feat(doc): clarify how to use scattermoe (#3408) [skip ci]
* feat(doc): clarify how to use scattermoe * chore: fix wording
This commit is contained in:
44
src/axolotl/integrations/kernels/README.md
Normal file
44
src/axolotl/integrations/kernels/README.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Kernels Integration
|
||||
|
||||
MoE (Mixture of Experts) kernels speed up training for MoE layers and reduce VRAM costs. In transformers v5, `batched_mm` and `grouped_mm` were integrated as built-in options via the `experts_implementation` config kwarg:
|
||||
|
||||
```python
|
||||
class ExpertsInterface(GeneralInterface):
|
||||
_global_mapping = {
|
||||
"batched_mm": batched_mm_experts_forward,
|
||||
"grouped_mm": grouped_mm_experts_forward,
|
||||
}
|
||||
```
|
||||
|
||||
In our custom integration, we add support for **ScatterMoE**, which is even more efficient and faster than `grouped_mm`.
|
||||
|
||||
## Usage
|
||||
|
||||
Add the following to your axolotl YAML config:
|
||||
|
||||
```yaml
|
||||
plugins:
|
||||
- axolotl.integrations.kernels.KernelsPlugin
|
||||
|
||||
use_kernels: true
|
||||
use_scattermoe: true
|
||||
```
|
||||
|
||||
**Important:** Setting `experts_implementation` is incompatible with `use_scattermoe`.
|
||||
|
||||
## How It Works
|
||||
|
||||
The `KernelsPlugin` runs before model loading and:
|
||||
|
||||
1. Registers the ScatterMoE kernel from the [`axolotl-ai-co/scattermoe`](https://huggingface.co/axolotl-ai-co/scattermoe) Hub repo.
|
||||
2. Patches the model's `SparseMoeBlock` forward method with the optimized ScatterMoE implementation.
|
||||
|
||||
This works for any MoE model in transformers that uses a `SparseMoeBlock` class (Mixtral, Qwen2-MoE, OLMoE, etc.).
|
||||
|
||||
## Limitations
|
||||
|
||||
ScatterMoE uses a softmax -> topk routing, so results may be different for some model arch as baseline (GPT-OSS, GLM_MOE_DSA).
|
||||
|
||||
## Note on MegaBlocks
|
||||
|
||||
We tested [MegaBlocks](https://huggingface.co/kernels-community/megablocks) but were unable to ensure numerical accuracy, so we did not integrate it. It was also incompatible with many newer model architectures in transformers.
|
||||
Reference in New Issue
Block a user