From d6a2532dd7e351fc8952f669d2d7f62a56da2e3d Mon Sep 17 00:00:00 2001 From: NanoCode012 Date: Sun, 15 Feb 2026 19:51:28 +0700 Subject: [PATCH] feat(doc): clarify how to use scattermoe (#3408) [skip ci] * feat(doc): clarify how to use scattermoe * chore: fix wording --- src/axolotl/integrations/kernels/README.md | 44 ++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 src/axolotl/integrations/kernels/README.md diff --git a/src/axolotl/integrations/kernels/README.md b/src/axolotl/integrations/kernels/README.md new file mode 100644 index 000000000..96ff7b328 --- /dev/null +++ b/src/axolotl/integrations/kernels/README.md @@ -0,0 +1,44 @@ +# Kernels Integration + +MoE (Mixture of Experts) kernels speed up training for MoE layers and reduce VRAM costs. In transformers v5, `batched_mm` and `grouped_mm` were integrated as built-in options via the `experts_implementation` config kwarg: + +```python +class ExpertsInterface(GeneralInterface): + _global_mapping = { + "batched_mm": batched_mm_experts_forward, + "grouped_mm": grouped_mm_experts_forward, + } +``` + +In our custom integration, we add support for **ScatterMoE**, which is even more efficient and faster than `grouped_mm`. + +## Usage + +Add the following to your axolotl YAML config: + +```yaml +plugins: + - axolotl.integrations.kernels.KernelsPlugin + +use_kernels: true +use_scattermoe: true +``` + +**Important:** Setting `experts_implementation` is incompatible with `use_scattermoe`. + +## How It Works + +The `KernelsPlugin` runs before model loading and: + +1. Registers the ScatterMoE kernel from the [`axolotl-ai-co/scattermoe`](https://huggingface.co/axolotl-ai-co/scattermoe) Hub repo. +2. Patches the model's `SparseMoeBlock` forward method with the optimized ScatterMoE implementation. + +This works for any MoE model in transformers that uses a `SparseMoeBlock` class (Mixtral, Qwen2-MoE, OLMoE, etc.). + +## Limitations + +ScatterMoE uses a softmax -> topk routing, so results may be different for some model arch as baseline (GPT-OSS, GLM_MOE_DSA). + +## Note on MegaBlocks + +We tested [MegaBlocks](https://huggingface.co/kernels-community/megablocks) but were unable to ensure numerical accuracy, so we did not integrate it. It was also incompatible with many newer model architectures in transformers.