From d6a2532dd7e351fc8952f669d2d7f62a56da2e3d Mon Sep 17 00:00:00 2001
From: NanoCode012 <nano@axolotl.ai>
Date: Sun, 15 Feb 2026 19:51:28 +0700
Subject: [PATCH] feat(doc): clarify how to use scattermoe (#3408) [skip ci]

* feat(doc): clarify how to use scattermoe

* chore: fix wording
---
 src/axolotl/integrations/kernels/README.md | 44 ++++++++++++++++++++++
 1 file changed, 44 insertions(+)
 create mode 100644 src/axolotl/integrations/kernels/README.md

diff --git a/src/axolotl/integrations/kernels/README.md b/src/axolotl/integrations/kernels/README.md
new file mode 100644
index 000000000..96ff7b328
--- /dev/null
+++ b/src/axolotl/integrations/kernels/README.md
@@ -0,0 +1,44 @@
+# Kernels Integration
+
+MoE (Mixture of Experts) kernels speed up training for MoE layers and reduce VRAM costs. In transformers v5, `batched_mm` and `grouped_mm` were integrated as built-in options via the `experts_implementation` config kwarg:
+
+```python
+class ExpertsInterface(GeneralInterface):
+    _global_mapping = {
+        "batched_mm": batched_mm_experts_forward,
+        "grouped_mm": grouped_mm_experts_forward,
+    }
+```
+
+In our custom integration, we add support for **ScatterMoE**, which is even more efficient and faster than `grouped_mm`.
+
+## Usage
+
+Add the following to your axolotl YAML config:
+
+```yaml
+plugins:
+  - axolotl.integrations.kernels.KernelsPlugin
+
+use_kernels: true
+use_scattermoe: true
+```
+
+**Important:** Setting `experts_implementation` is incompatible with `use_scattermoe`.
+
+## How It Works
+
+The `KernelsPlugin` runs before model loading and:
+
+1. Registers the ScatterMoE kernel from the [`axolotl-ai-co/scattermoe`](https://huggingface.co/axolotl-ai-co/scattermoe) Hub repo.
+2. Patches the model's `SparseMoeBlock` forward method with the optimized ScatterMoE implementation.
+
+This works for any MoE model in transformers that uses a `SparseMoeBlock` class (Mixtral, Qwen2-MoE, OLMoE, etc.).
+
+## Limitations
+
+ScatterMoE uses a softmax -> topk routing, so results may be different for some model arch as baseline (GPT-OSS, GLM_MOE_DSA).
+
+## Note on MegaBlocks
+
+We tested [MegaBlocks](https://huggingface.co/kernels-community/megablocks) but were unable to ensure numerical accuracy, so we did not integrate it. It was also incompatible with many newer model architectures in transformers.