diff --git a/.nojekyll b/.nojekyll index c2a64ff12..f6c1f0ea6 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -17703de0 \ No newline at end of file +d0072613 \ No newline at end of file diff --git a/docs/custom_integrations.html b/docs/custom_integrations.html index c7f7b587c..9fdb9fa8a 100644 --- a/docs/custom_integrations.html +++ b/docs/custom_integrations.html @@ -823,6 +823,11 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
libs/scattermoe_lora package (includes fused LoRA support via Triton kernels).SparseMoeBlock forward method with the optimized ScatterMoE implementation.SparseMoeBlock forward method with the optimized ScatterMoE implementation via the HF kernels library.constants.py.Both paths use the shared resolve_moe_block_classes utility in constants.py for model-type-to-class resolution.
See constants.py for the full list of supported model types (Qwen2-MoE, Qwen3-MoE, OLMoE, Mixtral, DeepSeek-V3, GLM-MoE, MiniMax, etc.).
All models use the SwiGLU activation (act_fn(gate) * up). Neither kernel currently supports non-SwiGLU MoE architectures.
| Routing Strategy | +Description | +ScatterMoE | +SonicMoE | +
|---|---|---|---|
| softmax → topk | +Softmax over experts, select top-K, optional renormalization | +Yes | +Yes | +
| softmax → group selection → topk | +Softmax, select top groups (sum of top-2 per group), topk from selected groups, renorm + scaling | +No | +Yes | +
| sigmoid → topk (with groups) | +Sigmoid + bias correction, group-based masking, topk from masked scores, weights from original sigmoid | +Yes | +Yes | +
| sigmoid → topk (no groups) | +Sigmoid + bias correction, straight topk (n_group=1) | +Yes | +Yes | +
| softmax → bias correction → topk | +Softmax, bias via gate.moe_statics, topk, gather from original probs, clamp-based renorm |
+No | +Yes | +
| softmax → group_limited_greedy | +Softmax, group selection (max per group), topk, scale only (no renorm) | +No | +Yes | +
| softmax → topk via gate.wg | +Softmax, gate weight at gate.wg.weight (not gate.weight), always renormalize |
+No | +Yes | +
| fused topk → softmax | +Routing + expert computation fused in a single kernel | +No | +Planned | +
| Model Type | +Architecture | +Routing | +ScatterMoE | +SonicMoE | +
|---|---|---|---|---|
qwen2_moe |
+Qwen2-MoE | +softmax → topk | +Yes | +Yes | +
qwen3_moe |
+Qwen3-MoE | +softmax → topk | +Yes | +Yes | +
qwen3_5_moe |
+Qwen3.5-MoE | +softmax → topk | +Yes | +Yes | +
qwen3_5_moe_text |
+Qwen3.5-MoE (VLM text) | +softmax → topk | +Yes | +Yes | +
qwen3_next |
+Qwen3-Next | +softmax → topk | +Yes | +Yes | +
qwen3_vl_moe |
+Qwen3-VL-MoE | +softmax → topk | +Yes | +Yes | +
qwen3_omni_moe |
+Qwen3-Omni (Thinker + Talker) | +softmax → topk | +Yes | +Yes | +
olmoe |
+OLMoE | +softmax → topk | +Yes | +Yes | +
mixtral |
+Mixtral | +softmax → topk | +Yes | +Yes | +
minimax |
+MiniMax | +softmax → topk | +Yes | +Yes | +
mistral4 |
+Mistral 4 | +softmax → group → topk | +No | +Yes | +
glm_moe_dsa |
+GLM-MoE DSA (GLM 5) | +sigmoid → topk (groups) | +Yes | +Yes | +
deepseek_v3 |
+DeepSeek-V3 | +sigmoid → topk (groups) | +Yes | +Yes | +
glm4_moe |
+GLM4-MoE | +sigmoid → topk (groups) | +Yes | +Yes | +
glm4_moe_lite |
+GLM4-MoE Lite (GLM 4.7 Flash) | +sigmoid → topk (groups) | +Yes* | +Yes | +
glm4v_moe |
+GLM4v-MoE | +sigmoid → topk (groups) | +Yes | +Yes | +
minimax_m2 |
+MiniMax M2 | +sigmoid → topk (no groups) | +Yes | +Yes | +
ernie4_5_moe |
+ERNIE 4.5 MoE | +softmax → bias → topk | +No | +Yes | +
deepseek_v2 |
+DeepSeek-V2 | +softmax → group_limited_greedy | +No | +Yes | +
hunyuan_v1_moe |
+HunYuan V1 MoE | +softmax → topk (gate.wg) | +No | +Yes | +
gpt_oss |
+GPT-OSS | +fused topk → softmax | +No | +Planned | +
* glm4_moe_lite with ScatterMoE may have issues — see Limitations.
| Feature | +ScatterMoE | +SonicMoE | +
|---|---|---|
| Kernel backend | +Triton | +CUTLASS | +
| GPU requirement | +Any CUDA | +Hopper (H100/H200) or Blackwell (B200+) | +
| LoRA approach | +Fused in Triton kernel | +Runtime materialization + custom autograd | +
| LoRA overhead | +Lower (fused computation) | +Higher (per-forward materialization) | +
| Gate/router LoRA | +Yes | +Yes | +
| Expert LoRA | +Yes (fused) | +Yes (materialized) | +
| Shared expert LoRA | +Yes (standard PEFT) | +Yes (standard PEFT) | +
| Selective expert dequantization | +Yes (~97% memory savings) | +No | +
| Weight format | +Transposed [E, hidden, 2*inter] |
+Interleaved gate/up [2*I, H, E] |
+
| torch.compile routing | +No | +Yes (optional) | +
ScatterMoE uses a softmax -> topk routing, so results may be different for some model architectures as baseline (GPT-OSS, etc). Incompatible with GLM_MOE_DSA (GLM 5) and GLM4_MOE_LITE (GLM 4.7 Flash) at the moment.
SonicMoE supports both softmax->topk and sigmoid->topk routing, covering a wider range of architectures.
-ScatterMoE does not work for GLM4.7 Flash (glm4_moe_lite) atm.
+glm4_moe_lite).[E, H, 2*I], expert biases, and custom GLU activation. A dedicated forward path is needed.