support nemotron for scattermoe-lora

post merge lora fixes for CI (#3536 ) [skip ci]
* post merge lora fixes for CI * handle lora kernel auto-enable for moe without grouped_mm * prefer not to import torch in schema validation
2026-03-23 21:29:58 +00:00 · 2026-03-23 02:26:10 -04:00 · 2026-03-22 13:54:03 -04:00 · 2026-03-22 13:53:19 -04:00 · 2026-03-22 13:19:21 -04:00
24 changed files with 3946 additions and 473 deletions
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -3,7 +3,8 @@ set -e
 python -c "import torch; assert '$PYTORCH_VERSION' in torch.__version__"
-curl --silent -L https://axolotl-ci.b-cdn.net/hf-cache.tar.zst | tar -xpf - -C "${HF_HOME}/hub/"  --use-compress-program unzstd --strip-components=1
+set -o pipefail
 curl --silent --show-error --fail --retry 3 --retry-delay 5 -L https://axolotl-ci.b-cdn.net/hf-cache.tar.zst | tar -xpf - -C "${HF_HOME}/hub/" --use-compress-program unzstd --strip-components=1
 # hf download "NousResearch/Meta-Llama-3-8B"
 # hf download "NousResearch/Meta-Llama-3-8B-Instruct"
 # hf download "microsoft/Phi-4-reasoning"
--- a/codecov.yml
+++ b/codecov.yml
@@ -37,6 +37,7 @@ coverage:
        only_pulls: false
        flags: null
        paths: null
        informational: true
 parsers:
  gcov:
--- a/src/axolotl/integrations/kernels/constants.py
+++ b/src/axolotl/integrations/kernels/constants.py
@@ -36,6 +36,8 @@ SPARSE_MOE_BLOCK = {
    "glm4v_moe": "Glm4vMoeTextMoE",
    # sigmoid -> topk routing (no group selection)
    "minimax_m2": "MiniMaxM2SparseMoeBlock",
    # sigmoid -> topk routing, non-gated experts (up_proj + down_proj, no gate_up_proj)
    "nemotron_h": "NemotronHMoE",
    # Models below need custom routing (not yet implemented):
    # "ernie4_5_moe": "Ernie4_5_MoeSparseMoeBlock",  # softmax->topk, e_score_correction_bias between softmax and topk
    # "deepseek_v2": "DeepseekV2Moe",  # softmax->topk, group_limited_greedy, different attr names (num_group)
--- a/src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
+++ b/src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
@@ -168,6 +168,9 @@ def _unwrap_experts_lora(experts_module):
          -> base_layer: ParamWrapper(gate_up_proj)
              -> base_layer: OlmoeExperts (the real module)
    For non-gated experts (e.g. NemotronH), the chain targets ``up_proj``
    instead of ``gate_up_proj``.
    This function walks the chain, collects LoRA params keyed by
    ``parameter_name``, and returns the base experts module.
@@ -176,6 +179,7 @@ def _unwrap_experts_lora(experts_module):
        Each ``*_lora`` is either ``(smoe_A, smoe_B, scaling)`` or ``None``.
        A/B are already in scattermoe layout.
        For non-gated experts, ``gup_lora`` holds the ``up_proj`` LoRA.
    """
    # Collect ParamWrapper layers by their parameter_name
    wrappers = {}
@@ -195,13 +199,15 @@ def _unwrap_experts_lora(experts_module):
    num_experts = getattr(base_experts, "num_experts", None)
    if num_experts is None:
        # Fallback: infer from parameter shape
-        gup = getattr(base_experts, "gate_up_proj", None)
+        for attr in ("gate_up_proj", "up_proj"):
-        if gup is not None:
+            param = getattr(base_experts, attr, None)
-            num_experts = gup.shape[0]
+            if param is not None:
                num_experts = param.shape[0]
                break
-    # Extract gate_up_proj LoRA (needs A<->B swap due to transposition)
+    # Extract gate_up_proj or up_proj LoRA (needs A<->B swap due to transposition)
    gup_lora = None
-    gup_wrapper = wrappers.get("gate_up_proj")
+    gup_wrapper = wrappers.get("gate_up_proj") or wrappers.get("up_proj")
    if gup_wrapper is not None:
        lora_A, lora_B, scaling = get_lora_params_from_wrapper(gup_wrapper)
        if lora_A is not None:
@@ -441,10 +447,12 @@ class HFScatterMoEGatedMLP(nn.Module):
    Supports:
    * **Softmax→topk routing**: OLMoE, Qwen2/3MoE, Mixtral, MiniMax
-    * **Sigmoid→topk routing**: GLM, DeepSeek V3, MiniMax M2
+    * **Sigmoid→topk routing**: GLM, DeepSeek V3, MiniMax M2, NemotronH
    * **Full-parameter training**: uses ``parallel_linear`` (base ScatterMoE)
    * **LoRA fine-tuning**: detects peft ``ParamWrapper`` on ``self.experts``,
      extracts adapter weights, and uses ``parallel_linear_lora`` (fused kernel)
    * **Non-gated experts**: NemotronH (up_proj + down_proj, no gate_up_proj)
    * **Latent projections**: NemotronH (fc1/fc2_latent_proj wrapping experts)
    """
    @staticmethod
@@ -467,7 +475,7 @@ class HFScatterMoEGatedMLP(nn.Module):
        hidden_states_flat = layer_input.view(-1, hidden_dim)
        # ====================================================================
-        # Shared Expert (if present, e.g. Qwen2MoE, DeepSeek V3)
+        # Shared Expert (if present, e.g. Qwen2MoE, DeepSeek V3, NemotronH)
        # ====================================================================
        shared_expert_output = _compute_shared_expert(self, hidden_states_flat)
@@ -489,6 +497,22 @@ class HFScatterMoEGatedMLP(nn.Module):
        # ====================================================================
        experts, gup_lora, down_lora = _unwrap_experts_lora(self.experts)
        # ====================================================================
        # Detect non-gated experts (e.g. NemotronH: up_proj + down_proj only)
        # ====================================================================
        is_gated = hasattr(experts, "gate_up_proj")
        up_proj_attr = "gate_up_proj" if is_gated else "up_proj"
        # ====================================================================
        # Optional latent projection (NemotronH: fc1/fc2_latent_proj)
        # ====================================================================
        fc1_latent_proj = getattr(self, "fc1_latent_proj", None)
        fc2_latent_proj = getattr(self, "fc2_latent_proj", None)
        expert_input = hidden_states_flat
        if fc1_latent_proj is not None and not isinstance(fc1_latent_proj, nn.Identity):
            expert_input = fc1_latent_proj(hidden_states_flat)
        # ====================================================================
        # Selective expert weight dequantization
        # ====================================================================
@@ -498,7 +522,7 @@ class HFScatterMoEGatedMLP(nn.Module):
        use_selective = (
            getattr(self, "_use_selective_dequant", False)
            and hasattr(experts, "parametrizations")
-            and "gate_up_proj" in experts.parametrizations
+            and up_proj_attr in experts.parametrizations
        )
        if use_selective:
@@ -517,11 +541,11 @@ class HFScatterMoEGatedMLP(nn.Module):
                num_experts,
            )
            # Dequantize only active experts' weights
-            gate_up_W = selective_expert_weights(
+            up_W = selective_expert_weights(
                experts,
-                "gate_up_proj",
+                up_proj_attr,
                active_experts,
-            ).transpose(2, 1)  # [num_active, hidden, 2*inter]
+            ).transpose(2, 1)
            # Remap LoRA weights to match compact expert indices
            if gup_lora is not None:
@@ -538,18 +562,18 @@ class HFScatterMoEGatedMLP(nn.Module):
            sei_gup = remapped_expert_idxs
            eo_gup = compact_offsets
        else:
-            gate_up_W = experts.gate_up_proj.transpose(2, 1)  # [E, hidden, 2*inter]
+            up_W = getattr(experts, up_proj_attr).transpose(2, 1)
            sei_gup = sorted_expert_idxs
            eo_gup = expert_offsets
        # ====================================================================
-        # Gate + Up projection
+        # Up projection (gated: gate_up_proj; non-gated: up_proj)
        # ====================================================================
        if gup_lora is not None:
            gup_A, gup_B, gup_scaling = gup_lora
-            gup = parallel_linear_lora(
+            up_out = parallel_linear_lora(
-                hidden_states_flat,
+                expert_input,
-                gate_up_W,
+                up_W,
                top_k,
                sei_gup,
                sorted_scattered_idxs,
@@ -563,9 +587,9 @@ class HFScatterMoEGatedMLP(nn.Module):
                use_fused_gather=True,
            )
        else:
-            gup = parallel_linear(
+            up_out = parallel_linear(
-                hidden_states_flat,
+                expert_input,
-                gate_up_W,
+                up_W,
                top_k,
                sei_gup,
                sorted_scattered_idxs,
@@ -574,8 +598,14 @@ class HFScatterMoEGatedMLP(nn.Module):
                grouped_out=True,
            )
-        gates, h = gup.chunk(2, dim=-1)
+        # ====================================================================
-        h = experts.act_fn(gates) * h
+        # Activation: gated (act_fn(gate) * up) vs non-gated (act_fn(up))
        # ====================================================================
        if is_gated:
            gates, h = up_out.chunk(2, dim=-1)
            h = experts.act_fn(gates) * h
        else:
            h = experts.act_fn(up_out)
        # ====================================================================
        # Down projection
@@ -635,6 +665,12 @@ class HFScatterMoEGatedMLP(nn.Module):
                gates=routing_weights,
            )
        # ====================================================================
        # Optional latent projection back to hidden_size (NemotronH)
        # ====================================================================
        if fc2_latent_proj is not None and not isinstance(fc2_latent_proj, nn.Identity):
            expert_output = fc2_latent_proj(expert_output)
        # ====================================================================
        # Combine with shared expert and reshape
        # ====================================================================
--- a/src/axolotl/integrations/liger/args.py
+++ b/src/axolotl/integrations/liger/args.py
@@ -30,6 +30,15 @@ class LigerArgs(BaseModel):
    liger_rope: bool | None = None
    liger_rms_norm: bool | None = None
    liger_rms_norm_gated: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": (
                "Enables fused RMSNorm+SiLU gate Triton kernel for models with "
                "gated RMSNorm (e.g. Qwen3.5 / Qwen3.5 MoE linear attention layers)."
            )
        },
    )
    liger_layer_norm: bool | None = None
    liger_swiglu: bool | None = None
    liger_glu_activation: bool | None = None
--- a/src/axolotl/integrations/liger/models/qwen3_5.py
+++ b/src/axolotl/integrations/liger/models/qwen3_5.py
@@ -0,0 +1,175 @@
 """
 Liger FLCE for Qwen3.5. Based on transformers v5.3.0.
 """
 import sys
 from copy import deepcopy
 from typing import Optional, Union
 import torch
 from liger_kernel.transformers.model.loss_utils import LigerForCausalLMLoss
 from transformers.cache_utils import Cache
 from transformers.modeling_outputs import CausalLMOutputWithPast
 def lce_forward(
    self,
    input_ids: Optional[torch.LongTensor] = None,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_values: Optional[Cache] = None,
    inputs_embeds: Optional[torch.FloatTensor] = None,
    labels: Optional[torch.LongTensor] = None,
    use_cache: Optional[bool] = None,
    cache_position: Optional[torch.LongTensor] = None,
    logits_to_keep: Union[int, torch.Tensor] = 0,
    **kwargs,
 ) -> CausalLMOutputWithPast:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
        logits_to_keep (`int` or `torch.Tensor`, *optional*):
            If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
            `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
            token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
            If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
            This is useful when using packed tensor format (single dimension for batch and sequence length).
    Returns:
    """
    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        cache_position=cache_position,
        **kwargs,
    )
    hidden_states = outputs[0]
    logits = None
    loss = None
    # if in training mode, don't materialize logits
    if self.training and (labels is not None):
        loss = LigerForCausalLMLoss(
            hidden_states=hidden_states,
            lm_head_weight=self.lm_head.weight,
            labels=labels,
            hidden_size=self.config.hidden_size,
            **kwargs,
        )
    else:  # if in inference mode materialize logits
        slice_indices = (
            slice(-logits_to_keep, None)
            if isinstance(logits_to_keep, int)
            else logits_to_keep
        )
        logits = self.lm_head(hidden_states[:, slice_indices, :])
        if labels is not None:
            loss = self.loss_function(
                logits=logits,
                labels=labels,
                vocab_size=self.config.vocab_size,
                **kwargs,
            )
    return CausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )
 def apply_liger_kernel_to_qwen3_5(
    cross_entropy: bool = False,
    fused_linear_cross_entropy: bool = False,
    rms_norm: bool = False,
    rms_norm_gated: bool = False,
    glu_activation: bool = False,
    layer_norm: bool = False,
    **kwargs,
 ) -> None:
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen3.5 models.
    Note: Qwen3_5RMSNorm uses zero-init weight with offset 1.0 (like Gemma),
    so we use LigerRMSNorm with offset=1.0 and init_fn="zeros".
    Args:
        cross_entropy (bool): Whether to apply Liger's cross entropy loss. Default is False.
        fused_linear_cross_entropy (bool):
            Whether to apply Liger's fused linear cross entropy loss. Default is False.
            `cross_entropy` and `fused_linear_cross_entropy` cannot both be True.
            If `fused_linear_cross_entropy` is True, the logits will not be materialized but more memory efficient.
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is False.
        rms_norm_gated (bool): Whether to apply fused RMSNorm+SiLU gate kernel for
            Qwen3_5RMSNormGated (used in linear attention layers). Default is False.
        glu_activation (bool): Whether to apply Liger's SwiGLU MLP. Default is False.
        layer_norm (bool): Whether to apply Liger's LayerNorm. Default is False.
    """
    import transformers.models.qwen3_5.modeling_qwen3_5  # noqa: F401
    from liger_kernel.transformers.functional import liger_cross_entropy
    from liger_kernel.transformers.layer_norm import LigerLayerNorm
    from liger_kernel.transformers.rms_norm import LigerRMSNorm
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    assert not (cross_entropy and fused_linear_cross_entropy), (
        "cross_entropy and fused_linear_cross_entropy cannot both be True."
    )
    modeling_qwen3_5 = sys.modules["transformers.models.qwen3_5.modeling_qwen3_5"]
    if rms_norm:
        # Qwen3_5RMSNorm uses zero-init weight with `output * (1.0 + weight)` pattern
        class LigerRMSNormForQwen3_5(LigerRMSNorm):
            def __init__(self, dim, eps=1e-6, **kwargs):
                super().__init__(
                    dim,
                    eps=eps,
                    offset=1.0,
                    casting_mode="gemma",
                    init_fn="zeros",
                    in_place=False,
                )
        modeling_qwen3_5.Qwen3_5RMSNorm = LigerRMSNormForQwen3_5
    if rms_norm_gated:
        from axolotl.kernels.rms_norm_gated import FusedRMSNormGated
        modeling_qwen3_5.Qwen3_5RMSNormGated = FusedRMSNormGated
    if glu_activation:
        def _liger_swiglu_mlp_wrapper(config, intermediate_size=None, **kwargs):
            """Accepts intermediate_size to pass to LigerSwiGLUMLP"""
            config = deepcopy(config)
            if intermediate_size is not None:
                config.intermediate_size = intermediate_size
            return LigerSwiGLUMLP(config, **kwargs)
        modeling_qwen3_5.Qwen3_5MLP = _liger_swiglu_mlp_wrapper
    if layer_norm:
        modeling_qwen3_5.nn.LayerNorm = LigerLayerNorm
    if cross_entropy:
        from transformers.loss.loss_utils import nn
        nn.functional.cross_entropy = liger_cross_entropy
    if fused_linear_cross_entropy:
        modeling_qwen3_5.Qwen3_5ForCausalLM.forward = lce_forward
--- a/src/axolotl/integrations/liger/models/qwen3_5_moe.py
+++ b/src/axolotl/integrations/liger/models/qwen3_5_moe.py
@@ -0,0 +1,198 @@
 """
 Liger FLCE for Qwen3.5 MoE. Based on transformers v5.3.0.
 """
 import sys
 from copy import deepcopy
 from typing import Optional, Union
 import torch
 from liger_kernel.transformers.model.loss_utils import LigerForCausalLMLoss
 from transformers.modeling_outputs import MoeCausalLMOutputWithPast
 def lce_forward(
    self,
    input_ids: Optional[torch.LongTensor] = None,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_values=None,
    inputs_embeds: Optional[torch.FloatTensor] = None,
    labels: Optional[torch.LongTensor] = None,
    use_cache: Optional[bool] = None,
    output_router_logits: Optional[bool] = None,
    cache_position: Optional[torch.LongTensor] = None,
    logits_to_keep: Union[int, torch.Tensor] = 0,
    **kwargs,
 ) -> MoeCausalLMOutputWithPast:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
        logits_to_keep (`int` or `torch.Tensor`, *optional*):
            If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
            `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
            token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
            If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
            This is useful when using packed tensor format (single dimension for batch and sequence length).
    Returns:
    """
    from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
        load_balancing_loss_func,
    )
    output_router_logits = (
        output_router_logits
        if output_router_logits is not None
        else self.config.output_router_logits
    )
    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_router_logits=output_router_logits,
        cache_position=cache_position,
        **kwargs,
    )
    hidden_states = outputs[0]
    logits = None
    loss = None
    # if in training mode, don't materialize logits
    if self.training and (labels is not None):
        loss = LigerForCausalLMLoss(
            hidden_states=hidden_states,
            lm_head_weight=self.lm_head.weight,
            labels=labels,
            hidden_size=self.config.hidden_size,
            **kwargs,
        )
    else:  # if in inference mode materialize logits
        slice_indices = (
            slice(-logits_to_keep, None)
            if isinstance(logits_to_keep, int)
            else logits_to_keep
        )
        logits = self.lm_head(hidden_states[:, slice_indices, :])
        if labels is not None:
            loss = self.loss_function(
                logits,
                labels,
                self.vocab_size,
                **kwargs,
            )
    aux_loss = None
    if output_router_logits:
        aux_loss = load_balancing_loss_func(
            outputs.router_logits,
            self.num_experts,
            self.num_experts_per_tok,
            attention_mask,
        )
        if labels is not None:
            loss += self.router_aux_loss_coef * aux_loss.to(loss.device)
    return MoeCausalLMOutputWithPast(
        loss=loss,
        aux_loss=aux_loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
        router_logits=outputs.router_logits,
    )
 def apply_liger_kernel_to_qwen3_5_moe(
    cross_entropy: bool = False,
    fused_linear_cross_entropy: bool = False,
    rms_norm: bool = False,
    rms_norm_gated: bool = False,
    glu_activation: bool = False,
    layer_norm: bool = False,
    **kwargs,
 ) -> None:
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen3.5 MoE models.
    Note: Qwen3_5MoeRMSNorm uses zero-init weight with offset 1.0 (like Gemma),
    so we use LigerRMSNorm with offset=1.0 and init_fn="zeros".
    Args:
        cross_entropy (bool): Whether to apply Liger's cross entropy loss. Default is False.
        fused_linear_cross_entropy (bool):
            Whether to apply Liger's fused linear cross entropy loss. Default is False.
            `cross_entropy` and `fused_linear_cross_entropy` cannot both be True.
            If `fused_linear_cross_entropy` is True, the logits will not be materialized but more memory efficient.
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is False.
        rms_norm_gated (bool): Whether to apply fused RMSNorm+SiLU gate kernel for
            Qwen3_5MoeRMSNormGated (used in linear attention layers). Default is False.
        glu_activation (bool): Whether to apply Liger's SwiGLU MLP. Default is False.
        layer_norm (bool): Whether to apply Liger's LayerNorm. Default is False.
    """
    import transformers.models.qwen3_5_moe.modeling_qwen3_5_moe  # noqa: F401
    from liger_kernel.transformers.functional import liger_cross_entropy
    from liger_kernel.transformers.layer_norm import LigerLayerNorm
    from liger_kernel.transformers.rms_norm import LigerRMSNorm
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    assert not (cross_entropy and fused_linear_cross_entropy), (
        "cross_entropy and fused_linear_cross_entropy cannot both be True."
    )
    modeling_mod = sys.modules["transformers.models.qwen3_5_moe.modeling_qwen3_5_moe"]
    if rms_norm:
        # Qwen3_5MoeRMSNorm uses zero-init weight with `output * (1.0 + weight)` pattern
        class LigerRMSNormForQwen3_5Moe(LigerRMSNorm):
            def __init__(self, dim, eps=1e-6, **kwargs):
                super().__init__(
                    dim,
                    eps=eps,
                    offset=1.0,
                    casting_mode="gemma",
                    init_fn="zeros",
                    in_place=False,
                )
        modeling_mod.Qwen3_5MoeRMSNorm = LigerRMSNormForQwen3_5Moe
    if rms_norm_gated:
        from axolotl.kernels.rms_norm_gated import FusedRMSNormGated
        modeling_mod.Qwen3_5MoeRMSNormGated = FusedRMSNormGated
    if glu_activation:
        def _liger_swiglu_mlp_wrapper(config, intermediate_size=None, **kwargs):
            """Accepts intermediate_size to pass to LigerSwiGLUMLP"""
            config = deepcopy(config)
            if intermediate_size is not None:
                config.intermediate_size = intermediate_size
            return LigerSwiGLUMLP(config, **kwargs)
        modeling_mod.Qwen3_5MoeMLP = _liger_swiglu_mlp_wrapper
    if layer_norm:
        modeling_mod.nn.LayerNorm = LigerLayerNorm
    if cross_entropy:
        from transformers.loss.loss_utils import nn
        nn.functional.cross_entropy = liger_cross_entropy
    if fused_linear_cross_entropy:
        modeling_mod.Qwen3_5MoeForCausalLM.forward = lce_forward
--- a/src/axolotl/integrations/liger/plugin.py
+++ b/src/axolotl/integrations/liger/plugin.py
@@ -174,6 +174,19 @@ class LigerPlugin(BasePlugin):
                rms_norm=cfg.liger_rms_norm,
                layer_norm=cfg.liger_layer_norm,
            )
        elif cfg.model_config_type == "qwen3_5":
            from axolotl.integrations.liger.models.qwen3_5 import (
                apply_liger_kernel_to_qwen3_5,
            )
            apply_liger_kernel_to_qwen3_5(
                cross_entropy=cfg.liger_cross_entropy,
                fused_linear_cross_entropy=cfg.liger_fused_linear_cross_entropy,
                glu_activation=cfg.liger_glu_activation,
                rms_norm=cfg.liger_rms_norm,
                rms_norm_gated=getattr(cfg, "liger_rms_norm_gated", False),
                layer_norm=cfg.liger_layer_norm,
            )
        elif cfg.model_config_type == "qwen3_moe":
            from axolotl.integrations.liger.models.qwen3_moe import (
                apply_liger_kernel_to_qwen3_moe,
@@ -186,6 +199,19 @@ class LigerPlugin(BasePlugin):
                rms_norm=cfg.liger_rms_norm,
                layer_norm=cfg.liger_layer_norm,
            )
        elif cfg.model_config_type == "qwen3_5_moe":
            from axolotl.integrations.liger.models.qwen3_5_moe import (
                apply_liger_kernel_to_qwen3_5_moe,
            )
            apply_liger_kernel_to_qwen3_5_moe(
                cross_entropy=cfg.liger_cross_entropy,
                fused_linear_cross_entropy=cfg.liger_fused_linear_cross_entropy,
                glu_activation=cfg.liger_glu_activation,
                rms_norm=cfg.liger_rms_norm,
                rms_norm_gated=getattr(cfg, "liger_rms_norm_gated", False),
                layer_norm=cfg.liger_layer_norm,
            )
        elif cfg.model_config_type == "granitemoe":
            from liger_kernel.transformers import apply_liger_kernel_to_granite
--- a/src/axolotl/kernels/dora.py
+++ b/src/axolotl/kernels/dora.py
@@ -0,0 +1,147 @@
 """
 Triton kernels for DoRA (Weight-Decomposed Low-Rank Adaptation).
 Fuses the weight norm computation and magnitude scaling to avoid
 materializing the full [out_features, in_features] combined weight matrix.
 The B@A product is computed row-by-row inside the kernel.
 """
 import torch
 import triton
 import triton.language as tl
 from .quantize import dequantize
@triton.jit
 def _dora_fused_norm_kernel(
    # Pointers
    W_ptr,  # base weight [out, in] (dequantized, row-major)
    B_ptr,  # LoRA B [out, rank] (row-major)
    A_ptr,  # LoRA A [rank, in] (row-major)
    mag_ptr,  # magnitude vector [out]
    out_ptr,  # output mag_norm_scale [out]
    # Shapes
    out_features,
    in_features,
    rank,
    # Scaling
    lora_scale,  # float scaling factor
    # Block sizes
    BLOCK_IN: tl.constexpr,
    BLOCK_R: tl.constexpr,  # >= rank, power of 2
 ):
    """Compute mag_norm_scale[i] = magnitude[i] / ||W[i,:] + s * (B[i,:] @ A)[:] ||_2
    Each program handles one output row. B[row,:] is loaded once (small),
    then we tile over in_features computing the dot product with A[:,tile]
    and accumulating the squared norm.
    This avoids materializing the full [out, in] B@A matrix.
    """
    row = tl.program_id(0)
    if row >= out_features:
        return
    # Accumulate squared norm across tiles of in_features
    norm_sq_acc = tl.zeros([BLOCK_IN], dtype=tl.float32)
    for start in range(0, in_features, BLOCK_IN):
        cols = start + tl.arange(0, BLOCK_IN)
        col_mask = cols < in_features
        # Load W[row, cols]
        w_vals = tl.load(
            W_ptr + row * in_features + cols,
            mask=col_mask,
            other=0.0,
        ).to(tl.float32)
        # Compute (B[row,:] @ A[:, cols]) for this tile
        # Load B[row, r] as scalar and A[r, cols] as vector for each r
        ba_vals = tl.zeros([BLOCK_IN], dtype=tl.float32)
        for r in tl.static_range(BLOCK_R):
            # Load scalar B[row, r]
            b_val = tl.load(
                B_ptr + row * rank + r,
                mask=(r < rank),
                other=0.0,
            ).to(tl.float32)
            # Load vector A[r, cols]
            a_vals = tl.load(
                A_ptr + r * in_features + cols,
                mask=(col_mask & (r < rank)),
                other=0.0,
            ).to(tl.float32)
            ba_vals += b_val * a_vals
        # Combined: W + s * (B @ A)
        combined = w_vals + lora_scale * ba_vals
        # Accumulate squared values
        norm_sq_acc += tl.where(col_mask, combined * combined, 0.0)
    # Reduce to scalar norm
    norm_sq = tl.sum(norm_sq_acc, axis=0)
    norm = tl.sqrt(norm_sq + 1e-12)  # epsilon for numerical stability
    # Load magnitude and compute scale
    mag = tl.load(mag_ptr + row).to(tl.float32)
    scale = mag / norm
    tl.store(out_ptr + row, scale)
 def triton_dora_scale(
    W: torch.Tensor,
    W_quant,
    A: torch.Tensor,
    B: torch.Tensor,
    s: float,
    magnitude: torch.Tensor,
    dtype: torch.dtype,
 ) -> torch.Tensor:
    """Compute DoRA mag_norm_scale using fused Triton kernel.
    Computes B@A row-by-row inside the kernel, avoiding the full
    [out_features, in_features] materialization.
    Args:
        W: base weight [out, in] (possibly quantized)
        W_quant: quantization state
        A: LoRA A [rank, in]
        B: LoRA B [out, rank]
        s: LoRA scaling factor
        magnitude: learned magnitude [out]
        dtype: compute dtype
    Returns:
        mag_norm_scale: [out] tensor = magnitude / ||W + s * B @ A||_2
    """
    # Dequantize W to [out, in]
    W_full = dequantize(W.t(), W_quant).t().contiguous().to(dtype)
    out_features, in_features = W_full.shape
    rank = A.shape[0]
    out = torch.empty(out_features, dtype=dtype, device=W.device)
    # Block sizes
    BLOCK_IN = triton.next_power_of_2(min(in_features, 2048))
    BLOCK_R = triton.next_power_of_2(rank)
    _dora_fused_norm_kernel[(out_features,)](
        W_full,
        B.contiguous().to(dtype),
        A.contiguous().to(dtype),
        magnitude.contiguous(),
        out,
        out_features=out_features,
        in_features=in_features,
        rank=rank,
        lora_scale=s,
        BLOCK_IN=BLOCK_IN,
        BLOCK_R=BLOCK_R,
    )
    return out.detach()
--- a/src/axolotl/kernels/lora.py
+++ b/src/axolotl/kernels/lora.py
--- a/src/axolotl/kernels/quantize.py
+++ b/src/axolotl/kernels/quantize.py
@@ -105,6 +105,10 @@ def dequantize(
    # Extract quantization state
    if not isinstance(quant_state, list):
        # New style quant_state class
        # Non-double-quantized models have offset=None and state2=None
        if quant_state.offset is None or quant_state.state2 is None:
            # Fall back to bitsandbytes standard dequantize
            return bnb.functional.dequantize_4bit(W, quant_state, quant_type="nf4")
        absmax = quant_state.absmax.to(target_device)
        shape = quant_state.shape
        dtype = quant_state.dtype
--- a/src/axolotl/kernels/rms_norm_gated.py
+++ b/src/axolotl/kernels/rms_norm_gated.py
@@ -0,0 +1,333 @@
 """
 Fused RMSNorm + SiLU Gate Triton kernel.
 Computes: Y = (W + offset) * RMSNorm(X) * silu(G)
 where RMSNorm(X) = X / sqrt(mean(X^2) + eps)
 and silu(G) = G * sigmoid(G)
 Used by Qwen3.5's GatedDeltaNet linear attention layers (Qwen3_5RMSNormGated).
 """
 import math
 import operator
 import torch
 import triton
 import triton.language as tl
 from liger_kernel.ops.utils import (
    calculate_settings,
    compare_version,
    ensure_contiguous,
    torch_to_triton_dtype,
 )
 from liger_kernel.utils import is_npu_available
 if compare_version("triton", operator.ge, "3.0.0") and not is_npu_available():
    try:
        from triton.language.extra.libdevice import rsqrt
    except ModuleNotFoundError:
        from triton.language.extra.cuda.libdevice import rsqrt
 else:
    from triton.language.math import rsqrt
@triton.jit
 def _rms_norm_gated_forward_kernel(
    Y_ptr,
    Y_row_stride,
    X_ptr,
    X_row_stride,
    G_ptr,
    G_row_stride,
    W_ptr,
    W_row_stride,
    RSTD_ptr,
    RSTD_row_stride,
    n_cols,
    eps,
    offset,
    BLOCK_SIZE: tl.constexpr,
 ):
    """
    Y = (W + offset) * (X / RMS(X)) * silu(G)
    All computation done in fp32 (Gemma-style), result cast to input dtype.
    """
    row_idx = tl.program_id(0).to(tl.int64)
    col_offsets = tl.arange(0, BLOCK_SIZE)
    mask = col_offsets < n_cols
    X_row = tl.load(X_ptr + row_idx * X_row_stride + col_offsets, mask=mask, other=0)
    G_row = tl.load(G_ptr + row_idx * G_row_stride + col_offsets, mask=mask, other=0)
    W_row = tl.load(W_ptr + col_offsets, mask=mask, other=0)
    X_row_dtype = X_row.dtype
    # Cast everything to fp32
    X_fp32 = X_row.to(tl.float32)
    G_fp32 = G_row.to(tl.float32)
    W_fp32 = W_row.to(tl.float32)
    # RMS norm
    mean_sq = tl.sum(X_fp32 * X_fp32, axis=0) / n_cols
    rstd = rsqrt(mean_sq + eps)
    tl.store(RSTD_ptr + row_idx * RSTD_row_stride, rstd)
    X_norm = X_fp32 * rstd
    # SiLU gate: silu(G) = G * sigmoid(G)
    sig_G = tl.sigmoid(G_fp32)
    silu_G = G_fp32 * sig_G
    # Fused output
    Y_row = (offset + W_fp32) * X_norm * silu_G
    tl.store(
        Y_ptr + row_idx * Y_row_stride + col_offsets,
        Y_row.to(X_row_dtype),
        mask=mask,
    )
@triton.jit
 def _rms_norm_gated_backward_kernel(
    dY_ptr,
    dY_row_stride,
    dX_ptr,
    dX_row_stride,
    dG_ptr,
    dG_row_stride,
    X_ptr,
    X_row_stride,
    X_dtype: tl.constexpr,
    G_ptr,
    G_row_stride,
    W_ptr,
    W_row_stride,
    RSTD_ptr,
    RSTD_row_stride,
    dW_ptr,
    dW_row_stride,
    n_rows,
    n_cols,
    offset,
    rows_per_program,
    BLOCK_SIZE: tl.constexpr,
 ):
    """
    Backward for Y = (W + offset) * (X * RSTD) * silu(G)
    dW = sum_batch(dY * X_norm * silu(G))
    dG = dY * (W + offset) * X_norm * silu'(G)
       where silu'(G) = sigmoid(G) * (1 + G * (1 - sigmoid(G)))
    dX = RSTD * (m - (1/N) * RSTD^2 * dot(m, X) * X)
       where m = dY * (W + offset) * silu(G)
    """
    row_block_id = tl.program_id(0).to(tl.int64)
    row_start = row_block_id * rows_per_program
    row_end = min((row_block_id + 1) * rows_per_program, n_rows)
    col_offsets = tl.arange(0, BLOCK_SIZE)
    mask = col_offsets < n_cols
    dW_acc = tl.zeros((BLOCK_SIZE,), dtype=tl.float32)
    W_row = tl.load(W_ptr + col_offsets, mask=mask, other=0.0)
    W_row = W_row.to(tl.float32) + offset
    for row_idx in range(row_start, row_end):
        dY_row = tl.load(
            dY_ptr + row_idx * dY_row_stride + col_offsets, mask=mask, other=0.0
        )
        X_row = tl.load(
            X_ptr + row_idx * X_row_stride + col_offsets, mask=mask, other=0.0
        )
        G_row = tl.load(
            G_ptr + row_idx * G_row_stride + col_offsets, mask=mask, other=0.0
        )
        rstd_row = tl.load(RSTD_ptr + row_idx * RSTD_row_stride)
        # Cast to fp32
        dY_fp32 = dY_row.to(tl.float32)
        X_fp32 = X_row.to(tl.float32)
        G_fp32 = G_row.to(tl.float32)
        # Recompute intermediates
        X_norm = X_fp32 * rstd_row
        sig_G = tl.sigmoid(G_fp32)
        silu_G = G_fp32 * sig_G
        # dW: accumulate dY * X_norm * silu(G)
        dW_acc += dY_fp32 * X_norm * silu_G
        # dG: dY * (W + offset) * X_norm * silu'(G)
        # silu'(G) = sigmoid(G) * (1 + G * (1 - sigmoid(G)))
        silu_prime_G = sig_G * (1.0 + G_fp32 * (1.0 - sig_G))
        dG_row = dY_fp32 * W_row * X_norm * silu_prime_G
        tl.store(
            dG_ptr + row_idx * dG_row_stride + col_offsets,
            dG_row.to(X_dtype),
            mask=mask,
        )
        # dX: standard RMSNorm backward with effective gradient m = dY * W * silu(G)
        m = dY_fp32 * W_row * silu_G
        dX_row = rstd_row * m
        dX_row += rstd_row * (
            -(1.0 / n_cols) * rstd_row * rstd_row * tl.sum(m * X_fp32, axis=0) * X_fp32
        )
        tl.store(
            dX_ptr + row_idx * dX_row_stride + col_offsets,
            dX_row.to(X_dtype),
            mask=mask,
        )
    tl.store(
        dW_ptr + row_block_id * dW_row_stride + col_offsets,
        dW_acc,
        mask=mask,
    )
 def rms_norm_gated_forward(X, G, W, eps, offset):
    shape = X.shape
    dim = shape[-1]
    X = X.view(-1, dim)
    G = G.view(-1, dim)
    n_rows, n_cols = X.shape
    BLOCK_SIZE, num_warps = calculate_settings(n_cols)
    Y = torch.empty((n_rows, n_cols), dtype=X.dtype, device=X.device)
    RSTD = torch.empty(n_rows, dtype=torch.float32, device=X.device)
    assert X.shape[1] == W.shape[0], (
        f"Incompatible hidden size: X.shape[1]={X.shape[1]} vs W.shape[0]={W.shape[0]}"
    )
    assert X.shape == G.shape, (
        f"X and G must have same shape, got {X.shape} and {G.shape}"
    )
    _rms_norm_gated_forward_kernel[(n_rows,)](
        Y,
        Y.stride(0),
        X,
        X.stride(0),
        G,
        G.stride(0),
        W,
        W.stride(0),
        RSTD,
        RSTD.stride(0),
        n_cols,
        eps,
        offset,
        BLOCK_SIZE=BLOCK_SIZE,
        num_warps=num_warps,
    )
    return Y.view(*shape), X, G, RSTD, BLOCK_SIZE, num_warps
 def rms_norm_gated_backward(dY, X, G, W, RSTD, offset, BLOCK_SIZE, num_warps):
    shape = dY.shape
    dim = shape[-1]
    dY = dY.view(-1, dim)
    n_rows, n_cols = dY.shape
    sm_count = torch.cuda.get_device_properties(X.device).multi_processor_count
    _dW = torch.empty((sm_count, n_cols), dtype=torch.float32, device=W.device)
    dX = torch.empty_like(dY)
    dG = torch.empty_like(dY)
    rows_per_program = math.ceil(n_rows / sm_count)
    grid = (sm_count,)
    _rms_norm_gated_backward_kernel[grid](
        dY,
        dY.stride(0),
        dX,
        dX.stride(0),
        dG,
        dG.stride(0),
        X,
        X.stride(0),
        torch_to_triton_dtype[X.dtype],
        G,
        G.stride(0),
        W,
        W.stride(0),
        RSTD,
        RSTD.stride(0),
        _dW,
        _dW.stride(0),
        n_rows,
        n_cols,
        offset,
        rows_per_program,
        BLOCK_SIZE=BLOCK_SIZE,
        num_warps=num_warps,
    )
    dX = dX.view(*shape)
    dG = dG.view(*shape)
    dW = _dW.sum(dim=0).to(W.dtype)
    return dX, dG, dW
 class FusedRMSNormGatedFunction(torch.autograd.Function):
    @staticmethod
    @ensure_contiguous
    def forward(ctx, X, G, W, eps, offset=0.0):
        """
        X: (B, T, H) or (BxT, H) — input hidden states
        G: (B, T, H) or (BxT, H) — gate tensor
        W: (H,) — weight parameter
        """
        Y, X, G, RSTD, BLOCK_SIZE, num_warps = rms_norm_gated_forward(
            X, G, W, eps, offset
        )
        ctx.offset = offset
        ctx.BLOCK_SIZE = BLOCK_SIZE
        ctx.num_warps = num_warps
        ctx.save_for_backward(X, G, W, RSTD)
        return Y
    @staticmethod
    @ensure_contiguous
    def backward(ctx, dY):
        X, G, W, RSTD = ctx.saved_tensors
        dX, dG, dW = rms_norm_gated_backward(
            dY, X, G, W, RSTD, ctx.offset, ctx.BLOCK_SIZE, ctx.num_warps
        )
        return dX, dG, dW, None, None
 class FusedRMSNormGated(torch.nn.Module):
    """
    Fused RMSNorm + SiLU Gate.
    Computes: Y = W * RMSNorm(X) * silu(G)
    Drop-in replacement for Qwen3_5RMSNormGated with matching
    init signature: __init__(hidden_size, eps=1e-6, **kwargs)
    and forward signature: forward(hidden_states, gate=None)
    """
    def __init__(self, hidden_size, eps=1e-6, offset=0.0, **kwargs):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps
        self.offset = offset
    def forward(self, hidden_states, gate=None):
        if gate is None:
            raise ValueError("FusedRMSNormGated requires a gate tensor")
        if hidden_states.device.type != "cuda":
            raise ValueError(
                f"FusedRMSNormGated requires CUDA tensors, got device={hidden_states.device}"
            )
        return FusedRMSNormGatedFunction.apply(
            hidden_states, gate, self.weight, self.variance_epsilon, self.offset
        )
    def extra_repr(self):
        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -12,6 +12,7 @@ from torch import nn
 from transformers import AutoConfig
 from axolotl.kernels.lora import (
    apply_lora_embedding,
    apply_lora_mlp_geglu,
    apply_lora_mlp_swiglu,
    apply_lora_o,
@@ -370,13 +371,13 @@ def apply_lora_kernel_patches(
        active_adapter = model.active_adapter
    lora_config = model.model.peft_config[active_adapter]
-    # Only patch if conditions are met
+    # Log what features are active
-    can_patch = lora_config.lora_dropout == 0 and lora_config.bias == "none"
+    if lora_config.lora_dropout > 0:
-
+        LOG.info(f"LoRA kernels: dropout={lora_config.lora_dropout} enabled")
-    if not can_patch:
+    if lora_config.bias != "none":
-        LOG.warning("Cannot patch layers - requires no dropout and no bias")
+        LOG.info(f"LoRA kernels: bias={lora_config.bias} enabled")
-        LOG.warning("Please specify `lora_dropout: 0` in your axolotl config file")
+    if lora_config.use_dora:
-        return model
+        LOG.info("LoRA kernels: DoRA enabled")
    # This needs to be reset after patching
    original_level = LOG.getEffectiveLevel()
@@ -419,44 +420,33 @@ def apply_lora_kernel_patches(
                    for linear_proj in ["q_proj", "k_proj", "v_proj"]
                ]
                can_patch_qkv = all(
-                    hasattr(module, "lora_A")
+                    hasattr(module, "lora_A") for module in layer_modules
                    and len(getattr(module, "lora_magnitude_vector", []) or []) == 0
                    for module in layer_modules
                )
                if can_patch_qkv:
                    # Add optimized implementation
                    self_attn.apply_qkv = types.MethodType(apply_lora_qkv, self_attn)
                else:
                    LOG.warning_once(
-                        "Cannot patch some attention QKV projections - requires LoRA "
+                        "Cannot patch some attention QKV projections - requires LoRA adapters"
                        "adapters and no lora_magnitude_vector (DoRA)"
                    )
            if cfg.lora_o_kernel:
                # Output patching
                layer_modules = [
                    getattr(self_attn, linear_proj) for linear_proj in ["o_proj"]
                ]
-                can_patch_o = all(
+                can_patch_o = all(hasattr(module, "lora_A") for module in layer_modules)
                    hasattr(module, "lora_A")
                    and len(getattr(module, "lora_magnitude_vector", []) or []) == 0
                    for module in layer_modules
                )
                if can_patch_o:
                    self_attn.apply_o = types.MethodType(apply_lora_o, self_attn)
                else:
                    LOG.warning_once(
-                        "Cannot patch some attention output projection - requires LoRA "
+                        "Cannot patch some attention output projection - requires LoRA adapters"
                        "adapters and no lora_magnitude_vector (DoRA)"
                    )
        for gate_proj, up_proj, down_proj, mlp in find_mlp_in_layer(layer):
            if cfg.lora_mlp_kernel:
                # MLP patching
                can_patch_mlp = all(
-                    hasattr(proj, "lora_A")
+                    hasattr(proj, "lora_A") for proj in (gate_proj, up_proj, down_proj)
                    and len(getattr(proj, "lora_magnitude_vector", []) or []) == 0
                    for proj in (gate_proj, up_proj, down_proj)
                )
                if can_patch_mlp:
@@ -464,15 +454,50 @@ def apply_lora_kernel_patches(
                    layer.mlp.forward = types.MethodType(apply_fn, mlp)
                else:
                    LOG.warning_once(
-                        "Cannot patch some MLP layers - requires LoRA adapters and no "
+                        "Cannot patch some MLP layers - requires LoRA adapters"
                        "lora_magnitude_vector (DoRA)"
                    )
    # Patch embedding layers (model-level, not per-layer)
    if cfg.lora_embedding_kernel:
        _patch_embedding_layers(model, cfg)
    LOG.setLevel(original_level)
    return model
 def _patch_embedding_layers(model: PeftModelForCausalLM, cfg: DictDefault):
    """Patch embedding layers with fused LoRA kernel.
    Handles both embed_tokens (nn.Embedding with lora_embedding_A/B) and
    lm_head (nn.Linear with lora_A/B, used when tied embeddings are untied by PEFT).
    """
    pretrained_model = model.model
    patched = 0
    # Find embedding modules - check common locations
    for attr_path in [
        ("model", "embed_tokens"),
        ("model", "language_model", "embed_tokens"),
    ]:
        parent = pretrained_model
        for attr in attr_path:
            parent = getattr(parent, attr, None)
            if parent is None:
                break
        if parent is not None and hasattr(parent, "lora_embedding_A"):
            LOG.info(f"Patching embedding layer: {'.'.join(attr_path)}")
            parent.forward = types.MethodType(apply_lora_embedding, parent)
            patched += 1
    # lm_head with LoRA is a Linear layer - already handled by LoRA_O/LoRA_W kernels
    # when included in target_modules. No special embedding handling needed since
    # PEFT wraps it as a Linear (not Embedding) even for tied models.
    if not patched:
        LOG.debug("No embedding layers with LoRA found to patch")
 class FakeMLP(nn.Module):
    """
    placeholder MLP for triton patching
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -703,6 +703,12 @@ class AxolotlInputConfig(
            "description": "Apply custom LoRA autograd functions and activation function Triton kernels for speed and memory savings. See: https://docs.axolotl.ai/docs/lora_optims.html"
        },
    )
    lora_embedding_kernel: bool | None = Field(
        default=None,
        json_schema_extra={
            "description": "Apply custom LoRA autograd function for embedding layers. See: https://docs.axolotl.ai/docs/lora_optims.html"
        },
    )
    chunked_cross_entropy: bool | None = Field(
        default=None,
@@ -1313,6 +1319,7 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            data.get("lora_mlp_kernel")
            or data.get("lora_qkv_kernel")
            or data.get("lora_o_kernel")
            or data.get("lora_embedding_kernel")
        ):
            capabilities = data.get("capabilities")
            is_fsdp = data.get("fsdp_config") is not None
@@ -1360,7 +1367,12 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
        if data.get("adapter") in ["lora", "qlora"]:
            # Skip if already set, using unsloth optimizations, or using 8-bit
            unsloth_fields = ["unsloth_lora_mlp", "unsloth_lora_qkv", "unsloth_lora_o"]
-            kernel_fields = ["lora_mlp_kernel", "lora_qkv_kernel", "lora_o_kernel"]
+            kernel_fields = [
                "lora_mlp_kernel",
                "lora_qkv_kernel",
                "lora_o_kernel",
                "lora_embedding_kernel",
            ]
            if (
                any(data.get(k) is not None for k in kernel_fields)
                or any(data.get(k) for k in unsloth_fields)
@@ -1373,9 +1385,38 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            if data.get("trust_remote_code"):
                return data
-            # Skip if dropout is not 0, as auto enabling it would just disable it during runtime patch checks
+            # Skip auto-enable for MoE models when native grouped_mm is unavailable
-            if data.get("lora_dropout") != 0:
+            # (torch < 2.9). The grouped_mm fallback in transformers uses torch.mm
-                return data
+            # with out= which bypasses autocast and fails on mixed dtypes during eval.
            env_capabilities = data.get("env_capabilities", {})
            torch_version = env_capabilities.get("torch_version")
            if torch_version is None:
                import torch
                torch_version = str(torch.__version__).split("+", maxsplit=1)[0]
            has_grouped_mm = version.parse(torch_version) >= version.parse("2.9.0")
            if not has_grouped_mm:
                is_moe = False
                model_type = data.get("model_config_type", "")
                if model_type and "moe" in model_type.lower():
                    is_moe = True
                if not is_moe:
                    try:
                        from transformers import AutoConfig
                        base_model = data.get("base_model")
                        if base_model:
                            auto_cfg = AutoConfig.from_pretrained(
                                base_model, trust_remote_code=False
                            )
                            if getattr(auto_cfg, "num_local_experts", None) or getattr(
                                auto_cfg, "num_experts", None
                            ):
                                is_moe = True
                    except Exception:  # pylint: disable=broad-exception-caught
                        pass
                if is_moe:
                    return data
            # Check multi-GPU compatibility
            capabilities = data.get("capabilities")
@@ -1398,6 +1439,9 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
                if data.get("lora_o_kernel") is None:
                    data["lora_o_kernel"] = True
                if data.get("lora_embedding_kernel") is None:
                    data["lora_embedding_kernel"] = True
                LOG.warning(
                    "Auto-enabling LoRA kernel optimizations for faster training. "
                    + "Please explicitly set `lora_*_kernel` config values to `false` to disable. "
--- a/src/axolotl/utils/schemas/validation.py
+++ b/src/axolotl/utils/schemas/validation.py
@@ -681,15 +681,7 @@ class LoRAValidationMixin:
    @model_validator(mode="before")
    @classmethod
    def check_lora_kernels_dora(cls, data):
-        if (
+        # DoRA is now supported by lora kernels
            data.get("lora_mlp_kernel")
            or data.get("lora_qkv_kernel")
            or data.get("lora_o_kernel")
        ) and data.get("peft_use_dora"):
            raise ValueError(
                "lora_mlp_kernel, lora_qkv_kernel, and lora_o_kernel are not "
                "compatible with DoRA at the moment."
            )
        return data
    @model_validator(mode="before")
--- a/tests/core/test_async_grpo.py
+++ b/tests/core/test_async_grpo.py
@@ -153,7 +153,7 @@ class TestLoraFP8Guard(unittest.TestCase):
        proj.base_layer = base_layer
-        W, b, quant_state, A, B, s = get_lora_parameters(proj)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(proj)
        # quant_state should be None since weight is bf16, not FP8
        self.assertIsNone(quant_state)
@@ -174,7 +174,7 @@ class TestLoraFP8Guard(unittest.TestCase):
        scale_inv = torch.ones(1)
        base_layer.weight_scale_inv = scale_inv
-        W, b, quant_state, A, B, s = get_lora_parameters(proj)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(proj)
        self.assertIs(quant_state, scale_inv)
--- a/tests/e2e/kernels/test_lora.py
+++ b/tests/e2e/kernels/test_lora.py
@@ -102,7 +102,7 @@ def mock_proj():
 def test_get_lora_parameters(mock_proj):
    """Tests get_lora_parameters function"""
    # Test with LoRA enabled
-    W, b, _, A, B, s = get_lora_parameters(mock_proj)
+    W, b, _, A, B, s, *_ = get_lora_parameters(mock_proj)
    assert isinstance(W, torch.Tensor)
    assert W.shape == (128, 64)
@@ -113,13 +113,13 @@ def test_get_lora_parameters(mock_proj):
    # Test with LoRA disabled
    mock_proj.disable_adapters = True
-    W, b, _, A, B, s = get_lora_parameters(mock_proj)
+    W, b, _, A, B, s, *_ = get_lora_parameters(mock_proj)
    assert A is None and B is None and s is None
    # Test with merged state
    mock_proj.disable_adapters = False
    mock_proj.merged = True
-    W, b, _, A, B, s = get_lora_parameters(mock_proj)
+    W, b, _, A, B, s, *_ = get_lora_parameters(mock_proj)
    assert A is None and B is None and s is None
@@ -176,24 +176,31 @@ def test_lora_mlp_direct(sample_tensors, activation_forward, activation_backward
    X.requires_grad = True
    output = LoRA_MLP.apply(
        X,
        None,  # X_drop
        gate_proj.weight,
        gate_proj.bias,
        None,  # gate_quant
        None,  # gate_A
        None,  # gate_B
        None,  # gate_scale
        None,  # gate_lora_bias
        None,  # gate_magnitude
        up_proj.weight,
        up_proj.bias,
        None,  # up_quant
        None,  # up_A
        None,  # up_B
        None,  # up_scale
        None,  # up_lora_bias
        None,  # up_magnitude
        down_proj.weight,
        down_proj.bias,
        None,  # down_quant
        None,  # down_A
        None,  # down_B
        None,  # down_scale
        None,  # down_lora_bias
        None,  # down_magnitude
        activation_forward,
        activation_backward,
        True,  # inplace
@@ -247,24 +254,31 @@ def test_lora_mlp_with_adapters(
    # Forward pass with adapters
    output = LoRA_MLP.apply(
        X,
        None,  # X_drop
        gate_proj.weight,
        gate_proj.bias,
        None,
        gate_A,
        gate_B,
        scale,
        None,  # gate_lora_bias
        None,  # gate_magnitude
        up_proj.weight,
        up_proj.bias,
        None,
        up_A,
        up_B,
        scale,
        None,  # up_lora_bias
        None,  # up_magnitude
        down_proj.weight,
        down_proj.bias,
        None,
        down_A,
        down_B,
        scale,
        None,  # down_lora_bias
        None,  # down_magnitude
        activation_forward,
        activation_backward,
        True,
@@ -334,25 +348,32 @@ def test_lora_qkv(sample_tensors):
    Q1, K1, V1 = LoRA_QKV.apply(
        X,
        None,  # X_drop
        q_weight,
        None,
        None,
        None,
        None,
        None,
        None,
        None,  # Q: weight, bias, quant, A, B, scale, lora_bias, magnitude
        k_weight,
        None,
        None,
        None,
        None,
        None,
        None,
        None,  # K
        v_weight,
        None,
        None,
        None,
        None,
        None,
-        True,
+        None,
        None,  # V
        True,  # inplace
    )
    assert Q1.shape == K1.shape == V1.shape == X.shape
@@ -366,25 +387,32 @@ def test_lora_qkv(sample_tensors):
    # Test with LoRA adapters
    Q2, K2, V2 = LoRA_QKV.apply(
        X,
        None,  # X_drop
        q_weight,
        None,
        None,
        q_A,
        q_B,
        scale,
        None,
        None,  # Q
        k_weight,
        None,
        None,
        k_A,
        k_B,
        scale,
        None,
        None,  # K
        v_weight,
        None,
        None,
        v_A,
        v_B,
        scale,
-        True,
+        None,
        None,  # V
        True,  # inplace
    )
    assert Q2.shape == K2.shape == V2.shape == X.shape
@@ -427,7 +455,9 @@ def test_lora_o(sample_tensors):
    # Test forward pass
    X.requires_grad = True
-    output = LoRA_O.apply(X, W, b, None, A, B, scale)
+    output = LoRA_O.apply(
        X, None, W, b, None, A, B, scale, None, None
    )  # X_drop, ..., lora_bias, magnitude
    assert output.shape == (X.shape[0], X.shape[1], W.shape[0])
@@ -542,6 +572,7 @@ def test_inplace_operations(sample_tensors, apply_function):
            "down_proj": nn.Linear(shapes["out"], shapes["hidden"]).to(
                device="cuda", dtype=torch.float16
            ),
            "training": False,
        },
    )
--- a/tests/e2e/kernels/test_lora_features.py
+++ b/tests/e2e/kernels/test_lora_features.py
--- a/tests/e2e/multigpu/test_fsdp2_lora_kernels.py
+++ b/tests/e2e/multigpu/test_fsdp2_lora_kernels.py
@@ -0,0 +1,120 @@
 """Test LoRA kernels under FSDP2 multi-GPU training.
 Verifies that lora_qkv_kernel, lora_o_kernel, lora_mlp_kernel, and
 lora_embedding_kernel work correctly with FSDP2 sharding, including
 with bias, dropout, and DoRA enabled.
 """
 from pathlib import Path
 import yaml
 from accelerate.test_utils import execute_subprocess_async
 from transformers.testing_utils import get_torch_dist_unique_port
 from axolotl.utils.dict import DictDefault
 from tests.e2e.utils import require_torch_2_7_0
 AXOLOTL_ROOT = Path(__file__).parent.parent.parent.parent
 def _run_training(temp_dir, cfg):
    """Write config and launch multi-GPU training."""
    Path(temp_dir).mkdir(parents=True, exist_ok=True)
    with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
    execute_subprocess_async(
        [
            "axolotl",
            "train",
            str(Path(temp_dir) / "config.yaml"),
            "--num-processes",
            "2",
            "--main-process-port",
            f"{get_torch_dist_unique_port()}",
        ]
    )
 def _base_lora_fsdp2_config(temp_dir, **overrides):
    """Base config for LoRA + FSDP2 + kernel tests."""
    cfg = {
        "base_model": "Qwen/Qwen3-0.6B",
        "sequence_len": 512,
        "val_set_size": 0.0,
        "datasets": [
            {
                "path": "tatsu-lab/alpaca",
                "type": "alpaca",
                "split": "train[:1%]",
            },
        ],
        "adapter": "lora",
        "lora_r": 8,
        "lora_alpha": 16,
        "lora_target_linear": True,
        "num_epochs": 1,
        "max_steps": 3,
        "micro_batch_size": 1,
        "gradient_accumulation_steps": 1,
        "output_dir": temp_dir,
        "learning_rate": 1e-4,
        "optimizer": "adamw_torch_fused",
        "lr_scheduler": "cosine",
        "flash_attention": True,
        "bf16": True,
        "fsdp_version": 2,
        "fsdp_config": {
            "offload_params": False,
            "cpu_ram_efficient_loading": False,
            "transformer_layer_cls_to_wrap": "Qwen3DecoderLayer",
            "state_dict_type": "FULL_STATE_DICT",
            "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
            "reshard_after_forward": True,
        },
        # Enable all LoRA kernels
        "lora_mlp_kernel": True,
        "lora_qkv_kernel": True,
        "lora_o_kernel": True,
        "lora_embedding_kernel": True,
        "save_safetensors": True,
    }
    cfg.update(overrides)
    return DictDefault(cfg)
 class TestFSDP2LoRAKernels:
    """Test LoRA kernels under FSDP2."""
    @require_torch_2_7_0
    def test_lora_kernels_basic(self, temp_dir):
        """Basic LoRA + kernels + FSDP2: no dropout, no bias, no DoRA."""
        cfg = _base_lora_fsdp2_config(temp_dir)
        _run_training(temp_dir, cfg)
        assert (Path(temp_dir) / "adapter_model.safetensors").exists()
    @require_torch_2_7_0
    def test_lora_kernels_with_dropout(self, temp_dir):
        """LoRA kernels + dropout + FSDP2."""
        cfg = _base_lora_fsdp2_config(temp_dir, lora_dropout=0.1)
        _run_training(temp_dir, cfg)
        assert (Path(temp_dir) / "adapter_model.safetensors").exists()
    @require_torch_2_7_0
    def test_lora_kernels_with_dora(self, temp_dir):
        """LoRA kernels + DoRA + FSDP2."""
        cfg = _base_lora_fsdp2_config(temp_dir, peft_use_dora=True)
        _run_training(temp_dir, cfg)
        assert (Path(temp_dir) / "adapter_model.safetensors").exists()
    @require_torch_2_7_0
    def test_lora_kernels_with_dora_and_dropout(self, temp_dir):
        """LoRA kernels + DoRA + dropout + FSDP2."""
        cfg = _base_lora_fsdp2_config(
            temp_dir,
            peft_use_dora=True,
            lora_dropout=0.05,
        )
        _run_training(temp_dir, cfg)
        assert (Path(temp_dir) / "adapter_model.safetensors").exists()
--- a/tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
+++ b/tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
@@ -222,9 +222,9 @@ def test_model_specific_activation(model_name, expected_activation):
 def test_kernel_patch_conditions():
-    """Test various conditions that should prevent kernel patching."""
+    """Test that kernels ARE patched even with dropout and bias (now supported)."""
    test_configs = [
-        # Dropout prevents patching
+        # Dropout — kernels now support this
        {
            "peft_type": "LORA",
            "task_type": "CAUSAL_LM",
@@ -234,7 +234,7 @@ def test_kernel_patch_conditions():
            "lora_dropout": 0.1,
            "bias": "none",
        },
-        # Bias prevents patching
+        # Bias — kernels now support this
        {
            "peft_type": "LORA",
            "task_type": "CAUSAL_LM",
@@ -252,13 +252,14 @@ def test_kernel_patch_conditions():
        model = PeftModelForCausalLM(model, peft_config)
        cfg = DictDefault({"lora_mlp_kernel": True})
        # Should not patch
        patched_model = apply_lora_kernel_patches(model, cfg)
        layer = patched_model.model.model.layers[0].mlp
-        # Verify no patches applied
+        # Verify patches ARE applied (dropout and bias are now supported)
-        assert layer.forward.__func__ is not apply_lora_mlp_swiglu
+        assert (
-        assert layer.forward.__func__ is not apply_lora_mlp_geglu
+            layer.forward.__func__ is apply_lora_mlp_swiglu
            or layer.forward.__func__ is apply_lora_mlp_geglu
        )
 def test_kernel_config_options():
@@ -511,7 +512,7 @@ def test_kernel_training_integration_auto_enable(temp_dir):
 def test_kernel_training_integration_dropout_non_zero(temp_dir):
-    """Test model loading with dropout non-zero should not patch."""
+    """Test model loading with dropout non-zero DOES patch (now supported)."""
    from axolotl.cli.utils import load_model_and_tokenizer
@@ -546,31 +547,18 @@ def test_kernel_training_integration_dropout_non_zero(temp_dir):
    # Load config
    cfg = load_cfg(str(path))
    # Get original attention class
    attention_cls = get_attention_cls_from_config(cfg)
    # Store original state before patching
    original_forward_method = attention_cls.forward
    # Load model
    model, tokenizer, _ = load_model_and_tokenizer(cfg=cfg)
    # We call modelloader as that's where the patches are applied
    # despite the fact that we're not using it to load the model
    model_loader = ModelLoader(cfg, tokenizer)
-    # Apply patch
+    # Apply patches — should succeed even with dropout > 0
    model_loader.patch_manager._apply_self_attention_lora_patch()
    # Verify patch was not applied
    assert attention_cls.forward == original_forward_method
    # Apply apply_lora_kernel_patches
    model_loader.patch_manager._apply_lora_kernel_patch(model)
-    # Verify patch was not applied
+    # Verify patches WERE applied (dropout is now supported by kernels)
    layers = get_layers(model)
    for layer in layers:
        for self_attn in find_self_attn_in_layer(layer):
-            assert not hasattr(self_attn, "apply_qkv")
+            assert hasattr(self_attn, "apply_qkv")
-            assert not hasattr(self_attn, "apply_o")
+            assert hasattr(self_attn, "apply_o")
--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -179,7 +179,7 @@ def check_tensorboard(
    tag: str,
    lt_val: float,
    assertion_err: str,
-    rtol: float = 0.02,
+    rtol: float = 0.05,
    gt_zero: bool = True,
 ) -> None:
    """
--- a/tests/kernels/test_rms_norm_gated.py
+++ b/tests/kernels/test_rms_norm_gated.py
@@ -0,0 +1,229 @@
 """
 Correctness tests for fused RMSNorm + SiLU Gate kernel.
 Tests against the eager Qwen3_5RMSNormGated implementation.
 """
 import pytest
 import torch
 import torch.nn.functional as F
 pytest.importorskip("triton", reason="triton required for fused kernels")
 if not torch.cuda.is_available():
    pytest.skip("CUDA required for fused kernel tests", allow_module_level=True)
 from axolotl.kernels.rms_norm_gated import FusedRMSNormGated
 class EagerRMSNormGated(torch.nn.Module):
    """Reference implementation matching Qwen3_5RMSNormGated exactly."""
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps
    def forward(self, hidden_states, gate=None):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        hidden_states = self.weight * hidden_states.to(input_dtype)
        hidden_states = hidden_states * F.silu(gate.to(torch.float32))
        return hidden_states.to(input_dtype)
 def _sync_weights(eager_mod, fused_mod):
    """Copy weights from eager to fused module."""
    fused_mod.weight.data.copy_(eager_mod.weight.data)
@pytest.mark.parametrize("dtype", [torch.float32, torch.bfloat16, torch.float16])
@pytest.mark.parametrize(
    "shape",
    [
        (2, 128, 256),
        (4, 64, 512),
        (1, 32, 1024),
        (2, 16, 2560),  # Qwen3.5-4B hidden_size
        (2, 16, 4096),  # Qwen3.5-9B hidden_size
        (1, 8, 5120),  # Qwen3.5-27B hidden_size
        (4, 16, 2048),  # Qwen3.5-35B-A3B (MoE) hidden_size
        (4, 16, 3072),  # Qwen3.5-122B-A10B (MoE) hidden_size
    ],
 )
 class TestRMSNormGatedForward:
    def test_output_matches_eager(self, dtype, shape):
        torch.manual_seed(42)
        B, T, H = shape
        X = torch.randn(B, T, H, dtype=dtype, device="cuda")
        G = torch.randn(B, T, H, dtype=dtype, device="cuda")
        eager = EagerRMSNormGated(H).to(dtype=dtype, device="cuda")
        fused = FusedRMSNormGated(H).to(dtype=dtype, device="cuda")
        _sync_weights(eager, fused)
        y_eager = eager(X, gate=G)
        y_fused = fused(X, gate=G)
        if dtype == torch.float32:
            torch.testing.assert_close(y_fused, y_eager, atol=1e-5, rtol=1e-5)
        else:
            torch.testing.assert_close(y_fused, y_eager, atol=1e-2, rtol=1e-2)
    def test_output_shape(self, dtype, shape):
        B, T, H = shape
        X = torch.randn(B, T, H, dtype=dtype, device="cuda")
        G = torch.randn(B, T, H, dtype=dtype, device="cuda")
        fused = FusedRMSNormGated(H).to(dtype=dtype, device="cuda")
        y = fused(X, gate=G)
        assert y.shape == (B, T, H)
        assert y.dtype == dtype
@pytest.mark.parametrize("dtype", [torch.float32, torch.bfloat16, torch.float16])
@pytest.mark.parametrize(
    "shape",
    [
        (2, 32, 256),
        (2, 16, 512),
        (2, 16, 2560),  # Qwen3.5-4B
        (1, 8, 4096),  # Qwen3.5-9B
        (1, 8, 5120),  # Qwen3.5-27B
        (2, 16, 2048),  # Qwen3.5-35B-A3B (MoE)
        (2, 16, 3072),  # Qwen3.5-122B-A10B (MoE)
    ],
 )
 class TestRMSNormGatedBackward:
    def test_grad_x(self, dtype, shape):
        torch.manual_seed(42)
        B, T, H = shape
        X = torch.randn(B, T, H, dtype=dtype, device="cuda", requires_grad=True)
        G = torch.randn(B, T, H, dtype=dtype, device="cuda", requires_grad=True)
        X_ref = X.detach().clone().requires_grad_(True)
        G_ref = G.detach().clone().requires_grad_(True)
        eager = EagerRMSNormGated(H).to(dtype=dtype, device="cuda")
        fused = FusedRMSNormGated(H).to(dtype=dtype, device="cuda")
        _sync_weights(eager, fused)
        y_eager = eager(X_ref, gate=G_ref)
        y_fused = fused(X, gate=G)
        grad_out = torch.randn_like(y_eager)
        y_eager.backward(grad_out)
        y_fused.backward(grad_out)
        if dtype == torch.float32:
            atol, rtol = 1e-4, 1e-4
        else:
            atol, rtol = 5e-2, 5e-2
        torch.testing.assert_close(X.grad, X_ref.grad, atol=atol, rtol=rtol)
    def test_grad_gate(self, dtype, shape):
        torch.manual_seed(42)
        B, T, H = shape
        X = torch.randn(B, T, H, dtype=dtype, device="cuda", requires_grad=True)
        G = torch.randn(B, T, H, dtype=dtype, device="cuda", requires_grad=True)
        X_ref = X.detach().clone().requires_grad_(True)
        G_ref = G.detach().clone().requires_grad_(True)
        eager = EagerRMSNormGated(H).to(dtype=dtype, device="cuda")
        fused = FusedRMSNormGated(H).to(dtype=dtype, device="cuda")
        _sync_weights(eager, fused)
        y_eager = eager(X_ref, gate=G_ref)
        y_fused = fused(X, gate=G)
        grad_out = torch.randn_like(y_eager)
        y_eager.backward(grad_out)
        y_fused.backward(grad_out)
        if dtype == torch.float32:
            atol, rtol = 1e-4, 1e-4
        else:
            atol, rtol = 5e-2, 5e-2
        torch.testing.assert_close(G.grad, G_ref.grad, atol=atol, rtol=rtol)
    def test_grad_weight(self, dtype, shape):
        torch.manual_seed(42)
        B, T, H = shape
        X = torch.randn(B, T, H, dtype=dtype, device="cuda", requires_grad=True)
        G = torch.randn(B, T, H, dtype=dtype, device="cuda", requires_grad=True)
        X_ref = X.detach().clone().requires_grad_(True)
        G_ref = G.detach().clone().requires_grad_(True)
        eager = EagerRMSNormGated(H).to(dtype=dtype, device="cuda")
        fused = FusedRMSNormGated(H).to(dtype=dtype, device="cuda")
        _sync_weights(eager, fused)
        y_eager = eager(X_ref, gate=G_ref)
        y_fused = fused(X, gate=G)
        grad_out = torch.randn_like(y_eager)
        y_eager.backward(grad_out)
        y_fused.backward(grad_out)
        if dtype == torch.float32:
            atol, rtol = 1e-4, 1e-4
        else:
            atol, rtol = 5e-2, 5e-2
        torch.testing.assert_close(
            fused.weight.grad, eager.weight.grad, atol=atol, rtol=rtol
        )
 class TestRMSNormGatedEdgeCases:
    def test_gate_none_raises(self):
        fused = FusedRMSNormGated(256).cuda()
        X = torch.randn(2, 4, 256, device="cuda")
        with pytest.raises(ValueError, match="requires a gate tensor"):
            fused(X, gate=None)
    def test_2d_input(self):
        """Test with (BxT, H) shaped input instead of (B, T, H)."""
        torch.manual_seed(42)
        H = 512
        X = torch.randn(64, H, dtype=torch.bfloat16, device="cuda", requires_grad=True)
        G = torch.randn(64, H, dtype=torch.bfloat16, device="cuda", requires_grad=True)
        X_ref = X.detach().clone().requires_grad_(True)
        G_ref = G.detach().clone().requires_grad_(True)
        eager = EagerRMSNormGated(H).to(dtype=torch.bfloat16, device="cuda")
        fused = FusedRMSNormGated(H).to(dtype=torch.bfloat16, device="cuda")
        _sync_weights(eager, fused)
        y_eager = eager(X_ref, gate=G_ref)
        y_fused = fused(X, gate=G)
        torch.testing.assert_close(y_fused, y_eager, atol=1e-2, rtol=1e-2)
        grad_out = torch.randn_like(y_eager)
        y_eager.backward(grad_out)
        y_fused.backward(grad_out)
        torch.testing.assert_close(X.grad, X_ref.grad, atol=5e-2, rtol=5e-2)
        torch.testing.assert_close(G.grad, G_ref.grad, atol=5e-2, rtol=5e-2)
    def test_random_weight_init(self):
        """Test with non-default weight values."""
        torch.manual_seed(123)
        H = 256
        X = torch.randn(2, 16, H, dtype=torch.bfloat16, device="cuda")
        G = torch.randn(2, 16, H, dtype=torch.bfloat16, device="cuda")
        eager = EagerRMSNormGated(H).to(dtype=torch.bfloat16, device="cuda")
        # Randomize weights
        eager.weight.data = torch.randn_like(eager.weight.data)
        fused = FusedRMSNormGated(H).to(dtype=torch.bfloat16, device="cuda")
        _sync_weights(eager, fused)
        y_eager = eager(X, gate=G)
        y_fused = fused(X, gate=G)
        torch.testing.assert_close(y_fused, y_eager, atol=1e-2, rtol=1e-2)
--- a/tests/utils/lora/test_config_validation_lora.py
+++ b/tests/utils/lora/test_config_validation_lora.py
@@ -28,20 +28,22 @@ class TestLoRAConfigValidation:
        result = validate_config(valid_config)
        assert result["adapter"] == "lora"
-        with pytest.raises(ValueError, match="not compatible with DoRA"):
+        # DoRA is now compatible with lora kernels
-            invalid_config = DictDefault(
+        dora_kernel_config = DictDefault(
-                {
+            {
-                    "adapter": "lora",
+                "adapter": "lora",
-                    "lora_mlp_kernel": True,
+                "lora_mlp_kernel": True,
-                    "peft_use_dora": True,
+                "peft_use_dora": True,
-                    "datasets": [{"path": "dummy_dataset", "type": "alpaca"}],
+                "datasets": [{"path": "dummy_dataset", "type": "alpaca"}],
-                    "micro_batch_size": 1,
+                "micro_batch_size": 1,
-                    "gradient_accumulation_steps": 1,
+                "gradient_accumulation_steps": 1,
-                    "learning_rate": 1e-5,
+                "learning_rate": 1e-5,
-                    "base_model": "dummy_model",
+                "base_model": "dummy_model",
-                }
+            }
-            )
+        )
-            validate_config(invalid_config)
+        result = validate_config(dora_kernel_config)
        assert result["lora_mlp_kernel"] is True
        assert result["peft_use_dora"] is True
    def test_qlora_4bit_validation(self):
        """Test QLoRA 4-bit configuration validation"""
--- a/tests/utils/lora/test_freeze_lora.py
+++ b/tests/utils/lora/test_freeze_lora.py
@@ -38,6 +38,11 @@ class TestLoRAParameterFreezing:
            mock_layer.lora_A["default"].weight = torch.randn(16, 256, dtype=self.dtype)
            mock_layer.lora_B["default"].weight = torch.randn(512, 16, dtype=self.dtype)
            mock_layer.lora_B["default"].bias = None
            # Required by get_lora_parameters for dropout/DoRA extraction
            mock_layer.lora_dropout = {}
            mock_layer.lora_magnitude_vector = None
        else:
            mock_layer.weight = base_layer.weight
            mock_layer.bias = base_layer.bias
@@ -48,7 +53,7 @@ class TestLoRAParameterFreezing:
        """Test that LoRA parameters are None when adapters are disabled."""
        layer = self.create_mock_lora_layer(has_adapters=True, adapters_disabled=True)
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        # Base parameters should be returned
        assert W is not None
@@ -62,7 +67,7 @@ class TestLoRAParameterFreezing:
        """Test that LoRA parameters are None when adapters are merged."""
        layer = self.create_mock_lora_layer(has_adapters=True, merged=True)
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        # Base parameters should be returned
        assert W is not None
@@ -77,7 +82,7 @@ class TestLoRAParameterFreezing:
        """Test parameter behavior when no adapters are present."""
        layer = self.create_mock_lora_layer(has_adapters=False)
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        # Base parameters should be returned
        assert W is not None
@@ -94,7 +99,7 @@ class TestLoRAParameterFreezing:
            has_adapters=True, adapters_disabled=False, merged=False
        )
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        # All parameters should be returned
        assert W is not None
@@ -110,7 +115,7 @@ class TestLoRAParameterFreezing:
            has_adapters=True, adapters_disabled=False, merged=False
        )
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        # Check shape consistency
        assert W.shape == (512, 256)
@@ -124,7 +129,7 @@ class TestLoRAParameterFreezing:
            has_adapters=True, adapters_disabled=False, merged=False
        )
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        assert W.dtype == self.dtype
        assert b.dtype == self.dtype
@@ -138,7 +143,7 @@ class TestLoRAParameterFreezing:
        quant_state_mock = Mock()
        layer.base_layer.weight.quant_state = quant_state_mock
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        assert quant_state == quant_state_mock
@@ -157,7 +162,7 @@ class TestLoRAParameterFreezing:
        layer.active_adapters = ["adapter2"]
-        W, b, quant_state, A, B, s = get_lora_parameters(layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(layer)
        assert s == 0.2
        assert torch.equal(A, layer.lora_A["adapter2"].weight)
@@ -192,13 +197,13 @@ class TestLoRAParameterFreezingIntegration:
        model = get_peft_model(base_model, lora_config)
        lora_layer = model.base_model.model.linear
        # Test with adapters enabled
-        W, b, quant_state, A, B, s = get_lora_parameters(lora_layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(lora_layer)
        assert A is not None
        assert B is not None
        assert s is not None
        # Test with adapters disabled
        model.disable_adapter_layers()
-        W, b, quant_state, A, B, s = get_lora_parameters(lora_layer)
+        W, b, quant_state, A, B, s, *_ = get_lora_parameters(lora_layer)
        assert A is None
        assert B is None
        assert s is None
Author	SHA1	Message	Date
Wing Lian	936149380f	support nemotron for scattermoe-lora	2026-03-23 21:29:58 +00:00
Wing Lian	86be9f329e	post merge lora fixes for CI (#3536 ) [skip ci] * post merge lora fixes for CI * handle lora kernel auto-enable for moe without grouped_mm * prefer not to import torch in schema validation	2026-03-23 02:26:10 -04:00
Wing Lian	0e583efeaa	increase rtol, codecov informational only, don't silently fail errors w curl (#3534 ) [skip ci]	2026-03-22 13:54:03 -04:00
Wing Lian	b3289fd190	feat: LoRA kernel support for bias, dropout, dora, embeddings (#3528 ) [skip ci] * feat: LoRA kernel support for bias, dropout, dora, embeddings * chore: lint * chore: lint * address PR feedback, add regression tests, add fsdp2 tests for lora kernels * update tests for new sigs * update tests now that bias and dropout are supported	2026-03-22 13:53:19 -04:00
Wing Lian	a67392c427	liger support for qwen 3.5 and fused rmsnorm+gated (#3531 ) [skip ci] * liger support for qwen 3.5 and fused rmsnorm+gated * support for qwen 3.5 moe * fix version ref * fixups for PR code review	2026-03-22 13:19:21 -04:00