diff --git a/.nojekyll b/.nojekyll index f3c74bd25..7cd04cfc1 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -605b84d9 \ No newline at end of file +77cdfda6 \ No newline at end of file diff --git a/docs/api/cli.main.html b/docs/api/cli.main.html index edd9603a4..d49172e74 100644 --- a/docs/api/cli.main.html +++ b/docs/api/cli.main.html @@ -790,7 +790,9 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
  • 📚 Documentation
  • +
  • AI Agent Support
  • 🤝 Getting Help
  • 🌟 Contributing
  • 📈 Telemetry
  • @@ -976,6 +977,25 @@ Expand older updates
  • FAQ - Frequently asked questions
  • +
    +

    AI Agent Support

    +

    Axolotl ships with built-in documentation optimized for AI coding agents (Claude Code, Cursor, Copilot, etc.). These docs are bundled with the pip package — no repo clone needed.

    +
    # Show overview and available training methods
    +axolotl agent-docs
    +
    +# Topic-specific references
    +axolotl agent-docs sft                 # supervised fine-tuning
    +axolotl agent-docs grpo                # GRPO online RL
    +axolotl agent-docs preference_tuning   # DPO, KTO, ORPO, SimPO
    +axolotl agent-docs reward_modelling    # outcome and process reward models
    +axolotl agent-docs pretraining         # continual pretraining
    +axolotl agent-docs --list              # list all topics
    +
    +# Dump config schema for programmatic use
    +axolotl config-schema
    +axolotl config-schema --field adapter
    +

    If you’re working with the source repo, agent docs are also available at docs/agents/ and the project overview is in AGENTS.md.

    +

    🤝 Getting Help

    📜 License

    diff --git a/search.json b/search.json index eda44af55..50682912a 100644 --- a/search.json +++ b/search.json @@ -629,21 +629,21 @@ "href": "docs/api/kernels.lora.html", "title": "kernels.lora", "section": "", - "text": "kernels.lora\nModule for definition of Low-Rank Adaptation (LoRA) Triton kernels.\nSee “LoRA: Low-Rank Adaptation of Large Language Models”\n(https://arxiv.org/abs/2106.09685).\nAlso supports DoRA (Weight-Decomposed Low-Rank Adaptation):\nSee “DoRA: Weight-Decomposed Low-Rank Adaptation” (https://arxiv.org/abs/2402.09353).\nCredit to unsloth (https://unsloth.ai/) for inspiration for this implementation.\n\n\n\n\n\nName\nDescription\n\n\n\n\nLoRA_Embedding\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\n\n\nLoRA_MLP\nOptimized LoRA MLP implementation.\n\n\nLoRA_O\nOptimized LoRA implementation for output projection.\n\n\nLoRA_QKV\nOptimized LoRA QKV implementation with quantization support.\n\n\n\n\n\nkernels.lora.LoRA_Embedding()\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\nSupports dropout and DoRA.\n\n\n\nkernels.lora.LoRA_MLP()\nOptimized LoRA MLP implementation.\nSupports bias, dropout, and DoRA. Dropout is applied to the input for\ngate/up projections. The down projection uses hidden states (post-activation)\nas input, so dropout is not applied there.\n\n\n\nkernels.lora.LoRA_O()\nOptimized LoRA implementation for output projection.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.LoRA_QKV()\nOptimized LoRA QKV implementation with quantization support.\nSupports bias, dropout, and DoRA (Weight-Decomposed Low-Rank Adaptation).\nDropout is applied outside this Function so autograd handles its backward.\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\napply_lora_embedding\nApplies LoRA to embedding layer.\n\n\napply_lora_mlp_geglu\nApplies LoRA to MLP layer with GEGLU activation.\n\n\napply_lora_mlp_swiglu\nApplies LoRA to MLP layer with SwiGLU activation.\n\n\napply_lora_o\nApplies LoRA to output projection layer.\n\n\napply_lora_qkv\nApplies LoRA to compute Query, Key, Value projections.\n\n\nget_embedding_lora_parameters\nExtract LoRA parameters from a PEFT Embedding module.\n\n\nget_lora_parameters\nGets LoRA parameters from a projection module.\n\n\nmatmul_lora\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\nkernels.lora.apply_lora_embedding(self, x)\nApplies LoRA to embedding layer.\n\n\n\nkernels.lora.apply_lora_mlp_geglu(self, X, inplace=True)\nApplies LoRA to MLP layer with GEGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_mlp_swiglu(self, X, inplace=True)\nApplies LoRA to MLP layer with SwiGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_o(self, X)\nApplies LoRA to output projection layer.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_qkv(self, X, inplace=True)\nApplies LoRA to compute Query, Key, Value projections.\nSupports bias, dropout, and DoRA. Dropout is applied outside the autograd\nFunction so PyTorch handles its backward automatically. A single shared\ndropout mask is used across Q, K, V projections for memory efficiency.\n\n\n\nkernels.lora.get_embedding_lora_parameters(embed)\nExtract LoRA parameters from a PEFT Embedding module.\n\n\n\nkernels.lora.get_lora_parameters(proj)\nGets LoRA parameters from a projection module.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nproj\nnn.Module\nThe projection module to extract parameters from.\nrequired\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nA tuple containing:\n\n\n\ntorch.Tensor | None\n- W: base weight tensor\n\n\n\nQuantState | torch.Tensor | None\n- b: base layer bias (or None)\n\n\n\ntorch.Tensor | None\n- quant_state: quantization state (or None)\n\n\n\ntorch.Tensor | None\n- A: LoRA A weight (or None)\n\n\n\nfloat | None\n- B: LoRA B weight (or None)\n\n\n\ntorch.Tensor | None\n- s: LoRA scaling factor (or None)\n\n\n\nnn.Module | None\n- lora_bias: LoRA B bias (or None)\n\n\n\ntorch.Tensor | None\n- dropout: dropout module (or None)\n\n\n\ntuple[torch.Tensor, torch.Tensor | None, QuantState | torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, float | None, torch.Tensor | None, nn.Module | None, torch.Tensor | None]\n- magnitude: DoRA magnitude vector (or None)\n\n\n\n\n\n\n\nkernels.lora.matmul_lora(\n X,\n W,\n b,\n W_quant,\n A,\n B,\n s,\n out=None,\n X_drop=None,\n lora_bias=None,\n)\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nX\ntorch.Tensor\nInput tensor [*, in_features]\nrequired\n\n\nW\ntorch.Tensor\nBase weight matrix [out_features, in_features]\nrequired\n\n\nW_quant\nQuantState | torch.Tensor | None\nQuantization state for W\nrequired\n\n\nA\ntorch.Tensor | None\nLoRA A matrix [rank, in_features]\nrequired\n\n\nB\ntorch.Tensor | None\nLoRA B matrix [out_features, rank]\nrequired\n\n\ns\nfloat | None\nLoRA scaling factor\nrequired\n\n\nout\ntorch.Tensor | None\nOptional output tensor for inplace operations\nNone\n\n\nX_drop\ntorch.Tensor | None\nOptional dropout-applied input for LoRA path (if None, uses X)\nNone\n\n\nlora_bias\ntorch.Tensor | None\nOptional LoRA B layer bias [out_features]\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nResult of X @ W + s * X_drop @ A @ B + b + s * lora_bias" + "text": "kernels.lora\nModule for definition of Low-Rank Adaptation (LoRA) Triton kernels.\nSee “LoRA: Low-Rank Adaptation of Large Language Models”\n(https://arxiv.org/abs/2106.09685).\nAlso supports DoRA (Weight-Decomposed Low-Rank Adaptation):\nSee “DoRA: Weight-Decomposed Low-Rank Adaptation” (https://arxiv.org/abs/2402.09353).\nCredit to unsloth (https://unsloth.ai/) for inspiration for this implementation.\n\n\n\n\n\nName\nDescription\n\n\n\n\nLoRA_Embedding\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\n\n\nLoRA_MLP\nOptimized LoRA MLP implementation.\n\n\nLoRA_O\nOptimized LoRA implementation for output projection.\n\n\nLoRA_QK\nOptimized LoRA QK implementation for models where v_proj is None.\n\n\nLoRA_QKV\nOptimized LoRA QKV implementation with quantization support.\n\n\n\n\n\nkernels.lora.LoRA_Embedding()\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\nSupports dropout and DoRA.\n\n\n\nkernels.lora.LoRA_MLP()\nOptimized LoRA MLP implementation.\nSupports bias, dropout, and DoRA. Dropout is applied to the input for\ngate/up projections. The down projection uses hidden states (post-activation)\nas input, so dropout is not applied there.\n\n\n\nkernels.lora.LoRA_O()\nOptimized LoRA implementation for output projection.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.LoRA_QK()\nOptimized LoRA QK implementation for models where v_proj is None.\nUsed by models like Gemma4 with attention_k_eq_v=True, where key states are\nreused as value states. Only Q and K projections are fused; the caller\nreturns K a second time as V so that autograd accumulates key+value gradients\ninto a single dK.\nSupports bias, dropout, and DoRA (Weight-Decomposed Low-Rank Adaptation).\n\n\n\nkernels.lora.LoRA_QKV()\nOptimized LoRA QKV implementation with quantization support.\nSupports bias, dropout, and DoRA (Weight-Decomposed Low-Rank Adaptation).\nDropout is applied outside this Function so autograd handles its backward.\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\napply_lora_embedding\nApplies LoRA to embedding layer.\n\n\napply_lora_mlp_geglu\nApplies LoRA to MLP layer with GEGLU activation.\n\n\napply_lora_mlp_swiglu\nApplies LoRA to MLP layer with SwiGLU activation.\n\n\napply_lora_o\nApplies LoRA to output projection layer.\n\n\napply_lora_qk\nApplies LoRA to compute Query and Key projections for models where v_proj is None.\n\n\napply_lora_qkv\nApplies LoRA to compute Query, Key, Value projections.\n\n\nget_embedding_lora_parameters\nExtract LoRA parameters from a PEFT Embedding module.\n\n\nget_lora_parameters\nGets LoRA parameters from a projection module.\n\n\nmatmul_lora\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\nkernels.lora.apply_lora_embedding(self, x)\nApplies LoRA to embedding layer.\n\n\n\nkernels.lora.apply_lora_mlp_geglu(self, X, inplace=True)\nApplies LoRA to MLP layer with GEGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_mlp_swiglu(self, X, inplace=True)\nApplies LoRA to MLP layer with SwiGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_o(self, X)\nApplies LoRA to output projection layer.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_qk(self, X, inplace=True)\nApplies LoRA to compute Query and Key projections for models where v_proj is None.\nWhen v_proj is None (e.g. Gemma4 attention_k_eq_v), key states are reused as\nvalue states. Returns (Q, K, K) — the caller’s patched forward will use K as V.\nBecause K is returned twice, autograd accumulates gradients from both the key and\nvalue paths into dK before calling LoRA_QK.backward.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_qkv(self, X, inplace=True)\nApplies LoRA to compute Query, Key, Value projections.\nSupports bias, dropout, and DoRA. Dropout is applied outside the autograd\nFunction so PyTorch handles its backward automatically. A single shared\ndropout mask is used across Q, K, V projections for memory efficiency.\n\n\n\nkernels.lora.get_embedding_lora_parameters(embed)\nExtract LoRA parameters from a PEFT Embedding module.\n\n\n\nkernels.lora.get_lora_parameters(proj)\nGets LoRA parameters from a projection module.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nproj\nnn.Module\nThe projection module to extract parameters from.\nrequired\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nA tuple containing:\n\n\n\ntorch.Tensor | None\n- W: base weight tensor\n\n\n\nQuantState | torch.Tensor | None\n- b: base layer bias (or None)\n\n\n\ntorch.Tensor | None\n- quant_state: quantization state (or None)\n\n\n\ntorch.Tensor | None\n- A: LoRA A weight (or None)\n\n\n\nfloat | None\n- B: LoRA B weight (or None)\n\n\n\ntorch.Tensor | None\n- s: LoRA scaling factor (or None)\n\n\n\nnn.Module | None\n- lora_bias: LoRA B bias (or None)\n\n\n\ntorch.Tensor | None\n- dropout: dropout module (or None)\n\n\n\ntuple[torch.Tensor, torch.Tensor | None, QuantState | torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, float | None, torch.Tensor | None, nn.Module | None, torch.Tensor | None]\n- magnitude: DoRA magnitude vector (or None)\n\n\n\n\n\n\n\nkernels.lora.matmul_lora(\n X,\n W,\n b,\n W_quant,\n A,\n B,\n s,\n out=None,\n X_drop=None,\n lora_bias=None,\n)\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nX\ntorch.Tensor\nInput tensor [*, in_features]\nrequired\n\n\nW\ntorch.Tensor\nBase weight matrix [out_features, in_features]\nrequired\n\n\nW_quant\nQuantState | torch.Tensor | None\nQuantization state for W\nrequired\n\n\nA\ntorch.Tensor | None\nLoRA A matrix [rank, in_features]\nrequired\n\n\nB\ntorch.Tensor | None\nLoRA B matrix [out_features, rank]\nrequired\n\n\ns\nfloat | None\nLoRA scaling factor\nrequired\n\n\nout\ntorch.Tensor | None\nOptional output tensor for inplace operations\nNone\n\n\nX_drop\ntorch.Tensor | None\nOptional dropout-applied input for LoRA path (if None, uses X)\nNone\n\n\nlora_bias\ntorch.Tensor | None\nOptional LoRA B layer bias [out_features]\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nResult of X @ W + s * X_drop @ A @ B + b + s * lora_bias" }, { "objectID": "docs/api/kernels.lora.html#classes", "href": "docs/api/kernels.lora.html#classes", "title": "kernels.lora", "section": "", - "text": "Name\nDescription\n\n\n\n\nLoRA_Embedding\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\n\n\nLoRA_MLP\nOptimized LoRA MLP implementation.\n\n\nLoRA_O\nOptimized LoRA implementation for output projection.\n\n\nLoRA_QKV\nOptimized LoRA QKV implementation with quantization support.\n\n\n\n\n\nkernels.lora.LoRA_Embedding()\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\nSupports dropout and DoRA.\n\n\n\nkernels.lora.LoRA_MLP()\nOptimized LoRA MLP implementation.\nSupports bias, dropout, and DoRA. Dropout is applied to the input for\ngate/up projections. The down projection uses hidden states (post-activation)\nas input, so dropout is not applied there.\n\n\n\nkernels.lora.LoRA_O()\nOptimized LoRA implementation for output projection.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.LoRA_QKV()\nOptimized LoRA QKV implementation with quantization support.\nSupports bias, dropout, and DoRA (Weight-Decomposed Low-Rank Adaptation).\nDropout is applied outside this Function so autograd handles its backward." + "text": "Name\nDescription\n\n\n\n\nLoRA_Embedding\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\n\n\nLoRA_MLP\nOptimized LoRA MLP implementation.\n\n\nLoRA_O\nOptimized LoRA implementation for output projection.\n\n\nLoRA_QK\nOptimized LoRA QK implementation for models where v_proj is None.\n\n\nLoRA_QKV\nOptimized LoRA QKV implementation with quantization support.\n\n\n\n\n\nkernels.lora.LoRA_Embedding()\nFused LoRA embedding: F.embedding(x, W) + s * F.embedding(x, A^T) @ B^T.\nSupports dropout and DoRA.\n\n\n\nkernels.lora.LoRA_MLP()\nOptimized LoRA MLP implementation.\nSupports bias, dropout, and DoRA. Dropout is applied to the input for\ngate/up projections. The down projection uses hidden states (post-activation)\nas input, so dropout is not applied there.\n\n\n\nkernels.lora.LoRA_O()\nOptimized LoRA implementation for output projection.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.LoRA_QK()\nOptimized LoRA QK implementation for models where v_proj is None.\nUsed by models like Gemma4 with attention_k_eq_v=True, where key states are\nreused as value states. Only Q and K projections are fused; the caller\nreturns K a second time as V so that autograd accumulates key+value gradients\ninto a single dK.\nSupports bias, dropout, and DoRA (Weight-Decomposed Low-Rank Adaptation).\n\n\n\nkernels.lora.LoRA_QKV()\nOptimized LoRA QKV implementation with quantization support.\nSupports bias, dropout, and DoRA (Weight-Decomposed Low-Rank Adaptation).\nDropout is applied outside this Function so autograd handles its backward." }, { "objectID": "docs/api/kernels.lora.html#functions", "href": "docs/api/kernels.lora.html#functions", "title": "kernels.lora", "section": "", - "text": "Name\nDescription\n\n\n\n\napply_lora_embedding\nApplies LoRA to embedding layer.\n\n\napply_lora_mlp_geglu\nApplies LoRA to MLP layer with GEGLU activation.\n\n\napply_lora_mlp_swiglu\nApplies LoRA to MLP layer with SwiGLU activation.\n\n\napply_lora_o\nApplies LoRA to output projection layer.\n\n\napply_lora_qkv\nApplies LoRA to compute Query, Key, Value projections.\n\n\nget_embedding_lora_parameters\nExtract LoRA parameters from a PEFT Embedding module.\n\n\nget_lora_parameters\nGets LoRA parameters from a projection module.\n\n\nmatmul_lora\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\nkernels.lora.apply_lora_embedding(self, x)\nApplies LoRA to embedding layer.\n\n\n\nkernels.lora.apply_lora_mlp_geglu(self, X, inplace=True)\nApplies LoRA to MLP layer with GEGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_mlp_swiglu(self, X, inplace=True)\nApplies LoRA to MLP layer with SwiGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_o(self, X)\nApplies LoRA to output projection layer.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_qkv(self, X, inplace=True)\nApplies LoRA to compute Query, Key, Value projections.\nSupports bias, dropout, and DoRA. Dropout is applied outside the autograd\nFunction so PyTorch handles its backward automatically. A single shared\ndropout mask is used across Q, K, V projections for memory efficiency.\n\n\n\nkernels.lora.get_embedding_lora_parameters(embed)\nExtract LoRA parameters from a PEFT Embedding module.\n\n\n\nkernels.lora.get_lora_parameters(proj)\nGets LoRA parameters from a projection module.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nproj\nnn.Module\nThe projection module to extract parameters from.\nrequired\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nA tuple containing:\n\n\n\ntorch.Tensor | None\n- W: base weight tensor\n\n\n\nQuantState | torch.Tensor | None\n- b: base layer bias (or None)\n\n\n\ntorch.Tensor | None\n- quant_state: quantization state (or None)\n\n\n\ntorch.Tensor | None\n- A: LoRA A weight (or None)\n\n\n\nfloat | None\n- B: LoRA B weight (or None)\n\n\n\ntorch.Tensor | None\n- s: LoRA scaling factor (or None)\n\n\n\nnn.Module | None\n- lora_bias: LoRA B bias (or None)\n\n\n\ntorch.Tensor | None\n- dropout: dropout module (or None)\n\n\n\ntuple[torch.Tensor, torch.Tensor | None, QuantState | torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, float | None, torch.Tensor | None, nn.Module | None, torch.Tensor | None]\n- magnitude: DoRA magnitude vector (or None)\n\n\n\n\n\n\n\nkernels.lora.matmul_lora(\n X,\n W,\n b,\n W_quant,\n A,\n B,\n s,\n out=None,\n X_drop=None,\n lora_bias=None,\n)\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nX\ntorch.Tensor\nInput tensor [*, in_features]\nrequired\n\n\nW\ntorch.Tensor\nBase weight matrix [out_features, in_features]\nrequired\n\n\nW_quant\nQuantState | torch.Tensor | None\nQuantization state for W\nrequired\n\n\nA\ntorch.Tensor | None\nLoRA A matrix [rank, in_features]\nrequired\n\n\nB\ntorch.Tensor | None\nLoRA B matrix [out_features, rank]\nrequired\n\n\ns\nfloat | None\nLoRA scaling factor\nrequired\n\n\nout\ntorch.Tensor | None\nOptional output tensor for inplace operations\nNone\n\n\nX_drop\ntorch.Tensor | None\nOptional dropout-applied input for LoRA path (if None, uses X)\nNone\n\n\nlora_bias\ntorch.Tensor | None\nOptional LoRA B layer bias [out_features]\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nResult of X @ W + s * X_drop @ A @ B + b + s * lora_bias" + "text": "Name\nDescription\n\n\n\n\napply_lora_embedding\nApplies LoRA to embedding layer.\n\n\napply_lora_mlp_geglu\nApplies LoRA to MLP layer with GEGLU activation.\n\n\napply_lora_mlp_swiglu\nApplies LoRA to MLP layer with SwiGLU activation.\n\n\napply_lora_o\nApplies LoRA to output projection layer.\n\n\napply_lora_qk\nApplies LoRA to compute Query and Key projections for models where v_proj is None.\n\n\napply_lora_qkv\nApplies LoRA to compute Query, Key, Value projections.\n\n\nget_embedding_lora_parameters\nExtract LoRA parameters from a PEFT Embedding module.\n\n\nget_lora_parameters\nGets LoRA parameters from a projection module.\n\n\nmatmul_lora\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\nkernels.lora.apply_lora_embedding(self, x)\nApplies LoRA to embedding layer.\n\n\n\nkernels.lora.apply_lora_mlp_geglu(self, X, inplace=True)\nApplies LoRA to MLP layer with GEGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_mlp_swiglu(self, X, inplace=True)\nApplies LoRA to MLP layer with SwiGLU activation.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_o(self, X)\nApplies LoRA to output projection layer.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_qk(self, X, inplace=True)\nApplies LoRA to compute Query and Key projections for models where v_proj is None.\nWhen v_proj is None (e.g. Gemma4 attention_k_eq_v), key states are reused as\nvalue states. Returns (Q, K, K) — the caller’s patched forward will use K as V.\nBecause K is returned twice, autograd accumulates gradients from both the key and\nvalue paths into dK before calling LoRA_QK.backward.\nSupports bias, dropout, and DoRA.\n\n\n\nkernels.lora.apply_lora_qkv(self, X, inplace=True)\nApplies LoRA to compute Query, Key, Value projections.\nSupports bias, dropout, and DoRA. Dropout is applied outside the autograd\nFunction so PyTorch handles its backward automatically. A single shared\ndropout mask is used across Q, K, V projections for memory efficiency.\n\n\n\nkernels.lora.get_embedding_lora_parameters(embed)\nExtract LoRA parameters from a PEFT Embedding module.\n\n\n\nkernels.lora.get_lora_parameters(proj)\nGets LoRA parameters from a projection module.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nproj\nnn.Module\nThe projection module to extract parameters from.\nrequired\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nA tuple containing:\n\n\n\ntorch.Tensor | None\n- W: base weight tensor\n\n\n\nQuantState | torch.Tensor | None\n- b: base layer bias (or None)\n\n\n\ntorch.Tensor | None\n- quant_state: quantization state (or None)\n\n\n\ntorch.Tensor | None\n- A: LoRA A weight (or None)\n\n\n\nfloat | None\n- B: LoRA B weight (or None)\n\n\n\ntorch.Tensor | None\n- s: LoRA scaling factor (or None)\n\n\n\nnn.Module | None\n- lora_bias: LoRA B bias (or None)\n\n\n\ntorch.Tensor | None\n- dropout: dropout module (or None)\n\n\n\ntuple[torch.Tensor, torch.Tensor | None, QuantState | torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, float | None, torch.Tensor | None, nn.Module | None, torch.Tensor | None]\n- magnitude: DoRA magnitude vector (or None)\n\n\n\n\n\n\n\nkernels.lora.matmul_lora(\n X,\n W,\n b,\n W_quant,\n A,\n B,\n s,\n out=None,\n X_drop=None,\n lora_bias=None,\n)\nEfficient fused matmul + LoRA computation.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nX\ntorch.Tensor\nInput tensor [*, in_features]\nrequired\n\n\nW\ntorch.Tensor\nBase weight matrix [out_features, in_features]\nrequired\n\n\nW_quant\nQuantState | torch.Tensor | None\nQuantization state for W\nrequired\n\n\nA\ntorch.Tensor | None\nLoRA A matrix [rank, in_features]\nrequired\n\n\nB\ntorch.Tensor | None\nLoRA B matrix [out_features, rank]\nrequired\n\n\ns\nfloat | None\nLoRA scaling factor\nrequired\n\n\nout\ntorch.Tensor | None\nOptional output tensor for inplace operations\nNone\n\n\nX_drop\ntorch.Tensor | None\nOptional dropout-applied input for LoRA path (if None, uses X)\nNone\n\n\nlora_bias\ntorch.Tensor | None\nOptional LoRA B layer bias [out_features]\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntorch.Tensor\nResult of X @ W + s * X_drop @ A @ B + b + s * lora_bias" }, { "objectID": "docs/api/monkeypatch.utils.html", @@ -3269,6 +3269,16 @@ "Home" ] }, + { + "objectID": "index.html#ai-agent-support", + "href": "index.html#ai-agent-support", + "title": "Axolotl", + "section": "AI Agent Support", + "text": "AI Agent Support\nAxolotl ships with built-in documentation optimized for AI coding agents (Claude Code, Cursor, Copilot, etc.). These docs are bundled with the pip package — no repo clone needed.\n# Show overview and available training methods\naxolotl agent-docs\n\n# Topic-specific references\naxolotl agent-docs sft # supervised fine-tuning\naxolotl agent-docs grpo # GRPO online RL\naxolotl agent-docs preference_tuning # DPO, KTO, ORPO, SimPO\naxolotl agent-docs reward_modelling # outcome and process reward models\naxolotl agent-docs pretraining # continual pretraining\naxolotl agent-docs --list # list all topics\n\n# Dump config schema for programmatic use\naxolotl config-schema\naxolotl config-schema --field adapter\nIf you’re working with the source repo, agent docs are also available at docs/agents/ and the project overview is in AGENTS.md.", + "crumbs": [ + "Home" + ] + }, { "objectID": "index.html#getting-help", "href": "index.html#getting-help", @@ -3718,7 +3728,7 @@ "href": "docs/custom_integrations.html#cut-cross-entropy", "title": "Custom Integrations", "section": "Cut Cross Entropy", - "text": "Cut Cross Entropy\nCut Cross Entropy (CCE) reduces VRAM usage through optimization on the cross-entropy operation during loss calculation.\nSee https://github.com/apple/ml-cross-entropy\n\nRequirements\n\nPyTorch 2.4.0 or higher\n\n\n\nInstallation\nRun the following command to install cut_cross_entropy[transformers] if you don’t have it already.\n\nIf you are in dev environment\n\npython scripts/cutcrossentropy_install.py | sh\n\nIf you are installing from pip\n\npip3 uninstall -y cut-cross-entropy && pip3 install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@63b15e6\"\n\n\nUsage\nplugins:\n - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin\n\n\nSupported Models\n\nafmoe\napertus\narcee\ncohere\ncohere2\ndeepseek_v3\nexaone4\ngemma\ngemma2\ngemma3\ngemma3_text\ngemma3n\ngemma3n_text\nglm\nglm4\nglm4_moe\nglm4_moe_lite\nglm46v\nglm4v\nglm4v_moe\nglm_image\nglm_moe_dsa\ngpt_oss\ngranite\ngranitemoe\ngranitemoehybrid\ngranitemoeshared\nhunyuan_v1_dense\nhunyuan_v1_moe\ninternvl\nkimi_linear\nlfm2\nlfm2_moe\nlfm2_vl\nllama\nllama4\nllama4_text\nllava\nministral\nministral3\nmistral\nmistral3\nmistral4\nmixtral\nmllama\nnemotron_h\nolmo\nolmo2\nolmo3\nolmoe\nphi\nphi3\nphi4_multimodal\nqwen2\nqwen2_5_vl\nqwen2_moe\nqwen2_vl\nqwen3\nqwen3_5\nqwen3_5_text\nqwen3_5_moe\nqwen3_5_moe_text\nqwen3_moe\nqwen3_next\nqwen3_vl\nqwen3_vl_moe\nseed_oss\nsmollm3\nstep3p5\nvoxtral\n\n\n\nCitation\n@article{wijmans2024cut,\n author = {Erik Wijmans and\n Brody Huval and\n Alexander Hertzberg and\n Vladlen Koltun and\n Philipp Kr\\\"ahenb\\\"uhl},\n title = {Cut Your Losses in Large-Vocabulary Language Models},\n journal = {arXiv},\n year = {2024},\n url = {https://arxiv.org/abs/2411.09009},\n}\nPlease see reference here", + "text": "Cut Cross Entropy\nCut Cross Entropy (CCE) reduces VRAM usage through optimization on the cross-entropy operation during loss calculation.\nSee https://github.com/apple/ml-cross-entropy\n\nRequirements\n\nPyTorch 2.4.0 or higher\n\n\n\nInstallation\nRun the following command to install cut_cross_entropy[transformers] if you don’t have it already.\n\nIf you are in dev environment\n\npython scripts/cutcrossentropy_install.py | sh\n\nIf you are installing from pip\n\npip3 uninstall -y cut-cross-entropy && pip3 install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@fec1a88\"\n\n\nUsage\nplugins:\n - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin\n\n\nSupported Models\n\nafmoe\napertus\narcee\ncohere\ncohere2\ndeepseek_v3\nexaone4\ngemma\ngemma2\ngemma3\ngemma3_text\ngemma3n\ngemma3n_text\ngemma4\nglm\nglm4\nglm4_moe\nglm4_moe_lite\nglm46v\nglm4v\nglm4v_moe\nglm_image\nglm_moe_dsa\ngpt_oss\ngranite\ngranitemoe\ngranitemoehybrid\ngranitemoeshared\nhunyuan_v1_dense\nhunyuan_v1_moe\ninternvl\nkimi_linear\nlfm2\nlfm2_moe\nlfm2_vl\nllama\nllama4\nllama4_text\nllava\nministral\nministral3\nmistral\nmistral3\nmistral4\nmixtral\nmllama\nnemotron_h\nolmo\nolmo2\nolmo3\nolmoe\nphi\nphi3\nphi4_multimodal\nqwen2\nqwen2_5_vl\nqwen2_moe\nqwen2_vl\nqwen3\nqwen3_5\nqwen3_5_text\nqwen3_5_moe\nqwen3_5_moe_text\nqwen3_moe\nqwen3_next\nqwen3_vl\nqwen3_vl_moe\nseed_oss\nsmollm3\nstep3p5\nvoxtral\n\n\n\nCitation\n@article{wijmans2024cut,\n author = {Erik Wijmans and\n Brody Huval and\n Alexander Hertzberg and\n Vladlen Koltun and\n Philipp Kr\\\"ahenb\\\"uhl},\n title = {Cut Your Losses in Large-Vocabulary Language Models},\n journal = {arXiv},\n year = {2024},\n url = {https://arxiv.org/abs/2411.09009},\n}\nPlease see reference here", "crumbs": [ "Advanced Features", "Custom Integrations" @@ -3762,7 +3772,7 @@ "href": "docs/custom_integrations.html#kernels-integration", "title": "Custom Integrations", "section": "Kernels Integration", - "text": "Kernels Integration\nMoE (Mixture of Experts) kernels speed up training for MoE layers and reduce VRAM costs. In transformers v5, batched_mm and grouped_mm were integrated as built-in options via the experts_implementation config kwarg:\nclass ExpertsInterface(GeneralInterface):\n _global_mapping = {\n \"batched_mm\": batched_mm_experts_forward,\n \"grouped_mm\": grouped_mm_experts_forward,\n }\nIn our custom integration, we add support for ScatterMoE and SonicMoE, which are more efficient and faster than grouped_mm.\n\nUsage\nAdd the following to your axolotl YAML config:\nplugins:\n - axolotl.integrations.kernels.KernelsPlugin\n\nuse_kernels: true\n\nuse_scattermoe: true\nuse_sonicmoe: true\nImportant: Setting experts_implementation to batched_mm or grouped_mm is incompatible with custom kernel options. The exception is experts_implementation: scattermoe, which is used for models like Gemma 4 that embed MoE directly in the decoder layer (no SparseMoeBlock) and dispatch through the transformers ExpertsInterface.\n\n\nSonicMoE installation\nPrerequisites:\n- NVIDIA Hopper (H100, H200) or Blackwell (B200, GB200) GPU\n- CUDA 12.9+ (13.0+ for B300)\n- PyTorch 2.7+ (2.9.1 recommended)\n- For B300: Triton 3.6.0\npip install --ignore-requires-python --no-deps \"sonic-moe @ git+https://github.com/Dao-AILab/sonic-moe.git@116e2df0a41874f77fa0ad269ce7df3f0cfcb956\" && pip install nvidia-cutlass-dsl==4.4.0 quack-kernels==0.2.5\nSee the SonicMoE installation guide for the latest prerequisite details.\nNote: Blackwell support is in upstream beta. On Blackwell GPUs, Axolotl automatically sets USE_QUACK_GEMM=1 to enable the Blackwell kernels.\n\n\nHow It Works\nThe KernelsPlugin runs before model loading and:\n\n\nScatterMoE\n\nRegisters the ScatterMoE kernel from the local libs/scattermoe_lora package (includes fused LoRA support via Triton kernels).\nPatches the model’s SparseMoeBlock forward method with the optimized ScatterMoE implementation via the HF kernels library.\n\n\n\nSonicMoE\n\nResolves the model’s MoE block class(es) from constants.py.\nPatches the forward method with SonicMoE’s optimized CUTLASS kernels and registers a weight converter for the interleaved gate/up projection format.\nSupports pluggable routing strategies (see routing table below).\n\nBoth paths use the shared resolve_moe_block_classes utility in constants.py for model-type-to-class resolution.\n\n\nModel Support Matrix\nMost models use the SwiGLU activation (silu(gate) * up). Gemma 4 uses GEGLU (gelu(gate) * up). ScatterMoE supports any gated activation (activation is applied in Python between kernel calls). SonicMoE supports SwiGLU, GEGLU, and REGLU via its ActivationType enum.\n\n\nRouting strategies\n\n\n\n\n\n\n\n\n\nRouting Strategy\nDescription\nScatterMoE\nSonicMoE\n\n\n\n\nsoftmax → topk\nSoftmax over experts, select top-K, optional renormalization\nYes\nYes\n\n\nsoftmax → group selection → topk\nSoftmax, select top groups (sum of top-2 per group), topk from selected groups, renorm + scaling\nNo\nYes\n\n\nsigmoid → topk (with groups)\nSigmoid + bias correction, group-based masking, topk from masked scores, weights from original sigmoid\nYes\nYes\n\n\nsigmoid → topk (no groups)\nSigmoid + bias correction, straight topk (n_group=1)\nYes\nYes\n\n\nsoftmax → bias correction → topk\nSoftmax, bias via gate.moe_statics, topk, gather from original probs, clamp-based renorm\nNo\nYes\n\n\nsoftmax → group_limited_greedy\nSoftmax, group selection (max per group), topk, scale only (no renorm)\nNo\nYes\n\n\nsoftmax → topk via gate.wg\nSoftmax, gate weight at gate.wg.weight (not gate.weight), always renormalize\nNo\nYes\n\n\nsoftmax → topk + per_expert_scale\nRMSNorm → scale → proj → softmax → topk → renorm → per-expert learned scales\nYes\nYes\n\n\nfused topk → softmax\nRouting + expert computation fused in a single kernel\nNo\nPlanned\n\n\n\n\n\nPer-model support\n\n\n\n\n\n\n\n\n\n\nModel Type\nArchitecture\nRouting\nScatterMoE\nSonicMoE\n\n\n\n\nqwen2_moe\nQwen2-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_moe\nQwen3-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_5_moe\nQwen3.5-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_5_moe_text\nQwen3.5-MoE (VLM text)\nsoftmax → topk\nYes\nYes\n\n\nqwen3_next\nQwen3-Next\nsoftmax → topk\nYes\nYes\n\n\nqwen3_vl_moe\nQwen3-VL-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_omni_moe\nQwen3-Omni (Thinker + Talker)\nsoftmax → topk\nYes\nYes\n\n\nolmoe\nOLMoE\nsoftmax → topk\nYes\nYes\n\n\nmixtral\nMixtral\nsoftmax → topk\nYes\nYes\n\n\nminimax\nMiniMax\nsoftmax → topk\nYes\nYes\n\n\nmistral4\nMistral 4\nsoftmax → group → topk\nNo\nYes\n\n\nglm_moe_dsa\nGLM-MoE DSA (GLM 5)\nsigmoid → topk (groups)\nYes\nYes\n\n\ndeepseek_v3\nDeepSeek-V3\nsigmoid → topk (groups)\nYes\nYes\n\n\nglm4_moe\nGLM4-MoE\nsigmoid → topk (groups)\nYes\nYes\n\n\nglm4_moe_lite\nGLM4-MoE Lite (GLM 4.7 Flash)\nsigmoid → topk (groups)\nYes*\nYes\n\n\nglm4v_moe\nGLM4v-MoE\nsigmoid → topk (groups)\nYes\nYes\n\n\nminimax_m2\nMiniMax M2\nsigmoid → topk (no groups)\nYes\nYes\n\n\nernie4_5_moe\nERNIE 4.5 MoE\nsoftmax → bias → topk\nNo\nYes\n\n\ndeepseek_v2\nDeepSeek-V2\nsoftmax → group_limited_greedy\nNo\nYes\n\n\nhunyuan_v1_moe\nHunYuan V1 MoE\nsoftmax → topk (gate.wg)\nNo\nYes\n\n\ngemma4_text\nGemma 4 (26B-A4B)\nsoftmax → topk + per_expert_scale\nYes**\nYes**\n\n\ngpt_oss\nGPT-OSS\nfused topk → softmax\nNo\nPlanned\n\n\n\n* glm4_moe_lite with ScatterMoE may have issues — see Limitations.\n** Gemma 4 uses experts_implementation: scattermoe path (registered via ExpertsInterface) instead of SparseMoeBlock patching, since Gemma 4 embeds MoE directly in its decoder layer (no separate SparseMoeBlock). See the Gemma 4 section below.\n\n\nFeature comparison\n\n\n\n\n\n\n\n\nFeature\nScatterMoE\nSonicMoE\n\n\n\n\nKernel backend\nTriton\nCUTLASS\n\n\nGPU requirement\nAny CUDA\nHopper (H100/H200) or Blackwell (B200+)\n\n\nLoRA approach\nFused in Triton kernel\nRuntime materialization + custom autograd\n\n\nLoRA overhead\nLower (fused computation)\nHigher (per-forward materialization)\n\n\nGate/router LoRA\nYes\nYes\n\n\nExpert LoRA\nYes (fused)\nYes (materialized)\n\n\nShared expert LoRA\nYes (standard PEFT)\nYes (standard PEFT)\n\n\nSelective expert dequantization\nYes (~97% memory savings)\nNo\n\n\nWeight format\nTransposed [E, hidden, 2*inter]\nInterleaved gate/up [2*I, H, E]\n\n\ntorch.compile routing\nNo\nYes (optional)\n\n\n\n\n\nShared Expert Handling\nBoth kernels handle shared experts identically. Shared expert attribute names are detected in order of priority:\n\nshared_expert (Qwen2-MoE)\nshared_experts (GLM-MoE, DeepSeek-V3)\nshared_mlp (HunYuan V1 MoE)\n\nIf shared_expert_gate exists, sigmoid gating is applied to the shared expert contribution before adding it to the routed output. PEFT wraps shared expert linear layers with standard LoRA — no special handling is needed.\n\n\nGemma 4\nGemma 4 (e.g. google/gemma-4-26B-A4B) has a unique hybrid MoE architecture:\n\nNo SparseMoeBlock: MoE is embedded directly in the decoder layer alongside a dense MLP. Both run in parallel and their outputs are summed.\nCustom router (Gemma4TextRouter): RMSNorm → learned scale → linear projection → softmax → top-k → renormalization → per-expert learned scales.\nGEGLU activation: Uses gelu_pytorch_tanh (not SiLU/SwiGLU like most other MoE models).\n128 experts, top-k=8 for the 26B-A4B variant.\n\nBecause there is no SparseMoeBlock class to patch, Gemma 4 uses a different integration path: we register \"scattermoe\" as a custom implementation in the transformers ExpertsInterface, and set experts_implementation: scattermoe in the config. The @use_experts_implementation decorator on Gemma4TextExperts then dispatches to our ScatterMoE kernel automatically. The router is untouched — it runs as-is.\nImportant limitations:\n- Flash Attention 2 is not supported — Gemma 4 uses global_head_dim: 512 for full attention layers, which exceeds FA2’s maximum head dimension of 256. Use sdp_attention: true instead.\n- Multimodal model: Gemma 4 includes vision and audio encoders. For text-only SFT, use lora_target_linear_modules with a regex to restrict LoRA to the text backbone (e.g. language_model\\.model\\.layers\\.\\d+\\.self_attn\\.(q|k|v|o)_proj).\n\n\nLimitations\n\nScatterMoE + GLM4-MoE Lite: ScatterMoE does not work reliably for GLM 4.7 Flash (glm4_moe_lite).\nNon-SwiGLU activations: Neither kernel supports MoE architectures with non-SwiGLU expert activations (e.g., GPT-OSS uses a custom GLU variant).\nGPT-OSS: Deferred — requires transposed weight layout [E, H, 2*I], expert biases, and custom GLU activation. A dedicated forward path is needed.\nFSDP + fused gate LoRA (SonicMoE): The fused topk→softmax path materializes a local tensor when LoRA delta is present to avoid DTensor + Tensor mixing under FSDP.\n\n\n\nNote on MegaBlocks\nWe tested MegaBlocks but were unable to ensure numerical accuracy, so we did not integrate it. It was also incompatible with many newer model architectures in transformers.\nPlease see reference here", + "text": "Kernels Integration\nMoE (Mixture of Experts) kernels speed up training for MoE layers and reduce VRAM costs. In transformers v5, batched_mm and grouped_mm were integrated as built-in options via the experts_implementation config kwarg:\nclass ExpertsInterface(GeneralInterface):\n _global_mapping = {\n \"batched_mm\": batched_mm_experts_forward,\n \"grouped_mm\": grouped_mm_experts_forward,\n }\nIn our custom integration, we add support for ScatterMoE and SonicMoE, which are more efficient and faster than grouped_mm.\n\nUsage\nAdd the following to your axolotl YAML config:\nplugins:\n - axolotl.integrations.kernels.KernelsPlugin\n\nuse_kernels: true\n\nuse_scattermoe: true\nuse_sonicmoe: true\nImportant: Setting experts_implementation to batched_mm or grouped_mm is incompatible with custom kernel options. The exception is experts_implementation: scattermoe, which is used for models like Gemma 4 that embed MoE directly in the decoder layer (no SparseMoeBlock) and dispatch through the transformers ExpertsInterface.\n\n\nSonicMoE installation\nPrerequisites:\n- NVIDIA Hopper (H100, H200) or Blackwell (B200, GB200) GPU\n- CUDA 12.9+ (13.0+ for B300)\n- PyTorch 2.7+ (2.9.1 recommended)\n- For B300: Triton 3.6.0\npip install --ignore-requires-python --no-deps \"sonic-moe @ git+https://github.com/Dao-AILab/sonic-moe.git@116e2df0a41874f77fa0ad269ce7df3f0cfcb956\" && pip install nvidia-cutlass-dsl==4.4.0 quack-kernels==0.2.5\nSee the SonicMoE installation guide for the latest prerequisite details.\nNote: Blackwell support is in upstream beta. On Blackwell GPUs, Axolotl automatically sets USE_QUACK_GEMM=1 to enable the Blackwell kernels.\n\n\nHow It Works\nThe KernelsPlugin runs before model loading and:\n\n\nScatterMoE\n\nRegisters the ScatterMoE kernel from the local libs/scattermoe_lora package (includes fused LoRA support via Triton kernels).\nPatches the model’s SparseMoeBlock forward method with the optimized ScatterMoE implementation via the HF kernels library.\n\n\n\nSonicMoE\n\nResolves the model’s MoE block class(es) from constants.py.\nPatches the forward method with SonicMoE’s optimized CUTLASS kernels and registers a weight converter for the interleaved gate/up projection format.\nSupports pluggable routing strategies (see routing table below).\n\nBoth paths use the shared resolve_moe_block_classes utility in constants.py for model-type-to-class resolution.\n\n\nModel Support Matrix\nMost models use the SwiGLU activation (silu(gate) * up). Gemma 4 uses GEGLU (gelu(gate) * up). ScatterMoE supports any gated activation (activation is applied in Python between kernel calls). SonicMoE supports SwiGLU, GEGLU, and REGLU via its ActivationType enum.\n\n\nRouting strategies\n\n\n\n\n\n\n\n\n\nRouting Strategy\nDescription\nScatterMoE\nSonicMoE\n\n\n\n\nsoftmax → topk\nSoftmax over experts, select top-K, optional renormalization\nYes\nYes\n\n\nsoftmax → group selection → topk\nSoftmax, select top groups (sum of top-2 per group), topk from selected groups, renorm + scaling\nNo\nYes\n\n\nsigmoid → topk (with groups)\nSigmoid + bias correction, group-based masking, topk from masked scores, weights from original sigmoid\nYes\nYes\n\n\nsigmoid → topk (no groups)\nSigmoid + bias correction, straight topk (n_group=1)\nYes\nYes\n\n\nsoftmax → bias correction → topk\nSoftmax, bias via gate.moe_statics, topk, gather from original probs, clamp-based renorm\nNo\nYes\n\n\nsoftmax → group_limited_greedy\nSoftmax, group selection (max per group), topk, scale only (no renorm)\nNo\nYes\n\n\nsoftmax → topk via gate.wg\nSoftmax, gate weight at gate.wg.weight (not gate.weight), always renormalize\nNo\nYes\n\n\nsoftmax → topk + per_expert_scale\nRMSNorm → scale → proj → softmax → topk → renorm → per-expert learned scales\nYes\nYes\n\n\nfused topk → softmax\nRouting + expert computation fused in a single kernel\nNo\nPlanned\n\n\n\n\n\nPer-model support\n\n\n\n\n\n\n\n\n\n\nModel Type\nArchitecture\nRouting\nScatterMoE\nSonicMoE\n\n\n\n\nqwen2_moe\nQwen2-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_moe\nQwen3-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_5_moe\nQwen3.5-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_5_moe_text\nQwen3.5-MoE (VLM text)\nsoftmax → topk\nYes\nYes\n\n\nqwen3_next\nQwen3-Next\nsoftmax → topk\nYes\nYes\n\n\nqwen3_vl_moe\nQwen3-VL-MoE\nsoftmax → topk\nYes\nYes\n\n\nqwen3_omni_moe\nQwen3-Omni (Thinker + Talker)\nsoftmax → topk\nYes\nYes\n\n\nolmoe\nOLMoE\nsoftmax → topk\nYes\nYes\n\n\nmixtral\nMixtral\nsoftmax → topk\nYes\nYes\n\n\nminimax\nMiniMax\nsoftmax → topk\nYes\nYes\n\n\nmistral4\nMistral 4\nsoftmax → group → topk\nNo\nYes\n\n\nglm_moe_dsa\nGLM-MoE DSA (GLM 5)\nsigmoid → topk (groups)\nYes\nYes\n\n\ndeepseek_v3\nDeepSeek-V3\nsigmoid → topk (groups)\nYes\nYes\n\n\nglm4_moe\nGLM4-MoE\nsigmoid → topk (groups)\nYes\nYes\n\n\nglm4_moe_lite\nGLM4-MoE Lite (GLM 4.7 Flash)\nsigmoid → topk (groups)\nYes*\nYes\n\n\nglm4v_moe\nGLM4v-MoE\nsigmoid → topk (groups)\nYes\nYes\n\n\nminimax_m2\nMiniMax M2\nsigmoid → topk (no groups)\nYes\nYes\n\n\nernie4_5_moe\nERNIE 4.5 MoE\nsoftmax → bias → topk\nNo\nYes\n\n\ndeepseek_v2\nDeepSeek-V2\nsoftmax → group_limited_greedy\nNo\nYes\n\n\nhunyuan_v1_moe\nHunYuan V1 MoE\nsoftmax → topk (gate.wg)\nNo\nYes\n\n\ngemma4_text\nGemma 4 (26B-A4B)\nsoftmax → topk + per_expert_scale\nYes**\nYes**\n\n\ngpt_oss\nGPT-OSS\nfused topk → softmax\nNo\nPlanned\n\n\n\n* glm4_moe_lite with ScatterMoE may have issues — see Limitations.\n** Gemma 4 uses experts_implementation: scattermoe path (registered via ExpertsInterface) instead of SparseMoeBlock patching, since Gemma 4 embeds MoE directly in its decoder layer (no separate SparseMoeBlock). See the Gemma 4 section below.\n\n\nFeature comparison\n\n\n\n\n\n\n\n\nFeature\nScatterMoE\nSonicMoE\n\n\n\n\nKernel backend\nTriton\nCUTLASS\n\n\nGPU requirement\nAny CUDA\nHopper (H100/H200) or Blackwell (B200+)\n\n\nLoRA approach\nFused in Triton kernel\nRuntime materialization + custom autograd\n\n\nLoRA overhead\nLower (fused computation)\nHigher (per-forward materialization)\n\n\nGate/router LoRA\nYes\nYes\n\n\nExpert LoRA\nYes (fused)\nYes (materialized)\n\n\nShared expert LoRA\nYes (standard PEFT)\nYes (standard PEFT)\n\n\nSelective expert dequantization\nYes (~97% memory savings)\nNo\n\n\nWeight format\nTransposed [E, hidden, 2*inter]\nInterleaved gate/up [2*I, H, E]\n\n\ntorch.compile routing\nNo\nYes (optional)\n\n\n\n\n\nShared Expert Handling\nBoth kernels handle shared experts identically. Shared expert attribute names are detected in order of priority:\n\nshared_expert (Qwen2-MoE)\nshared_experts (GLM-MoE, DeepSeek-V3)\nshared_mlp (HunYuan V1 MoE)\n\nIf shared_expert_gate exists, sigmoid gating is applied to the shared expert contribution before adding it to the routed output. PEFT wraps shared expert linear layers with standard LoRA — no special handling is needed.\n\n\nGemma 4\nGemma 4 (e.g. google/gemma-4-26B-A4B) has a unique hybrid MoE architecture:\n\nNo SparseMoeBlock: MoE is embedded directly in the decoder layer alongside a dense MLP. Both run in parallel and their outputs are summed.\nCustom router (Gemma4TextRouter): RMSNorm → learned scale → linear projection → softmax → top-k → renormalization → per-expert learned scales.\nGEGLU activation: Uses gelu_pytorch_tanh (not SiLU/SwiGLU like most other MoE models).\n128 experts, top-k=8 for the 26B-A4B variant.\n\nBecause there is no SparseMoeBlock class to patch, Gemma 4 uses a different integration path: we register \"scattermoe\" as a custom implementation in the transformers ExpertsInterface, and set experts_implementation: scattermoe in the config. The @use_experts_implementation decorator on Gemma4TextExperts then dispatches to our ScatterMoE kernel automatically. The router is untouched — it runs as-is.\n\n\nLimitations\n\nScatterMoE + GLM4-MoE Lite: ScatterMoE does not work reliably for GLM 4.7 Flash (glm4_moe_lite).\nNon-SwiGLU activations: Neither kernel supports MoE architectures with non-SwiGLU expert activations (e.g., GPT-OSS uses a custom GLU variant).\nGPT-OSS: Deferred — requires transposed weight layout [E, H, 2*I], expert biases, and custom GLU activation. A dedicated forward path is needed.\nFSDP + fused gate LoRA (SonicMoE): The fused topk→softmax path materializes a local tensor when LoRA delta is present to avoid DTensor + Tensor mixing under FSDP.\n\n\n\nNote on MegaBlocks\nWe tested MegaBlocks but were unable to ensure numerical accuracy, so we did not integrate it. It was also incompatible with many newer model architectures in transformers.\nPlease see reference here", "crumbs": [ "Advanced Features", "Custom Integrations" @@ -6357,14 +6367,14 @@ "href": "docs/api/cli.main.html", "title": "cli.main", "section": "", - "text": "cli.main\nClick CLI definitions for various axolotl commands.\n\n\n\n\n\nName\nDescription\n\n\n\n\ncli\nAxolotl CLI - Train and fine-tune large language models\n\n\nevaluate\nEvaluate a model.\n\n\nfetch\nFetch example configs or other resources.\n\n\ninference\nRun inference with a trained model.\n\n\nmerge_lora\nMerge trained LoRA adapters into a base model.\n\n\nmerge_sharded_fsdp_weights\nMerge sharded FSDP model weights.\n\n\npreprocess\nPreprocess datasets before training.\n\n\ntrain\nTrain or fine-tune a model.\n\n\n\n\n\ncli.main.cli()\nAxolotl CLI - Train and fine-tune large language models\n\n\n\ncli.main.evaluate(ctx, config, launcher, **kwargs)\nEvaluate a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU evaluation (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.fetch(directory, dest)\nFetch example configs or other resources.\nAvailable directories:\n- examples: Example configuration files\n- deepspeed_configs: DeepSpeed configuration files\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndirectory\nstr\nOne of examples, deepspeed_configs.\nrequired\n\n\ndest\nOptional[str]\nOptional destination directory.\nrequired\n\n\n\n\n\n\n\ncli.main.inference(ctx, config, launcher, gradio, **kwargs)\nRun inference with a trained model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU inference (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\ngradio\nbool\nWhether to use Gradio browser interface or command line for inference.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_lora(config, **kwargs)\nMerge trained LoRA adapters into a base model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_sharded_fsdp_weights(ctx, config, launcher, **kwargs)\nMerge sharded FSDP model weights.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for weight merging (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.preprocess(config, cloud=None, **kwargs)\nPreprocess datasets before training.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\ncloud\nOptional[str]\nPath to a cloud accelerator configuration file.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.train(\n ctx,\n config,\n launcher='accelerate',\n cloud=None,\n sweep=None,\n **kwargs,\n)\nTrain or fine-tune a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nLiteral['accelerate', 'torchrun', 'python']\nLauncher to use for multi-GPU training (“accelerate”, “torchrun”, or “python”).\n'accelerate'\n\n\ncloud\nstr | None\nPath to a cloud accelerator configuration file\nNone\n\n\nsweep\nstr | None\nPath to YAML config for sweeping hyperparameters.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}" + "text": "cli.main\nClick CLI definitions for various axolotl commands.\n\n\n\n\n\nName\nDescription\n\n\n\n\nagent_docs\nShow agent-optimized documentation.\n\n\ncli\nAxolotl CLI - Train and fine-tune large language models\n\n\nconfig_schema\nDump the full config JSON schema.\n\n\nevaluate\nEvaluate a model.\n\n\nfetch\nFetch example configs or other resources.\n\n\ninference\nRun inference with a trained model.\n\n\nmerge_lora\nMerge trained LoRA adapters into a base model.\n\n\nmerge_sharded_fsdp_weights\nMerge sharded FSDP model weights.\n\n\npreprocess\nPreprocess datasets before training.\n\n\ntrain\nTrain or fine-tune a model.\n\n\n\n\n\ncli.main.agent_docs(topic, list_topics)\nShow agent-optimized documentation.\nPrints reference docs designed for AI coding agents.\nThese docs are bundled with the package — no network access needed.\n\b\nExamples:\naxolotl agent-docs # overview (start here)\naxolotl agent-docs grpo # GRPO reference\naxolotl agent-docs sft # SFT reference\naxolotl agent-docs –list # list all topics\n\n\n\ncli.main.cli()\nAxolotl CLI - Train and fine-tune large language models\n\n\n\ncli.main.config_schema(output_format, field)\nDump the full config JSON schema.\nUseful for AI agents and tooling to discover all available config options,\ntheir types, defaults, and descriptions.\n\b\nExamples:\naxolotl config-schema # full JSON schema\naxolotl config-schema –format yaml # YAML format\naxolotl config-schema –field adapter # single field\n\n\n\ncli.main.evaluate(ctx, config, launcher, **kwargs)\nEvaluate a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU evaluation (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.fetch(directory, dest)\nFetch example configs or other resources.\nAvailable directories:\n- examples: Example configuration files\n- deepspeed_configs: DeepSpeed configuration files\n- docs: Full documentation (Quarto markdown files)\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndirectory\nstr\nOne of examples, deepspeed_configs, docs.\nrequired\n\n\ndest\nOptional[str]\nOptional destination directory.\nrequired\n\n\n\n\n\n\n\ncli.main.inference(ctx, config, launcher, gradio, **kwargs)\nRun inference with a trained model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU inference (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\ngradio\nbool\nWhether to use Gradio browser interface or command line for inference.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_lora(config, **kwargs)\nMerge trained LoRA adapters into a base model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_sharded_fsdp_weights(ctx, config, launcher, **kwargs)\nMerge sharded FSDP model weights.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for weight merging (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.preprocess(config, cloud=None, **kwargs)\nPreprocess datasets before training.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\ncloud\nOptional[str]\nPath to a cloud accelerator configuration file.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.train(\n ctx,\n config,\n launcher='accelerate',\n cloud=None,\n sweep=None,\n **kwargs,\n)\nTrain or fine-tune a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nLiteral['accelerate', 'torchrun', 'python']\nLauncher to use for multi-GPU training (“accelerate”, “torchrun”, or “python”).\n'accelerate'\n\n\ncloud\nstr | None\nPath to a cloud accelerator configuration file\nNone\n\n\nsweep\nstr | None\nPath to YAML config for sweeping hyperparameters.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}" }, { "objectID": "docs/api/cli.main.html#functions", "href": "docs/api/cli.main.html#functions", "title": "cli.main", "section": "", - "text": "Name\nDescription\n\n\n\n\ncli\nAxolotl CLI - Train and fine-tune large language models\n\n\nevaluate\nEvaluate a model.\n\n\nfetch\nFetch example configs or other resources.\n\n\ninference\nRun inference with a trained model.\n\n\nmerge_lora\nMerge trained LoRA adapters into a base model.\n\n\nmerge_sharded_fsdp_weights\nMerge sharded FSDP model weights.\n\n\npreprocess\nPreprocess datasets before training.\n\n\ntrain\nTrain or fine-tune a model.\n\n\n\n\n\ncli.main.cli()\nAxolotl CLI - Train and fine-tune large language models\n\n\n\ncli.main.evaluate(ctx, config, launcher, **kwargs)\nEvaluate a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU evaluation (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.fetch(directory, dest)\nFetch example configs or other resources.\nAvailable directories:\n- examples: Example configuration files\n- deepspeed_configs: DeepSpeed configuration files\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndirectory\nstr\nOne of examples, deepspeed_configs.\nrequired\n\n\ndest\nOptional[str]\nOptional destination directory.\nrequired\n\n\n\n\n\n\n\ncli.main.inference(ctx, config, launcher, gradio, **kwargs)\nRun inference with a trained model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU inference (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\ngradio\nbool\nWhether to use Gradio browser interface or command line for inference.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_lora(config, **kwargs)\nMerge trained LoRA adapters into a base model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_sharded_fsdp_weights(ctx, config, launcher, **kwargs)\nMerge sharded FSDP model weights.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for weight merging (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.preprocess(config, cloud=None, **kwargs)\nPreprocess datasets before training.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\ncloud\nOptional[str]\nPath to a cloud accelerator configuration file.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.train(\n ctx,\n config,\n launcher='accelerate',\n cloud=None,\n sweep=None,\n **kwargs,\n)\nTrain or fine-tune a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nLiteral['accelerate', 'torchrun', 'python']\nLauncher to use for multi-GPU training (“accelerate”, “torchrun”, or “python”).\n'accelerate'\n\n\ncloud\nstr | None\nPath to a cloud accelerator configuration file\nNone\n\n\nsweep\nstr | None\nPath to YAML config for sweeping hyperparameters.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}" + "text": "Name\nDescription\n\n\n\n\nagent_docs\nShow agent-optimized documentation.\n\n\ncli\nAxolotl CLI - Train and fine-tune large language models\n\n\nconfig_schema\nDump the full config JSON schema.\n\n\nevaluate\nEvaluate a model.\n\n\nfetch\nFetch example configs or other resources.\n\n\ninference\nRun inference with a trained model.\n\n\nmerge_lora\nMerge trained LoRA adapters into a base model.\n\n\nmerge_sharded_fsdp_weights\nMerge sharded FSDP model weights.\n\n\npreprocess\nPreprocess datasets before training.\n\n\ntrain\nTrain or fine-tune a model.\n\n\n\n\n\ncli.main.agent_docs(topic, list_topics)\nShow agent-optimized documentation.\nPrints reference docs designed for AI coding agents.\nThese docs are bundled with the package — no network access needed.\n\b\nExamples:\naxolotl agent-docs # overview (start here)\naxolotl agent-docs grpo # GRPO reference\naxolotl agent-docs sft # SFT reference\naxolotl agent-docs –list # list all topics\n\n\n\ncli.main.cli()\nAxolotl CLI - Train and fine-tune large language models\n\n\n\ncli.main.config_schema(output_format, field)\nDump the full config JSON schema.\nUseful for AI agents and tooling to discover all available config options,\ntheir types, defaults, and descriptions.\n\b\nExamples:\naxolotl config-schema # full JSON schema\naxolotl config-schema –format yaml # YAML format\naxolotl config-schema –field adapter # single field\n\n\n\ncli.main.evaluate(ctx, config, launcher, **kwargs)\nEvaluate a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU evaluation (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.fetch(directory, dest)\nFetch example configs or other resources.\nAvailable directories:\n- examples: Example configuration files\n- deepspeed_configs: DeepSpeed configuration files\n- docs: Full documentation (Quarto markdown files)\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndirectory\nstr\nOne of examples, deepspeed_configs, docs.\nrequired\n\n\ndest\nOptional[str]\nOptional destination directory.\nrequired\n\n\n\n\n\n\n\ncli.main.inference(ctx, config, launcher, gradio, **kwargs)\nRun inference with a trained model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for multi-GPU inference (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\ngradio\nbool\nWhether to use Gradio browser interface or command line for inference.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_lora(config, **kwargs)\nMerge trained LoRA adapters into a base model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.merge_sharded_fsdp_weights(ctx, config, launcher, **kwargs)\nMerge sharded FSDP model weights.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nstr\nLauncher to use for weight merging (“accelerate”, “torchrun”, or “python”).\nrequired\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.preprocess(config, cloud=None, **kwargs)\nPreprocess datasets before training.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\ncloud\nOptional[str]\nPath to a cloud accelerator configuration file.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}\n\n\n\n\n\n\n\ncli.main.train(\n ctx,\n config,\n launcher='accelerate',\n cloud=None,\n sweep=None,\n **kwargs,\n)\nTrain or fine-tune a model.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nctx\nclick.Context\nClick context for extra args.\nrequired\n\n\nconfig\nstr\nPath to axolotl config YAML file.\nrequired\n\n\nlauncher\nLiteral['accelerate', 'torchrun', 'python']\nLauncher to use for multi-GPU training (“accelerate”, “torchrun”, or “python”).\n'accelerate'\n\n\ncloud\nstr | None\nPath to a cloud accelerator configuration file\nNone\n\n\nsweep\nstr | None\nPath to YAML config for sweeping hyperparameters.\nNone\n\n\nkwargs\n\nAdditional keyword arguments which correspond to CLI args or axolotl config options.\n{}" }, { "objectID": "docs/api/monkeypatch.trainer_fsdp_optim.html", diff --git a/sitemap.xml b/sitemap.xml index 4adca6915..d9a8c7c20 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,982 +2,982 @@ https://docs.axolotl.ai/FAQS.html - 2026-04-04T09:17:57.293Z + 2026-04-06T17:11:30.004Z https://docs.axolotl.ai/docs/dataset-formats/template_free.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/dataset-formats/conversation.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/dataset-formats/pretraining.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/dataset-formats/index.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/api/cli.args.html - 2026-04-04T09:21:16.852Z + 2026-04-06T17:14:57.675Z https://docs.axolotl.ai/docs/api/prompt_strategies.orcamini.html - 2026-04-04T09:21:17.358Z + 2026-04-06T17:14:58.172Z https://docs.axolotl.ai/docs/api/cli.preprocess.html - 2026-04-04T09:21:16.945Z + 2026-04-06T17:14:57.766Z https://docs.axolotl.ai/docs/api/utils.collators.core.html - 2026-04-04T09:21:18.230Z + 2026-04-06T17:14:59.045Z https://docs.axolotl.ai/docs/api/prompt_strategies.dpo.llama3.html - 2026-04-04T09:21:17.393Z + 2026-04-06T17:14:58.207Z https://docs.axolotl.ai/docs/api/utils.schemas.enums.html - 2026-04-04T09:21:17.984Z + 2026-04-06T17:14:58.800Z https://docs.axolotl.ai/docs/api/utils.lora.html - 2026-04-04T09:21:17.711Z + 2026-04-06T17:14:58.530Z https://docs.axolotl.ai/docs/api/common.datasets.html - 2026-04-04T09:21:18.227Z + 2026-04-06T17:14:59.042Z https://docs.axolotl.ai/docs/api/monkeypatch.relora.html - 2026-04-04T09:21:17.574Z + 2026-04-06T17:14:58.395Z https://docs.axolotl.ai/docs/api/core.builders.base.html - 2026-04-04T09:21:16.685Z + 2026-04-06T17:14:57.505Z https://docs.axolotl.ai/docs/api/prompt_strategies.input_output.html - 2026-04-04T09:21:17.339Z + 2026-04-06T17:14:58.153Z https://docs.axolotl.ai/docs/api/integrations.lm_eval.args.html - 2026-04-04T09:21:18.200Z + 2026-04-06T17:14:59.016Z https://docs.axolotl.ai/docs/api/cli.inference.html - 2026-04-04T09:21:16.910Z + 2026-04-06T17:14:57.732Z https://docs.axolotl.ai/docs/api/monkeypatch.gradient_checkpointing.offload_disk.html - 2026-04-04T09:21:17.694Z + 2026-04-06T17:14:58.513Z https://docs.axolotl.ai/docs/api/core.datasets.chat.html - 2026-04-04T09:21:16.755Z + 2026-04-06T17:14:57.573Z https://docs.axolotl.ai/docs/api/core.chat.format.shared.html - 2026-04-04T09:21:16.748Z + 2026-04-06T17:14:57.567Z https://docs.axolotl.ai/docs/api/logging_config.html - 2026-04-04T09:21:16.677Z + 2026-04-06T17:14:57.497Z https://docs.axolotl.ai/docs/api/prompt_strategies.chat_template.html - 2026-04-04T09:21:17.264Z + 2026-04-06T17:14:58.079Z https://docs.axolotl.ai/docs/api/utils.collators.mamba.html - 2026-04-04T09:21:18.257Z + 2026-04-06T17:14:59.072Z https://docs.axolotl.ai/docs/api/cli.config.html - 2026-04-04T09:21:16.886Z + 2026-04-06T17:14:57.708Z https://docs.axolotl.ai/docs/api/loaders.model.html - 2026-04-04T09:21:17.128Z + 2026-04-06T17:14:57.943Z https://docs.axolotl.ai/docs/api/prompt_strategies.kto.chatml.html - 2026-04-04T09:21:17.433Z + 2026-04-06T17:14:58.251Z https://docs.axolotl.ai/docs/api/cli.quantize.html - 2026-04-04T09:21:16.951Z + 2026-04-06T17:14:57.772Z https://docs.axolotl.ai/docs/api/prompt_strategies.bradley_terry.llama3.html - 2026-04-04T09:21:17.465Z + 2026-04-06T17:14:58.284Z https://docs.axolotl.ai/docs/api/integrations.spectrum.args.html - 2026-04-04T09:21:18.204Z + 2026-04-06T17:14:59.020Z https://docs.axolotl.ai/docs/api/prompt_strategies.messages.chat.html - 2026-04-04T09:21:17.372Z + 2026-04-06T17:14:58.186Z https://docs.axolotl.ai/docs/api/utils.callbacks.perplexity.html - 2026-04-04T09:21:18.322Z + 2026-04-06T17:14:59.139Z https://docs.axolotl.ai/docs/api/monkeypatch.lora_kernels.html - 2026-04-04T09:21:17.613Z + 2026-04-06T17:14:58.433Z https://docs.axolotl.ai/docs/api/monkeypatch.data.batch_dataset_fetcher.html - 2026-04-04T09:21:17.655Z + 2026-04-06T17:14:58.474Z https://docs.axolotl.ai/docs/api/loaders.patch_manager.html - 2026-04-04T09:21:17.170Z + 2026-04-06T17:14:57.985Z https://docs.axolotl.ai/docs/api/utils.model_shard_quant.html - 2026-04-04T09:21:17.718Z + 2026-04-06T17:14:58.536Z https://docs.axolotl.ai/docs/api/utils.schemas.multimodal.html - 2026-04-04T09:21:17.952Z + 2026-04-06T17:14:58.769Z https://docs.axolotl.ai/docs/api/utils.callbacks.profiler.html - 2026-04-04T09:21:18.327Z + 2026-04-06T17:14:59.144Z https://docs.axolotl.ai/docs/api/convert.html - 2026-04-04T09:21:16.612Z + 2026-04-06T17:14:57.433Z https://docs.axolotl.ai/docs/api/cli.utils.html - 2026-04-04T09:21:16.975Z + 2026-04-06T17:14:57.795Z https://docs.axolotl.ai/docs/api/kernels.lora.html - 2026-04-04T09:21:17.514Z + 2026-04-06T17:14:58.336Z https://docs.axolotl.ai/docs/api/monkeypatch.utils.html - 2026-04-04T09:21:17.620Z + 2026-04-06T17:14:58.440Z https://docs.axolotl.ai/docs/api/common.const.html - 2026-04-04T09:21:18.208Z + 2026-04-06T17:14:59.023Z https://docs.axolotl.ai/docs/api/utils.freeze.html - 2026-04-04T09:21:17.732Z + 2026-04-06T17:14:58.551Z https://docs.axolotl.ai/docs/api/utils.schemas.utils.html - 2026-04-04T09:21:17.991Z + 2026-04-06T17:14:58.807Z https://docs.axolotl.ai/docs/api/utils.callbacks.qat.html - 2026-04-04T09:21:18.346Z + 2026-04-06T17:14:59.163Z https://docs.axolotl.ai/docs/api/utils.data.sft.html - 2026-04-04T09:21:17.844Z + 2026-04-06T17:14:58.662Z https://docs.axolotl.ai/docs/api/monkeypatch.llama_attn_hijack_xformers.html - 2026-04-04T09:21:17.566Z + 2026-04-06T17:14:58.387Z https://docs.axolotl.ai/docs/api/core.trainers.grpo.sampler.html - 2026-04-04T09:21:17.114Z + 2026-04-06T17:14:57.930Z https://docs.axolotl.ai/docs/api/core.chat.messages.html - 2026-04-04T09:21:16.743Z + 2026-04-06T17:14:57.561Z https://docs.axolotl.ai/docs/api/core.trainers.mamba.html - 2026-04-04T09:21:17.073Z + 2026-04-06T17:14:57.889Z https://docs.axolotl.ai/docs/api/prompt_strategies.dpo.passthrough.html - 2026-04-04T09:21:17.412Z + 2026-04-06T17:14:58.229Z https://docs.axolotl.ai/docs/api/kernels.swiglu.html - 2026-04-04T09:21:17.540Z + 2026-04-06T17:14:58.361Z https://docs.axolotl.ai/docs/api/prompt_strategies.pygmalion.html - 2026-04-04T09:21:17.367Z + 2026-04-06T17:14:58.180Z https://docs.axolotl.ai/docs/api/utils.schemas.peft.html - 2026-04-04T09:21:17.941Z + 2026-04-06T17:14:58.758Z https://docs.axolotl.ai/docs/api/utils.schemas.trl.html - 2026-04-04T09:21:17.945Z + 2026-04-06T17:14:58.763Z https://docs.axolotl.ai/docs/api/prompt_strategies.completion.html - 2026-04-04T09:21:17.332Z + 2026-04-06T17:14:58.146Z https://docs.axolotl.ai/docs/api/cli.vllm_serve.html - 2026-04-04T09:21:16.960Z + 2026-04-06T17:14:57.781Z https://docs.axolotl.ai/docs/api/utils.trainer.html - 2026-04-04T09:21:17.754Z + 2026-04-06T17:14:58.572Z https://docs.axolotl.ai/docs/api/utils.ctx_managers.sequence_parallel.html - 2026-04-04T09:21:17.221Z + 2026-04-06T17:14:58.035Z https://docs.axolotl.ai/docs/api/core.training_args.html - 2026-04-04T09:21:16.713Z + 2026-04-06T17:14:57.532Z https://docs.axolotl.ai/docs/api/evaluate.html - 2026-04-04T09:21:16.587Z + 2026-04-06T17:14:57.408Z https://docs.axolotl.ai/docs/api/utils.callbacks.comet_.html - 2026-04-04T09:21:18.337Z + 2026-04-06T17:14:59.154Z https://docs.axolotl.ai/docs/api/loaders.tokenizer.html - 2026-04-04T09:21:17.139Z + 2026-04-06T17:14:57.955Z https://docs.axolotl.ai/docs/api/monkeypatch.llama_attn_hijack_flash.html - 2026-04-04T09:21:17.564Z + 2026-04-06T17:14:58.385Z https://docs.axolotl.ai/docs/api/cli.cloud.modal_.html - 2026-04-04T09:21:16.973Z + 2026-04-06T17:14:57.793Z https://docs.axolotl.ai/docs/api/prompt_strategies.stepwise_supervised.html - 2026-04-04T09:21:17.345Z + 2026-04-06T17:14:58.159Z https://docs.axolotl.ai/docs/api/monkeypatch.btlm_attn_hijack_flash.html - 2026-04-04T09:21:17.622Z + 2026-04-06T17:14:58.441Z https://docs.axolotl.ai/docs/api/core.chat.format.llama3x.html - 2026-04-04T09:21:16.747Z + 2026-04-06T17:14:57.565Z https://docs.axolotl.ai/docs/api/utils.quantization.html - 2026-04-04T09:21:17.868Z + 2026-04-06T17:14:58.686Z https://docs.axolotl.ai/docs/api/monkeypatch.unsloth_.html - 2026-04-04T09:21:17.643Z + 2026-04-06T17:14:58.463Z https://docs.axolotl.ai/docs/api/prompt_strategies.orpo.chat_template.html - 2026-04-04T09:21:17.460Z + 2026-04-06T17:14:58.279Z https://docs.axolotl.ai/docs/api/cli.art.html - 2026-04-04T09:21:16.856Z + 2026-04-06T17:14:57.679Z https://docs.axolotl.ai/docs/api/loaders.processor.html - 2026-04-04T09:21:17.141Z + 2026-04-06T17:14:57.957Z https://docs.axolotl.ai/docs/api/cli.merge_sharded_fsdp_weights.html - 2026-04-04T09:21:16.935Z + 2026-04-06T17:14:57.756Z https://docs.axolotl.ai/docs/api/kernels.quantize.html - 2026-04-04T09:21:17.555Z + 2026-04-06T17:14:58.376Z https://docs.axolotl.ai/docs/api/core.trainers.utils.html - 2026-04-04T09:21:17.116Z + 2026-04-06T17:14:57.932Z https://docs.axolotl.ai/docs/api/prompt_strategies.dpo.chat_template.html - 2026-04-04T09:21:17.380Z + 2026-04-06T17:14:58.194Z https://docs.axolotl.ai/docs/api/cli.delinearize_llama4.html - 2026-04-04T09:21:16.892Z + 2026-04-06T17:14:57.714Z https://docs.axolotl.ai/docs/faq.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.008Z https://docs.axolotl.ai/docs/expert_quantization.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/checkpoint_saving.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/agents/pretraining.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/agents/grpo.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.006Z https://docs.axolotl.ai/docs/agents/sft.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/multi-gpu.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/nd_parallelism.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/mac.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/reward_modelling.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/models/ministral3.html - 2026-04-04T09:21:40.143Z + 2026-04-06T17:15:22.606Z https://docs.axolotl.ai/docs/models/hunyuan.html - 2026-04-04T09:21:40.152Z + 2026-04-06T17:15:22.614Z https://docs.axolotl.ai/docs/models/smolvlm2.html - 2026-04-04T09:21:40.151Z + 2026-04-06T17:15:22.613Z https://docs.axolotl.ai/docs/models/ministral3/vision.html - 2026-04-04T09:21:40.144Z + 2026-04-06T17:15:22.607Z https://docs.axolotl.ai/docs/models/voxtral.html - 2026-04-04T09:21:40.147Z + 2026-04-06T17:15:22.609Z https://docs.axolotl.ai/docs/models/ministral.html - 2026-04-04T09:21:40.146Z + 2026-04-06T17:15:22.608Z https://docs.axolotl.ai/docs/models/granite4.html - 2026-04-04T09:21:40.152Z + 2026-04-06T17:15:22.613Z https://docs.axolotl.ai/docs/models/phi.html - 2026-04-04T09:21:40.151Z + 2026-04-06T17:15:22.613Z https://docs.axolotl.ai/docs/models/internvl3_5.html - 2026-04-04T09:21:40.141Z + 2026-04-06T17:15:22.604Z https://docs.axolotl.ai/docs/models/magistral/think.html - 2026-04-04T09:21:40.145Z + 2026-04-06T17:15:22.608Z https://docs.axolotl.ai/docs/models/mistral-small.html - 2026-04-04T09:21:40.146Z + 2026-04-06T17:15:22.609Z https://docs.axolotl.ai/docs/models/gemma3n.html - 2026-04-04T09:21:40.149Z + 2026-04-06T17:15:22.611Z https://docs.axolotl.ai/docs/models/arcee.html - 2026-04-04T09:21:40.143Z + 2026-04-06T17:15:22.605Z https://docs.axolotl.ai/docs/models/llama-2.html - 2026-04-04T09:21:40.148Z + 2026-04-06T17:15:22.610Z https://docs.axolotl.ai/docs/models/llama-4.html - 2026-04-04T09:21:40.148Z + 2026-04-06T17:15:22.610Z https://docs.axolotl.ai/docs/models/seed-oss.html - 2026-04-04T09:21:40.150Z + 2026-04-06T17:15:22.612Z https://docs.axolotl.ai/docs/models/jamba.html - 2026-04-04T09:21:40.153Z + 2026-04-06T17:15:22.614Z https://docs.axolotl.ai/docs/nccl.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/multipack.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/debugging.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/dataset_preprocessing.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/vllm_serving.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/docs/optimizers.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/ebft.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/torchao.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/docs/lr_groups.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/streaming.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/docs/amd_hpc.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/installation.html - 2026-04-04T09:17:57.298Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/inference.html - 2026-04-04T09:17:57.298Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/getting-started.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.008Z https://docs.axolotl.ai/docs/telemetry.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/src/axolotl/integrations/cut_cross_entropy/ACKNOWLEDGEMENTS.html - 2026-04-04T09:17:57.328Z + 2026-04-06T17:11:30.050Z https://docs.axolotl.ai/index.html - 2026-04-04T09:17:57.321Z + 2026-04-06T17:11:30.041Z https://docs.axolotl.ai/examples/colab-notebooks/colab-axolotl-example.html - 2026-04-04T09:17:57.304Z + 2026-04-06T17:11:30.019Z https://docs.axolotl.ai/src/axolotl/integrations/LICENSE.html - 2026-04-04T09:17:57.327Z + 2026-04-06T17:11:30.049Z https://docs.axolotl.ai/docs/batch_vs_grad.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/sequence_parallelism.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/docs/quantize.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/docker.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/attention.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/unsloth.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/docs/qat.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/multi-node.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/custom_integrations.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/ray-integration.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/config-reference.html - 2026-04-04T09:21:39.206Z + 2026-04-06T17:15:21.636Z https://docs.axolotl.ai/docs/gradient_checkpointing.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.008Z https://docs.axolotl.ai/docs/grpo.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.008Z https://docs.axolotl.ai/docs/choosing_method.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/models/LiquidAI.html - 2026-04-04T09:21:40.152Z + 2026-04-06T17:15:22.614Z https://docs.axolotl.ai/docs/models/magistral.html - 2026-04-04T09:21:40.145Z + 2026-04-06T17:15:22.608Z https://docs.axolotl.ai/docs/models/devstral.html - 2026-04-04T09:21:40.147Z + 2026-04-06T17:15:22.609Z https://docs.axolotl.ai/docs/models/qwen3-next.html - 2026-04-04T09:21:40.149Z + 2026-04-06T17:15:22.611Z https://docs.axolotl.ai/docs/models/mistral.html - 2026-04-04T09:21:40.148Z + 2026-04-06T17:15:22.610Z https://docs.axolotl.ai/docs/models/plano.html - 2026-04-04T09:21:40.140Z + 2026-04-06T17:15:22.603Z https://docs.axolotl.ai/docs/models/olmo3.html - 2026-04-04T09:21:40.141Z + 2026-04-06T17:15:22.604Z https://docs.axolotl.ai/docs/models/magistral/vision.html - 2026-04-04T09:21:40.145Z + 2026-04-06T17:15:22.608Z https://docs.axolotl.ai/docs/models/mimo.html - 2026-04-04T09:21:40.141Z + 2026-04-06T17:15:22.604Z https://docs.axolotl.ai/docs/models/index.html - 2026-04-04T09:21:40.153Z + 2026-04-06T17:15:22.615Z https://docs.axolotl.ai/docs/models/trinity.html - 2026-04-04T09:21:40.142Z + 2026-04-06T17:15:22.605Z https://docs.axolotl.ai/docs/models/kimi-linear.html - 2026-04-04T09:21:40.140Z + 2026-04-06T17:15:22.603Z https://docs.axolotl.ai/docs/models/orpheus.html - 2026-04-04T09:21:40.153Z + 2026-04-06T17:15:22.615Z https://docs.axolotl.ai/docs/models/qwen3.html - 2026-04-04T09:21:40.149Z + 2026-04-06T17:15:22.611Z https://docs.axolotl.ai/docs/models/ministral3/think.html - 2026-04-04T09:21:40.144Z + 2026-04-06T17:15:22.606Z https://docs.axolotl.ai/docs/models/apertus.html - 2026-04-04T09:21:40.150Z + 2026-04-06T17:15:22.612Z https://docs.axolotl.ai/docs/models/gpt-oss.html - 2026-04-04T09:21:40.150Z + 2026-04-06T17:15:22.612Z https://docs.axolotl.ai/docs/mixed_precision.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/lora_optims.html - 2026-04-04T09:17:57.298Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/dataset_loading.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/input_output.html - 2026-04-04T09:17:57.298Z + 2026-04-06T17:11:30.011Z https://docs.axolotl.ai/docs/fsdp_qlora.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.008Z https://docs.axolotl.ai/docs/agents/preference_tuning.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/agents/reward_modelling.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/optimizations.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/training_stability.html - 2026-04-04T09:17:57.300Z + 2026-04-06T17:11:30.013Z https://docs.axolotl.ai/docs/cli.html - 2026-04-04T09:17:57.295Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/api/utils.callbacks.mlflow_.html - 2026-04-04T09:21:18.333Z + 2026-04-06T17:14:59.150Z https://docs.axolotl.ai/docs/api/models.mamba.modeling_mamba.html - 2026-04-04T09:21:18.228Z + 2026-04-06T17:14:59.043Z https://docs.axolotl.ai/docs/api/core.trainers.dpo.trainer.html - 2026-04-04T09:21:17.081Z + 2026-04-06T17:14:57.897Z https://docs.axolotl.ai/docs/api/cli.utils.fetch.html - 2026-04-04T09:21:16.996Z + 2026-04-06T17:14:57.815Z https://docs.axolotl.ai/docs/api/core.builders.causal.html - 2026-04-04T09:21:16.691Z + 2026-04-06T17:14:57.511Z https://docs.axolotl.ai/docs/api/core.builders.rl.html - 2026-04-04T09:21:16.697Z + 2026-04-06T17:14:57.516Z https://docs.axolotl.ai/docs/api/utils.bench.html - 2026-04-04T09:21:17.723Z + 2026-04-06T17:14:58.541Z https://docs.axolotl.ai/docs/api/prompt_strategies.kto.user_defined.html - 2026-04-04T09:21:17.434Z + 2026-04-06T17:14:58.253Z https://docs.axolotl.ai/docs/api/prompt_strategies.alpaca_instruct.html - 2026-04-04T09:21:17.283Z + 2026-04-06T17:14:58.097Z https://docs.axolotl.ai/docs/api/prompt_strategies.alpaca_chat.html - 2026-04-04T09:21:17.281Z + 2026-04-06T17:14:58.096Z https://docs.axolotl.ai/docs/api/utils.collators.mm_chat.html - 2026-04-04T09:21:18.263Z + 2026-04-06T17:14:59.078Z https://docs.axolotl.ai/docs/api/utils.schedulers.html - 2026-04-04T09:21:17.793Z + 2026-04-06T17:14:58.611Z https://docs.axolotl.ai/docs/api/kernels.utils.html - 2026-04-04T09:21:17.557Z + 2026-04-06T17:14:58.378Z https://docs.axolotl.ai/docs/api/core.chat.format.chatml.html - 2026-04-04T09:21:16.745Z + 2026-04-06T17:14:57.563Z https://docs.axolotl.ai/docs/api/loaders.constants.html - 2026-04-04T09:21:17.172Z + 2026-04-06T17:14:57.987Z https://docs.axolotl.ai/docs/api/utils.schemas.model.html - 2026-04-04T09:21:17.896Z + 2026-04-06T17:14:58.714Z https://docs.axolotl.ai/docs/api/integrations.grokfast.optimizer.html - 2026-04-04T09:21:18.182Z + 2026-04-06T17:14:58.998Z https://docs.axolotl.ai/docs/api/cli.utils.load.html - 2026-04-04T09:21:17.003Z + 2026-04-06T17:14:57.822Z https://docs.axolotl.ai/docs/api/loaders.adapter.html - 2026-04-04T09:21:17.148Z + 2026-04-06T17:14:57.964Z https://docs.axolotl.ai/docs/api/cli.train.html - 2026-04-04T09:21:16.816Z + 2026-04-06T17:14:57.640Z https://docs.axolotl.ai/docs/api/monkeypatch.stablelm_attn_hijack_flash.html - 2026-04-04T09:21:17.629Z + 2026-04-06T17:14:58.449Z https://docs.axolotl.ai/docs/api/cli.checks.html - 2026-04-04T09:21:16.864Z + 2026-04-06T17:14:57.687Z https://docs.axolotl.ai/docs/api/prompt_strategies.dpo.user_defined.html - 2026-04-04T09:21:17.410Z + 2026-04-06T17:14:58.225Z https://docs.axolotl.ai/docs/api/prompt_strategies.llama2_chat.html - 2026-04-04T09:21:17.324Z + 2026-04-06T17:14:58.138Z https://docs.axolotl.ai/docs/api/core.trainers.trl.html - 2026-04-04T09:21:17.064Z + 2026-04-06T17:14:57.882Z https://docs.axolotl.ai/docs/api/monkeypatch.mistral_attn_hijack_flash.html - 2026-04-04T09:21:17.568Z + 2026-04-06T17:14:58.389Z https://docs.axolotl.ai/docs/api/core.trainers.mixins.scheduler.html - 2026-04-04T09:21:17.192Z + 2026-04-06T17:14:58.006Z https://docs.axolotl.ai/docs/api/core.trainers.grpo.trainer.html - 2026-04-04T09:21:17.100Z + 2026-04-06T17:14:57.915Z https://docs.axolotl.ai/docs/api/cli.merge_lora.html - 2026-04-04T09:21:16.921Z + 2026-04-06T17:14:57.742Z https://docs.axolotl.ai/docs/api/datasets.html - 2026-04-04T09:21:16.595Z + 2026-04-06T17:14:57.416Z https://docs.axolotl.ai/docs/api/utils.schemas.training.html - 2026-04-04T09:21:17.905Z + 2026-04-06T17:14:58.723Z https://docs.axolotl.ai/docs/api/utils.distributed.html - 2026-04-04T09:21:17.818Z + 2026-04-06T17:14:58.636Z https://docs.axolotl.ai/docs/api/cli.cloud.base.html - 2026-04-04T09:21:16.964Z + 2026-04-06T17:14:57.785Z https://docs.axolotl.ai/docs/api/kernels.geglu.html - 2026-04-04T09:21:17.527Z + 2026-04-06T17:14:58.349Z https://docs.axolotl.ai/docs/api/core.trainers.mixins.optimizer.html - 2026-04-04T09:21:17.179Z + 2026-04-06T17:14:57.994Z https://docs.axolotl.ai/docs/api/index.html - 2026-04-04T09:21:16.495Z + 2026-04-06T17:14:57.318Z https://docs.axolotl.ai/docs/api/prompt_strategies.base.html - 2026-04-04T09:21:17.223Z + 2026-04-06T17:14:58.037Z https://docs.axolotl.ai/docs/api/cli.evaluate.html - 2026-04-04T09:21:16.826Z + 2026-04-06T17:14:57.650Z https://docs.axolotl.ai/docs/api/train.html - 2026-04-04T09:21:16.574Z + 2026-04-06T17:14:57.395Z https://docs.axolotl.ai/docs/api/common.architectures.html - 2026-04-04T09:21:18.206Z + 2026-04-06T17:14:59.022Z https://docs.axolotl.ai/docs/api/prompt_strategies.kto.llama3.html - 2026-04-04T09:21:17.422Z + 2026-04-06T17:14:58.241Z https://docs.axolotl.ai/docs/api/utils.callbacks.lisa.html - 2026-04-04T09:21:18.329Z + 2026-04-06T17:14:59.145Z https://docs.axolotl.ai/docs/api/cli.utils.train.html - 2026-04-04T09:21:17.025Z + 2026-04-06T17:14:57.844Z https://docs.axolotl.ai/docs/api/integrations.liger.args.html - 2026-04-04T09:21:18.196Z + 2026-04-06T17:14:59.011Z https://docs.axolotl.ai/docs/api/prompt_tokenizers.html - 2026-04-04T09:21:16.665Z + 2026-04-06T17:14:57.485Z https://docs.axolotl.ai/docs/api/cli.utils.sweeps.html - 2026-04-04T09:21:17.010Z + 2026-04-06T17:14:57.829Z https://docs.axolotl.ai/docs/api/cli.utils.args.html - 2026-04-04T09:21:16.989Z + 2026-04-06T17:14:57.809Z https://docs.axolotl.ai/docs/api/utils.chat_templates.html - 2026-04-04T09:21:17.705Z + 2026-04-06T17:14:58.523Z https://docs.axolotl.ai/docs/api/utils.schemas.config.html - 2026-04-04T09:21:17.888Z + 2026-04-06T17:14:58.705Z https://docs.axolotl.ai/docs/api/prompt_strategies.user_defined.html - 2026-04-04T09:21:17.308Z + 2026-04-06T17:14:58.122Z https://docs.axolotl.ai/docs/api/utils.schemas.datasets.html - 2026-04-04T09:21:17.930Z + 2026-04-06T17:14:58.747Z https://docs.axolotl.ai/docs/api/integrations.base.html - 2026-04-04T09:21:18.176Z + 2026-04-06T17:14:58.992Z https://docs.axolotl.ai/docs/api/utils.tokenization.html - 2026-04-04T09:21:17.703Z + 2026-04-06T17:14:58.521Z https://docs.axolotl.ai/docs/api/monkeypatch.multipack.html - 2026-04-04T09:21:17.570Z + 2026-04-06T17:14:58.390Z https://docs.axolotl.ai/docs/api/integrations.kd.trainer.html - 2026-04-04T09:21:18.191Z + 2026-04-06T17:14:59.007Z https://docs.axolotl.ai/docs/api/monkeypatch.mixtral.html - 2026-04-04T09:21:17.656Z + 2026-04-06T17:14:58.476Z https://docs.axolotl.ai/docs/api/core.trainers.base.html - 2026-04-04T09:21:17.043Z + 2026-04-06T17:14:57.863Z https://docs.axolotl.ai/docs/api/utils.schemas.integrations.html - 2026-04-04T09:21:17.972Z + 2026-04-06T17:14:58.789Z https://docs.axolotl.ai/docs/api/core.trainers.mixins.rng_state_loader.html - 2026-04-04T09:21:17.183Z + 2026-04-06T17:14:57.998Z https://docs.axolotl.ai/docs/api/cli.main.html - 2026-04-04T09:21:16.806Z + 2026-04-06T17:14:57.630Z https://docs.axolotl.ai/docs/api/monkeypatch.trainer_fsdp_optim.html - 2026-04-04T09:21:17.633Z + 2026-04-06T17:14:58.453Z https://docs.axolotl.ai/docs/api/core.datasets.transforms.chat_builder.html - 2026-04-04T09:21:16.764Z + 2026-04-06T17:14:57.583Z https://docs.axolotl.ai/docs/api/prompt_strategies.alpaca_w_system.html - 2026-04-04T09:21:17.298Z + 2026-04-06T17:14:58.112Z https://docs.axolotl.ai/docs/api/integrations.cut_cross_entropy.args.html - 2026-04-04T09:21:18.180Z + 2026-04-06T17:14:58.996Z https://docs.axolotl.ai/docs/api/monkeypatch.transformers_fa_utils.html - 2026-04-04T09:21:17.641Z + 2026-04-06T17:14:58.461Z https://docs.axolotl.ai/docs/api/utils.data.streaming.html - 2026-04-04T09:21:17.837Z + 2026-04-06T17:14:58.654Z https://docs.axolotl.ai/docs/api/utils.collators.batching.html - 2026-04-04T09:21:18.253Z + 2026-04-06T17:14:59.068Z https://docs.axolotl.ai/docs/api/utils.samplers.multipack.html - 2026-04-04T09:21:18.314Z + 2026-04-06T17:14:59.131Z https://docs.axolotl.ai/docs/api/prompt_strategies.dpo.chatml.html - 2026-04-04T09:21:17.406Z + 2026-04-06T17:14:58.220Z https://docs.axolotl.ai/docs/api/utils.dict.html - 2026-04-04T09:21:17.825Z + 2026-04-06T17:14:58.643Z https://docs.axolotl.ai/docs/api/prompt_strategies.dpo.zephyr.html - 2026-04-04T09:21:17.408Z + 2026-04-06T17:14:58.222Z https://docs.axolotl.ai/docs/api/utils.optimizers.adopt.html - 2026-04-04T09:21:17.835Z + 2026-04-06T17:14:58.652Z https://docs.axolotl.ai/docs/api/prompt_strategies.metharme.html - 2026-04-04T09:21:17.354Z + 2026-04-06T17:14:58.167Z https://docs.axolotl.ai/docs/api/monkeypatch.gradient_checkpointing.offload_cpu.html - 2026-04-04T09:21:17.661Z + 2026-04-06T17:14:58.480Z https://docs.axolotl.ai/docs/rlhf.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.012Z https://docs.axolotl.ai/docs/dataset-formats/inst_tune.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/dataset-formats/stepwise_supervised.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/dataset-formats/tokenized.html - 2026-04-04T09:17:57.296Z + 2026-04-06T17:11:30.007Z https://docs.axolotl.ai/docs/multimodal.html - 2026-04-04T09:17:57.299Z + 2026-04-06T17:11:30.011Z