diff --git a/.nojekyll b/.nojekyll index c47c88483..a3c993635 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -a6bc28b6 \ No newline at end of file +66b84b9d \ No newline at end of file diff --git a/docs/agents/model_architectures.html b/docs/agents/model_architectures.html index 731a06c60..b0373a116 100644 --- a/docs/agents/model_architectures.html +++ b/docs/agents/model_architectures.html @@ -788,12 +788,19 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
Model-specific quirks, required settings, and known issues. Check this before debugging training failures on specific model families.
+All VLM configs require these four lines:
+processor_type: AutoProcessor
+skip_prepare_dataset: true
+remove_unused_columns: false
+sample_packing: falseDecision tree for VLM config:
+Is the model multimodal (has vision/audio encoder)?
+ ├─ YES: Add `freeze_mm_modules: true` if training text only
+ │ Add `chat_template: <model_template>` (e.g. gemma4, qwen3_5, gemma3)
+ │ LoRA: use regex `lora_target_modules` to restrict to language model
+ └─ NO: Train as a regular text model
+
+Is the model MoE (e.g. Gemma4 26B-A4B, Qwen3.5 35B-A3B)?
+ ├─ YES: Add `lora_target_parameters` for expert LoRA
+ │ Consider ScatterMoE kernels (see Plugins section)
+ └─ NO: Standard LoRA config
+Computes loss from hidden states + lm_head weight without materializing the full logits tensor, saving significant VRAM. Install if not already present:
+uv pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@main"plugins:
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPluginFuses expert + LoRA computation into a single kernel for MoE models. Significant speedup for models with many experts.
+plugins:
+ - axolotl.integrations.kernels.KernelsPlugin
+use_kernels: true
+use_scattermoe: true
+experts_implementation: scattermoe
+
+# Expert LoRA targets (3D parameter tensors, not nn.Linear):
+lora_target_parameters:
+ - experts.gate_up_proj
+ - experts.down_projSupported: Gemma4 (gemma4_text), Mixtral, Qwen MoE variants. The plugin auto-detects model type and routing function. Without ScatterMoE, expert LoRA still works but runs base expert matmul and LoRA as separate operations.
Models: google/gemma-4-26B-A4B (MoE), google/gemma-4-31B (dense), google/gemma-4-E2B, google/gemma-4-E4B
Architecture: Multimodal wrapper (Gemma4ForConditionalGeneration) over a text backbone (Gemma4TextModel), with optional vision/audio encoders. All Gemma4 HF repos have model_type: "gemma4" — even text-only variants load as multimodal with a vision tower.
# Always needed for Gemma4:
-freeze_mm_modules: true # Freeze vision/audio encoders for text-only training
-gradient_checkpointing_kwargs:
- use_reentrant: false # Shared per-layer norms cause "marked ready twice" with reentrant
-
-# LoRA target — restrict to language model only (DO NOT use lora_target_linear: true):
-lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'# Always needed for Gemma4:
+freeze_mm_modules: true # Freeze vision/audio encoders for text-only training
+gradient_checkpointing_kwargs:
+ use_reentrant: false # Shared per-layer norms cause "marked ready twice" with reentrant
+
+# LoRA target — restrict to language model only (DO NOT use lora_target_linear: true):
+lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'FSDP2 config:
-fsdp:
- - full_shard
- - auto_wrap
-fsdp_config:
- fsdp_version: 2
- fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
- fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayerfsdp:
+ - full_shard
+ - auto_wrap
+fsdp_config:
+ fsdp_version: 2
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+ fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayerenable_moe_block: true, 256 experts, top-k routing
No separate SparseMoeBlock — MoE is embedded in each decoder layer
Expert LoRA targets 3D parameter tensors:
-lora_target_parameters:
- - experts.gate_up_proj
- - experts.down_projlora_target_parameters:
+ - experts.gate_up_proj
+ - experts.down_projScatterMoE kernel acceleration:
-plugins:
- - axolotl.integrations.kernels.KernelsPlugin
-use_kernels: true
-use_scattermoe: true
-experts_implementation: scattermoeplugins:
+ - axolotl.integrations.kernels.KernelsPlugin
+use_kernels: true
+use_scattermoe: true
+experts_implementation: scattermoeAll Gemma4 models load as Gemma4ForConditionalGeneration with a vision tower. No custom ProcessingStrategy needed — the base class auto-detects the image token.
base_model: google/gemma-4-E2B-it # or E4B-it, 26B-A4B
+processor_type: AutoProcessor
+freeze_mm_modules: true
+chat_template: gemma4
+
+skip_prepare_dataset: true
+remove_unused_columns: false
+sample_packing: falseA starting VLM loss of ~8-15 is typical. In most runs, loss converges below 1.0 within ~30-50 steps, though results may vary across configurations.
+For the 26B-A4B MoE variant with ScatterMoE + expert LoRA + CCE, add:
+plugins:
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+ - axolotl.integrations.kernels.KernelsPlugin
+use_kernels: true
+use_scattermoe: true
+experts_implementation: scattermoe
+lora_target_parameters:
+ - experts.gate_up_proj
+ - experts.down_projTo profile training and identify optimization opportunities:
+# Profile steps 3-7 (after warmup/autotuning settles)
+profiler_steps_start: 3
+profiler_steps: 5This produces profiler_trace.json (Chrome trace) and snapshot.pickle (memory snapshot) in output_dir.
+View the Chrome trace at chrome://tracing.
To programmatically inspect the trace:
+python scripts/analyze_profile.py output_dir/The trace shows per-kernel CUDA times, memory allocations, and operator-level breakdown. Look for: +- Large matmul kernels: candidates for fusion or quantization +- Memory copies (H2D/D2H): unnecessary data movement +- Small frequent kernels: candidates for kernel fusion +- Gaps between kernels: pipeline bubbles from CPU overhead
Full troubleshooting: training_stability.qmd, debugging.qmd
All Gemma 4 variants (E2B, E4B, 26B-A4B, 31B) load as multimodal models even for text-only training.
+base_model: google/gemma-4-E2B-it # or E4B-it, 26B-A4B, 31B
+
+chat_template: gemma4
+freeze_mm_modules: true # freeze vision/audio encoders for text-only or vision LoRA
+
+# For the 26B-A4B MoE model, enable ScatterMoE and expert LoRA:
+plugins:
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+ - axolotl.integrations.kernels.KernelsPlugin
+use_kernels: true
+use_scattermoe: true
+experts_implementation: scattermoe
+
+lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
+
+# MoE expert LoRA (3D tensors, not nn.Linear) — only for 26B-A4B:
+lora_target_parameters:
+ - experts.gate_up_proj
+ - experts.down_projGemma 4 VLM training starts with high loss (~8-15). This is expected — see the training stability guide for details.
+For DDP training, axolotl auto-detects Gemma4 and sets use_reentrant=False and ddp_find_unused_parameters=True. However, when activation_offloading: true, ddp_find_unused_parameters is skipped (checkpoint wrappers conflict with it); use freeze_mm_modules: true instead to handle unused vision/audio params. For FSDP2, use fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer.
For multi-modal 4B/12B/27B models, use the following config:
-base_model: google/gemma-3-4b-it
-
-chat_template: gemma3base_model: google/gemma-3-4b-it
+
+chat_template: gemma3Please make sure to install timm via pip3 install timm==1.0.17
base_model: google/gemma-3n-E2B-it
-
-chat_template: gemma3nbase_model: google/gemma-3n-E2B-it
+
+chat_template: gemma3nbase_model: Qwen/Qwen2-VL-7B-Instruct
-
-chat_template: qwen2_vlbase_model: Qwen/Qwen2-VL-7B-Instruct
+
+chat_template: qwen2_vlbase_model: Qwen/Qwen2.5-VL-7B-Instruct
-
-chat_template: qwen2_vl # same as qwen2-vlbase_model: Qwen/Qwen3-VL-4B-Instruct
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
chat_template: qwen2_vl # same as qwen2-vl
+
+Qwen3-VL
+base_model: Qwen/Qwen3-VL-4B-Instruct
+
+chat_template: qwen2_vl # same as qwen2-vl
+
Qwen3.5
-base_model: Qwen/Qwen3.5-9B
-
-chat_template: qwen3_5
+base_model: Qwen/Qwen3.5-9B
+
+chat_template: qwen3_5
GLM-4.6V
Both GLM-4.6V (106B MoE) and GLM-4.6V-Flash (9B) are supported.
-# GLM-4.6V (106B MoE version)
-base_model: zai-org/GLM-4.6V
-
-# OR GLM-4.6V-Flash (9B version)
-base_model: zai-org/GLM-4.6V-Flash
+# GLM-4.6V (106B MoE version)
+base_model: zai-org/GLM-4.6V
+
+# OR GLM-4.6V-Flash (9B version)
+base_model: zai-org/GLM-4.6V-Flash
SmolVLM2
@@ -1098,7 +1149,7 @@ Tip
Please make sure to install num2words via pip3 install num2words==0.5.14
base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instructbase_model: HuggingFaceTB/SmolVLM2-500M-Video-InstructPlease uninstall causal-conv1d via pip3 uninstall -y causal-conv1d
base_model: LiquidAI/LFM2-VL-450Mbase_model: LiquidAI/LFM2-VL-450MPlease make sure to install timm via pip3 install timm==1.0.19
base_model: OpenGVLab/InternVL3_5-8Bbase_model: OpenGVLab/InternVL3_5-8BHere is an example of a multi-modal dataset:
-[
- {
- "messages": [
- {
- "role": "system",
- "content": [
- {"type": "text", "text": "You are a helpful assistant."}
- ]
- },
- {
- "role": "user",
- "content": [
- {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
- {"type": "text", "text": "Describe this image in detail."}
- ]
- },
- {
- "role": "assistant",
- "content": [
- {"type": "text", "text": "The image is a bee."}
- ]
- }
- ]
- }
-][
+ {
+ "messages": [
+ {
+ "role": "system",
+ "content": [
+ {"type": "text", "text": "You are a helpful assistant."}
+ ]
+ },
+ {
+ "role": "user",
+ "content": [
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
+ {"type": "text", "text": "Describe this image in detail."}
+ ]
+ },
+ {
+ "role": "assistant",
+ "content": [
+ {"type": "text", "text": "The image is a bee."}
+ ]
+ }
+ ]
+ }
+]Requirements:
bf16 and Flash Attention) or AMD GPU# install uv if you don't already have it installed
+curl -LsSf https://astral.sh/uv/install.sh | sh
+source $HOME/.local/bin/env
+
+# CUDA 12.8.1 tends to have better package compatibility
+export UV_TORCH_BACKEND=cu128
+
+# create a new virtual environment
+uv venv --python 3.12
+source .venv/bin/activate
+
+uv pip install torch==2.10.0 torchvision
+uv pip install --no-build-isolation axolotl[deepspeed]
+
+# recommended - install cut-cross-entropy
+uv pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@main"
+
+# (optional) - prefetch flash-attn2 and causal-conv1d kernels
+uv run --python 3.12 python -c "from kernels import get_kernel; get_kernel('kernels-community/flash-attn2'); get_kernel('kernels-community/causal-conv1d')"
+
+# Download example axolotl configs, deepspeed configs
+axolotl fetch examples
+axolotl fetch deepspeed_configs # OPTIONALpip3 install -U packaging==26.0 setuptools==75.8.0 wheel ninja
-pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
-
-# Download example axolotl configs, deepspeed configs
-axolotl fetch examples
-axolotl fetch deepspeed_configs # OPTIONALpip3 install -U packaging==26.0 setuptools==75.8.0 wheel ninja
+pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
+
+# Download example axolotl configs, deepspeed configs
+axolotl fetch examples
+axolotl fetch deepspeed_configs # OPTIONALInstalling with Docker can be less error prone than installing in your own environment.
-docker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latestdocker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latestOther installation approaches are described here.
# Fetch axolotl examples
-axolotl fetch examples
-
-# Or, specify a custom path
-axolotl fetch examples --dest path/to/folder
-
-# Train a model using LoRA
-axolotl train examples/llama-3/lora-1b.yml# Fetch axolotl examples
+axolotl fetch examples
+
+# Or, specify a custom path
+axolotl fetch examples --dest path/to/folder
+
+# Train a model using LoRA
+axolotl train examples/llama-3/lora-1b.ymlThat’s it! Check out our Getting Started Guide for a more detailed walkthrough.
Axolotl ships with built-in documentation optimized for AI coding agents (Claude Code, Cursor, Copilot, etc.). These docs are bundled with the pip package — no repo clone needed.
-# Show overview and available training methods
-axolotl agent-docs
-
-# Topic-specific references
-axolotl agent-docs sft # supervised fine-tuning
-axolotl agent-docs grpo # GRPO online RL
-axolotl agent-docs preference_tuning # DPO, KTO, ORPO, SimPO
-axolotl agent-docs reward_modelling # outcome and process reward models
-axolotl agent-docs pretraining # continual pretraining
-axolotl agent-docs --list # list all topics
-
-# Dump config schema for programmatic use
-axolotl config-schema
-axolotl config-schema --field adapter# Show overview and available training methods
+axolotl agent-docs
+
+# Topic-specific references
+axolotl agent-docs sft # supervised fine-tuning
+axolotl agent-docs grpo # GRPO online RL
+axolotl agent-docs preference_tuning # DPO, KTO, ORPO, SimPO
+axolotl agent-docs reward_modelling # outcome and process reward models
+axolotl agent-docs pretraining # continual pretraining
+axolotl agent-docs --list # list all topics
+
+# Dump config schema for programmatic use
+axolotl config-schema
+axolotl config-schema --field adapterIf you’re working with the source repo, agent docs are also available at docs/agents/ and the project overview is in AGENTS.md.
If you use Axolotl in your research or projects, please cite it as follows:
-@software{axolotl,
- title = {Axolotl: Open Source LLM Post-Training},
- author = {{Axolotl maintainers and contributors}},
- url = {https://github.com/axolotl-ai-cloud/axolotl},
- license = {Apache-2.0},
- year = {2023}
-}@software{axolotl,
+ title = {Axolotl: Open Source LLM Post-Training},
+ author = {{Axolotl maintainers and contributors}},
+ url = {https://github.com/axolotl-ai-cloud/axolotl},
+ license = {Apache-2.0},
+ year = {2023}
+}