diff --git a/.nojekyll b/.nojekyll index a493daa16..f9903bf94 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -c1bf87c4 \ No newline at end of file +7dd4c342 \ No newline at end of file diff --git a/FAQS.html b/FAQS.html index 82918ae5b..85ac56908 100644 --- a/FAQS.html +++ b/FAQS.html @@ -2,7 +2,7 @@ - + @@ -39,10 +39,10 @@ ul.task-list li input[type="checkbox"] { - + - + - + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+ +
+ + +
+ + + +
+ + + + +
+

Model Architectures — Agent Reference

+

Model-specific quirks, required settings, and known issues. Check this before debugging training failures on specific model families.

+
+

Gemma 4

+

Models: google/gemma-4-26B-A4B (MoE), google/gemma-4-31B (dense), google/gemma-4-E2B, google/gemma-4-E4B

+

Architecture: Multimodal wrapper (Gemma4ForConditionalGeneration) over a text backbone (Gemma4TextModel), with optional vision/audio encoders. All Gemma4 HF repos have model_type: "gemma4" — even text-only variants load as multimodal with a vision tower.

+
+

Required settings

+
# Always needed for Gemma4:
+freeze_mm_modules: true          # Freeze vision/audio encoders for text-only training
+gradient_checkpointing_kwargs:
+  use_reentrant: false           # Shared per-layer norms cause "marked ready twice" with reentrant
+
+# LoRA target — restrict to language model only (DO NOT use lora_target_linear: true):
+lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
+
+
+

Auto-detection

+

Axolotl auto-detects Gemma4 and applies: +- use_reentrant: false for gradient checkpointing +- ddp_find_unused_parameters: true for DDP (skipped when activation_offloading: true)

+
+
+

Multi-GPU

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
StrategyWorks?Notes
DDPYesAuto-sets ddp_find_unused_parameters=True
DDP + activation_offloadingYesfind_unused_parameters is skipped (conflicts with checkpoint wrappers)
FSDP1NoOOM during dequantization/sharding with QLoRA
FSDP2YesUse Gemma4TextDecoderLayer (not Gemma4DecoderLayer) as wrap class
FSDP2 + activation_offloadingYesLowest VRAM (~26 GiB/GPU for 26B-A4B)
+

FSDP2 config:

+
fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_version: 2
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
+
+
+

MoE (26B-A4B)

+
    +
  • enable_moe_block: true, 256 experts, top-k routing

  • +
  • No separate SparseMoeBlock — MoE is embedded in each decoder layer

  • +
  • Expert LoRA targets 3D parameter tensors:

    +
    lora_target_parameters:
    +  - experts.gate_up_proj
    +  - experts.down_proj
  • +
  • ScatterMoE kernel acceleration:

    +
    plugins:
    +  - axolotl.integrations.kernels.KernelsPlugin
    +use_kernels: true
    +use_scattermoe: true
    +experts_implementation: scattermoe
  • +
+
+
+

Common issues

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SymptomCauseFix
mm_token_type_ids is required in DDPmodel.config not accessible through DDP wrapperAlready fixed — unwrap_model() in compute_loss and prediction_step
marked a variable ready twice in DDPddp_find_unused_parameters=True + activation_offloading checkpoint wrappersAuto-handled — find_unused_parameters is skipped when activation_offloading: true
Loss ~12 instead of ~0.5Using lora_target_linear: true (applies LoRA to vision/audio modules)Use the regex lora_target_modules pattern instead
FSDP2 Could not find Gemma4AudioLayerAuto-wrap detects _no_split_modules including audio layers that don’t existExplicitly set fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
Gemma4ClippableLinear not supported by PEFTVision tower uses a non-standard linear wrapperAxolotl patches this automatically via _patch_peft_clippable_linear()
+
+
+

E2B/E4B dense models

+

These have hidden_size_per_layer_input: 256 (per-layer input embeddings) and attention_k_eq_v: False. Known issue: loss starts higher than expected (~12 vs ~0.5 for 26B). Root cause under investigation — may be related to the per-layer input mechanism or the Gemma4ForConditionalGeneration loss computation.

+
+
+
+

Gemma 3

+

Models: google/gemma-3-*

+
    +
  • ddp_find_unused_parameters: true needed (multimodal unused params)
  • +
  • use_reentrant: false recommended
  • +
  • Attention mask must be dropped for sample packing (handled automatically)
  • +
  • Multi-GPU test currently skipped (tests/e2e/multigpu/test_gemma3.py)
  • +
+
+
+

Qwen 3.5 MoE

+

Models: Qwen/Qwen3.5-35B-A3B

+
    +
  • Hybrid architecture: DeltaNet linear attention (30 layers) + full attention (10 layers)

  • +
  • 256 experts, 8 active per token

  • +
  • Known weight scale drift in late DeltaNet layers (36-38) due to AdamW + rare expert interaction

  • +
  • Fix: normalize_weight_scales config to detect and rescale outliers:

    +
    normalize_weight_scales:
    +  - name_pattern: 'linear_attn\.conv1d\.weight'
    +    threshold: 1.3
  • +
+
+
+

General MoE Notes

+
    +
  • lora_target_linear: true with multimodal MoE models will apply LoRA to ALL linear modules including vision/audio encoders — use regex lora_target_modules to restrict to language model only
  • +
  • Rare experts get larger effective learning rate from AdamW (small second-moment estimates) — can cause weight drift in recurrent/SSM components. Use normalize_weight_scales with dry_run: true to detect.
  • +
  • For ScatterMoE kernel support, set experts_implementation: scattermoe and add the KernelsPlugin
  • +
+ + +
+
+ +
+ +
+ + + + + \ No newline at end of file diff --git a/docs/agents/new_model_support.html b/docs/agents/new_model_support.html new file mode 100644 index 000000000..671db5ade --- /dev/null +++ b/docs/agents/new_model_support.html @@ -0,0 +1,1576 @@ + + + + + + + + + +new_model_support – Axolotl + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+ +
+ + +
+ + + +
+ + + + +
+

New Model Support — Agent Reference

+

Guide for debugging and adding support for new model architectures in axolotl. Based on lessons learned from Gemma4, Gemma3, Qwen2-VL, and other multimodal/MoE models.

+
+

Quick Validation Checklist

+

When testing a new model, run through these checks in order:

+
    +
  1. Does the model load? axolotl preprocess config.yaml — catches config schema errors
  2. +
  3. Does LoRA apply? Check for “Unsupported layer type” warnings from PEFT
  4. +
  5. Is the initial loss sane? First-step loss for a pretrained model should be 0.5–2.0 for SFT
  6. +
  7. Does sample packing work? Compare loss with sample_packing: true vs false — should be similar
  8. +
  9. Is CCE active? Check for “Applying Cut Cross Entropy” log and verify peak VRAM is lower
  10. +
+
+
+

Loss Debugging

+
+

Expected initial loss

+

A pretrained model doing SFT should start with loss roughly in the 0.5–2.0 range. If loss starts above 3.0, something is wrong. If it’s near log(vocab_size) (≈ 12 for 262K vocab), the model is predicting at random — attention masking or model weights are broken.

+
+
+

Direct comparison technique

+

The fastest way to isolate a loss issue — bypass the trainer entirely:

+
# Load model via axolotl's pipeline (applies all patches)
+from axolotl.cli.config import load_cfg
+from axolotl.utils.config import normalize_config, prepare_plugins
+from axolotl.loaders.tokenizer import load_tokenizer
+from axolotl.loaders.model import ModelLoader
+
+cfg = load_cfg("your_config.yaml")
+normalize_config(cfg)
+prepare_plugins(cfg)
+tokenizer = load_tokenizer(cfg)
+model, _ = ModelLoader(cfg, tokenizer).load()
+
+# Forward pass on preprocessed data
+model.train()
+out = model(input_ids, labels=labels)
+print(f"Direct loss: {out.loss.item()}")  # Compare to trainer's reported loss
+

If direct loss is correct (~1.0) but trainer reports 3–4x higher, check model_accepts_loss_kwargs (see below).

+
+
+

model_accepts_loss_kwargs inflation

+

HF Trainer checks if the model’s forward() has **kwargs and sets model_accepts_loss_kwargs=True. This changes loss normalization: the trainer does NOT divide loss by gradient_accumulation_steps before logging. The gradient is correct — only the logged loss is inflated.

+

Symptom: Logged loss ≈ actual_loss × gradient_accumulation_steps.

+

Which models are affected: Any model with **kwargs in forward (common in multimodal models for extra inputs like mm_token_type_ids, pixel_values, etc.).

+

Fix location: src/axolotl/core/trainers/base.py __init__() — after super().__init__(), check if the unwrapped model actually has num_items_in_batch in its forward signature. If not, set self.model_accepts_loss_kwargs = False.

+
+
+
+

Multimodal Models (ForConditionalGeneration)

+

Many recent models use ForConditionalGeneration as the top-level class, not ForCausalLM: +- Gemma3 → Gemma3ForConditionalGeneration +- Gemma4 → Gemma4ForConditionalGeneration +- Qwen2-VL → Qwen2VLForConditionalGeneration +- LLaVA → LlavaForConditionalGeneration

+
+

Why this matters

+ + + + + + + + + + + + + + + + + + + + + + + + + +
ComponentTargets ForCausalLMNeeds ForConditionalGeneration
CCE patches✅ (default)❌ silently inactive if not patched
PEFT LoRAMay fail on custom layer types
HF Trainer label handlingMay need extra inputs
+
+
+

Required extra inputs

+

Multimodal models require special inputs during training even for text-only data:

+ + + + + + + + + + + + + + + + + + + + +
ModelRequired InputValue for Text-Only
Gemma4mm_token_type_idstorch.zeros_like(input_ids)
Gemma3token_type_idstorch.zeros_like(input_ids)
+

Auto-inject in compute_loss() when not provided by the data collator. See core/trainers/base.py.

+
+
+

Custom layer types and PEFT

+

Vision towers often use custom module wrappers that PEFT doesn’t support:

+ ++++++ + + + + + + + + + + + + + + + + +
ModelCustom LayerWrapsFix
Gemma4Gemma4ClippableLinearnn.LinearRedirect to .linear child
+

Fix location: src/axolotl/loaders/adapter.py _patch_peft_clippable_linear().

+
+
+
+

Sample Packing

+
+

How packed sequence detection works (transformers ≥ 5.x)

+

transformers.masking_utils._preprocess_mask_arguments() detects packed sequences from position_ids resets. But only when attention_mask is None:

+
# From masking_utils.py:
+if position_ids is not None and attention_mask is None and past_key_values is None:
+    packed_sequence_mask = find_packed_sequence_indices(position_ids)
+

If the collator provides an all-ones attention_mask, packing detection is skipped and the model builds a single causal mask spanning all packed sequences → cross-sequence attention leakage → very high loss.

+
+
+

Fix for models using create_causal_mask_mapping

+

For Gemma3, Gemma4, and similar models that use the new transformers masking system, remove attention_mask from inputs when sample packing is active:

+
# In compute_loss():
+if (
+    self.args.sample_packing
+    and model_type in ("gemma4", "gemma3")
+    and "attention_mask" in inputs
+    and "position_ids" in inputs
+):
+    del inputs["attention_mask"]
+

Fix location: src/axolotl/core/trainers/base.py compute_loss().

+
+
+

Models that DON’T need this fix

+

Older models that use _prepare_4d_causal_attention_mask (Llama, Mistral, Qwen2, etc.) handle sample packing via axolotl’s multipack attention monkeypatch instead. Only models using the new create_causal_mask_mapping / create_causal_mask masking system need the attention_mask removal.

+
+
+
+

Attention Backend Selection

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BackendConfighead_dim limittorch_compileNotes
FA2flash_attention: true256Fastest when supported
FA4auto with flash_attention: true256 (SM90+)Auto-detected on H100+
SDPAsdp_attention: trueNoneUniversal fallback
flexflex_attention: trueNone⚠️ Triton OOM for large head_dimGood for variable head dims
eagerneither setNoneSlowest, always works
+

Check model support: Look at _supports_flash_attn_2, _supports_flex_attn, _supports_sdpa attributes on the model class.

+

head_dim gotcha: The 256 limit is specific to flash-attn CUDA kernels, NOT PyTorch-level. SDPA and flex_attention both handle arbitrary head_dim. Models with global_head_dim > 256 (Gemma4: 512) must use SDPA or flex.

+

flex + compile gotcha: torch_compile with flex_attention can hit Triton shared memory OOM for large head_dim. Falls back to eager per-function (not a crash, but slower). Unsloth disables flex for Gemma4 for this reason.

+
+
+

Cut Cross Entropy (CCE)

+
+

How CCE patches work

+

CCE replaces the model’s forward() with a fused version that computes loss from hidden states + lm_head weight without materializing the full logits tensor. This saves ~batch × seq_len × vocab_size × dtype_bytes of VRAM.

+
+
+

Adding CCE for a new model

+
    +
  1. Check if the model type is in cut_cross_entropy.transformers.patch.PATCH_FNS
  2. +
  3. If not, axolotl’s generic fallback (integrations/cut_cross_entropy/__init__.py patch_llama_like()) patches {Prefix}ForCausalLM.forward with cce_forward
  4. +
  5. For multimodal models (ForConditionalGeneration), a model-specific patch is needed in ml-cross-entropy repo
  6. +
  7. The multimodal cce_forward must accept all extra kwargs (pixel_values, mm_token_type_ids, etc.) and pop any that would conflict before calling self.model()
  8. +
+
+
+

Common CCE pitfall

+

If CCE appears active (log says “Applying Cut Cross Entropy”) but peak VRAM doesn’t decrease, check which class was patched. If the model loads as ForConditionalGeneration but CCE patched ForCausalLM, the patch is silently inactive.

+
+
+
+

MoE Models

+
+

Dense MLP vs MoE experts

+

Some MoE models (e.g., Gemma4) have BOTH dense MLP layers and MoE expert layers at every decoder layer: +- gate_proj/up_proj/down_proj → targets the dense MLP (Gemma4TextMLP) +- experts.gate_up_proj/experts.down_proj → targets the MoE experts (Gemma4TextExperts)

+

LoRA on the dense MLP works normally. Expert LoRA via lora_target_parameters requires PEFT support for the specific expert module type (may warn “Unsupported layer type”).

+
+
+

ScatterMoE kernels

+

use_scattermoe: true with experts_implementation: scattermoe registers fused expert kernels via transformers’ ExpertsInterface. Significant speedup for MoE models. Requires the kernels plugin:

+
plugins:
+  - axolotl.integrations.kernels.KernelsPlugin
+use_kernels: true
+use_scattermoe: true
+experts_implementation: scattermoe
+
+
+
+

Where to Add Model-Specific Fixes

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WhatWhereExample
Missing forward inputscore/trainers/base.py compute_loss()mm_token_type_ids injection
Attention mask fixescore/trainers/base.py compute_loss()Sample packing mask removal
Loss logging fixescore/trainers/base.py __init__()model_accepts_loss_kwargs override
PEFT/LoRA patchesloaders/adapter.pyClippableLinear redirect
Attention patchesmonkeypatch/attention/FA4 tuple fix
Model-specific patchesloaders/patch_manager.py _apply_model_specific_patches()Llama4, Kimi, NemotronH
CCE patchesml-cross-entropy repo transformers/Per-model cce_forward
Example configsexamples/<model>/Validated YAML
Config validationutils/schemas/validation.pyCompatibility checks
+ + +
+
+ +
+ +
+ + + + + \ No newline at end of file diff --git a/docs/agents/preference_tuning.html b/docs/agents/preference_tuning.html index f58e36f71..0c3381039 100644 --- a/docs/agents/preference_tuning.html +++ b/docs/agents/preference_tuning.html @@ -2,7 +2,7 @@ - + @@ -39,10 +39,10 @@ ul.task-list li input[type="checkbox"] { - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +