fix ddp/fsdp w gemma4 (#3584)
* fix ddp/fsdp w gemma4 * address pr comments * activation offloading fix and update agent docs for gemma4
This commit is contained in:
110
docs/agents/model_architectures.md
Normal file
110
docs/agents/model_architectures.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# Model Architectures — Agent Reference
|
||||
|
||||
Model-specific quirks, required settings, and known issues. Check this before debugging training failures on specific model families.
|
||||
|
||||
## Gemma 4
|
||||
|
||||
**Models**: `google/gemma-4-26B-A4B` (MoE), `google/gemma-4-31B` (dense), `google/gemma-4-E2B`, `google/gemma-4-E4B`
|
||||
|
||||
**Architecture**: Multimodal wrapper (`Gemma4ForConditionalGeneration`) over a text backbone (`Gemma4TextModel`), with optional vision/audio encoders. All Gemma4 HF repos have `model_type: "gemma4"` — even text-only variants load as multimodal with a vision tower.
|
||||
|
||||
### Required settings
|
||||
|
||||
```yaml
|
||||
# Always needed for Gemma4:
|
||||
freeze_mm_modules: true # Freeze vision/audio encoders for text-only training
|
||||
gradient_checkpointing_kwargs:
|
||||
use_reentrant: false # Shared per-layer norms cause "marked ready twice" with reentrant
|
||||
|
||||
# LoRA target — restrict to language model only (DO NOT use lora_target_linear: true):
|
||||
lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
|
||||
```
|
||||
|
||||
### Auto-detection
|
||||
|
||||
Axolotl auto-detects Gemma4 and applies:
|
||||
- `use_reentrant: false` for gradient checkpointing
|
||||
- `ddp_find_unused_parameters: true` for DDP (skipped when `activation_offloading: true`)
|
||||
|
||||
### Multi-GPU
|
||||
|
||||
| Strategy | Works? | Notes |
|
||||
|----------|--------|-------|
|
||||
| DDP | Yes | Auto-sets `ddp_find_unused_parameters=True` |
|
||||
| DDP + activation_offloading | Yes | `find_unused_parameters` is skipped (conflicts with checkpoint wrappers) |
|
||||
| FSDP1 | No | OOM during dequantization/sharding with QLoRA |
|
||||
| FSDP2 | Yes | Use `Gemma4TextDecoderLayer` (not `Gemma4DecoderLayer`) as wrap class |
|
||||
| FSDP2 + activation_offloading | Yes | Lowest VRAM (~26 GiB/GPU for 26B-A4B) |
|
||||
|
||||
FSDP2 config:
|
||||
```yaml
|
||||
fsdp:
|
||||
- full_shard
|
||||
- auto_wrap
|
||||
fsdp_config:
|
||||
fsdp_version: 2
|
||||
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
||||
fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
|
||||
```
|
||||
|
||||
### MoE (26B-A4B)
|
||||
|
||||
- `enable_moe_block: true`, 256 experts, top-k routing
|
||||
- No separate `SparseMoeBlock` — MoE is embedded in each decoder layer
|
||||
- Expert LoRA targets 3D parameter tensors:
|
||||
```yaml
|
||||
lora_target_parameters:
|
||||
- experts.gate_up_proj
|
||||
- experts.down_proj
|
||||
```
|
||||
- ScatterMoE kernel acceleration:
|
||||
```yaml
|
||||
plugins:
|
||||
- axolotl.integrations.kernels.KernelsPlugin
|
||||
use_kernels: true
|
||||
use_scattermoe: true
|
||||
experts_implementation: scattermoe
|
||||
```
|
||||
|
||||
### Common issues
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| `mm_token_type_ids is required` in DDP | `model.config` not accessible through DDP wrapper | Already fixed — `unwrap_model()` in `compute_loss` and `prediction_step` |
|
||||
| `marked a variable ready twice` in DDP | `ddp_find_unused_parameters=True` + activation_offloading checkpoint wrappers | Auto-handled — `find_unused_parameters` is skipped when `activation_offloading: true` |
|
||||
| Loss ~12 instead of ~0.5 | Using `lora_target_linear: true` (applies LoRA to vision/audio modules) | Use the regex `lora_target_modules` pattern instead |
|
||||
| FSDP2 `Could not find Gemma4AudioLayer` | Auto-wrap detects `_no_split_modules` including audio layers that don't exist | Explicitly set `fsdp_transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer` |
|
||||
| `Gemma4ClippableLinear not supported` by PEFT | Vision tower uses a non-standard linear wrapper | Axolotl patches this automatically via `_patch_peft_clippable_linear()` |
|
||||
|
||||
### E2B/E4B dense models
|
||||
|
||||
These have `hidden_size_per_layer_input: 256` (per-layer input embeddings) and `attention_k_eq_v: False`. Known issue: loss starts higher than expected (~12 vs ~0.5 for 26B). Root cause under investigation — may be related to the per-layer input mechanism or the `Gemma4ForConditionalGeneration` loss computation.
|
||||
|
||||
## Gemma 3
|
||||
|
||||
**Models**: `google/gemma-3-*`
|
||||
|
||||
- `ddp_find_unused_parameters: true` needed (multimodal unused params)
|
||||
- `use_reentrant: false` recommended
|
||||
- Attention mask must be dropped for sample packing (handled automatically)
|
||||
- Multi-GPU test currently skipped (`tests/e2e/multigpu/test_gemma3.py`)
|
||||
|
||||
## Qwen 3.5 MoE
|
||||
|
||||
**Models**: `Qwen/Qwen3.5-35B-A3B`
|
||||
|
||||
- Hybrid architecture: DeltaNet linear attention (30 layers) + full attention (10 layers)
|
||||
- 256 experts, 8 active per token
|
||||
- Known weight scale drift in late DeltaNet layers (36-38) due to AdamW + rare expert interaction
|
||||
- Fix: `normalize_weight_scales` config to detect and rescale outliers:
|
||||
```yaml
|
||||
normalize_weight_scales:
|
||||
- name_pattern: 'linear_attn\.conv1d\.weight'
|
||||
threshold: 1.3
|
||||
```
|
||||
|
||||
## General MoE Notes
|
||||
|
||||
- `lora_target_linear: true` with multimodal MoE models will apply LoRA to ALL linear modules including vision/audio encoders — use regex `lora_target_modules` to restrict to language model only
|
||||
- Rare experts get larger effective learning rate from AdamW (small second-moment estimates) — can cause weight drift in recurrent/SSM components. Use `normalize_weight_scales` with `dry_run: true` to detect.
|
||||
- For ScatterMoE kernel support, set `experts_implementation: scattermoe` and add the KernelsPlugin
|
||||
181
docs/agents/new_model_support.md
Normal file
181
docs/agents/new_model_support.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# New Model Support — Agent Reference
|
||||
|
||||
Guide for debugging and adding support for new model architectures in axolotl. Based on lessons learned from Gemma4, Gemma3, Qwen2-VL, and other multimodal/MoE models.
|
||||
|
||||
## Quick Validation Checklist
|
||||
|
||||
When testing a new model, run through these checks in order:
|
||||
|
||||
1. **Does the model load?** `axolotl preprocess config.yaml` — catches config schema errors
|
||||
2. **Does LoRA apply?** Check for "Unsupported layer type" warnings from PEFT
|
||||
3. **Is the initial loss sane?** First-step loss for a pretrained model should be 0.5–2.0 for SFT
|
||||
4. **Does sample packing work?** Compare loss with `sample_packing: true` vs `false` — should be similar
|
||||
5. **Is CCE active?** Check for "Applying Cut Cross Entropy" log and verify peak VRAM is lower
|
||||
|
||||
## Loss Debugging
|
||||
|
||||
### Expected initial loss
|
||||
A pretrained model doing SFT should start with loss roughly in the 0.5–2.0 range. If loss starts above 3.0, something is wrong. If it's near `log(vocab_size)` (≈ 12 for 262K vocab), the model is predicting at random — attention masking or model weights are broken.
|
||||
|
||||
### Direct comparison technique
|
||||
The fastest way to isolate a loss issue — bypass the trainer entirely:
|
||||
|
||||
```python
|
||||
# Load model via axolotl's pipeline (applies all patches)
|
||||
from axolotl.cli.config import load_cfg
|
||||
from axolotl.utils.config import normalize_config, prepare_plugins
|
||||
from axolotl.loaders.tokenizer import load_tokenizer
|
||||
from axolotl.loaders.model import ModelLoader
|
||||
|
||||
cfg = load_cfg("your_config.yaml")
|
||||
normalize_config(cfg)
|
||||
prepare_plugins(cfg)
|
||||
tokenizer = load_tokenizer(cfg)
|
||||
model, _ = ModelLoader(cfg, tokenizer).load()
|
||||
|
||||
# Forward pass on preprocessed data
|
||||
model.train()
|
||||
out = model(input_ids, labels=labels)
|
||||
print(f"Direct loss: {out.loss.item()}") # Compare to trainer's reported loss
|
||||
```
|
||||
|
||||
If direct loss is correct (~1.0) but trainer reports 3–4x higher, check `model_accepts_loss_kwargs` (see below).
|
||||
|
||||
### `model_accepts_loss_kwargs` inflation
|
||||
HF Trainer checks if the model's `forward()` has `**kwargs` and sets `model_accepts_loss_kwargs=True`. This changes loss normalization: the trainer does NOT divide loss by `gradient_accumulation_steps` before logging. The gradient is correct — only the logged loss is inflated.
|
||||
|
||||
**Symptom**: Logged loss ≈ actual_loss × gradient_accumulation_steps.
|
||||
|
||||
**Which models are affected**: Any model with `**kwargs` in forward (common in multimodal models for extra inputs like `mm_token_type_ids`, `pixel_values`, etc.).
|
||||
|
||||
**Fix location**: `src/axolotl/core/trainers/base.py` `__init__()` — after `super().__init__()`, check if the unwrapped model actually has `num_items_in_batch` in its forward signature. If not, set `self.model_accepts_loss_kwargs = False`.
|
||||
|
||||
## Multimodal Models (ForConditionalGeneration)
|
||||
|
||||
Many recent models use `ForConditionalGeneration` as the top-level class, not `ForCausalLM`:
|
||||
- Gemma3 → `Gemma3ForConditionalGeneration`
|
||||
- Gemma4 → `Gemma4ForConditionalGeneration`
|
||||
- Qwen2-VL → `Qwen2VLForConditionalGeneration`
|
||||
- LLaVA → `LlavaForConditionalGeneration`
|
||||
|
||||
### Why this matters
|
||||
|
||||
| Component | Targets `ForCausalLM` | Needs `ForConditionalGeneration` |
|
||||
|-----------|----------------------|--------------------------------|
|
||||
| CCE patches | ✅ (default) | ❌ silently inactive if not patched |
|
||||
| PEFT LoRA | ✅ | May fail on custom layer types |
|
||||
| HF Trainer label handling | ✅ | May need extra inputs |
|
||||
|
||||
### Required extra inputs
|
||||
Multimodal models require special inputs during training even for text-only data:
|
||||
|
||||
| Model | Required Input | Value for Text-Only |
|
||||
|-------|---------------|-------------------|
|
||||
| Gemma4 | `mm_token_type_ids` | `torch.zeros_like(input_ids)` |
|
||||
| Gemma3 | `token_type_ids` | `torch.zeros_like(input_ids)` |
|
||||
|
||||
Auto-inject in `compute_loss()` when not provided by the data collator. See `core/trainers/base.py`.
|
||||
|
||||
### Custom layer types and PEFT
|
||||
Vision towers often use custom module wrappers that PEFT doesn't support:
|
||||
|
||||
| Model | Custom Layer | Wraps | Fix |
|
||||
|-------|-------------|-------|-----|
|
||||
| Gemma4 | `Gemma4ClippableLinear` | `nn.Linear` | Redirect to `.linear` child |
|
||||
|
||||
Fix location: `src/axolotl/loaders/adapter.py` `_patch_peft_clippable_linear()`.
|
||||
|
||||
## Sample Packing
|
||||
|
||||
### How packed sequence detection works (transformers ≥ 5.x)
|
||||
`transformers.masking_utils._preprocess_mask_arguments()` detects packed sequences from `position_ids` resets. But **only when `attention_mask is None`**:
|
||||
|
||||
```python
|
||||
# From masking_utils.py:
|
||||
if position_ids is not None and attention_mask is None and past_key_values is None:
|
||||
packed_sequence_mask = find_packed_sequence_indices(position_ids)
|
||||
```
|
||||
|
||||
If the collator provides an all-ones `attention_mask`, packing detection is **skipped** and the model builds a single causal mask spanning all packed sequences → cross-sequence attention leakage → very high loss.
|
||||
|
||||
### Fix for models using `create_causal_mask_mapping`
|
||||
For Gemma3, Gemma4, and similar models that use the new transformers masking system, remove `attention_mask` from inputs when sample packing is active:
|
||||
|
||||
```python
|
||||
# In compute_loss():
|
||||
if (
|
||||
self.args.sample_packing
|
||||
and model_type in ("gemma4", "gemma3")
|
||||
and "attention_mask" in inputs
|
||||
and "position_ids" in inputs
|
||||
):
|
||||
del inputs["attention_mask"]
|
||||
```
|
||||
|
||||
Fix location: `src/axolotl/core/trainers/base.py` `compute_loss()`.
|
||||
|
||||
### Models that DON'T need this fix
|
||||
Older models that use `_prepare_4d_causal_attention_mask` (Llama, Mistral, Qwen2, etc.) handle sample packing via axolotl's multipack attention monkeypatch instead. Only models using the new `create_causal_mask_mapping` / `create_causal_mask` masking system need the `attention_mask` removal.
|
||||
|
||||
## Attention Backend Selection
|
||||
|
||||
| Backend | Config | head_dim limit | torch_compile | Notes |
|
||||
|---------|--------|---------------|---------------|-------|
|
||||
| FA2 | `flash_attention: true` | 256 | ✅ | Fastest when supported |
|
||||
| FA4 | auto with `flash_attention: true` | 256 (SM90+) | ✅ | Auto-detected on H100+ |
|
||||
| SDPA | `sdp_attention: true` | None | ✅ | Universal fallback |
|
||||
| flex | `flex_attention: true` | None | ⚠️ Triton OOM for large head_dim | Good for variable head dims |
|
||||
| eager | neither set | None | ✅ | Slowest, always works |
|
||||
|
||||
**Check model support**: Look at `_supports_flash_attn_2`, `_supports_flex_attn`, `_supports_sdpa` attributes on the model class.
|
||||
|
||||
**head_dim gotcha**: The 256 limit is specific to flash-attn CUDA kernels, NOT PyTorch-level. SDPA and flex_attention both handle arbitrary head_dim. Models with `global_head_dim > 256` (Gemma4: 512) must use SDPA or flex.
|
||||
|
||||
**flex + compile gotcha**: `torch_compile` with flex_attention can hit Triton shared memory OOM for large head_dim. Falls back to eager per-function (not a crash, but slower). Unsloth disables flex for Gemma4 for this reason.
|
||||
|
||||
## Cut Cross Entropy (CCE)
|
||||
|
||||
### How CCE patches work
|
||||
CCE replaces the model's `forward()` with a fused version that computes loss from hidden states + lm_head weight without materializing the full logits tensor. This saves ~`batch × seq_len × vocab_size × dtype_bytes` of VRAM.
|
||||
|
||||
### Adding CCE for a new model
|
||||
1. Check if the model type is in `cut_cross_entropy.transformers.patch.PATCH_FNS`
|
||||
2. If not, axolotl's generic fallback (`integrations/cut_cross_entropy/__init__.py` `patch_llama_like()`) patches `{Prefix}ForCausalLM.forward` with `cce_forward`
|
||||
3. For multimodal models (`ForConditionalGeneration`), a model-specific patch is needed in `ml-cross-entropy` repo
|
||||
4. The multimodal `cce_forward` must accept all extra kwargs (pixel_values, mm_token_type_ids, etc.) and pop any that would conflict before calling `self.model()`
|
||||
|
||||
### Common CCE pitfall
|
||||
If CCE appears active (log says "Applying Cut Cross Entropy") but peak VRAM doesn't decrease, check which class was patched. If the model loads as `ForConditionalGeneration` but CCE patched `ForCausalLM`, the patch is silently inactive.
|
||||
|
||||
## MoE Models
|
||||
|
||||
### Dense MLP vs MoE experts
|
||||
Some MoE models (e.g., Gemma4) have BOTH dense MLP layers and MoE expert layers at every decoder layer:
|
||||
- `gate_proj/up_proj/down_proj` → targets the **dense MLP** (`Gemma4TextMLP`)
|
||||
- `experts.gate_up_proj/experts.down_proj` → targets the **MoE experts** (`Gemma4TextExperts`)
|
||||
|
||||
LoRA on the dense MLP works normally. Expert LoRA via `lora_target_parameters` requires PEFT support for the specific expert module type (may warn "Unsupported layer type").
|
||||
|
||||
### ScatterMoE kernels
|
||||
`use_scattermoe: true` with `experts_implementation: scattermoe` registers fused expert kernels via transformers' `ExpertsInterface`. Significant speedup for MoE models. Requires the kernels plugin:
|
||||
```yaml
|
||||
plugins:
|
||||
- axolotl.integrations.kernels.KernelsPlugin
|
||||
use_kernels: true
|
||||
use_scattermoe: true
|
||||
experts_implementation: scattermoe
|
||||
```
|
||||
|
||||
## Where to Add Model-Specific Fixes
|
||||
|
||||
| What | Where | Example |
|
||||
|------|-------|---------|
|
||||
| Missing forward inputs | `core/trainers/base.py` `compute_loss()` | mm_token_type_ids injection |
|
||||
| Attention mask fixes | `core/trainers/base.py` `compute_loss()` | Sample packing mask removal |
|
||||
| Loss logging fixes | `core/trainers/base.py` `__init__()` | model_accepts_loss_kwargs override |
|
||||
| PEFT/LoRA patches | `loaders/adapter.py` | ClippableLinear redirect |
|
||||
| Attention patches | `monkeypatch/attention/` | FA4 tuple fix |
|
||||
| Model-specific patches | `loaders/patch_manager.py` `_apply_model_specific_patches()` | Llama4, Kimi, NemotronH |
|
||||
| CCE patches | `ml-cross-entropy` repo `transformers/` | Per-model cce_forward |
|
||||
| Example configs | `examples/<model>/` | Validated YAML |
|
||||
| Config validation | `utils/schemas/validation.py` | Compatibility checks |
|
||||
Reference in New Issue
Block a user