* fix: remove unneeded debug log * fix: cleanup * feat: add dense gemma config and cleanup * feat: add cce support * update notes and set torch compile * fix patch for new number of return vals * fixes for gemma4 * fix packing bug * use updated cce for mm * fix: pass in kv cache func when avail for transformers 5.5 * feat: update examples with flex variant and readme * gemma4 lora attention kernels --------- Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>
2.4 KiB
2.4 KiB
Finetune Google's Gemma 4 with Axolotl
Gemma 4 is a family of multimodal models from Google. This guide covers how to train them with Axolotl.
Getting started
-
Install Axolotl following the installation guide.
-
Install Cut Cross Entropy to reduce training VRAM usage.
-
Run the finetuning example:
# 26B MoE QLoRA (1x80GB @ ~50 GiB)
axolotl train examples/gemma4/26b-a4b-moe-qlora.yaml
# 31B Dense QLoRA (1x80GB @ ~44 GiB)
axolotl train examples/gemma4/31b-qlora.yaml
# 31B Dense QLoRA Flex Attn (1x80GB @ ~26 GiB)
axolotl train examples/gemma4/31b-qlora-flex.yaml
MoE Expert Quantization & Expert LoRA (26B-A4B only)
The 26B-A4B config uses ScatterMoE kernels via the transformers ExpertsInterface and quantizes expert weights on load. To learn about expert quantization, expert LoRA targeting, and related limitations, see the MoE Expert Quantization docs.
Flex Attention
Reduce ~40% VRAM (at the cost of up to half throughput) by setting the below (shown in examples/gemma4/31b-qlora-flex.yaml):
torch_compile: true
flex_attention: true
This works for both the MoE and Dense model.
Limitations
- Flash Attention: FA2 (max head_dim=256) and FA4 (max head_dim=128) cannot support Gemma 4's
global_head_dim=512. Use SDP or flex attention instead. - LoRA kernels: Not supported due to KV-sharing layers.
- lora_target_linear: Incompatible for multimodal models — use
lora_target_moduleswith a regex to restrict LoRA to the text backbone.
TIPS
- Read more on how to load your own dataset at docs.
- You can run full finetuning by removing
adapter: qlora,load_in_4bit: true, andquantize_moe_experts: truefrom the config. This is heavy and has not been tested.
Optimization Guides
Please check the Optimizations doc.