* chore: rename without period * feat: add glm45 air * feat: add doc on expert quantization * feat: update base readme with new changes * chore: cleanup * chore: cleanup * chore: cleanup * fix: disable quantize_moe_expert on merge per comment * chore: add kernel info to optimizations doc
Finetune Z.ai's GLM-4.7-Flash with Axolotl
GLM-4.7-Flash is a 30B-A3B MoE model by Z.ai.
This guide shows how to fine-tune it with Axolotl.
Getting started
-
Install Axolotl following the installation guide.
-
Install Cut Cross Entropy to reduce training VRAM usage.
-
Run the finetuning example:
# QLoRA
# - no target experts (1x48GB @ ~24GiB/GPU)
# - target experts (1x48GB @ ~34GiB/GPU)
axolotl train examples/glm47-flash/qlora.yaml
# QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
axolotl train examples/glm47-flash/qlora_fsdp.yaml
# LoRA
# - no target experts (1x48GB @ ~35GiB/GPU)
# - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
axolotl train examples/glm47-flash/lora.yaml
# LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
axolotl train examples/glm47-flash/lora_fsdp.yaml
MoE Expert Quantization & Expert LoRA
This model quantize expert weights on load. To learn about expert quantization, expert LoRA targeting, and related limitations, see the MoE Expert Quantization docs.
Limitations
- lora_target_linear: Incompatible for this model.
- LoRA kernels: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (
lora_*_kernel: false).
TIPS
- For inference, the official Z.ai team recommends these default settings (most tasks):
temperature: 1.0top_p: 0.95max_new_tokens: 131072
- You can run a full finetuning by removing
adapter: qlora,load_in_4bit: true, andquantize_moe_experts: truefrom the config. This is heavy, so we have not tested this. - Read more on how to load your own dataset at docs.
Optimization Guides
Please check the Optimizations doc.