Files
axolotl/examples/glm4.7-flash/README.md
NanoCode012 945c8aeb10 Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439)
* fix: saving clones state dict

* fix: apply fix for only CP mode

* fix: add dropout check when using lora target param

* fix: re-add patch from transformers PR #39866

* feat: add moe quant to test by ved

* fix: try match target param properly end with

* fix: clear cache per param quant

* fix: attempt on-load quantize experts instead of post-load

* fix: attempt disable async load

* chore: add log

* chore: adjust log

* fix: remove cuda alloc for moe and enable async load

* chore: remove leftover logs

* chore: add extra empty cache

* fix(doc): clarify support

* fix: handle fsdp2 for paramwrapper dtensor

* feat: attempt to quant experts in 8bit mode too

* feat: attempt to release bf16 experts from vram

* feat: upgrade cce

* fix: fsdp2 init_sharded_param load int8/uint4 dtensor as
require_grad=true on init

* fix: remove unnecessary gc and empty cache

* Revert "fix: remove unnecessary gc and empty cache"

This reverts commit 1d54518990.

* fix: do not call full_tensor on non-dtensors

* fix: attempt to address fsdp2 with quant exp high loss

* fix: attempt lora quant experts wrong dim

* fix: ensure require_grad patch applied for lora 8bit

* fix: attempt lora 8bit fsdp2

* fix: attribute access on save for lora 8bit fsdp2

* fix: wrong weight attrib access

* chore(refactor): add config, re-arrange position of patches, clean
comments

* feat: add example docs

* chore: cherry pick trinity fixes from PR 3399

* chore: comments refactor; add guards

* fix: guard using wrong key

* fix: mamba save does not accept main process param

* fix: guard prevent double hook

* fix: move gc to upper scope

* chore: add comment on proxy forward patch

* fix: add comment to clarify

* feat: add test idempotency

* fix: AttributeError: `e_score_correction_bias` is not an nn.Parameter

* fix: AttributeError: 'NoneType' object has no attribute 'to'

* fix: update docs on cpu_ram_efficient_loading
2026-03-03 10:06:23 -05:00

2.8 KiB

Finetune Z.ai's GLM-4.7-Flash with Axolotl

GLM-4.7-Flash is a 30B-A3B MoE model by Z.ai.

This guide shows how to fine-tune it with Axolotl.

Getting started

  1. Install Axolotl following the installation guide.

  2. Install Cut Cross Entropy to reduce training VRAM usage.

  3. Run the finetuning example:

# QLoRA
# - no target experts (1x48GB @ ~24GiB/GPU)
# - target experts (1x48GB @ ~34GiB/GPU)
axolotl train examples/glm4.7-flash/qlora.yaml

# QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
axolotl train examples/glm4.7-flash/qlora_fsdp.yaml
# LoRA
# - no target experts (1x48GB @ ~35GiB/GPU)
# - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
axolotl train examples/glm4.7-flash/lora.yaml

# LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
axolotl train examples/glm4.7-flash/lora_fsdp.yaml

Expert LoRA

To also apply LoRA adapters to expert weights, add lora_target_parameters to your config.

Note: lora_dropout must be 0 when using lora_target_parameters.

lora_target_parameters:
  - mlp.experts.gate_up_proj
  - mlp.experts.down_proj
  # - mlp.gate.weight  # router, untested but should work, not normally targeted

Limitations

  • FSDP VRAM: FSDP2 may use more VRAM per GPU than single GPU training. We suspect not all layers are properly sharded across ranks.
  • FSDP initial spike: FSDP LoRA (8-bit) may have a large initial VRAM spike at the first 1-2 steps that then drops. FSDP QLoRA (4-bit) does not exhibit this.
  • cpu_ram_efficient_loading: Must be set to false with FSDP2 — causes hang otherwise.
  • lora_target_linear: Incompatible for this model.
  • LoRA kernels: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (lora_*_kernel: false).

TIPS

  • For inference, the official Z.ai team recommends these default settings (most tasks):
    • temperature: 1.0
    • top_p: 0.95
    • max_new_tokens: 131072
  • You can run a full finetuning by removing adapter: qlora, load_in_4bit: true, and quantize_moe_experts: true from the config. This is heavy, so we have not tested this.
  • Read more on how to load your own dataset at docs.

Optimization Guides

Please check the Optimizations doc.