Files

NanoCode012 753906cfc7 feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452 ) [skip ci]

* chore: rename without period

* feat: add glm45 air

* feat: add doc on expert quantization

* feat: update base readme with new changes

* chore: cleanup

* chore: cleanup

* chore: cleanup

* fix: disable quantize_moe_expert on merge per comment

* chore: add kernel info to optimizations doc

2026-03-05 09:58:09 -05:00

lora_fsdp.yaml

feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452 ) [skip ci]

2026-03-05 09:58:09 -05:00

lora.yaml

feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452 ) [skip ci]

2026-03-05 09:58:09 -05:00

qlora_fsdp.yaml

feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452 ) [skip ci]

2026-03-05 09:58:09 -05:00

qlora.yaml

feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452 ) [skip ci]

2026-03-05 09:58:09 -05:00

README.md

feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452 ) [skip ci]

2026-03-05 09:58:09 -05:00

README.md

Finetune Z.ai's GLM-4.7-Flash with Axolotl

GLM-4.7-Flash is a 30B-A3B MoE model by Z.ai.

This guide shows how to fine-tune it with Axolotl.

Getting started

Install Axolotl following the installation guide.
Install Cut Cross Entropy to reduce training VRAM usage.
Run the finetuning example:

# QLoRA
# - no target experts (1x48GB @ ~24GiB/GPU)
# - target experts (1x48GB @ ~34GiB/GPU)
axolotl train examples/glm47-flash/qlora.yaml

# QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
axolotl train examples/glm47-flash/qlora_fsdp.yaml

# LoRA
# - no target experts (1x48GB @ ~35GiB/GPU)
# - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
axolotl train examples/glm47-flash/lora.yaml

# LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
axolotl train examples/glm47-flash/lora_fsdp.yaml

MoE Expert Quantization & Expert LoRA

This model quantize expert weights on load. To learn about expert quantization, expert LoRA targeting, and related limitations, see the MoE Expert Quantization docs.

Limitations

lora_target_linear: Incompatible for this model.
LoRA kernels: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (lora_*_kernel: false).

TIPS

For inference, the official Z.ai team recommends these default settings (most tasks):
- temperature: 1.0
- top_p: 0.95
- max_new_tokens: 131072
You can run a full finetuning by removing adapter: qlora, load_in_4bit: true, and quantize_moe_experts: true from the config. This is heavy, so we have not tested this.
Read more on how to load your own dataset at docs.

Optimization Guides

Please check the Optimizations doc.

README.md

Finetune Z.ai's GLM-4.7-Flash with Axolotl

Getting started

MoE Expert Quantization & Expert LoRA

Limitations

TIPS

Optimization Guides

Related Resources