Files

NanoCode012 945c8aeb10 Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439 )

* fix: saving clones state dict

* fix: apply fix for only CP mode

* fix: add dropout check when using lora target param

* fix: re-add patch from transformers PR #39866

* feat: add moe quant to test by ved

* fix: try match target param properly end with

* fix: clear cache per param quant

* fix: attempt on-load quantize experts instead of post-load

* fix: attempt disable async load

* chore: add log

* chore: adjust log

* fix: remove cuda alloc for moe and enable async load

* chore: remove leftover logs

* chore: add extra empty cache

* fix(doc): clarify support

* fix: handle fsdp2 for paramwrapper dtensor

* feat: attempt to quant experts in 8bit mode too

* feat: attempt to release bf16 experts from vram

* feat: upgrade cce

* fix: fsdp2 init_sharded_param load int8/uint4 dtensor as
require_grad=true on init

* fix: remove unnecessary gc and empty cache

* Revert "fix: remove unnecessary gc and empty cache"

This reverts commit 1d54518990.

* fix: do not call full_tensor on non-dtensors

* fix: attempt to address fsdp2 with quant exp high loss

* fix: attempt lora quant experts wrong dim

* fix: ensure require_grad patch applied for lora 8bit

* fix: attempt lora 8bit fsdp2

* fix: attribute access on save for lora 8bit fsdp2

* fix: wrong weight attrib access

* chore(refactor): add config, re-arrange position of patches, clean
comments

* feat: add example docs

* chore: cherry pick trinity fixes from PR 3399

* chore: comments refactor; add guards

* fix: guard using wrong key

* fix: mamba save does not accept main process param

* fix: guard prevent double hook

* fix: move gc to upper scope

* chore: add comment on proxy forward patch

* fix: add comment to clarify

* feat: add test idempotency

* fix: AttributeError: `e_score_correction_bias` is not an nn.Parameter

* fix: AttributeError: 'NoneType' object has no attribute 'to'

* fix: update docs on cpu_ram_efficient_loading

2026-03-03 10:06:23 -05:00

lora_fsdp.yaml

Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439 )

2026-03-03 10:06:23 -05:00

lora.yaml

Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439 )

2026-03-03 10:06:23 -05:00

qlora_fsdp.yaml

Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439 )

2026-03-03 10:06:23 -05:00

qlora.yaml

Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439 )

2026-03-03 10:06:23 -05:00

README.md

Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439 )

2026-03-03 10:06:23 -05:00

README.md

Finetune Z.ai's GLM-4.7-Flash with Axolotl

GLM-4.7-Flash is a 30B-A3B MoE model by Z.ai.

This guide shows how to fine-tune it with Axolotl.

Getting started

Install Axolotl following the installation guide.
Install Cut Cross Entropy to reduce training VRAM usage.
Run the finetuning example:

# QLoRA
# - no target experts (1x48GB @ ~24GiB/GPU)
# - target experts (1x48GB @ ~34GiB/GPU)
axolotl train examples/glm4.7-flash/qlora.yaml

# QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
axolotl train examples/glm4.7-flash/qlora_fsdp.yaml

# LoRA
# - no target experts (1x48GB @ ~35GiB/GPU)
# - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
axolotl train examples/glm4.7-flash/lora.yaml

# LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
axolotl train examples/glm4.7-flash/lora_fsdp.yaml

Expert LoRA

To also apply LoRA adapters to expert weights, add lora_target_parameters to your config.

Note: lora_dropout must be 0 when using lora_target_parameters.

lora_target_parameters:
  - mlp.experts.gate_up_proj
  - mlp.experts.down_proj
  # - mlp.gate.weight  # router, untested but should work, not normally targeted

Limitations

FSDP VRAM: FSDP2 may use more VRAM per GPU than single GPU training. We suspect not all layers are properly sharded across ranks.
FSDP initial spike: FSDP LoRA (8-bit) may have a large initial VRAM spike at the first 1-2 steps that then drops. FSDP QLoRA (4-bit) does not exhibit this.
cpu_ram_efficient_loading: Must be set to false with FSDP2 — causes hang otherwise.
lora_target_linear: Incompatible for this model.
LoRA kernels: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (lora_*_kernel: false).

TIPS

For inference, the official Z.ai team recommends these default settings (most tasks):
- temperature: 1.0
- top_p: 0.95
- max_new_tokens: 131072
You can run a full finetuning by removing adapter: qlora, load_in_4bit: true, and quantize_moe_experts: true from the config. This is heavy, so we have not tested this.
Read more on how to load your own dataset at docs.

Optimization Guides

Please check the Optimizations doc.

README.md

Finetune Z.ai's GLM-4.7-Flash with Axolotl

Getting started

Expert LoRA

Limitations

TIPS

Optimization Guides

Related Resources