feat: add doc for expert quantization, glm45 air example configs, and update readme for release (#3452) [skip ci]

* chore: rename without period * feat: add glm45 air * feat: add doc on expert quantization * feat: update base readme with new changes * chore: cleanup * chore: cleanup * chore: cleanup * fix: disable quantize_moe_expert on merge per comment * chore: add kernel info to optimizations doc
2026-03-05 21:58:09 +07:00
parent b6b8db805a
commit 753906cfc7
13 changed files with 248 additions and 29 deletions
--- a/examples/glm45/README.md
+++ b/examples/glm45/README.md
@@ -0,0 +1,72 @@
+# Finetune Z.ai's GLM-4.5-Air with Axolotl
+
+[GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air) is a MoE model by Z.ai.
+
+This guide shows how to fine-tune it with Axolotl.
+
+## Getting started
+
+1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
+
+2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage.
+
+3. Run the finetuning example:
+
+```bash
+# QLoRA (1x80GB @ ~63.4GiB/GPU)
+axolotl train examples/glm45/glm-45-air-qlora.yaml
+```
+
+### Dataset
+
+In addition to the standard OpenAI Messages format, GLM-4.5 supports an extra parameter for thinking in the assistant section.
+
+```json
+{
+    "role": "assistant",
+    "reasoning_content": "...",  // or have </think>...</think> in `content`
+    "content": "..."
+}
+```
+
+Make sure you set the below extra attributes if needed:
+
+```yaml
+datasets:
+  - path: ...
+    type: chat_template
+    message_property_mappings:
+      role: role
+      content: content
+
+    #   tool_calls: tool_calls  # uncomment if using tools
+    #   reasoning_content: reasoning_content  # uncomment if have reasoning
+
+# Uncomment if training on tool role (you would rarely if ever need this)
+# eot_tokens:
+#   - <|observation|>
+```
+
+### Tips
+
+- The role name for tools in this template is `tool`.
+- You will see this Axolotl WARNING — this is expected as the template does not use EOS:
+  ```
+  EOS token '<|endoftext|>' not found in chat_template. Please check if your template/EOS token is correct.
+  ```
+- You can run a full finetuning by removing `adapter: qlora`, `load_in_4bit: true`, and `quantize_moe_experts: true` from the config.
+- **LoRA kernels**: Incompatible with this model. Must be explicitly disabled (`lora_*_kernel: false`).
+- Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
+
+## Optimization Guides
+
+Please check the [Optimizations doc](https://docs.axolotl.ai/docs/optimizations.html).
+
+## Related Resources
+
+- [GLM-4.5-Air on HuggingFace](https://huggingface.co/zai-org/GLM-4.5-Air)
+- [GLM-4.5 Blog](https://z.ai/blog/glm-4.5)
+- [Axolotl Docs](https://docs.axolotl.ai)
+- [Axolotl Website](https://axolotl.ai)
+- [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
+- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
--- a/examples/glm45/glm-45-air-qlora.yaml
+++ b/examples/glm45/glm-45-air-qlora.yaml
@@ -0,0 +1,64 @@
+base_model: zai-org/GLM-4.5-Air
+
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+
+plugins:
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+load_in_8bit: false
+load_in_4bit: true
+
+quantize_moe_experts: true # important
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/lora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+
+lora_r: 16
+lora_alpha: 8
+lora_dropout: 0
+lora_target_modules:
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+# lora_target_parameters:
+#   - mlp.experts.gate_up_proj
+#   - mlp.experts.down_proj
+
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+gradient_accumulation_steps: 2
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing: true
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+
+# save_first_step: true  # uncomment this to validate checkpoint saving works with your config
--- a/examples/glm4.7-flash/README.md
+++ b/examples/glm4.7-flash/README.md
@@ -16,40 +16,28 @@ This guide shows how to fine-tune it with Axolotl.
 # QLoRA
 # - no target experts (1x48GB @ ~24GiB/GPU)
 # - target experts (1x48GB @ ~34GiB/GPU)
-axolotl train examples/glm4.7-flash/qlora.yaml
+axolotl train examples/glm47-flash/qlora.yaml

 # QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
-axolotl train examples/glm4.7-flash/qlora_fsdp.yaml
+axolotl train examples/glm47-flash/qlora_fsdp.yaml
 ```

 ```bash
 # LoRA
 # - no target experts (1x48GB @ ~35GiB/GPU)
 # - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
-axolotl train examples/glm4.7-flash/lora.yaml
+axolotl train examples/glm47-flash/lora.yaml

 # LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
-axolotl train examples/glm4.7-flash/lora_fsdp.yaml
+axolotl train examples/glm47-flash/lora_fsdp.yaml
 ```

-### Expert LoRA
+### MoE Expert Quantization & Expert LoRA

-To also apply LoRA adapters to expert weights, add `lora_target_parameters` to your config.
-
-Note: `lora_dropout` must be `0` when using `lora_target_parameters`.
-
-```yaml
-lora_target_parameters:
-  - mlp.experts.gate_up_proj
-  - mlp.experts.down_proj
-  # - mlp.gate.weight  # router, untested but should work, not normally targeted
-```
+This model quantize expert weights on load. To learn about expert quantization, expert LoRA targeting, and related limitations, see the [MoE Expert Quantization](https://docs.axolotl.ai/docs/expert_quantization.html) docs.

 ## Limitations

- **FSDP VRAM**: FSDP2 may use more VRAM per GPU than single GPU training. We suspect not all layers are properly sharded across ranks.
- **FSDP initial spike**: FSDP LoRA (8-bit) may have a large initial VRAM spike at the first 1-2 steps that then drops. FSDP QLoRA (4-bit) does not exhibit this.
- **cpu_ram_efficient_loading**: Must be set to `false` with FSDP2 — causes hang otherwise.
 - **lora_target_linear**: Incompatible for this model.
 - **LoRA kernels**: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (`lora_*_kernel: false`).

--- a/examples/glm4.7-flash/lora.yaml
+++ b/examples/glm4.7-flash/lora.yaml
--- a/examples/glm4.7-flash/lora_fsdp.yaml
+++ b/examples/glm4.7-flash/lora_fsdp.yaml
--- a/examples/glm4.7-flash/qlora.yaml
+++ b/examples/glm4.7-flash/qlora.yaml
--- a/examples/glm4.7-flash/qlora_fsdp.yaml
+++ b/examples/glm4.7-flash/qlora_fsdp.yaml