Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes (#3439)

* fix: saving clones state dict * fix: apply fix for only CP mode * fix: add dropout check when using lora target param * fix: re-add patch from transformers PR #39866 * feat: add moe quant to test by ved * fix: try match target param properly end with * fix: clear cache per param quant * fix: attempt on-load quantize experts instead of post-load * fix: attempt disable async load * chore: add log * chore: adjust log * fix: remove cuda alloc for moe and enable async load * chore: remove leftover logs * chore: add extra empty cache * fix(doc): clarify support * fix: handle fsdp2 for paramwrapper dtensor * feat: attempt to quant experts in 8bit mode too * feat: attempt to release bf16 experts from vram * feat: upgrade cce * fix: fsdp2 init_sharded_param load int8/uint4 dtensor as require_grad=true on init * fix: remove unnecessary gc and empty cache * Revert "fix: remove unnecessary gc and empty cache" This reverts commit 1d54518990. * fix: do not call full_tensor on non-dtensors * fix: attempt to address fsdp2 with quant exp high loss * fix: attempt lora quant experts wrong dim * fix: ensure require_grad patch applied for lora 8bit * fix: attempt lora 8bit fsdp2 * fix: attribute access on save for lora 8bit fsdp2 * fix: wrong weight attrib access * chore(refactor): add config, re-arrange position of patches, clean comments * feat: add example docs * chore: cherry pick trinity fixes from PR 3399 * chore: comments refactor; add guards * fix: guard using wrong key * fix: mamba save does not accept main process param * fix: guard prevent double hook * fix: move gc to upper scope * chore: add comment on proxy forward patch * fix: add comment to clarify * feat: add test idempotency * fix: AttributeError: `e_score_correction_bias` is not an nn.Parameter * fix: AttributeError: 'NoneType' object has no attribute 'to' * fix: update docs on cpu_ram_efficient_loading
2026-03-03 22:06:23 +07:00
parent e672d37f33
commit 945c8aeb10
24 changed files with 1015 additions and 29 deletions
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
@@ -40,7 +40,7 @@
    "%%capture\n",
    "# This step can take ~5-10 minutes to install dependencies\n",
    "!pip install --no-build-isolation axolotl[flash-attn]>=0.9.1\n",
-    "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@58d6572\""
+    "!pip install \"cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@a668583\""
   ]
  },
  {
--- a/examples/glm4.7-flash/README.md
+++ b/examples/glm4.7-flash/README.md
@@ -0,0 +1,77 @@
+# Finetune Z.ai's GLM-4.7-Flash with Axolotl
+
+[GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) is a 30B-A3B MoE model by Z.ai.
+
+This guide shows how to fine-tune it with Axolotl.
+
+## Getting started
+
+1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
+
+2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage.
+
+3. Run the finetuning example:
+
+```bash
+# QLoRA
+# - no target experts (1x48GB @ ~24GiB/GPU)
+# - target experts (1x48GB @ ~34GiB/GPU)
+axolotl train examples/glm4.7-flash/qlora.yaml
+
+# QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
+axolotl train examples/glm4.7-flash/qlora_fsdp.yaml
+```
+
+```bash
+# LoRA
+# - no target experts (1x48GB @ ~35GiB/GPU)
+# - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
+axolotl train examples/glm4.7-flash/lora.yaml
+
+# LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
+axolotl train examples/glm4.7-flash/lora_fsdp.yaml
+```
+
+### Expert LoRA
+
+To also apply LoRA adapters to expert weights, add `lora_target_parameters` to your config.
+
+Note: `lora_dropout` must be `0` when using `lora_target_parameters`.
+
+```yaml
+lora_target_parameters:
+  - mlp.experts.gate_up_proj
+  - mlp.experts.down_proj
+  # - mlp.gate.weight  # router, untested but should work, not normally targeted
+```
+
+## Limitations
+
+- **FSDP VRAM**: FSDP2 may use more VRAM per GPU than single GPU training. We suspect not all layers are properly sharded across ranks.
+- **FSDP initial spike**: FSDP LoRA (8-bit) may have a large initial VRAM spike at the first 1-2 steps that then drops. FSDP QLoRA (4-bit) does not exhibit this.
+- **cpu_ram_efficient_loading**: Must be set to `false` with FSDP2 — causes hang otherwise.
+- **lora_target_linear**: Incompatible for this model.
+- **LoRA kernels**: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (`lora_*_kernel: false`).
+
+
+### TIPS
+
+- For inference, the official Z.ai team recommends these default settings (most tasks):
+  - `temperature: 1.0`
+  - `top_p: 0.95`
+  - `max_new_tokens: 131072`
+- You can run a full finetuning by removing `adapter: qlora`, `load_in_4bit: true`, and `quantize_moe_experts: true` from the config. This is heavy, so we have not tested this.
+- Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
+
+## Optimization Guides
+
+Please check the [Optimizations doc](https://docs.axolotl.ai/docs/optimizations.html).
+
+## Related Resources
+
+- [GLM-4.7-Flash on HuggingFace](https://huggingface.co/zai-org/GLM-4.7-Flash)
+- [GLM-4.7 Blog](https://z.ai/blog/glm-4.7)
+- [Axolotl Docs](https://docs.axolotl.ai)
+- [Axolotl Website](https://axolotl.ai)
+- [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
+- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
--- a/examples/glm4.7-flash/lora.yaml
+++ b/examples/glm4.7-flash/lora.yaml
@@ -0,0 +1,65 @@
+base_model: zai-org/GLM-4.7-Flash
+
+plugins:
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+load_in_8bit: true
+quantize_moe_experts: true
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/glm4.7-flash-lora-8bit-out
+
+adapter: lora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0
+lora_target_modules:
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+# Uncomment to also target MoE expert weights:
+# lora_target_parameters:
+#   - mlp.experts.gate_up_proj
+#   - mlp.experts.down_proj
+
+# LoRA kernels incompatible with DSA attention
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_torch_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing: true
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
--- a/examples/glm4.7-flash/lora_fsdp.yaml
+++ b/examples/glm4.7-flash/lora_fsdp.yaml
@@ -0,0 +1,75 @@
+base_model: zai-org/GLM-4.7-Flash
+
+plugins:
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+load_in_8bit: true
+quantize_moe_experts: true
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/glm4.7-flash-lora-8bit-fsdp-out
+
+adapter: lora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0
+lora_target_modules:
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+# Uncomment to also target MoE expert weights:
+# lora_target_parameters:
+#   - mlp.experts.gate_up_proj
+#   - mlp.experts.down_proj
+
+# LoRA kernels incompatible with DSA attention
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_torch_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+
+fsdp_config:
+  fsdp_version: 2
+  offload_params: false
+  cpu_ram_efficient_loading: false
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: Glm4MoeLiteDecoderLayer
+  state_dict_type: FULL_STATE_DICT
+  sharding_strategy: FULL_SHARD
+  reshard_after_forward: true
+  activation_checkpointing: true
--- a/examples/glm4.7-flash/qlora.yaml
+++ b/examples/glm4.7-flash/qlora.yaml
@@ -0,0 +1,65 @@
+base_model: zai-org/GLM-4.7-Flash
+
+plugins:
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+load_in_4bit: true
+quantize_moe_experts: true
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/glm4.7-flash-qlora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0
+lora_target_modules:
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+# Uncomment to also target MoE expert weights:
+# lora_target_parameters:
+#   - mlp.experts.gate_up_proj
+#   - mlp.experts.down_proj
+
+# LoRA kernels incompatible with DSA attention
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_torch_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing: true
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
--- a/examples/glm4.7-flash/qlora_fsdp.yaml
+++ b/examples/glm4.7-flash/qlora_fsdp.yaml
@@ -0,0 +1,75 @@
+base_model: zai-org/GLM-4.7-Flash
+
+plugins:
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+
+load_in_4bit: true
+quantize_moe_experts: true
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.1
+output_dir: ./outputs/glm4.7-flash-qlora-fsdp-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 2048
+sample_packing: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0
+lora_target_modules:
+  - q_proj
+  - v_proj
+  - k_proj
+  - o_proj
+
+# Uncomment to also target MoE expert weights:
+# lora_target_parameters:
+#   - mlp.experts.gate_up_proj
+#   - mlp.experts.down_proj
+
+# LoRA kernels incompatible with DSA attention
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_torch_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+bf16: auto
+tf32: false
+
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+
+fsdp_config:
+  fsdp_version: 2
+  offload_params: false
+  cpu_ram_efficient_loading: false
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: Glm4MoeLiteDecoderLayer
+  state_dict_type: FULL_STATE_DICT
+  sharding_strategy: FULL_SHARD
+  reshard_after_forward: true
+  activation_checkpointing: true
--- a/examples/trinity/README.md
+++ b/examples/trinity/README.md
@@ -8,13 +8,15 @@ This guide shows how to fine-tune it with Axolotl with multi-turn conversations

 1. Install Axolotl following the main from the [installation guide](https://docs.axolotl.ai/docs/installation.html#sec-edge-build).

-2. Run the finetuning example:
+2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage.
+
+3. Run the finetuning example:

    ```bash
    axolotl train examples/trinity/trinity-nano-preview-qlora.yaml
    ```

-This config uses about 24.9 GiB VRAM.
+This config uses about 24.9 GiB VRAM (w/o CCE).

 Let us know how it goes. Happy finetuning! 🚀

@@ -29,10 +31,6 @@ Let us know how it goes. Happy finetuning! 🚀

 Please check the [Optimizations doc](https://docs.axolotl.ai/docs/optimizations.html).

-## Limitations
-
-**Cut Cross Entropy (CCE)**: Currently not supported. We plan to include CCE support for Trinity in the near future.
-
 ## Related Resources

 - [Trinity Blog](https://www.arcee.ai/blog/the-trinity-manifesto)
--- a/examples/trinity/trinity-nano-preview-qlora.yaml
+++ b/examples/trinity/trinity-nano-preview-qlora.yaml
@@ -1,5 +1,4 @@
 base_model: arcee-ai/Trinity-Nano-Preview
-trust_remote_code: true
 revision_of_model: 2ee94b0

 # Automatically upload checkpoint and final model to HF