* fix: saving clones state dict
* fix: apply fix for only CP mode
* fix: add dropout check when using lora target param
* fix: re-add patch from transformers PR #39866
* feat: add moe quant to test by ved
* fix: try match target param properly end with
* fix: clear cache per param quant
* fix: attempt on-load quantize experts instead of post-load
* fix: attempt disable async load
* chore: add log
* chore: adjust log
* fix: remove cuda alloc for moe and enable async load
* chore: remove leftover logs
* chore: add extra empty cache
* fix(doc): clarify support
* fix: handle fsdp2 for paramwrapper dtensor
* feat: attempt to quant experts in 8bit mode too
* feat: attempt to release bf16 experts from vram
* feat: upgrade cce
* fix: fsdp2 init_sharded_param load int8/uint4 dtensor as
require_grad=true on init
* fix: remove unnecessary gc and empty cache
* Revert "fix: remove unnecessary gc and empty cache"
This reverts commit 1d54518990.
* fix: do not call full_tensor on non-dtensors
* fix: attempt to address fsdp2 with quant exp high loss
* fix: attempt lora quant experts wrong dim
* fix: ensure require_grad patch applied for lora 8bit
* fix: attempt lora 8bit fsdp2
* fix: attribute access on save for lora 8bit fsdp2
* fix: wrong weight attrib access
* chore(refactor): add config, re-arrange position of patches, clean
comments
* feat: add example docs
* chore: cherry pick trinity fixes from PR 3399
* chore: comments refactor; add guards
* fix: guard using wrong key
* fix: mamba save does not accept main process param
* fix: guard prevent double hook
* fix: move gc to upper scope
* chore: add comment on proxy forward patch
* fix: add comment to clarify
* feat: add test idempotency
* fix: AttributeError: `e_score_correction_bias` is not an nn.Parameter
* fix: AttributeError: 'NoneType' object has no attribute 'to'
* fix: update docs on cpu_ram_efficient_loading
78 lines
2.8 KiB
Markdown
78 lines
2.8 KiB
Markdown
# Finetune Z.ai's GLM-4.7-Flash with Axolotl
|
|
|
|
[GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) is a 30B-A3B MoE model by Z.ai.
|
|
|
|
This guide shows how to fine-tune it with Axolotl.
|
|
|
|
## Getting started
|
|
|
|
1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
|
|
|
|
2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage.
|
|
|
|
3. Run the finetuning example:
|
|
|
|
```bash
|
|
# QLoRA
|
|
# - no target experts (1x48GB @ ~24GiB/GPU)
|
|
# - target experts (1x48GB @ ~34GiB/GPU)
|
|
axolotl train examples/glm4.7-flash/qlora.yaml
|
|
|
|
# QLoRA FSDP2 no target experts (2x48GB @ ~29GiB/GPU)
|
|
axolotl train examples/glm4.7-flash/qlora_fsdp.yaml
|
|
```
|
|
|
|
```bash
|
|
# LoRA
|
|
# - no target experts (1x48GB @ ~35GiB/GPU)
|
|
# - target experts (1x48GB @ OOM. Projected ~45-50GiB/GPU)
|
|
axolotl train examples/glm4.7-flash/lora.yaml
|
|
|
|
# LoRA FSDP2 no target experts (2x48GB @ ~43GiB/GPU)
|
|
axolotl train examples/glm4.7-flash/lora_fsdp.yaml
|
|
```
|
|
|
|
### Expert LoRA
|
|
|
|
To also apply LoRA adapters to expert weights, add `lora_target_parameters` to your config.
|
|
|
|
Note: `lora_dropout` must be `0` when using `lora_target_parameters`.
|
|
|
|
```yaml
|
|
lora_target_parameters:
|
|
- mlp.experts.gate_up_proj
|
|
- mlp.experts.down_proj
|
|
# - mlp.gate.weight # router, untested but should work, not normally targeted
|
|
```
|
|
|
|
## Limitations
|
|
|
|
- **FSDP VRAM**: FSDP2 may use more VRAM per GPU than single GPU training. We suspect not all layers are properly sharded across ranks.
|
|
- **FSDP initial spike**: FSDP LoRA (8-bit) may have a large initial VRAM spike at the first 1-2 steps that then drops. FSDP QLoRA (4-bit) does not exhibit this.
|
|
- **cpu_ram_efficient_loading**: Must be set to `false` with FSDP2 — causes hang otherwise.
|
|
- **lora_target_linear**: Incompatible for this model.
|
|
- **LoRA kernels**: Incompatible with this model due to non-standard attention projections (DSA). Must be explicitly disabled (`lora_*_kernel: false`).
|
|
|
|
|
|
### TIPS
|
|
|
|
- For inference, the official Z.ai team recommends these default settings (most tasks):
|
|
- `temperature: 1.0`
|
|
- `top_p: 0.95`
|
|
- `max_new_tokens: 131072`
|
|
- You can run a full finetuning by removing `adapter: qlora`, `load_in_4bit: true`, and `quantize_moe_experts: true` from the config. This is heavy, so we have not tested this.
|
|
- Read more on how to load your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
|
|
|
|
## Optimization Guides
|
|
|
|
Please check the [Optimizations doc](https://docs.axolotl.ai/docs/optimizations.html).
|
|
|
|
## Related Resources
|
|
|
|
- [GLM-4.7-Flash on HuggingFace](https://huggingface.co/zai-org/GLM-4.7-Flash)
|
|
- [GLM-4.7 Blog](https://z.ai/blog/glm-4.7)
|
|
- [Axolotl Docs](https://docs.axolotl.ai)
|
|
- [Axolotl Website](https://axolotl.ai)
|
|
- [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
|
|
- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
|