* feat: move to uv first * fix: update doc to uv first * fix: merge dev/tests into uv pyproject * fix: update docker docs to match current config * fix: migrate examples to readme * fix: add llmcompressor to conflict * feat: rec uv sync with lockfile for dev/ci * fix: update docker docs to clarify how to use uv images * chore: docs * fix: use system python, no venv * fix: set backend cpu * fix: only set for installing pytorch step * fix: remove unsloth kernel and installs * fix: remove U in tests * fix: set backend in deps too * chore: test * chore: comments * fix: attempt to lock torch * fix: workaround torch cuda and not upgraded * fix: forgot to push * fix: missed source * fix: nightly upstream loralinear config * fix: nightly phi3 long rope not work * fix: forgot commit * fix: test phi3 template change * fix: no more requirements * fix: carry over changes from new requirements to pyproject * chore: remove lockfile per discussion * fix: set match-runtime * fix: remove unneeded hf hub buildtime * fix: duplicate cache delete on nightly * fix: torchvision being overridden * fix: migrate to uv images * fix: leftover from merge * fix: simplify base readme * fix: update assertion message to be clearer * chore: docs * fix: change fallback for cicd script * fix: match against main exactly * fix: peft 0.19.1 change * fix: e2e test * fix: ci * fix: e2e test
96 lines
4.0 KiB
Markdown
96 lines
4.0 KiB
Markdown
# Finetune Qwen3.5 with Axolotl
|
|
|
|
[Qwen3.5](https://huggingface.co/collections/Qwen/qwen35) is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. All Qwen3.5 models are early-fusion vision-language models: dense variants use `Qwen3_5ForConditionalGeneration` and MoE variants use `Qwen3_5MoeForConditionalGeneration`.
|
|
|
|
## Getting started
|
|
|
|
1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html).
|
|
|
|
2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage.
|
|
|
|
3. Install FLA for sample packing support with the Gated DeltaNet linear attention layers:
|
|
```bash
|
|
uv pip uninstall causal-conv1d && uv pip install flash-linear-attention==0.4.1
|
|
```
|
|
> FLA is required when `sample_packing: true`. Without it, training raises a `RuntimeError` on packed sequences. Vision configs use `sample_packing: false` so FLA is optional there.
|
|
|
|
4. Pick any config from the table below and run:
|
|
|
|
```bash
|
|
axolotl train examples/qwen3.5/<config>.yaml
|
|
```
|
|
|
|
Available configs:
|
|
|
|
| Config | Model | Type | Peak VRAM |
|
|
|---|---|---|---|
|
|
| `9b-lora-vision.yaml` | Qwen3.5-9B | Vision+text LoRA, single GPU | — |
|
|
| `9b-fft-vision.yaml` | Qwen3.5-9B | Vision+text FFT, single GPU | ~61 GiB |
|
|
| `27b-qlora.yaml` | Qwen3.5-27B | Dense, text-only QLoRA | ~47 GiB |
|
|
| `27b-fft.yaml` | Qwen3.5-27B | Dense, text-only FFT (vision frozen) | ~53 GiB |
|
|
| `27b-qlora-fsdp.yaml` | Qwen3.5-27B | Dense, text-only QLoRA + FSDP2 | — |
|
|
| `35b-a3b-moe-qlora.yaml` | Qwen3.5-35B-A3B | MoE, text-only QLoRA | — |
|
|
| `35b-a3b-moe-qlora-fsdp.yaml` | Qwen3.5-35B-A3B | MoE, text-only QLoRA + FSDP2 | — |
|
|
| `122b-a10b-moe-qlora.yaml` | Qwen3.5-122B-A10B | MoE, text-only QLoRA | — |
|
|
| `122b-a10b-moe-qlora-fsdp.yaml` | Qwen3.5-122B-A10B | MoE, text-only QLoRA + FSDP2 | — |
|
|
|
|
### Gated DeltaNet Linear Attention
|
|
|
|
Qwen3.5 interleaves standard attention with Gated DeltaNet linear attention layers. To apply LoRA to them, add to `lora_target_modules`:
|
|
|
|
```yaml
|
|
lora_target_modules:
|
|
# ... standard projections ...
|
|
- linear_attn.in_proj_qkv
|
|
- linear_attn.in_proj_z
|
|
- linear_attn.out_proj
|
|
```
|
|
|
|
### Routed Experts (MoE)
|
|
|
|
To apply LoRA to routed expert parameters, add `lora_target_parameters`:
|
|
|
|
```yaml
|
|
lora_target_parameters:
|
|
- mlp.experts.gate_up_proj
|
|
- mlp.experts.down_proj
|
|
# - mlp.gate.weight # router
|
|
```
|
|
|
|
### Shared Experts (MoE)
|
|
|
|
Shared experts use `nn.Linear` (unlike routed experts which are 3D `nn.Parameter` tensors), so they can be targeted via `lora_target_modules`. To also train shared expert projections alongside attention, uncomment `gate_up_proj` and `down_proj` in `lora_target_modules`:
|
|
|
|
```yaml
|
|
lora_target_modules:
|
|
- q_proj
|
|
- k_proj
|
|
- v_proj
|
|
- o_proj
|
|
# Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
|
|
# - gate_up_proj
|
|
# - down_proj
|
|
```
|
|
|
|
Use `lora_target_parameters` (see [Routed Experts](#routed-experts-moe) above) to target routed experts separately.
|
|
|
|
### TIPS
|
|
|
|
- For inference hyp, please see the respective model card details.
|
|
- You can run a full finetuning of smaller configs by removing `adapter: qlora` and `load_in_4bit: true`. See [Multi-GPU](#optimization-guides) below.
|
|
- Read more on loading your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html).
|
|
- The dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
|
|
- For **multimodal** finetuning, set `processor_type: AutoProcessor`, `skip_prepare_dataset: true`, and `remove_unused_columns: false` as shown in `9b-lora-vision.yaml`.
|
|
|
|
## Optimization Guides
|
|
|
|
- [Optimizations Guide](https://docs.axolotl.ai/docs/optimizations.html)
|
|
|
|
## Related Resources
|
|
|
|
- [Qwen3.5 Blog](https://qwenlm.github.io/blog/qwen3.5/)
|
|
- [Axolotl Docs](https://docs.axolotl.ai)
|
|
- [Axolotl Website](https://axolotl.ai)
|
|
- [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl)
|
|
- [Axolotl Discord](https://discord.gg/7m9sfhzaf3)
|