Files
axolotl/examples/qwen3.5/README.md
NanoCode012 9de5b76336 feat: move to uv first (#3545)
* feat: move to uv first

* fix: update doc to uv first

* fix: merge dev/tests into uv pyproject

* fix: update docker docs to match current config

* fix: migrate examples to readme

* fix: add llmcompressor to conflict

* feat: rec uv sync with lockfile for dev/ci

* fix: update docker docs to clarify how to use uv images

* chore: docs

* fix: use system python, no venv

* fix: set backend cpu

* fix: only set for installing pytorch step

* fix: remove unsloth kernel and installs

* fix: remove U in tests

* fix: set backend in deps too

* chore: test

* chore: comments

* fix: attempt to lock torch

* fix: workaround torch cuda and not upgraded

* fix: forgot to push

* fix: missed source

* fix: nightly upstream loralinear config

* fix: nightly phi3 long rope not work

* fix: forgot commit

* fix: test phi3 template change

* fix: no more requirements

* fix: carry over changes from new requirements to pyproject

* chore: remove lockfile per discussion

* fix: set match-runtime

* fix: remove unneeded hf hub buildtime

* fix: duplicate cache delete on nightly

* fix: torchvision being overridden

* fix: migrate to uv images

* fix: leftover from merge

* fix: simplify base readme

* fix: update assertion message to be clearer

* chore: docs

* fix: change fallback for cicd script

* fix: match against main exactly

* fix: peft 0.19.1 change

* fix: e2e test

* fix: ci

* fix: e2e test
2026-04-21 10:16:03 -04:00

4.0 KiB

Finetune Qwen3.5 with Axolotl

Qwen3.5 is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. All Qwen3.5 models are early-fusion vision-language models: dense variants use Qwen3_5ForConditionalGeneration and MoE variants use Qwen3_5MoeForConditionalGeneration.

Getting started

  1. Install Axolotl following the installation guide.

  2. Install Cut Cross Entropy to reduce training VRAM usage.

  3. Install FLA for sample packing support with the Gated DeltaNet linear attention layers:

uv pip uninstall causal-conv1d && uv pip install flash-linear-attention==0.4.1

FLA is required when sample_packing: true. Without it, training raises a RuntimeError on packed sequences. Vision configs use sample_packing: false so FLA is optional there.

  1. Pick any config from the table below and run:

    axolotl train examples/qwen3.5/<config>.yaml
    

Available configs:

Config Model Type Peak VRAM
9b-lora-vision.yaml Qwen3.5-9B Vision+text LoRA, single GPU
9b-fft-vision.yaml Qwen3.5-9B Vision+text FFT, single GPU ~61 GiB
27b-qlora.yaml Qwen3.5-27B Dense, text-only QLoRA ~47 GiB
27b-fft.yaml Qwen3.5-27B Dense, text-only FFT (vision frozen) ~53 GiB
27b-qlora-fsdp.yaml Qwen3.5-27B Dense, text-only QLoRA + FSDP2
35b-a3b-moe-qlora.yaml Qwen3.5-35B-A3B MoE, text-only QLoRA
35b-a3b-moe-qlora-fsdp.yaml Qwen3.5-35B-A3B MoE, text-only QLoRA + FSDP2
122b-a10b-moe-qlora.yaml Qwen3.5-122B-A10B MoE, text-only QLoRA
122b-a10b-moe-qlora-fsdp.yaml Qwen3.5-122B-A10B MoE, text-only QLoRA + FSDP2

Gated DeltaNet Linear Attention

Qwen3.5 interleaves standard attention with Gated DeltaNet linear attention layers. To apply LoRA to them, add to lora_target_modules:

lora_target_modules:
  # ... standard projections ...
  - linear_attn.in_proj_qkv
  - linear_attn.in_proj_z
  - linear_attn.out_proj

Routed Experts (MoE)

To apply LoRA to routed expert parameters, add lora_target_parameters:

lora_target_parameters:
  - mlp.experts.gate_up_proj
  - mlp.experts.down_proj
#  - mlp.gate.weight  # router

Shared Experts (MoE)

Shared experts use nn.Linear (unlike routed experts which are 3D nn.Parameter tensors), so they can be targeted via lora_target_modules. To also train shared expert projections alongside attention, uncomment gate_up_proj and down_proj in lora_target_modules:

lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  # Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
  # - gate_up_proj
  # - down_proj

Use lora_target_parameters (see Routed Experts above) to target routed experts separately.

TIPS

  • For inference hyp, please see the respective model card details.
  • You can run a full finetuning of smaller configs by removing adapter: qlora and load_in_4bit: true. See Multi-GPU below.
  • Read more on loading your own dataset at docs.
  • The dataset format follows the OpenAI Messages format as seen here.
  • For multimodal finetuning, set processor_type: AutoProcessor, skip_prepare_dataset: true, and remove_unused_columns: false as shown in 9b-lora-vision.yaml.

Optimization Guides