Files

NanoCode012 9de5b76336 feat: move to uv first (#3545 )

* feat: move to uv first

* fix: update doc to uv first

* fix: merge dev/tests into uv pyproject

* fix: update docker docs to match current config

* fix: migrate examples to readme

* fix: add llmcompressor to conflict

* feat: rec uv sync with lockfile for dev/ci

* fix: update docker docs to clarify how to use uv images

* chore: docs

* fix: use system python, no venv

* fix: set backend cpu

* fix: only set for installing pytorch step

* fix: remove unsloth kernel and installs

* fix: remove U in tests

* fix: set backend in deps too

* chore: test

* chore: comments

* fix: attempt to lock torch

* fix: workaround torch cuda and not upgraded

* fix: forgot to push

* fix: missed source

* fix: nightly upstream loralinear config

* fix: nightly phi3 long rope not work

* fix: forgot commit

* fix: test phi3 template change

* fix: no more requirements

* fix: carry over changes from new requirements to pyproject

* chore: remove lockfile per discussion

* fix: set match-runtime

* fix: remove unneeded hf hub buildtime

* fix: duplicate cache delete on nightly

* fix: torchvision being overridden

* fix: migrate to uv images

* fix: leftover from merge

* fix: simplify base readme

* fix: update assertion message to be clearer

* chore: docs

* fix: change fallback for cicd script

* fix: match against main exactly

* fix: peft 0.19.1 change

* fix: e2e test

* fix: ci

* fix: e2e test

2026-04-21 10:16:03 -04:00

4.0 KiB

Raw Blame History

Finetune Qwen3.5 with Axolotl

Qwen3.5 is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. All Qwen3.5 models are early-fusion vision-language models: dense variants use Qwen3_5ForConditionalGeneration and MoE variants use Qwen3_5MoeForConditionalGeneration.

Getting started

Install Axolotl following the installation guide.
Install Cut Cross Entropy to reduce training VRAM usage.
Install FLA for sample packing support with the Gated DeltaNet linear attention layers:

uv pip uninstall causal-conv1d && uv pip install flash-linear-attention==0.4.1

FLA is required when sample_packing: true. Without it, training raises a RuntimeError on packed sequences. Vision configs use sample_packing: false so FLA is optional there.

Pick any config from the table below and run:

axolotl train examples/qwen3.5/<config>.yaml

Available configs:

Config	Model	Type	Peak VRAM
`9b-lora-vision.yaml`	Qwen3.5-9B	Vision+text LoRA, single GPU	—
`9b-fft-vision.yaml`	Qwen3.5-9B	Vision+text FFT, single GPU	~61 GiB
`27b-qlora.yaml`	Qwen3.5-27B	Dense, text-only QLoRA	~47 GiB
`27b-fft.yaml`	Qwen3.5-27B	Dense, text-only FFT (vision frozen)	~53 GiB
`27b-qlora-fsdp.yaml`	Qwen3.5-27B	Dense, text-only QLoRA + FSDP2	—
`35b-a3b-moe-qlora.yaml`	Qwen3.5-35B-A3B	MoE, text-only QLoRA	—
`35b-a3b-moe-qlora-fsdp.yaml`	Qwen3.5-35B-A3B	MoE, text-only QLoRA + FSDP2	—
`122b-a10b-moe-qlora.yaml`	Qwen3.5-122B-A10B	MoE, text-only QLoRA	—
`122b-a10b-moe-qlora-fsdp.yaml`	Qwen3.5-122B-A10B	MoE, text-only QLoRA + FSDP2	—

Gated DeltaNet Linear Attention

Qwen3.5 interleaves standard attention with Gated DeltaNet linear attention layers. To apply LoRA to them, add to lora_target_modules:

lora_target_modules:
  # ... standard projections ...
  - linear_attn.in_proj_qkv
  - linear_attn.in_proj_z
  - linear_attn.out_proj

Routed Experts (MoE)

To apply LoRA to routed expert parameters, add lora_target_parameters:

lora_target_parameters:
  - mlp.experts.gate_up_proj
  - mlp.experts.down_proj
#  - mlp.gate.weight  # router

Shared Experts (MoE)

Shared experts use nn.Linear (unlike routed experts which are 3D nn.Parameter tensors), so they can be targeted via lora_target_modules. To also train shared expert projections alongside attention, uncomment gate_up_proj and down_proj in lora_target_modules:

lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  # Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
  # - gate_up_proj
  # - down_proj

Use lora_target_parameters (see Routed Experts above) to target routed experts separately.

TIPS

For inference hyp, please see the respective model card details.
You can run a full finetuning of smaller configs by removing adapter: qlora and load_in_4bit: true. See Multi-GPU below.
Read more on loading your own dataset at docs.
The dataset format follows the OpenAI Messages format as seen here.
For multimodal finetuning, set processor_type: AutoProcessor, skip_prepare_dataset: true, and remove_unused_columns: false as shown in 9b-lora-vision.yaml.

Optimization Guides

Optimizations Guide

4.0 KiB Raw Blame History