Files
axolotl/examples/qwen3.5
VED 113d275bd9 qwen docs + new config (#3499) [skip ci]
* qwen docs + new config

* docss lint

* simplify comments

* read me

* lint comments

* Update docs/multimodal.qmd

* Update docs/multimodal.qmd

* Update examples/qwen3.5/9b-fft-vision.yaml

* chore: fix link and incorrect points

---------

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2026-03-20 16:13:34 +07:00
..
2026-03-06 09:31:00 -05:00
2026-03-06 09:31:00 -05:00
2026-03-06 09:31:00 -05:00

Finetune Qwen3.5 with Axolotl

Qwen3.5 is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. All Qwen3.5 models are early-fusion vision-language models: dense variants use Qwen3_5ForConditionalGeneration and MoE variants use Qwen3_5MoeForConditionalGeneration.

Vision and text tokens are processed through the same transformer stack. The configs below train on text-only data unless noted otherwise. See 9b-lora-vision.yaml for a multimodal example.

Available configs:

Config Model Type Peak VRAM
27b-qlora.yaml Qwen3.5-27B Dense VLM, text-only QLoRA ~47 GiB
27b-fft.yaml Qwen3.5-27B Dense VLM, text-only FFT (vision frozen) ~53 GiB
35b-a3b-moe-qlora.yaml Qwen3.5-35B-A3B MoE, text-only QLoRA
122b-a10b-moe-qlora.yaml Qwen3.5-122B-A10B MoE, text-only QLoRA
9b-lora-vision.yaml Qwen3.5-9B Vision+text LoRA, single GPU
9b-fft-vision.yaml Qwen3.5-9B Vision+text FFT, single GPU ~61 GiB

Getting started

  1. Install Axolotl following the installation guide.

  2. Install Cut Cross Entropy to reduce training VRAM usage.

  3. Install FLA for sample packing support with the Gated DeltaNet linear attention layers:

pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.4.1

FLA is required when sample_packing: true. Without it, training raises a RuntimeError on packed sequences. Vision configs use sample_packing: false so FLA is optional there.

  1. Run a finetuning example:
# Dense 27B text-only (QLoRA, ~47 GiB VRAM with sample packing)
axolotl train examples/qwen3.5/27b-qlora.yaml

# Dense 27B text-only FFT with vision encoder frozen (~53 GiB, single 80 GiB GPU)
axolotl train examples/qwen3.5/27b-fft.yaml

# MoE 35B-A3B text-only (QLoRA)
axolotl train examples/qwen3.5/35b-a3b-moe-qlora.yaml

# MoE 122B-A10B text-only (QLoRA)
axolotl train examples/qwen3.5/122b-a10b-moe-qlora.yaml

# 9B vision+text (LoRA, multimodal dataset)
axolotl train examples/qwen3.5/9b-lora-vision.yaml

# 9B vision+text FFT, single 80 GiB GPU (~61 GiB peak)
axolotl train examples/qwen3.5/9b-fft-vision.yaml

TIPS

  • For inference, you can experiment with temperature: 0.7, top_p: 0.8, top_k: 20, and min_p: 0.
  • For text-only FFT on 27B, use 27b-fft.yaml which sets unfrozen_parameters to freeze the vision encoder (model.visual.*) — this avoids wasting optimizer state on parameters that receive no gradient from text-only data.
  • You can run a full finetuning of smaller configs by removing adapter: qlora and load_in_4bit: true. See Multi-GPU below.
  • Read more on loading your own dataset at docs.
  • The dataset format follows the OpenAI Messages format as seen here.
  • For multimodal finetuning, set processor_type: AutoProcessor, skip_prepare_dataset: true, and remove_unused_columns: false as shown in 9b-lora-vision.yaml.
  • The Gated DeltaNet linear attention layers (linear_attn.*) can optionally be added to lora_target_modules — they are commented out by default.

Optimization Guides