Files
axolotl/examples/qwen3.5

Finetune Qwen3.5 with Axolotl

Qwen3.5 is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. All Qwen3.5 models are early-fusion vision-language models: dense variants use Qwen3_5ForConditionalGeneration and MoE variants use Qwen3_5MoeForConditionalGeneration.

Getting started

  1. Install Axolotl following the installation guide.

  2. Install Cut Cross Entropy to reduce training VRAM usage.

  3. Install FLA for sample packing support with the Gated DeltaNet linear attention layers:

pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.4.1

FLA is required when sample_packing: true. Without it, training raises a RuntimeError on packed sequences. Vision configs use sample_packing: false so FLA is optional there.

  1. Pick any config from the table below and run:

    axolotl train examples/qwen3.5/<config>.yaml
    

Available configs:

Config Model Type Peak VRAM
9b-lora-vision.yaml Qwen3.5-9B Vision+text LoRA, single GPU
9b-fft-vision.yaml Qwen3.5-9B Vision+text FFT, single GPU ~61 GiB
27b-qlora.yaml Qwen3.5-27B Dense, text-only QLoRA ~47 GiB
27b-fft.yaml Qwen3.5-27B Dense, text-only FFT (vision frozen) ~53 GiB
27b-qlora-fsdp.yaml Qwen3.5-27B Dense, text-only QLoRA + FSDP2
35b-a3b-moe-qlora.yaml Qwen3.5-35B-A3B MoE, text-only QLoRA
35b-a3b-moe-qlora-fsdp.yaml Qwen3.5-35B-A3B MoE, text-only QLoRA + FSDP2
122b-a10b-moe-qlora.yaml Qwen3.5-122B-A10B MoE, text-only QLoRA
122b-a10b-moe-qlora-fsdp.yaml Qwen3.5-122B-A10B MoE, text-only QLoRA + FSDP2

Gated DeltaNet Linear Attention

Qwen3.5 interleaves standard attention with Gated DeltaNet linear attention layers. To apply LoRA to them, add to lora_target_modules:

lora_target_modules:
  # ... standard projections ...
  - linear_attn.in_proj_qkv
  - linear_attn.in_proj_z
  - linear_attn.out_proj

Routed Experts (MoE)

To apply LoRA to routed expert parameters, add lora_target_parameters:

lora_target_parameters:
  - mlp.experts.gate_up_proj
  - mlp.experts.down_proj
#  - mlp.gate.weight  # router

Shared Experts (MoE)

Routed experts and shared experts both have gate_up_proj/down_proj, so a plain module name in lora_target_modules would match both. Use a regex to target only attention and shared expert projections, while lora_target_parameters above handles routed experts separately:

lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'

TIPS

  • For inference hyp, please see the respective model card details.
  • You can run a full finetuning of smaller configs by removing adapter: qlora and load_in_4bit: true. See Multi-GPU below.
  • Read more on loading your own dataset at docs.
  • The dataset format follows the OpenAI Messages format as seen here.
  • For multimodal finetuning, set processor_type: AutoProcessor, skip_prepare_dataset: true, and remove_unused_columns: false as shown in 9b-lora-vision.yaml.

Optimization Guides