* lora target modules with regex * updates * fsdp for non moe * update wording * chore: cleanup and lint * chore: cleanup docs from merge --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>
Finetune Qwen3.5 with Axolotl
Qwen3.5 is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. All Qwen3.5 models are early-fusion vision-language models: dense variants use Qwen3_5ForConditionalGeneration and MoE variants use Qwen3_5MoeForConditionalGeneration.
Getting started
-
Install Axolotl following the installation guide.
-
Install Cut Cross Entropy to reduce training VRAM usage.
-
Install FLA for sample packing support with the Gated DeltaNet linear attention layers:
pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.4.1
FLA is required when
sample_packing: true. Without it, training raises aRuntimeErroron packed sequences. Vision configs usesample_packing: falseso FLA is optional there.
-
Pick any config from the table below and run:
axolotl train examples/qwen3.5/<config>.yaml
Available configs:
| Config | Model | Type | Peak VRAM |
|---|---|---|---|
9b-lora-vision.yaml |
Qwen3.5-9B | Vision+text LoRA, single GPU | — |
9b-fft-vision.yaml |
Qwen3.5-9B | Vision+text FFT, single GPU | ~61 GiB |
27b-qlora.yaml |
Qwen3.5-27B | Dense, text-only QLoRA | ~47 GiB |
27b-fft.yaml |
Qwen3.5-27B | Dense, text-only FFT (vision frozen) | ~53 GiB |
27b-qlora-fsdp.yaml |
Qwen3.5-27B | Dense, text-only QLoRA + FSDP2 | — |
35b-a3b-moe-qlora.yaml |
Qwen3.5-35B-A3B | MoE, text-only QLoRA | — |
35b-a3b-moe-qlora-fsdp.yaml |
Qwen3.5-35B-A3B | MoE, text-only QLoRA + FSDP2 | — |
122b-a10b-moe-qlora.yaml |
Qwen3.5-122B-A10B | MoE, text-only QLoRA | — |
122b-a10b-moe-qlora-fsdp.yaml |
Qwen3.5-122B-A10B | MoE, text-only QLoRA + FSDP2 | — |
Gated DeltaNet Linear Attention
Qwen3.5 interleaves standard attention with Gated DeltaNet linear attention layers. To apply LoRA to them, add to lora_target_modules:
lora_target_modules:
# ... standard projections ...
- linear_attn.in_proj_qkv
- linear_attn.in_proj_z
- linear_attn.out_proj
Routed Experts (MoE)
To apply LoRA to routed expert parameters, add lora_target_parameters:
lora_target_parameters:
- mlp.experts.gate_up_proj
- mlp.experts.down_proj
# - mlp.gate.weight # router
Shared Experts (MoE)
Routed experts and shared experts both have gate_up_proj/down_proj, so a plain module name in lora_target_modules would match both. Use a regex to target only attention and shared expert projections, while lora_target_parameters above handles routed experts separately:
lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'
TIPS
- For inference hyp, please see the respective model card details.
- You can run a full finetuning of smaller configs by removing
adapter: qloraandload_in_4bit: true. See Multi-GPU below. - Read more on loading your own dataset at docs.
- The dataset format follows the OpenAI Messages format as seen here.
- For multimodal finetuning, set
processor_type: AutoProcessor,skip_prepare_dataset: true, andremove_unused_columns: falseas shown in9b-lora-vision.yaml.