Files

Wing Lian e4032fc90f Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

* upgrade to torchao 0.17.0

* chore: lint

* refactor attention handling

* replace legacy attention boolean flags with capability properties

Replace checks with capability-based properties derived from attn_implementation

This separates three concerns that were conflated under flash_attention:
1. Backend selection -> attn_implementation enum
2. Packing capability -> attn_supports_packing property
3. Flash-attn library dependency -> attn_uses_flash_lib property

* compute attn capability flags in normalizer instead of properties

* make attn_implementation the single source of truth

* move attention-dependent validators to mode=after

* migrate remaining consumers to canonical attn_implementation

* expand attention tests + rewrite docs

* migrate example configs to canonical attn_implementation

* update doc snippets + reject gemma4-hybrid with non-FA2 backend

* remove dead gemma4 branch in _set_attention_config

* fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests

* drop "Phase 2" naming from attn-implementation tests

* regroup attn_implementation tests by feature concern

* clean up verbose comments and remove MD

Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

* fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x

In transformers 5.x, ProcessorMixin.apply_chat_template gained its own
`return_dict` parameter (defaulting to False).  When return_dict=False
and tokenize=True the method returns out["input_ids"] directly — a 2-D
tensor — rather than the full BatchFeature dict.

The old code placed `return_dict=True` inside processor_kwargs.  In
transformers 5.x those kwargs are forwarded to the underlying processor
call self(...) where _merge_kwargs silently ignores any key not present
in MllamaProcessorKwargs (emitting a warning).  The outer return_dict
therefore stayed False, apply_chat_template returned the raw input_ids
tensor, and the subsequent `batch["input_ids"]` attempted to index a
2-D tensor with the 9-character string "input_ids", producing:

  IndexError: too many indices for tensor of dimension 2

The fix is to pass return_dict=True as a top-level keyword argument to
apply_chat_template (where it is actually consumed) and remove it from
processor_kwargs (where it was silently dropped).  No version guard is
needed: transformers is pinned to ==5.5.4 in pyproject.toml.

Adds a unit-level regression test (tests/test_mm_chat_collator.py) that
mocks the processor to return a raw tensor when apply_chat_template is
called without top-level return_dict=True, verifying the four invariants:
process_rows returns a dict, input_ids is 2-D, labels is 2-D, and
apply_chat_template receives return_dict=True as a top-level kwarg.

Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

* fix(collator): process_rows returns dict (BatchFeature) shape

Two related changes for the multimodal chat collator under transformers 5.x:

1. Wrap apply_chat_template result in dict(...) so process_rows returns
   a plain dict rather than a BatchFeature instance. BatchFeature is a
   Mapping but not a dict; downstream code that did
     batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"])
   would index on a tensor when the result wasn't dict-shaped, raising
     IndexError: too many indices for tensor of dimension 2

2. Soften the regression test's contract from `dict` to `Mapping` so it
   exercises the actual semantic guarantee (key/value access) rather
   than the implementation detail (dict vs BatchFeature). Test guards
   against the original transformers 5.x breakage where apply_chat_template's
   return_dict default went from True to False.

Includes regression test under tests/test_mm_chat_collator.py.

Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against
attn-implementation-refactor; squash-merged from agent commits 4de886fd
+ dc9fcf4f.

Signed-off-by: Wing Lian <wing@axolotl.ai>

---------

Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

2026-05-05 10:15:18 -04:00

26b-a4b-moe-qlora.yaml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

31b-qlora-flex.yaml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

31b-qlora.yaml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

e2b-vision-lora.yaml

Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 )

2026-05-05 10:15:18 -04:00

README.md

chore: cleanup post release v0.16 (#3577 )

2026-04-06 10:10:52 -07:00

README.md

Finetune Google's Gemma 4 with Axolotl

Gemma 4 is a family of multimodal models from Google. This guide covers how to train them with Axolotl.

Getting started

Install Axolotl following the installation guide.
Install Cut Cross Entropy to reduce training VRAM usage.
Run the finetuning example:

# 26B MoE QLoRA (1x80GB @ ~50 GiB)
axolotl train examples/gemma4/26b-a4b-moe-qlora.yaml

# 31B Dense QLoRA (1x80GB @ ~44 GiB)
axolotl train examples/gemma4/31b-qlora.yaml

# 31B Dense QLoRA Flex Attn (1x80GB @ ~26 GiB)
axolotl train examples/gemma4/31b-qlora-flex.yaml

MoE Expert Quantization & Expert LoRA (26B-A4B only)

The 26B-A4B config uses ScatterMoE kernels via the transformers ExpertsInterface and quantizes expert weights on load. To learn about expert quantization, expert LoRA targeting, and related limitations, see the MoE Expert Quantization docs.

Flex Attention

Reduce ~40% VRAM (at the cost of up to half throughput) by setting the below (shown in examples/gemma4/31b-qlora-flex.yaml):

torch_compile: true
flex_attention: true

This works for both the MoE and Dense model.

Limitations

Flash Attention: FA2 (max head_dim=256) and FA4 (max head_dim=128) cannot support Gemma 4's global_head_dim=512. Use SDP or flex attention instead.
LoRA kernels: Not supported due to KV-sharing layers.
lora_target_linear: Incompatible for multimodal models — use lora_target_modules with a regex to restrict LoRA to the text backbone.

TIPS

Read more on how to load your own dataset at docs.
You can run full finetuning by removing adapter: qlora, load_in_4bit: true, and quantize_moe_experts: true from the config. This is heavy and has not been tested.

Optimization Guides

Please check the Optimizations doc.

README.md

Finetune Google's Gemma 4 with Axolotl

Getting started

MoE Expert Quantization & Expert LoRA (26B-A4B only)

Flex Attention

Limitations

TIPS

Optimization Guides

Related Resources