Wing Lian e4032fc90f Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602)
* upgrade to torchao 0.17.0

* chore: lint

* refactor attention handling

* replace legacy attention boolean flags with capability properties

Replace checks with capability-based properties derived from attn_implementation

This separates three concerns that were conflated under flash_attention:
1. Backend selection -> attn_implementation enum
2. Packing capability -> attn_supports_packing property
3. Flash-attn library dependency -> attn_uses_flash_lib property

* compute attn capability flags in normalizer instead of properties

* make attn_implementation the single source of truth

* move attention-dependent validators to mode=after

* migrate remaining consumers to canonical attn_implementation

* expand attention tests + rewrite docs

* migrate example configs to canonical attn_implementation

* update doc snippets + reject gemma4-hybrid with non-FA2 backend

* remove dead gemma4 branch in _set_attention_config

* fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests

* drop "Phase 2" naming from attn-implementation tests

* regroup attn_implementation tests by feature concern

* clean up verbose comments and remove MD

Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

* fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x

In transformers 5.x, ProcessorMixin.apply_chat_template gained its own
`return_dict` parameter (defaulting to False).  When return_dict=False
and tokenize=True the method returns out["input_ids"] directly — a 2-D
tensor — rather than the full BatchFeature dict.

The old code placed `return_dict=True` inside processor_kwargs.  In
transformers 5.x those kwargs are forwarded to the underlying processor
call self(...) where _merge_kwargs silently ignores any key not present
in MllamaProcessorKwargs (emitting a warning).  The outer return_dict
therefore stayed False, apply_chat_template returned the raw input_ids
tensor, and the subsequent `batch["input_ids"]` attempted to index a
2-D tensor with the 9-character string "input_ids", producing:

  IndexError: too many indices for tensor of dimension 2

The fix is to pass return_dict=True as a top-level keyword argument to
apply_chat_template (where it is actually consumed) and remove it from
processor_kwargs (where it was silently dropped).  No version guard is
needed: transformers is pinned to ==5.5.4 in pyproject.toml.

Adds a unit-level regression test (tests/test_mm_chat_collator.py) that
mocks the processor to return a raw tensor when apply_chat_template is
called without top-level return_dict=True, verifying the four invariants:
process_rows returns a dict, input_ids is 2-D, labels is 2-D, and
apply_chat_template receives return_dict=True as a top-level kwarg.

Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>

* fix(collator): process_rows returns dict (BatchFeature) shape

Two related changes for the multimodal chat collator under transformers 5.x:

1. Wrap apply_chat_template result in dict(...) so process_rows returns
   a plain dict rather than a BatchFeature instance. BatchFeature is a
   Mapping but not a dict; downstream code that did
     batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"])
   would index on a tensor when the result wasn't dict-shaped, raising
     IndexError: too many indices for tensor of dimension 2

2. Soften the regression test's contract from `dict` to `Mapping` so it
   exercises the actual semantic guarantee (key/value access) rather
   than the implementation detail (dict vs BatchFeature). Test guards
   against the original transformers 5.x breakage where apply_chat_template's
   return_dict default went from True to False.

Includes regression test under tests/test_mm_chat_collator.py.

Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against
attn-implementation-refactor; squash-merged from agent commits 4de886fd
+ dc9fcf4f.

Signed-off-by: Wing Lian <wing@axolotl.ai>

---------

Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
2026-05-05 10:15:18 -04:00
2026-04-24 14:23:09 +07:00
2026-04-21 10:16:03 -04:00
2026-04-24 14:23:09 +07:00
2024-11-18 14:58:03 -05:00
2026-04-21 10:16:03 -04:00
2023-04-14 00:20:05 -04:00
2024-08-23 12:21:51 -04:00
2026-04-23 00:25:28 -04:00
2025-09-12 10:55:11 +01:00
2025-04-10 12:34:25 +07:00
2023-06-10 23:36:14 -07:00
2025-06-17 18:09:24 -04:00
2023-07-21 09:49:29 -04:00
2026-04-21 10:16:03 -04:00
2026-04-29 22:46:51 +07:00
2026-04-24 14:23:09 +07:00

Axolotl

A Free and Open Source LLM Fine-tuning Framework

GitHub License tests codecov Releases
contributors GitHub Repo stars
discord twitter google-colab
tests-nightly multigpu-semi-weekly tests

🎉 Latest Updates

Expand older updates
  • 2025/09: Axolotl now has text diffusion training. Read more here.
  • 2025/08: QAT has been updated to include NVFP4 support. See PR.
  • 2025/07:
    • ND Parallelism support has been added into Axolotl. Compose Context Parallelism (CP), Tensor Parallelism (TP), and Fully Sharded Data Parallelism (FSDP) within a single node and across multiple nodes. Check out the blog post for more info.
    • Axolotl adds more models: GPT-OSS, Gemma 3n, Liquid Foundation Model 2 (LFM2), and Arcee Foundation Models (AFM).
    • FP8 finetuning with fp8 gather op is now possible in Axolotl via torchao. Get started here!
    • Voxtral, Magistral 1.1, and Devstral with mistral-common tokenizer support has been integrated in Axolotl!
    • TiledMLP support for single-GPU to multi-GPU training with DDP, DeepSpeed and FSDP support has been added to support Arctic Long Sequence Training. (ALST). See examples for using ALST with Axolotl!
  • 2025/06: Magistral with mistral-common tokenizer support has been added to Axolotl. See docs to start training your own Magistral models with Axolotl!
  • 2025/05: Quantization Aware Training (QAT) support has been added to Axolotl. Explore the docs to learn more!
  • 2025/04: Llama 4 support has been added in Axolotl. See docs to start training your own Llama 4 models with Axolotl's linearized version!
  • 2025/03: Axolotl has implemented Sequence Parallelism (SP) support. Read the blog and docs to learn how to scale your context length when fine-tuning.
  • 2025/03: (Beta) Fine-tuning Multimodal models is now supported in Axolotl. Check out the docs to fine-tune your own!
  • 2025/02: Axolotl has added LoRA optimizations to reduce memory usage and improve training speed for LoRA and QLoRA in single GPU and multi-GPU training (DDP and DeepSpeed). Jump into the docs to give it a try.
  • 2025/02: Axolotl has added GRPO support. Dive into our blog and GRPO example and have some fun!
  • 2025/01: Axolotl has added Reward Modelling / Process Reward Modelling fine-tuning support. See docs.

Overview

Axolotl is a free and open-source tool designed to streamline post-training and fine-tuning for the latest large language models (LLMs).

Features:

🚀 Quick Start - LLM Fine-tuning in Minutes

Requirements:

  • NVIDIA GPU (Ampere or newer for bf16 and Flash Attention) or AMD GPU
  • Python >=3.11 (3.12 recommended)
  • PyTorch ≥2.9.1

Google Colab

Open In Colab

Installation

# install uv if you don't already have it installed (restart shell after)
curl -LsSf https://astral.sh/uv/install.sh | sh

# change depending on system
export UV_TORCH_BACKEND=cu128

# create a new virtual environment
uv venv --python 3.12
source .venv/bin/activate

uv pip install torch==2.10.0 torchvision
uv pip install --no-build-isolation axolotl[deepspeed]

# Download example axolotl configs, deepspeed configs
axolotl fetch examples
axolotl fetch deepspeed_configs  # OPTIONAL

Using Docker

Installing with Docker can be less error prone than installing in your own environment.

docker run --gpus '"all"' --ipc=host --rm -it axolotlai/axolotl:main-latest

Other installation approaches are described here.

Cloud Providers

Your First Fine-tune

# Fetch axolotl examples
axolotl fetch examples

# Or, specify a custom path
axolotl fetch examples --dest path/to/folder

# Train a model using LoRA
axolotl train examples/llama-3/lora-1b.yml

That's it! Check out our Getting Started Guide for a more detailed walkthrough.

📚 Documentation

AI Agent Support

Axolotl ships with built-in documentation optimized for AI coding agents (Claude Code, Cursor, Copilot, etc.). These docs are bundled with the pip package — no repo clone needed.

# Show overview and available training methods
axolotl agent-docs

# Topic-specific references
axolotl agent-docs sft                 # supervised fine-tuning
axolotl agent-docs grpo                # GRPO online RL
axolotl agent-docs preference_tuning   # DPO, KTO, ORPO, SimPO
axolotl agent-docs reward_modelling    # outcome and process reward models
axolotl agent-docs pretraining         # continual pretraining
axolotl agent-docs --list              # list all topics

# Dump config schema for programmatic use
axolotl config-schema
axolotl config-schema --field adapter

If you're working with the source repo, agent docs are also available at docs/agents/ and the project overview is in AGENTS.md.

🤝 Getting Help

🌟 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

📈 Telemetry

Axolotl has opt-out telemetry that helps us understand how the project is being used and prioritize improvements. We collect basic system information, model types, and error rates—never personal data or file paths. Telemetry is enabled by default. To disable it, set AXOLOTL_DO_NOT_TRACK=1. For more details, see our telemetry documentation.

❤️ Sponsors

Interested in sponsoring? Contact us at wing@axolotl.ai

📝 Citing Axolotl

If you use Axolotl in your research or projects, please cite it as follows:

@software{axolotl,
  title = {Axolotl: Open Source LLM Post-Training},
  author = {{Axolotl maintainers and contributors}},
  url = {https://github.com/axolotl-ai-cloud/axolotl},
  license = {Apache-2.0},
  year = {2023}
}

📜 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Description
Fork of axolotl-ai-cloud/axolotl @ v0.16.1 � activeblue patches for RTX 5080 / CUDA 12.8
Readme Apache-2.0 Cite this repository 48 MiB
Languages
Python 97.4%
Jinja 2.3%
Shell 0.2%