* upgrade to torchao 0.17.0
* chore: lint
* refactor attention handling
* replace legacy attention boolean flags with capability properties
Replace checks with capability-based properties derived from attn_implementation
This separates three concerns that were conflated under flash_attention:
1. Backend selection -> attn_implementation enum
2. Packing capability -> attn_supports_packing property
3. Flash-attn library dependency -> attn_uses_flash_lib property
* compute attn capability flags in normalizer instead of properties
* make attn_implementation the single source of truth
* move attention-dependent validators to mode=after
* migrate remaining consumers to canonical attn_implementation
* expand attention tests + rewrite docs
* migrate example configs to canonical attn_implementation
* update doc snippets + reject gemma4-hybrid with non-FA2 backend
* remove dead gemma4 branch in _set_attention_config
* fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests
* drop "Phase 2" naming from attn-implementation tests
* regroup attn_implementation tests by feature concern
* clean up verbose comments and remove MD
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x
In transformers 5.x, ProcessorMixin.apply_chat_template gained its own
`return_dict` parameter (defaulting to False). When return_dict=False
and tokenize=True the method returns out["input_ids"] directly — a 2-D
tensor — rather than the full BatchFeature dict.
The old code placed `return_dict=True` inside processor_kwargs. In
transformers 5.x those kwargs are forwarded to the underlying processor
call self(...) where _merge_kwargs silently ignores any key not present
in MllamaProcessorKwargs (emitting a warning). The outer return_dict
therefore stayed False, apply_chat_template returned the raw input_ids
tensor, and the subsequent `batch["input_ids"]` attempted to index a
2-D tensor with the 9-character string "input_ids", producing:
IndexError: too many indices for tensor of dimension 2
The fix is to pass return_dict=True as a top-level keyword argument to
apply_chat_template (where it is actually consumed) and remove it from
processor_kwargs (where it was silently dropped). No version guard is
needed: transformers is pinned to ==5.5.4 in pyproject.toml.
Adds a unit-level regression test (tests/test_mm_chat_collator.py) that
mocks the processor to return a raw tensor when apply_chat_template is
called without top-level return_dict=True, verifying the four invariants:
process_rows returns a dict, input_ids is 2-D, labels is 2-D, and
apply_chat_template receives return_dict=True as a top-level kwarg.
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset
Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* fix(collator): process_rows returns dict (BatchFeature) shape
Two related changes for the multimodal chat collator under transformers 5.x:
1. Wrap apply_chat_template result in dict(...) so process_rows returns
a plain dict rather than a BatchFeature instance. BatchFeature is a
Mapping but not a dict; downstream code that did
batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"])
would index on a tensor when the result wasn't dict-shaped, raising
IndexError: too many indices for tensor of dimension 2
2. Soften the regression test's contract from `dict` to `Mapping` so it
exercises the actual semantic guarantee (key/value access) rather
than the implementation detail (dict vs BatchFeature). Test guards
against the original transformers 5.x breakage where apply_chat_template's
return_dict default went from True to False.
Includes regression test under tests/test_mm_chat_collator.py.
Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against
attn-implementation-refactor; squash-merged from agent commits 4de886fd
+ dc9fcf4f.
Signed-off-by: Wing Lian <wing@axolotl.ai>
---------
Signed-off-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>
* jagged lr restart scheudler
var name fix
make sure to create scheduler first
* wire things together
* more fixes
* fix for nesting scheduler and first anneal phase
* no need for relora trainer anymore since we've generalized the relora scheduler
* remove redundant relora scheduler and lint
* update relora e2e test for updated params
* need restart steps for relora test
* update quarto docs for dropped relora trainer
* update example yaml
* drop verbose arg
* min lr scale support for jagged lr
* don't let min_lr be nonetype
* cleanup args
* make pad_to_sequence_len default to the same value as sample_packing
* remove duplicate validation
* fix test
* update description meta
Co-authored-by: NanoCode012 <nano@axolotl.ai>
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* checkpoint model on first step callback
* remove debug
* add test cases; update existing tests not to save on first step
* move test out of solo
* delete
* default to False
* typo
* wip for dbrx finetuning
* add fastcore for parallel loading of sharded weights
* fix dtype for load, use PartialState instead of accelerator to init process group, remove redundant wandb callback
* update to use v2 of the converted model
* more fixes for dbrx loras
* make sure to enable fsdp activation checkpointing
* fix support for 8bit loras too for dbrx
* apply z3 leaf moe fix for DBRX with deepspeed
* don't raise value error since child module searches could fail and be ok
* revert a previous change to fix fsdp
* update mistral/mistral qlora+fsdp yamls
* fix qlora+fsdp quant storage type
* more edge cases for qlora-fsdp
* fixes for fsdp+qlora w optimizer in 8bit
* add bigstral z3 config and make sure to use full_state_dict for fsdp
* add lisa support
* fix default and fix attribute traversal for layers
* improve lisa callback logging
* fix LISA by ensuring params are not frozen during __init__
* example config for lisa
---------
Co-authored-by: Aman Karmani <aman@tmm1.net>
* wip qlora + fsdp fixes
* more fixes
* make sure to load the lora 🤦
* only setup quantized meta on non-zero rank:
* only run setup_quantized_peft_meta_for_training for qlora+fsdp
* more fixes for qlora+fsdp
* chore: lint
* add example yml
* support mistral too
* fix for model_type and add mixtral support too
* set cpu_offload: false to reduce vram, constrain new accleerator logic to qlora + fsdp
* refactor for duplicate code
* Add CausalLMBenchEvalCallback for measuring seq2seq performance
* Fix code for pre-commit
* Fix typing and improve logging
* eval_sample_packing must be false with CausalLMBenchEvalCallback
* loftq support for lora
* fix loftq check
* update readme for loftq
* readability cleanup
* use peft main for loftq fixes, remove unnecessary special tokens
* remove unused test from older deprecation
* set fp16 to false if bf16, update bf16: auto in example YAMLs
* unset fp16 so that it fallsback properly if bf16 isn't available
* Update README.md [skip-ci]
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* test that bf16 disables fp16
---------
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* Add s2_attn to hijack flash code
* Refactor code to account for s2_attn
* Add test for models utils
* Add ``s2_attention`` option to llama configs
* Add ``s2_attention`` option to README config
* Format code to appease linter
* chore: lint
* Remove xpos and llama-landmark [bad merge]
* add e2e smoke tests for shifted sparse attention
* remove stray patch from merge
* update yml with link to paper for s2_attention/longlora
* fix assertion check for full fine tune
* increase sequence len for tests and PR feedback updates
* reduce context len to 16k for tests
* reduce context len to 16k for tests
* reduce batch size for larger context len and udpate test to check message
* fix test for message
---------
Co-authored-by: joecummings <jrcummings@devvm050.nha0.facebook.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* Feat: Update to handle wandb env better
* chore: rename wandb_run_id to wandb_name
* feat: add new recommendation and update config
* fix: indent and pop disabled env if project passed
* feat: test env set for wandb and recommendation
* feat: update to use wandb_name and allow id
* chore: add info to readme
* various bugfixes
use latest tinyllama release
check if val_set_size is empty first
update sdp and xformers llama patches for updated upstream transformers
fix system prompt when no input
calculate total and total supervised tokens even when not sample packing
* add fix for when eval size is estimated to be too small
* should be len 1 for dataset length
* add catchall kwargs
* auto gptq support
* more tweaks and add yml
* remove old gptq docker
* don't need explicit peft install for tests
* fix setup.py to use extra index url
install torch for tests
fix cuda version for autogptq index
set torch in requirements so that it installs properly
move gptq install around to work with github cicd
* gptq doesn't play well with sample packing
* address pr feedback
* remove torch install for now
* set quantization_config from model config
* Fix the implementation for getting quant config from model config