axolotl

Author	SHA1	Message	Date
Wing Lian	e4032fc90f	Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602 ) * upgrade to torchao 0.17.0 * chore: lint * refactor attention handling * replace legacy attention boolean flags with capability properties Replace checks with capability-based properties derived from attn_implementation This separates three concerns that were conflated under flash_attention: 1. Backend selection -> attn_implementation enum 2. Packing capability -> attn_supports_packing property 3. Flash-attn library dependency -> attn_uses_flash_lib property * compute attn capability flags in normalizer instead of properties * make attn_implementation the single source of truth * move attention-dependent validators to mode=after * migrate remaining consumers to canonical attn_implementation * expand attention tests + rewrite docs * migrate example configs to canonical attn_implementation * update doc snippets + reject gemma4-hybrid with non-FA2 backend * remove dead gemma4 branch in _set_attention_config * fix duplicate attn_implementation in gpt-oss yamls and flaky caplog tests * drop "Phase 2" naming from attn-implementation tests * regroup attn_implementation tests by feature concern * clean up verbose comments and remove MD Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai> * fix(collator): pass return_dict=True at apply_chat_template top level for transformers 5.x In transformers 5.x, ProcessorMixin.apply_chat_template gained its own `return_dict` parameter (defaulting to False). When return_dict=False and tokenize=True the method returns out["input_ids"] directly — a 2-D tensor — rather than the full BatchFeature dict. The old code placed `return_dict=True` inside processor_kwargs. In transformers 5.x those kwargs are forwarded to the underlying processor call self(...) where _merge_kwargs silently ignores any key not present in MllamaProcessorKwargs (emitting a warning). The outer return_dict therefore stayed False, apply_chat_template returned the raw input_ids tensor, and the subsequent `batch["input_ids"]` attempted to index a 2-D tensor with the 9-character string "input_ids", producing: IndexError: too many indices for tensor of dimension 2 The fix is to pass return_dict=True as a top-level keyword argument to apply_chat_template (where it is actually consumed) and remove it from processor_kwargs (where it was silently dropped). No version guard is needed: transformers is pinned to ==5.5.4 in pyproject.toml. Adds a unit-level regression test (tests/test_mm_chat_collator.py) that mocks the processor to return a raw tensor when apply_chat_template is called without top-level return_dict=True, verifying the four invariants: process_rows returns a dict, input_ids is 2-D, labels is 2-D, and apply_chat_template receives return_dict=True as a top-level kwarg. Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_multimodal_dataset Fixes: tests/e2e/test_llama_vision.py::TestLlamaVision::test_lora_llama_vision_text_only_dataset Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai> * fix(collator): process_rows returns dict (BatchFeature) shape Two related changes for the multimodal chat collator under transformers 5.x: 1. Wrap apply_chat_template result in dict(...) so process_rows returns a plain dict rather than a BatchFeature instance. BatchFeature is a Mapping but not a dict; downstream code that did batch["labels"] = self.processing_strategy.process_labels(batch["input_ids"]) would index on a tensor when the result wasn't dict-shaped, raising IndexError: too many indices for tensor of dimension 2 2. Soften the regression test's contract from `dict` to `Mapping` so it exercises the actual semantic guarantee (key/value access) rather than the implementation detail (dict vs BatchFeature). Test guards against the original transformers 5.x breakage where apply_chat_template's return_dict default went from True to False. Includes regression test under tests/test_mm_chat_collator.py. Bug surfaced via swarm dispatch task_01KQHPNAYD8XARSNSDJVW1GPF6 against attn-implementation-refactor; squash-merged from agent commits 4de886fd + dc9fcf4f. Signed-off-by: Wing Lian <wing@axolotl.ai> --------- Signed-off-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Axolotl Swarm <no-reply@axolotl.ai>	2026-05-05 10:15:18 -04:00
Grant Holmes (Ren)	850c1a5f8d	Add FSDP v2 swap memory support + QLoRA compatibility fixes (#3167 ) Co-authored-by: salman <salman.mohammadi@outlook.com>	2025-09-26 10:23:59 +01:00
Wing Lian	97e86c6d47	drop old patches and code that are no longer needed (#3007 ) [skip ci]	2025-08-06 08:02:39 -04:00
Wing Lian	7b68dfafd7	jagged lr restart scheudler (#1680 ) [skip ci] * jagged lr restart scheudler var name fix make sure to create scheduler first * wire things together * more fixes * fix for nesting scheduler and first anneal phase * no need for relora trainer anymore since we've generalized the relora scheduler * remove redundant relora scheduler and lint * update relora e2e test for updated params * need restart steps for relora test * update quarto docs for dropped relora trainer * update example yaml * drop verbose arg * min lr scale support for jagged lr * don't let min_lr be nonetype * cleanup args	2025-07-31 13:50:03 -04:00
Wing Lian	22810c97b7	use warmup_ratio as a better default than warmup steps since it's data dependent (#2897 ) [skip ci] * use warmup_ratio as a better default than warmup steps since it's data dependent * replace remainder of warmup_steps	2025-07-30 06:44:06 -04:00
Wing Lian	af8d257aa2	make pad_to_sequence_len default to the same value as sample_packing (#2941 ) [skip ci] * make pad_to_sequence_len default to the same value as sample_packing * remove duplicate validation * fix test * update description meta Co-authored-by: NanoCode012 <nano@axolotl.ai> --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-07-21 11:40:56 -04:00
Dan Saunders	10ba1622f7	checkpoint model on first step callback (#2906 ) * checkpoint model on first step callback * remove debug * add test cases; update existing tests not to save on first step * move test out of solo * delete * default to False * typo	2025-07-15 15:00:48 -04:00
Wing Lian	dd8bad06d0	remove strict=false from example yamls [skip ci] (#2523 ) [skip ci]	2025-04-12 07:25:11 -07:00
Wing Lian	9f824ef76a	simplify the example configs to be more minimal and less daunting (#2486 ) [skip ci] * simplify the example configs to be more minimal and less daunting * drop empty s2_attention from example yamls	2025-04-04 13:47:26 -04:00
salman	ac471a697a	updating to fused (#2293 )	2025-01-30 11:45:56 -05:00
Sunny Liu	1c14c4a15c	Add hub model id config options to all example yml files (#2196 ) [skip ci] * added hub model_id in example yml * add hub model id to example yml	2024-12-17 11:24:30 -05:00
Wing Lian	4fde300e5f	update outputs path so that we can mount workspace to /workspace/data (#1623 ) * update outputs path so that we can mount workspace to /workspace/data * fix ln order	2024-05-15 12:44:13 -04:00
Wing Lian	132eb740f0	DBRX Model Support (#1462 ) * wip for dbrx finetuning * add fastcore for parallel loading of sharded weights * fix dtype for load, use PartialState instead of accelerator to init process group, remove redundant wandb callback * update to use v2 of the converted model * more fixes for dbrx loras * make sure to enable fsdp activation checkpointing * fix support for 8bit loras too for dbrx * apply z3 leaf moe fix for DBRX with deepspeed * don't raise value error since child module searches could fail and be ok * revert a previous change to fix fsdp * update mistral/mistral qlora+fsdp yamls * fix qlora+fsdp quant storage type * more edge cases for qlora-fsdp * fixes for fsdp+qlora w optimizer in 8bit * add bigstral z3 config and make sure to use full_state_dict for fsdp	2024-04-12 09:02:36 -04:00
Wing Lian	0ddfb24fcf	LISA (#1469 ) * add lisa support * fix default and fix attribute traversal for layers * improve lisa callback logging * fix LISA by ensuring params are not frozen during __init__ * example config for lisa --------- Co-authored-by: Aman Karmani <aman@tmm1.net>	2024-04-01 04:54:53 -07:00
Wing Lian	2a1589f6f6	strip out hacky qlora-fsdp workarounds now that qlora-fsdp fixes are upstreamed (#1428 )	2024-03-21 11:56:13 -04:00
Wing Lian	9b6ee83a73	FDSP + QLoRA (#1378 ) * wip qlora + fsdp fixes * more fixes * make sure to load the lora 🤦 * only setup quantized meta on non-zero rank: * only run setup_quantized_peft_meta_for_training for qlora+fsdp * more fixes for qlora+fsdp * chore: lint * add example yml * support mistral too * fix for model_type and add mixtral support too * set cpu_offload: false to reduce vram, constrain new accleerator logic to qlora + fsdp * refactor for duplicate code	2024-03-08 14:31:01 -05:00
NanoCode012	a7a9a1433a	fix(examples): remove is_*_derived as it's parsed automatically (#1297 )	2024-02-22 00:52:46 +09:00
Leonardo Emili	5a5d47458d	Add seq2seq eval benchmark callback (#1274 ) * Add CausalLMBenchEvalCallback for measuring seq2seq performance * Fix code for pre-commit * Fix typing and improve logging * eval_sample_packing must be false with CausalLMBenchEvalCallback	2024-02-13 08:24:30 -08:00
Wing Lian	4cb7900a56	Peft lotfq (#1222 ) * loftq support for lora * fix loftq check * update readme for loftq * readability cleanup * use peft main for loftq fixes, remove unnecessary special tokens * remove unused test from older deprecation	2024-01-28 18:50:08 -05:00
Wing Lian	54d2ac155b	Mixtral fixes 20240124 (#1192 ) [skip ci] * mixtral nccl fixes * make sure to patch for z3	2024-01-24 14:59:57 -05:00
Wing Lian	782b6a4216	set fp16 to false if bf16, update bf16: auto in example YAMLs (#1122 ) [skip ci] * set fp16 to false if bf16, update bf16: auto in example YAMLs * unset fp16 so that it fallsback properly if bf16 isn't available * Update README.md [skip-ci] Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * test that bf16 disables fp16 --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>	2024-01-22 18:44:01 -05:00
Joe Cummings	1d70f24b50	Add shifted sparse attention (#973 ) [skip-ci] * Add s2_attn to hijack flash code * Refactor code to account for s2_attn * Add test for models utils * Add ``s2_attention`` option to llama configs * Add ``s2_attention`` option to README config * Format code to appease linter * chore: lint * Remove xpos and llama-landmark [bad merge] * add e2e smoke tests for shifted sparse attention * remove stray patch from merge * update yml with link to paper for s2_attention/longlora * fix assertion check for full fine tune * increase sequence len for tests and PR feedback updates * reduce context len to 16k for tests * reduce context len to 16k for tests * reduce batch size for larger context len and udpate test to check message * fix test for message --------- Co-authored-by: joecummings <jrcummings@devvm050.nha0.facebook.com> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-18 10:16:07 -05:00
Tim Dolan	c75f916745	added tiny llama examples for lora and qlora (#1027 ) * added tiny llama examples for lora and qlora * corrected yml files and removed tiny-llama.yml from llama-2 example	2024-01-02 20:00:37 -05:00
Wing Lian	5f79b8242f	new evals_per_epoch and saves_per_epoch to make things cleaner (#944 ) * new evals_per_epoch and saves_per_epoch to make things cleaner * update per PR feedback	2023-12-12 15:35:23 -05:00
NanoCode012	a1da39cd48	Feat(wandb): Refactor to be more flexible (#767 ) * Feat: Update to handle wandb env better * chore: rename wandb_run_id to wandb_name * feat: add new recommendation and update config * fix: indent and pop disabled env if project passed * feat: test env set for wandb and recommendation * feat: update to use wandb_name and allow id * chore: add info to readme	2023-12-04 22:17:25 +09:00
Wing Lian	14706504e3	various bugfixes (#856 ) * various bugfixes use latest tinyllama release check if val_set_size is empty first update sdp and xformers llama patches for updated upstream transformers fix system prompt when no input calculate total and total supervised tokens even when not sample packing * add fix for when eval size is estimated to be too small * should be len 1 for dataset length * add catchall kwargs	2023-11-15 12:23:18 -05:00
Wing Lian	f544ab2bed	don't compile deepspeed or bitsandbytes from source (#837 )	2023-11-08 19:49:55 -05:00
Wing Lian	8b79ff0e94	fix eval_steps to be a sane default (#797 ) * fix eval_steps to be a sane default * update docs for fractional eval_steps	2023-10-27 22:36:30 -04:00
Wing Lian	2d8def68dc	simplify by removing duplicate base_model_config (#772 )	2023-10-23 01:42:38 -04:00
Casper	15d3a654bf	Implement fused modules (#747 ) * MLP: Memory saving * Remove RMSNorm restrictions * Map packed weights to original * FusedAttention module * Simplify code * Move fused modules * Fix critical typo * Split inplace * Add FFT config * Add validation of fused arguments * Add fused arguments to config * Update docs * Fix validation logic * Add fused modules to flash attn * Only fuse during training * Remove timing * Formatting * Formatting * Formatting * chore: lint * chore: lint * add e2e tests for fused llama * no lora for tests --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-10-21 16:08:25 -04:00
Wing Lian	e50a64e85e	prepared dataset caching, other misc fixes (#665 ) * prepared dataset caching, other misc fixes * also don't load from disk cache unless explicit	2023-10-02 21:07:24 -04:00
Wing Lian	d887ad86c3	eval_table isn't quite stable enough to be in default llama configs (#637 )	2023-09-26 10:13:20 -04:00
mhenrichsen	4fecbfe5e1	default model changed	2023-09-24 18:52:53 +02:00
Wing Lian	faecff9798	support to disable exllama for gptq (#604 ) * support to disable exllama for gptq * update property instead of item * fix config key	2023-09-19 17:51:08 -04:00
Glavin Wiechert	5b67ea98a6	Add training callback to send predictions to WandB table (#521 ) * WIP Add training callback to send predictions to WandB table * WIP improve wandb table reporting callback * WIP improve wandb table reporting callback (cont) * Add VSCode launching for debugging * Add tiny llama example * WIP attempt to improve post-eval prediction generation for table * WIP attempt to improve post-eval prediction generation for table - part 2 * WIP batch generation * WIP attempt to handle sample_packing using position_ids for wandb prediction table * WIP add code for debugging * Fix sample_packing support for wandb prediction table * Clean up code for PR review * Add eval_table_size, eval_table_max_new_tokens configs & clean up code * Clean up PR, delete VSCode config, add tiny-llama example * Add eval_table_size, eval_table_max_new_tokens documentation. Fix linting/formatting	2023-09-13 09:51:08 -04:00
Wing Lian	343714972b	recommend padding when using sample packing (#531 )	2023-09-06 17:00:21 -04:00
Wing Lian	3355706e22	Add support for GPTQ using native transformers/peft (#468 ) * auto gptq support * more tweaks and add yml * remove old gptq docker * don't need explicit peft install for tests * fix setup.py to use extra index url install torch for tests fix cuda version for autogptq index set torch in requirements so that it installs properly move gptq install around to work with github cicd * gptq doesn't play well with sample packing * address pr feedback * remove torch install for now * set quantization_config from model config * Fix the implementation for getting quant config from model config	2023-09-05 12:43:22 -04:00
Charles O. Goddard	fe4d6baf92	Add example Llama 2 ReLoRA config (#471 ) * Add example Llama 2 ReLoRA config * Use adamw_bnb_8bit in example relora config	2023-08-27 10:08:34 +09:00
Wing Lian	1687be6a35	don't use mask expansion for inference (#392 )	2023-08-14 20:52:54 -04:00
mhenrichsen	fdffef5940	new llama-2 default settings (#370 ) * new default settings * fix whitespace * rm max packed sequence length --------- Co-authored-by: Mads Henrichsen <mads@BrbartiendeMads.lan>	2023-08-14 17:39:09 +09:00
Morgan McGuire	7019509daa	Add wandb_entity to wandb options, update example configs, update README (#361 ) * Update wandb_entity and add wandb descriptions * add wandb to config section * remove trailing whitespace for pre-commit hook * remove trailing whitespace for pre-commit hook --------- Co-authored-by: Morgan McGuire <morganmcguire@Morgans-MacBook-Pro.local> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-08-12 12:17:11 -04:00
Aman Karmani	36fefcf94b	set group_by_length to false in examples	2023-08-06 23:59:09 -07:00
mhenrichsen	dc71d8872a	feat/llama-2 examples (#319 ) * qlora llama-2 * qlora llama-2 * linting * readme * lora added * linting * change group_by_length * 13b fitting on 24gb * grouped lengths true * add pad token * change out dir --------- Co-authored-by: Mads Henrichsen <mads@Brbar-tilhrende-Mads.local>	2023-08-03 19:22:48 +09:00

43 Commits