axolotl

Author	SHA1	Message	Date
Wing Lian	6ba5c0ed2c	use latest hf-xet and don't install vllm for torch 2.7.0 (#2603 ) * use latest hf-xet and don't install vllm for torch 2.7.0 * fix runpod hub tests	2025-04-30 18:27:39 -04:00
Wing Lian	dc4da4a7e2	update trl to 0.17.0 (#2560 ) * update trl to 0.17.0 * grpo + vllm no longer supported with 2.5.1 due to vllm constraints * disable VLLM_USE_V1 for ci * imporve handle killing off of multiprocessing vllm service * debug why this doesn't run in CI * increase vllm wait time * increase timeout to 5min * upgrade to vllm 0.8.4 * dump out the vllm log for debugging * use debug logging * increase vllm start timeout * use NVL instead * disable torch compile cache * revert some commented checks now that grpo tests are fixed * increase vllm timeoout back to 5min	2025-04-27 19:19:53 -04:00
Wing Lian	0d691cc2a7	add base docker image with pytorch 2.7.0 and variant for cuda 12.8 (#2551 ) * add base docker image with pytorch 2.7.0 and variant for cuda 12.8 * my bash is terrible	2025-04-23 14:59:03 -04:00
Chiwan Park	4ce469d32e	fix: upgrade liger to 0.5.8 and use native Gemma3 patches (#2527 ) * fix: upgrade liger to 0.5.8 and use native Gemma3 patches * fix: make lint happy * doc: update Liger Kernel FLCE support for Gemma 3	2025-04-18 09:57:40 -07:00
NanoCode012	682a9cf79b	Fix: add delinearization and make qlora work with fsdp2 (#2515 ) * fixes for delinearization, and make qlora work with fsdp2 * Add back mistakenly removed lm_eval * typo [skip ci] * patch evals for torch.compile + fsdp2 * also check torch_compile w fsdp2 * lots of fixes for flex attn with llama4 * fix patch check and patch llama4 too * attempt to make the patches stick * use transformers 4.51.2 * update configs and README for llama4 * remove torch.compile for CI test * cleanup any existing singletons * set singleton cache to None instead of deleting * use importlib reload with monkeypatch * don't worry about transformers version, mark inputs with grads, fix regex * make sure embeds aren't on cpu * logging and mem improvements * vllm version and add to docker, make sure to save processor on conversion * fix ambiguous tensor bool check * fix vllm to not use v1, upgrade hf transformers * fix tests * make flex_attn_compile_kwargs configurable, since this depends on model params --------- Co-authored-by: Wing Lian <wing@axolotl.ai> Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>	2025-04-15 23:31:39 -07:00
Wing Lian	630e40dd13	upgrade transformers to 4.51.1 (#2508 ) * upgrade transformers to 4.51.1 * multigpu longer timeout	2025-04-09 02:53:00 -04:00
NanoCode012	9b89591ead	Feat: Add doc on loading datasets and support for Azure/OCI (#2482 ) * fix: remove unused config * feat: add doc on dataset loading * feat: enable azure and oci remote file system * feat: add adlfs and ocifs to requirements * fix: add links between dataset formats and dataset loading * fix: remove unused condition * Revert "fix: remove unused condition" This reverts commit `5fe13be73e`.	2025-04-07 12:41:13 -04:00
Wing Lian	8bbad21bfd	llama4 support (#2493 ) * llama4 support * add xet support [skip ci] * be flexible on transformers version and skip test on version * don't use deepspeed for the fix_untrained_tokens test * reordering to trigger torch 2.6.0 tests first * slightly smaller train set * use 4.51.0 for now * remove stray print, add llama4 chat template to schema, bump peft to 0.15.1 * patches to make llama4 performant * add preliminary fp8 support	2025-04-07 10:49:15 -04:00
Wing Lian	5f4af3665d	FSDP2 support (#2469 ) * fsdp2 support * use accelerate release 1.6.0 * allow 8bit optims with fsdp2 * liger + torch compile fix * add fsdp2 e2e tests * use transformers commit with fsdp2 support * skip zero3 tests for this PR for now * fix fsdp2 config for ci * make sure both flex and flash attn work with fsdp2, skip fix untrained tokens * okay, actually use fdsp2... * more fixes to flex for fsdp2 * make sure to patch all the loaded models * additional validation for fsdp2, bump dep versions	2025-04-06 17:08:01 -04:00
Wing Lian	e7e0cd97ce	Update dependencies and show slow tests in CI (#2492 ) * use latest torchao, gradio, schedule-free * get info on slow tests * speed up tests by avoiding gradient checkpointing and reducing eval size	2025-04-05 17:41:31 -04:00
NanoCode012	990b5896bc	fix: downgrade deepspeed to fix grad checkpoint oom (#2465 ) [skip ci]	2025-04-01 12:25:05 -04:00
Wing Lian	e0aba74dd0	Release update 20250331 (#2460 ) [skip ci] * make torch 2.6.0 the default image * fix tests against upstream main * fix attribute access * use fixture dataset * fix dataset load * correct the fixtures + tests * more fixtures * add accidentally removed shakespeare fixture * fix conversion from unittest to pytest class * nightly main ci caches * build 12.6.3 cuda base image * override for fix from huggingface/transformers#37162 * address PR feedback	2025-04-01 08:47:50 -04:00
Wing Lian	b6fc46ada8	Updates for trl 0.16.0 - mostly for GRPO (#2437 ) [skip ci] * add grpo scale_rewards config for trl#3135 * options to connect to vllm server directly w grpo trl#3094 * temperature support trl#3029 * sampling/generation kwargs for grpo trl#2989 * make vllm_enable_prefix_caching a config param trl#2900 * grpo multi-step optimizeations trl#2899 * remove overrides for grpo trainer * bump trl to 0.16.0 * add cli to start vllm-serve via trl * call the python module directly * update to use vllm with 2.6.0 too now and call trl vllm serve from module * vllm 0.8.1 * use python3 * use sys.executable * remove context and wait for start * fixes to make it actually work * fixes so the grpo tests pass with new vllm paradigm * explicit host/port and check in start vllm * make sure that vllm doesn't hang by setting quiet so outouts go to dev null * also bump bnb to latest release * add option for wait from cli and nccl debugging for ci * grpo + vllm test on separate devices for now * make sure grpo + vllm tests runs single worker since pynccl comms would conflict * fix cli * remove wait and add caching for argilla dataset * refactoring configs * chore: lint * add vllm config * fixup vllm grpo args * fix one more incorrect schema/config path * fix another vlllm reference and increase timeout * make the tests run a bit faster * change mbsz back so it is correct for grpo * another change mbsz back so it is correct for grpo * fixing cli args * nits * adding docs * docs * include tensor parallel size for vllm in pydantic schema * moving start_vllm, more docs * limit output len for grpo vllm * vllm enable_prefix_caching isn't a bool cli arg * fix env ordering in tests and also use pid check when looking for vllm --------- Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>	2025-03-31 15:47:11 -04:00
Wing Lian	e46239f8d3	bump liger to 0.5.5 (#2448 )	2025-03-28 19:21:03 -04:00
Wing Lian	05f03b541a	hf offline decorator for tests to workaround rate limits (#2452 ) [skip ci] * hf offline decorator for tests to workaround rate limits * fail quicker so we can see logs * try new cache name * limit files downloaded * phi mini predownload * offline decorator for phi tokenizer * handle meta llama 8b offline too * make sure to return fixtures if they are wrapped too * more fixes * more things offline * more offline things * fix the env var * fix the model name * handle gemma also * force reload of modules to recheck offline status * prefetch mistral too * use reset_sessions so hub picks up offline mode * more fixes * rename so it doesn't seem like a context manager * fix backoff * switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline * include additional dataset * more fixes * more fixes * replace tiny shakespeaere dataset * skip some tests for now * use more robust check using snapshot download to determine if a dataset name is on the hub * typo for skip reason * use local_files_only * more fixtures * remove local only * use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached * make sure fixtures aren't offline improve the offline reset try bumping version of datasets reorder reloading and setting prime a new cache run the tests now with fresh cache try with a static cache * now run all the ci again with hopefully a correct cache * skip wonky tests for now * skip wonky tests for now * handle offline mode for model card creation	2025-03-28 19:20:46 -04:00
NanoCode012	9f00465a5c	Feat: Add support for gemma3_text and add e2e for gemma2 (#2406 )	2025-03-22 20:33:21 -04:00
Dan Saunders	23f0c51d88	Sequence parallelism (#2412 ) * adding easy_context as integration for now * progress on ring attn impl * progress on ring attn impl * cleanup * remove errant file * fix req * removing unused code * updates * pytest * update * updates * fixes * precommit fixes * working multi-group SP * fixing sample packing * remove debug logs and simplify * eval dataloader and sampler changes * removing some obvious comments * update config.qmd and rename option * scoping down problematic import * another import scoping change * pernicious Fire CLI bugfix * isolate cli tests * actually isolate CLI tests * gracefully handle no ring-flash-attn * fix * fix * move ring flash attn to extras with flash-attn (#2414) * removing flash-attn from requirements.txt (in setup.py extras already) * rename file, delete another * using field validator instead of model validator * test fix * sampler / dataloader refactor * non-seq2se1 collator fix * removing print statement * bugfix * add SP doc, review comments * small changes * review comments, docstrings * refactors, SP mixin * small updates * fix tests * precommit * precommit --------- Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Dan Saunders <dan@axolotl.ai>	2025-03-21 12:43:55 -04:00
Wing Lian	aae4337f40	add 12.8.1 cuda to the base matrix (#2426 ) * add 12.8.1 cuda to the base matrix * use nightly * bump deepspeed and set no binary * deepspeed binary fixes hopefully * install deepspeed by itself * multiline fix * make sure ninja is installed * try with reversion of packaging/setuptools/wheel install * use license instead of license-file * try rolling back packaging and setuptools versions * comment out license for validation for now * make sure packaging version is consistent * more parity across tests and docker images for packaging/setuptools	2025-03-21 10:17:25 -04:00
Wing Lian	38df5a36ea	bump HF versions except for trl (#2427 )	2025-03-20 10:22:05 -04:00
Wing Lian	4d92a68a96	use default torch fused adamw optimizer as default as adamw_hf is deprecated (#2425 ) * use default torch fused adamw optimizer as default as adamw_hf is deprecated * make sure to have latest packaging installed * bump packagingin requirements.txt too	2025-03-19 23:58:33 -04:00
Wing Lian	59899b9817	pass additional info for fix untrained tokens when using distributed + offloading (#2388 ) * pass additional info for fix untrained tokens when using distributed + offloading * use latest version of vendored lib * use v0.0.5 of contribs lgpl * fix for no bad tokens and add tests * use release * add multigpu test too * make sure the multigpu zero3 test actually uses zero3	2025-03-11 12:02:43 -04:00
Wing Lian	ae66374156	Optimizer refactor and add Muon support (#2367 ) * add muon optimizer optimizer_cls_and_kwargs is on trainer_kwargs only add adamw_kwargs if they're non-null fix mocks better handling of override and check the optimizer unwrap optimizer * fix import	2025-03-06 11:49:19 -05:00
Wing Lian	9850f42204	bump liger to 0.5.3 (#2353 )	2025-02-24 12:40:54 -05:00
salman	29b366b2e1	Bumping 0.15.1 TRL version for GRPO+PEFT fix (#2344 ) * bumping TRL version * apply upstream fixes to our custom fix --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-02-20 22:56:04 -05:00
NanoCode012	b53a41372f	feat: update transformers version to 4.49.0 (#2340 )	2025-02-20 21:12:06 -05:00
Wing Lian	ffae8d6a95	GRPO (#2307 )	2025-02-13 16:01:01 -05:00
NanoCode012	e48e2df4dd	feat: update FA to 2.7.4.post1 which includes torch2.6 binary (#2315 )	2025-02-07 21:34:01 -05:00
Wing Lian	b7616022ab	bump transformers to 4.48.3 (#2318 )	2025-02-07 21:33:44 -05:00
NanoCode012	5bbad5ef93	feat: add torch2.6 to ci (#2311 )	2025-02-07 07:28:54 -05:00
Wing Lian	8779997ba5	native support for modal cloud from CLI (#2237 ) * native support for modal cloud from CLI * do lm_eval in cloud too * Fix the sub call to lm-eval * lm_eval option to not post eval, and append not extend * cache bust when using branch, grab sha of latest image tag, update lm-eval dep * allow minimal yaml for lm eval * include modal in requirements * update link in README to include utm * pr feedback * use chat template * revision support * apply chat template as arg * add wandb name support, allow explicit a100-40gb * cloud is optional * handle accidental setting of tasks with a single task str * document the modal cloud yaml for clarity [skip ci] * cli docs * support spawn vs remote for lm-eval * Add support for additional docker commands in modal image build * cloud config shouldn't be a dir * Update README.md Co-authored-by: Charles Frye <cfrye59@gmail.com> * fix annotation args --------- Co-authored-by: Charles Frye <cfrye59@gmail.com>	2025-01-30 11:34:02 -05:00
Wing Lian	0b52f06227	bump bnb to 0.45.1 (#2289 ) [skip ci]	2025-01-28 23:21:25 -05:00
Wing Lian	8a7a0b07dc	support for latest transformers release 4.48.1 (#2256 )	2025-01-23 21:17:57 -05:00
Wing Lian	fb3352e21c	rename liger test so it properly runs in ci (#2246 )	2025-01-09 17:31:43 -05:00
Wing Lian	7669a03fb4	update upstream HF deps (#2239 ) * bump axolotl contribs for upstream main conflicts: * bump datasets, tokenizer, trl * remove log workarounds in trl * bump lm-eval * remove unsloth_ import from critical path * remove llama fa2 from conftest * unsloth breaks with latest upstream	2025-01-09 21:01:59 +00:00
Wing Lian	e0a2eb2ebd	fix untrained tokens if specified explicitly from a list (#2210 )	2024-12-23 09:08:28 -05:00
Wing Lian	d91feaffc8	upgrade to liger 0.5.2 (#2181 ) [skip ci]	2024-12-17 13:58:21 -05:00
Wing Lian	e246ceffa4	use axolotl contribs for fix_untrained_tokens (#2194 ) [skip ci] * use axolotl contribs for fix_untrained_tokens * remove the module we're replacing * Add check for using fix_untrained_tokens	2024-12-17 13:57:16 -05:00
Wing Lian	1f623e6cc8	transformers 4.47.1 (#2187 ) * transformers 4.47.1 * drop monkeypatches * can't remove patches yet * make flash attention forward ignore the loss kwargs * patch the flash attention in the modeling arch too * remove fsdp and deepspeed patches * cleanup PR * bump accelerate and torchao, also logically reorder/group requirements * meant to include torchao * use official patch release	2024-12-17 11:01:21 -05:00
Wing Lian	effc4dc409	pin to 4.47.0 (#2180 )	2024-12-12 20:17:12 -05:00
Wing Lian	40907c6887	upgrade deepspeed to 0.16.1 (#2157 )	2024-12-09 07:25:10 -05:00
Wing Lian	1302e31049	Transformers version flexibility and FSDP optimizer patch (#2155 ) * allow flexibility in transformers version for FSDP * more flexibility with dev versions of 4.47.0.dev0 * add patch for fsdp * fix typo * correct fn name * stray character * fix patch * reset Trainer too * also reset Trainer.training_step * allow tests/patched to run more than one process on e2e runner * skip tests/patched in e2e for now since it's run in regular pytest	2024-12-08 14:50:40 -05:00
Wing Lian	be5f554a62	bump autoawq to 0.2.7.post3 (#2150 )	2024-12-07 22:24:09 -05:00
Wing Lian	743ba62bd5	Transformers 4.47.0 (#2138 ) * bump transformers and trl * fix: update trainer.log signature * fix trl trainer.log interfaces * broken 🦥 with latest transformers * skip parent, call grandparent - yeah, super janky * update HF HUB env var and fix reward trainer log since it doesn't directly override log * also bump accelerate * patches for llama ga * detab the code to check * fix whitespace for patch check * play nicely with CI tests since we patch everytime * fix pop default in case it doesn't exist * more tweaks to make patches nicer in CI * fix detab for when there are possibly multiple patches --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2024-12-07 05:03:01 -05:00
Wing Lian	6b3058b2dc	upgrade bnb 0.45.0 and peft 0.14.0 (#2126 ) * upgrade bnb to lastest release * update peft to working supporting commit * bump to latest release of peft==0.14.0	2024-12-06 09:08:55 -05:00
Wing Lian	ce5bcff750	various tests fixes for flakey tests (#2110 ) * add mhenrichsen/alpaca_2k_test with revision dataset download fixture for flaky tests * log slowest tests * pin pynvml==11.5.3 * fix load local hub path * optimize for speed w smaller models and val_set_size * replace pynvml * make the resume from checkpoint e2e faster * make tests smaller	2024-12-02 17:28:58 -05:00
Sunny Liu	bf416bdfd0	bump_liger_0.4.2 (#2096 )	2024-11-21 13:24:52 -05:00
Wing Lian	d9b71edf84	bump transformers for fsdp-grad-accum fix, remove patch (#2079 )	2024-11-19 02:23:09 -05:00
Wing Lian	70cf79ef52	upgrade autoawq==0.2.7.post2 for transformers fix (#2070 ) * point to upstream autoawq for transformers fix * use autoawq 0.2.7 release * test wheel for awq * try different format for wheel def * autoawq re-release * Add intel_extension_for_pytorch dep * ipex gte version * forcefully remove intel-extension-for-pytorch * add -y option to pip uninstall for ipex * use post2 release for autoawq and remove uninstall of ipex	2024-11-18 11:53:37 -05:00
Wing Lian	d42f202046	Fsdp grad accum monkeypatch (#2064 )	2024-11-15 19:11:04 -05:00
Wing Lian	0dabde1962	support for schedule free and e2e ci smoke test (#2066 ) [skip ci] * support for schedule free and e2e ci smoke test * set default lr scheduler to constant in test * ignore duplicate code * fix quotes for config/dict	2024-11-15 19:10:14 -05:00

1 2 3 4

191 Commits