axolotl

Author	SHA1	Message	Date
Wing Lian	ee59e4de97	add cu130 + torch 2.9.1 to test matrices (#3343 ) * add cu130 + torch 2.9.1 to test matrices * uv can't use pip3 directly	2026-01-05 15:24:29 -05:00
Wing Lian	4e61b8aa23	use updated version of prebuilt wheels for flash attention for cu130 (#3342 ) * use updated version of prebuilt wheels for flash attention for cu130 * use elif * fix the uv base installs of FA also * make wget less verbose	2026-01-05 13:48:12 -05:00
Wing Lian	b26ba3a5cb	don't build images w cuda 130 since we don't have flash attention wheels (#3341 )	2026-01-03 18:08:28 -05:00
Wing Lian	afe18ace35	deprecate torch 2.7.1 (#3339 )	2026-01-01 06:52:45 -05:00
Wing Lian	e73dab6df9	support pydantic 2.12 (#3328 ) * upgrade pydantic to 2.12 * use latest modal version * upgrade modal * update modal in requirements and loosen pydantic * upgrade modal too	2025-12-30 12:41:07 -05:00
Wing Lian	11c0b5b256	bartch upgrade dependencies (#3299 ) * upgrade dependencies * don't use reset sessions * downgrade transformers, upgrade other deps * upgrade bnb to 0.49.0 * restore s3 cache * explicit use local files w hub * decompress and strip top level dir * use 2 levels for strip components * try to preserve permissions for symlinks * use updated tar * fix #3293 for distributed * downgrade bnb * fast fail after 4 * fix total tokens device * patch accelerate CP/SP (#3309) --------- Co-authored-by: salman <salman.mohammadi@outlook.com>	2025-12-30 09:02:49 -05:00
Wing Lian	efeb5a4e41	fix check for fp8 capability (#3324 ) * fix check for fp8 capability * handle non-cuda compute * reduce concurrency of tests	2025-12-22 13:58:25 -05:00
Wing Lian	07c41a6c2a	fix preview docs failing due to running out of disk (#3326 ) [skip ci] * fix preview docs failing due to running out of disk * fix docs publish too	2025-12-19 11:34:55 -05:00
Wing Lian	2a664dc8ad	support for xformers wheels for torch 2.9 (#3308 ) * support for xformers wheels for torch 2.9 * fix hf cache? * don't use hf cache from s3 * show disk free space in ci	2025-12-11 11:56:40 -05:00
Wing Lian	0b635e69c5	build docker images for 2.9.x (#3273 )	2025-11-20 09:26:24 -05:00
Wing Lian	0d27e14e45	Torch 2.9.1 base images (#3268 ) * update torch 2.9.1 base images * update base dockerfile image check	2025-11-20 09:04:37 -05:00
Wing Lian	a6bafb55cb	upgrade datasets to 4.4.1 (#3266 ) * upgrade datasets * cleanup pip cache earlier * cleanup unused things from worker * also cleanup sdist	2025-11-14 09:52:14 -08:00
Wing Lian	0fbde69e9c	only push axolotl images, personal repo is deprecated (#3262 ) * only push axolotl images, personal repo is deprecated * cleanup	2025-11-14 07:50:03 -08:00
Wing Lian	301e22849f	upgrade to latest deepspeed and make sure latest tagged axolotl images are using torch 2.8.0 (#3261 )	2025-11-13 13:03:01 -05:00
salman	c37decb073	update pre-commit cadence (#3245 )	2025-11-04 13:43:40 +00:00
Wing Lian	633afffacb	add torch 2.9.0 to ci (#3223 )	2025-10-30 18:50:26 -04:00
Wing Lian	4b1b4fa6d8	upgrade numpy (#3236 ) * upgrade numpy to 2.3.4 * bump contribs for numpy * fix vllm versions * bump numba * make sure psutil is installed * add psutil to cicd dockerfile jinja * lower dep versions of numba + numpy for vllm * bump datasets version * resolve pydantic conflict too	2025-10-30 10:03:24 -04:00
Wing Lian	a4b921135b	build cuda 13.0.0 base image with 2.9.0 (#3229 ) * build cuda 13.0.0 base image with 2.9.0 * upgrade causal-conv1d * 1.5.4 not in pypi yet * pin to 1.3.0 * use github release instead of pypi * split the logic for incompatible packages * fix bash in dockerfile	2025-10-29 18:07:29 -04:00
Wing Lian	383f220cfd	build torch 2.9.0 base images (#3221 )	2025-10-20 08:53:49 -04:00
Wing Lian	409cfb8a87	deprecate torch 2.6.0 support (#3197 ) [skip ci]	2025-10-07 11:23:41 -04:00
Wing Lian	ce74c20109	don't cache pip install (#3194 ) * don't cache pip install * no cache dir for disk space for sdist too	2025-10-01 11:11:39 -04:00
salman	58d67bf98d	Migrate QAT API; fix `axolotl quantize` for QAT-ed models; add NVFP4 (#3107 )	2025-09-12 10:55:50 +01:00
Wing Lian	06bebcb65f	run cu128-2.8.0 e2e tests on B200 (#3126 ) * run cu128-2.8.0 e2e tests on B200 * not an int 🤦 * fix yaml	2025-09-02 13:13:23 -04:00
Wing Lian	6afba3871d	Add support for PyTorch 2.8.0 (#3106 ) * Add support for PyTorch 2.8.0 * loosen triton requirements * handle torch 2.8.0 in setup.py * fix versions * no vllm for torch 2.8.0 * remove comment Co-authored-by: NanoCode012 <nano@axolotl.ai> --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-08-28 09:10:40 -04:00
salman	d1de6f5f3d	Add option to skip slow tests in PRs (#3060 ) [skip ci] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * testing e2e skip [skip-e2e] * stop running multigpu [skip-e2e] * should work now [skip-e2e] * reverting [skip-e2e] * testing [skip-e2e] * debug [skip-e2e] * debug [skip-e2e] * round 2[skip-e2e] * removing debug [skip-e2e] * support skipping whole PR [skip-e2e] * use script for e2e skip [skip-e2e] * contributing [skip-e2e] * contributing [skip-e2e] --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-08-13 22:57:51 -04:00
Wing Lian	686933194e	fix vllm tagging and add cloud images w/o tmux (#3049 ) [skip ci]	2025-08-10 20:21:56 -04:00
Wing Lian	05f1b4b2e8	run monkeypatch tests in seperate runner (#3047 )	2025-08-09 14:34:07 -04:00
Wing Lian	c5e5aba547	Add 2.8.0 base images and uv images (#3034 )	2025-08-08 02:30:16 -04:00
Wing Lian	10946afae7	fixes for spinning up vllm service for grpo (#3001 )	2025-08-02 11:19:24 -04:00
salman	09dda462ab	Fix don't preview docs for contributors (#2994 ) [skip ci] * checking against fork vs. main repo * force doc preview	2025-07-31 11:12:41 -04:00
Wing Lian	1d2aa1e467	upgrade to support latest transformers release (#2984 ) * upgrade to support latest transformers release * bump mistral common too * Fix dependencies	2025-07-27 17:05:12 -04:00
Wing Lian	add3e5076b	don't publish to netlify on contributor submissions since it requires auth tokens (#2985 ) [skip ci] * don't publish to netlify on contributor submissions since it requires auth tokens * fix no-tmux build and add contact to motd	2025-07-27 17:04:27 -04:00
salman	1407aac779	Skip CI for draft PRs (#2970 )	2025-07-24 09:11:46 +01:00
Wing Lian	d32058e149	include torchvision in build for upstream changes requiring it now (#2953 ) [skip ci]	2025-07-22 04:19:16 -04:00
Wing Lian	8a4bcacdb2	cu126-torch271 for cloud docker image should be tagged with main-latest (#2935 )	2025-07-17 00:01:23 -04:00
Wing Lian	d2c3d5a954	run nightly-vs-upstream-main on 2.7.1 and multi-gpu also (#2929 ) [skip ci]	2025-07-16 21:45:42 -04:00
Wing Lian	942005f526	use modal==1.0.2 for nightlies and for cli (#2925 ) [skip ci] * use modal==1.0.2 for nightlies and for cli * use latest cce fork for upstream changes * increase timeout	2025-07-15 20:31:23 -04:00
Wing Lian	7dc3ac6cb3	update nightlies builds (#2921 ) [skip ci]	2025-07-14 20:10:43 -04:00
Wing Lian	5081db7f8a	upgrade trl==0.19.1 (#2892 ) [skip ci] * upgrade trl==0.19.1 * add vllm for tests for grpo * fixes to work with latest trl * need data_parallel_size config too * support for vllm_mode for server / colocate * vllm settings for colocate * relax vllm version * bump min hf hub for latest vllm support * add hints on string literal for vllm mode * use latest transformers 4.53.2 * tweak acceptable loss on flaky test_ds_zero3_packed test * don't run flaky vllm/grpo tests for now	2025-07-14 09:23:42 -04:00
salman	03b2a113fe	Update doc preview workflow to use sticky comments (#2873 )	2025-07-11 14:08:35 +01:00
Wing Lian	c6d69d5c1b	release v0.11.0 (#2875 ) Some checks failed ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled Details publish pypi / Create Release (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.6.0) (push) Has been cancelled Details ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled Details ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled Details publish pypi / Upload release to PyPI (push) Has been cancelled Details * release v0.11.0 * don't build vllm into release for now * remove 2.5.1 references * smollm3 multipack support * fix ordering of e2e tests	2025-07-09 09:22:35 -04:00
Wing Lian	4ff96a2526	fix xformers version (#2888 )	2025-07-09 08:43:40 -04:00
salman	89e99eaaa7	slowest durations (#2887 ) [skip ci]	2025-07-09 08:43:26 -04:00
Wing Lian	6ed501f6dc	add 2.7.0 torch images back to support vlllm (#2885 )	2025-07-08 16:28:14 -04:00
Wing Lian	a5946ff1f0	build fa2 from source for base image with torch2.6 and cu124 (#2867 )	2025-07-05 09:21:18 -04:00
Wing Lian	70ca1b2291	fix nightlies to use correct cache (#2848 ) [skip ci] * fix nightlies to use correct cache * fix for handling None for bf16	2025-07-03 12:21:39 -04:00
Wing Lian	cb811f8bf1	upgrade to flash-attn 2.8.0.post2 (#2828 ) * upgrade to flash-attn 2.8.0.post2 * use cu126 with torch 2.6 * seems vllm 0.8.5.post1 not compatible with cuda12.6.3 and torch 2.6 * cu126 + torch 2.6 as the default * use cu126 for multigpu w torch 2.6 too * drop vllm for now from ci for now	2025-06-29 22:11:16 -04:00
Dan Saunders	06a648263b	Config doc autogen: follow-up fix docs build (#2806 ) * config reference doc autogen * improvements * cleanup; still ugly but working * reformat * remove autogen config ref from git * factor out validations * rewrite * rewrite * cleanup * progress * progress * progress * lint and minifying somewhat * remove unneeded * coderabbit * coderabbit * update preview-docs workflow triggers * installing with deps * coderabbit * update refs * overwrote file accidentally * docs install deps	2025-06-18 15:42:54 -04:00
Dan Saunders	9d5bfc127e	Config doc autogen (#2718 ) * config reference doc autogen * improvements * cleanup; still ugly but working * reformat * remove autogen config ref from git * factor out validations * rewrite * rewrite * cleanup * progress * progress * progress * lint and minifying somewhat * remove unneeded * coderabbit * coderabbit * update preview-docs workflow triggers * installing with deps * coderabbit * update refs * overwrote file accidentally	2025-06-18 15:36:53 -04:00
Wing Lian	ccc94da8ad	KD fix w/ online distillation (#2700 ) [skip ci] * kd fixes * fix collator setup * fix input args * better handling to drop string fields for kd with raw dataset * kd trainer has kd temp as part of the init * drop top_k before softmax * simplfy and remove zscore * WIP chunked KD loss with autograd wrapper * more fixes and liger-type chunked loss * collator cls for plugins * remove debugging * additional plugin collator kwargs, don't scale up kd loss by t^2 * don't need temp arg to distill method * online kd wip * add close to comment block * suport sampling params/max new tokens * handle when no custom collator is used in plugins * logsumexp trick: * fix check * shift off the first empty token * fix length of padding * use max not min * temp scale kd loss at end * support for dynamic plugin training args mixins and symmetric kl * chore: lint * fix trainer callback base class * Fix decay * accept compressed responses for smaller wire payload * post-rebase lint * more KD updates * increase hyperparams_count for gradients for added normalize_topk * fix to remove attention_mask * rename vars for consistency * fix rebase issues * default to dropping last batch in multipack batch sampler * improve handling of train len * init collator_cls_and_kwargs * explicit drop_last=False when checking for multipack completeness * use separate v2 loader for kd * fix kd tests to use subprocess so it picks up kd training args * default value for kd_beta arg * use updated dataset for ci * longer timeout for e2e	2025-06-17 12:09:13 -04:00

1 2 3 4 5 ...

271 Commits