Wing Lian
ee59e4de97
add cu130 + torch 2.9.1 to test matrices ( #3343 )
...
* add cu130 + torch 2.9.1 to test matrices
* uv can't use pip3 directly
2026-01-05 15:24:29 -05:00
Wing Lian
4e61b8aa23
use updated version of prebuilt wheels for flash attention for cu130 ( #3342 )
...
* use updated version of prebuilt wheels for flash attention for cu130
* use elif
* fix the uv base installs of FA also
* make wget less verbose
2026-01-05 13:48:12 -05:00
Wing Lian
b26ba3a5cb
don't build images w cuda 130 since we don't have flash attention wheels ( #3341 )
2026-01-03 18:08:28 -05:00
Wing Lian
afe18ace35
deprecate torch 2.7.1 ( #3339 )
2026-01-01 06:52:45 -05:00
Wing Lian
e73dab6df9
support pydantic 2.12 ( #3328 )
...
* upgrade pydantic to 2.12
* use latest modal version
* upgrade modal
* update modal in requirements and loosen pydantic
* upgrade modal too
2025-12-30 12:41:07 -05:00
Wing Lian
11c0b5b256
bartch upgrade dependencies ( #3299 )
...
* upgrade dependencies
* don't use reset sessions
* downgrade transformers, upgrade other deps
* upgrade bnb to 0.49.0
* restore s3 cache
* explicit use local files w hub
* decompress and strip top level dir
* use 2 levels for strip components
* try to preserve permissions for symlinks
* use updated tar
* fix #3293 for distributed
* downgrade bnb
* fast fail after 4
* fix total tokens device
* patch accelerate CP/SP (#3309 )
---------
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-12-30 09:02:49 -05:00
Wing Lian
efeb5a4e41
fix check for fp8 capability ( #3324 )
...
* fix check for fp8 capability
* handle non-cuda compute
* reduce concurrency of tests
2025-12-22 13:58:25 -05:00
Wing Lian
07c41a6c2a
fix preview docs failing due to running out of disk ( #3326 ) [skip ci]
...
* fix preview docs failing due to running out of disk
* fix docs publish too
2025-12-19 11:34:55 -05:00
Wing Lian
2a664dc8ad
support for xformers wheels for torch 2.9 ( #3308 )
...
* support for xformers wheels for torch 2.9
* fix hf cache?
* don't use hf cache from s3
* show disk free space in ci
2025-12-11 11:56:40 -05:00
Wing Lian
0b635e69c5
build docker images for 2.9.x ( #3273 )
2025-11-20 09:26:24 -05:00
Wing Lian
0d27e14e45
Torch 2.9.1 base images ( #3268 )
...
* update torch 2.9.1 base images
* update base dockerfile image check
2025-11-20 09:04:37 -05:00
Wing Lian
a6bafb55cb
upgrade datasets to 4.4.1 ( #3266 )
...
* upgrade datasets
* cleanup pip cache earlier
* cleanup unused things from worker
* also cleanup sdist
2025-11-14 09:52:14 -08:00
Wing Lian
0fbde69e9c
only push axolotl images, personal repo is deprecated ( #3262 )
...
* only push axolotl images, personal repo is deprecated
* cleanup
2025-11-14 07:50:03 -08:00
Wing Lian
301e22849f
upgrade to latest deepspeed and make sure latest tagged axolotl images are using torch 2.8.0 ( #3261 )
2025-11-13 13:03:01 -05:00
salman
c37decb073
update pre-commit cadence ( #3245 )
2025-11-04 13:43:40 +00:00
Wing Lian
633afffacb
add torch 2.9.0 to ci ( #3223 )
2025-10-30 18:50:26 -04:00
Wing Lian
4b1b4fa6d8
upgrade numpy ( #3236 )
...
* upgrade numpy to 2.3.4
* bump contribs for numpy
* fix vllm versions
* bump numba
* make sure psutil is installed
* add psutil to cicd dockerfile jinja
* lower dep versions of numba + numpy for vllm
* bump datasets version
* resolve pydantic conflict too
2025-10-30 10:03:24 -04:00
Wing Lian
a4b921135b
build cuda 13.0.0 base image with 2.9.0 ( #3229 )
...
* build cuda 13.0.0 base image with 2.9.0
* upgrade causal-conv1d
* 1.5.4 not in pypi yet
* pin to 1.3.0
* use github release instead of pypi
* split the logic for incompatible packages
* fix bash in dockerfile
2025-10-29 18:07:29 -04:00
Wing Lian
383f220cfd
build torch 2.9.0 base images ( #3221 )
2025-10-20 08:53:49 -04:00
Wing Lian
409cfb8a87
deprecate torch 2.6.0 support ( #3197 ) [skip ci]
2025-10-07 11:23:41 -04:00
Wing Lian
ce74c20109
don't cache pip install ( #3194 )
...
* don't cache pip install
* no cache dir for disk space for sdist too
2025-10-01 11:11:39 -04:00
salman
58d67bf98d
Migrate QAT API; fix axolotl quantize for QAT-ed models; add NVFP4 ( #3107 )
2025-09-12 10:55:50 +01:00
Wing Lian
06bebcb65f
run cu128-2.8.0 e2e tests on B200 ( #3126 )
...
* run cu128-2.8.0 e2e tests on B200
* not an int 🤦
* fix yaml
2025-09-02 13:13:23 -04:00
Wing Lian
6afba3871d
Add support for PyTorch 2.8.0 ( #3106 )
...
* Add support for PyTorch 2.8.0
* loosen triton requirements
* handle torch 2.8.0 in setup.py
* fix versions
* no vllm for torch 2.8.0
* remove comment
Co-authored-by: NanoCode012 <nano@axolotl.ai >
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-28 09:10:40 -04:00
salman
d1de6f5f3d
Add option to skip slow tests in PRs ( #3060 ) [skip ci]
...
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* stop running multigpu [skip-e2e]
* should work now [skip-e2e]
* reverting [skip-e2e]
* testing [skip-e2e]
* debug [skip-e2e]
* debug [skip-e2e]
* round 2[skip-e2e]
* removing debug [skip-e2e]
* support skipping whole PR [skip-e2e]
* use script for e2e skip [skip-e2e]
* contributing [skip-e2e]
* contributing [skip-e2e]
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-13 22:57:51 -04:00
Wing Lian
686933194e
fix vllm tagging and add cloud images w/o tmux ( #3049 ) [skip ci]
2025-08-10 20:21:56 -04:00
Wing Lian
05f1b4b2e8
run monkeypatch tests in seperate runner ( #3047 )
2025-08-09 14:34:07 -04:00
Wing Lian
c5e5aba547
Add 2.8.0 base images and uv images ( #3034 )
2025-08-08 02:30:16 -04:00
Wing Lian
10946afae7
fixes for spinning up vllm service for grpo ( #3001 )
2025-08-02 11:19:24 -04:00
salman
09dda462ab
Fix don't preview docs for contributors ( #2994 ) [skip ci]
...
* checking against fork vs. main repo
* force doc preview
2025-07-31 11:12:41 -04:00
Wing Lian
1d2aa1e467
upgrade to support latest transformers release ( #2984 )
...
* upgrade to support latest transformers release
* bump mistral common too
* Fix dependencies
2025-07-27 17:05:12 -04:00
Wing Lian
add3e5076b
don't publish to netlify on contributor submissions since it requires auth tokens ( #2985 ) [skip ci]
...
* don't publish to netlify on contributor submissions since it requires auth tokens
* fix no-tmux build and add contact to motd
2025-07-27 17:04:27 -04:00
salman
1407aac779
Skip CI for draft PRs ( #2970 )
2025-07-24 09:11:46 +01:00
Wing Lian
d32058e149
include torchvision in build for upstream changes requiring it now ( #2953 ) [skip ci]
2025-07-22 04:19:16 -04:00
Wing Lian
8a4bcacdb2
cu126-torch271 for cloud docker image should be tagged with main-latest ( #2935 )
2025-07-17 00:01:23 -04:00
Wing Lian
d2c3d5a954
run nightly-vs-upstream-main on 2.7.1 and multi-gpu also ( #2929 ) [skip ci]
2025-07-16 21:45:42 -04:00
Wing Lian
942005f526
use modal==1.0.2 for nightlies and for cli ( #2925 ) [skip ci]
...
* use modal==1.0.2 for nightlies and for cli
* use latest cce fork for upstream changes
* increase timeout
2025-07-15 20:31:23 -04:00
Wing Lian
7dc3ac6cb3
update nightlies builds ( #2921 ) [skip ci]
2025-07-14 20:10:43 -04:00
Wing Lian
5081db7f8a
upgrade trl==0.19.1 ( #2892 ) [skip ci]
...
* upgrade trl==0.19.1
* add vllm for tests for grpo
* fixes to work with latest trl
* need data_parallel_size config too
* support for vllm_mode for server / colocate
* vllm settings for colocate
* relax vllm version
* bump min hf hub for latest vllm support
* add hints on string literal for vllm mode
* use latest transformers 4.53.2
* tweak acceptable loss on flaky test_ds_zero3_packed test
* don't run flaky vllm/grpo tests for now
2025-07-14 09:23:42 -04:00
salman
03b2a113fe
Update doc preview workflow to use sticky comments ( #2873 )
2025-07-11 14:08:35 +01:00
Wing Lian
c6d69d5c1b
release v0.11.0 ( #2875 )
...
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
* release v0.11.0
* don't build vllm into release for now
* remove 2.5.1 references
* smollm3 multipack support
* fix ordering of e2e tests
2025-07-09 09:22:35 -04:00
Wing Lian
4ff96a2526
fix xformers version ( #2888 )
2025-07-09 08:43:40 -04:00
salman
89e99eaaa7
slowest durations ( #2887 ) [skip ci]
2025-07-09 08:43:26 -04:00
Wing Lian
6ed501f6dc
add 2.7.0 torch images back to support vlllm ( #2885 )
2025-07-08 16:28:14 -04:00
Wing Lian
a5946ff1f0
build fa2 from source for base image with torch2.6 and cu124 ( #2867 )
2025-07-05 09:21:18 -04:00
Wing Lian
70ca1b2291
fix nightlies to use correct cache ( #2848 ) [skip ci]
...
* fix nightlies to use correct cache
* fix for handling None for bf16
2025-07-03 12:21:39 -04:00
Wing Lian
cb811f8bf1
upgrade to flash-attn 2.8.0.post2 ( #2828 )
...
* upgrade to flash-attn 2.8.0.post2
* use cu126 with torch 2.6
* seems vllm 0.8.5.post1 not compatible with cuda12.6.3 and torch 2.6
* cu126 + torch 2.6 as the default
* use cu126 for multigpu w torch 2.6 too
* drop vllm for now from ci for now
2025-06-29 22:11:16 -04:00
Dan Saunders
06a648263b
Config doc autogen: follow-up fix docs build ( #2806 )
...
* config reference doc autogen
* improvements
* cleanup; still ugly but working
* reformat
* remove autogen config ref from git
* factor out validations
* rewrite
* rewrite
* cleanup
* progress
* progress
* progress
* lint and minifying somewhat
* remove unneeded
* coderabbit
* coderabbit
* update preview-docs workflow triggers
* installing with deps
* coderabbit
* update refs
* overwrote file accidentally
* docs install deps
2025-06-18 15:42:54 -04:00
Dan Saunders
9d5bfc127e
Config doc autogen ( #2718 )
...
* config reference doc autogen
* improvements
* cleanup; still ugly but working
* reformat
* remove autogen config ref from git
* factor out validations
* rewrite
* rewrite
* cleanup
* progress
* progress
* progress
* lint and minifying somewhat
* remove unneeded
* coderabbit
* coderabbit
* update preview-docs workflow triggers
* installing with deps
* coderabbit
* update refs
* overwrote file accidentally
2025-06-18 15:36:53 -04:00
Wing Lian
ccc94da8ad
KD fix w/ online distillation ( #2700 ) [skip ci]
...
* kd fixes
* fix collator setup
* fix input args
* better handling to drop string fields for kd with raw dataset
* kd trainer has kd temp as part of the init
* drop top_k before softmax
* simplfy and remove zscore
* WIP chunked KD loss with autograd wrapper
* more fixes and liger-type chunked loss
* collator cls for plugins
* remove debugging
* additional plugin collator kwargs, don't scale up kd loss by t^2
* don't need temp arg to distill method
* online kd wip
* add close to comment block
* suport sampling params/max new tokens
* handle when no custom collator is used in plugins
* logsumexp trick:
* fix check
* shift off the first empty token
* fix length of padding
* use max not min
* temp scale kd loss at end
* support for dynamic plugin training args mixins and symmetric kl
* chore: lint
* fix trainer callback base class
* Fix decay
* accept compressed responses for smaller wire payload
* post-rebase lint
* more KD updates
* increase hyperparams_count for gradients for added normalize_topk
* fix to remove attention_mask
* rename vars for consistency
* fix rebase issues
* default to dropping last batch in multipack batch sampler
* improve handling of train len
* init collator_cls_and_kwargs
* explicit drop_last=False when checking for multipack completeness
* use separate v2 loader for kd
* fix kd tests to use subprocess so it picks up kd training args
* default value for kd_beta arg
* use updated dataset for ci
* longer timeout for e2e
2025-06-17 12:09:13 -04:00