Commit Graph

257 Commits

Author SHA1 Message Date
salman
c37decb073 update pre-commit cadence (#3245) 2025-11-04 13:43:40 +00:00
Wing Lian
633afffacb add torch 2.9.0 to ci (#3223) 2025-10-30 18:50:26 -04:00
Wing Lian
4b1b4fa6d8 upgrade numpy (#3236)
* upgrade numpy to 2.3.4

* bump contribs for numpy

* fix vllm versions

* bump numba

* make sure psutil is installed

* add psutil to cicd dockerfile jinja

* lower dep versions of numba + numpy for vllm

* bump datasets version

* resolve pydantic conflict too
2025-10-30 10:03:24 -04:00
Wing Lian
a4b921135b build cuda 13.0.0 base image with 2.9.0 (#3229)
* build cuda 13.0.0 base image with 2.9.0

* upgrade causal-conv1d

* 1.5.4 not in pypi yet

* pin to 1.3.0

* use github release instead of pypi

* split the logic for incompatible packages

* fix bash in dockerfile
2025-10-29 18:07:29 -04:00
Wing Lian
383f220cfd build torch 2.9.0 base images (#3221) 2025-10-20 08:53:49 -04:00
Wing Lian
409cfb8a87 deprecate torch 2.6.0 support (#3197) [skip ci] 2025-10-07 11:23:41 -04:00
Wing Lian
ce74c20109 don't cache pip install (#3194)
* don't cache pip install

* no cache dir for disk space for sdist too
2025-10-01 11:11:39 -04:00
salman
58d67bf98d Migrate QAT API; fix axolotl quantize for QAT-ed models; add NVFP4 (#3107) 2025-09-12 10:55:50 +01:00
Wing Lian
06bebcb65f run cu128-2.8.0 e2e tests on B200 (#3126)
* run cu128-2.8.0 e2e tests on B200

* not an int 🤦

* fix yaml
2025-09-02 13:13:23 -04:00
Wing Lian
6afba3871d Add support for PyTorch 2.8.0 (#3106)
* Add support for PyTorch 2.8.0

* loosen triton requirements

* handle torch 2.8.0 in setup.py

* fix versions

* no vllm for torch 2.8.0

* remove comment

Co-authored-by: NanoCode012 <nano@axolotl.ai>

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-28 09:10:40 -04:00
salman
d1de6f5f3d Add option to skip slow tests in PRs (#3060) [skip ci]
* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* stop running multigpu [skip-e2e]

* should work now [skip-e2e]

* reverting [skip-e2e]

* testing [skip-e2e]

* debug [skip-e2e]

* debug [skip-e2e]

* round 2[skip-e2e]

* removing debug [skip-e2e]

* support skipping whole PR [skip-e2e]

* use script for e2e skip [skip-e2e]

* contributing [skip-e2e]

* contributing [skip-e2e]

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-08-13 22:57:51 -04:00
Wing Lian
686933194e fix vllm tagging and add cloud images w/o tmux (#3049) [skip ci] 2025-08-10 20:21:56 -04:00
Wing Lian
05f1b4b2e8 run monkeypatch tests in seperate runner (#3047) 2025-08-09 14:34:07 -04:00
Wing Lian
c5e5aba547 Add 2.8.0 base images and uv images (#3034) 2025-08-08 02:30:16 -04:00
Wing Lian
10946afae7 fixes for spinning up vllm service for grpo (#3001) 2025-08-02 11:19:24 -04:00
salman
09dda462ab Fix don't preview docs for contributors (#2994) [skip ci]
* checking against fork vs. main repo

* force doc preview
2025-07-31 11:12:41 -04:00
Wing Lian
1d2aa1e467 upgrade to support latest transformers release (#2984)
* upgrade to support latest transformers release

* bump mistral common too

* Fix dependencies
2025-07-27 17:05:12 -04:00
Wing Lian
add3e5076b don't publish to netlify on contributor submissions since it requires auth tokens (#2985) [skip ci]
* don't publish to netlify on contributor submissions since it requires auth tokens

* fix no-tmux build and add contact to motd
2025-07-27 17:04:27 -04:00
salman
1407aac779 Skip CI for draft PRs (#2970) 2025-07-24 09:11:46 +01:00
Wing Lian
d32058e149 include torchvision in build for upstream changes requiring it now (#2953) [skip ci] 2025-07-22 04:19:16 -04:00
Wing Lian
8a4bcacdb2 cu126-torch271 for cloud docker image should be tagged with main-latest (#2935) 2025-07-17 00:01:23 -04:00
Wing Lian
d2c3d5a954 run nightly-vs-upstream-main on 2.7.1 and multi-gpu also (#2929) [skip ci] 2025-07-16 21:45:42 -04:00
Wing Lian
942005f526 use modal==1.0.2 for nightlies and for cli (#2925) [skip ci]
* use modal==1.0.2 for nightlies and for cli

* use latest cce fork for upstream changes

* increase timeout
2025-07-15 20:31:23 -04:00
Wing Lian
7dc3ac6cb3 update nightlies builds (#2921) [skip ci] 2025-07-14 20:10:43 -04:00
Wing Lian
5081db7f8a upgrade trl==0.19.1 (#2892) [skip ci]
* upgrade trl==0.19.1

* add vllm for tests for grpo

* fixes to work with latest trl

* need data_parallel_size config too

* support for vllm_mode for server / colocate

* vllm settings for colocate

* relax vllm version

* bump min hf hub for latest vllm support

* add hints on string literal for vllm mode

* use latest transformers 4.53.2

* tweak acceptable loss on flaky test_ds_zero3_packed test

* don't run flaky vllm/grpo tests for now
2025-07-14 09:23:42 -04:00
salman
03b2a113fe Update doc preview workflow to use sticky comments (#2873) 2025-07-11 14:08:35 +01:00
Wing Lian
c6d69d5c1b release v0.11.0 (#2875)
Some checks failed
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
* release v0.11.0

* don't build vllm into release for now

* remove 2.5.1 references

* smollm3 multipack support

* fix ordering of e2e tests
2025-07-09 09:22:35 -04:00
Wing Lian
4ff96a2526 fix xformers version (#2888) 2025-07-09 08:43:40 -04:00
salman
89e99eaaa7 slowest durations (#2887) [skip ci] 2025-07-09 08:43:26 -04:00
Wing Lian
6ed501f6dc add 2.7.0 torch images back to support vlllm (#2885) 2025-07-08 16:28:14 -04:00
Wing Lian
a5946ff1f0 build fa2 from source for base image with torch2.6 and cu124 (#2867) 2025-07-05 09:21:18 -04:00
Wing Lian
70ca1b2291 fix nightlies to use correct cache (#2848) [skip ci]
* fix nightlies to use correct cache

* fix for handling None for bf16
2025-07-03 12:21:39 -04:00
Wing Lian
cb811f8bf1 upgrade to flash-attn 2.8.0.post2 (#2828)
* upgrade to flash-attn 2.8.0.post2

* use cu126 with torch 2.6

* seems vllm 0.8.5.post1 not compatible with cuda12.6.3 and torch 2.6

* cu126 + torch 2.6 as the default

* use cu126 for multigpu w torch 2.6 too

* drop vllm for now from ci for now
2025-06-29 22:11:16 -04:00
Dan Saunders
06a648263b Config doc autogen: follow-up fix docs build (#2806)
* config reference doc autogen

* improvements

* cleanup; still ugly but working

* reformat

* remove autogen config ref from git

* factor out validations

* rewrite

* rewrite

* cleanup

* progress

* progress

* progress

* lint and minifying somewhat

* remove unneeded

* coderabbit

* coderabbit

* update preview-docs workflow triggers

* installing with deps

* coderabbit

* update refs

* overwrote file accidentally

* docs install deps
2025-06-18 15:42:54 -04:00
Dan Saunders
9d5bfc127e Config doc autogen (#2718)
* config reference doc autogen

* improvements

* cleanup; still ugly but working

* reformat

* remove autogen config ref from git

* factor out validations

* rewrite

* rewrite

* cleanup

* progress

* progress

* progress

* lint and minifying somewhat

* remove unneeded

* coderabbit

* coderabbit

* update preview-docs workflow triggers

* installing with deps

* coderabbit

* update refs

* overwrote file accidentally
2025-06-18 15:36:53 -04:00
Wing Lian
ccc94da8ad KD fix w/ online distillation (#2700) [skip ci]
* kd fixes

* fix collator setup

* fix input args

* better handling to drop string fields for kd with raw dataset

* kd trainer has kd temp as part of the init

* drop top_k before softmax

* simplfy and remove zscore

* WIP chunked KD loss with autograd wrapper

* more fixes and liger-type chunked loss

* collator cls for plugins

* remove debugging

* additional plugin collator kwargs, don't scale up kd loss by t^2

* don't need temp arg to distill method

* online kd wip

* add close to comment block

* suport sampling params/max new tokens

* handle when no custom collator is used in plugins

* logsumexp trick:

* fix check

* shift off the first empty token

* fix length of padding

* use max not min

* temp scale kd loss at end

* support for dynamic plugin training args mixins and symmetric kl

* chore: lint

* fix trainer callback base class

* Fix decay

* accept compressed responses for smaller wire payload

* post-rebase lint

* more KD updates

* increase hyperparams_count for gradients for added normalize_topk

* fix to remove attention_mask

* rename vars for consistency

* fix rebase issues

* default to dropping last batch in multipack batch sampler

* improve handling of train len

* init collator_cls_and_kwargs

* explicit drop_last=False when checking for multipack completeness

* use separate v2 loader for kd

* fix kd tests to use subprocess so it picks up kd training args

* default value for kd_beta arg

* use updated dataset for ci

* longer timeout for e2e
2025-06-17 12:09:13 -04:00
Wing Lian
bcc108efc1 build 2.7.1 images too (#2784) [skip ci] 2025-06-12 13:22:20 -04:00
Wing Lian
581dd324cc build base images for torch 2.7.1 (#2764)
* build base images for torch 2.7.1

* fix: update base docker to use torch 2.7.1

* fix: update doc for main base to use 2.7.1

* make sure to install fa2 in base uv too

* use no build isolation for uv+flashattn

* install psutil also for fa2

* longer timeout for flash attn build

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-06-11 17:11:06 -04:00
Wing Lian
cb03c765a1 add uv tooling for e2e gpu tests (#2750)
* add uv tooling for e2e gpu tests

* fixes from PR feedback

* simplify check

* fix env var

* make sure to use uv for other install

* use raw_dockerfile_image

* Fix import

* fix args to experimental dockerfile image call

* use updated modal versions
2025-06-05 07:25:06 -07:00
Wing Lian
c67910fa6f bump hf deps (#2735) [skip ci]
* bump hf deps

* upgrade liger-kernel too

* install cce from fork for transformers fix

* fix reference to vocab size in gemma3 patch

* use padding_idx instead of pad_token_id

* remove fixed gemma3 patch

* use updated cce fork

* fix local mllama cce patches w docstring

* add test for multipack with trainer setup and fix trainer for trainer refactor upstream

* bump modal version

* guard for iterable datasetS

* mllama model arch layout changed in latest transformers

* fix batch sampler with drop_last

* fix: address upstream vlm changes for lora

* fix: update references to old lora target path

* fix: remove mllama fa2 patch due to upstream fix

* fix: lora kernel patch path for multimodal models

* fix: removed mllama from quarto

* run test for came optim on 2.6.0+

* fix fsdp2 patch and remove deprecated patch

* make sure to set sequence_parallel_degree for grpo

* Add SP test for GRPO

* add sp to grpo config for trainer

* use reward_funcs as kwarg to grpo trainer

* fix the comprehension for reward funcs

* reward funcs already passed in as args

* init sp_group right before training

* fix check for adding models to SP context

* make sure to pass args to super

* upgrade deepspeed

* use updated trl and add reasoning flags for vllm

* patch the worker

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-06-05 07:20:33 -07:00
github-actions[bot]
94219f6ee8 chore: update pre-commit hooks (#2745)
* chore: update pre-commit hooks

* trigger linter when pre commit hooks are updated

* fix type checks from upgraded pre-commit

---------

Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-06-02 15:54:29 -07:00
Wing Lian
ecc719f5c7 add support for base image with uv (#2691) 2025-06-02 12:48:55 -07:00
Dan Saunders
a0941a9271 no need to generate diff file (#2728) 2025-05-27 11:44:06 -04:00
Wing Lian
8f8a7afb05 Add ci and images for CUDA 12.8 for B200s (#2683) [skip ci]
* Add ci and images for CUDA 12.8 for B200s

* add comments explaining CI [skip e2e]
2025-05-16 13:06:08 -04:00
Wing Lian
f34eef546a update doc and use P2P=LOC for brittle grpo test (#2649)
* update doc and skip brittle grpo test

* fix the path to run the multigpu tests

* increase timeout, use LOC instead of NVL

* typo

* use hf cache from s3 backed cloudfront

* mark grpo as flaky test dues to vllm start
2025-05-12 14:17:25 -04:00
Wing Lian
c7b6790614 Various fixes for CI, save_only_model for RL, prevent packing multiprocessing deadlocks (#2661)
* lean mistral ft tests, remove e2e torch 2.4.1 test

* make sure to pass save_only_model for RL

* more tests to make ci leaner, add cleanup to modal ci

* fix module for import in e2e tests

* use mp spawn to prevent deadlocks with packing

* make sure cleanup shell script is executable when cloned out
2025-05-12 10:51:18 -04:00
Wing Lian
0f3587174d swap tinymodels that have safetensors for some ci tests (#2641) 2025-05-07 15:06:07 -04:00
Wing Lian
f720b6e72d repop cache (#2639)
* repop cache

* pre-cache as a step

* fix the name

* add reason for pytest skipif

* restore pytorch matrix

* remove max-parallel now that we've optimized this a bit
2025-05-06 11:09:07 -04:00
NanoCode012
6a3e6f8c53 fix: run preview-docs only when md/qmd changes (#2606)
* fix: run preview-docs only when md/qmd changes

* feat: add quarto yaml based on PR feedback
2025-05-01 13:21:28 -04:00
Rahul Tuli
996fc124e5 Add: Sparse Finetuning Integration with llmcompressor (#2479)
* Add: SFTPlugin with llmcompressor

* Update: review comments!

* Add:llmcompressor instalable

* pre commit hooks

* Use: warning over warn

* Revert: TODO's

* Update llmcompressor version to latest

* Apply suggestions from @markurtz

Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com>

* Address review comments from @markurtz

* Add: llcompressor installable

* Rename: sft.yaml to sparse-finetuning.yaml

* Use: absolute import

* Update model config

* Move: LLMCompressorPlugin into it's own submodule

* Add: `llm_compressor` integration documentation

* Rebase and updates!

* Tests, Style, Updates

* Add: .qmd file

* Address Review Comments:
* deleted redundant docs/llm_compressor.qmd
* incorporated feedback in integration README.md
* added llmcompressor integration to docs/custom_integrations.qmd

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

* Add: line about further optimizations using llmcompressor

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

* Apply patch from @winglian

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

* Fix: Test

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

* additional fixes for docker and saving compressed

* split llmcompressor from vllm checks

* Reset session between tests

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

* move decorator to test method instead of class

* make sure to reset the session after each test

* move import of llmcompressor to reset session inside test

---------

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-01 12:25:16 -04:00