Commit Graph

2374 Commits

Author SHA1 Message Date
Dan Saunders
1f75287a3a diffusion custom models approach 2025-08-19 04:09:46 +00:00
Dan Saunders
63d2280999 nits 2025-08-18 19:17:24 +00:00
Dan Saunders
b210db2d15 fixes 2025-08-18 19:09:09 +00:00
Dan Saunders
556a69118f sample generation, tests fixes 2025-08-18 18:25:04 +00:00
Dan Saunders
8569675b26 Merge branch 'main' into diffusion 2025-08-18 10:07:55 -04:00
VED
c10eb811fa data_parallel_size in in VllmserveCliArgs (#3074)
* data_parallel_size in in VllmserveCliArgs

* moved to 43
2025-08-18 08:44:37 -04:00
VED
0eef385b1a [feat] truncation support with excess_length_strategy (#3068) [skip ci]
* feat:truncation support with excess_len

* pre-commit

* excess_length_strategy

* requested changes

* lint

* added handle_long_seq_in_dataset in sft

* comments improved
2025-08-18 08:39:13 -04:00
Dan Saunders
077b5a4358 cleanup; tests draft 2025-08-16 02:44:44 +00:00
Wing Lian
ecbe8b2b61 [GPT-OSS] improve FSDP shard merging and documentation for GPT-OSS (#3073)
* improve fsdp shard merging

* improve logging

* update information on merging and inferencing GPT-OSS

* cleanup readme

* automate cleanup of FSDP prefix

* import GRPO only if necessary

* only modify config.json on rank0

* merge final checkpoint at end of training

* prevent circular import

* Fix saving for sharded state dict

* devx, move merged to output dir

* move import back to top

* Fix stuck merge

* fix conditionals from pr feedback and add test
2025-08-15 21:25:01 -04:00
Dan Saunders
234b7b3126 nits 2025-08-16 00:14:44 +00:00
Wing Lian
130ef7c51a Various fixes for VLMs (#3063)
* fix to not use batch feature indexing

* more vlm fixes

* use AutoModelForImageTextToText

* add example yaml and need num2words for chat template

* improve handling of adding image tokens to conversation

* add lfm2-vl support

* update the lfm readme

* fix markdown and add rtol for loss checks

* feat: add smolvlm2 processing strat

* fix: check for causal-conv1d in lfm models

* feat: add docs for lfm2

* feat: add new models and tips to docs

* feat: add smolvlm2 docs and remove extra dep

* chore: update docs

* feat: add video instructions

* chore: cleanup

* chore: comments

* fix: typo

* feat: add usage stats

* chore: refactor

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-15 10:52:57 -04:00
Dan Saunders
e19be0c2d9 add back in reinit_weights (clobbered?); masking / pretrain fixes 2025-08-15 02:21:25 +00:00
Dan Saunders
479a454ae3 fixes + improvements 2025-08-14 16:11:37 -04:00
Dan Saunders
0a9341acde nits 2025-08-14 01:53:24 -04:00
Dan Saunders
d8b63804bc cleanup 2025-08-14 01:51:13 -04:00
Dan Saunders
3156c605d4 diffusion training plugin 2025-08-14 01:48:22 -04:00
salman
d1de6f5f3d Add option to skip slow tests in PRs (#3060) [skip ci]
* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* testing e2e skip [skip-e2e]

* stop running multigpu [skip-e2e]

* should work now [skip-e2e]

* reverting [skip-e2e]

* testing [skip-e2e]

* debug [skip-e2e]

* debug [skip-e2e]

* round 2[skip-e2e]

* removing debug [skip-e2e]

* support skipping whole PR [skip-e2e]

* use script for e2e skip [skip-e2e]

* contributing [skip-e2e]

* contributing [skip-e2e]

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-08-13 22:57:51 -04:00
Wing Lian
48b7ae1677 use updated patch releasE (#3066) 2025-08-13 21:23:05 -04:00
NanoCode012
506e3a3907 fix: fsdp_config validation being None (#3061) [skip ci]
* fix: fsdp_config validation being None

* fix: handling

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-08-13 21:21:50 -04:00
Wing Lian
09145de8fa upgrade transformers==4.55.1 and bitsandbytes==0.47.0 (#3064)
* upgrade transformers==4.55.1

* also upgrade bnb

* remove bnb params4bit patch (upstreamed)

* use latest causal-conv1d

* fix patching ring-flash-attn with now missing imports

---------

Co-authored-by: Dan Saunders <danjsaund@gmail.com>
2025-08-13 19:41:07 -04:00
Wing Lian
e0a2523a3b Workaround to unblock docs build in main (#3055)
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>
2025-08-13 11:39:39 +01:00
Wing Lian
3d45620008 remove prepare-from-posids patch (#3052) [skip ci] 2025-08-11 09:34:41 -04:00
github-actions[bot]
ce20e838b5 chore: update pre-commit hooks (#3050) [skip ci]
Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>
2025-08-11 09:32:21 -04:00
Wing Lian
d4d84d48af fix ray train and add fsdp2 smoke test for ray trainer (#3053)
* add fsdp2 smokle test for ray trainer

* fix raytrain with fsdp2
2025-08-11 09:31:54 -04:00
Wing Lian
9b12c05660 use exec instead of subprocess to make ctrl+c nicer for cli (#3044)
* use exec instead of subprocess to make ctrl+c nicer for cli

* change var name to use_exec

* simplify to bool

* flush std*

* patch subprocess as mock in test

* fix tests

* more test fixes
2025-08-10 20:22:20 -04:00
Wing Lian
686933194e fix vllm tagging and add cloud images w/o tmux (#3049) [skip ci] 2025-08-10 20:21:56 -04:00
Wing Lian
d12b461d19 follow up fix for plugin registration (#3054) [skip ci] 2025-08-10 20:21:38 -04:00
Wing Lian
d6b81b3683 update training args check for new defaults (#3051) [skip ci]
* update training args check for new defaults

* skip check for now
2025-08-10 11:26:22 -04:00
Wing Lian
05f1b4b2e8 run monkeypatch tests in seperate runner (#3047) 2025-08-09 14:34:07 -04:00
Wing Lian
7cfc80ec77 set dev version (#3045) [skip ci] 2025-08-08 13:56:53 -04:00
salman
0da6a95efa Add citation.tff (#3043) [skip ci] 2025-08-08 16:18:42 +01:00
Wing Lian
2c8497e489 tag for v0.12.0 release (#3041)
Some checks failed
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.12.0
2025-08-08 08:24:09 -04:00
NanoCode012
f70d4de8c7 feat(doc): add links to new features on README (#2980) [skip ci]
* feat(doc): add links to new features on README

* fix merge error

* remove blurb about older FSDP2 integration

* update blog link

* chore: update cce commit

* feat: update model support into readme

* Update README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* chore: lint num spaces

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-08-08 08:16:43 -04:00
Dan Saunders
0ae06d756d use nanmean for loss aggregation (CP fix) (#3033)
* use nanmena for loss aggregation (CP fix)

* use regular asserts

* small changes to make tests isolate

* combining evaluation_loop patches

* fix

* delete unused

* fix check
2025-08-08 08:15:17 -04:00
NanoCode012
2974670bf8 Feat: add arcee (#3028)
* feat: add arcee

* feat: add latest models supported by cce

* feat: add arcee example config

* chore: lint

* fix: typo

* feat: change to instruct

* feat: add vram usage

* Update README.md
2025-08-08 08:09:11 -04:00
Wing Lian
50f2b94d50 add 120b and deepspeed zero3 examples (#3035) [skip ci]
* add 120b and deepspeed zero3 examples

* add a bit of flavor and cleanup gpt oss readme

* fix: remove expert vram usage

* fix: remove redundant EOS token from eot_tokens

* feat: add 120B to docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-08 08:04:56 -04:00
Wing Lian
eb2c87b525 Example for Slurm and various fixes (#3038) [skip ci]
* slurm example and make preprocess play nicely

* start slurm if it init file exists

* remove incorrect comment

* feat: add slurm docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-08 08:02:03 -04:00
NanoCode012
4db7f023c6 feat(doc): standardize the axolotl install to a release (#3040) [skip ci] 2025-08-08 08:00:26 -04:00
NanoCode012
4273d5cf7e feat: update nd parallelism readme (#3039)
Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-08-08 12:45:36 +01:00
Wing Lian
c5e5aba547 Add 2.8.0 base images and uv images (#3034) 2025-08-08 02:30:16 -04:00
Wing Lian
9d5c95db6f Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp (#3019)
* fix for parallelism config from trainer

* fix handling of parallelism_config w accelerate

* add todo for removal

* update to latest axolotl-contribs-mit for optimizer fix too

* synchronize training after checkpoint save

* dir spelling

* use latest accelerate main

* fix to not use partial state parallelism_config

* more fixeS

* use most recent accelerate fix

* fix cpu_ram_efficient_loading to meta devices from rank 0 to prevent CPU RAM oom

* improve handling of broadcasting fsdp2 state dict

* support for openai chat template with thinking key as the reasoning trace

* address PR feedback

* refactor to remove dependency on PartialState for parallelism config

* bump accelerate, gptoss fixes

* limit meta fixes to fsdp2 for now

* fixes for gpt oss

* fixup examples, don't use cpu-ram-efficient-loading for now

* remove problematic barrier

* patch parallelism config

* reorder comparison

* device mesh fixes

* make pure CP work

* lint
2025-08-07 21:22:15 -04:00
NanoCode012
ca796fb56e feat(doc): update gpt-oss readme (#3029) [skip ci]
* feat(doc): update gpt-oss readme

* fix: caps

* feat: add toolcalling section

* feat: add example tool dataset to docs

* chore: update
2025-08-07 09:26:42 -04:00
VED
597953bef0 clear cache before clean up (#3031) [skip ci]
* clear chahe before save_model

* chore: lint

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-08-07 09:25:58 -04:00
NanoCode012
39fbd3b2b5 fix: lora kernels for mistral3 (#3027) [skip ci] 2025-08-07 09:25:37 -04:00
salman
46dfacf255 ND Parallel Doc Nits (#3032) 2025-08-07 10:34:26 +01:00
Wing Lian
4bce713b39 allow custom trainer_cls to be defined as a module reference in the YAML (#3024) [skip ci]
* allow custom trainer_cls to be defined as a module reference in the YAML

* address PR feedback and add test

* add tests
2025-08-06 22:49:19 -04:00
Dan Saunders
d09290f2f4 Lora kernels bias support (#3025)
* lora kernels bias support

* revert rename

* nit

* lint, tests

* satisfying the rabbit
2025-08-06 20:20:08 -04:00
Wing Lian
e442ff22aa fix keyerror on load_in_8bit/load_in_4bit access in _set_quantization_config (#3023)
* set load_in_8bit/load_in_4bit in _set_quantization_config to prevent keyerror

* use dict.get instead
2025-08-06 14:28:52 -04:00
Wing Lian
ba3dba3e4f add kernels for gpt oss models (#3020)
* add kernels for gpt oss models

* add support for gpt-oss

* typo incorrect package

* fix: layout for configs and added wandb/epochs

* add gptoss example w offload and set moe leaf for z3

* add support for Mxfp4Config from yaml

* update yaml to use official model

* fix lora and don't allow triton to go above 3.3.1

* fix lr and tweak vram use

* fix range for triton since pinned wasn't compatible with toch 2.6.0

* update cce with gpt oss patches

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-08-06 09:47:55 -04:00
Wing Lian
97e86c6d47 drop old patches and code that are no longer needed (#3007) [skip ci] 2025-08-06 08:02:39 -04:00