Wing Lian
7ed40f1d70
automatically set env vars for single gpu deepspeed zero3 ( #3118 ) [skip ci]
...
* automatically set env vars for single gpu deepspeed zero3
* use setdefault
2025-08-29 13:36:47 -04:00
VED
5b6ec2820f
patch for ds_grads_remaining in deepspeed ( #3102 ) [skip ci]
...
* patch deepspeed
* deepspeed patch for ds_grads_remaining
* patch in Patchmanager
* chore: lint
* deepseed utils
* chore2
* patch ds_grads_remaining chore
* chore lint
* chore lint
* remove torch.nn patch
* lint
* Update src/axolotl/monkeypatch/utils.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
* patched with checkpointwarapper
* lint
* only apply deepspeed patch when using activation offloading
---------
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-29 12:12:09 -04:00
Wing Lian
6afba3871d
Add support for PyTorch 2.8.0 ( #3106 )
...
* Add support for PyTorch 2.8.0
* loosen triton requirements
* handle torch 2.8.0 in setup.py
* fix versions
* no vllm for torch 2.8.0
* remove comment
Co-authored-by: NanoCode012 <nano@axolotl.ai >
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-28 09:10:40 -04:00
Dan Saunders
dc338c3b0e
Update .coderabbit.yaml ( #3109 ) [skip ci]
...
Oops, should be false.
2025-08-27 09:50:52 -04:00
salman
d0d2fc5606
Tokens per second logging [skip-e2e] ( #3072 )
2025-08-27 09:10:14 +01:00
Wing Lian
e1131e9619
make always skip_move_to_device default as true ( #3084 )
2025-08-26 09:30:22 -04:00
Wing Lian
c4c4b90638
add tokenizer_save_jinja_files to keep legacy behavior of including chat template in tokenizer_config.json ( #3093 )
...
* add tokenizer_save_jinja_files to keep legacy behavior of including chat template in tokenizer_config.json
* fix test import
2025-08-26 09:30:04 -04:00
Wing Lian
0e9945e3b9
deploy training jobs to baseten w truss in axolotl cli ( #3086 ) [skip ci]
...
* deploy training jobs to baseten w truss in axolotl cli
* cleanup
2025-08-26 09:29:50 -04:00
NanoCode012
0de254a0d0
feat: add gemma3_text attention handling for lora kernels ( #3103 )
2025-08-26 16:47:26 +07:00
Dan Saunders
79ddaebe9a
Add ruff, remove black, isort, flake8, pylint ( #3092 )
...
* black, isort, flake8 -> ruff
* remove unused
* add back needed import
* fix
2025-08-23 23:37:33 -04:00
Dan Saunders
eea7a006e1
make multipack sampler patch explicit ( #3096 )
...
* make multipack sampler patch explicit
* combining
2025-08-22 14:29:10 -04:00
Wing Lian
ab4d604a8f
upgrade peft for 0.17.1 ( #3094 )
...
* upgrade peft to 0.17.1
* upgrade for transformers too
2025-08-22 07:26:30 -04:00
Wing Lian
0fa752e58b
upgrade flash-attn to 2.8.3 for gpt-oss attn sink support ( #3082 )
2025-08-21 15:04:10 -04:00
Dan Saunders
08e517ea48
Update .coderabbit.yaml ( #3091 ) [skip ci]
2025-08-20 22:14:13 -04:00
Wing Lian
07fd22f39b
better handling of lora w bias with fsdp2 and handling of files when saving model checkpoint ( #3090 )
2025-08-20 15:17:48 -04:00
Wing Lian
06eaf6c448
misc fixes ( #3085 )
2025-08-20 08:52:26 -04:00
goggle
050210e637
fix: Sweep runs overwrite each other because output_dir from base config is reused ( #3080 )
...
* refactor: improve output_dir handling in generate_config_files
* fix typo
* cli: harden sweep output_dir handling with base fallback
- Ensure sweep permutations always resolve a valid output_dir
- Default to ./model-out if neither permutation nor base config sets output_dir
- Append sweepXXXX suffix consistently for each permutation
- Prevent Path(None) TypeError and improve robustness of sweep config generation
* fix typo
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-19 20:25:20 -04:00
Wing Lian
05cedbfb1e
add baseten info for gpt-oss recipe ( #3078 )
...
* add bsaeten info for gpt-oss recipe
* incorporate PR review
2025-08-19 13:30:37 -04:00
VED
c10eb811fa
data_parallel_size in in VllmserveCliArgs ( #3074 )
...
* data_parallel_size in in VllmserveCliArgs
* moved to 43
2025-08-18 08:44:37 -04:00
VED
0eef385b1a
[feat] truncation support with excess_length_strategy ( #3068 ) [skip ci]
...
* feat:truncation support with excess_len
* pre-commit
* excess_length_strategy
* requested changes
* lint
* added handle_long_seq_in_dataset in sft
* comments improved
2025-08-18 08:39:13 -04:00
Wing Lian
ecbe8b2b61
[GPT-OSS] improve FSDP shard merging and documentation for GPT-OSS ( #3073 )
...
* improve fsdp shard merging
* improve logging
* update information on merging and inferencing GPT-OSS
* cleanup readme
* automate cleanup of FSDP prefix
* import GRPO only if necessary
* only modify config.json on rank0
* merge final checkpoint at end of training
* prevent circular import
* Fix saving for sharded state dict
* devx, move merged to output dir
* move import back to top
* Fix stuck merge
* fix conditionals from pr feedback and add test
2025-08-15 21:25:01 -04:00
Wing Lian
130ef7c51a
Various fixes for VLMs ( #3063 )
...
* fix to not use batch feature indexing
* more vlm fixes
* use AutoModelForImageTextToText
* add example yaml and need num2words for chat template
* improve handling of adding image tokens to conversation
* add lfm2-vl support
* update the lfm readme
* fix markdown and add rtol for loss checks
* feat: add smolvlm2 processing strat
* fix: check for causal-conv1d in lfm models
* feat: add docs for lfm2
* feat: add new models and tips to docs
* feat: add smolvlm2 docs and remove extra dep
* chore: update docs
* feat: add video instructions
* chore: cleanup
* chore: comments
* fix: typo
* feat: add usage stats
* chore: refactor
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-15 10:52:57 -04:00
salman
d1de6f5f3d
Add option to skip slow tests in PRs ( #3060 ) [skip ci]
...
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* stop running multigpu [skip-e2e]
* should work now [skip-e2e]
* reverting [skip-e2e]
* testing [skip-e2e]
* debug [skip-e2e]
* debug [skip-e2e]
* round 2[skip-e2e]
* removing debug [skip-e2e]
* support skipping whole PR [skip-e2e]
* use script for e2e skip [skip-e2e]
* contributing [skip-e2e]
* contributing [skip-e2e]
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-13 22:57:51 -04:00
Wing Lian
48b7ae1677
use updated patch releasE ( #3066 )
2025-08-13 21:23:05 -04:00
NanoCode012
506e3a3907
fix: fsdp_config validation being None ( #3061 ) [skip ci]
...
* fix: fsdp_config validation being None
* fix: handling
---------
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-08-13 21:21:50 -04:00
Wing Lian
09145de8fa
upgrade transformers==4.55.1 and bitsandbytes==0.47.0 ( #3064 )
...
* upgrade transformers==4.55.1
* also upgrade bnb
* remove bnb params4bit patch (upstreamed)
* use latest causal-conv1d
* fix patching ring-flash-attn with now missing imports
---------
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
2025-08-13 19:41:07 -04:00
Wing Lian
e0a2523a3b
Workaround to unblock docs build in main ( #3055 )
...
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com >
2025-08-13 11:39:39 +01:00
Wing Lian
3d45620008
remove prepare-from-posids patch ( #3052 ) [skip ci]
2025-08-11 09:34:41 -04:00
github-actions[bot]
ce20e838b5
chore: update pre-commit hooks ( #3050 ) [skip ci]
...
Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com >
2025-08-11 09:32:21 -04:00
Wing Lian
d4d84d48af
fix ray train and add fsdp2 smoke test for ray trainer ( #3053 )
...
* add fsdp2 smokle test for ray trainer
* fix raytrain with fsdp2
2025-08-11 09:31:54 -04:00
Wing Lian
9b12c05660
use exec instead of subprocess to make ctrl+c nicer for cli ( #3044 )
...
* use exec instead of subprocess to make ctrl+c nicer for cli
* change var name to use_exec
* simplify to bool
* flush std*
* patch subprocess as mock in test
* fix tests
* more test fixes
2025-08-10 20:22:20 -04:00
Wing Lian
686933194e
fix vllm tagging and add cloud images w/o tmux ( #3049 ) [skip ci]
2025-08-10 20:21:56 -04:00
Wing Lian
d12b461d19
follow up fix for plugin registration ( #3054 ) [skip ci]
2025-08-10 20:21:38 -04:00
Wing Lian
d6b81b3683
update training args check for new defaults ( #3051 ) [skip ci]
...
* update training args check for new defaults
* skip check for now
2025-08-10 11:26:22 -04:00
Wing Lian
05f1b4b2e8
run monkeypatch tests in seperate runner ( #3047 )
2025-08-09 14:34:07 -04:00
Wing Lian
7cfc80ec77
set dev version ( #3045 ) [skip ci]
2025-08-08 13:56:53 -04:00
salman
0da6a95efa
Add citation.tff ( #3043 ) [skip ci]
2025-08-08 16:18:42 +01:00
Wing Lian
2c8497e489
tag for v0.12.0 release ( #3041 )
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.12.0
2025-08-08 08:24:09 -04:00
NanoCode012
f70d4de8c7
feat(doc): add links to new features on README ( #2980 ) [skip ci]
...
* feat(doc): add links to new features on README
* fix merge error
* remove blurb about older FSDP2 integration
* update blog link
* chore: update cce commit
* feat: update model support into readme
* Update README.md
Co-authored-by: salman <salman.mohammadi@outlook.com >
* chore: lint num spaces
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-08-08 08:16:43 -04:00
Dan Saunders
0ae06d756d
use nanmean for loss aggregation (CP fix) ( #3033 )
...
* use nanmena for loss aggregation (CP fix)
* use regular asserts
* small changes to make tests isolate
* combining evaluation_loop patches
* fix
* delete unused
* fix check
2025-08-08 08:15:17 -04:00
NanoCode012
2974670bf8
Feat: add arcee ( #3028 )
...
* feat: add arcee
* feat: add latest models supported by cce
* feat: add arcee example config
* chore: lint
* fix: typo
* feat: change to instruct
* feat: add vram usage
* Update README.md
2025-08-08 08:09:11 -04:00
Wing Lian
50f2b94d50
add 120b and deepspeed zero3 examples ( #3035 ) [skip ci]
...
* add 120b and deepspeed zero3 examples
* add a bit of flavor and cleanup gpt oss readme
* fix: remove expert vram usage
* fix: remove redundant EOS token from eot_tokens
* feat: add 120B to docs
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-08 08:04:56 -04:00
Wing Lian
eb2c87b525
Example for Slurm and various fixes ( #3038 ) [skip ci]
...
* slurm example and make preprocess play nicely
* start slurm if it init file exists
* remove incorrect comment
* feat: add slurm docs
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-08 08:02:03 -04:00
NanoCode012
4db7f023c6
feat(doc): standardize the axolotl install to a release ( #3040 ) [skip ci]
2025-08-08 08:00:26 -04:00
NanoCode012
4273d5cf7e
feat: update nd parallelism readme ( #3039 )
...
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-08-08 12:45:36 +01:00
Wing Lian
c5e5aba547
Add 2.8.0 base images and uv images ( #3034 )
2025-08-08 02:30:16 -04:00
Wing Lian
9d5c95db6f
Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp ( #3019 )
...
* fix for parallelism config from trainer
* fix handling of parallelism_config w accelerate
* add todo for removal
* update to latest axolotl-contribs-mit for optimizer fix too
* synchronize training after checkpoint save
* dir spelling
* use latest accelerate main
* fix to not use partial state parallelism_config
* more fixeS
* use most recent accelerate fix
* fix cpu_ram_efficient_loading to meta devices from rank 0 to prevent CPU RAM oom
* improve handling of broadcasting fsdp2 state dict
* support for openai chat template with thinking key as the reasoning trace
* address PR feedback
* refactor to remove dependency on PartialState for parallelism config
* bump accelerate, gptoss fixes
* limit meta fixes to fsdp2 for now
* fixes for gpt oss
* fixup examples, don't use cpu-ram-efficient-loading for now
* remove problematic barrier
* patch parallelism config
* reorder comparison
* device mesh fixes
* make pure CP work
* lint
2025-08-07 21:22:15 -04:00
NanoCode012
ca796fb56e
feat(doc): update gpt-oss readme ( #3029 ) [skip ci]
...
* feat(doc): update gpt-oss readme
* fix: caps
* feat: add toolcalling section
* feat: add example tool dataset to docs
* chore: update
2025-08-07 09:26:42 -04:00
VED
597953bef0
clear cache before clean up ( #3031 ) [skip ci]
...
* clear chahe before save_model
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-07 09:25:58 -04:00
NanoCode012
39fbd3b2b5
fix: lora kernels for mistral3 ( #3027 ) [skip ci]
2025-08-07 09:25:37 -04:00