salman
d1de6f5f3d
Add option to skip slow tests in PRs ( #3060 ) [skip ci]
...
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* testing e2e skip [skip-e2e]
* stop running multigpu [skip-e2e]
* should work now [skip-e2e]
* reverting [skip-e2e]
* testing [skip-e2e]
* debug [skip-e2e]
* debug [skip-e2e]
* round 2[skip-e2e]
* removing debug [skip-e2e]
* support skipping whole PR [skip-e2e]
* use script for e2e skip [skip-e2e]
* contributing [skip-e2e]
* contributing [skip-e2e]
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-13 22:57:51 -04:00
Wing Lian
48b7ae1677
use updated patch releasE ( #3066 )
2025-08-13 21:23:05 -04:00
NanoCode012
506e3a3907
fix: fsdp_config validation being None ( #3061 ) [skip ci]
...
* fix: fsdp_config validation being None
* fix: handling
---------
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-08-13 21:21:50 -04:00
Wing Lian
09145de8fa
upgrade transformers==4.55.1 and bitsandbytes==0.47.0 ( #3064 )
...
* upgrade transformers==4.55.1
* also upgrade bnb
* remove bnb params4bit patch (upstreamed)
* use latest causal-conv1d
* fix patching ring-flash-attn with now missing imports
---------
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
2025-08-13 19:41:07 -04:00
Wing Lian
e0a2523a3b
Workaround to unblock docs build in main ( #3055 )
...
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com >
2025-08-13 11:39:39 +01:00
Wing Lian
3d45620008
remove prepare-from-posids patch ( #3052 ) [skip ci]
2025-08-11 09:34:41 -04:00
github-actions[bot]
ce20e838b5
chore: update pre-commit hooks ( #3050 ) [skip ci]
...
Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com >
2025-08-11 09:32:21 -04:00
Wing Lian
d4d84d48af
fix ray train and add fsdp2 smoke test for ray trainer ( #3053 )
...
* add fsdp2 smokle test for ray trainer
* fix raytrain with fsdp2
2025-08-11 09:31:54 -04:00
Wing Lian
9b12c05660
use exec instead of subprocess to make ctrl+c nicer for cli ( #3044 )
...
* use exec instead of subprocess to make ctrl+c nicer for cli
* change var name to use_exec
* simplify to bool
* flush std*
* patch subprocess as mock in test
* fix tests
* more test fixes
2025-08-10 20:22:20 -04:00
Wing Lian
686933194e
fix vllm tagging and add cloud images w/o tmux ( #3049 ) [skip ci]
2025-08-10 20:21:56 -04:00
Wing Lian
d12b461d19
follow up fix for plugin registration ( #3054 ) [skip ci]
2025-08-10 20:21:38 -04:00
Wing Lian
d6b81b3683
update training args check for new defaults ( #3051 ) [skip ci]
...
* update training args check for new defaults
* skip check for now
2025-08-10 11:26:22 -04:00
Wing Lian
05f1b4b2e8
run monkeypatch tests in seperate runner ( #3047 )
2025-08-09 14:34:07 -04:00
Wing Lian
7cfc80ec77
set dev version ( #3045 ) [skip ci]
2025-08-08 13:56:53 -04:00
salman
0da6a95efa
Add citation.tff ( #3043 ) [skip ci]
2025-08-08 16:18:42 +01:00
Wing Lian
2c8497e489
tag for v0.12.0 release ( #3041 )
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.12.0
2025-08-08 08:24:09 -04:00
NanoCode012
f70d4de8c7
feat(doc): add links to new features on README ( #2980 ) [skip ci]
...
* feat(doc): add links to new features on README
* fix merge error
* remove blurb about older FSDP2 integration
* update blog link
* chore: update cce commit
* feat: update model support into readme
* Update README.md
Co-authored-by: salman <salman.mohammadi@outlook.com >
* chore: lint num spaces
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-08-08 08:16:43 -04:00
Dan Saunders
0ae06d756d
use nanmean for loss aggregation (CP fix) ( #3033 )
...
* use nanmena for loss aggregation (CP fix)
* use regular asserts
* small changes to make tests isolate
* combining evaluation_loop patches
* fix
* delete unused
* fix check
2025-08-08 08:15:17 -04:00
NanoCode012
2974670bf8
Feat: add arcee ( #3028 )
...
* feat: add arcee
* feat: add latest models supported by cce
* feat: add arcee example config
* chore: lint
* fix: typo
* feat: change to instruct
* feat: add vram usage
* Update README.md
2025-08-08 08:09:11 -04:00
Wing Lian
50f2b94d50
add 120b and deepspeed zero3 examples ( #3035 ) [skip ci]
...
* add 120b and deepspeed zero3 examples
* add a bit of flavor and cleanup gpt oss readme
* fix: remove expert vram usage
* fix: remove redundant EOS token from eot_tokens
* feat: add 120B to docs
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-08 08:04:56 -04:00
Wing Lian
eb2c87b525
Example for Slurm and various fixes ( #3038 ) [skip ci]
...
* slurm example and make preprocess play nicely
* start slurm if it init file exists
* remove incorrect comment
* feat: add slurm docs
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-08 08:02:03 -04:00
NanoCode012
4db7f023c6
feat(doc): standardize the axolotl install to a release ( #3040 ) [skip ci]
2025-08-08 08:00:26 -04:00
NanoCode012
4273d5cf7e
feat: update nd parallelism readme ( #3039 )
...
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-08-08 12:45:36 +01:00
Wing Lian
c5e5aba547
Add 2.8.0 base images and uv images ( #3034 )
2025-08-08 02:30:16 -04:00
Wing Lian
9d5c95db6f
Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp ( #3019 )
...
* fix for parallelism config from trainer
* fix handling of parallelism_config w accelerate
* add todo for removal
* update to latest axolotl-contribs-mit for optimizer fix too
* synchronize training after checkpoint save
* dir spelling
* use latest accelerate main
* fix to not use partial state parallelism_config
* more fixeS
* use most recent accelerate fix
* fix cpu_ram_efficient_loading to meta devices from rank 0 to prevent CPU RAM oom
* improve handling of broadcasting fsdp2 state dict
* support for openai chat template with thinking key as the reasoning trace
* address PR feedback
* refactor to remove dependency on PartialState for parallelism config
* bump accelerate, gptoss fixes
* limit meta fixes to fsdp2 for now
* fixes for gpt oss
* fixup examples, don't use cpu-ram-efficient-loading for now
* remove problematic barrier
* patch parallelism config
* reorder comparison
* device mesh fixes
* make pure CP work
* lint
2025-08-07 21:22:15 -04:00
NanoCode012
ca796fb56e
feat(doc): update gpt-oss readme ( #3029 ) [skip ci]
...
* feat(doc): update gpt-oss readme
* fix: caps
* feat: add toolcalling section
* feat: add example tool dataset to docs
* chore: update
2025-08-07 09:26:42 -04:00
VED
597953bef0
clear cache before clean up ( #3031 ) [skip ci]
...
* clear chahe before save_model
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-08-07 09:25:58 -04:00
NanoCode012
39fbd3b2b5
fix: lora kernels for mistral3 ( #3027 ) [skip ci]
2025-08-07 09:25:37 -04:00
salman
46dfacf255
ND Parallel Doc Nits ( #3032 )
2025-08-07 10:34:26 +01:00
Wing Lian
4bce713b39
allow custom trainer_cls to be defined as a module reference in the YAML ( #3024 ) [skip ci]
...
* allow custom trainer_cls to be defined as a module reference in the YAML
* address PR feedback and add test
* add tests
2025-08-06 22:49:19 -04:00
Dan Saunders
d09290f2f4
Lora kernels bias support ( #3025 )
...
* lora kernels bias support
* revert rename
* nit
* lint, tests
* satisfying the rabbit
2025-08-06 20:20:08 -04:00
Wing Lian
e442ff22aa
fix keyerror on load_in_8bit/load_in_4bit access in _set_quantization_config ( #3023 )
...
* set load_in_8bit/load_in_4bit in _set_quantization_config to prevent keyerror
* use dict.get instead
2025-08-06 14:28:52 -04:00
Wing Lian
ba3dba3e4f
add kernels for gpt oss models ( #3020 )
...
* add kernels for gpt oss models
* add support for gpt-oss
* typo incorrect package
* fix: layout for configs and added wandb/epochs
* add gptoss example w offload and set moe leaf for z3
* add support for Mxfp4Config from yaml
* update yaml to use official model
* fix lora and don't allow triton to go above 3.3.1
* fix lr and tweak vram use
* fix range for triton since pinned wasn't compatible with toch 2.6.0
* update cce with gpt oss patches
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-06 09:47:55 -04:00
Wing Lian
97e86c6d47
drop old patches and code that are no longer needed ( #3007 ) [skip ci]
2025-08-06 08:02:39 -04:00
VED
784f8c0e95
fix:kd_distillation key_error logprobs ( #2990 )
...
* fix:kd_distillation key_error logprobs
* style
* fix: leave handling of pop logprobs to parent
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-08-06 08:02:07 -04:00
NanoCode012
e3177c3210
feat: add complete optimizer docs ( #3017 ) [skip ci]
...
* feat: add complete optimizer docs
* fix: deprecate old torchao adamw low bit
2025-08-06 08:01:51 -04:00
Wing Lian
70faea331f
add support for connecting via prime-intellect ( #3021 )
2025-08-06 01:06:52 -04:00
Wing Lian
8021c718ce
use skip_move_to_device for all cases ( #3015 )
...
* use skip_move_to_device for all cases
* use experimental option for skip move
2025-08-06 00:13:12 -04:00
Wing Lian
42f5e6f9e9
upgrade transformers==4.55.0 ( #3018 )
2025-08-05 16:29:12 -04:00
Wing Lian
ab49d16e34
Dion optimizer support ( #3014 )
...
* Add support for Dion optimizer
* dion training kwargs
* fix var names
* no dion 8bit for now
* use updated axolotl-contribs-mit for dion optimizer
* add smoke test for dion optimizer
* add docs
* fix typo during edits
* fix test to not remove load in 8bit
2025-08-04 16:33:30 -04:00
Carsten Kragelund Jørgensen
33d094721c
fix: deepcopy lr in RexLR scheduler. ( #3012 )
...
* fix: deepcopy lr in RexLR scheduler.
This fixes a problem where when the lr is a scalar tensor, the base_lrs in the get_lr function end up being references to the current learning rate, rather than the correct initial learning rate.
See also related pytorch PR https://github.com/pytorch/pytorch/pull/127190/
* fix: add missing torch.Tensor import
2025-08-04 10:23:49 -04:00
NanoCode012
a54c1be972
Fix: shorten mem logs to 2 decimal places and renamed nd docs ( #3011 ) [skip ci]
...
* fix: shorten memory logs
* fix: title name
2025-08-04 10:23:36 -04:00
github-actions[bot]
5691992d34
chore: update pre-commit hooks ( #3009 ) [skip ci]
...
Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com >
2025-08-04 10:23:19 -04:00
Dan Saunders
e758343cac
FSDP2 + LoRA kernels ( #2992 )
...
* impl fix
* smoke tests
* patches for fsdp2 + qlora compat
* nit
* working fix
* working fix
* fix merge
* minifying patches; update bnb dep
* renaming; adding tests
* remove duplicate test, add dora guard
* generalize __torch_function__
* revert generalization
* update comments
2025-08-03 20:05:17 -04:00
Wing Lian
deac7b18a1
upgrade peft v0.17.0 and support for lora target_parameters ( #3006 )
2025-08-02 20:24:04 -04:00
Wing Lian
10946afae7
fixes for spinning up vllm service for grpo ( #3001 )
2025-08-02 11:19:24 -04:00
Wing Lian
5639552064
prevent usage of low bit ao optimizers with configurations that use parameter groups ( #3003 )
...
* prevent usage of low bit ao optimizers with configurations that use parameter groups
* use optimizer enum value
* fix validation
2025-08-01 17:54:04 -04:00
Wing Lian
cda3c82351
move ib/rdma libs into base image ( #3002 )
...
* move ib/rdma libs into base image
* use --no-install-recommends
2025-08-01 16:10:37 -04:00
Wing Lian
7c3b428f23
Add validation for TP with models with tied embeddings ( #2999 )
...
* add validation for tp + tied embeddings models
* fix logic and messaging
* add additional guard for null tp size
2025-08-01 13:58:16 -04:00
Wing Lian
01a6bd1a0e
use CCE fix for TP using vocab parallel for CEL ( #3000 )
2025-08-01 13:21:58 -04:00