Wing Lian
2413688b08
upload the deepspeed json to wandb ( #2593 ) [skip ci]
2025-04-30 03:32:44 -04:00
NanoCode012
5bb1f3da56
feat: add qwen3 moe block for ds3 ( #2596 ) [skip ci]
2025-04-30 03:32:23 -04:00
Wing Lian
a21b9cc472
patch to convert LR from tensor to float when using DS ( #2595 ) [skip ci]
2025-04-30 03:31:57 -04:00
Aleksandr Dremov
41a1ec0c95
Plugins create_lr_scheduler support ( #2584 )
...
* lr_scheduler support
* fix
* Update scheduler.py
* Update scheduler.py
* cfg handling
* black
* remove debug
* remove adding the axolotl cfg to the scheduler mixin
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-29 17:08:30 -04:00
Dan Saunders
ecac731922
auto-enable lora kernels where possible ( #2589 )
...
* auto-enable lora kernels where possible
* test
* revert change to example yaml
* naming
* remove print
* slight logic change
2025-04-29 16:18:49 -04:00
NanoCode012
742fef4200
fix(doc): key used to point to url in multimodal doc ( #2575 ) [skip ci]
2025-04-29 15:10:59 -04:00
Wing Lian
a39caf8824
bump vllm==0.8.5 for qwen3 support ( #2583 ) [skip ci]
2025-04-29 15:10:40 -04:00
Wing Lian
07e4f2e25b
support for qwen3 with lora kernels ( #2588 )
...
* support for qwen3 with lora kernels
* fix patch
* typo
2025-04-29 15:02:49 -04:00
Dan Saunders
c7d07de6b4
Fix eval + add smoke test ( #2586 )
...
* fix evaluate CLI
* add smoke test
* fix naming
* lint
2025-04-29 12:58:54 -04:00
Wing Lian
6565ae85d8
set config on the PluginManager for callback access ( #2587 )
2025-04-29 12:05:44 -04:00
Wing Lian
80b4edb4a7
Post release fixes ( #2581 )
...
* fix missing kwarg on child
* make the runpod test shorter
* update docs
* rename runpod test json file
* typing fixes and ordering of doc
2025-04-29 10:01:38 -04:00
Wing Lian
fedbcc0254
remove torch 2.4.1 CI as part of support deprecation ( #2582 )
2025-04-29 08:28:32 -04:00
Wing Lian
8175896ada
add dev tag for v0.10.0.dev0 ( #2580 )
2025-04-28 20:30:14 -04:00
Wing Lian
14d670dbf0
v0.9.0 release ( #2578 )
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.4.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.9.0
2025-04-28 18:23:17 -04:00
Wing Lian
2d77165dc0
automatically split out reasoning trace from dataset ( #2579 )
...
* automatically split out reasoning trace from dataset
* chore: lint
* fix import
2025-04-28 18:23:03 -04:00
Wing Lian
63b17e3109
chat template and example for qwen3 ( #2577 )
2025-04-28 15:09:41 -04:00
NanoCode012
1178a15ede
Feat: Add qwen3 and CCE for qwen family ( #2518 )
2025-04-28 12:18:46 -04:00
Wing Lian
c513487d1a
support val_set_size for splitting test split from train with DPO ( #2572 )
2025-04-28 12:12:15 -04:00
Dan Saunders
dda95e6c40
add preview-docs workflow ( #2432 )
...
* add preview-docs workflow
* update preview-docs workflow
* use correct publish-dir
* install deps prior to docs build
* use correct publish-dir
* use quarto publish with netlify target
* adding _publish.yml
* fix
* fix
* fix
* remove unused file
* fix naming
---------
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-04-28 11:20:46 -04:00
NanoCode012
7099343c56
feat: add eos_tokens and train_on_eot for chat_template EOT parsing ( #2364 )
...
* feat: add eos_tokens and train_on_eot for chat_template EOT parsing
* fix: comments
* chore: add some examples of tokens
* feat: add new potential errors for chat_template to faq
* feat: add examples for EOT handling
* fix: change error to warning for missing EOS
* fix: warning typo
* feat: add tests for eot token handling
* fix: remove broken caplog capture in test
* fix: chattemplate strategy with kd missing eot changes
2025-04-28 10:11:20 -04:00
Wing Lian
5000cb3fe7
grab sys prompt too from dataset ( #2397 ) [skip ci]
...
* grab sys prompt too from dataset
* chore: add field_system to docs
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-04-28 10:11:06 -04:00
divyanshuaggarwal
170cdb5be9
Add Post_model_load, post_lora_load, post_train, post_train_unload function calls ( #2539 )
...
* Update train.py
add post_model_load and post_lora_load model calss.
* Update train.py
add post_train and post_train_unload function calls
* Update train.py
* Update base.py
* Update train.py
* chore: lint
* clarify plugin hooks
* Update src/axolotl/integrations/base.py
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
* Update src/axolotl/utils/models.py
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
* Update src/axolotl/utils/models.py
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
* Update src/axolotl/integrations/base.py
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
* Update models.py
* Update models.py
* remove extra call to post_model_load
* chore: lint
* add test for hooks and gc trainer
* disable duplicated code check for test
* fix the path and add better handling
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: Dan Saunders <danjsaund@gmail.com >
2025-04-28 10:10:28 -04:00
Ezekiel Wotring
5d182a1056
Add runpod sls handler ( #2530 ) [skip ci]
...
* Add runpod sls handler
* remove LICENSE and fix README
* chore: lint
* use axolotl cloud image as base and various fixes
* fix: trim allowed cuda versions
* restore dockerfile
* chore: update title
* use axolotl cloud image
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-04-28 10:08:32 -04:00
Wing Lian
40f4ea23ab
replace references to random 68m model w 135m smollm2 ( #2570 ) [skip ci]
...
* replace references to random 68m model w 135m smollm2
* use AutoTokenizer for smollm2
2025-04-28 10:08:07 -04:00
NanoCode012
f1df73a798
fix(doc): clarify vllm usage with grpo ( #2573 ) [skip ci]
...
* fix(doc): clarify vllm usage with grpo
* nit
Co-authored-by: salman <salman.mohammadi@outlook.com >
* Update docs/rlhf.qmd
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: salman <salman.mohammadi@outlook.com >
2025-04-28 10:07:45 -04:00
Dhruv Mullick
8b33ae1c4f
Fix bug in grpo reward module import ( #2571 )
2025-04-28 00:31:56 -04:00
Wing Lian
dc4da4a7e2
update trl to 0.17.0 ( #2560 )
...
* update trl to 0.17.0
* grpo + vllm no longer supported with 2.5.1 due to vllm constraints
* disable VLLM_USE_V1 for ci
* imporve handle killing off of multiprocessing vllm service
* debug why this doesn't run in CI
* increase vllm wait time
* increase timeout to 5min
* upgrade to vllm 0.8.4
* dump out the vllm log for debugging
* use debug logging
* increase vllm start timeout
* use NVL instead
* disable torch compile cache
* revert some commented checks now that grpo tests are fixed
* increase vllm timeoout back to 5min
2025-04-27 19:19:53 -04:00
Wing Lian
f9c7c3bb72
don't use is_main_process during config validation ( #2569 )
2025-04-26 14:14:52 -04:00
Wing Lian
caf5cb63ea
add e2e smoke test for using activation/gradient checkpointing with offload ( #2565 )
...
* add e2e smoke test for using activation/gradient checkpointing with offload
* disable duplicate code check for the test
* fix relative import
* seq len too small to test this dataset with packing
* Fix checkpoint ptaching for tests
2025-04-25 21:11:17 -04:00
Wing Lian
5dba5c82a8
fix support for wandb run_name for rl trainers ( #2566 ) [skip ci]
...
* fix support for wandb run_name for rl trainers
* prefer to use wandb random names for run_name
2025-04-25 21:10:54 -04:00
Chiwan Park
e3c9d541a7
fix: crash when pretraining_dataset with dispatch_batches is false ( #2558 )
2025-04-25 17:15:03 -04:00
NanoCode012
9eba0ad118
chore(doc): update docker tags on doc ( #2559 ) [skip ci]
2025-04-25 17:14:48 -04:00
Wing Lian
53dbf97d85
make cce default to true when using the plugin ( #2562 ) [skip ci]
2025-04-25 17:14:26 -04:00
Eko Julianto Salim
2c2563bc34
fix: gradient checkpointing functools.partial object has no attribute __self__ ( #2563 ) [skip ci]
...
* fix: gradient checkpointing causing functools.partial error
* lint
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-25 17:02:37 -04:00
Wing Lian
5cb3398460
don't fail on codecov upload for external contributor PRs ( #2564 ) [skip ci]
2025-04-25 15:10:55 -04:00
Dan Saunders
ae1c7ace63
Sequence parallel training context manager ( #2553 )
...
* ctx manager for SP
* updates
* update
* further simplifying
* accommodate both training context managers
* simplifying
* simplifying
* nit
* reorg
* tweak codecov yaml
* add gather post hook, simplify, fixes
* pytest
* pytest fix
2025-04-25 10:33:54 -04:00
Wing Lian
1447beb132
make sure to validate the config before normalizing so defaults get set ( #2554 )
...
* make sure to validate the config before normalizing so defaults get set
* validation not needed for particular test
* remove duplicate validations
* set qlora correctly
2025-04-24 13:01:43 -04:00
Dan Saunders
66f41ec6f1
disable codecov pr annotations ( #2556 )
2025-04-24 08:51:51 -04:00
NanoCode012
85053f4bd4
Fix(doc): add delinearize instruction ( #2545 )
...
* fix: mention to install pytorch before axolotl
* feat(doc): include instruction to delinearize
* fix: update instruction for delinearize with adapter
2025-04-24 01:03:43 -04:00
Wing Lian
a4d5112ae1
builds for torch 2.7.0 ( #2552 )
...
* builds for torch==2.7.0
* use xformers==0.0.29.post3
* no vllm support with torch 2.7
* update default, fix conditional
* no xformers for 270
* no vllm on 2.7.0 for multigpu test too
* remove deprecated verbose arg from scheduler
* 2.7.0 tests on cpu
2025-04-24 00:39:31 -04:00
Wing Lian
0d691cc2a7
add base docker image with pytorch 2.7.0 and variant for cuda 12.8 ( #2551 )
...
* add base docker image with pytorch 2.7.0 and variant for cuda 12.8
* my bash is terrible
2025-04-23 14:59:03 -04:00
Dan Saunders
c4053481ff
Codecov fixes / improvements ( #2549 )
...
* adding codecov reporting
* random change
* codecov fixes
* adding missing dependency
* fix
---------
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-04-23 10:33:30 -04:00
NanoCode012
a6d28d19b1
feat: add glm and glm4 multipack and cce ( #2546 )
...
* feat: add glm and glm4 multipack
* feat: add glm4 example
* feat: add cce for glm
2025-04-23 10:27:51 -04:00
Wing Lian
32e335dd51
fix missing host/port for vllm ( #2543 )
...
* fix missing host/port for vllm
* set tensor parallel size so it doesn't always default to cli override
2025-04-22 10:16:48 -04:00
Wing Lian
7651550850
make sure to download fixtures for kd test ( #2541 )
...
* make sure to download fixtures for kd test
* use same alpaca dataset
2025-04-21 10:31:50 -04:00
Wing Lian
341e95aac9
prevent rate limiting to hf when using dispatch batches ( #2536 ) [skip ci]
2025-04-21 10:31:35 -04:00
Catgat
b882dfb63f
Fixed Rex Scheduler Warm Up ( #2535 ) [skip ci]
...
* Fixed Rex Scheduler Warm Up
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-21 10:30:55 -04:00
Wing Lian
b640db1dbc
don't run multigpu tests twice, run SP in separate test ( #2542 )
...
* don't run multigpu tests twice, run SP in separate test
* fix multiline
2025-04-21 10:24:13 -04:00
Chiwan Park
4ce469d32e
fix: upgrade liger to 0.5.8 and use native Gemma3 patches ( #2527 )
...
* fix: upgrade liger to 0.5.8 and use native Gemma3 patches
* fix: make lint happy
* doc: update Liger Kernel FLCE support for Gemma 3
2025-04-18 09:57:40 -07:00
Wing Lian
60a8f0958d
zero val fix for beta ( #2538 )
2025-04-17 17:27:19 -07:00