Commit Graph

2086 Commits

Author SHA1 Message Date
Wing Lian
5a36b6ff2d Atropos support (#2666) [skip ci]
* allow peft+liger+grpo and custom vllm serve for atropos support

* set trainer class for RL
2025-05-13 17:06:05 -04:00
NanoCode012
224da88fa2 fix: disable auto lora kernel if dropout nonzero (#2655) [skip ci]
* fix: disable auto lora kernel if dropout nonzero

* Add comment from PR feedback

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-13 17:05:20 -04:00
Wing Lian
493eb8e5c6 update doc and use P2P=LOC for brittle grpo test (#2649)
* update doc and skip brittle grpo test

* fix the path to run the multigpu tests

* increase timeout, use LOC instead of NVL

* typo

* use hf cache from s3 backed cloudfront

* mark grpo as flaky test dues to vllm start
2025-05-13 17:05:11 -04:00
Wing Lian
4780ac7c4d guard on deleting secrets from env (#2653) [skip ci] 2025-05-13 17:03:27 -04:00
Wing Lian
cf69de2eb9 Various fixes for CI, save_only_model for RL, prevent packing multiprocessing deadlocks (#2661)
* lean mistral ft tests, remove e2e torch 2.4.1 test

* make sure to pass save_only_model for RL

* more tests to make ci leaner, add cleanup to modal ci

* fix module for import in e2e tests

* use mp spawn to prevent deadlocks with packing

* make sure cleanup shell script is executable when cloned out
2025-05-13 17:03:08 -04:00
Wing Lian
27e3329273 .post1 version release for multipack fix
Some checks failed
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.9.1.post1
2025-05-09 21:54:04 -04:00
Dan Saunders
27fec49083 don't sort multipack sampler (#2657)
* don't sort multipack sampler

* increased packing efficiency increases loss

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-09 21:53:29 -04:00
Wing Lian
8cda9e93c1 set version for v0.9.1
Some checks failed
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.9.1
2025-05-07 16:10:51 -04:00
Wing Lian
17d715c2b3 swap tinymodels that have safetensors for some ci tests (#2641) 2025-05-07 16:10:18 -04:00
xzuyn
f943306263 Add CAME Optimizer (#2385) 2025-05-07 16:10:17 -04:00
NanoCode012
3c8b9b33d6 fix(doc): clarify instruction to delinearize llama4 similar to cli doc (#2644) [skip ci] 2025-05-07 16:10:17 -04:00
NanoCode012
8b0c2a71ad Fix: improve error message on failed dataset load (#2637) [skip ci]
* fix(log): clarify error on dataset loading failed

* fix: add path for easy tracking of broken config

* fix: improve error message based on pr feedback
2025-05-07 16:10:17 -04:00
Wing Lian
493910559a Configurable embeddings upcast (#2621)
* fsdp embeddings should be float32 per comment

* patch peft to not upcast everything

* add tabs back to code check

* fix import

* add configurable option and fix check

* add check for dtypes

* move embeddings test to patch dir

* fix test

* fix comment and logic
2025-05-07 16:10:16 -04:00
Eric Meier
c54534dbfa Fix cut_cross_entropy plugin install (#2642) [skip ci] 2025-05-07 16:10:16 -04:00
Wing Lian
cae5cebb59 xformers attention with packing (#2619)
* xformers attention with packing

* wire up the patch

* fix xformers + packing validation

* fix warning

* reorder the packing check

* fix fp16 / bf16 reset when using fp16 with bf16 auto

* fix seq lens calc to drop hanging sequences

* handle xformers patch for inference too

* fix batch size setter

* fix xformers inference

* add colab callback to fix inference post train

* PR feedback
2025-05-07 16:10:16 -04:00
Wing Lian
fcbd7477d0 Multipack parallel bin packing (#2631)
* improve readability of multipack sampler

* parallel bin packing
fix error with lambda and pickling

make sure things are in float instead of np.float

* annotations and comments update

* support for configurable group and bin size for sample packing

* fix missing map back to original indices
2025-05-07 16:10:15 -04:00
Wing Lian
038db85a40 allow plugins to return their own dataset (#2617) [skip ci]
* allow plugins to return their own dataset

* add post_trainer_create and wire up

* add hook check

* address PR feedback:

* remove annotation causing circular import
2025-05-07 16:10:15 -04:00
NanoCode012
680dcc5a4d feat(doc): add split_thinking docs (#2613) [skip ci]
* feat(doc): add split_thinking docs

* fix: link config.qmd to conversation.qmd for split_thinking example

* update thinking => reasoning_content in messages format

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-07 16:10:15 -04:00
Wing Lian
fed5ca8254 bump liger dep to 0.5.9 (#2640) [skip ci]
* bump liger dep to 0.5.9

* also upgrade vllm to post1, and datasets to 3.5.1
2025-05-07 16:10:15 -04:00
mhenrichsen
7a2d017c88 Update lr_scheduler options in config.qmd to include additional scheduling strategies for improved training flexibility. (#2636) [skip ci] 2025-05-07 16:10:15 -04:00
Wing Lian
8c0303aa5e Print axolotl art if train is called outside of cli: (#2627) [skip ci] 2025-05-07 16:10:14 -04:00
Wing Lian
5d61169f7c fix dpo eval override to call grandparent instead of the broken super (#2628) [skip ci] 2025-05-07 16:10:14 -04:00
Wing Lian
e1586f7919 make sure gc_steps is used for all trainers (#2638) 2025-05-07 16:10:14 -04:00
Wing Lian
e4bf3ffb17 repop cache (#2639)
* repop cache

* pre-cache as a step

* fix the name

* add reason for pytest skipif

* restore pytorch matrix

* remove max-parallel now that we've optimized this a bit
2025-05-07 16:10:14 -04:00
mhenrichsen
30150fe1e1 Adds example for training a TTS model on top of a LLM. (#2614)
* Adds example for training a TTS model on top of a LLM.

* Update examples/orpheus/finetune.yml

Co-authored-by: NanoCode012 <nano@axolotl.ai>

* Update examples/orpheus/finetune.yml

Co-authored-by: NanoCode012 <nano@axolotl.ai>

* Update README.md to clarify GPU requirements for finetuning Orpheus TTS model

* Update finetune.yml to use the new base model canopylabs/orpheus-3b-0.1-pretrained

* Update finetune.yml and README.md for consistency and clarity

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-05-07 16:10:14 -04:00
Emmanuel Ferdman
7f7d7ade2e Fix logging deprecation warnings (#2623)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-07 16:10:14 -04:00
Wing Lian
776cf70fe4 include multipack support for qwen3 family (#2622) 2025-05-07 16:10:14 -04:00
Wing Lian
8730951aba setup hf transfer too and fix auto bf16 when fp16 enabled (#2620) [skip ci] 2025-05-07 16:10:13 -04:00
Wing Lian
e72c11ad55 qwen3 and qwen3_moe support for liger kernels (#2612)
* qwen3 and qwen3_moe support for liger kernels

* fix moe module path

* fix: qwen3 liger input args and mlp

* fix: qwen3 input args and output class

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-05-07 16:10:13 -04:00
aitechguy
1a7978b960 remove keys to incoporate changes for the trl update (#2616) 2025-05-07 16:10:13 -04:00
Wing Lian
60b0d14f1d automatically set pad_to_sequence_len when use packing (#2607)
* automatically set pad_to_sequence_len when use packing

* update tests
2025-05-07 16:10:13 -04:00
NanoCode012
a7a40378f5 fix: run preview-docs only when md/qmd changes (#2606)
* fix: run preview-docs only when md/qmd changes

* feat: add quarto yaml based on PR feedback
2025-05-07 16:10:13 -04:00
Wing Lian
b50d35bec9 Logging config for colab (#2611)
* only configure logging on cli to play nicely with colab

* allow reloading the config on the fly from a dict

* make sure to use dict for yaml

* reuse existing function for load

* make cli args optional

* mps fix and respect max_steps
2025-05-07 16:10:13 -04:00
Wing Lian
bc6dfa6899 add missing __init__ for lr monkeypatch fix (#2609) 2025-05-07 16:10:13 -04:00
Dhruv Mullick
9d6e8af622 Add num_completions_to_print for trl and grpo (#2604) 2025-05-07 16:10:12 -04:00
Wing Lian
17b441248c use latest hf-xet and don't install vllm for torch 2.7.0 (#2603)
* use latest hf-xet and don't install vllm for torch 2.7.0

* fix runpod hub tests
2025-05-07 16:10:12 -04:00
Wing Lian
d49a4268b8 additional args for grpo config/trainer (#2598) 2025-05-07 16:10:12 -04:00
Wing Lian
1d6e931115 replace zero_only with simpler if statement (#2592) 2025-05-07 16:10:12 -04:00
Wing Lian
ff106ace44 ensure we pass axolotl extras to the Dockerfile so vllm is included in shipped images (#2599) 2025-05-07 16:10:12 -04:00
Wing Lian
24907533d1 don't automatically enable lora kernels for RL training (#2600) 2025-05-07 16:10:12 -04:00
Wing Lian
0e9d816d2e only import vllm serve cli if its being called (#2597) [skip ci] 2025-05-07 16:10:12 -04:00
Wing Lian
72f142186a Handle other reasoning trace dataset formats (#2591)
* Handle other reasoning trace dataset formats

* rename var to improve readability

* chore: refactor with comments

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-05-07 16:10:11 -04:00
Wing Lian
87726322bf upload the deepspeed json to wandb (#2593) [skip ci] 2025-05-07 16:10:11 -04:00
NanoCode012
ae8ae7534c feat: add qwen3 moe block for ds3 (#2596) [skip ci] 2025-05-07 16:10:11 -04:00
Wing Lian
ee00142cb5 patch to convert LR from tensor to float when using DS (#2595) [skip ci] 2025-05-07 16:10:11 -04:00
Aleksandr Dremov
097e7e3b5b Plugins create_lr_scheduler support (#2584)
* lr_scheduler support

* fix

* Update scheduler.py

* Update scheduler.py

* cfg handling

* black

* remove debug

* remove adding the axolotl cfg to the scheduler mixin

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-05-07 16:10:11 -04:00
Dan Saunders
c714958181 auto-enable lora kernels where possible (#2589)
* auto-enable lora kernels where possible

* test

* revert change to example yaml

* naming

* remove print

* slight logic change
2025-05-07 16:10:11 -04:00
NanoCode012
4402c293dc fix(doc): key used to point to url in multimodal doc (#2575) [skip ci] 2025-05-07 16:10:10 -04:00
Wing Lian
0d71f787a3 bump vllm==0.8.5 for qwen3 support (#2583) [skip ci] 2025-05-07 16:10:10 -04:00
Wing Lian
c337ca0872 support for qwen3 with lora kernels (#2588)
* support for qwen3 with lora kernels

* fix patch

* typo
2025-05-07 16:10:10 -04:00