Wing Lian
34b68ddaae
curl with apt instead of pip
2025-05-18 15:17:38 -07:00
Wing Lian
9a3d0c919b
make sure curl is installed
2025-05-18 15:17:38 -07:00
Wing Lian
bd34d0b861
install for hopper from pre-built wheel
2025-05-18 15:17:38 -07:00
Wing Lian
37220ab90a
install pybind11 for fa3 build
2025-05-18 15:17:38 -07:00
Wing Lian
e1b74d710b
update docker args to minimums used and use MAX_JOBS already set as arg
2025-05-18 15:17:38 -07:00
Wing Lian
79daf5b934
reduce max jobs for build of fa3
2025-05-18 15:17:38 -07:00
Wing Lian
ddd7c55576
build hopper w fa3 on torch 2.6
2025-05-18 15:17:37 -07:00
Wing Lian
65c6c98a76
whitespace fix in dockerfile
2025-05-18 15:17:37 -07:00
Wing Lian
4ef2e8293f
fix the bash in docker base
2025-05-18 15:17:37 -07:00
Wing Lian
c126d5cd04
fix suffix for tag
2025-05-18 15:17:37 -07:00
Wing Lian
9b0be4f15c
fix 12.8 image and add flash-attn v3 hopper base image
2025-05-18 15:17:37 -07:00
Wing Lian
a27b909c5c
GRPO fixes (peft) ( #2676 )
...
* don't set peft_config on grpo to prevent double peft wrap
* remove overrides needed to support bug
* fix grpo tests
* require more CPU for multigpu to help with torch compile for vllm
2025-05-16 15:47:03 -04:00
xzuyn
6cb07b9d12
Fix for setting adam_beta3 and adam_epsilon2 for CAME Optimizer ( #2654 ) [skip ci]
...
* make setting `adam_beta3` and `adam_epsilon2` work correctly
* update config docs so users know args are specific to CAME optim
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-16 15:46:50 -04:00
C080
288653adb6
Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifa… ( #2675 ) [skip ci]
...
* Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifacts setting
* cleanup and lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-16 15:46:31 -04:00
NanoCode012
3a5b495a74
Fix: improve doc on merge/inference cli visibility ( #2674 )
...
* feat: improve visibility for merge doc
* feat: add tip on reuse config between modes
2025-05-16 13:07:40 -04:00
xzuyn
f661858fc4
Print dataset name ( #2668 ) [skip ci]
2025-05-16 13:06:58 -04:00
Eric Meier
c837c4a424
Add missing init file to liger plugin ( #2670 ) [skip ci]
2025-05-16 13:06:46 -04:00
michelyang
c9797de6bb
Add num_proc to fix data set slow processing issue ( #2681 ) [skip ci]
2025-05-16 13:06:20 -04:00
Wing Lian
8f8a7afb05
Add ci and images for CUDA 12.8 for B200s ( #2683 ) [skip ci]
...
* Add ci and images for CUDA 12.8 for B200s
* add comments explaining CI [skip e2e]
2025-05-16 13:06:08 -04:00
NanoCode012
86472715da
fix: remove doc string imports in monkeypatches ( #2671 ) [skip ci]
2025-05-16 13:05:55 -04:00
Wing Lian
c0a0c7534c
Activation checkpointing with offloading to disk with prefetch ( #2663 )
...
* offload activations to disk instead of CPU RAM
* add prefetch
* Disco :dance:
* include offload_disk in e2e test for AC
* document and make sure to cleanup
* fix annotation to match docs
* fix docs build
* address PR feedback
2025-05-13 16:39:39 -04:00
Wing Lian
7fa1089cea
Atropos support ( #2666 ) [skip ci]
...
* allow peft+liger+grpo and custom vllm serve for atropos support
* set trainer class for RL
2025-05-13 08:30:58 -04:00
Dan Saunders
80304c26a7
SP GRPO support + batch SP fixes ( #2643 )
...
* ctx manager for SP
* updates
* update
* further simplifying
* simplifying
* simplifying
* reorg
* batch api HF adapter for ring-flash-attn; cleanup and improvements
* update
* adding all batch ring-flash-attn methods via single adapter
* fix
* fixes for batch API funcs, simplify
* fix
* grpo sp support
* progress
* stronger subclassing of TRL GRPO trainer; custom distributed sampler
* subclassing constructor
* progress
* finalizing SP + GRPO trainer
* minimize diffs to GRPO trainer
* remove (most of) the custom GRPO trainer logic
* debug
* debug
* update
* update
* update
* progress
* cleanup
* cleanup
* minor changes
* update
* update
* update
* small changes
* updates
* cleanup; torch.compile ring_flash_attn functions to prevent numerical instability; lint
* spacing
* cleanup; log in pydantic model config only on main process
* remove comment
* fix sp sampler, update to latest upstream code, doc
* add docs
* update quartodoc autodoc contents
* fix, simplifications
* fixes + simplifications
* review comments
* lint
* removing main process only logs in favor of #2608
* fixes, additional smoke test
* updatse
* more tests
* update
* fix grad accum bug (sort of)
* lint, tests
* todo
2025-05-12 17:52:40 -04:00
NanoCode012
67c4ea9c7c
fix: disable auto lora kernel if dropout nonzero ( #2655 ) [skip ci]
...
* fix: disable auto lora kernel if dropout nonzero
* Add comment from PR feedback
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-12 16:23:53 -04:00
Wing Lian
526ddb886d
guard on deleting secrets from env ( #2653 ) [skip ci]
2025-05-12 14:18:42 -04:00
Wing Lian
f34eef546a
update doc and use P2P=LOC for brittle grpo test ( #2649 )
...
* update doc and skip brittle grpo test
* fix the path to run the multigpu tests
* increase timeout, use LOC instead of NVL
* typo
* use hf cache from s3 backed cloudfront
* mark grpo as flaky test dues to vllm start
2025-05-12 14:17:25 -04:00
Wing Lian
c7b6790614
Various fixes for CI, save_only_model for RL, prevent packing multiprocessing deadlocks ( #2661 )
...
* lean mistral ft tests, remove e2e torch 2.4.1 test
* make sure to pass save_only_model for RL
* more tests to make ci leaner, add cleanup to modal ci
* fix module for import in e2e tests
* use mp spawn to prevent deadlocks with packing
* make sure cleanup shell script is executable when cloned out
2025-05-12 10:51:18 -04:00
Dan Saunders
47e0e71bc8
don't sort multipack sampler ( #2657 )
...
* don't sort multipack sampler
* increased packing efficiency increases loss
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-09 20:28:58 -04:00
Wing Lian
0f3587174d
swap tinymodels that have safetensors for some ci tests ( #2641 )
2025-05-07 15:06:07 -04:00
xzuyn
25e6c5f9bd
Add CAME Optimizer ( #2385 )
2025-05-07 10:31:46 -04:00
NanoCode012
32f51bca35
fix(doc): clarify instruction to delinearize llama4 similar to cli doc ( #2644 ) [skip ci]
2025-05-07 10:29:47 -04:00
NanoCode012
9daa04da90
Fix: improve error message on failed dataset load ( #2637 ) [skip ci]
...
* fix(log): clarify error on dataset loading failed
* fix: add path for easy tracking of broken config
* fix: improve error message based on pr feedback
2025-05-07 10:29:05 -04:00
Wing Lian
0d71b0aa5f
Configurable embeddings upcast ( #2621 )
...
* fsdp embeddings should be float32 per comment
* patch peft to not upcast everything
* add tabs back to code check
* fix import
* add configurable option and fix check
* add check for dtypes
* move embeddings test to patch dir
* fix test
* fix comment and logic
2025-05-06 23:40:44 -04:00
Eric Meier
63aaccf85b
Fix cut_cross_entropy plugin install ( #2642 ) [skip ci]
2025-05-06 22:56:00 -04:00
Wing Lian
ff0fe767c8
xformers attention with packing ( #2619 )
...
* xformers attention with packing
* wire up the patch
* fix xformers + packing validation
* fix warning
* reorder the packing check
* fix fp16 / bf16 reset when using fp16 with bf16 auto
* fix seq lens calc to drop hanging sequences
* handle xformers patch for inference too
* fix batch size setter
* fix xformers inference
* add colab callback to fix inference post train
* PR feedback
2025-05-06 22:49:22 -04:00
Wing Lian
8e4158cc0b
Multipack parallel bin packing ( #2631 )
...
* improve readability of multipack sampler
* parallel bin packing
fix error with lambda and pickling
make sure things are in float instead of np.float
* annotations and comments update
* support for configurable group and bin size for sample packing
* fix missing map back to original indices
2025-05-06 20:08:08 -04:00
Wing Lian
cd84325253
allow plugins to return their own dataset ( #2617 ) [skip ci]
...
* allow plugins to return their own dataset
* add post_trainer_create and wire up
* add hook check
* address PR feedback:
* remove annotation causing circular import
2025-05-06 20:05:51 -04:00
NanoCode012
0b140fef83
feat(doc): add split_thinking docs ( #2613 ) [skip ci]
...
* feat(doc): add split_thinking docs
* fix: link config.qmd to conversation.qmd for split_thinking example
* update thinking => reasoning_content in messages format
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-06 20:05:32 -04:00
Wing Lian
e4cfebe995
bump liger dep to 0.5.9 ( #2640 ) [skip ci]
...
* bump liger dep to 0.5.9
* also upgrade vllm to post1, and datasets to 3.5.1
2025-05-06 20:05:19 -04:00
mhenrichsen
a6cac5dd32
Update lr_scheduler options in config.qmd to include additional scheduling strategies for improved training flexibility. ( #2636 ) [skip ci]
2025-05-06 11:24:07 -04:00
Wing Lian
b71c0e3447
Print axolotl art if train is called outside of cli: ( #2627 ) [skip ci]
2025-05-06 11:18:45 -04:00
Wing Lian
ddaebf8309
fix dpo eval override to call grandparent instead of the broken super ( #2628 ) [skip ci]
2025-05-06 11:18:25 -04:00
Wing Lian
679743087a
make sure gc_steps is used for all trainers ( #2638 )
2025-05-06 11:18:00 -04:00
Wing Lian
f720b6e72d
repop cache ( #2639 )
...
* repop cache
* pre-cache as a step
* fix the name
* add reason for pytest skipif
* restore pytorch matrix
* remove max-parallel now that we've optimized this a bit
2025-05-06 11:09:07 -04:00
mhenrichsen
a980618fd0
Adds example for training a TTS model on top of a LLM. ( #2614 )
...
* Adds example for training a TTS model on top of a LLM.
* Update examples/orpheus/finetune.yml
Co-authored-by: NanoCode012 <nano@axolotl.ai >
* Update examples/orpheus/finetune.yml
Co-authored-by: NanoCode012 <nano@axolotl.ai >
* Update README.md to clarify GPU requirements for finetuning Orpheus TTS model
* Update finetune.yml to use the new base model canopylabs/orpheus-3b-0.1-pretrained
* Update finetune.yml and README.md for consistency and clarity
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-05-06 10:11:06 +02:00
Emmanuel Ferdman
54960d4de0
Fix logging deprecation warnings ( #2623 )
...
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com >
2025-05-04 08:22:45 -04:00
Wing Lian
ed922796b7
include multipack support for qwen3 family ( #2622 )
2025-05-03 12:02:39 -04:00
Wing Lian
3dd9c3bf3f
setup hf transfer too and fix auto bf16 when fp16 enabled ( #2620 ) [skip ci]
2025-05-03 12:02:26 -04:00
Wing Lian
0ba7d362fa
qwen3 and qwen3_moe support for liger kernels ( #2612 )
...
* qwen3 and qwen3_moe support for liger kernels
* fix moe module path
* fix: qwen3 liger input args and mlp
* fix: qwen3 input args and output class
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-05-02 09:29:55 -04:00
aitechguy
e4f73bc98e
remove keys to incoporate changes for the trl update ( #2616 )
2025-05-02 08:47:42 -04:00