mhenrhcsen
f5a3e3529e
RL datasets: warn and drop unsalvageable over-length prompts post-truncate; add post-truncate filter; support alias config key 'excess_token_handling'
2025-08-12 20:37:41 +02:00
mhenrichsen
618b008e36
Merge branch 'main' into 775-option-to-drop-vs-truncate-on-rows-longer-than-context-length
2025-05-27 12:31:31 +02:00
Dan Saunders
5eb01f3df1
Fix quarto ( #2717 )
...
* missing modules
* fix quarto complaints
2025-05-23 21:16:51 -04:00
xzuyn
d27c35ac44
Liger GraniteMoE ( #2715 )
2025-05-23 18:40:43 -04:00
Dan Saunders
a535b68043
update quarto for model loading refactor ( #2716 )
...
* update quarto for model loading refactor
* fix desc
2025-05-23 16:28:31 -04:00
Dan Saunders
b5f1e53a0f
models.py -> loaders/ module refactor ( #2680 )
...
* models.py -> loaders/ module refactor
* refactor ModelLoader class
* plugin manager changes
* circular import fix
* pytest
* pytest
* minor improvements
* fix
* minor changes
* fix test
* remove dead code
* coderabbit comments
* lint
* fix
* coderabbit suggestion I liked
* more coderabbit
* review comments, yak shaving
* lint
* updating in light of SP ctx manager changes
* review comment
* review comment 2
2025-05-23 15:51:11 -04:00
Dan Saunders
8cde256db2
Remove unused const ( #2714 )
...
* remove unused const
* accidentally commited benchmark plot
2025-05-23 12:27:38 -04:00
Dan Saunders
5f8f817200
SP context manager update ( #2699 )
...
* utilize accelerate prepare_data_loader with patching
* lint
* cleanup, fix
* update to support DPO quirk
* coderabbit commits, cleanup, remove dead code
* fix
* move ring attn patching to sp ctx manager
* lint
* lint
* test fix
* test fix
2025-05-22 11:18:32 -04:00
NanoCode012
aa0492c366
feat: do not find turn indices if turn is not trainable ( #2696 )
...
* feat: do not find turn indices if turn is not trainable
* fix: handle edge case where train on eos/eot is all
* fix: improve warning message
2025-05-22 19:19:59 +07:00
NanoCode012
798b5f5cfd
fix(RL): address plugin rl overwriting trainer_cls ( #2697 ) [skip ci]
...
* fix: plugin rl overwrite trainer_cls
* feat(test): add test to catch trainer_cls is not None
2025-05-22 19:19:12 +07:00
NanoCode012
1c83a1a020
feat(doc): clarify minimum pytorch and cuda to use blackwell ( #2704 ) [skip ci]
2025-05-22 19:18:27 +07:00
Dan Saunders
6aa41740df
SP dataloader patching + removing custom sampler / dataloader logic ( #2686 )
...
* utilize accelerate prepare_data_loader with patching
* lint
* cleanup, fix
* update to support DPO quirk
* small change
* coderabbit commits, cleanup, remove dead code
* quarto fix
* patch fix
* review comments
* moving monkeypatch up one level
* fix
2025-05-21 11:20:20 -04:00
Wing Lian
a27b909c5c
GRPO fixes (peft) ( #2676 )
...
* don't set peft_config on grpo to prevent double peft wrap
* remove overrides needed to support bug
* fix grpo tests
* require more CPU for multigpu to help with torch compile for vllm
2025-05-16 15:47:03 -04:00
xzuyn
6cb07b9d12
Fix for setting adam_beta3 and adam_epsilon2 for CAME Optimizer ( #2654 ) [skip ci]
...
* make setting `adam_beta3` and `adam_epsilon2` work correctly
* update config docs so users know args are specific to CAME optim
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-16 15:46:50 -04:00
C080
288653adb6
Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifa… ( #2675 ) [skip ci]
...
* Fix: Make MLflow config artifact logging respect hf_mlflow_log_artifacts setting
* cleanup and lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-16 15:46:31 -04:00
NanoCode012
3a5b495a74
Fix: improve doc on merge/inference cli visibility ( #2674 )
...
* feat: improve visibility for merge doc
* feat: add tip on reuse config between modes
2025-05-16 13:07:40 -04:00
xzuyn
f661858fc4
Print dataset name ( #2668 ) [skip ci]
2025-05-16 13:06:58 -04:00
Eric Meier
c837c4a424
Add missing init file to liger plugin ( #2670 ) [skip ci]
2025-05-16 13:06:46 -04:00
michelyang
c9797de6bb
Add num_proc to fix data set slow processing issue ( #2681 ) [skip ci]
2025-05-16 13:06:20 -04:00
Wing Lian
8f8a7afb05
Add ci and images for CUDA 12.8 for B200s ( #2683 ) [skip ci]
...
* Add ci and images for CUDA 12.8 for B200s
* add comments explaining CI [skip e2e]
2025-05-16 13:06:08 -04:00
NanoCode012
86472715da
fix: remove doc string imports in monkeypatches ( #2671 ) [skip ci]
2025-05-16 13:05:55 -04:00
mhenrhcsen
5d7a61576d
Refactor sequence length overflow handling in pretraining module
...
- Introduced DEFAULT_SEQUENCE_LEN_OVERFLOW_HANDLING constant in utils.py.
- Updated encode_packed_pretraining function to use this constant instead of a hardcoded value.
2025-05-15 12:55:09 +02:00
mhenrhcsen
5ecf22b54e
Merge branch 'main' of github.com:axolotl-ai-cloud/axolotl into 775-option-to-drop-vs-truncate-on-rows-longer-than-context-length
2025-05-14 13:36:43 +02:00
mhenrhcsen
9c5b8da22f
fix merge conflicts
2025-05-14 13:33:42 +02:00
Wing Lian
c0a0c7534c
Activation checkpointing with offloading to disk with prefetch ( #2663 )
...
* offload activations to disk instead of CPU RAM
* add prefetch
* Disco :dance:
* include offload_disk in e2e test for AC
* document and make sure to cleanup
* fix annotation to match docs
* fix docs build
* address PR feedback
2025-05-13 16:39:39 -04:00
Wing Lian
7fa1089cea
Atropos support ( #2666 ) [skip ci]
...
* allow peft+liger+grpo and custom vllm serve for atropos support
* set trainer class for RL
2025-05-13 08:30:58 -04:00
mhenrhcsen
fea6649518
increased test coverage
2025-05-13 08:58:34 +02:00
mhenrhcsen
124ad2b968
lint
2025-05-13 08:35:16 +02:00
Dan Saunders
80304c26a7
SP GRPO support + batch SP fixes ( #2643 )
...
* ctx manager for SP
* updates
* update
* further simplifying
* simplifying
* simplifying
* reorg
* batch api HF adapter for ring-flash-attn; cleanup and improvements
* update
* adding all batch ring-flash-attn methods via single adapter
* fix
* fixes for batch API funcs, simplify
* fix
* grpo sp support
* progress
* stronger subclassing of TRL GRPO trainer; custom distributed sampler
* subclassing constructor
* progress
* finalizing SP + GRPO trainer
* minimize diffs to GRPO trainer
* remove (most of) the custom GRPO trainer logic
* debug
* debug
* update
* update
* update
* progress
* cleanup
* cleanup
* minor changes
* update
* update
* update
* small changes
* updates
* cleanup; torch.compile ring_flash_attn functions to prevent numerical instability; lint
* spacing
* cleanup; log in pydantic model config only on main process
* remove comment
* fix sp sampler, update to latest upstream code, doc
* add docs
* update quartodoc autodoc contents
* fix, simplifications
* fixes + simplifications
* review comments
* lint
* removing main process only logs in favor of #2608
* fixes, additional smoke test
* updatse
* more tests
* update
* fix grad accum bug (sort of)
* lint, tests
* todo
2025-05-12 17:52:40 -04:00
mhenrhcsen
767c2340f1
docstring for tests
2025-05-12 22:57:43 +02:00
mhenrhcsen
f6623c34cc
Linting fix
2025-05-12 22:53:30 +02:00
mhenrhcsen
5dd8f0b2b8
Fixes comments from winglian
2025-05-12 22:43:15 +02:00
NanoCode012
67c4ea9c7c
fix: disable auto lora kernel if dropout nonzero ( #2655 ) [skip ci]
...
* fix: disable auto lora kernel if dropout nonzero
* Add comment from PR feedback
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-12 16:23:53 -04:00
Wing Lian
526ddb886d
guard on deleting secrets from env ( #2653 ) [skip ci]
2025-05-12 14:18:42 -04:00
Wing Lian
f34eef546a
update doc and use P2P=LOC for brittle grpo test ( #2649 )
...
* update doc and skip brittle grpo test
* fix the path to run the multigpu tests
* increase timeout, use LOC instead of NVL
* typo
* use hf cache from s3 backed cloudfront
* mark grpo as flaky test dues to vllm start
2025-05-12 14:17:25 -04:00
Wing Lian
c7b6790614
Various fixes for CI, save_only_model for RL, prevent packing multiprocessing deadlocks ( #2661 )
...
* lean mistral ft tests, remove e2e torch 2.4.1 test
* make sure to pass save_only_model for RL
* more tests to make ci leaner, add cleanup to modal ci
* fix module for import in e2e tests
* use mp spawn to prevent deadlocks with packing
* make sure cleanup shell script is executable when cloned out
2025-05-12 10:51:18 -04:00
mhenrhcsen
be3c6bbd85
fix linting issues
2025-05-12 14:46:57 +02:00
mhenrhcsen
f07db4f853
Refactor truncation logic in drop_long_rl_seq function
...
- Simplified the truncation process for chosen and rejected responses to ensure they fit within the specified sequence length while preserving the prompt.
- Improved readability by restructuring the code and removing redundant checks.
- Ensured that the function returns the sample correctly after processing, maintaining compatibility with existing handling options.
2025-05-12 14:40:10 +02:00
mhenrhcsen
17a5838d38
lint
2025-05-12 14:36:43 +02:00
mhenrhcsen
9f68918f13
Implement configurable handling of excess tokens in datasets
...
- Added `excess_token_handling` option to the configuration, allowing users to choose between "drop" and "truncate" for handling tokens exceeding the maximum sequence length.
- Introduced `truncate_or_drop_long_seq` function to manage both single and batched samples based on the selected handling method.
- Updated relevant dataset processing functions to utilize the new handling option, ensuring backward compatibility with existing "drop" behavior.
- Enhanced logging to reflect truncation actions in dataset processing.
This change improves flexibility in managing sequence lengths during training and evaluation.
2025-05-12 14:08:43 +02:00
Dan Saunders
47e0e71bc8
don't sort multipack sampler ( #2657 )
...
* don't sort multipack sampler
* increased packing efficiency increases loss
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-05-09 20:28:58 -04:00
Wing Lian
0f3587174d
swap tinymodels that have safetensors for some ci tests ( #2641 )
2025-05-07 15:06:07 -04:00
xzuyn
25e6c5f9bd
Add CAME Optimizer ( #2385 )
2025-05-07 10:31:46 -04:00
NanoCode012
32f51bca35
fix(doc): clarify instruction to delinearize llama4 similar to cli doc ( #2644 ) [skip ci]
2025-05-07 10:29:47 -04:00
NanoCode012
9daa04da90
Fix: improve error message on failed dataset load ( #2637 ) [skip ci]
...
* fix(log): clarify error on dataset loading failed
* fix: add path for easy tracking of broken config
* fix: improve error message based on pr feedback
2025-05-07 10:29:05 -04:00
Wing Lian
0d71b0aa5f
Configurable embeddings upcast ( #2621 )
...
* fsdp embeddings should be float32 per comment
* patch peft to not upcast everything
* add tabs back to code check
* fix import
* add configurable option and fix check
* add check for dtypes
* move embeddings test to patch dir
* fix test
* fix comment and logic
2025-05-06 23:40:44 -04:00
Eric Meier
63aaccf85b
Fix cut_cross_entropy plugin install ( #2642 ) [skip ci]
2025-05-06 22:56:00 -04:00
Wing Lian
ff0fe767c8
xformers attention with packing ( #2619 )
...
* xformers attention with packing
* wire up the patch
* fix xformers + packing validation
* fix warning
* reorder the packing check
* fix fp16 / bf16 reset when using fp16 with bf16 auto
* fix seq lens calc to drop hanging sequences
* handle xformers patch for inference too
* fix batch size setter
* fix xformers inference
* add colab callback to fix inference post train
* PR feedback
2025-05-06 22:49:22 -04:00
Wing Lian
8e4158cc0b
Multipack parallel bin packing ( #2631 )
...
* improve readability of multipack sampler
* parallel bin packing
fix error with lambda and pickling
make sure things are in float instead of np.float
* annotations and comments update
* support for configurable group and bin size for sample packing
* fix missing map back to original indices
2025-05-06 20:08:08 -04:00
Wing Lian
cd84325253
allow plugins to return their own dataset ( #2617 ) [skip ci]
...
* allow plugins to return their own dataset
* add post_trainer_create and wire up
* add hook check
* address PR feedback:
* remove annotation causing circular import
2025-05-06 20:05:51 -04:00