* checkpoint model on first step callback
* remove debug
* add test cases; update existing tests not to save on first step
* move test out of solo
* delete
* default to False
* typo
* upgrade peft to 0.16.0
* upgrade datasets to 4.0.0
* refactor dupes from merge/rebase
* fix check for fsdp1 + sharded_state_dict
* use full state dict for ci
* upgrade trl==0.19.1
* add vllm for tests for grpo
* fixes to work with latest trl
* need data_parallel_size config too
* support for vllm_mode for server / colocate
* vllm settings for colocate
* relax vllm version
* bump min hf hub for latest vllm support
* add hints on string literal for vllm mode
* use latest transformers 4.53.2
* tweak acceptable loss on flaky test_ds_zero3_packed test
* don't run flaky vllm/grpo tests for now
* FSDP2 args migration implementation
This commit implements the migration to FSDP2 arguments including:
- FSDP2 support with LoRA training
- DPO integration with FSDP2
- Model loading fixes and refactoring
- CPU offloading and PEFT handling
- Test updates and CI improvements
- Bug fixes for dtype errors and various edge cases
* update transformers to 4.53.0
* remove attention_mask from signature columns if using packing
* remove attention_mask column from dataloader
* update signature of flash attn forward for ring attn patch
* fix FSDP
* patch ring-flash-attn with upstream signature fix
* fix patch indentation level
* fix the patch
* add batch flattening smoke test with loss check that works in older transformers
* fix patch
* don't drop attention mask for flex
* more fixes
* patch create_causal_mask for packing w flex
* global torch manual_seed fixture
* tweak loss checks
* fix patch and use single batch for flex
* don't need to reload
* fix causal mask patch
* use transformers patch releasE
* make sure env var is string
* make sure to drop attention mask for flex w packing for latest transformers patch release
* tweak loss
* guard on signature columns before removing attention mask
* bump loss
* set remove isn't chainable
* skip slow mistral test in 2.5.1
* bump hf deps
* upgrade liger-kernel too
* install cce from fork for transformers fix
* fix reference to vocab size in gemma3 patch
* use padding_idx instead of pad_token_id
* remove fixed gemma3 patch
* use updated cce fork
* fix local mllama cce patches w docstring
* add test for multipack with trainer setup and fix trainer for trainer refactor upstream
* bump modal version
* guard for iterable datasetS
* mllama model arch layout changed in latest transformers
* fix batch sampler with drop_last
* fix: address upstream vlm changes for lora
* fix: update references to old lora target path
* fix: remove mllama fa2 patch due to upstream fix
* fix: lora kernel patch path for multimodal models
* fix: removed mllama from quarto
* run test for came optim on 2.6.0+
* fix fsdp2 patch and remove deprecated patch
* make sure to set sequence_parallel_degree for grpo
* Add SP test for GRPO
* add sp to grpo config for trainer
* use reward_funcs as kwarg to grpo trainer
* fix the comprehension for reward funcs
* reward funcs already passed in as args
* init sp_group right before training
* fix check for adding models to SP context
* make sure to pass args to super
* upgrade deepspeed
* use updated trl and add reasoning flags for vllm
* patch the worker
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* don't set peft_config on grpo to prevent double peft wrap
* remove overrides needed to support bug
* fix grpo tests
* require more CPU for multigpu to help with torch compile for vllm
* update doc and skip brittle grpo test
* fix the path to run the multigpu tests
* increase timeout, use LOC instead of NVL
* typo
* use hf cache from s3 backed cloudfront
* mark grpo as flaky test dues to vllm start
* update trl to 0.17.0
* grpo + vllm no longer supported with 2.5.1 due to vllm constraints
* disable VLLM_USE_V1 for ci
* imporve handle killing off of multiprocessing vllm service
* debug why this doesn't run in CI
* increase vllm wait time
* increase timeout to 5min
* upgrade to vllm 0.8.4
* dump out the vllm log for debugging
* use debug logging
* increase vllm start timeout
* use NVL instead
* disable torch compile cache
* revert some commented checks now that grpo tests are fixed
* increase vllm timeoout back to 5min
* batch api HF adapter for ring-flash-attn; cleanup and improvements
* update
* adding all batch ring-flash-attn methods via single adapter
* removing pad_to_sequence_len=False for now
* fix
* updating docs to include batch SP
* review comments
* fixes for batch API funcs, simplify
* fixes
* fix
* updates
* add batch_zigzag smoke test
* fixes for delinearization, and make qlora work with fsdp2
* Add back mistakenly removed lm_eval
* typo [skip ci]
* patch evals for torch.compile + fsdp2
* also check torch_compile w fsdp2
* lots of fixes for flex attn with llama4
* fix patch check and patch llama4 too
* attempt to make the patches stick
* use transformers 4.51.2
* update configs and README for llama4
* remove torch.compile for CI test
* cleanup any existing singletons
* set singleton cache to None instead of deleting
* use importlib reload with monkeypatch
* don't worry about transformers version, mark inputs with grads, fix regex
* make sure embeds aren't on cpu
* logging and mem improvements
* vllm version and add to docker, make sure to save processor on conversion
* fix ambiguous tensor bool check
* fix vllm to not use v1, upgrade hf transformers
* fix tests
* make flex_attn_compile_kwargs configurable, since this depends on model params
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>
* [ci] make e2e tests a bit faster by reducing test split size
* use 10% split of alpaca dataset to speed up dataset loading/tokenization
* reduce gas 4->2 for most e2e tests
* increase val set size for packing
* llama4 support
* add xet support [skip ci]
* be flexible on transformers version and skip test on version
* don't use deepspeed for the fix_untrained_tokens test
* reordering to trigger torch 2.6.0 tests first
* slightly smaller train set
* use 4.51.0 for now
* remove stray print, add llama4 chat template to schema, bump peft to 0.15.1
* patches to make llama4 performant
* add preliminary fp8 support
* fsdp2 support
* use accelerate release 1.6.0
* allow 8bit optims with fsdp2
* liger + torch compile fix
* add fsdp2 e2e tests
* use transformers commit with fsdp2 support
* skip zero3 tests for this PR for now
* fix fsdp2 config for ci
* make sure both flex and flash attn work with fsdp2, skip fix untrained tokens
* okay, actually use fdsp2...
* more fixes to flex for fsdp2
* make sure to patch all the loaded models
* additional validation for fsdp2, bump dep versions
* make gemma3 work with packing
* multi-gpu e2e for ci
* update gemma3 model namespace to use mirror
* add gradient checkpointing to multigpu e2e ci
* update gemma3 examples for use_reentrant and fix ddp find unused params
* fix tests for gemma3
* fix import for test utils
* set correct train loss for gemma3 e2e
* add grpo scale_rewards config for trl#3135
* options to connect to vllm server directly w grpo trl#3094
* temperature support trl#3029
* sampling/generation kwargs for grpo trl#2989
* make vllm_enable_prefix_caching a config param trl#2900
* grpo multi-step optimizeations trl#2899
* remove overrides for grpo trainer
* bump trl to 0.16.0
* add cli to start vllm-serve via trl
* call the python module directly
* update to use vllm with 2.6.0 too now and call trl vllm serve from module
* vllm 0.8.1
* use python3
* use sys.executable
* remove context and wait for start
* fixes to make it actually work
* fixes so the grpo tests pass with new vllm paradigm
* explicit host/port and check in start vllm
* make sure that vllm doesn't hang by setting quiet so outouts go to dev null
* also bump bnb to latest release
* add option for wait from cli and nccl debugging for ci
* grpo + vllm test on separate devices for now
* make sure grpo + vllm tests runs single worker since pynccl comms would conflict
* fix cli
* remove wait and add caching for argilla dataset
* refactoring configs
* chore: lint
* add vllm config
* fixup vllm grpo args
* fix one more incorrect schema/config path
* fix another vlllm reference and increase timeout
* make the tests run a bit faster
* change mbsz back so it is correct for grpo
* another change mbsz back so it is correct for grpo
* fixing cli args
* nits
* adding docs
* docs
* include tensor parallel size for vllm in pydantic schema
* moving start_vllm, more docs
* limit output len for grpo vllm
* vllm enable_prefix_caching isn't a bool cli arg
* fix env ordering in tests and also use pid check when looking for vllm
---------
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>
* fix: update chat_template
* fix: handle gemma3 showing a lot of no content for turn 0
* fix: remove unknown config from examples
* fix: test
* fix: temporary disable gemma2 test
* fix: stop overwriting config.text_config unnecessarily
* fix: handling of set cache to the text_config section
* feat: add liger gemma support and bump liger to 0.5.5
* fix: add double use_cache setting
* fix: add support for final_logit_softcap in CCE for gemma2/3
* fix: set use_cache before model load
* feat: add missing layernorm override
* fix: handle gemma3 rmsnorm
* fix: use wrapper to pass dim as hidden_size
* fix: change dim to positional
* fix: patch with wrong mlp
* chore: refactor use_cache handling
* fix import issues
* fix tests.e2e.utils import
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* pass additional info for fix untrained tokens when using distributed + offloading
* use latest version of vendored lib
* use v0.0.5 of contribs lgpl
* fix for no bad tokens and add tests
* use release
* add multigpu test too
* make sure the multigpu zero3 test actually uses zero3
* current
not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer
* address comments
* use axolotl train in multigpu tests and add ray tests for multi-gpu
* accelerate uses underscores for main_process_port arg
* chore: lint
* fix order of accelerate args
* include ray train in docker images
* current
not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer
* address comments
* use axolotl train in multigpu tests and add ray tests for multi-gpu
* accelerate uses underscores for main_process_port arg
* chore: lint
* fix order of accelerate args
* include ray train in docker images
* fix bf16 resolution behavior
* move dtype logic
* x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* rename
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* add to sidebar
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* Apply suggestions from code review
Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>
* Update docs/ray-integration.qmd
Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>
* pre-commit fixes
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* use output_dir instead of hardcoded saves path
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* bugfix storage dir
* change type\ for resources_per_worker
---------
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* need to update deepspeed version in extras too
* fix patch import
* fix monkeypatch reloading in tests and deepspeed patch
* remove duplicated functionality fixture
* reset LlamaForCausalLM too in fixtures for cce patch
* reset llama attn too
* disable xformers patch for cce
* skip problematic test on low usage functionality
* add more test cases for gradient accumulation and fix zero3
* swap out for smaller model
* fix missing return
* fix missing pad_token in config
* support concurrency for multigpu testing
* cast empty deepspeed to empty string for zero3 check
* fix temp_dir as fixture so parametrize works properly
* fix test file for multigpu evals
* don't use default
* don't use default for fsdp_state_dict_type
* don't use llama tokenizer w smollm
* also automatically cancel multigpu for concurrency
* remove skipped test
* use mean_resizing_embeddings with qlora and added tokens
* use </s> as pad_token to prevent resize of embeddings
* make sure local hub test saves to a tmp dir
* use Path so concatenation works
* make sure to use tmp_ds_path for data files