* fixes for delinearization, and make qlora work with fsdp2
* Add back mistakenly removed lm_eval
* typo [skip ci]
* patch evals for torch.compile + fsdp2
* also check torch_compile w fsdp2
* lots of fixes for flex attn with llama4
* fix patch check and patch llama4 too
* attempt to make the patches stick
* use transformers 4.51.2
* update configs and README for llama4
* remove torch.compile for CI test
* cleanup any existing singletons
* set singleton cache to None instead of deleting
* use importlib reload with monkeypatch
* don't worry about transformers version, mark inputs with grads, fix regex
* make sure embeds aren't on cpu
* logging and mem improvements
* vllm version and add to docker, make sure to save processor on conversion
* fix ambiguous tensor bool check
* fix vllm to not use v1, upgrade hf transformers
* fix tests
* make flex_attn_compile_kwargs configurable, since this depends on model params
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>
* [ci] make e2e tests a bit faster by reducing test split size
* use 10% split of alpaca dataset to speed up dataset loading/tokenization
* reduce gas 4->2 for most e2e tests
* increase val set size for packing
* llama4 support
* add xet support [skip ci]
* be flexible on transformers version and skip test on version
* don't use deepspeed for the fix_untrained_tokens test
* reordering to trigger torch 2.6.0 tests first
* slightly smaller train set
* use 4.51.0 for now
* remove stray print, add llama4 chat template to schema, bump peft to 0.15.1
* patches to make llama4 performant
* add preliminary fp8 support
* fsdp2 support
* use accelerate release 1.6.0
* allow 8bit optims with fsdp2
* liger + torch compile fix
* add fsdp2 e2e tests
* use transformers commit with fsdp2 support
* skip zero3 tests for this PR for now
* fix fsdp2 config for ci
* make sure both flex and flash attn work with fsdp2, skip fix untrained tokens
* okay, actually use fdsp2...
* more fixes to flex for fsdp2
* make sure to patch all the loaded models
* additional validation for fsdp2, bump dep versions
* make gemma3 work with packing
* multi-gpu e2e for ci
* update gemma3 model namespace to use mirror
* add gradient checkpointing to multigpu e2e ci
* update gemma3 examples for use_reentrant and fix ddp find unused params
* fix tests for gemma3
* fix import for test utils
* set correct train loss for gemma3 e2e
* add grpo scale_rewards config for trl#3135
* options to connect to vllm server directly w grpo trl#3094
* temperature support trl#3029
* sampling/generation kwargs for grpo trl#2989
* make vllm_enable_prefix_caching a config param trl#2900
* grpo multi-step optimizeations trl#2899
* remove overrides for grpo trainer
* bump trl to 0.16.0
* add cli to start vllm-serve via trl
* call the python module directly
* update to use vllm with 2.6.0 too now and call trl vllm serve from module
* vllm 0.8.1
* use python3
* use sys.executable
* remove context and wait for start
* fixes to make it actually work
* fixes so the grpo tests pass with new vllm paradigm
* explicit host/port and check in start vllm
* make sure that vllm doesn't hang by setting quiet so outouts go to dev null
* also bump bnb to latest release
* add option for wait from cli and nccl debugging for ci
* grpo + vllm test on separate devices for now
* make sure grpo + vllm tests runs single worker since pynccl comms would conflict
* fix cli
* remove wait and add caching for argilla dataset
* refactoring configs
* chore: lint
* add vllm config
* fixup vllm grpo args
* fix one more incorrect schema/config path
* fix another vlllm reference and increase timeout
* make the tests run a bit faster
* change mbsz back so it is correct for grpo
* another change mbsz back so it is correct for grpo
* fixing cli args
* nits
* adding docs
* docs
* include tensor parallel size for vllm in pydantic schema
* moving start_vllm, more docs
* limit output len for grpo vllm
* vllm enable_prefix_caching isn't a bool cli arg
* fix env ordering in tests and also use pid check when looking for vllm
---------
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com>
* fix: update chat_template
* fix: handle gemma3 showing a lot of no content for turn 0
* fix: remove unknown config from examples
* fix: test
* fix: temporary disable gemma2 test
* fix: stop overwriting config.text_config unnecessarily
* fix: handling of set cache to the text_config section
* feat: add liger gemma support and bump liger to 0.5.5
* fix: add double use_cache setting
* fix: add support for final_logit_softcap in CCE for gemma2/3
* fix: set use_cache before model load
* feat: add missing layernorm override
* fix: handle gemma3 rmsnorm
* fix: use wrapper to pass dim as hidden_size
* fix: change dim to positional
* fix: patch with wrong mlp
* chore: refactor use_cache handling
* fix import issues
* fix tests.e2e.utils import
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* pass additional info for fix untrained tokens when using distributed + offloading
* use latest version of vendored lib
* use v0.0.5 of contribs lgpl
* fix for no bad tokens and add tests
* use release
* add multigpu test too
* make sure the multigpu zero3 test actually uses zero3
* current
not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer
* address comments
* use axolotl train in multigpu tests and add ray tests for multi-gpu
* accelerate uses underscores for main_process_port arg
* chore: lint
* fix order of accelerate args
* include ray train in docker images
* current
not clean working version
move torch trainer to do_cli
update code with config changes and clean up
edit config
cleanup
add run name to trainer
* address comments
* use axolotl train in multigpu tests and add ray tests for multi-gpu
* accelerate uses underscores for main_process_port arg
* chore: lint
* fix order of accelerate args
* include ray train in docker images
* fix bf16 resolution behavior
* move dtype logic
* x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* rename
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* add to sidebar
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* Apply suggestions from code review
Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>
* Update docs/ray-integration.qmd
Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>
* pre-commit fixes
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
* use output_dir instead of hardcoded saves path
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* bugfix storage dir
* change type\ for resources_per_worker
---------
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: SumanthRH <sumanthrh@anyscale.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* need to update deepspeed version in extras too
* fix patch import
* fix monkeypatch reloading in tests and deepspeed patch
* remove duplicated functionality fixture
* reset LlamaForCausalLM too in fixtures for cce patch
* reset llama attn too
* disable xformers patch for cce
* skip problematic test on low usage functionality
* add more test cases for gradient accumulation and fix zero3
* swap out for smaller model
* fix missing return
* fix missing pad_token in config
* support concurrency for multigpu testing
* cast empty deepspeed to empty string for zero3 check
* fix temp_dir as fixture so parametrize works properly
* fix test file for multigpu evals
* don't use default
* don't use default for fsdp_state_dict_type
* don't use llama tokenizer w smollm
* also automatically cancel multigpu for concurrency
* remove skipped test
* use mean_resizing_embeddings with qlora and added tokens
* use </s> as pad_token to prevent resize of embeddings
* make sure local hub test saves to a tmp dir
* use Path so concatenation works
* make sure to use tmp_ds_path for data files
* add ds zero3 to multigpu biweekly tests
* fix for upstream api change
* use updated accelerate and fix deepspeed tests
* stringify the Path, and run multigpu tests if the multigpu tests change for a PR
* use correct json rather than yaml
* revert accelerate for deepspeed
* Attempt to run multigpu in PR CI for now to ensure it works
* fix yaml file
* forgot to include multigpu tests
* fix call to cicd.multigpu
* dump dictdefault to dict for yaml conversion
* use to_dict instead of casting
* 16bit-lora w flash attention, 8bit lora seems problematic
* add llama fsdp test
* more tests
* Add test for qlora + fsdp with prequant
* limit accelerate to 2 processes and disable broken qlora+fsdp+bnb test
* move multigpu tests to biweekly