* feat: upgrade transformers to v4.56
* fix handling of CP/SP now that position_ids are default even for unpacked sequences
* feat: monkeypatch list_repo_templates
* fix: apply patch for tests only
* see if updated main works at least
* fix: update to patch release and remove monkeypatch
* remove fsdp2 eval patch
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* improve fsdp shard merging
* improve logging
* update information on merging and inferencing GPT-OSS
* cleanup readme
* automate cleanup of FSDP prefix
* import GRPO only if necessary
* only modify config.json on rank0
* merge final checkpoint at end of training
* prevent circular import
* Fix saving for sharded state dict
* devx, move merged to output dir
* move import back to top
* Fix stuck merge
* fix conditionals from pr feedback and add test
* fix to not use batch feature indexing
* more vlm fixes
* use AutoModelForImageTextToText
* add example yaml and need num2words for chat template
* improve handling of adding image tokens to conversation
* add lfm2-vl support
* update the lfm readme
* fix markdown and add rtol for loss checks
* feat: add smolvlm2 processing strat
* fix: check for causal-conv1d in lfm models
* feat: add docs for lfm2
* feat: add new models and tips to docs
* feat: add smolvlm2 docs and remove extra dep
* chore: update docs
* feat: add video instructions
* chore: cleanup
* chore: comments
* fix: typo
* feat: add usage stats
* chore: refactor
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* use exec instead of subprocess to make ctrl+c nicer for cli
* change var name to use_exec
* simplify to bool
* flush std*
* patch subprocess as mock in test
* fix tests
* more test fixes
* use nanmena for loss aggregation (CP fix)
* use regular asserts
* small changes to make tests isolate
* combining evaluation_loop patches
* fix
* delete unused
* fix check
* fix for parallelism config from trainer
* fix handling of parallelism_config w accelerate
* add todo for removal
* update to latest axolotl-contribs-mit for optimizer fix too
* synchronize training after checkpoint save
* dir spelling
* use latest accelerate main
* fix to not use partial state parallelism_config
* more fixeS
* use most recent accelerate fix
* fix cpu_ram_efficient_loading to meta devices from rank 0 to prevent CPU RAM oom
* improve handling of broadcasting fsdp2 state dict
* support for openai chat template with thinking key as the reasoning trace
* address PR feedback
* refactor to remove dependency on PartialState for parallelism config
* bump accelerate, gptoss fixes
* limit meta fixes to fsdp2 for now
* fixes for gpt oss
* fixup examples, don't use cpu-ram-efficient-loading for now
* remove problematic barrier
* patch parallelism config
* reorder comparison
* device mesh fixes
* make pure CP work
* lint
* Add support for Dion optimizer
* dion training kwargs
* fix var names
* no dion 8bit for now
* use updated axolotl-contribs-mit for dion optimizer
* add smoke test for dion optimizer
* add docs
* fix typo during edits
* fix test to not remove load in 8bit
* jagged lr restart scheudler
var name fix
make sure to create scheduler first
* wire things together
* more fixes
* fix for nesting scheduler and first anneal phase
* no need for relora trainer anymore since we've generalized the relora scheduler
* remove redundant relora scheduler and lint
* update relora e2e test for updated params
* need restart steps for relora test
* update quarto docs for dropped relora trainer
* update example yaml
* drop verbose arg
* min lr scale support for jagged lr
* don't let min_lr be nonetype
* cleanup args
* make pad_to_sequence_len default to the same value as sample_packing
* remove duplicate validation
* fix test
* update description meta
Co-authored-by: NanoCode012 <nano@axolotl.ai>
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* limit num_proc when saving datasets to disk
* enforce at least 1 in case it rounds down to 0, and sane divisor is at least 8 rows per worker to save
* update fixtures with dataset processes since that should never be NoneType
* improve reusability for tests
* checkpoint model on first step callback
* remove debug
* add test cases; update existing tests not to save on first step
* move test out of solo
* delete
* default to False
* typo
* support for deepspeed autotup
* bump to latest deepspeed that supports deepcompile too
* add deepcompile support too
* fix total steps calculation for TP
* setup fixture for tp
* update ds config to ensure weights are gathered for checkpoint
* fix duplicate validation names
* chore: lint
* upgrade peft to 0.16.0
* upgrade datasets to 4.0.0
* refactor dupes from merge/rebase
* fix check for fsdp1 + sharded_state_dict
* use full state dict for ci
* upgrade trl==0.19.1
* add vllm for tests for grpo
* fixes to work with latest trl
* need data_parallel_size config too
* support for vllm_mode for server / colocate
* vllm settings for colocate
* relax vllm version
* bump min hf hub for latest vllm support
* add hints on string literal for vllm mode
* use latest transformers 4.53.2
* tweak acceptable loss on flaky test_ds_zero3_packed test
* don't run flaky vllm/grpo tests for now