* improve fsdp shard merging
* improve logging
* update information on merging and inferencing GPT-OSS
* cleanup readme
* automate cleanup of FSDP prefix
* import GRPO only if necessary
* only modify config.json on rank0
* merge final checkpoint at end of training
* prevent circular import
* Fix saving for sharded state dict
* devx, move merged to output dir
* move import back to top
* Fix stuck merge
* fix conditionals from pr feedback and add test
* fix to not use batch feature indexing
* more vlm fixes
* use AutoModelForImageTextToText
* add example yaml and need num2words for chat template
* improve handling of adding image tokens to conversation
* add lfm2-vl support
* update the lfm readme
* fix markdown and add rtol for loss checks
* feat: add smolvlm2 processing strat
* fix: check for causal-conv1d in lfm models
* feat: add docs for lfm2
* feat: add new models and tips to docs
* feat: add smolvlm2 docs and remove extra dep
* chore: update docs
* feat: add video instructions
* chore: cleanup
* chore: comments
* fix: typo
* feat: add usage stats
* chore: refactor
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* use exec instead of subprocess to make ctrl+c nicer for cli
* change var name to use_exec
* simplify to bool
* flush std*
* patch subprocess as mock in test
* fix tests
* more test fixes
* use nanmena for loss aggregation (CP fix)
* use regular asserts
* small changes to make tests isolate
* combining evaluation_loop patches
* fix
* delete unused
* fix check
* fix for parallelism config from trainer
* fix handling of parallelism_config w accelerate
* add todo for removal
* update to latest axolotl-contribs-mit for optimizer fix too
* synchronize training after checkpoint save
* dir spelling
* use latest accelerate main
* fix to not use partial state parallelism_config
* more fixeS
* use most recent accelerate fix
* fix cpu_ram_efficient_loading to meta devices from rank 0 to prevent CPU RAM oom
* improve handling of broadcasting fsdp2 state dict
* support for openai chat template with thinking key as the reasoning trace
* address PR feedback
* refactor to remove dependency on PartialState for parallelism config
* bump accelerate, gptoss fixes
* limit meta fixes to fsdp2 for now
* fixes for gpt oss
* fixup examples, don't use cpu-ram-efficient-loading for now
* remove problematic barrier
* patch parallelism config
* reorder comparison
* device mesh fixes
* make pure CP work
* lint
* Add support for Dion optimizer
* dion training kwargs
* fix var names
* no dion 8bit for now
* use updated axolotl-contribs-mit for dion optimizer
* add smoke test for dion optimizer
* add docs
* fix typo during edits
* fix test to not remove load in 8bit
* jagged lr restart scheudler
var name fix
make sure to create scheduler first
* wire things together
* more fixes
* fix for nesting scheduler and first anneal phase
* no need for relora trainer anymore since we've generalized the relora scheduler
* remove redundant relora scheduler and lint
* update relora e2e test for updated params
* need restart steps for relora test
* update quarto docs for dropped relora trainer
* update example yaml
* drop verbose arg
* min lr scale support for jagged lr
* don't let min_lr be nonetype
* cleanup args
* make pad_to_sequence_len default to the same value as sample_packing
* remove duplicate validation
* fix test
* update description meta
Co-authored-by: NanoCode012 <nano@axolotl.ai>
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* limit num_proc when saving datasets to disk
* enforce at least 1 in case it rounds down to 0, and sane divisor is at least 8 rows per worker to save
* update fixtures with dataset processes since that should never be NoneType
* improve reusability for tests
* checkpoint model on first step callback
* remove debug
* add test cases; update existing tests not to save on first step
* move test out of solo
* delete
* default to False
* typo
* support for deepspeed autotup
* bump to latest deepspeed that supports deepcompile too
* add deepcompile support too
* fix total steps calculation for TP
* setup fixture for tp
* update ds config to ensure weights are gathered for checkpoint
* fix duplicate validation names
* chore: lint
* upgrade peft to 0.16.0
* upgrade datasets to 4.0.0
* refactor dupes from merge/rebase
* fix check for fsdp1 + sharded_state_dict
* use full state dict for ci
* upgrade trl==0.19.1
* add vllm for tests for grpo
* fixes to work with latest trl
* need data_parallel_size config too
* support for vllm_mode for server / colocate
* vllm settings for colocate
* relax vllm version
* bump min hf hub for latest vllm support
* add hints on string literal for vllm mode
* use latest transformers 4.53.2
* tweak acceptable loss on flaky test_ds_zero3_packed test
* don't run flaky vllm/grpo tests for now
* FSDP2 args migration implementation
This commit implements the migration to FSDP2 arguments including:
- FSDP2 support with LoRA training
- DPO integration with FSDP2
- Model loading fixes and refactoring
- CPU offloading and PEFT handling
- Test updates and CI improvements
- Bug fixes for dtype errors and various edge cases
* fix: do not add training and training_detail block by default
* fixed: magistral docs
* fix: address pad adding new fields and use built-in from_openai
* feat: try enable multiprocessing
* fix: check for keys before deleting attn_mask
* feat: add mistral pad test
* feat: add tool calling test
* feat: add devstral tokenizer tests
* fix: comma format
* chore: remove unused support_preprocessing as tokenizer is pickable now
* chore: update magistral doc
* feat: add devstral readme and example
* chore: refactor error handling
* update transformers to 4.53.0
* remove attention_mask from signature columns if using packing
* remove attention_mask column from dataloader
* update signature of flash attn forward for ring attn patch
* fix FSDP
* patch ring-flash-attn with upstream signature fix
* fix patch indentation level
* fix the patch
* add batch flattening smoke test with loss check that works in older transformers
* fix patch
* don't drop attention mask for flex
* more fixes
* patch create_causal_mask for packing w flex
* global torch manual_seed fixture
* tweak loss checks
* fix patch and use single batch for flex
* don't need to reload
* fix causal mask patch
* use transformers patch releasE
* make sure env var is string
* make sure to drop attention mask for flex w packing for latest transformers patch release
* tweak loss
* guard on signature columns before removing attention mask
* bump loss
* set remove isn't chainable
* skip slow mistral test in 2.5.1
* feat: update handling for mistraltokenizer decode
* fix: update mistral common package version
* fix: to use correct release
* fix triton path
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* ignore generation/endgeneration tags
Axolotl handles calculating the mask for assistant turns on its own, and as such these tags are not needed, however currently the analyzer does not recognize them at all and throws an error.
* feat: add phi4 tokenizer test and unblock gemma2
* fix: improve template
* chore: refactor
* chore: lint
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* kd fixes
* fix collator setup
* fix input args
* better handling to drop string fields for kd with raw dataset
* kd trainer has kd temp as part of the init
* drop top_k before softmax
* simplfy and remove zscore
* WIP chunked KD loss with autograd wrapper
* more fixes and liger-type chunked loss
* collator cls for plugins
* remove debugging
* additional plugin collator kwargs, don't scale up kd loss by t^2
* don't need temp arg to distill method
* online kd wip
* add close to comment block
* suport sampling params/max new tokens
* handle when no custom collator is used in plugins
* logsumexp trick:
* fix check
* shift off the first empty token
* fix length of padding
* use max not min
* temp scale kd loss at end
* support for dynamic plugin training args mixins and symmetric kl
* chore: lint
* fix trainer callback base class
* Fix decay
* accept compressed responses for smaller wire payload
* post-rebase lint
* more KD updates
* increase hyperparams_count for gradients for added normalize_topk
* fix to remove attention_mask
* rename vars for consistency
* fix rebase issues
* default to dropping last batch in multipack batch sampler
* improve handling of train len
* init collator_cls_and_kwargs
* explicit drop_last=False when checking for multipack completeness
* use separate v2 loader for kd
* fix kd tests to use subprocess so it picks up kd training args
* default value for kd_beta arg
* use updated dataset for ci
* longer timeout for e2e