* Revert "fix deprecate deepspeed stage3_gather_16bit_weights_on_model_save arg…"
This reverts commit e207762928.
* don't revert the values that don't use 'auto'
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* kd fixes
* fix collator setup
* fix input args
* better handling to drop string fields for kd with raw dataset
* kd trainer has kd temp as part of the init
* drop top_k before softmax
* simplfy and remove zscore
* WIP chunked KD loss with autograd wrapper
* more fixes and liger-type chunked loss
* collator cls for plugins
* remove debugging
* additional plugin collator kwargs, don't scale up kd loss by t^2
* don't need temp arg to distill method
* online kd wip
* add close to comment block
* suport sampling params/max new tokens
* handle when no custom collator is used in plugins
* logsumexp trick:
* fix check
* shift off the first empty token
* fix length of padding
* use max not min
* temp scale kd loss at end
* support for dynamic plugin training args mixins and symmetric kl
* chore: lint
* fix trainer callback base class
* Fix decay
* accept compressed responses for smaller wire payload
* post-rebase lint
* more KD updates
* increase hyperparams_count for gradients for added normalize_topk
* fix to remove attention_mask
* rename vars for consistency
* fix rebase issues
* default to dropping last batch in multipack batch sampler
* improve handling of train len
* init collator_cls_and_kwargs
* explicit drop_last=False when checking for multipack completeness
* use separate v2 loader for kd
* fix kd tests to use subprocess so it picks up kd training args
* default value for kd_beta arg
* use updated dataset for ci
* longer timeout for e2e
* wip for dbrx finetuning
* add fastcore for parallel loading of sharded weights
* fix dtype for load, use PartialState instead of accelerator to init process group, remove redundant wandb callback
* update to use v2 of the converted model
* more fixes for dbrx loras
* make sure to enable fsdp activation checkpointing
* fix support for 8bit loras too for dbrx
* apply z3 leaf moe fix for DBRX with deepspeed
* don't raise value error since child module searches could fail and be ok
* revert a previous change to fix fsdp
* update mistral/mistral qlora+fsdp yamls
* fix qlora+fsdp quant storage type
* more edge cases for qlora-fsdp
* fixes for fsdp+qlora w optimizer in 8bit
* add bigstral z3 config and make sure to use full_state_dict for fsdp