* ignore generation/endgeneration tags
Axolotl handles calculating the mask for assistant turns on its own, and as such these tags are not needed, however currently the analyzer does not recognize them at all and throws an error.
* feat: add phi4 tokenizer test and unblock gemma2
* fix: improve template
* chore: refactor
* chore: lint
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* kd fixes
* fix collator setup
* fix input args
* better handling to drop string fields for kd with raw dataset
* kd trainer has kd temp as part of the init
* drop top_k before softmax
* simplfy and remove zscore
* WIP chunked KD loss with autograd wrapper
* more fixes and liger-type chunked loss
* collator cls for plugins
* remove debugging
* additional plugin collator kwargs, don't scale up kd loss by t^2
* don't need temp arg to distill method
* online kd wip
* add close to comment block
* suport sampling params/max new tokens
* handle when no custom collator is used in plugins
* logsumexp trick:
* fix check
* shift off the first empty token
* fix length of padding
* use max not min
* temp scale kd loss at end
* support for dynamic plugin training args mixins and symmetric kl
* chore: lint
* fix trainer callback base class
* Fix decay
* accept compressed responses for smaller wire payload
* post-rebase lint
* more KD updates
* increase hyperparams_count for gradients for added normalize_topk
* fix to remove attention_mask
* rename vars for consistency
* fix rebase issues
* default to dropping last batch in multipack batch sampler
* improve handling of train len
* init collator_cls_and_kwargs
* explicit drop_last=False when checking for multipack completeness
* use separate v2 loader for kd
* fix kd tests to use subprocess so it picks up kd training args
* default value for kd_beta arg
* use updated dataset for ci
* longer timeout for e2e
* fix: do not pre-patch self attention if lora dropout non-zero
* fix: add test to check patch not applied
* fix: test
* fix: test config check
* fix where we check so that tests don't break
* fix: test
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* feat: add fsdp config for magistral
* fix: add mllama self attention handling for lora kernels
* fix: no eval if val_set_size 0 despite having test_datasets
* fix: add note for cce for vlm in newer model
* build base images for torch 2.7.1
* fix: update base docker to use torch 2.7.1
* fix: update doc for main base to use 2.7.1
* make sure to install fa2 in base uv too
* use no build isolation for uv+flashattn
* install psutil also for fa2
* longer timeout for flash attn build
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* Update batching.py: fix the bug of position ids padding
if position ids is padded with a long sequence of zeros, it will cause flash attention to crash
* use alternate calculation for padding position_ids with a range
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* add uv tooling for e2e gpu tests
* fixes from PR feedback
* simplify check
* fix env var
* make sure to use uv for other install
* use raw_dockerfile_image
* Fix import
* fix args to experimental dockerfile image call
* use updated modal versions
* remove unused field for chat_template.default
"messages" field present in final dataset causes issues with DPO
training otherwise
* lint and fix tests for new return value
* remove unused field for chat_template.default
"messages" field present in final dataset causes issues with DPO
training otherwise
lint and fix tests for new return value
fix for updated expected fields for dpo
remove unused field for chat_template.default
"messages" field present in final dataset causes issues with DPO
training otherwise
fix test still expecting "messages" field
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>