axolotl

Files

Wing Lian ccc94da8ad KD fix w/ online distillation (#2700 ) [skip ci]

* kd fixes

* fix collator setup

* fix input args

* better handling to drop string fields for kd with raw dataset

* kd trainer has kd temp as part of the init

* drop top_k before softmax

* simplfy and remove zscore

* WIP chunked KD loss with autograd wrapper

* more fixes and liger-type chunked loss

* collator cls for plugins

* remove debugging

* additional plugin collator kwargs, don't scale up kd loss by t^2

* don't need temp arg to distill method

* online kd wip

* add close to comment block

* suport sampling params/max new tokens

* handle when no custom collator is used in plugins

* logsumexp trick:

* fix check

* shift off the first empty token

* fix length of padding

* use max not min

* temp scale kd loss at end

* support for dynamic plugin training args mixins and symmetric kl

* chore: lint

* fix trainer callback base class

* Fix decay

* accept compressed responses for smaller wire payload

* post-rebase lint

* more KD updates

* increase hyperparams_count for gradients for added normalize_topk

* fix to remove attention_mask

* rename vars for consistency

* fix rebase issues

* default to dropping last batch in multipack batch sampler

* improve handling of train len

* init collator_cls_and_kwargs

* explicit drop_last=False when checking for multipack completeness

* use separate v2 loader for kd

* fix kd tests to use subprocess so it picks up kd training args

* default value for kd_beta arg

* use updated dataset for ci

* longer timeout for e2e

2025-06-17 12:09:13 -04:00

zero1_torch_compile.json

add deepspeed example with torch compile enabled (#2212 ) [skip ci]

2024-12-22 12:11:39 -05:00

zero1.json

Set gradient_clipping to auto in DeepSpeed configs (#1382 ) [skip ci]

2024-03-10 20:50:12 -04:00

zero2_torch_compile.json

KD fix w/ online distillation (#2700 ) [skip ci]

2025-06-17 12:09:13 -04:00

zero2.json

Set gradient_clipping to auto in DeepSpeed configs (#1382 ) [skip ci]