* upgrade trl==0.19.1
* add vllm for tests for grpo
* fixes to work with latest trl
* need data_parallel_size config too
* support for vllm_mode for server / colocate
* vllm settings for colocate
* relax vllm version
* bump min hf hub for latest vllm support
* add hints on string literal for vllm mode
* use latest transformers 4.53.2
* tweak acceptable loss on flaky test_ds_zero3_packed test
* don't run flaky vllm/grpo tests for now
* upgrade to flash-attn 2.8.0.post2
* use cu126 with torch 2.6
* seems vllm 0.8.5.post1 not compatible with cuda12.6.3 and torch 2.6
* cu126 + torch 2.6 as the default
* use cu126 for multigpu w torch 2.6 too
* drop vllm for now from ci for now
* kd fixes
* fix collator setup
* fix input args
* better handling to drop string fields for kd with raw dataset
* kd trainer has kd temp as part of the init
* drop top_k before softmax
* simplfy and remove zscore
* WIP chunked KD loss with autograd wrapper
* more fixes and liger-type chunked loss
* collator cls for plugins
* remove debugging
* additional plugin collator kwargs, don't scale up kd loss by t^2
* don't need temp arg to distill method
* online kd wip
* add close to comment block
* suport sampling params/max new tokens
* handle when no custom collator is used in plugins
* logsumexp trick:
* fix check
* shift off the first empty token
* fix length of padding
* use max not min
* temp scale kd loss at end
* support for dynamic plugin training args mixins and symmetric kl
* chore: lint
* fix trainer callback base class
* Fix decay
* accept compressed responses for smaller wire payload
* post-rebase lint
* more KD updates
* increase hyperparams_count for gradients for added normalize_topk
* fix to remove attention_mask
* rename vars for consistency
* fix rebase issues
* default to dropping last batch in multipack batch sampler
* improve handling of train len
* init collator_cls_and_kwargs
* explicit drop_last=False when checking for multipack completeness
* use separate v2 loader for kd
* fix kd tests to use subprocess so it picks up kd training args
* default value for kd_beta arg
* use updated dataset for ci
* longer timeout for e2e
* build base images for torch 2.7.1
* fix: update base docker to use torch 2.7.1
* fix: update doc for main base to use 2.7.1
* make sure to install fa2 in base uv too
* use no build isolation for uv+flashattn
* install psutil also for fa2
* longer timeout for flash attn build
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* add uv tooling for e2e gpu tests
* fixes from PR feedback
* simplify check
* fix env var
* make sure to use uv for other install
* use raw_dockerfile_image
* Fix import
* fix args to experimental dockerfile image call
* use updated modal versions