Commit Graph

2496 Commits

Author SHA1 Message Date
Wing Lian
a85efffbef bump transformers==4.52.4 (#2800) [skip ci]
* bump transformers==4.52.4

* don't use hf offline for qwen tokenizer

* increase timeout

* don't use methodtype

* increase timeout

* better assertion logging

* upgrade deepspeed version too
2025-06-18 15:46:14 -04:00
Dan Saunders
06a648263b Config doc autogen: follow-up fix docs build (#2806)
* config reference doc autogen

* improvements

* cleanup; still ugly but working

* reformat

* remove autogen config ref from git

* factor out validations

* rewrite

* rewrite

* cleanup

* progress

* progress

* progress

* lint and minifying somewhat

* remove unneeded

* coderabbit

* coderabbit

* update preview-docs workflow triggers

* installing with deps

* coderabbit

* update refs

* overwrote file accidentally

* docs install deps
2025-06-18 15:42:54 -04:00
Dan Saunders
9d5bfc127e Config doc autogen (#2718)
* config reference doc autogen

* improvements

* cleanup; still ugly but working

* reformat

* remove autogen config ref from git

* factor out validations

* rewrite

* rewrite

* cleanup

* progress

* progress

* progress

* lint and minifying somewhat

* remove unneeded

* coderabbit

* coderabbit

* update preview-docs workflow triggers

* installing with deps

* coderabbit

* update refs

* overwrote file accidentally
2025-06-18 15:36:53 -04:00
Wing Lian
da8f6c32b9 update favicon (#2801)
* update favicon

* correct size favicon
2025-06-17 18:09:24 -04:00
Wing Lian
88c0e8d048 release tag (#2799)
Some checks failed
ci-cd / build-axolotl (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, 3.11, 2.5.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 124, 12.4.1, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 124, 12.4.1, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.10.0
2025-06-17 12:13:27 -04:00
NanoCode012
d8e8cd8558 feat: remove evalfirst callback with built-in trainer arg (#2797) 2025-06-17 12:09:33 -04:00
Wing Lian
ccc94da8ad KD fix w/ online distillation (#2700) [skip ci]
* kd fixes

* fix collator setup

* fix input args

* better handling to drop string fields for kd with raw dataset

* kd trainer has kd temp as part of the init

* drop top_k before softmax

* simplfy and remove zscore

* WIP chunked KD loss with autograd wrapper

* more fixes and liger-type chunked loss

* collator cls for plugins

* remove debugging

* additional plugin collator kwargs, don't scale up kd loss by t^2

* don't need temp arg to distill method

* online kd wip

* add close to comment block

* suport sampling params/max new tokens

* handle when no custom collator is used in plugins

* logsumexp trick:

* fix check

* shift off the first empty token

* fix length of padding

* use max not min

* temp scale kd loss at end

* support for dynamic plugin training args mixins and symmetric kl

* chore: lint

* fix trainer callback base class

* Fix decay

* accept compressed responses for smaller wire payload

* post-rebase lint

* more KD updates

* increase hyperparams_count for gradients for added normalize_topk

* fix to remove attention_mask

* rename vars for consistency

* fix rebase issues

* default to dropping last batch in multipack batch sampler

* improve handling of train len

* init collator_cls_and_kwargs

* explicit drop_last=False when checking for multipack completeness

* use separate v2 loader for kd

* fix kd tests to use subprocess so it picks up kd training args

* default value for kd_beta arg

* use updated dataset for ci

* longer timeout for e2e
2025-06-17 12:09:13 -04:00
Matt Cummins
ba62aa65ee fixed the lora_target_modules syntax (#2793) 2025-06-15 16:47:02 -04:00
NanoCode012
21388cf615 Fix: lora kernel pre-patch applied despite post-patch not applied (#2772)
* fix: do not pre-patch self attention if lora dropout non-zero

* fix: add test to check patch not applied

* fix: test

* fix: test config check

* fix where we check so that tests don't break

* fix: test

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-06-14 11:54:06 -07:00
NanoCode012
80d5b066ec Fix: adding magistral fsdp config, fixing not eval with test_datasets, handle mllama attention (#2789) [skip ci]
* feat: add fsdp config for magistral

* fix: add mllama self attention handling for lora kernels

* fix: no eval if val_set_size 0 despite having test_datasets

* fix: add note for cce for vlm in newer model
2025-06-14 11:53:43 -07:00
NanoCode012
a3c82e8cbb fix: grpo doc link (#2788) [skip ci] 2025-06-13 12:03:47 -07:00
Wing Lian
b2274d430b support for QAT w RL (DPO) (#2776) 2025-06-13 10:00:35 -04:00
NanoCode012
eac4a61f55 Feat: Add Magistral and mistral-common tokenizer support (#2780) 2025-06-12 19:18:33 -04:00
Wing Lian
ace9287c96 update loss value for flakey e2e test (#2786) [skip ci]
* update loss value for flakey e2e test

* use pytest skip

* parametrize combinations
2025-06-12 18:06:14 -04:00
JZacaroli
f5fbc82f2b Fix logging import in evaluate.py (#2782) (#2783)
* Fix logging import in evaluate.py (#2782)

* chore: lint

---------

Co-authored-by: Joe Zacaroli <jaz@cyberscience.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-06-12 13:23:31 -04:00
NanoCode012
706c677cad feat(doc): update readme to include changelog and remove matrix (#2775) [skip ci]
* feat(doc): update readme to include changelog and remove matrix

* chore: improve wording

* chore: wording

* Update README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* Update README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* Update README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* Update README.md

Co-authored-by: salman <salman.mohammadi@outlook.com>

* chore: address comment remove muon

* chore: address comments

* fix: address final comments

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-06-12 13:23:18 -04:00
Wing Lian
468580d18e limit multipack sampler processes (#2771) [skip ci]
* limit to 16 packing processes

* make num_processes properly reflect configured dataset_processes
2025-06-12 13:22:58 -04:00
salman
3634d8ff9d QAT docfix (#2778) [skip ci]
* nits

* Update docs/qat.qmd

Co-authored-by: NanoCode012 <nano@axolotl.ai>

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-06-12 13:22:40 -04:00
Wing Lian
bcc108efc1 build 2.7.1 images too (#2784) [skip ci] 2025-06-12 13:22:20 -04:00
Wing Lian
581dd324cc build base images for torch 2.7.1 (#2764)
* build base images for torch 2.7.1

* fix: update base docker to use torch 2.7.1

* fix: update doc for main base to use 2.7.1

* make sure to install fa2 in base uv too

* use no build isolation for uv+flashattn

* install psutil also for fa2

* longer timeout for flash attn build

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-06-11 17:11:06 -04:00
Dan Saunders
00cda8cc70 Data loader refactor (#2707)
* data loading refactor (wip)

* updates

* progress

* pytest

* pytest fix

* lint

* zero_first -> filelock, more simplifications

* small simplification

* import change

* nit

* lint

* simplify dedup

* couldnt resist

* review comments WIP

* continued wip

* minor changes

* fix; remove contrived test

* further refactor

* set default seed in pydantic config

* lint

* continued simplication

* lint

* renaming and nits

* filelock tests

* fix

* fix

* lint

* remove nullable arg

* remove unnecessary code

* moving dataset save fn to shared module

* remove debug print

* matching var naming

* fn name change

* coderabbit comments

* naming nit

* fix test
2025-06-10 19:53:07 -04:00
Dan Saunders
52a0452acb magistral small placeholder (#2777) 2025-06-10 13:03:41 -04:00
NanoCode012
83632f71d8 Feat: add tool calling support via tools column (#2774)
* feat: add tool_calling field support

* fix: add tests
2025-06-09 21:42:05 -07:00
Qingyang Wu
92afa4fa27 Fix the bug of position ids padding (#2739) [skip ci]
* Update batching.py: fix the bug of position ids padding

if position ids is padded with a long sequence of zeros, it will cause flash attention to crash

* use alternate calculation for padding position_ids with a range

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-06-09 21:26:36 -07:00
Wing Lian
dd660c2ed0 handle when unable to save optimizer state when using ao optimizer with FSDP (#2773) [skip ci]
* handle when unable to save optimizer state when using ao optimizer with FSDP1

* improve messaging

Co-authored-by: salman <salman.mohammadi@outlook.com>

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-06-09 21:26:14 -07:00
Wing Lian
09c685fd2c fix worker_init_fn signature handling (#2769) 2025-06-08 23:14:10 -07:00
Dan Saunders
345a159796 coderabbit comments 2025-06-07 04:50:29 +00:00
Dan Saunders
657bffd85f update posthog dep 2025-06-05 23:46:20 +00:00
Dan Saunders
f0dde8e2d5 lint 2025-06-05 23:41:46 +00:00
Dan Saunders
25fa4df70f fix 2025-06-05 23:33:46 +00:00
Dan Saunders
e735f4270b slight changes 2025-06-05 23:33:46 +00:00
Dan Saunders
035e7a2f4c simplifying 2025-06-05 23:33:46 +00:00
Dan Saunders
2d36c11264 minor fixes 2025-06-05 23:33:46 +00:00
Dan Saunders
b8ec5bdccf doc update 2025-06-05 23:33:44 +00:00
Dan Saunders
249405b46e docs fix 2025-06-05 23:31:44 +00:00
Dan Saunders
d3be84fec2 enable / disable logic update 2025-06-05 23:31:44 +00:00
Dan Saunders
1c74ab175f opt-in version of telemetry 2025-06-05 23:31:44 +00:00
Dan Saunders
b2f1fc109a distributed fix 2025-06-05 23:31:44 +00:00
Dan Saunders
5a2a80cc48 fix issue with tests in ci 2025-06-05 23:31:44 +00:00
Dan Saunders
4033fe74f8 fixes 2025-06-05 23:31:44 +00:00
Dan Saunders
e9df4444be remove duplicate info 2025-06-05 23:31:44 +00:00
Dan Saunders
ffd2985750 adding runtime metrics / system info additional accelerator support, etc. 2025-06-05 23:31:44 +00:00
Dan Saunders
17310f9acc adding runtime metrics / system info additional accelerator support, etc. 2025-06-05 23:31:44 +00:00
Dan Saunders
71ae6f9f87 improved redaction, send system info during model config load telemetry, etc. 2025-06-05 23:31:08 +00:00
Dan Saunders
9dd1092f8f doc update 2025-06-05 23:27:29 +00:00
Dan Saunders
2c2f2647a9 fix 2025-06-05 23:27:29 +00:00
Dan Saunders
98313a6b3f adding back in base_model redaction w/ whitelist 2025-06-05 23:27:29 +00:00
Dan Saunders
8b75205d3b sleep on all ranks in distributed setting 2025-06-05 23:27:29 +00:00
Dan Saunders
ef4990f304 simplifying path redaction 2025-06-05 23:27:29 +00:00
Dan Saunders
db3297b090 small update / fix 2025-06-05 23:27:27 +00:00