Commit Graph

2504 Commits

Author SHA1 Message Date
Wing Lian
efeb5a4e41 fix check for fp8 capability (#3324)
* fix check for fp8 capability

* handle non-cuda compute

* reduce concurrency of tests
2025-12-22 13:58:25 -05:00
VED
faaff6c792 allow users to set ndigits for rounding of metrics when logging (#3325)
* METRIC_PRECISION-> 8

* use ndigits and move env getter to top of log function

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-22 08:54:43 -05:00
Alexander Kozhevnikov
43cef27458 Fix typo in densemixer RuntimeError (#3327) [skip ci]
It offers installing densemizer while it should be densemixer
2025-12-22 08:53:58 -05:00
Wing Lian
07c41a6c2a fix preview docs failing due to running out of disk (#3326) [skip ci]
* fix preview docs failing due to running out of disk

* fix docs publish too
2025-12-19 11:34:55 -05:00
salman
bbd3486f57 Distributed Muon Optimizer (#3264)
* init

* working

* updating configs

* removing unneeded files

* lint

* comments

* lint

* fix regex match

* bump contribs version

* comments

* fixing tests and imports

* muon imports in test v2

* test cleanup

* bump contribs version

---------

Co-authored-by: Salman Mohammadi <“salman.mohammadi@outlook.com”>
2025-12-19 10:43:47 -05:00
VED
3750d7dd64 add liger support kernal for dpo (#3302)
* add liger kernal 4 dpo

* revert grpo changes,add support in dpo

* revert grpo changes,add support in dpo

* dpo_use_liger_kernal

* fix liger_dpo

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
2025-12-18 11:11:06 -05:00
xzuyn
2197b0bf89 feat: cheap ppl metric (#3317)
* Import math and compute perplexity from loss values

* lint

* coderabbit changes

* lint

* fix: add rounding to ppl

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-12-18 09:02:41 -05:00
Seung Hyun Cho
3e51a680c2 fix: Fix evaluation loss in KD trainer (#3271)
* fix: Fix evaluation loss in KD trainer

* Fix v2 strategy super() call

* fix: Add safety check for total_tokens in log method

* fix: simplified num items and outputs return handling

* fix: add missing model forward pass in compute_loss

* refactor: Use Template Method pattern for chat template strategies

* refactor: use pop(None) and remove v2 override

* chore: lint

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-17 13:40:36 -05:00
xzuyn
2cf254b4af Add peft_autocast_adapter_dtype config option (#3311) [skip ci]
* Add `peft_autocast_adapter_dtype` field to schema

* Add `autocast_adapter_dtype` to `model_kwargs`

* chore: docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-12-17 10:09:39 -05:00
salman
83d4d97dcc Add QAT NVFP4 configs for blogpost (#3280) [skip ci]
* add configs for blogpost

* fix configs

* fixing baseline configs
2025-12-17 09:35:22 -05:00
NanoCode012
a1d07f42e4 Fix(misc): address PYTORCH_CUDA_ALLOC_CONF deprecate (#3313)
* fix: leftover ministral docs changes

* fix: pytorch_cuda_alloc_conf deprecation

* fix: set old PYTORCH_CUDA_ALLOC_CONF env too

* handle 2.9 separately

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-17 09:12:18 -05:00
Wing Lian
2a664dc8ad support for xformers wheels for torch 2.9 (#3308)
* support for xformers wheels for torch 2.9

* fix hf cache?

* don't use hf cache from s3

* show disk free space in ci
2025-12-11 11:56:40 -05:00
NanoCode012
4ac78aa562 fix: update qwen3 jinja tokenization off a few tokens (#3295)
* fix: update qwen3 jinja tokenization off a few tokens

* fix: add note on tokenization issue

* fix: pop last index for mistral tokenizer
2025-12-09 14:31:03 +07:00
VED
b3f4aa149f fix bin size (#3307)
* fix bin size

* lint

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
2025-12-08 09:16:18 -05:00
salman
75b20fb66f Save processor in quantizer CLI (#3290) 2025-12-06 16:27:18 +00:00
NanoCode012
5992e607a2 fix: improve ministral3 docs to be clearer (#3300)
* fix: improve ministral3 docs to be clearer

* fix: title

* chore: wording
2025-12-04 21:44:44 +07:00
NanoCode012
2b66ee189c Feat: add ministral3 (#3297)
* feat: add ministral and mistral3

* chore: lint

* feat: update cce for ministral

* fix: add vram usage

* feat: update for release

* fix: save_pretrained issue in v5

* fix: add instructions to use v5 branch

* fix: add to multipack

* fix: improve instructions

* fix: add model to readme
2025-12-04 08:32:08 -05:00
NanoCode012
86d8cca149 Feat: add trinity by ArceeAI (#3292) 2025-12-02 13:12:55 -05:00
NanoCode012
4a0f98e612 feat: upgrade liger to 0.6.4 (#3289) 2025-12-02 09:16:23 -05:00
Yohan Na
c6ddcdd06a feat: add exaone4 chat template and update enums (#3279)
* feat: add exaone4 chat template and update enums

* fix: handle first message as system or tools in exaone4 chat template

* Update src/axolotl/utils/chat_templates/templates/exaone4.jinja

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: lint

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-12-01 15:52:45 +07:00
github-actions[bot]
7fb6a947d9 chore: update pre-commit hooks (#3287)
Co-authored-by: SalmanMohammadi <25081738+SalmanMohammadi@users.noreply.github.com>
2025-12-01 15:03:14 +07:00
NanoCode012
b234532d9f Feat: add peft_ensure_weight_tying (#3278)
* feat: upgrade peft to 0.18.0

* feat: add peft_ensure_weight_tying

* fix: default

* chore: adjust kwarg per feedback
2025-11-28 18:54:48 +07:00
VED
8990ca3205 fix: removed unused "scikit-learn==1.4.2" (#3277)
Co-authored-by: Ved <ved.work2024@gmail.com>
2025-11-24 13:48:53 +07:00
NanoCode012
006f226270 Feat: add Olmo3 (BC with Olmo and Olmo2) (#3275)
* feat: update cce to include olmo family

* chore: update docs following feedback

* feat: add olmo3 config

* fix: clarify 3 methods

* chore: add olmo to readme
2025-11-24 10:21:31 +07:00
Wing Lian
0b635e69c5 build docker images for 2.9.x (#3273) 2025-11-20 09:26:24 -05:00
Wing Lian
0d27e14e45 Torch 2.9.1 base images (#3268)
* update torch 2.9.1 base images

* update base dockerfile image check
2025-11-20 09:04:37 -05:00
NanoCode012
f5f21fb216 chore: update readme with latest updates (#3267)
Some checks failed
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, true, 3.11, 2.8.0) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, <nil>, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, true, 3.11, 2.8.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (vllm, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, <nil>, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 128, 12.8.1, <nil>, 3.11, 2.8.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (vllm, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.13.0
2025-11-18 14:45:21 +07:00
NanoCode012
4e55871112 feat: Add opt-out Telemetry (#3237)
* initial telemetry manager impl

* adding todo

* updates

* updates

* progress on telemetry: config load, process, model load, train start / end, error tracking

* update error file path sanitization function; adding more error tracking

* updated sanitization logic, tests

* adding runtime metrics (cpu + gpu memory, steps/s, etc.)

* tests for runtime metrics telemetry and assoc. callback

* small update / fix

* simplifying path redaction

* sleep on all ranks in distributed setting

* adding back in base_model redaction w/ whitelist

* fix

* doc update

* improved redaction, send system info during model config load telemetry, etc.

* adding runtime metrics / system info additional accelerator support, etc.

* adding runtime metrics / system info additional accelerator support, etc.

* remove duplicate info

* fixes

* fix issue with tests in ci

* distributed fix

* opt-in version of telemetry

* enable / disable logic update

* docs fix

* doc update

* minor fixes

* simplifying

* slight changes

* fix

* lint

* update posthog dep

* coderabbit comments

* fix: opt-in model

* fix: increase time since last

* fix: increase whitelist orgs

* fix: posthog init and shutdown

* fix: imports

* fix: also check grad norm

* fix: duplicate plugin_manager calls

* fix: bad merge

* chore: update docs

* fix: cache process per comment

* fix: error handling

* fix: tests

* Revert "fix: error handling"

This reverts commit 22d1ea5755.

* fix: test telemetry error_handled bool

* fix: revert test

* chore: final doc fixes

---------

Co-authored-by: Dan Saunders <danjsaund@gmail.com>
Co-authored-by: Dan Saunders <dan@axolotl.ai>
2025-11-18 11:35:25 +07:00
Wing Lian
a6bafb55cb upgrade datasets to 4.4.1 (#3266)
* upgrade datasets

* cleanup pip cache earlier

* cleanup unused things from worker

* also cleanup sdist
2025-11-14 09:52:14 -08:00
Wing Lian
0fbde69e9c only push axolotl images, personal repo is deprecated (#3262)
* only push axolotl images, personal repo is deprecated

* cleanup
2025-11-14 07:50:03 -08:00
Wing Lian
301e22849f upgrade to latest deepspeed and make sure latest tagged axolotl images are using torch 2.8.0 (#3261) 2025-11-13 13:03:01 -05:00
VED
dcf24fd24e feat: save checkpoint after training started (#3233)
* add:config parameters for checkpoint

* callback main

* test file_type fix

* lint

* unit

* simplify dict/obj handeling

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Delete tests/e2e/integrations/__init__.py

* remove hard code path in test

* device check

* lint

* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* lint-2

* remove: singal based checkpoints

* lint

* remove signal tests

* add:is_main_process

* lint

* addis_d:istributed() for tests

* remove nested is_main_process

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* add user_defined_filename

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-11-13 10:21:05 -05:00
NanoCode012
49b8107989 feat: add granite4 examples (#3256) [skip ci] 2025-11-13 10:19:16 -05:00
NanoCode012
9901ee5602 fix: voxtralprocessor broken (#3255) [skip ci]
* fix: voxtralprocessor broken

* chore: add todo

* chore: wording
2025-11-13 10:18:42 -05:00
xzuyn
dd78f2e0cc Fix: warmup_steps: 0 & warmup_ratio: 0 not disabling warmup (#3254)
* fix unintentional falsy checks

* chore: lint

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-11-11 10:32:06 +07:00
Eduard Zl
b54f9c942b _get_tools in ChatTemplateStrategy : function "parameters" can be dict or string (#3238)
* When training of function calls, "tools" elements of a dataset can contain same parameter name but with different types. Datasets fails to load such training set. This fix allows "parameters" element of function call to be string( by running "json.dumps" in preparation of training data set). The _get_tools function will iterate over tool definitions, if "parameters" element is dict, it will keep that way, if it is a string, it will be converted to dict by invoking "json.loads" on string value.

* feat: add doc on tool parameters json loading

* feat: add tests for parameters json string

---------

Co-authored-by: ezlotnik <eduard_zlotnik@intuit.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-11-11 09:04:28 +07:00
NanoCode012
11eb36585a feat: add arg to enable dft in liger (#3125)
* feat: add arg to enable dft in liger

* feat: add tests use_token_scaling

* fix: test

* fix: move check to args
2025-11-10 21:37:47 +07:00
NanoCode012
d0c846fc5e feat: add granitemoeshared and granitemoehybrid (#3158) 2025-11-10 21:35:45 +07:00
Wing Lian
b5fcc2f14b log cumulative total trained tokens (#3252)
* log cumulative total trained tokens

* use is_distributed helper
2025-11-07 16:04:00 -05:00
Wing Lian
b62eed8809 add openenv-core to requirements (#3251) 2025-11-07 12:17:27 -05:00
VED
ed2e8cacd6 feat:openenv rollout_func (#3239) [skip ci]
* feat:openenv rollout_func

* chore lint

* docs

* add:docs processing_class

* tests

* lint
2025-11-07 08:51:40 -05:00
Lê Nam Khánh
80270a92fa Fix typos in some files (#3250) [skip ci] 2025-11-07 08:21:20 -05:00
Wing Lian
bfdc9a8249 upgrade trl and other hf deps (#3249)
* upgrade trl and other hf deps

* skip simpo for now
2025-11-06 16:06:03 -05:00
salman
c37decb073 update pre-commit cadence (#3245) 2025-11-04 13:43:40 +00:00
NanoCode012
01a346d86a feat(example): add gpt-oss-safeguard docs (#3243)
* feat(example): add gpt-oss-safeguard docs

* fix: add doc on reasoning_effort
2025-11-04 07:39:21 +07:00
NanoCode012
26f05b6008 fix(example): set model_type to load for gemma3 text (#3242)
* fix: set model_type to load for gemma3 text

* chore: simplify

* chore: unify
2025-11-04 07:35:07 +07:00
github-actions[bot]
ed58fa8a75 chore: update pre-commit hooks (#3244) 2025-11-03 15:55:40 +00:00
Wing Lian
633afffacb add torch 2.9.0 to ci (#3223) 2025-10-30 18:50:26 -04:00
Wing Lian
4b1b4fa6d8 upgrade numpy (#3236)
* upgrade numpy to 2.3.4

* bump contribs for numpy

* fix vllm versions

* bump numba

* make sure psutil is installed

* add psutil to cicd dockerfile jinja

* lower dep versions of numba + numpy for vllm

* bump datasets version

* resolve pydantic conflict too
2025-10-30 10:03:24 -04:00
github-actions[bot]
0f7c886b7b chore: update pre-commit hooks (#3222) [skip ci]
Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>
2025-10-29 18:09:46 -04:00