Commit Graph

2511 Commits

Author SHA1 Message Date
VED
a6080df73c compute loss only if training and update token metric naming (#3293) [skip ci]
* compute loss only if training

* save total_tokens for checkpiont

* check if string

* refactor total_tokens/ num_tokens

* refactor 2

* rplc trainable_step/trian_per_sec_per_gpu

* lint + log trainable/tokens

* consolidate it in the callback.

* test for total_tokes aftr remuse

* check if tokenstate exist after ckpt

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
2025-12-25 18:38:17 +07:00
NanoCode012
4f5e8a328a Feat: add MiMo and Plano (#3332) [skip-ci]
* feat: add xiaomi's mimo 7b

* fix: pin revision

* fix: update trinity docs and pin revision

* fix: wrong config name

* feat: add vram usage

* feat: add plano

* feat: update plano vram usage

* chore: comments
2025-12-25 18:09:03 +07:00
NanoCode012
418933f0d1 feat: add internvl3_5 (#3141) [skip-ci]
* feat: add internvl3_5

* fix: add timm instructions

* chore: add kimi-linear to cce doc

* feat: update internvl example

* chore: pin revision

* chore: remove from multipack

* fix: add to multimodal array

* fix: internvl use hf version

* feat: update cce

* chore: lint

* fix: list for image_size

* chore: add docs vram usage

* feat: enable cce

* fix: no need trust remote code

* fix: inconsistent timm version
2025-12-25 18:07:59 +07:00
NanoCode012
372f664c63 feat: cleanup old flex mask patch, suppress Matmul bnb warn, and misc (#3330) [skip-ci]
* feat: add pos id to flex attention for packing part 1

* feat: update to include sliding window mask patch

* fix: suppress MatMul8bitLt: inputs will be cast from warnings

* fix: remove redundant flex attention patch

* chore: update olmo docs

* feat: add validator patch for cross entropy
2025-12-25 17:56:20 +07:00
NanoCode012
97f1b1758d Feat: add kimi linear support (#3257)
* feat: add custom kimi linear patch [skip ci]

* feat: add configuration file and fix import [skip ci]

* fix: hijack tokenizer temporarily [skip ci]

* chore: remove accidental commit

* fix: attempt patch kimi remote

* fix: kwargs passsed

* fix: device for tensor

* fix: aux loss calculation

* feat: cleaned up patches order

* fix: remove duplicate tokenizer patch

* chore: add debug logs

* chore: add debug logs

* chore: debug

* Revert "chore: add debug logs"

This reverts commit da372a5f67.

* Revert "chore: add debug logs"

This reverts commit 97d1de1d7c.

* fix: KeyError: 'tokenization_kimi'

* fix: support remote_model_id in cce patch

* feat: add config preload patch

* fix: use standard aux loss calc and updated modeling

* fix: import

* feat: add kimi-linear docs and example

* chore: add note about moe kernels

* feat: update cce to include kimi-linear

* chore: lint

* chore: update main readme

* fix: patch mechanism to address comments

* chore: lint

* fix: tests

* chore: cleanup comment
2025-12-25 17:53:52 +07:00
Abubakar Abid
f2155eaf79 feat: add trackio as experiment tracking integration (#3253)
* feat: add trackio as experiment tracking integration

- Add TrackioConfig to integrations schema with project_name, run_name, and space_id
- Create trackio_.py module for environment setup
- Add is_trackio_available() utility function
- Integrate trackio with report_to in trainer builder
- Add trackio callback for experiment tracking
- Add trackio config keys to gpt-oss example YAMLs
- Trackio runs locally by default, syncs to HF Space if space_id provided

* changes

* changes

* changes

* changes

* changes

* changes

* changes

* Update requirements.txt

* don't allow pydantic 2.12 for now

---------

Co-authored-by: Abubakar Abid <aaabid93@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-23 08:49:07 -05:00
kallewoof
92ee4256f7 feature: raise on long sequence drop (#3321)
* feature: raise on long sequence drop

It is sometimes not desired that sequences are silently dropped from the dataset, especially when the dataset has been carefully crafted and pre-fitted for the training context. This would then suggest that an error occurred somewhere in the process. This feature adds a third value for excess_length_strategy called 'raise', which will raise a ValueError if a sequence is encountered that is too long and would have normally been dropped/truncated.

* tests: add excess_length_strategy tests

* doc: updated return value description for drop_long_seq_in_dataset

* add @enable_hf_offline

* fixed cfg modified after validate_config called

* hf offline fix

* fix tqdm desc when raise is used

* test: added test for non-batched case

* accidental code change revert

* test: use pytest.raises

* test: simplified drop_seq_len tests

* test: moved excess_length_strat test to test_data.py

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
2025-12-22 13:59:49 -05:00
Wing Lian
efeb5a4e41 fix check for fp8 capability (#3324)
* fix check for fp8 capability

* handle non-cuda compute

* reduce concurrency of tests
2025-12-22 13:58:25 -05:00
VED
faaff6c792 allow users to set ndigits for rounding of metrics when logging (#3325)
* METRIC_PRECISION-> 8

* use ndigits and move env getter to top of log function

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-22 08:54:43 -05:00
Alexander Kozhevnikov
43cef27458 Fix typo in densemixer RuntimeError (#3327) [skip ci]
It offers installing densemizer while it should be densemixer
2025-12-22 08:53:58 -05:00
Wing Lian
07c41a6c2a fix preview docs failing due to running out of disk (#3326) [skip ci]
* fix preview docs failing due to running out of disk

* fix docs publish too
2025-12-19 11:34:55 -05:00
salman
bbd3486f57 Distributed Muon Optimizer (#3264)
* init

* working

* updating configs

* removing unneeded files

* lint

* comments

* lint

* fix regex match

* bump contribs version

* comments

* fixing tests and imports

* muon imports in test v2

* test cleanup

* bump contribs version

---------

Co-authored-by: Salman Mohammadi <“salman.mohammadi@outlook.com”>
2025-12-19 10:43:47 -05:00
VED
3750d7dd64 add liger support kernal for dpo (#3302)
* add liger kernal 4 dpo

* revert grpo changes,add support in dpo

* revert grpo changes,add support in dpo

* dpo_use_liger_kernal

* fix liger_dpo

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
2025-12-18 11:11:06 -05:00
xzuyn
2197b0bf89 feat: cheap ppl metric (#3317)
* Import math and compute perplexity from loss values

* lint

* coderabbit changes

* lint

* fix: add rounding to ppl

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-12-18 09:02:41 -05:00
Seung Hyun Cho
3e51a680c2 fix: Fix evaluation loss in KD trainer (#3271)
* fix: Fix evaluation loss in KD trainer

* Fix v2 strategy super() call

* fix: Add safety check for total_tokens in log method

* fix: simplified num items and outputs return handling

* fix: add missing model forward pass in compute_loss

* refactor: Use Template Method pattern for chat template strategies

* refactor: use pop(None) and remove v2 override

* chore: lint

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-17 13:40:36 -05:00
xzuyn
2cf254b4af Add peft_autocast_adapter_dtype config option (#3311) [skip ci]
* Add `peft_autocast_adapter_dtype` field to schema

* Add `autocast_adapter_dtype` to `model_kwargs`

* chore: docs

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-12-17 10:09:39 -05:00
salman
83d4d97dcc Add QAT NVFP4 configs for blogpost (#3280) [skip ci]
* add configs for blogpost

* fix configs

* fixing baseline configs
2025-12-17 09:35:22 -05:00
NanoCode012
a1d07f42e4 Fix(misc): address PYTORCH_CUDA_ALLOC_CONF deprecate (#3313)
* fix: leftover ministral docs changes

* fix: pytorch_cuda_alloc_conf deprecation

* fix: set old PYTORCH_CUDA_ALLOC_CONF env too

* handle 2.9 separately

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-12-17 09:12:18 -05:00
Wing Lian
2a664dc8ad support for xformers wheels for torch 2.9 (#3308)
* support for xformers wheels for torch 2.9

* fix hf cache?

* don't use hf cache from s3

* show disk free space in ci
2025-12-11 11:56:40 -05:00
NanoCode012
4ac78aa562 fix: update qwen3 jinja tokenization off a few tokens (#3295)
* fix: update qwen3 jinja tokenization off a few tokens

* fix: add note on tokenization issue

* fix: pop last index for mistral tokenizer
2025-12-09 14:31:03 +07:00
VED
b3f4aa149f fix bin size (#3307)
* fix bin size

* lint

---------

Co-authored-by: Ved <ved.work2024@gmail.com>
2025-12-08 09:16:18 -05:00
salman
75b20fb66f Save processor in quantizer CLI (#3290) 2025-12-06 16:27:18 +00:00
NanoCode012
5992e607a2 fix: improve ministral3 docs to be clearer (#3300)
* fix: improve ministral3 docs to be clearer

* fix: title

* chore: wording
2025-12-04 21:44:44 +07:00
NanoCode012
2b66ee189c Feat: add ministral3 (#3297)
* feat: add ministral and mistral3

* chore: lint

* feat: update cce for ministral

* fix: add vram usage

* feat: update for release

* fix: save_pretrained issue in v5

* fix: add instructions to use v5 branch

* fix: add to multipack

* fix: improve instructions

* fix: add model to readme
2025-12-04 08:32:08 -05:00
NanoCode012
86d8cca149 Feat: add trinity by ArceeAI (#3292) 2025-12-02 13:12:55 -05:00
NanoCode012
4a0f98e612 feat: upgrade liger to 0.6.4 (#3289) 2025-12-02 09:16:23 -05:00
Yohan Na
c6ddcdd06a feat: add exaone4 chat template and update enums (#3279)
* feat: add exaone4 chat template and update enums

* fix: handle first message as system or tools in exaone4 chat template

* Update src/axolotl/utils/chat_templates/templates/exaone4.jinja

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: lint

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-12-01 15:52:45 +07:00
github-actions[bot]
7fb6a947d9 chore: update pre-commit hooks (#3287)
Co-authored-by: SalmanMohammadi <25081738+SalmanMohammadi@users.noreply.github.com>
2025-12-01 15:03:14 +07:00
NanoCode012
b234532d9f Feat: add peft_ensure_weight_tying (#3278)
* feat: upgrade peft to 0.18.0

* feat: add peft_ensure_weight_tying

* fix: default

* chore: adjust kwarg per feedback
2025-11-28 18:54:48 +07:00
VED
8990ca3205 fix: removed unused "scikit-learn==1.4.2" (#3277)
Co-authored-by: Ved <ved.work2024@gmail.com>
2025-11-24 13:48:53 +07:00
NanoCode012
006f226270 Feat: add Olmo3 (BC with Olmo and Olmo2) (#3275)
* feat: update cce to include olmo family

* chore: update docs following feedback

* feat: add olmo3 config

* fix: clarify 3 methods

* chore: add olmo to readme
2025-11-24 10:21:31 +07:00
Wing Lian
0b635e69c5 build docker images for 2.9.x (#3273) 2025-11-20 09:26:24 -05:00
Wing Lian
0d27e14e45 Torch 2.9.1 base images (#3268)
* update torch 2.9.1 base images

* update base dockerfile image check
2025-11-20 09:04:37 -05:00
NanoCode012
f5f21fb216 chore: update readme with latest updates (#3267)
Some checks failed
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, true, 3.11, 2.8.0) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, <nil>, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, true, 3.11, 2.8.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (vllm, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, <nil>, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 128, 12.8.1, <nil>, 3.11, 2.8.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (vllm, 126, 12.6.3, true, 3.11, 2.7.1) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
v0.13.0
2025-11-18 14:45:21 +07:00
NanoCode012
4e55871112 feat: Add opt-out Telemetry (#3237)
* initial telemetry manager impl

* adding todo

* updates

* updates

* progress on telemetry: config load, process, model load, train start / end, error tracking

* update error file path sanitization function; adding more error tracking

* updated sanitization logic, tests

* adding runtime metrics (cpu + gpu memory, steps/s, etc.)

* tests for runtime metrics telemetry and assoc. callback

* small update / fix

* simplifying path redaction

* sleep on all ranks in distributed setting

* adding back in base_model redaction w/ whitelist

* fix

* doc update

* improved redaction, send system info during model config load telemetry, etc.

* adding runtime metrics / system info additional accelerator support, etc.

* adding runtime metrics / system info additional accelerator support, etc.

* remove duplicate info

* fixes

* fix issue with tests in ci

* distributed fix

* opt-in version of telemetry

* enable / disable logic update

* docs fix

* doc update

* minor fixes

* simplifying

* slight changes

* fix

* lint

* update posthog dep

* coderabbit comments

* fix: opt-in model

* fix: increase time since last

* fix: increase whitelist orgs

* fix: posthog init and shutdown

* fix: imports

* fix: also check grad norm

* fix: duplicate plugin_manager calls

* fix: bad merge

* chore: update docs

* fix: cache process per comment

* fix: error handling

* fix: tests

* Revert "fix: error handling"

This reverts commit 22d1ea5755.

* fix: test telemetry error_handled bool

* fix: revert test

* chore: final doc fixes

---------

Co-authored-by: Dan Saunders <danjsaund@gmail.com>
Co-authored-by: Dan Saunders <dan@axolotl.ai>
2025-11-18 11:35:25 +07:00
Wing Lian
a6bafb55cb upgrade datasets to 4.4.1 (#3266)
* upgrade datasets

* cleanup pip cache earlier

* cleanup unused things from worker

* also cleanup sdist
2025-11-14 09:52:14 -08:00
Wing Lian
0fbde69e9c only push axolotl images, personal repo is deprecated (#3262)
* only push axolotl images, personal repo is deprecated

* cleanup
2025-11-14 07:50:03 -08:00
Wing Lian
301e22849f upgrade to latest deepspeed and make sure latest tagged axolotl images are using torch 2.8.0 (#3261) 2025-11-13 13:03:01 -05:00
VED
dcf24fd24e feat: save checkpoint after training started (#3233)
* add:config parameters for checkpoint

* callback main

* test file_type fix

* lint

* unit

* simplify dict/obj handeling

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Delete tests/e2e/integrations/__init__.py

* remove hard code path in test

* device check

* lint

* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update src/axolotl/utils/callbacks/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* lint-2

* remove: singal based checkpoints

* lint

* remove signal tests

* add:is_main_process

* lint

* addis_d:istributed() for tests

* remove nested is_main_process

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* Update src/axolotl/utils/schemas/dynamic_checkpoint.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* add user_defined_filename

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
2025-11-13 10:21:05 -05:00
NanoCode012
49b8107989 feat: add granite4 examples (#3256) [skip ci] 2025-11-13 10:19:16 -05:00
NanoCode012
9901ee5602 fix: voxtralprocessor broken (#3255) [skip ci]
* fix: voxtralprocessor broken

* chore: add todo

* chore: wording
2025-11-13 10:18:42 -05:00
xzuyn
dd78f2e0cc Fix: warmup_steps: 0 & warmup_ratio: 0 not disabling warmup (#3254)
* fix unintentional falsy checks

* chore: lint

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-11-11 10:32:06 +07:00
Eduard Zl
b54f9c942b _get_tools in ChatTemplateStrategy : function "parameters" can be dict or string (#3238)
* When training of function calls, "tools" elements of a dataset can contain same parameter name but with different types. Datasets fails to load such training set. This fix allows "parameters" element of function call to be string( by running "json.dumps" in preparation of training data set). The _get_tools function will iterate over tool definitions, if "parameters" element is dict, it will keep that way, if it is a string, it will be converted to dict by invoking "json.loads" on string value.

* feat: add doc on tool parameters json loading

* feat: add tests for parameters json string

---------

Co-authored-by: ezlotnik <eduard_zlotnik@intuit.com>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
2025-11-11 09:04:28 +07:00
NanoCode012
11eb36585a feat: add arg to enable dft in liger (#3125)
* feat: add arg to enable dft in liger

* feat: add tests use_token_scaling

* fix: test

* fix: move check to args
2025-11-10 21:37:47 +07:00
NanoCode012
d0c846fc5e feat: add granitemoeshared and granitemoehybrid (#3158) 2025-11-10 21:35:45 +07:00
Wing Lian
b5fcc2f14b log cumulative total trained tokens (#3252)
* log cumulative total trained tokens

* use is_distributed helper
2025-11-07 16:04:00 -05:00
Wing Lian
b62eed8809 add openenv-core to requirements (#3251) 2025-11-07 12:17:27 -05:00
VED
ed2e8cacd6 feat:openenv rollout_func (#3239) [skip ci]
* feat:openenv rollout_func

* chore lint

* docs

* add:docs processing_class

* tests

* lint
2025-11-07 08:51:40 -05:00
Lê Nam Khánh
80270a92fa Fix typos in some files (#3250) [skip ci] 2025-11-07 08:21:20 -05:00
Wing Lian
bfdc9a8249 upgrade trl and other hf deps (#3249)
* upgrade trl and other hf deps

* skip simpo for now
2025-11-06 16:06:03 -05:00