Commit Graph

2258 Commits

Author SHA1 Message Date
mhenrhcsen
ea234afa8a Enhance model loading logic to include support for GraniteSpeechConfig, allowing for the use of the specific model class for Granite Speech. 2025-07-17 19:45:23 +02:00
mhenrhcsen
738adb2258 fixes 2025-07-16 21:50:56 +02:00
mhenrhcsen
f40e8caa28 checks 2025-07-16 21:30:01 +02:00
mhenrhcsen
f9bdf1fb44 checks 2025-07-16 21:23:25 +02:00
mhenrhcsen
2f670a5988 Fix: Update model loading logic to conditionally upcast based on lm_head presence for btlm models 2025-07-16 21:16:47 +02:00
mhenrhcsen
84ad69afad Fix: Ensure tie_weights method is called only if it exists in the model class 2025-07-16 21:03:49 +02:00
Wing Lian
36cbe13d18 activation offloading with cuda streams doesn't work with LoRA (#2927) 2025-07-16 11:59:20 -04:00
Wing Lian
2c408b5c5e Apply generic fused liger ce, cce, and tiledmlp for arbitrary models (#2908)
* Apply generic fused liger ce for unknown models

* fix deepseek liger modeling

* generic cce and config tiled mlp to use original mlp and auto detect compute params

* fix weight and lint

* update warnings

* address PR feedback

* use lookup for model class prefixes

* revert inadvertent change to flash attn verison

* remove un-needed pylint annotations

* fix import
2025-07-15 22:40:41 -04:00
Wing Lian
942005f526 use modal==1.0.2 for nightlies and for cli (#2925) [skip ci]
* use modal==1.0.2 for nightlies and for cli

* use latest cce fork for upstream changes

* increase timeout
2025-07-15 20:31:23 -04:00
Dan Saunders
10ba1622f7 checkpoint model on first step callback (#2906)
* checkpoint model on first step callback

* remove debug

* add test cases; update existing tests not to save on first step

* move test out of solo

* delete

* default to False

* typo
2025-07-15 15:00:48 -04:00
Wing Lian
d320ef6199 fix for upstream refactor of KwargsForCausalLM (#2911) 2025-07-15 11:28:41 -04:00
NanoCode012
354eaaf0d3 feat: add call method to mistral tokenizer wrapper (#2898) 2025-07-14 22:33:35 -04:00
greenhestu
a061446540 Fix: Prevents merging of tool arguments during preprocessing (#2909) 2025-07-14 22:33:10 -04:00
Wing Lian
cd079b5536 Tensor parallel w DeepSpeed AutoTP (#2574)
* support for deepspeed autotup

* bump to latest deepspeed that supports deepcompile too

* add deepcompile support too

* fix total steps calculation for TP

* setup fixture for tp

* update ds config to ensure weights are gathered for checkpoint

* fix duplicate validation names

* chore: lint
2025-07-14 21:33:48 -04:00
Wing Lian
5cc16040a8 move the plugin post trainer create to the setup trainer (#2907)
* move the plugin post trainer create to the setup trainer

* move post-train plugins to execute-training fn
2025-07-14 20:11:33 -04:00
Wing Lian
38359a8997 allow profiling in mid-training rather from the start (#2899) [skip ci]
* allow profiling in mid-training rather from the start

* simplify based on PR feedback

* fix logic, improve saving at end, add tests
2025-07-14 20:11:11 -04:00
Wing Lian
7dc3ac6cb3 update nightlies builds (#2921) [skip ci] 2025-07-14 20:10:43 -04:00
Wing Lian
99187cd208 Activation Offloading w CUDA Streams (#2900) [skip ci]
* use cuda streams for activation offloading

* use torch native ops

* update cfg schema for streams

* fix literal constructor for set

* use context for training step so it doesn't affect evals

* disable streams

* auto gc on eval steps

* use activation_offloading config arg

* add docs for gradient checkpointing

* handle validation for gc/ao

* use cuda streams for act offloading

* add more validation for AC w/o GC

* fix docs

* move activation_offloading lower in definition so it doesn't break args/kwargs

* fix kd due to import order
2025-07-14 20:10:20 -04:00
Wing Lian
aa684122f1 upgrade peft==0.16.0 and datasets==4.0.0 (#2917) [skip ci]
* upgrade peft to 0.16.0

* upgrade datasets to 4.0.0

* refactor dupes from merge/rebase

* fix check for fsdp1 + sharded_state_dict

* use full state dict for ci
2025-07-14 20:09:26 -04:00
Wing Lian
ca4d4ef793 don't init distributed for deepspeed if preprocessing (#2920)
* don't init distributed for deepspeed if preprocessing

* add e2e test to validate preprocess cli with deepspeed

* ignore duplicate code for cfg
2025-07-14 14:19:19 -04:00
Dan Saunders
37edbe4999 Remove extra torch.compile call (#2904)
* debug

* debug

* debug

* moving validation code to transformers

* revert unneeded change

* add accelerator config to base trainer builder

* add back accumulated_cache_size_limit setting

* lint
2025-07-14 12:32:45 -04:00
Wing Lian
e581c15d40 refactor dupes from merge/rebase (#2919) [skip ci] 2025-07-14 10:05:26 -04:00
Wing Lian
af92151a7b FSDP2 fix validation and add tests (#2910)
* fix validation and add tests

* remove debugging and add more tests

* remove migrate_fsdp
2025-07-14 09:25:44 -04:00
Wing Lian
80dc4c261a fix xformers version for python 2.6 (#2916) [skip ci] 2025-07-14 09:24:29 -04:00
Wing Lian
7ccbbd8e77 upgrade liger to 0.6.0 (#2893) [skip ci] 2025-07-14 09:24:07 -04:00
Wing Lian
5081db7f8a upgrade trl==0.19.1 (#2892) [skip ci]
* upgrade trl==0.19.1

* add vllm for tests for grpo

* fixes to work with latest trl

* need data_parallel_size config too

* support for vllm_mode for server / colocate

* vllm settings for colocate

* relax vllm version

* bump min hf hub for latest vllm support

* add hints on string literal for vllm mode

* use latest transformers 4.53.2

* tweak acceptable loss on flaky test_ds_zero3_packed test

* don't run flaky vllm/grpo tests for now
2025-07-14 09:23:42 -04:00
Wing Lian
41664c7c4c fix ddp for incorrect steps (#2915)
* fix ddp for incorrect steps

* add test
2025-07-14 07:51:16 -04:00
Wing Lian
9a8073e73d Liquid Foundation Model 2 support (#2905)
* LFM2 support

* docs

* packing seems to work

* update install to force install in case already on dev version

* default to use chunked cross entropy
2025-07-12 11:41:34 -04:00
Jiawei Liu
7fb8441e0e fix: customized dataset with simpo (#2894) [skip ci] 2025-07-12 11:40:30 -04:00
NanoCode012
4dc5910e1c feat(doc): re-add docker 2.7.0 tag back (#2902) [skip ci] 2025-07-12 11:40:01 -04:00
Wing Lian
fb7bc9250d move unmaintained examples to archive (#2903) [skip ci] 2025-07-12 11:39:51 -04:00
salman
d6e4a611e5 FSDP1 -> FSDP2 (#2760)
* FSDP2 args migration implementation

This commit implements the migration to FSDP2 arguments including:
- FSDP2 support with LoRA training
- DPO integration with FSDP2
- Model loading fixes and refactoring
- CPU offloading and PEFT handling
- Test updates and CI improvements
- Bug fixes for dtype errors and various edge cases
2025-07-12 15:18:01 +01:00
Ed Sealing
eb662557a7 Register Plugins in Ray Workers (#2901) [skip ci]
* Access plugins in ray cluster

* Add comment

* chore: lint

---------

Co-authored-by: Ed Sealing <ed.sealing@patapsco.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-07-11 16:59:59 -04:00
salman
03b2a113fe Update doc preview workflow to use sticky comments (#2873) 2025-07-11 14:08:35 +01:00
NanoCode012
9b95a625ab feat: add devstral small 2507 (#2896)
* feat: add devstral small 2507

* chore: update blog doc
2025-07-11 09:34:19 +07:00
Wing Lian
c370d0795c [doc] Fix docs for text field mapping for completion datasets (#2890)
* Fix docs for text field mapping for completion datasets

* update another reference
2025-07-09 14:52:44 -04:00
Wing Lian
76aeb16156 tiled_mlp supports single gpu (#2891)
* tiled_mlp supports single gpu

* use checkpoint offloading for arctic training

* patch torch checkpoint too

* support for single gpu zero3

* add linkback to where it was copied from
2025-07-09 12:48:22 -04:00
Wing Lian
7c5ea0010f bump dev version (#2889) [skip ci] 2025-07-09 09:43:42 -04:00
Wing Lian
c6d69d5c1b release v0.11.0 (#2875)
Some checks failed
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl (vllm, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
publish pypi / Create Release (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 126, 12.6.3, true, 3.11, 2.6.0) (push) Has been cancelled
ci-cd / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.7.1) (push) Has been cancelled
ci-cd / build-axolotl-cloud-no-tmux (<nil>, 126, 12.6.3, 3.11, 2.6.0) (push) Has been cancelled
publish pypi / Upload release to PyPI (push) Has been cancelled
* release v0.11.0

* don't build vllm into release for now

* remove 2.5.1 references

* smollm3 multipack support

* fix ordering of e2e tests
v0.11.0
2025-07-09 09:22:35 -04:00
Wing Lian
4ff96a2526 fix xformers version (#2888) 2025-07-09 08:43:40 -04:00
salman
89e99eaaa7 slowest durations (#2887) [skip ci] 2025-07-09 08:43:26 -04:00
Wing Lian
6ed501f6dc add 2.7.0 torch images back to support vlllm (#2885) 2025-07-08 16:28:14 -04:00
NanoCode012
8c6a6ea6eb Feat: add devstral model support (#2880) [skip ci]
* fix: do not add training and training_detail block by default

* fixed: magistral docs

* fix: address pad adding new fields and use built-in from_openai

* feat: try enable multiprocessing

* fix: check for keys before deleting attn_mask

* feat: add mistral pad test

* feat: add tool calling test

* feat: add devstral tokenizer tests

* fix: comma format

* chore: remove unused support_preprocessing as tokenizer is pickable now

* chore: update magistral doc

* feat: add devstral readme and example

* chore: refactor error handling
2025-07-08 11:01:19 -04:00
NanoCode012
78bff4925e fix: set add_generation_prompt to False when apply chat template (#2859) [skip ci] 2025-07-08 11:00:44 -04:00
NanoCode012
b237c8a3f3 chore: update cce commit to include gemma3n fixes (#2881) [skip ci] 2025-07-08 10:59:35 -04:00
float-trip
1032e22650 Fix link in FSDP + QLoRA docs. (#2879) [skip ci] 2025-07-08 09:19:09 -04:00
Wing Lian
d68cc1e8ab densemixer plugin integration (#2868)
* densemixer plugin integration

* update readme with usage docs

* automatically find new integrations that aren't explicitly defined

* make sure to import os
2025-07-07 17:05:19 -04:00
github-actions[bot]
21f1bf4805 chore: update pre-commit hooks (#2870) [skip ci]
* chore: update pre-commit hooks

* don't bandit huggingface hub downloads without revision

---------

Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
2025-07-07 15:26:15 -04:00
Wing Lian
de2c5ba103 mark flaky geglu tests and add torch seed (#2876) [skip ci]
* mark flaky geglu tests and add torch seed

* restore accidental removal of seed
2025-07-07 15:24:16 -04:00
Wing Lian
9c0d7ee761 TiledMLP support (#2865) 2025-07-07 15:23:49 -04:00