Wing Lian
f196941315
additional fixes for docker and saving compressed
2025-04-28 13:16:29 -04:00
Rahul Tuli
5be047ac46
Fix: Test
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-04-28 13:16:29 -04:00
Rahul Tuli
758115b8c6
Apply patch from @winglian
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-04-28 13:16:29 -04:00
Rahul Tuli
0dc1da5876
Add: line about further optimizations using llmcompressor
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-04-28 13:16:29 -04:00
Rahul Tuli
f3e876dbfc
Address Review Comments:
...
* deleted redundant docs/llm_compressor.qmd
* incorporated feedback in integration README.md
* added llmcompressor integration to docs/custom_integrations.qmd
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-04-28 13:16:29 -04:00
Rahul Tuli
99c13ef60c
Add: .qmd file
2025-04-28 13:16:29 -04:00
Rahul Tuli
2c24434ee0
Tests, Style, Updates
2025-04-28 13:16:29 -04:00
Rahul Tuli
586268a0d7
Rebase and updates!
2025-04-28 13:16:29 -04:00
Rahul Tuli
b600e119b6
Add: llm_compressor integration documentation
2025-04-28 13:16:29 -04:00
Rahul Tuli
a8e5ba000e
Move: LLMCompressorPlugin into it's own submodule
2025-04-28 13:16:29 -04:00
Rahul Tuli
bc3dfa666d
Update model config
2025-04-28 13:16:29 -04:00
Rahul Tuli
4371f3459e
Use: absolute import
2025-04-28 13:16:29 -04:00
Rahul Tuli
cc58d5e072
Rename: sft.yaml to sparse-finetuning.yaml
2025-04-28 13:16:29 -04:00
Rahul Tuli
d197b054e3
Add: llcompressor installable
2025-04-28 13:16:29 -04:00
Rahul Tuli
7e1e153831
Address review comments from @markurtz
2025-04-28 13:16:29 -04:00
Rahul Tuli
42de3096cf
Apply suggestions from @markurtz
...
Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com >
2025-04-28 13:16:29 -04:00
Rahul Tuli
27758840a1
Update llmcompressor version to latest
2025-04-28 13:16:29 -04:00
Rahul Tuli
8dbf5c215a
Revert: TODO's
2025-04-28 13:16:29 -04:00
Rahul Tuli
6411ca3fe1
Use: warning over warn
2025-04-28 13:16:29 -04:00
Rahul Tuli
813809c54d
pre commit hooks
2025-04-28 13:16:29 -04:00
Rahul Tuli
af7cfdc30b
Add:llmcompressor instalable
2025-04-28 13:16:29 -04:00
Rahul Tuli
b76d2d1130
Update: review comments!
2025-04-28 13:16:29 -04:00
Rahul Tuli
7946f89df4
Add: SFTPlugin with llmcompressor
2025-04-28 13:16:29 -04:00
Dhruv Mullick
8b33ae1c4f
Fix bug in grpo reward module import ( #2571 )
2025-04-28 00:31:56 -04:00
Wing Lian
dc4da4a7e2
update trl to 0.17.0 ( #2560 )
...
* update trl to 0.17.0
* grpo + vllm no longer supported with 2.5.1 due to vllm constraints
* disable VLLM_USE_V1 for ci
* imporve handle killing off of multiprocessing vllm service
* debug why this doesn't run in CI
* increase vllm wait time
* increase timeout to 5min
* upgrade to vllm 0.8.4
* dump out the vllm log for debugging
* use debug logging
* increase vllm start timeout
* use NVL instead
* disable torch compile cache
* revert some commented checks now that grpo tests are fixed
* increase vllm timeoout back to 5min
2025-04-27 19:19:53 -04:00
Wing Lian
f9c7c3bb72
don't use is_main_process during config validation ( #2569 )
2025-04-26 14:14:52 -04:00
Wing Lian
caf5cb63ea
add e2e smoke test for using activation/gradient checkpointing with offload ( #2565 )
...
* add e2e smoke test for using activation/gradient checkpointing with offload
* disable duplicate code check for the test
* fix relative import
* seq len too small to test this dataset with packing
* Fix checkpoint ptaching for tests
2025-04-25 21:11:17 -04:00
Wing Lian
5dba5c82a8
fix support for wandb run_name for rl trainers ( #2566 ) [skip ci]
...
* fix support for wandb run_name for rl trainers
* prefer to use wandb random names for run_name
2025-04-25 21:10:54 -04:00
Chiwan Park
e3c9d541a7
fix: crash when pretraining_dataset with dispatch_batches is false ( #2558 )
2025-04-25 17:15:03 -04:00
NanoCode012
9eba0ad118
chore(doc): update docker tags on doc ( #2559 ) [skip ci]
2025-04-25 17:14:48 -04:00
Wing Lian
53dbf97d85
make cce default to true when using the plugin ( #2562 ) [skip ci]
2025-04-25 17:14:26 -04:00
Eko Julianto Salim
2c2563bc34
fix: gradient checkpointing functools.partial object has no attribute __self__ ( #2563 ) [skip ci]
...
* fix: gradient checkpointing causing functools.partial error
* lint
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-25 17:02:37 -04:00
Wing Lian
5cb3398460
don't fail on codecov upload for external contributor PRs ( #2564 ) [skip ci]
2025-04-25 15:10:55 -04:00
Dan Saunders
ae1c7ace63
Sequence parallel training context manager ( #2553 )
...
* ctx manager for SP
* updates
* update
* further simplifying
* accommodate both training context managers
* simplifying
* simplifying
* nit
* reorg
* tweak codecov yaml
* add gather post hook, simplify, fixes
* pytest
* pytest fix
2025-04-25 10:33:54 -04:00
Wing Lian
1447beb132
make sure to validate the config before normalizing so defaults get set ( #2554 )
...
* make sure to validate the config before normalizing so defaults get set
* validation not needed for particular test
* remove duplicate validations
* set qlora correctly
2025-04-24 13:01:43 -04:00
Dan Saunders
66f41ec6f1
disable codecov pr annotations ( #2556 )
2025-04-24 08:51:51 -04:00
NanoCode012
85053f4bd4
Fix(doc): add delinearize instruction ( #2545 )
...
* fix: mention to install pytorch before axolotl
* feat(doc): include instruction to delinearize
* fix: update instruction for delinearize with adapter
2025-04-24 01:03:43 -04:00
Wing Lian
a4d5112ae1
builds for torch 2.7.0 ( #2552 )
...
* builds for torch==2.7.0
* use xformers==0.0.29.post3
* no vllm support with torch 2.7
* update default, fix conditional
* no xformers for 270
* no vllm on 2.7.0 for multigpu test too
* remove deprecated verbose arg from scheduler
* 2.7.0 tests on cpu
2025-04-24 00:39:31 -04:00
Wing Lian
0d691cc2a7
add base docker image with pytorch 2.7.0 and variant for cuda 12.8 ( #2551 )
...
* add base docker image with pytorch 2.7.0 and variant for cuda 12.8
* my bash is terrible
2025-04-23 14:59:03 -04:00
Dan Saunders
c4053481ff
Codecov fixes / improvements ( #2549 )
...
* adding codecov reporting
* random change
* codecov fixes
* adding missing dependency
* fix
---------
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-04-23 10:33:30 -04:00
NanoCode012
a6d28d19b1
feat: add glm and glm4 multipack and cce ( #2546 )
...
* feat: add glm and glm4 multipack
* feat: add glm4 example
* feat: add cce for glm
2025-04-23 10:27:51 -04:00
Wing Lian
32e335dd51
fix missing host/port for vllm ( #2543 )
...
* fix missing host/port for vllm
* set tensor parallel size so it doesn't always default to cli override
2025-04-22 10:16:48 -04:00
Wing Lian
7651550850
make sure to download fixtures for kd test ( #2541 )
...
* make sure to download fixtures for kd test
* use same alpaca dataset
2025-04-21 10:31:50 -04:00
Wing Lian
341e95aac9
prevent rate limiting to hf when using dispatch batches ( #2536 ) [skip ci]
2025-04-21 10:31:35 -04:00
Catgat
b882dfb63f
Fixed Rex Scheduler Warm Up ( #2535 ) [skip ci]
...
* Fixed Rex Scheduler Warm Up
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-21 10:30:55 -04:00
Wing Lian
b640db1dbc
don't run multigpu tests twice, run SP in separate test ( #2542 )
...
* don't run multigpu tests twice, run SP in separate test
* fix multiline
2025-04-21 10:24:13 -04:00
Chiwan Park
4ce469d32e
fix: upgrade liger to 0.5.8 and use native Gemma3 patches ( #2527 )
...
* fix: upgrade liger to 0.5.8 and use native Gemma3 patches
* fix: make lint happy
* doc: update Liger Kernel FLCE support for Gemma 3
2025-04-18 09:57:40 -07:00
Wing Lian
60a8f0958d
zero val fix for beta ( #2538 )
2025-04-17 17:27:19 -07:00
NanoCode012
9da730d6a4
fix(doc): cut cross entropy installation instructions broken in qmd ( #2532 )
2025-04-16 15:02:51 -07:00
NanoCode012
32637fad00
fix: preprocess yielding whole dataset to each worker ( #2503 ) [skip ci]
2025-04-16 15:02:35 -07:00