Wing Lian
388e950016
restore dockerfile
2025-04-26 16:21:30 -04:00
NanoCode012
fb4adbb311
fix: trim allowed cuda versions
2025-04-26 16:21:30 -04:00
Wing Lian
5e8abca54f
use axolotl cloud image as base and various fixes
2025-04-26 16:21:30 -04:00
Wing Lian
168ec339e5
chore: lint
2025-04-26 16:21:30 -04:00
zeke
cb7185998b
remove LICENSE and fix README
2025-04-26 16:21:30 -04:00
zeke
c2fc35f520
Add runpod sls handler
2025-04-26 16:21:30 -04:00
Wing Lian
f9c7c3bb72
don't use is_main_process during config validation ( #2569 )
2025-04-26 14:14:52 -04:00
Wing Lian
caf5cb63ea
add e2e smoke test for using activation/gradient checkpointing with offload ( #2565 )
...
* add e2e smoke test for using activation/gradient checkpointing with offload
* disable duplicate code check for the test
* fix relative import
* seq len too small to test this dataset with packing
* Fix checkpoint ptaching for tests
2025-04-25 21:11:17 -04:00
Wing Lian
5dba5c82a8
fix support for wandb run_name for rl trainers ( #2566 ) [skip ci]
...
* fix support for wandb run_name for rl trainers
* prefer to use wandb random names for run_name
2025-04-25 21:10:54 -04:00
Chiwan Park
e3c9d541a7
fix: crash when pretraining_dataset with dispatch_batches is false ( #2558 )
2025-04-25 17:15:03 -04:00
NanoCode012
9eba0ad118
chore(doc): update docker tags on doc ( #2559 ) [skip ci]
2025-04-25 17:14:48 -04:00
Wing Lian
53dbf97d85
make cce default to true when using the plugin ( #2562 ) [skip ci]
2025-04-25 17:14:26 -04:00
Eko Julianto Salim
2c2563bc34
fix: gradient checkpointing functools.partial object has no attribute __self__ ( #2563 ) [skip ci]
...
* fix: gradient checkpointing causing functools.partial error
* lint
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-25 17:02:37 -04:00
Wing Lian
5cb3398460
don't fail on codecov upload for external contributor PRs ( #2564 ) [skip ci]
2025-04-25 15:10:55 -04:00
Dan Saunders
ae1c7ace63
Sequence parallel training context manager ( #2553 )
...
* ctx manager for SP
* updates
* update
* further simplifying
* accommodate both training context managers
* simplifying
* simplifying
* nit
* reorg
* tweak codecov yaml
* add gather post hook, simplify, fixes
* pytest
* pytest fix
2025-04-25 10:33:54 -04:00
Wing Lian
1447beb132
make sure to validate the config before normalizing so defaults get set ( #2554 )
...
* make sure to validate the config before normalizing so defaults get set
* validation not needed for particular test
* remove duplicate validations
* set qlora correctly
2025-04-24 13:01:43 -04:00
Dan Saunders
66f41ec6f1
disable codecov pr annotations ( #2556 )
2025-04-24 08:51:51 -04:00
NanoCode012
85053f4bd4
Fix(doc): add delinearize instruction ( #2545 )
...
* fix: mention to install pytorch before axolotl
* feat(doc): include instruction to delinearize
* fix: update instruction for delinearize with adapter
2025-04-24 01:03:43 -04:00
Wing Lian
a4d5112ae1
builds for torch 2.7.0 ( #2552 )
...
* builds for torch==2.7.0
* use xformers==0.0.29.post3
* no vllm support with torch 2.7
* update default, fix conditional
* no xformers for 270
* no vllm on 2.7.0 for multigpu test too
* remove deprecated verbose arg from scheduler
* 2.7.0 tests on cpu
2025-04-24 00:39:31 -04:00
Wing Lian
0d691cc2a7
add base docker image with pytorch 2.7.0 and variant for cuda 12.8 ( #2551 )
...
* add base docker image with pytorch 2.7.0 and variant for cuda 12.8
* my bash is terrible
2025-04-23 14:59:03 -04:00
Dan Saunders
c4053481ff
Codecov fixes / improvements ( #2549 )
...
* adding codecov reporting
* random change
* codecov fixes
* adding missing dependency
* fix
---------
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-04-23 10:33:30 -04:00
NanoCode012
a6d28d19b1
feat: add glm and glm4 multipack and cce ( #2546 )
...
* feat: add glm and glm4 multipack
* feat: add glm4 example
* feat: add cce for glm
2025-04-23 10:27:51 -04:00
Wing Lian
32e335dd51
fix missing host/port for vllm ( #2543 )
...
* fix missing host/port for vllm
* set tensor parallel size so it doesn't always default to cli override
2025-04-22 10:16:48 -04:00
Wing Lian
7651550850
make sure to download fixtures for kd test ( #2541 )
...
* make sure to download fixtures for kd test
* use same alpaca dataset
2025-04-21 10:31:50 -04:00
Wing Lian
341e95aac9
prevent rate limiting to hf when using dispatch batches ( #2536 ) [skip ci]
2025-04-21 10:31:35 -04:00
Catgat
b882dfb63f
Fixed Rex Scheduler Warm Up ( #2535 ) [skip ci]
...
* Fixed Rex Scheduler Warm Up
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
2025-04-21 10:30:55 -04:00
Wing Lian
b640db1dbc
don't run multigpu tests twice, run SP in separate test ( #2542 )
...
* don't run multigpu tests twice, run SP in separate test
* fix multiline
2025-04-21 10:24:13 -04:00
Chiwan Park
4ce469d32e
fix: upgrade liger to 0.5.8 and use native Gemma3 patches ( #2527 )
...
* fix: upgrade liger to 0.5.8 and use native Gemma3 patches
* fix: make lint happy
* doc: update Liger Kernel FLCE support for Gemma 3
2025-04-18 09:57:40 -07:00
Wing Lian
60a8f0958d
zero val fix for beta ( #2538 )
2025-04-17 17:27:19 -07:00
NanoCode012
9da730d6a4
fix(doc): cut cross entropy installation instructions broken in qmd ( #2532 )
2025-04-16 15:02:51 -07:00
NanoCode012
32637fad00
fix: preprocess yielding whole dataset to each worker ( #2503 ) [skip ci]
2025-04-16 15:02:35 -07:00
Dan Saunders
f776f889a1
adding codecov reporting ( #2372 ) [skip ci]
...
* adding codecov reporting
* update codecov-action to v5
* fix
---------
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-04-16 15:02:17 -07:00
Wing Lian
69eda209a6
re-enable DS zero3 ci with updated transformers ( #2533 )
2025-04-16 14:48:40 -07:00
Dan Saunders
b8c633aa97
batch api HF adapter for ring-flash-attn; cleanup and improvements ( #2520 )
...
* batch api HF adapter for ring-flash-attn; cleanup and improvements
* update
* adding all batch ring-flash-attn methods via single adapter
* removing pad_to_sequence_len=False for now
* fix
* updating docs to include batch SP
* review comments
* fixes for batch API funcs, simplify
* fixes
* fix
* updates
* add batch_zigzag smoke test
2025-04-16 13:50:48 -04:00
NanoCode012
682a9cf79b
Fix: add delinearization and make qlora work with fsdp2 ( #2515 )
...
* fixes for delinearization, and make qlora work with fsdp2
* Add back mistakenly removed lm_eval
* typo [skip ci]
* patch evals for torch.compile + fsdp2
* also check torch_compile w fsdp2
* lots of fixes for flex attn with llama4
* fix patch check and patch llama4 too
* attempt to make the patches stick
* use transformers 4.51.2
* update configs and README for llama4
* remove torch.compile for CI test
* cleanup any existing singletons
* set singleton cache to None instead of deleting
* use importlib reload with monkeypatch
* don't worry about transformers version, mark inputs with grads, fix regex
* make sure embeds aren't on cpu
* logging and mem improvements
* vllm version and add to docker, make sure to save processor on conversion
* fix ambiguous tensor bool check
* fix vllm to not use v1, upgrade hf transformers
* fix tests
* make flex_attn_compile_kwargs configurable, since this depends on model params
---------
Co-authored-by: Wing Lian <wing@axolotl.ai >
Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com >
2025-04-15 23:31:39 -07:00
NanoCode012
271b24cccc
feat: update cce to latest ( #2521 )
2025-04-15 22:17:10 -07:00
Wing Lian
198d775d6d
make sure the all of the model is on the same device, so this test will pass on multigpu ( #2524 ) [skip ci]
2025-04-15 22:15:42 -07:00
NanoCode012
e4307fb7d7
feat: add examples for deepcoder ( #2517 )
2025-04-12 07:25:23 -07:00
Wing Lian
dd8bad06d0
remove strict=false from example yamls [skip ci] ( #2523 ) [skip ci]
2025-04-12 07:25:11 -07:00
Wing Lian
de8a625dd7
make e2e tests a bit faster by reducing test split size ( #2522 ) [skip ci]
...
* [ci] make e2e tests a bit faster by reducing test split size
* use 10% split of alpaca dataset to speed up dataset loading/tokenization
* reduce gas 4->2 for most e2e tests
* increase val set size for packing
2025-04-12 07:24:43 -07:00
NanoCode012
51267ded04
chore: update doc links ( #2509 )
...
* chore: update doc links
* fix: address pr feedback
2025-04-11 09:53:18 -04:00
NanoCode012
756a0559c1
feat(doc): explain deepspeed configs ( #2514 ) [skip ci]
...
* feat(doc): explain deepspeed configs
* fix: add fetch configs
2025-04-11 09:52:43 -04:00
NanoCode012
9a8e3e9c7b
Feat(examples): add deepcogito ( #2516 ) [skip ci]
...
* feat: add examples for deepcogito
* fix: reduce num evals per epoch
* fix: reduce num epochs
2025-04-11 09:52:23 -04:00
Wing Lian
7e7180fa10
add mocks for loading datasets in cli train tests ( #2497 ) [skip ci]
...
* add mocks for loading datasets in cli train tests
* Apply suggestions from code review to fix patched module for preprocess
Co-authored-by: NanoCode012 <nano@axolotl.ai >
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai >
2025-04-11 09:51:59 -04:00
Sung Ching Liu
22c562533d
Update rlhf.qmd ( #2519 )
...
Fix typo in command that spawns a vllm server, should be `axolotl vllm-serve` not `axolotl vllm_serve`
2025-04-10 11:33:09 -04:00
NanoCode012
16823e1de6
feat: add CNAME ( #2513 )
2025-04-10 12:34:25 +07:00
NanoCode012
e0420b3528
fix: allow merge lora on pre-quantized model ( #2511 )
...
* fix: allow merge lora on pre-quantized model
* fix: remove unused sections per comment
2025-04-09 14:01:42 -04:00
Wing Lian
9f986f5e71
Add Llama4 maverick examples ( #2512 )
2025-04-09 14:01:28 -04:00
NanoCode012
f85861a0b2
fix: liger swiglu for llama4 ( #2504 )
...
* fix: liger swiglu for llama4
* feat: add liger to deepseek v3
* fix: unpack not found
* fix: spelling
* fix: comment out deepseek v3
* fix: retest deepseek
* fix: map glu
* fix: patch model forward
* chore: add temp code to save
* fix: remove deepseek to move into separate PR
2025-04-09 02:53:17 -04:00
Wing Lian
630e40dd13
upgrade transformers to 4.51.1 ( #2508 )
...
* upgrade transformers to 4.51.1
* multigpu longer timeout
2025-04-09 02:53:00 -04:00