Commit Graph

15 Commits

Author SHA1 Message Date
Wing Lian
0aeb277456 add e2e smoke tests for llama liger integration (#1884)
* add e2e smoke tests for llama liger integration

* fix import

* don't use __main__ for test

* consolidate line
2024-09-01 19:29:37 -04:00
Wing Lian
e8ff5d5738 don't mess with bnb since it needs compiled wheels (#1859) 2024-08-23 12:18:47 -04:00
Wing Lian
dcbff16983 run nightly ci builds against upstream main (#1851)
* run nightly ci builds against upstream main

* add test badges

* run the multigpu tests against nightly main builds too
2024-08-22 13:10:54 -04:00
Wing Lian
54392ac8a6 Attempt to run multigpu in PR CI for now to ensure it works (#1815) [skip ci]
* Attempt to run multigpu in PR CI for now to ensure it works

* fix yaml file

* forgot to include multigpu tests

* fix call to cicd.multigpu

* dump dictdefault to dict for yaml conversion

* use to_dict instead of casting

* 16bit-lora w flash attention, 8bit lora seems problematic

* add llama fsdp test

* more tests

* Add test for qlora + fsdp with prequant

* limit accelerate to 2 processes and disable broken qlora+fsdp+bnb test

* move multigpu tests to biweekly
2024-08-09 11:50:13 -04:00
Wing Lian
9a63884597 update test and main/nightly builds (#1797)
* update test and main/nightly builds

* don't install mamba-ssm on 2.4.0 since it has no wheels yet
2024-07-30 12:37:40 -04:00
Wing Lian
e1725aef2b update modal package and don't cache pip install (#1757)
* update modal package and cleanup pip cache

* more verbosity on the test
2024-07-16 14:45:38 -04:00
Wing Lian
78e12f8ca5 add basic support for the optimi adamw optimizer (#1727)
* add support for optimi_adamw optimizer w kahan summation

* pydantic validator for optimi_adamw

* workaround for setting optimizer for fsdp

* make sure to install optimizer packages

* make sure to have parity for model parameters passed to optimizer

* add smoke test for optimi_adamw optimizer

* don't use foreach optimi by default
2024-07-14 19:12:57 -04:00
Wing Lian
98af5388ba bump flash attention 2.5.8 -> 2.6.1 (#1738)
* bump flash attention 2.5.8 -> 2.6.1

* use triton implementation of cross entropy from flash attn

* add smoke test for flash attn cross entropy patch

* fix args to xentropy.apply

* handle tuple from triton loss fn

* ensure the patch tests run independently

* use the wrapper already built into flash attn for cross entropy

* mark pytest as forked for patches

* use pytest xdist instead of forked, since cuda doesn't like forking

* limit to 1 process and use dist loadfile for pytest

* change up pytest for fixture to reload transformers w monkeypathc
2024-07-14 19:11:31 -04:00
mhenrichsen
1194c2e0b1 github urls (#1734)
Co-authored-by: Henrichsen, Mads (ext) <mads.henrichsen.ext@siemens-energy.com>
2024-07-11 09:19:29 -04:00
Wing Lian
fe650dd326 make sure the CI fails when pytest script fails (#1669)
* make sure the pytest script fails

* make sure the defaults come through for tests

* make sure tensorboard is loaded for test assertion
2024-05-29 10:12:11 -04:00
Wing Lian
89134f2143 make sure to install causal_conv1d in docker (#1459) 2024-03-29 16:43:25 -04:00
Wing Lian
dd449c5cd8 support galore once upstreamed into transformers (#1409)
* support galore once upstreamed into transformers

* update module name for llama in readme and fix typing for all linear

* bump trl for deprecation fixes from newer transformers

* include galore as an extra and install in docker image

* fix optim_args type

* fix optim_args

* update dependencies for galore

* add galore to cicd dockerfile
2024-03-19 09:26:35 -04:00
Wing Lian
00018629e7 run tests again on Modal (#1289) [skip ci]
* run tests again on Modal

* make sure to run the full suite of tests on modal

* run cicd steps via shell script

* run tests in different runs

* increase timeout

* split tests into steps on modal

* increase workflow timeout

* retry doing this with only a single script

* fix yml launch for modal ci

* reorder tests to run on modal

* skip dpo tests on modal

* run on L4s, A10G takes too long

* increase CPU and RAM for modal test

* run modal tests on A100s

* skip phi test on modal

* env not arg in modal dockerfile

* upgrade pydantic and fastapi for modal tests

* cleanup stray character

* use A10s instead of A100 for modal
2024-02-29 14:26:26 -05:00
Wing Lian
8da1633124 Revert "run PR e2e docker CI tests in Modal" (#1220) [skip ci] 2024-01-26 16:50:44 -05:00
Wing Lian
36d053f6f0 run PR e2e docker CI tests in Modal (#1217) [skip ci]
* wip modal for ci

* handle falcon layernorms better

* update

* rebuild the template each time with the pseudo-ARGS

* fix ref

* update tests to use modal

* cleanup ci script

* make sure to install jinja2 also

* kickoff the gh action on gh hosted runners and specify num gpus
2024-01-26 16:13:27 -05:00