Wing Lian
493eb8e5c6
update doc and use P2P=LOC for brittle grpo test ( #2649 )
...
* update doc and skip brittle grpo test
* fix the path to run the multigpu tests
* increase timeout, use LOC instead of NVL
* typo
* use hf cache from s3 backed cloudfront
* mark grpo as flaky test dues to vllm start
2025-05-13 17:05:11 -04:00
Dan Saunders
ae1c7ace63
Sequence parallel training context manager ( #2553 )
...
* ctx manager for SP
* updates
* update
* further simplifying
* accommodate both training context managers
* simplifying
* simplifying
* nit
* reorg
* tweak codecov yaml
* add gather post hook, simplify, fixes
* pytest
* pytest fix
2025-04-25 10:33:54 -04:00
Dan Saunders
66f41ec6f1
disable codecov pr annotations ( #2556 )
2025-04-24 08:51:51 -04:00
Dan Saunders
f776f889a1
adding codecov reporting ( #2372 ) [skip ci]
...
* adding codecov reporting
* update codecov-action to v5
* fix
---------
Co-authored-by: Dan Saunders <dan@axolotl.ai >
2025-04-16 15:02:17 -07:00