VED
cd856b45b1
feat:add support dataset_num_processes ( #3129 ) [skip ci]
...
* feat:add support dataset_num_processes
* chore
* required changes
* requested chnages
* required chnages
* required changes
* required changes
* elif get_default_process_count()
* add:del data
* Update cicd/Dockerfile.jinja
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
* Update cicd/single_gpu.py
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
---------
Co-authored-by: salman <salman.mohammadi@outlook.com >
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com >
2025-10-13 17:18:12 +07:00
Dan Saunders
79ddaebe9a
Add ruff, remove black, isort, flake8, pylint ( #3092 )
...
* black, isort, flake8 -> ruff
* remove unused
* add back needed import
* fix
2025-08-23 23:37:33 -04:00
Wing Lian
563f5eed7a
update dependencies - liger + trl ( #2987 )
...
* update dependencies
* set dataset processes for tests
* add support for GSPO
2025-07-31 11:17:17 -04:00
Dan Saunders
10ba1622f7
checkpoint model on first step callback ( #2906 )
...
* checkpoint model on first step callback
* remove debug
* add test cases; update existing tests not to save on first step
* move test out of solo
* delete
* default to False
* typo
2025-07-15 15:00:48 -04:00
Dan Saunders
00cda8cc70
Data loader refactor ( #2707 )
...
* data loading refactor (wip)
* updates
* progress
* pytest
* pytest fix
* lint
* zero_first -> filelock, more simplifications
* small simplification
* import change
* nit
* lint
* simplify dedup
* couldnt resist
* review comments WIP
* continued wip
* minor changes
* fix; remove contrived test
* further refactor
* set default seed in pydantic config
* lint
* continued simplication
* lint
* renaming and nits
* filelock tests
* fix
* fix
* lint
* remove nullable arg
* remove unnecessary code
* moving dataset save fn to shared module
* remove debug print
* matching var naming
* fn name change
* coderabbit comments
* naming nit
* fix test
2025-06-10 19:53:07 -04:00
Wing Lian
c0a0c7534c
Activation checkpointing with offloading to disk with prefetch ( #2663 )
...
* offload activations to disk instead of CPU RAM
* add prefetch
* Disco :dance:
* include offload_disk in e2e test for AC
* document and make sure to cleanup
* fix annotation to match docs
* fix docs build
* address PR feedback
2025-05-13 16:39:39 -04:00
Wing Lian
caf5cb63ea
add e2e smoke test for using activation/gradient checkpointing with offload ( #2565 )
...
* add e2e smoke test for using activation/gradient checkpointing with offload
* disable duplicate code check for the test
* fix relative import
* seq len too small to test this dataset with packing
* Fix checkpoint ptaching for tests
2025-04-25 21:11:17 -04:00