* checkpoint model on first step callback
* remove debug
* add test cases; update existing tests not to save on first step
* move test out of solo
* delete
* default to False
* typo
* update transformers to 4.53.0
* remove attention_mask from signature columns if using packing
* remove attention_mask column from dataloader
* update signature of flash attn forward for ring attn patch
* fix FSDP
* patch ring-flash-attn with upstream signature fix
* fix patch indentation level
* fix the patch
* add batch flattening smoke test with loss check that works in older transformers
* fix patch
* don't drop attention mask for flex
* more fixes
* patch create_causal_mask for packing w flex
* global torch manual_seed fixture
* tweak loss checks
* fix patch and use single batch for flex
* don't need to reload
* fix causal mask patch
* use transformers patch releasE
* make sure env var is string
* make sure to drop attention mask for flex w packing for latest transformers patch release
* tweak loss
* guard on signature columns before removing attention mask
* bump loss
* set remove isn't chainable
* skip slow mistral test in 2.5.1
* lean mistral ft tests, remove e2e torch 2.4.1 test
* make sure to pass save_only_model for RL
* more tests to make ci leaner, add cleanup to modal ci
* fix module for import in e2e tests
* use mp spawn to prevent deadlocks with packing
* make sure cleanup shell script is executable when cloned out
* make sure to validate the config before normalizing so defaults get set
* validation not needed for particular test
* remove duplicate validations
* set qlora correctly
* add helper to verify the correct model output file exists
* more checks using helper
* chore: lint
* fix import and relora model check
* workaround for trl trainer saves
* remove stray print
* reduce test concurrency to avoid HF rate limiting, test suite parity
* make val_set_size smaller to speed up e2e tests
* more retries for pytest fixture downloads
* val_set_size was too small
* move retry_on_request_exceptions to data utils and add retry strategy
* pre-download ultrafeedback as a test fixture
* refactor download retry into it's own fn
* don't import from data utils
* use retry mechanism now for fixtures
* attempt to also run e2e tests that needs gpus
* fix stray quote
* checkout specific github ref
* dockerfile for tests with proper checkout
ensure wandb is dissabled for docker pytests
clear wandb env after testing
clear wandb env after testing
make sure to provide a default val for pop
tryin skipping wandb validation tests
explicitly disable wandb in the e2e tests
explicitly report_to None to see if that fixes the docker e2e tests
split gpu from non-gpu unit tests
skip bf16 check in test for now
build docker w/o cache since it uses branch name ref
revert some changes now that caching is fixed
skip bf16 check if on gpu w support
* pytest skip for auto-gptq requirements
* skip mamba tests for now, split multipack and non packed lora llama tests
* split tests that use monkeypatches
* fix relative import for prev commit
* move other tests using monkeypatches to the correct run