* wip for pretraining/iterable data with arbitrary prompt strategies
* more fixes, wip
* more fixes for custom pretraining
* iterable ds wrapper not needed
* remove extra features
* chore: lint
* update pretraning example yml
* fix order for partials
* fixup for tests
* support for true batches with multipack
* patch the map dataset fetcher to handle batches with packed indexes
* patch 4d mask creation for sdp attention
* better handling for BetterTransformer
* patch general case for 4d mask
* setup forward patch. WIP
* fix patch file
* support for multipack w/o flash attention for llama
* cleanup
* add warning about bf16 vs fp16 for multipack with sdpa
* bugfixes
* add 4d multipack tests, refactor patches
* update tests and add warnings
* fix e2e file check
* skip sdpa test if not at least torch 2.1.1, update docs
* [Feat] streaming multipack
* WIP make continued pretraining work w multipack
* fix up hadrcoding, lint
* fix dict check
* update test for updated pretraining multipack code
* fix hardcoded data collator fix for multipack pretraining
* fix the collator to be the max length for multipack pretraining
* don't bother with latest tag for test
* cleanup docker build/test
---------
Co-authored-by: jinwonkim93@github.com <jinwonkim>
Co-authored-by: Wing Lian <wing.lian@gmail.com>