axolotl

Author	SHA1	Message	Date
Wing Lian	7570446596	Preprocess dataset size fix (#1131 ) * overwrite cache on preprocess step * don't cache the TokenizedPromptDataset at all * load_from_cache_file no longer needed	2024-01-17 11:02:41 -05:00
Wing Lian	cdc71f73c8	update table for rwkv4 support, fix process count for dataset (#822 )	2023-11-04 23:45:44 -04:00
Felix Yan	d1236f2c41	Correct typos in datasets.py (#639 )	2023-09-27 12:12:10 -04:00
Wing Lian	97d3776ce6	split completion text to sequence_len (#616 )	2023-09-21 21:51:25 -04:00
Wing Lian	2bb0b78975	Attention mask and position id fixes for packing (#285 ) * fix attetion mask with packing * set position ids and use block diagonal attn mask * fix expand mask for multiple batch items, make sure we pad position_ids * don't move masks to cpu * use multi pack dataloader w random sampler * add position_ids back * more fixes for dataloader integration * est total tokens, fix field loop * more fixes, position_ids seems broken * more fixes for sample packing * use distributed sampler, avoid accelerate prepare * use accelerator prepare for dataloader * fix for position_ids w packing * Update src/axolotl/utils/dataloader.py * validation for sample packing and doc * more fixes for 4k and optimizations * optimized expand mask fn * better handling of variance in multipack dataloader length and trainer hanging when it runs out of data * fix rounding of len of batches to int * better handling so that all devices have the same dataloader len * fix step calc for packing * pass sample packing efficiency to training args * add a test for the mask expansion for sequence packing * only process eval dataset for packing if not None * don't split batches when packing * weighted CE losses * weighted CEL fixes * limit packing to sequences of max seq len * seq_len_multiple for packing * make sure the chunk size is an int * sample_packing_seq_len_multiplier config * use cumulative seq len with var len flash attn v2 w packing * properly calculate max len * fix flash-attn, xformers, packing, support chatml * fix chatml system prompt for openorca, legacy tokenizer opts * add chatml * add unit tests for cum seq lens, add ability to build cu_seq_lens from positional ids, fix prompt test * fix test and pylint checks * more packing and dataset optimizations and fixes * filter w multiple cpus * more fixes and optimizations * fixes and go back to distributed sampler since batch sampler won't work * fix counts by accounting for num devices * fix steps calculation * previous accelerate is still most performant * add numba to requirements. * use custom distributed checks * fix sampler to prevent overfit w new epochs * let's not cleanup the cached datasets * calculate cum seq lens with pos_ids instead of mask, simplify packing params, fix distributed barrier * speed optimizations and set accelerate fsdp env vars * optimize dataset concatenation? * more optimizations for dataset handling * fix import for annotation * manual pre-commit fixes * another sum optimization and bug fix for calc steps * fix packing estimations * fix formatting * pylint problems * add back flash attention branch for handling unpacked sequences seperately * Address PR feedback * add optional sample packing config params to readme	2023-08-12 15:14:56 -04:00
NanoCode012	45ac7c4f88	feat: use multi-core	2023-07-19 10:16:54 +09:00
theobjectivedad	b1f4f7a34d	Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var	2023-07-15 12:29:35 +00:00
theobjectivedad	553a86b52c	Adding logging enhancement	2023-07-14 07:26:19 -05:00
Wing Lian	7b57ed7618	pylint for duplicated code for system prompts	2023-06-25 22:28:07 -04:00
Wing Lian	aac4b7691e	add new sharegpt, refactor prompt so it can be customized later, add exception if no data is processed	2023-06-11 19:42:25 -04:00
Wing Lian	9b8585dc70	fix packing so that concatenated sequences reset the attention	2023-05-31 11:38:52 -04:00
NanoCode012	37293dce07	Apply isort then black	2023-05-31 02:53:53 +09:00
NanoCode012	6abb7f6a16	Lint datasets	2023-05-31 02:53:53 +09:00
NanoCode012	392dfd9b07	Lint and format	2023-05-31 02:53:22 +09:00
Wing Lian	0f74464652	fix new dataset prompt tokenizers	2023-05-21 18:57:09 -04:00
Wing Lian	2bc1a5bde1	black formatting	2023-05-10 16:01:08 -04:00
Wing Lian	94f5e415a3	various bugfixes	2023-04-24 09:41:34 -04:00
Wing Lian	2db9436410	casts the prepared data to int16 (doesn't help with training memory)	2023-04-17 21:36:02 -04:00
Wing Lian	77fca25f1b	4bit quantized support (wip)	2023-04-17 11:37:39 -04:00
Wing Lian	80b2ed29d8	various bugfixes	2023-04-14 21:37:07 -04:00
Wing Lian	a6028d302e	black formatting	2023-04-14 07:25:52 -04:00
Wing Lian	8d959a7e26	make it work with pythia in the cloud	2023-04-14 07:24:55 -04:00
Wing Lian	ce24f5e246	WIP for axolotl trainer	2023-04-14 00:20:05 -04:00

23 Commits