axolotl

Author	SHA1	Message	Date
Wing Lian	9218ebecd2	e2e testing (#574 )	2023-09-14 21:56:11 -04:00
Jan Philipp Harries	2f586d18db	Fix pretraining with iterable/streaming Dataset (#556 ) * return without packing prep/len * fix remove columns * fix encode arguments * add error when max steps not set * fix test --------- Co-authored-by: Jan Philipp Harries <jphme@users.noreply.github.com>	2023-09-13 00:16:40 -04:00
Wing Lian	0b4cf5bc8c	workaround for md5 variations (#533 ) * workaround for md5 variations * refactor the prepared hash too	2023-09-08 16:01:05 -04:00
Wing Lian	343714972b	recommend padding when using sample packing (#531 )	2023-09-06 17:00:21 -04:00
Wing Lian	d5dcf9c350	fix test fixture b/c hf trainer tokenization changed (#464 )	2023-08-23 04:04:49 -04:00
Wing Lian	8cace80175	fix fixture for new tokenizer handling in transformers (#428 )	2023-08-17 17:01:52 -04:00
Aman Karmani	efb3b2c95e	simplify `load_tokenizer`	2023-08-12 18:55:06 -07:00
Aman Karmani	8cec513447	extract module for working with cfg	2023-08-12 18:25:27 -07:00
Aman Karmani	a13e45d548	fix DefaultDict.__or__	2023-08-13 01:15:50 +00:00
Wing Lian	2bb0b78975	Attention mask and position id fixes for packing (#285 ) * fix attetion mask with packing * set position ids and use block diagonal attn mask * fix expand mask for multiple batch items, make sure we pad position_ids * don't move masks to cpu * use multi pack dataloader w random sampler * add position_ids back * more fixes for dataloader integration * est total tokens, fix field loop * more fixes, position_ids seems broken * more fixes for sample packing * use distributed sampler, avoid accelerate prepare * use accelerator prepare for dataloader * fix for position_ids w packing * Update src/axolotl/utils/dataloader.py * validation for sample packing and doc * more fixes for 4k and optimizations * optimized expand mask fn * better handling of variance in multipack dataloader length and trainer hanging when it runs out of data * fix rounding of len of batches to int * better handling so that all devices have the same dataloader len * fix step calc for packing * pass sample packing efficiency to training args * add a test for the mask expansion for sequence packing * only process eval dataset for packing if not None * don't split batches when packing * weighted CE losses * weighted CEL fixes * limit packing to sequences of max seq len * seq_len_multiple for packing * make sure the chunk size is an int * sample_packing_seq_len_multiplier config * use cumulative seq len with var len flash attn v2 w packing * properly calculate max len * fix flash-attn, xformers, packing, support chatml * fix chatml system prompt for openorca, legacy tokenizer opts * add chatml * add unit tests for cum seq lens, add ability to build cu_seq_lens from positional ids, fix prompt test * fix test and pylint checks * more packing and dataset optimizations and fixes * filter w multiple cpus * more fixes and optimizations * fixes and go back to distributed sampler since batch sampler won't work * fix counts by accounting for num devices * fix steps calculation * previous accelerate is still most performant * add numba to requirements. * use custom distributed checks * fix sampler to prevent overfit w new epochs * let's not cleanup the cached datasets * calculate cum seq lens with pos_ids instead of mask, simplify packing params, fix distributed barrier * speed optimizations and set accelerate fsdp env vars * optimize dataset concatenation? * more optimizations for dataset handling * fix import for annotation * manual pre-commit fixes * another sum optimization and bug fix for calc steps * fix packing estimations * fix formatting * pylint problems * add back flash attention branch for handling unpacked sequences seperately * Address PR feedback * add optional sample packing config params to readme	2023-08-12 15:14:56 -04:00
Jan Philipp Harries	3392270544	experimental llama 2 chat support (#296 ) * experimental llama 2 chat support * few small fixes * llama2_chat * small fix to follow original implementation * small fixes and added fixtures/tests * fix -mixed up inference and finetuning conversations * args - small fix * small fix * small adjustment and warning * fix with pre-commit --------- Co-authored-by: Jan Philipp Harries <jpdus@users.noreply.github.com>	2023-08-06 17:40:52 -04:00
Wing Lian	3d4984b9a5	update prompts for open orca to match the paper (#317 ) fix the test for the updated system tokenizer	2023-07-22 13:49:11 -04:00
theobjectivedad	b1f4f7a34d	Fixed pre-commit problems, fixed small bug in logging_config to handle LOG_LEVEL env var	2023-07-15 12:29:35 +00:00
theobjectivedad	553a86b52c	Adding logging enhancement	2023-07-14 07:26:19 -05:00
Wing Lian	19cf0bda99	params are adam_, not adamw_	2023-07-08 12:13:39 -04:00
Wing Lian	3a38271276	add tests and supoort for loader for sys prompt data	2023-06-25 22:28:07 -04:00
Wing Lian	8d20e0a3d3	initial wip to get sys prompt from dataset	2023-06-25 22:28:07 -04:00
Wing Lian	47d601fa23	optionally define whether to use_fast tokenizer	2023-06-25 10:19:49 -04:00
Wing Lian	ad5ca4f734	Additional test case per pr	2023-06-15 10:12:47 -04:00
Wing Lian	cb9d3af5c0	add validation and tests for adamw hyperparam	2023-06-15 09:39:42 -04:00
Wing Lian	1925eaf1e6	Merge pull request #214 from OpenAccess-AI-Collective/fix-tokenizing-labels Fix tokenizing labels	2023-06-15 08:13:43 -04:00
Wing Lian	1ab3bf3e67	fix test name	2023-06-15 02:09:33 -04:00
Wing Lian	baed440fa1	ingore duplicate code in tests	2023-06-15 02:03:53 -04:00
Wing Lian	7925ddce86	bugfix for potential off by one	2023-06-15 01:59:33 -04:00
Wing Lian	fd2c9814c9	Merge branch 'main' into flash-optimum	2023-06-12 13:12:15 -04:00
Wing Lian	14668fa54e	new validation for mpt w grad checkpoints	2023-06-11 09:26:10 -04:00
Wing Lian	eea2731a5e	add streaming dataset support for pretraining datasets	2023-06-10 14:23:56 -04:00
NanoCode012	babf0fdb71	Validate falcon with fsdp	2023-06-09 00:29:04 +09:00
NanoCode012	3c71c8debe	Update doc for grad_accu and add validation tests for batch size	2023-06-01 06:13:47 +09:00
Wing Lian	0136f510f2	don't worry about duplicate code here	2023-05-31 12:05:43 -04:00
Wing Lian	9b8585dc70	fix packing so that concatenated sequences reset the attention	2023-05-31 11:38:52 -04:00
Wing Lian	6fa40bf8ad	black formatting	2023-05-30 23:33:37 -04:00
Wing Lian	3aad5f3b3e	add support for gradient accumulation steps	2023-05-30 23:24:37 -04:00
NanoCode012	b81c97ff76	Fix pre-commit for rebased files	2023-05-31 03:01:38 +09:00
Wing Lian	cfcc549f6b	fix relative path for fixtures	2023-05-31 02:55:21 +09:00
NanoCode012	37293dce07	Apply isort then black	2023-05-31 02:53:53 +09:00
NanoCode012	0dd35c74af	Ignore unsupported-binary-operation	2023-05-31 02:53:53 +09:00
NanoCode012	b832a0ac62	Black formatting	2023-05-31 02:53:53 +09:00
NanoCode012	1f3c3f5ea0	Lint validation	2023-05-31 02:53:53 +09:00
NanoCode012	0e952889dc	Lint test_dict	2023-05-31 02:53:53 +09:00
NanoCode012	7eb33a77dd	Lint test_prompters	2023-05-31 02:53:53 +09:00
NanoCode012	392dfd9b07	Lint and format	2023-05-31 02:53:22 +09:00
Wing Lian	e65aeedce7	fix relative path for fixtures	2023-05-30 10:38:20 -04:00
Wing Lian	e6fdeb087f	add unit test for sharegpt tokenization	2023-05-30 10:28:17 -04:00
Wing Lian	fd5f9656a2	update for pr feedback	2023-05-28 14:23:27 -04:00
Wing Lian	1c33eb88a7	new hf_use_auth_token setting so login to hf isn't required	2023-05-28 13:08:49 -04:00
NanoCode012	52dd92a0cd	Feat: Update validate_config and add tests	2023-05-29 00:25:54 +09:00
NanoCode012	f87bd20555	Fix incorrect syntax in test	2023-05-28 23:35:29 +09:00
NanoCode012	923151ffab	Add test for DictDefault	2023-05-28 23:06:10 +09:00
Wing Lian	d199d6c261	automated testing in github actions	2023-05-27 11:51:01 -04:00

50 Commits