Attention mask and position id fixes for packing (#285)

* fix attetion mask with packing

* set position ids and use block diagonal attn mask

* fix expand mask for multiple batch items, make sure we pad position_ids

* don't move masks to cpu

* use multi pack dataloader w random sampler

* add position_ids back

* more fixes for dataloader integration

* est total tokens, fix field loop

* more fixes, position_ids seems broken

* more fixes for sample packing

* use distributed sampler, avoid accelerate prepare

* use accelerator prepare for dataloader

* fix for position_ids w packing

* Update src/axolotl/utils/dataloader.py

* validation for sample packing and doc

* more fixes for 4k and optimizations

* optimized expand mask fn

* better handling of variance in multipack dataloader length and trainer hanging when it runs out of data

* fix rounding of len of batches to int

* better handling so that all devices have the same dataloader len

* fix step calc for packing

* pass sample packing efficiency to training args

* add a test for the mask expansion for sequence packing

* only process eval dataset for packing if not None

* don't split batches when packing

* weighted CE losses

* weighted CEL fixes

* limit packing to sequences of max seq len

* seq_len_multiple for packing

* make sure the chunk size is an int

* sample_packing_seq_len_multiplier config

* use cumulative seq len with var len flash attn v2 w packing

* properly calculate max len

* fix flash-attn, xformers, packing, support chatml

* fix chatml system prompt for openorca, legacy tokenizer opts

* add chatml

* add unit tests for cum seq lens, add ability to build cu_seq_lens from positional ids, fix prompt test

* fix test and pylint checks

* more packing and dataset optimizations and fixes

* filter w multiple cpus

* more fixes and optimizations

* fixes and go back to distributed sampler since batch sampler won't work

* fix counts by accounting for num devices

* fix steps calculation

* previous accelerate is still most performant

* add numba to requirements.

* use custom distributed checks

* fix sampler to prevent overfit w new epochs

* let's not cleanup the cached datasets

* calculate cum seq lens with pos_ids instead of mask, simplify packing params, fix distributed barrier

* speed optimizations and set accelerate fsdp env vars

* optimize dataset concatenation?

* more optimizations for dataset handling

* fix import for annotation

* manual pre-commit fixes

* another sum optimization and bug fix for calc steps

* fix packing estimations

* fix formatting

* pylint problems

* add back flash attention branch for handling unpacked sequences seperately

* Address PR feedback

* add optional sample packing config params to readme

This commit is contained in:

Wing Lian

2023-08-12 15:14:56 -04:00

committed by

GitHub

parent a276c9c88d

commit 2bb0b78975

23 changed files with 1218 additions and 70 deletions

									
										12

tests/test_prompt_tokenizers.py
									
												View File
												
				@@ -134,9 +134,15 @@ class InstructionWSystemPromptTokenizingStrategyTest(unittest.TestCase):

				            "output": "Hi! How can I help?",

				        }

				        example = strat.tokenize_prompt(sample)

				        assert example["input_ids"][0:4] == [1, 835, 2184, 29901]  # "<s>### System:"

				        assert example["input_ids"][5:7] == [1509, 20118]  # "use cot"

				        assert example["input_ids"][9] == 11889  # USER

				        assert example["input_ids"][0:5] == [

				            1,

				            28962,

				            1254,

				            12665,

				            29901,

				        ]  # "<s>SYSTEM:"

				        assert example["input_ids"][5:7] == [671, 20118]  # " use cot"

				        assert example["input_ids"][8] == 11889  # USER

				class Llama2ChatTokenizationTest(unittest.TestCase):

Attention mask and position id fixes for packing (#285)

12 tests/test_prompt_tokenizers.py Unescape Escape View File

12

tests/test_prompt_tokenizers.py

View File