* hf offline decorator for tests to workaround rate limits * fail quicker so we can see logs * try new cache name * limit files downloaded * phi mini predownload * offline decorator for phi tokenizer * handle meta llama 8b offline too * make sure to return fixtures if they are wrapped too * more fixes * more things offline * more offline things * fix the env var * fix the model name * handle gemma also * force reload of modules to recheck offline status * prefetch mistral too * use reset_sessions so hub picks up offline mode * more fixes * rename so it doesn't seem like a context manager * fix backoff * switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline * include additional dataset * more fixes * more fixes * replace tiny shakespeaere dataset * skip some tests for now * use more robust check using snapshot download to determine if a dataset name is on the hub * typo for skip reason * use local_files_only * more fixtures * remove local only * use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached * make sure fixtures aren't offline improve the offline reset try bumping version of datasets reorder reloading and setting prime a new cache run the tests now with fresh cache try with a static cache * now run all the ci again with hopefully a correct cache * skip wonky tests for now * skip wonky tests for now * handle offline mode for model card creation
72 lines
2.3 KiB
Python
72 lines
2.3 KiB
Python
"""Module for testing dataset sequence packing"""
|
|
|
|
import unittest
|
|
from pathlib import Path
|
|
|
|
from datasets import Dataset, load_dataset
|
|
from transformers import AutoTokenizer
|
|
from utils import enable_hf_offline
|
|
|
|
from axolotl.datasets import ConstantLengthDataset, TokenizedPromptDataset
|
|
from axolotl.prompt_tokenizers import AlpacaPromptTokenizingStrategy
|
|
from axolotl.prompters import AlpacaPrompter
|
|
|
|
|
|
class TestPacking(unittest.TestCase):
|
|
"""
|
|
Test class for packing dataset sequences
|
|
"""
|
|
|
|
@enable_hf_offline
|
|
def setUp(self) -> None:
|
|
# pylint: disable=duplicate-code
|
|
self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
|
|
self.tokenizer.add_special_tokens(
|
|
{
|
|
"bos_token": "<s>",
|
|
"eos_token": "</s>",
|
|
"unk_token": "<unk>",
|
|
}
|
|
)
|
|
|
|
def test_increments_attention(self):
|
|
prompter = AlpacaPrompter("chat")
|
|
strat = AlpacaPromptTokenizingStrategy(
|
|
prompter,
|
|
self.tokenizer,
|
|
False,
|
|
2048,
|
|
)
|
|
dateset = load_dataset(
|
|
"json",
|
|
data_files=str(Path(__file__).parent / "fixtures/alpaca/alpaca.json"),
|
|
)["train"]
|
|
dataset = Dataset.from_list(list(TokenizedPromptDataset(strat, dateset)))
|
|
|
|
constant_len_dataset = ConstantLengthDataset(
|
|
self.tokenizer,
|
|
[dataset],
|
|
seq_length=2048,
|
|
)
|
|
packed_dataset = Dataset.from_list(list(constant_len_dataset))
|
|
example = packed_dataset[0]
|
|
next_bos_index = (
|
|
example["input_ids"][1:].index(self.tokenizer.bos_token_id) + 1
|
|
) # add one since we sliced
|
|
|
|
# first example doesn't have mask reset
|
|
assert example["input_ids"][0] == self.tokenizer.bos_token_id
|
|
assert example["attention_mask"][0] == 1
|
|
assert example["position_ids"][0] == 0
|
|
assert example["position_ids"][1] == 1
|
|
|
|
# but subsequent one does
|
|
assert example["input_ids"][next_bos_index] == self.tokenizer.bos_token_id
|
|
assert example["attention_mask"][next_bos_index] == 2
|
|
assert example["position_ids"][next_bos_index] == 0
|
|
assert example["position_ids"][next_bos_index + 1] == 1
|
|
|
|
|
|
if __name__ == "__main__":
|
|
unittest.main()
|