hf offline decorator for tests to workaround rate limits (#2452) [skip ci]

* hf offline decorator for tests to workaround rate limits

* fail quicker so we can see logs

* try new cache name

* limit files downloaded

* phi mini predownload

* offline decorator for phi tokenizer

* handle meta llama 8b offline too

* make sure to return fixtures if they are wrapped too

* more fixes

* more things offline

* more offline things

* fix the env var

* fix the model name

* handle gemma also

* force reload of modules to recheck offline status

* prefetch mistral too

* use reset_sessions so hub picks up offline mode

* more fixes

* rename so it doesn't seem like a context manager

* fix backoff

* switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline

* include additional dataset

* more fixes

* more fixes

* replace tiny shakespeaere dataset

* skip some tests for now

* use more robust check using snapshot download to determine if a dataset name is on the hub

* typo for skip reason

* use local_files_only

* more fixtures

* remove local only

* use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached

* make sure fixtures aren't offline

improve the offline reset
try bumping version of datasets
reorder reloading and setting
prime a new cache
run the tests now with fresh cache
try with a static cache

* now run all the ci again with hopefully a correct cache

* skip wonky tests for now

* skip wonky tests for now

* handle offline mode for model card creation
This commit is contained in:
Wing Lian
2025-03-28 19:20:46 -04:00
committed by GitHub
parent a4e430e7c4
commit 05f03b541a
21 changed files with 381 additions and 50 deletions

View File

@@ -8,6 +8,7 @@ import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from utils import disable_hf_offline, enable_hf_offline
from axolotl.utils.data import get_dataset_wrapper, wrap_pretraining_dataset
from axolotl.utils.dict import DictDefault
@@ -18,17 +19,18 @@ class TestPretrainingPacking(unittest.TestCase):
Test class for packing streaming dataset sequences
"""
@enable_hf_offline
def setUp(self) -> None:
# pylint: disable=duplicate-code
self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
self.tokenizer.pad_token = "</s>"
@pytest.mark.flaky(retries=3, delay=5)
@pytest.mark.flaky(retries=1, delay=5)
@disable_hf_offline
def test_packing_stream_dataset(self):
# pylint: disable=duplicate-code
dataset = load_dataset(
"allenai/c4",
"en",
"winglian/tiny-shakespeare",
streaming=True,
)["train"]
@@ -36,8 +38,7 @@ class TestPretrainingPacking(unittest.TestCase):
{
"pretraining_dataset": [
{
"path": "allenai/c4",
"name": "en",
"path": "winglian/tiny-shakespeare",
"type": "pretrain",
}
],