hf offline decorator for tests to workaround rate limits (#2452) [skip ci]

* hf offline decorator for tests to workaround rate limits * fail quicker so we can see logs * try new cache name * limit files downloaded * phi mini predownload * offline decorator for phi tokenizer * handle meta llama 8b offline too * make sure to return fixtures if they are wrapped too * more fixes * more things offline * more offline things * fix the env var * fix the model name * handle gemma also * force reload of modules to recheck offline status * prefetch mistral too * use reset_sessions so hub picks up offline mode * more fixes * rename so it doesn't seem like a context manager * fix backoff * switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline * include additional dataset * more fixes * more fixes * replace tiny shakespeaere dataset * skip some tests for now * use more robust check using snapshot download to determine if a dataset name is on the hub * typo for skip reason * use local_files_only * more fixtures * remove local only * use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached * make sure fixtures aren't offline improve the offline reset try bumping version of datasets reorder reloading and setting prime a new cache run the tests now with fresh cache try with a static cache * now run all the ci again with hopefully a correct cache * skip wonky tests for now * skip wonky tests for now * handle offline mode for model card creation
2025-03-28 19:20:46 -04:00
parent a4e430e7c4
commit 05f03b541a
21 changed files with 381 additions and 50 deletions
--- a/tests/test_packed_pretraining.py
+++ b/tests/test_packed_pretraining.py
@@ -8,6 +8,7 @@ import torch
 from datasets import load_dataset
 from torch.utils.data import DataLoader
 from transformers import AutoTokenizer
+from utils import disable_hf_offline, enable_hf_offline

 from axolotl.utils.data import get_dataset_wrapper, wrap_pretraining_dataset
 from axolotl.utils.dict import DictDefault
@@ -18,17 +19,18 @@ class TestPretrainingPacking(unittest.TestCase):
    Test class for packing streaming dataset sequences
    """

+    @enable_hf_offline
    def setUp(self) -> None:
        # pylint: disable=duplicate-code
        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
        self.tokenizer.pad_token = "</s>"

-    @pytest.mark.flaky(retries=3, delay=5)
+    @pytest.mark.flaky(retries=1, delay=5)
+    @disable_hf_offline
    def test_packing_stream_dataset(self):
        # pylint: disable=duplicate-code
        dataset = load_dataset(
-            "allenai/c4",
-            "en",
+            "winglian/tiny-shakespeare",
            streaming=True,
        )["train"]

@@ -36,8 +38,7 @@ class TestPretrainingPacking(unittest.TestCase):
            {
                "pretraining_dataset": [
                    {
-                        "path": "allenai/c4",
-                        "name": "en",
+                        "path": "winglian/tiny-shakespeare",
                        "type": "pretrain",
                    }
                ],