tweak loss

add seed for stable reproducibility
tweak acceptable loss from changed hyperparams
2025-07-06 19:42:43 -04:00 · 2025-07-06 19:29:51 -04:00 · 2025-07-06 19:25:26 -04:00 · 2025-07-06 19:11:46 -04:00 · 2025-07-06 18:55:16 -04:00 · 2025-07-06 13:27:55 -04:00
32 changed files with 448 additions and 445 deletions
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -5,11 +5,13 @@ on:
    branches:
      - "main"
    paths:
-      - 'Dockerfile-base'
+      - 'docker/Dockerfile-base'
+      - 'docker/Dockerfile-uv-base'
      - '.github/workflows/base.yml'
  pull_request:
    paths:
-      - 'Dockerfile-base'
+      - 'docker/Dockerfile-base'
+      - 'docker/Dockerfile-uv-base'
      - '.github/workflows/base.yml'
  workflow_dispatch:

--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -20,12 +20,11 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras: vllm
-            is_latest: true
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
@@ -88,8 +87,8 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
@@ -146,8 +145,8 @@ jobs:
    strategy:
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -26,11 +26,11 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
-            axolotl_extras: vllm
+            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 124
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -18,96 +18,9 @@ jobs:
        env:
          SKIP: no-commit-to-branch

-  preload-cache:
-    name: Preload HF cache
-    runs-on: ubuntu-latest
-    strategy:
-      fail-fast: false
-      matrix:
-        python_version: ["3.11"]
-        pytorch_version: ["2.6.0"]
-    timeout-minutes: 20
-
-    env:
-      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
-
-    steps:
-      - name: Check out repository code
-        uses: actions/checkout@v4
-
-      - name: Restore HF cache
-        id: hf-cache-restore
-        uses: actions/cache/restore@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ runner.os }}-hf-hub-cache-v2
-
-      - name: Setup Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: ${{ matrix.python_version }}
-          cache: 'pip' # caching pip dependencies
-
-      - name: upgrade pip
-        run: |
-          pip3 install --upgrade pip
-          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
-
-      - name: Install PyTorch
-        run: |
-          pip3 install torch==${{ matrix.pytorch_version }}
-
-      - name: Install dependencies
-        run: |
-          pip3 show torch
-          pip3 install --no-build-isolation -U -e .
-          python scripts/unsloth_install.py | sh
-          python scripts/cutcrossentropy_install.py | sh
-          pip3 install -r requirements-dev.txt -r requirements-tests.txt
-
-      - name: Make sure PyTorch version wasn't clobbered
-        run: |
-          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
-
-      - name: Ensure axolotl CLI was installed
-        run: |
-          axolotl --help
-
-      - name: Pre-Download dataset fixture
-        run: |
-          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
-
-      - name: Run tests
-        run: |
-          pytest -v tests/conftest.py
-
-      - name: Upload coverage to Codecov
-        uses: codecov/codecov-action@v5
-        with:
-          token: ${{ secrets.CODECOV_TOKEN }}
-          files: ./coverage.xml
-          flags: unittests,pytorch-${{ matrix.pytorch_version }}
-          fail_ci_if_error: false
-
-      - name: cleanup pip cache
-        run: |
-          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
-
-      - name: Save HF cache
-        id: hf-cache
-        uses: actions/cache/save@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
-
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
-    needs: [preload-cache]
    strategy:
      fail-fast: false
      max-parallel: 2
@@ -120,14 +33,11 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4

-      - name: Restore HF cache
-        id: hf-cache-restore
-        uses: actions/cache/restore@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ runner.os }}-hf-hub-cache-v2
+      - name: Restore Cache from S3
+        id: hf-cache-restore-s3
+        run: |
+          mkdir -p /home/runner/.cache/huggingface/hub
+          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd

      - name: Setup Python
        uses: actions/setup-python@v5
@@ -168,10 +78,6 @@ jobs:
        run: |
          axolotl --help

-      - name: Pre-Download dataset fixture
-        run: |
-          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
-
      - name: Run tests
        run: |
          pytest -v -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/
@@ -193,15 +99,8 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.5.1
-            num_gpus: 1
-            axolotl_extras:
-            nightly_build: "true"
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -195,12 +195,12 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
-            axolotl_extras: vllm
+            axolotl_extras:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
@@ -247,8 +247,8 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
+          - cuda: 126
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
@@ -311,7 +311,7 @@ jobs:
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
-            axolotl_extras: vllm
+            axolotl_extras:
    steps:
      - name: Checkout
        uses: actions/checkout@v4
--- a/README.md
+++ b/README.md
@@ -59,6 +59,8 @@ Features:

 ### Installation

+#### Using pip
+
 ```bash
 pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
@@ -68,6 +70,13 @@ axolotl fetch examples
 axolotl fetch deepspeed_configs  # OPTIONAL
 ```

+#### Using Docker
+
+Installing with Docker can be less error prone than installing in your own environment.
+```bash
+docker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latest
+```
+
 Other installation approaches are described [here](https://docs.axolotl.ai/docs/installation.html).

 ### Your First Fine-tune
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -32,6 +32,8 @@ df_args = {
    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
    "HF_HOME": "/workspace/data/huggingface-cache/hub",
+    "PYTHONUNBUFFERED": os.environ.get("PYTHONUNBUFFERED", "1"),
+    "DEEPSPEED_LOG_LEVEL": os.environ.get("DEEPSPEED_LOG_LEVEL", "WARNING"),
 }

 dockerfile_contents = df_template.render(**df_args)
--- a/docker/Dockerfile-base
+++ b/docker/Dockerfile-base
@@ -38,6 +38,6 @@ RUN git lfs install --skip-repo && \
    # The base image ships with `pydantic==1.8.2` which is not working
    pip3 install -U --no-cache-dir pydantic==1.10.10

-RUN if [ "$PYTORCH_VERSION" = "2.7.1" ] ; then \
-        pip3 install flash-attn==2.7.4.post1; \
+RUN if [ "$PYTORCH_VERSION" = "2.6.0" ] && [ "$CUDA" = "124" ] ; then \
+        FLASH_ATTENTION_FORCE_BUILD="TRUE" pip3 install --no-build-isolation flash-attn==2.8.0.post2; \
    fi
--- a/docker/Dockerfile-uv-base
+++ b/docker/Dockerfile-uv-base
@@ -34,7 +34,3 @@ RUN uv pip install packaging setuptools wheel psutil \
    && uv pip install --no-build-isolation "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" \
    && uv pip install "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main" \
    && uv pip install awscli pydantic
-
-RUN if [ "$PYTORCH_VERSION" = "2.7.1" ] ; then \
-        uv pip install --no-build-isolation flash-attn==2.7.4.post1; \
-    fi
--- a/docs/docker.qmd
+++ b/docs/docker.qmd
@@ -9,7 +9,7 @@ format:
 This section describes the different Docker images that are released by AxolotlAI at [Docker Hub](https://hub.docker.com/u/axolotlai).

 ::: {.callout-important}
-For Blackwell GPUs, please use the tags with Pytorch 2.7.1 and CUDA 12.8.
+For Blackwell GPUs, please use the tags with PyTorch 2.7.1 and CUDA 12.8.
 :::

 ## Base
@@ -34,6 +34,7 @@ Tags examples:

 - `main-base-py3.11-cu128-2.7.1`
 - `main-base-py3.11-cu126-2.7.1`
+- `main-base-py3.11-cu126-2.6.0`
 - `main-base-py3.11-cu124-2.6.0`
 - `main-base-py3.11-cu124-2.5.1`

@@ -73,13 +74,15 @@ There may be some extra tags appended to the image, like `-vllm` which installs

 Tags examples:

- `main-py3.11-cu126-2.7.0`
+- `main-py3.11-cu128-2.7.1`
+- `main-py3.11-cu126-2.7.1`
+- `main-py3.11-cu126-2.6.0`
 - `main-py3.11-cu124-2.6.0`
 - `main-py3.11-cu124-2.5.1`
 - `main-latest`
 - `main-20250303-py3.11-cu124-2.6.0`
 - `main-20250303-py3.11-cu124-2.5.1`
- `0.9.2`
+- `0.10.1`

 ## Cloud

--- a/requirements.txt
+++ b/requirements.txt
@@ -1,7 +1,7 @@
 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

 # START section of dependencies that don't install on Darwin/MacOS
-bitsandbytes==0.45.4
+bitsandbytes==0.46.0
 triton>=3.0.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
@@ -15,7 +15,7 @@ huggingface_hub==0.32.2
 peft==0.15.2
 transformers==4.52.4
 tokenizers>=0.21.1
-accelerate==1.7.0
+accelerate==1.8.1
 datasets==3.6.0
 deepspeed>=0.17.0
 trl==0.18.2
@@ -68,4 +68,4 @@ schedulefree==1.4.1
 axolotl-contribs-lgpl==0.0.6
 axolotl-contribs-mit==0.0.3

-mistral-common==1.6.0
+mistral-common==1.6.3
--- a/setup.py
+++ b/setup.py
@@ -111,9 +111,9 @@ def get_package_version():


 extras_require = {
-    "flash-attn": ["flash-attn==2.7.4.post1"],
+    "flash-attn": ["flash-attn==2.8.0.post2"],
    "ring-flash-attn": [
-        "flash-attn==2.7.4.post1",
+        "flash-attn==2.8.0.post2",
        "ring-flash-attn>=0.1.4",
        "yunchang==0.6.0",
    ],
--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -219,7 +219,9 @@ class TrainerBuilderBase(abc.ABC):
        if self.cfg.bf16 == "full":
            training_args_kwargs["bf16_full_eval"] = True
        else:
-            training_args_kwargs["bf16"] = self.cfg.bf16 or self.cfg.bfloat16
+            bf16 = self.cfg.bf16 or self.cfg.bfloat16
+            bf16 = bf16 if bf16 is not None else False
+            training_args_kwargs["bf16"] = bf16

    def _configure_scheduler(self, training_args_kwargs: dict):
        if self.cfg.lr_scheduler in ["one_cycle", "rex"]:
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -253,6 +253,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs["eval_sample_packing"] = bool(
            self.cfg.eval_sample_packing
        )
+        if self.cfg.sample_packing_sequentially is not None:
+            training_arguments_kwargs["sample_packing_sequentially"] = (
+                self.cfg.sample_packing_sequentially
+            )
        if self.cfg.sample_packing_bin_size is not None:
            training_arguments_kwargs["sample_packing_bin_size"] = (
                self.cfg.sample_packing_bin_size
--- a/src/axolotl/core/trainers/dpo/init.py
+++ b/src/axolotl/core/trainers/dpo/init.py
@@ -28,7 +28,7 @@ class DPOStrategy:
        training_args_kwargs["max_completion_length"] = None
        training_args_kwargs["max_length"] = cfg.sequence_len
        training_args_kwargs["max_prompt_length"] = cfg.sequence_len
-        training_args_kwargs["generate_during_eval"] = cfg.use_wandb
+        training_args_kwargs["generate_during_eval"] = cfg.dpo_generate_during_eval
        if cfg.dpo_use_weighting is not None:
            training_args_kwargs["use_weighting"] = cfg.dpo_use_weighting
        if cfg.dpo_padding_free is not None:
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -65,6 +65,7 @@ class PatchManager:
        self._apply_mistral_cross_entropy_patch()
        self._apply_self_attention_lora_patch()
        self._apply_gemma3_conditional_generation_forward_patch()
+        self._apply_sequence_parallel_patches()

    def apply_post_model_load_patches(self, model: PreTrainedModel):
        """Apply patches that require the model instance."""
@@ -231,6 +232,17 @@ class PatchManager:

            patch_gemma3_conditional_generation_forward()

+    def _apply_sequence_parallel_patches(self):
+        """Apply sequence parallelism patches."""
+        if self.cfg.sequence_parallel_degree and self.cfg.sequence_parallel_degree > 1:
+            from axolotl.monkeypatch.ring_attn.patch import (
+                patch_prepare_data_loader,
+                patch_prepare_device_mesh,
+            )
+
+            patch_prepare_data_loader()
+            patch_prepare_device_mesh(self.cfg.sequence_parallel_degree, self.cfg.fsdp)
+
    def _patch_attention(self):
        """Apply attention-specific patches based on model type."""
        if not (self.cfg.flash_attention and hasattr(self.model_config, "model_type")):
--- a/src/axolotl/monkeypatch/ring_attn/patch.py
+++ b/src/axolotl/monkeypatch/ring_attn/patch.py
@@ -152,7 +152,7 @@ def update_ring_attn_params(position_ids: torch.Tensor | None):
 def patch_prepare_data_loader():
    """Patch `accelerate.data_loader.prepare_data_loader` to respect the SP degree.

-    Raies:
+    Raises:
        RuntimeError: If source code to patch does not exist.
    """
    original_fn = accelerate.data_loader.prepare_data_loader
@@ -168,23 +168,34 @@ def patch_prepare_data_loader():
        ORIGINAL_PREPARE_DATALOADER_CODE, NEW_PREPARE_DATALOADER_CODE
    )

+    items_to_import = []
+    for item in dir(accelerate.data_loader):
+        if item in patched_source:
+            items_to_import.append(item)
+
    # Create a new function from the patched source
    namespace = {}
    exec(  # pylint: disable=exec-used  # nosec B102
-        patched_source, accelerate.data_loader.__dict__, namespace
+        f"from accelerate.data_loader import ({', '.join(items_to_import)})",
+        globals(),
+    )
+    exec(  # pylint: disable=exec-used  # nosec B102
+        patched_source, globals(), namespace
    )
-    patched_function = namespace["prepare_data_loader"]

-    accelerate.data_loader.prepare_data_loader = patched_function
+    patched_function = namespace["prepare_data_loader"]
+    original_fn.__code__ = patched_function.__code__
+
    LOG.info("Patched accelerate.data_loader.prepare_data_loader for SP support")


-def patch_prepare_device_mesh(sequence_parallel_degree: int):
+def patch_prepare_device_mesh(sequence_parallel_degree: int, fsdp: bool = False):
    """Patches the `Accelerator._prepare_device_mesh` method to create a device mesh
    that includes sequence parallelism with the specified degree.

    Args:
-        sequence_parallel_degree (int): The degree of sequence parallelism to use.
+        sequence_parallel_degree: The degree of sequence parallelism to use.
+        fsdp: Whether to use FSDP.
    """

    def _prepare_device_mesh(self):
@@ -207,12 +218,14 @@ def patch_prepare_device_mesh(sequence_parallel_degree: int):
        )
        device_ids = list(range(world_size))

-        # Note that we use "cp" instead of "sp" to match the PyTorch native "context
-        # parallelism" implementation naming
+        # NOTE: We use "cp" instead of "sp" to match the PyTorch native "context
+        # parallelism" implementation naming.
+        # NOTE: We have a simplified FSDP handling here; i.e., if FSDP is enabled, we
+        # only use "fsdp" and "cp" for the device mesh.
        return dist.DeviceMesh(
            "cuda",
            torch.tensor(device_ids).reshape(mesh_shape),
-            mesh_dim_names=("dp", "cp"),
+            mesh_dim_names=("dp", "cp") if not fsdp else ("fsdp", "cp"),
        )

    # Replace the original method with our new method
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -223,8 +223,9 @@ def execute_training(
            )

        LOG.info("Starting trainer...")
-        if cfg.bf16:
-            torch.set_default_dtype(torch.bfloat16)
+        # TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers
+        # if cfg.bf16:
+        #     torch.set_default_dtype(torch.bfloat16)
        trainer.train(resume_from_checkpoint=resume_from_checkpoint)


--- a/src/axolotl/utils/ctx_managers/sequence_parallel.py
+++ b/src/axolotl/utils/ctx_managers/sequence_parallel.py
@@ -12,8 +12,6 @@ from transformers.utils import ModelOutput

 from axolotl.monkeypatch.ring_attn import (
    get_ring_attn_group,
-    patch_prepare_data_loader,
-    patch_prepare_device_mesh,
    register_ring_attn,
    update_ring_attn_params,
 )
@@ -238,12 +236,6 @@ class SequenceParallelContextManager:
            ring_attn_func=self.ring_attn_func,
        )

-        # Patches for accelerate functionality
-        patch_prepare_data_loader()
-        patch_prepare_device_mesh(
-            sequence_parallel_degree=self.sequence_parallel_degree
-        )
-
    def _register_model_hooks(self):
        # Forward pre-hook to apply sequence parallelism
        def sequence_parallel_pre_hook(_, args, kwargs):
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -524,13 +524,24 @@ def merge_datasets(datasets: list[Dataset], cfg: DictDefault) -> Dataset:
        Merged dataset.
    """
    if len(datasets) == 1:
-        return datasets[0]
+        ds = datasets[0]
+
+        # Do not shuffle if curriculum sampling is enabled
+        if cfg.curriculum_sampling:
+            return ds
+
+        return ds.shuffle(seed=cfg.seed)

    LOG.info("Merging datasets...")
    merged_dataset = concatenate_datasets(datasets)

    if cfg.shuffle_merged_datasets:
        LOG.debug("Shuffling merged datasets...")
+        if cfg.curriculum_sampling:
+            LOG.warning(
+                "Shuffling merged datasets with curriculum sampling is not recommended. "
+                "This will randomize the order of samples."
+            )
        merged_dataset = merged_dataset.shuffle(seed=cfg.seed)
    else:
        LOG.debug("Not shuffling merged datasets.")
--- a/src/axolotl/utils/mistral_tokenizer.py
+++ b/src/axolotl/utils/mistral_tokenizer.py
@@ -8,7 +8,7 @@ from typing import TYPE_CHECKING, Optional
 import numpy as np
 from huggingface_hub import hf_hub_download
 from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
-from mistral_common.tokens.tokenizers.tekken import Tekkenizer
+from mistral_common.tokens.tokenizers.tekken import SpecialTokenPolicy, Tekkenizer
 from torch import Tensor
 from transformers.utils import PaddingStrategy

@@ -251,10 +251,13 @@ class HFMistralTokenizer:
            token_ids = [token_ids]

        if skip_special_tokens:
-            return self._mistral.instruct_tokenizer.tokenizer.decode(token_ids)
+            return self._mistral.instruct_tokenizer.tokenizer.decode(
+                token_ids, special_token_policy=SpecialTokenPolicy.IGNORE
+            )

-        # to_string returns a string with special tokens
-        return self._mistral.instruct_tokenizer.tokenizer.to_string(token_ids)
+        return self._mistral.instruct_tokenizer.tokenizer.decode(
+            token_ids, special_token_policy=SpecialTokenPolicy.KEEP
+        )

    def _create_mistral_chat_completion_request(
        self, conversation: list[dict], tools: list[dict] | None = None
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -146,6 +146,7 @@ class AxolotlInputConfig(
    dpo_label_smoothing: float | None = None
    dpo_norm_loss: bool | None = None
    dpo_padding_free: bool | None = None
+    dpo_generate_during_eval: bool | None = None

    datasets: (
        Annotated[
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -4,12 +4,14 @@ shared pytest fixtures

 import functools
 import importlib
+import logging
 import os
 import shutil
 import sys
 import tempfile
 import time
 from pathlib import Path
+from typing import Generator

 import datasets
 import pytest
@@ -24,6 +26,8 @@ from tests.hf_offline_utils import (
    hf_offline_context,
 )

+logging.getLogger("filelock").setLevel(logging.CRITICAL)
+

 def retry_on_request_exceptions(max_retries=3, delay=1):
    # pylint: disable=duplicate-code
@@ -411,7 +415,16 @@ def tokenizer_mistral_7b_instruct_chatml(tokenizer_mistral_7b_instruct):


@pytest.fixture
-def temp_dir():
+def temp_dir() -> Generator[str, None, None]:
+    # Create a temporary directory
+    _temp_dir = tempfile.mkdtemp()
+    yield _temp_dir
+    # Clean up the directory after the test
+    shutil.rmtree(_temp_dir)
+
+
+@pytest.fixture(scope="module")
+def module_temp_dir() -> Generator[str, None, None]:
    # Create a temporary directory
    _temp_dir = tempfile.mkdtemp()
    yield _temp_dir
--- a/tests/e2e/multigpu/patched/test_sp.py
+++ b/tests/e2e/multigpu/patched/test_sp.py
@@ -54,6 +54,7 @@ class TestSequenceParallelism:
                "micro_batch_size": micro_batch_size,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/solo/test_flex.py
+++ b/tests/e2e/multigpu/solo/test_flex.py
@@ -54,6 +54,7 @@ class TestPackedFlex:
                "gradient_accumulation_steps": 2,
                "gradient_checkpointing": True,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/solo/test_grpo.py
+++ b/tests/e2e/multigpu/solo/test_grpo.py
@@ -309,6 +309,7 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):
                "warmup_steps": 10,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.0001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
@@ -400,6 +401,7 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):
                "warmup_steps": 10,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.0001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_eval.py
+++ b/tests/e2e/multigpu/test_eval.py
@@ -38,12 +38,13 @@ class TestMultiGPUEval:
                "lora_dropout": 0.05,
                "lora_target_linear": True,
                "lora_modules_to_save": ["embed_tokens", "lm_head"],
-                "val_set_size": 0.004,
+                "val_set_size": 0.05,
                "special_tokens": {"pad_token": "<|endoftext|>"},
                "datasets": [
                    {
                        "path": "teknium/GPT4-LLM-Cleaned",
                        "type": "alpaca",
+                        "split": "train[:5%]",
                    },
                ],
                "num_epochs": 1,
@@ -51,6 +52,7 @@ class TestMultiGPUEval:
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
@@ -107,12 +109,13 @@ class TestMultiGPUEval:
                "lora_dropout": 0.05,
                "lora_target_linear": True,
                "lora_modules_to_save": ["embed_tokens", "lm_head"],
-                "val_set_size": 0.0004,
+                "val_set_size": 0.01,
                "special_tokens": {"pad_token": "<|endoftext|>"},
                "datasets": [
                    {
                        "path": "teknium/GPT4-LLM-Cleaned",
                        "type": "alpaca",
+                        "split": "train[:5%]",
                    },
                ],
                "num_epochs": 1,
@@ -120,6 +123,7 @@ class TestMultiGPUEval:
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_gemma3.py
+++ b/tests/e2e/multigpu/test_gemma3.py
@@ -64,6 +64,7 @@ class TestMultiGPUGemma3:
                },
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.0001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_llama.py
+++ b/tests/e2e/multigpu/test_llama.py
@@ -2,6 +2,8 @@
 E2E tests for multigpu lora tinyllama
 """

+# pylint: disable=redefined-outer-name
+
 from pathlib import Path

 import pytest
@@ -25,6 +27,60 @@ def download_model():
    snapshot_download("HuggingFaceTB/SmolLM2-135M")


+@pytest.fixture(scope="module")
+def sft_base_cfg():
+    cfg = DictDefault(
+        base_model="HuggingFaceTB/SmolLM2-135M",
+        tokenizer_config="HuggingFaceTB/SmolLM2-135M",  # this has to be manually set since we haven't done validation
+        sequence_len=1024,
+        special_tokens={
+            "pad_token": "<|endoftext|>",
+        },
+        datasets=[
+            {
+                "path": "tatsu-lab/alpaca",
+                "type": "alpaca",
+                "split": "train[:10%]",
+            },
+        ],
+        val_set_size=0.1,
+        sample_packing=True,
+        flash_attention=True,
+        learning_rate=0.00001,
+        optimizer="adamw_8bit",
+        seed=42,
+        # these need to be set since we aren't running schema validation
+        micro_batch_size=2,
+        gradient_accumulation_steps=1,
+    )
+
+    return cfg
+
+
+@pytest.fixture(scope="module", name="sft_prepared_dataset_alpaca_cfg")
+def sft_prepared_dataset_alpaca_cfg(module_temp_dir, sft_base_cfg):
+    dataset_prepared_path = module_temp_dir + "/last_run_prepared"
+    cfg = sft_base_cfg | DictDefault(
+        dataset_prepared_path=dataset_prepared_path,
+    )
+
+    Path(module_temp_dir).mkdir(parents=True, exist_ok=True)
+    with open(Path(module_temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+    execute_subprocess_async(
+        [
+            "axolotl",
+            "preprocess",
+            str(Path(module_temp_dir) / "config.yaml"),
+        ]
+    )
+
+    # unset flash attention since we have some flex attention tests too
+    cfg.flash_attention = None
+    return cfg
+
+
 def transformers_version_eq(required_version):
    return version.parse(transformers.__version__) == version.parse(required_version)

@@ -62,6 +118,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
@@ -96,44 +153,36 @@ class TestMultiGPULlama:
        "gradient_accumulation_steps",
        [1, 2],
    )
-    def test_lora_ddp_packed(self, temp_dir, gradient_accumulation_steps):
+    def test_lora_ddp_packed(
+        self, temp_dir, sft_prepared_dataset_alpaca_cfg, gradient_accumulation_steps
+    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "sequence_len": 2048,
-                "sample_packing": True,
-                "eval_sample_packing": False,
-                "pad_to_sequence_len": True,
-                "adapter": "lora",
-                "lora_r": 8,
-                "lora_alpha": 16,
-                "lora_dropout": 0.05,
-                "lora_target_linear": True,
-                "val_set_size": 0.05,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:20%]",
-                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 1,
-                "gradient_accumulation_steps": gradient_accumulation_steps,
-                # "gradient_checkpointing": True,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_8bit",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "use_tensorboard": True,
-                "bf16": True,
-            }
+        cfg = (
+            DictDefault(
+                {
+                    "eval_sample_packing": False,
+                    "pad_to_sequence_len": True,
+                    "adapter": "lora",
+                    "lora_r": 8,
+                    "lora_alpha": 16,
+                    "lora_dropout": 0.05,
+                    "lora_target_linear": True,
+                    "val_set_size": 0.05,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 1,
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
+                    # "gradient_checkpointing": True,
+                    "output_dir": temp_dir,
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_8bit",
+                    "lr_scheduler": "cosine",
+                    "flash_attention": True,
+                    "use_tensorboard": True,
+                    "bf16": True,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )

        # write cfg to yaml file
@@ -200,6 +249,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "warmup_steps": 0,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -278,6 +328,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "warmup_steps": 0,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -340,6 +391,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": gradient_accumulation_steps,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
@@ -380,58 +432,50 @@ class TestMultiGPULlama:
        )

        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )

    @pytest.mark.parametrize(
        "fsdp_state_dict_type",
        ["FULL_STATE_DICT", "SHARDED_STATE_DICT"],
    )
-    def test_fsdp_packed(self, temp_dir, fsdp_state_dict_type):
+    def test_fsdp_packed(
+        self, temp_dir, sft_prepared_dataset_alpaca_cfg, fsdp_state_dict_type
+    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "sample_packing": True,
-                "pad_to_sequence_len": True,
-                "sequence_len": 1024,
-                "val_set_size": 0.05,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:10%]",
+        cfg = (
+            DictDefault(
+                {
+                    "pad_to_sequence_len": True,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 2,
+                    "gradient_accumulation_steps": 2,
+                    # "gradient_checkpointing": True,
+                    "output_dir": temp_dir,
+                    "dataset_prepared_path": temp_dir + "/last_run_prepared",
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_torch_fused",
+                    "lr_scheduler": "cosine",
+                    "flash_attention": True,
+                    "fsdp": [
+                        "full_shard",
+                        "auto_wrap",
+                    ],
+                    "fsdp_config": {
+                        "fsdp_limit_all_gathers": True,
+                        "fsdp_offload_params": False,
+                        "fsdp_sync_module_states": True,
+                        "fsdp_use_orig_params": False,
+                        "fsdp_cpu_ram_efficient_loading": False,
+                        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
+                        "fsdp_state_dict_type": fsdp_state_dict_type,
+                        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 2,
-                "gradient_accumulation_steps": 2,
-                # "gradient_checkpointing": True,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "fsdp": [
-                    "full_shard",
-                    "auto_wrap",
-                ],
-                "fsdp_config": {
-                    "fsdp_limit_all_gathers": True,
-                    "fsdp_offload_params": False,
-                    "fsdp_sync_module_states": True,
-                    "fsdp_use_orig_params": False,
-                    "fsdp_cpu_ram_efficient_loading": False,
-                    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
-                    "fsdp_state_dict_type": fsdp_state_dict_type,
-                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
-                },
-                "use_tensorboard": True,
-            }
+                    "use_tensorboard": True,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )

        # write cfg to yaml file
@@ -452,7 +496,7 @@ class TestMultiGPULlama:
        )

        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.4, "Train Loss (%s) is too high"
        )

    @require_torch_2_6_0
@@ -465,50 +509,43 @@ class TestMultiGPULlama:
        [True, False],
    )
    def test_fsdp2_packed(
-        self, temp_dir, attention_backend, fsdp_reshard_after_forward
+        self,
+        temp_dir,
+        sft_prepared_dataset_alpaca_cfg,
+        attention_backend,
+        fsdp_reshard_after_forward,
    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "sample_packing": True,
-                "pad_to_sequence_len": True,
-                "sequence_len": 2048,
-                "val_set_size": 0.1,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:10%]",
+        cfg = (
+            DictDefault(
+                {
+                    "pad_to_sequence_len": True,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 4,
+                    "gradient_accumulation_steps": 2,
+                    "gradient_checkpointing": True,
+                    "output_dir": temp_dir,
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_torch_8bit",
+                    "lr_scheduler": "cosine",
+                    "fsdp": [
+                        "auto_wrap",
+                    ],
+                    "fsdp_config": {
+                        "fsdp_version": 2,
+                        # "fsdp_forward_prefetch": True,  # not yet implemented in accelerate
+                        "fsdp_offload_params": False,
+                        "fsdp_cpu_ram_efficient_loading": False,
+                        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
+                        "fsdp_state_dict_type": "SHARDED_STATE_DICT",
+                        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                        "fsdp_reshard_after_forward": fsdp_reshard_after_forward,
                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 4,
-                "gradient_accumulation_steps": 2,
-                "gradient_checkpointing": True,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_8bit",
-                "lr_scheduler": "cosine",
-                "fsdp": [
-                    "auto_wrap",
-                ],
-                "fsdp_config": {
-                    "fsdp_version": 2,
-                    # "fsdp_forward_prefetch": True,  # not yet implemented in accelerate
-                    "fsdp_offload_params": False,
-                    "fsdp_cpu_ram_efficient_loading": False,
-                    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
-                    "fsdp_state_dict_type": "SHARDED_STATE_DICT",
-                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
-                    "fsdp_reshard_after_forward": fsdp_reshard_after_forward,
-                },
-                "use_tensorboard": True,
-            }
+                    "use_tensorboard": True,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )
        if attention_backend == "flash":
            cfg.flash_attention = True
@@ -536,63 +573,55 @@ class TestMultiGPULlama:
            temp_dir + "/runs", "train/train_loss", 2.1, "Train Loss (%s) is too high"
        )

-    def test_fsdp_qlora_prequant_packed(self, temp_dir):
+    def test_fsdp_qlora_prequant_packed(
+        self, temp_dir, sft_prepared_dataset_alpaca_cfg
+    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
-            {
-                "base_model": "axolotl-ai-co/SmolLM2-135M-bnb-nf4-bf16",
-                "adapter": "qlora",
-                "mean_resizing_embeddings": True,
-                "load_in_4bit": True,
-                "lora_r": 8,
-                "lora_alpha": 16,
-                "lora_dropout": 0.05,
-                "lora_target_linear": True,
-                # "lora_modules_to_save": [
-                #     "embed_tokens",
-                #     "lm_head",
-                # ],
-                "sample_packing": True,
-                "eval_sample_packing": False,
-                "pad_to_sequence_len": True,
-                "sequence_len": 1024,
-                "val_set_size": 0.01,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:10%]",
+        cfg = (
+            DictDefault(
+                {
+                    "base_model": "axolotl-ai-co/SmolLM2-135M-bnb-nf4-bf16",
+                    "adapter": "qlora",
+                    "mean_resizing_embeddings": True,
+                    "load_in_4bit": True,
+                    "lora_r": 8,
+                    "lora_alpha": 16,
+                    "lora_dropout": 0.05,
+                    "lora_target_linear": True,
+                    # "lora_modules_to_save": [
+                    #     "embed_tokens",
+                    #     "lm_head",
+                    # ],
+                    "eval_sample_packing": False,
+                    "pad_to_sequence_len": True,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 2,
+                    "gradient_accumulation_steps": 2,
+                    # "gradient_checkpointing": True,
+                    "output_dir": temp_dir,
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_torch_fused",
+                    "lr_scheduler": "cosine",
+                    "flash_attention": True,
+                    "fsdp": [
+                        "full_shard",
+                        "auto_wrap",
+                    ],
+                    "fsdp_config": {
+                        "fsdp_limit_all_gathers": True,
+                        "fsdp_offload_params": False,
+                        "fsdp_sync_module_states": True,
+                        "fsdp_use_orig_params": False,
+                        "fsdp_cpu_ram_efficient_loading": True,
+                        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
+                        "fsdp_state_dict_type": "SHARDED_STATE_DICT",
+                        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 2,
-                "gradient_accumulation_steps": 2,
-                # "gradient_checkpointing": True,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "fsdp": [
-                    "full_shard",
-                    "auto_wrap",
-                ],
-                "fsdp_config": {
-                    "fsdp_limit_all_gathers": True,
-                    "fsdp_offload_params": False,
-                    "fsdp_sync_module_states": True,
-                    "fsdp_use_orig_params": False,
-                    "fsdp_cpu_ram_efficient_loading": True,
-                    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
-                    "fsdp_state_dict_type": "SHARDED_STATE_DICT",
-                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
-                },
-                "use_tensorboard": True,
-            }
+                    "use_tensorboard": True,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )

        # write cfg to yaml file
@@ -633,7 +662,12 @@ class TestMultiGPULlama:
        [True, False],
    )
    def test_ds_zero3_packed(
-        self, temp_dir, gradient_accumulation_steps, deepspeed, qlora
+        self,
+        temp_dir,
+        sft_prepared_dataset_alpaca_cfg,
+        gradient_accumulation_steps,
+        deepspeed,
+        qlora,
    ):
        # pylint: disable=duplicate-code
        if qlora:
@@ -647,36 +681,25 @@ class TestMultiGPULlama:
            }
        else:
            adapter = {}
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "sample_packing": True,
-                "pad_to_sequence_len": True,
-                "sequence_len": 1024,
-                "val_set_size": 0.05,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:10%]",
-                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 1,
-                "gradient_accumulation_steps": gradient_accumulation_steps,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "deepspeed": str(AXOLOTL_ROOT / deepspeed),
-                "use_tensorboard": True,
-                **adapter,
-            }
+        cfg = (
+            DictDefault(
+                {
+                    "pad_to_sequence_len": True,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 1,
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
+                    "output_dir": temp_dir,
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_torch_fused",
+                    "lr_scheduler": "cosine",
+                    "flash_attention": True,
+                    "deepspeed": str(AXOLOTL_ROOT / deepspeed),
+                    "use_tensorboard": True,
+                    **adapter,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )

        # write cfg to yaml file
@@ -697,7 +720,7 @@ class TestMultiGPULlama:
        )

        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.4, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )

    @pytest.mark.parametrize(
@@ -708,7 +731,13 @@ class TestMultiGPULlama:
        "qlora",
        [True, False],
    )
-    def test_ds_zero2_packed(self, temp_dir, gradient_accumulation_steps, qlora):
+    def test_ds_zero2_packed(
+        self,
+        temp_dir,
+        sft_prepared_dataset_alpaca_cfg,
+        gradient_accumulation_steps,
+        qlora,
+    ):
        # pylint: disable=duplicate-code
        if qlora:
            adapter = {
@@ -721,36 +750,25 @@ class TestMultiGPULlama:
            }
        else:
            adapter = {}
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "sample_packing": True,
-                "pad_to_sequence_len": True,
-                "sequence_len": 1024,
-                "val_set_size": 0.01,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:10%]",
-                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 1,
-                "gradient_accumulation_steps": gradient_accumulation_steps,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero2.json"),
-                "use_tensorboard": True,
-                **adapter,
-            }
+        cfg = (
+            DictDefault(
+                {
+                    "pad_to_sequence_len": True,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 1,
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
+                    "output_dir": temp_dir,
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_torch_fused",
+                    "lr_scheduler": "cosine",
+                    "flash_attention": True,
+                    "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero2.json"),
+                    "use_tensorboard": True,
+                    **adapter,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )

        # write cfg to yaml file
@@ -771,7 +789,7 @@ class TestMultiGPULlama:
        )

        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )

    @pytest.mark.parametrize(
@@ -782,7 +800,13 @@ class TestMultiGPULlama:
        "qlora",
        [True, False],
    )
-    def test_ds_zero1_packed(self, temp_dir, gradient_accumulation_steps, qlora):
+    def test_ds_zero1_packed(
+        self,
+        temp_dir,
+        sft_prepared_dataset_alpaca_cfg,
+        gradient_accumulation_steps,
+        qlora,
+    ):
        # pylint: disable=duplicate-code
        if qlora:
            adapter = {
@@ -795,36 +819,25 @@ class TestMultiGPULlama:
            }
        else:
            adapter = {}
-        cfg = DictDefault(
-            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
-                "sample_packing": True,
-                "pad_to_sequence_len": True,
-                "sequence_len": 1024,
-                "val_set_size": 0.01,
-                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
-                },
-                "datasets": [
-                    {
-                        "path": "tatsu-lab/alpaca",
-                        "type": "alpaca",
-                        "split": "train[:10%]",
-                    },
-                ],
-                "num_epochs": 1,
-                "max_steps": 2,
-                "micro_batch_size": 1,
-                "gradient_accumulation_steps": gradient_accumulation_steps,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch_fused",
-                "lr_scheduler": "cosine",
-                "flash_attention": True,
-                "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero1.json"),
-                "use_tensorboard": True,
-                **adapter,
-            }
+        cfg = (
+            DictDefault(
+                {
+                    "pad_to_sequence_len": True,
+                    "num_epochs": 1,
+                    "max_steps": 2,
+                    "micro_batch_size": 1,
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
+                    "output_dir": temp_dir,
+                    "learning_rate": 0.00001,
+                    "optimizer": "adamw_torch_fused",
+                    "lr_scheduler": "cosine",
+                    "flash_attention": True,
+                    "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero1.json"),
+                    "use_tensorboard": True,
+                    **adapter,
+                }
+            )
+            | sft_prepared_dataset_alpaca_cfg
        )

        # write cfg to yaml file
@@ -845,7 +858,7 @@ class TestMultiGPULlama:
        )

        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )

    @pytest.mark.skip(
--- a/tests/e2e/multigpu/test_qwen2.py
+++ b/tests/e2e/multigpu/test_qwen2.py
@@ -46,6 +46,7 @@ class TestMultiGPUQwen2:
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_ray.py
+++ b/tests/e2e/multigpu/test_ray.py
@@ -48,6 +48,7 @@ class TestMultiGPURay:
                "micro_batch_size": 4,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
@@ -107,6 +108,7 @@ class TestMultiGPURay:
                "micro_batch_size": 1,
                "gradient_accumulation_steps": gradient_accumulation_steps,
                "output_dir": temp_dir,
+                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
--- a/tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
+++ b/tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
@@ -396,7 +396,7 @@ def test_model_architecture(model_config):


 # pylint: disable=duplicate-code
-def test_kernel_training_integration():
+def test_kernel_training_integration(temp_dir):
    """Test model loading with kernel patches enabled."""
    from axolotl.cli.utils import load_model_and_tokenizer

@@ -426,6 +426,14 @@ def test_kernel_training_integration():
        }
    )

+    # Write cfg to yaml file
+    path = Path(temp_dir) / "config.yaml"
+    with open(path, "w", encoding="utf-8") as fout:
+        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+    # Load config
+    cfg = load_cfg(str(path))
+
    # Load model
    model, _, _ = load_model_and_tokenizer(cfg=cfg)

@@ -505,7 +513,7 @@ def test_kernel_training_integration_auto_enable(temp_dir):
    assert found_patched_attn


-def test_kernel_training_integration_dropout_non_zero():
+def test_kernel_training_integration_dropout_non_zero(temp_dir):
    """Test model loading with dropout non-zero should not patch."""

    from axolotl.cli.utils import load_model_and_tokenizer
@@ -533,6 +541,14 @@ def test_kernel_training_integration_dropout_non_zero():
        }
    )

+    # Write cfg to yaml file
+    path = Path(temp_dir) / "config.yaml"
+    with open(path, "w", encoding="utf-8") as fout:
+        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+    # Load config
+    cfg = load_cfg(str(path))
+
    # Get original attention class
    attention_cls = get_attention_cls_from_config(cfg)
Author	SHA1	Message	Date
Wing Lian	b79996bdc4	tweak loss	2025-07-06 19:42:43 -04:00
Wing Lian	68368de7ed	add seed for stable reproducibility	2025-07-06 19:29:51 -04:00
Wing Lian	a94c4a014b	tweak acceptable loss from changed hyperparams	2025-07-06 19:25:26 -04:00
Wing Lian	0102ca5943	fix cfg merge	2025-07-06 19:11:46 -04:00
Wing Lian	97e8c01a70	tweak losses	2025-07-06 18:55:16 -04:00
Wing Lian	5c4705b185	unset fa	2025-07-06 13:27:55 -04:00
Wing Lian	47a88da330	set mbsz and revert non-packed test	2025-07-06 12:27:25 -04:00
Wing Lian	07ab737a55	set tokenizer_config in fixture	2025-07-06 12:24:21 -04:00
Wing Lian	c40da3b5eb	use shared fixture for preprocessed alpaca dataset	2025-07-06 11:44:31 -04:00
Wing Lian	a5946ff1f0	build fa2 from source for base image with torch2.6 and cu124 (#2867 )	2025-07-05 09:21:18 -04:00
Wing Lian	70ca1b2291	fix nightlies to use correct cache (#2848 ) [skip ci] * fix nightlies to use correct cache * fix for handling None for bf16	2025-07-03 12:21:39 -04:00
NanoCode012	8ae5a2311b	feat: update handling for mistraltokenizer decode and multiprocessing pickling fix (#2790 ) * feat: update handling for mistraltokenizer decode * fix: update mistral common package version * fix: to use correct release * fix triton path --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-07-02 08:07:18 -04:00
NanoCode012	6383630155	Fix: tokenize stall due to not shuffling dataset (#2845 ) * fix: shuffle dataset even if only one to fix tokenize stall * fix: warn if shuffling merged with curriculum sampling * chore: refactor	2025-07-02 08:06:00 -04:00
Vincenzo di Cicco	f2b352f2e5	Add sample_packing_sequentially to trainer args (#2853 ) [skip ci]	2025-07-02 08:05:35 -04:00
NanoCode012	bf5928d0ee	feat(doc): update docker tag examples (#2851 ) [skip ci] * feat(doc): update docker tag examples * chore: comment	2025-07-02 08:05:01 -04:00
Dhruv Mullick	d1224db8f4	Decouple generate_during_eval from wandb to support other visualizers (#2849 ) [skip ci] * Add generate_during_eval for mlflow for dpo * Decouple generate_during_eval from wandb	2025-07-02 08:04:40 -04:00
mhenrichsen	327b4e48e9	Add installation instructions for pip and Docker to README.md (#2854 ) * Add installation instructions for pip and Docker to README.md * Enhance README.md with Docker installation guidance for improved setup reliability.	2025-07-02 09:03:52 +02:00
Dan Saunders	35fdbce102	Ensure device mesh patching is applied (#2842 ) * move patches; make patch stronger * fix broken tests * guard sequence_parallel_degree comparison against none --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-29 22:16:32 -04:00
Wing Lian	cb811f8bf1	upgrade to flash-attn 2.8.0.post2 (#2828 ) * upgrade to flash-attn 2.8.0.post2 * use cu126 with torch 2.6 * seems vllm 0.8.5.post1 not compatible with cuda12.6.3 and torch 2.6 * cu126 + torch 2.6 as the default * use cu126 for multigpu w torch 2.6 too * drop vllm for now from ci for now	2025-06-29 22:11:16 -04:00
Wing Lian	7563e1bd30	set a different triton cache for each test to avoid blocking writes to cache (#2843 ) * set a different triton cache for each test to avoid blocking writes to cache * set log level * disable debug logging for filelock	2025-06-29 22:05:21 -04:00
Wing Lian	81893c775c	Accelerate 1.8.1 and BNB 0.46.0 update (#2815 ) * update accelerate to v1.8.0 * update bnb also * fix multigpu ci timeout * fix test set size * use latest accelerate 1.8.1 * disable default dtype	2025-06-28 15:29:19 -04:00