tweak loss

add seed for stable reproducibility
tweak acceptable loss from changed hyperparams
2025-07-06 19:42:43 -04:00 · 2025-07-06 19:29:51 -04:00 · 2025-07-06 19:25:26 -04:00 · 2025-07-06 19:11:46 -04:00 · 2025-07-06 18:55:16 -04:00 · 2025-07-06 13:27:55 -04:00
33 changed files with 450 additions and 451 deletions
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -5,11 +5,13 @@ on:
    branches:
      - "main"
    paths:
-      - 'Dockerfile-base'
+      - 'docker/Dockerfile-base'
      - 'docker/Dockerfile-uv-base'
      - '.github/workflows/base.yml'
  pull_request:
    paths:
-      - 'Dockerfile-base'
+      - 'docker/Dockerfile-base'
      - 'docker/Dockerfile-uv-base'
      - '.github/workflows/base.yml'
  workflow_dispatch:
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -20,12 +20,11 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras: vllm
            is_latest: true
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
@@ -88,8 +87,8 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            axolotl_extras:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
@@ -146,8 +145,8 @@ jobs:
    strategy:
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            axolotl_extras:
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -26,11 +26,11 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
-            axolotl_extras: vllm
+            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 124
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -18,96 +18,9 @@ jobs:
        env:
          SKIP: no-commit-to-branch
  preload-cache:
    name: Preload HF cache
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
        pytorch_version: ["2.6.0"]
    timeout-minutes: 20
    env:
      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
    steps:
      - name: Check out repository code
        uses: actions/checkout@v4
      - name: Restore HF cache
        id: hf-cache-restore
        uses: actions/cache/restore@v4
        with:
          path: |
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python_version }}
          cache: 'pip' # caching pip dependencies
      - name: upgrade pip
        run: |
          pip3 install --upgrade pip
          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
      - name: Install PyTorch
        run: |
          pip3 install torch==${{ matrix.pytorch_version }}
      - name: Install dependencies
        run: |
          pip3 show torch
          pip3 install --no-build-isolation -U -e .
          python scripts/unsloth_install.py | sh
          python scripts/cutcrossentropy_install.py | sh
          pip3 install -r requirements-dev.txt -r requirements-tests.txt
      - name: Make sure PyTorch version wasn't clobbered
        run: |
          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
      - name: Ensure axolotl CLI was installed
        run: |
          axolotl --help
      - name: Pre-Download dataset fixture
        run: |
          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
      - name: Run tests
        run: |
          pytest -v tests/conftest.py
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v5
        with:
          token: ${{ secrets.CODECOV_TOKEN }}
          files: ./coverage.xml
          flags: unittests,pytorch-${{ matrix.pytorch_version }}
          fail_ci_if_error: false
      - name: cleanup pip cache
        run: |
          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
      - name: Save HF cache
        id: hf-cache
        uses: actions/cache/save@v4
        with:
          path: |
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
    needs: [preload-cache]
    strategy:
      fail-fast: false
      max-parallel: 2
@@ -120,14 +33,11 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4
-      - name: Restore HF cache
+      - name: Restore Cache from S3
-        id: hf-cache-restore
+        id: hf-cache-restore-s3
-        uses: actions/cache/restore@v4
+        run: |
-        with:
+          mkdir -p /home/runner/.cache/huggingface/hub
-          path: |
+          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Setup Python
        uses: actions/setup-python@v5
@@ -168,10 +78,6 @@ jobs:
        run: |
          axolotl --help
      - name: Pre-Download dataset fixture
        run: |
          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
      - name: Run tests
        run: |
          pytest -v -n8 --dist loadfile --ignore=tests/e2e/ --ignore=tests/patched/ --ignore=tests/cli/ tests/
@@ -193,15 +99,8 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.5.1
            num_gpus: 1
            axolotl_extras:
            nightly_build: "true"
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -195,12 +195,12 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
-            axolotl_extras: vllm
+            axolotl_extras:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
@@ -247,8 +247,8 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
+          - cuda: 126
-            cuda_version: 12.4.1
+            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
@@ -311,7 +311,7 @@ jobs:
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
-            axolotl_extras: vllm
+            axolotl_extras:
    steps:
      - name: Checkout
        uses: actions/checkout@v4
--- a/README.md
+++ b/README.md
@@ -59,6 +59,8 @@ Features:
 ### Installation
 #### Using pip
 ```bash
 pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
 pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
@@ -68,6 +70,13 @@ axolotl fetch examples
 axolotl fetch deepspeed_configs  # OPTIONAL
 ```
 #### Using Docker
 Installing with Docker can be less error prone than installing in your own environment.
 ```bash
 docker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latest
 ```
 Other installation approaches are described [here](https://docs.axolotl.ai/docs/installation.html).
 ### Your First Fine-tune
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -32,6 +32,8 @@ df_args = {
    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
    "HF_HOME": "/workspace/data/huggingface-cache/hub",
    "PYTHONUNBUFFERED": os.environ.get("PYTHONUNBUFFERED", "1"),
    "DEEPSPEED_LOG_LEVEL": os.environ.get("DEEPSPEED_LOG_LEVEL", "WARNING"),
 }
 dockerfile_contents = df_template.render(**df_args)
--- a/docker/Dockerfile-base
+++ b/docker/Dockerfile-base
@@ -38,6 +38,6 @@ RUN git lfs install --skip-repo && \
    # The base image ships with `pydantic==1.8.2` which is not working
    pip3 install -U --no-cache-dir pydantic==1.10.10
-RUN if [ "$PYTORCH_VERSION" = "2.7.1" ] ; then \
+RUN if [ "$PYTORCH_VERSION" = "2.6.0" ] && [ "$CUDA" = "124" ] ; then \
-        pip3 install flash-attn==2.7.4.post1; \
+        FLASH_ATTENTION_FORCE_BUILD="TRUE" pip3 install --no-build-isolation flash-attn==2.8.0.post2; \
    fi
--- a/docker/Dockerfile-uv-base
+++ b/docker/Dockerfile-uv-base
@@ -34,7 +34,3 @@ RUN uv pip install packaging setuptools wheel psutil \
    && uv pip install --no-build-isolation "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" \
    && uv pip install "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main" \
    && uv pip install awscli pydantic
 RUN if [ "$PYTORCH_VERSION" = "2.7.1" ] ; then \
        uv pip install --no-build-isolation flash-attn==2.7.4.post1; \
    fi
--- a/docs/docker.qmd
+++ b/docs/docker.qmd
@@ -9,7 +9,7 @@ format:
 This section describes the different Docker images that are released by AxolotlAI at [Docker Hub](https://hub.docker.com/u/axolotlai).
 ::: {.callout-important}
-For Blackwell GPUs, please use the tags with Pytorch 2.7.1 and CUDA 12.8.
+For Blackwell GPUs, please use the tags with PyTorch 2.7.1 and CUDA 12.8.
 :::
 ## Base
@@ -34,6 +34,7 @@ Tags examples:
 - `main-base-py3.11-cu128-2.7.1`
 - `main-base-py3.11-cu126-2.7.1`
 - `main-base-py3.11-cu126-2.6.0`
 - `main-base-py3.11-cu124-2.6.0`
 - `main-base-py3.11-cu124-2.5.1`
@@ -73,13 +74,15 @@ There may be some extra tags appended to the image, like `-vllm` which installs
 Tags examples:
- `main-py3.11-cu126-2.7.0`
+- `main-py3.11-cu128-2.7.1`
 - `main-py3.11-cu126-2.7.1`
 - `main-py3.11-cu126-2.6.0`
 - `main-py3.11-cu124-2.6.0`
 - `main-py3.11-cu124-2.5.1`
 - `main-latest`
 - `main-20250303-py3.11-cu124-2.6.0`
 - `main-20250303-py3.11-cu124-2.5.1`
- `0.9.2`
+- `0.10.1`
 ## Cloud
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,7 +1,7 @@
 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
 # START section of dependencies that don't install on Darwin/MacOS
-bitsandbytes==0.45.4
+bitsandbytes==0.46.0
 triton>=3.0.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
@@ -15,7 +15,7 @@ huggingface_hub==0.32.2
 peft==0.15.2
 transformers==4.52.4
 tokenizers>=0.21.1
-accelerate==1.7.0
+accelerate==1.8.1
 datasets==3.6.0
 deepspeed>=0.17.0
 trl==0.18.2
@@ -68,4 +68,4 @@ schedulefree==1.4.1
 axolotl-contribs-lgpl==0.0.6
 axolotl-contribs-mit==0.0.3
-mistral-common==1.6.0
+mistral-common==1.6.3
--- a/setup.py
+++ b/setup.py
@@ -111,9 +111,9 @@ def get_package_version():
 extras_require = {
-    "flash-attn": ["flash-attn==2.7.4.post1"],
+    "flash-attn": ["flash-attn==2.8.0.post2"],
    "ring-flash-attn": [
-        "flash-attn==2.7.4.post1",
+        "flash-attn==2.8.0.post2",
        "ring-flash-attn>=0.1.4",
        "yunchang==0.6.0",
    ],
--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -219,7 +219,9 @@ class TrainerBuilderBase(abc.ABC):
        if self.cfg.bf16 == "full":
            training_args_kwargs["bf16_full_eval"] = True
        else:
-            training_args_kwargs["bf16"] = self.cfg.bf16 or self.cfg.bfloat16
+            bf16 = self.cfg.bf16 or self.cfg.bfloat16
            bf16 = bf16 if bf16 is not None else False
            training_args_kwargs["bf16"] = bf16
    def _configure_scheduler(self, training_args_kwargs: dict):
        if self.cfg.lr_scheduler in ["one_cycle", "rex"]:
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -253,6 +253,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs["eval_sample_packing"] = bool(
            self.cfg.eval_sample_packing
        )
        if self.cfg.sample_packing_sequentially is not None:
            training_arguments_kwargs["sample_packing_sequentially"] = (
                self.cfg.sample_packing_sequentially
            )
        if self.cfg.sample_packing_bin_size is not None:
            training_arguments_kwargs["sample_packing_bin_size"] = (
                self.cfg.sample_packing_bin_size
--- a/src/axolotl/core/trainers/dpo/init.py
+++ b/src/axolotl/core/trainers/dpo/init.py
@@ -28,7 +28,7 @@ class DPOStrategy:
        training_args_kwargs["max_completion_length"] = None
        training_args_kwargs["max_length"] = cfg.sequence_len
        training_args_kwargs["max_prompt_length"] = cfg.sequence_len
-        training_args_kwargs["generate_during_eval"] = cfg.use_wandb
+        training_args_kwargs["generate_during_eval"] = cfg.dpo_generate_during_eval
        if cfg.dpo_use_weighting is not None:
            training_args_kwargs["use_weighting"] = cfg.dpo_use_weighting
        if cfg.dpo_padding_free is not None:
--- a/src/axolotl/loaders/patch_manager.py
+++ b/src/axolotl/loaders/patch_manager.py
@@ -65,6 +65,7 @@ class PatchManager:
        self._apply_mistral_cross_entropy_patch()
        self._apply_self_attention_lora_patch()
        self._apply_gemma3_conditional_generation_forward_patch()
        self._apply_sequence_parallel_patches()
    def apply_post_model_load_patches(self, model: PreTrainedModel):
        """Apply patches that require the model instance."""
@@ -231,6 +232,17 @@ class PatchManager:
            patch_gemma3_conditional_generation_forward()
    def _apply_sequence_parallel_patches(self):
        """Apply sequence parallelism patches."""
        if self.cfg.sequence_parallel_degree and self.cfg.sequence_parallel_degree > 1:
            from axolotl.monkeypatch.ring_attn.patch import (
                patch_prepare_data_loader,
                patch_prepare_device_mesh,
            )
            patch_prepare_data_loader()
            patch_prepare_device_mesh(self.cfg.sequence_parallel_degree, self.cfg.fsdp)
    def _patch_attention(self):
        """Apply attention-specific patches based on model type."""
        if not (self.cfg.flash_attention and hasattr(self.model_config, "model_type")):
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -156,12 +156,8 @@ def get_attention_cls_from_config(cfg: DictDefault) -> Type[nn.Module]:
        model_cls_prefix = "".join(
            [part.capitalize() for part in model_type.split("_")]
        )
-        if model_type == "gemma3n":
+        module = __import__(module_path, fromlist=[f"{model_cls_prefix}Attention"])
-            module = __import__(module_path, fromlist=[f"{model_cls_prefix}TextAttention"])
+        attention_cls = getattr(module, f"{model_cls_prefix}Attention")
            attention_cls = getattr(module, f"{model_cls_prefix}TextAttention")
        else:
            module = __import__(module_path, fromlist=[f"{model_cls_prefix}Attention"])
            attention_cls = getattr(module, f"{model_cls_prefix}Attention")
        return attention_cls
    except (ImportError, AttributeError) as e:
--- a/src/axolotl/monkeypatch/ring_attn/patch.py
+++ b/src/axolotl/monkeypatch/ring_attn/patch.py
@@ -152,7 +152,7 @@ def update_ring_attn_params(position_ids: torch.Tensor | None):
 def patch_prepare_data_loader():
    """Patch `accelerate.data_loader.prepare_data_loader` to respect the SP degree.
-    Raies:
+    Raises:
        RuntimeError: If source code to patch does not exist.
    """
    original_fn = accelerate.data_loader.prepare_data_loader
@@ -168,23 +168,34 @@ def patch_prepare_data_loader():
        ORIGINAL_PREPARE_DATALOADER_CODE, NEW_PREPARE_DATALOADER_CODE
    )
    items_to_import = []
    for item in dir(accelerate.data_loader):
        if item in patched_source:
            items_to_import.append(item)
    # Create a new function from the patched source
    namespace = {}
    exec(  # pylint: disable=exec-used  # nosec B102
-        patched_source, accelerate.data_loader.__dict__, namespace
+        f"from accelerate.data_loader import ({', '.join(items_to_import)})",
        globals(),
    )
    exec(  # pylint: disable=exec-used  # nosec B102
        patched_source, globals(), namespace
    )
    patched_function = namespace["prepare_data_loader"]
-    accelerate.data_loader.prepare_data_loader = patched_function
+    patched_function = namespace["prepare_data_loader"]
    original_fn.__code__ = patched_function.__code__
    LOG.info("Patched accelerate.data_loader.prepare_data_loader for SP support")
-def patch_prepare_device_mesh(sequence_parallel_degree: int):
+def patch_prepare_device_mesh(sequence_parallel_degree: int, fsdp: bool = False):
    """Patches the `Accelerator._prepare_device_mesh` method to create a device mesh
    that includes sequence parallelism with the specified degree.
    Args:
-        sequence_parallel_degree (int): The degree of sequence parallelism to use.
+        sequence_parallel_degree: The degree of sequence parallelism to use.
        fsdp: Whether to use FSDP.
    """
    def _prepare_device_mesh(self):
@@ -207,12 +218,14 @@ def patch_prepare_device_mesh(sequence_parallel_degree: int):
        )
        device_ids = list(range(world_size))
-        # Note that we use "cp" instead of "sp" to match the PyTorch native "context
+        # NOTE: We use "cp" instead of "sp" to match the PyTorch native "context
-        # parallelism" implementation naming
+        # parallelism" implementation naming.
        # NOTE: We have a simplified FSDP handling here; i.e., if FSDP is enabled, we
        # only use "fsdp" and "cp" for the device mesh.
        return dist.DeviceMesh(
            "cuda",
            torch.tensor(device_ids).reshape(mesh_shape),
-            mesh_dim_names=("dp", "cp"),
+            mesh_dim_names=("dp", "cp") if not fsdp else ("fsdp", "cp"),
        )
    # Replace the original method with our new method
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -223,8 +223,9 @@ def execute_training(
            )
        LOG.info("Starting trainer...")
-        if cfg.bf16:
+        # TODO: disabling for now as not compatible with FSDP2 + torchao low bit optimizers
-            torch.set_default_dtype(torch.bfloat16)
+        # if cfg.bf16:
        #     torch.set_default_dtype(torch.bfloat16)
        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
--- a/src/axolotl/utils/ctx_managers/sequence_parallel.py
+++ b/src/axolotl/utils/ctx_managers/sequence_parallel.py
@@ -12,8 +12,6 @@ from transformers.utils import ModelOutput
 from axolotl.monkeypatch.ring_attn import (
    get_ring_attn_group,
    patch_prepare_data_loader,
    patch_prepare_device_mesh,
    register_ring_attn,
    update_ring_attn_params,
 )
@@ -238,12 +236,6 @@ class SequenceParallelContextManager:
            ring_attn_func=self.ring_attn_func,
        )
        # Patches for accelerate functionality
        patch_prepare_data_loader()
        patch_prepare_device_mesh(
            sequence_parallel_degree=self.sequence_parallel_degree
        )
    def _register_model_hooks(self):
        # Forward pre-hook to apply sequence parallelism
        def sequence_parallel_pre_hook(_, args, kwargs):
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -524,13 +524,24 @@ def merge_datasets(datasets: list[Dataset], cfg: DictDefault) -> Dataset:
        Merged dataset.
    """
    if len(datasets) == 1:
-        return datasets[0]
+        ds = datasets[0]
        # Do not shuffle if curriculum sampling is enabled
        if cfg.curriculum_sampling:
            return ds
        return ds.shuffle(seed=cfg.seed)
    LOG.info("Merging datasets...")
    merged_dataset = concatenate_datasets(datasets)
    if cfg.shuffle_merged_datasets:
        LOG.debug("Shuffling merged datasets...")
        if cfg.curriculum_sampling:
            LOG.warning(
                "Shuffling merged datasets with curriculum sampling is not recommended. "
                "This will randomize the order of samples."
            )
        merged_dataset = merged_dataset.shuffle(seed=cfg.seed)
    else:
        LOG.debug("Not shuffling merged datasets.")
--- a/src/axolotl/utils/mistral_tokenizer.py
+++ b/src/axolotl/utils/mistral_tokenizer.py
@@ -8,7 +8,7 @@ from typing import TYPE_CHECKING, Optional
 import numpy as np
 from huggingface_hub import hf_hub_download
 from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
-from mistral_common.tokens.tokenizers.tekken import Tekkenizer
+from mistral_common.tokens.tokenizers.tekken import SpecialTokenPolicy, Tekkenizer
 from torch import Tensor
 from transformers.utils import PaddingStrategy
@@ -251,10 +251,13 @@ class HFMistralTokenizer:
            token_ids = [token_ids]
        if skip_special_tokens:
-            return self._mistral.instruct_tokenizer.tokenizer.decode(token_ids)
+            return self._mistral.instruct_tokenizer.tokenizer.decode(
                token_ids, special_token_policy=SpecialTokenPolicy.IGNORE
            )
-        # to_string returns a string with special tokens
+        return self._mistral.instruct_tokenizer.tokenizer.decode(
-        return self._mistral.instruct_tokenizer.tokenizer.to_string(token_ids)
+            token_ids, special_token_policy=SpecialTokenPolicy.KEEP
        )
    def _create_mistral_chat_completion_request(
        self, conversation: list[dict], tools: list[dict] | None = None
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -146,6 +146,7 @@ class AxolotlInputConfig(
    dpo_label_smoothing: float | None = None
    dpo_norm_loss: bool | None = None
    dpo_padding_free: bool | None = None
    dpo_generate_during_eval: bool | None = None
    datasets: (
        Annotated[
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -4,12 +4,14 @@ shared pytest fixtures
 import functools
 import importlib
 import logging
 import os
 import shutil
 import sys
 import tempfile
 import time
 from pathlib import Path
 from typing import Generator
 import datasets
 import pytest
@@ -24,6 +26,8 @@ from tests.hf_offline_utils import (
    hf_offline_context,
 )
 logging.getLogger("filelock").setLevel(logging.CRITICAL)
 def retry_on_request_exceptions(max_retries=3, delay=1):
    # pylint: disable=duplicate-code
@@ -411,7 +415,16 @@ def tokenizer_mistral_7b_instruct_chatml(tokenizer_mistral_7b_instruct):
@pytest.fixture
-def temp_dir():
+def temp_dir() -> Generator[str, None, None]:
    # Create a temporary directory
    _temp_dir = tempfile.mkdtemp()
    yield _temp_dir
    # Clean up the directory after the test
    shutil.rmtree(_temp_dir)
@pytest.fixture(scope="module")
 def module_temp_dir() -> Generator[str, None, None]:
    # Create a temporary directory
    _temp_dir = tempfile.mkdtemp()
    yield _temp_dir
--- a/tests/e2e/multigpu/patched/test_sp.py
+++ b/tests/e2e/multigpu/patched/test_sp.py
@@ -54,6 +54,7 @@ class TestSequenceParallelism:
                "micro_batch_size": micro_batch_size,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/solo/test_flex.py
+++ b/tests/e2e/multigpu/solo/test_flex.py
@@ -54,6 +54,7 @@ class TestPackedFlex:
                "gradient_accumulation_steps": 2,
                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/solo/test_grpo.py
+++ b/tests/e2e/multigpu/solo/test_grpo.py
@@ -309,6 +309,7 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):
                "warmup_steps": 10,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.0001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
@@ -400,6 +401,7 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):
                "warmup_steps": 10,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.0001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_eval.py
+++ b/tests/e2e/multigpu/test_eval.py
@@ -38,12 +38,13 @@ class TestMultiGPUEval:
                "lora_dropout": 0.05,
                "lora_target_linear": True,
                "lora_modules_to_save": ["embed_tokens", "lm_head"],
-                "val_set_size": 0.004,
+                "val_set_size": 0.05,
                "special_tokens": {"pad_token": "<|endoftext|>"},
                "datasets": [
                    {
                        "path": "teknium/GPT4-LLM-Cleaned",
                        "type": "alpaca",
                        "split": "train[:5%]",
                    },
                ],
                "num_epochs": 1,
@@ -51,6 +52,7 @@ class TestMultiGPUEval:
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
@@ -107,12 +109,13 @@ class TestMultiGPUEval:
                "lora_dropout": 0.05,
                "lora_target_linear": True,
                "lora_modules_to_save": ["embed_tokens", "lm_head"],
-                "val_set_size": 0.0004,
+                "val_set_size": 0.01,
                "special_tokens": {"pad_token": "<|endoftext|>"},
                "datasets": [
                    {
                        "path": "teknium/GPT4-LLM-Cleaned",
                        "type": "alpaca",
                        "split": "train[:5%]",
                    },
                ],
                "num_epochs": 1,
@@ -120,6 +123,7 @@ class TestMultiGPUEval:
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_gemma3.py
+++ b/tests/e2e/multigpu/test_gemma3.py
@@ -64,6 +64,7 @@ class TestMultiGPUGemma3:
                },
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.0001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_llama.py
+++ b/tests/e2e/multigpu/test_llama.py
@@ -2,6 +2,8 @@
 E2E tests for multigpu lora tinyllama
 """
 # pylint: disable=redefined-outer-name
 from pathlib import Path
 import pytest
@@ -25,6 +27,60 @@ def download_model():
    snapshot_download("HuggingFaceTB/SmolLM2-135M")
@pytest.fixture(scope="module")
 def sft_base_cfg():
    cfg = DictDefault(
        base_model="HuggingFaceTB/SmolLM2-135M",
        tokenizer_config="HuggingFaceTB/SmolLM2-135M",  # this has to be manually set since we haven't done validation
        sequence_len=1024,
        special_tokens={
            "pad_token": "<|endoftext|>",
        },
        datasets=[
            {
                "path": "tatsu-lab/alpaca",
                "type": "alpaca",
                "split": "train[:10%]",
            },
        ],
        val_set_size=0.1,
        sample_packing=True,
        flash_attention=True,
        learning_rate=0.00001,
        optimizer="adamw_8bit",
        seed=42,
        # these need to be set since we aren't running schema validation
        micro_batch_size=2,
        gradient_accumulation_steps=1,
    )
    return cfg
@pytest.fixture(scope="module", name="sft_prepared_dataset_alpaca_cfg")
 def sft_prepared_dataset_alpaca_cfg(module_temp_dir, sft_base_cfg):
    dataset_prepared_path = module_temp_dir + "/last_run_prepared"
    cfg = sft_base_cfg | DictDefault(
        dataset_prepared_path=dataset_prepared_path,
    )
    Path(module_temp_dir).mkdir(parents=True, exist_ok=True)
    with open(Path(module_temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
    execute_subprocess_async(
        [
            "axolotl",
            "preprocess",
            str(Path(module_temp_dir) / "config.yaml"),
        ]
    )
    # unset flash attention since we have some flex attention tests too
    cfg.flash_attention = None
    return cfg
 def transformers_version_eq(required_version):
    return version.parse(transformers.__version__) == version.parse(required_version)
@@ -62,6 +118,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
@@ -96,44 +153,36 @@ class TestMultiGPULlama:
        "gradient_accumulation_steps",
        [1, 2],
    )
-    def test_lora_ddp_packed(self, temp_dir, gradient_accumulation_steps):
+    def test_lora_ddp_packed(
        self, temp_dir, sft_prepared_dataset_alpaca_cfg, gradient_accumulation_steps
    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                {
-                "sequence_len": 2048,
+                    "eval_sample_packing": False,
-                "sample_packing": True,
+                    "pad_to_sequence_len": True,
-                "eval_sample_packing": False,
+                    "adapter": "lora",
-                "pad_to_sequence_len": True,
+                    "lora_r": 8,
-                "adapter": "lora",
+                    "lora_alpha": 16,
-                "lora_r": 8,
+                    "lora_dropout": 0.05,
-                "lora_alpha": 16,
+                    "lora_target_linear": True,
-                "lora_dropout": 0.05,
+                    "val_set_size": 0.05,
-                "lora_target_linear": True,
+                    "num_epochs": 1,
-                "val_set_size": 0.05,
+                    "max_steps": 2,
-                "special_tokens": {
+                    "micro_batch_size": 1,
-                    "pad_token": "<|endoftext|>",
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
-                },
+                    # "gradient_checkpointing": True,
-                "datasets": [
+                    "output_dir": temp_dir,
-                    {
+                    "learning_rate": 0.00001,
-                        "path": "tatsu-lab/alpaca",
+                    "optimizer": "adamw_8bit",
-                        "type": "alpaca",
+                    "lr_scheduler": "cosine",
-                        "split": "train[:20%]",
+                    "flash_attention": True,
-                    },
+                    "use_tensorboard": True,
-                ],
+                    "bf16": True,
-                "num_epochs": 1,
+                }
-                "max_steps": 2,
+            )
-                "micro_batch_size": 1,
+            | sft_prepared_dataset_alpaca_cfg
                "gradient_accumulation_steps": gradient_accumulation_steps,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
                "bf16": True,
            }
        )
        # write cfg to yaml file
@@ -200,6 +249,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "warmup_steps": 0,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -278,6 +328,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "warmup_steps": 0,
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
@@ -340,6 +391,7 @@ class TestMultiGPULlama:
                "gradient_accumulation_steps": gradient_accumulation_steps,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
@@ -380,58 +432,50 @@ class TestMultiGPULlama:
        )
        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )
    @pytest.mark.parametrize(
        "fsdp_state_dict_type",
        ["FULL_STATE_DICT", "SHARDED_STATE_DICT"],
    )
-    def test_fsdp_packed(self, temp_dir, fsdp_state_dict_type):
+    def test_fsdp_packed(
        self, temp_dir, sft_prepared_dataset_alpaca_cfg, fsdp_state_dict_type
    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                {
-                "sample_packing": True,
+                    "pad_to_sequence_len": True,
-                "pad_to_sequence_len": True,
+                    "num_epochs": 1,
-                "sequence_len": 1024,
+                    "max_steps": 2,
-                "val_set_size": 0.05,
+                    "micro_batch_size": 2,
-                "special_tokens": {
+                    "gradient_accumulation_steps": 2,
-                    "pad_token": "<|endoftext|>",
+                    # "gradient_checkpointing": True,
-                },
+                    "output_dir": temp_dir,
-                "datasets": [
+                    "dataset_prepared_path": temp_dir + "/last_run_prepared",
-                    {
+                    "learning_rate": 0.00001,
-                        "path": "tatsu-lab/alpaca",
+                    "optimizer": "adamw_torch_fused",
-                        "type": "alpaca",
+                    "lr_scheduler": "cosine",
-                        "split": "train[:10%]",
+                    "flash_attention": True,
                    "fsdp": [
                        "full_shard",
                        "auto_wrap",
                    ],
                    "fsdp_config": {
                        "fsdp_limit_all_gathers": True,
                        "fsdp_offload_params": False,
                        "fsdp_sync_module_states": True,
                        "fsdp_use_orig_params": False,
                        "fsdp_cpu_ram_efficient_loading": False,
                        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
                        "fsdp_state_dict_type": fsdp_state_dict_type,
                        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                    },
-                ],
+                    "use_tensorboard": True,
-                "num_epochs": 1,
+                }
-                "max_steps": 2,
+            )
-                "micro_batch_size": 2,
+            | sft_prepared_dataset_alpaca_cfg
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "fsdp": [
                    "full_shard",
                    "auto_wrap",
                ],
                "fsdp_config": {
                    "fsdp_limit_all_gathers": True,
                    "fsdp_offload_params": False,
                    "fsdp_sync_module_states": True,
                    "fsdp_use_orig_params": False,
                    "fsdp_cpu_ram_efficient_loading": False,
                    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
                    "fsdp_state_dict_type": fsdp_state_dict_type,
                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                },
                "use_tensorboard": True,
            }
        )
        # write cfg to yaml file
@@ -452,7 +496,7 @@ class TestMultiGPULlama:
        )
        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.4, "Train Loss (%s) is too high"
        )
    @require_torch_2_6_0
@@ -465,50 +509,43 @@ class TestMultiGPULlama:
        [True, False],
    )
    def test_fsdp2_packed(
-        self, temp_dir, attention_backend, fsdp_reshard_after_forward
+        self,
        temp_dir,
        sft_prepared_dataset_alpaca_cfg,
        attention_backend,
        fsdp_reshard_after_forward,
    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                {
-                "sample_packing": True,
+                    "pad_to_sequence_len": True,
-                "pad_to_sequence_len": True,
+                    "num_epochs": 1,
-                "sequence_len": 2048,
+                    "max_steps": 2,
-                "val_set_size": 0.1,
+                    "micro_batch_size": 4,
-                "special_tokens": {
+                    "gradient_accumulation_steps": 2,
-                    "pad_token": "<|endoftext|>",
+                    "gradient_checkpointing": True,
-                },
+                    "output_dir": temp_dir,
-                "datasets": [
+                    "learning_rate": 0.00001,
-                    {
+                    "optimizer": "adamw_torch_8bit",
-                        "path": "tatsu-lab/alpaca",
+                    "lr_scheduler": "cosine",
-                        "type": "alpaca",
+                    "fsdp": [
-                        "split": "train[:10%]",
+                        "auto_wrap",
                    ],
                    "fsdp_config": {
                        "fsdp_version": 2,
                        # "fsdp_forward_prefetch": True,  # not yet implemented in accelerate
                        "fsdp_offload_params": False,
                        "fsdp_cpu_ram_efficient_loading": False,
                        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
                        "fsdp_state_dict_type": "SHARDED_STATE_DICT",
                        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                        "fsdp_reshard_after_forward": fsdp_reshard_after_forward,
                    },
-                ],
+                    "use_tensorboard": True,
-                "num_epochs": 1,
+                }
-                "max_steps": 2,
+            )
-                "micro_batch_size": 4,
+            | sft_prepared_dataset_alpaca_cfg
                "gradient_accumulation_steps": 2,
                "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_8bit",
                "lr_scheduler": "cosine",
                "fsdp": [
                    "auto_wrap",
                ],
                "fsdp_config": {
                    "fsdp_version": 2,
                    # "fsdp_forward_prefetch": True,  # not yet implemented in accelerate
                    "fsdp_offload_params": False,
                    "fsdp_cpu_ram_efficient_loading": False,
                    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
                    "fsdp_state_dict_type": "SHARDED_STATE_DICT",
                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                    "fsdp_reshard_after_forward": fsdp_reshard_after_forward,
                },
                "use_tensorboard": True,
            }
        )
        if attention_backend == "flash":
            cfg.flash_attention = True
@@ -536,63 +573,55 @@ class TestMultiGPULlama:
            temp_dir + "/runs", "train/train_loss", 2.1, "Train Loss (%s) is too high"
        )
-    def test_fsdp_qlora_prequant_packed(self, temp_dir):
+    def test_fsdp_qlora_prequant_packed(
        self, temp_dir, sft_prepared_dataset_alpaca_cfg
    ):
        # pylint: disable=duplicate-code
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "axolotl-ai-co/SmolLM2-135M-bnb-nf4-bf16",
+                {
-                "adapter": "qlora",
+                    "base_model": "axolotl-ai-co/SmolLM2-135M-bnb-nf4-bf16",
-                "mean_resizing_embeddings": True,
+                    "adapter": "qlora",
-                "load_in_4bit": True,
+                    "mean_resizing_embeddings": True,
-                "lora_r": 8,
+                    "load_in_4bit": True,
-                "lora_alpha": 16,
+                    "lora_r": 8,
-                "lora_dropout": 0.05,
+                    "lora_alpha": 16,
-                "lora_target_linear": True,
+                    "lora_dropout": 0.05,
-                # "lora_modules_to_save": [
+                    "lora_target_linear": True,
-                #     "embed_tokens",
+                    # "lora_modules_to_save": [
-                #     "lm_head",
+                    #     "embed_tokens",
-                # ],
+                    #     "lm_head",
-                "sample_packing": True,
+                    # ],
-                "eval_sample_packing": False,
+                    "eval_sample_packing": False,
-                "pad_to_sequence_len": True,
+                    "pad_to_sequence_len": True,
-                "sequence_len": 1024,
+                    "num_epochs": 1,
-                "val_set_size": 0.01,
+                    "max_steps": 2,
-                "special_tokens": {
+                    "micro_batch_size": 2,
-                    "pad_token": "<|endoftext|>",
+                    "gradient_accumulation_steps": 2,
-                },
+                    # "gradient_checkpointing": True,
-                "datasets": [
+                    "output_dir": temp_dir,
-                    {
+                    "learning_rate": 0.00001,
-                        "path": "tatsu-lab/alpaca",
+                    "optimizer": "adamw_torch_fused",
-                        "type": "alpaca",
+                    "lr_scheduler": "cosine",
-                        "split": "train[:10%]",
+                    "flash_attention": True,
                    "fsdp": [
                        "full_shard",
                        "auto_wrap",
                    ],
                    "fsdp_config": {
                        "fsdp_limit_all_gathers": True,
                        "fsdp_offload_params": False,
                        "fsdp_sync_module_states": True,
                        "fsdp_use_orig_params": False,
                        "fsdp_cpu_ram_efficient_loading": True,
                        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
                        "fsdp_state_dict_type": "SHARDED_STATE_DICT",
                        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                    },
-                ],
+                    "use_tensorboard": True,
-                "num_epochs": 1,
+                }
-                "max_steps": 2,
+            )
-                "micro_batch_size": 2,
+            | sft_prepared_dataset_alpaca_cfg
                "gradient_accumulation_steps": 2,
                # "gradient_checkpointing": True,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "fsdp": [
                    "full_shard",
                    "auto_wrap",
                ],
                "fsdp_config": {
                    "fsdp_limit_all_gathers": True,
                    "fsdp_offload_params": False,
                    "fsdp_sync_module_states": True,
                    "fsdp_use_orig_params": False,
                    "fsdp_cpu_ram_efficient_loading": True,
                    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
                    "fsdp_state_dict_type": "SHARDED_STATE_DICT",
                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
                },
                "use_tensorboard": True,
            }
        )
        # write cfg to yaml file
@@ -633,7 +662,12 @@ class TestMultiGPULlama:
        [True, False],
    )
    def test_ds_zero3_packed(
-        self, temp_dir, gradient_accumulation_steps, deepspeed, qlora
+        self,
        temp_dir,
        sft_prepared_dataset_alpaca_cfg,
        gradient_accumulation_steps,
        deepspeed,
        qlora,
    ):
        # pylint: disable=duplicate-code
        if qlora:
@@ -647,36 +681,25 @@ class TestMultiGPULlama:
            }
        else:
            adapter = {}
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                {
-                "sample_packing": True,
+                    "pad_to_sequence_len": True,
-                "pad_to_sequence_len": True,
+                    "num_epochs": 1,
-                "sequence_len": 1024,
+                    "max_steps": 2,
-                "val_set_size": 0.05,
+                    "micro_batch_size": 1,
-                "special_tokens": {
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
-                    "pad_token": "<|endoftext|>",
+                    "output_dir": temp_dir,
-                },
+                    "learning_rate": 0.00001,
-                "datasets": [
+                    "optimizer": "adamw_torch_fused",
-                    {
+                    "lr_scheduler": "cosine",
-                        "path": "tatsu-lab/alpaca",
+                    "flash_attention": True,
-                        "type": "alpaca",
+                    "deepspeed": str(AXOLOTL_ROOT / deepspeed),
-                        "split": "train[:10%]",
+                    "use_tensorboard": True,
-                    },
+                    **adapter,
-                ],
+                }
-                "num_epochs": 1,
+            )
-                "max_steps": 2,
+            | sft_prepared_dataset_alpaca_cfg
                "micro_batch_size": 1,
                "gradient_accumulation_steps": gradient_accumulation_steps,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "deepspeed": str(AXOLOTL_ROOT / deepspeed),
                "use_tensorboard": True,
                **adapter,
            }
        )
        # write cfg to yaml file
@@ -697,7 +720,7 @@ class TestMultiGPULlama:
        )
        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.4, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )
    @pytest.mark.parametrize(
@@ -708,7 +731,13 @@ class TestMultiGPULlama:
        "qlora",
        [True, False],
    )
-    def test_ds_zero2_packed(self, temp_dir, gradient_accumulation_steps, qlora):
+    def test_ds_zero2_packed(
        self,
        temp_dir,
        sft_prepared_dataset_alpaca_cfg,
        gradient_accumulation_steps,
        qlora,
    ):
        # pylint: disable=duplicate-code
        if qlora:
            adapter = {
@@ -721,36 +750,25 @@ class TestMultiGPULlama:
            }
        else:
            adapter = {}
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                {
-                "sample_packing": True,
+                    "pad_to_sequence_len": True,
-                "pad_to_sequence_len": True,
+                    "num_epochs": 1,
-                "sequence_len": 1024,
+                    "max_steps": 2,
-                "val_set_size": 0.01,
+                    "micro_batch_size": 1,
-                "special_tokens": {
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
-                    "pad_token": "<|endoftext|>",
+                    "output_dir": temp_dir,
-                },
+                    "learning_rate": 0.00001,
-                "datasets": [
+                    "optimizer": "adamw_torch_fused",
-                    {
+                    "lr_scheduler": "cosine",
-                        "path": "tatsu-lab/alpaca",
+                    "flash_attention": True,
-                        "type": "alpaca",
+                    "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero2.json"),
-                        "split": "train[:10%]",
+                    "use_tensorboard": True,
-                    },
+                    **adapter,
-                ],
+                }
-                "num_epochs": 1,
+            )
-                "max_steps": 2,
+            | sft_prepared_dataset_alpaca_cfg
                "micro_batch_size": 1,
                "gradient_accumulation_steps": gradient_accumulation_steps,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero2.json"),
                "use_tensorboard": True,
                **adapter,
            }
        )
        # write cfg to yaml file
@@ -771,7 +789,7 @@ class TestMultiGPULlama:
        )
        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )
    @pytest.mark.parametrize(
@@ -782,7 +800,13 @@ class TestMultiGPULlama:
        "qlora",
        [True, False],
    )
-    def test_ds_zero1_packed(self, temp_dir, gradient_accumulation_steps, qlora):
+    def test_ds_zero1_packed(
        self,
        temp_dir,
        sft_prepared_dataset_alpaca_cfg,
        gradient_accumulation_steps,
        qlora,
    ):
        # pylint: disable=duplicate-code
        if qlora:
            adapter = {
@@ -795,36 +819,25 @@ class TestMultiGPULlama:
            }
        else:
            adapter = {}
-        cfg = DictDefault(
+        cfg = (
-            {
+            DictDefault(
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                {
-                "sample_packing": True,
+                    "pad_to_sequence_len": True,
-                "pad_to_sequence_len": True,
+                    "num_epochs": 1,
-                "sequence_len": 1024,
+                    "max_steps": 2,
-                "val_set_size": 0.01,
+                    "micro_batch_size": 1,
-                "special_tokens": {
+                    "gradient_accumulation_steps": gradient_accumulation_steps,
-                    "pad_token": "<|endoftext|>",
+                    "output_dir": temp_dir,
-                },
+                    "learning_rate": 0.00001,
-                "datasets": [
+                    "optimizer": "adamw_torch_fused",
-                    {
+                    "lr_scheduler": "cosine",
-                        "path": "tatsu-lab/alpaca",
+                    "flash_attention": True,
-                        "type": "alpaca",
+                    "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero1.json"),
-                        "split": "train[:10%]",
+                    "use_tensorboard": True,
-                    },
+                    **adapter,
-                ],
+                }
-                "num_epochs": 1,
+            )
-                "max_steps": 2,
+            | sft_prepared_dataset_alpaca_cfg
                "micro_batch_size": 1,
                "gradient_accumulation_steps": gradient_accumulation_steps,
                "output_dir": temp_dir,
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "deepspeed": str(AXOLOTL_ROOT / "deepspeed_configs/zero1.json"),
                "use_tensorboard": True,
                **adapter,
            }
        )
        # write cfg to yaml file
@@ -845,7 +858,7 @@ class TestMultiGPULlama:
        )
        check_tensorboard(
-            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss (%s) is too high"
+            temp_dir + "/runs", "train/train_loss", 2.5, "Train Loss (%s) is too high"
        )
    @pytest.mark.skip(
--- a/tests/e2e/multigpu/test_qwen2.py
+++ b/tests/e2e/multigpu/test_qwen2.py
@@ -46,6 +46,7 @@ class TestMultiGPUQwen2:
                "micro_batch_size": 2,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch_fused",
                "lr_scheduler": "cosine",
--- a/tests/e2e/multigpu/test_ray.py
+++ b/tests/e2e/multigpu/test_ray.py
@@ -48,6 +48,7 @@ class TestMultiGPURay:
                "micro_batch_size": 4,
                "gradient_accumulation_steps": 2,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_8bit",
                "lr_scheduler": "cosine",
@@ -107,6 +108,7 @@ class TestMultiGPURay:
                "micro_batch_size": 1,
                "gradient_accumulation_steps": gradient_accumulation_steps,
                "output_dir": temp_dir,
                "dataset_prepared_path": temp_dir + "/last_run_prepared",
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
--- a/tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
+++ b/tests/e2e/patched/lora_kernels/test_lora_kernel_patching.py
@@ -396,7 +396,7 @@ def test_model_architecture(model_config):
 # pylint: disable=duplicate-code
-def test_kernel_training_integration():
+def test_kernel_training_integration(temp_dir):
    """Test model loading with kernel patches enabled."""
    from axolotl.cli.utils import load_model_and_tokenizer
@@ -426,6 +426,14 @@ def test_kernel_training_integration():
        }
    )
    # Write cfg to yaml file
    path = Path(temp_dir) / "config.yaml"
    with open(path, "w", encoding="utf-8") as fout:
        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
    # Load config
    cfg = load_cfg(str(path))
    # Load model
    model, _, _ = load_model_and_tokenizer(cfg=cfg)
@@ -505,7 +513,7 @@ def test_kernel_training_integration_auto_enable(temp_dir):
    assert found_patched_attn
-def test_kernel_training_integration_dropout_non_zero():
+def test_kernel_training_integration_dropout_non_zero(temp_dir):
    """Test model loading with dropout non-zero should not patch."""
    from axolotl.cli.utils import load_model_and_tokenizer
@@ -533,6 +541,14 @@ def test_kernel_training_integration_dropout_non_zero():
        }
    )
    # Write cfg to yaml file
    path = Path(temp_dir) / "config.yaml"
    with open(path, "w", encoding="utf-8") as fout:
        fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
    # Load config
    cfg = load_cfg(str(path))
    # Get original attention class
    attention_cls = get_attention_cls_from_config(cfg)
Author	SHA1	Message	Date
Wing Lian	b79996bdc4	tweak loss	2025-07-06 19:42:43 -04:00
Wing Lian	68368de7ed	add seed for stable reproducibility	2025-07-06 19:29:51 -04:00
Wing Lian	a94c4a014b	tweak acceptable loss from changed hyperparams	2025-07-06 19:25:26 -04:00
Wing Lian	0102ca5943	fix cfg merge	2025-07-06 19:11:46 -04:00
Wing Lian	97e8c01a70	tweak losses	2025-07-06 18:55:16 -04:00
Wing Lian	5c4705b185	unset fa	2025-07-06 13:27:55 -04:00
Wing Lian	47a88da330	set mbsz and revert non-packed test	2025-07-06 12:27:25 -04:00
Wing Lian	07ab737a55	set tokenizer_config in fixture	2025-07-06 12:24:21 -04:00
Wing Lian	c40da3b5eb	use shared fixture for preprocessed alpaca dataset	2025-07-06 11:44:31 -04:00
Wing Lian	a5946ff1f0	build fa2 from source for base image with torch2.6 and cu124 (#2867 )	2025-07-05 09:21:18 -04:00
Wing Lian	70ca1b2291	fix nightlies to use correct cache (#2848 ) [skip ci] * fix nightlies to use correct cache * fix for handling None for bf16	2025-07-03 12:21:39 -04:00
NanoCode012	8ae5a2311b	feat: update handling for mistraltokenizer decode and multiprocessing pickling fix (#2790 ) * feat: update handling for mistraltokenizer decode * fix: update mistral common package version * fix: to use correct release * fix triton path --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-07-02 08:07:18 -04:00
NanoCode012	6383630155	Fix: tokenize stall due to not shuffling dataset (#2845 ) * fix: shuffle dataset even if only one to fix tokenize stall * fix: warn if shuffling merged with curriculum sampling * chore: refactor	2025-07-02 08:06:00 -04:00
Vincenzo di Cicco	f2b352f2e5	Add sample_packing_sequentially to trainer args (#2853 ) [skip ci]	2025-07-02 08:05:35 -04:00
NanoCode012	bf5928d0ee	feat(doc): update docker tag examples (#2851 ) [skip ci] * feat(doc): update docker tag examples * chore: comment	2025-07-02 08:05:01 -04:00
Dhruv Mullick	d1224db8f4	Decouple generate_during_eval from wandb to support other visualizers (#2849 ) [skip ci] * Add generate_during_eval for mlflow for dpo * Decouple generate_during_eval from wandb	2025-07-02 08:04:40 -04:00
mhenrichsen	327b4e48e9	Add installation instructions for pip and Docker to README.md (#2854 ) * Add installation instructions for pip and Docker to README.md * Enhance README.md with Docker installation guidance for improved setup reliability.	2025-07-02 09:03:52 +02:00
Dan Saunders	35fdbce102	Ensure device mesh patching is applied (#2842 ) * move patches; make patch stronger * fix broken tests * guard sequence_parallel_degree comparison against none --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-29 22:16:32 -04:00
Wing Lian	cb811f8bf1	upgrade to flash-attn 2.8.0.post2 (#2828 ) * upgrade to flash-attn 2.8.0.post2 * use cu126 with torch 2.6 * seems vllm 0.8.5.post1 not compatible with cuda12.6.3 and torch 2.6 * cu126 + torch 2.6 as the default * use cu126 for multigpu w torch 2.6 too * drop vllm for now from ci for now	2025-06-29 22:11:16 -04:00
Wing Lian	7563e1bd30	set a different triton cache for each test to avoid blocking writes to cache (#2843 ) * set a different triton cache for each test to avoid blocking writes to cache * set log level * disable debug logging for filelock	2025-06-29 22:05:21 -04:00
Wing Lian	81893c775c	Accelerate 1.8.1 and BNB 0.46.0 update (#2815 ) * update accelerate to v1.8.0 * update bnb also * fix multigpu ci timeout * fix test set size * use latest accelerate 1.8.1 * disable default dtype	2025-06-28 15:29:19 -04:00