chore: refactor normalize_attn to use mapping and loop

fix: set replit mpt model to use eager attention
fixes from PR feedback
2025-05-07 17:10:18 +07:00 · 2025-05-07 17:10:18 +07:00 · 2025-05-07 17:10:18 +07:00 · 2025-05-07 17:10:18 +07:00 · 2025-05-06 23:40:44 -04:00 · 2025-05-06 22:56:00 -04:00
156 changed files with 1115 additions and 1273 deletions
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -3,7 +3,7 @@ name: docker-multigpu-tests-biweekly
 on:
  pull_request:
    paths:
-      - 'tests/e2e/multigpu/**.py'
+      - 'tests/e2e/multigpu/*.py'
      - 'requirements.txt'
      - 'setup.py'
      - 'pyproject.toml'
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -18,96 +18,9 @@ jobs:
        env:
          SKIP: no-commit-to-branch
  preload-cache:
    name: Preload HF cache
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.11"]
        pytorch_version: ["2.6.0"]
    timeout-minutes: 20
    env:
      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
    steps:
      - name: Check out repository code
        uses: actions/checkout@v4
      - name: Restore HF cache
        id: hf-cache-restore
        uses: actions/cache/restore@v4
        with:
          path: |
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python_version }}
          cache: 'pip' # caching pip dependencies
      - name: upgrade pip
        run: |
          pip3 install --upgrade pip
          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
      - name: Install PyTorch
        run: |
          pip3 install torch==${{ matrix.pytorch_version }}
      - name: Install dependencies
        run: |
          pip3 show torch
          pip3 install --no-build-isolation -U -e .
          python scripts/unsloth_install.py | sh
          python scripts/cutcrossentropy_install.py | sh
          pip3 install -r requirements-dev.txt -r requirements-tests.txt
      - name: Make sure PyTorch version wasn't clobbered
        run: |
          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
      - name: Ensure axolotl CLI was installed
        run: |
          axolotl --help
      - name: Pre-Download dataset fixture
        run: |
          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
      - name: Run tests
        run: |
          pytest -v tests/conftest.py
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v5
        with:
          token: ${{ secrets.CODECOV_TOKEN }}
          files: ./coverage.xml
          flags: unittests,pytorch-${{ matrix.pytorch_version }}
          fail_ci_if_error: false
      - name: cleanup pip cache
        run: |
          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
      - name: Save HF cache
        id: hf-cache
        uses: actions/cache/save@v4
        with:
          path: |
            /home/runner/.cache/huggingface/hub/datasets--*
            /home/runner/.cache/huggingface/hub/models--*
          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
    needs: [preload-cache]
    strategy:
      fail-fast: false
      max-parallel: 2
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -44,102 +44,96 @@ jobs:
        env:
          SKIP: no-commit-to-branch
-#  preload-cache:
+  preload-cache:
-#    name: Preload HF cache
+    name: Preload HF cache
-#    runs-on: ubuntu-latest
+    runs-on: ubuntu-latest
-#    strategy:
+    strategy:
-#      fail-fast: false
+      fail-fast: false
-#      matrix:
+      matrix:
-#        python_version: ["3.11"]
+        python_version: ["3.11"]
-#        pytorch_version: ["2.6.0"]
+        pytorch_version: ["2.6.0"]
-#    timeout-minutes: 20
+    timeout-minutes: 20
-#
+
-#    env:
+    env:
-#      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
+      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
-#
+
-#    steps:
+    steps:
-#      - name: Check out repository code
+      - name: Check out repository code
-#        uses: actions/checkout@v4
+        uses: actions/checkout@v4
-#
+
-#      - name: Restore HF cache
+      - name: Restore HF cache
-#        id: hf-cache-restore
+        id: hf-cache-restore
-#        uses: actions/cache/restore@v4
+        uses: actions/cache/restore@v4
-#        with:
+        with:
-#          path: |
+          path: |
-#            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/datasets--*
-#            /home/runner/.cache/huggingface/hub/models--*
+            /home/runner/.cache/huggingface/hub/models--*
-#          key: ${{ runner.os }}-hf-hub-cache-v2
+          key: ${{ runner.os }}-hf-hub-cache-v2
-#
+
-#      - name: Restore Cache from S3
+      - name: Setup Python
-#        id: hf-cache-restore-s3
+        uses: actions/setup-python@v5
-#        run: |
+        with:
-#          mkdir -p /home/runner/.cache/huggingface/hub
+          python-version: ${{ matrix.python_version }}
-#          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd
+          cache: 'pip' # caching pip dependencies
-#
+
-#      - name: Setup Python
+      - name: upgrade pip
-#        uses: actions/setup-python@v5
+        run: |
-#        with:
+          pip3 install --upgrade pip
-#          python-version: ${{ matrix.python_version }}
+          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
-#          cache: 'pip' # caching pip dependencies
+
-#
+      - name: Install PyTorch
-#      - name: upgrade pip
+        run: |
-#        run: |
+          pip3 install torch==${{ matrix.pytorch_version }}
-#          pip3 install --upgrade pip
+
-#          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
+      - name: Install dependencies
-#
+        run: |
-#      - name: Install PyTorch
+          pip3 show torch
-#        run: |
+          pip3 install --no-build-isolation -U -e .
-#          pip3 install torch==${{ matrix.pytorch_version }}
+          python scripts/unsloth_install.py | sh
-#
+          python scripts/cutcrossentropy_install.py | sh
-#      - name: Install dependencies
+          pip3 install -r requirements-dev.txt -r requirements-tests.txt
-#        run: |
+
-#          pip3 show torch
+      - name: Make sure PyTorch version wasn't clobbered
-#          pip3 install --no-build-isolation -U -e .
+        run: |
-#          python scripts/unsloth_install.py | sh
+          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
-#          python scripts/cutcrossentropy_install.py | sh
+
-#          pip3 install -r requirements-dev.txt -r requirements-tests.txt
+      - name: Ensure axolotl CLI was installed
-#
+        run: |
-#      - name: Make sure PyTorch version wasn't clobbered
+          axolotl --help
-#        run: |
+
-#          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
+      - name: Pre-Download dataset fixture
-#
+        run: |
-#      - name: Ensure axolotl CLI was installed
+          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
-#        run: |
+
-#          axolotl --help
+      - name: Run tests
-#
+        run: |
-#      - name: Pre-Download dataset fixture
+          pytest -v tests/conftest.py
-#        run: |
+
-#          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
+      - name: Upload coverage to Codecov
-#
+        uses: codecov/codecov-action@v5
-#      - name: Run tests
+        with:
-#        run: |
+          token: ${{ secrets.CODECOV_TOKEN }}
-#          pytest -v tests/conftest.py
+          files: ./coverage.xml
-#
+          flags: unittests,pytorch-${{ matrix.pytorch_version }}
-#      - name: Upload coverage to Codecov
+          fail_ci_if_error: false
-#        uses: codecov/codecov-action@v5
+
-#        with:
+      - name: cleanup pip cache
-#          token: ${{ secrets.CODECOV_TOKEN }}
+        run: |
-#          files: ./coverage.xml
+          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
-#          flags: unittests,pytorch-${{ matrix.pytorch_version }}
+
-#          fail_ci_if_error: false
+      - name: Save HF cache
-#
+        id: hf-cache
-#      - name: cleanup pip cache
+        uses: actions/cache/save@v4
-#        run: |
+        with:
-#          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
+          path: |
-#
+            /home/runner/.cache/huggingface/hub/datasets--*
-#      - name: Save HF cache
+            /home/runner/.cache/huggingface/hub/models--*
-#        id: hf-cache
+          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
 #        uses: actions/cache/save@v4
 #        with:
 #          path: |
 #            /home/runner/.cache/huggingface/hub/datasets--*
 #            /home/runner/.cache/huggingface/hub/models--*
 #          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
-#    needs: [preload-cache]
+    needs: [preload-cache]
    strategy:
      fail-fast: false
      matrix:
@@ -151,20 +145,14 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4
-#      - name: Restore HF cache
+      - name: Restore HF cache
-#        id: hf-cache-restore
+        id: hf-cache-restore
-#        uses: actions/cache/restore@v4
+        uses: actions/cache/restore@v4
-#        with:
+        with:
-#          path: |
+          path: |
-#            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/datasets--*
-#            /home/runner/.cache/huggingface/hub/models--*
+            /home/runner/.cache/huggingface/hub/models--*
-#          key: ${{ runner.os }}-hf-hub-cache-v2
+          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Restore Cache from S3
        id: hf-cache-restore-s3
        run: |
          mkdir -p /home/runner/.cache/huggingface/hub
          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd
      - name: Setup Python
        uses: actions/setup-python@v5
@@ -222,7 +210,7 @@ jobs:
  pytest-sdist:
    name: PyTest from Source Dist
    runs-on: ubuntu-latest
-#    needs: [preload-cache]
+    needs: [preload-cache]
    strategy:
      fail-fast: false
      matrix:
@@ -234,20 +222,14 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4
-#      - name: Restore HF cache
+      - name: Restore HF cache
-#        id: hf-cache-restore
+        id: hf-cache-restore
-#        uses: actions/cache/restore@v4
+        uses: actions/cache/restore@v4
-#        with:
+        with:
-#          path: |
+          path: |
-#            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/datasets--*
-#            /home/runner/.cache/huggingface/hub/models--*
+            /home/runner/.cache/huggingface/hub/models--*
-#          key: ${{ runner.os }}-hf-hub-cache-v2
+          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Restore Cache from S3
        id: hf-cache-restore-s3
        run: |
          mkdir -p /home/runner/.cache/huggingface/hub
          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd
      - name: Setup Python
        uses: actions/setup-python@v5
@@ -347,6 +329,18 @@ jobs:
      fail-fast: false
      matrix:
        include:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
            axolotl_extras: llmcompressor
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.4.1
            num_gpus: 1
            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -383,43 +377,3 @@ jobs:
      - name: Run tests job on Modal
        run: |
          modal run cicd.e2e_tests
  docker-e2e-cleanup:
    runs-on: [self-hosted, modal]
    timeout-minutes: 90
    needs: [docker-e2e-tests]
    strategy:
      fail-fast: false
      matrix:
        include:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
            axolotl_extras: vllm
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Install Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
          pip install modal==0.71.8 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
          echo "PYTORCH_VERSION=${{ matrix.pytorch}}" >> $GITHUB_ENV
          echo "AXOLOTL_ARGS=${{ matrix.axolotl_args}}" >> $GITHUB_ENV
          echo "AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}" >> $GITHUB_ENV
          echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
          echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
          echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
          modal run cicd.cleanup
--- a/.runpod/src/handler.py
+++ b/.runpod/src/handler.py
@@ -57,10 +57,8 @@ async def handler(job):
    logger.info("Training Complete.")
    # Cleanup
-    if "WANDB_API_KEY" in os.environ:
+    del os.environ["WANDB_API_KEY"]
-        del os.environ["WANDB_API_KEY"]
+    del os.environ["HF_TOKEN"]
    if "HF_TOKEN" in os.environ:
        del os.environ["HF_TOKEN"]
 runpod.serverless.start({"handler": handler, "return_aggregate_stream": True})
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -124,8 +124,7 @@ quartodoc:
        - utils.optimizers.adopt
        - utils.data.pretraining
        - utils.data.sft
-        - utils.gradient_checkpointing.offload_cpu
+        - utils.gradient_checkpointing.unsloth
        - utils.gradient_checkpointing.offload_disk
    - title: Schemas
      desc: Pydantic data models for Axolotl config
      contents:
--- a/cicd/init.py
+++ b/cicd/init.py
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -18,7 +18,7 @@ pytest -v --durations=10 \
  --cov-append
 # Run patched tests excluding lora kernels with coverage append
-pytest --full-trace -vvv --durations=10 \
+pytest -v --durations=10 \
  --ignore=tests/e2e/patched/lora_kernels \
  /workspace/axolotl/tests/e2e/patched \
  --cov=axolotl \
--- a/cicd/cleanup.py
+++ b/cicd/cleanup.py
@@ -1,19 +0,0 @@
 """Modal app to run axolotl GPU cleanup"""
 from .single_gpu import VOLUME_CONFIG, app, cicd_image, run_cmd
@app.function(
    image=cicd_image,
    timeout=60 * 60,
    cpu=8.0,
    memory=131072,
    volumes=VOLUME_CONFIG,
 )
 def cleanup():
    run_cmd("./cicd/cleanup.sh", "/workspace/axolotl")
@app.local_entrypoint()
 def main():
    cleanup.remote()
--- a/cicd/cleanup.sh
+++ b/cicd/cleanup.sh
@@ -1,6 +0,0 @@
 #!/bin/bash
 set -e
 # cleanup old cache files for datasets processing and intermediate mappings
 find /workspace/data/huggingface-cache/hub/datasets -name "cache-*" -type f -mtime +1 -exec rm {} \;
 find /workspace/data/huggingface-cache/hub/datasets -name "*.lock" -type f -mtime +1 -exec rm {} \;
--- a/cicd/e2e_tests.py
+++ b/cicd/e2e_tests.py
@@ -1,12 +1,75 @@
 """Modal app to run axolotl GPU tests"""
-from .single_gpu import GPU_CONFIG, VOLUME_CONFIG, app, cicd_image, run_cmd
+# pylint: disable=duplicate-code
 import os
 import pathlib
 import tempfile
 import jinja2
 import modal
 from jinja2 import select_autoescape
 from modal import App, Image
 cicd_path = pathlib.Path(__file__).parent.resolve()
 template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
 template_env = jinja2.Environment(
    loader=template_loader, autoescape=select_autoescape()
 )
 df_template = template_env.get_template("Dockerfile.jinja")
 df_args = {
    "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
    "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
    "CUDA": os.environ.get("CUDA", "121"),
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
    "HF_HOME": "/workspace/data/huggingface-cache/hub",
 }
 dockerfile_contents = df_template.render(**df_args)
 temp_dir = tempfile.mkdtemp()
 with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
    f.write(dockerfile_contents)
 cicd_image = Image.from_dockerfile(
    pathlib.Path(temp_dir) / "Dockerfile",
    context_mount=None,
    force_build=True,
    gpu="A10G",
 ).env(df_args)
 app = App("Axolotl CI/CD", secrets=[])
 hf_cache_volume = modal.Volume.from_name(
    "axolotl-ci-hf-hub-cache", create_if_missing=True
 )
 VOLUME_CONFIG = {
    "/workspace/data/huggingface-cache/hub": hf_cache_volume,
 }
 N_GPUS = int(os.environ.get("N_GPUS", 1))
 GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
 def run_cmd(cmd: str, run_folder: str):
    import subprocess  # nosec
    # Propagate errors from subprocess.
    if exit_code := subprocess.call(cmd.split(), cwd=run_folder):  # nosec
        exit(exit_code)  # pylint: disable=consider-using-sys-exit
@app.function(
    image=cicd_image,
    gpu=GPU_CONFIG,
-    timeout=90 * 60,  # 90 min
+    timeout=60 * 60,
    cpu=8.0,
    memory=131072,
    volumes=VOLUME_CONFIG,
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -1,66 +0,0 @@
 """Modal app to run axolotl GPU tests"""
 # pylint: disable=duplicate-code
 import os
 import pathlib
 import tempfile
 import jinja2
 import modal
 from jinja2 import select_autoescape
 from modal import App, Image
 cicd_path = pathlib.Path(__file__).parent.resolve()
 template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
 template_env = jinja2.Environment(
    loader=template_loader, autoescape=select_autoescape()
 )
 df_template = template_env.get_template("Dockerfile.jinja")
 df_args = {
    "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
    "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
    "CUDA": os.environ.get("CUDA", "121"),
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
    "HF_HOME": "/workspace/data/huggingface-cache/hub",
 }
 dockerfile_contents = df_template.render(**df_args)
 temp_dir = tempfile.mkdtemp()
 with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
    f.write(dockerfile_contents)
 cicd_image = Image.from_dockerfile(
    pathlib.Path(temp_dir) / "Dockerfile",
    context_mount=None,
    force_build=True,
    gpu="A10G",
 ).env(df_args)
 app = App("Axolotl CI/CD", secrets=[])
 hf_cache_volume = modal.Volume.from_name(
    "axolotl-ci-hf-hub-cache", create_if_missing=True
 )
 VOLUME_CONFIG = {
    "/workspace/data/huggingface-cache/hub": hf_cache_volume,
 }
 N_GPUS = int(os.environ.get("N_GPUS", 1))
 GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
 def run_cmd(cmd: str, run_folder: str):
    import subprocess  # nosec
    # Propagate errors from subprocess.
    if exit_code := subprocess.call(cmd.split(), cwd=run_folder):  # nosec
        exit(exit_code)  # pylint: disable=consider-using-sys-exit
--- a/codecov.yml
+++ b/codecov.yml
@@ -19,7 +19,7 @@ coverage:
        if_no_uploads: error
        if_not_found: success
        if_ci_failed: error
-        only_pulls: true
+        only_pulls: false
        flags: null
        paths: null
    patch:
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -505,7 +505,6 @@ save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of eac
 save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps
 saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
 save_total_limit: # Checkpoints saved at a time
 save_only_model: # Save only the model weights, skipping the optimizer. Using this means you can't resume from checkpoints.
 # Maximum number of iterations to train for. It precedes num_epochs which means that
 # if both are set, num_epochs will not be guaranteed.
 # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
@@ -539,7 +538,7 @@ train_on_inputs: false
 # Note that training loss may have an oscillating pattern with this enabled.
 group_by_length: false
-# Whether to use gradient checkpointing. Available options are: true, false, "offload", "offload_disk".
+# Whether to use gradient checkpointing. Available options are: true, false, "offload".
 # https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
 gradient_checkpointing: false
 # additional kwargs to pass to the trainer for gradient checkpointing
@@ -613,7 +612,6 @@ lr_div_factor: # Learning rate div factor
 # - optimi_adamw
 # - ao_adamw_8bit
 # - ao_adamw_fp8
 # - came_pytorch
 optimizer:
 # Dictionary of arguments to pass to the optimizer
 optim_args:
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
    ("LLMCompressor", "llm_compressor")
 ]
 for section_name, folder_name in sections:
--- a/examples/cerebras/btlm-ft.yml
+++ b/examples/cerebras/btlm-ft.yml
@@ -59,9 +59,7 @@ gradient_checkpointing: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 sdp_attention:
 flash_optimum:
 gptq_groupsize:
 gptq_model_v1:
--- a/examples/cerebras/qlora.yml
+++ b/examples/cerebras/qlora.yml
@@ -39,8 +39,7 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention: true
+attention: xformers
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 10
--- a/examples/code-llama/13b/lora.yml
+++ b/examples/code-llama/13b/lora.yml
@@ -45,7 +45,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/code-llama/13b/qlora.yml
+++ b/examples/code-llama/13b/qlora.yml
@@ -46,7 +46,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/code-llama/34b/lora.yml
+++ b/examples/code-llama/34b/lora.yml
@@ -45,7 +45,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/code-llama/34b/qlora.yml
+++ b/examples/code-llama/34b/qlora.yml
@@ -46,7 +46,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/code-llama/7b/lora.yml
+++ b/examples/code-llama/7b/lora.yml
@@ -45,7 +45,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/code-llama/7b/qlora.yml
+++ b/examples/code-llama/7b/qlora.yml
@@ -46,7 +46,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/cohere/command-r-7b-qlora.yml
+++ b/examples/cohere/command-r-7b-qlora.yml
@@ -49,7 +49,8 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_ratio: 0.1
 evals_per_epoch:
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
@@ -112,9 +112,7 @@
    "early_stopping_patience:\n",
    "resume_from_checkpoint:\n",
    "logging_steps: 1\n",
-    "xformers_attention:\n",
+    "attention: sdpa\n",
    "flash_attention: false\n",
    "sdp_attention: true\n",
    "\n",
    "warmup_steps: 1\n",
    "max_steps: 25\n",
--- a/examples/dbrx/16bit-lora.yaml
+++ b/examples/dbrx/16bit-lora.yaml
@@ -52,7 +52,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch:
--- a/examples/dbrx/8bit-lora.yaml
+++ b/examples/dbrx/8bit-lora.yaml
@@ -55,7 +55,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch:
--- a/examples/dbrx/fft-ds-zero3.yaml
+++ b/examples/dbrx/fft-ds-zero3.yaml
@@ -39,7 +39,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch:
--- a/examples/deepseek-v2/fft-fsdp-16b.yaml
+++ b/examples/deepseek-v2/fft-fsdp-16b.yaml
@@ -35,7 +35,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 2
--- a/examples/deepseek-v2/qlora-fsdp-2_5.yaml
+++ b/examples/deepseek-v2/qlora-fsdp-2_5.yaml
@@ -59,7 +59,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 2
--- a/examples/falcon/config-7b-lora.yml
+++ b/examples/falcon/config-7b-lora.yml
@@ -43,8 +43,7 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention: true
+attention: xformers
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 40
--- a/examples/falcon/config-7b-qlora.yml
+++ b/examples/falcon/config-7b-qlora.yml
@@ -73,8 +73,7 @@ early_stopping_patience: 3
 resume_from_checkpoint:
 auto_resume_from_checkpoints: true
 logging_steps: 1
-xformers_attention: true
+attention: xformers
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 10
--- a/examples/falcon/config-7b.yml
+++ b/examples/falcon/config-7b.yml
@@ -40,8 +40,7 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention: true
+attention: xformers
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 40
--- a/examples/gemma/qlora.yml
+++ b/examples/gemma/qlora.yml
@@ -47,7 +47,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_ratio: 0.1
 evals_per_epoch: 4
--- a/examples/gemma2/qlora.yml
+++ b/examples/gemma2/qlora.yml
@@ -53,7 +53,8 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_ratio: 0.1
 evals_per_epoch:
--- a/examples/gemma2/reward-model.yaml
+++ b/examples/gemma2/reward-model.yaml
@@ -43,7 +43,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_ratio: 0.1
 evals_per_epoch:
--- a/examples/gemma3/gemma-3-1b-qlora.yml
+++ b/examples/gemma3/gemma-3-1b-qlora.yml
@@ -57,7 +57,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_ratio: 0.1
 evals_per_epoch:
--- a/examples/gemma3/gemma-3-4b-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-qlora.yml
@@ -51,8 +51,7 @@ gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 logging_steps: 1
-flash_attention: true
+attention: flash
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
--- a/examples/gemma3/gemma-3-4b-vision-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-vision-qlora.yml
@@ -53,8 +53,7 @@ gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 logging_steps: 1
-flash_attention: true
+attention: flash
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
--- a/examples/gptj/qlora.yml
+++ b/examples/gptj/qlora.yml
@@ -36,8 +36,7 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-xformers_attention: true
+attention: xformers
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 10
--- a/examples/jamba/qlora.yaml
+++ b/examples/jamba/qlora.yaml
@@ -47,7 +47,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch:
--- a/examples/jamba/qlora_deepspeed.yaml
+++ b/examples/jamba/qlora_deepspeed.yaml
@@ -46,7 +46,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch:
--- a/examples/jamba/qlora_fsdp_large.yaml
+++ b/examples/jamba/qlora_fsdp_large.yaml
@@ -45,7 +45,8 @@ gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: true
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 1
--- a/examples/jeopardy-bot/config.yml
+++ b/examples/jeopardy-bot/config.yml
@@ -37,8 +37,7 @@ bf16: auto
 tf32: true
 resume_from_checkpoint:
 logging_steps: 5
-xformers_attention: true
+attention: xformers
 flash_attention:
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
--- a/examples/llama-2/fft_optimized.yml
+++ b/examples/llama-2/fft_optimized.yml
@@ -42,7 +42,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 flash_attn_cross_entropy: false
 flash_attn_rms_norm: true
 flash_attn_fuse_qkv: false
--- a/examples/llama-2/gptq-lora.yml
+++ b/examples/llama-2/gptq-lora.yml
@@ -53,9 +53,7 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention:
+attention: flash
 sdp_attention:
 flash_optimum:
 warmup_steps: 100
 evals_per_epoch: 4
 saves_per_epoch: 1
--- a/examples/llama-2/lisa.yml
+++ b/examples/llama-2/lisa.yml
@@ -46,7 +46,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 flash_attn_cross_entropy: false
 flash_attn_rms_norm: true
 flash_attn_fuse_qkv: false
--- a/examples/llama-2/loftq.yml
+++ b/examples/llama-2/loftq.yml
@@ -45,7 +45,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-2/lora.yml
+++ b/examples/llama-2/lora.yml
@@ -45,7 +45,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-2/qlora-fsdp.yml
+++ b/examples/llama-2/qlora-fsdp.yml
@@ -48,7 +48,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-2/qlora.yml
+++ b/examples/llama-2/qlora.yml
@@ -46,7 +46,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-2/relora.yml
+++ b/examples/llama-2/relora.yml
@@ -48,7 +48,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3-vision/lora-11b.yaml
+++ b/examples/llama-3-vision/lora-11b.yaml
@@ -50,8 +50,7 @@ tf32: true
 gradient_checkpointing: true
 logging_steps: 1
-flash_attention: true
+attention: flash
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
--- a/examples/llama-3/fft-8b-liger-fsdp.yaml
+++ b/examples/llama-3/fft-8b-liger-fsdp.yaml
@@ -49,7 +49,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 2
--- a/examples/llama-3/fft-8b.yaml
+++ b/examples/llama-3/fft-8b.yaml
@@ -34,7 +34,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 2
--- a/examples/llama-3/instruct-dpo-lora-8b.yml
+++ b/examples/llama-3/instruct-dpo-lora-8b.yml
@@ -61,7 +61,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/instruct-lora-8b.yml
+++ b/examples/llama-3/instruct-lora-8b.yml
@@ -56,7 +56,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/lora-1b-deduplicate-dpo.yml
+++ b/examples/llama-3/lora-1b-deduplicate-dpo.yml
@@ -77,7 +77,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/lora-1b-deduplicate-sft.yml
+++ b/examples/llama-3/lora-1b-deduplicate-sft.yml
@@ -53,7 +53,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/lora-1b-kernels.yml
+++ b/examples/llama-3/lora-1b-kernels.yml
@@ -54,7 +54,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/llama-3/lora-1b-ray.yml
+++ b/examples/llama-3/lora-1b-ray.yml
@@ -48,7 +48,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/llama-3/lora-1b-sample-packing-sequentially.yml
+++ b/examples/llama-3/lora-1b-sample-packing-sequentially.yml
@@ -55,7 +55,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/lora-1b.yml
+++ b/examples/llama-3/lora-1b.yml
@@ -48,7 +48,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/llama-3/lora-8b.yml
+++ b/examples/llama-3/lora-8b.yml
@@ -49,7 +49,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/qlora-1b-kto.yaml
+++ b/examples/llama-3/qlora-1b-kto.yaml
@@ -53,7 +53,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 20
 evals_per_epoch: 4
--- a/examples/llama-3/qlora-1b.yml
+++ b/examples/llama-3/qlora-1b.yml
@@ -51,7 +51,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/llama-3/qlora-fsdp-405b.yaml
+++ b/examples/llama-3/qlora-fsdp-405b.yaml
@@ -39,7 +39,8 @@ gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: true
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/qlora-fsdp-70b.yaml
+++ b/examples/llama-3/qlora-fsdp-70b.yaml
@@ -48,7 +48,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/qlora.yml
+++ b/examples/llama-3/qlora.yml
@@ -46,7 +46,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/llama-3/sparse-finetuning.yaml
+++ b/examples/llama-3/sparse-finetuning.yaml
@@ -0,0 +1,77 @@
 base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4
 plugins:
  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
 load_in_8bit: false
 load_in_4bit: false
 strict: false
 datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
 dataset_prepared_path: last_run_prepared
 val_set_size: 0.05
 output_dir: ./outputs/out
 sequence_len: 4096
 sample_packing: true
 pad_to_sequence_len: true
 eval_sample_packing: false
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 8
 micro_batch_size: 1
 num_epochs: 1
 optimizer: paged_adamw_8bit
 lr_scheduler: cosine
 learning_rate: 2e-5
 train_on_inputs: false
 group_by_length: false
 bf16: auto
 fp16:
 tf32: false
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false
 early_stopping_patience:
 resume_from_checkpoint:
 logging_steps: 1
 xformers_attention:
 flash_attention: true
 warmup_steps: 100
 evals_per_epoch: 2
 eval_table_size:
 saves_per_epoch: 1
 debug:
 deepspeed:
 weight_decay: 0.0
 fsdp:
 fsdp_config:
 special_tokens:
  pad_token: <|end_of_text|>
 llmcompressor:
  recipe:
    finetuning_stage:
      finetuning_modifiers:
        ConstantPruningModifier:
          targets: [
            're:.*q_proj.weight',
            're:.*k_proj.weight',
            're:.*v_proj.weight',
            're:.*o_proj.weight',
            're:.*gate_proj.weight',
            're:.*up_proj.weight',
            're:.*down_proj.weight',
          ]
          start: 0
  save_compressed: true
--- a/examples/llama-4/README.md
+++ b/examples/llama-4/README.md
@@ -34,5 +34,3 @@ We provide a script to delinearize Llama 4 linearized models into regular Huggin
 ```bash
 axolotl delinearize-llama4 --model path/to/model_dir --output path/to/output_dir
 ```
 Note: This only works with the non-quantized linearized model. If you have an adapter, merge it with the *non-quantized linearized* model before delinearizing.
--- a/examples/llava/lora-7b.yaml
+++ b/examples/llava/lora-7b.yaml
@@ -46,8 +46,7 @@ tf32: true
 gradient_checkpointing: true
 logging_steps: 1
-flash_attention: true
+attention: flash
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
--- a/examples/mamba/config.yml
+++ b/examples/mamba/config.yml
@@ -39,7 +39,7 @@ tf32: true
 gradient_checkpointing: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention:
+attention: eager
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/mistral/bigstral-ds-zero3.yaml
+++ b/examples/mistral/bigstral-ds-zero3.yaml
@@ -42,7 +42,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 save_total_limit: 1
 save_steps:
--- a/examples/mistral/config.yml
+++ b/examples/mistral/config.yml
@@ -36,7 +36,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/mistral/lora-mps.yml
+++ b/examples/mistral/lora-mps.yml
@@ -53,8 +53,7 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: false
+attention: sdpa
 sdp_attention: true
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/lora.yml
+++ b/examples/mistral/lora.yml
@@ -54,7 +54,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/mistral-dpo-qlora.yml
+++ b/examples/mistral/mistral-dpo-qlora.yml
@@ -71,7 +71,7 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: false
+attention: eager
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/mistral/mistral-qlora-fsdp.yml
+++ b/examples/mistral/mistral-qlora-fsdp.yml
@@ -51,7 +51,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/mistral-qlora-orpo.yml
+++ b/examples/mistral/mistral-qlora-orpo.yml
@@ -59,7 +59,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/mistral-small-3.1-24B-lora.yml
+++ b/examples/mistral/mistral-small-3.1-24B-lora.yml
@@ -48,9 +48,7 @@ tf32: true
 gradient_checkpointing: true
 logging_steps: 1
-flash_attention: false # PixtralVisionModel does not support Flash Attention 2.0 yet.
+attention: eager  # PixtralVisionModel does not support Flash Attention 2.0 yet.
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
--- a/examples/mistral/mixtral-8x22b-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-8x22b-qlora-fsdp.yml
@@ -49,7 +49,8 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/mixtral-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-qlora-fsdp.yml
@@ -51,7 +51,8 @@ tf32: true
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/mixtral.yml
+++ b/examples/mistral/mixtral.yml
@@ -69,7 +69,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mistral/mixtral_22.yml
+++ b/examples/mistral/mixtral_22.yml
@@ -40,7 +40,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 save_total_limit: 1
 save_steps:
--- a/examples/mistral/qlora.yml
+++ b/examples/mistral/qlora.yml
@@ -54,7 +54,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 loss_watchdog_threshold: 5.0
 loss_watchdog_patience: 3
--- a/examples/mpt-7b/config.yml
+++ b/examples/mpt-7b/config.yml
@@ -39,7 +39,7 @@ bf16: auto
 tf32: true
 resume_from_checkpoint:
 logging_steps: 5
-flash_attention:
+attention: eager
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
--- a/examples/openllama-3b/config.yml
+++ b/examples/openllama-3b/config.yml
@@ -39,7 +39,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
--- a/examples/openllama-3b/lora.yml
+++ b/examples/openllama-3b/lora.yml
@@ -47,7 +47,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
--- a/examples/openllama-3b/qlora.yml
+++ b/examples/openllama-3b/qlora.yml
@@ -40,7 +40,8 @@ tf32: false
 gradient_checkpointing: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 gptq_groupsize:
 gptq_model_v1:
 warmup_steps: 20
--- a/examples/phi/phi-ft.yml
+++ b/examples/phi/phi-ft.yml
@@ -48,7 +48,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: True
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 4
--- a/examples/phi/phi-qlora.yml
+++ b/examples/phi/phi-qlora.yml
@@ -51,7 +51,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: True
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 4
--- a/examples/phi/phi2-ft.yml
+++ b/examples/phi/phi2-ft.yml
@@ -48,7 +48,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: True
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 4
--- a/examples/phi/phi3-ft-fsdp.yml
+++ b/examples/phi/phi3-ft-fsdp.yml
@@ -49,7 +49,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: true
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 100
 evals_per_epoch: 4
--- a/examples/phi/phi3-ft.yml
+++ b/examples/phi/phi3-ft.yml
@@ -44,7 +44,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: True
 early_stopping_patience: 3
 logging_steps: 1
-flash_attention: true
+attention: flash
 eval_steps: 1000
 save_steps: 5000
--- a/examples/pixtral/lora-12b.yml
+++ b/examples/pixtral/lora-12b.yml
@@ -46,8 +46,7 @@ tf32: true
 gradient_checkpointing: true
 logging_steps: 1
-flash_attention: false # PixtralVisionModel does not support Flash Attention 2.0 yet
+attention: eager  # PixtralVisionModel does not support Flash Attention 2.0 yet
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
--- a/examples/qwen/lora.yml
+++ b/examples/qwen/lora.yml
@@ -47,7 +47,7 @@ tf32: false
 gradient_checkpointing: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention:
+attention: eager
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/qwen/qlora.yml
+++ b/examples/qwen/qlora.yml
@@ -47,7 +47,7 @@ tf32: false
 gradient_checkpointing: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention:
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/qwen/qwen2-moe-lora.yaml
+++ b/examples/qwen/qwen2-moe-lora.yaml
@@ -43,7 +43,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/qwen/qwen2-moe-qlora.yaml
+++ b/examples/qwen/qwen2-moe-qlora.yaml
@@ -46,7 +46,8 @@ gradient_checkpointing_kwargs:
  use_reentrant: false
 resume_from_checkpoint:
 logging_steps: 1
-flash_attention: true
+attention: flash
 warmup_steps: 10
 evals_per_epoch: 4
--- a/examples/qwen2-vl/lora-7b.yaml
+++ b/examples/qwen2-vl/lora-7b.yaml
@@ -46,8 +46,7 @@ tf32: true
 gradient_checkpointing: true
 logging_steps: 1
-flash_attention: true
+attention: flash
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
NanoCode012	ef883b6960	chore: refactor normalize_attn to use mapping and loop	2025-05-07 17:10:18 +07:00
NanoCode012	d0c4930dd5	fix: set replit mpt model to use eager attention	2025-05-07 17:10:18 +07:00
Wing Lian	6ee7cb30fa	fixes from PR feedback	2025-05-07 17:10:18 +07:00
Wing Lian	ba47adc24b	replace attention in the yaml config with an enum	2025-05-07 17:10:18 +07:00
Wing Lian	0d71b0aa5f	Configurable embeddings upcast (#2621 ) * fsdp embeddings should be float32 per comment * patch peft to not upcast everything * add tabs back to code check * fix import * add configurable option and fix check * add check for dtypes * move embeddings test to patch dir * fix test * fix comment and logic	2025-05-06 23:40:44 -04:00
Eric Meier	63aaccf85b	Fix cut_cross_entropy plugin install (#2642 ) [skip ci]	2025-05-06 22:56:00 -04:00
Wing Lian	ff0fe767c8	xformers attention with packing (#2619 ) * xformers attention with packing * wire up the patch * fix xformers + packing validation * fix warning * reorder the packing check * fix fp16 / bf16 reset when using fp16 with bf16 auto * fix seq lens calc to drop hanging sequences * handle xformers patch for inference too * fix batch size setter * fix xformers inference * add colab callback to fix inference post train * PR feedback	2025-05-06 22:49:22 -04:00
Wing Lian	8e4158cc0b	Multipack parallel bin packing (#2631 ) * improve readability of multipack sampler * parallel bin packing fix error with lambda and pickling make sure things are in float instead of np.float * annotations and comments update * support for configurable group and bin size for sample packing * fix missing map back to original indices	2025-05-06 20:08:08 -04:00
Wing Lian	cd84325253	allow plugins to return their own dataset (#2617 ) [skip ci] * allow plugins to return their own dataset * add post_trainer_create and wire up * add hook check * address PR feedback: * remove annotation causing circular import	2025-05-06 20:05:51 -04:00
NanoCode012	0b140fef83	feat(doc): add split_thinking docs (#2613 ) [skip ci] * feat(doc): add split_thinking docs * fix: link config.qmd to conversation.qmd for split_thinking example * update thinking => reasoning_content in messages format --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-06 20:05:32 -04:00
Wing Lian	e4cfebe995	bump liger dep to 0.5.9 (#2640 ) [skip ci] * bump liger dep to 0.5.9 * also upgrade vllm to post1, and datasets to 3.5.1	2025-05-06 20:05:19 -04:00
mhenrichsen	a6cac5dd32	Update lr_scheduler options in config.qmd to include additional scheduling strategies for improved training flexibility. (#2636 ) [skip ci]	2025-05-06 11:24:07 -04:00
Wing Lian	b71c0e3447	Print axolotl art if train is called outside of cli: (#2627 ) [skip ci]	2025-05-06 11:18:45 -04:00
Wing Lian	ddaebf8309	fix dpo eval override to call grandparent instead of the broken super (#2628 ) [skip ci]	2025-05-06 11:18:25 -04:00
Wing Lian	679743087a	make sure gc_steps is used for all trainers (#2638 )	2025-05-06 11:18:00 -04:00
Wing Lian	f720b6e72d	repop cache (#2639 ) * repop cache * pre-cache as a step * fix the name * add reason for pytest skipif * restore pytorch matrix * remove max-parallel now that we've optimized this a bit	2025-05-06 11:09:07 -04:00
mhenrichsen	a980618fd0	Adds example for training a TTS model on top of a LLM. (#2614 ) * Adds example for training a TTS model on top of a LLM. * Update examples/orpheus/finetune.yml Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/orpheus/finetune.yml Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update README.md to clarify GPU requirements for finetuning Orpheus TTS model * Update finetune.yml to use the new base model canopylabs/orpheus-3b-0.1-pretrained * Update finetune.yml and README.md for consistency and clarity --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-06 10:11:06 +02:00
Emmanuel Ferdman	54960d4de0	Fix logging deprecation warnings (#2623 ) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-04 08:22:45 -04:00
Wing Lian	ed922796b7	include multipack support for qwen3 family (#2622 )	2025-05-03 12:02:39 -04:00
Wing Lian	3dd9c3bf3f	setup hf transfer too and fix auto bf16 when fp16 enabled (#2620 ) [skip ci]	2025-05-03 12:02:26 -04:00
Wing Lian	0ba7d362fa	qwen3 and qwen3_moe support for liger kernels (#2612 ) * qwen3 and qwen3_moe support for liger kernels * fix moe module path * fix: qwen3 liger input args and mlp * fix: qwen3 input args and output class --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-02 09:29:55 -04:00
aitechguy	e4f73bc98e	remove keys to incoporate changes for the trl update (#2616 )	2025-05-02 08:47:42 -04:00
Wing Lian	bcb59c70e2	automatically set pad_to_sequence_len when use packing (#2607 ) * automatically set pad_to_sequence_len when use packing * update tests	2025-05-01 13:24:38 -04:00
NanoCode012	6a3e6f8c53	fix: run preview-docs only when md/qmd changes (#2606 ) * fix: run preview-docs only when md/qmd changes * feat: add quarto yaml based on PR feedback	2025-05-01 13:21:28 -04:00
Wing Lian	fee3c13bb5	Logging config for colab (#2611 ) * only configure logging on cli to play nicely with colab * allow reloading the config on the fly from a dict * make sure to use dict for yaml * reuse existing function for load * make cli args optional * mps fix and respect max_steps	2025-05-01 12:58:00 -04:00
Rahul Tuli	996fc124e5	Add: Sparse Finetuning Integration with llmcompressor (#2479 ) * Add: SFTPlugin with llmcompressor * Update: review comments! * Add:llmcompressor instalable * pre commit hooks * Use: warning over warn * Revert: TODO's * Update llmcompressor version to latest * Apply suggestions from @markurtz Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com> * Address review comments from @markurtz * Add: llcompressor installable * Rename: sft.yaml to sparse-finetuning.yaml * Use: absolute import * Update model config * Move: LLMCompressorPlugin into it's own submodule * Add: `llm_compressor` integration documentation * Rebase and updates! * Tests, Style, Updates * Add: .qmd file * Address Review Comments: * deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Add: line about further optimizations using llmcompressor Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Apply patch from @winglian Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Fix: Test Signed-off-by: Rahul Tuli <rtuli@redhat.com> * additional fixes for docker and saving compressed * split llmcompressor from vllm checks * Reset session between tests Signed-off-by: Rahul Tuli <rtuli@redhat.com> * move decorator to test method instead of class * make sure to reset the session after each test * move import of llmcompressor to reset session inside test --------- Signed-off-by: Rahul Tuli <rtuli@redhat.com> Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-01 12:25:16 -04:00
Wing Lian	e963990ad7	add missing __init__ for lr monkeypatch fix (#2609 )	2025-05-01 09:41:32 -04:00
Dhruv Mullick	c3f2b1c5c2	Add num_completions_to_print for trl and grpo (#2604 )	2025-04-30 21:00:30 -04:00
Wing Lian	6ba5c0ed2c	use latest hf-xet and don't install vllm for torch 2.7.0 (#2603 ) * use latest hf-xet and don't install vllm for torch 2.7.0 * fix runpod hub tests	2025-04-30 18:27:39 -04:00
Wing Lian	24ff5f53f8	additional args for grpo config/trainer (#2598 )	2025-04-30 13:11:12 -04:00
Wing Lian	5e949eaa07	replace zero_only with simpler if statement (#2592 )	2025-04-30 13:11:03 -04:00
Wing Lian	89ca14d9a0	ensure we pass axolotl extras to the Dockerfile so vllm is included in shipped images (#2599 )	2025-04-30 11:35:45 -04:00
Wing Lian	8446b4ad28	don't automatically enable lora kernels for RL training (#2600 )	2025-04-30 11:06:50 -04:00
Wing Lian	fc79606b6d	only import vllm serve cli if its being called (#2597 ) [skip ci]	2025-04-30 09:11:25 -04:00
Wing Lian	baeb00231b	Handle other reasoning trace dataset formats (#2591 ) * Handle other reasoning trace dataset formats * rename var to improve readability * chore: refactor with comments --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-04-30 03:32:55 -04:00
Wing Lian	2413688b08	upload the deepspeed json to wandb (#2593 ) [skip ci]	2025-04-30 03:32:44 -04:00
NanoCode012	5bb1f3da56	feat: add qwen3 moe block for ds3 (#2596 ) [skip ci]	2025-04-30 03:32:23 -04:00
Wing Lian	a21b9cc472	patch to convert LR from tensor to float when using DS (#2595 ) [skip ci]	2025-04-30 03:31:57 -04:00
Aleksandr Dremov	41a1ec0c95	Plugins create_lr_scheduler support (#2584 ) * lr_scheduler support * fix * Update scheduler.py * Update scheduler.py * cfg handling * black * remove debug * remove adding the axolotl cfg to the scheduler mixin --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-04-29 17:08:30 -04:00
Dan Saunders	ecac731922	auto-enable lora kernels where possible (#2589 ) * auto-enable lora kernels where possible * test * revert change to example yaml * naming * remove print * slight logic change	2025-04-29 16:18:49 -04:00
NanoCode012	742fef4200	fix(doc): key used to point to url in multimodal doc (#2575 ) [skip ci]	2025-04-29 15:10:59 -04:00
Wing Lian	a39caf8824	bump vllm==0.8.5 for qwen3 support (#2583 ) [skip ci]	2025-04-29 15:10:40 -04:00
Wing Lian	07e4f2e25b	support for qwen3 with lora kernels (#2588 ) * support for qwen3 with lora kernels * fix patch * typo	2025-04-29 15:02:49 -04:00
Dan Saunders	c7d07de6b4	Fix eval + add smoke test (#2586 ) * fix evaluate CLI * add smoke test * fix naming * lint	2025-04-29 12:58:54 -04:00
Wing Lian	6565ae85d8	set config on the PluginManager for callback access (#2587 )	2025-04-29 12:05:44 -04:00
Wing Lian	80b4edb4a7	Post release fixes (#2581 ) * fix missing kwarg on child * make the runpod test shorter * update docs * rename runpod test json file * typing fixes and ordering of doc	2025-04-29 10:01:38 -04:00
Wing Lian	fedbcc0254	remove torch 2.4.1 CI as part of support deprecation (#2582 )	2025-04-29 08:28:32 -04:00
Wing Lian	8175896ada	add dev tag for v0.10.0.dev0 (#2580 )	2025-04-28 20:30:14 -04:00