avoid crash/oom on train end

Activation checkpointing with offloading to disk with prefetch (#2663 )
* offload activations to disk instead of CPU RAM * add prefetch * Disco :dance: * include offload_disk in e2e test for AC * document and make sure to cleanup * fix annotation to match docs * fix docs build * address PR feedback
2025-05-15 15:53:35 -04:00 · 2025-05-13 16:39:39 -04:00 · 2025-05-13 08:30:58 -04:00 · 2025-05-12 17:52:40 -04:00 · 2025-05-12 16:23:53 -04:00 · 2025-05-12 14:18:42 -04:00
129 changed files with 5659 additions and 1160 deletions
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -22,12 +22,6 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: "124"
-            cuda_version: 12.4.1
-            cudnn_version: ""
-            python_version: "3.11"
-            pytorch: 2.4.1
-            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
          - cuda: "124"
            cuda_version: 12.4.1
            cudnn_version: ""
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -15,11 +15,6 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.4.1
-            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -35,7 +30,7 @@ jobs:
            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.7.0
-            axolotl_extras: vllm
+            axolotl_extras:
    runs-on: axolotl-gpu-runner
    steps:
      - name: Checkout
@@ -67,6 +62,7 @@ jobs:
            CUDA=${{ matrix.cuda }}
            PYTORCH_VERSION=${{ matrix.pytorch }}
            AXOLOTL_ARGS=${{ matrix.axolotl_args }}
+            AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}
          file: ./docker/Dockerfile
          push: ${{ github.event_name != 'pull_request' }}
          tags: |
@@ -82,11 +78,6 @@ jobs:
    strategy:
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.4.1
-            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -3,12 +3,13 @@ name: docker-multigpu-tests-biweekly
 on:
  pull_request:
    paths:
-      - 'tests/e2e/multigpu/*.py'
+      - 'tests/e2e/multigpu/**.py'
      - 'requirements.txt'
      - 'setup.py'
      - 'pyproject.toml'
      - '.github/workflows/multi-gpu-e2e.yml'
      - 'src/axolotl/core/trainers/mixins/sequence_parallel.py'
+      - 'src/axolotl/utils/distributed.py'
  workflow_dispatch:
  schedule:
    - cron: '0 0 * * 1,4'  # Runs at 00:00 UTC every monday & thursday
@@ -32,13 +33,6 @@ jobs:
            axolotl_extras: vllm
            num_gpus: 2
            nightly_build: "true"
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.4.1
-            axolotl_extras:  # no vllm support for 2.4.1
-            num_gpus: 2
-            nightly_build: "true"
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/.github/workflows/nightlies.yml
+++ b/.github/workflows/nightlies.yml
@@ -12,11 +12,6 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.4.1
-            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -70,11 +65,6 @@ jobs:
    strategy:
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.4.1
-            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/.github/workflows/preview-docs.yml
+++ b/.github/workflows/preview-docs.yml
@@ -4,6 +4,12 @@ on:
  pull_request:
    types: [opened, synchronize, reopened]

+    # Run the workflow only when one of these files changes
+    paths:
+      - '**/*.md'      # any Markdown file
+      - '**/*.qmd'     # any Quarto file
+      - '_quarto.yaml'
+
 permissions:
  checks: write
  contents: write
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -18,15 +18,102 @@ jobs:
        env:
          SKIP: no-commit-to-branch

+  preload-cache:
+    name: Preload HF cache
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python_version: ["3.11"]
+        pytorch_version: ["2.6.0"]
+    timeout-minutes: 20
+
+    env:
+      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
+
+    steps:
+      - name: Check out repository code
+        uses: actions/checkout@v4
+
+      - name: Restore HF cache
+        id: hf-cache-restore
+        uses: actions/cache/restore@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ runner.os }}-hf-hub-cache-v2
+
+      - name: Setup Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python_version }}
+          cache: 'pip' # caching pip dependencies
+
+      - name: upgrade pip
+        run: |
+          pip3 install --upgrade pip
+          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
+
+      - name: Install PyTorch
+        run: |
+          pip3 install torch==${{ matrix.pytorch_version }}
+
+      - name: Install dependencies
+        run: |
+          pip3 show torch
+          pip3 install --no-build-isolation -U -e .
+          python scripts/unsloth_install.py | sh
+          python scripts/cutcrossentropy_install.py | sh
+          pip3 install -r requirements-dev.txt -r requirements-tests.txt
+
+      - name: Make sure PyTorch version wasn't clobbered
+        run: |
+          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
+
+      - name: Ensure axolotl CLI was installed
+        run: |
+          axolotl --help
+
+      - name: Pre-Download dataset fixture
+        run: |
+          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
+
+      - name: Run tests
+        run: |
+          pytest -v tests/conftest.py
+
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v5
+        with:
+          token: ${{ secrets.CODECOV_TOKEN }}
+          files: ./coverage.xml
+          flags: unittests,pytorch-${{ matrix.pytorch_version }}
+          fail_ci_if_error: false
+
+      - name: cleanup pip cache
+        run: |
+          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
+
+      - name: Save HF cache
+        id: hf-cache
+        uses: actions/cache/save@v4
+        with:
+          path: |
+            /home/runner/.cache/huggingface/hub/datasets--*
+            /home/runner/.cache/huggingface/hub/models--*
+          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
+
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
+    needs: [preload-cache]
    strategy:
      fail-fast: false
      max-parallel: 2
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.4.1", "2.5.1", "2.6.0"]
+        pytorch_version: ["2.5.1", "2.6.0", "2.7.0"]
    timeout-minutes: 20

    steps:
@@ -106,13 +193,6 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: 124
-            cuda_version: 12.4.1
-            python_version: "3.11"
-            pytorch: 2.4.1
-            num_gpus: 1
-            axolotl_extras:
-            nightly_build: "true"
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -27,6 +27,9 @@ concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}

+env:
+  TRANSFORMERS_IS_CI: "yes"
+
 jobs:
  pre-commit:
    name: pre-commit
@@ -41,29 +44,127 @@ jobs:
        env:
          SKIP: no-commit-to-branch

+#  preload-cache:
+#    name: Preload HF cache
+#    runs-on: ubuntu-latest
+#    strategy:
+#      fail-fast: false
+#      matrix:
+#        python_version: ["3.11"]
+#        pytorch_version: ["2.6.0"]
+#    timeout-minutes: 20
+#
+#    env:
+#      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
+#
+#    steps:
+#      - name: Check out repository code
+#        uses: actions/checkout@v4
+#
+#      - name: Restore HF cache
+#        id: hf-cache-restore
+#        uses: actions/cache/restore@v4
+#        with:
+#          path: |
+#            /home/runner/.cache/huggingface/hub/datasets--*
+#            /home/runner/.cache/huggingface/hub/models--*
+#          key: ${{ runner.os }}-hf-hub-cache-v2
+#
+#      - name: Restore Cache from S3
+#        id: hf-cache-restore-s3
+#        run: |
+#          mkdir -p /home/runner/.cache/huggingface/hub
+#          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd
+#
+#      - name: Setup Python
+#        uses: actions/setup-python@v5
+#        with:
+#          python-version: ${{ matrix.python_version }}
+#          cache: 'pip' # caching pip dependencies
+#
+#      - name: upgrade pip
+#        run: |
+#          pip3 install --upgrade pip
+#          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
+#
+#      - name: Install PyTorch
+#        run: |
+#          pip3 install torch==${{ matrix.pytorch_version }}
+#
+#      - name: Install dependencies
+#        run: |
+#          pip3 show torch
+#          pip3 install --no-build-isolation -U -e .
+#          python scripts/unsloth_install.py | sh
+#          python scripts/cutcrossentropy_install.py | sh
+#          pip3 install -r requirements-dev.txt -r requirements-tests.txt
+#
+#      - name: Make sure PyTorch version wasn't clobbered
+#        run: |
+#          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
+#
+#      - name: Ensure axolotl CLI was installed
+#        run: |
+#          axolotl --help
+#
+#      - name: Pre-Download dataset fixture
+#        run: |
+#          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
+#
+#      - name: Run tests
+#        run: |
+#          pytest -v tests/conftest.py
+#
+#      - name: Upload coverage to Codecov
+#        uses: codecov/codecov-action@v5
+#        with:
+#          token: ${{ secrets.CODECOV_TOKEN }}
+#          files: ./coverage.xml
+#          flags: unittests,pytorch-${{ matrix.pytorch_version }}
+#          fail_ci_if_error: false
+#
+#      - name: cleanup pip cache
+#        run: |
+#          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
+#
+#      - name: Save HF cache
+#        id: hf-cache
+#        uses: actions/cache/save@v4
+#        with:
+#          path: |
+#            /home/runner/.cache/huggingface/hub/datasets--*
+#            /home/runner/.cache/huggingface/hub/models--*
+#          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
+
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
+#    needs: [preload-cache]
    strategy:
      fail-fast: false
-      max-parallel: 2
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.4.1", "2.5.1", "2.6.0", "2.7.0"]
+        pytorch_version: ["2.5.1", "2.6.0", "2.7.0"]
    timeout-minutes: 20

    steps:
      - name: Check out repository code
        uses: actions/checkout@v4

-      - name: Restore HF cache
-        id: hf-cache-restore
-        uses: actions/cache/restore@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ runner.os }}-hf-hub-cache-v2
+#      - name: Restore HF cache
+#        id: hf-cache-restore
+#        uses: actions/cache/restore@v4
+#        with:
+#          path: |
+#            /home/runner/.cache/huggingface/hub/datasets--*
+#            /home/runner/.cache/huggingface/hub/models--*
+#          key: ${{ runner.os }}-hf-hub-cache-v2
+
+      - name: Restore Cache from S3
+        id: hf-cache-restore-s3
+        run: |
+          mkdir -p /home/runner/.cache/huggingface/hub
+          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd

      - name: Setup Python
        uses: actions/setup-python@v5
@@ -118,38 +219,35 @@ jobs:
        run: |
          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;

-      - name: Save HF cache
-        id: hf-cache
-        uses: actions/cache/save@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
-
  pytest-sdist:
    name: PyTest from Source Dist
    runs-on: ubuntu-latest
+#    needs: [preload-cache]
    strategy:
      fail-fast: false
-      max-parallel: 1
      matrix:
        python_version: ["3.11"]
-        pytorch_version: ["2.4.1", "2.5.1", "2.6.0"]
+        pytorch_version: ["2.5.1", "2.6.0", "2.7.0"]
    timeout-minutes: 20

    steps:
      - name: Check out repository code
        uses: actions/checkout@v4

-      - name: Restore HF cache
-        id: hf-cache-restore
-        uses: actions/cache/restore@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ runner.os }}-hf-hub-cache-v2
+#      - name: Restore HF cache
+#        id: hf-cache-restore
+#        uses: actions/cache/restore@v4
+#        with:
+#          path: |
+#            /home/runner/.cache/huggingface/hub/datasets--*
+#            /home/runner/.cache/huggingface/hub/models--*
+#          key: ${{ runner.os }}-hf-hub-cache-v2
+
+      - name: Restore Cache from S3
+        id: hf-cache-restore-s3
+        run: |
+          mkdir -p /home/runner/.cache/huggingface/hub
+          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd

      - name: Setup Python
        uses: actions/setup-python@v5
@@ -196,15 +294,6 @@ jobs:
        run: |
          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;

-      - name: Save HF cache
-        id: hf-cache
-        uses: actions/cache/save@v4
-        with:
-          path: |
-            /home/runner/.cache/huggingface/hub/datasets--*
-            /home/runner/.cache/huggingface/hub/models--*
-          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
-
  docker-e2e-tests-1st:
    if: ${{ ! contains(github.event.commits[0].message, '[skip e2e]') && github.repository_owner == 'axolotl-ai-cloud' }}
    # this job needs to be run on self-hosted GPU runners...
@@ -261,9 +350,9 @@ jobs:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
-            pytorch: 2.4.1
+            pytorch: 2.6.0
            num_gpus: 1
-            axolotl_extras:
+            axolotl_extras: llmcompressor
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -300,3 +389,43 @@ jobs:
      - name: Run tests job on Modal
        run: |
          modal run cicd.e2e_tests
+
+  docker-e2e-cleanup:
+    runs-on: [self-hosted, modal]
+    timeout-minutes: 90
+    needs: [docker-e2e-tests]
+
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.6.0
+            num_gpus: 1
+            axolotl_extras: vllm
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install Modal
+        run: |
+          python -m pip install --upgrade pip
+          pip install modal==0.71.8 jinja2
+      - name: Update env vars
+        run: |
+          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
+          echo "PYTORCH_VERSION=${{ matrix.pytorch}}" >> $GITHUB_ENV
+          echo "AXOLOTL_ARGS=${{ matrix.axolotl_args}}" >> $GITHUB_ENV
+          echo "AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}" >> $GITHUB_ENV
+          echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
+          echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
+          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
+          echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
+      - name: Run tests job on Modal
+        run: |
+          modal run cicd.cleanup
--- a/.runpod/src/handler.py
+++ b/.runpod/src/handler.py
@@ -57,8 +57,10 @@ async def handler(job):
    logger.info("Training Complete.")

    # Cleanup
-    del os.environ["WANDB_API_KEY"]
-    del os.environ["HF_TOKEN"]
+    if "WANDB_API_KEY" in os.environ:
+        del os.environ["WANDB_API_KEY"]
+    if "HF_TOKEN" in os.environ:
+        del os.environ["HF_TOKEN"]


 runpod.serverless.start({"handler": handler, "return_aggregate_stream": True})
--- a/.runpod/test-input.json
+++ b/.runpod/test-input.json
@@ -0,0 +1,86 @@
+{
+  "input": {
+    "name": "quick_smoke_test_sft",
+    "user_id": "user",
+    "model_id": "llama-test",
+    "run_id": "llama-test",
+    "credentials": {
+      "wandb_api_key": "",
+      "hf_token": ""
+    },
+    "args": {
+      "base_model": "HuggingFaceTB/SmolLM2-135M",
+      "model_type": "AutoModelForCausalLM",
+      "tokenizer_type": "AutoTokenizer",
+      "load_in_4bit": true,
+      "strict": false,
+      "datasets": [
+        {
+          "path": "mhenrichsen/alpaca_2k_test",
+          "type": "alpaca",
+          "split": "train[:10%]"
+        }
+      ],
+      "val_set_size": 0.02,
+      "output_dir": "./outputs/lora-out",
+      "sequence_len": 4096,
+      "sample_packing": true,
+      "eval_sample_packing": false,
+      "pad_to_sequence_len": true,
+      "adapter": "qlora",
+      "lora_r": 32,
+      "lora_alpha": 64,
+      "lora_dropout": 0.05,
+      "lora_target_linear": true,
+      "lora_modules_to_save": [
+        "embed_tokens",
+        "lm_head"
+      ],
+      "gradient_accumulation_steps": 2,
+      "micro_batch_size": 1,
+      "num_epochs": 1,
+      "optimizer": "adamw_torch_fused",
+      "lr_scheduler": "cosine",
+      "learning_rate": 0.0002,
+      "train_on_inputs": false,
+      "group_by_length": false,
+      "bf16": "auto",
+      "tf32": true,
+      "gradient_checkpointing": true,
+      "logging_steps": 1,
+      "flash_attention": true,
+      "warmup_steps": 1,
+      "evals_per_epoch": 1,
+      "eval_max_new_tokens": 128,
+      "saves_per_epoch": 1,
+      "weight_decay": 0.0,
+      "special_tokens": {
+        "pad_token": "<|endoftext|>"
+      },
+      "max_steps": 20
+    },
+    "timeout": 100000
+  },
+  "config": {
+    "gpuTypeId": "NVIDIA GeForce RTX 4090",
+    "gpuCount": 1,
+    "containerDiskInGb": 200,
+    "env": [
+      {
+        "key": "TOKENIZER",
+        "value": ""
+      },
+      {
+        "key": "DISABLE_LOG_STATS",
+        "value": "true"
+      }
+    ],
+    "allowedCudaVersions": [
+      "12.8",
+      "12.7",
+      "12.6",
+      "12.5",
+      "12.4"
+    ]
+  }
+}
--- a/.runpod/tests.json
+++ b/.runpod/tests.json
@@ -1,65 +1,70 @@
 {
-  "input": {
-    "name": "quick_smoke_test_sft",
-    "user_id": "user",
-    "model_id": "llama-test",
-    "run_id": "llama-test",
-    "credentials": {
-      "wandb_api_key": "",
-      "hf_token": ""
-    },
-    "args": {
-      "base_model": "HuggingFaceTB/SmolLM2-135M",
-      "model_type": "AutoModelForCausalLM",
-      "tokenizer_type": "AutoTokenizer",
-      "load_in_8bit": true,
-      "load_in_4bit": false,
-      "strict": false,
-      "datasets": [
-        {
-          "path": "mhenrichsen/alpaca_2k_test",
-          "type": "alpaca"
+  "tests": [
+    {
+      "name": "quick_smoke_test_sft",
+      "input": {
+        "user_id": "user",
+        "model_id": "llama-test",
+        "run_id": "llama-test",
+        "credentials": {
+          "wandb_api_key": "",
+          "hf_token": ""
+        },
+        "args": {
+          "base_model": "HuggingFaceTB/SmolLM2-135M",
+          "model_type": "AutoModelForCausalLM",
+          "tokenizer_type": "AutoTokenizer",
+          "load_in_4bit": true,
+          "strict": false,
+          "datasets": [
+            {
+              "path": "mhenrichsen/alpaca_2k_test",
+              "type": "alpaca",
+              "split": "train[:10%]"
+            }
+          ],
+          "val_set_size": 0.02,
+          "output_dir": "./outputs/lora-out",
+          "sequence_len": 4096,
+          "sample_packing": true,
+          "eval_sample_packing": false,
+          "pad_to_sequence_len": true,
+          "adapter": "qlora",
+          "lora_r": 32,
+          "lora_alpha": 64,
+          "lora_dropout": 0.05,
+          "lora_target_linear": true,
+          "lora_modules_to_save": [
+            "embed_tokens",
+            "lm_head"
+          ],
+          "gradient_accumulation_steps": 2,
+          "micro_batch_size": 1,
+          "num_epochs": 1,
+          "optimizer": "adamw_torch_fused",
+          "lr_scheduler": "cosine",
+          "learning_rate": 0.0002,
+          "train_on_inputs": false,
+          "group_by_length": false,
+          "bf16": "auto",
+          "tf32": true,
+          "gradient_checkpointing": true,
+          "logging_steps": 1,
+          "flash_attention": true,
+          "warmup_steps": 1,
+          "evals_per_epoch": 1,
+          "eval_max_new_tokens": 128,
+          "saves_per_epoch": 1,
+          "weight_decay": 0.0,
+          "special_tokens": {
+            "pad_token": "<|endoftext|>"
+          },
+          "max_steps": 20
        }
-      ],
-      "val_set_size": 0.05,
-      "output_dir": "./outputs/lora-out",
-      "sequence_len": 4096,
-      "sample_packing": true,
-      "eval_sample_packing": false,
-      "pad_to_sequence_len": true,
-      "adapter": "lora",
-      "lora_r": 32,
-      "lora_alpha": 64,
-      "lora_dropout": 0.05,
-      "lora_target_linear": true,
-      "lora_modules_to_save": [
-        "embed_tokens",
-        "lm_head"
-      ],
-      "gradient_accumulation_steps": 4,
-      "micro_batch_size": 2,
-      "num_epochs": 1,
-      "optimizer": "adamw_torch_fused",
-      "lr_scheduler": "cosine",
-      "learning_rate": 0.0002,
-      "train_on_inputs": false,
-      "group_by_length": false,
-      "bf16": "auto",
-      "tf32": true,
-      "gradient_checkpointing": true,
-      "logging_steps": 1,
-      "flash_attention": true,
-      "warmup_steps": 1,
-      "evals_per_epoch": 1,
-      "eval_max_new_tokens": 128,
-      "saves_per_epoch": 1,
-      "weight_decay": 0.0,
-      "special_tokens": {
-        "pad_token": "<|endoftext|>"
-      }
-    },
-    "timeout": 100000
-  },
+      },
+      "timeout": 100000
+    }
+  ],
  "config": {
    "gpuTypeId": "NVIDIA GeForce RTX 4090",
    "gpuCount": 1,
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -48,8 +48,23 @@ quartodoc:
      contents:
        - core.trainers.base
        - core.trainers.trl
+        - core.trainers.mamba
+        - core.trainers.relora
        - core.trainers.dpo.trainer
        - core.trainers.grpo.trainer
+        - core.trainers.grpo.sampler
+        - core.trainers.utils
+    - title: Mixins
+      desc: Mixin classes for augmenting trainers
+      contents:
+        - core.trainers.mixins.optimizer
+        - core.trainers.mixins.rng_state_loader
+        - core.trainers.mixins.scheduler
+        - core.trainers.mixins.sequence_parallel
+    - title: Context Managers
+      desc: Context managers for altering trainer behaviors
+      contents:
+        - utils.ctx_managers.sequence_parallel
    - title: Prompt Strategies
      desc: Prompt formatting strategies
      contents:
@@ -86,7 +101,7 @@ quartodoc:
        - kernels.swiglu
        - kernels.quantize
        - kernels.utils
-    - title: MonkeyPatches
+    - title: Monkey Patches
      desc: Runtime patches for model optimizations
      contents:
        - monkeypatch.llama_attn_hijack_flash
@@ -124,7 +139,8 @@ quartodoc:
        - utils.optimizers.adopt
        - utils.data.pretraining
        - utils.data.sft
-        - utils.gradient_checkpointing.unsloth
+        - utils.gradient_checkpointing.offload_cpu
+        - utils.gradient_checkpointing.offload_disk
    - title: Schemas
      desc: Pydantic data models for Axolotl config
      contents:
--- a/cicd/init.py
+++ b/cicd/init.py
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -18,7 +18,7 @@ pytest -v --durations=10 \
  --cov-append

 # Run patched tests excluding lora kernels with coverage append
-pytest -v --durations=10 \
+pytest --full-trace -vvv --durations=10 \
  --ignore=tests/e2e/patched/lora_kernels \
  /workspace/axolotl/tests/e2e/patched \
  --cov=axolotl \
--- a/cicd/cleanup.py
+++ b/cicd/cleanup.py
@@ -0,0 +1,19 @@
+"""Modal app to run axolotl GPU cleanup"""
+
+from .single_gpu import VOLUME_CONFIG, app, cicd_image, run_cmd
+
+
+@app.function(
+    image=cicd_image,
+    timeout=60 * 60,
+    cpu=8.0,
+    memory=131072,
+    volumes=VOLUME_CONFIG,
+)
+def cleanup():
+    run_cmd("./cicd/cleanup.sh", "/workspace/axolotl")
+
+
+@app.local_entrypoint()
+def main():
+    cleanup.remote()
--- a/cicd/cleanup.sh
+++ b/cicd/cleanup.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+set -e
+
+# cleanup old cache files for datasets processing and intermediate mappings
+find /workspace/data/huggingface-cache/hub/datasets -name "cache-*" -type f -mtime +1 -exec rm {} \;
+find /workspace/data/huggingface-cache/hub/datasets -name "*.lock" -type f -mtime +1 -exec rm {} \;
--- a/cicd/e2e_tests.py
+++ b/cicd/e2e_tests.py
@@ -1,75 +1,12 @@
 """Modal app to run axolotl GPU tests"""

-# pylint: disable=duplicate-code
-
-import os
-import pathlib
-import tempfile
-
-import jinja2
-import modal
-from jinja2 import select_autoescape
-from modal import App, Image
-
-cicd_path = pathlib.Path(__file__).parent.resolve()
-
-template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
-template_env = jinja2.Environment(
-    loader=template_loader, autoescape=select_autoescape()
-)
-df_template = template_env.get_template("Dockerfile.jinja")
-
-df_args = {
-    "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
-    "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
-    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
-    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
-    "CUDA": os.environ.get("CUDA", "121"),
-    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
-    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
-    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
-    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
-    "HF_HOME": "/workspace/data/huggingface-cache/hub",
-}
-
-dockerfile_contents = df_template.render(**df_args)
-
-temp_dir = tempfile.mkdtemp()
-with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
-    f.write(dockerfile_contents)
-
-cicd_image = Image.from_dockerfile(
-    pathlib.Path(temp_dir) / "Dockerfile",
-    context_mount=None,
-    force_build=True,
-    gpu="A10G",
-).env(df_args)
-
-app = App("Axolotl CI/CD", secrets=[])
-
-hf_cache_volume = modal.Volume.from_name(
-    "axolotl-ci-hf-hub-cache", create_if_missing=True
-)
-VOLUME_CONFIG = {
-    "/workspace/data/huggingface-cache/hub": hf_cache_volume,
-}
-
-N_GPUS = int(os.environ.get("N_GPUS", 1))
-GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
-
-
-def run_cmd(cmd: str, run_folder: str):
-    import subprocess  # nosec
-
-    # Propagate errors from subprocess.
-    if exit_code := subprocess.call(cmd.split(), cwd=run_folder):  # nosec
-        exit(exit_code)  # pylint: disable=consider-using-sys-exit
+from .single_gpu import GPU_CONFIG, VOLUME_CONFIG, app, cicd_image, run_cmd


@app.function(
    image=cicd_image,
    gpu=GPU_CONFIG,
-    timeout=60 * 60,
+    timeout=90 * 60,  # 90 min
    cpu=8.0,
    memory=131072,
    volumes=VOLUME_CONFIG,
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -0,0 +1,66 @@
+"""Modal app to run axolotl GPU tests"""
+
+# pylint: disable=duplicate-code
+
+import os
+import pathlib
+import tempfile
+
+import jinja2
+import modal
+from jinja2 import select_autoescape
+from modal import App, Image
+
+cicd_path = pathlib.Path(__file__).parent.resolve()
+
+template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
+template_env = jinja2.Environment(
+    loader=template_loader, autoescape=select_autoescape()
+)
+df_template = template_env.get_template("Dockerfile.jinja")
+
+df_args = {
+    "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
+    "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
+    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
+    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
+    "CUDA": os.environ.get("CUDA", "121"),
+    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
+    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
+    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
+    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
+    "HF_HOME": "/workspace/data/huggingface-cache/hub",
+}
+
+dockerfile_contents = df_template.render(**df_args)
+
+temp_dir = tempfile.mkdtemp()
+with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
+    f.write(dockerfile_contents)
+
+cicd_image = Image.from_dockerfile(
+    pathlib.Path(temp_dir) / "Dockerfile",
+    context_mount=None,
+    force_build=True,
+    gpu="A10G",
+).env(df_args)
+
+app = App("Axolotl CI/CD", secrets=[])
+
+hf_cache_volume = modal.Volume.from_name(
+    "axolotl-ci-hf-hub-cache", create_if_missing=True
+)
+VOLUME_CONFIG = {
+    "/workspace/data/huggingface-cache/hub": hf_cache_volume,
+}
+
+N_GPUS = int(os.environ.get("N_GPUS", 1))
+GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
+
+
+def run_cmd(cmd: str, run_folder: str):
+    import subprocess  # nosec
+
+    # Propagate errors from subprocess.
+    if exit_code := subprocess.call(cmd.split(), cwd=run_folder):  # nosec
+        exit(exit_code)  # pylint: disable=consider-using-sys-exit
--- a/codecov.yml
+++ b/codecov.yml
@@ -19,7 +19,7 @@ coverage:
        if_no_uploads: error
        if_not_found: success
        if_ci_failed: error
-        only_pulls: false
+        only_pulls: true
        flags: null
        paths: null
    patch:
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -32,6 +32,8 @@ tokenizer_legacy:
 resize_token_embeddings_to_32x:
 # Optional[bool] Whether to shrink the embeddings to len(tokenizer). By default, we won't shrink.
 shrink_embeddings:
+# Optional[bool] Don't upcast the embeddings to float32 when using PEFT. Useful for low-VRAM GPUs
+embeddings_skip_upcast:
 # Whether to load the model with randomly initialized weights. Useful for
 # pre-training a model from scratch or debugging purposes.
 random_init_weights:
@@ -73,11 +75,12 @@ load_in_8bit: true
 load_in_4bit:

 # Use CUDA bf16
-bf16: true # bool or 'full' for `bf16_full_eval`. require >=ampere
+bf16: true # bool or 'full' for `bf16_full_eval`, or 'auto' for automatic detection. require >=ampere
 # Use CUDA fp16
 fp16: true
 # Use CUDA tf32
 tf32: true # require >=ampere
+# Note: if bf16 is set to 'auto', and fp16 is set to true, we will prefer the explict fp16 setting

 # No AMP (automatic mixed precision)
 bfloat16: true # require >=ampere
@@ -184,6 +187,10 @@ datasets:
    # adding a system turn with empty content.
    drop_system_message:

+    # Optional[bool]. (for Qwen3 template only) Whether to split the assistant content based on a reasoning trace inside delimited tags
+    # See example at `docs/dataset-formats/conversation.qmd`
+    split_thinking:
+
    # IMPORTANT: The following fields determine which parts of the conversation to train on.
    # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
    # See examples at `docs/dataset-formats/conversation.qmd`
@@ -498,6 +505,7 @@ save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of eac
 save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps
 saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
 save_total_limit: # Checkpoints saved at a time
+save_only_model: # Save only the model weights, skipping the optimizer. Using this means you can't resume from checkpoints.
 # Maximum number of iterations to train for. It precedes num_epochs which means that
 # if both are set, num_epochs will not be guaranteed.
 # e.g., when 1 epoch is 1000 steps => `num_epochs: 2` and `max_steps: 100` will train for 100 steps
@@ -531,7 +539,7 @@ train_on_inputs: false
 # Note that training loss may have an oscillating pattern with this enabled.
 group_by_length: false

-# Whether to use gradient checkpointing. Available options are: true, false, "offload".
+# Whether to use gradient checkpointing. Available options are: true, false, "offload", "offload_disk".
 # https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
 gradient_checkpointing: false
 # additional kwargs to pass to the trainer for gradient checkpointing
@@ -543,7 +551,7 @@ gradient_checkpointing: false
 early_stopping_patience: 3

 # Specify a scheduler and kwargs to use with the optimizer
-lr_scheduler: # 'one_cycle' | 'rex' | 'log_sweep' | empty for cosine
+lr_scheduler: # 'one_cycle' | 'rex' | 'log_sweep' | 'linear' | 'cosine_with_restarts' | 'polynomial' | 'constant' | 'constant_with_warmup' | 'inverse_sqrt' | 'reduce_lr_on_plateau' | 'cosine_with_min_lr' | 'warmup_stable_decay' | empty for cosine
 lr_scheduler_kwargs:
 cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
 cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)
@@ -605,6 +613,7 @@ lr_div_factor: # Learning rate div factor
 # - optimi_adamw
 # - ao_adamw_8bit
 # - ao_adamw_fp8
+# - came_pytorch
 optimizer:
 # Dictionary of arguments to pass to the optimizer
 optim_args:
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
+    ("LLMCompressor", "llm_compressor")
 ]

 for section_name, folder_name in sections:
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -196,6 +196,34 @@ datasets:
 It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
 :::

+8. (For Qwen3 template only) Enable reasoning split, where the reasoning is split from the content and passed as a separate field into the template.
+
+```yaml
+datasets:
+  - path: ...
+    type: chat_template
+    chat_template: qwen3
+    split_thinking: true
+```
+
+For example, a content can look like:
+
+```json
+{
+  "content": "<think>Some thinking outputs</think>Output after thinking."
+}
+```
+
+After split, it will look like:
+
+```json
+{
+  "reasoning_content": "Some thinking outputs",
+  "content": "Output after thinking..."
+}
+```
+
+
 ## sharegpt

 ::: {.callout-important}
--- a/docs/multimodal.qmd
+++ b/docs/multimodal.qmd
@@ -164,7 +164,7 @@ Here is an example of a multi-modal dataset:
        {
            "role": "user",
            "content": [
-                {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
+                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        },
--- a/docs/sequence_parallelism.qmd
+++ b/docs/sequence_parallelism.qmd
@@ -3,8 +3,6 @@ title: Sequence Parallelism
 description: Train with long sequences split across multiple GPUs.
 ---

-# Sequence Parallelism
-
 Sequence parallelism is a technique that splits sequences across multiple GPUs,
 allowing you to train with very long sequences that wouldn't fit on a single GPU. Each
 GPU processes a different portion of the sequence, and the results are aggregated
@@ -27,7 +25,7 @@ To enable sequence parallelism, add the following to your configuration file:
 sequence_parallel_degree: 4  # Split sequences across 4 GPUs
 # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
 heads_k_stride: 1
-# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
+# Optional; one of "varlen_llama3" or "batch_ring". Defaults to
 # "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
 ring_attn_func:
 ```
--- a/examples/llama-3/sparse-finetuning.yaml
+++ b/examples/llama-3/sparse-finetuning.yaml
@@ -0,0 +1,77 @@
+base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4
+
+plugins:
+  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: tatsu-lab/alpaca
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.05
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+eval_sample_packing: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 8
+micro_batch_size: 1
+num_epochs: 1
+optimizer: paged_adamw_8bit
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  pad_token: <|end_of_text|>
+
+llmcompressor:
+  recipe:
+    finetuning_stage:
+      finetuning_modifiers:
+        ConstantPruningModifier:
+          targets: [
+            're:.*q_proj.weight',
+            're:.*k_proj.weight',
+            're:.*v_proj.weight',
+            're:.*o_proj.weight',
+            're:.*gate_proj.weight',
+            're:.*up_proj.weight',
+            're:.*down_proj.weight',
+          ]
+          start: 0
+  save_compressed: true
--- a/examples/llama-4/README.md
+++ b/examples/llama-4/README.md
@@ -34,3 +34,5 @@ We provide a script to delinearize Llama 4 linearized models into regular Huggin
 ```bash
 axolotl delinearize-llama4 --model path/to/model_dir --output path/to/output_dir
 ```
+
+Note: This only works with the non-quantized linearized model. If you have an adapter, merge it with the *non-quantized linearized* model before delinearizing.
--- a/examples/orpheus/README.md
+++ b/examples/orpheus/README.md
@@ -0,0 +1,341 @@
+# Finetuning LLMs to output audio
+
+In this example, we finetune Orpcanopylabs/orpheus-tts-0.1-pretrained (a LLaMA 3.2 3b model) to output audio.
+
+The `finetune.yml` withe current settings will run on any Nvidia GPU with 45GB VRAM or more. If you adjust the batch size it can easily run on any GPU under 24GB.
+
+## Dataset pre-processing for pre-training
+If you are adding another voice in English, please jump ahead to finetuning pre-processing.
+
+For this to work, we need to preprocess our dataset. Since we are expecting to output audio, we will need to add tokens to the tokenizer.
+
+Using this code, it will download the SNAC model and add the correct tokens and upload the final dataset.
+
+```python
+import torch
+from snac import SNAC
+from datasets import load_dataset
+from huggingface_hub import snapshot_download
+from datasets import load_dataset
+import random
+import torchaudio.transforms as T
+from transformers import AutoTokenizer
+import os
+
+my_original_dataset_name = "<huggingface-id-of-dataset-that-we-want-to-preprocess>"
+name_to_push_dataset_to = "<huggingface-id-of-where-to-save-dataset>"
+
+dsn = my_original_dataset_name
+
+snapshot_download(
+    repo_id=dsn,
+    repo_type="dataset",
+    revision="main",
+    max_workers=64,
+)
+
+
+ds = load_dataset(dsn, split="train")
+ds_sample_rate = ds[0]["audio"]["sampling_rate"]
+
+model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
+model = model.to("mps")
+
+def tokenise_audio(waveform):
+  waveform = torch.from_numpy(waveform).unsqueeze(0)
+  waveform = waveform.to(dtype=torch.float32)
+  resample_transform = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)
+  waveform = resample_transform(waveform)
+
+  waveform = waveform.unsqueeze(0).to("cuda")
+
+  #generate the codes from snac
+  with torch.inference_mode():
+    codes = model.encode(waveform)
+
+  all_codes = []
+  for i in range(codes[0].shape[1]):
+    all_codes.append(codes[0][0][i].item()+128266)
+    all_codes.append(codes[1][0][2*i].item()+128266+4096)
+    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
+    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
+    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
+    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
+    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))
+
+
+  return all_codes
+
+def add_codes(example):
+    # Always initialize codes_list to None
+    codes_list = None
+
+    try:
+        answer_audio = example.get("audio")
+        # If there's a valid audio array, tokenise it
+        if answer_audio and "array" in answer_audio:
+            audio_array = answer_audio["array"]
+            codes_list = tokenise_audio(audio_array)
+    except Exception as e:
+        print(f"Skipping row due to error: {e}")
+        # Keep codes_list as None if we fail
+    example["codes_list"] = codes_list
+
+    return example
+
+ds = ds.map(add_codes, remove_columns=["audio"])
+
+#@title Load Tokenizer
+tokeniser_length = 128256
+start_of_text = 128000
+end_of_text = 128009
+
+start_of_speech = tokeniser_length + 1
+end_of_speech = tokeniser_length + 2
+
+start_of_human = tokeniser_length + 3
+end_of_human = tokeniser_length + 4
+
+start_of_ai = tokeniser_length + 5
+end_of_ai =  tokeniser_length + 6
+pad_token = tokeniser_length + 7
+
+audio_tokens_start = tokeniser_length + 10
+
+tokenizer_name = "canopylabs/orpheus-3b-0.1-pretrained"
+
+
+tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+num_proc = os.cpu_count() - 2
+
+ds = ds.filter(lambda x: x["codes_list"] is not None)
+ds = ds.filter(lambda x: len(x["codes_list"]) > 0)
+
+#@title Create Input Ids
+def remove_duplicate_frames(example):
+    vals = example["codes_list"]
+    if len(vals) % 7 != 0:
+        raise ValueError("Input list length must be divisible by 7")
+
+    result = vals[:7]
+
+    removed_frames = 0
+
+    for i in range(7, len(vals), 7):
+        current_first = vals[i]
+        previous_first = result[-7]
+
+        if current_first != previous_first:
+            result.extend(vals[i:i+7])
+        else:
+            removed_frames += 1
+
+    example["codes_list"] = result
+
+    return example
+
+ds = ds.map(remove_duplicate_frames, num_proc=num_proc)
+
+
+def create_input_ids(example):
+    text_ids = tokenizer.encode({example['text']},  add_special_tokens=True)
+    text_ids.append(end_of_text)
+    example["text_tokens"] = text_ids
+    input_ids = (
+        [start_of_human]
+        + example["text_tokens"]
+        + [end_of_human]
+        + [start_of_ai]
+        + [start_of_speech]
+        + example["codes_list"]
+        + [end_of_speech]
+        + [end_of_ai]
+    )
+    example["input_ids"] = input_ids
+    example["labels"] = input_ids
+    example["attention_mask"] = [1] * len(input_ids)
+
+    return example
+
+ds = ds.map(create_input_ids, num_proc=num_proc, remove_columns=["text", "codes_list"])
+
+#@title Remove unnecessary columns
+columns_to_keep = ["input_ids", "labels", "attention_mask"]
+columns_to_remove = [col for col in ds.column_names if col not in columns_to_keep]
+
+ds = ds.remove_columns(columns_to_remove)
+
+ds.push_to_hub(name_to_push_dataset_to)
+```
+
+
+## Finetune pre-processing
+Use this code to add a new voice.
+
+```python
+import torch
+from snac import SNAC
+from datasets import load_dataset
+from huggingface_hub import snapshot_download
+from datasets import load_dataset
+import random
+import torchaudio.transforms as T
+from transformers import AutoTokenizer
+import os
+
+my_original_dataset_name = "<huggingface-id-of-dataset-that-we-want-to-preprocess>"
+name_to_push_dataset_to = "<huggingface-id-of-where-to-save-dataset>"
+
+dsn = my_original_dataset_name
+
+snapshot_download(
+    repo_id=dsn,
+    repo_type="dataset",
+    revision="main",
+    max_workers=64,
+)
+
+
+ds = load_dataset(dsn, split="train")
+ds_sample_rate = ds[0]["audio"]["sampling_rate"]
+
+model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
+model = model.to("mps")
+
+def tokenise_audio(waveform):
+  waveform = torch.from_numpy(waveform).unsqueeze(0)
+  waveform = waveform.to(dtype=torch.float32)
+  resample_transform = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)
+  waveform = resample_transform(waveform)
+
+  waveform = waveform.unsqueeze(0).to("cuda")
+
+  #generate the codes from snac
+  with torch.inference_mode():
+    codes = model.encode(waveform)
+
+  all_codes = []
+  for i in range(codes[0].shape[1]):
+    all_codes.append(codes[0][0][i].item()+128266)
+    all_codes.append(codes[1][0][2*i].item()+128266+4096)
+    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
+    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
+    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
+    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
+    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))
+
+
+  return all_codes
+
+def add_codes(example):
+    # Always initialize codes_list to None
+    codes_list = None
+
+    try:
+        answer_audio = example.get("audio")
+        # If there's a valid audio array, tokenise it
+        if answer_audio and "array" in answer_audio:
+            audio_array = answer_audio["array"]
+            codes_list = tokenise_audio(audio_array)
+    except Exception as e:
+        print(f"Skipping row due to error: {e}")
+        # Keep codes_list as None if we fail
+    example["codes_list"] = codes_list
+
+    return example
+
+ds = ds.map(add_codes, remove_columns=["audio"])
+
+#@title Load Tokenizer
+tokeniser_length = 128256
+start_of_text = 128000
+end_of_text = 128009
+
+start_of_speech = tokeniser_length + 1
+end_of_speech = tokeniser_length + 2
+
+start_of_human = tokeniser_length + 3
+end_of_human = tokeniser_length + 4
+
+start_of_ai = tokeniser_length + 5
+end_of_ai =  tokeniser_length + 6
+pad_token = tokeniser_length + 7
+
+audio_tokens_start = tokeniser_length + 10
+
+tokenizer_name = "canopylabs/orpheus-3b-0.1-pretrained"
+
+
+tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+num_proc = os.cpu_count() - 2
+
+ds = ds.filter(lambda x: x["codes_list"] is not None)
+ds = ds.filter(lambda x: len(x["codes_list"]) > 0)
+
+#@title Create Input Ids
+def remove_duplicate_frames(example):
+    vals = example["codes_list"]
+    if len(vals) % 7 != 0:
+        raise ValueError("Input list length must be divisible by 7")
+
+    result = vals[:7]
+
+    removed_frames = 0
+
+    for i in range(7, len(vals), 7):
+        current_first = vals[i]
+        previous_first = result[-7]
+
+        if current_first != previous_first:
+            result.extend(vals[i:i+7])
+        else:
+            removed_frames += 1
+
+    example["codes_list"] = result
+
+    return example
+
+ds = ds.map(remove_duplicate_frames, num_proc=num_proc)
+
+tok_info = '''*** HERE you can modify the text prompt
+i.e. if you wanted a multispeaker model like canopylabs/orpheus-3b-0.1-ft, you can pass:
+f"{example["source"]}:  {example["text"]}", as is passed.
+'''
+print(tok_info)
+
+def create_input_ids(example):
+    text_ids = tokenizer.encode(f"{example['speaker_id']}: {example['text']}",  add_special_tokens=True)
+    text_ids.append(end_of_text)
+    example["text_tokens"] = text_ids
+    input_ids = (
+        [start_of_human]
+        + example["text_tokens"]
+        + [end_of_human]
+        + [start_of_ai]
+        + [start_of_speech]
+        + example["codes_list"]
+        + [end_of_speech]
+        + [end_of_ai]
+    )
+    example["input_ids"] = input_ids
+    example["labels"] = input_ids
+    example["attention_mask"] = [1] * len(input_ids)
+
+    return example
+
+ds = ds.map(create_input_ids, num_proc=num_proc, remove_columns=["text", "codes_list"])
+
+#@title Remove unnecessary columns
+columns_to_keep = ["input_ids", "labels", "attention_mask"]
+columns_to_remove = [col for col in ds.column_names if col not in columns_to_keep]
+
+ds = ds.remove_columns(columns_to_remove)
+
+ds.push_to_hub(name_to_push_dataset_to)
+```
+
+## Training
+After preprocessing is done, fill out the blanks in finetune.yml and simply run `axolotl train finetune.yml`
+
+## Inference
+For inference, please refer to the original [orpheus github](https://github.com/canopyai/Orpheus-TTS/tree/main).
--- a/examples/orpheus/finetune.yml
+++ b/examples/orpheus/finetune.yml
@@ -0,0 +1,52 @@
+base_model: canopylabs/orpheus-3b-0.1-pretrained
+
+hub_model_id: <your-hub-model-id>
+
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_glu_activation: true
+liger_fused_linear_cross_entropy: true
+
+datasets:
+  - path: <your-hf-dataset-id>
+    type:  # leave empty to load pre-tokenized
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./outputs/out
+
+sequence_len: 8192
+sample_packing: true
+pad_to_sequence_len: true
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 8
+micro_batch_size: 4
+num_epochs: 3
+optimizer: adamw_torch_fused
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+bf16: auto
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+resume_from_checkpoint:
+logging_steps: 1
+flash_attention: true
+
+warmup_steps: 20
+evals_per_epoch: 5
+saves_per_epoch: 5
+weight_decay: 0.05
+
+special_tokens:
+  pad_token: <custom_token_7>
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,19 +6,20 @@ triton>=3.0.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
 autoawq==0.2.7.post3
-liger-kernel==0.5.8
+liger-kernel==0.5.9
 # END section

 packaging==23.2

+huggingface_hub==0.31.0
 peft==0.15.2
 transformers==4.51.3
 tokenizers>=0.21.1
 accelerate==1.6.0
-datasets==3.5.0
+datasets==3.5.1
 deepspeed>=0.15.4
 trl==0.17.0
-hf_xet==1.0.0
+hf_xet==1.1.0
 hqq==0.2.5

 optimum==1.16.2
--- a/setup.py
+++ b/setup.py
@@ -67,13 +67,13 @@ def parse_requirements(extras_require_map):
            if (major, minor) >= (2, 7):
                _install_requires.pop(_install_requires.index(xformers_version))
                # _install_requires.append("xformers==0.0.29.post3")  # xformers seems to be hard pinned to 2.6.0
-                extras_require_map["vllm"] = ["vllm==0.8.4"]
+                extras_require_map["vllm"] = ["vllm==0.8.5.post1"]
            elif (major, minor) >= (2, 6):
                _install_requires.pop(_install_requires.index(xformers_version))
                _install_requires.append(
                    "xformers==0.0.29.post2"
                )  # vllm needs post2 w torch 2.6
-                extras_require_map["vllm"] = ["vllm==0.8.4"]
+                extras_require_map["vllm"] = ["vllm==0.8.5.post1"]
            elif (major, minor) >= (2, 5):
                _install_requires.pop(_install_requires.index(xformers_version))
                if patch == 0:
@@ -142,6 +142,7 @@ extras_require = {
        "apollo-torch",
        "lomo-optim==0.1.1",
        "torch-optimi==0.2.1",
+        "came_pytorch==0.1.3",
    ],
    "ray": [
        "ray[train]",
@@ -149,6 +150,9 @@ extras_require = {
    "vllm": [
        "vllm==0.7.2",
    ],
+    "llmcompressor": [
+        "llmcompressor==0.5.1",
+    ],
 }

 install_requires, dependency_links, extras_require_build = parse_requirements(
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -4,4 +4,4 @@ import pkgutil

 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package

-__version__ = "0.9.0"
+__version__ = "0.10.0.dev0"
--- a/src/axolotl/cli/init.py
+++ b/src/axolotl/cli/init.py
@@ -2,4 +2,7 @@

 import os

+from axolotl.logging_config import configure_logging
+
 os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+configure_logging()
--- a/src/axolotl/cli/args.py
+++ b/src/axolotl/cli/args.py
@@ -82,6 +82,12 @@ class VllmServeCliArgs:
            "hardware support this feature."
        },
    )
+    serve_module: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Module to serve. If not set, the default module will be used."
+        },
+    )


@dataclass
--- a/src/axolotl/cli/art.py
+++ b/src/axolotl/cli/art.py
@@ -16,8 +16,15 @@ AXOLOTL_LOGO = """
    @@@@  @@@@@@@@@@@@@@@@
 """

+HAS_PRINTED_LOGO = False
+

 def print_axolotl_text_art():
    """Prints axolotl ASCII art."""
+
+    global HAS_PRINTED_LOGO  # pylint: disable=global-statement
+    if HAS_PRINTED_LOGO:
+        return
    if is_main_process():
+        HAS_PRINTED_LOGO = True
        print(AXOLOTL_LOGO)
--- a/src/axolotl/cli/checks.py
+++ b/src/axolotl/cli/checks.py
@@ -8,9 +8,6 @@ from accelerate.commands.config import config_args
 from huggingface_hub import HfApi
 from huggingface_hub.utils import LocalTokenNotFoundError

-from axolotl.logging_config import configure_logging
-
-configure_logging()
 LOG = logging.getLogger(__name__)


--- a/src/axolotl/cli/config.py
+++ b/src/axolotl/cli/config.py
@@ -5,6 +5,7 @@ import logging
 import os
 import tempfile
 from pathlib import Path
+from tempfile import NamedTemporaryFile
 from typing import Union
 from urllib.parse import urlparse

@@ -152,7 +153,15 @@ def prepare_plugins(cfg: DictDefault):
            plugin_manager.register(plugin_name)


-def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs) -> DictDefault:
+def plugin_set_cfg(cfg: DictDefault):
+    if cfg.get("plugins"):
+        plugin_manager = PluginManager.get_instance()
+        plugin_manager.cfg = cfg
+
+
+def load_cfg(
+    config: str | Path | DictDefault = Path("examples/"), **kwargs
+) -> DictDefault:
    """
    Loads the `axolotl` configuration stored at `config`, validates it, and performs
    various setup.
@@ -164,13 +173,24 @@ def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs) -> DictDefa
    Returns:
        `DictDefault` mapping configuration keys to values.
    """
-    config = check_remote_config(config)
-    if Path(config).is_dir():
-        config = choose_config(Path(config))
+    if isinstance(config, (str, Path)):
+        config = check_remote_config(config)
+        if Path(config).is_dir():
+            config = choose_config(Path(config))

-    # Load the config from the yaml file
-    with open(config, encoding="utf-8") as file:
-        cfg: DictDefault = DictDefault(yaml.safe_load(file))
+        # Load the config from the yaml file
+        with open(config, encoding="utf-8") as file:
+            cfg: DictDefault = DictDefault(yaml.safe_load(file))
+
+        cfg.axolotl_config_path = config
+    else:
+        cfg = config
+        with NamedTemporaryFile(
+            mode="w", delete=False, suffix=".yml", prefix="axolotl_config_"
+        ) as temp_file:
+            temp_file.write(yaml.dump(config.to_dict()))
+            temp_file.close()
+        cfg.axolotl_config_path = temp_file.name

    # If there are any options passed in the cli, if it is something that seems valid
    # from the yaml, then overwrite the value
@@ -184,8 +204,6 @@ def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs) -> DictDefa
            else:
                cfg[k] = kwargs[k]

-    cfg.axolotl_config_path = config
-
    try:
        device_props = torch.cuda.get_device_properties("cuda")
        gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
@@ -213,5 +231,6 @@ def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs) -> DictDefa
    setup_wandb_env_vars(cfg)
    setup_mlflow_env_vars(cfg)
    setup_comet_env_vars(cfg)
+    plugin_set_cfg(cfg)

    return cfg
--- a/src/axolotl/cli/evaluate.py
+++ b/src/axolotl/cli/evaluate.py
@@ -1,6 +1,7 @@
 """CLI to run evaluation on a model."""

 import logging
+import os
 from pathlib import Path
 from typing import Union

@@ -14,6 +15,7 @@ from axolotl.cli.checks import check_accelerate_default_config, check_user_token
 from axolotl.cli.config import load_cfg
 from axolotl.common.datasets import load_datasets, load_preference_datasets
 from axolotl.evaluate import evaluate
+from axolotl.utils import patch_optimized_env
 from axolotl.utils.dict import DictDefault

 LOG = logging.getLogger(__name__)
@@ -29,10 +31,14 @@ def do_evaluate(cfg: DictDefault, cli_args: TrainerCliArgs) -> None:
        cfg: Dictionary mapping `axolotl` config keys to values.
        cli_args: CLI arguments.
    """
+    # Enable expandable segments for cuda allocation to improve VRAM usage
+    patch_optimized_env()
+
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
    check_accelerate_default_config()
-    check_user_token()
+    if int(os.getenv("LOCAL_RANK", "0")) == 0:
+        check_user_token()

    if cfg.rl:
        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
--- a/src/axolotl/cli/main.py
+++ b/src/axolotl/cli/main.py
@@ -28,9 +28,8 @@ from axolotl.cli.utils import (
    fetch_from_github,
    filter_none_kwargs,
 )
-from axolotl.cli.vllm_serve import do_vllm_serve
 from axolotl.integrations.lm_eval.cli import lm_eval
-from axolotl.utils import set_pytorch_cuda_alloc_conf
+from axolotl.utils import patch_optimized_env
 from axolotl.utils.schemas.config import AxolotlInputConfig


@@ -56,6 +55,8 @@ def preprocess(config: str, cloud: Optional[str] = None, **kwargs) -> None:
        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
            config options.
    """
+    patch_optimized_env()
+
    if cloud:
        from axolotl.cli.cloud import do_cli_preprocess

@@ -101,7 +102,7 @@ def train(
            config options.
    """
    # Enable expandable segments for cuda allocation to improve VRAM usage
-    set_pytorch_cuda_alloc_conf()
+    patch_optimized_env()

    if "use_ray" in kwargs and kwargs["use_ray"]:
        accelerate = False
@@ -327,6 +328,8 @@ def fetch(directory: str, dest: Optional[str]) -> None:
@add_options_from_dataclass(VllmServeCliArgs)
@filter_none_kwargs
 def vllm_serve(config: str, **cli_args: VllmServeCliArgs):
+    from axolotl.cli.vllm_serve import do_vllm_serve
+
    do_vllm_serve(config, cli_args)


--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -18,6 +18,7 @@ from axolotl.cli.checks import check_accelerate_default_config, check_user_token
 from axolotl.cli.config import load_cfg
 from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
 from axolotl.common.datasets import load_datasets, load_preference_datasets
+from axolotl.integrations.base import PluginManager
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.trainer import disable_datasets_caching

@@ -47,7 +48,10 @@ def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:
        cfg.dataset_prepared_path = DEFAULT_DATASET_PREPARED_PATH

    with disable_datasets_caching():
-        if cfg.rl:
+        plugin_manager = PluginManager.get_instance()
+        if plugin_manager.load_datasets(cfg, preprocess=True):
+            pass
+        elif cfg.rl:
            load_preference_datasets(cfg=cfg, cli_args=cli_args)
        else:
            load_datasets(cfg=cfg, cli_args=cli_args)
--- a/src/axolotl/cli/train.py
+++ b/src/axolotl/cli/train.py
@@ -18,7 +18,7 @@ from axolotl.cli.config import load_cfg
 from axolotl.common.datasets import load_datasets, load_preference_datasets
 from axolotl.integrations.base import PluginManager
 from axolotl.train import train
-from axolotl.utils import set_pytorch_cuda_alloc_conf
+from axolotl.utils import patch_optimized_env
 from axolotl.utils.config import normalize_config, resolve_dtype
 from axolotl.utils.dict import DictDefault

@@ -36,17 +36,20 @@ def do_train(cfg: DictDefault, cli_args: TrainerCliArgs):
        cli_args: Training-specific CLI arguments.
    """
    # Enable expandable segments for cuda allocation to improve VRAM usage
-    set_pytorch_cuda_alloc_conf()
+    patch_optimized_env()

    print_axolotl_text_art()
    check_accelerate_default_config()
    if int(os.getenv("LOCAL_RANK", "0")) == 0:
        check_user_token()

-    if cfg.rl:
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
-    else:
-        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+    plugin_manager = PluginManager.get_instance()
+    dataset_meta = plugin_manager.load_datasets(cfg, preprocess=False)
+    if not dataset_meta:
+        if cfg.rl:
+            dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        else:
+            dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

    model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)

--- a/src/axolotl/cli/utils.py
+++ b/src/axolotl/cli/utils.py
@@ -20,11 +20,9 @@ from transformers import (
    ProcessorMixin,
 )

-from axolotl.logging_config import configure_logging
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_model, load_processor, load_tokenizer

-configure_logging()
 LOG = logging.getLogger(__name__)


--- a/src/axolotl/cli/vllm_serve.py
+++ b/src/axolotl/cli/vllm_serve.py
@@ -6,7 +6,6 @@ from pathlib import Path
 from typing import Union

 from trl.scripts.vllm_serve import ScriptArguments
-from trl.scripts.vllm_serve import main as vllm_serve_main

 from axolotl.cli.config import load_cfg

@@ -28,6 +27,9 @@ def do_vllm_serve(
    cfg = load_cfg(config)
    model = cfg.base_model

+    serve_module = cli_args.get("serve_module", "trl.scripts.vllm_serve")
+    vllm_serve_main = getattr(__import__(serve_module, fromlist=["main"]), "main")
+
    tensor_parallel_size = (
        cli_args.get("tensor_parallel_size") or cfg.vllm.tensor_parallel_size
    )
--- a/src/axolotl/common/architectures.py
+++ b/src/axolotl/common/architectures.py
@@ -11,5 +11,6 @@ MOE_ARCH_BLOCK = {
    ],
    "mixtral": "MixtralSparseMoeBlock",
    "qwen2_moe": "Qwen2MoeSparseMoeBlock",
+    "qwen3_moe": "Qwen3MoeSparseMoeBlock",
    "deepseek_v2": "DeepseekV2MoE",
 }
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -14,6 +14,7 @@ from axolotl.utils.data import prepare_dataset
 from axolotl.utils.data.rl import load_prepare_preference_datasets
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_processor, load_tokenizer
+from axolotl.utils.schemas.enums import RLType
 from axolotl.utils.tokenization import check_dataset_labels

 LOG = logging.getLogger(__name__)
@@ -47,7 +48,8 @@ def sample_dataset(dataset: Dataset, num_samples: int) -> Dataset:
 def load_datasets(
    *,
    cfg: DictDefault,
-    cli_args: Union[PreprocessCliArgs, TrainerCliArgs],
+    cli_args: PreprocessCliArgs | TrainerCliArgs | None = None,
+    debug: bool = False,
 ) -> TrainDatasetMeta:
    """
    Loads one or more training or evaluation datasets, calling
@@ -56,6 +58,7 @@ def load_datasets(
    Args:
        cfg: Dictionary mapping `axolotl` config keys to values.
        cli_args: Command-specific CLI arguments.
+        debug: Whether to print out tokenization of sample

    Returns:
        Dataclass with fields for training and evaluation datasets and the computed
@@ -64,7 +67,8 @@ def load_datasets(
    tokenizer = load_tokenizer(cfg)
    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None
    preprocess_iterable = (
-        hasattr(cli_args, "iterable")
+        cli_args
+        and hasattr(cli_args, "iterable")
        and cli_args.iterable is not None
        and cli_args.iterable
    )
@@ -76,20 +80,25 @@ def load_datasets(
        preprocess_iterable=preprocess_iterable,
    )

-    if (
-        cli_args.debug
-        or cfg.debug
-        or cli_args.debug_text_only
-        or int(cli_args.debug_num_examples) > 0
-    ):
+    if (  # pylint: disable=too-many-boolean-expressions
+        cli_args
+        and (
+            cli_args.debug
+            or cfg.debug
+            or cli_args.debug_text_only
+            or int(cli_args.debug_num_examples) > 0
+        )
+    ) or debug:
        LOG.info("check_dataset_labels...")

-        train_samples = sample_dataset(train_dataset, cli_args.debug_num_examples)
+        num_examples = cli_args.debug_num_examples if cli_args else 1
+        text_only = cli_args.debug_text_only if cli_args else False
+        train_samples = sample_dataset(train_dataset, num_examples)
        check_dataset_labels(
            train_samples,
            tokenizer,
-            num_examples=cli_args.debug_num_examples,
-            text_only=cli_args.debug_text_only,
+            num_examples=num_examples,
+            text_only=text_only,
        )

        LOG.info("printing prompters...")
@@ -125,7 +134,7 @@ def load_preference_datasets(
    total_num_steps: Optional[int] = int(
        math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
    )
-    if cfg.rl == "grpo":
+    if cfg.rl is RLType.GRPO:
        total_num_steps = None

    if cli_args.debug or cfg.debug:
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -21,6 +21,7 @@ import importlib.util
 import inspect
 import logging
 import math
+import os
 import sys
 from abc import abstractmethod
 from pathlib import Path
@@ -60,6 +61,7 @@ from axolotl.core.training_args import (
 from axolotl.integrations.base import PluginManager
 from axolotl.monkeypatch.multipack import SUPPORTED_MULTIPACK_MODEL_TYPES
 from axolotl.monkeypatch.relora import ReLoRACallback
+from axolotl.monkeypatch.trainer.lr import patch_trainer_get_lr
 from axolotl.processing_strategies import get_processing_strategy
 from axolotl.utils import is_comet_available, is_mlflow_available
 from axolotl.utils.callbacks import (
@@ -71,6 +73,7 @@ from axolotl.utils.callbacks import (
    SaveBetterTransformerModelCallback,
    bench_eval_callback_factory,
    causal_lm_bench_eval_callback_factory,
+    colab_inference_post_train_callback,
    log_prediction_callback_factory,
 )
 from axolotl.utils.callbacks.lisa import lisa_callback_factory
@@ -84,7 +87,7 @@ from axolotl.utils.collators import (
 )
 from axolotl.utils.collators.mm_chat import MultiModalChatDataCollator
 from axolotl.utils.models import ensure_dtype
-from axolotl.utils.schemas.enums import CustomSupportedOptimizers
+from axolotl.utils.schemas.enums import CustomSupportedOptimizers, RLType

 try:
    import torch._dynamo  # pylint: disable=ungrouped-imports
@@ -114,6 +117,8 @@ class TrainerBuilderBase(abc.ABC):
        if hasattr(model, "add_model_tags"):
            model.add_model_tags(["axolotl"])

+        patch_trainer_get_lr()
+
    @property
    def model_ref(self):
        return self._model_ref
@@ -165,6 +170,9 @@ class TrainerBuilderBase(abc.ABC):
                )
            )

+        if self.cfg.gc_steps:
+            callbacks.append(GCCallback(gc_steps=self.cfg.gc_steps))
+
        if self.cfg.use_wandb:
            callbacks.append(
                SaveAxolotlConfigtoWandBCallback(self.cfg.axolotl_config_path)
@@ -246,9 +254,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.loss_watchdog_threshold is not None:
            callbacks.append(LossWatchDogCallback(self.cfg))

-        if self.cfg.gc_steps:
-            callbacks.append(GCCallback(gc_steps=self.cfg.gc_steps))
-
        return callbacks

    def get_post_trainer_create_callbacks(self, trainer):
@@ -290,6 +295,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.lisa_step_interval and self.cfg.lisa_n_layers:
            callbacks.append(lisa_callback_factory(trainer))

+        if any("COLAB_" in key for key in os.environ):
+            ColabCallback = colab_inference_post_train_callback(trainer)
+            callbacks.append(ColabCallback(self.cfg))
+
        callbacks.extend(super().get_post_trainer_create_callbacks(trainer=trainer))
        return callbacks

@@ -344,7 +353,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs["warmup_steps"] = warmup_steps
        training_arguments_kwargs["logging_steps"] = logging_steps

-        if self.cfg.seed:
+        if self.cfg.seed is not None:
            training_arguments_kwargs["seed"] = self.cfg.seed

        if self.cfg.gradient_checkpointing:
@@ -485,7 +494,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):

        # these are all the "standard" kwargs that are def used
        training_arguments_kwargs["max_steps"] = (
-            total_num_steps if self.cfg.max_steps else -1
+            self.cfg.max_steps if self.cfg.max_steps else -1
        )
        training_arguments_kwargs["max_seq_length"] = self.cfg.sequence_len
        training_arguments_kwargs["per_device_train_batch_size"] = (
@@ -538,8 +547,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        report_to = []
        if self.cfg.use_wandb:
            report_to.append("wandb")
-            if self.cfg.wandb_name:
-                training_arguments_kwargs["run_name"] = self.cfg.wandb_name
        if self.cfg.use_mlflow:
            report_to.append("mlflow")
        if self.cfg.use_tensorboard:
@@ -699,6 +706,20 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
                optimizer_cls = ADOPT
                adam_kwargs["decouple"] = True
                optimizer_kwargs.update(adam_kwargs)
+            elif self.cfg.optimizer == "came_pytorch":
+                from came_pytorch import CAME
+
+                optimizer_cls = CAME
+
+                beta1 = training_arguments_kwargs.get("adam_beta1", 0.9)
+                beta2 = training_arguments_kwargs.get("adam_beta2", 0.999)
+                beta3 = training_arguments_kwargs.get("adam_beta2", 0.9999)
+                eps1 = training_arguments_kwargs.get("adam_epsilon", 1e-30)
+                eps2 = training_arguments_kwargs.get("adam_epsilon2", 1e-16)
+                adam_kwargs["betas"] = (beta1, beta2, beta3)
+                adam_kwargs["eps"] = (eps1, eps2)
+
+                optimizer_kwargs.update(adam_kwargs)

            # Parse any additional optimizer args from config
            if self.cfg.optim_args:
@@ -798,14 +819,15 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        data_collator_kwargs = {
            "padding": True,  # True/"longest" is the default
        }
+        multiple = 64
        if self.cfg.pad_to_sequence_len:
-            data_collator_kwargs["pad_to_multiple_of"] = 64 * math.ceil(
-                self.cfg.sequence_len / 64
+            data_collator_kwargs["pad_to_multiple_of"] = multiple * math.ceil(
+                self.cfg.sequence_len / multiple
            )
        else:
            # A100 is best at 64, while others at 8. Let's use the larger so we don't have to check
            # https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
-            data_collator_kwargs["pad_to_multiple_of"] = 64
+            data_collator_kwargs["pad_to_multiple_of"] = multiple

        if self.cfg.reward_model:
            data_collator_kwargs["max_length"] = self.cfg.sequence_len
@@ -1011,6 +1033,10 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            training_args_kwargs["dataloader_prefetch_factor"] = (
                self.cfg.dataloader_prefetch_factor
            )
+
+        if self.cfg.seed is not None:
+            training_args_kwargs["seed"] = self.cfg.seed
+
        if self.cfg.gradient_checkpointing:
            training_args_kwargs["gradient_checkpointing"] = (
                self.cfg.gradient_checkpointing
@@ -1034,6 +1060,8 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            # default to saving each epoch if not defined
            training_args_kwargs["save_strategy"] = "epoch"

+        training_args_kwargs["save_only_model"] = self.cfg.save_only_model
+
        if self.cfg.dataset_processes:
            training_args_kwargs["dataset_num_proc"] = self.cfg.dataset_processes

@@ -1051,9 +1079,13 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
        if self.cfg.use_wandb:
            training_args_kwargs["run_name"] = self.cfg.wandb_name

+        training_args_kwargs["sequence_parallel_degree"] = (
+            self.cfg.sequence_parallel_degree
+        )
+
        training_args_cls = None
        blocklist_args_kwargs = []
-        if self.cfg.rl == "simpo":
+        if self.cfg.rl is RLType.SIMPO:
            training_args_cls = AxolotlCPOConfig
            training_args_kwargs["loss_type"] = "simpo"
            training_args_kwargs["max_length"] = self.cfg.sequence_len
@@ -1061,13 +1093,13 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            if self.cfg.cpo_alpha is not None:
                training_args_kwargs["cpo_alpha"] = self.cfg.cpo_alpha

-        elif self.cfg.rl == "orpo":
+        elif self.cfg.rl is RLType.ORPO:
            training_args_cls = AxolotlORPOConfig
            training_args_kwargs["max_length"] = self.cfg.sequence_len
            if self.cfg.max_prompt_len:
                training_args_kwargs["max_prompt_length"] = self.cfg.max_prompt_len

-        elif self.cfg.rl == "kto":
+        elif self.cfg.rl is RLType.KTO:
            training_args_cls = AxolotlKTOConfig

            training_args_kwargs["desirable_weight"] = (
@@ -1081,14 +1113,14 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
            if self.cfg.max_prompt_len:
                training_args_kwargs["max_prompt_length"] = self.cfg.max_prompt_len

-        elif self.cfg.rl == "grpo":
+        elif self.cfg.rl is RLType.GRPO:
            training_args_cls = GRPOStrategy.get_training_args_class()
            training_args_kwargs.update(GRPOStrategy.set_training_args_kwargs(self.cfg))
            blocklist_args_kwargs = GRPOStrategy.get_blocklist_args_kwargs()

        else:
            training_args_cls = AxolotlDPOConfig
-            if self.cfg.rl == "ipo":
+            if self.cfg.rl is RLType.IPO:
                training_args_kwargs["loss_type"] = "ipo"
            training_args_kwargs["max_length"] = self.cfg.sequence_len
            training_args_kwargs["max_completion_length"] = None
@@ -1131,67 +1163,73 @@ class HFRLTrainerBuilder(TrainerBuilderBase):

    def build(self, total_num_steps):
        training_args = self.build_training_arguments(total_num_steps)
-        dpo_trainer_kwargs = {}
-        if self.cfg.rl == "ipo":
+        trainer_kwargs = {}
+        if self.cfg.rl is RLType.IPO:
            if self.cfg.dpo_label_smoothing:
-                dpo_trainer_kwargs["label_smoothing"] = self.cfg.dpo_label_smoothing
+                trainer_kwargs["label_smoothing"] = self.cfg.dpo_label_smoothing
        if self.eval_dataset:
-            dpo_trainer_kwargs["eval_dataset"] = self.eval_dataset
+            trainer_kwargs["eval_dataset"] = self.eval_dataset
        if self.cfg.adapter and self.peft_config:
-            dpo_trainer_kwargs["peft_config"] = self.peft_config
+            trainer_kwargs["peft_config"] = self.peft_config
        if self.cfg.precompute_ref_log_probs is not None:
-            dpo_trainer_kwargs["precompute_ref_log_probs"] = (
+            trainer_kwargs["precompute_ref_log_probs"] = (
                self.cfg.precompute_ref_log_probs
            )
-        if self.cfg.rl == "grpo":
-            trainer_cls = GRPOStrategy.get_trainer_class()
+        if self.cfg.rl is RLType.GRPO:
+            trainer_cls = GRPOStrategy.get_trainer_class(
+                sequence_parallel=self.cfg.sequence_parallel_degree > 1
+            )
            trainer_cls_args = [self.model]
            trainer_cls_args.extend(GRPOStrategy.set_trainer_args(self.cfg))
-            dpo_trainer_kwargs.update(GRPOStrategy.set_trainer_kwargs(self.cfg))
-        elif self.cfg.rl in ["dpo", "ipo"]:
+            trainer_kwargs.update(GRPOStrategy.set_trainer_kwargs(self.cfg))
+        elif self.cfg.rl in [RLType.DPO, RLType.IPO]:
            trainer_cls = DPOStrategy.get_trainer_class()
            trainer_cls_args = [self.model, self.model_ref]
-        elif self.cfg.rl == "orpo":
+        elif self.cfg.rl is RLType.ORPO:
            trainer_cls = AxolotlORPOTrainer
            trainer_cls_args = [self.model]
-        elif self.cfg.rl in ["kto"]:
+        elif self.cfg.rl is RLType.KTO:
            trainer_cls = AxolotlKTOTrainer
            trainer_cls_args = [self.model]
-        elif self.cfg.rl in ["simpo"]:
+        elif self.cfg.rl is RLType.SIMPO:
            trainer_cls = AxolotlCPOTrainer
            trainer_cls_args = [self.model]
        else:
            raise ValueError(f"Unsupported RL: {self.cfg.rl}")

+        if self.cfg.plugins:
+            plugin_manager = PluginManager.get_instance()
+            trainer_cls = plugin_manager.get_trainer_cls(self.cfg)
+
        sig = inspect.signature(trainer_cls)
        if "tokenizer" in sig.parameters.keys():
-            dpo_trainer_kwargs["tokenizer"] = self.tokenizer
+            trainer_kwargs["tokenizer"] = self.tokenizer
        else:
-            dpo_trainer_kwargs["processing_class"] = self.tokenizer
+            trainer_kwargs["processing_class"] = self.tokenizer

        if self.cfg.datasets is not None and (
            trainer_cls is DPOStrategy.get_trainer_class()
        ):
-            dpo_trainer_kwargs["dataset_tags"] = [
+            trainer_kwargs["dataset_tags"] = [
                d["path"] for d in self.cfg.datasets if not Path(d["path"]).is_dir()
            ]
-        dpo_trainer = trainer_cls(
+        trainer = trainer_cls(
            *trainer_cls_args,
            args=training_args,
            train_dataset=self.train_dataset,
            callbacks=self.get_callbacks(),
-            **dpo_trainer_kwargs,
+            **trainer_kwargs,
        )
        if self.cfg.fsdp:
-            ensure_dtype(dpo_trainer.model, dtype=self.cfg.torch_dtype)
-            if self.cfg.rl in ["dpo", "ipo"] and dpo_trainer.ref_model:
-                ensure_dtype(dpo_trainer.ref_model, dtype=self.cfg.torch_dtype)
+            ensure_dtype(trainer.model, dtype=self.cfg.torch_dtype)
+            if self.cfg.rl in [RLType.DPO, RLType.IPO] and trainer.ref_model:
+                ensure_dtype(trainer.ref_model, dtype=self.cfg.torch_dtype)

-        dpo_trainer = self.hook_post_create_trainer(dpo_trainer)
-        for callback in self.get_post_trainer_create_callbacks(dpo_trainer):
-            dpo_trainer.add_callback(callback)
+        trainer = self.hook_post_create_trainer(trainer)
+        for callback in self.get_post_trainer_create_callbacks(trainer):
+            trainer.add_callback(callback)

-        return dpo_trainer
+        return trainer


 class HFPPOTrainerBuilder(TrainerBuilderBase):
--- a/src/axolotl/core/trainers/init.py
+++ b/src/axolotl/core/trainers/init.py
@@ -5,7 +5,7 @@

 from .base import AxolotlTrainer
 from .dpo.trainer import AxolotlDPOTrainer
-from .grpo.trainer import AxolotlGRPOTrainer
+from .grpo.trainer import AxolotlGRPOSequenceParallelTrainer, AxolotlGRPOTrainer
 from .mamba import AxolotlMambaTrainer
 from .relora import ReLoRATrainer
 from .trl import (
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -114,6 +114,8 @@ class AxolotlTrainer(
            packing_efficiency_estimate=self.args.sample_packing_efficiency,
            batch_max_len=batch_max_len,
            batch_size=batch_size,
+            group_size=self.args.sample_packing_group_size,
+            bin_size=self.args.sample_packing_bin_size,
            sequential=self.args.sample_packing_sequentially,
            drop_last=True,
        )
@@ -371,15 +373,13 @@ class AxolotlTrainer(
                num_items_in_batch=num_items_in_batch,
            )

-        loss = super().compute_loss(
+        return super().compute_loss(
            model,
            inputs,
            return_outputs=return_outputs,
            num_items_in_batch=num_items_in_batch,
        )

-        return loss
-
    @staticmethod
    def orpo_concatenate_inputs(inputs, label_pad_token=-100, pad_token=0, device=None):
        concatenated_batch = {}
--- a/src/axolotl/core/trainers/dpo/init.py
+++ b/src/axolotl/core/trainers/dpo/init.py
@@ -1,14 +1,11 @@
-"""
-DPO Specific Strategy for training
-"""
+"""DPO Specific Strategy for training"""

 from axolotl.core.trainers.dpo.trainer import AxolotlDPOTrainer
+from axolotl.utils.schemas.enums import RLType


 class DPOStrategy:
-    """
-    Strategy for DPO training
-    """
+    """Strategy for DPO training"""

    @classmethod
    def get_trainer_class(cls):
@@ -23,7 +20,7 @@ class DPOStrategy:
    @classmethod
    def set_training_args_kwargs(cls, cfg):
        training_args_kwargs = {}
-        if cfg.rl == "ipo":
+        if cfg.rl is RLType.IPO:
            training_args_kwargs["loss_type"] = "ipo"
        training_args_kwargs["max_length"] = cfg.sequence_len
        training_args_kwargs["max_completion_length"] = None
--- a/src/axolotl/core/trainers/dpo/trainer.py
+++ b/src/axolotl/core/trainers/dpo/trainer.py
@@ -177,12 +177,8 @@ class AxolotlDPOTrainer(RngLoaderMixin, SchedulerMixin, DPOTrainer):
            # dpo trainer may incorrectly prepend the bos_token_id to the dpo outputs
            if res["chosen_input_ids"][0] == processing_class.bos_token_id:
                res["chosen_input_ids"] = res["chosen_input_ids"][1:]
-                res["chosen_labels"] = res["chosen_labels"][1:]
-                res["chosen_attention_mask"] = res["chosen_attention_mask"][1:]
            if res["rejected_input_ids"][0] == processing_class.bos_token_id:
                res["rejected_input_ids"] = res["rejected_input_ids"][1:]
-                res["rejected_labels"] = res["rejected_labels"][1:]
-                res["rejected_attention_mask"] = res["rejected_attention_mask"][1:]

        return res

@@ -251,7 +247,9 @@ class AxolotlDPOTrainer(RngLoaderMixin, SchedulerMixin, DPOTrainer):
                )

        # Base evaluation
-        initial_output = super().evaluation_loop(
+        initial_output = super(  # pylint: disable=bad-super-call
+            DPOTrainer, self
+        ).evaluation_loop(
            dataloader,
            description,
            prediction_loss_only,
--- a/src/axolotl/core/trainers/grpo/init.py
+++ b/src/axolotl/core/trainers/grpo/init.py
@@ -1,37 +1,41 @@
-"""
-GRPO Specific Strategy for training
-"""
+"""GRPO Specific Strategy for training"""

 import importlib
 import inspect
 import logging
+from typing import Any

 from trl.trainer.grpo_trainer import RewardFunc

-from axolotl.core.trainers.grpo.trainer import AxolotlGRPOTrainer
+from axolotl.core.trainers.grpo.args import AxolotlGRPOConfig
+from axolotl.core.trainers.grpo.trainer import (
+    AxolotlGRPOSequenceParallelTrainer,
+    AxolotlGRPOTrainer,
+)
+from axolotl.utils.dict import DictDefault
 from axolotl.utils.schemas.trl import TRLConfig

-LOG = logging.getLogger("axolotl")
+LOG = logging.getLogger(__name__)


 class GRPOStrategy:
-    """
-    Strategy for GRPO training
-    """
+    """Strategy for GRPO training"""

    @classmethod
-    def get_trainer_class(cls):
+    def get_trainer_class(
+        cls, sequence_parallel: bool
+    ) -> type[AxolotlGRPOTrainer] | type[AxolotlGRPOSequenceParallelTrainer]:
+        if sequence_parallel:
+            return AxolotlGRPOSequenceParallelTrainer
        return AxolotlGRPOTrainer

    @classmethod
-    def get_training_args_class(cls):
-        from axolotl.core.trainers.grpo.args import AxolotlGRPOConfig
-
+    def get_training_args_class(cls) -> type[AxolotlGRPOConfig]:
        return AxolotlGRPOConfig

    @classmethod
-    def set_training_args_kwargs(cls, cfg):
-        grpo_args_kwargs = {}
+    def set_training_args_kwargs(cls, cfg: DictDefault) -> dict[str, Any]:
+        grpo_args_kwargs: dict[str, Any] = {}

        if not hasattr(cfg, "trl") or not cfg.trl:
            return grpo_args_kwargs
@@ -40,8 +44,8 @@ class GRPOStrategy:

        if trl.use_vllm:
            grpo_args_kwargs["use_vllm"] = trl.use_vllm
-            grpo_args_kwargs["vllm_server_host"] = trl.vllm_server_host or trl.vllm.host
-            grpo_args_kwargs["vllm_server_port"] = trl.vllm_server_port or trl.vllm.port
+            grpo_args_kwargs["vllm_server_host"] = trl.vllm_server_host or trl.vllm.host  # type: ignore[attr-defined]
+            grpo_args_kwargs["vllm_server_port"] = trl.vllm_server_port or trl.vllm.port  # type: ignore[attr-defined]
            if trl.vllm_server_timeout:
                grpo_args_kwargs["vllm_server_timeout"] = trl.vllm_server_timeout
            if trl.vllm_guided_decoding_regex:
@@ -63,6 +67,7 @@ class GRPOStrategy:

        grpo_args_kwargs["max_completion_length"] = trl.max_completion_length
        grpo_args_kwargs["log_completions"] = trl.log_completions
+        grpo_args_kwargs["num_completions_to_print"] = trl.num_completions_to_print

        if trl.reward_weights:
            grpo_args_kwargs["reward_weights"] = trl.reward_weights
@@ -70,6 +75,13 @@ class GRPOStrategy:
        if trl.scale_rewards is not None:
            grpo_args_kwargs["scale_rewards"] = trl.scale_rewards

+        if trl.loss_type is not None:
+            grpo_args_kwargs["loss_type"] = trl.loss_type
+        if trl.mask_truncated_completions is not None:
+            grpo_args_kwargs["mask_truncated_completions"] = (
+                trl.mask_truncated_completions
+            )
+
        if trl.temperature is not None:
            grpo_args_kwargs["temperature"] = trl.temperature
        if trl.top_p is not None:
@@ -85,21 +97,27 @@ class GRPOStrategy:
            grpo_args_kwargs["num_iterations"] = trl.num_iterations
        if trl.epsilon is not None:
            grpo_args_kwargs["epsilon"] = trl.epsilon
+        if trl.epsilon_high is not None:
+            grpo_args_kwargs["epsilon_high"] = trl.epsilon_high
+
+        if trl.use_liger_loss is not None:
+            grpo_args_kwargs["use_liger_loss"] = trl.use_liger_loss

        return grpo_args_kwargs

    @classmethod
-    def set_trainer_args(cls, cfg):
+    def set_trainer_args(cls, cfg: DictDefault) -> list[Any]:
        trainer_args = []
        if cfg.trl and cfg.trl.reward_funcs:
            reward_funcs = []
            for reward_func_fqn in cfg.trl.reward_funcs:
                reward_funcs.append(cls.get_reward_func(reward_func_fqn))
            trainer_args.append(reward_funcs)
+
        return trainer_args

    @classmethod
-    def set_trainer_kwargs(cls, cfg):
+    def set_trainer_kwargs(cls, cfg: DictDefault) -> dict[str, Any]:
        trainer_kwargs = {}
        if cfg.trl and cfg.trl.reward_processing_classes:
            trainer_kwargs["reward_processing_classes"] = (
@@ -113,7 +131,7 @@ class GRPOStrategy:
        return None

    @classmethod
-    def get_blocklist_args_kwargs(cls):
+    def get_blocklist_args_kwargs(cls) -> list[str]:
        return ["dataset_num_proc"]

    @classmethod
@@ -124,13 +142,13 @@ class GRPOStrategy:
        Args:
            reward_func_fqn (str): Fully qualified name of the reward function (e.g. r1_grpo.gsm8k_transform),
                or a HF hub path to the reward model.
-        Raises:
-            ValueError: If the reward function does not accept at least two arguments.

        Returns:
            RewardFunc: A callable that accepts prompts and completions and returns rewards,
                or a path to a reward model.

+        Raises:
+            ValueError: If the reward function does not accept at least two arguments.
        """
        try:
            # use importlib to dynamically load the reward function from the module
--- a/src/axolotl/core/trainers/grpo/args.py
+++ b/src/axolotl/core/trainers/grpo/args.py
@@ -11,6 +11,4 @@ from axolotl.core.training_args import AxolotlTrainingMixins

@dataclass
 class AxolotlGRPOConfig(AxolotlTrainingMixins, GRPOConfig):
-    """
-    Axolotl GRPO Config for GRPO training
-    """
+    """Axolotl GRPO Config for GRPO training"""
--- a/src/axolotl/core/trainers/grpo/sampler.py
+++ b/src/axolotl/core/trainers/grpo/sampler.py
@@ -0,0 +1,172 @@
+"""Repeat random sampler (similar to the one implemented in
+https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py) that adds
+sequence parallelism functionality; i.e., duplicating data across ranks in the same
+sequence parallel group.
+"""
+
+from typing import Iterator, Sized
+
+import torch
+from torch.utils.data import Sampler
+
+
+class SequenceParallelRepeatRandomSampler(Sampler):
+    """Sampler for GRPO training with sequence parallelism.
+
+    This sampler ensures:
+    - Ranks in the same sequence parallel (SP) group receive identical data.
+    - Each index is repeated multiple times for sampling different completions.
+    - Entire batches are repeated for reuse in multiple updates.
+    - Data is properly distributed across SP groups.
+
+    In the table below, the values represent dataset indices. Each SP group has
+    `sequence_parallel_degree = 2` GPUs working together on the same data. There are 2
+    SP groups (SP0 and SP1), with `world_size = 4` total GPUs.
+
+                                               Sequence Parallel Groups
+                                        |       SP0        |       SP1        |
+                                        |  GPU 0  |  GPU 1 |  GPU 2  |  GPU 3 |
+                    global_step  step    <---> mini_repeat_count=3
+                                            <----------> batch_size=2 per SP group
+    grad_accum=2   ▲  ▲  0       0         [0 0 0  1 1 1]     [2 2 2  3 3 3]   <- SP groups get different data
+                   ▼  |  0       1         [0 0 0  1 1 1]     [2 2 2  3 3 3]   <- Same data for each SP group GPU
+                      |
+                      |  1       2         [0 0 0  1 1 1]     [2 2 2  3 3 3]   <- Repeat same indices for iterations
+    num_iterations=2  ▼  1       3         [0 0 0  1 1 1]     [2 2 2  3 3 3]   <- When using gradient accumulation
+
+                         2       4         [4 4 4  5 5 5]     [6 6 6  7 7 7]   <- New batch of data indices
+                         2       5         [4 4 4  5 5 5]     [6 6 6  7 7 7]
+                                            ...
+
+    Args:
+        dataset: Dataset to sample from.
+        mini_repeat_count: How many times to repeat each sample immediately.
+        world_size: Total number of processes.
+        rank: Rank of current process.
+        batch_size: Number of samples per batch.
+        repeat_count: How many times to repeat the full sampling process.
+        sequence_parallel_degree: Number of ranks in a sequence parallel group.
+        shuffle: Whether to shuffle the dataset.
+        seed: Random seed for shuffling.
+        drop_last: Whether to drop the last incomplete batch.
+    """
+
+    def __init__(
+        self,
+        dataset: Sized,
+        mini_repeat_count: int,
+        world_size: int,
+        rank: int,
+        batch_size: int = 1,
+        repeat_count: int = 1,
+        sequence_parallel_degree: int = 1,
+        shuffle: bool = True,
+        seed: int = 0,
+        drop_last: bool = False,
+    ):
+        self.dataset = dataset
+        self.mini_repeat_count = mini_repeat_count
+        self.batch_size = batch_size
+        self.repeat_count = repeat_count
+        self.shuffle = shuffle
+        self.seed = seed
+        self.drop_last = drop_last
+        self.epoch = 0
+
+        self.world_size = world_size
+        self.rank = rank
+
+        # Sequence parallelism parameters
+        self.sequence_parallel_degree = sequence_parallel_degree
+        self.num_sp_groups = world_size // sequence_parallel_degree
+        self.sp_group_id = rank // sequence_parallel_degree
+
+        # Adjust dataset size for distributed sampling
+        self.num_samples = len(self.dataset)
+        self.total_size = self.num_samples
+
+        # Calculate effective number of samples per SP group
+        if (
+            self.drop_last
+            and self.total_size % (self.num_sp_groups * self.batch_size) != 0
+        ):
+            # Drop last incomplete batch if drop_last is True
+            self.num_samples_per_sp_group = (
+                self.total_size // self.batch_size // self.num_sp_groups
+            ) * self.batch_size
+        else:
+            # Round up to include last batch if drop_last is False
+            self.num_samples_per_sp_group = (
+                (self.total_size + self.batch_size * self.num_sp_groups - 1)
+                // (self.batch_size * self.num_sp_groups)
+                * self.batch_size
+            )
+
+        if shuffle:
+            self.generator = torch.Generator()
+            self.generator.manual_seed(seed)
+
+    def __iter__(self) -> Iterator[int]:
+        """Creates iterator over dataset indices.
+
+        Returns:
+            Iterator that yields indices into the dataset.
+        """
+        # Deterministically shuffle based on epoch and seed
+        if self.shuffle:
+            indices = torch.randperm(
+                self.num_samples, generator=self.generator
+            ).tolist()
+        else:
+            indices = list(range(self.num_samples))
+
+        # Add extra samples to make it evenly divisible by batch_size
+        if len(indices) % self.batch_size != 0:
+            padding = indices[: self.batch_size - len(indices) % self.batch_size]
+            indices += padding
+
+        # Subsample based on SP group ID
+        # Each SP group gets distinct batches of data
+        batch_indices = []
+        for i in range(0, len(indices), self.batch_size * self.num_sp_groups):
+            start_idx = i + self.sp_group_id * self.batch_size
+            end_idx = min(start_idx + self.batch_size, len(indices))
+            if start_idx < len(indices):
+                for j in range(self.batch_size):
+                    if start_idx + j < end_idx:
+                        batch_indices.append(indices[start_idx + j])
+
+        # Make sure batch_indices is exactly batch_size * num_batches_per_sp_group
+        if self.drop_last:
+            num_batches_per_sp_group = self.num_samples_per_sp_group // self.batch_size
+            target_len = self.batch_size * num_batches_per_sp_group
+            if len(batch_indices) > target_len:
+                batch_indices = batch_indices[:target_len]
+
+        # Apply the GRPO repeat pattern
+        final_indices = []
+        for _ in range(self.repeat_count):
+            for idx in batch_indices:
+                for _ in range(self.mini_repeat_count):
+                    final_indices.append(idx)
+
+        return iter(final_indices)
+
+    def __len__(self) -> int:
+        """Returns the total length of the iterable including repetitions.
+
+        Returns:
+            Total number of samples.
+        """
+        # Total length including all repetitions
+        return (
+            self.num_samples_per_sp_group * self.mini_repeat_count * self.repeat_count
+        )
+
+    def set_epoch(self, epoch: int) -> None:
+        """Sets the epoch for this sampler.
+
+        Args:
+            epoch: Epoch number to use for shuffling.
+        """
+        self.epoch = epoch
--- a/src/axolotl/core/trainers/grpo/trainer.py
+++ b/src/axolotl/core/trainers/grpo/trainer.py
@@ -1,23 +1,63 @@
-"""
-Axolotl GRPO trainer
-"""
+"""Axolotl GRPO trainers (with and without sequence parallelism handling)"""

+# pylint: disable=too-many-lines,duplicate-code,protected-access,no-member
+
+import warnings
 from contextlib import nullcontext
+from typing import Any

-from accelerate.utils import is_deepspeed_available, is_peft_model
+import datasets
+import torch
+import torch.distributed as dist
+import torch.utils.data
+from accelerate.utils import (
+    broadcast_object_list,
+    gather,
+    gather_object,
+    is_peft_model,
+)
+from datasets import Dataset, IterableDataset
+from torch import nn
+from torch.utils.data import (
+    BatchSampler,
+    DataLoader,
+    Sampler,
+)
+from transformers import (
+    PreTrainedModel,
+    PreTrainedTokenizerBase,
+    Trainer,
+    TrainerCallback,
+)
+from transformers.trainer_utils import seed_worker
+from transformers.utils import is_peft_available
 from trl import GRPOTrainer
-from trl.extras.profiling import profiling_decorator
+from trl.data_utils import (
+    apply_chat_template,
+    is_conversational,
+    maybe_apply_chat_template,
+)
+from trl.extras.profiling import profiling_context, profiling_decorator
+from trl.import_utils import is_deepspeed_available
+from trl.models import unwrap_model_for_generation
+from trl.trainer.grpo_config import GRPOConfig
+from trl.trainer.grpo_trainer import RewardFunc, nanstd
+from trl.trainer.utils import pad

+from axolotl.core.trainers.grpo.sampler import SequenceParallelRepeatRandomSampler
 from axolotl.core.trainers.mixins import RngLoaderMixin, SchedulerMixin
+from axolotl.monkeypatch.attention.ring_attn.patch import get_ring_attn_group
+
+if is_peft_available():
+    # pylint: disable=unused-import
+    from peft import PeftConfig

 if is_deepspeed_available():
    import deepspeed


 class AxolotlGRPOTrainer(RngLoaderMixin, SchedulerMixin, GRPOTrainer):
-    """
-    Extend the base GRPOTrainer for axolotl helpers
-    """
+    """Extend the base GRPOTrainer for axolotl helpers"""

    _tag_names = ["trl", "grpo", "axolotl"]

@@ -67,3 +107,600 @@ class AxolotlGRPOTrainer(RngLoaderMixin, SchedulerMixin, GRPOTrainer):
        # Reset cache on main process
        if self.accelerator.is_main_process:
            self.vllm_client.reset_prefix_cache()
+
+
+class AxolotlGRPOSequenceParallelTrainer(AxolotlGRPOTrainer):
+    """Extend the base GRPOTrainer for sequence parallelism handling"""
+
+    def __init__(
+        self,
+        model: str | PreTrainedModel,
+        reward_funcs: RewardFunc | list[RewardFunc],
+        args: GRPOConfig | None = None,
+        train_dataset: Dataset | IterableDataset | None = None,
+        eval_dataset: (
+            Dataset | IterableDataset | dict[str, Dataset | IterableDataset] | None
+        ) = None,
+        processing_class: PreTrainedTokenizerBase | None = None,
+        reward_processing_classes: (
+            PreTrainedTokenizerBase | list[PreTrainedTokenizerBase] | None
+        ) = None,
+        callbacks: list[TrainerCallback] | None = None,
+        optimizers: tuple[
+            torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None
+        ] = (None, None),
+        peft_config: "PeftConfig | None" = None,
+    ):
+        # First call the superclass constructor with all arguments
+        super().__init__(
+            model=model,
+            reward_funcs=reward_funcs,
+            args=args,
+            train_dataset=train_dataset,
+            eval_dataset=eval_dataset,
+            processing_class=processing_class,
+            reward_processing_classes=reward_processing_classes,
+            callbacks=callbacks,
+            optimizers=optimizers,
+            peft_config=peft_config,
+        )
+
+        # Get number of SP groups (number of processes divided by SP degree)
+        num_processes = self.accelerator.num_processes
+        num_sp_groups = num_processes // self.args.sequence_parallel_degree
+
+        # Calculate batch size per SP group (not per process)
+        sp_group_batch_size = self.args.per_device_train_batch_size * num_sp_groups
+        possible_values = [
+            n_gen
+            for n_gen in range(2, sp_group_batch_size + 1)
+            if (sp_group_batch_size) % n_gen == 0
+        ]
+
+        if self.num_generations not in possible_values:
+            raise ValueError(
+                f"The batch size per SP group ({num_sp_groups} x "
+                f"{self.args.per_device_train_batch_size}) must be evenly divisible by "
+                f"the number of generations per prompt ({self.num_generations}). Given "
+                "the current configuration, the valid values for the number of "
+                f"generations are: {possible_values}."
+            )
+
+        if self.args.eval_strategy != "no":
+            # If sequence parallelism is enabled, calculate batch size per SP group
+            sp_group_eval_batch_size = args.per_device_eval_batch_size * num_sp_groups  # type: ignore[union-attr]
+            possible_values = [
+                n_gen
+                for n_gen in range(2, sp_group_eval_batch_size + 1)
+                if (sp_group_eval_batch_size) % n_gen == 0
+            ]
+
+            if self.num_generations not in possible_values:
+                raise ValueError(
+                    f"With sequence parallelism (degree {self.args.sequence_parallel_degree}), "
+                    f"the eval batch size per SP group ({num_sp_groups} x {self.args.per_device_eval_batch_size}) "
+                    f"must be evenly divisible by the number of generations per prompt "
+                    f"({self.num_generations}). Given the current eval batch size, "
+                    f"the valid values for the number of generations are: {possible_values}."
+                )
+
+        # Initialize the SP group
+        self.sp_group = get_ring_attn_group()
+        self.rank = dist.get_rank()
+        self.world_size = dist.get_world_size()
+        self.local_rank = dist.get_rank(group=self.sp_group)
+        self.local_world_size = dist.get_world_size(group=self.sp_group)
+
+    def _get_train_sampler(self) -> Sampler:
+        effective_batch_size = (
+            self.args.per_device_train_batch_size
+            * self.world_size
+            * self.args.gradient_accumulation_steps
+        )
+
+        return SequenceParallelRepeatRandomSampler(
+            dataset=self.train_dataset,
+            mini_repeat_count=self.num_generations,
+            world_size=self.world_size,
+            rank=self.rank,
+            batch_size=effective_batch_size
+            // self.num_generations
+            // self.args.sequence_parallel_degree,
+            repeat_count=self.num_iterations * self.args.gradient_accumulation_steps,
+            sequence_parallel_degree=self.args.sequence_parallel_degree,
+            shuffle=True,
+            seed=self.args.seed,
+            drop_last=True,
+        )
+
+    def _create_dataloader_params(self, is_eval=False, custom_batch_size=None):
+        """Create common dataloader parameters for train or eval."""
+        batch_size = custom_batch_size or (
+            self.args.eval_batch_size if is_eval else self._train_batch_size
+        )
+
+        params = {
+            "batch_size": batch_size,
+            "collate_fn": self.data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+        }
+
+        # Add persistent workers only for training
+        if not is_eval and hasattr(self.args, "dataloader_persistent_workers"):
+            params["persistent_workers"] = self.args.dataloader_persistent_workers
+
+        # Add prefetch factor if specified
+        if self.args.dataloader_prefetch_factor:
+            params["prefetch_factor"] = self.args.dataloader_prefetch_factor
+
+        return params
+
+    def _prepare_dataloader(
+        self, dataset, sampler, is_eval=False, custom_batch_size=None
+    ):
+        """Prepare a dataloader with the given dataset and sampler."""
+        # Get base parameters
+        dataloader_params = self._create_dataloader_params(is_eval, custom_batch_size)
+
+        # Add sampler configuration
+        if not isinstance(dataset, torch.utils.data.IterableDataset):
+            if isinstance(sampler, BatchSampler):
+                # batch_size and batch_sampler are mutually exclusive
+                dataloader_params["batch_sampler"] = sampler
+                del dataloader_params["batch_size"]
+            else:
+                dataloader_params["sampler"] = sampler
+                dataloader_params["drop_last"] = self.args.dataloader_drop_last
+
+            if not is_eval:
+                dataloader_params["worker_init_fn"] = seed_worker
+
+        # Create the dataloader
+        dataloader = DataLoader(dataset, **dataloader_params)
+
+        if self.args.sample_packing and (
+            (not is_eval and not self.args.pretraining)
+            or (is_eval and self.args.eval_sample_packing is not False)
+        ):
+            self.accelerator.even_batches = False
+
+        # Return unprepared dataloader if using sequence parallelism
+        # TODO(djsaunde): We might be able to use `accelerate`'s dataloader preparation
+        # if we use `dispatch_batches` and `slice_fn_for_dispatch` properly (i.e.,
+        # slice each batch along the sequence dimension).
+        if self.args.sequence_parallel_degree > 1:
+            return dataloader
+
+        # Otherwise prepare with accelerator
+        return self.accelerator.prepare_data_loader(dataloader)
+
+    def get_train_dataloader(self) -> DataLoader:
+        """Get dataloader for training"""
+        train_dataset = self.train_dataset
+        # pylint: disable=access-member-before-definition
+        data_collator = self.data_collator  # type: ignore
+
+        # Handle dataset preprocessing
+        if isinstance(train_dataset, datasets.Dataset):
+            # Add debug print before any modifications
+            if self.args.sample_packing and not self.args.pretraining:
+                train_dataset = train_dataset.remove_columns(["length"])
+            if not self.args.sample_packing or self.args.pretraining:
+                train_dataset = self._remove_unused_columns(
+                    train_dataset, description="training"
+                )
+        else:
+            self.data_collator = self._get_collator_with_removed_columns(  # pylint: disable=attribute-defined-outside-init
+                data_collator,
+                description="training",
+            )
+
+        # Get sampler and create dataloader
+        sampler = self._get_train_sampler()
+        dataloader = self._prepare_dataloader(train_dataset, sampler, is_eval=False)
+
+        return dataloader
+
+    def _generate_and_score_completions(
+        self, inputs: list[dict[str, torch.Tensor | Any]]
+    ) -> dict[str, torch.Tensor | Any]:
+        device = self.accelerator.device
+        mode = "eval" if self.control.should_evaluate else "train"
+
+        prompts = [x["prompt"] for x in inputs]
+        prompts_text = [
+            maybe_apply_chat_template(example, self.processing_class)["prompt"]
+            for example in inputs
+        ]
+        prompt_inputs = self.processing_class(
+            text=prompts_text,
+            return_tensors="pt",
+            padding=True,
+            padding_side="left",
+            add_special_tokens=False,
+        )
+        prompt_inputs = Trainer._prepare_inputs(self, prompt_inputs)
+        prompt_ids, prompt_mask = (
+            prompt_inputs["input_ids"],
+            prompt_inputs["attention_mask"],
+        )
+
+        if self.max_prompt_length is not None:
+            prompt_ids = prompt_ids[:, -self.max_prompt_length :]
+            prompt_mask = prompt_mask[:, -self.max_prompt_length :]
+
+        # Generate completions using either vLLM or regular generation
+        if self.args.use_vllm:
+            # First, have main process load weights if needed
+            # pylint: disable=access-member-before-definition
+            if self.state.global_step != self._last_loaded_step:  # type: ignore[has-type]
+                self._move_model_to_vllm()
+                # pylint: disable=attribute-defined-outside-init
+                self._last_loaded_step = self.state.global_step
+
+            # Generate completions using vLLM: gather all prompts and use them in a single call in the main process
+            all_prompts_text = gather_object(prompts_text)
+            if self.accelerator.is_main_process:
+                if self.args.sequence_parallel_degree > 1:
+                    # Calculate sequence parallel group information
+                    world_size = self.accelerator.num_processes
+                    sequence_parallel_degree = self.args.sequence_parallel_degree
+                    num_sp_groups = world_size // sequence_parallel_degree
+
+                    # Since processes in the same SP group have the same prompts, we need to ensure
+                    # we only take one copy of each prompt from each SP group
+                    ordered_set_of_prompts = []
+                    for sp_group_id in range(num_sp_groups):
+                        # Get the first process from each SP group (typically the group leader)
+                        group_leader_rank = sp_group_id * sequence_parallel_degree
+
+                        # Extract prompts from this SP group, accounting for num_generations duplicates
+                        # We only need prompts from one rank in each SP group
+                        group_prompts = all_prompts_text[
+                            group_leader_rank
+                            * len(prompts_text) : (group_leader_rank + 1)
+                            * len(prompts_text) : self.num_generations
+                        ]
+
+                        ordered_set_of_prompts.extend(group_prompts)
+                else:
+                    # Since 'prompts' contains 'num_generations' duplicates, we first take unique prompts, and generate
+                    # num_generations outputs for each one. This is faster than generating outputs for each duplicate
+                    # prompt individually.
+                    ordered_set_of_prompts = all_prompts_text[
+                        :: self.num_generations * self.args.sequence_parallel_degree
+                    ]
+
+                with profiling_context(self, "vLLM.generate"):
+                    completion_ids = self.vllm_client.generate(
+                        prompts=ordered_set_of_prompts,
+                        n=self.num_generations,
+                        repetition_penalty=self.repetition_penalty,
+                        temperature=self.temperature,
+                        top_p=self.top_p,
+                        top_k=-1 if self.top_k is None else self.top_k,
+                        min_p=0.0 if self.min_p is None else self.min_p,
+                        max_tokens=self.max_completion_length,
+                        guided_decoding_regex=self.guided_decoding_regex,
+                    )
+            else:
+                completion_ids = [None] * (
+                    len(all_prompts_text) // self.args.sequence_parallel_degree
+                )
+
+            # Broadcast the completions from the main process to all processes
+            completion_ids = broadcast_object_list(completion_ids, from_process=0)
+
+            # Determine the appropriate slice based on sequence parallelism
+            if self.args.sequence_parallel_degree > 1:
+                # Calculate SP group ID (which group of ranks this rank belongs to)
+                sp_group_id = self.accelerator.process_index // self.local_world_size
+
+                # Calculate the start index for this SP group
+                sp_group_start = sp_group_id * len(prompts) * self.local_world_size
+
+                # All ranks in the same SP group get the same data slice
+                process_slice = slice(
+                    sp_group_start,
+                    sp_group_start + len(prompts),
+                )
+                completion_ids = completion_ids[process_slice]
+            else:
+                # Original behavior for non-sequence parallel case
+                process_slice = slice(
+                    self.accelerator.process_index * len(prompts),
+                    (self.accelerator.process_index + 1) * len(prompts),
+                )
+                completion_ids = completion_ids[process_slice]
+
+            # Pad the completions, and concatenate them with the prompts
+            completion_ids = [
+                torch.tensor(ids, device=device) for ids in completion_ids
+            ]
+            completion_ids = pad(
+                completion_ids, padding_value=self.processing_class.pad_token_id
+            )
+            prompt_completion_ids = torch.cat([prompt_ids, completion_ids], dim=1)
+        else:
+            # Regular generation path
+            with unwrap_model_for_generation(
+                self.model_wrapped,
+                self.accelerator,
+                gather_deepspeed3_params=self.args.ds3_gather_for_generation,
+            ) as unwrapped_model:
+                prompt_completion_ids = unwrapped_model.generate(
+                    prompt_ids,
+                    attention_mask=prompt_mask,
+                    generation_config=self.generation_config,
+                )
+
+            # Compute prompt length and extract completion ids
+            prompt_length = prompt_ids.size(1)
+            prompt_ids = prompt_completion_ids[:, :prompt_length]
+            completion_ids = prompt_completion_ids[:, prompt_length:]
+
+        # Mask everything after the first EOS token
+        is_eos = completion_ids == self.processing_class.eos_token_id
+        eos_idx = torch.full(
+            (is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=device
+        )
+        eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)]
+        sequence_indices = torch.arange(is_eos.size(1), device=device).expand(
+            is_eos.size(0), -1
+        )
+        completion_mask = (sequence_indices <= eos_idx.unsqueeze(1)).int()
+
+        # If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
+        if self.args.mask_truncated_completions:
+            truncated_completions = ~is_eos.any(dim=1)
+            completion_mask = (
+                completion_mask * (~truncated_completions).unsqueeze(1).int()
+            )
+
+        # Concatenate prompt_mask with completion_mask for logit computation
+        attention_mask = torch.cat([prompt_mask, completion_mask], dim=1)  # (B, P+C)
+
+        logits_to_keep = completion_ids.size(
+            1
+        )  # we only need to compute the logits for the completion tokens
+        batch_size = (
+            self.args.per_device_train_batch_size
+            if mode == "train"
+            else self.args.per_device_eval_batch_size
+        )
+
+        with torch.no_grad():
+            # When using num_iterations == 1, old_per_token_logps == per_token_logps, so we can skip it's
+            # computation here, and use per_token_logps.detach() instead.
+            if self.num_iterations > 1:
+                old_per_token_logps = self._get_per_token_logps(
+                    self.model,
+                    prompt_completion_ids,
+                    attention_mask,
+                    logits_to_keep,
+                    batch_size,
+                )
+            else:
+                old_per_token_logps = None
+
+            if self.beta == 0.0:
+                ref_per_token_logps = None
+            elif self.ref_model is not None:
+                ref_per_token_logps = self._get_per_token_logps(
+                    self.ref_model,
+                    prompt_completion_ids,
+                    attention_mask,
+                    logits_to_keep,
+                    batch_size,
+                )
+            else:
+                with self.accelerator.unwrap_model(self.model).disable_adapter():
+                    ref_per_token_logps = self._get_per_token_logps(
+                        self.model,
+                        prompt_completion_ids,
+                        attention_mask,
+                        logits_to_keep,
+                        batch_size,
+                    )
+
+        # Decode the generated completions
+        completions_text = self.processing_class.batch_decode(
+            completion_ids, skip_special_tokens=True
+        )
+        if is_conversational(inputs[0]):
+            completions = []
+            for prompt, completion in zip(prompts, completions_text):
+                bootstrap = (
+                    prompt.pop()["content"] if prompt[-1]["role"] == "assistant" else ""
+                )
+                completions.append(
+                    [{"role": "assistant", "content": bootstrap + completion}]
+                )
+        else:
+            completions = completions_text
+
+        rewards_per_func = torch.zeros(
+            len(prompts), len(self.reward_funcs), device=device
+        )
+        for i, (reward_func, reward_processing_class, reward_func_name) in enumerate(
+            zip(
+                self.reward_funcs,
+                self.reward_processing_classes,
+                self.reward_func_names,
+            )
+        ):
+            with profiling_context(self, reward_func_name):
+                if isinstance(
+                    reward_func, nn.Module
+                ):  # Module instead of PretrainedModel for compat with compiled models
+                    if is_conversational(inputs[0]):
+                        messages = [
+                            {"messages": p + c} for p, c in zip(prompts, completions)
+                        ]
+                        texts = [
+                            apply_chat_template(x, reward_processing_class)["text"]
+                            for x in messages
+                        ]
+                    else:
+                        texts = [p + c for p, c in zip(prompts, completions)]
+                    reward_inputs = reward_processing_class(
+                        text=texts,
+                        return_tensors="pt",
+                        padding=True,
+                        padding_side="right",
+                        add_special_tokens=False,
+                    )
+                    reward_inputs = Trainer._prepare_inputs(self, reward_inputs)
+                    with torch.inference_mode():
+                        rewards_per_func[:, i] = reward_func(**reward_inputs).logits[
+                            :, 0
+                        ]  # Shape (B*G,)
+                else:
+                    # Repeat all input columns (but "prompt" and "completion") to match the number of generations
+                    keys = [
+                        key for key in inputs[0] if key not in ["prompt", "completion"]
+                    ]
+                    reward_kwargs = {
+                        key: [example[key] for example in inputs] for key in keys
+                    }
+                    output_reward_func = reward_func(
+                        prompts=prompts, completions=completions, **reward_kwargs
+                    )
+                    # Convert None values to NaN
+                    output_reward_func = [
+                        reward if reward is not None else torch.nan
+                        for reward in output_reward_func
+                    ]
+
+                    rewards_per_func[:, i] = torch.tensor(
+                        output_reward_func, dtype=torch.float32, device=device
+                    )
+
+        # If all reward functions return None for a given row, issue a detailed warning
+        if torch.isnan(rewards_per_func).all(dim=1).any():
+            nan_row_idx = (
+                torch.isnan(rewards_per_func).all(dim=1).nonzero(as_tuple=True)[0][0]
+            )
+            row_reward_kwargs = {
+                key: value[nan_row_idx] for key, value in reward_kwargs.items()
+            }
+            row_reward_kwargs["prompt"] = prompts[nan_row_idx]
+            row_reward_kwargs["completion"] = completions[nan_row_idx]
+            warnings.warn(
+                f"All reward functions returned None for the following kwargs: {row_reward_kwargs}. "
+                "Please ensure that at least one reward function returns a valid reward."
+            )
+
+        # Gather the reward per function: this part is crucial, because the rewards are normalized per group and the
+        # completions may be distributed across processes
+        rewards_per_func = gather(rewards_per_func)
+
+        # Apply weights to each reward function's output and sum
+        rewards = (
+            rewards_per_func * self.reward_weights.to(device).unsqueeze(0)
+        ).nansum(dim=1)
+
+        # Compute grouped-wise rewards
+        mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
+        std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)
+
+        # Normalize the rewards to compute the advantages
+        mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(
+            self.num_generations, dim=0
+        )
+        std_grouped_rewards = std_grouped_rewards.repeat_interleave(
+            self.num_generations, dim=0
+        )
+        advantages = rewards - mean_grouped_rewards
+        if self.args.scale_rewards:
+            advantages = advantages / (std_grouped_rewards + 1e-4)
+
+        # Slice to keep only the local part of the data
+        if self.args.sequence_parallel_degree > 1:
+            # Calculate SP group ID (which group of ranks this rank belongs to)
+            sp_group_id = self.accelerator.process_index // self.local_world_size
+
+            # Calculate the start index for this SP group
+            sp_group_start = sp_group_id * len(prompts) * self.local_world_size
+
+            # All ranks in the same SP group get the same data slice
+            process_slice = slice(
+                sp_group_start,
+                sp_group_start + len(prompts),
+            )
+        else:
+            # Original behavior for non-sequence parallel case
+            process_slice = slice(
+                self.accelerator.process_index * len(prompts),
+                (self.accelerator.process_index + 1) * len(prompts),
+            )
+        advantages = advantages[process_slice]
+
+        # Log the metrics
+        if mode == "train":
+            self._total_train_tokens += (
+                self.accelerator.gather_for_metrics(attention_mask.sum()).sum().item()
+            )
+        self._metrics[mode]["num_tokens"] = [self._total_train_tokens]
+
+        # log completion lengths, mean, min, max
+        agg_completion_mask = self.accelerator.gather_for_metrics(
+            completion_mask.sum(1)
+        )
+        self._metrics[mode]["completions/mean_length"].append(
+            agg_completion_mask.float().mean().item()
+        )
+        self._metrics[mode]["completions/min_length"].append(
+            agg_completion_mask.float().min().item()
+        )
+        self._metrics[mode]["completions/max_length"].append(
+            agg_completion_mask.float().max().item()
+        )
+
+        # identify sequences that terminated with EOS and log their lengths
+        agg_terminated_with_eos = self.accelerator.gather_for_metrics(is_eos.any(dim=1))
+        term_completion_mask = agg_completion_mask[agg_terminated_with_eos]
+        clipped_completions_ratio = 1 - len(term_completion_mask) / len(
+            agg_completion_mask
+        )
+        self._metrics[mode]["completions/clipped_ratio"].append(
+            clipped_completions_ratio
+        )
+        if len(term_completion_mask) == 0:
+            # edge case where no completed sequences are found
+            term_completion_mask = torch.zeros(1, device=device)
+        self._metrics[mode]["completions/mean_terminated_length"].append(
+            term_completion_mask.float().mean().item()
+        )
+        self._metrics[mode]["completions/min_terminated_length"].append(
+            term_completion_mask.float().min().item()
+        )
+        self._metrics[mode]["completions/max_terminated_length"].append(
+            term_completion_mask.float().max().item()
+        )
+
+        # Calculate mean reward per function, but only for samples where the function was applied (non-NaN values)
+        for i, reward_func_name in enumerate(self.reward_func_names):
+            mean_rewards = torch.nanmean(rewards_per_func[:, i]).item()
+            self._metrics[mode][f"rewards/{reward_func_name}/mean"].append(mean_rewards)
+            std_rewards = nanstd(rewards_per_func[:, i]).item()
+            self._metrics[mode][f"rewards/{reward_func_name}/std"].append(std_rewards)
+        self._metrics[mode]["reward"].append(mean_grouped_rewards.mean().item())
+        self._metrics[mode]["reward_std"].append(std_grouped_rewards.mean().item())
+
+        # Log prompt and completion texts
+        self._textual_logs["prompt"].extend(gather_object(prompts_text))
+        self._textual_logs["completion"].extend(gather_object(completions_text))
+        for i, name in enumerate(self.reward_func_names):
+            self._textual_logs["rewards"][name].extend(rewards_per_func[:, i].tolist())
+
+        return {
+            "prompt_ids": prompt_ids,
+            "prompt_mask": prompt_mask,
+            "completion_ids": completion_ids,
+            "completion_mask": completion_mask,
+            "advantages": advantages,
+            "old_per_token_logps": old_per_token_logps,
+            "ref_per_token_logps": ref_per_token_logps,
+        }
--- a/src/axolotl/core/trainers/mixins/init.py
+++ b/src/axolotl/core/trainers/mixins/init.py
@@ -6,4 +6,4 @@
 from .optimizer import OptimizerMixin
 from .rng_state_loader import RngLoaderMixin
 from .scheduler import SchedulerMixin
-from .sequence_parallel import SequenceParallelContextManager, SequenceParallelMixin
+from .sequence_parallel import SequenceParallelMixin
--- a/src/axolotl/core/trainers/mixins/scheduler.py
+++ b/src/axolotl/core/trainers/mixins/scheduler.py
@@ -3,9 +3,10 @@
 import logging

 import torch
-from torch.optim.lr_scheduler import OneCycleLR
+from torch.optim.lr_scheduler import LRScheduler, OneCycleLR
 from transformers.trainer import Trainer

+from axolotl.integrations.base import PluginManager
 from axolotl.utils.schedulers import (
    RexLR,
    get_cosine_schedule_with_min_lr,
@@ -25,9 +26,9 @@ class SchedulerMixin(Trainer):

    def create_scheduler(
        self, num_training_steps: int, optimizer: torch.optim.Optimizer = None
-    ):
+    ) -> LRScheduler:
        """
-        Setup the scheduler. The optimizer of the trainer must have been set up either before this method is called or
+        Set up the scheduler. The optimizer of the trainer must have been set up either before this method is called or
        passed as an argument.

        Args:
@@ -47,7 +48,16 @@ class SchedulerMixin(Trainer):
        # fmt: off
        if self.lr_scheduler is None:  # type: ignore  # pylint: disable=access-member-before-definition
            # fmt: on
-            if self.args.alternate_lr_scheduler_type == "one_cycle":
+            plugin_manager = PluginManager.get_instance()
+            lr_scheduler: LRScheduler | None = plugin_manager.create_lr_scheduler(
+                trainer=self,
+                optimizer=optimizer,
+                num_training_steps=num_training_steps
+            )
+            if lr_scheduler is not None:
+                LOG.info(f"Using plugin-created lr_scheduler: {lr_scheduler}")
+                self.lr_scheduler = lr_scheduler
+            elif self.args.alternate_lr_scheduler_type == "one_cycle":
                num_warmup_steps = self.args.get_warmup_steps(num_training_steps)
                pct_start = num_warmup_steps / num_training_steps
                extra_lr_kwargs = {}
@@ -110,4 +120,4 @@ class SchedulerMixin(Trainer):
            if use_cosine_min_lr:
                LOG.warning("axolotl's cosine scheduler with min lr not used (e.g., because of deepspeed).")

-        return self.lr_scheduler
+        return self.lr_scheduler  # type: ignore
--- a/src/axolotl/core/trainers/mixins/sequence_parallel.py
+++ b/src/axolotl/core/trainers/mixins/sequence_parallel.py
@@ -1,85 +1,13 @@
-"""
-Module for Axolotl trainer sequence parallelism mixin and training context manager
-"""
+"""Module for Axolotl trainer sequence parallelism mixin"""

-import functools
-import logging
-
-import torch
 import torch.distributed as dist
 from datasets import Dataset
-from torch import nn
 from torch.utils.data import DistributedSampler, Sampler
-from torch.utils.hooks import RemovableHandle

 from axolotl.monkeypatch.attention.ring_attn import (
-    RingAttnFunc,
    get_ring_attn_group,
-    update_ring_attn_params,
 )

-LOG = logging.getLogger(__name__)
-
-
-def apply_sequence_parallelism(
-    batch: dict[str, torch.Tensor],
-    local_rank: int,
-    local_world_size: int,
-    ring_attn_func: RingAttnFunc,
-) -> dict[str, torch.Tensor]:
-    """
-    Apply sequence parallelism slicing to a batch.
-
-    Args:
-        batch: Batch dictionary (e.g., input_ids, attention_mask, etc.)
-        local_rank: Local rank in the sequence parallel group
-        local_world_size: World size of the sequence parallel group
-        ring_attn_func: The ring attention function to use
-
-    Returns:
-        Sliced batch dictionary.
-    """
-    # Update ring attention params if needed
-    if batch.get("position_ids") is not None:
-        update_ring_attn_params(position_ids=batch["position_ids"])
-
-    # Slice batch for sequence parallel processing
-    total_seq_len = batch["input_ids"].size(1)
-    for key in batch:
-        if (
-            key in batch
-            and isinstance(batch[key], torch.Tensor)
-            and batch[key].dim() > 1
-            and batch[key].size(1) == total_seq_len
-        ):
-
-            if ring_attn_func in [
-                RingAttnFunc.VARLEN_LLAMA3,
-                RingAttnFunc.BATCH_RING,
-            ]:
-                # Split in sequential fashion and grab this rank's chunk
-                batch[key] = (
-                    batch[key].chunk(local_world_size, dim=1)[local_rank].contiguous()
-                )
-            elif ring_attn_func is RingAttnFunc.BATCH_ZIGZAG:
-                chunks = batch[key].chunk(2 * local_world_size, dim=1)
-
-                # Take rank's chunk and opposing chunk for zigzag pattern
-                selected_chunks = [
-                    chunks[local_rank],
-                    chunks[2 * local_world_size - local_rank - 1],
-                ]
-                batch[key] = torch.cat(selected_chunks, dim=1).contiguous()
-            elif ring_attn_func is RingAttnFunc.BATCH_STRIPE:
-                # Split into striped data and stack
-                tensor = torch.stack(
-                    batch[key].split(local_world_size, dim=1),
-                    dim=1,
-                ).transpose(1, 2)
-                batch[key] = tensor[:, local_rank].contiguous()
-
-    return batch
-

 class SequenceParallelMixin:
    """
@@ -157,157 +85,3 @@ class SequenceParallelMixin:
        return self._create_sequence_parallel_sampler(
            eval_dataset, shuffle=False, is_eval=True
        )
-
-
-class SequenceParallelContextManager:
-    """
-    Context manager for sequence parallelism operations.
-
-    This class provides a context that will automatically apply sequence parallelism
-    during model forward passes using a pre-forward hook, and gather outputs from
-    across the sequence parallelism group using a post-forward hook.
-    """
-
-    def __init__(
-        self,
-        model: nn.Module,
-        sequence_parallel_degree: int,
-        ring_attn_func: RingAttnFunc,
-    ):
-        self.model = model
-        self.sequence_parallel_degree = sequence_parallel_degree
-        self.ring_attn_func = ring_attn_func
-        self.process_group = get_ring_attn_group()
-
-        # Initialize sequence parallel group details
-        self.local_rank = dist.get_rank(self.process_group)
-        self.local_world_size = dist.get_world_size(self.process_group)
-
-        # Will store hook handles for removal
-        self.hook_handles: list[RemovableHandle] = []
-
-        # Create a partially applied version of the apply_sequence_parallelism function
-        # with pre-configured params
-        self.apply_sequence_parallelism = functools.partial(
-            apply_sequence_parallelism,
-            local_rank=self.local_rank,
-            local_world_size=self.local_world_size,
-            ring_attn_func=self.ring_attn_func,
-        )
-
-    def __enter__(self):
-        # Forward pre-hook to apply sequence parallelism
-        def sequence_parallel_pre_hook(_, args, kwargs):
-            # Apply sequence parallelism to kwargs
-            kwargs = self.apply_sequence_parallelism(batch=kwargs)
-            return args, kwargs
-
-        # Forward post-hook to gather outputs
-        def sequence_parallel_post_hook(_, __, output):
-            # Gather the sharded outputs
-            return self.gather_outputs(output)
-
-        # Register both hooks
-        self.hook_handles.append(
-            self.model.register_forward_pre_hook(
-                sequence_parallel_pre_hook, with_kwargs=True
-            )
-        )
-        self.hook_handles.append(
-            self.model.register_forward_hook(sequence_parallel_post_hook)
-        )
-
-        return self
-
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        # Remove all hooks
-        for handle in self.hook_handles:
-            handle.remove()
-        self.hook_handles = []
-
-    def gather_outputs(self, output):
-        """Gather sharded outputs from all ranks and reconstruct the full tensor."""
-        # Handle different output formats (dict, tensor, etc.)
-        if isinstance(output, dict):
-            gathered_output = {}
-            for key, value in output.items():
-                if isinstance(value, torch.Tensor) and value.dim() > 1:
-                    # Gather logits or other sequence-sharded tensors
-                    gathered_value = self.gather_tensor(value)
-                    gathered_output[key] = gathered_value
-                else:
-                    gathered_value = value.clone()
-                    dist.all_reduce(
-                        gathered_value, op=dist.ReduceOp.SUM, group=self.process_group
-                    )
-                    gathered_output[key] = gathered_value
-            return gathered_output
-        if isinstance(output, torch.Tensor):
-            return self.gather_tensor(output)
-
-        return output
-
-    def gather_tensor(self, tensor):
-        """Gather a sharded tensor from all ranks."""
-        # Prepare tensors for all_gather
-        world_size = self.local_world_size
-
-        # Create list to store tensors from all ranks
-        gathered_tensors = [torch.zeros_like(tensor) for _ in range(world_size)]
-
-        # All-gather operation
-        dist.all_gather(gathered_tensors, tensor, group=self.process_group)
-
-        # Concatenate along sequence dimension (typically dim=1)
-        if self.ring_attn_func in [RingAttnFunc.VARLEN_LLAMA3, RingAttnFunc.BATCH_RING]:
-            # Simple concatenation for standard sharding
-            return torch.cat(gathered_tensors, dim=1)
-
-        if self.ring_attn_func is RingAttnFunc.BATCH_ZIGZAG:
-            # Each rank has a pattern of (rank, world_size*2-rank-1)
-            reconstituted_tensors = [None] * (world_size * 2)
-
-            # First, split each gathered tensor into its two chunks
-            for rank, gathered_tensor in enumerate(gathered_tensors):
-                # Each tensor contains two chunks in the sequence dimension
-                chunk_size = gathered_tensor.size(1) // 2
-                chunk1, chunk2 = gathered_tensor.split(chunk_size, dim=1)
-
-                # Place chunks in their original positions
-                reconstituted_tensors[rank] = chunk1
-                reconstituted_tensors[world_size * 2 - rank - 1] = chunk2
-
-            # Concatenate the reconstituted tensors in the correct order
-            return torch.cat(reconstituted_tensors, dim=1)
-
-        # Otherwise, RingAttnFunc.BATCH_STRIPE
-        # In striping, each rank has every world_size-th slice
-        batch_size = tensor.size(0)
-        hidden_dim = tensor.size(-1)
-
-        # First, determine the full sequence length
-        total_seq_len = 0
-        for t in gathered_tensors:
-            total_seq_len += t.size(1)
-
-        # Create a tensor to hold the unstriped result
-        result = torch.zeros(
-            batch_size,
-            total_seq_len,
-            hidden_dim,
-            dtype=tensor.dtype,
-            device=tensor.device,
-        )
-
-        # For each rank's tensor, distribute its slices to the correct positions
-        for rank, gathered_tensor in enumerate(gathered_tensors):
-            # The rank's tensor contains every world_size-th slice
-            # starting from its rank position
-            seq_len = gathered_tensor.size(1)
-            for i in range(seq_len):
-                # Calculate the position in the full tensor
-                pos = i * world_size + rank
-                if pos < total_seq_len:
-                    result[:, pos] = gathered_tensor[:, i]
-
-        return result
--- a/src/axolotl/core/trainers/relora.py
+++ b/src/axolotl/core/trainers/relora.py
@@ -1,6 +1,7 @@
 """Module for ReLoRA trainer"""

 import torch
+from torch.optim.lr_scheduler import LRScheduler

 from axolotl.core.trainers.base import AxolotlTrainer
 from axolotl.monkeypatch.relora import ReLoRAScheduler
@@ -19,9 +20,11 @@ class ReLoRATrainer(AxolotlTrainer):
        self,
        num_training_steps: int,
        optimizer: torch.optim.Optimizer | None = None,
-    ):
+    ) -> LRScheduler:
        optimizer = self.optimizer if optimizer is None else optimizer
-        lr_scheduler = super().create_scheduler(num_training_steps, optimizer)
+        lr_scheduler: LRScheduler = super().create_scheduler(
+            num_training_steps, optimizer
+        )

        if self.args.relora_steps:
            warmup_steps = (
@@ -30,7 +33,7 @@ class ReLoRATrainer(AxolotlTrainer):
            anneal_steps = (
                self.args.relora_anneal_steps if self.args.relora_anneal_steps else 1
            )
-            self.lr_scheduler = ReLoRAScheduler(
+            self.lr_scheduler = ReLoRAScheduler(  # type: ignore
                optimizer,
                lr_scheduler,
                self.args.relora_steps,
@@ -38,6 +41,6 @@ class ReLoRATrainer(AxolotlTrainer):
                warmup_steps,
            )
        else:
-            self.lr_scheduler = lr_scheduler
+            self.lr_scheduler = lr_scheduler  # type: ignore

-        return self.lr_scheduler
+        return self.lr_scheduler  # type: ignore
--- a/src/axolotl/core/training_args.py
+++ b/src/axolotl/core/training_args.py
@@ -9,7 +9,7 @@ from PIL.Image import Resampling
 from transformers import TrainingArguments
 from trl import CPOConfig, KTOConfig, ORPOConfig, PRMConfig, RewardConfig

-from axolotl.monkeypatch.attention.ring_attn.patch import RingAttnFunc
+from axolotl.utils.schemas.enums import RingAttnFunc


@dataclass
--- a/src/axolotl/evaluate.py
+++ b/src/axolotl/evaluate.py
@@ -11,20 +11,19 @@ from accelerate.logging import get_logger
 from datasets import Dataset
 from transformers.trainer import Trainer

-from axolotl.logging_config import configure_logging
-from axolotl.train import TrainDatasetMeta
-from axolotl.utils import set_pytorch_cuda_alloc_conf
+from axolotl.train import (
+    TrainDatasetMeta,
+    setup_model_and_tokenizer,
+)
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import cleanup_distributed
-from axolotl.utils.models import load_model, load_processor, load_tokenizer
 from axolotl.utils.trainer import setup_trainer

 project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 src_dir = os.path.join(project_root, "src")
 sys.path.insert(0, src_dir)

-configure_logging()
-LOG = get_logger("axolotl.evaluate")
+LOG = get_logger(__name__)


 def evaluate_dataset(
@@ -75,37 +74,22 @@ def evaluate(*, cfg: DictDefault, dataset_meta: TrainDatasetMeta) -> Dict[str, f
    Returns:
        Dictionary mapping metric names to their values.
    """
-    # pylint: disable=duplicate-code
-    # Enable expandable segments for cuda allocation to improve VRAM usage
-    set_pytorch_cuda_alloc_conf()
-
-    # Load tokenizer
-    LOG.debug(
-        f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}",
-        main_process_only=True,
-    )
-    tokenizer = load_tokenizer(cfg)
-
-    # Load processor for multimodal models if needed
-    processor = None
-    if cfg.is_multimodal:
-        processor = load_processor(cfg, tokenizer)
+    # Load tokenizer, processor and model
+    LOG.debug("loading model for evaluation...")
+    model, tokenizer, _, processor = setup_model_and_tokenizer(cfg)

    # Get datasets
+    # pylint: disable=duplicate-code
    train_dataset = dataset_meta.train_dataset
    eval_dataset = dataset_meta.eval_dataset
    total_num_steps = dataset_meta.total_num_steps

-    # Load model
-    LOG.debug("loading model for evaluation...")
-    model, _ = load_model(cfg, tokenizer, processor=processor)
-
    # Set up trainer
    trainer = setup_trainer(
-        cfg,
+        cfg=cfg,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
-        model=(model, None, None),  # No need for model_ref or peft_config
+        model=model,
        tokenizer=tokenizer,
        processor=processor,
        total_num_steps=total_num_steps,
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -24,6 +24,9 @@ import logging
 from typing import OrderedDict

 import torch
+from torch.optim.lr_scheduler import LRScheduler
+
+from axolotl.utils.dict import DictDefault


 class BasePlugin:
@@ -35,13 +38,15 @@ class BasePlugin:

    Methods:
    register(cfg): Registers the plugin with the given configuration.
+    load_datasets(cfg): Loads and preprocesses the dataset for training.
    pre_model_load(cfg): Performs actions before the model is loaded.
    post_model_build(cfg, model): Performs actions after the model is loaded, but before LoRA adapters are applied.
    pre_lora_load(cfg, model): Performs actions before LoRA weights are loaded.
    post_lora_load(cfg, model): Performs actions after LoRA weights are loaded.
    post_model_load(cfg, model): Performs actions after the model is loaded, inclusive of any adapters.
+    post_trainer_create(cfg, trainer): Performs actions after the trainer is created.
    create_optimizer(cfg, trainer): Creates and returns an optimizer for training.
-    create_lr_scheduler(cfg, trainer, optimizer): Creates and returns a learning rate scheduler.
+    create_lr_scheduler(cfg, trainer, optimizer, num_training_steps): Creates and returns a learning rate scheduler.
    add_callbacks_pre_trainer(cfg, model): Adds callbacks to the trainer before training.
    add_callbacks_post_trainer(cfg, trainer): Adds callbacks to the trainer after training.
    """
@@ -62,20 +67,32 @@ class BasePlugin:
        None
        """

-    def get_input_args(self):
+    def get_input_args(self) -> str | None:
        """
        Returns a pydantic model for the plugin's input arguments.
        """

+    def load_datasets(self, cfg: DictDefault, preprocess: bool = False):
+        """
+        Loads and preprocesses the dataset for training.
+
+        Args:
+            cfg: The configuration for the plugin.
+            preprocess: Whether this is the preprocess step of the datasets.
+
+        Returns:
+            dataset_meta: The metadata for the training dataset.
+        """
+
    def pre_model_load(self, cfg):  # pylint: disable=unused-argument
        """
        Performs actions before the model is loaded.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
+        Args:
+            cfg (dict): The configuration for the plugin.

        Returns:
-        None
+            None
        """

    def post_model_build(self, cfg, model):  # pylint: disable=unused-argument
@@ -90,86 +107,99 @@ class BasePlugin:
        """
        Performs actions after the model is loaded.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        model (object): The loaded model.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            model (object): The loaded model.

        Returns:
-        None
+            None
        """

    def pre_lora_load(self, cfg, model):  # pylint: disable=unused-argument
        """
        Performs actions before LoRA weights are loaded.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        model (object): The loaded model.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            model (object): The loaded model.

        Returns:
-        None
+            None
        """

    def post_lora_load(self, cfg, model):  # pylint: disable=unused-argument
        """
        Performs actions after LoRA weights are loaded.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        model (object): The loaded model.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            model (object): The loaded model.

        Returns:
-        None
+            None
        """

    def get_trainer_cls(self, cfg):  # pylint: disable=unused-argument):
        """
        Returns a custom class for the trainer.

-        Parameters:
-        cfg (dict): The global axolotl configuration.
+        Args:
+            cfg (dict): The global axolotl configuration.

        Returns:
-        class: The class for the trainer.
+            class: The class for the trainer.
+        """
+
+    def post_trainer_create(self, cfg, trainer):  # pylint: disable=unused-argument
+        """
+        Performs actions after the trainer is created.
+
+        Args:
+            cfg (dict): The configuration for the plugin.
+            trainer (object): The trainer object for training.
+
+        Returns:
+            None
        """

    def create_optimizer(self, cfg, trainer):  # pylint: disable=unused-argument
        """
        Creates and returns an optimizer for training.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        trainer (object): The trainer object for training.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            trainer (object): The trainer object for training.

        Returns:
-        object: The created optimizer.
+            object: The created optimizer.
        """

    def create_lr_scheduler(
-        self, cfg, trainer, optimizer
-    ):  # pylint: disable=unused-argument
+        self, cfg, trainer, optimizer, num_training_steps
+    ) -> LRScheduler | None:  # pylint: disable=unused-argument
        """
        Creates and returns a learning rate scheduler.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        trainer (object): The trainer object for training.
-        optimizer (object): The optimizer for training.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            trainer (object): The trainer object for training.
+            optimizer (object): The optimizer for training.
+            num_training_steps (int): Total number of training steps

        Returns:
-        object: The created learning rate scheduler.
+            object (LRScheduler): The created learning rate scheduler.
        """

    def add_callbacks_pre_trainer(self, cfg, model):  # pylint: disable=unused-argument
        """
        setup callbacks before creating the trainer.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        model (object): The loaded model.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            model (object): The loaded model.

        Returns:
-        List[callable]: A list of callback functions to be added to the TrainingArgs
+            List[callable]: A list of callback functions to be added to the TrainingArgs
        """
        return []

@@ -180,12 +210,12 @@ class BasePlugin:
        Adds callbacks to the trainer after creating the trainer.
        This is useful for callbacks that require access to the model or trainer.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
-        trainer (object): The trainer object for training.
+        Args:
+            cfg (dict): The configuration for the plugin.
+            trainer (object): The trainer object for training.

        Returns:
-        List[callable]: A list of callback functions to be added
+            List[callable]: A list of callback functions to be added
        """
        return []

@@ -193,23 +223,23 @@ class BasePlugin:
        """
        Performs actions after training is complete.

-        Parameters:
-        cfg (dict): The axolotl configuration
-        model (object): The loaded model.
+        Args:
+            cfg (dict): The axolotl configuration
+            model (object): The loaded model.

        Returns:
-        None
+            None
        """

    def post_train_unload(self, cfg):  # pylint: disable=unused-argument
        """
        Performs actions after training is complete and the model is unloaded.

-        Parameters:
-        cfg (dict): The configuration for the plugin.
+        Args:
+            cfg (dict): The configuration for the plugin.

        Returns:
-        None
+            None
        """


@@ -270,6 +300,7 @@ class PluginManager:
    plugins: OrderedDict[str, BasePlugin] = collections.OrderedDict()

    _instance = None
+    _cfg = None

    def __new__(cls):
        """
@@ -277,7 +308,9 @@ class PluginManager:
        """
        if cls._instance is None:
            cls._instance = super(PluginManager, cls).__new__(cls)
-            cls._instance.plugins = collections.OrderedDict()
+            cls._instance.plugins: OrderedDict[str, BasePlugin] = (
+                collections.OrderedDict()
+            )
        return cls._instance

    @staticmethod
@@ -290,6 +323,14 @@ class PluginManager:
            PluginManager()
        return PluginManager._instance  # type: ignore

+    @property
+    def cfg(self):
+        return self._cfg
+
+    @cfg.setter
+    def cfg(self, cfg):
+        self._cfg = cfg
+
    def register(self, plugin_name: str):
        """
        Registers a new plugin by its name.
@@ -325,6 +366,27 @@ class PluginManager:
                input_args.append(input_args_from_plugin)
        return input_args

+    def load_datasets(self, cfg, preprocess: bool = False):
+        """
+        Calls the load_datasets method of each registered plugin.
+
+        Args:
+            cfg: The configuration for the plugins.
+            preprocess : Whether this is preprocess step of the datasets.
+
+        Returns:
+            dataset_meta: The dataset metadata loaded from all registered plugins.
+        """
+        return_ds_meta = None
+        for plugin in self.plugins.values():
+            dataset_meta = plugin.load_datasets(cfg, preprocess)
+            if dataset_meta is not None:
+                if return_ds_meta is None:
+                    return_ds_meta = dataset_meta
+                else:
+                    raise RuntimeError("Multiple plugins loaded datasets")
+        return return_ds_meta
+
    def pre_model_load(self, cfg):
        """
        Calls the pre_model_load method of all registered plugins.
@@ -409,29 +471,43 @@ class PluginManager:
                return trainer_cls
        return None

-    def create_optimizer(self, cfg, trainer):
+    def post_trainer_create(self, cfg, trainer):
        """
-        Calls the create_optimizer method of all registered plugins and returns the first non-None optimizer.
+        Calls the post_trainer_create method of all registered plugins.

        Parameters:
        cfg (dict): The configuration for the plugins.
        trainer (object): The trainer object for training.

+        Returns:
+        None
+        """
+        for plugin in self.plugins.values():
+            plugin.post_trainer_create(cfg, trainer)
+
+    def create_optimizer(self, trainer):
+        """
+        Calls the create_optimizer method of all registered plugins and returns the first non-None optimizer.
+
+        Parameters:
+        trainer (object): The trainer object for training.
+
        Returns:
        object: The created optimizer, or None if none was found.
        """
        for plugin in self.plugins.values():
-            optimizer = plugin.create_optimizer(cfg, trainer)
+            optimizer = plugin.create_optimizer(self.cfg, trainer)
            if optimizer is not None:
                return optimizer
        return None

-    def create_lr_scheduler(self, cfg, trainer, optimizer):
+    def create_lr_scheduler(
+        self, trainer, optimizer, num_training_steps
+    ) -> LRScheduler | None:
        """
        Calls the create_lr_scheduler method of all registered plugins and returns the first non-None scheduler.

        Parameters:
-        cfg (dict): The configuration for the plugins.
        trainer (object): The trainer object for training.
        optimizer (object): The optimizer for training.

@@ -439,7 +515,12 @@ class PluginManager:
        object: The created learning rate scheduler, or None if none was found.
        """
        for plugin in self.plugins.values():
-            scheduler = plugin.create_lr_scheduler(cfg, trainer, optimizer)
+            scheduler: LRScheduler | None = plugin.create_lr_scheduler(
+                self.cfg,
+                trainer=trainer,
+                optimizer=optimizer,
+                num_training_steps=num_training_steps,
+            )
            if scheduler is not None:
                return scheduler
        return None
--- a/src/axolotl/integrations/cut_cross_entropy/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/init.py
@@ -25,7 +25,7 @@ import torch

 from axolotl.integrations.base import BasePlugin
 from axolotl.utils import get_pytorch_version
-from axolotl.utils.distributed import zero_only
+from axolotl.utils.distributed import is_main_process

 from .args import CutCrossEntropyArgs  # pylint: disable=unused-import. # noqa: F401

@@ -76,7 +76,7 @@ class CutCrossEntropyPlugin(BasePlugin):
                cce_patch,
            )

-            with zero_only():
+            if is_main_process(use_environ=True):
                LOG.info(
                    f"Applying Cut Cross Entropy to model type: {cfg.model_config_type}"
                )
--- a/src/axolotl/integrations/cut_cross_entropy/monkeypatch/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/monkeypatch/init.py
--- a/src/axolotl/integrations/kd/chat_template.py
+++ b/src/axolotl/integrations/kd/chat_template.py
@@ -37,6 +37,7 @@ class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
        train_on_eos=None,
        train_on_eot=None,
        eot_tokens=None,
+        split_thinking: bool | None = False,
        logprobs_field="logprobs",
        gen_temperature=1.0,
        kd_temperature=1.0,
@@ -54,6 +55,7 @@ class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
            train_on_eos=train_on_eos,
            train_on_eot=train_on_eot,
            eot_tokens=eot_tokens,
+            split_thinking=split_thinking,
        )

    @property
--- a/src/axolotl/integrations/liger/init.py
+++ b/src/axolotl/integrations/liger/init.py
@@ -23,8 +23,8 @@ import logging
 import sys

 from axolotl.integrations.base import BasePlugin
+from axolotl.utils.distributed import is_main_process

-from ...utils.distributed import zero_only
 from .args import LigerArgs  # pylint: disable=unused-import. # noqa: F401
 from .utils import patch_with_compile_disable

@@ -85,7 +85,7 @@ class LigerPlugin(BasePlugin):
                kwargs["geglu"] = cfg.liger_glu_activation
            elif "swiglu" in liger_fn_sig.parameters:
                kwargs["swiglu"] = cfg.liger_glu_activation
-            with zero_only():
+            if is_main_process(use_environ=True):
                LOG.info(
                    f"Applying LIGER to {cfg.model_config_type} with kwargs: {kwargs}"
                )
@@ -151,6 +151,30 @@ class LigerPlugin(BasePlugin):
                rms_norm=cfg.liger_rms_norm,
                layer_norm=cfg.liger_layer_norm,
            )
+        elif cfg.model_config_type == "qwen3":
+            from axolotl.integrations.liger.models.qwen3 import (
+                apply_liger_kernel_to_qwen3,
+            )
+
+            apply_liger_kernel_to_qwen3(
+                cross_entropy=cfg.liger_cross_entropy,
+                fused_linear_cross_entropy=cfg.liger_fused_linear_cross_entropy,
+                glu_activation=cfg.liger_glu_activation,
+                rms_norm=cfg.liger_rms_norm,
+                layer_norm=cfg.liger_layer_norm,
+            )
+        elif cfg.model_config_type == "qwen3_moe":
+            from axolotl.integrations.liger.models.qwen3_moe import (
+                apply_liger_kernel_to_qwen3_moe,
+            )
+
+            apply_liger_kernel_to_qwen3_moe(
+                cross_entropy=cfg.liger_cross_entropy,
+                fused_linear_cross_entropy=cfg.liger_fused_linear_cross_entropy,
+                glu_activation=cfg.liger_glu_activation,
+                rms_norm=cfg.liger_rms_norm,
+                layer_norm=cfg.liger_layer_norm,
+            )
        else:
            logging.warning(
                f"Unsupported model config type: {cfg.model_config_type}. Liger not applied."
--- a/src/axolotl/integrations/liger/models/qwen3.py
+++ b/src/axolotl/integrations/liger/models/qwen3.py
@@ -0,0 +1,160 @@
+"""
+Liger FLCE for Qwen3. Based on transformers v4.51.3.
+"""
+
+import sys
+from typing import Optional, Tuple, Union
+
+import torch
+from liger_kernel.transformers.model.loss_utils import LigerForCausalLMLoss
+from transformers.cache_utils import Cache
+from transformers.modeling_outputs import CausalLMOutputWithPast
+
+
+def lce_forward(
+    self,
+    input_ids: Optional[torch.LongTensor] = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[Cache] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    logits_to_keep: Union[int, torch.Tensor] = 0,
+    **kwargs,
+) -> Union[Tuple, CausalLMOutputWithPast]:
+    r"""
+    Args:
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        logits_to_keep (`int` or `torch.Tensor`, *optional*):
+            If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+            `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+            token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+            If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+            This is useful when using packed tensor format (single dimension for batch and sequence length).
+
+    Returns:
+    """
+
+    # pylint: disable=duplicate-code
+    output_attentions = (
+        output_attentions
+        if output_attentions is not None
+        else self.config.output_attentions
+    )
+    output_hidden_states = (
+        output_hidden_states
+        if output_hidden_states is not None
+        else self.config.output_hidden_states
+    )
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        cache_position=cache_position,
+        **kwargs,
+    )
+
+    hidden_states = outputs[0]
+
+    logits = None
+    loss = None
+    # if in training mode, don't materialize logits
+    if self.training and (labels is not None):
+        loss = LigerForCausalLMLoss(
+            hidden_states=hidden_states,
+            lm_head_weight=self.lm_head.weight,
+            labels=labels,
+            hidden_size=self.config.hidden_size,
+            **kwargs,
+        )
+
+    else:  # if in inference mode materialize logits
+        slice_indices = (
+            slice(-logits_to_keep, None)
+            if isinstance(logits_to_keep, int)
+            else logits_to_keep
+        )
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        if labels is not None:
+            loss = self.loss_function(
+                logits=logits,
+                labels=labels,
+                vocab_size=self.config.vocab_size,
+                **kwargs,
+            )
+
+    return CausalLMOutputWithPast(
+        loss=loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+    )
+
+
+def apply_liger_kernel_to_qwen3(
+    cross_entropy: bool = False,
+    fused_linear_cross_entropy: bool = False,
+    rms_norm: bool = False,
+    glu_activation: bool = False,
+    layer_norm: bool = False,
+    **kwargs,  # pylint: disable=unused-argument
+) -> None:
+    # pylint: disable=duplicate-code
+    """
+    Apply Liger kernels to replace original implementation in HuggingFace Llama models (2 and 3)
+
+    Args:
+        cross_entropy (bool): Whether to apply Liger's cross entropy loss. Default is False.
+        fused_linear_cross_entropy (bool):
+            Whether to apply Liger's fused linear cross entropy loss. Default is False.
+            `cross_entropy` and `fused_linear_cross_entropy` cannot both be False.
+            If `fused_linear_cross_entropy` is True, the logits will not be materialized but more memory efficient.
+        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is False.
+        glu_activation (bool): Whether to apply Liger's SwiGLU MLP. Default is False.
+        layer_norm (bool): Whether to apply Liger's LayerNorm. Default is False.
+    """
+
+    import transformers.models.qwen3.modeling_qwen3  # noqa: F401  # pylint: disable=unused-import
+    from liger_kernel.transformers.functional import liger_cross_entropy
+    from liger_kernel.transformers.layer_norm import LigerLayerNorm
+    from liger_kernel.transformers.rms_norm import LigerRMSNorm
+    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
+
+    assert not (
+        cross_entropy and fused_linear_cross_entropy
+    ), "cross_entropy and fused_linear_cross_entropy cannot both be True."
+
+    modeling_qwen3 = sys.modules["transformers.models.qwen3.modeling_qwen3"]
+
+    if rms_norm:
+        modeling_qwen3.Qwen3RMSNorm = LigerRMSNorm
+
+    if glu_activation:
+        modeling_qwen3.Qwen3MLP = LigerSwiGLUMLP
+
+    if layer_norm:
+        modeling_qwen3.nn.LayerNorm = LigerLayerNorm
+
+    if cross_entropy:
+        from transformers.loss.loss_utils import nn
+
+        nn.functional.cross_entropy = liger_cross_entropy
+
+    if fused_linear_cross_entropy:
+        modeling_qwen3.Qwen3ForCausalLM.forward = lce_forward
--- a/src/axolotl/integrations/liger/models/qwen3_moe.py
+++ b/src/axolotl/integrations/liger/models/qwen3_moe.py
@@ -0,0 +1,191 @@
+"""
+Liger FLCE for Qwen3 MoE. Based on transformers v4.51.3.
+"""
+
+import sys
+from copy import deepcopy
+from typing import List, Optional, Union
+
+import torch
+from liger_kernel.transformers.model.loss_utils import LigerForCausalLMLoss
+from transformers.modeling_outputs import MoeCausalLMOutputWithPast
+from transformers.models.qwen3_moe.modeling_qwen3_moe import load_balancing_loss_func
+
+
+def lce_forward(
+    self,
+    input_ids: Optional[torch.LongTensor] = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[List[torch.FloatTensor]] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    output_router_logits: Optional[bool] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    logits_to_keep: Union[int, torch.Tensor] = 0,
+    **kwargs,
+) -> MoeCausalLMOutputWithPast:
+    r"""
+    Args:
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        logits_to_keep (`int` or `torch.Tensor`, *optional*):
+            If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+            `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+            token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+            If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+            This is useful when using packed tensor format (single dimension for batch and sequence length).
+
+    Returns:
+    """
+
+    # pylint: disable=duplicate-code
+    output_attentions = (
+        output_attentions
+        if output_attentions is not None
+        else self.config.output_attentions
+    )
+    output_router_logits = (
+        output_router_logits
+        if output_router_logits is not None
+        else self.config.output_router_logits
+    )
+    output_hidden_states = (
+        output_hidden_states
+        if output_hidden_states is not None
+        else self.config.output_hidden_states
+    )
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        output_router_logits=output_router_logits,
+        cache_position=cache_position,
+        **kwargs,
+    )
+
+    hidden_states = outputs[0]
+
+    logits = None
+    loss = None
+    # if in training mode, don't materialize logits
+    if self.training and (labels is not None):
+        loss = LigerForCausalLMLoss(
+            hidden_states=hidden_states,
+            lm_head_weight=self.lm_head.weight,
+            labels=labels,
+            hidden_size=self.config.hidden_size,
+            **kwargs,
+        )
+
+    else:  # if in inference mode materialize logits
+        slice_indices = (
+            slice(-logits_to_keep, None)
+            if isinstance(logits_to_keep, int)
+            else logits_to_keep
+        )
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        if labels is not None:
+            loss = self.loss_function(
+                logits=logits,
+                labels=labels,
+                vocab_size=self.config.vocab_size,
+                **kwargs,
+            )
+
+    aux_loss = None
+    if output_router_logits:
+        aux_loss = load_balancing_loss_func(
+            outputs.router_logits,
+            self.num_experts,
+            self.num_experts_per_tok,
+            attention_mask,
+        )
+        if labels is not None:
+            loss += self.router_aux_loss_coef * aux_loss.to(
+                loss.device
+            )  # make sure to reside in the same device
+
+    return MoeCausalLMOutputWithPast(
+        loss=loss,
+        aux_loss=aux_loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+    )
+
+
+def apply_liger_kernel_to_qwen3_moe(
+    cross_entropy: bool = False,
+    fused_linear_cross_entropy: bool = False,
+    rms_norm: bool = False,
+    glu_activation: bool = False,
+    layer_norm: bool = False,
+    **kwargs,  # pylint: disable=unused-argument
+) -> None:
+    # pylint: disable=duplicate-code
+    """
+    Apply Liger kernels to replace original implementation in HuggingFace Llama models (2 and 3)
+
+    Args:
+        cross_entropy (bool): Whether to apply Liger's cross entropy loss. Default is False.
+        fused_linear_cross_entropy (bool):
+            Whether to apply Liger's fused linear cross entropy loss. Default is False.
+            `cross_entropy` and `fused_linear_cross_entropy` cannot both be False.
+            If `fused_linear_cross_entropy` is True, the logits will not be materialized but more memory efficient.
+        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is False.
+        glu_activation (bool): Whether to apply Liger's SwiGLU MLP. Default is False.
+        layer_norm (bool): Whether to apply Liger's LayerNorm. Default is False.
+    """
+
+    import transformers.models.qwen3_moe.modeling_qwen3_moe  # noqa: F401  # pylint: disable=unused-import
+    from liger_kernel.transformers.functional import liger_cross_entropy
+    from liger_kernel.transformers.layer_norm import LigerLayerNorm
+    from liger_kernel.transformers.rms_norm import LigerRMSNorm
+    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
+
+    assert not (
+        cross_entropy and fused_linear_cross_entropy
+    ), "cross_entropy and fused_linear_cross_entropy cannot both be True."
+
+    modeling_qwen3_moe = sys.modules["transformers.models.qwen3_moe.modeling_qwen3_moe"]
+
+    if rms_norm:
+        modeling_qwen3_moe.Qwen3MoeRMSNorm = LigerRMSNorm
+
+    if glu_activation:
+
+        def _liger_swiglu_mlp_wrapper(config, intermediate_size=None, **kwargs):
+            "Accepts intermediate_size to pass to LigerSwiGLUMLP"
+            # clone config to avoid modifying the original
+            config = deepcopy(config)
+            if intermediate_size:
+                setattr(config, "intermediate_size", intermediate_size)
+            return LigerSwiGLUMLP(config, **kwargs)
+
+        modeling_qwen3_moe.Qwen3MoeMLP = _liger_swiglu_mlp_wrapper
+
+    if layer_norm:
+        modeling_qwen3_moe.nn.LayerNorm = LigerLayerNorm
+
+    if cross_entropy:
+        from transformers.loss.loss_utils import nn
+
+        nn.functional.cross_entropy = liger_cross_entropy
+
+    if fused_linear_cross_entropy:
+        modeling_qwen3_moe.Qwen3MoeForCausalLM.forward = lce_forward
--- a/src/axolotl/integrations/llm_compressor/README.md
+++ b/src/axolotl/integrations/llm_compressor/README.md
@@ -0,0 +1,108 @@
+# LLMCompressor Integration
+
+Fine-tune sparsified models in Axolotl using Neural Magic's [LLMCompressor](https://github.com/vllm-project/llm-compressor).
+
+This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor's model compression capabilities with Axolotl's distributed training pipelines, users can efficiently fine-tune sparse models at scale.
+
+It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training.
+
+---
+
+## Requirements
+
+- Axolotl with `llmcompressor` extras:
+
+  ```bash
+  pip install "axolotl[llmcompressor]"
+  ```
+
+- Requires `llmcompressor >= 0.5.1`
+
+This will install all necessary dependencies to fine-tune sparsified models using the integration.
+
+---
+
+## Usage
+
+To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
+
+```yaml
+plugins:
+  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+llmcompressor:
+  recipe:
+    finetuning_stage:
+      finetuning_modifiers:
+        ConstantPruningModifier:
+          targets: [
+            're:.*q_proj.weight',
+            're:.*k_proj.weight',
+            're:.*v_proj.weight',
+            're:.*o_proj.weight',
+            're:.*gate_proj.weight',
+            're:.*up_proj.weight',
+            're:.*down_proj.weight',
+          ]
+          start: 0
+  save_compressed: true
+# ... (other training arguments)
+```
+
+This plugin **does not apply pruning or sparsification itself** — it is intended for **fine-tuning models that have already been sparsified**.
+
+Pre-sparsified checkpoints can be:
+- Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
+- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Any custom LLM with compatible sparsity patterns that you've created yourself
+
+To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
+[https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)
+
+### Storage Optimization with save_compressed
+
+Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
+- Reduces disk space usage by approximately 40%
+- Maintains compatibility with vLLM for accelerated inference
+- Maintains compatibility with llmcompressor for further optimization (example: quantization)
+
+This option is highly recommended when working with sparse models to maximize the benefits of model compression.
+
+### Example Config
+
+See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.
+
+---
+
+## Inference with vLLM
+
+After fine-tuning your sparse model, you can leverage vLLM for efficient inference.
+You can also use LLMCompressor to apply additional quantization to your fine-tuned
+sparse model before inference for even greater performance benefits.:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM("path/to/your/sparse/model")
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
+
+## Learn More
+
+For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
+
+[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
--- a/src/axolotl/integrations/llm_compressor/init.py
+++ b/src/axolotl/integrations/llm_compressor/init.py
@@ -0,0 +1,5 @@
+"""Integration entry point for the LLMCompressor plugin."""
+
+from .plugin import LLMCompressorPlugin
+
+__all__ = ["LLMCompressorPlugin"]
--- a/src/axolotl/integrations/llm_compressor/args.py
+++ b/src/axolotl/integrations/llm_compressor/args.py
@@ -0,0 +1,40 @@
+"""
+LLMCompressor and Sparse Finetuning config models.
+"""
+
+from typing import Any
+
+from pydantic import BaseModel, Field
+from typing_extensions import Annotated
+
+
+class CompressionArgs(BaseModel):
+    """Sparse Finetuning config for LLMCompressor."""
+
+    # Typing for recipe is set to Any due to:
+    # https://github.com/vllm-project/llm-compressor/issues/1319
+    recipe: Annotated[
+        Any,
+        Field(
+            description="The recipe containing the compression algorithms and hyperparameters to apply."
+        ),
+    ]
+
+    save_compressed: Annotated[
+        bool,
+        Field(
+            default=False,
+            description="Whether to save the compressed model after training.",
+        ),
+    ]
+
+
+class LLMCompressorArgs(BaseModel):
+    """LLMCompressor configuration BaseModel."""
+
+    llmcompressor: Annotated[
+        CompressionArgs,
+        Field(
+            description="Arguments enabling compression pathways through the LLM Compressor plugins"
+        ),
+    ]
--- a/src/axolotl/integrations/llm_compressor/plugin.py
+++ b/src/axolotl/integrations/llm_compressor/plugin.py
@@ -0,0 +1,171 @@
+"""
+Sparse Finetuning plugin for Axolotl — enables handling of sparse neural networks
+by maintaining masks for zero weights during training.
+"""
+
+import logging
+from functools import wraps
+from typing import Any, Callable, Concatenate, ParamSpec, TypeVar
+
+from llmcompressor import active_session, create_session
+from llmcompressor.core import callbacks as session_callbacks
+from llmcompressor.recipe import Recipe
+from torch.nn import Module
+from transformers.trainer import Trainer
+from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
+from transformers.training_args import TrainingArguments
+
+from axolotl.integrations.base import BasePlugin
+
+P = ParamSpec("P")  # Params for generic function signatures
+R = TypeVar("R")  # Return type for generic function signatures
+
+LOG = logging.getLogger("axolotl.integrations.llm_compressor")
+
+
+class LLMCompressorCallbackHandler(TrainerCallback):
+    """
+    Trainer callback for Sparse Finetuning.
+    Maintains sparsity patterns during training by applying masks after optimization steps,
+    ensuring zero-weight updates are canceled out.
+    """
+
+    def __init__(self, trainer: Trainer, recipe: Any):
+        """
+        Initialize the Sparse Finetuning callback handler.
+
+        Args:
+            trainer (Trainer): Huggingface Trainer instance.
+            recipe (Recipe | dict): Sparse finetuning recipe to apply.
+        """
+        super().__init__()
+        self.trainer = trainer
+        self.recipe = (
+            Recipe.model_validate(recipe) if not isinstance(recipe, Recipe) else recipe
+        )
+        self.original_compute_loss = trainer.compute_loss
+        self.trainer.compute_loss = compute_loss_wrapper(self.trainer.compute_loss)
+        create_session()
+
+    def on_train_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the beginning of training. Initializes the compression session.
+
+        Args:
+            args (TrainingArguments): Training arguments.
+            state (TrainerState): Trainer state.
+            control (TrainerControl): Trainer control.
+        """
+        super().on_train_begin(args, state, control, **kwargs)
+        self.trainer.accelerator.wait_for_everyone()
+        active_session().initialize(
+            model=self.trainer.model,
+            optimizer=self.trainer.optimizer,
+            start=state.epoch,
+            recipe=self.recipe,
+        )
+        self.trainer.accelerator.wait_for_everyone()
+
+    def on_step_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the beginning of a training step. Triggers batch_start callback.
+        """
+        super().on_step_begin(args, state, control, **kwargs)
+        session_callbacks.batch_start()
+
+    def on_step_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the end of a training step. Triggers optimizer and batch_end callbacks.
+        """
+        super().on_step_end(args, state, control, **kwargs)
+        session_callbacks.optim_pre_step()
+        session_callbacks.optim_post_step()
+        session_callbacks.batch_end()
+
+    def on_train_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the end of training. Finalizes the compression session.
+        """
+        super().on_train_end(args, state, control, **kwargs)
+        active_session().finalize()
+        self.trainer.compute_loss_func = self.original_compute_loss
+
+
+class LLMCompressorPlugin(BasePlugin):
+    """
+    Sparse Finetuning plugin for Axolotl integration.
+    """
+
+    def get_input_args(self) -> str:
+        """
+        Returns the path to the plugin's argument definition.
+
+        Returns:
+            str: Dotted path to the LLMCompressorArgs class.
+        """
+        return "axolotl.integrations.llm_compressor.args.LLMCompressorArgs"
+
+    def add_callbacks_post_trainer(self, cfg: Any, trainer: Trainer) -> list:
+        """
+        Adds Sparse Finetuning callback to the Trainer instance.
+
+        Args:
+            cfg (Any): Configuration object containing the sparse recipe.
+            trainer (Trainer): Huggingface Trainer instance.
+
+        Returns:
+            list: List containing the configured callback instances.
+        """
+        LOG.info("Adding Sparse Finetuning callback to the trainer")
+        callback = LLMCompressorCallbackHandler(
+            trainer=trainer,
+            recipe=cfg.llmcompressor.recipe,
+        )
+        return [callback]
+
+
+def compute_loss_wrapper(
+    compute_loss_func: Callable[Concatenate[Module, P], R],
+) -> Callable[Concatenate[Module, P], R]:
+    """
+    Wraps the loss computation function to trigger the loss_calculated callback.
+
+    Args:
+        compute_loss_func (Callable): Original loss computation function.
+
+    Returns:
+        Callable: Wrapped function that also invokes the loss_calculated callback.
+    """
+
+    @wraps(compute_loss_func)
+    def compute_and_notify(model: Module, *args: P.args, **kwargs: P.kwargs) -> R:
+        loss = compute_loss_func(model, *args, **kwargs)
+        if active_session().lifecycle.initialized_ and model.training:
+            session_callbacks.loss_calculated(loss=loss)
+        return loss
+
+    return compute_and_notify
--- a/src/axolotl/integrations/llm_compressor/utils.py
+++ b/src/axolotl/integrations/llm_compressor/utils.py
@@ -0,0 +1,40 @@
+"""Utilities for llmcompressor integration with axolotl."""
+
+from typing import Union
+
+from llmcompressor.transformers.sparsification.compressed_tensors_utils import (
+    modify_save_pretrained,
+)
+from transformers import PreTrainedModel, Trainer
+
+
+def save_compressed_model(
+    model: PreTrainedModel,
+    output_dir: Union[str, bytes],
+    trainer: Trainer,
+    safe_serialization: bool = False,
+    save_compressed: bool = False,
+) -> None:
+    """
+    Synchronize processes, apply compression hooks, and save the model.
+
+    Args:
+        model (PreTrainedModel): The model to be saved.
+        output_dir (str or bytes): Path where the model files will be written.
+        trainer (Trainer): Hugging Face Trainer for process synchronization.
+        safe_serialization (bool): Use safe serialization if True.
+        save_compressed (bool): Write compressed tensors if True.
+    """
+    trainer.accelerator.wait_for_everyone()
+
+    # Only the main process writes the files
+    if not trainer.accelerator.is_main_process:
+        return
+
+    modify_save_pretrained(model)
+    model.save_pretrained(
+        output_dir,
+        safe_serialization=safe_serialization,
+        save_compressed=save_compressed,
+        skip_sparsity_compression_stats=not save_compressed,
+    )
--- a/src/axolotl/monkeypatch/attention/init.py
+++ b/src/axolotl/monkeypatch/attention/init.py
@@ -0,0 +1,19 @@
+"""
+attention module for attention monkeypatches
+"""
+
+from transformers.integrations.flash_attention import flash_attention_forward
+
+
+def patch_xformers_attn_over_fa2():
+    from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
+
+    from .xformers import xformers_attention_forward
+
+    ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = xformers_attention_forward
+
+
+def unpatch_xformers_attn_over_fa2():
+    from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
+
+    ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = flash_attention_forward()
--- a/src/axolotl/monkeypatch/attention/ring_attn/init.py
+++ b/src/axolotl/monkeypatch/attention/ring_attn/init.py
@@ -4,7 +4,6 @@
 # flake8: noqa

 from .patch import (
-    RingAttnFunc,
    get_ring_attn_group,
    register_ring_attn,
    set_ring_attn_group,
--- a/src/axolotl/monkeypatch/attention/ring_attn/adapters/batch.py
+++ b/src/axolotl/monkeypatch/attention/ring_attn/adapters/batch.py
@@ -16,11 +16,7 @@ import torch
 import torch.distributed as dist
 import transformers
 import transformers.modeling_flash_attention_utils
-from ring_flash_attn import (
-    ring_flash_attn_func,
-    stripe_flash_attn_func,
-    zigzag_ring_flash_attn_func,
-)
+from ring_flash_attn import ring_flash_attn_func
 from ring_flash_attn.adapters.hf_adapter import check_params
 from transformers.modeling_flash_attention_utils import (
    _flash_supports_window_size,
@@ -28,12 +24,12 @@ from transformers.modeling_flash_attention_utils import (
 )
 from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS

-from axolotl.monkeypatch.attention.ring_attn.patch import RingAttnFunc
+from axolotl.utils.schemas.enums import RingAttnFunc

 RING_ATTN_FUNC_MAPPING = {
-    RingAttnFunc.BATCH_RING: ring_flash_attn_func,
-    RingAttnFunc.BATCH_ZIGZAG: zigzag_ring_flash_attn_func,
-    RingAttnFunc.BATCH_STRIPE: stripe_flash_attn_func,
+    RingAttnFunc.BATCH_RING: torch.compile(ring_flash_attn_func),
+    # RingAttnFunc.BATCH_ZIGZAG: torch.compile(zigzag_ring_flash_attn_func),
+    # RingAttnFunc.BATCH_STRIPE: torch.compile(stripe_flash_attn_func),
 }


--- a/src/axolotl/monkeypatch/attention/ring_attn/patch.py
+++ b/src/axolotl/monkeypatch/attention/ring_attn/patch.py
@@ -6,16 +6,13 @@ package, specifically the `hf_adapter.substitute_hf_flash_attn` function to patc
 their sequence parallel version of Flash Attention 2.
 """

-from enum import Enum
-
 import torch
 import torch.distributed as dist
 from accelerate.logging import get_logger

-from axolotl.logging_config import configure_logging
 from axolotl.monkeypatch.utils import get_cu_seqlens_from_pos_ids
+from axolotl.utils.schemas.enums import RingAttnFunc

-configure_logging()
 LOG = get_logger(__name__)


@@ -43,17 +40,6 @@ def set_ring_attn_group(ring_attn_group: dist.ProcessGroup | None):
    RING_ATTN_GROUP = ring_attn_group


-class RingAttnFunc(str, Enum):
-    """Enum class for supported `ring-flash-attn` implementations"""
-
-    # VARLEN_RING = "varlen_ring"
-    # VARLEN_ZIGZAG = "varlen_zigzag"
-    VARLEN_LLAMA3 = "varlen_llama3"
-    BATCH_RING = "batch_ring"
-    BATCH_ZIGZAG = "batch_zigzag"
-    BATCH_STRIPE = "batch_stripe"
-
-
 def register_ring_attn(
    sequence_parallel_degree: int,
    heads_k_stride: int | None,
@@ -119,11 +105,7 @@ def register_ring_attn(
        substitute_hf_flash_attn(
            process_group=get_ring_attn_group(), heads_k_stride=heads_k_stride or 1
        )
-    elif ring_attn_func in [
-        RingAttnFunc.BATCH_RING,
-        RingAttnFunc.BATCH_ZIGZAG,
-        RingAttnFunc.BATCH_STRIPE,
-    ]:
+    elif ring_attn_func is RingAttnFunc.BATCH_RING:
        from axolotl.monkeypatch.attention.ring_attn.adapters.batch import (
            substitute_hf_flash_attn,
        )
--- a/src/axolotl/monkeypatch/attention/xformers.py
+++ b/src/axolotl/monkeypatch/attention/xformers.py
@@ -0,0 +1,160 @@
+"""
+xformers attention implementation for packing
+"""
+
+from typing import Optional
+
+import torch
+import xformers
+import xformers.ops.fmha
+from transformers.modeling_flash_attention_utils import (
+    _upad_input,
+)
+
+from axolotl.monkeypatch.utils import get_cu_seqlens_from_pos_ids
+
+xformers_attention = xformers.ops.fmha.memory_efficient_attention
+
+
+def xformers_attention_forward(
+    module: torch.nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    dropout: float = 0.0,  # pylint: disable=unused-argument
+    scaling: Optional[float] = None,  # pylint: disable=unused-argument
+    sliding_window: Optional[int] = None,  # pylint: disable=unused-argument
+    softcap: Optional[float] = None,  # pylint: disable=unused-argument
+    cu_seq_lens_q: Optional[torch.LongTensor] = None,
+    cu_seq_lens_k: Optional[torch.LongTensor] = None,
+    max_length_q: Optional[int] = None,
+    max_length_k: Optional[int] = None,  # pylint: disable=unused-argument
+    **kwargs,  # pylint: disable=unused-argument
+):
+    # Get dimensions
+    # query: [batch, heads, seq_len, hidden_dim]
+    batch_size = query.size(0)
+    query_length = query.shape[2]
+    key_length = key.shape[2]
+
+    # Default causal mask
+    attn_bias = xformers.ops.LowerTriangularMask()
+
+    # Check if we have sliding window attention
+    has_sliding_window = sliding_window is not None and sliding_window < query_length
+
+    # Transpose dimensions for xformers (Q: [b, h, s, d] -> [b, s, h, d])
+    query = query.transpose(1, 2)
+    key = key.transpose(1, 2)
+    value = value.transpose(1, 2)
+
+    # Get GQA parameters
+    num_attention_heads = module.config.num_attention_heads
+    num_key_value_heads = module.config.num_key_value_heads
+    head_dim = query.size(-1)
+    is_gqa = num_attention_heads != num_key_value_heads
+    n_groups = num_attention_heads // num_key_value_heads if is_gqa else 1
+
+    # If position_ids is provided and check all examples do not contain only 1 sequence, If tensor in increasing
+    # then we probably have one sequence, otherwise it is packed. Additionally check we are in pre-fill/training stage.
+    # Use `flash_attn_varlen_func` to prevent cross-example attention and also allow padding free approach
+    if position_ids is not None and (
+        max_length_q is not None
+        or (query_length != 1 and not (torch.diff(position_ids, dim=-1) >= 0).all())
+    ):
+        if cu_seq_lens_q is None or cu_seq_lens_k is None:
+            cu_seq_lens_q = get_cu_seqlens_from_pos_ids(position_ids)[0]
+            cu_seq_lens_q = cu_seq_lens_q.squeeze()
+            seq_lengths = cu_seq_lens_q[1:] - cu_seq_lens_q[:-1]
+            attn_bias = (
+                xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask.from_seqlens(
+                    q_seqlen=seq_lengths.tolist(),
+                )
+            )
+        else:
+            query = query.reshape(-1, query.size(-2), query.size(-1))
+            key = key.reshape(-1, key.size(-2), key.size(-1))
+            value = value.reshape(-1, value.size(-2), value.size(-1))
+
+        # Handle GQA
+        if is_gqa:
+            key = key.repeat_interleave(n_groups, dim=2)
+            value = value.repeat_interleave(n_groups, dim=2)
+
+    elif attention_mask is not None:
+        query, key, value, _, cu_seq_lens, _ = _upad_input(
+            query, key, value, attention_mask, query_length
+        )
+        cu_seq_lens_q, cu_seq_lens_k = cu_seq_lens
+        seq_lengths = []
+        for i in range(len(cu_seq_lens_q) - 1):
+            seq_lengths.append(cu_seq_lens_q[i + 1] - cu_seq_lens_q[i])
+        attn_bias = xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask.from_seqlens(
+            q_seqlen=seq_lengths,
+            kv_seqlen=seq_lengths,
+        )
+
+        # Handle GQA
+        if is_gqa:
+            key = key.repeat_interleave(n_groups, dim=2)
+            value = value.repeat_interleave(n_groups, dim=2)
+    else:
+        # Handle Group Query Attention (GQA) using view/expand approach from reference
+        key = key.view(batch_size, key_length, num_key_value_heads, 1, head_dim)
+        value = value.view(batch_size, key_length, num_key_value_heads, 1, head_dim)
+        key = key.expand(
+            batch_size, key_length, num_key_value_heads, n_groups, head_dim
+        )
+        value = value.expand(
+            batch_size, key_length, num_key_value_heads, n_groups, head_dim
+        )
+
+        if module.training:
+            key = key.reshape(batch_size, key_length, num_attention_heads, head_dim)
+            value = value.reshape(batch_size, key_length, num_attention_heads, head_dim)
+
+            if has_sliding_window:
+                query = query.view(
+                    1, batch_size * query_length, num_attention_heads, head_dim
+                )
+                key = key.view(
+                    1, batch_size * key_length, num_attention_heads, head_dim
+                )
+                value = value.view(
+                    1, batch_size * key_length, num_attention_heads, head_dim
+                )
+        else:
+            query = query.view(
+                batch_size, query_length, num_key_value_heads, n_groups, head_dim
+            )
+
+            # If we need a sliding window attention
+            if has_sliding_window:
+                query = query.view(
+                    1,
+                    batch_size * query_length,
+                    num_key_value_heads,
+                    n_groups,
+                    head_dim,
+                )
+                key = key.view(
+                    1, batch_size * key_length, num_key_value_heads, n_groups, head_dim
+                )
+                value = value.view(
+                    1, batch_size * key_length, num_key_value_heads, n_groups, head_dim
+                )
+
+    # Run the xformers attention
+    attn_output = xformers_attention(
+        query,
+        key,
+        value,
+        attn_bias=attn_bias,
+    )
+
+    attn_output = attn_output.view(
+        batch_size, -1, attn_output.size(-2), attn_output.size(-1)
+    )
+    return attn_output, None
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -23,22 +23,42 @@ from axolotl.utils.dict import DictDefault

 LOG = get_logger(__name__)

-ORIGINAL_QKV_CODE = """
+QKV_PATCHES = [
+    (
+        """
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
    key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
    value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
 """.lstrip(
-    "\n"
-)
-
-PATCHED_QKV_CODE = """
+            "\n"
+        ),
+        """
    query_states, key_states, value_states = self.apply_qkv(hidden_states)
    query_states = query_states.view(hidden_shape).transpose(1, 2)
    key_states = key_states.view(hidden_shape).transpose(1, 2)
    value_states = value_states.view(hidden_shape).transpose(1, 2)
 """.lstrip(
-    "\n"
-)
+            "\n"
+        ),
+    ),
+    (
+        """
+    query_states = self.q_norm(self.q_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+    key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+    value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+""".lstrip(
+            "\n"
+        ),
+        """
+    query_states, key_states, value_states = self.apply_qkv(hidden_states)
+    query_states = self.q_norm(query_states.view(hidden_shape)).transpose(1, 2)
+    key_states = self.k_norm(key_states.view(hidden_shape)).transpose(1, 2)
+    value_states = value_states.view(hidden_shape).transpose(1, 2)
+""".lstrip(
+            "\n"
+        ),
+    ),
+]

 ORIGINAL_O_CODE = """
    attn_output = self.o_proj(attn_output)
@@ -128,10 +148,11 @@ def get_attention_cls_from_config(cfg: DictDefault) -> Type[nn.Module]:
    try:
        # Dynamically import the module and attention class
        module_path = f"transformers.models.{model_type}.modeling_{model_type}"
-        module = __import__(
-            module_path, fromlist=[f"{model_type.capitalize()}Attention"]
+        model_cls_prefix = "".join(
+            [part.capitalize() for part in model_type.split("_")]
        )
-        attention_cls = getattr(module, f"{model_type.capitalize()}Attention")
+        module = __import__(module_path, fromlist=[f"{model_cls_prefix}Attention"])
+        attention_cls = getattr(module, f"{model_cls_prefix}Attention")

        return attention_cls
    except (ImportError, AttributeError) as e:
@@ -168,10 +189,18 @@ def patch_self_attn_lora(cfg: DictDefault):
    attention_cls._original_forward = self_attn_forward
    self_attn_forward, _ = detab_code(self_attn_forward)

-    assert ORIGINAL_QKV_CODE in self_attn_forward, "Original QKV code not found"
+    assert any(
+        qkv_options[0] in self_attn_forward for qkv_options in QKV_PATCHES
+    ), "Original QKV code not found"
    assert ORIGINAL_O_CODE in self_attn_forward, "Original O code not found"

-    self_attn_forward = self_attn_forward.replace(ORIGINAL_QKV_CODE, PATCHED_QKV_CODE)
+    for qkv_orig, qkv_patched in QKV_PATCHES:
+        if qkv_orig in self_attn_forward:
+            self_attn_forward = self_attn_forward.replace(
+                qkv_orig,
+                qkv_patched,
+            )
+            break
    self_attn_forward = self_attn_forward.replace(ORIGINAL_O_CODE, PATCHED_O_CODE)
    self_attn_forward = self_attn_forward.replace(
        "def forward(",
--- a/src/axolotl/monkeypatch/multipack.py
+++ b/src/axolotl/monkeypatch/multipack.py
@@ -18,6 +18,8 @@ SUPPORTED_MULTIPACK_MODEL_TYPES = [
    "mixtral",
    "qwen2",
    "qwen2_moe",
+    "qwen3",
+    "qwen3_moe",
    "falcon",
    "phi",
    "phi3",
--- a/src/axolotl/monkeypatch/peft/init.py
+++ b/src/axolotl/monkeypatch/peft/init.py
--- a/src/axolotl/monkeypatch/peft/utils.py
+++ b/src/axolotl/monkeypatch/peft/utils.py
@@ -0,0 +1,78 @@
+"""
+Patch prepare_model_for_kbit_training to not upcast everything
+"""
+
+import inspect
+import logging
+
+import peft
+
+import axolotl
+from axolotl.monkeypatch.utils import detab_code
+
+LOG = logging.getLogger(__name__)
+
+ORIGINAL_PREPARE_CODE = """
+        for param in model.parameters():
+            if (
+                (param.dtype == torch.float16) or (param.dtype == torch.bfloat16)
+            ) and param.__class__.__name__ != "Params4bit":
+                param.data = param.data.to(torch.float32)
+"""
+
+PATCHED_PREPARE_CODE = """
+        for name, param in model.named_parameters():
+            if (
+                (param.dtype == torch.float16) or (param.dtype == torch.bfloat16)
+            ) and param.__class__.__name__ != "Params4bit" and all(embed_name not in name for embed_name in ["embed_tokens", "lm_head"]):
+                param.data = param.data.to(torch.float32)
+"""
+
+
+def get_peft_prep_code() -> str:
+    prepare = inspect.getsource(peft.utils.other.prepare_model_for_kbit_training)
+    return prepare
+
+
+def check_peft_prep_code_is_patchable() -> bool:
+    prep_code = get_peft_prep_code()
+    prep_code, _ = detab_code(prep_code)
+    return ORIGINAL_PREPARE_CODE in prep_code
+
+
+def patch_peft_prep_code():
+    """
+    monkeypatch create_accelerator_and_postprocess so it checks for additional kwargs
+    """
+
+    try:
+        prep_code = get_peft_prep_code()
+    except OSError:
+        return
+    peft.utils.other._original_create_accelerator_and_postprocess = (  # pylint: disable=protected-access
+        prep_code
+    )
+    prep_code, _ = detab_code(prep_code)
+    if ORIGINAL_PREPARE_CODE not in prep_code:
+        return
+
+    prep_code = prep_code.replace(ORIGINAL_PREPARE_CODE, PATCHED_PREPARE_CODE)
+    prep_code = prep_code.replace(
+        "def prepare_model_for_kbit_training(",
+        "def fixed_prepare_model_for_kbit_training(",
+        1,
+    )
+
+    items_to_import = []
+    for item in dir(peft.utils.other):
+        if item in prep_code:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from peft.utils.other import (" + ", ".join(x for x in items_to_import) + ")",
+        globals(),
+    )
+    exec(prep_code, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching prepare_model_for_kbit_training to allow for overrides")
+    peft.utils.other.prepare_model_for_kbit_training = fixed_prepare_model_for_kbit_training  # pylint: disable=protected-access  # pylint: disable=undefined-variable  # noqa: F821
+    axolotl.utils.models.prepare_model_for_kbit_training = fixed_prepare_model_for_kbit_training  # pylint: disable=protected-access  # pylint: disable=undefined-variable  # noqa: F821
--- a/src/axolotl/monkeypatch/trainer/init.py
+++ b/src/axolotl/monkeypatch/trainer/init.py
--- a/src/axolotl/monkeypatch/trainer/lr.py
+++ b/src/axolotl/monkeypatch/trainer/lr.py
@@ -0,0 +1,42 @@
+"""
+monkeypatch for Trainer _get_learning_rate method
+"""
+
+import logging
+
+import torch
+
+LOG = logging.getLogger(__name__)
+
+
+# TODO remove this patch once https://github.com/huggingface/transformers/pull/37881 is included in a release
+def _get_learning_rate(self):
+    if self.is_deepspeed_enabled:
+        # with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
+        # not run for the first few dozen steps while loss scale is too large, and thus during
+        # that time `get_last_lr` will fail if called during that warm up stage, so work around it:
+        try:
+            last_lr = self.lr_scheduler.get_last_lr()[0]
+        except AssertionError as e:
+            if "need to call step" in str(e):
+                LOG.warning(
+                    "tried to get lr value before scheduler/optimizer started stepping, returning lr=0"
+                )
+                last_lr = 0
+            else:
+                raise
+    else:
+        if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
+            last_lr = self.optimizer.param_groups[0]["lr"]
+        else:
+            last_lr = self.lr_scheduler.get_last_lr()[0]
+
+    if torch.is_tensor(last_lr):
+        last_lr = last_lr.item()
+    return last_lr
+
+
+def patch_trainer_get_lr():
+    from transformers.trainer import Trainer
+
+    Trainer._get_learning_rate = _get_learning_rate  # pylint: disable=protected-access
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -4,7 +4,7 @@ HF Chat Templates prompt strategy

 import logging
 from collections import defaultdict
-from typing import Any, Dict, List, Optional, Set, Union
+from typing import Any, Dict, List, Set, Union

 from pydantic import BaseModel
 from transformers import ProcessorMixin
@@ -29,12 +29,12 @@ class ChatTemplatePrompter(Prompter):
        chat_template: str,
        processor=None,
        max_length=2048,
-        message_property_mappings: Optional[Dict[str, str]] = None,
-        message_field_training: Optional[str] = None,
-        message_field_training_detail: Optional[str] = None,
+        message_property_mappings: Dict[str, str] | None = None,
+        message_field_training: str | None = None,
+        message_field_training_detail: str | None = None,
        field_messages: str = "messages",
        field_system: str = "system",
-        roles: Optional[Dict[str, List[str]]] = None,
+        roles: Dict[str, List[str]] | None = None,
        drop_system_message: bool = False,
    ):
        # check if message_property_mappings is None or empty dict
@@ -42,6 +42,7 @@ class ChatTemplatePrompter(Prompter):
            message_property_mappings = {
                "role": "role",
                "content": "content",
+                "reasoning_content": "reasoning_content",
            }

        if roles:
@@ -65,7 +66,7 @@ class ChatTemplatePrompter(Prompter):
        self.field_messages = field_messages
        self.field_system = field_system
        self.tokenizer = tokenizer
-        self.processor: Optional[ProcessorMixin] = processor
+        self.processor: ProcessorMixin | None = processor
        self.chat_template = chat_template
        self.max_length = max_length
        self.drop_system_message = drop_system_message
@@ -224,11 +225,11 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        tokenizer,
        train_on_inputs: bool,
        sequence_len: int,
-        roles_to_train: Optional[List[str]] = None,
-        train_on_eos: Optional[str] = None,
-        train_on_eot: Optional[str] = None,
-        eot_tokens: Optional[List[str]] = None,
-        split_thinking: Optional[bool] = False,
+        roles_to_train: list[str] | None = None,
+        train_on_eos: str | None = None,
+        train_on_eot: str | None = None,
+        eot_tokens: list[str] | None = None,
+        split_thinking: bool | None = False,
    ):
        super().__init__(prompter, tokenizer, train_on_inputs, sequence_len)
        self.prompter: ChatTemplatePrompter = prompter
@@ -661,16 +662,46 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        # if the role is assistant that we want to use reasoning_content
        if self.split_thinking and transformed_message["role"] == "assistant":
            content = transformed_message["content"]
-            pairs = [("<think>", "</think>"), ("<reasoning>", "</reasoning>")]
-            for pair in pairs:
-                if pair[0] in content and pair[1] in content:
-                    start_idx = content.find(pair[0])
-                    end_idx = content.find(pair[1])
-                    thinking_content = content[start_idx + len(pair[0]) : end_idx]
+            thinking_pairs = [
+                ("<think>", "</think>"),
+                ("<reasoning>", "</reasoning>"),
+                ("<|begin_of_thought|>", "<|end_of_thought|>"),
+            ]
+            content_pairs = [("<|begin_of_solution|>", "<|end_of_solution|>")]
+            for tpair in thinking_pairs:
+                # check if the thinking pair is in the content
+                if tpair[0] in content and tpair[1] in content:
+                    # find the start and end index of the thinking pair
+                    t_start_idx = content.find(tpair[0])
+                    t_end_idx = content.find(tpair[1])
+
+                    # get the thinking content
+                    thinking_content = content[t_start_idx + len(tpair[0]) : t_end_idx]
                    transformed_message["reasoning_content"] = thinking_content.strip()
-                    transformed_message["content"] = content[
-                        end_idx + len(pair[1]) :
-                    ].lstrip()
+
+                    # take remainder of the content
+                    # strip whitespace from beginning of the remainder (thinking tokens)
+                    remainder = content[t_end_idx + len(tpair[1]) :].lstrip()
+
+                    # check if the content pair is in the remainder
+                    cpair_found = False
+                    for cpair in content_pairs:
+                        if cpair[0] in remainder and cpair[1] in remainder:
+                            # find the start and end index of the content pair
+                            c_start_idx = remainder.find(cpair[0])
+                            c_end_idx = remainder.find(cpair[1])
+
+                            # get the content content
+                            content_content = remainder[
+                                c_start_idx + len(cpair[0]) : c_end_idx
+                            ]
+                            transformed_message["content"] = content_content.strip()
+                            cpair_found = True
+                            break
+
+                    # else, the content is the remainder
+                    if not cpair_found:
+                        transformed_message["content"] = remainder
                    break

        # Determine which keys in the original message were not mapped
@@ -714,7 +745,7 @@ class StrategyLoader:
        self,
        tokenizer,
        cfg,
-        ds_cfg: Optional[Union[Dict[str, Any], DatasetConfig]] = None,
+        ds_cfg: Union[Dict[str, Any], DatasetConfig] | None = None,
        processor=None,
    ):
        if ds_cfg is None:
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -2,17 +2,17 @@

 import importlib
 import inspect
+import logging
 import os
 import signal
 import sys
 import weakref
-from contextlib import nullcontext
+from contextlib import ExitStack
 from pathlib import Path
 from typing import Any, Dict

 import torch
 import transformers.modelcard
-from accelerate.logging import get_logger
 from accelerate.utils import save_fsdp_model
 from datasets import Dataset
 from huggingface_hub.errors import OfflineModeIsEnabled
@@ -21,20 +21,19 @@ from transformers import PreTrainedModel, PreTrainedTokenizer, ProcessorMixin
 from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
 from transformers.trainer import Trainer

+from axolotl.cli.art import print_axolotl_text_art
 from axolotl.common.datasets import TrainDatasetMeta
 from axolotl.contribs.lgpl import (  # pylint: disable = no-name-in-module
    fix_untrained_tokens,
 )
 from axolotl.core.trainer_builder import HFCausalTrainerBuilder, HFRLTrainerBuilder
-from axolotl.core.trainers.mixins.sequence_parallel import (
-    SequenceParallelContextManager,
-)
 from axolotl.integrations.base import PluginManager
-from axolotl.logging_config import configure_logging
+from axolotl.utils.ctx_managers.sequence_parallel import SequenceParallelContextManager
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import cleanup_distributed
 from axolotl.utils.freeze import freeze_layers_except
 from axolotl.utils.models import load_model, load_processor, load_tokenizer
+from axolotl.utils.schemas.enums import RLType
 from axolotl.utils.trainer import setup_trainer

 try:
@@ -42,8 +41,7 @@ try:
 except ImportError:
    BetterTransformer = None

-configure_logging()
-LOG = get_logger(__name__)
+LOG = logging.getLogger(__name__)


 def setup_model_and_tokenizer(
@@ -64,7 +62,6 @@ def setup_model_and_tokenizer(
    # Load tokenizer
    LOG.debug(
        f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}",
-        main_process_only=True,
    )
    tokenizer = load_tokenizer(cfg)

@@ -109,7 +106,7 @@ def setup_reference_model(
        Reference model if needed for RL training, `None` otherwise.
    """
    model_ref = None
-    if cfg.rl and cfg.rl != "orpo":
+    if cfg.rl and cfg.rl != RLType.ORPO:
        if cfg.adapter and not cfg.rl_adapter_ref_model:
            # use built-in trl autounwrap
            LOG.debug("Passing model_ref: None to RL trainer")
@@ -190,28 +187,32 @@ def execute_training(
        trainer: The configured trainer object.
        resume_from_checkpoint: Path to checkpoint to resume from, if applicable.
    """
-    # Define the context managers to use
-    flash_context = (
-        torch.backends.cuda.sdp_kernel(
-            enable_flash=True,
-            enable_math=True,
-            enable_mem_efficient=True,
-        )
-        if cfg.flash_optimum
-        else nullcontext()
-    )
-    sequence_parallel_context = (
-        SequenceParallelContextManager(
-            model=trainer.model,
-            sequence_parallel_degree=cfg.sequence_parallel_degree,
-            ring_attn_func=cfg.ring_attn_func,
-        )
-        if cfg.sequence_parallel_degree > 1
-        else nullcontext()
-    )
+    with ExitStack() as stack:
+        # Define the context managers to use
+        if cfg.flash_optimum:
+            stack.enter_context(
+                torch.backends.cuda.sdp_kernel(
+                    enable_flash=True,
+                    enable_math=True,
+                    enable_mem_efficient=True,
+                )
+            )

-    LOG.info("Starting trainer...")
-    with flash_context, sequence_parallel_context:
+        if cfg.sequence_parallel_degree > 1:
+            models = [trainer.model]
+            if hasattr(trainer, "ref_model"):
+                models.append(trainer.ref_model)
+
+            stack.enter_context(
+                SequenceParallelContextManager(
+                    models=models,
+                    sequence_parallel_degree=cfg.sequence_parallel_degree,
+                    gradient_accumulation_steps=cfg.gradient_accumulation_steps,
+                    ring_attn_func=cfg.ring_attn_func,
+                )
+            )
+
+        LOG.info("Starting trainer...")
        trainer.train(resume_from_checkpoint=resume_from_checkpoint)


@@ -288,15 +289,32 @@ def save_trained_model(
                os.remove(os.path.join(cfg.output_dir, "model.safetensors"))
            except FileNotFoundError:
                pass
-    elif cfg.local_rank == 0:
-        if cfg.flash_optimum and BetterTransformer:
-            model = BetterTransformer.reverse(model)
+    else:
+        if cfg.local_rank == 0:
+            if cfg.flash_optimum and BetterTransformer:
+                model = BetterTransformer.reverse(model)

-        if cfg.rl and cfg.adapter and not cfg.rl_adapter_ref_model:
-            trainer.model.save_pretrained(
-                cfg.output_dir, safe_serialization=safe_serialization
-            )
-        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+            if cfg.rl and cfg.adapter and not cfg.rl_adapter_ref_model:
+                trainer.model.save_pretrained(
+                    cfg.output_dir, safe_serialization=safe_serialization
+                )
+
+            model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+        trainer.accelerator.wait_for_everyone()
+
+    if hasattr(cfg, "llmcompressor") and cfg.llmcompressor:
+        # TODO: add integration support so this can be implemented completely within the plugin
+        from axolotl.integrations.llm_compressor.utils import (
+            save_compressed_model,
+        )
+
+        save_compressed_model(
+            model=model,
+            output_dir=cfg.output_dir,
+            trainer=trainer,
+            safe_serialization=safe_serialization,
+            save_compressed=cfg.llmcompressor.save_compressed,
+        )


 def create_model_card(cfg: DictDefault, trainer: Trainer):
@@ -503,6 +521,8 @@ def train(
    Returns:
        Tuple of (model, tokenizer) after training
    """
+    print_axolotl_text_art()
+
    # Setup model, tokenizer, (causal or RLHF) trainer, etc.
    (
        trainer,
@@ -512,6 +532,9 @@ def train(
        processor,
    ) = setup_model_and_trainer(cfg, dataset_meta)

+    plugin_manager = PluginManager.get_instance()
+    plugin_manager.post_trainer_create(cfg, trainer)
+
    # Handle untrained tokens if configured
    safe_serialization = cfg.save_safetensors is True
    train_dataset = dataset_meta.train_dataset
@@ -534,7 +557,6 @@ def train(
    if not cfg.use_ray:
        cleanup_distributed()

-    plugin_manager = PluginManager.get_instance()
    plugin_manager.post_train(cfg, model)

    return model, tokenizer, trainer
--- a/src/axolotl/utils/init.py
+++ b/src/axolotl/utils/init.py
@@ -43,3 +43,12 @@ def set_pytorch_cuda_alloc_conf():
            os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
                "expandable_segments:True,roundup_power2_divisions:16"
            )
+
+
+def patch_optimized_env():
+    """
+    Patch environment variables to improve VRAM usage and increase download speed
+    """
+    if os.getenv("HF_HUB_ENABLE_HF_TRANSFER") is None:
+        os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+    set_pytorch_cuda_alloc_conf()
--- a/src/axolotl/utils/callbacks/init.py
+++ b/src/axolotl/utils/callbacks/init.py
@@ -3,6 +3,7 @@
 from __future__ import annotations

 import gc
+import json
 import logging
 import os
 import traceback
@@ -808,11 +809,44 @@ class SaveAxolotlConfigtoWandBCallback(TrainerCallback):
                    artifact.add_file(temp_file.name)
                    wandb.log_artifact(artifact)
                    wandb.save(temp_file.name)
-                LOG.info(
-                    "The Axolotl config has been saved to the WandB run under files."
-                )
+                    LOG.info(
+                        "The Axolotl config has been saved to the WandB run under files."
+                    )
            except (FileNotFoundError, ConnectionError) as err:
                LOG.warning(f"Error while saving Axolotl config to WandB: {err}")
+
+            if args.deepspeed:
+                try:
+                    # sync config to top level in run, cannot delete file right away because wandb schedules it to be synced even w/policy = 'now', so let OS delete it later.
+                    with NamedTemporaryFile(
+                        mode="w",
+                        delete=False,
+                        suffix=".json",
+                        prefix="deepspeed_config_",
+                    ) as temp_file:
+                        skip_upload = False
+                        if isinstance(args.deepspeed, dict):
+                            json.dump(args.deepspeed, temp_file, indent=4)
+                        elif isinstance(args.deepspeed, str) and os.path.exists(
+                            args.deepspeed
+                        ):
+                            copyfile(args.deepspeed, temp_file.name)
+                        else:
+                            skip_upload = True
+                        if not skip_upload:
+                            artifact = wandb.Artifact(
+                                f"deepspeed-config-{wandb.run.id}",
+                                type="deepspeed-config",
+                            )
+                            artifact.add_file(temp_file.name)
+                            wandb.log_artifact(artifact)
+                            wandb.save(temp_file.name)
+                            LOG.info(
+                                "The DeepSpeed config has been saved to the WandB run under files."
+                            )
+                except (FileNotFoundError, ConnectionError) as err:
+                    LOG.warning(f"Error while saving DeepSpeed config to WandB: {err}")
+
        return control


@@ -834,3 +868,28 @@ class GCCallback(TrainerCallback):
    ):
        torch.cuda.empty_cache()
        gc.collect()
+
+
+def colab_inference_post_train_callback(trainer: Trainer):
+    class ColabCallback(TrainerCallback):
+        """Callback to prep model for inference on Google Colab"""
+
+        def __init__(self, cfg):
+            self.gpu_name = torch.cuda.get_device_name(0)
+            self.cfg = cfg
+
+        def on_train_end(
+            self, args, state, control, **kwargs
+        ):  # pylint: disable=unused-argument
+            """
+            handle T4 gpu, we need to convert attention to eager for inference
+            """
+            if "Tesla T4" in self.gpu_name and self.cfg.xformers_attention:
+                trainer.model.config._attn_implementation = (  # pylint: disable=protected-access
+                    "eager"
+                )
+            trainer.model.gradient_checkpointing_disable()
+            trainer.model.config.use_cache = True
+            trainer.model.eval()
+
+    return ColabCallback
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -59,7 +59,7 @@ def choose_device(cfg):

 def resolve_dtype(cfg):
    if (
-        cfg.bf16 == "auto" and not cfg.use_ray
+        not cfg.fp16 and cfg.bf16 == "auto" and not cfg.use_ray
    ):  # if we use ray we want to defer this check to the worker node
        if is_torch_bf16_gpu_available():
            LOG.debug("bf16 support detected, enabling for this configuration.")
@@ -67,9 +67,12 @@ def resolve_dtype(cfg):
        else:
            LOG.debug("bf16 support not detected, disabling for this configuration.")
            cfg.bf16 = False
-            if cfg.fp16 is None:
+            if cfg.fp16 is None and not cfg.float16:
                cfg.fp16 = True

+    if cfg.fp16 and cfg.bf16 == "auto":
+        cfg.bf16 = False
+
    if cfg.device == "mps":
        cfg.load_in_8bit = False
        cfg.tf32 = False
--- a/src/axolotl/utils/ctx_managers/init.py
+++ b/src/axolotl/utils/ctx_managers/init.py
@@ -0,0 +1,6 @@
+"""Init for context manager submodule"""
+
+# pylint: disable=unused-import
+# flake8: noqa
+
+from .sequence_parallel import SequenceParallelContextManager
--- a/src/axolotl/utils/ctx_managers/sequence_parallel.py
+++ b/src/axolotl/utils/ctx_managers/sequence_parallel.py
@@ -0,0 +1,335 @@
+"""Module for Axolotl trainer sequence parallelism manager and utilities"""
+
+import functools
+
+import torch
+import torch.distributed as dist
+from torch import nn
+from torch.utils.hooks import RemovableHandle
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.utils import ModelOutput
+
+from axolotl.monkeypatch.attention.ring_attn.patch import (
+    get_ring_attn_group,
+    update_ring_attn_params,
+)
+from axolotl.utils.schemas.enums import RingAttnFunc
+
+
+# TODO(djsaunde): implement zigzag, stripe patterns here (and elsewhere) in this
+# module. Currently, we just focus on batch ring and varlen llama3 for simplicity.
+def apply_sequence_parallelism(
+    batch: dict[str, torch.Tensor],
+    local_rank: int,
+    local_world_size: int,
+    gradient_accumulation_steps: int,
+    ring_attn_func: RingAttnFunc,  # pylint: disable=unused-argument
+) -> tuple[dict[str, torch.Tensor], int, int]:
+    """
+    Apply sequence parallelism slicing to a batch.
+
+    Special handling is implemented for integer logits_to_keep, which indicates
+    to only keep the last N tokens in the sequence during generation.
+
+    Args:
+        batch: Batch dictionary (e.g., input_ids, attention_mask, etc.).
+        local_rank: Local rank in the sequence parallel group.
+        local_world_size: World size of the sequence parallel group.
+        gradient_accumulation_steps: Number of steps to accumulate gradients over.
+        ring_attn_func: Which ring attention function to use. Currently unused, but
+            related to above TODO.
+
+    Returns:
+        tuple of:
+            - Batch dictionary with sliced tensors.
+            - The original sequence length before padding.
+            - The number of padding tokens added.
+    """
+    original_seq_len = batch["input_ids"].size(1)
+
+    # Update ring attention params if needed
+    if batch.get("position_ids") is not None:
+        update_ring_attn_params(position_ids=batch["position_ids"])
+    else:
+        # If position_ids aren't already in the batch, create them
+        batch["position_ids"] = torch.arange(
+            0,
+            original_seq_len,
+            dtype=torch.long,
+            device=batch["input_ids"].device,
+        ).expand(batch["input_ids"].size(0), -1)
+
+    if "logits_to_keep" in batch and isinstance(batch["logits_to_keep"], int):
+        logits_to_keep = batch["logits_to_keep"]
+
+        # Calculate which positions in the full sequence contain the last N tokens
+        start_position = max(0, original_seq_len - logits_to_keep)
+        chunk_size = original_seq_len // local_world_size
+        rank_start = local_rank * chunk_size
+        rank_end = rank_start + chunk_size
+
+        # Create a boolean mask tensor for this rank's chunk
+        mask = torch.zeros(
+            chunk_size,
+            dtype=torch.bool,
+            device=batch["input_ids"].device,
+        )
+
+        if rank_end > start_position:
+            # Calculate how many of the last N tokens fall within this rank's range
+            tokens_in_rank = min(rank_end, original_seq_len) - max(
+                rank_start, start_position
+            )
+
+            # Calculate where these tokens start in the local chunk
+            local_start_idx = max(0, start_position - rank_start)
+
+            # Set the appropriate positions in the mask to True
+            mask[local_start_idx : local_start_idx + tokens_in_rank] = True
+
+        # Replace the integer with the boolean mask
+        batch["logits_to_keep"] = mask
+
+    # Add padding to make sequence length divisible by local_world_size
+    total_seq_len = original_seq_len
+    pad_len = 0
+    divisor = min(local_world_size, 64)
+    if total_seq_len % divisor != 0:
+        pad_len = divisor - (total_seq_len % divisor)
+
+        # Apply padding to all relevant tensors
+        for key in batch:
+            if (
+                isinstance(batch[key], torch.Tensor)
+                and batch[key].dim() > 1
+                and batch[key].size(1) == total_seq_len
+            ):
+                # Create padding tensor
+                pad_value = -100 if key == "labels" else 0
+                padding = torch.full(
+                    (batch[key].size(0), pad_len, *batch[key].shape[2:]),
+                    pad_value,
+                    dtype=batch[key].dtype,
+                    device=batch[key].device,
+                )
+
+                # Concatenate padding to the right side of the tensor
+                batch[key] = torch.cat([batch[key], padding], dim=1)
+            if key == "logits_to_keep":
+                # Create padding tensor
+                padding = torch.ones(
+                    1,
+                    dtype=batch[key].dtype,
+                    device=batch[key].device,
+                )
+
+                # Concatenate padding to the right side of the tensor
+                batch[key] = torch.cat([batch[key], padding], dim=0)
+
+        # Update the total sequence length after padding
+        total_seq_len = batch["input_ids"].size(1)
+
+    # Slice batch for sequence parallel
+    for key in batch:
+        if not isinstance(batch[key], torch.Tensor) or batch[key].dim() <= 1:
+            continue
+
+        # Split in sequential fashion and grab this rank's chunk
+        if batch[key].size(1) == total_seq_len:
+            batch[key] = (
+                batch[key].chunk(local_world_size, dim=1)[local_rank].contiguous()
+            )
+        elif key == "logits_to_keep":
+            batch[key] = (
+                batch[key].chunk(local_world_size, dim=0)[local_rank].contiguous()
+            )
+
+        # Handle num_items_in_batch
+        if "num_items_in_batch" in batch:
+            # Approximation; this needed since num_items_in_batch may be counted across
+            # all samples in a gradient accumulated batch, not on a per-step basis.
+            batch["num_items_in_batch"] = (
+                batch["labels"] != -100
+            ).sum() * gradient_accumulation_steps
+
+    return batch, original_seq_len, pad_len
+
+
+class SequenceParallelContextManager:
+    """Context manager for sequence parallelism operations.
+
+    This class provides a context that will automatically apply sequence parallelism
+    during model forward passes using a pre-forward hook, and gather outputs from
+    across the sequence parallelism group using a post-forward hook.
+
+    Args:
+        models: List of models to apply sequence parallelism to pre- and post- forward
+            hooks.
+        sequence_parallel_degree: Number of processes to split sequences over.
+        gradient_accumulation_steps: Number of steps to accumulate gradients over.
+        ring_attn_func: Which ring attention function to use. Currently unused.
+    """
+
+    def __init__(
+        self,
+        models: list[nn.Module],
+        sequence_parallel_degree: int,
+        gradient_accumulation_steps: int,
+        ring_attn_func: RingAttnFunc,
+    ):
+        self.models = models
+        self.sequence_parallel_degree = sequence_parallel_degree
+        self.gradient_accumulation_steps = gradient_accumulation_steps
+        self.ring_attn_func = ring_attn_func
+        self.process_group = get_ring_attn_group()
+
+        # Initialize sequence parallel group details
+        self.local_rank = dist.get_rank(self.process_group)
+        self.local_world_size = dist.get_world_size(self.process_group)
+
+        # Will store hook handles for removal
+        self.hook_handles: list[RemovableHandle] = []
+
+        # Store original sequence length and padding information
+        self.original_seq_len = 0
+        self.pad_len = 0
+
+        # Create a partially applied version of the apply_sequence_parallelism function
+        self.apply_sequence_parallelism = functools.partial(
+            apply_sequence_parallelism,
+            local_rank=self.local_rank,
+            local_world_size=self.local_world_size,
+            gradient_accumulation_steps=self.gradient_accumulation_steps,
+            ring_attn_func=self.ring_attn_func,
+        )
+
+    def __enter__(self):
+        # Forward pre-hook to apply sequence parallelism
+        def sequence_parallel_pre_hook(_, args, kwargs):
+            # Apply sequence parallelism to kwargs and get original sequence length and padding info
+            kwargs, self.original_seq_len, self.pad_len = (
+                self.apply_sequence_parallelism(batch=kwargs)
+            )
+
+            return args, kwargs
+
+        # Forward post-hook to gather outputs
+        def sequence_parallel_post_hook(_, __, output: ModelOutput) -> ModelOutput:
+            # Gather the sharded outputs
+            output = self.gather_outputs(output)
+
+            # Remove padding if it was added
+            if self.pad_len > 0:
+                for key, value in output.items():
+                    if isinstance(value, torch.Tensor) and value.dim() > 1:
+                        if value.size(1) == self.original_seq_len + self.pad_len:
+                            # Slice to remove padding
+                            output[key] = value[:, : self.original_seq_len].contiguous()
+
+            return output
+
+        # Register both hooks
+        for model in self.models:
+            self.hook_handles.append(
+                model.register_forward_pre_hook(
+                    sequence_parallel_pre_hook, with_kwargs=True
+                )
+            )
+            self.hook_handles.append(
+                model.register_forward_hook(sequence_parallel_post_hook)
+            )
+
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        # Remove all hooks
+        for handle in self.hook_handles:
+            handle.remove()
+        self.hook_handles = []
+
+    def gather_outputs(self, output: CausalLMOutputWithPast) -> CausalLMOutputWithPast:
+        """Gather sharded outputs from all ranks and reconstruct the full tensor."""
+        for key, value in output.items():
+            if isinstance(value, torch.Tensor) and value.dim() > 1:
+                output[key] = AllGatherWithGrad.apply(value, self.process_group)
+
+        return output
+
+
+class AllGatherWithGrad(torch.autograd.Function):
+    """Custom autograd function for all-gather to preserve gradients."""
+
+    @staticmethod
+    def forward(
+        ctx: torch.autograd.function.FunctionCtx,
+        input_tensor: torch.Tensor,
+        group: dist.ProcessGroup,
+    ) -> torch.Tensor:
+        """
+        Forward pass of all-gather of data with sequence dimension.
+
+        Args:
+            ctx: `torch.autograd` function context.
+            input_tensor: Tensor from model output with sequence dimension.
+            group: `torch.distributed` process group.
+
+        Returns:
+            Tensor from gathering the `input_tensor` from across the process group and
+                concatenating along the sequence dimension.
+        """
+        ctx.group = group
+        ctx.rank = dist.get_rank(group)
+        world_size = dist.get_world_size(group)
+
+        # Gather shape metadata
+        local_shape = torch.tensor(list(input_tensor.shape), device=input_tensor.device)
+        all_shapes = [torch.zeros_like(local_shape) for _ in range(world_size)]
+        dist.all_gather(all_shapes, local_shape, group=group)
+
+        # Store sequence lengths for backward pass
+        seq_lens = [int(shape[1].item()) for shape in all_shapes]
+        ctx.seq_lens = seq_lens
+
+        # Perform all_gather operation
+        gathered = [
+            torch.zeros(
+                tuple(shape.tolist()),
+                dtype=input_tensor.dtype,
+                device=input_tensor.device,
+            )
+            for shape in all_shapes
+        ]
+        dist.all_gather(gathered, input_tensor, group=group)
+
+        # Concatenate tensors along sequence dimension
+        result = torch.cat(gathered, dim=1)
+
+        return result
+
+    @staticmethod
+    def backward(
+        ctx: torch.autograd.function.FunctionCtx, grad_output: torch.Tensor
+    ) -> tuple[torch.Tensor, None]:
+        """
+        Backward pass for all-gather operation.
+
+        Extracts the gradient slice corresponding to this rank's original input
+        from the full gradient tensor.
+
+        Args:
+            ctx: `torch.autograd` function context.
+            grad_output: Gradient from subsequent layers with respect to the
+                concatenated output tensor.
+
+        Returns:
+            Tuple containing the gradient slice for this rank's input tensor and `None`
+                for the process group parameter which doesn't require gradients.
+        """
+        rank = ctx.rank
+        seq_lens = ctx.seq_lens
+
+        # Extract gradient for this rank's chunk
+        offset = sum(seq_lens[:rank])
+        grad_slice = grad_output[:, offset : offset + seq_lens[rank]].contiguous()
+
+        return grad_slice, None
--- a/src/axolotl/utils/data/rl.py
+++ b/src/axolotl/utils/data/rl.py
@@ -18,8 +18,9 @@ from axolotl.utils.data.utils import deduplicate_and_log_datasets, md5
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process, zero_first
 from axolotl.utils.models import load_tokenizer
+from axolotl.utils.schemas.enums import RLType

-LOG = logging.getLogger("axolotl")
+LOG = logging.getLogger(__name__)


 def _get_path(ds_hash, cfg):
@@ -80,7 +81,7 @@ def map_dataset(cfg, data_set, ds_transform_fn, tokenizer, **map_kwargs):
 def drop_long_rl_seq(
    sample, rl, tokenizer, sequence_len  # pylint: disable=invalid-name
 ):
-    if rl in ("dpo", "ipo", "orpo", "simpo"):
+    if rl in (RLType.DPO, RLType.IPO, RLType.ORPO, RLType.SIMPO):
        if not (
            sample.get("prompt") and sample.get("chosen") and sample.get("rejected")
        ):
@@ -100,7 +101,7 @@ def drop_long_rl_seq(
            len_prompt + len_rejected
        ) <= sequence_len

-    if rl == "kto":
+    if rl is RLType.KTO:
        if not (sample.get("prompt") and sample.get("completion")):
            raise ValueError("Prompt and completion keys are required for KTO datasets")

@@ -114,7 +115,7 @@ def drop_long_rl_seq(

        return (len_prompt + len_completion) <= sequence_len

-    if rl == "grpo":
+    if rl is RLType.GRPO:
        return True

    raise ValueError("Unknown RL type")
@@ -137,9 +138,9 @@ def load_prepare_preference_datasets(cfg):
            if _type:
                if isinstance(_type, DictDefault):
                    _type = "user_defined.default"
-                if _cfg.rl == "orpo":
+                if _cfg.rl is RLType.ORPO:
                    ds_transform_fn = load_orpo(_type, _cfg, dataset_idx=i)
-                elif _cfg.rl == "kto":
+                elif _cfg.rl is RLType.KTO:
                    ds_transform_fn = load_kto(_type, _cfg, dataset_idx=i)
                else:
                    ds_transform_fn = load_dpo(_type, _cfg, dataset_idx=i)
@@ -150,7 +151,7 @@ def load_prepare_preference_datasets(cfg):
                split_datasets[i] = map_dataset(
                    cfg, data_set, ds_transform_fn, tokenizer, **map_kwargs
                )
-            elif _cfg.rl == "kto":
+            elif _cfg.rl is RLType.KTO:
                ds_transform_fn = load_kto(_type, _cfg, dataset_idx=i)
                map_kwargs = {}
                if isinstance(ds_transform_fn, tuple):
@@ -185,7 +186,7 @@ def load_prepare_preference_datasets(cfg):
                    )

        combined_datasets = concatenate_datasets(split_datasets)
-        combined_datasets = combined_datasets.shuffle(seed=cfg.seed)
+        combined_datasets = combined_datasets.shuffle(seed=cfg.seed or 42)

        return combined_datasets

@@ -205,6 +206,8 @@ def load_prepare_preference_datasets(cfg):
                eval_dataset = load_split(cfg.test_datasets, cfg)
        if not eval_dataset:
            if cfg.val_set_size:
+                seed = cfg.seed if cfg.seed is not None else 42
+
                # ensure we end up with the same fingerprint by doing rank0 first and being able to cache
                to_hash_train = (
                    train_dataset._fingerprint  # pylint: disable=protected-access
@@ -213,7 +216,7 @@ def load_prepare_preference_datasets(cfg):
                    + "|"
                    + "train"
                    + "|"
-                    + str(cfg.seed or 42)
+                    + str(seed)
                )
                to_hash_test = (
                    train_dataset._fingerprint  # pylint: disable=protected-access
@@ -222,13 +225,13 @@ def load_prepare_preference_datasets(cfg):
                    + "|"
                    + "test"
                    + "|"
-                    + str(cfg.seed or 42)
+                    + str(seed)
                )
                train_fingerprint = md5(to_hash_train)
                test_fingerprint = md5(to_hash_test)
                ds_w_test_split = train_dataset.train_test_split(
                    test_size=cfg.val_set_size,
-                    seed=cfg.seed,
+                    seed=seed,
                    shuffle=False,
                    train_new_fingerprint=train_fingerprint,
                    test_new_fingerprint=test_fingerprint,
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -148,7 +148,7 @@ def prepare_dataset(cfg, tokenizer, processor=None, preprocess_iterable=None):
            ds_wrapper_partial,
            max_tokens=cfg.sequence_len,
            batch_size=cfg.micro_batch_size,
-            seed=cfg.seed or 42,
+            seed=cfg.seed if cfg.seed is not None else 42,
            buffer_size=cfg.pretrain_multipack_buffer_size or 10_000,
        )
        # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230
@@ -416,6 +416,8 @@ def load_prepare_datasets(
    )

    if split == "train" and val_set_size:
+        seed = cfg.seed if cfg.seed is not None else 42
+
        # ensure we end up with the same fingerprint by doing rank0 first and being able to cache
        to_hash_train = (
            dataset._fingerprint  # pylint: disable=protected-access
@@ -424,7 +426,7 @@ def load_prepare_datasets(
            + "|"
            + "train"
            + "|"
-            + str(cfg.seed or 42)
+            + str(seed)
        )
        to_hash_test = (
            dataset._fingerprint  # pylint: disable=protected-access
@@ -433,7 +435,7 @@ def load_prepare_datasets(
            + "|"
            + "test"
            + "|"
-            + str(cfg.seed or 42)
+            + str(seed)
        )
        train_fingerprint = md5(to_hash_train)
        test_fingerprint = md5(to_hash_test)
@@ -442,7 +444,7 @@ def load_prepare_datasets(
        dataset = dataset.train_test_split(
            test_size=val_set_size,
            shuffle=False,
-            seed=cfg.seed or 42,
+            seed=seed,
            train_new_fingerprint=train_fingerprint,
            test_new_fingerprint=test_fingerprint,
        )
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -281,6 +281,10 @@ def load_dataset_w_config(
            **load_ds_kwargs,
        )
    if not ds:
-        raise ValueError("unhandled dataset load")
+        raise ValueError(
+            "The dataset could not be loaded. This could be due to a misconfigured dataset path "
+            f"({config_dataset.path}). Try double-check your path / name / data_files. "
+            "This is not caused by the dataset type."
+        )

    return ds
--- a/src/axolotl/utils/distributed.py
+++ b/src/axolotl/utils/distributed.py
@@ -69,17 +69,27 @@ def barrier():
        dist.barrier()


-def is_main_process():
+def is_main_process(use_environ=False):
    """
    Check if the current process is the main process. If not in distributed mode,
    always return `True`.
+
+    Args:
+    - use_environ (bool, optional): Use environment variable to determine main process.
+
+    Returns:
+    - bool: `True` if the current process is the main process, `False` otherwise.
    """
+    if use_environ:
+        return os.environ.get("LOCAL_RANK", "0") == "0"
    if not is_distributed():
        return True
    return dist.get_rank() == 0


-def is_local_main_process():
+def is_local_main_process(use_environ=False):
+    if use_environ:
+        return os.environ.get("LOCAL_RANK", "0") == "0"
    return PartialState().is_local_main_process


@@ -99,17 +109,6 @@ def cleanup_distributed():
        torch.distributed.destroy_process_group()


-@contextmanager
-def zero_only():
-    """
-    Context manager that only runs the enclosed block on the main rank.
-    """
-    if is_main_process():
-        yield
-    else:
-        yield None
-
-
@contextmanager
 def zero_first(is_main):
    """
--- a/src/axolotl/utils/gradient_checkpointing/init.py
+++ b/src/axolotl/utils/gradient_checkpointing/init.py
@@ -1,16 +1,59 @@
 """custom checkpointing utils"""

+import importlib
 from functools import partial

-from axolotl.utils.gradient_checkpointing.unsloth import (
-    Unsloth_Offloaded_Gradient_Checkpointer,
+from packaging import version
+
+from axolotl.utils.gradient_checkpointing.offload_cpu import (
+    CPU_Offloaded_Gradient_Checkpointer,
 )
+from axolotl.utils.gradient_checkpointing.offload_disk import (
+    Disco,
+)
+
+transformers_version = version.parse(importlib.metadata.version("transformers"))
+if transformers_version > version.parse("4.51.3"):
+    from transformers.modeling_layers import GradientCheckpointingLayer
+
+    def uses_gc_layers(decoder_layer):
+        return isinstance(decoder_layer.func.__self__, GradientCheckpointingLayer)
+
+else:
+
+    def uses_gc_layers(_):
+        return False


 def hf_grad_checkpoint_offload_wrapper(
    decoder_layer, *args, use_reentrant=None
 ):  # pylint: disable=unused-argument
-    return Unsloth_Offloaded_Gradient_Checkpointer.apply(
+    if uses_gc_layers(decoder_layer):
+        return CPU_Offloaded_Gradient_Checkpointer.apply(
+            decoder_layer,
+            *args,
+        )
+
+    return CPU_Offloaded_Gradient_Checkpointer.apply(
+        (
+            decoder_layer.func.__self__
+            if isinstance(decoder_layer, partial)
+            else decoder_layer.__self__
+        ),
+        *args,
+    )
+
+
+def hf_grad_checkpoint_disk_offload_wrapper(
+    decoder_layer, *args, use_reentrant=None
+):  # pylint: disable=unused-argument
+    if uses_gc_layers(decoder_layer):
+        return Disco.apply(
+            decoder_layer,
+            *args,
+        )
+
+    return Disco.apply(
        (
            decoder_layer.func.__self__
            if isinstance(decoder_layer, partial)
--- a/src/axolotl/utils/gradient_checkpointing/offload_cpu.py
+++ b/src/axolotl/utils/gradient_checkpointing/offload_cpu.py
@@ -1,4 +1,4 @@
-"""Unsloth checkpointing"""
+"""CPU offloaded checkpointing"""

 # Copyright 2023-present Daniel Han-Chen & the Unsloth team. All rights reserved.
 #
@@ -26,7 +26,7 @@ else:
    torch_cuda_amp_custom_bwd = torch.amp.custom_bwd(device_type="cuda")


-class Unsloth_Offloaded_Gradient_Checkpointer(  # pylint: disable=invalid-name
+class CPU_Offloaded_Gradient_Checkpointer(  # pylint: disable=invalid-name
    torch.autograd.Function
 ):
    """
--- a/src/axolotl/utils/gradient_checkpointing/offload_disk.py
+++ b/src/axolotl/utils/gradient_checkpointing/offload_disk.py
@@ -0,0 +1,531 @@
+"""
+DISCO - DIsk-based Storage and Checkpointing with Optimized prefetching
+"""
+
+# Copyright 2025 Axolotl AI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import atexit
+import concurrent.futures
+import logging
+import os
+import queue
+import shutil
+import tempfile
+import threading
+import time
+import uuid
+from collections import deque
+from concurrent.futures import Future
+from typing import Dict
+
+import torch
+
+torch_cuda_amp_custom_fwd = torch.amp.custom_fwd(device_type="cuda")
+torch_cuda_amp_custom_bwd = torch.amp.custom_bwd(device_type="cuda")
+
+# Setup logger
+logger = logging.getLogger(__name__)
+
+
+class DiskOffloadManager:
+    """
+    Manages offloaded tensors and handles prefetching in a separate thread.
+    Includes synchronization to prevent race conditions.
+    """
+
+    def __init__(
+        self,
+        prefetch_size: int = 3,
+        prefetch_to_gpu: bool = True,
+        save_workers: int = 4,
+    ):
+        """
+        Args:
+            prefetch_size: Maximum number of tensors to prefetch in the background.
+            prefetch_to_gpu: Whether to prefetch tensors directly to GPU memory.
+            save_workers: Maximum number of concurrent save operations.
+        """
+        self.temp_dir = tempfile.mkdtemp(prefix="disco_")
+
+        # Track tensor paths and their status
+        self.tensor_paths: deque = deque()  # Ordered history of tensor paths (LIFO)
+        self.file_locks: Dict[str, threading.Lock] = (
+            {}
+        )  # Maps file_path -> threading.Lock()
+        # Maps file_path -> status ("saving", "ready", "prefetching", "loaded", "deleted")
+        self.file_status: Dict[str, str] = {}
+
+        self.max_prefetch = prefetch_size
+        self.prefetch_to_gpu = prefetch_to_gpu
+
+        # Thread synchronization
+        self.manager_lock = threading.RLock()  # Used for thread-safe operations
+
+        # Prefetch queue and cache
+        self.prefetch_queue: queue.Queue = queue.Queue()
+        self.prefetch_cache: Dict[str, torch.Tensor] = {}  # Maps file_path -> tensor
+
+        # Save queue and thread pool
+        self.save_queue: queue.Queue = queue.Queue()
+        self.save_pool = concurrent.futures.ThreadPoolExecutor(max_workers=save_workers)
+        self.save_futures: Dict[str, Future] = {}
+        self.save_semaphore = threading.Semaphore(
+            save_workers * 2
+        )  # Limit concurrent save operations
+
+        # Start prefetch worker thread
+        self.stop_event = threading.Event()
+        # start multiple threads for prefetching
+        self.prefetch_worker_count = 2
+        self.prefetch_workers = []
+        for _ in range(self.prefetch_worker_count):
+            worker = threading.Thread(target=self._prefetch_worker, daemon=True)
+            worker.start()
+            self.prefetch_workers.append(worker)
+
+        # Start save worker thread
+        self.save_worker = threading.Thread(target=self._save_worker, daemon=True)
+        self.save_worker.start()
+        self.idx = 0
+
+        atexit.register(self.cleanup)
+
+    def _save_worker(self):
+        """Background thread that processes the save queue"""
+        while not self.stop_event.is_set():
+            try:
+                save_item = self.save_queue.get(timeout=0.5)
+                if save_item is None:
+                    continue
+
+                tensor, file_path = save_item
+
+                # Submit the save task to the thread pool
+                future = self.save_pool.submit(
+                    self._save_tensor_to_disk, tensor, file_path
+                )
+                with self.manager_lock:
+                    self.save_futures[file_path] = future
+
+                self.save_queue.task_done()
+
+            except queue.Empty:
+                time.sleep(0.01)  # Small sleep to prevent CPU spinning
+                continue
+
+    def _save_tensor_to_disk(self, tensor: torch.Tensor, file_path: str):
+        """Actually save the tensor to disk"""
+        try:
+            # Save tensor to disk
+            cpu_tensor = tensor.detach().cpu()
+            torch.save(cpu_tensor, file_path)
+            del cpu_tensor
+
+            with self.manager_lock:
+                # Mark file as ready
+                self.file_status[file_path] = "ready"
+
+            # Release semaphore
+            self.save_semaphore.release()
+
+            return True
+        except FileNotFoundError as e:
+            logger.error(f"Error saving tensor to {file_path}: {e}")
+            with self.manager_lock:
+                self.file_status[file_path] = "error"
+
+            # Release semaphore
+            self.save_semaphore.release()
+
+            return False
+
+    def _prefetch_worker(self):
+        """Background thread that loads tensors from disk ahead of time"""
+        while not self.stop_event.is_set():
+            try:
+                file_path = self.prefetch_queue.get(timeout=0.5)
+                if file_path is None:
+                    continue
+
+                # Check if file is available and not already in cache
+                with self.manager_lock:
+                    if (
+                        file_path not in self.file_status
+                        or self.file_status[file_path] == "deleted"
+                    ):
+                        self.prefetch_queue.task_done()
+                    if file_path in self.prefetch_cache:
+                        self.prefetch_queue.task_done()
+                        continue
+
+                    # If file is still being saved, wait for it
+                    if (
+                        self.file_status[file_path] == "saving"
+                        and file_path in self.save_futures
+                    ):
+                        # Re-queue this prefetch request with a little delay
+                        self.prefetch_queue.task_done()
+                        time.sleep(0.1)
+                        self.prefetch_queue.put(file_path)
+                        continue
+
+                    # Mark file as being prefetched
+                    self.file_status[file_path] = "prefetching"
+
+                # Load tensor from disk and store in cache
+                try:
+                    if os.path.exists(file_path):
+                        if self.prefetch_to_gpu:
+                            tensor = torch.load(
+                                file_path,
+                                map_location=torch.device("cuda"),
+                                weights_only=True,
+                            )
+                        else:
+                            tensor = torch.load(file_path, weights_only=True)
+
+                        with self.manager_lock:
+                            self.prefetch_cache[file_path] = tensor
+                            self.file_status[file_path] = "ready"
+                    else:
+                        with self.manager_lock:
+                            if self.file_status.get(file_path) != "deleted":
+                                logger.warning(
+                                    f"Prefetch error: File not found {file_path}"
+                                )
+                                self.file_status[file_path] = "missing"
+
+                except FileNotFoundError as e:
+                    with self.manager_lock:
+                        if self.file_status.get(file_path) != "deleted":
+                            logger.warning(f"Prefetch error for {file_path}: {e}")
+                            self.file_status[file_path] = "error"
+
+                self.prefetch_queue.task_done()
+
+            except queue.Empty:
+                time.sleep(0.01)  # Small sleep to prevent CPU spinning
+                continue
+
+    def save_tensor(self, tensor: torch.Tensor):
+        """Save tensor to disk asynchronously and return file path with thread-safe operations"""
+        # Generate unique file path
+        self.idx += 1
+        file_path: str = os.path.join(
+            self.temp_dir, f"{self.idx:06d}-{uuid.uuid4()}.pt"
+        )
+
+        with self.manager_lock:
+            # Mark file as being saved
+            self.file_locks[file_path] = threading.Lock()
+            self.file_status[file_path] = "saving"
+            # Add to history
+            self.tensor_paths.append(file_path)
+
+        # Acquire semaphore to limit concurrent save operations
+        self.save_semaphore.acquire()  # pylint: disable=consider-using-with
+        # Queue tensor for saving in background
+        self.save_queue.put((tensor.detach(), file_path))
+
+        return file_path
+
+    def wait_for_save(self, file_path, timeout=None) -> None:
+        """Wait for a tensor to be saved to disk"""
+        start_time = time.time()
+        while timeout is None or time.time() - start_time < timeout:
+            with self.manager_lock:
+                if self.file_status.get(file_path) == "ready":
+                    return
+                if self.file_status.get(file_path) in ["error", "missing", "deleted"]:
+                    return
+
+                if file_path in self.save_futures:
+                    future = self.save_futures[file_path]
+                    if future.done():
+                        return
+
+            # Small sleep to prevent CPU spinning
+            time.sleep(0.01)
+
+        # Timeout
+        logger.warning(f"Timeout waiting for tensor to be saved: {file_path}")
+        return
+
+    def load_tensor(self, file_path, target_device="cuda"):
+        """Load tensor from disk or prefetch cache with proper synchronization"""
+        # Wait for tensor to be saved if it's still in progress
+        self.wait_for_save(file_path)
+
+        tensor = None
+
+        # Try to get from cache first
+        with self.manager_lock:
+            # Check if tensor is already in cache
+            if file_path in self.prefetch_cache:
+                tensor = self.prefetch_cache[file_path]
+                del self.prefetch_cache[file_path]
+                self.file_status[file_path] = "loaded"
+
+        if tensor is not None:
+            # Ensure tensor is on correct device
+            if target_device != "cpu" and tensor.device.type == "cpu":
+                tensor = tensor.to(target_device, non_blocking=True)
+            return tensor
+
+        # If not in cache, load directly from disk
+        try:
+            if not os.path.exists(file_path):
+                logger.error(f"File not found for loading: {file_path}")
+                raise FileNotFoundError(f"File not found: {file_path}")
+
+            tensor = torch.load(file_path, weights_only=True)
+
+            with self.manager_lock:
+                self.file_status[file_path] = "loaded"
+
+            if target_device != "cpu":
+                tensor = tensor.to(target_device, non_blocking=True)
+
+            return tensor
+
+        except Exception as e:
+            logger.error(f"Error loading tensor from {file_path}: {e}")
+            raise
+
+    def _safe_delete_file(self, file_path):
+        """Safely delete a file with proper synchronization"""
+        with self.manager_lock:
+            # Make sure any save operation is completed
+            if file_path in self.save_futures:
+                future = self.save_futures[file_path]
+                try:
+                    if not future.done():
+                        future.cancel()
+                    del self.save_futures[file_path]
+                except FileNotFoundError as e:
+                    logger.warning(
+                        f"Error canceling save operation for {file_path}: {e}"
+                    )
+
+            # Only delete if file exists and is not being prefetched
+            status = self.file_status.get(file_path)
+            if status in ["ready", "loaded", "error", "missing"]:
+                try:
+                    if os.path.exists(file_path):
+                        os.remove(file_path)
+                    self.file_status[file_path] = "deleted"
+                    return True
+                except FileNotFoundError as e:
+                    logger.warning(f"Error deleting file {file_path}: {e}")
+            return False
+
+    def trigger_prefetch(self, n=None):
+        """Trigger prefetching of the next N tensors with proper synchronization"""
+        if n is None:
+            n = self.max_prefetch
+
+        prefetch_paths = []
+        with self.manager_lock:
+            # Find files that are ready to be prefetched (not already in cache or being prefetched)
+            for path in reversed(self.tensor_paths):
+                if (
+                    path not in self.prefetch_cache
+                    and self.file_status.get(path) == "ready"
+                ):
+                    prefetch_paths.append(path)
+                    if len(prefetch_paths) >= n:
+                        break
+
+        # Queue files for prefetching
+        for path in prefetch_paths:
+            self.prefetch_queue.put(path)
+
+    def cleanup_tensor(self, file_path: str):
+        """Clean up a specific tensor file after it's been used"""
+        with self.manager_lock:
+            if file_path in self.tensor_paths:
+                self.tensor_paths.remove(file_path)
+
+            # Remove from prefetch cache if present
+            if file_path in self.prefetch_cache:
+                del self.prefetch_cache[file_path]
+
+            # Remove from save futures if present
+            if file_path in self.save_futures:
+                future = self.save_futures[file_path]
+                if not future.done():
+                    future.cancel()
+                del self.save_futures[file_path]
+
+        # Try to delete the file
+        self._safe_delete_file(file_path)
+
+    def cleanup(self):
+        """Clean up all temp files and stop prefetch thread with proper synchronization"""
+        self.stop_event.set()
+
+        # Cancel all pending save operations
+        with self.manager_lock:
+            for _, future in self.save_futures.items():
+                if not future.done():
+                    future.cancel()
+            self.save_futures.clear()
+
+        # Drain the save queue
+        while not self.save_queue.empty():
+            try:
+                self.save_queue.get_nowait()
+                self.save_queue.task_done()
+            except queue.Empty:
+                break
+
+        # Shutdown the save pool
+        self.save_pool.shutdown(wait=False)
+
+        # Join the save worker thread
+        if self.save_worker.is_alive():
+            self.save_worker.join(timeout=2.0)
+
+        # Join the prefetch worker threads
+        for thread in self.prefetch_workers:
+            if thread.is_alive():
+                thread.join(timeout=2.0)
+
+        # Clear cache and remove all temporary files
+        with self.manager_lock:
+            self.prefetch_cache.clear()
+            paths_to_delete = list(self.tensor_paths)
+            self.tensor_paths.clear()
+
+        # Delete all temporary files
+        for path in paths_to_delete:
+            self._safe_delete_file(path)
+
+        # Remove temp directory
+        try:
+            if os.path.exists(self.temp_dir):
+                shutil.rmtree(self.temp_dir, ignore_errors=True)
+        except FileNotFoundError as e:
+            logger.warning(f"Error removing temporary directory {self.temp_dir}: {e}")
+
+
+class Disco(torch.autograd.Function):
+    """
+    Disco: DIsk-based Storage and Checkpointing with Optimized prefetching
+    Advanced disk-based gradient checkpointer with prefetching.
+    """
+
+    # Shared manager instance across all checkpointing operations
+    _manager = None
+
+    @staticmethod
+    def get_instance(prefetch_size=1, prefetch_to_gpu=True, save_workers=4):
+        """Get or create the offload manager"""
+        if Disco._manager is None:
+            Disco._manager = DiskOffloadManager(
+                prefetch_size=prefetch_size,
+                prefetch_to_gpu=prefetch_to_gpu,
+                save_workers=save_workers,
+            )
+        return Disco._manager
+
+    @staticmethod
+    @torch_cuda_amp_custom_fwd
+    def forward(
+        ctx,
+        forward_function,
+        hidden_states,
+        *args,
+        prefetch_size=1,
+        prefetch_to_gpu=True,
+        save_workers=4,
+    ):
+        """Forward pass that offloads activations to disk asynchronously"""
+        # Get or create the manager
+        manager = Disco.get_instance(
+            prefetch_size=prefetch_size,
+            prefetch_to_gpu=prefetch_to_gpu,
+            save_workers=save_workers,
+        )
+
+        # Save tensor to disk asynchronously
+        file_path = manager.save_tensor(hidden_states)
+
+        # Run forward pass immediately without waiting for save to complete
+        with torch.no_grad():
+            output = forward_function(hidden_states, *args)
+
+        # Store what we need for backward
+        ctx.save_for_backward(torch.tensor([0]))  # Dummy tensor
+        ctx.file_path = file_path
+        ctx.forward_function = forward_function
+        ctx.args = args
+
+        return output
+
+    @staticmethod
+    @torch_cuda_amp_custom_bwd
+    def backward(ctx, *grad_outputs):
+        """Backward pass that loads activations from disk with prefetching"""
+        # Get the manager
+        manager = Disco._manager
+
+        # Trigger prefetching for future tensors
+        # This happens at the start of backward, so should have time to complete
+        manager.trigger_prefetch()
+
+        # Load hidden states from disk or prefetch cache
+        file_path = ctx.file_path
+        try:
+            # Ensure the file is saved before we try to load it
+            manager.wait_for_save(file_path)
+
+            hidden_states = manager.load_tensor(file_path)
+            hidden_states.requires_grad = True
+
+            # Compute gradients
+            with torch.enable_grad():
+                output = ctx.forward_function(hidden_states, *ctx.args)
+
+                # Handle tuple outputs properly
+                if isinstance(output, tuple):
+                    if len(grad_outputs) == len(output):
+                        torch.autograd.backward(output, grad_outputs)
+                    else:
+                        torch.autograd.backward(output, grad_outputs[0])
+                else:
+                    torch.autograd.backward(output, grad_outputs[0])
+
+            # Clean up the file after we're done with it
+            manager.cleanup_tensor(file_path)
+
+            return (
+                (
+                    None,  # forward_function
+                    hidden_states.grad,  # hidden_states grad
+                )
+                + (None,) * len(ctx.args)  # for each arg
+                + (
+                    None,  # prefetch_size
+                    None,  # prefetch_to_gpu
+                    None,  # save_workers
+                )
+            )
+
+        except Exception as e:
+            logger.error(f"Error in backward pass: {e}")
+            # Clean up the file even on error
+            manager.cleanup_tensor(file_path)
+            raise
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -68,11 +68,15 @@ from axolotl.utils.distributed import (
    get_device_count,
    get_device_type,
    is_local_main_process,
-    zero_only,
+    is_main_process,
+)
+from axolotl.utils.gradient_checkpointing import (
+    hf_grad_checkpoint_disk_offload_wrapper,
+    hf_grad_checkpoint_offload_wrapper,
 )
-from axolotl.utils.gradient_checkpointing import hf_grad_checkpoint_offload_wrapper
 from axolotl.utils.lora_embeddings import get_linear_embedding_layers
 from axolotl.utils.model_shard_quant import load_sharded_model, load_sharded_model_quant
+from axolotl.utils.schemas.enums import RLType

 LOG = logging.getLogger(__name__)
 PLUGIN_MANAGER = PluginManager.get_instance()
@@ -141,6 +145,22 @@ def check_model_config(cfg: DictDefault, model_config: PretrainedConfig):
        hasattr(model_config, "quantization_config")
        and model_config.quantization_config
    )
+
+    # Detect compressed-tensors config
+    is_compressed_tensors_config = (
+        quant_config_exists
+        and model_config.quantization_config.get("quant_method") == "compressed-tensors"
+    )
+
+    if is_compressed_tensors_config:
+        if model_config.quantization_config.get("config_groups"):
+            LOG.warning(
+                "Found `config_groups` in a compressed-tensors config. "
+                "QAT integration with llmcompressor is not tested."
+            )
+        # Skip further quant checks for compressed-tensors
+        return
+
    quant_config_method_is_gptq = (
        quant_config_exists
        and "quant_method" in model_config.quantization_config
@@ -437,7 +457,7 @@ def load_tokenizer(cfg):
            {"additional_special_tokens": additional_special_tokens}
        )

-    with zero_only():
+    if is_main_process(use_environ=True):
        LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}")
        LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}")
        LOG.debug(f"PAD: {tokenizer.pad_token_id} / {tokenizer.pad_token}")
@@ -540,11 +560,21 @@ class ModelLoader:
        self.auto_model_loader = AutoModelForCausalLM  # pylint: disable=invalid-name

    def apply_patches(self) -> None:
+        if self.cfg.xformers_attention and self.cfg.sample_packing:
+            from axolotl.monkeypatch.attention import patch_xformers_attn_over_fa2
+
+            patch_xformers_attn_over_fa2()
+            self.cfg.flash_attention = True
        if self.cfg.fsdp_config and str(self.cfg.fsdp_config.fsdp_version) == "2":
            from axolotl.monkeypatch.accelerate.fsdp2 import patch_accelerate_fsdp_utils

            patch_accelerate_fsdp_utils()

+        if self.cfg.adapter and self.cfg.embeddings_skip_upcast:
+            from axolotl.monkeypatch.peft.utils import patch_peft_prep_code
+
+            patch_peft_prep_code()
+
        if self.cfg.flex_attention:
            from axolotl.monkeypatch.attention.flex_attn import (
                patch_flex_make_mask,
@@ -593,6 +623,10 @@ class ModelLoader:

        if self.cfg.gradient_checkpointing in ["unsloth", "offload"]:
            transformers.modeling_utils.checkpoint = hf_grad_checkpoint_offload_wrapper
+        if self.cfg.gradient_checkpointing == "offload_disk":
+            transformers.modeling_utils.checkpoint = (
+                hf_grad_checkpoint_disk_offload_wrapper
+            )

        if self.cfg.flash_attention:
            self.patch_attention()
@@ -1164,7 +1198,7 @@ class ModelLoader:
                ],
            )

-    def prepare_model(self, qlora_fsdp) -> None:
+    def prepare_model(self, qlora_fsdp: bool) -> None:
        skip_prepare_model_for_kbit_training = False
        if self.cfg.model_config_type == "qwen" and self.cfg.adapter == "lora":
            # Qwen doesn't play nicely with LoRA if this is enabled
@@ -1294,7 +1328,10 @@ class ModelLoader:
        # make sure these are fp32 per Ramesh et al. (2021)
        embedding_modules = get_linear_embedding_layers(self.cfg.model_config_type)
        if not self.cfg.fsdp:
-            # FSDP doesn't like mixed Float and BFloat16
+            # we don't run this during FSDP because this will leave mixed
+            # float and bfloat16 dtypes in the model which FSDP doesn't like
+            if self.cfg.load_in_4bit and self.cfg.embeddings_skip_upcast:
+                embedding_modules = []
            self.convert_embedding_modules_dtype(
                embedding_modules,
                dist_dtype=torch.float32,
@@ -1343,7 +1380,7 @@ class ModelLoader:
            # then the dpo trainer doesn't want the peft model loaded over it, it just wants the lora/peft config
            if (
                self.cfg.adapter
-                and self.cfg.rl in ["dpo", "ipo", "kto"]
+                and self.cfg.rl in [RLType.DPO, RLType.IPO, RLType.KTO]
                and not self.cfg.merge_lora
            ):
                _, lora_config = load_lora(
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -1,10 +1,13 @@
-# pylint: skip-file
 """
-Multipack Batch Sampler
+Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences
+into fixed-capacity batches to optimize memory usage and training throughput.
 """
+
 import logging
 import math
-from typing import Any, Iterable, List, Union
+from concurrent.futures import ProcessPoolExecutor
+from multiprocessing import cpu_count, get_context
+from typing import Iterable, Union

 import numba
 import numpy as np
@@ -13,26 +16,39 @@ from torch.utils.data import BatchSampler, Sampler, SequentialSampler
 from axolotl.utils.distributed import reduce_and_broadcast

 LOG = logging.getLogger(__name__)
-
 LOG.setLevel(logging.INFO)


@numba.njit
-def ffd_check(a: np.ndarray, c: int, n: int):
-    # First-fit-decreasing bin packing
-    # Check if a[] could fit in n bins with capacity c
-    # https://en.wikipedia.org/wiki/First-fit-decreasing_bin_packing
+def ffd_check(sequence_lengths: np.ndarray, bin_capacity: int, num_bins: int):
+    """
+    First-fit-decreasing bin packing algorithm check

-    a = np.sort(a)[::-1]
-    bins = np.full((n,), c, dtype=a.dtype)
-    for size in a:
+    Checks if sequences with the given lengths could fit in the specified number of bins
+
+    Args:
+        sequence_lengths: Array of sequence lengths
+        bin_capacity: Maximum capacity of each bin
+        num_bins: Number of bins available
+
+    Returns:
+        True if all sequences can be packed, False otherwise
+    """
+    # Sort sequence lengths in descending order for optimal packing
+    sequence_lengths = np.sort(sequence_lengths)[::-1]
+    # Initialize all bins with full capacity
+    bins = np.full((num_bins,), bin_capacity, dtype=sequence_lengths.dtype)
+
+    # Try to place each sequence in the first bin it fits
+    for size in sequence_lengths:
        not_found = True
-        for idx in range(n):
+        for idx in range(num_bins):
            if bins[idx] >= size:
                bins[idx] -= size
                not_found = False
                break

+        # If no bin could fit this sequence, packing failed
        if not_found:
            return False

@@ -40,86 +56,155 @@ def ffd_check(a: np.ndarray, c: int, n: int):


@numba.njit
-def ffd_with_result(a: np.ndarray, c: int, start_index: int):
-    # First-fit-decreasing bin packing (with result return)
+def pack_group(
+    sequence_lengths: np.ndarray,
+    group_offset: int,
+    bin_capacity: int,
+    max_bins: int,
+    bin_size: int,
+    safe_mode: bool = True,
+):
+    """
+    Pack a group of sequences into bins using First-Fit Decreasing algorithm

-    indices = np.argsort(a)[::-1]
-    a = a[indices]
+    Args:
+        sequence_lengths: Array of sequence lengths
+        group_offset: Offset to apply to indices when returning results
+        bin_capacity: Maximum capacity of each bin
+        max_bins: Maximum number of bins to use
+        bin_size: Maximum number of sequences per bin
+        safe_mode: If True, use a more conservative packing approach

-    bins: List[Any] = []
-    bins_result: List[Any] = []
-    for a_id, size in enumerate(a):
-        add_new = True
-        for idx in range(len(bins)):
-            if bins[idx] >= size:
-                bins[idx] -= size
-                bins_result[idx].append(indices[a_id] + start_index)
-                add_new = False
+    Returns:
+        List of bins, where each bin contains indices of sequences assigned to it
+    """
+    bins_remaining_space: list = []  # Tracks remaining capacity in each bin
+    bins_assigned_sequences: list = []  # Tracks sequence indices assigned to each bin
+
+    for seq_id, size in enumerate(sequence_lengths):
+        global_idx = seq_id + group_offset
+
+        # Try to place sequence in existing bins
+        add_new_bin = True
+        for bin_idx, _ in enumerate(bins_remaining_space):
+            if (
+                bins_remaining_space[bin_idx] >= size
+                and len(bins_assigned_sequences[bin_idx]) < bin_size
+            ):
+                bins_remaining_space[bin_idx] -= size
+                bins_assigned_sequences[bin_idx].append(global_idx)
+                add_new_bin = False
                break

-        if add_new:
-            bins.append(c - size)
-            bins_result.append([indices[a_id] + start_index])
+        # Create a new bin if needed and if we haven't reached the limit
+        if add_new_bin:
+            if len(bins_remaining_space) >= max_bins and safe_mode:
+                # In safe mode, skip items that would exceed max_bins
+                continue
+            bins_remaining_space.append(bin_capacity - size)
+            bins_assigned_sequences.append([global_idx])

-    return bins_result
+            # Safety check to avoid infinite bins
+            if len(bins_remaining_space) > len(sequence_lengths):
+                break
+
+    return bins_assigned_sequences


-@numba.njit
-def allocate(
-    lengths: np.ndarray, lengths_cumsum: np.ndarray, rank: int, c: int, n: int
+# Define a standalone function for multiprocessing
+def _process_group(args):
+    group_lengths, start_idx, bin_capacity, max_bins, bin_size, safe_mode = args
+    return pack_group(
+        group_lengths, start_idx, bin_capacity, max_bins, bin_size, safe_mode
+    )
+
+
+def pack_parallel(
+    sequence_lengths: np.ndarray,
+    bin_capacity: int,
+    group_size: int,
+    bin_size: int,
+    num_processes: int | None = None,
+    safe_mode: bool = True,
+    mp_start_method: str | None = "spawn",
 ):
-    # Dynamic batch allocator, similar to Multifit
-    # https://en.wikipedia.org/wiki/Multifit_algorithm
-    # ~99.5% efficiency on OpenChat training set (12 * 2048 ctx len)
+    """
+    Pack sequences into bins using parallel processing

-    s = 0
-    start_index = 0
-    result = []
+    Args:
+        sequence_lengths: Array of sequence lengths
+        bin_capacity: Maximum capacity of each bin as total number of tokens
+        group_size: Number of sequences to process in each group
+        bin_size: Maximum number of bins to use
+        num_processes: Number of parallel processes to use
+        safe_mode: If True, use a more conservative packing approach
+        mp_start_method: Multiprocessing start method ('fork', 'spawn', 'forkserver').
+                         'spawn' is often safer with Numba/PyTorch.
+                         Set to None to use system default.
+    Returns:
+        List of bins, where each bin contains indices of sequences assigned to it
+    """
+    num_items = len(sequence_lengths)
+    if num_processes is None:
+        num_processes = max(1, min(num_items // group_size, cpu_count()))

-    while True:
-        # binary search [l, r)
-        left = 1
-        right = 1 + np.searchsorted(lengths_cumsum[start_index:], s + c * n, "right")
+    # Create tasks for parallel processing
+    tasks = []
+    for i in range(0, num_items, group_size):
+        group_lengths = sequence_lengths[i : i + group_size]
+        max_bins = len(group_lengths)  # Allow as many bins as items in the group
+        tasks.append((group_lengths, i, bin_capacity, max_bins, bin_size, safe_mode))

-        while right - left > 1:
-            mid = (left + right) // 2
-            if ffd_check(lengths[start_index : start_index + mid], c, n):
-                left = mid
-            else:
-                right = mid
+    # Process groups in parallel
+    all_bins = []

-        # use length l
-        batch = ffd_with_result(
-            lengths[start_index : start_index + left], c, start_index
-        )
-        assert len(batch) <= n
-        if len(batch) < n:
-            break
+    mp_ctx = None
+    if mp_start_method:
+        try:
+            mp_ctx = get_context(mp_start_method)
+        except ValueError:
+            LOG.warning(
+                f"Failed to get multiprocessing context '{mp_start_method}'. "
+                f"Falling back to default. Available: {get_context().get_all_start_methods()}"
+            )
+            mp_ctx = (
+                None  # Fallback to default context if specified one is not available
+            )

-        start_index += left
-        s = lengths_cumsum[start_index - 1]
+    if num_processes == 1:
+        LOG.debug("Using single process for pack_parallel, running sequentially.")
+        for task_args in tasks:
+            group_bins = _process_group(task_args)
+            all_bins.extend(group_bins)
+    else:
+        # Use ProcessPoolExecutor only if num_processes > 1
+        # Pass mp_context if available
+        with ProcessPoolExecutor(
+            max_workers=num_processes, mp_context=mp_ctx
+        ) as executor:
+            for group_bins in executor.map(_process_group, tasks):
+                all_bins.extend(group_bins)

-        # add local rank
-        result.append(batch[rank])
-
-    return result, s, len(result) * c * n
+    return all_bins


@numba.njit
-def allocate_sequentially(lengths: np.ndarray, rank: int, c: int, n: int):
+def allocate_sequentially(
+    sequence_lengths: np.ndarray, rank: int, bin_capacity: int, num_ranks: int
+):
    """
    Sequential allocator that preserves example order

-    Parameters:
-    - lengths: The lengths of all examples
-    - rank: The current rank (for distributed training)
-    - c: The capacity of each bin (maximum sequence length)
-    - n: Number of ranks
+    Args:
+        sequence_lengths: The lengths of all examples
+        rank: The current rank (for distributed training)
+        bin_capacity: The capacity of each bin (maximum sequence length)
+        num_ranks: Number of ranks (processes/GPUs)

    Returns:
-    - result: List of batches for the current rank
-    - total_used: Number of actual example tokens
-    - total_slots: Maximum theoretical number of example tokens (number of bins * bin capacity)
+        rank_batches: List of batches for the current rank
+        total_tokens_used: Number of actual example tokens
+        total_token_slots: Maximum theoretical number of example tokens (number of bins * bin capacity)
    """
    result = []
    total_used = 0
@@ -127,9 +212,9 @@ def allocate_sequentially(lengths: np.ndarray, rank: int, c: int, n: int):
    # First, do sequential packing into bins
    all_bins = []
    current_bin = [0 for i in range(0)]  # numba hint
-    remaining_capacity = c
+    remaining_capacity = bin_capacity

-    for idx, size in enumerate(lengths):
+    for idx, size in enumerate(sequence_lengths):
        if size <= remaining_capacity:
            # Example fits in current bin
            current_bin.append(idx)
@@ -140,7 +225,7 @@ def allocate_sequentially(lengths: np.ndarray, rank: int, c: int, n: int):
            if current_bin:  # Add non-empty bin to all_bins
                all_bins.append(current_bin)
            current_bin = [idx]
-            remaining_capacity = c - size
+            remaining_capacity = bin_capacity - size
            total_used += size

    # Add the last bin if not empty
@@ -148,132 +233,227 @@ def allocate_sequentially(lengths: np.ndarray, rank: int, c: int, n: int):
        all_bins.append(current_bin)

    # Assign bins to ranks - each rank gets every n-th bin
-    for bin_idx in range(rank, len(all_bins), n):
+    for bin_idx in range(rank, len(all_bins), num_ranks):
        result.append(all_bins[bin_idx])

-    return result, total_used, len(all_bins) * c
+    return result, total_used, len(all_bins) * bin_capacity


 class MultipackBatchSampler(BatchSampler):
-    """Batch sampler class for multipack"""
+    """
+    Batch sampler class for efficient packing of variable-length sequences
+
+    This sampler packs sequences into fixed-capacity bins (batches) to maximize
+    GPU memory utilization and training throughput by reducing padding.
+
+    It supports both parallel packing (using FFD algorithm) and
+    sequential packing (preserving original sequence order).
+    """

    def __init__(
        self,
        sampler: Union[Sampler[int], Iterable[int]],
-        batch_size: int,
-        batch_max_len: int,
-        lengths: np.ndarray,
-        packing_efficiency_estimate: float = 1.0,
-        drop_last: bool = False,
-        num_count_samples: int = 16,
-        sequential: bool = False,
-        **kwargs,
+        batch_size: int,  # Number of bins per batch
+        batch_max_len: int,  # Maximum sequence length (bin capacity)
+        lengths: np.ndarray,  # Sequence lengths
+        packing_efficiency_estimate: float = 1.0,  # Initial efficiency estimate
+        drop_last: bool = False,  # Whether to drop final batches (might be incomplete)
+        num_count_samples: int = 16,  # Number of times to estimate batch count
+        sequential: bool = False,  # Whether to use sequential packing
+        group_size: int = 100_000,  # Size of groups for parallel packing
+        bin_size: int = 200,  # The max number of samples that can be packed in a single bin
+        num_processes: int | None = None,  # Number of processes for parallel packing
+        safe_mode: bool = True,  # Conservative packing to prevent training instability
+        **kwargs,  # pylint: disable=unused-argument
    ):
        super().__init__(sampler, batch_size, drop_last)
        self.batch_size = batch_size
        self.batch_max_len = batch_max_len
-        self.lengths: np.ndarray = lengths
+        self.lengths = np.array(lengths, dtype=np.int32)
        self.packing_efficiency_estimate = packing_efficiency_estimate or 1.0
        self.sequential = sequential
+        self.group_size = group_size
+        self.bin_size = bin_size
+        self.num_processes = num_processes
+        self.safe_mode = safe_mode

        assert isinstance(self.lengths, np.ndarray)

        self.epoch = 0

-        # statistics
-        self.eff_total_used = 0
-        self.eff_total_slots = 0
+        # Efficiency statistics tracking
+        self.total_tokens_used = 0
+        self.total_token_slots = 0

-        # The number of times to calculate the batches to determine the minimum packed dataset length for the local rank
+        # The number of times to calculate batches to determine minimum packed dataset length
        self.num_count_samples = num_count_samples
-        # the minimum packed dataset length across all ranks determined by a gather/broadcast
+        # Minimum packed dataset length across all ranks (determined by gather/broadcast)
        self.len_across_ranks = None

+        # Cache for batches
+        self._batches = None
+
        if self.sequential and not isinstance(sampler, SequentialSampler):
-            LOG.warn(
+            LOG.warning(
                "using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?"
            )

    def set_epoch(self, epoch: int):
+        """Set the epoch number, used for reproducible shuffling across epochs"""
        self.epoch = epoch
+        self._batches = None  # Invalidate batch cache

    def generate_batches(self, set_stats=False):
-        indices = [idx for idx in self.sampler]
+        """
+        Generate packed batches for training

-        lengths = self.lengths[indices]
-        lengths_cumsum = np.cumsum(lengths)
+        Args:
+            set_stats: Whether to update efficiency statistics

-        if self.sequential:
-            batches, total_used, total_slots = allocate_sequentially(
-                lengths=lengths,
-                rank=0,
-                c=self.batch_max_len,
-                n=1,
-            )
-        else:
-            batches, total_used, total_slots = allocate(
-                lengths=lengths,
-                lengths_cumsum=lengths_cumsum,
-                rank=0,
-                c=self.batch_max_len,
-                n=1,
-            )
+        Returns:
+            List of batches, where each batch contains multiple bins,
+            and each bin contains multiple sequence indices
+        """
+        if self._batches is not None:
+            return self._batches

-        batches = [
-            [
-                [indices[b_idx] for b_idx in batch]
-                for batch in batches[i : i + self.batch_size]
-            ]
-            for i in range(0, len(batches), self.batch_size)
+        # Get indices from the sampler
+        indices = [  # pylint: disable=unnecessary-comprehension
+            idx for idx in self.sampler
        ]

-        # statistics
-        if set_stats:
-            self.eff_total_used += total_used
-            self.eff_total_slots += total_slots
+        # Get lengths of the selected sequences
+        lengths = self.lengths[indices]

+        # Pack sequences into bins using either sequential or parallel packing
+        if self.sequential:
+            bins, total_used, total_slots = allocate_sequentially(
+                lengths,
+                rank=0,
+                bin_capacity=self.batch_max_len,
+                num_ranks=1,
+            )
+            # Map bin indices back to original indices
+            bins = [[indices[b_idx] for b_idx in bin_indices] for bin_indices in bins]
+        else:
+            # Use parallel packing
+            all_bins = pack_parallel(
+                lengths,
+                bin_capacity=self.batch_max_len,
+                group_size=self.group_size,
+                bin_size=self.bin_size,
+                num_processes=self.num_processes,
+                safe_mode=self.safe_mode,
+            )
+
+            # Map bin indices back to original indices
+            bins = [
+                [indices[b_idx] for b_idx in bin_indices] for bin_indices in all_bins
+            ]
+
+            # Calculate efficiency statistics
+            total_used = lengths.sum()
+            total_slots = len(all_bins) * self.batch_max_len
+
+        # Group bins into batches (each batch contains batch_size bins)
+        batches = [
+            bins[i : i + self.batch_size] for i in range(0, len(bins), self.batch_size)
+        ]
+
+        # Drop last batch if requested and it's incomplete
+        if self.drop_last and len(batches[-1]) < self.batch_size:
+            batches = batches[:-1]
+            # Adjust total_slots if we dropped a batch
+            if not self.sequential:
+                total_slots -= (self.batch_size - len(batches[-1])) * self.batch_max_len
+
+        # Update statistics if requested
+        if set_stats:
+            self.total_tokens_used += total_used
+            self.total_token_slots += total_slots
+
+        self._batches = batches
        return batches

    def __iter__(self):
+        """
+        Return an iterator over batches
+
+        The batches are truncated to match the minimum number of batches across all ranks
+        to ensure distributed training balance
+        """
        batches = self.generate_batches(set_stats=True)
        if self.len_across_ranks:
-            # make sure the batches we iterate over is truncated to the same min length across all ranks
+            # Truncate batches to ensure all ranks have the same number of batches
            batches = batches[: self.len_across_ranks]
        return iter(batches)

-    def num_batches(self):
-        batches = self.generate_batches(set_stats=True)
-        return len(batches)
-
    def efficiency(self):
-        return self.eff_total_used / self.eff_total_slots
+        """
+        Calculate the packing efficiency (ratio of tokens used to total token slots)
+        Higher is better - 1.0 would mean perfect packing with no wasted space
+        """
+        if self.total_token_slots == 0:
+            self.generate_batches(set_stats=True)
+        if self.total_token_slots == 0:
+            return 0.0
+        # Return a Python float instead of potentially a numpy float
+        return float(self.total_tokens_used / self.total_token_slots)

    def gather_efficiency(self):
-        def calc_sample_packing_eff_est(estimates: List[float]):
-            LOG.debug(f"sample_packing_eff_est across ranks: {repr(estimates)}")
-            return math.floor(0.997 * max(estimates))
+        """
+        Gather and synchronize packing efficiency estimates across all distributed ranks
+        Returns a conservative efficiency estimate based on the measurements
+        """

+        def calc_sample_packing_eff_est(estimates: list[float]):
+            LOG.debug(f"sample_packing_eff_est across ranks: {repr(estimates)}")
+            # Use 99.7% of max observed efficiency as a safe estimate
+            max_eff = max(float(eff) for eff in estimates)
+            return math.floor(0.997 * max_eff)
+
+        # Gather efficiency from all ranks and apply the calculation function
        sample_packing_actual_eff_all = reduce_and_broadcast(
-            lambda: self.efficiency(),  # pylint: disable=unnecessary-lambda
+            lambda: float(self.efficiency()),  # pylint: disable=unnecessary-lambda
            calc_sample_packing_eff_est,
        )
+
+        # Quantize to 0.5% intervals for stability
        sample_packing_eff_est = (
            math.ceil(sample_packing_actual_eff_all * 200.0) / 200.0
        )
        return sample_packing_eff_est

    def gather_len_batches(self, num):
+        """
+        Gather and synchronize batch counts across all distributed ranks
+        Returns the minimum number of batches available on any rank
+        """
+
        def calc_min_len(estimates: list[(int, float)]):
            LOG.info(f"gather_len_batches: {repr(estimates)}")
            return math.floor(min(estimates))

+        # Find minimum batch count across ranks to ensure balance
        min_len_batches = reduce_and_broadcast(lambda: num, calc_min_len)
        return min_len_batches

    def __len__(self):
-        if not self.len_across_ranks:
-            len_batches = min(
-                [self.num_batches() for _ in range(self.num_count_samples)]
+        """
+        Return the total number of batches that will be yielded by this sampler
+
+        This is calculated as the minimum number of batches available on any rank
+        to ensure balanced distributed training
+        """
+        if self._batches is None:
+            self._batches = self.generate_batches(set_stats=True)
+
+        if self.len_across_ranks is None:
+            # Sample multiple times to get stable estimate
+            len_batches = min(  # pylint: disable=consider-using-generator
+                [len(self._batches) for _ in range(self.num_count_samples)]
            )
+            # Gather minimum across all ranks
            self.len_across_ranks = self.gather_len_batches(len_batches)
+
        return self.len_across_ranks
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -27,7 +27,7 @@ from axolotl.utils.schemas.datasets import (
    StepwiseSupervisedDataset,
 )
 from axolotl.utils.schemas.deprecated import DeprecatedParameters, RemappedParameters
-from axolotl.utils.schemas.enums import ChatTemplate, RLType
+from axolotl.utils.schemas.enums import ChatTemplate, RingAttnFunc, RLType
 from axolotl.utils.schemas.integrations import (
    CometConfig,
    GradioConfig,
@@ -82,6 +82,7 @@ class AxolotlInputConfig(
    mean_resizing_embeddings: bool | None = False
    # optionally shrink the embeddings when the tokenizer vocab size is smaller
    shrink_embeddings: bool | None = None
+    embeddings_skip_upcast: bool | None = None

    rl: RLType | None = None
    trl: TRLConfig | None = Field(
@@ -177,7 +178,7 @@ class AxolotlInputConfig(

    # torch_dtype: torch.dtype | None

-    gradient_checkpointing: Literal["unsloth", "offload"] | bool | None = Field(
+    gradient_checkpointing: Literal["offload", "offload_disk"] | bool | None = Field(
        default=False
    )
    gradient_checkpointing_kwargs: dict[str, Any] | None = None
@@ -259,7 +260,7 @@ class AxolotlInputConfig(

    sequence_parallel_degree: int | None = None
    heads_k_stride: int | None = None
-    ring_attn_func: str | None = None
+    ring_attn_func: RingAttnFunc | None = None

    special_tokens: SpecialTokensConfig | None = None
    tokens: list[str] | None = None
@@ -435,16 +436,6 @@ class AxolotlInputConfig(
            )
        return data

-    @model_validator(mode="before")
-    @classmethod
-    def check_sample_packing_w_xformers(cls, data):
-        if data.get("sample_packing") and data.get("xformers_attention"):
-            raise ValueError(
-                "sample_packing not compatible with xformers_attention. Use flash_attention"
-            )
-
-        return data
-
    @model_validator(mode="before")
    @classmethod
    # pylint: disable=duplicate-code
@@ -471,9 +462,10 @@ class AxolotlInputConfig(
            and not data.get("flash_attention")
            and not data.get("sdp_attention")
            and not data.get("flex_attention")
+            and not data.get("xformers_attention")
        ):
            LOG.warning(
-                "sample_packing without flash, sdp or flex attention does not handle cross sample decontamination."
+                "sample_packing without flash, sdp, xformers or flex attention does not handle cross sample decontamination."
            )

        return data
@@ -512,10 +504,17 @@ class AxolotlInputConfig(
    @model_validator(mode="before")
    @classmethod
    def hint_sample_packing_padding(cls, data):
-        if data.get("sample_packing") and not data.get("pad_to_sequence_len"):
-            LOG.warning(
-                "`pad_to_sequence_len: true` is recommended when using sample_packing"
-            )
+        if data.get("sample_packing"):
+            pad_to_sequence_len = data.get("pad_to_sequence_len")
+            if pad_to_sequence_len is False:
+                LOG.warning(
+                    "`pad_to_sequence_len: true` is recommended when using sample_packing"
+                )
+            elif pad_to_sequence_len is None:
+                LOG.info(
+                    "Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing"
+                )
+                data["pad_to_sequence_len"] = True
        return data

    @model_validator(mode="before")
@@ -783,7 +782,7 @@ class AxolotlInputConfig(

    @model_validator(mode="after")
    def check_simpo_warmup(self):
-        if self.rl == "simpo" and self.warmup_ratio:
+        if self.rl is RLType.SIMPO and self.warmup_ratio:
            raise ValueError(
                "warmup_ratio is not supported with the simpo trainer. Please use `warmup_steps` instead"
            )
@@ -1150,6 +1149,30 @@ class AxolotlInputConfig(

        return data

+    # @model_validator(mode="before")
+    # @classmethod
+    # def check_grpo_peft_liger(cls, data):
+    #     if (
+    #         data.get("rl") == "grpo"
+    #         and data.get("trl", {})
+    #         and data.get("trl").get("use_liger_loss")
+    #         and data.get("adapter")
+    #     ):
+    #         raise ValueError("PEFT + GRPO + Liger is not yet supported")
+    #     return data
+    #
+    @model_validator(mode="before")
+    @classmethod
+    def check_grpo_liger_sequence_parallel(cls, data):
+        if (
+            data.get("rl") == "grpo"
+            and data.get("trl", {})
+            and data.get("trl").get("use_liger_loss")
+            and data.get("sequence_parallel_degree", 1) > 1
+        ):
+            raise ValueError("GRPO + SP + Liger not currently supported")
+        return data
+
    @model_validator(mode="after")
    def check_sequence_parallel_degree(self):
        if not self.sequence_parallel_degree:
@@ -1162,7 +1185,7 @@ class AxolotlInputConfig(

            if self.sample_packing and self.micro_batch_size > 1:
                raise ValueError(
-                    "micro_batch_size must be set to 1 when sample_packing is enabled"
+                    "micro_batch_size must be set to 1 when sample_packing is enabled "
                    "due to a `ring-flash-attn` requirement"
                )

@@ -1194,16 +1217,8 @@ class AxolotlInputConfig(
        if getattr(self, "sequence_parallel_degree", 1) == 1:
            return self

-        from axolotl.monkeypatch.attention.ring_attn.patch import RingAttnFunc
-
        if self.ring_attn_func is not None:
-            valid_funcs = list(RingAttnFunc)
-            if self.ring_attn_func in valid_funcs:
-                self.ring_attn_func = RingAttnFunc(self.ring_attn_func)
-            else:
-                raise ValueError(
-                    f"ring_attn_func: {self.ring_attn_func} must be in {valid_funcs}"
-                )
+            self.ring_attn_func = RingAttnFunc(self.ring_attn_func)
        else:
            # Default ring attention function selection
            sample_packing = getattr(self, "sample_packing", False)
@@ -1315,6 +1330,61 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
                    )
        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_auto_enable_lora_kernels(cls, data):
+        # Only proceed if using LoRA or QLoRA adapter
+        if data.get("rl"):
+            # RL trainers not tested so don't enable kernels by default
+            return data
+        if data.get("adapter") in ["lora", "qlora"]:
+            # Skip if already set, using unsloth optimizations, or using 8-bit
+            unsloth_fields = ["unsloth_lora_mlp", "unsloth_lora_qkv", "unsloth_lora_o"]
+            kernel_fields = ["lora_mlp_kernel", "lora_qkv_kernel", "lora_o_kernel"]
+            if (
+                any(data.get(k) is not None for k in kernel_fields)
+                or any(data.get(k) for k in unsloth_fields)
+                or data.get("adapter") == "lora"
+                and data.get("load_in_8bit")
+            ):
+                return data
+
+            # Skip if dropout is not 0, as auto enabling it would just disable it during runtime patch checks
+            if data.get("lora_dropout") != 0:
+                return data
+
+            # Check multi-GPU compatibility
+            capabilities = data.get("capabilities")
+            is_multi_gpu = capabilities and capabilities.get("n_gpu", 0) > 1
+            is_fsdp = data.get("fsdp") is not None
+            is_fsdp2 = (
+                data.get("fsdp_config") is not None
+                and str(data.get("fsdp_config").get("fsdp_version")) == "2"
+            )
+
+            if (
+                not is_multi_gpu
+                or (is_multi_gpu and not is_fsdp)
+                or (is_multi_gpu and is_fsdp2)
+            ):
+                # Auto-enable kernels if not explicitly set by user
+                if data.get("lora_mlp_kernel") is None:
+                    data["lora_mlp_kernel"] = True
+
+                if data.get("lora_qkv_kernel") is None:
+                    data["lora_qkv_kernel"] = True
+
+                if data.get("lora_o_kernel") is None:
+                    data["lora_o_kernel"] = True
+
+                LOG.warning(
+                    "Auto-enabling LoRA kernel optimizations for faster training. "
+                    + "Please explicitly set `lora_*_kernel` config values to `false` to disable. "
+                    + "See https://docs.axolotl.ai/docs/lora_optims.html for more info."
+                )
+
+        return data
+
    @model_validator(mode="before")
    @classmethod
    def check_adopt_torch_version(cls, data):
--- a/src/axolotl/utils/schemas/enums.py
+++ b/src/axolotl/utils/schemas/enums.py
@@ -6,12 +6,12 @@ from enum import Enum
 class RLType(str, Enum):
    """RL trainer type configuration subset"""

-    dpo = "dpo"  # pylint: disable=invalid-name
-    grpo = "grpo"  # pylint: disable=invalid-name
-    ipo = "ipo"  # pylint: disable=invalid-name
-    orpo = "orpo"  # pylint: disable=invalid-name
-    kto = "kto"  # pylint: disable=invalid-name
-    simpo = "simpo"  # pylint: disable=invalid-name
+    DPO = "dpo"  # pylint: disable=invalid-name
+    GRPO = "grpo"  # pylint: disable=invalid-name
+    IPO = "ipo"  # pylint: disable=invalid-name
+    ORPO = "orpo"  # pylint: disable=invalid-name
+    KTO = "kto"  # pylint: disable=invalid-name
+    SIMPO = "simpo"  # pylint: disable=invalid-name


 class ChatTemplate(str, Enum):
@@ -53,4 +53,16 @@ class CustomSupportedOptimizers(str, Enum):
    ao_adamw_8bit = "ao_adamw_8bit"  # pylint: disable=invalid-name
    ao_adamw_fp8 = "ao_adamw_fp8"  # pylint: disable=invalid-name
    adopt_adamw = "adopt_adamw"  # pylint: disable=invalid-name
+    came_pytorch = "came_pytorch"  # pylint: disable=invalid-name
    muon = "muon"  # pylint: disable=invalid-name
+
+
+class RingAttnFunc(str, Enum):
+    """Enum class for supported `ring-flash-attn` implementations"""
+
+    # VARLEN_RING = "varlen_ring"
+    # VARLEN_ZIGZAG = "varlen_zigzag"
+    VARLEN_LLAMA3 = "varlen_llama3"
+    BATCH_RING = "batch_ring"
+    # BATCH_ZIGZAG = "batch_zigzag"
+    # BATCH_STRIPE = "batch_stripe"
--- a/src/axolotl/utils/schemas/training.py
+++ b/src/axolotl/utils/schemas/training.py
@@ -75,8 +75,10 @@ class HyperparametersConfig(BaseModel):
    lr_groups: list[LrGroup] | None = None

    adam_epsilon: float | None = None
+    adam_epsilon2: float | None = None
    adam_beta1: float | None = None
    adam_beta2: float | None = None
+    adam_beta3: float | None = None
    max_grad_norm: float | None = None
    num_epochs: float = Field(default=1.0)

--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Wing Lian	459f407e69	avoid crash/oom on train end	2025-05-15 15:53:35 -04:00
Wing Lian	c0a0c7534c	Activation checkpointing with offloading to disk with prefetch (#2663 ) * offload activations to disk instead of CPU RAM * add prefetch * Disco :dance: * include offload_disk in e2e test for AC * document and make sure to cleanup * fix annotation to match docs * fix docs build * address PR feedback	2025-05-13 16:39:39 -04:00
Wing Lian	7fa1089cea	Atropos support (#2666 ) [skip ci] * allow peft+liger+grpo and custom vllm serve for atropos support * set trainer class for RL	2025-05-13 08:30:58 -04:00
Dan Saunders	80304c26a7	SP GRPO support + batch SP fixes (#2643 ) * ctx manager for SP * updates * update * further simplifying * simplifying * simplifying * reorg * batch api HF adapter for ring-flash-attn; cleanup and improvements * update * adding all batch ring-flash-attn methods via single adapter * fix * fixes for batch API funcs, simplify * fix * grpo sp support * progress * stronger subclassing of TRL GRPO trainer; custom distributed sampler * subclassing constructor * progress * finalizing SP + GRPO trainer * minimize diffs to GRPO trainer * remove (most of) the custom GRPO trainer logic * debug * debug * update * update * update * progress * cleanup * cleanup * minor changes * update * update * update * small changes * updates * cleanup; torch.compile ring_flash_attn functions to prevent numerical instability; lint * spacing * cleanup; log in pydantic model config only on main process * remove comment * fix sp sampler, update to latest upstream code, doc * add docs * update quartodoc autodoc contents * fix, simplifications * fixes + simplifications * review comments * lint * removing main process only logs in favor of #2608 * fixes, additional smoke test * updatse * more tests * update * fix grad accum bug (sort of) * lint, tests * todo	2025-05-12 17:52:40 -04:00
NanoCode012	67c4ea9c7c	fix: disable auto lora kernel if dropout nonzero (#2655 ) [skip ci] * fix: disable auto lora kernel if dropout nonzero * Add comment from PR feedback --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-12 16:23:53 -04:00
Wing Lian	526ddb886d	guard on deleting secrets from env (#2653 ) [skip ci]	2025-05-12 14:18:42 -04:00
Wing Lian	f34eef546a	update doc and use P2P=LOC for brittle grpo test (#2649 ) * update doc and skip brittle grpo test * fix the path to run the multigpu tests * increase timeout, use LOC instead of NVL * typo * use hf cache from s3 backed cloudfront * mark grpo as flaky test dues to vllm start	2025-05-12 14:17:25 -04:00
Wing Lian	c7b6790614	Various fixes for CI, save_only_model for RL, prevent packing multiprocessing deadlocks (#2661 ) * lean mistral ft tests, remove e2e torch 2.4.1 test * make sure to pass save_only_model for RL * more tests to make ci leaner, add cleanup to modal ci * fix module for import in e2e tests * use mp spawn to prevent deadlocks with packing * make sure cleanup shell script is executable when cloned out	2025-05-12 10:51:18 -04:00
Dan Saunders	47e0e71bc8	don't sort multipack sampler (#2657 ) * don't sort multipack sampler * increased packing efficiency increases loss --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-09 20:28:58 -04:00
Wing Lian	0f3587174d	swap tinymodels that have safetensors for some ci tests (#2641 )	2025-05-07 15:06:07 -04:00
xzuyn	25e6c5f9bd	Add CAME Optimizer (#2385 )	2025-05-07 10:31:46 -04:00
NanoCode012	32f51bca35	fix(doc): clarify instruction to delinearize llama4 similar to cli doc (#2644 ) [skip ci]	2025-05-07 10:29:47 -04:00
NanoCode012	9daa04da90	Fix: improve error message on failed dataset load (#2637 ) [skip ci] * fix(log): clarify error on dataset loading failed * fix: add path for easy tracking of broken config * fix: improve error message based on pr feedback	2025-05-07 10:29:05 -04:00
Wing Lian	0d71b0aa5f	Configurable embeddings upcast (#2621 ) * fsdp embeddings should be float32 per comment * patch peft to not upcast everything * add tabs back to code check * fix import * add configurable option and fix check * add check for dtypes * move embeddings test to patch dir * fix test * fix comment and logic	2025-05-06 23:40:44 -04:00
Eric Meier	63aaccf85b	Fix cut_cross_entropy plugin install (#2642 ) [skip ci]	2025-05-06 22:56:00 -04:00
Wing Lian	ff0fe767c8	xformers attention with packing (#2619 ) * xformers attention with packing * wire up the patch * fix xformers + packing validation * fix warning * reorder the packing check * fix fp16 / bf16 reset when using fp16 with bf16 auto * fix seq lens calc to drop hanging sequences * handle xformers patch for inference too * fix batch size setter * fix xformers inference * add colab callback to fix inference post train * PR feedback	2025-05-06 22:49:22 -04:00
Wing Lian	8e4158cc0b	Multipack parallel bin packing (#2631 ) * improve readability of multipack sampler * parallel bin packing fix error with lambda and pickling make sure things are in float instead of np.float * annotations and comments update * support for configurable group and bin size for sample packing * fix missing map back to original indices	2025-05-06 20:08:08 -04:00
Wing Lian	cd84325253	allow plugins to return their own dataset (#2617 ) [skip ci] * allow plugins to return their own dataset * add post_trainer_create and wire up * add hook check * address PR feedback: * remove annotation causing circular import	2025-05-06 20:05:51 -04:00
NanoCode012	0b140fef83	feat(doc): add split_thinking docs (#2613 ) [skip ci] * feat(doc): add split_thinking docs * fix: link config.qmd to conversation.qmd for split_thinking example * update thinking => reasoning_content in messages format --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-06 20:05:32 -04:00
Wing Lian	e4cfebe995	bump liger dep to 0.5.9 (#2640 ) [skip ci] * bump liger dep to 0.5.9 * also upgrade vllm to post1, and datasets to 3.5.1	2025-05-06 20:05:19 -04:00
mhenrichsen	a6cac5dd32	Update lr_scheduler options in config.qmd to include additional scheduling strategies for improved training flexibility. (#2636 ) [skip ci]	2025-05-06 11:24:07 -04:00
Wing Lian	b71c0e3447	Print axolotl art if train is called outside of cli: (#2627 ) [skip ci]	2025-05-06 11:18:45 -04:00
Wing Lian	ddaebf8309	fix dpo eval override to call grandparent instead of the broken super (#2628 ) [skip ci]	2025-05-06 11:18:25 -04:00
Wing Lian	679743087a	make sure gc_steps is used for all trainers (#2638 )	2025-05-06 11:18:00 -04:00
Wing Lian	f720b6e72d	repop cache (#2639 ) * repop cache * pre-cache as a step * fix the name * add reason for pytest skipif * restore pytorch matrix * remove max-parallel now that we've optimized this a bit	2025-05-06 11:09:07 -04:00
mhenrichsen	a980618fd0	Adds example for training a TTS model on top of a LLM. (#2614 ) * Adds example for training a TTS model on top of a LLM. * Update examples/orpheus/finetune.yml Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/orpheus/finetune.yml Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update README.md to clarify GPU requirements for finetuning Orpheus TTS model * Update finetune.yml to use the new base model canopylabs/orpheus-3b-0.1-pretrained * Update finetune.yml and README.md for consistency and clarity --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-06 10:11:06 +02:00
Emmanuel Ferdman	54960d4de0	Fix logging deprecation warnings (#2623 ) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-04 08:22:45 -04:00
Wing Lian	ed922796b7	include multipack support for qwen3 family (#2622 )	2025-05-03 12:02:39 -04:00
Wing Lian	3dd9c3bf3f	setup hf transfer too and fix auto bf16 when fp16 enabled (#2620 ) [skip ci]	2025-05-03 12:02:26 -04:00
Wing Lian	0ba7d362fa	qwen3 and qwen3_moe support for liger kernels (#2612 ) * qwen3 and qwen3_moe support for liger kernels * fix moe module path * fix: qwen3 liger input args and mlp * fix: qwen3 input args and output class --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-02 09:29:55 -04:00
aitechguy	e4f73bc98e	remove keys to incoporate changes for the trl update (#2616 )	2025-05-02 08:47:42 -04:00
Wing Lian	bcb59c70e2	automatically set pad_to_sequence_len when use packing (#2607 ) * automatically set pad_to_sequence_len when use packing * update tests	2025-05-01 13:24:38 -04:00
NanoCode012	6a3e6f8c53	fix: run preview-docs only when md/qmd changes (#2606 ) * fix: run preview-docs only when md/qmd changes * feat: add quarto yaml based on PR feedback	2025-05-01 13:21:28 -04:00
Wing Lian	fee3c13bb5	Logging config for colab (#2611 ) * only configure logging on cli to play nicely with colab * allow reloading the config on the fly from a dict * make sure to use dict for yaml * reuse existing function for load * make cli args optional * mps fix and respect max_steps	2025-05-01 12:58:00 -04:00
Rahul Tuli	996fc124e5	Add: Sparse Finetuning Integration with llmcompressor (#2479 ) * Add: SFTPlugin with llmcompressor * Update: review comments! * Add:llmcompressor instalable * pre commit hooks * Use: warning over warn * Revert: TODO's * Update llmcompressor version to latest * Apply suggestions from @markurtz Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com> * Address review comments from @markurtz * Add: llcompressor installable * Rename: sft.yaml to sparse-finetuning.yaml * Use: absolute import * Update model config * Move: LLMCompressorPlugin into it's own submodule * Add: `llm_compressor` integration documentation * Rebase and updates! * Tests, Style, Updates * Add: .qmd file * Address Review Comments: * deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Add: line about further optimizations using llmcompressor Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Apply patch from @winglian Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Fix: Test Signed-off-by: Rahul Tuli <rtuli@redhat.com> * additional fixes for docker and saving compressed * split llmcompressor from vllm checks * Reset session between tests Signed-off-by: Rahul Tuli <rtuli@redhat.com> * move decorator to test method instead of class * make sure to reset the session after each test * move import of llmcompressor to reset session inside test --------- Signed-off-by: Rahul Tuli <rtuli@redhat.com> Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-01 12:25:16 -04:00
Wing Lian	e963990ad7	add missing __init__ for lr monkeypatch fix (#2609 )	2025-05-01 09:41:32 -04:00
Dhruv Mullick	c3f2b1c5c2	Add num_completions_to_print for trl and grpo (#2604 )	2025-04-30 21:00:30 -04:00
Wing Lian	6ba5c0ed2c	use latest hf-xet and don't install vllm for torch 2.7.0 (#2603 ) * use latest hf-xet and don't install vllm for torch 2.7.0 * fix runpod hub tests	2025-04-30 18:27:39 -04:00
Wing Lian	24ff5f53f8	additional args for grpo config/trainer (#2598 )	2025-04-30 13:11:12 -04:00
Wing Lian	5e949eaa07	replace zero_only with simpler if statement (#2592 )	2025-04-30 13:11:03 -04:00
Wing Lian	89ca14d9a0	ensure we pass axolotl extras to the Dockerfile so vllm is included in shipped images (#2599 )	2025-04-30 11:35:45 -04:00
Wing Lian	8446b4ad28	don't automatically enable lora kernels for RL training (#2600 )	2025-04-30 11:06:50 -04:00
Wing Lian	fc79606b6d	only import vllm serve cli if its being called (#2597 ) [skip ci]	2025-04-30 09:11:25 -04:00
Wing Lian	baeb00231b	Handle other reasoning trace dataset formats (#2591 ) * Handle other reasoning trace dataset formats * rename var to improve readability * chore: refactor with comments --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-04-30 03:32:55 -04:00
Wing Lian	2413688b08	upload the deepspeed json to wandb (#2593 ) [skip ci]	2025-04-30 03:32:44 -04:00
NanoCode012	5bb1f3da56	feat: add qwen3 moe block for ds3 (#2596 ) [skip ci]	2025-04-30 03:32:23 -04:00
Wing Lian	a21b9cc472	patch to convert LR from tensor to float when using DS (#2595 ) [skip ci]	2025-04-30 03:31:57 -04:00
Aleksandr Dremov	41a1ec0c95	Plugins create_lr_scheduler support (#2584 ) * lr_scheduler support * fix * Update scheduler.py * Update scheduler.py * cfg handling * black * remove debug * remove adding the axolotl cfg to the scheduler mixin --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-04-29 17:08:30 -04:00
Dan Saunders	ecac731922	auto-enable lora kernels where possible (#2589 ) * auto-enable lora kernels where possible * test * revert change to example yaml * naming * remove print * slight logic change	2025-04-29 16:18:49 -04:00
NanoCode012	742fef4200	fix(doc): key used to point to url in multimodal doc (#2575 ) [skip ci]	2025-04-29 15:10:59 -04:00
Wing Lian	a39caf8824	bump vllm==0.8.5 for qwen3 support (#2583 ) [skip ci]	2025-04-29 15:10:40 -04:00
Wing Lian	07e4f2e25b	support for qwen3 with lora kernels (#2588 ) * support for qwen3 with lora kernels * fix patch * typo	2025-04-29 15:02:49 -04:00
Dan Saunders	c7d07de6b4	Fix eval + add smoke test (#2586 ) * fix evaluate CLI * add smoke test * fix naming * lint	2025-04-29 12:58:54 -04:00
Wing Lian	6565ae85d8	set config on the PluginManager for callback access (#2587 )	2025-04-29 12:05:44 -04:00
Wing Lian	80b4edb4a7	Post release fixes (#2581 ) * fix missing kwarg on child * make the runpod test shorter * update docs * rename runpod test json file * typing fixes and ordering of doc	2025-04-29 10:01:38 -04:00
Wing Lian	fedbcc0254	remove torch 2.4.1 CI as part of support deprecation (#2582 )	2025-04-29 08:28:32 -04:00
Wing Lian	8175896ada	add dev tag for v0.10.0.dev0 (#2580 )	2025-04-28 20:30:14 -04:00