improve handling of train len

default to dropping last batch in multipack batch sampler
fix rebase issues
2025-06-06 22:07:29 -07:00 · 2025-06-05 16:00:24 -07:00 · 2025-06-05 15:31:28 -07:00 · 2025-06-05 15:29:21 -07:00 · 2025-06-05 15:29:20 -07:00 · 2025-06-05 15:29:20 -07:00
230 changed files with 6761 additions and 3575 deletions
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -17,7 +17,7 @@ jobs:
  build-base:
    if: github.repository_owner == 'axolotl-ai-cloud'
    # this job needs to be run on self-hosted GPU runners...
-    runs-on: axolotl-gpu-runner
+    runs-on: ubuntu-latest-m
    strategy:
      fail-fast: false
      matrix:
@@ -28,42 +28,50 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "124"
            cuda_version: 12.4.1
            cudnn_version: ""
            python_version: "3.11"
            pytorch: 2.6.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "126"
            cuda_version: 12.6.3
            cudnn_version: ""
            python_version: "3.11"
            pytorch: 2.6.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "126"
            cuda_version: 12.6.3
            cudnn_version: ""
            python_version: "3.11"
            pytorch: 2.7.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "128"
            cuda_version: 12.6.3
            cudnn_version: ""
            python_version: "3.11"
            pytorch: 2.7.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-base"
          - cuda: "128"
            cuda_version: 12.8.1
            cudnn_version: ""
            python_version: "3.11"
            pytorch: nightly
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
-          - cuda: "128"
+            dockerfile: "Dockerfile-base-nightly"
-            cuda_version: 12.8.1
+#          # "next" is for release candidates of pytorch
-            cudnn_version: ""
+#          - cuda: "128"
-            python_version: "3.11"
+#            cuda_version: 12.8.1
-            pytorch: next
+#            cudnn_version: ""
-            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
+#            python_version: "3.11"
 #            pytorch: next
 #            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
 #            dockerfile: "Dockerfile-base-next"
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -85,7 +93,59 @@ jobs:
        uses: docker/build-push-action@v4
        with:
          context: .
-          file: ${{ matrix.pytorch == 'nightly' && './docker/Dockerfile-base-nightly' || matrix.pytorch == 'next' && './docker/Dockerfile-base-next' || './docker/Dockerfile-base' }}
+          file: ./docker/${{ matrix.dockerfile }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.metadata.outputs.tags }}-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
          labels: ${{ steps.metadata.outputs.labels }}
          build-args: |
            CUDA_VERSION=${{ matrix.cuda_version }}
            CUDNN_VERSION=${{ matrix.cudnn_version }}
            CUDA=${{ matrix.cuda }}
            PYTHON_VERSION=${{ matrix.python_version }}
            PYTORCH_VERSION=${{ matrix.pytorch }}
            TORCH_CUDA_ARCH_LIST=${{ matrix.torch_cuda_arch_list }}
  build-base-uv:
    if: github.repository_owner == 'axolotl-ai-cloud'
    runs-on: ubuntu-latest-m
    strategy:
      fail-fast: false
      matrix:
        include:
          - cuda: "126"
            cuda_version: 12.6.3
            cudnn_version: ""
            python_version: "3.11"
            pytorch: 2.6.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-uv-base"
          - cuda: "128"
            cuda_version: 12.8.1
            cudnn_version: ""
            python_version: "3.11"
            pytorch: 2.7.0
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 8.7 8.9 9.0+PTX"
            dockerfile: "Dockerfile-uv-base"
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Docker metadata
        id: metadata
        uses: docker/metadata-action@v5
        with:
          images: |
            axolotlai/axolotl-base-uv
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Build
        uses: docker/build-push-action@v4
        with:
          context: .
          file: ./docker/${{ matrix.dockerfile }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.metadata.outputs.tags }}-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
          labels: ${{ steps.metadata.outputs.labels }}
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -9,6 +9,7 @@ on:
       - '.github/workflows/*.yml'
       - "*.[q]md"
       - "examples/**/*.y[a]?ml"
       - ".pre-commit-config.yaml"
  workflow_dispatch:
 jobs:
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -59,7 +59,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.71.8 jinja2
+          pip install modal==1.0.2 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
--- a/.github/workflows/precommit-autoupdate.yml
+++ b/.github/workflows/precommit-autoupdate.yml
@@ -25,7 +25,6 @@ jobs:
          pre-commit autoupdate
          if [[ -n $(git status --porcelain) ]]; then
            echo "changes=true" >> $GITHUB_OUTPUT
            git diff .pre-commit-config.yaml > pre-commit-update.diff
          fi
      - name: Create Pull Request
@@ -39,11 +38,3 @@ jobs:
          commit-message: "chore: update pre-commit hooks"
          body: |
            Automated PR to update pre-commit hooks to their latest versions.
            <details>
            <summary>Changes:</summary>
            ```diff
            ${{ steps.update.outputs.diff }}
            ```
            </details>
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -44,98 +44,6 @@ jobs:
        env:
          SKIP: no-commit-to-branch
 #  preload-cache:
 #    name: Preload HF cache
 #    runs-on: ubuntu-latest
 #    strategy:
 #      fail-fast: false
 #      matrix:
 #        python_version: ["3.11"]
 #        pytorch_version: ["2.6.0"]
 #    timeout-minutes: 20
 #
 #    env:
 #      AXOLOTL_IS_CI_CACHE_PRELOAD: "1"
 #
 #    steps:
 #      - name: Check out repository code
 #        uses: actions/checkout@v4
 #
 #      - name: Restore HF cache
 #        id: hf-cache-restore
 #        uses: actions/cache/restore@v4
 #        with:
 #          path: |
 #            /home/runner/.cache/huggingface/hub/datasets--*
 #            /home/runner/.cache/huggingface/hub/models--*
 #          key: ${{ runner.os }}-hf-hub-cache-v2
 #
 #      - name: Restore Cache from S3
 #        id: hf-cache-restore-s3
 #        run: |
 #          mkdir -p /home/runner/.cache/huggingface/hub
 #          curl -L https://d1dttdx32dkk5p.cloudfront.net/hf-cache.tar.zst | tar -xf - -C /home/runner/.cache/huggingface/hub/  --use-compress-program unzstd
 #
 #      - name: Setup Python
 #        uses: actions/setup-python@v5
 #        with:
 #          python-version: ${{ matrix.python_version }}
 #          cache: 'pip' # caching pip dependencies
 #
 #      - name: upgrade pip
 #        run: |
 #          pip3 install --upgrade pip
 #          pip3 install --upgrade packaging==23.2 setuptools==75.8.0 wheel
 #
 #      - name: Install PyTorch
 #        run: |
 #          pip3 install torch==${{ matrix.pytorch_version }}
 #
 #      - name: Install dependencies
 #        run: |
 #          pip3 show torch
 #          pip3 install --no-build-isolation -U -e .
 #          python scripts/unsloth_install.py | sh
 #          python scripts/cutcrossentropy_install.py | sh
 #          pip3 install -r requirements-dev.txt -r requirements-tests.txt
 #
 #      - name: Make sure PyTorch version wasn't clobbered
 #        run: |
 #          python -c "import torch; assert '${{ matrix.pytorch_version }}' in torch.__version__"
 #
 #      - name: Ensure axolotl CLI was installed
 #        run: |
 #          axolotl --help
 #
 #      - name: Pre-Download dataset fixture
 #        run: |
 #          huggingface-cli download --repo-type=dataset axolotl-ai-internal/axolotl-oss-dataset-fixtures
 #
 #      - name: Run tests
 #        run: |
 #          pytest -v tests/conftest.py
 #
 #      - name: Upload coverage to Codecov
 #        uses: codecov/codecov-action@v5
 #        with:
 #          token: ${{ secrets.CODECOV_TOKEN }}
 #          files: ./coverage.xml
 #          flags: unittests,pytorch-${{ matrix.pytorch_version }}
 #          fail_ci_if_error: false
 #
 #      - name: cleanup pip cache
 #        run: |
 #          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
 #
 #      - name: Save HF cache
 #        id: hf-cache
 #        uses: actions/cache/save@v4
 #        with:
 #          path: |
 #            /home/runner/.cache/huggingface/hub/datasets--*
 #            /home/runner/.cache/huggingface/hub/models--*
 #          key: ${{ steps.hf-cache-restore.outputs.cache-primary-key }}
  pytest:
    name: PyTest
    runs-on: ubuntu-latest
@@ -151,15 +59,6 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4
 #      - name: Restore HF cache
 #        id: hf-cache-restore
 #        uses: actions/cache/restore@v4
 #        with:
 #          path: |
 #            /home/runner/.cache/huggingface/hub/datasets--*
 #            /home/runner/.cache/huggingface/hub/models--*
 #          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Restore Cache from S3
        id: hf-cache-restore-s3
        run: |
@@ -222,7 +121,6 @@ jobs:
  pytest-sdist:
    name: PyTest from Source Dist
    runs-on: ubuntu-latest
 #    needs: [preload-cache]
    strategy:
      fail-fast: false
      matrix:
@@ -234,15 +132,6 @@ jobs:
      - name: Check out repository code
        uses: actions/checkout@v4
 #      - name: Restore HF cache
 #        id: hf-cache-restore
 #        uses: actions/cache/restore@v4
 #        with:
 #          path: |
 #            /home/runner/.cache/huggingface/hub/datasets--*
 #            /home/runner/.cache/huggingface/hub/models--*
 #          key: ${{ runner.os }}-hf-hub-cache-v2
      - name: Restore Cache from S3
        id: hf-cache-restore-s3
        run: |
@@ -312,6 +201,13 @@ jobs:
            pytorch: 2.6.0
            num_gpus: 1
            axolotl_extras: vllm
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
            pytorch: 2.6.0
            num_gpus: 1
            axolotl_extras:
            dockerfile: "Dockerfile-uv.jinja"
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -322,7 +218,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.71.8 jinja2
+          pip install modal==1.0.2 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
@@ -333,6 +229,7 @@ jobs:
          echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
          echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
          echo "E2E_DOCKERFILE=${{ matrix.dockerfile || 'Dockerfile.jinja'}}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
          modal run cicd.e2e_tests
@@ -384,7 +281,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.71.8 jinja2
+          pip install modal==1.0.2 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
@@ -395,6 +292,7 @@ jobs:
          echo "MODAL_IMAGE_BUILDER_VERSION=2024.10" >> $GITHUB_ENV
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
          echo "CODECOV_TOKEN=${{ secrets.CODECOV_TOKEN }}" >> $GITHUB_ENV
          echo "E2E_DOCKERFILE=${{ matrix.dockerfile || 'Dockerfile.jinja'}}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
          modal run cicd.e2e_tests
@@ -424,7 +322,7 @@ jobs:
      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
-          pip install modal==0.71.8 jinja2
+          pip install modal==1.0.2 jinja2
      - name: Update env vars
        run: |
          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -19,15 +19,15 @@ repos:
    hooks:
      - id: isort
 -   repo: https://github.com/PyCQA/flake8
-    rev: 7.1.2
+    rev: 7.2.0
    hooks:
    - id: flake8
 -   repo: https://github.com/pylint-dev/pylint
-    rev: v3.3.6
+    rev: v3.3.7
    hooks:
    - id: pylint
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.15.0
+    rev: v1.16.0
    hooks:
    - id: mypy
      additional_dependencies:
--- a/.runpod/src/config/config.yaml
+++ b/.runpod/src/config/config.yaml
@@ -242,16 +242,12 @@
 # early_stopping_patience: 3
 # # Specify a scheduler and kwargs to use with the optimizer
-# lr_scheduler: # 'one_cycle' | 'log_sweep' | empty for cosine
+# lr_scheduler: # 'one_cycle' | empty for cosine
 # lr_scheduler_kwargs:
 # # For one_cycle optim
 # lr_div_factor: # Learning rate div factor
 # # For log_sweep optim
 # log_sweep_min_lr:
 # log_sweep_max_lr:
 # # Specify optimizer
 # # Valid values are driven by the Transformers OptimizerNames class, see:
 # # https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
--- a/README.md
+++ b/README.md
@@ -51,7 +51,7 @@ Features:
 - NVIDIA GPU (Ampere or newer for `bf16` and Flash Attention) or AMD GPU
 - Python 3.11
- PyTorch ≥2.4.1
+- PyTorch ≥2.5.1
 ### Installation
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -17,7 +17,9 @@ quartodoc:
        - convert
        - prompt_tokenizers
        - logging_config
-        - core.trainer_builder
+        - core.builders.base
        - core.builders.causal
        - core.builders.rl
        - core.training_args
        - core.chat.messages
        - core.chat.format.chatml
@@ -43,6 +45,7 @@ quartodoc:
        - cli.vllm_serve
        - cli.cloud.base
        - cli.cloud.modal_
        - cli.quantize
    - title: Trainers
      desc: Training implementations
      contents:
@@ -54,6 +57,15 @@ quartodoc:
        - core.trainers.grpo.trainer
        - core.trainers.grpo.sampler
        - core.trainers.utils
    - title: Model Loading
      desc: Functionality for loading and patching models, tokenizers, etc.
      contents:
        - loaders.model
        - loaders.tokenizer
        - loaders.processor
        - loaders.adapter
        - loaders.patch_manager
        - loaders.constants
    - title: Mixins
      desc: Mixin classes for augmenting trainers
      contents:
@@ -117,17 +129,16 @@ quartodoc:
        - monkeypatch.trainer_fsdp_optim
        - monkeypatch.transformers_fa_utils
        - monkeypatch.unsloth_
        - monkeypatch.attention.mllama
        - monkeypatch.data.batch_dataset_fetcher
        - monkeypatch.mixtral
        - monkeypatch.gradient_checkpointing.offload_cpu
        - monkeypatch.gradient_checkpointing.offload_disk
    - title: Utils
      desc: Utility functions
      contents:
        - utils.models
        - utils.tokenization
        - utils.chat_templates
        - utils.lora
        - utils.lora_embeddings
        - utils.model_shard_quant
        - utils.bench
        - utils.freeze
@@ -138,8 +149,7 @@ quartodoc:
        - utils.optimizers.adopt
        - utils.data.pretraining
        - utils.data.sft
-        - utils.gradient_checkpointing.offload_cpu
+        - utils.quantization
        - utils.gradient_checkpointing.offload_disk
    - title: Schemas
      desc: Pydantic data models for Axolotl config
      contents:
@@ -189,12 +199,14 @@ quartodoc:
        - utils.callbacks.lisa
        - utils.callbacks.mlflow_
        - utils.callbacks.comet_
-
+        - utils.callbacks.qat
 website:
  title: "Axolotl"
  description: "We make fine-tuning accessible, scalable, and fun"
  favicon: favicon.jpg
  google-analytics: "G-9KYCVJBNMQ"
  navbar:
    logo: image/axolotl_logo_digital_white.svg
    title: false
@@ -247,6 +259,8 @@ website:
            - docs/lr_groups.qmd
            - docs/lora_optims.qmd
            - docs/dataset_loading.qmd
            - docs/qat.qmd
            - docs/quantize.qmd
        - section: "Core Concepts"
          contents:
--- a/cicd/Dockerfile-uv.jinja
+++ b/cicd/Dockerfile-uv.jinja
@@ -0,0 +1,52 @@
 FROM axolotlai/axolotl-base-uv:{{ BASE_TAG }}
 ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
 ENV AXOLOTL_EXTRAS="{{ AXOLOTL_EXTRAS }}"
 ENV AXOLOTL_ARGS="{{ AXOLOTL_ARGS }}"
 ENV CUDA="{{ CUDA }}"
 ENV PYTORCH_VERSION="{{ PYTORCH_VERSION }}"
 ENV GITHUB_REF="{{ GITHUB_REF }}"
 ENV GITHUB_SHA="{{ GITHUB_SHA }}"
 ENV NIGHTLY_BUILD="{{ NIGHTLY_BUILD }}"
 ENV HF_HOME="{{ HF_HOME }}"
 RUN apt-get update && \
    apt-get install -y --allow-change-held-packages vim curl nano libnccl2 libnccl-dev
 WORKDIR /workspace
 RUN git clone --depth=1 https://github.com/axolotl-ai-cloud/axolotl.git
 WORKDIR /workspace/axolotl
 RUN git fetch origin +$GITHUB_REF && \
    git checkout FETCH_HEAD
 # If AXOLOTL_EXTRAS is set, append it in brackets
 RUN if [ "$NIGHTLY_BUILD" = "true" ] ; then \
        sed -i 's#^transformers.*#transformers @ git+https://github.com/huggingface/transformers.git@main#' requirements.txt; \
        sed -i 's#^peft.*#peft @ git+https://github.com/huggingface/peft.git@main#' requirements.txt; \
        sed -i 's#^accelerate.*#accelerate @ git+https://github.com/huggingface/accelerate.git@main#' requirements.txt; \
        sed -i 's#^trl.*#trl @ git+https://github.com/huggingface/trl.git@main#' requirements.txt; \
        sed -i 's#^datasets.*#datasets @ git+https://github.com/huggingface/datasets.git@main#' requirements.txt; \
    fi
 RUN uv pip install packaging==23.2 setuptools==75.8.0
 RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
        uv pip install --no-build-isolation -e .[deepspeed,flash-attn,ring-flash-attn,optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
    else \
        uv pip install --no-build-isolation -e .[deepspeed,flash-attn,ring-flash-attn,optimizers,ray] $AXOLOTL_ARGS; \
    fi
 RUN python scripts/unsloth_install.py --uv | sh
 RUN python scripts/cutcrossentropy_install.py --uv | sh
 # So we can test the Docker image
 RUN uv pip install -r requirements-dev.txt -r requirements-tests.txt
 # fix so that git fetch/pull from remote works
 RUN git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && \
    git config --get remote.origin.fetch
 # helper for huggingface-login cli
 RUN git config --global credential.helper store
--- a/cicd/multigpu.py
+++ b/cicd/multigpu.py
@@ -24,9 +24,9 @@ df_template = template_env.get_template("Dockerfile.jinja")
 df_args = {
    "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
    "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
-    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
+    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.5.1"),
-    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
+    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu124-2.5.1"),
-    "CUDA": os.environ.get("CUDA", "121"),
+    "CUDA": os.environ.get("CUDA", "124"),
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
    "CODECOV_TOKEN": os.environ.get("CODECOV_TOKEN", ""),
@@ -55,7 +55,7 @@ VOLUME_CONFIG = {
 }
 N_GPUS = int(os.environ.get("N_GPUS", 2))
-GPU_CONFIG = modal.gpu.H100(count=N_GPUS)
+GPU_CONFIG = f"H100:{N_GPUS}"
 def run_cmd(cmd: str, run_folder: str):
--- a/cicd/single_gpu.py
+++ b/cicd/single_gpu.py
@@ -8,8 +8,9 @@ import tempfile
 import jinja2
 import modal
 import modal.experimental
 from jinja2 import select_autoescape
-from modal import App, Image
+from modal import App
 cicd_path = pathlib.Path(__file__).parent.resolve()
@@ -17,14 +18,15 @@ template_loader = jinja2.FileSystemLoader(searchpath=cicd_path)
 template_env = jinja2.Environment(
    loader=template_loader, autoescape=select_autoescape()
 )
-df_template = template_env.get_template("Dockerfile.jinja")
+dockerfile = os.environ.get("E2E_DOCKERFILE", "Dockerfile.jinja")
 df_template = template_env.get_template(dockerfile)
 df_args = {
    "AXOLOTL_EXTRAS": os.environ.get("AXOLOTL_EXTRAS", ""),
    "AXOLOTL_ARGS": os.environ.get("AXOLOTL_ARGS", ""),
-    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.4.1"),
+    "PYTORCH_VERSION": os.environ.get("PYTORCH_VERSION", "2.5.1"),
-    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu121-2.4.1"),
+    "BASE_TAG": os.environ.get("BASE_TAG", "main-base-py3.11-cu124-2.5.1"),
-    "CUDA": os.environ.get("CUDA", "121"),
+    "CUDA": os.environ.get("CUDA", "124"),
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
@@ -38,11 +40,11 @@ temp_dir = tempfile.mkdtemp()
 with open(pathlib.Path(temp_dir) / "Dockerfile", "w", encoding="utf-8") as f:
    f.write(dockerfile_contents)
-cicd_image = Image.from_dockerfile(
+cicd_image = modal.experimental.raw_dockerfile_image(
    pathlib.Path(temp_dir) / "Dockerfile",
-    context_mount=None,
+    # context_mount=None,
    force_build=True,
-    gpu="A10G",
+    # gpu="A10G",
 ).env(df_args)
 app = App("Axolotl CI/CD", secrets=[])
@@ -55,7 +57,7 @@ VOLUME_CONFIG = {
 }
 N_GPUS = int(os.environ.get("N_GPUS", 1))
-GPU_CONFIG = modal.gpu.L40S(count=N_GPUS)
+GPU_CONFIG = f"L40S:{N_GPUS}"
 def run_cmd(cmd: str, run_folder: str):
--- a/deepspeed_configs/zero2_torch_compile.json
+++ b/deepspeed_configs/zero2_torch_compile.json
@@ -0,0 +1,31 @@
 {
  "compile": {
    "disable": false,
    "backend": "inductor"
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu"
    },
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
 }
--- a/docker/Dockerfile-uv-base
+++ b/docker/Dockerfile-uv-base
@@ -0,0 +1,36 @@
 ARG CUDA_VERSION="12.6.3"
 ARG CUDNN_VERSION=""
 ARG UBUNTU_VERSION="22.04"
 ARG MAX_JOBS=4
 FROM nvidia/cuda:$CUDA_VERSION-cudnn$CUDNN_VERSION-devel-ubuntu$UBUNTU_VERSION AS base-builder
 ARG PYTHON_VERSION="3.11"
 ARG PYTORCH_VERSION="2.6.0"
 ARG CUDA="126"
 ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
 ENV PYTHON_VERSION=$PYTHON_VERSION
 ENV TORCH_CUDA_ARCH_LIST=$TORCH_CUDA_ARCH_LIST
 ENV UV_TORCH_BACKEND="cu${CUDA}"
 RUN apt-get update \
    && apt-get install -y wget git build-essential ninja-build git-lfs libaio-dev pkg-config curl && rm -rf /var/lib/apt/lists/* \
    && git lfs install --skip-repo \
    && curl -LsSf https://astral.sh/uv/install.sh | sh
 ENV PATH="/root/.local/bin:${PATH}"
 RUN uv python install ${PYTHON_VERSION}
 WORKDIR /workspace
 RUN uv venv --no-project --relocatable axolotl-venv
 ENV PATH="/workspace/axolotl-venv/bin:${PATH}"
 RUN uv pip install packaging setuptools wheel \
    && uv pip install torch==${PYTORCH_VERSION} \
    && uv pip install --no-build-isolation "causal_conv1d @ git+https://github.com/Dao-AILab/causal-conv1d.git@main" \
    && uv pip install "mamba_ssm @ git+https://github.com/state-spaces/mamba.git@main" \
    && uv pip install awscli pydantic
--- a/docs/cli.qmd
+++ b/docs/cli.qmd
@@ -209,6 +209,16 @@ axolotl delinearize-llama4 --model path/to/model_dir --output path/to/output_dir
 This would be necessary to use with other frameworks. If you have an adapter, merge it with the non-quantized linearized model before delinearizing.
 ### quantize
 Quantizes a model using the quantization configuration specified in your YAML file.
 ```bash
 axolotl quantize config.yml
 ```
 See [Quantization](./quantize.qmd) for more details.
 ## Legacy CLI Usage
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -65,6 +65,20 @@ bnb_config_kwargs:
  bnb_4bit_quant_type: nf4
  bnb_4bit_use_double_quant: true
 # quantization aware training
 qat:
  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4" and "int8"
  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
  fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
 # post-training quantization
 quantization:
  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8
  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
  quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
 # Whether you are training a 4-bit GPTQ quantized model
 gptq: true
@@ -98,8 +112,10 @@ plugins:
  # - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
 # A list of one or more datasets to finetune the model with
 # See https://docs.axolotl.ai/docs/dataset_loading.html for guide on loading datasets
 # See https://docs.axolotl.ai/docs/dataset-formats/ for guide on dataset formats
 datasets:
-  # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
+  # HuggingFace dataset repo | s3:// | gs:// | path to local file or directory
  - path: vicgalle/alpaca-gpt4
    # The type of prompt to use for training. [alpaca, gpteacher, oasst, reflection]
    type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
@@ -221,7 +237,7 @@ datasets:
 # The same applies to the `test_datasets` option and the `pretraining_dataset` option. Default is true.
 shuffle_merged_datasets: true
-Deduplicates datasets and test_datasets with identical entries.
+# Deduplicates datasets and test_datasets with identical entries.
 dataset_exact_deduplication: true
 # A list of one or more datasets to eval the model with.
@@ -270,10 +286,25 @@ trl:
  num_generations: # Optional[int]. Number of generations to sample.
  log_completions: # Optional[bool]. Whether to log completions.
  num_completions_to_print: # Optional[int]. Number of completions to print when log_completions is True.
  sync_ref_model: # Optional[bool]. Whether to sync the reference model.
  ref_model_mixup_alpha: # Optional[float]. Mixup alpha for the reference model.
  ref_model_sync_steps: # Optional[int]. Sync steps for the reference model.
  scale_rewards: # Optional[bool]. Whether to scale rewards by their standard deviation.
  temperature: # Optional[float]. Sampling temperature for the GRPO policy.
  top_p: # Optional[float]. Top-p sampling probability for the generation policy.
  top_k: # Optional[int]. Top-k sampling for the generation policy.
  min_p: # Optional[float]. Minimum probability for the generation policy.
  repetition_penalty: # Optional[float]. Penalty for tokens that appear in prompt and generated text.
  num_iterations: # Optional[int]. Number of iterations per batch (μ) for GRPO.
  epsilon: # Optional[float]. Epsilon value for clipping in the GRPO algorithm.
  epsilon_high: # Optional[float]. Upper-bound epsilon value for clipping in the GRPO algorithm.
  use_liger_loss: # Optional[bool]. Whether to use Liger loss for GRPO.
  loss_type: # Optional[str]. Loss formulation to use. Supported values: grpo, bnpo, dr_grpo.
  mask_truncated_completions: # Optional[bool]. Whether to exclude truncated completions from loss calculation.
 # reward modelling: `True` or `False`
@@ -483,6 +514,7 @@ output_dir: ./completed-model
 # setting to `auto` will enable torch compile when torch>=2.5.1
 torch_compile:  # Optional[Union[Literal["auto"], bool]]
 torch_compile_backend:  # Optional[str]
 torch_compile_mode:  # 'default' | 'reduce-overhead' | 'max-autotune'
 # Training hyperparameters
@@ -529,7 +561,7 @@ profiler_steps: # enable the pytorch profiler to capture the first N steps of tr
 loss_watchdog_threshold: # High loss value, indicating the learning has broken down (a good estimate is ~2 times the loss at the start of training)
 loss_watchdog_patience: # Number of high-loss steps in a row before the trainer aborts (default: 3)
-# Save model as safetensors (require safetensors package)
+# Save model as safetensors (require safetensors package). Default True
 save_safetensors:
 # Whether to mask out or include the human's prompt from the training labels
@@ -551,7 +583,24 @@ gradient_checkpointing: false
 early_stopping_patience: 3
 # Specify a scheduler and kwargs to use with the optimizer
-lr_scheduler: # 'one_cycle' | 'rex' | 'log_sweep' | 'linear' | 'cosine_with_restarts' | 'polynomial' | 'constant' | 'constant_with_warmup' | 'inverse_sqrt' | 'reduce_lr_on_plateau' | 'cosine_with_min_lr' | 'warmup_stable_decay' | empty for cosine
+# Valid values are driven by the Transformers SchedulerType class, see:
 # https://github.com/huggingface/transformers/blob/5f4ecf2d9f867a1255131d2461d75793c0cf1db2/src/transformers/trainer_utils.py#L420
 # Valid values include
 # - 'linear'
 # - 'cosine' (default)
 # - 'cosine_with_restarts'
 # - 'polynomial'
 # - 'constant'
 # - 'constant_with_warmup'
 # - 'inverse_sqrt'
 # - 'reduce_lr_on_plateau'
 # - 'cosine_with_min_lr'
 # - 'warmup_stable_decay'
 # Additional schedulers include:
 # - 'one_cycle'
 # - 'rex'
 lr_scheduler:
 lr_scheduler_kwargs:
 cosine_min_lr_ratio: # decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of peak lr
 cosine_constant_lr_ratio: # freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means start cosine_min_lr at 80% of training step (https://arxiv.org/pdf/2308.04014.pdf)
@@ -569,7 +618,7 @@ lr_div_factor: # Learning rate div factor
 #
 # Valid values for 'optimizer' include:
 # - adamw_torch
-# - adamw_torch_fused
+# - adamw_torch_fused (default)
 # - adamw_torch_xla
 # - adamw_torch_npu_fused
 # - adamw_apex_fused
--- a/docs/dataset-formats/index.qmd
+++ b/docs/dataset-formats/index.qmd
@@ -36,10 +36,6 @@ It is typically recommended to save your dataset as `.jsonl` due to its flexibil
 Axolotl supports loading from a Hugging Face hub repo or from local files.
 ::: {.callout-important}
 For pre-training only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts.
 :::
 ### Pre-training from Hugging Face hub datasets
 As an example, to train using a Hugging Face dataset `hf_org/name`, you can pass the following config:
@@ -77,18 +73,21 @@ datasets:
    type: completion
 ```
-From local files (either example works):
+From local files:
 ```yaml
 datasets:
  - path: A.jsonl
    type: completion
-  - path: json
+  - path: B.jsonl
    data_files: ["A.jsonl", "B.jsonl", "C.jsonl"]
    type: completion
 ```
 ::: {.callout-important}
 For `completion` only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts. If you are interested in having this for `pretraining_dataset` too, please let us know or help make a PR!
 :::
 ### Pre-training dataset configuration tips
 #### Setting max_steps
--- a/docs/dataset_loading.qmd
+++ b/docs/dataset_loading.qmd
@@ -54,7 +54,7 @@ datasets:
 #### Files
-Usually, to load a JSON file, you would do something like this:
+To load a JSON file, you would do something like this:
 ```python
 from datasets import load_dataset
@@ -66,20 +66,12 @@ Which translates to the following config:
 ```yaml
 datasets:
-  - path: json
+  - path: data.json
    data_files: /path/to/your/file.jsonl
 ```
 However, to make things easier, we have added a few shortcuts for loading local dataset files.
 You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file:
 ```yaml
 datasets:
  - path: /path/to/your/file.jsonl
    ds_type: json
 ```
 In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.
 This works for CSV, JSON, Parquet, and Arrow files.
 ::: {.callout-tip}
--- a/docs/docker.qmd
+++ b/docs/docker.qmd
@@ -36,7 +36,6 @@ Tags examples:
 - `main-base-py3.11-cu126-2.7.0`
 - `main-base-py3.11-cu124-2.6.0`
 - `main-base-py3.11-cu124-2.5.1`
 - `main-base-py3.11-cu124-2.4.1`
 ## Main
@@ -77,12 +76,10 @@ Tags examples:
 - `main-py3.11-cu126-2.7.0`
 - `main-py3.11-cu124-2.6.0`
 - `main-py3.11-cu124-2.5.1`
 - `main-py3.11-cu124-2.4.1`
 - `main-latest`
 - `main-20250303-py3.11-cu124-2.6.0`
 - `main-20250303-py3.11-cu124-2.5.1`
- `main-20250303-py3.11-cu124-2.4.1`
+- `0.9.2`
 - `0.7.1`
 ## Cloud
--- a/docs/faq.qmd
+++ b/docs/faq.qmd
@@ -110,3 +110,17 @@ description: Frequently asked questions
 > A: If `eot_tokens: ` is not provided, the default behavior is the same as before. EOS tokens used to delimit turns are masked/unmasked depending on whether the turn is trainable.
 > Internally, `eot_tokens: tokenizer.eos_token` and `train_on_eot: train_on_eos` (which defaults to `turn`). This transition helps clarify the naming and behavior of EOT/EOS tokens.
 **Q: `Data processing error: CAS service error`**
 > A: Try disabling XET with `export HF_HUB_DISABLE_XET=1`
 **Q: `torch._inductor.exc.LoweringException: NoValidChoicesError: No choices to select, please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice. `**
 > A: Depending on the version of torch, you may need to include this in your YAML:
 > ```yaml
 > flex_attn_compile_kwargs:
 >   dynamic: false
 >   mode: max-autotune-no-cudagraphs
 > ```
--- a/docs/getting-started.qmd
+++ b/docs/getting-started.qmd
@@ -180,7 +180,7 @@ Now that you have the basics, you might want to:
 Check our other guides for details on these topics:
 - [Configuration Guide](config.qmd) - Full configuration options
- [Dataset Loading](dataset-loading.qmd) - Loading datasets from various sources
+- [Dataset Loading](dataset_loading.qmd) - Loading datasets from various sources
 - [Dataset Formats](dataset-formats) - Working with different data formats
 - [Multi-GPU Training](multi-gpu.qmd)
 - [Multi-Node Training](multi-node.qmd)
--- a/docs/installation.qmd
+++ b/docs/installation.qmd
@@ -15,7 +15,7 @@ This guide covers all the ways you can install and set up Axolotl for your envir
 - NVIDIA GPU (Ampere architecture or newer for `bf16` and Flash Attention) or AMD GPU
 - Python ≥3.10
- PyTorch ≥2.4.1
+- PyTorch ≥2.5.1
 ## Installation Methods {#sec-installation-methods}
@@ -41,6 +41,40 @@ installed) in order not to clobber it, and so that we set the correct version of
 dependencies that are specific to the PyTorch version or other installed
 co-dependencies.
 ### uv Installation {#sec-uv}
 uv is a fast, reliable Python package installer and resolver built in Rust. It offers significant performance improvements over pip and provides better dependency resolution, making it an excellent choice for complex environments.
 Install uv if not already installed
 ```{.bash}
 curl -LsSf https://astral.sh/uv/install.sh | sh
 source $HOME/.local/bin/env
 ```
 Choose your CUDA version to use with PyTorch; e.g. `cu124`, `cu126`, `cu128`,
 then create the venv and activate
 ```{.bash}
 export UV_TORCH_BACKEND=cu126
 uv venv --no-project --relocatable
 source .venv/bin/activate
 ```
 Install PyTorch
 - PyTorch 2.6.0 recommended
 ```{.bash}
 uv pip install packaging setuptools wheel
 uv pip install torch==2.6.0
 uv pip install awscli pydantic
 ```
 Install axolotl from PyPi
 ```{.bash}
 uv pip install --no-build-isolation axolotl[deepspeed,flash-attn]
 # optionally install with vLLM if you're using torch==2.6.0 and want to train w/ GRPO
 uv pip install --no-build-isolation axolotl[deepspeed,flash-attn,vllm]
 ```
 ### Edge/Development Build {#sec-edge-build}
 For the latest features between releases:
--- a/docs/lora_optims.qmd
+++ b/docs/lora_optims.qmd
@@ -84,6 +84,10 @@ lora_qkv_kernel: true
 lora_o_kernel: true
 ```
 ::: {.callout-note}
 Currently, LoRA kernels are not supported for RLHF training, only SFT.
 :::
 ## Requirements
 - One or more NVIDIA or AMD GPUs (in order to use the Triton kernels)
--- a/docs/multimodal.qmd
+++ b/docs/multimodal.qmd
@@ -43,7 +43,7 @@ datasets:
 # leave the vision model and vision tower frozen
 # load_in_8bit: true
 adapter: lora
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 # (optional) if you want to resize images to a set size
 image_size: 512
--- a/docs/qat.qmd
+++ b/docs/qat.qmd
@@ -0,0 +1,32 @@
 ---
 title: "Quantization Aware Training (QAT)"
 back-to-top-navigation: true
 toc: true
 toc-expand: 2
 toc-depth: 4
 ---
 ## Overview
 [Quantization Aware Training](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#quantization-aware-training) (QAT) is a technique for improving the accuracy of models which are quantized
 by applying "fake" quantizations to the model's weights (and optionally, activations) during training. This fake
 quantization allows for the model to adjust for noise introduced by the quantization, so when the model is eventually
 quantized, the accuracy loss is minimized. We use the quantization techniques implemented in [torchao](https://github.com/pytorch/ao) to provide
 support for QAT and post-training quantization (PTQ) in axolotl.
 We recommend reviewing the excellent QAT tutorial in the [torchtune library](https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#quantizing-the-qat-model),
 and the QAT documentation in the [torchao library](https://github.com/pytorch/ao/tree/main/torchao/quantization/qat), for more details.
 ## Configuring QAT in Axolotl
 To enable QAT in axolotl, add the following to your configuration file:
 ```yaml
 qat:
  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are "int4" and "int8"
  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
  fake_quant_after_n_steps: # Optional[int] = None. The number of steps to apply fake quantization after
 ```
 Once you have finished training, you must quantize your model by using the same quantization configuration which you used to train the model with. You can use the [`quantize` command](./quantize.md) to do this.
--- a/docs/quantize.qmd
+++ b/docs/quantize.qmd
@@ -0,0 +1,53 @@
 ---
 title: "Quantization with torchao"
 back-to-top-navigation: true
 toc: true
 toc-expand: 2
 toc-depth: 4
 ---
 Quantization is a technique to lower the memory footprint of your model, potentially at the cost of accuracy or model performance. We support quantizing your model using the [torchao](https://github.com/pytorch/ao) library. Quantization is supported for both post-training quantization (PTQ) and quantization-aware training (QAT).
 ::: {.callout-note}
 We do not currently support quantization techniques such as GGUF/GPTQ,EXL2 at the moment.
 :::
 ## Configuring Quantization in Axolotl
 Quantization is configured using the `quantization` key in your configuration file.
 ```yaml
 base_model: # The path to the model to quantize.
 quantization:
  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8
  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
  quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
 output_dir:  # The path to the output directory.
 ```
 Once quantization is complete, your quantized model will be saved in the `{output_dir}/quantized` directory.
 You may also use the `quantize` command to quantize a model which has been trained with [QAT](./qat.md) - you can do this by using the existing QAT configuration file which
 you used to train the model:
 ```yaml
 # qat.yml
 qat:
  activation_dtype: int8
  weight_dtype: int8
  group_size: 256
  quantize_embedding: true
 output_dir: # The path to the output directory used during training where the final checkpoint has been saved.
 ```
 ```bash
 axolotl quantize qat.yml
 ```
 This ensures that an identical quantization configuration is used to quantize the model as was used to train it.
--- a/docs/rlhf.qmd
+++ b/docs/rlhf.qmd
@@ -16,7 +16,8 @@ feedback. Various methods include, but not limited to:
 - [Identity Preference Optimization (IPO)](#ipo)
 - [Kahneman-Tversky Optimization (KTO)](#kto)
 - [Odds Ratio Preference Optimization (ORPO)](#orpo)
- Proximal Policy Optimization (PPO) (not yet supported in axolotl)
+- [Group Relative Policy Optimization (GRPO)](#grpo)
 - Proximal Policy Optimization (PPO) (not yet supported in axolotl, if you're interested in contributing, please reach out!)
 ## RLHF using Axolotl
@@ -582,7 +583,20 @@ datasets:
 To see other examples of custom reward functions, please see [TRL GRPO Docs](https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md#using-a-custom-reward-function).
-To see description of the configs, please see [TRLConfig](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/config/models/input/v0_4_1/trl.py).
+To see all configs, please see [TRLConfig](https://github.com/axolotl-ai-cloud/axolotl/blob/v0.9.2/src/axolotl/utils/schemas/trl.py).
 #### GRPO with DAPO/Dr. GRPO loss
 The DAPO paper and subsequently Dr. GRPO paper proposed an alternative loss function for GRPO to remediate the penalty in longer responses.
 ```yaml
 trl:
  loss_type: dr_grpo
  # Normalizes loss based on max completion length (default: 256)
  max_completion_length:
 ```
 For more information, see [GRPO docs](https://huggingface.co/docs/trl/v0.17.0/en/grpo_trainer#loss-types).
 ### SimPO
--- a/examples/gemma3/gemma-3-4b-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-qlora.yml
@@ -28,7 +28,7 @@ pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 wandb_project:
 wandb_entity:
--- a/examples/gemma3/gemma-3-4b-vision-qlora.yml
+++ b/examples/gemma3/gemma-3-4b-vision-qlora.yml
@@ -30,7 +30,7 @@ pad_to_sequence_len: false
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 wandb_project:
 wandb_entity:
--- a/examples/llama-3-vision/lora-11b.yaml
+++ b/examples/llama-3-vision/lora-11b.yaml
@@ -29,7 +29,7 @@ pad_to_sequence_len: false
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 wandb_project:
 wandb_entity:
--- a/examples/llama-3/3b-qat-fsdp2.yaml
+++ b/examples/llama-3/3b-qat-fsdp2.yaml
@@ -0,0 +1,79 @@
 base_model: meta-llama/Llama-3.2-3B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 load_in_8bit: false
 load_in_4bit: false
 strict: false
 plugins:
  - axolotl.integrations.liger.LigerPlugin
 liger_rope: true
 liger_rms_norm: true
 liger_glu_activation: true
 liger_layer_norm: true
 liger_fused_linear_cross_entropy: true
 datasets:
  - path: yahma/alpaca-cleaned
    type: alpaca
 output_dir: ./outputs/qat_out/
 sample_packing: true
 pad_to_sequence_len: true
 sequence_len: 512
 flex_attention: true
 flex_attn_compile_kwargs:
  dynamic: false
  mode: max-autotune-no-cudagraphs
 qat:
  activation_dtype: int8
  weight_dtype: int4
  group_size: 32
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 1
 micro_batch_size: 16
 num_epochs: 1
 optimizer: adamw_torch_fused
 cosine_constant_lr_ratio: 0
 cosine_min_lr_ratio: 1.0
 learning_rate: 2e-5
 save_only_model: true
 bf16: true
 resume_from_checkpoint:
 logging_steps: 1
 evals_per_epoch: 1
 saves_per_epoch: 1
 warmup_steps: 10
 weight_decay: 0.0
 fsdp:
  - full_shard
  - auto_wrap
 fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true
 special_tokens:
  pad_token: <|end_of_text|>
--- a/examples/llama-3/lora-1b.yml
+++ b/examples/llama-3/lora-1b.yml
@@ -5,7 +5,7 @@ base_model: NousResearch/Llama-3.2-1B
 datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
-dataset_prepared_path: last_run_prepared
+
 val_set_size: 0.1
 output_dir: ./outputs/lora-out
@@ -38,6 +38,7 @@ wandb_log_model:
 gradient_accumulation_steps: 2
 micro_batch_size: 2
 num_epochs: 1
 optimizer: adamw_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
--- a/examples/llava/lora-7b.yaml
+++ b/examples/llava/lora-7b.yaml
@@ -25,7 +25,7 @@ pad_to_sequence_len: false
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 wandb_project:
 wandb_entity:
--- a/examples/mistral/bigstral-ds-zero3.yaml
+++ b/examples/mistral/bigstral-ds-zero3.yaml
--- a/examples/mistral/devstral-small-2505.yml
+++ b/examples/mistral/devstral-small-2505.yml
@@ -1,48 +0,0 @@
 base_model: mistralai/Devstral-Small-2505
 processor_type: AutoProcessor
 # these 3 lines are needed for now to handle vision chat templates w images
 skip_prepare_dataset: true
 remove_unused_columns: false
 sample_packing: false
 chat_template: mistral_v7_tekken
 datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]
    field_messages: messages
 dataset_prepared_path: last_run_prepared
 val_set_size: 0.01
 output_dir: ./outputs/out
 sequence_len: 2048
 pad_to_sequence_len: false
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 1
 micro_batch_size: 1
 num_epochs: 1
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
 learning_rate: 0.0002
 bf16: auto
 fp16:
 tf32: false
 gradient_checkpointing: true
 logging_steps: 1
 flash_attention: false
 eager_attention:
 warmup_ratio: 0.1
 evals_per_epoch: 1
 saves_per_epoch: 1
 weight_decay: 0.0
 special_tokens:
--- a/examples/mistral/mistral-small-3.1-24B-lora.yml
+++ b/examples/mistral/mistral-small-3.1-24B-lora.yml
@@ -27,7 +27,7 @@ pad_to_sequence_len: false
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 wandb_project:
 wandb_entity:
--- a/examples/pixtral/lora-12b.yml
+++ b/examples/pixtral/lora-12b.yml
@@ -25,7 +25,7 @@ pad_to_sequence_len: false
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
-lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
 wandb_project:
 wandb_entity:
--- a/examples/qwen3/8b-qat-fsdp2.yml
+++ b/examples/qwen3/8b-qat-fsdp2.yml
@@ -0,0 +1,78 @@
 base_model: Qwen/Qwen3-8B
 # Automatically upload checkpoint and final model to HF
 # hub_model_id: username/custom_model_name
 load_in_8bit: false
 load_in_4bit: false
 strict: false
 plugins:
  - axolotl.integrations.liger.LigerPlugin
 liger_rope: true
 liger_rms_norm: true
 liger_glu_activation: true
 liger_layer_norm: true
 liger_fused_linear_cross_entropy: true
 datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
 output_dir: ./outputs/qat_out/
 sequence_len: 2048
 sample_packing: true
 flex_attention: true
 pad_to_sequence_len: true
 flex_attn_compile_kwargs:
  dynamic: false
  mode: max-autotune-no-cudagraphs
 qat:
  activation_dtype: int8
  weight_dtype: int4
  group_size: 256
  fake_quant_after_n_steps: 1000
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 1
 micro_batch_size: 2
 max_steps: 2000
 optimizer: adamw_torch_fused
 lr_scheduler: cosine
 learning_rate: 2e-5
 bf16: true
 tf32: true
 resume_from_checkpoint:
 logging_steps: 1
 evals_per_epoch: 1
 saves_per_epoch: 1
 warmup_steps: 10
 weight_decay: 0.0
 fsdp:
  - full_shard
  - auto_wrap
 fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true
 special_tokens:
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,21 +6,20 @@ triton>=3.0.0
 mamba-ssm==1.2.0.post1
 xformers>=0.0.23.post1
 autoawq==0.2.7.post3
-liger-kernel==0.5.9
+liger-kernel==0.5.10
 # END section
 packaging==23.2
-huggingface_hub==0.31.0
+huggingface_hub==0.32.2
 peft==0.15.2
-transformers==4.51.3
+transformers==4.52.3
 tokenizers>=0.21.1
-accelerate==1.6.0
+accelerate==1.7.0
-datasets==3.5.1
+datasets==3.6.0
-deepspeed>=0.15.4
+deepspeed>=0.17.0
-trl==0.17.0
+trl==0.18.1
-hf_xet==1.1.0
+hf_xet==1.1.2
 hqq==0.2.5
 optimum==1.16.2
 hf_transfer
@@ -63,7 +62,7 @@ langdetect==1.0.9
 immutabledict==4.2.0
 antlr4-python3-runtime==4.13.2
-torchao==0.9.0
+torchao==0.10.0
 schedulefree==1.4.1
 axolotl-contribs-lgpl==0.0.6
--- a/scripts/cutcrossentropy_install.py
+++ b/scripts/cutcrossentropy_install.py
@@ -9,6 +9,8 @@ except ImportError as exc:
    raise ImportError("Install torch via `pip install torch`") from exc
 from packaging.version import Version as V
 USE_UV = "--uv" in sys.argv[1:]
 v = V(torch.__version__)
 # no cut-cross-entropy support for torch < 2.4.0
@@ -23,7 +25,9 @@ if cce_spec:
    if not importlib.util.find_spec("cut_cross_entropy.transformers"):
        UNINSTALL_PREFIX = "pip uninstall -y cut-cross-entropy && "
 UV_PREFIX = "uv " if USE_UV else ""
 print(
    UNINSTALL_PREFIX
-    + 'pip install "cut-cross-entropy[transformers] @ git+https://github.com/apple/ml-cross-entropy.git@bad6f7b49c75fdec69471abb71b4cddd0f0c6438"'
+    + f'{UV_PREFIX}pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@a1174ca"'
 )
--- a/scripts/motd
+++ b/scripts/motd
@@ -11,7 +11,7 @@
                                 =@#       @#  #@=     #@   =#@@@@#=    +#@@=  +#@@@@#=    .##@@+   @@
    @@@@  @@@@@@@@@@@@@@@@
-Welcome to the axolotl cloud image! If the you've mounted a disk to /workspace and the axolotl directory ie empty, run the following commands:
+Welcome to the axolotl cloud image! If the you've mounted a disk to /workspace and the axolotl directory is empty, run the following commands:
 ```
 cd /workspace
--- a/scripts/unsloth_install.py
+++ b/scripts/unsloth_install.py
@@ -1,11 +1,15 @@
 # noqa
 # pylint: skip-file
 import sys
 try:
    import torch
 except ImportError:
    raise ImportError("Install torch via `pip install torch`")
 from packaging.version import Version as V
 use_uv = "--uv" in sys.argv[1:]
 v = V(torch.__version__)
 cuda = str(torch.version.cuda)
 try:
@@ -31,6 +35,7 @@ elif v < V("2.6.0"):
 else:
    raise RuntimeError(f"Torch = {v} too new!")
 x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
 uv_prefix = "uv " if use_uv else ""
 print(
-    f'pip install unsloth-zoo==2024.12.1 && pip install --no-deps "unsloth[{x}]==2024.12.4"'
+    f'{uv_prefix}pip install unsloth-zoo==2024.12.1 && {uv_prefix}pip install --no-deps "unsloth[{x}]==2024.12.4"'
 )
--- a/setup.py
+++ b/setup.py
@@ -118,7 +118,7 @@ extras_require = {
        "yunchang==0.6.0",
    ],
    "deepspeed": [
-        "deepspeed==0.15.4",
+        "deepspeed==0.17.0",
        "deepspeed-kernels",
    ],
    "mamba-ssm": [
--- a/src/axolotl/cli/args.py
+++ b/src/axolotl/cli/args.py
@@ -28,7 +28,6 @@ class TrainerCliArgs:
    debug: bool = field(default=False)
    debug_text_only: bool = field(default=False)
    debug_num_examples: int = field(default=0)
    merge_lora: bool = field(default=False)
    prompter: Optional[str] = field(default=None)
    shard: bool = field(default=False)
    main_process_port: Optional[int] = field(default=None)
@@ -89,6 +88,26 @@ class VllmServeCliArgs:
        },
    )
    enable_reasoning: Optional[bool] = field(
        default=None,
    )
    reasoning_parser: Optional[str] = field(
        default=None,
    )
@dataclass
 class QuantizeCliArgs:
    """Dataclass with CLI arguments for `axolotl quantize` command."""
    base_model: Optional[str] = field(default=None)
    weight_dtype: Optional[str] = field(default=None)
    activation_dtype: Optional[str] = field(default=None)
    quantize_embedding: Optional[bool] = field(default=None)
    group_size: Optional[int] = field(default=None)
    output_dir: Optional[str] = field(default=None)
@dataclass
 class EvaluateCliArgs:
--- a/src/axolotl/cli/checks.py
+++ b/src/axolotl/cli/checks.py
@@ -1,6 +1,5 @@
 """Various checks for Axolotl CLI."""
 import logging
 import os
 from pathlib import Path
@@ -8,7 +7,9 @@ from accelerate.commands.config import config_args
 from huggingface_hub import HfApi
 from huggingface_hub.utils import LocalTokenNotFoundError
-LOG = logging.getLogger(__name__)
+from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 def check_accelerate_default_config() -> None:
--- a/src/axolotl/cli/cloud/modal_.py
+++ b/src/axolotl/cli/cloud/modal_.py
@@ -82,7 +82,7 @@ class ModalCloud(Cloud):
        return res
    def get_image(self):
-        docker_tag = "main-py3.11-cu124-2.5.1"
+        docker_tag = "main-py3.11-cu124-2.6.0"
        if self.config.docker_tag:
            docker_tag = self.config.docker_tag
        docker_image = f"axolotlai/axolotl:{docker_tag}"
--- a/src/axolotl/cli/config.py
+++ b/src/axolotl/cli/config.py
@@ -1,7 +1,6 @@
 """Configuration loading and processing."""
 import json
 import logging
 import os
 import tempfile
 from pathlib import Path
@@ -22,11 +21,12 @@ from axolotl.utils.config import (
    validate_config,
 )
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from axolotl.utils.mlflow_ import setup_mlflow_env_vars
 from axolotl.utils.trainer import prepare_opinionated_env, prepare_optim_env
 from axolotl.utils.wandb_ import setup_wandb_env_vars
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__, use_environ=True)
 def check_remote_config(config: Union[str, Path]) -> Union[str, Path]:
@@ -119,12 +119,12 @@ def choose_config(path: Path) -> str:
        )
    if len(yaml_files) == 1:
-        print(f"Using default YAML file '{yaml_files[0]}'")
+        LOG.info(f"Using default YAML file '{yaml_files[0]}'")
        return str(yaml_files[0])
-    print("Choose a YAML file:")
+    LOG.info("Choose a YAML file:")
    for idx, file in enumerate(yaml_files):
-        print(f"{idx + 1}. {file}")
+        LOG.info(f"{idx + 1}. {file}")
    chosen_file = None
    while chosen_file is None:
@@ -133,9 +133,9 @@ def choose_config(path: Path) -> str:
            if 1 <= choice <= len(yaml_files):
                chosen_file = str(yaml_files[choice - 1])
            else:
-                print("Invalid choice. Please choose a number from the list.")
+                LOG.info("Invalid choice. Please choose a number from the list.")
        except ValueError:
-            print("Invalid input. Please enter a number.")
+            LOG.info("Invalid input. Please enter a number.")
    return chosen_file
--- a/src/axolotl/cli/evaluate.py
+++ b/src/axolotl/cli/evaluate.py
@@ -1,6 +1,5 @@
 """CLI to run evaluation on a model."""
 import logging
 import os
 from pathlib import Path
 from typing import Union
@@ -17,8 +16,9 @@ from axolotl.common.datasets import load_datasets, load_preference_datasets
 from axolotl.evaluate import evaluate
 from axolotl.utils import patch_optimized_env
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 def do_evaluate(cfg: DictDefault, cli_args: TrainerCliArgs) -> None:
--- a/src/axolotl/cli/inference.py
+++ b/src/axolotl/cli/inference.py
@@ -1,7 +1,6 @@
 """CLI to run inference on a trained model."""
 import importlib
 import logging
 import sys
 from pathlib import Path
 from threading import Thread
@@ -22,8 +21,9 @@ from axolotl.utils.chat_templates import (
    get_chat_template_from_config,
 )
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 def get_multi_line_input() -> str:
--- a/src/axolotl/cli/main.py
+++ b/src/axolotl/cli/main.py
@@ -2,7 +2,6 @@
 # pylint: disable=redefined-outer-name
 import logging
 import os
 import subprocess  # nosec B404
 import tempfile
@@ -17,6 +16,7 @@ import axolotl
 from axolotl.cli.args import (
    EvaluateCliArgs,
    PreprocessCliArgs,
    QuantizeCliArgs,
    TrainerCliArgs,
    VllmServeCliArgs,
 )
@@ -30,8 +30,11 @@ from axolotl.cli.utils import (
 )
 from axolotl.integrations.lm_eval.cli import lm_eval
 from axolotl.utils import patch_optimized_env
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.config import AxolotlInputConfig
 LOG = get_logger(__name__)
@click.group()
@click.version_option(version=axolotl.__version__, prog_name="axolotl")
@@ -176,7 +179,7 @@ def train(
                    do_cli(config=cfg_file, **kwargs)
        except subprocess.CalledProcessError as exc:
-            logging.error(f"Failed to train/fine-tune config '{cfg_file}': {exc}")
+            LOG.error(f"Failed to train/fine-tune config '{cfg_file}': {exc}")
            if not sweep:
                raise exc
@@ -333,6 +336,16 @@ def vllm_serve(config: str, **cli_args: VllmServeCliArgs):
    do_vllm_serve(config, cli_args)
@cli.command()
@click.argument("config", type=click.Path(exists=True, path_type=str))
@add_options_from_dataclass(QuantizeCliArgs)
@filter_none_kwargs
 def quantize(config: str, **cli_args: QuantizeCliArgs):
    from axolotl.cli.quantize import do_quantize
    do_quantize(config, cli_args)
@cli.command()
@click.argument("model", type=click.Path(exists=True, path_type=str))
@click.argument("output", type=click.Path(exists=False, path_type=str))
--- a/src/axolotl/cli/merge_lora.py
+++ b/src/axolotl/cli/merge_lora.py
@@ -1,20 +1,18 @@
 """CLI to merge a trained LoRA into a base model."""
 import logging
 from pathlib import Path
 from typing import Union
 import fire
 import transformers
 from dotenv import load_dotenv
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.cli.art import print_axolotl_text_art
 from axolotl.cli.config import load_cfg
 from axolotl.cli.utils import load_model_and_tokenizer
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 def do_merge_lora(*, cfg: DictDefault) -> None:
@@ -68,12 +66,6 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
    Raises:
        ValueError: If target directory for LoRA merged model does not exist.
    """
    # pylint: disable=duplicate-code
    parser = transformers.HfArgumentParser(TrainerCliArgs)
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
    parsed_cli_args.merge_lora = True
    parsed_cfg = load_cfg(
        config,
--- a/src/axolotl/cli/merge_sharded_fsdp_weights.py
+++ b/src/axolotl/cli/merge_sharded_fsdp_weights.py
@@ -1,7 +1,6 @@
 """CLI to merge sharded FSDP model checkpoints into a single combined checkpoint."""
 import json
 import logging
 import os
 import shutil
 from pathlib import Path
@@ -11,7 +10,6 @@ import fire
 import torch
 import torch.distributed.checkpoint as dist_cp
 import torch.distributed.checkpoint.format_utils as dist_cp_format_utils
 import transformers
 from accelerate.utils import (
    SAFE_WEIGHTS_INDEX_NAME,
    SAFE_WEIGHTS_NAME,
@@ -24,11 +22,11 @@ from huggingface_hub import split_torch_state_dict_into_shards
 from safetensors.torch import save_file as safe_save_file
 from torch.distributed.checkpoint.format_utils import _EmptyStateDictLoadPlanner
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.cli.art import print_axolotl_text_art
 from axolotl.cli.config import load_cfg
 from axolotl.utils.logging import get_logger
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 class BFloat16CastPlanner(_EmptyStateDictLoadPlanner):
@@ -197,11 +195,6 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
    """
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
    parser = transformers.HfArgumentParser(TrainerCliArgs)
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
    parsed_cli_args.merge_lora = True
    parsed_cfg = load_cfg(config, **kwargs)
    fsdp_dir = Path(parsed_cfg.output_dir) / "pytorch_model_fsdp_0"
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -1,6 +1,5 @@
 """CLI to run preprocessing of a dataset."""
 import logging
 import warnings
 from pathlib import Path
 from typing import Union
@@ -20,9 +19,10 @@ from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
 from axolotl.common.datasets import load_datasets, load_preference_datasets
 from axolotl.integrations.base import PluginManager
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from axolotl.utils.trainer import disable_datasets_caching
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:
--- a/src/axolotl/cli/quantize.py
+++ b/src/axolotl/cli/quantize.py
@@ -0,0 +1,90 @@
 """
 CLI to post-training quantize a model using torchao
 """
 from pathlib import Path
 from typing import Union
 from transformers import AutoModelForCausalLM
 from axolotl.cli.art import print_axolotl_text_art
 from axolotl.cli.config import load_cfg
 from axolotl.loaders import load_tokenizer
 from axolotl.utils.logging import get_logger
 from axolotl.utils.quantization import TorchIntDType, quantize_model_for_ptq
 LOG = get_logger(__name__)
 def do_quantize(
    config: Union[Path, str],
    cli_args: dict,
 ):
    """
    Quantizes a model's model's weights
    Args:
        config (Union[Path, str]): The path to the config file
        cli_args (dict): Additional command-line arguments
    """
    print_axolotl_text_art()
    cfg = load_cfg(config)
    if cfg.qat and cfg.quantization:
        raise ValueError(
            "QAT and quantization cannot be used together. Please specify only one of qat or quantization in your config file."
        )
    if cfg.qat:
        quantize_cfg = cfg.qat
    elif cfg.quantization:
        quantize_cfg = cfg.quantization
    else:
        raise ValueError(
            "No quantization configuration found. Please specify either qat or quantization in your config file."
        )
    model_path = cli_args.get("model_path") or cfg.output_dir
    if weight_dtype := cli_args.get("weight_dtype"):
        weight_dtype = TorchIntDType[weight_dtype]
    else:
        weight_dtype = quantize_cfg.weight_dtype
    if activation_dtype := cli_args.get("activation_dtype"):
        activation_dtype = TorchIntDType[activation_dtype]
    else:
        activation_dtype = quantize_cfg.activation_dtype
    group_size = cli_args.get("group_size") or quantize_cfg.group_size
    quantize_embedding = (
        cli_args.get("quantize_embedding") or quantize_cfg.quantize_embedding
    )
    output_dir = cli_args.get("output_dir") or cfg.output_dir
    LOG.info(f"Loading model from {model_path}...")
    tokenizer = load_tokenizer(cfg)
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
    LOG.info(
        f"Quantizing model with configuration: \n"
        f"\tweight_dtype: {weight_dtype}\n"
        f"\tactivation_dtype: {activation_dtype}\n"
        f"\tgroup_size: {group_size}\n"
        f"\tquantize_embedding: {quantize_embedding}"
    )
    quantize_model_for_ptq(
        model, weight_dtype, group_size, activation_dtype, quantize_embedding
    )
    LOG.info(f"Saving quantized model to: {str(Path(output_dir) / 'quantized')}...")
    model.save_pretrained(
        str(Path(output_dir) / "quantized"),
        safe_serialization=False,
        progressbar=True,
    )
    tokenizer.save_pretrained(
        str(Path(output_dir) / "quantized"),
        safe_serialization=False,
        progressbar=True,
    )
    LOG.info(f"Quantized model saved to: {str(Path(output_dir) / 'quantized')}...")
--- a/src/axolotl/cli/train.py
+++ b/src/axolotl/cli/train.py
@@ -1,7 +1,6 @@
 """CLI to run training on a model."""
 import gc
 import logging
 import os
 from pathlib import Path
 from typing import Union
@@ -22,8 +21,6 @@ from axolotl.utils import patch_optimized_env
 from axolotl.utils.config import normalize_config, resolve_dtype
 from axolotl.utils.dict import DictDefault
 LOG = logging.getLogger(__name__)
 def do_train(cfg: DictDefault, cli_args: TrainerCliArgs):
    """
--- a/src/axolotl/cli/utils.py
+++ b/src/axolotl/cli/utils.py
@@ -4,7 +4,6 @@ import concurrent.futures
 import dataclasses
 import hashlib
 import json
 import logging
 from functools import wraps
 from pathlib import Path
 from types import NoneType
@@ -23,8 +22,9 @@ from transformers import (
 from axolotl.loaders import load_processor, load_tokenizer
 from axolotl.loaders.model import ModelLoader
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 def strip_optional_type(field_type: type | str | None):
--- a/src/axolotl/cli/vllm_serve.py
+++ b/src/axolotl/cli/vllm_serve.py
@@ -2,14 +2,27 @@
 CLI to start the vllm server for online RL
 """
 import os
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Union
 import trl
 from trl.scripts.vllm_serve import ScriptArguments
 from axolotl.cli.config import load_cfg
@dataclass
 class AxolotlScriptArguments(ScriptArguments):
    """
    Additional arguments for the VLLM server
    """
    reasoning_parser: str = field(default="", kw_only=True)
    enable_reasoning: bool | None = field(default=None, kw_only=True)
 def do_vllm_serve(
    config: Union[Path, str],
    cli_args: dict,
@@ -24,6 +37,7 @@ def do_vllm_serve(
    Returns:
        process_id: the process id of the started VLLM server
    """
    patch_vllm_worker()
    cfg = load_cfg(config)
    model = cfg.base_model
@@ -43,9 +57,16 @@ def do_vllm_serve(
    enable_prefix_caching = (
        cli_args.get("enable_prefix_caching") or cfg.vllm.enable_prefix_caching
    )
    reasoning_parser = (
        cli_args.get("reasoning_parser") or cfg.vllm.reasoning_parser or ""
    )
    enable_reasoning = (
        cli_args.get("enable_reasoning") or cfg.vllm.enable_reasoning or False
    )
-    vllm_script_args = ScriptArguments(
+    # pylint: disable=unexpected-keyword-arg
-        model,
+    vllm_script_args = AxolotlScriptArguments(
        model=model,
        tensor_parallel_size=tensor_parallel_size,
        host=host,
        port=port,
@@ -53,5 +74,67 @@ def do_vllm_serve(
        dtype=dtype,
        max_model_len=max_model_len,
        enable_prefix_caching=enable_prefix_caching,
        reasoning_parser=reasoning_parser,
        enable_reasoning=enable_reasoning,
    )
    vllm_serve_main(vllm_script_args)
 def patch_vllm_worker():
    from multiprocessing.connection import Connection
    from vllm import LLM
    def llm_worker(
        script_args: AxolotlScriptArguments,
        data_parallel_rank: int,
        master_port: int,
        connection: Connection,
    ) -> None:
        # Set required environment variables for DP to work with vLLM
        os.environ["VLLM_DP_RANK"] = str(data_parallel_rank)
        os.environ["VLLM_DP_RANK_LOCAL"] = str(data_parallel_rank)
        os.environ["VLLM_DP_SIZE"] = str(script_args.data_parallel_size)
        os.environ["VLLM_DP_MASTER_PORT"] = str(master_port)
        llm = LLM(
            model=script_args.model,
            revision=script_args.revision,
            tensor_parallel_size=script_args.tensor_parallel_size,
            gpu_memory_utilization=script_args.gpu_memory_utilization,
            enforce_eager=script_args.enforce_eager,
            dtype=script_args.dtype,
            # Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can
            # directly reuse the KV cache if it shares the same prefix with one of the existing queries.
            # This is particularly useful here because we generate completions from the same prompts.
            enable_prefix_caching=script_args.enable_prefix_caching,
            kv_cache_dtype=script_args.kv_cache_dtype,
            max_model_len=script_args.max_model_len,
            worker_extension_cls="trl.scripts.vllm_serve.WeightSyncWorkerExtension",
            enable_reasoning=script_args.enable_reasoning,
            reasoning_parser=script_args.reasoning_parser,
        )
        # Send ready signal to parent process
        connection.send({"status": "ready"})
        while True:
            # Wait for commands from the parent process
            try:
                command = connection.recv()
            except KeyboardInterrupt:
                llm.collective_rpc(method="close_communicator")
                break
            # Handle commands
            if command["type"] in ["call", "fire_and_forget"]:
                method_name = command["method"]
                args, kwargs = command.get("args", ()), command.get("kwargs", {})
                method = getattr(llm, method_name)
                result = method(*args, **kwargs)
                if command["type"] == "call":
                    connection.send(result)
            elif command["type"] == "shutdown":
                break
    trl.scripts.vllm_serve.llm_worker = llm_worker
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -1,6 +1,5 @@
 """Dataset loading utilities."""
 import logging
 import math
 import random
 from dataclasses import dataclass
@@ -14,10 +13,11 @@ from axolotl.loaders import load_processor, load_tokenizer
 from axolotl.utils.data import prepare_dataset
 from axolotl.utils.data.rl import load_prepare_preference_datasets
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import RLType
 from axolotl.utils.tokenization import check_dataset_labels
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
@dataclass
--- a/src/axolotl/core/builders/init.py
+++ b/src/axolotl/core/builders/init.py
@@ -0,0 +1,6 @@
 """Trainer builder classes"""
 from .causal import HFCausalTrainerBuilder
 from .rl import HFRLTrainerBuilder
 __all__ = ["HFCausalTrainerBuilder", "HFRLTrainerBuilder"]
--- a/src/axolotl/core/builders/base.py
+++ b/src/axolotl/core/builders/base.py
@@ -0,0 +1,503 @@
 # Copyright 2024 Axolotl AI. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Base class for trainer builder"""
 import abc
 import importlib
 import logging
 import sys
 from abc import abstractmethod
 from contextlib import suppress
 from pathlib import Path
 from typing import Any
 import torch
 from transformers import (
    TrainerCallback,
 )
 from transformers.training_args import OptimizerNames
 from axolotl.integrations.base import PluginManager
 from axolotl.monkeypatch.trainer.lr import patch_trainer_get_lr
 from axolotl.utils import is_comet_available, is_mlflow_available
 from axolotl.utils.callbacks import (
    GCCallback,
    GPUStatsCallback,
    SaveAxolotlConfigtoWandBCallback,
 )
 from axolotl.utils.callbacks.profiler import PytorchProfilerCallback
 from axolotl.utils.schemas.enums import CustomSupportedOptimizers
 LOG = logging.getLogger(__name__)
 with suppress(ImportError):
    import torch._dynamo  # pylint: disable=ungrouped-imports
 class TrainerBuilderBase(abc.ABC):
    """Base class for trainer builder."""
    def __init__(self, cfg, model, tokenizer, processor=None):
        self.cfg = cfg
        self.model = model
        self.tokenizer = tokenizer
        self.processor = processor
        self._train_dataset = None
        self._eval_dataset = None
        self._model_ref = None
        self._peft_config = None
        # If the model supports tagging, add the axolotl tag.
        # This makes sure the tag is correctly pushed even if a user calls
        # model.push_to_hub instead of trainer.push_to_hub.
        if hasattr(model, "add_model_tags"):
            model.add_model_tags(["axolotl"])
        patch_trainer_get_lr()
    @property
    def model_ref(self):
        return self._model_ref
    @model_ref.setter
    def model_ref(self, model):
        self._model_ref = model
    @property
    def train_dataset(self):
        return self._train_dataset
    @train_dataset.setter
    def train_dataset(self, dataset):
        self._train_dataset = dataset
    @property
    def eval_dataset(self):
        return self._eval_dataset
    @eval_dataset.setter
    def eval_dataset(self, dataset):
        self._eval_dataset = dataset
    @property
    def peft_config(self):
        return self._peft_config
    @peft_config.setter
    def peft_config(self, peft_config):
        self._peft_config = peft_config
    @abstractmethod
    def build(self, total_num_steps):
        pass
    def get_callbacks(self) -> list[TrainerCallback]:
        callbacks = []
        plugin_manager = PluginManager.get_instance()
        callbacks.extend(
            plugin_manager.add_callbacks_pre_trainer(cfg=self.cfg, model=self.model)
        )
        if self.cfg.profiler_steps:
            callbacks.append(
                PytorchProfilerCallback(
                    steps_to_profile=self.cfg.profiler_steps,
                )
            )
        if self.cfg.gc_steps:
            callbacks.append(GCCallback(gc_steps=self.cfg.gc_steps))
        if self.cfg.use_wandb:
            callbacks.append(
                SaveAxolotlConfigtoWandBCallback(self.cfg.axolotl_config_path)
            )
        if self.cfg.use_mlflow and is_mlflow_available():
            from axolotl.utils.callbacks.mlflow_ import (
                SaveAxolotlConfigtoMlflowCallback,
            )
            callbacks.extend(
                [
                    SaveAxolotlConfigtoMlflowCallback(self.cfg.axolotl_config_path),
                ]
            )
        if self.cfg.use_comet and is_comet_available():
            from axolotl.utils.callbacks.comet_ import SaveAxolotlConfigtoCometCallback
            callbacks.append(
                SaveAxolotlConfigtoCometCallback(self.cfg.axolotl_config_path)
            )
        callbacks.append(GPUStatsCallback(cfg=self.cfg))
        return callbacks
    def get_post_trainer_create_callbacks(self, trainer):
        """
        Callbacks added after the trainer is created, usually b/c these need access to the trainer
        """
        callbacks = []
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            callbacks.extend(
                [
                    cb
                    for cb in plugin_manager.add_callbacks_post_trainer(
                        self.cfg, trainer
                    )
                    if cb
                ]
            )
        return callbacks
    def hook_pre_create_training_args(self, training_arguments_kwargs):
        # TODO
        return training_arguments_kwargs
    def hook_post_create_training_args(self, training_arguments):
        # TODO
        return training_arguments
    def hook_pre_create_trainer(self, trainer_kwargs, trainer_cls):
        # TODO
        return trainer_kwargs, trainer_cls
    def hook_post_create_trainer(self, trainer):
        # TODO
        return trainer
    def _configure_warmup_and_logging(
        self, total_num_steps: int, training_args_kwargs: dict
    ):
        warmup_steps = 0
        warmup_ratio = 0.0
        if self.cfg.warmup_steps:
            warmup_steps = self.cfg.warmup_steps
        elif self.cfg.warmup_ratio:
            if total_num_steps:
                warmup_steps = max(int(self.cfg.warmup_ratio * total_num_steps), 0)
            else:
                warmup_ratio = self.cfg.warmup_ratio
        elif total_num_steps:
            warmup_steps = min(int(0.03 * total_num_steps), 100)
        else:
            warmup_ratio = 0.03
        if warmup_steps == 1:
            warmup_steps = 2
        if self.cfg.logging_steps is not None:
            training_args_kwargs["logging_steps"] = self.cfg.logging_steps
        else:
            training_args_kwargs["logging_steps"] = (
                500  # transformers defaults to 500
                if not total_num_steps
                else max(min(int(0.005 * total_num_steps), 10), 1)
            )
        training_args_kwargs["warmup_ratio"] = warmup_ratio
        training_args_kwargs["warmup_steps"] = warmup_steps
    def _configure_precision_settings(self, training_args_kwargs: dict):
        training_args_kwargs["fp16"] = (self.cfg.fp16 and not self.cfg.bf16) or False
        training_args_kwargs["tf32"] = self.cfg.tf32
        if self.cfg.bf16 == "full":
            training_args_kwargs["bf16_full_eval"] = True
        else:
            training_args_kwargs["bf16"] = self.cfg.bf16 or self.cfg.bfloat16
    def _configure_scheduler(self, training_args_kwargs: dict):
        if self.cfg.lr_scheduler in ["one_cycle", "rex"]:
            training_args_kwargs["lr_scheduler_type"] = "cosine"
            training_args_kwargs["alternate_lr_scheduler_type"] = self.cfg.lr_scheduler
        else:
            training_args_kwargs["lr_scheduler_type"] = (
                self.cfg.lr_scheduler if self.cfg.lr_scheduler else "cosine"
            )
        training_args_kwargs["lr_scheduler_kwargs"] = (
            self.cfg.lr_scheduler_kwargs if self.cfg.lr_scheduler_kwargs else {}
        )
    def _configure_optimizer(self, training_args_kwargs: dict, trainer_kwargs: dict):
        def _configure_custom_optimizer(
            training_args_kwargs: dict, trainer_kwargs: dict
        ):
            # Common optimizer kwargs
            optimizer_kwargs = {
                "lr": training_args_kwargs["learning_rate"],
                "weight_decay": training_args_kwargs["weight_decay"],
            }
            # Adam-specific kwargs
            adam_kwargs: dict = {}
            if training_args_kwargs.get("adam_beta1") and training_args_kwargs.get(
                "adam_beta2"
            ):
                adam_kwargs["betas"] = (
                    training_args_kwargs.get("adam_beta1"),
                    training_args_kwargs.get("adam_beta2"),
                )
            if training_args_kwargs.get("adam_epsilon"):
                adam_kwargs["eps"] = training_args_kwargs.get("adam_epsilon")
            if self.cfg.optimizer == "muon":
                from axolotl.contribs.mit.muon import (  # pylint: disable=no-name-in-module
                    MuonOptimizerFactory,
                )
                optimizer_cls = MuonOptimizerFactory
                optimizer_kwargs.update(adam_kwargs)
            elif self.cfg.optimizer == "optimi_adamw":
                from optimi import AdamW
                optimizer_kwargs["foreach"] = False
                optimizer_cls = AdamW
                optimizer_kwargs.update(adam_kwargs)
            elif self.cfg.optimizer == "ao_adamw_4bit":
                # TODO remove 20250401
                from torchao.prototype.low_bit_optim import AdamW4bit
                optimizer_cls = AdamW4bit
                optimizer_kwargs.update(adam_kwargs)
                LOG.warning(
                    f"`ao_adamw_4bit` will be deprecated soon. Please use `{OptimizerNames.ADAMW_TORCH_4BIT}` instead."
                )
            elif self.cfg.optimizer == "ao_adamw_8bit":
                from torchao.prototype.low_bit_optim import AdamW8bit
                optimizer_cls = AdamW8bit
                optimizer_kwargs.update(adam_kwargs)
            elif self.cfg.optimizer == "ao_adamw_fp8":
                from torchao.prototype.low_bit_optim import AdamWFp8
                optimizer_cls = AdamWFp8
                optimizer_kwargs.update(adam_kwargs)
            elif self.cfg.optimizer == "adopt_adamw":
                from axolotl.utils.optimizers.adopt import ADOPT
                optimizer_cls = ADOPT
                adam_kwargs["decouple"] = True
                optimizer_kwargs.update(adam_kwargs)
            elif self.cfg.optimizer == "came_pytorch":
                from came_pytorch import CAME
                optimizer_cls = CAME
                beta1 = training_args_kwargs.get("adam_beta1", 0.9)
                beta2 = training_args_kwargs.get("adam_beta2", 0.999)
                beta3 = training_args_kwargs.get("adam_beta3", 0.9999)
                eps1 = training_args_kwargs.get("adam_epsilon", 1e-30)
                eps2 = training_args_kwargs.get("adam_epsilon2", 1e-16)
                adam_kwargs["betas"] = (beta1, beta2, beta3)
                adam_kwargs["eps"] = (eps1, eps2)
                optimizer_kwargs.update(adam_kwargs)
            else:
                raise ValueError(
                    f"Unhandled optimizer: {self.cfg.optimizer}. Please raise an Issue."
                )
            # Parse any additional optimizer args from config
            if self.cfg.optim_args:
                if isinstance(self.cfg.optim_args, dict):
                    optimizer_kwargs.update(self.cfg.optim_args)
                else:
                    # Parse string format "key1=value1,key2=value2"
                    for mapping in self.cfg.optim_args.replace(" ", "").split(","):
                        key, value = mapping.split("=")
                        optimizer_kwargs[key] = value
            # Note: This is not used in training_args_kwargs, but in trainer_kwargs
            trainer_kwargs["optimizer_cls_and_kwargs"] = (
                optimizer_cls,
                optimizer_kwargs,
            )
        # Handle custom optimizer
        custom_supported_optimizers = [opt.value for opt in CustomSupportedOptimizers]
        if self.cfg.optimizer in custom_supported_optimizers:
            _configure_custom_optimizer(training_args_kwargs, trainer_kwargs)
        else:
            # Use transformers' optimizer
            training_args_kwargs["optim"] = self.cfg.optimizer
            # Parse any additional optimizer args from config
            if self.cfg.optim_args:
                if isinstance(self.cfg.optim_args, dict):
                    optim_args = ",".join(
                        [f"{key}={value}" for key, value in self.cfg.optim_args.items()]
                    )
                else:
                    optim_args = self.cfg.optim_args
                training_args_kwargs["optim_args"] = optim_args
            if (
                self.cfg.optimizer == "adamw_anyprecision"
                and Path(self.cfg.torchdistx_path).exists()
            ):
                sys.path.append(self.cfg.torchdistx_path)
                importlib.import_module("torchdistx")
    def _configure_hub_parameters(self, training_args_kwargs: dict):
        if self.cfg.hub_model_id:
            training_args_kwargs["hub_model_id"] = self.cfg.hub_model_id
            training_args_kwargs["push_to_hub"] = True
            training_args_kwargs["hub_private_repo"] = True
            training_args_kwargs["hub_always_push"] = True
            if self.cfg.hub_strategy:
                training_args_kwargs["hub_strategy"] = self.cfg.hub_strategy
    def _configure_save_and_eval_strategy(self, training_args_kwargs: dict):
        # save_strategy and save_steps
        if self.cfg.save_steps:
            training_args_kwargs["save_strategy"] = "steps"
            training_args_kwargs["save_steps"] = self.cfg.save_steps
        elif self.cfg.save_strategy:
            training_args_kwargs["save_strategy"] = self.cfg.save_strategy
        else:
            # default to saving each epoch if not defined
            training_args_kwargs["save_strategy"] = "epoch"
        training_args_kwargs["save_total_limit"] = (
            self.cfg.save_total_limit if self.cfg.save_total_limit else 4
        )
        # eval_strategy and eval_steps
        if not self.eval_dataset or self.cfg.val_set_size == 0:
            # do not eval if no eval_dataset or val_set_size=0
            training_args_kwargs["eval_strategy"] = "no"
        elif self.cfg.eval_steps:
            training_args_kwargs["eval_strategy"] = "steps"
            training_args_kwargs["eval_steps"] = self.cfg.eval_steps
        elif self.cfg.eval_strategy:
            training_args_kwargs["eval_strategy"] = self.cfg.eval_strategy
    def _configure_reporting(self, training_args_kwargs: dict):
        report_to = []
        if self.cfg.use_wandb:
            report_to.append("wandb")
        if self.cfg.use_mlflow:
            report_to.append("mlflow")
        if self.cfg.use_tensorboard:
            report_to.append("tensorboard")
        if self.cfg.use_comet:
            report_to.append("comet_ml")
        training_args_kwargs["report_to"] = report_to
        if self.cfg.use_wandb:
            training_args_kwargs["run_name"] = self.cfg.wandb_name
        elif self.cfg.use_mlflow:
            training_args_kwargs["run_name"] = self.cfg.mlflow_run_name
        else:
            training_args_kwargs["run_name"] = None
    def _configure_torch_compile(self, training_args_kwargs: dict):
        if self.cfg.torch_compile and getattr(torch, "_dynamo", None):
            torch._dynamo.config.suppress_errors = (  # pylint: disable=protected-access
                True
            )
            training_args_kwargs["torch_compile"] = self.cfg.torch_compile
            if self.cfg.torch_compile_backend:
                training_args_kwargs["torch_compile_backend"] = (
                    self.cfg.torch_compile_backend
                )
            if self.cfg.torch_compile_mode:
                training_args_kwargs["torch_compile_mode"] = self.cfg.torch_compile_mode
    def _configure_gradient_checkpointing(self, training_args_kwargs: dict):
        if self.cfg.gradient_checkpointing:
            training_args_kwargs["gradient_checkpointing"] = (
                self.cfg.gradient_checkpointing
            )
            if self.cfg.gradient_checkpointing_kwargs is not None:
                training_args_kwargs["gradient_checkpointing_kwargs"] = (
                    self.cfg.gradient_checkpointing_kwargs
                )
            else:
                training_args_kwargs["gradient_checkpointing_kwargs"] = {
                    "use_reentrant": False
                }
    def _set_base_training_args(
        self, total_num_steps
    ) -> tuple[dict[str, Any], dict[str, Any]]:
        training_args_kwargs: dict[str, Any] = {}
        trainer_kwargs: dict[str, Any] = {}
        self._configure_warmup_and_logging(total_num_steps, training_args_kwargs)
        self._configure_precision_settings(training_args_kwargs)
        self._configure_save_and_eval_strategy(training_args_kwargs)
        self._configure_gradient_checkpointing(training_args_kwargs)
        # set arg into trainer_args_kwargs with same name if value not None
        for arg in [
            # optim/scheduler
            "adam_beta1",
            "adam_beta2",
            "adam_beta3",
            "adam_epsilon",
            "adam_epsilon2",
            "cosine_min_lr_ratio",
            "cosine_constant_lr_ratio",
            "optim_target_modules",
            # trainer
            "max_grad_norm",
            "dataloader_num_workers",
            "dataloader_pin_memory",
            "dataloader_prefetch_factor",
            "gradient_accumulation_steps",
            "learning_rate",
            "embedding_lr",
            "embedding_lr_scale",
            "lr_groups",
            "loraplus_lr_ratio",
            "loraplus_lr_embedding",
            "output_dir",
            "save_safetensors",
            "save_only_model",
            "include_tokens_per_second",
            "weight_decay",
            "seed",
        ]:
            if hasattr(self.cfg, arg) and getattr(self.cfg, arg) is not None:
                training_args_kwargs[arg] = getattr(self.cfg, arg)
        training_args_kwargs["per_device_train_batch_size"] = self.cfg.micro_batch_size
        if self.cfg.eval_batch_size:
            training_args_kwargs["per_device_eval_batch_size"] = (
                self.cfg.eval_batch_size
            )
        training_args_kwargs["max_steps"] = self.cfg.max_steps or total_num_steps or -1
        training_args_kwargs["num_train_epochs"] = self.cfg.num_epochs
        # max_length is not used in CausalTrainer
        if self.cfg.reward_model or self.cfg.rl:
            training_args_kwargs["max_length"] = self.cfg.sequence_len
        self._configure_reporting(training_args_kwargs)
        self._configure_hub_parameters(training_args_kwargs)
        self._configure_scheduler(training_args_kwargs)
        self._configure_optimizer(training_args_kwargs, trainer_kwargs)
        self._configure_torch_compile(training_args_kwargs)
        return training_args_kwargs, trainer_kwargs
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -0,0 +1,489 @@
 """Builder for causal trainers"""
 import inspect
 import math
 import os
 from pathlib import Path
 from typing import Type, Union
 import transformers
 from transformers import (
    DataCollatorWithFlattening,
    EarlyStoppingCallback,
 )
 from trl.trainer.utils import RewardDataCollatorWithPadding
 from axolotl.core.builders.base import TrainerBuilderBase
 from axolotl.core.trainers import (
    AxolotlMambaTrainer,
    AxolotlPRMTrainer,
    AxolotlRewardTrainer,
    AxolotlTrainer,
    ReLoRATrainer,
 )
 from axolotl.integrations.base import PluginManager
 from axolotl.monkeypatch.multipack import SUPPORTED_MULTIPACK_MODEL_TYPES
 from axolotl.monkeypatch.relora import ReLoRACallback
 from axolotl.processing_strategies import get_processing_strategy
 from axolotl.utils import is_comet_available, is_mlflow_available
 from axolotl.utils.callbacks import (
    EvalFirstStepCallback,
    LossWatchDogCallback,
    SaveBetterTransformerModelCallback,
    bench_eval_callback_factory,
    causal_lm_bench_eval_callback_factory,
    colab_inference_post_train_callback,
    log_prediction_callback_factory,
 )
 from axolotl.utils.callbacks.lisa import lisa_callback_factory
 from axolotl.utils.callbacks.qat import QATCallback
 from axolotl.utils.chat_templates import get_chat_template_from_config
 from axolotl.utils.collators import (
    BatchSamplerDataCollatorForSeq2Seq,
    DataCollatorForSeq2Seq,
    MambaDataCollator,
    V2BatchSamplerDataCollatorForSeq2Seq,
 )
 from axolotl.utils.collators.mm_chat import MultiModalChatDataCollator
 from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 class HFCausalTrainerBuilder(TrainerBuilderBase):
    """
    Build the HuggingFace training args/trainer for causal models and reward modeling
    using TRL.
    """
    def get_callbacks(self):
        callbacks = super().get_callbacks()
        callbacks.append(EvalFirstStepCallback())
        if self.cfg.relora_steps:
            callbacks.append(ReLoRACallback(self.cfg))
        if (
            hasattr(self.model, "use_bettertransformer")
            and self.model.use_bettertransformer is True
        ):
            callbacks.append(SaveBetterTransformerModelCallback())
        # TODO: check if can move to base class
        if self.cfg.loss_watchdog_threshold is not None:
            callbacks.append(LossWatchDogCallback(self.cfg))
        if self.cfg.qat:
            callbacks.append(QATCallback(self.cfg.qat))
        return callbacks
    def get_post_trainer_create_callbacks(self, trainer):
        callbacks = []
        if self.cfg.use_wandb and self.cfg.eval_table_size > 0:
            LogPredictionCallback = log_prediction_callback_factory(
                trainer, self.tokenizer, "wandb"
            )
            callbacks.append(LogPredictionCallback(self.cfg))
        if (
            self.cfg.use_mlflow
            and is_mlflow_available()
            and self.cfg.eval_table_size > 0
        ):
            LogPredictionCallback = log_prediction_callback_factory(
                trainer, self.tokenizer, "mlflow"
            )
            callbacks.append(LogPredictionCallback(self.cfg))
        if self.cfg.use_comet and is_comet_available() and self.cfg.eval_table_size > 0:
            LogPredictionCallback = log_prediction_callback_factory(
                trainer, self.tokenizer, "comet_ml"
            )
            callbacks.append(LogPredictionCallback(self.cfg))
        if self.cfg.do_bench_eval:
            callbacks.append(bench_eval_callback_factory(trainer, self.tokenizer))
        if self.cfg.do_causal_lm_eval:
            CausalLMBenchEvalCallback = causal_lm_bench_eval_callback_factory(
                trainer, self.tokenizer
            )
            callbacks.append(CausalLMBenchEvalCallback(self.cfg))
        if self.cfg.early_stopping_patience:
            early_stop_cb = EarlyStoppingCallback(
                self.cfg.early_stopping_patience,
            )
            callbacks.append(early_stop_cb)
        if self.cfg.lisa_step_interval and self.cfg.lisa_n_layers:
            callbacks.append(lisa_callback_factory(trainer))
        if any("COLAB_" in key for key in os.environ):
            ColabCallback = colab_inference_post_train_callback(trainer)
            callbacks.append(ColabCallback(self.cfg))
        callbacks.extend(super().get_post_trainer_create_callbacks(trainer=trainer))
        return callbacks
    def _get_trainer_cls(self):
        """
        Gets the trainer class for the given configuration.
        """
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            trainer_cls = plugin_manager.get_trainer_cls(self.cfg)
            if trainer_cls:
                return trainer_cls
        if self.cfg.relora_steps:
            return ReLoRATrainer
        if self.cfg.model_config_type == "mamba":
            return AxolotlMambaTrainer
        if self.cfg.reward_model:
            return AxolotlRewardTrainer
        if self.cfg.process_reward_model:
            return AxolotlPRMTrainer
        return AxolotlTrainer
    def build(self, total_num_steps):
        from axolotl.core.training_args import (
            AxolotlPRMConfig,
            AxolotlRewardConfig,
            AxolotlTrainingArguments,
        )
        training_arguments_kwargs, trainer_kwargs = self._set_base_training_args(
            total_num_steps
        )
        if self.cfg.fsdp:
            training_arguments_kwargs["fsdp"] = self.cfg.fsdp
            if self.cfg.fsdp_config:
                training_arguments_kwargs["fsdp_config"] = {
                    k.lstrip("fsdp_"): v for k, v in dict(self.cfg.fsdp_config).items()
                }
        if self.cfg.adapter == "qlora":
            training_arguments_kwargs["qlora"] = True
        # deepspeed
        if self.cfg.deepspeed:
            training_arguments_kwargs["deepspeed"] = self.cfg.deepspeed
        if self.cfg.lr_quadratic_warmup is not None:
            training_arguments_kwargs["lr_quadratic_warmup"] = (
                self.cfg.lr_quadratic_warmup
            )
        if self.cfg.dataloader_drop_last is not None:
            training_arguments_kwargs["dataloader_drop_last"] = (
                self.cfg.dataloader_drop_last
            )
        elif self.cfg.sample_packing and self.cfg.eval_sample_packing is False:
            training_arguments_kwargs["dataloader_drop_last"] = True
        if self.cfg.remove_unused_columns is not None:
            training_arguments_kwargs["remove_unused_columns"] = (
                self.cfg.remove_unused_columns
            )
        if self.cfg.do_bench_eval:
            training_arguments_kwargs["do_bench_eval"] = self.cfg.do_bench_eval
            if self.cfg.bench_dataset:
                training_arguments_kwargs["bench_dataset"] = self.cfg.bench_dataset
        if self.cfg.do_causal_lm_eval:
            training_arguments_kwargs["do_causal_lm_eval"] = self.cfg.do_causal_lm_eval
        if self.cfg.metric_for_best_model:
            training_arguments_kwargs["metric_for_best_model"] = (
                self.cfg.metric_for_best_model
            )
        if self.cfg.greater_is_better:
            training_arguments_kwargs["greater_is_better"] = self.cfg.greater_is_better
        # DDP Config
        if self.cfg.ddp_timeout:
            training_arguments_kwargs["ddp_timeout"] = self.cfg.ddp_timeout
        # see https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html
        if self.cfg.ddp_bucket_cap_mb:
            training_arguments_kwargs["ddp_bucket_cap_mb"] = self.cfg.ddp_bucket_cap_mb
        if self.cfg.ddp_broadcast_buffers is not None:
            training_arguments_kwargs["ddp_broadcast_buffers"] = (
                self.cfg.ddp_broadcast_buffers
            )
        # these are all the "standard" kwargs that are def used
        training_arguments_kwargs["max_seq_length"] = self.cfg.sequence_len
        if self.cfg.auto_find_batch_size is not None:
            training_arguments_kwargs["auto_find_batch_size"] = (
                self.cfg.auto_find_batch_size
            )
        training_arguments_kwargs["eval_accumulation_steps"] = (
            self.cfg.gradient_accumulation_steps
        )
        training_arguments_kwargs["load_best_model_at_end"] = (
            (
                self.cfg.load_best_model_at_end is not False
                or self.cfg.early_stopping_patience
            )
            and (
                (not self.cfg.test_datasets and self.cfg.val_set_size > 0)
                or (self.cfg.test_datasets and self.cfg.val_set_size == 0)
            )
            and self.cfg.save_steps
            and self.cfg.eval_steps
            and self.cfg.save_steps % self.cfg.eval_steps == 0
        ) or False
        # handle ddp
        ddp_find_unused_parameters = None
        if self.cfg.ddp:
            ddp_find_unused_parameters = bool(self.cfg.ddp_find_unused_parameters)
        training_arguments_kwargs["ddp_find_unused_parameters"] = (
            ddp_find_unused_parameters
        )
        training_arguments_kwargs["group_by_length"] = self.cfg.group_by_length
        training_arguments_kwargs["curriculum_sampling"] = self.cfg.curriculum_sampling
        training_arguments_kwargs["sample_packing"] = bool(self.cfg.sample_packing)
        training_arguments_kwargs["multipack_real_batches"] = (
            self.cfg.multipack_real_batches
            if self.cfg.multipack_real_batches is not None
            else not self.cfg.flash_attention
        )
        training_arguments_kwargs["eval_sample_packing"] = bool(
            self.cfg.eval_sample_packing
        )
        if self.cfg.sample_packing_bin_size is not None:
            training_arguments_kwargs["sample_packing_bin_size"] = (
                self.cfg.sample_packing_bin_size
            )
        if self.cfg.sample_packing_group_size is not None:
            training_arguments_kwargs["sample_packing_group_size"] = (
                self.cfg.sample_packing_group_size
            )
        if self.cfg.sample_packing_eff_est:
            training_arguments_kwargs["sample_packing_efficiency"] = (
                self.cfg.sample_packing_eff_est
            )
        if self.cfg.relora_steps:
            training_arguments_kwargs["relora_steps"] = self.cfg.relora_steps
            training_arguments_kwargs["relora_warmup_steps"] = (
                self.cfg.relora_warmup_steps
            )
            if self.cfg.relora_anneal_steps:
                training_arguments_kwargs["relora_anneal_steps"] = (
                    self.cfg.relora_anneal_steps
                )
            if self.cfg.relora_prune_ratio:
                training_arguments_kwargs["relora_prune_ratio"] = (
                    self.cfg.relora_prune_ratio
                )
        if self.cfg.lisa_step_interval and self.cfg.lisa_n_layers:
            training_arguments_kwargs["lisa_n_layers"] = self.cfg.lisa_n_layers
            training_arguments_kwargs["lisa_step_interval"] = (
                self.cfg.lisa_step_interval
            )
            training_arguments_kwargs["lisa_layers_attribute"] = (
                self.cfg.lisa_layers_attribute
            )
        training_arguments_kwargs = self.hook_pre_create_training_args(
            training_arguments_kwargs
        )
        training_arguments_kwargs["model_type"] = self.cfg.model_config_type
        training_arguments_kwargs["pretraining"] = bool(self.cfg.pretraining_dataset)
        if self.cfg.chat_template:
            training_arguments_kwargs["chat_template"] = get_chat_template_from_config(
                cfg=self.cfg,
                tokenizer=self.tokenizer,
            )
        if self.cfg.neftune_noise_alpha is not None:
            training_arguments_kwargs["neftune_noise_alpha"] = (
                self.cfg.neftune_noise_alpha
            )
        if self.cfg.accelerator_config:
            training_arguments_kwargs["accelerator_config"] = (
                self.cfg.accelerator_config
            )
        if self.cfg.image_size:
            training_arguments_kwargs["image_size"] = self.cfg.image_size
        if self.cfg.image_resize_algorithm:
            training_arguments_kwargs["image_resize_algorithm"] = (
                self.cfg.image_resize_algorithm
            )
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            plugin_training_args = plugin_manager.get_training_args(self.cfg)
            if plugin_training_args:
                training_arguments_kwargs.update(plugin_training_args)
        if self.cfg.reward_model:
            training_args_cls = AxolotlRewardConfig
        elif self.cfg.process_reward_model:
            training_args_cls = AxolotlPRMConfig
        else:
            training_args_cls = AxolotlTrainingArguments
        training_args = training_args_cls(  # pylint: disable=unexpected-keyword-arg
            **training_arguments_kwargs,
        )
        training_args = self.hook_post_create_training_args(training_args)
        # unset run_name so wandb sets up experiment names
        if self.cfg.use_wandb and training_args.run_name == training_args.output_dir:
            training_args.run_name = (  # pylint: disable=attribute-defined-outside-init
                None
            )
        data_collator_kwargs = {
            "padding": True,  # True/"longest" is the default
        }
        multiple = 64
        if self.cfg.pad_to_sequence_len:
            data_collator_kwargs["pad_to_multiple_of"] = multiple * math.ceil(
                self.cfg.sequence_len / multiple
            )
        else:
            # A100 is best at 64, while others at 8. Let's use the larger so we don't have to check
            # https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
            data_collator_kwargs["pad_to_multiple_of"] = multiple
        trainer_cls = self._get_trainer_cls()
        trainer_kwargs, trainer_cls = self.hook_pre_create_trainer(
            trainer_kwargs, trainer_cls
        )
        if eval_data_collator := self.build_collator(
            training_args, is_eval=True, **data_collator_kwargs
        ):
            if not (self.cfg.reward_model or self.cfg.process_reward_model):
                trainer_kwargs["eval_data_collator"] = eval_data_collator
        if not (self.cfg.reward_model or self.cfg.process_reward_model):
            trainer_kwargs["bench_data_collator"] = transformers.DataCollatorForSeq2Seq(
                self.tokenizer,
                return_tensors="pt",
                **data_collator_kwargs,
            )
        sig = inspect.signature(trainer_cls)
        if "processing_class" in sig.parameters:
            trainer_kwargs["processing_class"] = self.tokenizer
        elif "tokenizer" in sig.parameters:
            trainer_kwargs["tokenizer"] = self.tokenizer
        if (
            not (trainer_cls in [AxolotlRewardTrainer, AxolotlPRMTrainer])
            and self.cfg.datasets is not None
        ):
            trainer_kwargs["dataset_tags"] = [
                d["path"] for d in self.cfg.datasets if not Path(d["path"]).is_dir()
            ]
        trainer = trainer_cls(
            model=self.model,
            train_dataset=self.train_dataset,
            eval_dataset=self.eval_dataset,
            args=training_args,
            data_collator=self.build_collator(training_args, **data_collator_kwargs),
            callbacks=self.get_callbacks(),
            **trainer_kwargs,
        )
        trainer = self.hook_post_create_trainer(trainer)
        for callback in self.get_post_trainer_create_callbacks(trainer):
            trainer.add_callback(callback)
        if self.cfg.deepspeed and self.cfg.sample_packing:
            trainer.accelerator.state.deepspeed_plugin.deepspeed_config[
                "train_micro_batch_size_per_gpu"
            ] = self.cfg.micro_batch_size
        return trainer
    def build_collator(
        self,
        training_args,  # type: "AxolotlTrainingArguments"  # type: ignore
        is_eval=False,
        **kwargs,
    ):
        if training_args.pretraining:
            if (
                self.cfg.pretraining_sample_concatenation is False
                or self.cfg.micro_batch_size > 1
            ):
                return DataCollatorForSeq2Seq(self.tokenizer, **kwargs)
            return None
        if self.cfg.model_config_type == "mamba":
            return MambaDataCollator(tokenizer=self.tokenizer)
        use_batch_sampler_collator = False
        if is_eval is False and training_args.sample_packing:
            use_batch_sampler_collator = True
        if is_eval and training_args.eval_sample_packing:
            use_batch_sampler_collator = True
        collator: Type[
            Union[
                V2BatchSamplerDataCollatorForSeq2Seq,
                BatchSamplerDataCollatorForSeq2Seq,
                DataCollatorForSeq2Seq,
                DataCollatorWithFlattening,
                RewardDataCollatorWithPadding,
            ]
        ]
        collator_args = [self.tokenizer]
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            collator_cls_and_kwargs = plugin_manager.get_collator_cls_and_kwargs(
                self.cfg, is_eval=is_eval
            )
        if collator_cls_and_kwargs:
            collator = collator_cls_and_kwargs[0]
            if kwargs and isinstance(kwargs, dict):
                kwargs.update(collator_cls_and_kwargs[1])
        elif self.cfg.reward_model:
            collator = RewardDataCollatorWithPadding
        elif use_batch_sampler_collator:
            # Use V2BatchSamplerDataCollatorForSeq2Seq for flex attention,
            # supported multipack models, or non-flash-attention llama
            if (
                self.cfg.flex_attention
                or self.cfg.model_config_type in SUPPORTED_MULTIPACK_MODEL_TYPES
                or (
                    self.cfg.model_config_type in ["llama"]
                    and self.cfg.flash_attention is not True
                )
            ):
                collator = V2BatchSamplerDataCollatorForSeq2Seq
            else:
                collator = BatchSamplerDataCollatorForSeq2Seq
        else:
            if self.cfg.processor_type and self.processor:
                collator = MultiModalChatDataCollator
                kwargs["processing_strategy"] = get_processing_strategy(
                    self.processor,
                    training_args.chat_template,
                    self.cfg.chat_template,
                    image_size=training_args.image_size,
                    image_resize_algorithm=training_args.image_resize_algorithm,
                )
            elif self.cfg.batch_flattening:
                collator = DataCollatorWithFlattening
                collator_args.pop(0)
                kwargs.pop("pad_to_multiple_of", None)
                kwargs.pop("padding", None)
            else:
                collator = DataCollatorForSeq2Seq
        kwargs["return_tensors"] = "pt"
        return collator(
            *collator_args,
            **kwargs,
        )
--- a/src/axolotl/core/builders/rl.py
+++ b/src/axolotl/core/builders/rl.py
@@ -0,0 +1,254 @@
 """Builder for RLHF trainers"""
 import inspect
 from pathlib import Path
 from axolotl.core.builders.base import TrainerBuilderBase
 from axolotl.core.trainers import (
    AxolotlCPOTrainer,
    AxolotlKTOTrainer,
    AxolotlORPOTrainer,
 )
 from axolotl.core.trainers.dpo import DPOStrategy
 from axolotl.core.trainers.dpo.args import AxolotlDPOConfig
 from axolotl.core.trainers.grpo import GRPOStrategy
 from axolotl.integrations.base import PluginManager
 from axolotl.loaders.utils import ensure_dtype
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.enums import RLType
 LOG = get_logger(__name__)
 class HFRLTrainerBuilder(TrainerBuilderBase):
    """Trainer factory class for TRL-based RLHF trainers (e.g. DPO)"""
    def get_callbacks(self):
        callbacks = super().get_callbacks()
        return callbacks
    def get_post_trainer_create_callbacks(self, trainer):
        callbacks = super().get_post_trainer_create_callbacks(trainer=trainer)
        return callbacks
    def _get_trainer_cls(self, trainer_kwargs: dict):
        """
        Returns trainer_cls and trainer_cls_args
        """
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            trainer_cls = plugin_manager.get_trainer_cls(self.cfg)
            trainer_cls_args = []  # type: ignore
            if trainer_cls is not None:
                return trainer_cls, trainer_cls_args
        trainer_cls = None
        trainer_cls_args = [self.model]
        if self.cfg.rl is RLType.GRPO:
            trainer_cls = GRPOStrategy.get_trainer_class(
                sequence_parallel=self.cfg.sequence_parallel_degree > 1
            )
            trainer_cls_args.extend(GRPOStrategy.set_trainer_args(self.cfg))
            trainer_kwargs.update(GRPOStrategy.set_trainer_kwargs(self.cfg))
        elif self.cfg.rl in [RLType.DPO, RLType.IPO]:
            trainer_cls = DPOStrategy.get_trainer_class()
            trainer_cls_args.append(self.model_ref)
        elif self.cfg.rl is RLType.ORPO:
            trainer_cls = AxolotlORPOTrainer
        elif self.cfg.rl is RLType.KTO:
            trainer_cls = AxolotlKTOTrainer
        elif self.cfg.rl is RLType.SIMPO:
            trainer_cls = AxolotlCPOTrainer
        else:
            raise ValueError(f"Unsupported RL: {self.cfg.rl}")
        return trainer_cls, trainer_cls_args
    def _build_training_arguments(self, total_num_steps):
        """
        Returns training_args and trainer_kwargs
        """
        from axolotl.core.training_args import (
            AxolotlCPOConfig,
            AxolotlKTOConfig,
            AxolotlORPOConfig,
        )
        training_args_kwargs, trainer_kwargs = self._set_base_training_args(
            total_num_steps=total_num_steps
        )
        if self.cfg.remove_unused_columns is not None:
            training_args_kwargs["remove_unused_columns"] = (
                self.cfg.remove_unused_columns
            )
        else:
            training_args_kwargs["remove_unused_columns"] = False
        # only rlhf
        if self.cfg.dataset_processes:
            training_args_kwargs["dataset_num_proc"] = self.cfg.dataset_processes
        if self.cfg.trl and self.cfg.trl.beta is not None:
            training_args_kwargs["beta"] = self.cfg.trl.beta
        elif self.cfg.rl_beta is not None:
            training_args_kwargs["beta"] = self.cfg.rl_beta
        elif self.cfg.orpo_alpha is not None:
            # trl does some odd mapping of alpha to beta to reuse the beta parameter ???
            training_args_kwargs["beta"] = self.cfg.orpo_alpha
        if self.cfg.rpo_alpha is not None:
            training_args_kwargs["rpo_alpha"] = self.cfg.rpo_alpha
        if self.cfg.use_wandb:
            training_args_kwargs["run_name"] = self.cfg.wandb_name
        training_args_cls = None
        blocklist_args_kwargs = []
        if self.cfg.rl is RLType.SIMPO:
            training_args_cls = AxolotlCPOConfig
            training_args_kwargs["loss_type"] = "simpo"
            training_args_kwargs["simpo_gamma"] = self.cfg.simpo_gamma
            if self.cfg.cpo_alpha is not None:
                training_args_kwargs["cpo_alpha"] = self.cfg.cpo_alpha
        elif self.cfg.rl is RLType.ORPO:
            training_args_cls = AxolotlORPOConfig
            if self.cfg.max_prompt_len:
                training_args_kwargs["max_prompt_length"] = self.cfg.max_prompt_len
        elif self.cfg.rl is RLType.KTO:
            training_args_cls = AxolotlKTOConfig
            training_args_kwargs["desirable_weight"] = (
                self.cfg.kto_desirable_weight or 1.0
            )
            training_args_kwargs["undesirable_weight"] = (
                self.cfg.kto_undesirable_weight or 1.0
            )
            if self.cfg.max_prompt_len:
                training_args_kwargs["max_prompt_length"] = self.cfg.max_prompt_len
        elif self.cfg.rl is RLType.GRPO:
            training_args_cls = GRPOStrategy.get_training_args_class()
            training_args_kwargs.update(GRPOStrategy.set_training_args_kwargs(self.cfg))
            blocklist_args_kwargs = GRPOStrategy.get_blocklist_args_kwargs()
        elif self.cfg.rl in [RLType.DPO, RLType.IPO]:
            training_args_cls = AxolotlDPOConfig
            if self.cfg.rl is RLType.IPO:
                training_args_kwargs["loss_type"] = "ipo"
            # Not compatible with IPO
            if self.cfg.rl is RLType.DPO and self.cfg.dpo_label_smoothing:
                training_args_kwargs["label_smoothing"] = self.cfg.dpo_label_smoothing
            training_args_kwargs["max_completion_length"] = None
            training_args_kwargs["max_prompt_length"] = self.cfg.sequence_len
            training_args_kwargs["generate_during_eval"] = self.cfg.use_wandb
            if self.cfg.dpo_use_weighting is not None:
                training_args_kwargs["use_weighting"] = self.cfg.dpo_use_weighting
            if self.cfg.dpo_use_logits_to_keep is not None:
                training_args_kwargs["use_logits_to_keep"] = (
                    self.cfg.dpo_use_logits_to_keep
                )
        else:
            raise ValueError(f"Unsupported RL: {self.cfg.rl}")
        for blocklist_key in blocklist_args_kwargs:
            if blocklist_key in training_args_kwargs:
                del training_args_kwargs[blocklist_key]
        if self.cfg.plugins:
            plugin_manager = PluginManager.get_instance()
            plugin_training_args = plugin_manager.get_training_args(self.cfg)
            if plugin_training_args:
                training_args_kwargs.update(plugin_training_args)
        training_args = training_args_cls(  # pylint: disable=unexpected-keyword-arg
            logging_first_step=True,
            **training_args_kwargs,
        )
        # unset run_name so wandb sets up experiment names
        if self.cfg.use_wandb and training_args.run_name == training_args.output_dir:
            training_args.run_name = (  # pylint: disable=attribute-defined-outside-init
                None
            )
        return training_args, trainer_kwargs
    def build(self, total_num_steps):
        training_args, trainer_kwargs = self._build_training_arguments(total_num_steps)
        if self.eval_dataset:
            trainer_kwargs["eval_dataset"] = self.eval_dataset
        if self.cfg.adapter and self.peft_config and self.cfg.rl is not RLType.GRPO:
            trainer_kwargs["peft_config"] = self.peft_config
        if self.cfg.precompute_ref_log_probs is not None:
            trainer_kwargs["precompute_ref_log_probs"] = (
                self.cfg.precompute_ref_log_probs
            )
        trainer_cls, trainer_cls_args = self._get_trainer_cls(trainer_kwargs)
        sig = inspect.signature(trainer_cls)
        if "tokenizer" in sig.parameters:
            trainer_kwargs["tokenizer"] = self.tokenizer
        else:
            trainer_kwargs["processing_class"] = self.tokenizer
        if self.cfg.datasets is not None and (
            trainer_cls is DPOStrategy.get_trainer_class()
        ):
            trainer_kwargs["dataset_tags"] = [
                d["path"] for d in self.cfg.datasets if not Path(d["path"]).is_dir()
            ]
        trainer_kwargs, trainer_cls = self.hook_pre_create_trainer(
            trainer_kwargs, trainer_cls
        )
        trainer = trainer_cls(
            *trainer_cls_args,
            args=training_args,
            train_dataset=self.train_dataset,
            callbacks=self.get_callbacks(),
            **trainer_kwargs,
        )
        if self.cfg.fsdp:
            ensure_dtype(trainer.model, dtype=self.cfg.torch_dtype)
            if self.cfg.rl in [RLType.DPO, RLType.IPO] and trainer.ref_model:
                ensure_dtype(trainer.ref_model, dtype=self.cfg.torch_dtype)
        trainer = self.hook_post_create_trainer(trainer)
        for callback in self.get_post_trainer_create_callbacks(trainer):
            trainer.add_callback(callback)
        return trainer
 class HFPPOTrainerBuilder(TrainerBuilderBase):
    """
    HF Factory class for PPO Trainer
    """
    def get_callbacks(self):
        callbacks = super().get_callbacks()
        return callbacks
    def get_post_trainer_create_callbacks(self, trainer):
        callbacks = super().get_post_trainer_create_callbacks(trainer=trainer)
        return callbacks
    def build(self, total_num_steps):
        # TODO: build PPOConfig
        raise NotImplementedError("PPO trainer builder is not implemented yet.")
--- a/src/axolotl/core/chat/messages.py
+++ b/src/axolotl/core/chat/messages.py
@@ -156,7 +156,6 @@ class Messages(BaseModel):
                        len(input_ids) : len(input_ids) + len(pending_input_ids)
                    ]
                    if new_pending_inputs != pending_input_ids:
                        # logging.warning("tokenization mismatch from concatenation.")
                        pending_input_ids = new_pending_inputs
                    input_ids.extend(pending_input_ids)
                    if pending_weight:
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -4,11 +4,10 @@
 from __future__ import annotations
 import logging
 import os
 from collections import defaultdict
-from functools import wraps
+from functools import partial, wraps
-from typing import Literal
+from typing import Callable, Literal, Optional
 import datasets
 import torch
@@ -34,9 +33,11 @@ from axolotl.core.trainers.utils import (
    sanitize_kwargs_for_ds_tagging,
    sanitize_kwargs_for_tagging,
 )
 from axolotl.utils import get_not_null
 from axolotl.utils.logging import get_logger
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
@@ -101,7 +102,7 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
            )
            batch_max_len = train_batch_size * self.args.max_seq_length
-        return MultipackBatchSampler(
+        sampler = MultipackBatchSampler(
            base_sampler,
            lengths=get_dataset_lengths(dataset),
            packing_efficiency_estimate=self.args.sample_packing_efficiency,
@@ -113,7 +114,12 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
            drop_last=True,
        )
-    def _get_train_sampler(self) -> Sampler | None:
+        len(sampler)
        return sampler
    def _get_train_sampler(
        self, train_dataset: Optional[Dataset] = None
    ) -> Optional[Sampler]:
        """
        Helper method to get the sampler for training. Handles cases for sample packing
        and curriculum sampling (sequential).
@@ -137,7 +143,7 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
        if use_sample_packing:
            return self._create_multipack_sampler(
                base_sampler=base_sampler,
-                dataset=self.train_dataset,
+                dataset=train_dataset,
            )
        return base_sampler
@@ -150,8 +156,6 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
            If the dataset is non-empty, a sampler is returned, the type of which
                depends on the passed training args.
        """
        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
        # Multipacking enabled if training is enabled and eval is not explicitly disabled
        use_multipack = (
            self.args.sample_packing and self.args.eval_sample_packing is not False
@@ -172,125 +176,93 @@ class AxolotlTrainer(SchedulerMixin, OptimizerMixin, RngLoaderMixin, Trainer):
        return base_sampler
-    def _create_dataloader_params(self, is_eval=False, custom_batch_size=None):
+    def _get_dataloader(
-        """Create common dataloader parameters for train or eval."""
+        self,
-        batch_size = custom_batch_size or (
+        dataset: Dataset,
-            self.args.eval_batch_size if is_eval else self._train_batch_size
+        description: str,
-        )
+        batch_size: int,
        sampler_fn: Optional[Callable[[Dataset], torch.utils.data.Sampler]] = None,
        is_training: bool = False,
        dataloader_key: Optional[str] = None,
    ) -> DataLoader:
        """Create a [`~torch.utils.data.DataLoader`] from the given dataset."""
-        params = {
+        data_collator = self.data_collator if is_training else self.eval_data_collator
        if dataset.column_names and "length" in dataset.column_names:
            dataset = dataset.remove_columns(["length"])
        if isinstance(dataset, datasets.Dataset):
            if is_training:
                if not self.args.sample_packing or self.args.pretraining:
                    dataset = self._remove_unused_columns(
                        dataset, description="training"
                    )
            elif (
                not is_training
                and self.args.sample_packing
                and self.args.eval_sample_packing is not False
            ):
                batch_size = (
                    batch_size
                    if self.args.sample_packing
                    else self.args.per_device_eval_batch_size
                )
            else:
                dataset = self._remove_unused_columns(dataset, description=description)
        else:
            data_collator = self._get_collator_with_removed_columns(
                self.data_collator, description=description
            )
        dataloader_params = {
            "batch_size": batch_size,
-            "collate_fn": self.data_collator,
+            "collate_fn": data_collator,
            "num_workers": self.args.dataloader_num_workers,
            "pin_memory": self.args.dataloader_pin_memory,
            "persistent_workers": self.args.dataloader_persistent_workers,
        }
        # Add persistent workers only for training
        if not is_eval and hasattr(self.args, "dataloader_persistent_workers"):
            params["persistent_workers"] = self.args.dataloader_persistent_workers
        # Add prefetch factor if specified
        if self.args.dataloader_prefetch_factor:
            params["prefetch_factor"] = self.args.dataloader_prefetch_factor
        return params
    def _prepare_dataloader(
        self, dataset, sampler, is_eval=False, custom_batch_size=None
    ):
        """Prepare a dataloader with the given dataset and sampler."""
        # Get base parameters
        dataloader_params = self._create_dataloader_params(is_eval, custom_batch_size)
        # Add sampler configuration
        if not isinstance(dataset, torch.utils.data.IterableDataset):
-            if isinstance(sampler, BatchSampler):
+            dataloader_params["drop_last"] = get_not_null(
-                # batch_size and batch_sampler are mutually exclusive
+                self.args.dataloader_drop_last, True
-                dataloader_params["batch_sampler"] = sampler
+            )
-                del dataloader_params["batch_size"]
+            if sampler_fn is not None:
-            else:
+                sampler = sampler_fn(dataset)
-                dataloader_params["sampler"] = sampler
+                if isinstance(sampler, BatchSampler):
-                dataloader_params["drop_last"] = self.args.dataloader_drop_last
+                    # batch_size and batch_sampler are mutually exclusive
-
+                    dataloader_params["batch_sampler"] = sampler
-            if not is_eval:
+                    del dataloader_params["batch_size"]
-                dataloader_params["worker_init_fn"] = seed_worker
+                    del dataloader_params["drop_last"]
-
+                else:
-        # Create the dataloader
+                    dataloader_params["sampler"] = sampler
        dataloader = DataLoader(dataset, **dataloader_params)
            dataloader_params["prefetch_factor"] = self.args.dataloader_prefetch_factor
            if is_training:
                dataloader_params["worker_init_fn"] = partial(
                    seed_worker,
                    num_workers=self.args.dataloader_num_workers,
                    rank=self.args.process_index,
                )
        if self.args.sample_packing and (
-            (not is_eval and not self.args.pretraining)
+            (is_training and not self.args.pretraining)
-            or (is_eval and self.args.eval_sample_packing is not False)
+            or (not is_training and self.args.eval_sample_packing is not False)
        ):
            self.accelerator.even_batches = False
-        return self.accelerator.prepare_data_loader(dataloader)
+        dataloader = DataLoader(dataset, **dataloader_params)
-    def get_train_dataloader(self) -> DataLoader:
+        # Accelerator.free_memory() will destroy the references, so
-        """Get dataloader for training"""
+        # we need to store the non-prepared version for eval dataloaders.
-        train_dataset = self.train_dataset
+        # fmt: off
-        data_collator = self.data_collator  # type: ignore
+        if dataloader_key is not None and self.args.dataloader_persistent_workers:
            if hasattr(self, "_eval_dataloaders"):
                self._eval_dataloaders[dataloader_key] = dataloader  # type: ignore  # pylint: disable=access-member-before-definition
            else:
                self._eval_dataloaders = {dataloader_key: dataloader}  # pylint: disable=attribute-defined-outside-init
        # fmt: on
-        # Handle dataset preprocessing
+        return self.accelerator.prepare(dataloader)
        if isinstance(train_dataset, datasets.Dataset):
            if self.args.sample_packing and not self.args.pretraining:
                train_dataset = train_dataset.remove_columns(["length"])
            if not self.args.sample_packing or self.args.pretraining:
                train_dataset = self._remove_unused_columns(
                    train_dataset, description="training"
                )
        else:
            self.data_collator = self._get_collator_with_removed_columns(  # pylint: disable=attribute-defined-outside-init
                data_collator,
                description="training",
            )
        # Get sampler and create dataloader
        sampler = self._get_train_sampler()
        return self._prepare_dataloader(train_dataset, sampler, is_eval=False)
    def get_eval_dataloader(self, eval_dataset: Dataset | None = None) -> DataLoader:
        """Get dataloader for evaluation"""
        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
        # Handle special case: sample packing is enabled but eval_sample_packing is False
        if self.args.sample_packing and self.args.eval_sample_packing is False:
            self.data_collator = (  # pylint: disable=attribute-defined-outside-init
                self.eval_data_collator
            )
            if "length" in eval_dataset.column_names:
                eval_dataset = eval_dataset.remove_columns(["length"])
            dataloader = super().get_eval_dataloader(eval_dataset)
            self.data_collator = (  # pylint: disable=attribute-defined-outside-init
                self.train_data_collator
            )
            return dataloader
        if self.args.sample_packing and self.args.eval_sample_packing is not False:
            # Get appropriate data collator
            self.data_collator = (  # pylint: disable=attribute-defined-outside-init
                self.eval_data_collator
                if hasattr(self, "eval_data_collator") and self.eval_data_collator
                else self.data_collator
            )
            if "length" in eval_dataset.column_names:
                eval_dataset = eval_dataset.remove_columns(["length"])
            # Use eval_batch_size for sample packing, per_device_eval_batch_size otherwise
            batch_size = (
                self.args.eval_batch_size
                if self.args.sample_packing
                else self.args.per_device_eval_batch_size
            )
            sampler = self._get_eval_sampler(eval_dataset)
            dataloader = self._prepare_dataloader(
                eval_dataset, sampler, is_eval=True, custom_batch_size=batch_size
            )
            return dataloader
        return super().get_eval_dataloader(eval_dataset)
    def _get_bench_sampler(
        self, bench_dataset: Dataset
--- a/src/axolotl/core/trainers/dpo/trainer.py
+++ b/src/axolotl/core/trainers/dpo/trainer.py
@@ -5,65 +5,31 @@ from functools import wraps
 from typing import Any, Dict, Union
 import torch
 from peft.optimizers import create_loraplus_optimizer
 from torch import nn
 from transformers import Trainer
 from transformers.utils import is_sagemaker_mp_enabled
 from trl import DPOTrainer
 from axolotl.core.trainers.mixins import RngLoaderMixin, SchedulerMixin
 from axolotl.core.trainers.mixins.optimizer import OptimizerInitMixin, OptimizerMixin
 from axolotl.core.trainers.utils import (
    sanitize_kwargs_for_ds_tagging,
    sanitize_kwargs_for_tagging,
 )
 if is_sagemaker_mp_enabled():
    import smdistributed.modelparallel.torch as smp
-
+class AxolotlDPOTrainer(
-class AxolotlDPOTrainer(RngLoaderMixin, SchedulerMixin, DPOTrainer):
+    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, DPOTrainer
 ):
    """Extend the base DPOTrainer for axolotl helpers."""
    tag_names = ["axolotl", "dpo"]
    def __init__(self, *args, dataset_tags=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset_tags = dataset_tags
        self.optimizer = None
        self.model_accepts_loss_kwargs = False
    def create_optimizer(self):
        # pylint: disable=duplicate-code
        if self.args.loraplus_lr_ratio is None:
            return super().create_optimizer()
        opt_model = self.model_wrapped if is_sagemaker_mp_enabled() else self.model
        if self.optimizer is None:  # pylint: disable=access-member-before-definition
            optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(
                self.args,
                opt_model,
            )
            loraplus_lr_ratio = getattr(self.args, "loraplus_lr_ratio", None)
            if loraplus_lr_ratio:
                print("Using lora+")
            loraplus_lr_embedding = getattr(self.args, "loraplus_lr_embedding", None)
            # pylint: disable=duplicate-code
            self.optimizer = create_loraplus_optimizer(  # pylint: disable=attribute-defined-outside-init
                opt_model,
                optimizer_cls,
                loraplus_lr_ratio=loraplus_lr_ratio,
                loraplus_lr_embedding=loraplus_lr_embedding,
                **optimizer_kwargs,
            )
        if is_sagemaker_mp_enabled():
            self.optimizer = smp.DistributedOptimizer(  # pylint: disable=attribute-defined-outside-init
                self.optimizer
            )
        return self.optimizer
    @wraps(DPOTrainer.push_to_hub)
    def push_to_hub(self, *args, **kwargs) -> str:
        """
--- a/src/axolotl/core/trainers/grpo/init.py
+++ b/src/axolotl/core/trainers/grpo/init.py
@@ -2,7 +2,6 @@
 import importlib
 import inspect
 import logging
 from typing import Any
 from trl.trainer.grpo_trainer import RewardFunc
@@ -13,9 +12,10 @@ from axolotl.core.trainers.grpo.trainer import (
    AxolotlGRPOTrainer,
 )
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schemas.trl import TRLConfig
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 class GRPOStrategy:
@@ -69,6 +69,9 @@ class GRPOStrategy:
        grpo_args_kwargs["log_completions"] = trl.log_completions
        grpo_args_kwargs["num_completions_to_print"] = trl.num_completions_to_print
        if cfg.sequence_parallel_degree > 1:
            grpo_args_kwargs["sequence_parallel_degree"] = cfg.sequence_parallel_degree
        if trl.reward_weights:
            grpo_args_kwargs["reward_weights"] = trl.reward_weights
@@ -106,7 +109,9 @@ class GRPOStrategy:
        return grpo_args_kwargs
    @classmethod
-    def set_trainer_args(cls, cfg: DictDefault) -> list[Any]:
+    def set_trainer_args(
        cls, cfg: DictDefault
    ) -> list[Any]:  # pylint: disable=unused-argument
        trainer_args = []
        if cfg.trl and cfg.trl.reward_funcs:
            reward_funcs = []
@@ -123,6 +128,7 @@ class GRPOStrategy:
            trainer_kwargs["reward_processing_classes"] = (
                cfg.trl.reward_processing_classes
            )
        return trainer_kwargs
    @classmethod
@@ -132,7 +138,7 @@ class GRPOStrategy:
    @classmethod
    def get_blocklist_args_kwargs(cls) -> list[str]:
-        return ["dataset_num_proc"]
+        return ["dataset_num_proc", "max_length"]
    @classmethod
    def get_reward_func(cls, reward_func_fqn: str) -> RewardFunc:
@@ -167,4 +173,4 @@ class GRPOStrategy:
            LOG.info(
                f"Reward function {reward_func_fqn} is a pre-trained model path - if this is unexpected, please check the reward function path."
            )
-            return reward_func
+            return reward_func_fqn
--- a/src/axolotl/core/trainers/grpo/args.py
+++ b/src/axolotl/core/trainers/grpo/args.py
@@ -12,3 +12,5 @@ from axolotl.core.training_args import AxolotlTrainingMixins
@dataclass
 class AxolotlGRPOConfig(AxolotlTrainingMixins, GRPOConfig):
    """Axolotl GRPO Config for GRPO training"""
    sequence_parallel_degree: int | None = None
--- a/src/axolotl/core/trainers/grpo/trainer.py
+++ b/src/axolotl/core/trainers/grpo/trainer.py
@@ -43,6 +43,7 @@ from trl.trainer.utils import pad
 from axolotl.core.trainers.grpo.sampler import SequenceParallelRepeatRandomSampler
 from axolotl.core.trainers.mixins import RngLoaderMixin, SchedulerMixin
 from axolotl.core.trainers.mixins.optimizer import OptimizerInitMixin, OptimizerMixin
 from axolotl.monkeypatch.ring_attn import get_ring_attn_group
 if is_peft_available():
@@ -50,7 +51,9 @@ if is_peft_available():
    from peft import PeftConfig
-class AxolotlGRPOTrainer(RngLoaderMixin, SchedulerMixin, GRPOTrainer):
+class AxolotlGRPOTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, GRPOTrainer
 ):
    """Extend the base GRPOTrainer for axolotl helpers"""
    _tag_names = ["trl", "grpo", "axolotl"]
@@ -77,6 +80,7 @@ class AxolotlGRPOSequenceParallelTrainer(AxolotlGRPOTrainer):
            torch.optim.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None
        ] = (None, None),
        peft_config: "PeftConfig | None" = None,
        optimizer_cls_and_kwargs: tuple[type, dict] | None = None,
    ):
        # First call the superclass constructor with all arguments
        super().__init__(
@@ -90,6 +94,7 @@ class AxolotlGRPOSequenceParallelTrainer(AxolotlGRPOTrainer):
            callbacks=callbacks,
            optimizers=optimizers,
            peft_config=peft_config,
            optimizer_cls_and_kwargs=optimizer_cls_and_kwargs,
        )
        # Get number of SP groups (number of processes divided by SP degree)
@@ -131,6 +136,13 @@ class AxolotlGRPOSequenceParallelTrainer(AxolotlGRPOTrainer):
                    f"the valid values for the number of generations are: {possible_values}."
                )
        self.sp_group = None
        self.rank = dist.get_rank()
        self.world_size = dist.get_world_size()
        self.local_rank = 0
        self.local_world_size = 1
    def train(self, *args, **kwargs):
        # Initialize the SP group
        self.sp_group = get_ring_attn_group()
        self.rank = dist.get_rank()
@@ -138,6 +150,8 @@ class AxolotlGRPOSequenceParallelTrainer(AxolotlGRPOTrainer):
        self.local_rank = dist.get_rank(group=self.sp_group)
        self.local_world_size = dist.get_world_size(group=self.sp_group)
        return super().train(*args, **kwargs)
    def _get_train_sampler(self) -> Sampler:
        effective_batch_size = (
            self.args.per_device_train_batch_size
--- a/src/axolotl/core/trainers/mixins/optimizer.py
+++ b/src/axolotl/core/trainers/mixins/optimizer.py
@@ -1,18 +1,17 @@
 """Module for Axolotl trainer optimizer mixin"""
 import logging
 from peft.optimizers import create_loraplus_optimizer
 from torch import nn
 from transformers.trainer import Trainer
 from transformers.utils import is_sagemaker_mp_enabled
 from axolotl.integrations.base import BaseOptimizerFactory
 from axolotl.utils.logging import get_logger
 if is_sagemaker_mp_enabled():
    import smdistributed.modelparallel.torch as smp
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 class OptimizerMixin(Trainer):
@@ -199,3 +198,20 @@ class OptimizerMixin(Trainer):
            )
        return self.optimizer
 class OptimizerInitMixin:
    """
    Mixin to handle common optimizer initialization logic for Trainers (mostly TRL) that do not
    accept optimizer_cls_and_kwargs as kwarg in constructor.
    """
    def __init__(self, *args, **kwargs):
        optimizer_cls_and_kwargs = kwargs.pop("optimizer_cls_and_kwargs", None)
        super().__init__(*args, **kwargs)
        if (
            optimizer_cls_and_kwargs
            and self.optimizer_cls_and_kwargs is None
            and self.optimizer is None
        ):
            self.optimizer_cls_and_kwargs = optimizer_cls_and_kwargs
--- a/src/axolotl/core/trainers/mixins/rng_state_loader.py
+++ b/src/axolotl/core/trainers/mixins/rng_state_loader.py
@@ -6,7 +6,6 @@ See https://github.com/huggingface/transformers/pull/37162
 TODO: Remove when upstream added PR to release
 """
 import logging
 import os
 import random
@@ -17,7 +16,9 @@ from transformers.trainer import safe_globals
 from transformers.trainer_pt_utils import set_rng_state_for_device
 from transformers.training_args import ParallelMode
-LOG = logging.getLogger(__name__)
+from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 class RngLoaderMixin(Trainer):
--- a/src/axolotl/core/trainers/mixins/scheduler.py
+++ b/src/axolotl/core/trainers/mixins/scheduler.py
@@ -1,12 +1,11 @@
 """Module for Axolotl trainer scheduler mixin"""
 import logging
 import torch
 from torch.optim.lr_scheduler import LRScheduler, OneCycleLR
 from transformers.trainer import Trainer
 from axolotl.integrations.base import PluginManager
 from axolotl.utils.logging import get_logger
 from axolotl.utils.schedulers import (
    RexLR,
    get_cosine_schedule_with_min_lr,
@@ -14,7 +13,7 @@ from axolotl.utils.schedulers import (
    get_cosine_schedule_with_warmup_decay_constant,
 )
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 class SchedulerMixin(Trainer):
@@ -80,13 +79,15 @@ class SchedulerMixin(Trainer):
                self.lr_scheduler = RexLR(
                    optimizer=optimizer,
                    max_lr=self.args.learning_rate,
-                    min_lr=0 if not use_cosine_min_lr else (self.args.learning_rate * self.args.cosine_min_lr_ratio),
+                    min_lr=0 if not use_cosine_min_lr else (
                        self.args.learning_rate * self.args.cosine_min_lr_ratio),
                    total_steps=num_training_steps,
                    num_warmup_steps=self.args.get_warmup_steps(num_training_steps),
                )
            elif use_cosine_quadratic:
                if use_cosine_min_lr:
-                    LOG.warning("Both cosine quadratic warmup and min lr detected. Using quadratic warmup.")
+                    LOG.warning(
                        "Both cosine quadratic warmup and min lr detected. Using quadratic warmup.")
                self.lr_scheduler = get_cosine_schedule_with_quadratic_warmup(  # pylint: disable=attribute-defined-outside-init
                    optimizer,
@@ -115,9 +116,11 @@ class SchedulerMixin(Trainer):
                return super().create_scheduler(num_training_steps, optimizer=optimizer)
        else:
            if use_cosine_quadratic:
-                LOG.warning("axolotl's cosine scheduler with quadratic warmup not used (e.g., because of deepspeed).")
+                LOG.warning(
                    "axolotl's cosine scheduler with quadratic warmup not used (e.g., because of deepspeed).")
            if use_cosine_min_lr:
-                LOG.warning("axolotl's cosine scheduler with min lr not used (e.g., because of deepspeed).")
+                LOG.warning(
                    "axolotl's cosine scheduler with min lr not used (e.g., because of deepspeed).")
        return self.lr_scheduler  # type: ignore
--- a/src/axolotl/core/trainers/trl.py
+++ b/src/axolotl/core/trainers/trl.py
@@ -1,7 +1,5 @@
 """Module for TRL PPO trainer"""
 from typing import Literal, Union
 import torch
 from tqdm import tqdm
 from trl import (
@@ -14,6 +12,7 @@ from trl import (
 )
 from axolotl.core.trainers.mixins import RngLoaderMixin
 from axolotl.core.trainers.mixins.optimizer import OptimizerInitMixin, OptimizerMixin
 from axolotl.core.trainers.mixins.scheduler import SchedulerMixin
@@ -75,87 +74,19 @@ class TRLPPOTrainer(PPOTrainer):
            )
-class AxolotlORPOTrainer(RngLoaderMixin, SchedulerMixin, ORPOTrainer):
+class AxolotlORPOTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, ORPOTrainer
 ):
    """
    Extend the base ORPOTrainer for axolotl helpers
    """
    tag_names = ["axolotl", "orpo"]
    def get_batch_loss_metrics(
        self,
        model,
        batch: dict[str, Union[list, torch.LongTensor]],
        train_eval: Literal["train", "eval"] = "train",
    ):
        """Compute the ORPO loss and other metrics for the given batch of inputs for train or test."""
-        # TODO remove once https://github.com/huggingface/trl/pull/3069 is included in a trl release
+class AxolotlKTOTrainer(
-
+    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, KTOTrainer
-        metrics = {}
+):
        forward_output = self.concatenated_forward(model, batch)
        (
            policy_chosen_logps,
            policy_rejected_logps,
            policy_chosen_logits,
            policy_rejected_logits,
            policy_nll_loss,
        ) = forward_output[:5]
        if self.aux_loss_enabled:
            aux_loss = forward_output[5]
        losses, chosen_rewards, rejected_rewards, log_odds_ratio, log_odds_chosen = (
            self.odds_ratio_loss(policy_chosen_logps, policy_rejected_logps)
        )
        # full ORPO loss
        loss = policy_nll_loss - losses.mean()
        reward_accuracies = (chosen_rewards > rejected_rewards).float()
        prefix = "eval_" if train_eval == "eval" else ""
        metrics[f"{prefix}rewards/chosen"] = self.accelerator.gather_for_metrics(
            chosen_rewards
        ).mean()
        metrics[f"{prefix}rewards/rejected"] = self.accelerator.gather_for_metrics(
            rejected_rewards
        ).mean()
        metrics[f"{prefix}rewards/accuracies"] = self.accelerator.gather_for_metrics(
            reward_accuracies
        ).mean()
        metrics[f"{prefix}rewards/margins"] = self.accelerator.gather_for_metrics(
            chosen_rewards - rejected_rewards
        ).mean()
        metrics[f"{prefix}logps/rejected"] = (
            self.accelerator.gather_for_metrics(policy_rejected_logps).detach().mean()
        )
        metrics[f"{prefix}logps/chosen"] = (
            self.accelerator.gather_for_metrics(policy_chosen_logps).detach().mean()
        )
        metrics[f"{prefix}logits/rejected"] = self.accelerator.gather_for_metrics(
            policy_rejected_logits.detach().mean()
        ).mean()
        metrics[f"{prefix}logits/chosen"] = self.accelerator.gather_for_metrics(
            policy_chosen_logits.detach().mean()
        ).mean()
        metrics[f"{prefix}nll_loss"] = (
            self.accelerator.gather_for_metrics(policy_nll_loss).detach().mean()
        )
        metrics[f"{prefix}log_odds_ratio"] = (
            self.accelerator.gather_for_metrics(log_odds_ratio).detach().mean()
        )
        metrics[f"{prefix}log_odds_chosen"] = (
            self.accelerator.gather_for_metrics(log_odds_chosen).detach().mean()
        )
        for k, v in metrics.items():
            metrics[k] = v.item()
        if self.aux_loss_enabled:
            loss += self.aux_loss_coef * aux_loss
        return loss, metrics
 class AxolotlKTOTrainer(RngLoaderMixin, SchedulerMixin, KTOTrainer):
    """
    Extend the base KTOTrainer for axolotl helpers
    """
@@ -163,89 +94,19 @@ class AxolotlKTOTrainer(RngLoaderMixin, SchedulerMixin, KTOTrainer):
    tag_names = ["axolotl", "kto"]
-class AxolotlCPOTrainer(RngLoaderMixin, SchedulerMixin, CPOTrainer):
+class AxolotlCPOTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, CPOTrainer
 ):
    """
    Extend the base CPOTrainer for axolotl helpers
    """
    tag_names = ["axolotl", "cpo"]
    def get_batch_loss_metrics(
        self,
        model,
        batch: dict[str, Union[list, torch.LongTensor]],
        train_eval: Literal["train", "eval"] = "train",
    ):
        """Compute the CPO loss and other metrics for the given batch of inputs for train or test."""
        metrics = {}
-        forward_output = self.concatenated_forward(model, batch)
+class AxolotlRewardTrainer(
-        (
+    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, RewardTrainer
-            policy_chosen_logps,
+):
            policy_rejected_logps,
            policy_chosen_logits,
            policy_rejected_logits,
            policy_nll_loss,
        ) = forward_output[:5]
        if self.aux_loss_enabled:
            aux_loss = forward_output[5]
        losses, chosen_rewards, rejected_rewards = self.cpo_loss(
            policy_chosen_logps,
            policy_rejected_logps,
        )
        loss = losses.mean() + self.cpo_alpha * policy_nll_loss
        reward_accuracies = (chosen_rewards > rejected_rewards).float()
        prefix = "eval_" if train_eval == "eval" else ""
        metrics[f"{prefix}rewards/chosen"] = (
            self.accelerator.gather_for_metrics(chosen_rewards).mean().item()
        )
        metrics[f"{prefix}rewards/rejected"] = (
            self.accelerator.gather_for_metrics(rejected_rewards).mean().item()
        )
        metrics[f"{prefix}rewards/accuracies"] = (
            self.accelerator.gather_for_metrics(reward_accuracies).mean().item()
        )
        metrics[f"{prefix}rewards/margins"] = (
            self.accelerator.gather_for_metrics(chosen_rewards - rejected_rewards)
            .mean()
            .item()
        )
        metrics[f"{prefix}logps/rejected"] = (
            self.accelerator.gather_for_metrics(policy_rejected_logps)
            .detach()
            .mean()
            .item()
        )
        metrics[f"{prefix}logps/chosen"] = (
            self.accelerator.gather_for_metrics(policy_chosen_logps)
            .detach()
            .mean()
            .item()
        )
        metrics[f"{prefix}logits/rejected"] = (
            self.accelerator.gather_for_metrics(policy_rejected_logits.detach().mean())
            .mean()
            .item()
        )
        metrics[f"{prefix}logits/chosen"] = (
            self.accelerator.gather_for_metrics(policy_chosen_logits.detach().mean())
            .mean()
            .item()
        )
        metrics[f"{prefix}nll_loss"] = (
            self.accelerator.gather_for_metrics(policy_nll_loss).detach().mean().item()
        )
        if self.aux_loss_enabled:
            loss += self.aux_loss_coef * aux_loss
        return loss, metrics
 class AxolotlRewardTrainer(RngLoaderMixin, SchedulerMixin, RewardTrainer):
    """
    Extend the base RewardTrainer for axolotl helpers
    """
@@ -253,7 +114,9 @@ class AxolotlRewardTrainer(RngLoaderMixin, SchedulerMixin, RewardTrainer):
    tag_names = ["axolotl", "reward"]
-class AxolotlPRMTrainer(RngLoaderMixin, SchedulerMixin, PRMTrainer):
+class AxolotlPRMTrainer(
    RngLoaderMixin, SchedulerMixin, OptimizerMixin, OptimizerInitMixin, PRMTrainer
 ):
    """
    Extend the base trl.PRMTrainer for axolotl helpers
    """
--- a/src/axolotl/core/training_args.py
+++ b/src/axolotl/core/training_args.py
@@ -2,244 +2,17 @@
 extra axolotl specific training args
 """
-from dataclasses import dataclass, field
+from __future__ import annotations
-from typing import Optional
+
 from dataclasses import dataclass, field
 from typing import Optional, Type
 from PIL.Image import Resampling
 from transformers import TrainingArguments
 from trl import CPOConfig, KTOConfig, ORPOConfig, PRMConfig, RewardConfig
 from axolotl.integrations.config import merge_training_args
-@dataclass
+AxolotlTrainingMixins: Type = merge_training_args()
 class AxolotlTrainingMixins:
    """
    Mixin class for the Axolotl training args.
    """
    # pylint: disable=duplicate-code
    model_type: Optional[str] = field(
        default=None, metadata={"help": "HF model configuration model_type."}
    )
    lr_quadratic_warmup: bool = field(
        default=False,
        metadata={"help": "Use quadratic warmup for cosine scheduling."},
    )
    pretraining: bool = field(
        default=False,
        metadata={
            "help": "Indicates to trainer whether we are doing continued pretraining."
        },
    )
    sample_packing: bool = field(
        default=False,
        metadata={"help": "Use sample packing for efficient training."},
    )
    sample_packing_sequentially: bool = field(
        default=False,
        metadata={
            "help": "Use next-fit sample packing that preserves the order of samples coming from the sampler. Use in combination with curriculum_sampling for fully sequential packing."
        },
    )
    multipack_real_batches: bool = field(
        default=False,
        metadata={"help": "Use real batches for efficient training."},
    )
    eval_sample_packing: Optional[bool] = field(
        default=None,
        metadata={"help": "Use sample packing for efficient evals."},
    )
    sample_packing_efficiency: float = field(
        default=1.0,
        metadata={"help": "Sample packing efficiency for calculating batch length."},
    )
    sample_packing_bin_size: int = field(
        default=200,
        metadata={
            "help": "The max number of samples that packed sample can contain after packing. Increase for better packing."
        },
    )
    sample_packing_group_size: int = field(
        default=100000,
        metadata={
            "help": "The number of samples to group together for packing. Increase for better packing."
        },
    )
    max_seq_length: int = field(
        default=2048,
        metadata={"help": "The maximum sequence length the model can handle"},
    )
    relora_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how often to reset for ReLoRA"},
    )
    relora_warmup_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
    )
    relora_anneal_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
    )
    relora_prune_ratio: Optional[float] = field(
        default=0.9,
        metadata={"help": "prune ratio for magnitude pruning of the optimizer"},
    )
    bench_split: Optional[str] = field(
        default="eval", metadata={"help": "The benchmark split to run on"}
    )
    bench_dataset: Optional[str] = field(
        default="pharaouk/dharma-1/dharma_1_mini.json",
        metadata={
            "help": "Benchmark dataset to use: options are `mmlu-zs`, `mmlu-fs`, or the full path to the dataset file"
        },
    )
    do_bench_eval: Optional[bool] = field(
        default=False, metadata={"help": "Whether to run the Benchmark evaluation."}
    )
    do_causal_lm_eval: Optional[bool] = field(
        default=False, metadata={"help": "Whether to run the Causal LM evaluation."}
    )
    max_bench_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "If set, only evaluates on `max_bench_samples` of the benchmark dataset."
        },
    )
    bench_source_max_len: int = field(
        default=2048, metadata={"help": "Maximum source sequence length for bench."}
    )
    dataloader_prefetch_factor: Optional[int] = field(
        default=None,
        metadata={"help": "prefetch_factor argument to the dataloader"},
    )
    cosine_min_lr_ratio: Optional[float] = field(
        default=None,
        metadata={"help": "Minimum learning rate is min_lr_ratio * learning_rate"},
    )
    cosine_constant_lr_ratio: Optional[float] = field(
        default=None,
        metadata={
            "help": "Starting constant learning rate step is cosine_constant_lr_ratio * max_steps"
        },
    )
    loraplus_lr_ratio: Optional[float] = field(
        default=None, metadata={"help": "loraplus learning rate ratio lr_B / lr_A."}
    )
    loraplus_lr_embedding: Optional[float] = field(
        default=1e-6,
        metadata={"help": "loraplus learning rate for lora embedding layers."},
    )
    embedding_lr_scale: Optional[float] = field(
        default=None,
        metadata={"help": "Scale the learning rate for the embedding layers."},
    )
    lr_groups: Optional[list[dict]] = field(
        default=None,
        metadata={"help": "Specify learning rate groups for with different LRs."},
    )
    embedding_lr: Optional[float] = field(
        default=None,
        metadata={"help": "absolute learning rate for the embedding layers."},
    )
    qlora: bool = field(
        default=False,
        metadata={"help": "whether this is a qlora training"},
    )
    orpo_alpha: Optional[float] = field(
        default=None,
    )
    lisa_n_layers: Optional[int] = field(
        default=None,
        metadata={"help": "the number of activate layers in LISA"},
    )
    lisa_step_interval: Optional[int] = field(
        default=None,
        metadata={"help": "how often to switch layers in LISA"},
    )
    lisa_layers_attribute: Optional[str] = field(
        default=None,
        metadata={"help": "path under the model to access the layers"},
    )
    curriculum_sampling: Optional[bool] = field(
        default=None,
        metadata={"help": "whether to use sequential sampling for curriculum learning"},
    )
    alternate_optimizer: Optional[str] = field(
        default=None,
        metadata={
            "help": "workaround to pass an alternate optimizer to the HF trainer"
        },
    )
    alternate_lr_scheduler_type: Optional[str] = field(
        default=None,
        metadata={
            "help": "workaround to pass an alternate lr scheduler to the HF trainer"
        },
    )
    chat_template: Optional[str] = field(
        default=None,
        metadata={"help": "Chat template converting chat messages to text"},
    )
    kd_ce_alpha: Optional[float] = field(
        default=None,
        metadata={
            "help": "The alpha scaling parameter for SFT cross entropy loss when using KD"
        },
    )
    kd_alpha: Optional[float] = field(
        default=1.0,
        metadata={"help": "The alpha scaling parameter for KD loss"},
    )
    kd_temperature: Optional[float] = field(
        default=1.0,
        metadata={
            "help": "the temperature parameter for KL divergence loss when using KD"
        },
    )
    kd_zscore_base_temp: Optional[float] = field(
        default=None,
        metadata={
            "help": "the base temperature parameter for KL divergence with z-score when using KD"
        },
    )
    kd_top_k_before_softmax: Optional[bool] = field(
        default=None,
        metadata={
            "help": "Whether to apply top_k_before_softmax to the logits when using KD"
        },
    )
    adam_beta3: Optional[float] = field(
        default=None,
        metadata={
            "help": "The beta3 hyperparameter used in some optimizers such as CAME"
        },
    )
    adam_epsilon2: Optional[float] = field(
        default=None,
        metadata={
            "help": "The epsilon2 hyperparameter used in some optimizers such as CAME"
        },
    )
    # multi-modal section
    image_size: int | tuple[int, int] | None = field(
        default=None,
        metadata={"help": "The size of the image to resize to"},
    )
    image_resize_algorithm: Resampling | None = field(
        default=None,
        metadata={"help": "The algorithm to use for image resizing"},
    )
    # end of multi-modal section
@dataclass
--- a/src/axolotl/core/training_args_base.py
+++ b/src/axolotl/core/training_args_base.py
@@ -0,0 +1,220 @@
 """
 Base Axolotl Training Mixins shared across various trainer configs
 """
 from dataclasses import dataclass, field
 from typing import Optional
 from PIL.Image import Resampling
@dataclass
 class AxolotlTrainingMixins:
    """
    Mixin class for the Axolotl training args.
    """
    # pylint: disable=duplicate-code
    model_type: Optional[str] = field(
        default=None, metadata={"help": "HF model configuration model_type."}
    )
    lr_quadratic_warmup: bool = field(
        default=False,
        metadata={"help": "Use quadratic warmup for cosine scheduling."},
    )
    pretraining: bool = field(
        default=False,
        metadata={
            "help": "Indicates to trainer whether we are doing continued pretraining."
        },
    )
    sample_packing: bool = field(
        default=False,
        metadata={"help": "Use sample packing for efficient training."},
    )
    sample_packing_sequentially: bool = field(
        default=False,
        metadata={
            "help": "Use next-fit sample packing that preserves the order of samples coming from the sampler. Use in combination with curriculum_sampling for fully sequential packing."
        },
    )
    multipack_real_batches: bool = field(
        default=False,
        metadata={"help": "Use real batches for efficient training."},
    )
    eval_sample_packing: Optional[bool] = field(
        default=None,
        metadata={"help": "Use sample packing for efficient evals."},
    )
    sample_packing_efficiency: float = field(
        default=1.0,
        metadata={"help": "Sample packing efficiency for calculating batch length."},
    )
    sample_packing_bin_size: int = field(
        default=200,
        metadata={
            "help": "The max number of samples that packed sample can contain after packing. Increase for better packing."
        },
    )
    sample_packing_group_size: int = field(
        default=100000,
        metadata={
            "help": "The number of samples to group together for packing. Increase for better packing."
        },
    )
    max_seq_length: int = field(
        default=2048,
        metadata={"help": "The maximum sequence length the model can handle"},
    )
    relora_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how often to reset for ReLoRA"},
    )
    relora_warmup_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
    )
    relora_anneal_steps: Optional[int] = field(
        default=None,
        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
    )
    relora_prune_ratio: Optional[float] = field(
        default=0.9,
        metadata={"help": "prune ratio for magnitude pruning of the optimizer"},
    )
    bench_split: Optional[str] = field(
        default="eval", metadata={"help": "The benchmark split to run on"}
    )
    bench_dataset: Optional[str] = field(
        default="pharaouk/dharma-1/dharma_1_mini.json",
        metadata={
            "help": "Benchmark dataset to use: options are `mmlu-zs`, `mmlu-fs`, or the full path to the dataset file"
        },
    )
    do_bench_eval: Optional[bool] = field(
        default=False, metadata={"help": "Whether to run the Benchmark evaluation."}
    )
    do_causal_lm_eval: Optional[bool] = field(
        default=False, metadata={"help": "Whether to run the Causal LM evaluation."}
    )
    max_bench_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "If set, only evaluates on `max_bench_samples` of the benchmark dataset."
        },
    )
    bench_source_max_len: int = field(
        default=2048, metadata={"help": "Maximum source sequence length for bench."}
    )
    dataloader_prefetch_factor: Optional[int] = field(
        default=None,
        metadata={"help": "prefetch_factor argument to the dataloader"},
    )
    cosine_min_lr_ratio: Optional[float] = field(
        default=None,
        metadata={"help": "Minimum learning rate is min_lr_ratio * learning_rate"},
    )
    cosine_constant_lr_ratio: Optional[float] = field(
        default=None,
        metadata={
            "help": "Starting constant learning rate step is cosine_constant_lr_ratio * max_steps"
        },
    )
    loraplus_lr_ratio: Optional[float] = field(
        default=None, metadata={"help": "loraplus learning rate ratio lr_B / lr_A."}
    )
    loraplus_lr_embedding: Optional[float] = field(
        default=1e-6,
        metadata={"help": "loraplus learning rate for lora embedding layers."},
    )
    embedding_lr_scale: Optional[float] = field(
        default=None,
        metadata={"help": "Scale the learning rate for the embedding layers."},
    )
    lr_groups: Optional[list[dict]] = field(
        default=None,
        metadata={"help": "Specify learning rate groups for with different LRs."},
    )
    embedding_lr: Optional[float] = field(
        default=None,
        metadata={"help": "absolute learning rate for the embedding layers."},
    )
    qlora: bool = field(
        default=False,
        metadata={"help": "whether this is a qlora training"},
    )
    orpo_alpha: Optional[float] = field(
        default=None,
    )
    lisa_n_layers: Optional[int] = field(
        default=None,
        metadata={"help": "the number of activate layers in LISA"},
    )
    lisa_step_interval: Optional[int] = field(
        default=None,
        metadata={"help": "how often to switch layers in LISA"},
    )
    lisa_layers_attribute: Optional[str] = field(
        default=None,
        metadata={"help": "path under the model to access the layers"},
    )
    curriculum_sampling: Optional[bool] = field(
        default=None,
        metadata={"help": "whether to use sequential sampling for curriculum learning"},
    )
    alternate_lr_scheduler_type: Optional[str] = field(
        default=None,
        metadata={
            "help": "workaround to pass an alternate lr scheduler to the HF trainer"
        },
    )
    chat_template: Optional[str] = field(
        default=None,
        metadata={"help": "Chat template converting chat messages to text"},
    )
    # kd_ce_alpha: Optional[float] = field(
    #     default=None,
    #     metadata={
    #         "help": "The alpha scaling parameter for SFT cross entropy loss when using KD"
    #     },
    # )
    #
    # kd_alpha: Optional[float] = field(
    #     default=1.0,
    #     metadata={"help": "The alpha scaling parameter for KD loss"},
    # )
    #
    # kd_temperature: Optional[float] = field(
    #     default=1.0,
    #     metadata={
    #         "help": "the temperature parameter for KL divergence loss when using KD"
    #     },
    # )
    adam_beta3: Optional[float] = field(
        default=None,
        metadata={
            "help": "The beta3 hyperparameter used in some optimizers such as CAME"
        },
    )
    adam_epsilon2: Optional[float] = field(
        default=None,
        metadata={
            "help": "The epsilon2 hyperparameter used in some optimizers such as CAME"
        },
    )
    # multi-modal section
    image_size: int | tuple[int, int] | None = field(
        default=None,
        metadata={"help": "The size of the image to resize to"},
    )
    image_resize_algorithm: Resampling | None = field(
        default=None,
        metadata={"help": "The algorithm to use for image resizing"},
    )
    # end of multi-modal section
--- a/src/axolotl/datasets.py
+++ b/src/axolotl/datasets.py
@@ -1,12 +1,13 @@
 """Module containing Dataset functionality"""
 import logging
 import os
 from typing import List, Optional, Union
 import torch
 from datasets import Dataset, IterableDataset
 from axolotl.utils.logging import get_logger
 from .prompt_tokenizers import PromptTokenizingStrategy
 # We want this to be a wrapper for an existing dataset that we have loaded
@@ -15,7 +16,7 @@ from .prompt_tokenizers import PromptTokenizingStrategy
 # let's check to ensure we don't truncate an item in the middle, we'll use
 # the collators later on to pad the datasets
-LOG = logging.getLogger("axolotl")
+LOG = get_logger(__name__)
 class TokenizedPromptDataset(Dataset):
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -22,7 +22,7 @@ from __future__ import annotations
 import collections
 import importlib
-import logging
+import traceback
 from typing import TYPE_CHECKING, Callable, OrderedDict, Union
 from peft import PeftModel
@@ -31,6 +31,9 @@ from torch.optim.lr_scheduler import LRScheduler
 from transformers import PreTrainedModel, Trainer
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__, use_environ=True)
 if TYPE_CHECKING:
    from axolotl.common.datasets import TrainDatasetMeta
@@ -39,31 +42,39 @@ if TYPE_CHECKING:
 class BasePlugin:
    """Base class for all plugins. Defines the interface for plugin methods.
-    Methods:
+    A plugin is a reusable, modular, and self-contained piece of code that extends
-        register(cfg): Registers the plugin with the given configuration.
+    the functionality of Axolotl. Plugins can be used to integrate third-party models,
-        load_datasets(cfg): Loads and preprocesses the dataset for training.
+    modify the training process, or add new features.
-        pre_model_load(cfg): Performs actions before the model is loaded.
+
-        post_model_build(cfg, model): Performs actions after the model is loaded, but
+    To create a new plugin, you need to inherit from the BasePlugin class and
    implement the required methods.
    Note:
        Plugin methods include:
        - register(cfg): Registers the plugin with the given configuration.
        - load_datasets(cfg): Loads and preprocesses the dataset for training.
        - pre_model_load(cfg): Performs actions before the model is loaded.
        - post_model_build(cfg, model): Performs actions after the model is loaded, but
            before LoRA adapters are applied.
-        pre_lora_load(cfg, model): Performs actions before LoRA weights are loaded.
+        - pre_lora_load(cfg, model): Performs actions before LoRA weights are loaded.
-        post_lora_load(cfg, model): Performs actions after LoRA weights are loaded.
+        - post_lora_load(cfg, model): Performs actions after LoRA weights are loaded.
-        post_model_load(cfg, model): Performs actions after the model is loaded,
+        - post_model_load(cfg, model): Performs actions after the model is loaded,
            inclusive of any adapters.
-        post_trainer_create(cfg, trainer): Performs actions after the trainer is
+        - post_trainer_create(cfg, trainer): Performs actions after the trainer is
            created.
-        create_optimizer(cfg, trainer): Creates and returns an optimizer for training.
+        - create_optimizer(cfg, trainer): Creates and returns an optimizer for training.
-        create_lr_scheduler(cfg, trainer, optimizer, num_training_steps): Creates and
+        - create_lr_scheduler(cfg, trainer, optimizer, num_training_steps): Creates and
            returns a learning rate scheduler.
-        add_callbacks_pre_trainer(cfg, model): Adds callbacks to the trainer before
+        - add_callbacks_pre_trainer(cfg, model): Adds callbacks to the trainer before
            training.
-        add_callbacks_post_trainer(cfg, trainer): Adds callbacks to the trainer after
+        - add_callbacks_post_trainer(cfg, trainer): Adds callbacks to the trainer after
            training.
    """
    def __init__(self):
        """Initializes the BasePlugin."""
-    def register(self, cfg):  # pylint: disable=unused-argument
+    def register(self, cfg: DictDefault):  # pylint: disable=unused-argument
        """Registers the plugin with the given configuration.
        Args:
@@ -73,6 +84,11 @@ class BasePlugin:
    def get_input_args(self) -> str | None:
        """Returns a pydantic model for the plugin's input arguments."""
    def get_training_args_mixin(self) -> str | None:
        """
        Returns a dataclass model for the plugin's training arguments.
        """
    def load_datasets(
        self, cfg: DictDefault, preprocess: bool = False
    ) -> Union["TrainDatasetMeta", None]:
@@ -148,6 +164,31 @@ class BasePlugin:
            trainer: The trainer object for training.
        """
    def get_training_args(self, cfg: DictDefault):  # pylint: disable=unused-argument):
        """
        Returns custom training arguments to set on TrainingArgs.
        Args:
            cfg: The global axolotl configuration.
        Returns:
            object: dict containing the training arguments.
        """
    def get_collator_cls_and_kwargs(
        self, cfg: DictDefault, is_eval: bool = False
    ):  # pylint: disable=unused-argument):
        """
        Returns a custom class for the collator.
        Args:
            cfg: The global axolotl configuration.
            is_eval: Whether this is an eval split.
        Returns:
            class: The class for the collator.
        """
    # pylint: disable=unused-argument
    def create_optimizer(self, cfg: DictDefault, trainer: Trainer) -> Optimizer | None:
        """Creates and returns an optimizer for training.
@@ -268,17 +309,18 @@ def load_plugin(plugin_name: str) -> BasePlugin:
    return plugin
-class PluginManager:
+class PluginManager:  # pylint: disable=too-many-public-methods
    """The `PluginManager` class is responsible for loading and managing plugins. It
    should be a singleton so it can be accessed from anywhere in the codebase.
    Attributes:
        plugins: A list of loaded plugins.
-    Methods:
+    Note:
-        get_instance(): Static method to get the singleton instance of `PluginManager`.
+        Key methods include:
-        register(plugin_name: str): Registers a new plugin by its name.
+        - get_instance(): Static method to get the singleton instance of `PluginManager`.
-        pre_model_load(cfg): Calls the pre_model_load method of all registered plugins.
+        - register(plugin_name: str): Registers a new plugin by its name.
        - pre_model_load(cfg): Calls the pre_model_load method of all registered plugins.
    """
    plugins: OrderedDict[str, BasePlugin] = collections.OrderedDict()
@@ -322,12 +364,15 @@ class PluginManager:
            ImportError: If the plugin module cannot be imported.
        """
        try:
-            logging.info(f"Attempting to load plugin: {plugin_name}")
+            LOG.info(f"Attempting to load plugin: {plugin_name}")
            plugin = load_plugin(plugin_name)
            self.plugins[plugin_name] = plugin
-            logging.info(f"Plugin loaded successfully: {plugin_name}")
+            LOG.info(f"Plugin loaded successfully: {plugin_name}")
-        except ImportError:
+        except ImportError as exc:
-            logging.error(f"Failed to load plugin: {plugin_name}")
+            LOG.error(f"Failed to load plugin: {plugin_name}")
            # print stacktrace
            traceback.print_exc()
            print(f"Error: {exc}")
    def get_input_args(self) -> list[str]:
        """Returns a list of Pydantic classes for all registered plugins' input arguments.'
@@ -342,6 +387,20 @@ class PluginManager:
                input_args.append(input_args_from_plugin)
        return input_args
    def get_training_args_mixin(self):
        """
        Returns a list of dataclasses for all registered plugins' training args mixins'
        Returns:
        list[str]: A list of dataclsses
        """
        training_args = []
        for plugin in self.plugins.values():
            training_args_from_plugin = plugin.get_training_args_mixin()
            if training_args_from_plugin is not None:
                training_args.append(training_args_from_plugin)
        return training_args
    def load_datasets(
        self, cfg: DictDefault, preprocess: bool = False
    ) -> Union["TrainDatasetMeta", None]:
@@ -431,6 +490,42 @@ class PluginManager:
                return trainer_cls
        return None
    def get_training_args(self, cfg):
        """
        Calls the get_training_args method of all registered plugins and returns the combined training arguments.
        Parameters:
        cfg (dict): The configuration for the plugins.
        Returns:
        object: The training arguments
        """
        training_args_kwargs = {}
        for plugin in self.plugins.values():
            training_args = plugin.get_training_args(cfg)
            if training_args is not None:
                training_args_kwargs.update(training_args)
        return training_args_kwargs
    def get_collator_cls_and_kwargs(self, cfg, is_eval=False):
        """
        Calls the get_collator_cls_and_kwargs method of all registered plugins and returns the first non-None collator class.
        Parameters:
        cfg (dict): The configuration for the plugins.
        is_eval (bool): Whether this is an eval split.
        Returns:
        object: The collator class, or None if none was found.
        """
        for plugin in self.plugins.values():
            collator = plugin.get_collator_cls_and_kwargs(cfg, is_eval=is_eval)
            if collator is not None:
                collator_cls, collator_kwargs = collator
                return collator_cls, collator_kwargs
        return None
    def post_trainer_create(self, cfg: DictDefault, trainer: Trainer):
        """Calls the `post_trainer_create` method of all registered plugins.
@@ -534,7 +629,6 @@ class PluginManager:
        Args:
            cfg: The configuration for the plugins.
            model: The loaded model.
        """
        for plugin in self.plugins.values():
            plugin.post_train_unload(cfg)
--- a/src/axolotl/integrations/config.py
+++ b/src/axolotl/integrations/config.py
@@ -16,7 +16,7 @@ Module to handle merging the plugins' input arguments with the base configuratio
 This was moved here to prevent circular imports.
 """
-from typing import Any, Dict, List
+from typing import Any, Dict, List, Type
 from axolotl.utils.schemas.config import (
    AxolotlConfigWCapabilities as AxolotlConfigWCapabilitiesBase,
@@ -61,3 +61,43 @@ def merge_input_args():
        ]
        return AxolotlConfigWCapabilities, AxolotlInputConfig
    return AxolotlConfigWCapabilitiesBase, AxolotlInputConfigBase
 def merge_training_args() -> Type:
    """
    Merges training arguments from registered plugins with the base TrainingArguments.
    This function retrieves the training arguments from registered plugins using the PluginManager.
    It then dynamically creates new classes, AxolotlTrainingMixins,
    that inherit from the base configurations and include the training arguments from the plugins.
    Returns:
    tuple: A tuple containing the newly created classes, AxolotlTrainingMixins.
    """
    # pylint: disable=duplicate-code
    from axolotl.core.training_args_base import (
        AxolotlTrainingMixins as AxolotlTrainingMixinsBase,
    )
    from axolotl.integrations.base import PluginManager
    plugin_manager = PluginManager.get_instance()
    training_args_mixins: List[str] = plugin_manager.get_training_args_mixin()
    mixin_classes = []
    dynamic_input = ""
    for plugin_args in training_args_mixins:
        plugin_module, plugin_cls = plugin_args.rsplit(".", 1)
        dynamic_input += f"from {plugin_module} import {plugin_cls}\n"
        mixin_classes.append(plugin_cls)
    if dynamic_input:
        dynamic_input += f"class AxolotlTrainingMixins(AxolotlTrainingMixinsBase, {', '.join(mixin_classes)}):\n    pass\n"
        namespace: Dict[Any, Any] = {}
        local_vars = {"AxolotlTrainingMixinsBase": AxolotlTrainingMixinsBase}
        exec(  # pylint: disable=exec-used  # nosec B102
            dynamic_input, {**globals(), **local_vars}, namespace
        )
        AxolotlTrainingMixins = namespace[  # pylint: disable=invalid-name
            "AxolotlTrainingMixins"
        ]
        return AxolotlTrainingMixins
    return AxolotlTrainingMixinsBase
--- a/src/axolotl/integrations/cut_cross_entropy/init.py
+++ b/src/axolotl/integrations/cut_cross_entropy/init.py
@@ -19,17 +19,16 @@ Cut Cross Entropy is an optimized implementation of cross entropy loss
 from Apple's ML team.
 """
 import importlib
 import logging
 import torch
 from axolotl.integrations.base import BasePlugin
 from axolotl.utils import get_pytorch_version
-from axolotl.utils.distributed import is_main_process
+from axolotl.utils.logging import get_logger
 from .args import CutCrossEntropyArgs  # pylint: disable=unused-import. # noqa: F401
-LOG = logging.getLogger("axolotl.integrations.cut_cross_entropy")
+LOG = get_logger(__name__, use_environ=True)
 _CCE_INSTALL_MESSAGE = (
    "Please install cut_cross_entropy with transformers support using "
@@ -76,10 +75,9 @@ class CutCrossEntropyPlugin(BasePlugin):
                cce_patch,
            )
-            if is_main_process(use_environ=True):
+            LOG.info(
-                LOG.info(
+                f"Applying Cut Cross Entropy to model type: {cfg.model_config_type}"
-                    f"Applying Cut Cross Entropy to model type: {cfg.model_config_type}"
+            )
                )
            # The patch checks model_type internally
            cce_patch(cfg.model_config_type)
--- a/src/axolotl/integrations/cut_cross_entropy/args.py
+++ b/src/axolotl/integrations/cut_cross_entropy/args.py
@@ -15,12 +15,13 @@
 """
 Module for handling Cut Cross Entropy input arguments.
 """
 import logging
 from typing import Optional
 from pydantic import BaseModel, model_validator
-LOG = logging.getLogger("axolotl.integrations.cut_cross_entropy.args")
+from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 class CutCrossEntropyArgs(BaseModel):
--- a/src/axolotl/integrations/cut_cross_entropy/monkeypatch/mllama.py
+++ b/src/axolotl/integrations/cut_cross_entropy/monkeypatch/mllama.py
@@ -15,23 +15,14 @@ from cut_cross_entropy.transformers.utils import (
 from transformers.cache_utils import Cache
 from transformers.modeling_outputs import CausalLMOutputWithPast
 from transformers.models.mllama.modeling_mllama import (
    MLLAMA_INPUTS_DOCSTRING,
    _prepare_cross_attention_mask,
 )
 from transformers.utils import (
    add_start_docstrings_to_model_forward,
    replace_return_docstrings,
 )
 from transformers.utils.deprecation import deprecate_kwarg
 _PATCH_OPTS: PatchOptions | None = None
@deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
@add_start_docstrings_to_model_forward(MLLAMA_INPUTS_DOCSTRING)
@replace_return_docstrings(
    output_type=CausalLMOutputWithPast, config_class="MllamaTextConfig"
 )
 def cce_forward(
    self,
    input_ids: torch.LongTensor | None = None,
@@ -164,10 +155,6 @@ def cce_forward(
@deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
@add_start_docstrings_to_model_forward(MLLAMA_INPUTS_DOCSTRING)
@replace_return_docstrings(
    output_type=CausalLMOutputWithPast, config_class="MllamaConfig"
 )
 def cce_forward_multimodal(
    self,
    input_ids: Optional[torch.LongTensor] = None,
--- a/src/axolotl/integrations/grokfast/init.py
+++ b/src/axolotl/integrations/grokfast/init.py
@@ -2,15 +2,15 @@
 Grokfast plugin for Axolotl
 """
 import logging
 from transformers.trainer_callback import TrainerCallback
 from axolotl.utils.logging import get_logger
 from ..base import BasePlugin
 from .args import GrokfastArgs  # pylint: disable=unused-import. # noqa: F401
 from .optimizer import gradfilter_ema
-LOG = logging.getLogger("axolotl.integrations.grokfast")
+LOG = get_logger(__name__)
 class GrokfastCallbackHandler(TrainerCallback):
--- a/src/axolotl/integrations/kd/README.md
+++ b/src/axolotl/integrations/kd/README.md
@@ -21,3 +21,32 @@ datasets:
 ```
 An example dataset can be found at [`axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample`](https://huggingface.co/datasets/axolotl-ai-co/evolkit-logprobs-pipeline-75k-v2-sample)
 ## Online KD (sglang)
 ```bash
 export UV_TORCH_BACKEND=cu124
 uv venv sglang --python 3.11
 source sglang/bin/activate
 uv pip install --upgrade pip
 uv pip install setuptools
 uv pip install torch~=2.5.1 --index-url https://download.pytorch.org/whl/cu124
 uv pip install sgl-kernel --force-reinstall --no-deps
 uv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
 ```
 ## Online KD (vllm)
 ```bash
 VLLM_USE_V1=0 vllm serve open-r1/OlympicCoder-32B --max-model-len 16400 --port 8888 --max-logprobs 128 --return-tokens-as-token-ids --tensor-parallel-size 8 --max-num-seqs
 256 --gpu_memory_utilization 0.2 --enable-chunked-prefill
 ```
 ```bash
 vllm serve open-r1/OlympicCoder-32B --max-model-len 16400 --port 8888 --max-logprobs 128 --return-tokens-as-token-ids --tensor-parallel-size 8 --no-enable-prefix-caching --gpu-memory-utilization 0.3 --max-num-batched-tokens 131072 --host 0.0.0.0
 ```
 ```bash
 python -m sglang.launch_server --model-path open-r1/OlympicCoder-32B --tensor-parallel-size 8 --port 8080 --host 0.0.0.0 --max-running-requests 256 --context-length 16400 --mem-fraction-static 0.2 --schedule-conservativeness 0.3 --chunked-prefill-size 131072 --schedule-policy fcfs --skip-tokenizer-init
 ```
--- a/src/axolotl/integrations/kd/init.py
+++ b/src/axolotl/integrations/kd/init.py
@@ -15,7 +15,12 @@
 """
 Plugin init to add KD support to Axolotl.
 """
 from typing import Any
 from transformers import Trainer
 from axolotl.integrations.base import BasePlugin
 from axolotl.integrations.kd.callbacks import KDTemperatureSchedulerCallback
 from .args import KDArgs  # pylint: disable=unused-import. # noqa: F401
@@ -28,9 +33,75 @@ class KDPlugin(BasePlugin):
    def get_input_args(self):
        return "axolotl.integrations.kd.KDArgs"
    def get_training_args_mixin(self):
        return "axolotl.integrations.kd.args.KDTrainingArgsMixin"
    def get_trainer_cls(self, cfg):
        if cfg.kd_trainer:
            from .trainer import AxolotlKDTrainer
            return AxolotlKDTrainer
        return None
    def get_training_args(self, cfg):
        return {
            "kd_ce_alpha": cfg.kd_ce_alpha,
            "kd_alpha": cfg.kd_alpha,
            "kd_temperature": cfg.kd_temperature,
            "kd_beta": cfg.kd_beta,
            "kd_normalize_topk": cfg.kd_normalize_topk,
        }
    def get_collator_cls_and_kwargs(self, cfg, is_eval=False):
        if not cfg.kd_trainer:
            return None, None
        from .collator import DataCollatorForKD, KDBatchSamplerDataCollatorForSeq2Seq
        use_batch_sampler_collator = False
        if is_eval is False and cfg.sample_packing:
            use_batch_sampler_collator = True
        if cfg.eval_sample_packing and is_eval:
            use_batch_sampler_collator = True
        if cfg.kd_online_server_base_url:
            from .collator_online_teacher import OnlineTeacherCollator
            return OnlineTeacherCollator, {
                "kd_online_server_base_url": cfg.kd_online_server_base_url,
                "kd_online_topk": cfg.kd_online_topk,
                "kd_temperature": cfg.kd_temperature,
                "kd_online_server": cfg.kd_online_server,
                "kd_online_timeout": cfg.kd_online_timeout,
                "kd_normalize_topk": cfg.kd_normalize_topk,
            }
        if use_batch_sampler_collator:
            return KDBatchSamplerDataCollatorForSeq2Seq, {}
        return DataCollatorForKD, {}
    def pre_model_load(self, cfg):
        from .kernels.models import apply_kernel
        apply_kernel(cfg.model_config_type)
    def add_callbacks_post_trainer(self, cfg: Any, trainer: Trainer) -> list:
        """
        Adds temp scheduler callback to the Trainer instance.
        Args:
            cfg (Any): Configuration object containing the sparse recipe.
            trainer (Trainer): Huggingface Trainer instance.
        Returns:
            list: List containing the configured callback instances.
        """
        if cfg.kd_temperature_min is not None and cfg.kd_online_server_base_url:
            callback = KDTemperatureSchedulerCallback(
                cfg.kd_temperature,
                cfg.kd_temperature_min,
                trainer,
            )
            return [callback]
        return []
--- a/src/axolotl/integrations/kd/args.py
+++ b/src/axolotl/integrations/kd/args.py
@@ -15,9 +15,19 @@
 """
 Plugin args for KD support.
 """
-from typing import Optional
+from dataclasses import dataclass
 from enum import Enum
-from pydantic import BaseModel
+from pydantic import BaseModel, Field
 class InferenceServerType(str, Enum):
    """
    Online inferences server types to handle different request args
    """
    vllm = "vllm"  # pylint: disable=invalid-name
    sglang = "sglang"  # pylint: disable=invalid-name
 class KDArgs(BaseModel):
@@ -25,13 +35,41 @@ class KDArgs(BaseModel):
    Input args for knowledge distillation.
    """
-    kd_trainer: Optional[bool] = None  # whether to use KD trainer
+    kd_trainer: float | None = None  # whether to use KD trainer
-    kd_ce_alpha: Optional[float] = (
+    kd_ce_alpha: float | None = (
        None  # loss coefficient for cross-entropy loss during KD
    )
-    kd_alpha: Optional[float] = None  # loss coefficient for KD loss
+    kd_alpha: float | None = None  # loss coefficient for KD loss
-    kd_temperature: Optional[float] = None  # temperature for sampling during KD
+    kd_temperature: float | None = None  # temperature for sampling during KD
-    kd_zscore_base_temp: Optional[float] = None  # base temperature for zscore scaling
+    kd_beta: float | None = None  # beta coefficient for ratio of fwd and reverse KL
-    kd_top_k_before_softmax: Optional[bool] = (
+    kd_normalize_topk: bool | None = (
-        None  # whether to sample top k before softmax during KD
+        None  # whether to normalize student logits during KD
    )
    # TODO online kd
    kd_online_server_base_url: str | None = None
    kd_online_topk: int | None = None
    kd_online_server: InferenceServerType | None = Field(
        default_factory=lambda: InferenceServerType.vllm
    )
    kd_online_timeout: int | None = 120
    kd_temperature_min: float | None = (
        None  # kd temperature scheduling during online kd
    )
@dataclass
 class KDTrainingArgsMixin:
    """
    Additional args for KD training.
    """
    kd_ce_alpha: float | None = (
        None  # loss coefficient for cross-entropy loss during KD
    )
    kd_alpha: float | None = None  # loss coefficient for KD loss
    kd_temperature: float | None = None  # temperature for sampling during KD
    kd_beta: float | None = None  # beta coefficient for ratio of fwd and reverse KL
    kd_normalize_topk: float | None = (
        None  # whether to normalize student logits during KD
    )
--- a/src/axolotl/integrations/kd/callbacks.py
+++ b/src/axolotl/integrations/kd/callbacks.py
@@ -0,0 +1,36 @@
 """
 Transformers trainer callbacks to schedule the KD temperature during training
 """
 import math
 from transformers.trainer_callback import TrainerCallback
 class KDTemperatureSchedulerCallback(TrainerCallback):
    """
    KD temperature scheduler callback for the trainer.
    """
    def __init__(self, temperature_start, temperature_min, trainer):
        self.temperature_start = temperature_start
        self.temperature_min = temperature_min
        self.temperature = temperature_start
        self.trainer = trainer
    def on_step_end(
        self, args, state, control, **kwargs
    ):  # pylint: disable=unused-argument
        # cosine decay temperature over the max steps
        progress = state.global_step / state.max_steps
        # Cosine decay factor: 0.5 * (1 + cos(pi * progress))
        # This factor goes from 1 (at progress=0) to 0 (at progress=1)
        decay_factor = 0.5 * (1.0 + math.cos(math.pi * progress))
        self.temperature = self.temperature_start - (
            (self.temperature_start - self.temperature_min) * (1.0 - decay_factor)
        )
        if hasattr(self.trainer.data_collator, "kd_temperature"):
            self.trainer.data_collator.kd_temperature = self.temperature
--- a/src/axolotl/integrations/kd/chat_template.py
+++ b/src/axolotl/integrations/kd/chat_template.py
@@ -15,12 +15,15 @@
 """
 Chat template prompt strategy loader with KD support
 """
 import logging
 from typing import Any, Dict
 import torch
 from axolotl.prompt_strategies.chat_template import ChatTemplateStrategy, StrategyLoader
 LOG = logging.getLogger(__name__)
 class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
    """
@@ -101,10 +104,8 @@ class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
        # fill with -inf for padding_len tokens for top_k tokens
        # extend target_logprobs with a padding_len x top_k 2D list filled with -inf
-        # for causal models, if we start the range at 1, then we don't need to shift in the trainer
+        # we shift for causal models in the trainer, so start the range from 0
-        # otherwise, we need to shift in the trainer
+        for _ in range(0, input_padding_len):
        shift = 0
        for _ in range(shift, input_padding_len):
            target_logprobs.append([-float("inf")] * top_k)
            target_token_ids.append(list(range(top_k)))
            target_mask.append([0] * top_k)
@@ -143,6 +144,10 @@ class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
            #
            # Convert from log to probability
            teacher_probs_t1 = position_logprobs_tensor.exp()
            # normalize probabilities to sum to 1 in case they aren't already
            teacher_probs_t1_sum = teacher_probs_t1.sum(dim=0, keepdim=True)
            if teacher_probs_t1_sum > 1e-9:
                teacher_probs_t1 = teacher_probs_t1 / teacher_probs_t1_sum
            if self.kd_temperature != self.gen_temperature:
                # Exponentiate by factor (T1 / T2)
                exponent = self.gen_temperature / self.kd_temperature
@@ -162,12 +167,6 @@ class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
            target_logprobs.append(position_logprobs_scaled)
            target_token_ids.append(position_token_ids)
        if shift == 1:
            # since we started at index 1 for causal, we need one more padding token
            target_logprobs.append([-float("inf")] * top_k)
            target_token_ids.append(list(range(top_k)))
            target_mask.append([0] * top_k)
        # Update sample with transformed logprobs
        sample["target_logprobs"] = target_logprobs
        sample["target_token_ids"] = target_token_ids
@@ -184,13 +183,124 @@ class ChatTemplateStrategyWithKD(ChatTemplateStrategy):
        return tokenized_prompt
 class ChatTemplateStrategyWithKDv2(ChatTemplateStrategyWithKD):
    """
    Strat for datasets with complete structured KD logprob data
    """
    def transform_logprobs(self, sample):
        """
        Transform logprobs to target format for KD training
        """
        # pylint: disable=duplicate-code
        logprobs = sample.pop(self.logprobs_field)
        target_seq_len = len(logprobs)
        input_seq_len = len(sample["input_ids"])
        input_padding_len = input_seq_len - target_seq_len
        # get non-zero top-k (prune None logprobs from vllm data step)
        top_k_vals = [
            len(logprobs[i])
            for i in range(len(logprobs))
            if logprobs[i] is not None and len(logprobs[i])
        ]
        max_top_k = max(set(top_k_vals), key=top_k_vals.count)
        min_top_k = min(set(top_k_vals), key=top_k_vals.count)
        top_k = min(max_top_k, min_top_k)
        if top_k == 0:
            raise ValueError("No non-zero top-k logprobs found.")
        target_logprobs = []
        target_token_ids = []
        target_mask = []
        if input_padding_len < 0:
            # logprobs is longer than target_seq_len,
            # so we need to slice from the left/beginning of logprobs
            logprobs = logprobs[:-input_seq_len]
            input_padding_len = 0
            # target_seq_len = input_seq_len
        # truncate the second dimension of the logprobs to top_k
        logprobs = [row[:top_k] for row in logprobs]
        # fill with -inf for padding_len tokens for top_k tokens
        # extend target_logprobs with a padding_len x top_k 2D list filled with -inf
        # we shift for causal models in the trainer, so start the range from 0
        for _ in range(0, input_padding_len):
            target_logprobs.append([-float("inf")] * top_k)
            target_token_ids.append(list(range(top_k)))
            target_mask.append([0] * top_k)
        for position in range(input_padding_len, input_seq_len):
            if sample["labels"][position] == -100:
                target_mask.append([0] * top_k)
            else:
                target_mask.append([1] * top_k)
        for token_pos_logprobs, pos_target_token_ids in zip(
            logprobs, sample["target_token_ids"]
        ):
            # Convert to a tensor for easier manipulation
            position_logprobs_tensor = torch.tensor(
                token_pos_logprobs, dtype=torch.float
            )
            # Now we have distribution at T1 in log form, i.e. log p_{T1}(k).
            # Next, re-scale to T2 = self.kd_temperature via exponent-based trick
            # p_{T2}(k) = [p_{T1}(k)]^(T1 / T2) / Z
            #
            # Convert from log to probability
            teacher_probs_t1 = position_logprobs_tensor.exp()
            # normalize probabilities to sum to 1 in case they aren't already
            teacher_probs_t1_sum = teacher_probs_t1.sum(dim=0, keepdim=True)
            if teacher_probs_t1_sum > 1e-9:
                teacher_probs_t1 = teacher_probs_t1 / teacher_probs_t1_sum
            if self.kd_temperature != self.gen_temperature:
                # Exponentiate by factor (T1 / T2)
                exponent = self.gen_temperature / self.kd_temperature
                teacher_probs_t2 = teacher_probs_t1**exponent
            else:
                teacher_probs_t2 = teacher_probs_t1
            # Re-normalize
            teacher_probs_t2 = teacher_probs_t2 / teacher_probs_t2.sum(
                dim=0, keepdim=True
            )
            # Convert back to log
            position_logprobs_tensor = torch.log(teacher_probs_t2)
            # Now we have log p_{teacher, T2}(k) stored in position_logprobs_tensor
            position_logprobs_scaled = position_logprobs_tensor.tolist()
            target_logprobs.append(position_logprobs_scaled)
            target_token_ids.append(pos_target_token_ids)
        # Update sample with transformed logprobs
        sample["target_logprobs"] = target_logprobs
        sample["target_token_ids"] = target_token_ids
        sample["target_mask"] = target_mask
        return sample
    def _tokenize_single_prompt(self, prompt):
        logprobs = prompt.pop(self.logprobs_field)
        target_token_ids = prompt.pop("target_token_ids")
        tokenized_prompt = super()._tokenize_single_prompt(prompt)
        tokenized_prompt[self.logprobs_field] = logprobs
        tokenized_prompt["target_token_ids"] = target_token_ids
        tokenized_prompt = self.transform_logprobs(tokenized_prompt)
        return tokenized_prompt
 class KDStrategyLoader(StrategyLoader):
    """
    Load ChatTemplateStrategy with KD support using StrategyLoader.
    """
    def _get_strategy_cls(self):
-        return ChatTemplateStrategyWithKD
+        return ChatTemplateStrategyWithKDv2
    def _get_strategy_params(self, cfg, ds_cfg: Dict[str, Any]):
        strategy_params = super()._get_strategy_params(cfg, ds_cfg)
--- a/src/axolotl/integrations/kd/collator.py
+++ b/src/axolotl/integrations/kd/collator.py
@@ -47,11 +47,16 @@ class DataCollatorForKD(DataCollatorForSeq2Seq):
    position_pad_token_id: int = 0
    return_tensors: str = "pt"
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
    def __call__(self, features, return_tensors=None):
        if return_tensors is None:
            return_tensors = self.return_tensors
        padding_side = self.tokenizer.padding_side
        max_len = 0
        # Pad labels and position_ids first
        for feature_name, pad_token_id in [
@@ -102,7 +107,9 @@ class DataCollatorForKD(DataCollatorForSeq2Seq):
                target_mask_list.append(f.pop("target_mask"))
            # Determine max lengths
-            max_teacher_seq_len = max(len(seq) for seq in target_logprobs_list)
+            max_teacher_seq_len = max_len or max(
                len(seq) for seq in target_logprobs_list
            )
            max_k = max(len(seq_k) for seq in target_logprobs_list for seq_k in seq)
            padded_target_logprobs = []
@@ -209,7 +216,9 @@ class KDBatchSamplerDataCollatorForSeq2Seq(DataCollatorForKD):
        #    We want to produce a single "merged" feature dict for each sub-batch.
        out_features = [{} for _ in features]
-        for i, sub_features in enumerate(features):
+        for i, sub_features in enumerate(  # pylint: disable=too-many-nested-blocks
            features
        ):
            # sub_features is a list of dicts, each dict = one sequence’s features
            # We'll merge them into out_features[i].
            #
@@ -243,10 +252,17 @@ class KDBatchSamplerDataCollatorForSeq2Seq(DataCollatorForKD):
                    # For example, input_ids or labels are often arrays.
                    arrays = []
                    for feat in sub_features:
-                        if field_name in feat:
+                        if field_name in feat and isinstance(
                            feat[field_name], (list, torch.Tensor)
                        ):
                            if isinstance(
                                feat[field_name][0], (dict, str)
                            ):  # pylint: disable=too-many-nested-blocks
                                continue
                            arr = np.array(feat[field_name])
                            arrays.append(arr)
-                    out_features[i][field_name] = np.concatenate(arrays)
+                    if arrays:
                        out_features[i][field_name] = np.concatenate(arrays)
        # 3) Now call the parent collator, which will do:
        #    - padding of labels/position_ids
--- a/src/axolotl/integrations/kd/collator_online_teacher.py
+++ b/src/axolotl/integrations/kd/collator_online_teacher.py
@@ -0,0 +1,561 @@
 """
 Packed data loader for online teacher training supporting vllm and sglang.
 """
 import hashlib
 import hmac
 import logging
 from typing import Any, Dict, List, Optional
 import requests
 import torch
 from orjson import orjson
 from axolotl.integrations.kd.collator import KDBatchSamplerDataCollatorForSeq2Seq
 from axolotl.integrations.kd.utils import normalize_logprobs
 from axolotl.utils.data.utils import retry_on_request_exceptions
 LOG = logging.getLogger(__name__)
 def hmac_sha_from_int_list(int_list, key, hash_func=hashlib.sha256):
    """
    Create HMAC-SHA hash from a list of integers
    Args:
        int_list: List of integers
        key: Secret key (string or bytes)
        hash_func: Hash function (default: sha256)
    Returns:
        HMAC digest as hex string
    """
    # Convert key to bytes if it's a string
    if isinstance(key, str):
        key = key.encode("utf-8")
    # Convert list of ints to bytes
    # Method 1: Convert each int to bytes and concatenate
    data = b"".join(i.to_bytes(4, byteorder="big") for i in int_list)
    # Create HMAC
    h = hmac.new(key, data, hash_func)
    return h.hexdigest()
 class OnlineTeacherCollator(KDBatchSamplerDataCollatorForSeq2Seq):
    """
    Collator for online teacher training.
    """
    DEFAULT_LABEL_PAD_TOKEN_ID: int = -100
    def __init__(
        self,
        *args: Any,
        kd_online_server_base_url: Optional[str] = None,
        kd_online_topk: Optional[int] = None,
        kd_temperature: Optional[float] = 1.0,
        kd_online_server: Optional[str] = "vllm",
        kd_online_timeout: Optional[int] = 120,
        kd_cache_dir: Optional[str] = None,
        kd_normalize_topk: Optional[bool] = True,
        **kwargs: Any,
    ):
        super().__init__(*args, **kwargs)
        if kd_online_server_base_url is None:
            raise ValueError(
                "kd_online_server_base_url must be provided for OnlineTeacherDataloader"
            )
        if kd_online_topk is None or kd_online_topk <= 0:
            raise ValueError(
                "kd_online_topk must be a positive integer for OnlineTeacherDataloader"
            )
        self.kd_online_server_base_url = kd_online_server_base_url.rstrip("/")
        self.kd_online_topk = kd_online_topk
        self.kd_temperature = kd_temperature
        self.kd_online_server = kd_online_server
        self.http_session = requests.Session()
        self.kd_online_timeout = kd_online_timeout
        self.kd_cache_dir = kd_cache_dir
        self.kd_normalize_topk = kd_normalize_topk
    def _normalize_logprobs(self, raw_logprobs: List[float]) -> List[float]:
        """
        Re-normalizes top-k raw logprobs as probabilities, and converts back to logprobs.
        """
        if not raw_logprobs or self.kd_online_topk == 0:
            return (
                [-float("inf")] * self.kd_online_topk if self.kd_online_topk > 0 else []
            )
        raw_logprobs_tensor = torch.tensor(raw_logprobs, dtype=torch.float32)
        return normalize_logprobs(raw_logprobs_tensor, self.kd_online_topk).tolist()
    @retry_on_request_exceptions(max_retries=10, delay=5)
    def fetch_online_logprobs_sglang(
        self, batch_input_ids: List[List[int]], labels: List[List[int]]
    ):
        """
        Fetches logprobs from an online teacher served by sglang for a batch of input_ids.
        Assumes API returns token IDs as strings in logprob dictionary keys.
        """
        api_endpoint = f"{self.kd_online_server_base_url}/generate"
        payload = {
            "input_ids": batch_input_ids,
            "return_logprob": True,
            "top_logprobs_num": self.kd_online_topk,
            "logprob_start_len": 0,
            "return_text_in_logprobs": True,
            "echo": True,
            "sampling_params": {
                "max_new_tokens": 0,
                "temperature": self.kd_temperature,
                "skip_special_tokens": False,
            },
        }
        # Initialize with empty lists, so if API call fails, these are returned.
        ret_data_target_token_ids: List[List[List[int]]] = []
        ret_data_target_logprobs: List[List[List[float]]] = []
        ret_data_target_mask: List[List[List[int]]] = []
        try:
            response = self.http_session.post(
                api_endpoint, json=payload, timeout=self.kd_online_timeout
            )
            response.raise_for_status()
            api_data: list[dict] = response.json()
            # Ensure api_data is a list, and its length matches batch_input_ids
            if not isinstance(api_data, list) or len(api_data) != len(batch_input_ids):
                LOG.error(
                    f"API response format error. Expected a list of {len(batch_input_ids)} "
                    f"items, got {type(api_data)} with length {len(api_data) if isinstance(api_data, list) else 'N/A'}."
                )
                # Return empty data; items processed later will get default empty KD fields
                return {
                    "target_token_ids": ret_data_target_token_ids,
                    "target_logprobs": ret_data_target_logprobs,
                    "target_mask": ret_data_target_mask,
                }
            for sequence_data, seq_input_ids, seq_labels in zip(
                api_data, batch_input_ids, labels
            ):
                current_target_logprobs = []
                current_target_token_ids = []
                current_target_mask = []
                meta_info = sequence_data.pop("meta_info", {})
                # Ensure input_top_logprobs is a list
                input_top_logprobs: Optional[list[None | list[tuple]]] = meta_info.pop(
                    "input_top_logprobs", []
                )
                if not isinstance(input_top_logprobs, list):
                    LOG.warning(
                        f"Received non-list input_top_logprobs: {input_top_logprobs}. Skipping sequence."
                    )
                    input_top_logprobs = []  # Treat as empty
                # basic check that the logprob data len matches the input len, so no need to handle padding
                assert len(seq_input_ids) == len(input_top_logprobs)
                for i, _, label in zip(
                    range(len(seq_input_ids)), seq_input_ids, seq_labels
                ):
                    if i < len(input_top_logprobs) and input_top_logprobs[i] is None:
                        # this is always the case for the first token.
                        # there is never logprob data for the first token since that's a true input
                        # so we replace the None value with padding data
                        current_target_logprobs.append(
                            [-float("inf")] * self.kd_online_topk
                        )
                        current_target_token_ids.append([0] * self.kd_online_topk)
                        current_target_mask.append([0] * self.kd_online_topk)
                    elif (
                        i < len(input_top_logprobs)
                        and input_top_logprobs[i] is not None
                    ):
                        pos_top_logprobs_data = input_top_logprobs[i]
                        # Ensure pos_top_logprobs_data is a list of lists as expected
                        if not (
                            isinstance(pos_top_logprobs_data, list)
                            and all(
                                isinstance(item, list) for item in pos_top_logprobs_data
                            )
                            and len(pos_top_logprobs_data) > 0
                            and len(pos_top_logprobs_data[0]) == 3
                        ):  # [logprob, token_id, token_str]
                            LOG.warning(
                                f"Malformed pos_top_logprobs_data: {pos_top_logprobs_data}. Padding this position."
                            )
                            current_target_logprobs.append(
                                [-float("inf")] * self.kd_online_topk
                            )
                            current_target_token_ids.append([0] * self.kd_online_topk)
                            current_target_mask.append([0] * self.kd_online_topk)
                            continue
                        # pos_top_logprobs: list of logprobs, pos_token_ids: list of token_ids
                        pos_logprobs_raw, pos_token_ids, _ = [
                            list(row) for row in zip(*pos_top_logprobs_data)
                        ]
                        # Ensure correct length (top_k)
                        if len(pos_logprobs_raw) < self.kd_online_topk:
                            pad_len = self.kd_online_topk - len(pos_logprobs_raw)
                            pos_logprobs_raw.extend([-float("inf")] * pad_len)
                            pos_token_ids.extend([0] * pad_len)  # Pad with 0 token_id
                        # truncate to top_k in case the response was longer
                        current_target_token_ids.append(
                            pos_token_ids[: self.kd_online_topk]
                        )
                        if self.kd_normalize_topk:
                            normalized_logprobs_for_position = self._normalize_logprobs(
                                pos_logprobs_raw[: self.kd_online_topk]
                            )
                            current_target_logprobs.append(
                                normalized_logprobs_for_position
                            )
                        else:
                            current_target_logprobs.append(
                                pos_logprobs_raw[: self.kd_online_topk]
                            )
                        # Mask depends on the corresponding label for the student
                        if label == self.DEFAULT_LABEL_PAD_TOKEN_ID:
                            current_target_mask.append([0] * self.kd_online_topk)
                        else:
                            current_target_mask.append([1] * self.kd_online_topk)
                    else:
                        # Pad if no logprobs for this position (either due to length mismatch or None entry)
                        current_target_logprobs.append(
                            [-float("inf")] * self.kd_online_topk
                        )
                        current_target_token_ids.append([0] * self.kd_online_topk)
                        current_target_mask.append([0] * self.kd_online_topk)
                ret_data_target_token_ids.append(current_target_token_ids)
                ret_data_target_logprobs.append(current_target_logprobs)
                ret_data_target_mask.append(current_target_mask)
        except requests.exceptions.RequestException as e:
            LOG.error(f"Error fetching logprobs from online teacher: {e}")
            raise e
            # ret_logprobs_data will be returned with empty lists, handled by the caller.
        except Exception as e:  # Catch other potential errors during processing
            LOG.error(
                f"Unexpected error processing API response in fetch_online_logprobs: {e}",
                exc_info=True,
            )
            raise e
        return {
            "target_token_ids": ret_data_target_token_ids,
            "target_logprobs": ret_data_target_logprobs,
            "target_mask": ret_data_target_mask,
        }
    @retry_on_request_exceptions(max_retries=10, delay=5)
    def fetch_online_logprobs_vllm(
        self, batch_input_ids: List[List[int]], labels: List[List[int]]
    ):
        """
        Fetches logprobs from an online teacher served by vllm for a batch of input_ids.
        Assumes API returns token IDs as strings in logprob dictionary keys.
        """
        api_endpoint = f"{self.kd_online_server_base_url}/v1/completions"
        payload = {
            "prompt": batch_input_ids,
            "echo": True,
            "logprobs": True,
            "prompt_logprobs": self.kd_online_topk,
            "top_logprobs": self.kd_online_topk,
            "max_new_tokens": 0,
            "skip_special_tokens": False,
            "temperature": self.kd_temperature,
            "sampling_params": {
                "max_tokens": 0,
            },
        }
        # Initialize with empty lists, so if API call fails, these are returned.
        ret_data_target_token_ids: List[List[List[int]]] = []
        ret_data_target_logprobs: List[List[List[float]]] = []
        ret_data_target_mask: List[List[List[int]]] = []
        try:
            headers = {"Accept-Encoding": "deflate, gzip, br, zstd"}
            response = self.http_session.post(
                api_endpoint,
                json=payload,
                headers=headers,
                timeout=self.kd_online_timeout,
            )
            response.raise_for_status()
            api_data: dict = orjson.loads(response.content)
            choices: list[dict] = api_data["choices"]
            # Ensure api_data is a list, and its length matches batch_input_ids
            if not isinstance(choices, list) or len(choices) != len(batch_input_ids):
                LOG.error(
                    f"API response format error. Expected a list of {len(batch_input_ids)} "
                    f"items, got {type(api_data)} with length {len(api_data) if isinstance(api_data, list) else 'N/A'}."
                )
                # Return empty data; items processed later will get default empty KD fields
                return {
                    "target_token_ids": ret_data_target_token_ids,
                    "target_logprobs": ret_data_target_logprobs,
                    "target_mask": ret_data_target_mask,
                }
            for sequence_data, seq_input_ids, seq_labels in zip(
                choices, batch_input_ids, labels
            ):
                # seq_input_ids: List[int]
                # seq_labels: List[int]
                current_target_logprobs = []
                current_target_token_ids = []
                current_target_mask = []
                # Ensure input_top_logprobs is a list
                input_top_logprobs: Optional[list[None | dict[str, dict]]] = (
                    sequence_data.pop("prompt_logprobs", [])
                )
                if not isinstance(input_top_logprobs, list):
                    LOG.warning(
                        f"Received non-list input_top_logprobs: {input_top_logprobs}. Skipping sequence."
                    )
                    input_top_logprobs = []  # Treat as empty
                # basic check that the logprob data len matches the input len, so no need to handle padding
                assert len(seq_input_ids) == len(input_top_logprobs)
                seq_len = len(seq_input_ids)
                for i, _, label in zip(range(seq_len), seq_input_ids, seq_labels):
                    if i < len(input_top_logprobs) and input_top_logprobs[i] is None:
                        # this is always the case for the first token.
                        # there is never logprob data for the first token since that's a true input
                        continue
                    if (
                        i < len(input_top_logprobs)
                        and input_top_logprobs[i] is not None
                    ):
                        pos_top_logprobs_data: dict[str, dict] = input_top_logprobs[i]  # type: ignore[assignment]
                        # Ensure pos_top_logprobs_data is a list of lists as expected
                        if not (
                            isinstance(pos_top_logprobs_data, dict)
                            and all(
                                isinstance(item, dict)
                                for item in pos_top_logprobs_data.values()
                            )
                            and len(pos_top_logprobs_data.keys()) > 0
                        ):  # [logprob, token_id, token_str]
                            LOG.warning(
                                f"Malformed pos_top_logprobs_data: {pos_top_logprobs_data}. Padding this position."
                            )
                            current_target_logprobs.append(
                                [-float("inf")] * self.kd_online_topk
                            )
                            current_target_token_ids.append(
                                list(range(self.kd_online_topk))
                            )
                            current_target_mask.append([0] * self.kd_online_topk)
                            continue
                        # pos_top_logprobs: list of logprobs, pos_token_ids: list of token_ids
                        pos_token_ids_str = list(pos_top_logprobs_data.keys())
                        pos_logprobs_dict = pos_top_logprobs_data.values()
                        pos_token_ids = [
                            int(token_id) for token_id in pos_token_ids_str
                        ]
                        pos_logprobs_raw = [
                            float(logprob.get("logprob", -float("inf")))
                            for logprob in pos_logprobs_dict
                        ]
                        # Ensure correct length (top_k)
                        if len(pos_logprobs_raw) < self.kd_online_topk:
                            pad_len = self.kd_online_topk - len(pos_logprobs_raw)
                            LOG.warning(
                                f"Padding position {i} with {pad_len} top-k tokens and logprobs."
                            )
                            pos_logprobs_raw.extend([-float("inf")] * pad_len)
                            pos_token_ids.extend([0] * pad_len)  # Pad with 0 token_id
                        # truncate to top_k in case the response was longer
                        current_target_token_ids.append(
                            pos_token_ids[: self.kd_online_topk]
                        )
                        if self.kd_normalize_topk:
                            normalized_logprobs_for_position = self._normalize_logprobs(
                                pos_logprobs_raw[: self.kd_online_topk]
                            )
                            current_target_logprobs.append(
                                normalized_logprobs_for_position
                            )
                        else:
                            current_target_logprobs.append(
                                pos_logprobs_raw[: self.kd_online_topk]
                            )
                        # Mask depends on the corresponding label for the student
                        if label == self.DEFAULT_LABEL_PAD_TOKEN_ID:
                            current_target_mask.append([0] * self.kd_online_topk)
                        else:
                            current_target_mask.append([1] * self.kd_online_topk)
                    else:
                        # Pad if no logprobs for this position (either due to length mismatch or None entry)
                        current_target_logprobs.append(
                            [-float("inf")] * self.kd_online_topk
                        )
                        current_target_token_ids.append(
                            list(range(self.kd_online_topk))
                        )
                        current_target_mask.append([0] * self.kd_online_topk)
                for i in range(max(0, seq_len - len(current_target_logprobs))):
                    current_target_logprobs.append(
                        [-float("inf")] * self.kd_online_topk
                    )
                    current_target_token_ids.append(list(range(self.kd_online_topk)))
                    current_target_mask.append([0] * self.kd_online_topk)
                ret_data_target_token_ids.append(current_target_token_ids)
                ret_data_target_logprobs.append(current_target_logprobs)
                ret_data_target_mask.append(current_target_mask)
                # TODO save and load targets to disk for caching for next epoch
                # generate a hmac SHA256 hash over the list seq_input_ids and convert it to an int
                # if self.kd_cache_dir:
                #     hash_input_ids = hmac_sha_from_int_list(
                #         seq_input_ids, f"{self.kd_online_server_base_url}:{self.kd_online_topk}"
                #     )
                #     with open(f"{self.kd_cache_dir}/{hash_input_ids}.parquet", "wb") as f:
                #         pd.DataFrame(ret_logprobs_data).to_parquet(f, index=False)
        except requests.exceptions.RequestException as e:
            LOG.error(f"Error fetching logprobs from online teacher: {e}")
            raise e
            # ret_logprobs_data will be returned with empty lists, handled by the caller.
        except Exception as e:  # Catch other potential errors during processing
            LOG.error(
                f"Unexpected error processing API response in fetch_online_logprobs: {e}",
                exc_info=True,
            )
            raise e
        return {
            "target_token_ids": ret_data_target_token_ids,
            "target_logprobs": ret_data_target_logprobs,
            "target_mask": ret_data_target_mask,
        }
    def __call__(
        self, features: List[List[Dict[str, Any]]], return_tensors: Optional[str] = None
    ) -> Dict[str, Any]:
        if not features:
            return super().__call__(features, return_tensors=return_tensors)
        for (
            sub_batch_features
        ) in features:  # sub_batch_features is List[Dict[str, Any]]
            if not sub_batch_features:
                continue
            input_ids_for_api_call: List[List[int]] = []
            labels_for_api_call: List[List[int]] = []
            # Store references to the original item dictionaries to update them in-place
            items_for_api_call: List[Dict[str, Any]] = []
            for item_dict in sub_batch_features:
                if not isinstance(item_dict, dict):
                    LOG.warning(
                        f"Skipping non-dict item in sub_batch_features: {item_dict}"
                    )
                    continue
                current_input_ids = item_dict.get("input_ids")
                current_labels = item_dict.get("labels")
                if current_input_ids is not None and current_labels is not None:
                    # Ensure input_ids and labels are lists of ints for JSON serialization
                    input_ids_list = (
                        current_input_ids.tolist()
                        if hasattr(current_input_ids, "tolist")
                        else list(current_input_ids)
                    )
                    labels_list = (
                        current_labels.tolist()
                        if hasattr(current_labels, "tolist")
                        else list(current_labels)
                    )
                    input_ids_for_api_call.append(input_ids_list)
                    labels_for_api_call.append(labels_list)
                    items_for_api_call.append(item_dict)
                else:
                    # This item will not get teacher logprobs from the API.
                    # Initialize KD fields to empty lists so downstream collators handle them uniformly.
                    item_dict.setdefault("target_token_ids", [])
                    item_dict.setdefault("target_logprobs", [])
                    item_dict.setdefault("target_mask", [])
            # print(items_for_api_call)
            if items_for_api_call:  # Only call API if there's something to process
                if self.kd_online_server == "sglang":
                    api_responses_for_sub_batch = self.fetch_online_logprobs_sglang(
                        input_ids_for_api_call, labels_for_api_call
                    )
                else:
                    api_responses_for_sub_batch = self.fetch_online_logprobs_vllm(
                        input_ids_for_api_call, labels_for_api_call
                    )
                # api_responses_for_sub_batch has keys: "target_token_ids", "target_logprobs", "target_mask"
                # Each value is a list, corresponding to items_for_api_call
                for i, item_to_update in enumerate(items_for_api_call):
                    # TODO make sure to figure out which input in sub_batch_features to update the batch in the original `features` object so the super class can handle it properly.
                    if api_responses_for_sub_batch and i < len(
                        api_responses_for_sub_batch["target_token_ids"]
                    ):  # Check bounds
                        assert len(
                            api_responses_for_sub_batch["target_token_ids"][i]
                        ) == len(item_to_update["input_ids"])
                        assert len(
                            api_responses_for_sub_batch["target_logprobs"][i]
                        ) == len(item_to_update["input_ids"])
                        assert len(
                            api_responses_for_sub_batch["target_mask"][i]
                        ) == len(item_to_update["labels"])
                        item_to_update["target_token_ids"] = (
                            api_responses_for_sub_batch["target_token_ids"][i]
                        )
                        item_to_update["target_logprobs"] = api_responses_for_sub_batch[
                            "target_logprobs"
                        ][i]
                        item_to_update["target_mask"] = api_responses_for_sub_batch[
                            "target_mask"
                        ][i]
                    else:
                        # API call failed for this item, or response was shorter than expected.
                        # Ensure KD fields are initialized as empty lists.
                        LOG.warning(
                            f" (index {i}), or API response was too short. "
                            f"API response keys: {list(api_responses_for_sub_batch.keys()) if api_responses_for_sub_batch else 'None'}"
                        )
                        item_to_update.setdefault("target_token_ids", [])
                        item_to_update.setdefault("target_logprobs", [])
                        item_to_update.setdefault("target_mask", [])
        return super().__call__(features, return_tensors=return_tensors)
--- a/src/axolotl/integrations/kd/kernels/liger.py
+++ b/src/axolotl/integrations/kd/kernels/liger.py
@@ -0,0 +1,485 @@
 """
 Liger Kernels for Chunked Top-K Log-Prob Distillation
 """
 import torch
 import torch.nn.functional as F
 from liger_kernel.chunked_loss.fused_linear_distillation import (
    LigerFusedLinearDistillationBase,
 )
 from axolotl.integrations.kd.utils import normalize_logprobs
 class LigerFusedLinearKLTopKLogprobFunction(LigerFusedLinearDistillationBase):
    """
    Chunked kl-div loss for top-k logprobs
    """
    @staticmethod
    def distillation_loss_fn(
        student_logits_temp_scaled: torch.Tensor,  # [chunk_size, vocab_size], already temp-scaled
        target_token_ids_chunk: torch.Tensor,  # [chunk_size, top_k]
        target_logprobs_chunk: torch.Tensor,  # [chunk_size, top_k], already temp-scaled and normalized logprobs
        target_mask_chunk: torch.Tensor,  # [chunk_size, top_k]
        beta: float = 0.0,
        normalize_topk: bool = True,
    ) -> torch.Tensor:
        """
        Compute Top-K KL divergence loss for a chunk.
        Args:
            student_logits_temp_scaled: Student logits, scaled by temperature. Shape: (N, V).
            target_token_ids_chunk: Top-k teacher token IDs. Shape: (N, K).
            target_logprobs_chunk: Top-k teacher log probabilities (temp-scaled, normalized). Shape: (N, K).
            target_mask_chunk: Mask for valid top-k tokens. Shape: (N, K).
            beta: Controls the type of KL divergence.
                  0.0 for Forward KL (P_teacher || P_student).
                  1.0 for Reverse KL (P_student || P_teacher).
                  0.5 for Symmetric KL (average of Forward and Reverse).
            normalize_topk: Whether to normalize the log probabilities
        Returns:
            Sum of KL divergence losses for the chunk.
        """
        topk = target_token_ids_chunk.shape[-1]
        student_logits_temp_scaled = (  # [chunk_size, vocab_size]
            student_logits_temp_scaled.float()
        )
        target_logprobs_chunk = target_logprobs_chunk.float()
        # Gather student logits for the top-k teacher token IDs
        # target_token_ids_chunk: [chunk_size, top_k]
        # student_logits_topk_temp_scaled: [chunk_size, top_k]
        student_logits_topk_temp_scaled = torch.gather(
            student_logits_temp_scaled, dim=-1, index=target_token_ids_chunk
        )
        # Student log-probabilities for the gathered top-k tokens
        student_lse = torch.logsumexp(
            student_logits_temp_scaled, dim=-1, keepdim=True
        )  # [chunk_size, 1]
        student_logprobs_topk_temp_scaled = (
            student_logits_topk_temp_scaled - student_lse
        )
        # we have the top-k student logprobs, normalize them
        if normalize_topk:
            student_logprobs_topk_temp_scaled = normalize_logprobs(
                student_logprobs_topk_temp_scaled, topk
            )
        valid_mask = target_mask_chunk.to(torch.bool)  # [chunk_size, top_k]
        student_logprobs_topk_valid = student_logprobs_topk_temp_scaled[valid_mask]
        teacher_logprobs_valid = target_logprobs_chunk[valid_mask]
        # Teacher probabilities P(y|x_teacher) from logprobs
        # target_logprobs_valid are already normalized (log(softmax(teacher_logits/T)))
        teacher_probs_valid = teacher_logprobs_valid.exp()
        # Student probabilities P_student from log P_student
        student_probs_topk_valid = student_logprobs_topk_valid.exp()
        # kd_loss_per_token = torch.zeros_like(target_logprobs_valid)
        # KL divergence: sum(P_teacher * (log P_teacher - log P_student))
        # = sum(P_teacher * log P_teacher) - sum(P_teacher * log P_student)
        # The distillation loss is often formulated as -sum(P_teacher * log P_student)
        # or as sum(P_teacher * (log_softmax_teacher - log_softmax_student))
        # Here, target_logprobs_valid are log_softmax_teacher.
        # student_logprobs_topk_valid are log_softmax_student (for the selected K indices).
        if beta == 0.0:  # Contribution from Forward KL
            fwd_kl_per_token = teacher_probs_valid * (
                teacher_logprobs_valid - student_logprobs_topk_valid
            )
            kd_loss = fwd_kl_per_token.sum()
        elif beta == 1.0:  # Contribution from Reverse KL
            rev_kl_per_token = student_probs_topk_valid * (
                student_logprobs_topk_valid - teacher_logprobs_valid
            )
            kd_loss = rev_kl_per_token.sum()
        else:
            # JSD - Jensen-Shannon Divergence / Symmetric
            mean_probs = (
                1 - beta
            ) * student_probs_topk_valid + beta * teacher_probs_valid
            log_mean_probs = mean_probs.log()
            student_kl = F.kl_div(
                log_mean_probs,
                student_logprobs_topk_valid,
                reduction="sum",
                log_target=True,
            )
            teacher_kl = F.kl_div(
                log_mean_probs, teacher_logprobs_valid, reduction="sum", log_target=True
            )
            jsd_loss = beta * teacher_kl + (1 - beta) * student_kl
            kd_loss = jsd_loss
        return kd_loss
    @staticmethod
    def _compute_loss_kl_topk(
        student_input_chunk: torch.Tensor,
        student_weight: torch.Tensor,
        # Args for student_bias, target_token_ids_chunk etc. are passed to the lambda wrapped by grad_and_value
        # or through `partial`. Let's make them explicit here for clarity.
        target_token_ids_chunk: torch.Tensor,
        target_logprobs_chunk: torch.Tensor,
        target_mask_chunk: torch.Tensor,
        target_chunk: torch.Tensor,  # For hard loss (true labels)
        student_bias: torch.Tensor = None,  # This will be one of the grad targets
        # Other params passed via `partial` from `forward`
        distillation_loss_fn=None,
        ignore_index: int = -100,
        weight_hard_loss: float = 0.5,
        weight_soft_loss: float = 0.5,
        compute_ce_loss: bool = True,
        temperature: float = 1.0,
        beta: float = 0.0,
        normalize_topk: bool = True,
    ):
        # Compute student logits for the chunk from hidden states and LM head
        # student_input_chunk: [chunk_size, hidden_dim]
        # student_lm_head_weight: [vocab_size, hidden_dim]
        # student_logits_chunk: [chunk_size, vocab_size]
        student_logits_chunk = F.linear(
            student_input_chunk, student_weight, student_bias
        )
        ce_loss = torch.tensor(
            0.0, device=student_logits_chunk.device, dtype=student_logits_chunk.dtype
        )
        if compute_ce_loss and weight_hard_loss > 0.0:
            ce_loss = F.cross_entropy(
                student_logits_chunk.view(-1, student_logits_chunk.shape[-1]),
                target_chunk.view(-1),
                reduction="sum",
                ignore_index=ignore_index,
            )
        soft_loss = torch.tensor(
            0.0, device=student_logits_chunk.device, dtype=student_logits_chunk.dtype
        )
        if weight_soft_loss > 0.0:
            student_logits_chunk_temp_scaled = student_logits_chunk / temperature
            # Assuming student_weight.shape[0] (vocab_size) is adequate for target_token_ids_chunk.max()
            # No explicit padding here; user must ensure vocab alignment or pre-pad student_weight.
            soft_loss = distillation_loss_fn(
                student_logits_chunk_temp_scaled,
                target_token_ids_chunk,
                target_logprobs_chunk,
                target_mask_chunk,
                beta=beta,
                normalize_topk=normalize_topk,
            )
        return soft_loss, ce_loss
    @classmethod
    def forward(
        cls,
        ctx,
        student_input: torch.Tensor,  # [batch_size, seq_len, dim]
        student_lm_head_weight: torch.Tensor,  # [dim, vocab_size]
        target_token_ids: torch.Tensor,  # [batch_size, seq_len, top_k]
        target_logprobs: torch.Tensor,  # [batch_size, seq_len, top_k]
        target_mask: torch.Tensor,  # [batch_size, seq_len, top_k]
        true_labels: torch.Tensor,  # [batch_size, seq_len]
        student_lm_head_bias: torch.Tensor = None,
        weight_hard_loss: float = 0.5,
        weight_soft_loss: float = 0.5,
        ignore_index: int = -100,
        temperature: float = 1.0,
        beta: float = 0.0,
        compiled: bool = False,
        chunk_size: int = 1024,
        compute_ce_loss: bool = True,
        normalize_topk: bool = True,
    ):
        CHUNK_SIZE = chunk_size  # pylint: disable=invalid-name
        grad_weight_acc = torch.zeros_like(student_lm_head_weight)
        grad_inputs_list = []
        grad_bias_acc = (
            torch.zeros_like(student_lm_head_bias)
            if student_lm_head_bias is not None
            else None
        )
        kd_loss_acc = torch.zeros(
            (), device=student_input.device, dtype=student_input.dtype
        )
        ce_loss_acc = torch.zeros(
            (), device=student_input.device, dtype=student_input.dtype
        )
        # This function will be what torch.func.grad_and_value differentiates.
        # It takes student_input_chunk, student_weight (full), student_bias (full) as primals.
        # Other necessary data (target_*, etc.) are passed as non-differentiable arguments.
        def loss_fn_for_grad(
            _student_input_chunk,
            _student_lm_head_weight,  # full weight
            _student_lm_head_bias,  # full bias
            # Fixed arguments for a given chunk, not differentiated:
            _target_token_ids_chunk,
            _target_logprobs_chunk,
            _target_mask_chunk,
            _true_labels_chunk,
        ):
            return cls._compute_loss_kl_topk(
                student_input_chunk=_student_input_chunk,
                student_weight=_student_lm_head_weight,
                target_token_ids_chunk=_target_token_ids_chunk,
                target_logprobs_chunk=_target_logprobs_chunk,
                target_mask_chunk=_target_mask_chunk,
                target_chunk=_true_labels_chunk,
                student_bias=_student_lm_head_bias,
                distillation_loss_fn=cls.distillation_loss_fn,
                ignore_index=ignore_index,
                weight_hard_loss=weight_hard_loss,
                weight_soft_loss=weight_soft_loss,
                compute_ce_loss=compute_ce_loss,
                temperature=temperature,
                beta=beta,
                normalize_topk=normalize_topk,
            )
        def accumulate_chunk_grads(
            student_input_chunk_ac,
            target_token_ids_chunk_ac,
            target_logprobs_chunk_ac,
            target_mask_chunk_ac,
            true_labels_chunk_ac,
        ):
            # student_weight and student_bias are closed over from the outer scope (full tensors)
            if student_lm_head_bias is not None:
                (
                    (chunk_grad_input, chunk_grad_weight, chunk_grad_bias),
                    (chunk_kd_loss, chunk_ce_loss),
                ) = torch.func.grad_and_value(
                    loss_fn_for_grad, argnums=(0, 1, 2), has_aux=True
                )(
                    student_input_chunk_ac,
                    student_lm_head_weight,
                    student_lm_head_bias,  # primals
                    target_token_ids_chunk_ac,
                    target_logprobs_chunk_ac,
                    target_mask_chunk_ac,
                    true_labels_chunk_ac,
                )  # non-primals
                grad_bias_acc.add_(chunk_grad_bias)
            else:
                argnums_for_grad = (0, 1)  # Differentiate wrt input_chunk, weight
                (
                    (chunk_grad_input, chunk_grad_weight),  # No grad for bias
                    (chunk_kd_loss, chunk_ce_loss),
                ) = torch.func.grad_and_value(
                    loss_fn_for_grad, argnums=argnums_for_grad, has_aux=True
                )(
                    student_input_chunk_ac,
                    student_lm_head_weight,
                    None,  # Pass None for student_bias primal
                    target_token_ids_chunk_ac,
                    target_logprobs_chunk_ac,
                    target_mask_chunk_ac,
                    true_labels_chunk_ac,
                )
            grad_weight_acc.add_(chunk_grad_weight)
            kd_loss_acc.add_(chunk_kd_loss)
            ce_loss_acc.add_(chunk_ce_loss)
            return chunk_grad_input
        if compiled:
            accumulate_chunk_grads_compiled = torch.compile(
                accumulate_chunk_grads, dynamic=True, backend="inductor"
            )  # dynamic=True often helpful
        else:
            accumulate_chunk_grads_compiled = accumulate_chunk_grads
        # Use the same chunking logic as LigerFusedLinearDistillationBase.forward
        B, N, D = student_input.shape  # pylint: disable=invalid-name
        K = target_token_ids.shape[-1]  # pylint: disable=invalid-name
        student_input_flat = student_input.reshape(-1, student_input.shape[-1])
        target_token_ids_flat = target_token_ids.reshape(-1, target_token_ids.shape[-1])
        target_logprobs_flat = target_logprobs.reshape(-1, target_logprobs.shape[-1])
        target_mask_flat = target_mask.reshape(-1, target_mask.shape[-1])
        # pad and shift for cross entropy loss
        true_labels = torch.nn.functional.pad(true_labels, (0, 1), value=ignore_index)
        true_labels_flat = true_labels[:, 1:].contiguous().view(-1)
        num_chunks = max(1, student_input_flat.shape[0] // CHUNK_SIZE)
        _student_input_chunks = torch.chunk(
            student_input_flat, chunks=num_chunks, dim=0
        )
        _target_token_ids_chunks = torch.chunk(
            target_token_ids_flat, chunks=num_chunks, dim=0
        )
        _target_logprobs_chunks = torch.chunk(
            target_logprobs_flat, chunks=num_chunks, dim=0
        )
        _target_mask_chunks = torch.chunk(target_mask_flat, chunks=num_chunks, dim=0)
        _true_labels_chunks = torch.chunk(true_labels_flat, chunks=num_chunks, dim=0)
        for i in range(num_chunks):
            grad_input_chunk = accumulate_chunk_grads_compiled(
                _student_input_chunks[i],
                _target_token_ids_chunks[i],
                _target_logprobs_chunks[i],
                _target_mask_chunks[i],
                _true_labels_chunks[i],
            )
            grad_inputs_list.append(grad_input_chunk)
        grad_inputs_combined = torch.cat(grad_inputs_list, dim=0)
        ctx.save_for_backward(grad_inputs_combined, grad_weight_acc, grad_bias_acc)
        # For matching None returns in backward for non-tensor/non-grad_requiring inputs
        ctx.hyperparams_count = 9  # Corresponds to number of hyperparams after main tensors in fwd signature
        ctx.bias_was_none = student_lm_head_bias is None
        ctx.orig_dims = (B, N, D, K)
        # since this is packed, there is simply a single batch, so batchmean reduction of kl-div is simply the accumulated sum
        # we still need to scale the kd_loss by the temp^2
        kd_loss_acc = kd_loss_acc * (temperature**2)
        final_loss = weight_soft_loss * kd_loss_acc + weight_hard_loss * ce_loss_acc
        return final_loss
    @staticmethod
    def backward(ctx, grad_output):
        grad_input_flat, grad_weight, grad_bias_maybe = (
            ctx.saved_tensors
        )  # grad_input_flat is (B*N, D)
        # Scale gradients by grad_output if it's not 1.0
        if not torch.equal(
            grad_output,
            torch.tensor(1.0, device=grad_output.device, dtype=grad_output.dtype),
        ):
            grad_input_flat = grad_input_flat * grad_output
            grad_weight = grad_weight * grad_output
            if grad_bias_maybe is not None:
                grad_bias_maybe = grad_bias_maybe * grad_output
        # Reshape grad_input_flat to match original student_input shape (B, N, D)
        # ctx.orig_dims stores (B, N, D, K)
        # We need the first three dimensions for student_input's shape.
        # Ensure that orig_dims are not (0,0,0,K) for empty inputs leading to view errors
        if (
            ctx.orig_dims[0] * ctx.orig_dims[1] * ctx.orig_dims[2] == 0
            and grad_input_flat.numel() == 0
        ):
            # If original input was empty, gradient should also be empty with correct shape
            grad_input_reshaped = torch.zeros(
                ctx.orig_dims[0],
                ctx.orig_dims[1],
                ctx.orig_dims[2],
                dtype=grad_input_flat.dtype,
                device=grad_input_flat.device,
            )
        elif grad_input_flat.numel() == 0 and not (
            ctx.orig_dims[0] * ctx.orig_dims[1] * ctx.orig_dims[2] == 0
        ):
            # This case should ideally not happen if forward path is correct (non-empty input -> non-empty flat grad)
            # but as a safeguard:
            grad_input_reshaped = torch.zeros(
                ctx.orig_dims[0],
                ctx.orig_dims[1],
                ctx.orig_dims[2],
                dtype=grad_input_flat.dtype,
                device=grad_input_flat.device,
            )
        else:
            grad_input_reshaped = grad_input_flat.view(
                ctx.orig_dims[0], ctx.orig_dims[1], ctx.orig_dims[2]
            )
        nones_for_hyperparams = [None] * ctx.hyperparams_count
        grad_bias_return = grad_bias_maybe if not ctx.bias_was_none else None
        return (
            grad_input_reshaped,  # Gradient for student_input (reshaped)
            grad_weight,  # Gradient for student_lm_head_weight
            None,  # Gradient for target_token_ids
            None,  # Gradient for target_logprobs
            None,  # Gradient for target_mask
            None,  # Gradient for true_labels
            grad_bias_return,  # Gradient for student_lm_head_bias
            *nones_for_hyperparams,  # Grads for weight_hard_loss, ..., compute_ce_loss
        )
 class LigerFusedLinearKLTopKLogprobLoss(torch.nn.Module):
    """
    wrapper for chunked top-k logprob kl-d
    """
    def __init__(
        self,
        weight_hard_loss: float = 0.5,
        weight_soft_loss: float = 0.5,
        temperature: float = 1.0,  # This is the kd_temperature
        beta: float = 1.0,
        ignore_index: int = -100,
        compiled: bool = True,
        chunk_size: int = 1024,
        compute_ce_loss: bool = True,
        normalize_topk: bool = True,
    ):
        super().__init__()
        if not (0.0 <= weight_hard_loss <= 1.0 and 0.0 <= weight_soft_loss <= 1.0):
            raise ValueError("Loss weights must be between 0.0 and 1.0.")
        if temperature <= 0:
            raise ValueError("Temperature must be positive.")
        self.weight_hard_loss = weight_hard_loss
        self.weight_soft_loss = weight_soft_loss
        self.temperature = temperature
        self.beta = beta
        self.ignore_index = ignore_index
        self.compiled = compiled
        self.chunk_size = chunk_size
        self.compute_ce_loss = compute_ce_loss
        self.normalize_topk = normalize_topk
        if not self.compute_ce_loss and self.weight_hard_loss > 0.0:
            print(
                f"Warning: compute_ce_loss is False, but weight_hard_loss ({self.weight_hard_loss}) > 0. Hard loss will effectively be zero."
            )
            # self.weight_hard_loss = 0.0 # Or let user manage this
        if self.weight_soft_loss == 0.0:
            print(
                "Warning: weight_soft_loss is 0.0. Soft (KD) loss will not be computed."
            )
    def forward(
        self,
        lm_head_weight: torch.Tensor,  # Weights of the linear layer in the LM head
        student_hidden_states: torch.Tensor,  # student_hidden_states before the lm_head
        target_token_ids: torch.Tensor,
        target_logprobs: torch.Tensor,
        target_mask: torch.Tensor,
        true_labels: torch.Tensor,
        student_bias: torch.Tensor = None,
    ) -> torch.Tensor:
        return LigerFusedLinearKLTopKLogprobFunction.apply(
            student_hidden_states,
            lm_head_weight,
            target_token_ids,
            target_logprobs,
            target_mask,
            true_labels,
            student_bias,
            self.weight_hard_loss,
            self.weight_soft_loss,
            self.ignore_index,
            self.temperature,
            self.beta,
            self.compiled,
            self.chunk_size,
            self.compute_ce_loss,
            self.normalize_topk,
        )
--- a/src/axolotl/integrations/kd/kernels/models.py
+++ b/src/axolotl/integrations/kd/kernels/models.py
@@ -0,0 +1,97 @@
 """
 model patcher for chunked top-k kl-div
 """
 from typing import Optional, Union, Unpack
 import torch
 from transformers import Cache
 from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
 from transformers.modeling_outputs import CausalLMOutputWithPast
 from transformers.utils import LossKwargs
 class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs):
    """
    placeholder kwargs for hf model classes
    """
 def kldiv_forward_llama_like(
    self,
    input_ids: Optional[torch.LongTensor] = None,
    target_logprobs: Optional[torch.Tensor] = None,
    target_token_ids: Optional[torch.LongTensor] = None,
    target_mask: Optional[torch.Tensor] = None,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_values: Optional[Cache] = None,
    inputs_embeds: Optional[torch.FloatTensor] = None,
    labels: Optional[torch.LongTensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    cache_position: Optional[torch.LongTensor] = None,
    logits_to_keep: Union[int, torch.Tensor] = 0,  # pylint: disable=unused-argument
    **kwargs: Unpack[KwargsForCausalLM],  # type: ignore[misc]
 ) -> CausalLMOutputWithPast:
    # pylint: disable=duplicate-code
    output_attentions = (
        output_attentions
        if output_attentions is not None
        else self.config.output_attentions
    )
    output_hidden_states = (
        output_hidden_states
        if output_hidden_states is not None
        else self.config.output_hidden_states
    )
    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        cache_position=cache_position,
        **kwargs,
    )
    hidden_states = outputs.last_hidden_state
    # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
    # TODO, we can optimize this further by filtering hidden_states on sequence dimension using labels != -100
    # self.loss_function should be LigerFusedLinearKLTopKLogprobLoss
    loss = self.loss_function(
        self.lm_head.weight,
        hidden_states,
        target_token_ids,
        target_logprobs,
        target_mask,
        true_labels=labels,
    )
    num_items_in_batch = kwargs.pop("num_items_in_batch", -1)
    if num_items_in_batch is not None and num_items_in_batch > 0:
        loss = loss / num_items_in_batch
    return CausalLMOutputWithPast(
        loss=loss,
        logits=None,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )
 def apply_kernel(model_type):
    # Dynamically import the module and attention class
    module_path = f"transformers.models.{model_type}.modeling_{model_type}"
    model_cls_prefix = "".join([part.capitalize() for part in model_type.split("_")])
    module = __import__(module_path, fromlist=[f"{model_cls_prefix}ForCausalLM"])
    model_cls = getattr(module, f"{model_cls_prefix}ForCausalLM")
    model_cls.forward = kldiv_forward_llama_like
--- a/src/axolotl/integrations/kd/topk_logprob/forward_kl.py
+++ b/src/axolotl/integrations/kd/topk_logprob/forward_kl.py
@@ -16,40 +16,7 @@
 loss for top_k KL divergence
 """
 import torch
-
+from torch import nn
 def zscore_standardize(
    logits: torch.Tensor,
    mask: torch.Tensor = None,
    base_temperature: float = 1.0,
    eps: float = 1e-9,
 ):
    """
    Z-score standardize along the last dimension of `logits`.
    i.e., for each [B, seq_len] row, across K entries:
        z = (logits - mean) / std,
    then scale by 1 / base_temperature if desired.
    mask can be broadcastable or None. If None, we standardize all elements.
    """
    if mask is None:
        # shape: [B, seq_len, K]
        # Mean and std over dim=-1
        mean = logits.mean(dim=-1, keepdim=True)
        var = logits.var(dim=-1, unbiased=False, keepdim=True)
    else:
        # If you have to exclude some tokens, multiply by mask, etc.
        float_mask = mask.to(logits.dtype)
        count = float_mask.sum(dim=-1, keepdim=True).clamp_min(1.0)
        mean = (logits * float_mask).sum(dim=-1, keepdim=True) / count
        var = (float_mask * (logits - mean) ** 2).sum(dim=-1, keepdim=True) / count
    std = torch.sqrt(var.clamp_min(eps))
    z = (logits - mean) / std
    # Scale by 1 / base_temperature
    z = z / base_temperature
    return z
@torch.jit.script
@@ -60,7 +27,6 @@ def loss(
    target_mask: torch.Tensor,
    num_items_in_batch: int = -1,  # Use -1 to indicate "None"
    kd_temperature: float = 1.0,
    top_k_before_softmax: int = 0,
 ) -> torch.Tensor:
    """
    A KD loss function that is TorchScript-friendly.
@@ -77,8 +43,6 @@ def loss(
        num_items_in_batch (int, optional): The number of items in the batch.
        kd_temperature (float, optional): The temperature for KD.
            Default: 1.0
        top_k_before_softmax (int, optional): Flag of whether to apply softmax before gathering student top-k logits
            Default: 0
    """
    target_logprobs = target_logprobs.float()
@@ -88,46 +52,24 @@ def loss(
    # student_logits shape:   [B, student_seq_len, vocab_size]
    teacher_seq_len = target_token_ids.shape[1]
-    if top_k_before_softmax:
+    # Slice student logits to match teacher-provided sequence length
-        # Slice student logits to match teacher-provided sequence length
+    student_logits_for_kd = (
-        student_logits_for_kd = student_logits[
+        student_logits[:, :teacher_seq_len, :] / kd_temperature
-            :, :teacher_seq_len, :
+    )  # [B, teacher_seq_len, vocab_size]
        ]  # [B, teacher_seq_len, vocab_size]
-        # Gather student logits for teacher's top-K tokens
+    # keep in full precision for numerical stability of loss
-        student_logits_topk = torch.gather(
+    student_logits_for_kd = student_logits_for_kd.float()
            student_logits_for_kd, dim=-1, index=target_token_ids
        )  # [B, teacher_seq_len, K]
-        student_logits_topk = student_logits_topk.float()
+    # Gather student logits for teacher's top-K tokens
    student_logits_topk = torch.gather(
        student_logits_for_kd, dim=-1, index=target_token_ids
    )  # [B, teacher_seq_len, K]
-        # Apply KD temperature to student’s logits
+    # Compute logsumexp across full vocabulary
-        if kd_temperature != 1.0:
+    student_lse = torch.logsumexp(student_logits_for_kd, dim=-1, keepdim=True)
            student_logits_topk = student_logits_topk / kd_temperature
-        # Convert student top-k logits to logprobs
+    #  Convert just the top-k logits to logprobs
-        student_logprobs_topk = student_logits_topk - torch.logsumexp(
+    student_logprobs_topk = student_logits_topk - student_lse
            student_logits_topk, dim=-1, keepdim=True
        )  # [B, teacher_seq_len, K]
    else:
        # Slice student logits to match teacher-provided sequence length
        student_logits_for_kd = (
            student_logits[:, :teacher_seq_len, :] / kd_temperature
        )  # [B, teacher_seq_len, vocab_size]
        # keep in full precision for numerical stability of loss
        student_logits_for_kd = student_logits_for_kd.float()
        # Gather student logits for teacher's top-K tokens
        student_logits_topk = torch.gather(
            student_logits_for_kd, dim=-1, index=target_token_ids
        )  # [B, teacher_seq_len, K]
        # Compute logsumexp across full vocabulary
        student_lse = torch.logsumexp(student_logits_for_kd, dim=-1, keepdim=True)
        #  Convert just the top-k logits to logprobs
        student_logprobs_topk = student_logits_topk - student_lse
    # Convert teacher_mask to boolean for indexing
    # In TorchScript, .bool() is sometimes unsupported, so we do:
@@ -144,10 +86,6 @@ def loss(
    kd_loss_per_token = teacher_probs * (target_logprobs - student_logprobs_topk)
    kd_loss = kd_loss_per_token.sum()
    # Multiply by T^2 (classical KD scaling)
    if kd_temperature != 1.0:
        kd_loss = kd_loss * (kd_temperature**2)
    # Normalize by number of items (if provided) or by valid tokens
    if num_items_in_batch > 0:
        kd_loss = kd_loss / float(num_items_in_batch)
@@ -158,80 +96,74 @@ def loss(
    return kd_loss
-def topk_kd_loss_with_zscore(
+class ChunkedTopKKDLoss(nn.Module):
    student_logits: torch.Tensor,  # [B, seq_len, vocab_size]
    target_token_ids: torch.Tensor,  # [B, seq_len, K]
    target_logprobs: torch.Tensor,  # [B, seq_len, K], sums to 1.0 in prob space
    target_mask: torch.Tensor,  # [B, seq_len, K] or [B, seq_len]
    kd_temperature: float = 1.0,  # classic KD temperature
    zscore_base_temp: float = 1.0,  # from the paper
    num_items_in_batch: int = -1,
 ):
    """
-    A variant of top_k KL divergence with Z-score scaling
+    A wrapper that chunks (splits) the student and teacher outputs along the time dimension
-    from "Logit Standardization in Knowledge Distillation".
+    to reduce peak memory usage when upcasting from bf16 to fp32, especially for large vocabularies.
    Usage is analogous to ForwardKLWithChunkedOutputLoss but adapted to top-K teacher logprobs.
    """
-    target_logprobs = target_logprobs.float()
+    def __init__(self, num_output_chunks: int = 8, kd_temperature: float = 1.0):
        super().__init__()
        self.num_output_chunks = num_output_chunks
        self.kd_temperature = kd_temperature
-    B, teacher_seq_len, K = target_logprobs.shape  # pylint: disable=invalid-name
+    def forward(
-    # 1) Gather the student's top-k logits to match teacher
+        self,
-    student_logits_for_kd = student_logits[
+        student_logits: torch.Tensor,  # [B, seq_len, vocab_size]
-        :, :teacher_seq_len, :
+        target_token_ids: torch.Tensor,  # [B, seq_len, K]
-    ]  # [B, seq_len, vocab]
+        target_logprobs: torch.Tensor,  # [B, seq_len, K]
-    student_topk_logits = torch.gather(
+        target_mask: torch.Tensor,  # [B, seq_len, K]
-        student_logits_for_kd, dim=-1, index=target_token_ids
+        num_items_in_batch: int = -1,  # optional batch size for normalization
-    )  # [B, seq_len, K]
+    ) -> torch.Tensor:
-    student_topk_logits = student_topk_logits.float()
+        # 1. Split along the "token" dimension (dim=1).
        student_logits_chunks = student_logits.chunk(self.num_output_chunks, dim=1)
        token_ids_chunks = target_token_ids.chunk(self.num_output_chunks, dim=1)
        logprobs_chunks = target_logprobs.chunk(self.num_output_chunks, dim=1)
        mask_chunks = target_mask.chunk(self.num_output_chunks, dim=1)
-    # 2) If you want to keep the "classical" T scaling, apply it first
+        # We'll accumulate a global "sum of losses" and "sum of valid tokens"
-    if kd_temperature != 1.0:
+        # so that our final average is consistent with the entire sequence/batch.
-        student_topk_logits = student_topk_logits / kd_temperature
+        total_loss = 0.0
        total_valid_tokens = 0
-    # 3) Convert teacher logprobs -> treat them as “logits” for z-score
+        # 2. Loop over each chunk and compute a chunk-specific loss.
-    #    (They differ by +some_constant from real logits, but in z-score
+        for st_chunk, tid_chunk, lp_chunk, msk_chunk in zip(
-    #     that constant is subtracted out anyway.)
+            student_logits_chunks, token_ids_chunks, logprobs_chunks, mask_chunks
-    teacher_logits_for_zscore = target_logprobs  # rename variable for clarity
+        ):
            # We pass num_items_in_batch=-1 so that the kd_loss
            # will average over *this chunk's* valid tokens only.
            chunk_loss = loss(
                student_logits=st_chunk,
                target_token_ids=tid_chunk,
                target_logprobs=lp_chunk,
                target_mask=msk_chunk,
                num_items_in_batch=-1,  # ensure per-chunk averaging by valid tokens
                kd_temperature=self.kd_temperature,
            )
-    # 4) Z-score teacher and student
+            # kd_loss returns an average over the chunk's valid tokens.
-    #    If target_mask is 2D, expand to 3D for the K dimension
+            # We want a global average in the end, so we need to re‐weight
-    if target_mask.dim() == 2 and target_mask.shape[:2] == (B, teacher_seq_len):
+            # by the number of valid tokens in this chunk and keep track of the total.
-        target_mask = target_mask.unsqueeze(-1).expand(-1, -1, K)
+            chunk_valid_mask = msk_chunk.to(torch.bool)
            chunk_valid_count = chunk_valid_mask.sum()  # scalar tensor
-    teacher_z = zscore_standardize(
+            # Re-scale "chunk average" back to "chunk sum"
-        teacher_logits_for_zscore, mask=target_mask, base_temperature=zscore_base_temp
+            chunk_loss_sum = chunk_loss * chunk_valid_count
    )
    student_z = zscore_standardize(
        student_topk_logits, mask=target_mask, base_temperature=zscore_base_temp
    )
-    # 5) Convert to log-probs for KL
+            total_loss += chunk_loss_sum
-    teacher_logprobs_z = teacher_z - torch.logsumexp(teacher_z, dim=-1, keepdim=True)
+            total_valid_tokens += chunk_valid_count
    student_logprobs_z = student_z - torch.logsumexp(student_z, dim=-1, keepdim=True)
-    # 6) Restrict to valid tokens if needed
+        # 3. Normalize *once* at the end.
-    valid_mask = target_mask.bool()  # shape [B, seq_len, K]
+        if num_items_in_batch > 0:
-    teacher_probs_z = teacher_logprobs_z.exp()
+            # If the user gave us a manual denominator (e.g. total items in batch),
-    teacher_probs_z = teacher_probs_z[valid_mask]
+            # we divide by it. Typically used if each item is of different length.
-    teacher_logprobs_z = teacher_logprobs_z[valid_mask]
+            final_loss = total_loss / float(num_items_in_batch)
-    student_logprobs_z = student_logprobs_z[valid_mask]
+        else:
            # Otherwise, divide by total valid tokens across all chunks.
            # to get the same result as a non-chunked approach.
            final_loss = total_loss / float(total_valid_tokens)
-    # 7) forward KL:  sum( p_teacher * [log(p_teacher) - log(p_student)] )
+        return final_loss
    kd_loss_per_token = teacher_probs_z * (teacher_logprobs_z - student_logprobs_z)
    kd_loss = kd_loss_per_token.sum()
    # 8) If using classical KD scaling by T^2
    if kd_temperature != 1.0:
        kd_loss = kd_loss * (kd_temperature**2)
    # Optionally scale by zscore_base_temp**2 if you want (paper might differ).
    # kd_loss = kd_loss * (zscore_base_temp**2)
    # 9) Normalize
    if num_items_in_batch is not None and num_items_in_batch > 0:
        kd_loss = kd_loss / float(num_items_in_batch)
    else:
        kd_loss = kd_loss / float(kd_loss_per_token.size(0))
    return kd_loss
--- a/src/axolotl/integrations/kd/trainer.py
+++ b/src/axolotl/integrations/kd/trainer.py
@@ -18,8 +18,7 @@ KD trainer
 from axolotl.core.trainers.base import AxolotlTrainer
-from .topk_logprob.forward_kl import loss as topk_kd_loss
+from .kernels.liger import LigerFusedLinearKLTopKLogprobLoss
 from .topk_logprob.forward_kl import topk_kd_loss_with_zscore
 class AxolotlKDTrainer(AxolotlTrainer):
@@ -27,6 +26,18 @@ class AxolotlKDTrainer(AxolotlTrainer):
    Custom trainer subclass for Knowledge Distillation (KD)
    """
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.model_accepts_loss_kwargs = True
        self.model._loss_function = LigerFusedLinearKLTopKLogprobLoss(
            self.args.kd_ce_alpha,  # hard label loss
            self.args.kd_alpha,  # kd loss
            self.args.kd_temperature,
            self.args.kd_beta,
            compute_ce_loss=bool(self.args.kd_ce_alpha),
            normalize_topk=self.args.kd_normalize_topk,
        )
    def _set_signature_columns_if_needed(self):
        super()._set_signature_columns_if_needed()
        columns_to_add = []
@@ -52,12 +63,12 @@ class AxolotlKDTrainer(AxolotlTrainer):
        Subclass and override for custom behavior.
        """
-
+        if (
-        target_logprobs = inputs.pop("target_logprobs")
+            self.args.sample_packing
-        target_token_ids = inputs.pop("target_token_ids")
+            and hasattr(inputs, "attention_mask")
-        target_mask = inputs.pop("target_mask")
+            and hasattr(inputs, "position_ids")
-
+        ):
-        seq_len = target_token_ids.shape[1]
+            del inputs["attention_mask"]
        if self.model_accepts_loss_kwargs:
            loss_kwargs = {}
@@ -65,49 +76,4 @@ class AxolotlKDTrainer(AxolotlTrainer):
                loss_kwargs["num_items_in_batch"] = num_items_in_batch
            inputs = {**inputs, **loss_kwargs}
        outputs = model(**inputs)
-
+        return outputs[0]
        # FIXME: account for tokenizer.padding_side
        student_logits = outputs["logits"][:, : seq_len - 1, :].contiguous()
        shift_logits = student_logits.contiguous()
        target_logprobs_for_loss = target_logprobs[..., 1:, :].contiguous()
        target_token_ids_for_loss = target_token_ids[..., 1:, :].contiguous()
        target_mask_for_loss = target_mask[..., 1:, :].contiguous()
        if self.args.kd_zscore_base_temp:
            loss_kd = topk_kd_loss_with_zscore(
                shift_logits,
                target_token_ids_for_loss,
                target_logprobs_for_loss,
                target_mask_for_loss,
                kd_temperature=self.args.kd_temperature,
                zscore_base_temp=self.args.kd_zscore_base_temp,
                num_items_in_batch=num_items_in_batch,
            )
        else:
            loss_kd = topk_kd_loss(
                shift_logits,
                target_token_ids_for_loss,
                target_logprobs_for_loss,
                target_mask_for_loss,
                num_items_in_batch=num_items_in_batch,
                kd_temperature=self.args.kd_temperature,
                top_k_before_softmax=1 if self.args.kd_top_k_before_softmax else 0,
            )
        if self.args.kd_ce_alpha > 0:
            kd_alpha = self.args.kd_alpha
            loss = self.args.kd_ce_alpha * outputs["loss"] + kd_alpha * loss_kd
        else:
            loss = loss_kd
        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[  # pylint: disable=attribute-defined-outside-init
                self.args.past_index
            ]
        if self.args.average_tokens_across_devices and self.model_accepts_loss_kwargs:
            loss *= self.accelerator.num_processes
        return (loss, outputs) if return_outputs else loss
--- a/src/axolotl/integrations/kd/utils.py
+++ b/src/axolotl/integrations/kd/utils.py
@@ -0,0 +1,100 @@
 """Helper KD utils"""
 import math
 from typing import List, Union
 import numpy as np
 import torch
 from torch import FloatTensor, Tensor
 def normalize_logprobs(logprobs: FloatTensor, topk: int) -> FloatTensor:
    """
    Re-normalizes top-k raw logprobs as probabilities, and converts back to logprobs.
    """
    # Ensure raw_logprobs matches kd_online_topk length for tensor operations
    # This should ideally be handled by the caller ensuring correct padding/truncation first
    if logprobs.shape[-1] != topk:
        # pad last dimension of logprobs to match topk length with -inf
        padding_len = topk - logprobs.shape[-1]
        padding_tensor = torch.full(
            (
                *logprobs.shape[:-1],
                padding_len,
            ),  # Takes all dimensions of logprobs except the last, then appends padding_needed
            float("-inf"),
            dtype=logprobs.dtype,
            device=logprobs.device,
        )
        logprobs = torch.cat((logprobs, padding_tensor), dim=-1)
    # Convert logprobs at T_online to probabilities
    # use log sum exp trick to avoid underflow
    position_logprobs_lse = torch.logsumexp(logprobs, dim=-1, keepdim=True)
    teacher_probs_t_online = torch.exp(logprobs - position_logprobs_lse)
    # Normalize probabilities (sum to 1)
    # This is important if the top-k from server aren't a full distribution
    teacher_probs_t_online_sum = teacher_probs_t_online.sum(dim=-1, keepdim=True)
    teacher_probs_t_online = teacher_probs_t_online / teacher_probs_t_online_sum
    final_logprobs_tensor = torch.log(teacher_probs_t_online)
    return final_logprobs_tensor
 def strided_chunk_views(
    tensor: Union[np.ndarray, torch.Tensor],
    chunks: int,
    dim: int = 0,
    stride: int = 1,
    chunk_size: int | None = None,
 ) -> List[Union[np.ndarray, torch.Tensor]]:
    """
    Split a tensor into chunks along a dimension with striding, prioritizing views over copies.
    Args:
        tensor: Input tensor (numpy array or torch tensor)
        chunks: Number of chunks to create
        dim: Dimension along which to chunk (default: 0)
        stride: Stride between chunk starting positions (default: 1)
        chunk_size: Size of each chunk. If None, calculated automatically (default: None)
    Returns:
        List of tensor chunks (views when possible, copies when necessary)
    """
    # Get the size of the specified dimension
    dim_size = tensor.shape[dim]
    # Calculate chunk size if not provided
    if chunk_size is None:
        chunk_size = (dim_size + chunks - 1) // chunks  # Ceiling division
    chunks_list = []
    for i in range(chunks):
        start_idx = i * stride
        end_idx = min(start_idx + chunk_size, dim_size)
        # Break if we've gone beyond the tensor
        if start_idx >= dim_size:
            break
        # Create slice objects for all dimensions
        slices = [slice(None)] * tensor.ndim
        slices[dim] = slice(start_idx, end_idx)
        chunk = tensor[tuple(slices)]
        chunks_list.append(chunk)
    return chunks_list
 def chunk_overlap(input_tensor: Tensor, chunks: int, dim: int = 0, overlap: int = 1):
    dim_size = input_tensor.shape[dim]
    stride = math.ceil(dim_size / chunks)
    return strided_chunk_views(
        input_tensor, chunks, dim, stride=stride, chunk_size=stride + overlap
    )
--- a/src/axolotl/integrations/liger/init.py
+++ b/src/axolotl/integrations/liger/init.py
@@ -19,16 +19,15 @@ Liger Kernel is the collection of Triton-native kernels for LLM Training.
 It is designed to be performant, correct, and light-weight.
 """
 import inspect
 import logging
 import sys
 from axolotl.integrations.base import BasePlugin
-from axolotl.utils.distributed import is_main_process
+from axolotl.utils.logging import get_logger
 from .args import LigerArgs  # pylint: disable=unused-import. # noqa: F401
 from .utils import patch_with_compile_disable
-LOG = logging.getLogger("axolotl.integrations.liger")
+LOG = get_logger(__name__, use_environ=True)
 class LigerPlugin(BasePlugin):
@@ -85,10 +84,7 @@ class LigerPlugin(BasePlugin):
                kwargs["geglu"] = cfg.liger_glu_activation
            elif "swiglu" in liger_fn_sig.parameters:
                kwargs["swiglu"] = cfg.liger_glu_activation
-            if is_main_process(use_environ=True):
+            LOG.info(f"Applying LIGER to {cfg.model_config_type} with kwargs: {kwargs}")
                LOG.info(
                    f"Applying LIGER to {cfg.model_config_type} with kwargs: {kwargs}"
                )
            apply_liger_fn(**kwargs)
        elif cfg.model_config_type == "jamba":
            from transformers.models.jamba import modeling_jamba
@@ -124,9 +120,9 @@ class LigerPlugin(BasePlugin):
            if cfg.liger_rope:
                # The DeepseekV2 version of RoPE is different than upstream LLaMA.
                # See https://github.com/linkedin/Liger-Kernel/issues/129#issuecomment-2313763528
-                logging.warning("Fused liger_rope is not supported for DeepseekV2.")
+                LOG.warning("Fused liger_rope is not supported for DeepseekV2.")
            if cfg.liger_glu_activation:
-                logging.warning("liger_glu_activation is not supported for DeepseekV2.")
+                LOG.warning("liger_glu_activation is not supported for DeepseekV2.")
            if cfg.liger_rms_norm:
                modeling_mod.DeepseekV2RMSNorm = LigerRMSNorm
            if cfg.liger_glu_activation:
@@ -175,7 +171,17 @@ class LigerPlugin(BasePlugin):
                rms_norm=cfg.liger_rms_norm,
                layer_norm=cfg.liger_layer_norm,
            )
        elif cfg.model_config_type == "granitemoe":
            from liger_kernel.transformers import apply_liger_kernel_to_granite
            apply_liger_kernel_to_granite(
                rope=cfg.liger_rope,
                cross_entropy=cfg.liger_cross_entropy,
                fused_linear_cross_entropy=cfg.liger_fused_linear_cross_entropy,
                rms_norm=cfg.liger_rms_norm,
                swiglu=cfg.liger_glu_activation,
            )
        else:
-            logging.warning(
+            LOG.warning(
                f"Unsupported model config type: {cfg.model_config_type}. Liger not applied."
            )
--- a/src/axolotl/integrations/liger/args.py
+++ b/src/axolotl/integrations/liger/args.py
@@ -15,12 +15,13 @@
 """
 Module for handling LIGER input arguments.
 """
 import logging
 from typing import Optional
 from pydantic import BaseModel, model_validator
-LOG = logging.getLogger("axolotl.integrations.liger.args")
+from axolotl.utils.logging import get_logger
 LOG = get_logger(__name__)
 class LigerArgs(BaseModel):
--- a/src/axolotl/integrations/llm_compressor/plugin.py
+++ b/src/axolotl/integrations/llm_compressor/plugin.py
@@ -3,7 +3,6 @@ Sparse Finetuning plugin for Axolotl — enables handling of sparse neural netwo
 by maintaining masks for zero weights during training.
 """
 import logging
 from functools import wraps
 from typing import Any, Callable, Concatenate, ParamSpec, TypeVar
@@ -16,11 +15,12 @@ from transformers.trainer_callback import TrainerCallback, TrainerControl, Train
 from transformers.training_args import TrainingArguments
 from axolotl.integrations.base import BasePlugin
 from axolotl.utils.logging import get_logger
 P = ParamSpec("P")  # Params for generic function signatures
 R = TypeVar("R")  # Return type for generic function signatures
-LOG = logging.getLogger("axolotl.integrations.llm_compressor")
+LOG = get_logger(__name__)
 class LLMCompressorCallbackHandler(TrainerCallback):
--- a/src/axolotl/integrations/spectrum/init.py
+++ b/src/axolotl/integrations/spectrum/init.py
@@ -17,14 +17,16 @@ Spectrum Plugin to automatically generate unfrozen parameters based on SNR data.
 """
 import json
 import logging
 import requests
 from axolotl.integrations.base import BasePlugin
 from axolotl.utils.logging import get_logger
 from .args import SpectrumArgs  # pylint: disable=unused-import. # noqa: F401
 LOG = get_logger(__name__)
 def _generate_unfrozen_params_yaml(snr_data, top_fraction=0.5):
    unfrozen_parameters = {}
@@ -83,17 +85,17 @@ class SpectrumPlugin(BasePlugin):
        except FileNotFoundError:
            pass
        except Exception as exc:  # pylint: disable=broad-exception-caught
-            logging.warning(f"Failed to read SNR data from {snr_path}: {exc}")
+            LOG.warning(f"Failed to read SNR data from {snr_path}: {exc}")
        if not snr_data:
            try:
                snr_data = requests.get(snr_url, timeout=60).json()
            except requests.exceptions.RequestException as exc:
-                logging.warning(f"Failed to fetch SNR data from {snr_url}: {exc}")
+                LOG.warning(f"Failed to fetch SNR data from {snr_url}: {exc}")
                return
            # also catch json parsing errors
            except json.JSONDecodeError as exc:
-                logging.warning(f"Failed to parse SNR data from {snr_url}: {exc}")
+                LOG.warning(f"Failed to parse SNR data from {snr_url}: {exc}")
                return
        unfrozen_parameters = _generate_unfrozen_params_yaml(
--- a/src/axolotl/kernels/lora.py
+++ b/src/axolotl/kernels/lora.py
@@ -280,19 +280,19 @@ class LoRA_MLP(torch.autograd.Function):
        # Initialize and compute LoRA gradients
        d_down_A = d_down_B = d_up_A = d_up_B = d_gate_A = d_gate_B = None
-        if down_A is not None:
+        if down_A is not None and down_B is not None:
            d_down_A = h.t() @ (grad_output @ down_B.t())
            d_down_B = (down_A.t() @ h.t()) @ grad_output
            d_down_A *= down_scale
            d_down_B *= down_scale
-        if up_A is not None:
+        if up_A is not None and up_B is not None:
            d_up_A = X.t() @ (grad_up @ up_B.t())
            d_up_B = (up_A.t() @ X.t()) @ grad_up
            d_up_A *= up_scale
            d_up_B *= up_scale
-        if gate_A is not None:
+        if gate_A is not None and gate_B is not None:
            d_gate_A = X.t() @ (grad_gate @ gate_B.t())
            d_gate_B = (gate_A.t() @ X.t()) @ grad_gate
            d_gate_A *= gate_scale
@@ -311,7 +311,7 @@ class LoRA_MLP(torch.autograd.Function):
            del up_weight
            # Note the .to(dtype) only where mixing LoRA with base weights
-            if up_A is not None:
+            if up_A is not None and up_B is not None:
                dX += grad_up @ up_B.to(dtype).t() @ (up_scale * up_A.to(dtype).t())
            # Gate projection gradients
@@ -319,7 +319,7 @@ class LoRA_MLP(torch.autograd.Function):
            dX += grad_gate @ gate_weight.t()
            del gate_weight
-            if gate_A is not None:
+            if gate_A is not None and gate_B is not None:
                dX += (
                    grad_gate
                    @ gate_B.to(dtype).t()
--- a/src/axolotl/loaders/adapter.py
+++ b/src/axolotl/loaders/adapter.py
@@ -1,6 +1,5 @@
 """Adapter loading functionality, including LoRA / QLoRA and associated utils"""
 import logging
 import os
 import types
 from typing import Any
@@ -21,8 +20,9 @@ from transformers import PreTrainedModel
 from axolotl.loaders.utils import get_linear_embedding_layers
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.logging import get_logger
-LOG = logging.getLogger(__name__)
+LOG = get_logger(__name__)
 def setup_quantized_meta_for_peft(model: torch.nn.Module):
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Wing Lian	2491303c46	improve handling of train len	2025-06-06 22:07:29 -07:00
Wing Lian	2c66483a47	default to dropping last batch in multipack batch sampler	2025-06-05 16:00:24 -07:00
Wing Lian	01382b9a79	fix rebase issues	2025-06-05 15:31:28 -07:00
Wing Lian	cfcd69df0d	rename vars for consistency	2025-06-05 15:29:21 -07:00
Wing Lian	2302b14a84	fix to remove attention_mask	2025-06-05 15:29:20 -07:00
Wing Lian	a8e2bddd19	increase hyperparams_count for gradients for added normalize_topk	2025-06-05 15:29:20 -07:00
Wing Lian	d55a51623f	more KD updates	2025-06-05 15:29:20 -07:00
Wing Lian	73a84ad0dd	post-rebase lint	2025-06-05 15:29:20 -07:00
Wing Lian	3cffe881bb	accept compressed responses for smaller wire payload	2025-06-05 15:29:20 -07:00
Wing Lian	e77d62933d	Fix decay	2025-06-05 15:29:19 -07:00
Wing Lian	3a0faa97ca	fix trainer callback base class	2025-06-05 15:29:19 -07:00
Wing Lian	20602fd93f	chore: lint	2025-06-05 15:29:17 -07:00
Wing Lian	770bb0605a	support for dynamic plugin training args mixins and symmetric kl	2025-06-05 15:28:25 -07:00
Wing Lian	24b96b1c4f	temp scale kd loss at end	2025-06-05 15:19:33 -07:00
Wing Lian	90c7228ff9	use max not min	2025-06-05 15:19:33 -07:00
Wing Lian	9eb53f5c9e	fix length of padding	2025-06-05 15:19:33 -07:00
Wing Lian	225b420dc5	shift off the first empty token	2025-06-05 15:19:33 -07:00
Wing Lian	b75db13615	fix check	2025-06-05 15:19:33 -07:00
Wing Lian	c7b1db329e	logsumexp trick:	2025-06-05 15:19:32 -07:00
Wing Lian	a40e484803	handle when no custom collator is used in plugins	2025-06-05 15:19:32 -07:00
Wing Lian	9899c924f9	suport sampling params/max new tokens	2025-06-05 15:19:32 -07:00
Wing Lian	505009b454	add close to comment block	2025-06-05 15:19:31 -07:00
Wing Lian	b4e96ef12c	online kd wip	2025-06-05 15:19:04 -07:00
Wing Lian	a8d9fab635	don't need temp arg to distill method	2025-06-05 15:18:20 -07:00
Wing Lian	49e2fa825d	additional plugin collator kwargs, don't scale up kd loss by t^2	2025-06-05 15:18:19 -07:00
Wing Lian	7263845207	remove debugging	2025-06-05 15:17:13 -07:00
Wing Lian	5ccfd225cb	collator cls for plugins	2025-06-05 15:16:31 -07:00
Wing Lian	28eb8632a1	more fixes and liger-type chunked loss	2025-06-05 15:14:38 -07:00
Wing Lian	5cfaac3767	WIP chunked KD loss with autograd wrapper	2025-06-05 15:14:37 -07:00
Wing Lian	ca70fb7cb0	simplfy and remove zscore	2025-06-05 15:13:55 -07:00
Wing Lian	22b50d6619	drop top_k before softmax	2025-06-05 15:13:24 -07:00
Wing Lian	a2248673d8	kd trainer has kd temp as part of the init	2025-06-05 15:12:23 -07:00
Wing Lian	0399aefcb3	better handling to drop string fields for kd with raw dataset	2025-06-05 15:12:22 -07:00
Wing Lian	83ad248e5b	fix input args	2025-06-05 15:12:22 -07:00
Wing Lian	6fafe46562	fix collator setup	2025-06-05 15:12:21 -07:00
Wing Lian	0e46367e01	kd fixes	2025-06-05 15:09:59 -07:00
Wing Lian	7909bfb076	add manual seed for flaky test_geglu_backward test (#2763 ) [skip ci]	2025-06-05 09:23:17 -07:00
Wing Lian	cb03c765a1	add uv tooling for e2e gpu tests (#2750 ) * add uv tooling for e2e gpu tests * fixes from PR feedback * simplify check * fix env var * make sure to use uv for other install * use raw_dockerfile_image * Fix import * fix args to experimental dockerfile image call * use updated modal versions	2025-06-05 07:25:06 -07:00
Timofey Klyubin	4440b4a1ce	remove unused field for chat_template.default for DPO training (#2755 ) [skip ci] * remove unused field for chat_template.default "messages" field present in final dataset causes issues with DPO training otherwise * lint and fix tests for new return value * remove unused field for chat_template.default "messages" field present in final dataset causes issues with DPO training otherwise lint and fix tests for new return value fix for updated expected fields for dpo remove unused field for chat_template.default "messages" field present in final dataset causes issues with DPO training otherwise fix test still expecting "messages" field * chore: lint --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-05 07:22:58 -07:00
NanoCode012	e8e45b3441	fix: remove hqq (#2759 ) [skip ci]	2025-06-05 07:22:23 -07:00
Wing Lian	c67910fa6f	bump hf deps (#2735 ) [skip ci] * bump hf deps * upgrade liger-kernel too * install cce from fork for transformers fix * fix reference to vocab size in gemma3 patch * use padding_idx instead of pad_token_id * remove fixed gemma3 patch * use updated cce fork * fix local mllama cce patches w docstring * add test for multipack with trainer setup and fix trainer for trainer refactor upstream * bump modal version * guard for iterable datasetS * mllama model arch layout changed in latest transformers * fix batch sampler with drop_last * fix: address upstream vlm changes for lora * fix: update references to old lora target path * fix: remove mllama fa2 patch due to upstream fix * fix: lora kernel patch path for multimodal models * fix: removed mllama from quarto * run test for came optim on 2.6.0+ * fix fsdp2 patch and remove deprecated patch * make sure to set sequence_parallel_degree for grpo * Add SP test for GRPO * add sp to grpo config for trainer * use reward_funcs as kwarg to grpo trainer * fix the comprehension for reward funcs * reward funcs already passed in as args * init sp_group right before training * fix check for adding models to SP context * make sure to pass args to super * upgrade deepspeed * use updated trl and add reasoning flags for vllm * patch the worker --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-06-05 07:20:33 -07:00
NanoCode012	787880215b	fix(deepspeed): deepspeed config not being set for z3 (#2754 ) * fix(deepspeed): deepspeed config not being set for z3 * fix: comments	2025-06-03 14:27:09 -07:00
NanoCode012	4b1a29c694	feat(modal): update docker tag to use torch2.6 from torch2.5 (#2749 ) [skip ci]	2025-06-03 14:26:07 -07:00
NanoCode012	d7fa60662e	feat: add chat_template kwargs (#2694 ) [skip ci]	2025-06-03 14:25:26 -07:00
Dan Saunders	1d91d905c9	remove deprecated wandb env var (#2751 ) * remove deprecated wandb env var * remove os.environ wandb setting; unused loggers * remove os.environ wandb setting; unused loggers	2025-06-03 14:04:15 -07:00
mhenrhcsen	2bf61d8e25	fix abbriviatation spelling error	2025-06-03 21:30:40 +02:00
mhenrhcsen	68788e419e	feat: add Group Relative Policy Optimization (GPRO) to RLHF documentation	2025-06-03 21:30:40 +02:00
github-actions[bot]	94219f6ee8	chore: update pre-commit hooks (#2745 ) * chore: update pre-commit hooks * trigger linter when pre commit hooks are updated * fix type checks from upgraded pre-commit --------- Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-06-02 15:54:29 -07:00
Wing Lian	ecc719f5c7	add support for base image with uv (#2691 )	2025-06-02 12:48:55 -07:00
NanoCode012	d5d0dc5938	fix: suppress non-axolotl logs unless it's warning or higher (#2724 ) * fix: increase log level for root loggers and axolotl's * fix: BasePlugin using wrong logger * fix: update logger to take name from module * feat: change logger class to AxolotlLogger to filter non-axolotl infos or below * fix: change behavior to not disable existing loggers * fix: update logging to respect correct env * chore: fix comment * fix: suppress accelerate log to LOG_LEVEL if not set --------- Co-authored-by: salman <salman.mohammadi@outlook.com>	2025-05-31 12:13:43 +07:00
NanoCode012	5e86c35322	fix(log): remove duplicate merge_lora param (#2742 ) [skip ci]	2025-05-31 12:13:31 +07:00
NanoCode012	6778856804	Fix: RL base feature parity (#2133 ) * feat: add num_proc and load from cache for rl mapping * fix: refactor sft and rl trainer to set same base args * feat: add report_to to set run name * fix: consolidate handling of fp16, bf16, tf32 kwarg * chore: consolidate eval_strat, loraplus, lr sched, max_length * fix: deprecate old types * fix: adding missing Any * fix: max_steps incorrectly set * fix: remove unnecessary datacollator kwarg insert and pop * fix: update default max_steps * fix: add missing weight_decay handling * fix: ignore max_length for grpo * feat: update CI on trainer_builder * fix: comments * improve handling of warmup/logging steps * use transformers default for logging steps, not None * fix: remove redundant override * fix: lint * feat: allow custom optim for rl methods * fix: duplicate optim setting * fix(test): set sequence_parallel_degree default in base cfg * feat: add handling for seed and SP/ring-attn config * chore: add back return typing from rebase * fix(test): use RLType directly to skip needing to validate * feat: split training builder into sub modules * fix: remove deprecated clause * chore: add missing config to doc * fix: update quarto autodoc * fix: import path for trainer builder and submodules * fix: remove redundant configs from rebase mistake * chore: simplify dynamo check * fix: optimizer_cls_and_kwargs to be passed into trainer_kwargs * fix: add missing rex from rebase * fix: move pop optimizer_cls_and_kwargs * fix: pop optimizer cls in rl too * fix: leftover bug from rebase * fix: update handling of trainer_cls in RL * fix: address pr feedback * feat: call hook_pre_create_trainer for rl * chore: lint * fix: return notimplemented for ppo * feat: moved torch compile to base and refactor collator setting * chore: remove unused importlib.util import * fix: optimizer cls not being popped * feat: move epoch setting to base * fix: catch unhandled custom optimizer * fix: remove duplicate lora plus setting * chore: refactor if condition * chore: refactor set_base_training_args into smaller modules * fix: address TrainerBuilderBase class variables to instance var * fix: add handling for beta3 and episilon2 * fix: change to pass dict via arg instead of updating dict * chore: simplify if condition * fix: force access to lr & weight decay in case not provided to early error * fix: remove log sweep * chore: refactor if condition * fix: address renamed cfg * fix: improve handling of cosine hyp * fix: remove unused params * chore: refactor * chore: clarify doc safetensors * fix: update import path to be unified following comments * fix: duplicate kwargs passed * feat: return separate trainer_kwargs * chore: refactor * chore: refactor based on comments * chore: refactor based on comments * fix: move gpustats callback to base * chore: create trainer_cls_args first based on comments * fix: ipo label smoothing passed incorrectly * feat: add optimizer parity for RL methods with test * feat: add parity for optimizer in RM/PRM and add test * fix: remove redundant function override for orpo/cpo batch metrics * fix: improve handling of dpo_label_smoothing and merge issue * fix: test fixture returning wrong field * fix: address avoid direct modify fixture * chore: minor refactor * Revert "chore: refactor" This reverts commit `99c8859eb0`. * feat: rename trainer_builder to builders --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-30 11:21:47 +07:00
Wing Lian	ec4ebfd997	Add a few items to faq (#2734 ) * Add a few items to faq * formatting * chore: lint	2025-05-28 16:20:19 -04:00
Dan Saunders	bde8b5b6bd	fix dist state init before deepspeed setup (#2737 )	2025-05-28 14:59:57 -04:00
Dan Saunders	2962a398b7	Lora kernels fix (#2732 ) * fix lora kernel patching and improve test * simplification	2025-05-28 10:03:43 -04:00
salman	65c5481120	Rank 0-only logging (#2608 ) Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-28 14:57:30 +01:00
salman	5fca214108	QAT (#2590 ) QAT and quantization w/torchao	2025-05-28 12:35:47 +01:00
NanoCode012	20fda75917	feat(doc): add google analytics to docs (#2708 )	2025-05-28 15:51:21 +07:00
NanoCode012	6b6370f4e3	feat(doc): add info on how to use dapo / dr grpo and misc doc fixes (#2673 ) [skip ci] * feat(doc): add info on how to use dapo / dr grpo * chore: add missing config to docs * fix: missing comment * fix: add missing scheduler from schema * chore: refactor lr scheduler docs * fix: remove log_sweep	2025-05-28 15:51:04 +07:00
mashdragon	add2025253	Fix Mistral chat template (mistral_v7_tekken) (#2710 ) [skip ci] Per `4b8dd8aae7 (d2h-482763)`	2025-05-28 15:50:47 +07:00
artem	a703560a10	add two checks to handle legacy format interleaved multimodal ds (#2721 ) [skip ci] * add two checks to handle legacy format interleaved ds * fix: add warning about multiple image using legacy format --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-28 15:49:43 +07:00
NOHHYEOB, BAE	4a80d309e8	Add chat templates for command-a and aya-23-8B models (#2731 ) [skip ci] * Add chat templates for command-a and aya model * Fix: isolate for-loop update and remove unintended changes	2025-05-28 15:49:16 +07:00
NanoCode012	e33f225434	feat(doc): note lora kernel incompat with RLHF (#2706 ) [skip ci] * feat(doc): note lora kernel incompat with RLHF * fix: add validation following comments * chore: fix typo following suggestion	2025-05-28 15:48:40 +07:00
NanoCode012	3e6948be97	Fix(doc): clarify data loading for local datasets and splitting samples (#2726 ) [skip ci] * fix(doc): remove incorrect json dataset loading method * fix(doc): clarify splitting only happens in completion mode * fix: update local file loading on config doc * fix: typo	2025-05-28 15:48:22 +07:00
github-actions[bot]	4a8af60d34	chore: update pre-commit hooks (#2729 ) Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>	2025-05-27 11:45:31 -04:00
Dan Saunders	a0941a9271	no need to generate diff file (#2728 )	2025-05-27 11:44:06 -04:00
Dan Saunders	5eb01f3df1	Fix quarto (#2717 ) * missing modules * fix quarto complaints	2025-05-23 21:16:51 -04:00
xzuyn	d27c35ac44	Liger GraniteMoE (#2715 )	2025-05-23 18:40:43 -04:00
Dan Saunders	a535b68043	update quarto for model loading refactor (#2716 ) * update quarto for model loading refactor * fix desc	2025-05-23 16:28:31 -04:00