pixtral example

added pixtral example
Multimodal Vision Llama - rudimentary support (#1940 )
2024-10-03 16:11:15 -04:00 · 2024-10-03 16:01:21 -04:00 · 2024-10-02 21:02:48 -04:00 · 2024-09-30 13:56:12 -04:00 · 2024-09-27 15:58:35 -04:00 · 2024-09-26 11:33:41 -04:00
77 changed files with 4649 additions and 1034 deletions
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -6,7 +6,7 @@ on:
       - '**.py'
       - 'requirements.txt'
       - '.github/workflows/*.yml'
-       - "*.md"
+       - "*.[q]md"
       - "examples/**/*.y[a]?ml"
  workflow_dispatch:

--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -1,6 +1,9 @@
 name: docker-multigpu-tests-biweekly

 on:
+  pull_request:
+    paths:
+      - 'tests/e2e/multigpu/*.py'
  workflow_dispatch:
  schedule:
    - cron: '0 0 * * 1,4'  # Runs at 00:00 UTC every monday & thursday
@@ -18,6 +21,13 @@ jobs:
            pytorch: 2.3.1
            axolotl_extras:
            num_gpus: 2
+          - cuda: 121
+            cuda_version: 12.1.1
+            python_version: "3.11"
+            pytorch: 2.3.1
+            axolotl_extras:
+            num_gpus: 2
+            nightly_build: "true"
    runs-on: [self-hosted, modal]
    timeout-minutes: 120
    steps:
@@ -39,6 +49,7 @@ jobs:
          echo "AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}" >> $GITHUB_ENV
          echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
+          echo "NIGHTLY_BUILD=${{ matrix.nightly_build }}" >> $GITHUB_ENV
      - name: Run tests job on Modal
        run: |
          modal run cicd.multigpu
--- a/.github/workflows/tests-nightly.yml
+++ b/.github/workflows/tests-nightly.yml
@@ -0,0 +1,120 @@
+name: Tests Nightly against upstream main
+on:
+  workflow_dispatch:
+  schedule:
+    - cron: '0 0 * * *'  # Runs at 00:00 UTC every day
+
+jobs:
+  pre-commit:
+    name: pre-commit
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+          cache: 'pip' # caching pip dependencies
+      - uses: pre-commit/action@v3.0.0
+        env:
+          SKIP: no-commit-to-branch
+
+  pytest:
+    name: PyTest
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python_version: ["3.10", "3.11"]
+        pytorch_version: ["2.3.1", "2.4.0"]
+    timeout-minutes: 20
+
+    steps:
+      - name: Check out repository code
+        uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python_version }}
+          cache: 'pip' # caching pip dependencies
+
+      - name: Install PyTorch
+        run: |
+          pip3 install torch==${{ matrix.pytorch_version }} --index-url https://download.pytorch.org/whl/cpu
+
+      - name: Update requirements.txt
+        run: |
+          sed -i 's#^transformers.*#transformers @ git+https://github.com/huggingface/transformers.git@main#' requirements.txt
+          sed -i 's#^peft.*#peft @ git+https://github.com/huggingface/peft.git@main#' requirements.txt
+          sed -i 's#^accelerate.*#accelerate @ git+https://github.com/huggingface/accelerate.git@main#' requirements.txt
+
+      - name: Install dependencies
+        run: |
+          pip3 install --upgrade pip
+          pip3 install --upgrade packaging
+          pip3 install -U -e .
+          pip3 install -r requirements-tests.txt
+
+      - name: Run tests
+        run: |
+          pytest --ignore=tests/e2e/ tests/
+
+      - name: cleanup pip cache
+        run: |
+          find "$(pip cache dir)/http-v2" -type f -mtime +14 -exec rm {} \;
+
+  docker-e2e-tests:
+    if: github.repository_owner == 'axolotl-ai-cloud'
+    # this job needs to be run on self-hosted GPU runners...
+    runs-on: [self-hosted, modal]
+    timeout-minutes: 60
+    needs: [pre-commit, pytest]
+
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - cuda: 121
+            cuda_version: 12.1.1
+            python_version: "3.10"
+            pytorch: 2.3.1
+            num_gpus: 1
+            axolotl_extras: mamba-ssm
+            nightly_build: "true"
+          - cuda: 121
+            cuda_version: 12.1.1
+            python_version: "3.11"
+            pytorch: 2.3.1
+            num_gpus: 1
+            axolotl_extras: mamba-ssm
+            nightly_build: "true"
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.4.0
+            num_gpus: 1
+            axolotl_extras:
+            nightly_build: "true"
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Install Modal
+        run: |
+          python -m pip install --upgrade pip
+          pip install modal==0.63.64 jinja2
+      - name: Update env vars
+        run: |
+          echo "BASE_TAG=main-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}" >> $GITHUB_ENV
+          echo "PYTORCH_VERSION=${{ matrix.pytorch}}" >> $GITHUB_ENV
+          echo "AXOLOTL_ARGS=${{ matrix.axolotl_args}}" >> $GITHUB_ENV
+          echo "AXOLOTL_EXTRAS=${{ matrix.axolotl_extras}}" >> $GITHUB_ENV
+          echo "CUDA=${{ matrix.cuda }}" >> $GITHUB_ENV
+          echo "N_GPUS=${{ matrix.num_gpus }}" >> $GITHUB_ENV
+          echo "NIGHTLY_BUILD=${{ matrix.nightly_build }}" >> $GITHUB_ENV
+      - name: Run tests job on Modal
+        run: |
+          modal run cicd.tests
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -36,6 +36,7 @@ jobs:
      fail-fast: false
      matrix:
        python_version: ["3.10", "3.11"]
+        pytorch_version: ["2.3.1", "2.4.0"]
    timeout-minutes: 20

    steps:
@@ -48,6 +49,10 @@ jobs:
          python-version: ${{ matrix.python_version }}
          cache: 'pip' # caching pip dependencies

+      - name: Install PyTorch
+        run: |
+          pip3 install torch==${{ matrix.pytorch_version }} --index-url https://download.pytorch.org/whl/cpu
+
      - name: Install dependencies
        run: |
          pip3 install --upgrade pip
--- a/.mypy.ini
+++ b/.mypy.ini
@@ -11,6 +11,9 @@ ignore_errors = True
 [mypy-axolotl.models.mixtral.*]
 ignore_errors = True

+[mypy-axolotl.integrations.liger.models.*]
+ignore_errors = True
+
 [mypy-axolotl.models.phi.*]
 ignore_errors = True

--- a/README.md
+++ b/README.md
@@ -1,5 +1,9 @@
 # Axolotl

+![tests](https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/tests.yml/badge.svg)
+![tests-nightly](https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/tests-nightly.yml/badge.svg)
+![multigpu-semi-weekly tests](https://github.com/axolotl-ai-cloud/axolotl/actions/workflows/multi-gpu-e2e.yml/badge.svg)
+
 Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.

 Features:
@@ -7,7 +11,7 @@ Features:
 - Supports fullfinetune, lora, qlora, relora, and gptq
 - Customize configurations using a simple yaml file or CLI overwrite
 - Load different dataset formats, use custom formats, or bring your own tokenized datasets
- Integrated with xformer, flash attention, rope scaling, and multipacking
+- Integrated with xformer, flash attention, [liger kernel](https://github.com/linkedin/Liger-Kernel), rope scaling, and multipacking
 - Works with single GPU or multiple GPUs via FSDP or Deepspeed
 - Easily run with Docker locally or on the cloud
 - Log results and optionally checkpoints to wandb or mlflow
@@ -22,39 +26,50 @@ Features:
 <td>

 ## Table of Contents
- [Introduction](#axolotl)
- [Supported Features](#axolotl-supports)
- [Quickstart](#quickstart-)
- [Environment](#environment)
-  - [Docker](#docker)
-  - [Conda/Pip venv](#condapip-venv)
-  - [Cloud GPU](#cloud-gpu) - Latitude.sh, JarvisLabs, RunPod
-  - [Bare Metal Cloud GPU](#bare-metal-cloud-gpu)
-  - [Windows](#windows)
-  - [Mac](#mac)
-  - [Google Colab](#google-colab)
-  - [Launching on public clouds via SkyPilot](#launching-on-public-clouds-via-skypilot)
-  - [Launching on public clouds via dstack](#launching-on-public-clouds-via-dstack)
- [Dataset](#dataset)
- [Config](#config)
-  - [Train](#train)
-  - [Inference](#inference-playground)
-  - [Merge LORA to Base](#merge-lora-to-base)
-  - [Special Tokens](#special-tokens)
-  - [All Config Options](#all-config-options)
- Advanced Topics
-  - [Multipack](./docs/multipack.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
-  - [RLHF & DPO](./docs/rlhf.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
-  - [Dataset Pre-Processing](./docs/dataset_preprocessing.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
-  - [Unsloth](./docs/unsloth.qmd)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
- [Common Errors](#common-errors-)
-  - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
- [Debugging Axolotl](#debugging-axolotl)
- [Need Help?](#need-help-)
- [Badge](#badge-)
- [Community Showcase](#community-showcase)
- [Contributing](#contributing-)
- [Sponsors](#sponsors-)
+- [Axolotl](#axolotl)
+  - [Table of Contents](#table-of-contents)
+  - [Axolotl supports](#axolotl-supports)
+  - [Quickstart ⚡](#quickstart-)
+    - [Usage](#usage)
+  - [Advanced Setup](#advanced-setup)
+    - [Environment](#environment)
+      - [Docker](#docker)
+      - [Conda/Pip venv](#condapip-venv)
+      - [Cloud GPU](#cloud-gpu)
+      - [Bare Metal Cloud GPU](#bare-metal-cloud-gpu)
+        - [LambdaLabs](#lambdalabs)
+        - [GCP](#gcp)
+      - [Windows](#windows)
+      - [Mac](#mac)
+      - [Google Colab](#google-colab)
+      - [Launching on public clouds via SkyPilot](#launching-on-public-clouds-via-skypilot)
+      - [Launching on public clouds via dstack](#launching-on-public-clouds-via-dstack)
+    - [Dataset](#dataset)
+    - [Config](#config)
+      - [All Config Options](#all-config-options)
+    - [Train](#train)
+      - [Preprocess dataset](#preprocess-dataset)
+      - [Multi-GPU](#multi-gpu)
+        - [DeepSpeed](#deepspeed)
+        - [FSDP](#fsdp)
+        - [FSDP + QLoRA](#fsdp--qlora)
+        - [Weights \& Biases Logging](#weights--biases-logging)
+        - [Special Tokens](#special-tokens)
+      - [Liger Kernel](#liger-kernel)
+    - [Inference Playground](#inference-playground)
+    - [Merge LORA to base](#merge-lora-to-base)
+  - [Common Errors 🧰](#common-errors-)
+    - [Tokenization Mismatch b/w Inference \& Training](#tokenization-mismatch-bw-inference--training)
+  - [Debugging Axolotl](#debugging-axolotl)
+  - [Need help? 🙋](#need-help-)
+  - [Badge ❤🏷️](#badge-️)
+  - [Community Showcase](#community-showcase)
+  - [Contributing 🤝](#contributing-)
+  - [Sponsors 🤝❤](#sponsors-)
+      - [💎 Diamond Sponsors - Contact directly](#-diamond-sponsors---contact-directly)
+      - [🥇 Gold Sponsors - $5000/mo](#-gold-sponsors---5000mo)
+      - [🥈 Silver Sponsors - $1000/mo](#-silver-sponsors---1000mo)
+      - [🥉 Bronze Sponsors - $500/mo](#-bronze-sponsors---500mo)

 </td>
 <td>
@@ -96,6 +111,7 @@ Features:
 | RWKV        | ✅         | ❓    | ❓     | ❓             | ❓                 | ❓          | ❓            |
 | Qwen        | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |
 | Gemma       | ✅         | ✅    | ✅     | ❓             | ❓                 | ✅          | ❓            |
+| Jamba       | ✅         | ✅    | ✅     | ❓             | ❓                 | ✅          | ❓            |

 ✅: supported
 ❌: not supported
@@ -515,6 +531,25 @@ tokens: # these are delimiters

 When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.

+##### Liger Kernel
+
+Liger Kernel: Efficient Triton Kernels for LLM Training
+
+https://github.com/linkedin/Liger-Kernel
+
+Liger (LinkedIn GPU Efficient Runtime) Kernel is a collection of Triton kernels designed specifically for LLM training.
+It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%. The Liger Kernel
+composes well and is compatible with both FSDP and Deepspeed.
+
+```yaml
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_swiglu: true
+liger_fused_linear_cross_entropy: true
+```
+
 ### Inference Playground

 Axolotl allows you to load your model in an interactive terminal playground for quick experimentation.
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -37,6 +37,7 @@ website:
            - docs/mac.qmd
            - docs/multi-node.qmd
            - docs/unsloth.qmd
+            - docs/amd_hpc.qmd
        - section: "Dataset Formats"
          contents: docs/dataset-formats/*
        - section: "Reference"
--- a/cicd/Dockerfile.jinja
+++ b/cicd/Dockerfile.jinja
@@ -8,6 +8,7 @@ ENV BNB_CUDA_VERSION="{{ CUDA }}"
 ENV PYTORCH_VERSION="{{ PYTORCH_VERSION }}"
 ENV GITHUB_REF="{{ GITHUB_REF }}"
 ENV GITHUB_SHA="{{ GITHUB_SHA }}"
+ENV NIGHTLY_BUILD="{{ NIGHTLY_BUILD }}"

 RUN apt-get update && \
    apt-get install -y --allow-change-held-packages vim curl nano libnccl2 libnccl-dev
@@ -23,6 +24,12 @@ RUN git fetch origin +$GITHUB_REF && \

 # If AXOLOTL_EXTRAS is set, append it in brackets
 RUN pip install causal_conv1d
+RUN if [ "$NIGHTLY_BUILD" = "true" ] ; then \
+        sed -i 's#^transformers.*#transformers @ git+https://github.com/huggingface/transformers.git@main#' requirements.txt; \
+        sed -i 's#^peft.*#peft @ git+https://github.com/huggingface/peft.git@main#' requirements.txt; \
+        sed -i 's#^accelerate.*#accelerate @ git+https://github.com/huggingface/accelerate.git@main#' requirements.txt; \
+    fi
+
 RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
        pip install -e .[deepspeed,flash-attn,optimizers,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
    else \
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -2,5 +2,5 @@
 set -e

 pytest --ignore=tests/e2e/ /workspace/axolotl/tests/
-pytest -n1 --dist loadfile -v /workspace/axolotl/tests/e2e/patched/
-pytest --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ /workspace/axolotl/tests/e2e/
+pytest -n1 --dist loadfile -v /workspace/axolotl/tests/e2e/patched/ /workspace/axolotl/tests/e2e/integrations/
+pytest --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ --ignore=tests/e2e/integrations/ /workspace/axolotl/tests/e2e/
--- a/cicd/tests.py
+++ b/cicd/tests.py
@@ -28,6 +28,7 @@ df_args = {
    "CUDA": os.environ.get("CUDA", "121"),
    "GITHUB_REF": os.environ.get("GITHUB_REF", "refs/heads/main"),
    "GITHUB_SHA": os.environ.get("GITHUB_SHA", ""),
+    "NIGHTLY_BUILD": os.environ.get("NIGHTLY_BUILD", ""),
 }

 dockerfile_contents = df_template.render(**df_args)
--- a/docs/amd_hpc.qmd
+++ b/docs/amd_hpc.qmd
@@ -0,0 +1,108 @@
+---
+title: Training with AMD GPUs on HPC Systems
+description: A comprehensive guide for using Axolotl on distributed systems with AMD GPUs
+---
+
+This guide provides step-by-step instructions for installing and configuring Axolotl on a High-Performance Computing (HPC) environment equipped with AMD GPUs.
+
+## Setup
+
+### 1. Install Python
+
+We recommend using Miniforge, a minimal conda-based Python distribution:
+
+```bash
+curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
+bash Miniforge3-$(uname)-$(uname -m).sh
+```
+
+### 2. Configure Python Environment
+Add Python to your PATH and ensure it's available at login:
+
+```bash
+echo 'export PATH=~/miniforge3/bin:$PATH' >> ~/.bashrc
+echo 'if [ -f ~/.bashrc ]; then . ~/.bashrc; fi' >> ~/.bash_profile
+```
+
+### 3. Load AMD GPU Software
+
+Load the ROCm module:
+
+```bash
+module load rocm/5.7.1
+```
+
+Note: The specific module name and version may vary depending on your HPC system. Consult your system documentation for the correct module name.
+
+### 4. Install PyTorch
+
+Install PyTorch with ROCm support:
+
+```bash
+pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7 --force-reinstall
+```
+
+### 5. Install Flash Attention
+
+Clone and install the Flash Attention repository:
+
+```bash
+git clone --recursive https://github.com/ROCmSoftwarePlatform/flash-attention.git
+export GPU_ARCHS="gfx90a"
+cd flash-attention
+export PYTHON_SITE_PACKAGES=$(python -c 'import site; print(site.getsitepackages()[0])')
+patch "${PYTHON_SITE_PACKAGES}/torch/utils/hipify/hipify_python.py" hipify_patch.patch
+pip install .
+```
+
+### 6. Install Axolotl
+
+Clone and install Axolotl:
+
+```bash
+git clone https://github.com/axolotl-ai-cloud/axolotl
+cd axolotl
+pip install packaging ninja
+pip install -e .
+```
+
+### 7. Apply xformers Workaround
+
+xformers appears to be incompatible with ROCm. Apply the following workarounds:
+ - Edit $HOME/packages/axolotl/src/axolotl/monkeypatch/llama_attn_hijack_flash.py modifying the code to always return `False` for SwiGLU availability from xformers.
+ - Edit $HOME/miniforge3/lib/python3.10/site-packages/xformers/ops/swiglu_op.py replacing the "SwiGLU" function with a pass statement.
+
+### 8. Prepare Job Submission Script
+
+Create a script for job submission using your HPC's particular software (e.g. Slurm, PBS). Include necessary environment setup and the command to run Axolotl training. If the compute node(s) do(es) not have internet access, it is recommended to include
+
+```bash
+export TRANSFORMERS_OFFLINE=1
+export HF_DATASETS_OFFLINE=1
+```
+
+### 9. Download Base Model
+
+Download a base model using the Hugging Face CLI:
+
+```bash
+huggingface-cli download meta-llama/Meta-Llama-3.1-8B --local-dir ~/hfdata/llama3.1-8B
+```
+
+### 10. Create Axolotl Configuration
+
+Create an Axolotl configuration file (YAML format) tailored to your specific training requirements and dataset. Use FSDP for multi-node training.
+
+Note: Deepspeed did not work at the time of testing. However, if anyone managed to get it working, please let us know.
+
+### 11. Preprocess Data
+
+Run preprocessing on the login node:
+
+```bash
+CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess /path/to/your/config.yaml
+```
+
+### 12. Train
+
+You are now ready to submit your previously prepared job script. 🚂
--- a/docs/dataset-formats/tokenized.qmd
+++ b/docs/dataset-formats/tokenized.qmd
@@ -7,7 +7,7 @@ order: 5
 - Pass an empty `type:` in your axolotl config.
 - Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
 - To indicate that a token should be ignored during training, set its corresponding label to `-100`.
- Do not add BOS/EOS. Axolotl will add them for you based on the default tokenizer for the model you're using.
+- You must add BOS and EOS, and make sure that you are training on EOS by not setting its label to -100.
 - For pretraining, do not truncate/pad documents to the context window length.
 - For instruction training, documents must be truncated/padded as desired.

--- a/docs/input_output.qmd
+++ b/docs/input_output.qmd
@@ -205,7 +205,7 @@ ds = load_from_disk(f'last_run_prepared/{directory[0]}/')
    hi there!.  goodbye  farewell</s>
 ```

-We can check that the right tokens are ingored by comparing the labels
+We can check that the right tokens are ignored by comparing the labels
 to each token:

 ```python
--- a/docs/multimodal.qmd
+++ b/docs/multimodal.qmd
@@ -0,0 +1,28 @@
+# MultiModal / Vision Language Models (BETA)
+
+### Supported Models
+
+- Mllama, i.e. llama with vision models
+
+### Usage
+
+Currently multimodal support is limited and doesn't have full feature parity. To finetune a multimodal Llama w/ LoRA,
+you'll need to use the following in YAML in combination with the rest of the required hyperparams.
+
+```yaml
+base_model: alpindale/Llama-3.2-11B-Vision-Instruct
+processor_type: AutoProcessor
+skip_prepare_dataset: true
+
+chat_template: llama3_2_vision
+datasets:
+  - path: HuggingFaceH4/llava-instruct-mix-vsft
+    type: chat_template
+    split: train[:1%]
+    field_messages: messages
+remove_unused_columns: false
+sample_packing: false
+
+# only finetune the Language model, leave the vision model and vision tower frozen
+lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+```
--- a/docs/unsloth.qmd
+++ b/docs/unsloth.qmd
@@ -34,7 +34,7 @@ unsloth_lora_o: true
 ```

 These options are composable and can be used with multi-gpu finetuning
-```
+```yaml
 unsloth_cross_entropy_loss: true
 unsloth_rms_norm: true
 unsloth_rope: true
--- a/examples/deepseek-v2/fft-fsdp-16b.yaml
+++ b/examples/deepseek-v2/fft-fsdp-16b.yaml
@@ -0,0 +1,67 @@
+base_model: deepseek-ai/DeepSeek-V2-Lite
+trust_remote_code: true
+
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: tatsu-lab/alpaca
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: ./outputs/out
+
+sequence_len: 2048
+sample_packing: true
+pad_to_sequence_len: true
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 8
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+special_tokens:
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: DeepseekV2DecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
--- a/examples/deepseek-v2/qlora-fsdp-2_5.yaml
+++ b/examples/deepseek-v2/qlora-fsdp-2_5.yaml
@@ -0,0 +1,83 @@
+base_model: axolotl-quants/DeepSeek-V2.5-bnb-nf4-bf16
+trust_remote_code: true
+
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+
+
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rms_norm: true
+liger_swiglu: true
+liger_fused_linear_cross_entropy: true
+
+chat_template: deepseek_v2
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+adapter: qlora
+lora_r: 256
+lora_alpha: 256
+lora_target_linear: true
+peft_use_rslora: true
+
+gradient_accumulation_steps: 1
+micro_batch_size: 8
+num_epochs: 1
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+special_tokens:
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: DeepseekV2DecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
--- a/examples/jamba/README.md
+++ b/examples/jamba/README.md
@@ -6,5 +6,5 @@
 - ✅ qlora w/ deepspeed Zero-3 needs at least 2x GPUs and 67GiB VRAM (wtf?)
 - ✅ qlora single-gpu, ~51GiB VRAM
 - ✅ multipack
- ❓ FSDP
+- ✅ FSDP
 - ❓ 8-bit LoRA
--- a/examples/jamba/qlora_fsdp_large.yaml
+++ b/examples/jamba/qlora_fsdp_large.yaml
@@ -0,0 +1,61 @@
+base_model: ai21labs/AI21-Jamba-1.5-Large
+tokenizer_type: AutoTokenizer
+
+load_in_4bit: true
+strict: false
+use_tensorboard: true
+datasets:
+  - path: cgato/SlimOrcaDedupCleaned
+    type: chat_template
+    chat_template: jamba
+    drop_system_message: true
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: jamba-large-fsdp-qlora-ft
+save_safetensors: true
+adapter: qlora
+sequence_len: 2048
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 16
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules: [down_proj,gate_proj,in_proj,k_proj,o_proj,out_proj,q_proj,up_proj,v_proj,x_proj]
+lora_target_linear: false
+
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 2
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 0.00001
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+tf32: true
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: true
+logging_steps: 1
+flash_attention: true
+
+warmup_steps: 10
+evals_per_epoch: 1
+saves_per_epoch: 1
+weight_decay: 0.0
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: false
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: JambaAttentionDecoderLayer,JambaMambaDecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
--- a/examples/llama-3-vision/lora-11b.yaml
+++ b/examples/llama-3-vision/lora-11b.yaml
@@ -0,0 +1,63 @@
+base_model: alpindale/Llama-3.2-11B-Vision-Instruct
+processor_type: AutoProcessor
+strict: false
+
+# these 3 lines are needed for now to handle vision chat templates w images
+skip_prepare_dataset: true
+remove_unused_columns: false
+sample_packing: false
+
+chat_template: llama3_2_vision
+datasets:
+  - path: HuggingFaceH4/llava-instruct-mix-vsft
+    type: chat_template
+    split: train[:1%]
+    field_messages: messages
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: ./outputs/out
+
+adapter: lora
+lora_model_dir:
+
+sequence_len: 8192
+pad_to_sequence_len: false
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16:
+tf32: true
+
+gradient_checkpointing: true
+local_rank:
+logging_steps: 1
+flash_attention: true
+eager_attention:
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
--- a/examples/llama-3/fft-8b-liger-fsdp.yaml
+++ b/examples/llama-3/fft-8b-liger-fsdp.yaml
@@ -0,0 +1,76 @@
+base_model: NousResearch/Meta-Llama-3.1-8B
+
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_swiglu: true
+liger_fused_linear_cross_entropy: true
+
+strict: false
+
+chat_template: llama3
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train[:20%]
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.02
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 1
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_backward_prefetch: BACKWARD_PRE
+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
+  eos_token: <|eot_id|>
--- a/examples/llama-3/fft-8b.yaml
+++ b/examples/llama-3/fft-8b.yaml
@@ -1,6 +1,4 @@
-base_model: NousResearch/Meta-Llama-3-8B
-model_type: LlamaForCausalLM
-tokenizer_type: AutoTokenizer
+base_model: NousResearch/Meta-Llama-3.1-8B

 load_in_8bit: false
 load_in_4bit: false
--- a/examples/phi/lora-3.5.yaml
+++ b/examples/phi/lora-3.5.yaml
@@ -0,0 +1,76 @@
+base_model: microsoft/Phi-3.5-mini-instruct
+model_type: AutoModelForCausalLM
+tokenizer_type: AutoTokenizer
+
+load_in_8bit: true
+load_in_4bit: false
+strict: false
+
+chat_template: phi_3
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+    chat_template: phi_3
+    field_messages: messages
+    message_field_role: role
+    message_field_content: content
+    roles:
+      user:
+        - user
+      assistant:
+        - assistant
+
+dataset_prepared_path:
+val_set_size: 0.05
+output_dir: ./outputs/lora-out
+
+sequence_len: 4096
+sample_packing: false
+pad_to_sequence_len: true
+
+adapter: lora
+lora_model_dir:
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 4
+num_epochs: 2
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bfloat16: true
+bf16: true
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+s2_attention:
+
+warmup_steps: 10
+evals_per_epoch: 4
+eval_table_size:
+eval_max_new_tokens: 128
+saves_per_epoch: 4
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
--- a/examples/pixtral/lora-12b.yaml
+++ b/examples/pixtral/lora-12b.yaml
@@ -0,0 +1,65 @@
+base_model: mistral-community/pixtral-12b
+processor_type: AutoProcessor
+
+load_in_8bit: true
+strict: false
+
+# these 3 lines are needed for now to handle vision chat templates w images
+skip_prepare_dataset: true
+remove_unused_columns: false
+sample_packing: false
+
+chat_template: llama3_2_vision
+datasets:
+  - path: HuggingFaceH4/llava-instruct-mix-vsft
+    type: chat_template
+    split: train[:1%]
+    field_messages: messages
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: ./outputs/out
+
+adapter: lora
+lora_model_dir:
+
+sequence_len: 8192
+pad_to_sequence_len: false
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16:
+tf32: true
+
+gradient_checkpointing: true
+local_rank:
+logging_steps: 1
+flash_attention: true
+eager_attention:
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
--- a/examples/qwen2/qlora-fsdp.yaml
+++ b/examples/qwen2/qlora-fsdp.yaml
@@ -72,4 +72,5 @@ fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
 special_tokens:
--- a/examples/tiny-llama/pretrain.yml
+++ b/examples/tiny-llama/pretrain.yml
@@ -9,9 +9,9 @@ strict: false

 max_steps: 200
 pretraining_dataset:
-  path: c4
-  name: en
-  type: pretrain
+  - path: allenai/c4
+    name: en
+    type: pretrain
 dataset_prepared_path:
 val_set_size: 0.0
 output_dir: ./outputs/model-out
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,11 +1,11 @@
 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
 packaging==23.2
-peft==0.12.0
-transformers==4.44.0
+peft==0.13.0
+transformers==4.45.1
 tokenizers>=0.19.1
-bitsandbytes==0.43.3
-accelerate==0.33.0
-datasets==2.20.0
+bitsandbytes==0.44.0
+accelerate==0.34.2
+datasets==2.21.0
 deepspeed==0.14.4
 pydantic==2.6.3
 addict
@@ -21,11 +21,11 @@ optimum==1.16.2
 hf_transfer
 colorama
 numba
-numpy>=1.24.4
+numpy>=1.24.4,<=2.0.1
 # qlora things
 evaluate==0.4.1
 scipy
-scikit-learn==1.2.2
+scikit-learn==1.4.2
 pynvml
 art
 fschat @ git+https://github.com/lm-sys/FastChat.git@27a05b04a35510afb1d767ae7e5990cbd278f8fe
@@ -33,6 +33,8 @@ gradio==3.50.2
 tensorboard
 python-dotenv==1.0.1
 autoawq>=0.2.5
+triton>=2.3.0
+liger-kernel==0.3.0

 mamba-ssm==1.2.0.post1

--- a/setup.py
+++ b/setup.py
@@ -80,7 +80,7 @@ setup(
    dependency_links=dependency_links,
    extras_require={
        "flash-attn": [
-            "flash-attn==2.6.2",
+            "flash-attn==2.6.3",
        ],
        "fused-dense-lib": [
            "fused-dense-lib  @ git+https://github.com/Dao-AILab/flash-attention@v2.6.2#subdirectory=csrc/fused_dense_lib",
--- a/src/axolotl/cli/init.py
+++ b/src/axolotl/cli/init.py
@@ -27,8 +27,10 @@ from transformers.utils import is_torch_bf16_gpu_available
 from transformers.utils.import_utils import _is_package_available

 from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
+from axolotl.integrations.base import PluginManager
 from axolotl.logging_config import configure_logging
 from axolotl.train import TrainDatasetMeta
+from axolotl.utils.chat_templates import chat_templates
 from axolotl.utils.config import (
    normalize_cfg_datasets,
    normalize_config,
@@ -38,7 +40,7 @@ from axolotl.utils.data import load_prepare_dpo_datasets, prepare_dataset
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process
 from axolotl.utils.mlflow_ import setup_mlflow_env_vars
-from axolotl.utils.models import load_tokenizer
+from axolotl.utils.models import load_processor, load_tokenizer
 from axolotl.utils.tokenization import check_dataset_labels
 from axolotl.utils.trainer import prepare_opinionated_env, prepare_optim_env
 from axolotl.utils.wandb_ import setup_wandb_env_vars
@@ -233,7 +235,8 @@ def do_inference_gradio(

    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
    prompter = cli_args.prompter
-    default_tokens = {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>"}
+    # default_tokens = {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>"}
+    default_tokens: Dict[str, str] = {}

    for token, symbol in default_tokens.items():
        # If the token isn't already specified in the config, add it
@@ -241,10 +244,13 @@ def do_inference_gradio(
            tokenizer.add_special_tokens({token: symbol})

    prompter_module = None
+    chat_template_str = None
    if prompter:
        prompter_module = getattr(
            importlib.import_module("axolotl.prompters"), prompter
        )
+    elif cfg.chat_template:
+        chat_template_str = chat_templates(cfg.chat_template)

    model = model.to(cfg.device, dtype=cfg.torch_dtype)

@@ -258,7 +264,24 @@ def do_inference_gradio(
            )
        else:
            prompt = instruction.strip()
-        batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
+
+        if chat_template_str:
+            batch = tokenizer.apply_chat_template(
+                [
+                    {
+                        "role": "user",
+                        "content": prompt,
+                    }
+                ],
+                return_tensors="pt",
+                add_special_tokens=True,
+                add_generation_prompt=True,
+                chat_template=chat_template_str,
+                tokenize=True,
+                return_dict=True,
+            )
+        else:
+            batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

        model.eval()
        with torch.no_grad():
@@ -281,6 +304,7 @@ def do_inference_gradio(
            streamer = TextIteratorStreamer(tokenizer)
            generation_kwargs = {
                "inputs": batch["input_ids"].to(cfg.device),
+                "attention_mask": batch["attention_mask"].to(cfg.device),
                "generation_config": generation_config,
                "streamer": streamer,
            }
@@ -365,6 +389,11 @@ def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs):

    cfg.axolotl_config_path = config

+    if cfg.get("plugins"):
+        plugin_manager = PluginManager.get_instance()
+        for plugin_name in cfg["plugins"]:
+            plugin_manager.register(plugin_name)
+
    try:
        device_props = torch.cuda.get_device_properties("cuda")
        gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
@@ -401,9 +430,12 @@ def load_datasets(
    cli_args: TrainerCliArgs,
 ) -> TrainDatasetMeta:
    tokenizer = load_tokenizer(cfg)
+    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None

    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
-        cfg, tokenizer
+        cfg,
+        tokenizer,
+        processor=processor,
    )

    if cli_args.debug or cfg.debug:
--- a/src/axolotl/cli/merge_sharded_fsdp_weights.py
+++ b/src/axolotl/cli/merge_sharded_fsdp_weights.py
@@ -0,0 +1,204 @@
+"""
+This module provides a CLI to merge sharded FSDP model checkpoints into a single combined checkpoint
+"""
+import json
+import logging
+import os
+import shutil
+from pathlib import Path
+from typing import Dict, Union
+
+import fire
+import torch
+import torch.distributed.checkpoint as dist_cp
+import torch.distributed.checkpoint.format_utils as dist_cp_format_utils
+import transformers
+from accelerate.utils import (
+    SAFE_WEIGHTS_INDEX_NAME,
+    SAFE_WEIGHTS_NAME,
+    WEIGHTS_INDEX_NAME,
+    WEIGHTS_NAME,
+    is_torch_version,
+)
+from dotenv import load_dotenv
+from huggingface_hub import split_torch_state_dict_into_shards
+from safetensors.torch import save_file as safe_save_file
+from torch.distributed.checkpoint.format_utils import _EmptyStateDictLoadPlanner
+
+from axolotl.cli import load_cfg, print_axolotl_text_art
+from axolotl.common.cli import TrainerCliArgs
+
+LOG = logging.getLogger("axolotl.cli.merge_sharded_fsdp_weights")
+
+
+class BFloat16CastPlanner(_EmptyStateDictLoadPlanner):
+    """
+    A custom planner to cast tensors to bfloat16 on the fly during loading.
+    """
+
+    def commit_tensor(self, read_item, tensor):  # pylint: disable=unused-argument
+        tensor.copy_(tensor.to(torch.bfloat16))
+
+
+def _distributed_checkpoint_to_merged_weights(
+    checkpoint_dir: Union[str, Path],
+    save_path: str,
+    safe_serialization: bool = False,
+    max_shard_size: str = "5GB",
+):
+    """
+    Passthrough to `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`
+
+    Will save under `save_path` as either `model.safetensors` or `pytorch_model.bin`.
+    """
+
+    state_dict: Dict = {}
+    save_path_ = Path(save_path)
+    save_path_.mkdir(exist_ok=True)
+    dist_cp_format_utils._load_state_dict(  # pylint: disable=protected-access
+        state_dict,
+        storage_reader=dist_cp.FileSystemReader(checkpoint_dir),
+        planner=BFloat16CastPlanner(),  # pylint: disable=protected-access
+        no_dist=True,
+    )
+
+    # To handle if state is a dict like {model: {...}}
+    if len(state_dict.keys()) == 1:
+        state_dict = state_dict[list(state_dict)[0]]
+
+    # Ensure all tensors are in bfloat16
+    for key, value in state_dict.items():
+        if isinstance(value, torch.Tensor) and value.dtype != torch.bfloat16:
+            state_dict[key] = value.to(torch.bfloat16)
+
+    weights_name = SAFE_WEIGHTS_NAME if safe_serialization else WEIGHTS_NAME
+
+    filename_pattern = weights_name.replace(".bin", "{suffix}.bin").replace(
+        ".safetensors", "{suffix}.safetensors"
+    )
+    state_dict_split = split_torch_state_dict_into_shards(
+        state_dict, filename_pattern=filename_pattern, max_shard_size=max_shard_size
+    )
+    # Save index if sharded
+    index = None
+    if state_dict_split.is_sharded:
+        index = {
+            "metadata": state_dict_split.metadata,
+            "weight_map": state_dict_split.tensor_to_filename,
+        }
+
+    # Save the model
+    filename_to_tensors = state_dict_split.filename_to_tensors.items()
+
+    for shard_file, tensors in filename_to_tensors:
+        shard = {tensor: state_dict[tensor] for tensor in tensors}
+
+        if safe_serialization:
+            safe_save_file(
+                shard, os.path.join(save_path_, shard_file), metadata={"format": "pt"}
+            )
+        else:
+            torch.save(shard, os.path.join(save_path_, shard_file))
+
+    if index is not None:
+        save_index_file = (
+            SAFE_WEIGHTS_INDEX_NAME if safe_serialization else WEIGHTS_INDEX_NAME
+        )
+        save_index_file = os.path.join(save_path_, save_index_file)
+        # Save the index as well
+        with open(save_index_file, "w", encoding="utf-8") as fout:
+            content = json.dumps(index, indent=2, sort_keys=True) + "\n"
+            fout.write(content)
+
+    return save_path_
+
+
+def merge_fsdp_weights(
+    checkpoint_dir: str,
+    output_path: str,
+    safe_serialization: bool = False,
+    remove_checkpoint_dir: bool = False,
+):
+    """
+    Merge the weights from sharded FSDP model checkpoints into a single combined checkpoint. Should be used if
+    `SHARDED_STATE_DICT` was used for the model. Weights will be saved to `{output_path}/model.safetensors` if
+    `safe_serialization` else `pytorch_model.bin`.
+
+    Note: this is a CPU-bound process.
+
+    Args:
+        checkpoint_dir (`str`):
+            The directory containing the FSDP checkpoints (can be either the model or optimizer).
+        output_path (`str`):
+            The path to save the merged checkpoint.
+        safe_serialization (`bool`, *optional*, defaults to `True`):
+            Whether to save the merged weights with safetensors (recommended).
+        remove_checkpoint_dir (`bool`, *optional*, defaults to `False`):
+            Whether to remove the checkpoint directory after merging.
+    """
+    checkpoint_dir_ = Path(checkpoint_dir)
+    from accelerate.state import PartialState
+
+    if not is_torch_version(">=", "2.3.0"):
+        raise ValueError("`merge_fsdp_weights` requires PyTorch >= 2.3.0`")
+
+    # Verify that the checkpoint directory exists
+    if not checkpoint_dir_.exists():
+        model_path_exists = (checkpoint_dir_ / "pytorch_model_fsdp_0").exists()
+        optimizer_path_exists = (checkpoint_dir_ / "optimizer_0").exists()
+        err = f"Tried to load from {checkpoint_dir_} but couldn't find a valid metadata file."
+        if model_path_exists and optimizer_path_exists:
+            err += (
+                " However, potential model and optimizer checkpoint directories exist."
+            )
+            err += f"Please pass in either {checkpoint_dir_}/pytorch_model_fsdp_0 or {checkpoint_dir_}/optimizer_0"
+            err += "instead."
+        elif model_path_exists:
+            err += " However, a potential model checkpoint directory exists."
+            err += (
+                f"Please try passing in {checkpoint_dir_}/pytorch_model_fsdp_0 instead."
+            )
+        elif optimizer_path_exists:
+            err += " However, a potential optimizer checkpoint directory exists."
+            err += f"Please try passing in {checkpoint_dir_}/optimizer_0 instead."
+        raise ValueError(err)
+
+    # To setup `save` to work
+    state = PartialState()
+    if state.is_main_process:
+        LOG.info(f"Merging FSDP weights from {checkpoint_dir_}")
+        save_path = _distributed_checkpoint_to_merged_weights(
+            checkpoint_dir_, output_path, safe_serialization
+        )
+        LOG.info(f"Successfully merged FSDP weights and saved to {save_path}")
+        if remove_checkpoint_dir:
+            LOG.info(f"Removing old checkpoint directory {checkpoint_dir_}")
+            shutil.rmtree(checkpoint_dir_)
+    state.wait_for_everyone()
+
+
+def do_cli(config: Path = Path("examples/"), **kwargs):
+    # pylint: disable=duplicate-code
+    print_axolotl_text_art()
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
+    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
+        return_remaining_strings=True
+    )
+    parsed_cli_args.merge_lora = True
+
+    parsed_cfg = load_cfg(
+        config,
+        **kwargs,
+    )
+
+    fsdp_dir = Path(parsed_cfg.output_dir) / "pytorch_model_fsdp_0"
+    merge_fsdp_weights(
+        checkpoint_dir=str(fsdp_dir),
+        output_path=str(Path(parsed_cfg.output_dir) / "merged"),
+        safe_serialization=True,
+    )
+
+
+if __name__ == "__main__":
+    load_dotenv()
+    fire.Fire(do_cli)
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -82,7 +82,14 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
            # "copying from a non-meta parameter in the checkpoint to a meta parameter in the current model"
            warnings.simplefilter("ignore")
            with init_empty_weights(include_buffers=True):
-                AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+                # fmt: off
+                try:
+                    AutoModelForCausalLM.from_pretrained(
+                        model_name, trust_remote_code=True
+                    )
+                except Exception as exc:  # pylint: disable=broad-exception-caught,unused-variable  # nosec B110  # noqa F841
+                    pass
+                # fmt: on

    LOG.info(
        Fore.GREEN
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -4,6 +4,7 @@ Builder for the training args and trainer
 """

 import abc
+import gc
 import importlib
 import importlib.util
 import logging
@@ -15,11 +16,13 @@ from collections import defaultdict
 from dataclasses import dataclass, field
 from functools import wraps
 from pathlib import Path
-from typing import Dict, List, Literal, Optional, Type, Union
+from typing import Any, Dict, List, Literal, Optional, Type, Union

 import torch
 import transformers
 from datasets import Dataset
+from peft.optimizers import create_loraplus_optimizer
+from torch import nn
 from torch.optim.lr_scheduler import OneCycleLR
 from torch.utils.data import BatchSampler, DataLoader, RandomSampler, SequentialSampler
 from transformers import (
@@ -43,7 +46,6 @@ from trl import (
 )
 from trl.trainer.utils import pad_to_length

-from axolotl.loraplus import create_loraplus_optimizer
 from axolotl.monkeypatch.multipack import SUPPORTED_MULTIPACK_MODEL_TYPES
 from axolotl.monkeypatch.relora import ReLoRACallback, ReLoRAScheduler
 from axolotl.utils import is_mlflow_available
@@ -59,12 +61,14 @@ from axolotl.utils.callbacks import (
    log_prediction_callback_factory,
 )
 from axolotl.utils.callbacks.lisa import lisa_callback_factory
+from axolotl.utils.chat_templates import chat_templates
 from axolotl.utils.collators import (
    BatchSamplerDataCollatorForSeq2Seq,
    DataCollatorForSeq2Seq,
    MambaDataCollator,
    V2BatchSamplerDataCollatorForSeq2Seq,
 )
+from axolotl.utils.collators.mm_chat import MultiModalChatDataCollator
 from axolotl.utils.models import ensure_dtype
 from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
 from axolotl.utils.schedulers import (
@@ -248,6 +252,10 @@ class AxolotlTrainingMixins:
            "help": "workaround to pass an alternate lr scheduler to the HF trainer"
        },
    )
+    chat_template: Optional[str] = field(
+        default=None,
+        metadata={"help": "Chat template converting chat messages to text"},
+    )


@dataclass
@@ -454,14 +462,14 @@ class AxolotlTrainer(SchedulerMixin, Trainer):
            if self.args.loraplus_lr_ratio is not None:
                loraplus_lr_ratio = getattr(self.args, "loraplus_lr_ratio", None)
                loraplus_lr_embedding = getattr(
-                    self.args, "loraplus_lr_embedding", None
+                    self.args, "loraplus_lr_embedding", 1e-6
                )
                self.optimizer = create_loraplus_optimizer(  # pylint: disable=attribute-defined-outside-init
                    opt_model,
                    optimizer_cls,
-                    optimizer_kwargs,
-                    loraplus_lr_ratio,
-                    loraplus_lr_embedding,
+                    loraplus_lr_ratio=loraplus_lr_ratio,
+                    loraplus_lr_embedding=loraplus_lr_embedding,
+                    **optimizer_kwargs,
                )
            elif self.args.alternate_optimizer == "optimi_adamw":
                from optimi import AdamW
@@ -504,9 +512,10 @@ class AxolotlTrainer(SchedulerMixin, Trainer):
                batch_max_len = self.args.max_seq_length
            else:
                batch_size = 1
-                batch_max_len = (
-                    self.args.per_device_train_batch_size * self.args.max_seq_length
+                train_batch_size = (
+                    self.state.train_batch_size or self.args.per_device_train_batch_size
                )
+                batch_max_len = train_batch_size * self.args.max_seq_length
            return MultipackBatchSampler(
                RandomSampler(self.train_dataset),
                lengths=get_dataset_lengths(self.train_dataset),
@@ -966,9 +975,9 @@ class AxolotlDPOTrainer(SchedulerMixin, DPOTrainer):
            self.optimizer = create_loraplus_optimizer(  # pylint: disable=attribute-defined-outside-init
                opt_model,
                optimizer_cls,
-                optimizer_kwargs,
-                loraplus_lr_ratio,
-                loraplus_lr_embedding,
+                loraplus_lr_ratio=loraplus_lr_ratio,
+                loraplus_lr_embedding=loraplus_lr_embedding,
+                **optimizer_kwargs,
            )

        if is_sagemaker_mp_enabled():
@@ -997,6 +1006,14 @@ class AxolotlDPOTrainer(SchedulerMixin, DPOTrainer):
                res[key] = res[key][1:]
        return res

+    def training_step(
+        self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]
+    ) -> torch.Tensor:
+        loss: torch.Tensor = super().training_step(model, inputs)
+        gc.collect()
+        torch.cuda.empty_cache()
+        return loss
+

 class AxolotlORPOTrainer(SchedulerMixin, ORPOTrainer):
    """
@@ -1032,10 +1049,11 @@ class TrainerBuilderBase(abc.ABC):
    _model_ref = None
    _peft_config = None

-    def __init__(self, cfg, model, tokenizer):
+    def __init__(self, cfg, model, tokenizer, processor=None):
        self.cfg = cfg
        self.model = model
        self.tokenizer = tokenizer
+        self.processor = processor

        # in case the model supports tagging, add the axolotl tag.
        # This makes sure the tag is correctly pushed even if a user calls
@@ -1369,6 +1387,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            training_arguments_kwargs[
                "per_device_eval_batch_size"
            ] = self.cfg.eval_batch_size
+        if self.cfg.auto_find_batch_size is not None:
+            training_arguments_kwargs[
+                "auto_find_batch_size"
+            ] = self.cfg.auto_find_batch_size
        training_arguments_kwargs[
            "gradient_accumulation_steps"
        ] = self.cfg.gradient_accumulation_steps
@@ -1402,6 +1424,8 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        report_to = []
        if self.cfg.use_wandb:
            report_to.append("wandb")
+            if self.cfg.wandb_name:
+                training_arguments_kwargs["run_name"] = self.cfg.wandb_name
        if self.cfg.use_mlflow:
            report_to.append("mlflow")
        if self.cfg.use_tensorboard:
@@ -1451,9 +1475,9 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        )

        training_arguments_kwargs["sample_packing"] = bool(self.cfg.sample_packing)
-        training_arguments_kwargs[
-            "multipack_real_batches"
-        ] = not self.cfg.flash_attention
+        training_arguments_kwargs["multipack_real_batches"] = (
+            not self.cfg.flash_attention or self.cfg.multipack_real_batches
+        )
        training_arguments_kwargs["eval_sample_packing"] = bool(
            self.cfg.eval_sample_packing
        )
@@ -1498,6 +1522,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        )
        training_arguments_kwargs["model_type"] = self.cfg.model_config_type
        training_arguments_kwargs["pretraining"] = bool(self.cfg.pretraining_dataset)
+        if self.cfg.chat_template:
+            training_arguments_kwargs["chat_template"] = chat_templates(
+                self.cfg.chat_template
+            )

        if self.cfg.rl == "orpo":
            training_arguments_kwargs["orpo_alpha"] = self.cfg.orpo_alpha
@@ -1559,6 +1587,12 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        )
        training_args = self.hook_post_create_training_args(training_args)

+        # unset run_name so wandb sets up experiment names
+        if self.cfg.use_wandb and training_args.run_name == training_args.output_dir:
+            training_args.run_name = (  # pylint: disable=attribute-defined-outside-init
+                None
+            )
+
        data_collator_kwargs = {
            "padding": True,  # True/"longest" is the default
        }
@@ -1638,7 +1672,12 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            else:
                collator = BatchSamplerDataCollatorForSeq2Seq
        else:
-            collator = DataCollatorForSeq2Seq
+            if self.cfg.processor_type and self.processor:
+                collator = MultiModalChatDataCollator
+                kwargs["processor"] = self.processor
+                kwargs["chat_template"] = training_args.chat_template
+            else:
+                collator = DataCollatorForSeq2Seq

        return collator(
            self.tokenizer,
@@ -1846,6 +1885,8 @@ class HFRLTrainerBuilder(TrainerBuilderBase):
        )
        if self.cfg.fsdp:
            ensure_dtype(dpo_trainer.model, dtype=self.cfg.torch_dtype)
+            if self.cfg.rl in ["dpo", "ipo"] and dpo_trainer.ref_model:
+                ensure_dtype(dpo_trainer.ref_model, dtype=self.cfg.torch_dtype)

        dpo_trainer = self.hook_post_create_trainer(dpo_trainer)
        for callback in self.get_post_trainer_create_callbacks(dpo_trainer):
--- a/src/axolotl/integrations/LICENSE.md
+++ b/src/axolotl/integrations/LICENSE.md
@@ -0,0 +1,58 @@
+### AXOLOTL COMMUNITY LICENSE AGREEMENT
+
+This Axolotl Community License Agreement (“Agreement”) is entered into by and between Axolotl AI Corp. (“Axolotl”) and
+any individual or entity (“Licensee”) who wishes to use the Software (as defined below) in accordance with the terms
+and conditions set forth in this Agreement.
+
+1.  Definitions
+    1.1 “Licensee” refers to any individual or entity who has obtained a copy of the Software under this Agreement.
+    1.2 “Plugin Integration” means independent integration software modules which may or may not be offered by Axolotl,
+        which may be licensed separately by their respective  authors and/or licensors.
+    1.3 “Software” refers to the specific sub-directory of the Axolotl, Inc. software located at
+        https://github.com/axolotl-ai-cloud/axolotl/tree/main/src/axolotl/integrations and its subdirectories which
+        permits Plugin Integrations to integrate with the Axolotl service.
+2.  Grant of License
+    2.1	Axolotl hereby grants Licensee a worldwide, non-exclusive, royalty-free, license to use, copy, modify, merge,
+        publish, distribute, sublicense, and/or otherwise exploit the Software, subject to the following conditions:
+        - Licensee must comply with all the terms and conditions of this Agreement.
+        - Licensee must include the original copyright notice and disclaimer of warranty in all copies or substantial
+          portions of the Software.
+    2.2 Licensee may use the Software for any lawful purpose, except as restricted in Section 3.
+3.  Restrictions
+    3.1 Licensee shall not use the Software for any activity that constitutes a commercial activity of offering for
+        free or for sale any services, platform, or equivalent  to third parties for the purposes of allowing such
+        third parties to fine-tune artificial intelligence models.
+    3.2 Licensee shall not:
+        - Use the Software for any illegal or unauthorized purpose.
+        - Reverse engineer, decompile, or disassemble the Software.
+        - Remove or modify any copyright, trademark, or other proprietary notices contained in the Software.
+        - Use the Software in a way that could damage, disable, overburden, or impair the functionality of the
+          Software or interfere with any third-party use of the Software.
+    3.3 Axolotl reserves the right to restrict certain Plugin Integrations for use with the Software. To the extent Licensee integrates a permitted, applicable Plugin Integration with the Software, Licensee shall comply with any additional terms and conditions imposed by the licensors of such Plugin Integration for use of such Plugin Integrations. Licensee shall contact Axolotl if it has questions about whether its use of the Software falls beyond the scope of this Agreement.
+4.  Intellectual Property Rights
+    4.1 Axolotl and its contributors retain all intellectual property rights in and to the Software. Licensee
+        acknowledges that this Agreement does not transfer any ownership rights or intellectual property rights to
+        Licensee.
+5.  Disclaimer of Warranty
+    5.1 THE SOFTWARE IS PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
+        TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. IN NO EVENT SHALL
+        THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF
+        CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+        DEALINGS IN THE SOFTWARE.
+6.  Termination
+    6.1 Axolotl may terminate this Agreement at any time if Licensee fails to comply with any of the terms and
+        conditions set forth herein. Upon termination, Licensee shall cease all use of the Software and destroy any
+        copies in its possession.
+7.  Governing Law
+    7.1 This Agreement shall be governed by and construed in accordance with the laws of the State of California,
+        without regards to conflicts of laws provisions thereof.
+8.  Entire Agreement
+    8.1 This Agreement constitutes the entire agreement between Axolotl and Licensee with respect to the subject matter
+        hereof and supersedes all prior or contemporaneous understandings or agreements between the parties concerning
+        the Software, whether written or oral. Axolotl may update the terms of this Agreement from time to time, and
+        Licensee’s continued use of the Software after any such updates shall constitute acceptance of updated terms
+        on a go-forward basis.  Axolotl will use commercially reasonable efforts to provide Licensee notice of any
+        material updates. By using the Software, Licensee acknowledges that it has read, understood, and agrees to be
+        bound by the terms and conditions of this Agreement.
+
+This Agreement was last updated on August 23, 2024.
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -0,0 +1,383 @@
+# Copyright 2024 Axolotl AI. All rights reserved.
+#
+# This software may be used and distributed according to
+# the terms of the Axolotl Community License Agreement (the "License");
+# you may not use this file except in compliance with the License.
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations under
+# the License.
+
+"""
+Base class for all plugins.
+
+A plugin is a reusable, modular, and self-contained piece of code that extends the functionality of Axolotl.
+Plugins can be used to integrate third-party models, modify the training process, or add new features.
+
+To create a new plugin, you need to inherit from the BasePlugin class and implement the required methods.
+"""
+import importlib
+import logging
+from typing import List
+
+
+class BasePlugin:
+    """
+    Base class for all plugins. Defines the interface for plugin methods.
+
+    Attributes:
+    None
+
+    Methods:
+    register(cfg): Registers the plugin with the given configuration.
+    pre_model_load(cfg): Performs actions before the model is loaded.
+    post_model_load(cfg, model): Performs actions after the model is loaded.
+    pre_lora_load(cfg, model): Performs actions before LoRA weights are loaded.
+    post_lora_load(cfg, model): Performs actions after LoRA weights are loaded.
+    create_optimizer(cfg, trainer): Creates and returns an optimizer for training.
+    create_lr_scheduler(cfg, trainer, optimizer): Creates and returns a learning rate scheduler.
+    add_callbacks_pre_trainer(cfg, model): Adds callbacks to the trainer before training.
+    add_callbacks_post_trainer(cfg, trainer): Adds callbacks to the trainer after training.
+    """
+
+    def __init__(self):
+        """
+        Initializes the BasePlugin.
+        """
+
+    def register(self, cfg):
+        """
+        Registers the plugin with the given configuration.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+
+        Returns:
+        None
+        """
+
+    def get_input_args(self):
+        """
+        Returns a pydantic model for the plugin's input arguments.
+        """
+
+    def pre_model_load(self, cfg):
+        """
+        Performs actions before the model is loaded.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+
+        Returns:
+        None
+        """
+
+    def post_model_load(self, cfg, model):
+        """
+        Performs actions after the model is loaded.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        model (object): The loaded model.
+
+        Returns:
+        None
+        """
+
+    def pre_lora_load(self, cfg, model):
+        """
+        Performs actions before LoRA weights are loaded.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        model (object): The loaded model.
+
+        Returns:
+        None
+        """
+
+    def post_lora_load(self, cfg, model):
+        """
+        Performs actions after LoRA weights are loaded.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        model (object): The loaded model.
+
+        Returns:
+        None
+        """
+
+    def create_optimizer(self, cfg, trainer):
+        """
+        Creates and returns an optimizer for training.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        trainer (object): The trainer object for training.
+
+        Returns:
+        object: The created optimizer.
+        """
+
+    def create_lr_scheduler(self, cfg, trainer, optimizer):
+        """
+        Creates and returns a learning rate scheduler.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        trainer (object): The trainer object for training.
+        optimizer (object): The optimizer for training.
+
+        Returns:
+        object: The created learning rate scheduler.
+        """
+
+    def add_callbacks_pre_trainer(self, cfg, model):
+        """
+        Adds callbacks to the trainer before training.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        model (object): The loaded model.
+
+        Returns:
+        List[callable]: A list of callback functions to be added to the TrainingArgs
+        """
+
+    def add_callbacks_post_trainer(self, cfg, trainer):
+        """
+        Adds callbacks to the trainer after training.
+
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        trainer (object): The trainer object for training.
+
+        Returns:
+        List[callable]: A list of callback functions to be added to the TrainingArgs
+        """
+
+
+def load_plugin(plugin_name: str) -> BasePlugin:
+    """
+    Loads a plugin based on the given plugin name.
+
+    The plugin name should be in the format "module_name.class_name".
+    This function splits the plugin name into module and class, imports the module,
+    retrieves the class from the module, and creates an instance of the class.
+
+    Parameters:
+    plugin_name (str): The name of the plugin to be loaded. The name should be in the format "module_name.class_name".
+
+    Returns:
+    BasePlugin: An instance of the loaded plugin.
+
+    Raises:
+    ImportError: If the plugin module cannot be imported.
+    """
+    # split the plugin name into module and class
+    module_name, class_name = plugin_name.rsplit(".", 1)
+
+    # import the module
+    module = importlib.import_module(module_name)
+    # instantiate the class
+    plugin_class = getattr(module, class_name)
+    # create an instance of the class
+    plugin = plugin_class()
+
+    return plugin
+
+
+class PluginManager:
+    """
+    The PluginManager class is responsible for loading and managing plugins.
+    It should be a singleton so it can be accessed from anywhere in the codebase.
+
+    Attributes:
+    plugins (List[BasePlugin]): A list of loaded plugins.
+
+    Methods:
+    get_instance(): Static method to get the singleton instance of PluginManager.
+    register(plugin_name: str): Registers a new plugin by its name.
+    pre_model_load(cfg): Calls the pre_model_load method of all registered plugins.
+    """
+
+    plugins: List[BasePlugin] = []
+
+    _instance = None
+
+    def __new__(cls):
+        """
+        Creates a new instance of PluginManager if it doesn't exist yet.
+        """
+        if cls._instance is None:
+            cls._instance = super(PluginManager, cls).__new__(cls)
+            cls._instance.plugins: List[BasePlugin] = []
+        return cls._instance
+
+    @staticmethod
+    def get_instance() -> "PluginManager":
+        """
+        Returns the singleton instance of PluginManager.
+        If the instance doesn't exist, it creates a new one.
+        """
+        if PluginManager._instance is None:
+            PluginManager()
+        return PluginManager._instance  # type: ignore
+
+    def register(self, plugin_name: str):
+        """
+        Registers a new plugin by its name.
+
+        Parameters:
+        plugin_name (str): The name of the plugin to be registered.
+
+        Returns:
+        None
+
+        Raises:
+        ImportError: If the plugin module cannot be imported.
+        """
+        try:
+            plugin = load_plugin(plugin_name)
+            self.plugins.append(plugin)
+        except ImportError:
+            logging.error(f"Failed to load plugin: {plugin_name}")
+
+    def get_input_args(self):
+        """
+        Returns a list of Pydantic classes for all registered plugins' input arguments.'
+
+        Returns:
+        list[str]: A list of Pydantic classes for all registered plugins' input arguments.'
+        """
+        input_args = []
+        for plugin in self.plugins:
+            input_args_from_plugin = plugin.get_input_args()
+            if input_args_from_plugin is not None:
+                input_args.append(input_args_from_plugin)
+        return input_args
+
+    def pre_model_load(self, cfg):
+        """
+        Calls the pre_model_load method of all registered plugins.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+
+        Returns:
+        None
+        """
+        for plugin in self.plugins:
+            plugin.pre_model_load(cfg)
+
+    def post_model_load(self, cfg, model):
+        """
+        Calls the post_model_load method of all registered plugins.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        model (object): The loaded model.
+
+        Returns:
+        None
+        """
+        for plugin in self.plugins:
+            plugin.post_model_load(cfg, model)
+
+    def pre_lora_load(self, cfg, model):
+        """
+        Calls the pre_lora_load method of all registered plugins.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        model (object): The loaded model.
+
+        Returns:
+        None
+        """
+        for plugin in self.plugins:
+            plugin.pre_lora_load(cfg, model)
+
+    def post_lora_load(self, cfg, model):
+        """
+        Calls the post_lora_load method of all registered plugins.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        model (object): The loaded model.
+
+        Returns:
+        None
+        """
+        for plugin in self.plugins:
+            plugin.post_lora_load(cfg, model)
+
+    def create_optimizer(self, cfg, trainer):
+        """
+        Calls the create_optimizer method of all registered plugins and returns the first non-None optimizer.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        trainer (object): The trainer object for training.
+
+        Returns:
+        object: The created optimizer, or None if none was found.
+        """
+        for plugin in self.plugins:
+            optimizer = plugin.create_optimizer(cfg, trainer)
+            if optimizer is not None:
+                return optimizer
+        return None
+
+    def create_lr_scheduler(self, cfg, trainer, optimizer):
+        """
+        Calls the create_lr_scheduler method of all registered plugins and returns the first non-None scheduler.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        trainer (object): The trainer object for training.
+        optimizer (object): The optimizer for training.
+
+        Returns:
+        object: The created learning rate scheduler, or None if none was found.
+        """
+        for plugin in self.plugins:
+            scheduler = plugin.create_lr_scheduler(cfg, trainer, optimizer)
+            if scheduler is not None:
+                return scheduler
+        return None
+
+    def add_callbacks_pre_trainer(self, cfg, model):
+        """
+        Calls the add_callbacks_pre_trainer method of all registered plugins.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        model (object): The loaded model.
+
+        Returns:
+        List[callable]: A list of callback functions to be added to the TrainingArgs.
+        """
+        callbacks = []
+        for plugin in self.plugins:
+            callbacks.extend(plugin.add_callbacks_pre_trainer(cfg, model))
+        return callbacks
+
+    def add_callbacks_post_trainer(self, cfg, trainer):
+        """
+        Calls the add_callbacks_post_trainer method of all registered plugins.
+
+        Parameters:
+        cfg (dict): The configuration for the plugins.
+        trainer (object): The trainer object for training.
+
+        Returns:
+        List[callable]: A list of callback functions to be added to the TrainingArgs.
+        """
+        callbacks = []
+        for plugin in self.plugins:
+            callbacks.extend(plugin.add_callbacks_post_trainer(cfg, trainer))
+        return callbacks
--- a/src/axolotl/integrations/config.py
+++ b/src/axolotl/integrations/config.py
@@ -0,0 +1,65 @@
+# Copyright 2024 Axolotl AI. All rights reserved.
+#
+# This software may be used and distributed according to
+# the terms of the Axolotl Community License Agreement (the "License");
+# you may not use this file except in compliance with the License.
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations under
+# the License.
+
+"""
+module to handle merging the plugins' input arguments with the base configurations.
+
+this was moved here to prevent circular imports
+"""
+
+from typing import Any, Dict, List
+
+from axolotl.utils.config.models.input.v0_4_1 import (
+    AxolotlConfigWCapabilities as AxolotlConfigWCapabilitiesBase,
+)
+from axolotl.utils.config.models.input.v0_4_1 import (
+    AxolotlInputConfig as AxolotlInputConfigBase,
+)
+
+
+def merge_input_args():
+    """
+    Merges input arguments from registered plugins with the base configurations.
+
+    This function retrieves the input arguments from registered plugins using the PluginManager.
+    It then dynamically creates new classes, AxolotlConfigWCapabilities and AxolotlInputConfig,
+    that inherit from the base configurations and include the input arguments from the plugins.
+
+    Returns:
+    tuple: A tuple containing the newly created classes, AxolotlConfigWCapabilities and AxolotlInputConfig.
+    """
+    from axolotl.integrations.base import PluginManager
+
+    plugin_manager = PluginManager.get_instance()
+    input_args: List[str] = plugin_manager.get_input_args()
+    plugin_classes = []
+    dynamic_input = ""
+    for plugin_args in input_args:
+        plugin_module, plugin_cls = plugin_args.rsplit(".", 1)
+        dynamic_input += f"from {plugin_module} import {plugin_cls}\n"
+        plugin_classes.append(plugin_cls)
+    if dynamic_input:
+        dynamic_input += f"class AxolotlConfigWCapabilities(AxolotlConfigWCapabilitiesBase, {', '.join(plugin_classes)}):\n    pass\n"
+        dynamic_input += f"class AxolotlInputConfig(AxolotlInputConfigBase, {', '.join(plugin_classes)}):\n    pass\n"
+
+        namespace: Dict[Any, Any] = {}
+        exec(  # pylint: disable=exec-used  # nosec B102
+            dynamic_input, globals(), namespace
+        )
+        AxolotlInputConfig = namespace[  # pylint: disable=invalid-name
+            "AxolotlInputConfig"
+        ]
+        AxolotlConfigWCapabilities = namespace[  # pylint: disable=invalid-name
+            "AxolotlConfigWCapabilities"
+        ]
+        return AxolotlConfigWCapabilities, AxolotlInputConfig
+    return AxolotlConfigWCapabilitiesBase, AxolotlInputConfigBase
--- a/src/axolotl/integrations/liger/LICENSE
+++ b/src/axolotl/integrations/liger/LICENSE
@@ -0,0 +1,202 @@
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/src/axolotl/integrations/liger/init.py
+++ b/src/axolotl/integrations/liger/init.py
@@ -0,0 +1,189 @@
+# Copyright 2024 Axolotl AI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Module for the Plugin for LIGER integraton with Axolotl.
+
+Liger Kernel is the collection of Triton-native kernels for LLM Training.
+It is designed to be performant, correct, and light-weight.
+"""
+import logging
+import sys
+from functools import partial
+
+from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLoss
+from liger_kernel.transformers.geglu import LigerGEGLUMLP
+from liger_kernel.transformers.rms_norm import LigerRMSNorm
+from liger_kernel.transformers.rope import liger_rotary_pos_emb
+from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
+
+from axolotl.integrations.base import BasePlugin
+
+from .args import LigerArgs  # pylint: disable=unused-import. # noqa: F401
+
+
+class LigerPlugin(BasePlugin):
+    """
+    Plugin for LIGER integraton with Axolotl.
+    """
+
+    def get_input_args(self):
+        return "axolotl.integrations.liger.LigerArgs"
+
+    def pre_model_load(self, cfg):
+        if cfg.model_config_type == "llama":
+            from liger_kernel.transformers.model.llama import (
+                lce_forward as llama_lce_forward,
+            )
+            from transformers.models.llama import modeling_llama
+
+            if cfg.liger_rope:
+                modeling_llama.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_llama.LlamaRMSNorm = LigerRMSNorm
+            if cfg.liger_swiglu:
+                modeling_llama.LlamaMLP = LigerSwiGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_llama.CrossEntropyLoss = LigerCrossEntropyLoss
+            elif cfg.liger_fused_linear_cross_entropy:
+                modeling_llama.LlamaForCausalLM.forward = llama_lce_forward
+
+        elif cfg.model_config_type == "mistral":
+            from liger_kernel.transformers.model.mistral import (
+                lce_forward as mistral_lce_forward,
+            )
+            from transformers.models.mistral import modeling_mistral
+
+            if cfg.liger_rope:
+                modeling_mistral.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_mistral.MistralRMSNorm = LigerRMSNorm
+            if cfg.liger_swiglu:
+                modeling_mistral.MistralMLP = LigerSwiGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_mistral.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                modeling_mistral.MistralForCausalLM.forward = mistral_lce_forward
+
+        elif cfg.model_config_type == "gemma":
+            from liger_kernel.transformers.model.gemma import (
+                lce_forward as gemma_lce_forward,
+            )
+            from transformers.models.gemma import modeling_gemma
+
+            if cfg.liger_rope:
+                modeling_gemma.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_gemma.GemmaRMSNorm = partial(
+                    LigerRMSNorm, offset=1.0, init_fn="zeros", casting_mode="gemma"
+                )
+            if cfg.liger_swiglu:
+                modeling_gemma.GemmaMLP = LigerGEGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_gemma.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                modeling_gemma.GemmaForCausalLM.forward = gemma_lce_forward
+
+        elif cfg.model_config_type == "jamba":
+            from transformers.models.jamba import modeling_jamba
+
+            from .models.jamba import lce_forward as jamba_lce_forward
+
+            if cfg.liger_rope:
+                modeling_jamba.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_jamba.JambaRMSNorm = LigerRMSNorm
+            if cfg.liger_swiglu:
+                modeling_jamba.JambaMLP = LigerSwiGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_jamba.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                modeling_jamba.JambaForCausalLM.forward = jamba_lce_forward
+
+        elif cfg.model_config_type == "qwen2":
+            from liger_kernel.transformers.model.qwen2 import (
+                lce_forward as qwen2_lce_forward,
+            )
+            from transformers.models.qwen2 import modeling_qwen2
+
+            if cfg.liger_rope:
+                modeling_qwen2.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_qwen2.Qwen2RMSNorm = LigerRMSNorm
+            if cfg.liger_swiglu:
+                modeling_qwen2.Qwen2MLP = LigerSwiGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_qwen2.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                modeling_qwen2.Qwen2ForCausalLM.forward = qwen2_lce_forward
+
+        elif cfg.model_config_type == "deepseek_v2":
+            from accelerate import init_empty_weights
+            from transformers import AutoModelForCausalLM
+
+            with init_empty_weights():
+                model = AutoModelForCausalLM.from_pretrained(
+                    cfg.base_model, trust_remote_code=cfg.trust_remote_code or False
+                )
+                modeling_mod = sys.modules[model.__class__.__module__]
+
+            from .models.deepseekv2 import lce_forward as deepseekv2_lce_forward
+
+            if cfg.liger_rope:
+                # The DeepseekV2 version of RoPE is different than upstream LLaMA.
+                # See https://github.com/linkedin/Liger-Kernel/issues/129#issuecomment-2313763528
+                logging.warning("Fused liger_rope is not supported for DeepseekV2.")
+            if cfg.liger_rms_norm:
+                modeling_mod.DeepseekV2RMSNorm = LigerRMSNorm
+            if cfg.liger_swiglu:
+                modeling_mod.DeepseekV2MLP.forward = LigerSwiGLUMLP.forward
+            if cfg.liger_cross_entropy:
+                modeling_mod.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                modeling_mod.DeepseekV2ForCausalLM.forward = deepseekv2_lce_forward
+
+        elif cfg.model_config_type == "gemma2":
+            from transformers.models.gemma2 import modeling_gemma2
+
+            if cfg.liger_rope:
+                modeling_gemma2.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_gemma2.Gemma2RMSNorm = partial(
+                    LigerRMSNorm, offset=1.0, init_fn="zeros", casting_mode="gemma"
+                )
+            if cfg.liger_swiglu:
+                modeling_gemma2.Gemma2MLP = LigerGEGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_gemma2.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                logging.warning(
+                    "Fused linear cross entropy is not supported for Gemma 2."
+                )
+
+        elif cfg.model_config_type == "phi3":
+            from liger_kernel.transformers.model.phi3 import (
+                lce_forward as phi3_lce_forward,
+            )
+            from transformers.models.phi3 import modeling_phi3
+
+            if cfg.liger_rope:
+                modeling_phi3.apply_rotary_pos_emb = liger_rotary_pos_emb
+            if cfg.liger_rms_norm:
+                modeling_phi3.Phi3RMSNorm = LigerRMSNorm
+            if cfg.liger_swiglu:
+                modeling_phi3.Phi3MLP = LigerSwiGLUMLP
+            if cfg.liger_cross_entropy:
+                modeling_phi3.CrossEntropyLoss = LigerCrossEntropyLoss
+            if cfg.liger_fused_linear_cross_entropy:
+                modeling_phi3.Phi3ForCausalLM.forward = phi3_lce_forward
--- a/src/axolotl/integrations/liger/args.py
+++ b/src/axolotl/integrations/liger/args.py
@@ -0,0 +1,32 @@
+# Copyright 2024 Axolotl AI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Module for handling LIGER input arguments.
+"""
+from typing import Optional
+
+from pydantic import BaseModel
+
+
+class LigerArgs(BaseModel):
+    """
+    Input args for LIGER.
+    """
+
+    liger_rope: Optional[bool] = None
+    liger_rms_norm: Optional[bool] = None
+    liger_swiglu: Optional[bool] = None
+    liger_cross_entropy: Optional[bool] = None
+    liger_fused_linear_cross_entropy: Optional[bool] = None
--- a/src/axolotl/integrations/liger/models/deepseekv2.py
+++ b/src/axolotl/integrations/liger/models/deepseekv2.py
@@ -0,0 +1,127 @@
+"""
+DeepseekV2 model with LigerFusedLinearCrossEntropyLoss
+"""
+# pylint: disable=duplicate-code
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+from liger_kernel.transformers.fused_linear_cross_entropy import (
+    LigerFusedLinearCrossEntropyLoss,
+)
+from torch.nn import CrossEntropyLoss
+from transformers.modeling_outputs import CausalLMOutputWithPast
+
+
+# @add_start_docstrings_to_model_forward(DeepseekV2_INPUTS_DOCSTRING)
+# @replace_return_docstrings(
+#    output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC
+# )
+def lce_forward(
+    self,
+    input_ids: torch.LongTensor = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[List[torch.FloatTensor]] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+) -> Union[Tuple, CausalLMOutputWithPast]:
+    r"""
+    Args:
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
+
+    Returns:
+
+    Example:
+
+    ```python
+    >>> from transformers import AutoTokenizer, DeepseekV2ForCausalLM
+
+    >>> model = DeepseekV2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+    >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+    >>> prompt = "Hey, are you conscious? Can you talk to me?"
+    >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+    >>> # Generate
+    >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+    >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+    ```"""
+    output_attentions = (
+        output_attentions
+        if output_attentions is not None
+        else self.config.output_attentions
+    )
+    output_hidden_states = (
+        output_hidden_states
+        if output_hidden_states is not None
+        else self.config.output_hidden_states
+    )
+    return_dict = (
+        return_dict if return_dict is not None else self.config.use_return_dict
+    )
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+    )
+
+    hidden_states = outputs[0]
+
+    loss = None
+    logits = None
+
+    if self.training:
+        shift_hidden_states = hidden_states[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+
+        # flatten tokens
+        shift_hidden_states = shift_hidden_states.view(-1, self.config.hidden_size)
+        shift_labels = shift_labels.view(-1)
+
+        lce = LigerFusedLinearCrossEntropyLoss()
+        loss = lce(self.lm_head.weight, shift_hidden_states, shift_labels)
+    else:
+        logits = self.lm_head(hidden_states)
+        logits = logits.float()
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+    if not return_dict:
+        output = (logits,) + outputs[1:]
+        return (loss,) + output if loss is not None else output
+
+    return CausalLMOutputWithPast(
+        loss=loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+    )
--- a/src/axolotl/integrations/liger/models/jamba.py
+++ b/src/axolotl/integrations/liger/models/jamba.py
@@ -0,0 +1,173 @@
+"""
+Jamba model with LigerFusedLinearCrossEntropyLoss
+"""
+# pylint: disable=duplicate-code
+
+from typing import Optional, Tuple, Union
+
+import torch
+from liger_kernel.transformers.fused_linear_cross_entropy import (
+    LigerFusedLinearCrossEntropyLoss,
+)
+from torch.nn import CrossEntropyLoss
+from transformers.modeling_outputs import MoeCausalLMOutputWithPast
+from transformers.models.jamba.modeling_jamba import (
+    _CONFIG_FOR_DOC,
+    JAMBA_INPUTS_DOCSTRING,
+    HybridMambaAttentionDynamicCache,
+    load_balancing_loss_func,
+)
+from transformers.utils import (
+    add_start_docstrings_to_model_forward,
+    replace_return_docstrings,
+)
+
+
+@add_start_docstrings_to_model_forward(JAMBA_INPUTS_DOCSTRING)
+@replace_return_docstrings(
+    output_type=MoeCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC
+)
+def lce_forward(
+    self,
+    input_ids: torch.LongTensor = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[HybridMambaAttentionDynamicCache] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    output_router_logits: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    num_logits_to_keep: Optional[Union[int, None]] = None,
+) -> Union[Tuple, MoeCausalLMOutputWithPast]:
+    r"""
+    Args:
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        num_logits_to_keep (`int` or `None`, *optional*):
+            Calculate logits for the last `num_logits_to_keep` tokens. If `None`, calculate logits for all
+            `input_ids`. Only last token logits are needed for generation, and calculating them only for that token
+            can save memory, which becomes pretty significant for long sequences.
+
+    Returns:
+
+    Example:
+
+    ```python
+    >>> from transformers import AutoTokenizer, JambaForCausalLM
+
+    >>> model = JambaForCausalLM.from_pretrained("ai21labs/Jamba-v0.1")
+    >>> tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
+
+    >>> prompt = "Hey, are you conscious? Can you talk to me?"
+    >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+    >>> # Generate
+    >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+    >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+    ```"""
+
+    output_attentions = (
+        output_attentions
+        if output_attentions is not None
+        else self.config.output_attentions
+    )
+    output_router_logits = (
+        output_router_logits
+        if output_router_logits is not None
+        else self.config.output_router_logits
+    )
+
+    output_hidden_states = (
+        output_hidden_states
+        if output_hidden_states is not None
+        else self.config.output_hidden_states
+    )
+    return_dict = (
+        return_dict if return_dict is not None else self.config.use_return_dict
+    )
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        output_router_logits=output_router_logits,
+        cache_position=cache_position,
+        return_dict=return_dict,
+    )
+
+    hidden_states = outputs[0]
+
+    loss = None
+    logits = None
+
+    if self.training:
+        shift_hidden_states = hidden_states[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+
+        # flatten tokens
+        shift_hidden_states = shift_hidden_states.view(-1, self.config.hidden_size)
+        shift_labels = shift_labels.view(-1)
+
+        lce = LigerFusedLinearCrossEntropyLoss()
+        loss = lce(self.lm_head.weight, shift_hidden_states, shift_labels)
+    else:
+        if num_logits_to_keep is None:
+            logits = self.lm_head(hidden_states)
+        else:
+            logits = self.lm_head(hidden_states[..., -num_logits_to_keep:, :])
+        logits = logits.float()
+
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+    aux_loss = None
+    if output_router_logits:
+        aux_loss = load_balancing_loss_func(
+            outputs.router_logits if return_dict else outputs[-1],
+            self.num_experts,
+            self.num_experts_per_tok,
+            attention_mask,
+        )
+        if labels is not None:
+            loss += self.router_aux_loss_coef * aux_loss.to(
+                loss.device
+            )  # make sure to reside in the same device
+
+    if not return_dict:
+        output = (logits,) + outputs[1:]
+        if output_router_logits:
+            output = (aux_loss,) + output
+        return (loss,) + output if loss is not None else output
+
+    return MoeCausalLMOutputWithPast(
+        loss=loss,
+        aux_loss=aux_loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+        router_logits=outputs.router_logits,
+    )
--- a/src/axolotl/integrations/spectrum/LICENSE
+++ b/src/axolotl/integrations/spectrum/LICENSE
@@ -0,0 +1,202 @@
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/src/axolotl/integrations/spectrum/README.md
+++ b/src/axolotl/integrations/spectrum/README.md
@@ -0,0 +1,21 @@
+## Spectrum: Targeted Training on Signal to Noise Ratio
+
+by Eric Hartford, Lucas Atkins, Fernando Fernandes, David Golchinfar
+
+This plugin contains code to freeze the bottom fraction of modules in a model, based on the Signal-to-Noise Ratio (SNR).
+
+### Overview
+
+Spectrum is a tool for scanning and evaluating the Signal-to-Noise Ratio (SNR) of layers in large language models.
+By identifying the top n% of layers with the highest SNR, you can optimize training efficiency.
+
+### Usage
+
+```yaml
+plugins:
+  - axolotl.integrations.spectrum.SpectrumPlugin
+
+spectrum_top_fraction: 0.5
+# Optional if using a pre-scanned model as your base_model. Useful if using a model mirror
+spectrum_model_name: meta-llama/Meta-Llama-3.1-8B
+```
--- a/src/axolotl/integrations/spectrum/init.py
+++ b/src/axolotl/integrations/spectrum/init.py
@@ -0,0 +1,102 @@
+# Copyright 2024 Axolotl AI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Spectrum Plugin to automatically generate unfrozen parameters based on SNR data.
+"""
+
+import json
+import logging
+
+import requests
+
+from axolotl.integrations.base import BasePlugin
+
+from .args import SpectrumArgs  # pylint: disable=unused-import. # noqa: F401
+
+
+def _generate_unfrozen_params_yaml(snr_data, top_fraction=0.5):
+    unfrozen_parameters = {}
+    for layer_name, info in snr_data.items():
+        layer_type = info["type"]
+        if layer_type not in unfrozen_parameters:
+            unfrozen_parameters[layer_type] = []
+        unfrozen_parameters[layer_type].append((layer_name, info["snr"]))
+    top_layers_by_type = {}
+    for layer_type, layers in unfrozen_parameters.items():
+        layers_sorted = sorted(layers, key=lambda x: x[1], reverse=True)
+        num_top_layers = int(len(layers) * top_fraction)
+        top_layers_by_type[layer_type] = [
+            layer[0] for layer in layers_sorted[:num_top_layers]
+        ]
+    unfrozen_parameters = [
+        "^lm_head.weight$",
+        "^model.embed_tokens.weight$",
+    ]
+    for layer_type, layer_names in top_layers_by_type.items():
+        for layer_name in layer_names:
+            unfrozen_parameters.append(layer_name)
+    return unfrozen_parameters
+
+
+class SpectrumPlugin(BasePlugin):
+    """
+    Spectrum Plugin to automatically generate unfrozen parameters based on SNR data.
+    """
+
+    base_url = "https://raw.githubusercontent.com/cognitivecomputations/spectrum/main/model_snr_results/"
+    base_path = "./model_snr_results/"
+    snr_file_template = "snr_results_{model_name_slug}.json"
+
+    def get_input_args(self):
+        return "axolotl.integrations.spectrum.SpectrumArgs"
+
+    def pre_model_load(self, cfg):
+        if cfg.get("spectrum_model_name"):
+            model_name = cfg["spectrum_model_name"]
+        else:
+            model_name = cfg["base_model"]
+        top_fraction = cfg.get("spectrum_top_fraction", 50)
+        model_slug = model_name.replace("/", "-").replace("_", "-")
+        snr_url = self.base_url + self.snr_file_template.format(
+            model_name_slug=model_slug
+        )
+        snr_path = self.base_path + self.snr_file_template.format(
+            model_name_slug=model_slug
+        )
+        # first check if the files exist locally and read the json
+        snr_data = None
+        try:
+            with open(snr_path, "r", encoding="utf-8") as fin:
+                snr_data = json.load(fin)
+        except FileNotFoundError:
+            pass
+        except Exception as exc:  # pylint: disable=broad-exception-caught
+            logging.warning(f"Failed to read SNR data from {snr_path}: {exc}")
+
+        if not snr_data:
+            try:
+                snr_data = requests.get(snr_url, timeout=60).json()
+            except requests.exceptions.RequestException as exc:
+                logging.warning(f"Failed to fetch SNR data from {snr_url}: {exc}")
+                return
+            # also catch json parsing errors
+            except json.JSONDecodeError as exc:
+                logging.warning(f"Failed to parse SNR data from {snr_url}: {exc}")
+                return
+
+        unfrozen_parameters = _generate_unfrozen_params_yaml(
+            snr_data, top_fraction=top_fraction
+        )
+        cfg["unfrozen_parameters"] = unfrozen_parameters
--- a/src/axolotl/integrations/spectrum/args.py
+++ b/src/axolotl/integrations/spectrum/args.py
@@ -0,0 +1,29 @@
+# Copyright 2024 Axolotl AI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Module for handling Spectrum input arguments.
+"""
+from typing import Optional
+
+from pydantic import BaseModel
+
+
+class SpectrumArgs(BaseModel):
+    """
+    Input args for Spectrum.
+    """
+
+    spectrum_top_fraction: Optional[float] = 0.5
+    spectrum_model_name: Optional[str] = None
--- a/src/axolotl/loraplus.py
+++ b/src/axolotl/loraplus.py
@@ -1,133 +0,0 @@
-"""Module for LoRA+"""
-
-# MIT License
-#
-# Copyright (c) 2024 nikhil-ghosh-berkeley
-# https://github.com/nikhil-ghosh-berkeley/loraplus
-
-import logging
-from functools import reduce
-
-from peft.tuners import lora
-from torch import nn
-from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
-from transformers.trainer_pt_utils import get_parameter_names
-
-LOG = logging.getLogger("axolotl.loraplus")
-
-
-def get_module(name, opt_model):
-    """
-    Retrieve a module from a model using its parameter name.
-    Args:
-        name (str): Full name of the parameter, typically including module path.
-        opt_model (torch.nn.Module): The model from which to retrieve the module.
-
-    Returns:
-        Module corresponding to the given name.
-    """
-    parent_idx = 2 if "lora" in name else 1
-    module_names = name.split(sep=".")[:-parent_idx]
-    module = reduce(getattr, module_names, opt_model)
-    return module
-
-
-def create_loraplus_optimizer(
-    opt_model,
-    optimizer_cls,
-    optimizer_kwargs,
-    loraplus_lr_ratio,
-    loraplus_lr_embedding=None,
-):
-    """
-    Creates an optimizer for the given model, applying LoRA-specific learning rate adjustments to different parameter groups.
-
-    Args:
-        opt_model (torch.nn.Module): The model for which the optimizer is being created.
-        optimizer_cls (class): The class of the optimizer to be used (e.g., torch.optim.Adam).
-        optimizer_kwargs (dict): A dictionary of keyword arguments for the optimizer's initialization.
-        loraplus_lr_ratio (float): The learning rate ratio to be applied to LoRA parameters.
-        loraplus_lr_embedding (float, optional): A specific learning rate for embedding parameters, with a default value if not provided.
-
-    Returns:
-        An instance of the specified optimizer class configured with the model's parameters organized into groups with custom learning rates.
-    """
-
-    assert loraplus_lr_ratio is not None, "loraplus_lr_ratio must be provided."
-
-    if loraplus_lr_embedding is None:
-        loraplus_lr_embedding = 1e-6
-
-    decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS)
-    decay_parameters = [name for name in decay_parameters if "bias" not in name]
-    param_groups = {
-        "groupA": {},
-        "groupB": {},
-        "groupB_no_decay": {},
-        "embedding": {},
-    }
-
-    for name, param in opt_model.named_parameters():
-        if not param.requires_grad:
-            continue
-
-        module = get_module(name, opt_model)
-        if isinstance(module, lora.Embedding):
-            param_groups["embedding"][name] = param
-        elif "lora_B" in name or param.ndim == 1:
-            if name in decay_parameters:
-                param_groups["groupB"][name] = param
-            else:
-                param_groups["groupB_no_decay"][name] = param
-        else:
-            param_groups["groupA"][name] = param
-
-    assigned_param_groups = ""
-    for group, group_params in param_groups.items():
-        assigned_param_groups += f"{group}\n {list(group_params.keys())}\n\n"
-    LOG.info(assigned_param_groups)
-
-    lr = optimizer_kwargs["lr"]  # pylint: disable=invalid-name
-    weight_decay = optimizer_kwargs.get("weight_decay", 0.0)
-
-    optimizer_grouped_parameters = [
-        {
-            "params": list(param_groups["groupA"].values()),
-            "weight_decay": weight_decay,
-            "lr": lr,
-        },
-        {
-            "params": list(param_groups["embedding"].values()),
-            "weight_decay": weight_decay,
-            "lr": loraplus_lr_embedding,
-        },
-        {
-            "params": list(param_groups["groupB"].values()),
-            "weight_decay": weight_decay,
-            "lr": lr * loraplus_lr_ratio,
-        },
-        {
-            "params": list(param_groups["groupB_no_decay"].values()),
-            "weight_decay": 0.0,
-            "lr": lr * loraplus_lr_ratio,
-        },
-    ]
-
-    optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
-    if optimizer_cls.__name__ == "Adam8bit":
-        import bitsandbytes
-
-        manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
-
-        skipped = 0
-        for module in opt_model.modules():
-            if isinstance(module, nn.Embedding):
-                skipped += sum(
-                    {p.data_ptr(): p.numel() for p in module.parameters()}.values()
-                )
-                LOG.info(f"skipped {module}: {skipped/2**20}M params")
-                manager.register_module_override(module, "weight", {"optim_bits": 32})
-                LOG.debug(f"bitsandbytes: will optimize {module} in fp32")
-        LOG.info(f"skipped: {skipped/2**20}M params")
-
-    return optimizer
--- a/src/axolotl/monkeypatch/attention/mllama.py
+++ b/src/axolotl/monkeypatch/attention/mllama.py
@@ -0,0 +1,229 @@
+"""
+Monkeypatch for Vision Llama for FA2 support
+"""
+# pylint: disable=duplicate-code
+
+from typing import Optional, Tuple
+
+import torch
+from flash_attn.flash_attn_interface import flash_attn_func
+from transformers.cache_utils import Cache
+from transformers.modeling_flash_attention_utils import _flash_attention_forward
+from transformers.models.mllama.configuration_mllama import MllamaTextConfig
+from transformers.models.mllama.modeling_mllama import (
+    MllamaTextCrossAttention,
+    MllamaTextSelfAttention,
+    apply_rotary_pos_emb,
+    repeat_kv,
+)
+from transformers.utils import is_flash_attn_greater_or_equal_2_10
+
+
+class MllamaTextCrossFlashAttention2(MllamaTextCrossAttention):
+    """
+    Mllama flash cross-attention module. This module inherits from `MllamaTextCrossAttention` and
+    implements the forward pass using Flash Attention for improved performance.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # Check if flash attention version is greater or equal to 2.1
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        cross_attention_states: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Cache] = None,
+        attention_mask: Optional[  # pylint: disable=unused-argument
+            torch.Tensor
+        ] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,  # pylint: disable=unused-argument
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+
+        query_states = self.q_proj(hidden_states)
+        query_states = query_states.view(
+            bsz, q_len, self.num_heads, self.head_dim
+        ).transpose(1, 2)
+        query_states = self.q_norm(query_states)
+
+        if cross_attention_states is not None:
+            key_states = self.k_proj(cross_attention_states)
+            value_states = self.v_proj(cross_attention_states)
+            key_states = key_states.view(
+                bsz, -1, self.num_key_value_heads, self.head_dim
+            ).transpose(1, 2)
+            value_states = value_states.view(
+                bsz, -1, self.num_key_value_heads, self.head_dim
+            ).transpose(1, 2)
+            key_states = repeat_kv(key_states, self.num_key_value_groups)
+            value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+            key_states = self.k_norm(key_states)
+            if past_key_value is not None:
+                key_states, value_states = past_key_value.update(
+                    key_states,
+                    value_states,
+                    self.layer_idx,
+                    {"cache_position": cache_position},
+                )
+        elif cache_position[0] != 0:
+            key_states, value_states = (
+                past_key_value.key_cache[self.layer_idx],
+                past_key_value.value_cache[self.layer_idx],
+            )
+        else:
+            raise ValueError(
+                "Cross attention layer can't find neither `cross_attn_states` nor cached values for key/values!"
+            )
+
+        # Transpose to get the expected layout for flash attention
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+
+        # Apply Flash Attention
+        dropout_rate = self.dropout if self.training else 0.0
+        output = flash_attn_func(
+            query_states,
+            key_states,
+            value_states,
+            dropout_p=dropout_rate,
+            softmax_scale=None,
+            causal=False,
+            return_attn_probs=output_attentions,
+        )
+
+        attn_output = output.contiguous().view(bsz, q_len, -1)
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class MllamaTextSelfFlashAttention2(MllamaTextSelfAttention):
+    """
+    Mllama flash self-attention module. This module inherits from `MllamaTextSelfAttention` and
+    implements the forward pass using Flash Attention for improved performance.
+    """
+
+    def __init__(self, config: MllamaTextConfig, layer_idx: int, *args, **kwargs):
+        super().__init__(config, layer_idx, *args, **kwargs)
+
+        # Check if flash attention version is greater or equal to 2.1
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,  # pylint: disable=unused-argument
+        past_key_value=None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,  # pylint: disable=unused-argument
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        output_attentions = False
+
+        bsz, q_len, _ = hidden_states.size()
+
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+
+        # Flash attention requires the input to have the shape
+        # batch_size x seq_length x num_heads x head_dim
+        query_states = query_states.view(
+            bsz, q_len, self.num_heads, self.head_dim
+        ).transpose(1, 2)
+        key_states = key_states.view(
+            bsz, q_len, self.num_key_value_heads, self.head_dim
+        ).transpose(1, 2)
+        value_states = value_states.view(
+            bsz, q_len, self.num_key_value_heads, self.head_dim
+        ).transpose(1, 2)
+
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(
+            query_states, key_states, cos, sin
+        )
+
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(
+                key_states, value_states, self.layer_idx, cache_kwargs
+            )
+
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+        # Transpose to get the expected layout for flash attention
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+
+        dropout_rate = self.dropout if self.training else 0.0
+
+        # Handle potential silent casting to float32
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = (
+                    self.config._pre_quantization_dtype  # pylint: disable=protected-access
+                )
+            else:
+                target_dtype = self.q_proj.weight.dtype
+
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+
+        attn_output = _flash_attention_forward(
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            q_len,
+            dropout=dropout_rate,
+            use_top_left_mask=self._flash_attn_uses_top_left_mask,
+            is_causal=True,
+        )
+
+        attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+def patch_mllama():
+    from transformers.models.mllama.modeling_mllama import (
+        MLLAMA_TEXT_ATTENTION_CLASSES,
+        MLLAMA_TEXT_CROSS_ATTENTION_CLASSES,
+        MLLAMA_VISION_ATTENTION_CLASSES,
+        MllamaPreTrainedModel,
+    )
+
+    MllamaPreTrainedModel._supports_flash_attn_2 = (  # pylint: disable=protected-access
+        True
+    )
+    MLLAMA_TEXT_ATTENTION_CLASSES["flash_attention_2"] = MllamaTextSelfFlashAttention2
+    MLLAMA_TEXT_CROSS_ATTENTION_CLASSES[
+        "flash_attention_2"
+    ] = MllamaTextCrossFlashAttention2
+    # fallback to SDPA
+    MLLAMA_VISION_ATTENTION_CLASSES[
+        "flash_attention_2"
+    ] = MLLAMA_VISION_ATTENTION_CLASSES["sdpa"]
--- a/src/axolotl/monkeypatch/llama_patch_multipack.py
+++ b/src/axolotl/monkeypatch/llama_patch_multipack.py
@@ -9,18 +9,18 @@ from axolotl.monkeypatch.utils import (


 def hijack_llama_prepare_4d_mask():
-    import transformers.modeling_attn_mask_utils
-    import transformers.models.llama.modeling_llama
+    from transformers import modeling_attn_mask_utils
+    from transformers.models.llama import modeling_llama

-    transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_for_sdpa = (  # pylint: disable=protected-access
+    modeling_llama._prepare_4d_causal_attention_mask_for_sdpa = (  # pylint: disable=protected-access
        patched_prepare_4d_causal_attention_mask_for_sdpa
    )
-    transformers.modeling_attn_mask_utils._prepare_4d_causal_attention_mask_for_sdpa = (  # pylint: disable=protected-access
+    modeling_attn_mask_utils._prepare_4d_causal_attention_mask_for_sdpa = (  # pylint: disable=protected-access
        patched_prepare_4d_causal_attention_mask_for_sdpa
    )
-    transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask = (  # pylint: disable=protected-access
+    modeling_llama._prepare_4d_causal_attention_mask = (  # pylint: disable=protected-access
        patched_prepare_4d_causal_attention_mask
    )
-    transformers.modeling_attn_mask_utils._prepare_4d_causal_attention_mask = (  # pylint: disable=protected-access
+    modeling_attn_mask_utils._prepare_4d_causal_attention_mask = (  # pylint: disable=protected-access
        patched_prepare_4d_causal_attention_mask
    )
--- a/src/axolotl/monkeypatch/multipack.py
+++ b/src/axolotl/monkeypatch/multipack.py
@@ -10,6 +10,7 @@ from axolotl.monkeypatch.mixtral import patch_mixtral_moe_forward_zero3
 from axolotl.monkeypatch.utils import get_unpad_data

 SUPPORTED_MULTIPACK_MODEL_TYPES = [
+    "mllama_text_model",
    "llama",
    "mistral",
    "mixtral",
@@ -17,6 +18,7 @@ SUPPORTED_MULTIPACK_MODEL_TYPES = [
    "qwen2_moe",
    "falcon",
    "phi",
+    "phi3",
    "gemma",
    "gemma2",
    "gemmoe",
--- a/src/axolotl/monkeypatch/stablelm_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/stablelm_attn_hijack_flash.py
@@ -16,6 +16,7 @@
 # This code is based off the following work:
 # https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
 # https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py
+# pylint: disable=duplicate-code
 """ PyTorch StableLM Epoch model. """
 import importlib
 import math
--- a/src/axolotl/monkeypatch/unsloth_.py
+++ b/src/axolotl/monkeypatch/unsloth_.py
@@ -16,8 +16,7 @@ from transformers.models.llama.modeling_llama import (

 LOG = get_logger("axolotl.monkeypatch.unsloth")

-ORIGINAL_CEL_CODE = """    if labels is not None:
-        # Shift so that tokens < n predict n
+ORIGINAL_CEL_CODE = """# Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        # Flatten the tokens
@@ -29,8 +28,7 @@ ORIGINAL_CEL_CODE = """    if labels is not None:
        loss = loss_fct(shift_logits, shift_labels)
 """

-PATCHED_CEL_CODE = """    if labels is not None:
-        shift_logits = logits[..., :-1, :].contiguous()
+PATCHED_CEL_CODE = """shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        loss = fast_cross_entropy_loss(
            logits = shift_logits,
--- a/src/axolotl/monkeypatch/utils.py
+++ b/src/axolotl/monkeypatch/utils.py
@@ -17,11 +17,9 @@ def get_max_seqlen_in_batch(attention_mask: torch.Tensor) -> torch.Tensor:
    max_num = int(torch.max(attention_mask).item())
    batch_size, _ = attention_mask.shape
    counts = torch.zeros((batch_size, max_num), dtype=torch.int32)
-
    for i in range(1, max_num + 1):
        mask = attention_mask == i
        counts[:, i - 1] = torch.sum(mask, dim=-1).to(dtype=torch.int32)
-
    result = counts.flatten()
    nonzero_indices = torch.nonzero(result).squeeze(-1)
    return result[nonzero_indices]
--- a/src/axolotl/prompt_strategies/init.py
+++ b/src/axolotl/prompt_strategies/init.py
@@ -9,7 +9,7 @@ from axolotl.prompt_strategies.user_defined import UserDefinedDatasetConfig
 LOG = logging.getLogger("axolotl.prompt_strategies")


-def load(strategy, tokenizer, cfg, ds_cfg):
+def load(strategy, tokenizer, cfg, ds_cfg, processor=None):
    try:
        load_fn = "load"
        if strategy.split(".")[-1].startswith("load_"):
@@ -24,6 +24,8 @@ def load(strategy, tokenizer, cfg, ds_cfg):
            sig = inspect.signature(func)
            if "ds_cfg" in sig.parameters:
                load_kwargs["ds_cfg"] = ds_cfg
+            if "processor" in sig.parameters:
+                load_kwargs["processor"] = processor
        return func(tokenizer, cfg, **load_kwargs)
    except ModuleNotFoundError:
        return None
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -5,6 +5,8 @@ HF Chat Templates prompt strategy
 import logging
 from typing import Any, Dict, List, Optional

+from transformers import ProcessorMixin
+
 from axolotl.prompt_tokenizers import PromptTokenizingStrategy
 from axolotl.prompters import IGNORE_TOKEN_ID, Prompter
 from axolotl.utils.chat_templates import chat_templates
@@ -20,12 +22,13 @@ class ChatTemplatePrompter(Prompter):
    def __init__(
        self,
        tokenizer,
+        processor=None,
        chat_template=None,
        max_length=2048,
        message_field_role: str = "from",
        message_field_content: str = "value",
-        message_field_training: str = "train",
-        message_field_training_detail: str = "train_detail",
+        message_field_training: Optional[str] = None,
+        message_field_training_detail: Optional[str] = None,
        roles: Optional[Dict[str, List[str]]] = None,
        drop_system_message: bool = False,
    ):
@@ -44,11 +47,12 @@ class ChatTemplatePrompter(Prompter):
        self.message_field_training = message_field_training
        self.message_field_training_detail = message_field_training_detail
        self.tokenizer = tokenizer
+        self.processor: ProcessorMixin = processor
        self.chat_template = chat_template
        self.max_length = max_length
        self.drop_system_message = drop_system_message

-    def build_prompt(self, conversation, add_generation_prompt=False):
+    def build_prompt(self, conversation, add_generation_prompt=False, images=None):
        turns = [
            {
                "role": self.roles[t[self.message_field_role]],
@@ -61,6 +65,28 @@ class ChatTemplatePrompter(Prompter):
        if self.drop_system_message and turns[0]["role"] == "system":
            turns = turns[1:]

+        if self.processor:
+            text = self.processor.apply_chat_template(
+                turns,
+                chat_template=self.chat_template,
+                tokenize=False,
+                add_generation_prompt=add_generation_prompt,
+            )
+            batch = self.processor(
+                text=text,
+                images=images,
+                return_tensors="pt",
+                truncation=True,
+                max_length=self.max_length,
+            )
+            # workaround since processor works in batches instead of single examples
+            for k, val in batch.items():
+                if k in ["pixel_values"]:
+                    batch[k] = val.tolist()
+                else:
+                    batch[k] = val.squeeze().tolist()
+            return batch
+
        return self.tokenizer.apply_chat_template(
            turns,
            truncation=True,
@@ -186,11 +212,12 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        train_on_inputs,
        sequence_len,
        roles_to_train=None,
-        train_on_eos="last",
+        train_on_eos=None,
    ):
        super().__init__(prompter, tokenizer, train_on_inputs, sequence_len)
        self.roles_to_train = roles_to_train if roles_to_train is not None else []
        self.train_on_eos = train_on_eos
+        self.images = "images"

    @property
    def messages(self):
@@ -201,6 +228,40 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
        self._messages = messages

    def tokenize_prompt(self, prompt):
+        # Old simple legacy behavior that works reliably.
+        if (
+            not self.roles_to_train
+            and not self.train_on_eos
+            and not self.prompter.message_field_training
+            and not self.prompter.message_field_training_detail
+        ):
+            turns = self.get_conversation_thread(prompt)
+            images = self.get_images(prompt)
+            prompt_ids = self.prompter.build_prompt(
+                turns[:-1],
+                add_generation_prompt=True,
+                images=images,
+            )
+            tokenized_res = self.prompter.build_prompt(turns, images=images)
+            tokenized_prompt = {}
+            if isinstance(tokenized_res, list):
+                input_ids = prompt_ids + tokenized_res[len(prompt_ids) :]
+                tokenized_prompt["input_ids"] = input_ids
+                tokenized_prompt["attention_mask"] = [1] * len(input_ids)
+            else:
+                input_ids = tokenized_res["input_ids"]
+                tokenized_prompt = tokenized_res
+
+            if not self.train_on_inputs:
+                user_prompt_len = len(prompt_ids)
+                labels = [-100] * user_prompt_len + input_ids[user_prompt_len:]
+            else:
+                labels = input_ids
+
+            tokenized_prompt["labels"] = labels
+
+            return tokenized_prompt
+
        turns = prompt[self.messages]
        input_ids = self.prompter.build_prompt(turns)
        labels = [IGNORE_TOKEN_ID] * len(input_ids)
@@ -219,9 +280,11 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
            should_train = (
                train_turn
                if train_turn is not None
-                else bool(train_detail is not None)
-                if train_detail is not None
-                else self.train_on_inputs or role in self.roles_to_train
+                else (
+                    bool(train_detail is not None)
+                    if train_detail is not None
+                    else self.train_on_inputs or role in self.roles_to_train
+                )
            )

            LOG.debug(f"Should train: {should_train}")
@@ -335,29 +398,35 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
    def get_conversation_thread(self, prompt):
        return prompt[self.messages]

+    def get_images(self, prompt):
+        return prompt.get(self.images, None)

-def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
+
+def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None, processor=None):
    ds_cfg = ds_cfg or {}

    prompter_params = {
        "tokenizer": tokenizer,
        "chat_template": chat_templates(ds_cfg.get("chat_template", "chatml")),
-        "message_field_role": ds_cfg.get("message_field_role", "from"),
-        "message_field_content": ds_cfg.get("message_field_content", "value"),
-        "message_field_training": ds_cfg.get("message_field_training", "training"),
+        "message_field_role": ds_cfg.get("message_field_role", "role"),
+        "message_field_content": ds_cfg.get("message_field_content", "content"),
+        "message_field_training": ds_cfg.get("message_field_training", None),
        "message_field_training_detail": ds_cfg.get(
-            "message_field_training_detail", "train_detail"
+            "message_field_training_detail",
+            None,
        ),
        "roles": ds_cfg.get("roles"),
        "drop_system_message": ds_cfg.get("drop_system_message", False),
-        "max_length": cfg.sequence_len,
+        # we need to add one for detecting sequences with exceeding the `sequence_len` limit.
+        "max_length": cfg.sequence_len + 1,
+        "processor": processor,
    }

    strategy_params = {
        "train_on_inputs": cfg.train_on_inputs,
        "sequence_len": cfg.sequence_len,
-        "roles_to_train": ds_cfg.get("roles_to_train", ["gpt", "assistant"]),
-        "train_on_eos": ds_cfg.get("train_on_eos", "last"),
+        "roles_to_train": ds_cfg.get("roles_to_train", []),
+        "train_on_eos": ds_cfg.get("train_on_eos", None),
    }

    strategy = ChatTemplateStrategy(
--- a/src/axolotl/prompters.py
+++ b/src/axolotl/prompters.py
@@ -65,8 +65,10 @@ class AlpacaPrompter(Prompter):
            self.system_format = "<|im_start|>system\n{system}<|im_end|>\n"
        elif self.prompt_style == PromptStyle.PHI.value:
            self.turn_format = "<|user|>\n{instruction}<|end|>{input}<|assistant|>"
-            self.turn_no_input_format = "<|user|>\n{instruction}<|end|><|assistant|>"
-            self.system_format = "<|system|>{system}\n"
+            self.turn_no_input_format = (
+                "<|user|>\n{instruction}<|end|>\n<|assistant|>\n"
+            )
+            self.system_format = "<|system|>\n{system}<|end|>\n"

    def _build_result(self, instruction, input_text, output):
        # returns the full prompt from instruction and optional input
@@ -350,9 +352,12 @@ class ShareGPTPrompter(Prompter):  # pylint: disable=too-few-public-methods
                        "Please help us by creating an Issue to add support for this conversation type."
                    )

-                role = CONVERSATION_ROLE_FORMAT[self._conversation.name].format(
-                    ROLE=from_role
-                )
+                if self._conversation.name in ["llama3"]:
+                    role = from_role
+                else:
+                    role = CONVERSATION_ROLE_FORMAT[self._conversation.name].format(
+                        ROLE=from_role
+                    )

            if len(conv.messages) > 0 and ((role == conv.messages[-1][0])):
                if (
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -12,6 +12,7 @@ import torch
 import transformers.modelcard
 from accelerate import Accelerator
 from accelerate.logging import get_logger
+from accelerate.utils import save_fsdp_model
 from datasets import Dataset
 from peft import PeftModel
 from pkg_resources import get_distribution  # type: ignore
@@ -23,7 +24,7 @@ from axolotl.core.tokenizer_utils import fix_untrained_tokens
 from axolotl.logging_config import configure_logging
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.freeze import freeze_layers_except
-from axolotl.utils.models import load_model, load_tokenizer
+from axolotl.utils.models import load_model, load_processor, load_tokenizer
 from axolotl.utils.trainer import setup_trainer

 try:
@@ -68,6 +69,9 @@ def train(
        main_process_only=True,
    )
    tokenizer = load_tokenizer(cfg)
+    processor = None
+    if cfg.is_multimodal:
+        processor = load_processor(cfg, tokenizer)

    train_dataset = dataset_meta.train_dataset
    eval_dataset = dataset_meta.eval_dataset
@@ -95,7 +99,9 @@ def train(
    LOG.debug(msg)
    # we wait unitl the last possible moment to setup Accelerator
    Accelerator()
-    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
+    model, peft_config = load_model(
+        cfg, tokenizer, processor=processor, inference=cli_args.inference
+    )
    model.generation_config.do_sample = True

    model_ref = None
@@ -121,6 +127,7 @@ def train(
        eval_dataset,
        (model, model_ref, peft_config),
        tokenizer,
+        processor,
        total_num_steps,
    )

@@ -194,9 +201,12 @@ def train(
        if hasattr(module, "_post_training"):
            module._post_training(model, name)  # pylint: disable=protected-access

+    state_dict_type = "FULL_STATE_DICT"
    if trainer.is_fsdp_enabled:
-        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
-        LOG.info("Set FSDP state dict type to FULL_STATE_DICT for saving.")
+        if cfg.fsdp_final_state_dict_type:
+            state_dict_type = cfg.fsdp_final_state_dict_type
+        trainer.accelerator.state.fsdp_plugin.set_state_dict_type(state_dict_type)
+        LOG.info(f"Set FSDP state dict type to {state_dict_type} for saving.")

    if cfg.relora_steps:
        if cfg.adapter == "lora" and not (cfg.load_in_4bit or cfg.load_in_8bit):
@@ -208,7 +218,18 @@ def train(
    # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
    # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
    if cfg.fsdp:
-        trainer.save_model(cfg.output_dir)
+        if (
+            state_dict_type == "SHARDED_STATE_DICT"
+            and cfg.fsdp_config.fsdp_state_dict_type == "SHARDED_STATE_DICT"
+        ):
+            save_fsdp_model(
+                trainer.accelerator.state.fsdp_plugin,
+                trainer.accelerator,
+                trainer.model,
+                cfg.output_dir,
+            )
+        elif state_dict_type == "FULL_STATE_DICT":
+            trainer.save_model(cfg.output_dir)
    elif cfg.deepspeed and is_deepspeed_zero3_enabled():
        # Copied over from: https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading
        trainer.accelerator.wait_for_everyone()
--- a/src/axolotl/utils/chat_templates.py
+++ b/src/axolotl/utils/chat_templates.py
--- a/src/axolotl/utils/collators/init.py
+++ b/src/axolotl/utils/collators/init.py
@@ -0,0 +1,10 @@
+"""
+shared axolotl collators for multipack, mamba, multimodal
+"""
+from .batching import (  # noqa: F401
+    BatchSamplerDataCollatorForSeq2Seq,
+    DataCollatorForSeq2Seq,
+    PretrainingBatchSamplerDataCollatorForSeq2Seq,
+    V2BatchSamplerDataCollatorForSeq2Seq,
+)
+from .mamba import MambaDataCollator  # noqa: F401
--- a/src/axolotl/utils/collators/batching.py
+++ b/src/axolotl/utils/collators/batching.py
@@ -1,17 +1,14 @@
 """
 DataCollator for axolotl to pad labels and position_ids for packed sequences
 """
+
 from dataclasses import dataclass
-from typing import Any, Dict, Optional, Sequence, Union
+from typing import Any, Optional, Union

 import numpy as np
-import torch
-import transformers
 from transformers import PreTrainedTokenizerBase
 from transformers.utils import PaddingStrategy

-IGNORE_INDEX = -100
-

@dataclass
 class DataCollatorForSeq2Seq:
@@ -183,34 +180,6 @@ class V2BatchSamplerDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
        return super().__call__(out_features, return_tensors=return_tensors)


-@dataclass
-class MambaDataCollator:
-    """
-    Collator for State Space Models (Mamba)
-    """
-
-    tokenizer: transformers.PreTrainedTokenizer
-
-    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
-        input_ids, labels = tuple(
-            [torch.LongTensor(instance[key]) for instance in instances]
-            for key in ("input_ids", "labels")
-        )
-        input_ids = torch.nn.utils.rnn.pad_sequence(
-            input_ids,
-            batch_first=True,
-            padding_value=self.tokenizer.pad_token_id,
-        )
-        labels = torch.nn.utils.rnn.pad_sequence(
-            labels, batch_first=True, padding_value=IGNORE_INDEX
-        )
-
-        return {
-            "input_ids": input_ids,
-            "labels": labels,
-        }
-
-
@dataclass
 class PretrainingBatchSamplerDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
    """
--- a/src/axolotl/utils/collators/core.py
+++ b/src/axolotl/utils/collators/core.py
@@ -0,0 +1,4 @@
+"""
+basic shared collator constants
+"""
+IGNORE_INDEX = -100
--- a/src/axolotl/utils/collators/mamba.py
+++ b/src/axolotl/utils/collators/mamba.py
@@ -0,0 +1,38 @@
+"""
+collators for Mamba
+"""
+from dataclasses import dataclass
+from typing import Dict, Sequence
+
+import torch
+import transformers
+
+from axolotl.utils.collators.core import IGNORE_INDEX
+
+
+@dataclass
+class MambaDataCollator:
+    """
+    Collator for State Space Models (Mamba)
+    """
+
+    tokenizer: transformers.PreTrainedTokenizer
+
+    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
+        input_ids, labels = tuple(
+            [torch.LongTensor(instance[key]) for instance in instances]
+            for key in ("input_ids", "labels")
+        )
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids,
+            batch_first=True,
+            padding_value=self.tokenizer.pad_token_id,
+        )
+        labels = torch.nn.utils.rnn.pad_sequence(
+            labels, batch_first=True, padding_value=IGNORE_INDEX
+        )
+
+        return {
+            "input_ids": input_ids,
+            "labels": labels,
+        }
--- a/src/axolotl/utils/collators/mm_chat.py
+++ b/src/axolotl/utils/collators/mm_chat.py
@@ -0,0 +1,77 @@
+"""
+Collators for multi-modal chat messages and packing
+"""
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Union
+
+from transformers import PreTrainedTokenizerBase, ProcessorMixin
+from transformers.data.data_collator import DataCollatorMixin
+from transformers.utils import PaddingStrategy
+
+
+@dataclass
+class MultiModalChatDataCollator(DataCollatorMixin):
+    """
+    Collator for multi-modal chat messages
+    """
+
+    tokenizer: PreTrainedTokenizerBase
+    processor: ProcessorMixin
+    return_tensors: str = "pt"
+    chat_template: Optional[str] = None
+    packing: bool = False
+    max_images: int = -1
+    padding: Union[bool, str, PaddingStrategy] = True
+    pad_to_multiple_of: Optional[int] = None
+
+    def __post_init__(self):
+        if self.packing:
+            raise ValueError("Packing is currently not supported.")
+
+    def torch_call(
+        self, examples: List[Union[List[int], Any, Dict[str, Any]]]
+    ) -> Dict[str, Any]:
+        # Handle dict or lists with proper padding and conversion to tensor.
+
+        return self.__class__.process_rows(
+            examples, self.processor, self.chat_template, self.max_images
+        )
+
+    @staticmethod
+    def process_rows(examples, processor, chat_template, max_images, length_only=False):
+        # HINT: use `_torch_collate_batch` to stack and pad tensors
+        # see also DataCollatorWithFlattening and DefaultDataCollator
+
+        # *** This is COPIED from the trl example sft_vlm.py code ***
+        # use this as a starting point
+
+        # Get the texts and images, and apply the chat template
+        texts = [
+            processor.apply_chat_template(
+                example["messages"], chat_template=chat_template, tokenize=False
+            )
+            for example in examples
+        ]
+        images = [example["images"] for example in examples]
+
+        if max_images > 0:
+            images = [img_batch[:max_images] for img_batch in images]
+
+        # Tokenize the texts and process the images
+        batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
+
+        # The labels are the input_ids, and we mask the padding tokens in the loss computation
+        labels = batch["input_ids"].clone()
+        labels[labels == processor.tokenizer.pad_token_id] = -100  #
+        # Ignore the image token index in the loss computation (model specific)
+        image_token_id = processor.tokenizer.convert_tokens_to_ids(
+            processor.image_token
+        )
+        labels[labels == image_token_id] = -100
+        batch["labels"] = labels
+
+        if length_only:
+            return {
+                "length": [len(sample["input_ids"]) for sample in batch["input_ids"]]
+            }
+        return batch
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -8,11 +8,14 @@ from typing import Optional
 import torch
 from transformers.utils import is_torch_bf16_gpu_available

+from axolotl.integrations.config import merge_input_args
 from axolotl.utils.bench import log_gpu_memory_usage
+from axolotl.utils.config.models.input.v0_4_1 import SUPPORTED_METRICS
 from axolotl.utils.config.models.input.v0_4_1 import (
-    SUPPORTED_METRICS,
-    AxolotlConfigWCapabilities,
-    AxolotlInputConfig,
+    AxolotlConfigWCapabilities as AxolotlConfigWCapabilitiesBase,
+)
+from axolotl.utils.config.models.input.v0_4_1 import (
+    AxolotlInputConfig as AxolotlInputConfigBase,
 )
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_model_config
@@ -118,15 +121,36 @@ def normalize_config(cfg):
        cfg.base_model_config = cfg.base_model

    model_config = load_model_config(cfg)
-    cfg.model_config_type = model_config.model_type

    cfg.tokenizer_config = (
        cfg.tokenizer_config or cfg.base_model_config or cfg.base_model
    )

+    cfg.is_multimodal = (
+        hasattr(model_config, "model_type")
+        and model_config.model_type in ["llava", "mllama"]
+        or any(
+            multimodal_name in cfg.base_model.lower()
+            for multimodal_name in [
+                "pixtral",
+            ]
+        )
+        or cfg.is_multimodal
+    )
+    if cfg.is_multimodal:
+        cfg.processor_config = (
+            cfg.processor_config or cfg.base_model_config or cfg.base_model
+        )
+        model_config = model_config.text_config
+
+    cfg.model_config_type = model_config.model_type
+
    # figure out if the model is llama
    cfg.is_llama_derived_model = (
-        (hasattr(model_config, "model_type") and model_config.model_type == "llama")
+        (
+            hasattr(model_config, "model_type")
+            and model_config.model_type == ["llama", "mllama_text_model"]
+        )
        or cfg.is_llama_derived_model
        or "llama" in cfg.base_model.lower()
        or (cfg.type_of_model and "llama" in cfg.type_of_model.lower())
@@ -207,6 +231,15 @@ def normalize_cfg_datasets(cfg):


 def validate_config(cfg: DictDefault, capabilities: Optional[dict] = None):
+    AxolotlConfigWCapabilities = AxolotlConfigWCapabilitiesBase
+    AxolotlInputConfig = AxolotlInputConfigBase
+
+    if cfg.plugins:
+        (
+            AxolotlConfigWCapabilities,  # pylint: disable=invalid-name
+            AxolotlInputConfig,  # pylint: disable=invalid-name
+        ) = merge_input_args()
+
    if capabilities:
        return DictDefault(
            dict(
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -188,8 +188,11 @@ class ChatTemplate(str, Enum):
    gemma = "gemma"  # pylint: disable=invalid-name
    cohere = "cohere"  # pylint: disable=invalid-name
    llama3 = "llama3"  # pylint: disable=invalid-name
+    llama3_2_vision = "llama3_2_vision"  # pylint: disable=invalid-name
    phi_3 = "phi_3"  # pylint: disable=invalid-name
+    phi_35 = "phi_35"  # pylint: disable=invalid-name
    deepseek_v2 = "deepseek_v2"  # pylint: disable=invalid-name
+    jamba = "jamba"  # pylint: disable=invalid-name


 class LoftQConfig(BaseModel):
@@ -226,11 +229,12 @@ class LoraConfig(BaseModel):
    lora_r: Optional[int] = None
    lora_alpha: Optional[int] = None
    lora_fan_in_fan_out: Optional[bool] = None
-    lora_target_modules: Optional[List[str]] = None
+    lora_target_modules: Optional[Union[str, List[str]]] = None
    lora_target_linear: Optional[bool] = None
    lora_modules_to_save: Optional[List[str]] = None
    lora_dropout: Optional[float] = 0.0
    peft_layers_to_transform: Optional[List[int]] = None
+    peft_layers_pattern: Optional[List[str]] = None
    peft: Optional[PeftConfig] = None
    peft_use_dora: Optional[bool] = None
    peft_use_rslora: Optional[bool] = None
@@ -296,6 +300,13 @@ class LoraConfig(BaseModel):
                    raise ValueError("Require cfg.load_in_4bit to be True for qlora")
        return self

+    @field_validator("loraplus_lr_embedding")
+    @classmethod
+    def convert_loraplus_lr_embedding(cls, loraplus_lr_embedding):
+        if loraplus_lr_embedding and isinstance(loraplus_lr_embedding, str):
+            loraplus_lr_embedding = float(loraplus_lr_embedding)
+        return loraplus_lr_embedding
+

 class ReLoRAConfig(BaseModel):
    """ReLoRA configuration subset"""
@@ -319,8 +330,13 @@ class ModelInputConfig(BaseModel):
    tokenizer_type: Optional[str] = Field(
        default=None, metadata={"help": "transformers tokenizer class"}
    )
+    processor_type: Optional[str] = Field(
+        default=None, metadata={"help": "transformers processor class"}
+    )
    trust_remote_code: Optional[bool] = None

+    model_kwargs: Optional[Dict[str, Any]] = None
+
    @field_validator("trust_remote_code")
    @classmethod
    def hint_trust_remote_code(cls, trust_remote_code):
@@ -352,6 +368,8 @@ class HyperparametersConfig(BaseModel):
        },
    )

+    auto_find_batch_size: Optional[bool] = None
+
    train_on_inputs: Optional[bool] = False
    group_by_length: Optional[bool] = None

@@ -517,6 +535,7 @@ class AxolotlInputConfig(
    dataset_prepared_path: Optional[str] = None
    dataset_shard_num: Optional[int] = None
    dataset_shard_idx: Optional[int] = None
+    skip_prepare_dataset: Optional[bool] = False

    pretraining_dataset: Optional[  # type: ignore
        conlist(Union[PretrainingDataset, SFTDataset], min_length=1)
@@ -589,6 +608,7 @@ class AxolotlInputConfig(
    eval_sample_packing: Optional[bool] = None
    pad_to_sequence_len: Optional[bool] = None
    curriculum_sampling: Optional[bool] = None
+    multipack_real_batches: Optional[bool] = None

    # for PoSE context length extension
    use_pose: Optional[bool] = None
@@ -614,6 +634,8 @@ class AxolotlInputConfig(
    flash_attn_fuse_mlp: Optional[bool] = None
    flash_optimum: Optional[bool] = None

+    eager_attention: Optional[bool] = None
+
    unsloth_cross_entropy_loss: Optional[bool] = None
    unsloth_lora_mlp: Optional[bool] = None
    unsloth_lora_qkv: Optional[bool] = None
@@ -624,6 +646,9 @@ class AxolotlInputConfig(
    deepspeed: Optional[Union[str, Dict[str, Any]]] = None
    fsdp: Optional[List[str]] = None
    fsdp_config: Optional[Dict[str, Any]] = None
+    fsdp_final_state_dict_type: Optional[
+        Literal["FULL_STATE_DICT", "LOCAL_STATE_DICT", "SHARDED_STATE_DICT"]
+    ] = None

    val_set_size: Optional[float] = Field(default=0.0)

@@ -978,6 +1003,18 @@ class AxolotlInputConfig(

        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_mm_prepare(cls, data):
+        if data.get("skip_prepare_dataset"):
+            if data.get("remove_unused_columns") is None:
+                LOG.info(
+                    "setting `remove_unused_columns: false` for skip_prepare_dataset"
+                )
+                data["remove_unused_columns"] = False
+
+        return data
+
    @model_validator(mode="before")
    @classmethod
    def check_warmup(cls, data):
@@ -1005,12 +1042,20 @@ class AxolotlInputConfig(
        return neftune_noise_alpha

    @model_validator(mode="after")
-    def check(self):
+    def check_rl_beta(self):
        if self.dpo_beta and not self.rl_beta:
            self.rl_beta = self.dpo_beta
            del self.dpo_beta
        return self

+    @model_validator(mode="after")
+    def check_simpo_warmup(self):
+        if self.rl == "simpo" and self.warmup_ratio:
+            raise ValueError(
+                "warmup_ratio is not supported with the simpo trainer. Please use `warmup_steps` instead"
+            )
+        return self
+
    @model_validator(mode="before")
    @classmethod
    def check_frozen(cls, data):
@@ -1025,6 +1070,15 @@ class AxolotlInputConfig(

        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_peft_layers_pattern(cls, data):
+        if data.get("peft_layers_pattern") and not data.get("peft_layers_to_transform"):
+            raise ValueError(
+                "peft_layers_pattern requires peft_layers_to_transform to be set"
+            )
+        return data
+
    @model_validator(mode="after")
    def check_fft_possible_bad_config(self):
        if (
@@ -1144,6 +1198,20 @@ class AxolotlInputConfig(
            )
        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_fsdp_sharded_state_dict_w_safetensors(cls, data):
+        if (
+            data.get("fsdp")
+            and data.get("save_safetensors")
+            and data.get("fsdp_config")
+            and data["fsdp_config"].get("fsdp_state_dict_type") == "SHARDED_STATE_DICT"
+        ):
+            raise ValueError(
+                "FSDP SHARDED_STATE_DICT not compatible with save_safetensors"
+            )
+        return data
+
    @model_validator(mode="before")
    @classmethod
    def check_causal_lm_evals(cls, data):
@@ -1263,6 +1331,19 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):

        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_hopper_8bit_lora(cls, data):
+        is_sm_90: bool = (
+            data["capabilities"]
+            and data["capabilities"].get("compute_capability") == "sm_90"
+        )
+        if data.get("adapter") and data.get("load_in_8bit") and is_sm_90:
+            # see https://github.com/bitsandbytes-foundation/bitsandbytes/issues/538#issuecomment-2262945464
+            raise ValueError("8-bit LoRA is not supported on Hopper GPUs")
+
+        return data
+
    @model_validator(mode="before")
    @classmethod
    def check_fsdp_deepspeed(cls, data):
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -18,10 +18,10 @@ LOG = logging.getLogger("axolotl")


 def encode_pretraining(
-    tokenizer: PreTrainedTokenizerBase, max_tokens: int, examples: List[str]
+    tokenizer: PreTrainedTokenizerBase, max_tokens: int, examples: Dict[str, List]
 ) -> Dict[str, List]:
    res = tokenizer(
-        examples,
+        examples["text"],
        truncation=True,
        max_length=max_tokens - 2,
        add_special_tokens=True,
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -51,20 +51,31 @@ from axolotl.utils.trainer import (
 LOG = logging.getLogger("axolotl")


-def prepare_dataset(cfg, tokenizer):
+def prepare_dataset(cfg, tokenizer, processor=None):
    prompters = []
    if not cfg.pretraining_dataset:
        with zero_first(is_local_main_process()):
            if cfg.test_datasets:
                train_dataset, _, prompters = load_prepare_datasets(
-                    tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="train"
+                    tokenizer,
+                    cfg,
+                    DEFAULT_DATASET_PREPARED_PATH,
+                    split="train",
+                    processor=processor,
                )
                _, eval_dataset, _ = load_prepare_datasets(
-                    tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH, split="test"
+                    tokenizer,
+                    cfg,
+                    DEFAULT_DATASET_PREPARED_PATH,
+                    split="test",
+                    processor=processor,
                )
            else:
                train_dataset, eval_dataset, prompters = load_prepare_datasets(
-                    tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH
+                    tokenizer,
+                    cfg,
+                    DEFAULT_DATASET_PREPARED_PATH,
+                    processor=processor,
                )
    else:
        path = cfg.pretraining_dataset
@@ -123,6 +134,7 @@ def load_tokenized_prepared_datasets(
    cfg,
    default_dataset_prepared_path,
    split="train",
+    processor=None,
 ) -> Tuple[DatasetDict, List[Prompter]]:
    cfg_datasets = cfg.test_datasets if split == "test" else cfg.datasets
    tokenizer_name = cfg.tokenizer_config
@@ -180,6 +192,7 @@ def load_tokenized_prepared_datasets(
        cfg.dataset_prepared_path
        and any(prepared_ds_path.glob("*"))
        and not cfg.is_preprocess
+        and not cfg.skip_prepare_dataset
    ):
        LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
        dataset = load_from_disk(str(prepared_ds_path))
@@ -423,12 +436,16 @@ def load_tokenized_prepared_datasets(
                dataset=ds,
                d_base_type=d_base_type,
                d_prompt_style=d_prompt_style,
+                processor=processor,
            )
            datasets.append(dataset_wrapper)
            prompters.append(dataset_prompter)

-        LOG.info("merging datasets")
-        dataset = concatenate_datasets(datasets)
+        if len(datasets) == 1:
+            dataset = datasets[0]
+        else:
+            LOG.info("merging datasets")
+            dataset = concatenate_datasets(datasets)

        if len(datasets) > 1:
            if cfg.shuffle_merged_datasets:
@@ -437,9 +454,10 @@ def load_tokenized_prepared_datasets(
            else:
                LOG.debug("NOT shuffling merged datasets")

-        dataset, _ = process_datasets_for_packing(cfg, dataset, None)
+        if not cfg.skip_prepare_dataset:
+            dataset, _ = process_datasets_for_packing(cfg, dataset, None)

-        if cfg.local_rank == 0:
+        if cfg.local_rank == 0 and not cfg.skip_prepare_dataset:
            LOG.info(f"Saving merged prepared dataset to disk... {prepared_ds_path}")
            dataset.save_to_disk(str(prepared_ds_path))
            if cfg.push_dataset_to_hub:
@@ -478,9 +496,14 @@ def load_prepare_datasets(
    cfg,
    default_dataset_prepared_path,
    split="train",
+    processor=None,
 ) -> Tuple[Dataset, Dataset, List[Prompter]]:
    dataset, prompters = load_tokenized_prepared_datasets(
-        tokenizer, cfg, default_dataset_prepared_path, split=split
+        tokenizer,
+        cfg,
+        default_dataset_prepared_path,
+        split=split,
+        processor=processor,
    )

    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
@@ -546,6 +569,7 @@ def get_dataset_wrapper(
    d_base_type,
    dataset,
    d_prompt_style=None,
+    processor=None,
 ):
    dataset_wrapper = None
    dataset_prompter = None
@@ -578,7 +602,11 @@ def get_dataset_wrapper(
            dataset,
            **ds_kwargs,
        )
-    elif ds_strategy := load(config_dataset.type, tokenizer, cfg, config_dataset):
+    elif cfg.skip_prepare_dataset:
+        dataset_wrapper = dataset
+    elif ds_strategy := load(
+        config_dataset.type, tokenizer, cfg, config_dataset, processor=processor
+    ):
        dataset_prompter = UnsupportedPrompter()
        dataset_wrapper = TokenizedPromptDataset(
            ds_strategy,
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -28,12 +28,17 @@ from transformers import (  # noqa: F401
    AddedToken,
    AutoConfig,
    AutoModelForCausalLM,
+    AutoModelForVision2Seq,
+    AutoProcessor,
    AutoTokenizer,
    AwqConfig,
    BitsAndBytesConfig,
    GPTQConfig,
+    LlavaForConditionalGeneration,
+    MllamaForConditionalGeneration,
    PreTrainedModel,
    PreTrainedTokenizerBase,
+    ProcessorMixin,
 )
 from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled

@@ -80,6 +85,9 @@ def get_module_class_from_name(module, name):


 def check_model_config(cfg: DictDefault, model_config: Union[AutoConfig, DictDefault]):
+    if cfg.is_multimodal:
+        model_config = model_config.text_config
+
    quant_config_exists = (
        hasattr(model_config, "quantization_config")
        and model_config.quantization_config
@@ -299,25 +307,63 @@ def load_tokenizer(cfg):
    return tokenizer


+def load_processor(cfg: DictDefault, tokenizer: PreTrainedTokenizerBase):
+    processor_kwargs: Dict[str, Any] = {}  # do we actually need this?
+
+    processor_cls = AutoProcessor
+    if cfg.processor_type:
+        processor_cls = getattr(transformers, cfg.processor_type)
+
+    processor = processor_cls.from_pretrained(
+        cfg.processor_config,
+        trust_remote_code=cfg.trust_remote_code or False,
+        tokenizer=tokenizer,
+        **processor_kwargs,
+    )
+
+    return processor
+
+
 def load_model(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizerBase,
+    *,
+    processor: ProcessorMixin = None,  # pylint: disable=unused-argument
    inference: bool = False,
    reference_model: bool = False,
+    **kwargs,  # pylint: disable=unused-argument
 ) -> Tuple[PreTrainedModel, Optional[PeftConfig]]:
    """
    Load a model for a given configuration and tokenizer.
    """
+
    base_model = cfg.base_model
    model_type = cfg.type_of_model
    model_config = load_model_config(cfg)

+    # load any patches from plugins
+    from axolotl.integrations.base import PluginManager
+
+    plugin_manager = PluginManager.get_instance()
+    plugin_manager.pre_model_load(cfg)
+
+    if cfg.is_multimodal:
+        text_model_config = model_config.text_config
+    else:
+        text_model_config = model_config
+
    # TODO refactor as a kwarg
    load_in_8bit = cfg.load_in_8bit

    if cfg.gradient_checkpointing == "unsloth":
        transformers.modeling_utils.checkpoint = hf_grad_checkpoint_unsloth_wrapper

+    if hasattr(model_config, "model_type") and model_config.model_type == "mllama":
+        if cfg.flash_attention:
+            from axolotl.monkeypatch.attention.mllama import patch_mllama
+
+            patch_mllama()
+
    if hasattr(model_config, "model_type") and model_config.model_type == "btlm":
        if cfg.flash_attention:
            from axolotl.monkeypatch.btlm_attn_hijack_flash import (
@@ -454,6 +500,19 @@ def load_model(
    max_memory = cfg.max_memory
    device_map = cfg.device_map

+    AutoModelLoader = AutoModelForCausalLM  # pylint: disable=invalid-name
+    if cfg.is_multimodal:
+        if model_config.model_type == "llava":
+            AutoModelLoader = (  # pylint: disable=invalid-name
+                LlavaForConditionalGeneration
+            )
+        elif model_config.model_type == "mllama":
+            AutoModelLoader = (  # pylint: disable=invalid-name
+                MllamaForConditionalGeneration
+            )
+        else:
+            AutoModelLoader = AutoModelForVision2Seq  # pylint: disable=invalid-name
+
    if cfg.gpu_memory_limit:
        gpu_memory_limit = (
            str(cfg.gpu_memory_limit) + "GiB"
@@ -471,7 +530,7 @@ def load_model(
        from accelerate import infer_auto_device_map

        with init_empty_weights():
-            model_canvas = AutoModelForCausalLM.from_config(
+            model_canvas = AutoModelLoader.from_config(
                model_config, trust_remote_code=cfg.trust_remote_code or False
            )
        model_canvas.tie_weights()
@@ -544,7 +603,9 @@ def load_model(
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_quant_storage": torch.bfloat16,
        }
-        if cfg.model_config_type in ["jamba", "qwen2_moe"] and not cfg.deepspeed:
+        if cfg.model_config_type in ["jamba", "qwen2_moe"] and not (
+            cfg.deepspeed or cfg.fsdp
+        ):
            # for some reason, this causes the loss to be off by an order of magnitude
            # but deepspeed needs this still in bfloat16
            bnb_config["bnb_4bit_quant_storage"] = torch.float32
@@ -580,25 +641,12 @@ def load_model(

    # sample packing uses custom FA2 patch
    if cfg.flash_attention:
-        if not cfg.sample_packing:
-            if cfg.s2_attention:
-                pass
-            # most other models support flash attention, we can define exceptions as they come up
-            model_kwargs["attn_implementation"] = "flash_attention_2"
-            model_config._attn_implementation = (  # pylint: disable=protected-access
-                "flash_attention_2"
-            )
-        else:
-            if model_config.model_type in SUPPORTED_MULTIPACK_MODEL_TYPES:
-                model_kwargs["attn_implementation"] = "flash_attention_2"
-                model_config._attn_implementation = (  # pylint: disable=protected-access
-                    "flash_attention_2"
-                )
-            else:
-                model_kwargs["attn_implementation"] = "eager"
-                model_config._attn_implementation = (  # pylint: disable=protected-access
-                    "eager"
-                )
+        if not cfg.sample_packing and cfg.s2_attention:
+            pass
+        model_kwargs["attn_implementation"] = "flash_attention_2"
+        model_config._attn_implementation = (  # pylint: disable=protected-access
+            "flash_attention_2"
+        )
    elif cfg.sdp_attention:
        model_kwargs["attn_implementation"] = "sdpa"
        model_config._attn_implementation = "sdpa"  # pylint: disable=protected-access
@@ -637,6 +685,8 @@ def load_model(
            quantization_config = (
                quantization_config or model_kwargs["quantization_config"]
            )
+            if cfg.is_multimodal:
+                model_config.text_config = text_model_config
            model = load_sharded_model_quant(
                base_model,
                model_config,
@@ -655,19 +705,13 @@ def load_model(
                if "device_map" in model_kwargs:
                    del model_kwargs["device_map"]

-            if cfg.fsdp and not cfg.adapter and cfg.local_rank != 0:
-                with init_empty_weights():
-                    model = AutoModelForCausalLM.from_pretrained(
-                        base_model,
-                        config=model_config,
-                        **model_kwargs,
-                    )
-            else:
-                model = AutoModelForCausalLM.from_pretrained(
-                    base_model,
-                    config=model_config,
-                    **model_kwargs,
-                )
+            if cfg.is_multimodal:
+                model_config.text_config = text_model_config
+            model = AutoModelLoader.from_pretrained(
+                base_model,
+                config=model_config,
+                **model_kwargs,
+            )

            if cfg.flash_attention and not inference:
                from axolotl.monkeypatch.llama_attn_hijack_flash import (
@@ -702,13 +746,17 @@ def load_model(
            and not cfg.trust_remote_code
        ):
            if cfg.gptq:
-                model = AutoModelForCausalLM.from_pretrained(
+                if cfg.is_multimodal:
+                    model_config.text_config = text_model_config
+                model = AutoModelLoader.from_pretrained(
                    base_model,
                    config=model_config,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
            else:
+                if cfg.is_multimodal:
+                    model_config.text_config = text_model_config
                model = getattr(transformers, model_type).from_pretrained(
                    base_model,
                    config=model_config,
@@ -719,21 +767,23 @@ def load_model(
            # Shouldn't be a problem most of the time. will obviously error if the model doesn't support this
            # when training starts
            if (
-                hasattr(model_config, "max_seq_len")
-                and model_config.max_seq_len
+                hasattr(text_model_config, "max_seq_len")
+                and text_model_config.max_seq_len
                and cfg.sequence_len > model_config.max_seq_len
            ):
-                model_config.max_seq_len = cfg.sequence_len
+                text_model_config.max_seq_len = cfg.sequence_len
                LOG.warning(f"increasing context length to {cfg.sequence_len}")
            elif (
-                hasattr(model_config, "max_sequence_length")
-                and model_config.max_sequence_length
-                and cfg.sequence_len > model_config.max_sequence_length
+                hasattr(text_model_config, "max_sequence_length")
+                and text_model_config.max_sequence_length
+                and cfg.sequence_len > text_model_config.max_sequence_length
            ):
-                model_config.max_sequence_length = cfg.sequence_len
+                text_model_config.max_sequence_length = cfg.sequence_len
                LOG.warning(f"increasing context length to {cfg.sequence_len}")
            if cfg.gptq:
-                model = AutoModelForCausalLM.from_pretrained(
+                if cfg.is_multimodal:
+                    model_config.text_config = text_model_config
+                model = AutoModelLoader.from_pretrained(
                    base_model,
                    config=model_config,
                    trust_remote_code=cfg.trust_remote_code or False,
@@ -746,7 +796,9 @@ def load_model(
                    if "device_map" in model_kwargs:
                        del model_kwargs["device_map"]

-                model = AutoModelForCausalLM.from_pretrained(
+                if cfg.is_multimodal:
+                    model_config.text_config = text_model_config
+                model = AutoModelLoader.from_pretrained(
                    base_model,
                    config=model_config,
                    trust_remote_code=cfg.trust_remote_code or False,
@@ -1028,12 +1080,17 @@ def load_lora(model, cfg, inference=False, config_only=False):

    from peft import LoraConfig, get_peft_model

-    lora_target_modules = list(cfg.lora_target_modules or [])
+    lora_target_modules = cfg.lora_target_modules or []

    if cfg.lora_target_linear:
        linear_names = find_all_linear_names(model)
        LOG.info(f"found linear modules: {repr(sorted(linear_names))}")
-        lora_target_modules = list(set(lora_target_modules + linear_names))
+        lora_target_modules_as_list = (
+            lora_target_modules
+            if isinstance(lora_target_modules, list)
+            else [lora_target_modules]
+        )
+        lora_target_modules = list(set(lora_target_modules_as_list + linear_names))

    lora_config_kwargs = {}
    loftq_bits = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits
@@ -1052,11 +1109,13 @@ def load_lora(model, cfg, inference=False, config_only=False):
        lora_alpha=cfg.lora_alpha,
        target_modules=lora_target_modules,
        layers_to_transform=cfg.peft_layers_to_transform,
+        layers_pattern=cfg.peft_layers_pattern,
        lora_dropout=cfg.lora_dropout,
        fan_in_fan_out=cfg.lora_fan_in_fan_out,
        modules_to_save=cfg.lora_modules_to_save if cfg.lora_modules_to_save else None,
        bias="none",
-        task_type="CAUSAL_LM",
+        # task_type="CAUSAL_LM",
+        task_type="CONDITIONAL_GENERATION" if cfg.is_multimodal else "CAUSAL_LM",
        **lora_config_kwargs,
    )

@@ -1108,9 +1167,20 @@ def load_lora(model, cfg, inference=False, config_only=False):

 def ensure_dtype(model, dtype=torch.bfloat16):
    for name, module in model.named_modules():
+        weight_mismatch = False
+        bias_mismatch = False
        try:
-            if module.weight.dtype != dtype:
-                print(f"Converting module {name}: {module.weight.dtype} -> {dtype}")
-                module.to(dtype)
+            weight_mismatch = module.weight.dtype != dtype
        except AttributeError:
            pass
+        try:
+            bias_mismatch = module.bias.dtype != dtype
+        except AttributeError:
+            pass
+
+        if weight_mismatch:
+            print(f"Converting module {name}.weight: {module.weight.dtype} -> {dtype}")
+        if bias_mismatch:
+            print(f"Converting module {name}.bias: {module.bias.dtype} -> {dtype}")
+        if weight_mismatch or bias_mismatch:
+            module.to(dtype)
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -11,6 +11,8 @@ import numba
 import numpy as np
 from torch.utils.data import BatchSampler, Sampler

+from axolotl.utils.distributed import reduce_and_broadcast
+
 LOG = logging.getLogger("axolotl.utils.samplers.multipack")


@@ -174,16 +176,46 @@ class MultipackBatchSampler(BatchSampler):
    def efficiency(self):
        return self.eff_total_used / self.eff_total_slots

+    def gather_efficiency(self):
+        def calc_sample_packing_eff_est(estimates: List[float]):
+            LOG.debug(f"sample_packing_eff_est across ranks: {repr(estimates)}")
+            return math.floor(0.997 * max(estimates))
+
+        sample_packing_actual_eff_all = reduce_and_broadcast(
+            lambda: self.efficiency(),  # pylint: disable=unnecessary-lambda
+            calc_sample_packing_eff_est,
+        )
+        sample_packing_eff_est = (
+            math.ceil(sample_packing_actual_eff_all * 200.0) / 200.0
+        )
+        return sample_packing_eff_est
+
+    def gather_len_batches(self, num):
+        def calc_min_len(estimates: list[(int, float)]):
+            LOG.info(f"gather_len_batches: {repr(estimates)}")
+            return math.floor(0.998 * min(estimates))
+
+        min_len_batches = reduce_and_broadcast(
+            lambda: num,
+            calc_min_len,
+        )
+        return min_len_batches
+
    def __len__(self):
-        self.num_batches()
-        return self._len_est()
+        len_batches = self.num_batches()
+        return self.gather_len_batches(len_batches)

    def _len_est(self):
+        efficiency = (
+            self.packing_efficiency_estimate
+            if self.packing_efficiency_estimate
+            else self.gather_efficiency()
+        )
        world_size = int(os.getenv("WORLD_SIZE", "1"))
        lengths_sum = np.sum(self.lengths)
        lengths_sum_per_device = lengths_sum // world_size
        LOG.info(
-            f"packing_efficiency_estimate: {self.packing_efficiency_estimate} "
+            f"packing_efficiency_estimate: {efficiency} "
            f"total_num_tokens per device: {lengths_sum_per_device}"
        )

@@ -195,7 +227,7 @@ class MultipackBatchSampler(BatchSampler):
                * math.floor(
                    0.99
                    * lengths_sum_per_device
-                    / self.packing_efficiency_estimate
+                    / efficiency
                    // (self.batch_max_len * self.batch_size)
                )
                - 1
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -217,6 +217,24 @@ def process_datasets_for_packing(cfg, train_dataset, eval_dataset):
            desc="Dropping Long Sequences",
        )

+    # drop samples with where the number of elements with labels not equal to -100 is zero
+    def drop_no_trainable_tokens(sample):
+        return np.sum(np.array(sample["labels"]) != -100) > 0
+
+    train_dataset = train_dataset.filter(
+        drop_no_trainable_tokens,
+        num_proc=cfg.dataset_processes,
+        load_from_cache_file=not cfg.is_preprocess,
+        desc="Drop Samples with Zero Trainable Tokens",
+    )
+    if eval_dataset:
+        eval_dataset = eval_dataset.filter(
+            drop_no_trainable_tokens,
+            num_proc=cfg.dataset_processes,
+            load_from_cache_file=not cfg.is_preprocess,
+            desc="Drop Samples with Zero Trainable Tokens",
+        )
+
    if cfg.group_by_length:
        train_dataset = train_dataset.map(
            add_length,
@@ -288,7 +306,7 @@ def process_pretraining_datasets_for_packing(


 def calculate_total_num_steps(cfg, train_dataset, update=True):
-    if not cfg.total_num_tokens:
+    if not cfg.total_num_tokens and not cfg.skip_prepare_dataset:
        total_num_tokens = np.sum(
            train_dataset.data.column("input_ids")
            .to_pandas()
@@ -301,7 +319,11 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):

    skip_estimates = cfg.model_config_type == "mamba"

-    if not skip_estimates and not cfg.total_supervised_tokens:
+    if (
+        not skip_estimates
+        and not cfg.total_supervised_tokens
+        and not cfg.skip_prepare_dataset
+    ):
        total_supervised_tokens = (
            train_dataset.data.column("labels")
            .to_pandas()
@@ -339,7 +361,7 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                main_process_only=True,
            )
        else:
-            if cfg.flash_attention:
+            if cfg.flash_attention and not cfg.multipack_real_batches:
                sampler_batch_size = 1
                batch_max_len = cfg.micro_batch_size * cfg.sequence_len
            else:
@@ -390,13 +412,25 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
    return total_num_steps


+def setup_torch_compile_env(cfg):
+    if cfg.torch_compile:
+        if not cfg.torch_compile_backend:
+            os.environ["ACCELERATE_DYNAMO_BACKEND"] = "INDUCTOR"
+        else:
+            os.environ["ACCELERATE_DYNAMO_BACKEND"] = cfg.torch_compile_backend.upper()
+
+
 def setup_deepspeed_env(cfg, stage=None):
+    from transformers.integrations.deepspeed import HfTrainerDeepSpeedConfig
+
    os.environ["ACCELERATE_USE_DEEPSPEED"] = "true"
    os.environ["ACCELERATE_DEEPSPEED_CONFIG_FILE"] = cfg.deepspeed
    if stage:
        os.environ["ACCELERATE_DEEPSPEED_ZERO_STAGE"] = str(stage)
        if stage == 3:
            os.environ["ACCELERATE_DEEPSPEED_ZERO3_INIT"] = "true"
+    # If we don't assign this, it doesn't actually get set in the accelerate weakref
+    _ = HfTrainerDeepSpeedConfig(cfg.deepspeed)


 def setup_fsdp_envs(cfg):
@@ -434,6 +468,8 @@ def prepare_optim_env(cfg):
            stage = deepspeed_config.get("zero_optimization", {}).get("stage", None)
        setup_deepspeed_env(cfg, stage=stage)

+    setup_torch_compile_env(cfg)
+
    if (cfg.bf16 == "auto" and is_torch_bf16_gpu_available()) or cfg.bf16 is True:
        os.environ["ACCELERATE_MIXED_PRECISION"] = "bf16"
    elif cfg.fp16:
@@ -446,13 +482,15 @@ def prepare_opinionated_env(cfg):
        os.environ["TOKENIZERS_PARALLELISM"] = "false"


-def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_steps):
+def setup_trainer(
+    cfg, train_dataset, eval_dataset, model, tokenizer, processor, total_num_steps
+):
    if cfg.rl in ["dpo", "ipo", "orpo", "kto", "simpo"]:
-        trainer_builder = HFRLTrainerBuilder(cfg, model[0], tokenizer)
+        trainer_builder = HFRLTrainerBuilder(cfg, model[0], tokenizer, processor)
        trainer_builder.model_ref = model[1]
        trainer_builder.peft_config = model[2]
    else:
-        trainer_builder = HFCausalTrainerBuilder(cfg, model[0], tokenizer)
+        trainer_builder = HFCausalTrainerBuilder(cfg, model[0], tokenizer, processor)

    trainer_builder.train_dataset = train_dataset
    trainer_builder.eval_dataset = eval_dataset
--- a/tests/e2e/integrations/init.py
+++ b/tests/e2e/integrations/init.py
--- a/tests/e2e/integrations/liger.py
+++ b/tests/e2e/integrations/liger.py
@@ -0,0 +1,110 @@
+"""
+Simple end-to-end test for Liger integration
+"""
+
+import unittest
+from pathlib import Path
+
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
+from axolotl.train import train
+from axolotl.utils.config import normalize_config
+from axolotl.utils.dict import DictDefault
+
+from ..utils import with_temp_dir
+
+
+class LigerIntegrationTestCase(unittest.TestCase):
+    """
+    e2e tests for liger integration with Axolotl
+    """
+
+    @with_temp_dir
+    def test_llama_wo_flce(self, temp_dir):
+        cfg = DictDefault(
+            {
+                "base_model": "JackFram/llama-68m",
+                "tokenizer_type": "LlamaTokenizer",
+                "plugins": [
+                    "axolotl.integrations.liger.LigerPlugin",
+                ],
+                "liger_rope": True,
+                "liger_rms_norm": True,
+                "liger_swiglu": True,
+                "liger_cross_entropy": True,
+                "liger_fused_linear_cross_entropy": False,
+                "sequence_len": 1024,
+                "val_set_size": 0.1,
+                "special_tokens": {
+                    "unk_token": "<unk>",
+                    "bos_token": "<s>",
+                    "eos_token": "</s>",
+                },
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    },
+                ],
+                "num_epochs": 1,
+                "micro_batch_size": 8,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch",
+                "lr_scheduler": "cosine",
+                "save_safetensors": True,
+                "bf16": "auto",
+            }
+        )
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()
+
+    @with_temp_dir
+    def test_llama_w_flce(self, temp_dir):
+        cfg = DictDefault(
+            {
+                "base_model": "JackFram/llama-68m",
+                "tokenizer_type": "LlamaTokenizer",
+                "plugins": [
+                    "axolotl.integrations.liger.LigerPlugin",
+                ],
+                "liger_rope": True,
+                "liger_rms_norm": True,
+                "liger_swiglu": True,
+                "liger_cross_entropy": False,
+                "liger_fused_linear_cross_entropy": True,
+                "sequence_len": 1024,
+                "val_set_size": 0.1,
+                "special_tokens": {
+                    "unk_token": "<unk>",
+                    "bos_token": "<s>",
+                    "eos_token": "</s>",
+                },
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    },
+                ],
+                "num_epochs": 1,
+                "micro_batch_size": 8,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch",
+                "lr_scheduler": "cosine",
+                "save_safetensors": True,
+                "bf16": "auto",
+            }
+        )
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()
--- a/tests/e2e/multigpu/test_llama.py
+++ b/tests/e2e/multigpu/test_llama.py
@@ -10,6 +10,7 @@ from pathlib import Path
 import pytest
 import yaml
 from accelerate.test_utils import execute_subprocess_async
+from huggingface_hub import snapshot_download

 from axolotl.utils.dict import DictDefault

@@ -19,6 +20,12 @@ LOG = logging.getLogger("axolotl.tests.e2e.multigpu")
 os.environ["WANDB_DISABLED"] = "true"


+@pytest.fixture(scope="session", autouse=True)
+def download_model():
+    # download the model
+    snapshot_download("TinyLlama/TinyLlama_v1.1")
+
+
 class TestMultiGPULlama(unittest.TestCase):
    """
    Test case for Llama models using LoRA
--- a/tests/e2e/multigpu/test_qwen2.py
+++ b/tests/e2e/multigpu/test_qwen2.py
@@ -0,0 +1,98 @@
+"""
+E2E tests for multigpu qwen2
+"""
+
+import logging
+import os
+import unittest
+from pathlib import Path
+
+import yaml
+from accelerate.test_utils import execute_subprocess_async
+
+from axolotl.utils.dict import DictDefault
+
+from ..utils import with_temp_dir
+
+LOG = logging.getLogger("axolotl.tests.e2e.multigpu")
+os.environ["WANDB_DISABLED"] = "true"
+
+
+class TestMultiGPUQwen2(unittest.TestCase):
+    """
+    Test case for Llama models using LoRA
+    """
+
+    @with_temp_dir
+    def test_qlora_fsdp_dpo(self, temp_dir):
+        # pylint: disable=duplicate-code
+        cfg = DictDefault(
+            {
+                "base_model": "Qwen/Qwen2-1.5B",
+                "load_in_4bit": True,
+                "rl": "dpo",
+                "chat_template": "chatml",
+                "sequence_len": 2048,
+                "adapter": "qlora",
+                "lora_r": 8,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "val_set_size": 0.05,
+                "datasets": [
+                    {
+                        "path": "Intel/orca_dpo_pairs",
+                        "split": "train",
+                        "type": "chatml.intel",
+                    },
+                ],
+                "num_epochs": 1,
+                "max_steps": 100,
+                "warmup_steps": 20,
+                "micro_batch_size": 4,
+                "gradient_accumulation_steps": 2,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch",
+                "lr_scheduler": "cosine",
+                "flash_attention": True,
+                "bf16": "auto",
+                "tf32": True,
+                "gradient_checkpointing": True,
+                "gradient_checkpointing_kwargs": {
+                    "use_reentrant": False,
+                },
+                "fsdp": [
+                    "full_shard",
+                    "auto_wrap",
+                ],
+                "fsdp_config": {
+                    "fsdp_limit_all_gathers": True,
+                    "fsdp_offload_params": False,
+                    "fsdp_sync_module_states": True,
+                    "fsdp_use_orig_params": False,
+                    "fsdp_cpu_ram_efficient_loading": False,
+                    "fsdp_transformer_layer_cls_to_wrap": "Qwen2DecoderLayer",
+                    "fsdp_state_dict_type": "FULL_STATE_DICT",
+                    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+                    "fsdp_sharding_strategy": "FULL_SHARD",
+                },
+            }
+        )
+
+        # write cfg to yaml file
+        Path(temp_dir).mkdir(parents=True, exist_ok=True)
+        with open(Path(temp_dir) / "config.yaml", "w", encoding="utf-8") as fout:
+            fout.write(yaml.dump(cfg.to_dict(), Dumper=yaml.Dumper))
+
+        execute_subprocess_async(
+            [
+                "accelerate",
+                "launch",
+                "--num-processes",
+                "2",
+                "-m",
+                "axolotl.cli.train",
+                str(Path(temp_dir) / "config.yaml"),
+            ]
+        )
--- a/tests/prompt_strategies/conftest.py
+++ b/tests/prompt_strategies/conftest.py
@@ -0,0 +1,71 @@
+"""
+shared fixtures for prompt strategies tests
+"""
+
+import pytest
+from datasets import Dataset
+from transformers import AutoTokenizer
+
+
+@pytest.fixture(name="assistant_dataset")
+def fixture_assistant_dataset():
+    return Dataset.from_list(
+        [
+            {
+                "messages": [
+                    {"role": "user", "content": "hello"},
+                    {"role": "assistant", "content": "hello"},
+                    {"role": "user", "content": "goodbye"},
+                    {"role": "assistant", "content": "goodbye"},
+                ]
+            }
+        ]
+    )
+
+
+@pytest.fixture(name="sharegpt_dataset")
+def fixture_sharegpt_dataset():
+    # pylint: disable=duplicate-code
+    return Dataset.from_list(
+        [
+            {
+                "conversations": [
+                    {"from": "human", "value": "hello"},
+                    {"from": "gpt", "value": "hello"},
+                    {"from": "human", "value": "goodbye"},
+                    {"from": "gpt", "value": "goodbye"},
+                ]
+            }
+        ]
+    )
+
+
+@pytest.fixture(name="basic_dataset")
+def fixture_basic_dataset():
+    # pylint: disable=duplicate-code
+    return Dataset.from_list(
+        [
+            {
+                "conversations": [
+                    {"from": "system", "value": "You are an AI assistant."},
+                    {"from": "human", "value": "Hello"},
+                    {"from": "assistant", "value": "Hi there!"},
+                    {"from": "human", "value": "How are you?"},
+                    {"from": "assistant", "value": "I'm doing well, thank you!"},
+                ]
+            }
+        ]
+    )
+
+
+@pytest.fixture(name="llama3_tokenizer")
+def fixture_llama3_tokenizer():
+    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B-Instruct")
+
+    return tokenizer
+
+
+@pytest.fixture(name="phi35_tokenizer")
+def fixture_phi35_tokenizer():
+    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
+    return tokenizer
--- a/tests/prompt_strategies/test_chat_templates.py
+++ b/tests/prompt_strategies/test_chat_templates.py
@@ -5,10 +5,6 @@ tests for chat_template prompt strategy
 import logging
 import unittest

-import pytest
-from datasets import Dataset
-from transformers import AutoTokenizer
-
 from axolotl.prompt_strategies.chat_template import (
    ChatTemplatePrompter,
    ChatTemplateStrategy,
@@ -22,657 +18,6 @@ logging.basicConfig(level=logging.DEBUG)
 LOG = logging.getLogger("axolotl")


-@pytest.fixture(name="assistant_dataset")
-def fixture_assistant_dataset():
-    return Dataset.from_list(
-        [
-            {
-                "messages": [
-                    {"role": "user", "content": "hello"},
-                    {"role": "assistant", "content": "hello"},
-                    {"role": "user", "content": "goodbye"},
-                    {"role": "assistant", "content": "goodbye"},
-                ]
-            }
-        ]
-    )
-
-
-@pytest.fixture(name="sharegpt_dataset")
-def fixture_sharegpt_dataset():
-    # pylint: disable=duplicate-code
-    return Dataset.from_list(
-        [
-            {
-                "conversations": [
-                    {"from": "human", "value": "hello"},
-                    {"from": "gpt", "value": "hello"},
-                    {"from": "human", "value": "goodbye"},
-                    {"from": "gpt", "value": "goodbye"},
-                ]
-            }
-        ]
-    )
-
-
-@pytest.fixture(name="basic_dataset")
-def fixture_basic_dataset():
-    # pylint: disable=duplicate-code
-    return Dataset.from_list(
-        [
-            {
-                "conversations": [
-                    {"from": "system", "value": "You are an AI assistant."},
-                    {"from": "human", "value": "Hello"},
-                    {"from": "assistant", "value": "Hi there!"},
-                    {"from": "human", "value": "How are you?"},
-                    {"from": "assistant", "value": "I'm doing well, thank you!"},
-                ]
-            }
-        ]
-    )
-
-
-@pytest.fixture(name="llama3_tokenizer")
-def fixture_llama3_tokenizer():
-    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B-Instruct")
-
-    return tokenizer
-
-
-class TestChatTemplateConfigurations:
-    """
-    Test class for various configurations of ChatTemplateStrategy.
-    """
-
-    @staticmethod
-    def find_sublist(full_list, sub_list):
-        token_count = len(sub_list)
-        for index in range(len(full_list) - token_count + 1):
-            if full_list[index : index + token_count] == sub_list:
-                return index
-        return -1
-
-    def test_train_on_inputs_true(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with train_on_inputs=True")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=True,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        # Verify that assistant responses are labeled
-        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
-        for response in assistant_responses:
-            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, response_ids)
-            LOG.debug(
-                f"Assistant response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
-            )
-            assert start_idx != -1, f"Could not find '{response}' in input_ids"
-            assert all(
-                label != IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(response_ids)]
-            ), f"Expected labels for assistant response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
-
-        # Check the behavior of human inputs
-        human_inputs = ["Hello", "How are you?"]
-        for input_text in human_inputs:
-            input_ids = llama3_tokenizer.encode(input_text, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, input_ids)
-            labeled = all(
-                label != IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(input_ids)]
-            )
-            LOG.debug(
-                f"Human input '{input_text}' is {'labeled' if labeled else 'not labeled'}, expected IDs: {input_ids}, found at: {start_idx}"
-            )
-
-        LOG.debug("Full labels: %s", labels)
-        LOG.debug("Full input_ids: %s", input_ids)
-
-    def test_train_on_inputs_false(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with train_on_inputs=False")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        # Verify that only assistant responses are labeled
-        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
-        for response in assistant_responses:
-            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, response_ids)
-            LOG.debug(
-                f"Assistant response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
-            )
-            assert start_idx != -1, f"Could not find '{response}' in input_ids"
-            assert all(
-                label != IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(response_ids)]
-            ), f"Expected labels for assistant response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
-
-        # Verify that human inputs are not labeled
-        human_inputs = ["Hello", "How are you?"]
-        for input_text in human_inputs:
-            input_ids = llama3_tokenizer.encode(input_text, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, input_ids)
-            LOG.debug(
-                f"Human input '{input_text}' expected IDs: {input_ids}, found at: {start_idx}"
-            )
-            assert start_idx != -1, f"Could not find '{input_text}' in input_ids"
-            assert all(
-                label == IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(input_ids)]
-            ), f"Expected labels for human input '{input_text}' to be IGNORE_TOKEN_ID, but got {labels[start_idx:start_idx+len(input_ids)]}"
-
-    def test_roles_to_train_assistant_only(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing roles_to_train with assistant only")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        # Verify that only assistant responses are labeled
-        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
-        for response in assistant_responses:
-            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, response_ids)
-            LOG.debug(
-                f"Assistant response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
-            )
-            assert all(
-                label != IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(response_ids)]
-            ), f"Expected labels for assistant response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
-
-    def test_roles_to_train_all(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing roles_to_train with all roles")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=True,
-            sequence_len=512,
-            roles_to_train=["human", "assistant"],
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        # Verify that all responses are labeled (except for special tokens)
-        all_responses = [
-            "Hello",
-            "Hi there!",
-            "How are you?",
-            "I'm doing well, thank you!",
-        ]
-        for response in all_responses:
-            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, response_ids)
-            LOG.debug(
-                f"Response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
-            )
-            assert all(
-                label != IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(response_ids)]
-            ), f"Expected labels for response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
-
-    def test_empty_roles_to_train(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with empty roles_to_train")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=[],
-            train_on_eos="none",  # Add this line
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-
-        # Verify that no labels are set when roles_to_train is empty
-        LOG.debug("Full labels: %s", labels)
-        assert all(
-            label == IGNORE_TOKEN_ID for label in labels
-        ), "Expected all labels to be IGNORE_TOKEN_ID when roles_to_train is empty"
-
-    def test_train_on_eos_all(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with train_on_eos='all'")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-            train_on_eos="all",
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        eos_token_id = llama3_tokenizer.eos_token_id
-        eos_indices = [
-            i for i, token_id in enumerate(input_ids) if token_id == eos_token_id
-        ]
-
-        assert len(eos_indices) > 0, "Expected at least one EOS token in the input"
-        for eos_idx in eos_indices:
-            assert (
-                labels[eos_idx] != IGNORE_TOKEN_ID
-            ), f"Expected EOS token at index {eos_idx} to be labeled"
-
-    def test_train_on_eos_turn(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with train_on_eos='turn'")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-            train_on_eos="turn",
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        eos_token_id = llama3_tokenizer.eos_token_id
-        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
-
-        for response in assistant_responses:
-            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, response_ids)
-            assert start_idx != -1, f"Could not find '{response}' in input_ids"
-
-            eos_idx = start_idx + len(response_ids)
-            while eos_idx < len(input_ids) and input_ids[eos_idx] != eos_token_id:
-                eos_idx += 1
-
-            assert eos_idx < len(
-                input_ids
-            ), f"Could not find EOS token after '{response}'"
-            assert (
-                labels[eos_idx] != IGNORE_TOKEN_ID
-            ), f"Expected EOS token after assistant response '{response}' to be labeled"
-
-        # Check that EOS tokens after human inputs are not labeled
-        human_inputs = ["Hello", "How are you?"]
-        for input_text in human_inputs:
-            input_ids = llama3_tokenizer.encode(input_text, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, input_ids)
-            assert start_idx != -1, f"Could not find '{input_text}' in input_ids"
-
-            eos_idx = start_idx + len(input_ids)
-            while eos_idx < len(input_ids) and input_ids[eos_idx] != eos_token_id:
-                eos_idx += 1
-
-            assert (
-                labels[eos_idx] == IGNORE_TOKEN_ID
-            ), f"Expected EOS token after human input '{input_text}' to not be labeled"
-
-    def test_train_on_eos_last(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with train_on_eos='last'")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-            train_on_eos="last",
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        eos_token_id = llama3_tokenizer.eos_token_id
-        eos_indices = [
-            i for i, token_id in enumerate(input_ids) if token_id == eos_token_id
-        ]
-
-        assert len(eos_indices) > 0, "Expected at least one EOS token in the input"
-        last_eos_idx = eos_indices[-1]
-
-        # Check that only the last EOS token is labeled
-        for idx in eos_indices[:-1]:
-            assert (
-                labels[idx] == IGNORE_TOKEN_ID
-            ), f"Expected EOS token at index {idx} to not be labeled"
-        assert (
-            labels[last_eos_idx] != IGNORE_TOKEN_ID
-        ), f"Expected last EOS token at index {last_eos_idx} to be labeled"
-
-    def test_train_on_eos_none(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with train_on_eos='none'")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-            train_on_eos="none",
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        eos_token_id = llama3_tokenizer.eos_token_id
-        eos_indices = [
-            i for i, token_id in enumerate(input_ids) if token_id == eos_token_id
-        ]
-
-        assert len(eos_indices) > 0, "Expected at least one EOS token in the input"
-        for eos_idx in eos_indices:
-            assert (
-                labels[eos_idx] == IGNORE_TOKEN_ID
-            ), f"Expected EOS token at index {eos_idx} to not be labeled"
-
-    def test_drop_system_message(self, llama3_tokenizer, basic_dataset):
-        LOG.info("Testing with drop_system_message=True")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(
-                llama3_tokenizer, chat_templates("llama3"), drop_system_message=True
-            ),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["assistant"],
-        )
-        res = strategy.tokenize_prompt(basic_dataset[0])
-        input_ids = res["input_ids"]
-
-        # Check if system message is not present in input_ids
-        system_message = "You are an AI assistant."
-        system_ids = llama3_tokenizer.encode(system_message, add_special_tokens=False)
-        assert (
-            self.find_sublist(input_ids, system_ids) == -1
-        ), "Expected system message to be dropped"
-
-    def test_custom_roles(self, llama3_tokenizer):
-        LOG.info("Testing with custom roles mapping")
-        custom_roles = {
-            "user": ["human", "user"],
-            "assistant": ["ai", "assistant"],
-            "system": ["context"],
-        }
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(
-                llama3_tokenizer, chat_templates("llama3"), roles=custom_roles
-            ),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=["ai"],
-        )
-
-        # Create a new dataset with modified role names
-        modified_conversations = [
-            {"from": "context", "value": "You are an AI assistant."},
-            {"from": "human", "value": "Hello"},
-            {"from": "ai", "value": "Hi there!"},
-            {"from": "human", "value": "How are you?"},
-            {"from": "ai", "value": "I'm doing well, thank you!"},
-        ]
-
-        modified_dataset = Dataset.from_dict(
-            {"conversations": [modified_conversations]}
-        )
-
-        res = strategy.tokenize_prompt(modified_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        # Check if AI responses are labeled correctly
-        ai_responses = ["Hi there!", "I'm doing well, thank you!"]
-        for response in ai_responses:
-            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, response_ids)
-            assert start_idx != -1, f"Could not find response '{response}' in input_ids"
-            assert all(
-                label != IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(response_ids)]
-            ), f"Expected labels for AI response '{response}' to be set"
-
-        # Check if human messages are not labeled
-        human_messages = ["Hello", "How are you?"]
-        for message in human_messages:
-            message_ids = llama3_tokenizer.encode(message, add_special_tokens=False)
-            start_idx = self.find_sublist(input_ids, message_ids)
-            assert start_idx != -1, f"Could not find message '{message}' in input_ids"
-            assert all(
-                label == IGNORE_TOKEN_ID
-                for label in labels[start_idx : start_idx + len(message_ids)]
-            ), f"Expected labels for human message '{message}' to be IGNORE_TOKEN_ID"
-
-    def test_message_field_training(self, llama3_tokenizer):
-        LOG.info("Testing with message_field_training")
-        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(
-                llama3_tokenizer,
-                chat_templates("llama3"),
-                message_field_training="train",
-                message_field_training_detail="train_detail",
-            ),
-            tokenizer=llama3_tokenizer,
-            train_on_inputs=False,
-            sequence_len=512,
-            roles_to_train=[],
-        )
-
-        # Create a new dataset with the train and train_detail fields
-        modified_conversation = [
-            {"from": "system", "value": "You are an AI assistant.", "train": False},
-            {"from": "human", "value": "Hello", "train": False},
-            {"from": "assistant", "value": "Hello", "train": True},
-            {"from": "human", "value": "How are you?", "train": True},
-            {
-                "from": "assistant",
-                "value": "I'm doing very well, thank you!",
-                "train_detail": [
-                    {"begin_offset": 0, "end_offset": 8, "train": False},
-                    {"begin_offset": 9, "end_offset": 18, "train": True},
-                    {"begin_offset": 19, "end_offset": 30, "train": False},
-                ],
-            },
-            {
-                "from": "human",
-                "value": "I'm doing very well, thank you!",
-                "train": False,
-            },
-            {"from": "assistant", "value": "Hi there!", "train": True},
-        ]
-
-        modified_dataset = Dataset.from_dict({"conversations": [modified_conversation]})
-
-        res = strategy.tokenize_prompt(modified_dataset[0])
-        labels = res["labels"]
-        input_ids = res["input_ids"]
-
-        # Function to find all occurrences of a sublist
-        def find_all_sublists(full_list, sub_list):
-            indices = []
-            for index in range(len(full_list) - len(sub_list) + 1):
-                if full_list[index : index + len(sub_list)] == sub_list:
-                    indices.append(index)
-            return indices
-
-        # Keep track of which occurrences we've processed
-        processed_occurrences = {}
-        # Check if messages are labeled correctly based on train or train_detail
-        for i, turn in enumerate(modified_conversation):
-            turn_tokens = llama3_tokenizer.encode(
-                turn["value"], add_special_tokens=False
-            )
-            occurrences = find_all_sublists(input_ids, turn_tokens)
-            turn_key = turn["value"]
-            if turn_key not in processed_occurrences:
-                processed_occurrences[turn_key] = 0
-            current_occurrence = processed_occurrences[turn_key]
-
-            if current_occurrence >= len(occurrences):
-                assert (
-                    False
-                ), f"Not enough occurrences found for message: {turn['value']}"
-
-            start_idx = occurrences[current_occurrence]
-            processed_occurrences[turn_key] += 1
-            end_idx = start_idx + len(turn_tokens)
-
-            LOG.debug(
-                f"Processing turn {i}: role={turn['from']}, content='{turn['value']}', start_idx={start_idx}, end_idx={end_idx}"
-            )
-
-            if "train_detail" in turn:
-                # Get token offsets
-                tokenized_output = llama3_tokenizer(
-                    turn["value"], return_offsets_mapping=True, add_special_tokens=False
-                )
-                token_offsets = tokenized_output["offset_mapping"]
-
-                # Adjust token offsets as done in the implementation
-                for i in range(len(token_offsets) - 1):
-                    token_offsets[i] = (
-                        token_offsets[i][0],
-                        token_offsets[i + 1][0] - 1,
-                    )
-                token_offsets[-1] = (token_offsets[-1][0], len(turn["value"]) - 1)
-
-                # Adjust train_details
-                adjusted_train_details = strategy.prompter.adjust_train_details(
-                    turn["train_detail"], token_offsets
-                )
-
-                LOG.debug(f"Original train_details: {turn['train_detail']}")
-                LOG.debug(f"Adjusted train_details: {adjusted_train_details}")
-
-                # Handle train_detail
-                token_offsets = strategy.prompter.get_offsets_for_train_detail(
-                    text=turn["value"],
-                    train_details=adjusted_train_details,
-                    mask_untrainable=False,
-                )
-                token_offsets_masked = strategy.prompter.get_offsets_for_train_detail(
-                    text=turn["value"],
-                    train_details=adjusted_train_details,
-                    mask_untrainable=True,
-                )
-                LOG.debug(f"Token offsets: {token_offsets_masked}")
-
-                expected_labels = [IGNORE_TOKEN_ID] * len(turn_tokens)
-                for i, offset in enumerate(token_offsets_masked):
-                    if offset != IGNORE_TOKEN_ID:
-                        expected_labels[i] = turn_tokens[i]
-                actual_labels = labels[
-                    start_idx : start_idx + len(token_offsets_masked)
-                ]
-                assert (
-                    actual_labels == expected_labels
-                ), f"Labels mismatch for turn: {turn['value']}\nExpected: {expected_labels}\nActual: {actual_labels}"
-
-                for detail in adjusted_train_details:
-                    # Find the token indices that correspond to the character offsets
-                    detail_start = start_idx + next(
-                        i
-                        for i, offset in enumerate(token_offsets)
-                        if offset >= detail["begin_offset"]
-                    )
-                    detail_end = start_idx + next(
-                        (
-                            i
-                            for i, offset in enumerate(token_offsets)
-                            if offset > detail["end_offset"]
-                        ),
-                        len(token_offsets),
-                    )
-
-                    detail_text = turn["value"][
-                        detail["begin_offset"] : detail["end_offset"] + 1
-                    ]
-                    detail_labels = labels[detail_start:detail_end]
-                    detail_input_ids = input_ids[detail_start:detail_end]
-
-                    LOG.debug(
-                        f"Detail: '{detail_text}', Start: {detail_start}, End: {detail_end}"
-                    )
-                    LOG.debug(f"Detail input_ids: {detail_input_ids}")
-                    LOG.debug(f"Detail labels: {detail_labels}")
-                    LOG.debug(
-                        f"Decoded detail: {llama3_tokenizer.decode(detail_input_ids)}"
-                    )
-                    LOG.debug(
-                        f"Token offsets for this detail: {token_offsets[detail_start-start_idx:detail_end-start_idx]}"
-                    )
-
-                    if detail["train"]:
-                        assert all(
-                            label != IGNORE_TOKEN_ID for label in detail_labels
-                        ), (
-                            f"Expected labels for trainable detail '{detail_text}' to be set, but some were IGNORE_TOKEN_ID. "
-                            f"Labels({detail_start}:{detail_end}): {detail_labels}, "
-                            f"InputIDs: {detail_input_ids}, "
-                            f"Decoded: '{llama3_tokenizer.decode(detail_input_ids)}'"
-                        )
-                    else:
-                        assert all(
-                            label == IGNORE_TOKEN_ID for label in detail_labels
-                        ), (
-                            f"Expected all labels for non-trainable detail '{detail_text}' to be IGNORE_TOKEN_ID, but some were not. "
-                            f"Labels({detail_start}:{detail_end}): {detail_labels}, "
-                            f"InputIDs: {detail_input_ids}, "
-                            f"Decoded: '{llama3_tokenizer.decode(detail_input_ids)}'"
-                        )
-            else:
-                should_train = turn.get("train", False)
-                turn_labels = labels[start_idx:end_idx]
-
-                LOG.debug(f"Should train: {should_train}")
-                LOG.debug(f"Turn indices: start={start_idx}, end={end_idx}")
-                LOG.debug(f"Turn labels: {turn_labels}")
-                LOG.debug(f"Turn input IDs: {input_ids[start_idx:end_idx]}")
-                LOG.debug(
-                    f"Decoded turn: {llama3_tokenizer.decode(input_ids[start_idx:end_idx])}"
-                )
-
-                if should_train:
-                    assert all(label != IGNORE_TOKEN_ID for label in turn_labels), (
-                        f"Expected all labels for '{turn['value']}' to be set\n"
-                        f"Labels({start_idx}:{end_idx}): {turn_labels}, "
-                        f"InputIDs: {input_ids[start_idx:end_idx]}, "
-                        f"Decoded: '{llama3_tokenizer.decode(input_ids[start_idx:end_idx])}'"
-                    )
-                else:
-                    assert all(label == IGNORE_TOKEN_ID for label in turn_labels), (
-                        f"Expected all labels for '{turn['value']}' to be IGNORE_TOKEN_ID\n"
-                        f"Labels({start_idx}:{end_idx}): {turn_labels}, "
-                        f"InputIDs: {input_ids[start_idx:end_idx]}, "
-                        f"Decoded: '{llama3_tokenizer.decode(input_ids[start_idx:end_idx])}'"
-                    )
-
-                LOG.debug(
-                    f"Processed turn: {turn['from']}, content: '{turn['value']}', "
-                    f"start_idx: {start_idx}, end_idx: {end_idx}, "
-                    f"labels: {labels[start_idx:end_idx]}"
-                )
-
-        LOG.debug(f"Final labels: {labels}")
-        LOG.debug(f"Final input_ids: {input_ids}")
-
-
 class TestAssistantChatTemplateLlama3:
    """
    Test class for assistant style datasets with llama-3 prompts using the chat_template strategy.
@@ -728,7 +73,7 @@ class TestAssistantChatTemplateLlama3:
        strategy = ChatTemplateStrategy(
            ChatTemplatePrompter(
                llama3_tokenizer,
-                chat_templates("llama3"),
+                chat_template=chat_templates("llama3"),
                message_field_role="role",
                message_field_content="content",
                roles={
@@ -740,7 +85,6 @@ class TestAssistantChatTemplateLlama3:
            tokenizer=llama3_tokenizer,
            train_on_inputs=False,
            sequence_len=512,
-            roles_to_train=["assistant"],
        )
        strategy.messages = "messages"
        res = strategy.tokenize_prompt(assistant_dataset[0])
@@ -764,12 +108,70 @@ class TestAssistantChatTemplateLlama3:
            input_ids == expected_input_ids
        ), f"Input IDs mismatch: {input_ids} != {expected_input_ids}"

+    def test_phi35(self, phi35_tokenizer, assistant_dataset):
+        LOG.info("Testing phi-3.5 with assistant dataset")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                phi35_tokenizer,
+                chat_template=chat_templates("phi_35"),
+                message_field_role="role",
+                message_field_content="content",
+                roles={
+                    "user": ["user"],
+                    "assistant": ["assistant"],
+                    "system": ["system"],
+                },
+            ),
+            tokenizer=phi35_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+        )
+        strategy.messages = "messages"
+        res = strategy.tokenize_prompt(assistant_dataset[0])
+        input_ids = res["input_ids"]
+        labels = res["labels"]
+        # fmt: off
+        expected_input_ids = [
+            32010,  # user
+            22172, 32007,  # user eot
+            32001,  # assistant
+            22172, 32007,  # assistant eot
+            32010,  # user
+            1781, 26966, 32007,  # user eot
+            32001,  # assistant
+            1781, 26966, 32007,  # assistant eot
+            32000,  # eos
+        ]
+        expected_labels = [
+            -100,  # user
+            -100, -100,  # user eot
+            -100,  # assistant
+            -100, -100,  # assistant eot,
+            -100,  # user
+            -100, -100, -100,  # user eot
+            -100,  # assistant
+            1781, 26966, 32007,  # assistant eot
+            32000,  # eos
+        ]
+        # fmt: on
+        LOG.debug(f"Expected input_ids: {expected_input_ids}")
+        LOG.debug(f"Actual input_ids: {input_ids}")
+        assert (
+            input_ids == expected_input_ids
+        ), f"Input IDs mismatch: {input_ids} != {expected_input_ids}"
+
+        LOG.debug(f"Expected labels : {expected_labels}")
+        LOG.debug(f"Actual labels : {labels}")
+        assert (
+            labels == expected_labels
+        ), f"Input IDs mismatch: {labels} != {expected_labels}"
+
    def test_llama3_with_training_data(self, llama3_tokenizer, assistant_dataset):
        LOG.info("Testing llama-3 with assistant dataset including training data")
        strategy = ChatTemplateStrategy(
            ChatTemplatePrompter(
                llama3_tokenizer,
-                chat_templates("llama3"),
+                chat_template=chat_templates("llama3"),
                message_field_role="role",
                message_field_content="content",
                message_field_training="training",
@@ -825,8 +227,11 @@ class TestSharegptChatTemplateLlama3:

    def test_llama3_assistant(self, llama3_tokenizer, sharegpt_dataset):
        LOG.info("Testing ShareGPT style datasets with llama-3 assistant prompts")
+        # pylint: disable=duplicate-code
        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
            tokenizer=llama3_tokenizer,
            train_on_inputs=False,
            train_on_eos="none",
@@ -875,8 +280,11 @@ class TestSharegptChatTemplateLlama3:

    def test_llama3_human(self, llama3_tokenizer, sharegpt_dataset):
        LOG.info("Testing ShareGPT style datasets with llama-3 human prompts")
+        # pylint: disable=duplicate-code
        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
            tokenizer=llama3_tokenizer,
            train_on_inputs=False,
            train_on_eos="none",
@@ -925,8 +333,11 @@ class TestSharegptChatTemplateLlama3:

    def test_llama3_system_human(self, llama3_tokenizer, basic_dataset):
        LOG.info("Testing ShareGPT style datasets with llama-3 system/human prompts")
+        # pylint: disable=duplicate-code
        strategy = ChatTemplateStrategy(
-            ChatTemplatePrompter(llama3_tokenizer, chat_templates("llama3")),
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
            tokenizer=llama3_tokenizer,
            train_on_inputs=False,
            train_on_eos="none",
--- a/tests/prompt_strategies/test_chat_templates_advanced.py
+++ b/tests/prompt_strategies/test_chat_templates_advanced.py
@@ -0,0 +1,637 @@
+"""
+tests for chat_template prompt strategy
+"""
+
+import logging
+import unittest
+
+from datasets import Dataset
+
+from axolotl.prompt_strategies.chat_template import (
+    ChatTemplatePrompter,
+    ChatTemplateStrategy,
+)
+from axolotl.prompters import IGNORE_TOKEN_ID
+from axolotl.utils.chat_templates import chat_templates
+
+logging.basicConfig(level=logging.DEBUG)
+LOG = logging.getLogger("axolotl")
+
+
+class TestChatTemplateConfigurations:
+    """
+    Test class for various configurations of ChatTemplateStrategy.
+    """
+
+    @staticmethod
+    def find_sublist(full_list, sub_list):
+        token_count = len(sub_list)
+        for index in range(len(full_list) - token_count + 1):
+            if full_list[index : index + token_count] == sub_list:
+                return index
+        return -1
+
+    def test_train_on_inputs_true(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with train_on_inputs=True")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=True,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        # Verify that assistant responses are labeled
+        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
+        for response in assistant_responses:
+            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, response_ids)
+            LOG.debug(
+                f"Assistant response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
+            )
+            assert start_idx != -1, f"Could not find '{response}' in input_ids"
+            assert all(
+                label != IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(response_ids)]
+            ), f"Expected labels for assistant response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
+
+        # Check the behavior of human inputs
+        human_inputs = ["Hello", "How are you?"]
+        for input_text in human_inputs:
+            input_ids = llama3_tokenizer.encode(input_text, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, input_ids)
+            labeled = all(
+                label != IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(input_ids)]
+            )
+            LOG.debug(
+                f"Human input '{input_text}' is {'labeled' if labeled else 'not labeled'}, expected IDs: {input_ids}, found at: {start_idx}"
+            )
+
+        LOG.debug("Full labels: %s", labels)
+        LOG.debug("Full input_ids: %s", input_ids)
+
+    def test_train_on_inputs_false(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with train_on_inputs=False")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        # Verify that only assistant responses are labeled
+        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
+        for response in assistant_responses:
+            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, response_ids)
+            LOG.debug(
+                f"Assistant response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
+            )
+            assert start_idx != -1, f"Could not find '{response}' in input_ids"
+            assert all(
+                label != IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(response_ids)]
+            ), f"Expected labels for assistant response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
+
+        # Verify that human inputs are not labeled
+        human_inputs = ["Hello", "How are you?"]
+        for input_text in human_inputs:
+            input_ids = llama3_tokenizer.encode(input_text, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, input_ids)
+            LOG.debug(
+                f"Human input '{input_text}' expected IDs: {input_ids}, found at: {start_idx}"
+            )
+            assert start_idx != -1, f"Could not find '{input_text}' in input_ids"
+            assert all(
+                label == IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(input_ids)]
+            ), f"Expected labels for human input '{input_text}' to be IGNORE_TOKEN_ID, but got {labels[start_idx:start_idx+len(input_ids)]}"
+
+    def test_roles_to_train_assistant_only(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing roles_to_train with assistant only")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        # Verify that only assistant responses are labeled
+        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
+        for response in assistant_responses:
+            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, response_ids)
+            LOG.debug(
+                f"Assistant response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
+            )
+            assert all(
+                label != IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(response_ids)]
+            ), f"Expected labels for assistant response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
+
+    def test_roles_to_train_all(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing roles_to_train with all roles")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=True,
+            sequence_len=512,
+            roles_to_train=["human", "assistant"],
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        # Verify that all responses are labeled (except for special tokens)
+        all_responses = [
+            "Hello",
+            "Hi there!",
+            "How are you?",
+            "I'm doing well, thank you!",
+        ]
+        for response in all_responses:
+            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, response_ids)
+            LOG.debug(
+                f"Response '{response}' expected IDs: {response_ids}, found at: {start_idx}"
+            )
+            assert all(
+                label != IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(response_ids)]
+            ), f"Expected labels for response '{response}' to be set, but got {labels[start_idx:start_idx+len(response_ids)]}"
+
+    def test_empty_roles_to_train(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with empty roles_to_train")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=[],
+            train_on_eos="none",  # Add this line
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+
+        # Verify that no labels are set when roles_to_train is empty
+        LOG.debug("Full labels: %s", labels)
+        assert all(
+            label == IGNORE_TOKEN_ID for label in labels
+        ), "Expected all labels to be IGNORE_TOKEN_ID when roles_to_train is empty"
+
+    def test_train_on_eos_all(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with train_on_eos='all'")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+            train_on_eos="all",
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        eos_token_id = llama3_tokenizer.eos_token_id
+        eos_indices = [
+            i for i, token_id in enumerate(input_ids) if token_id == eos_token_id
+        ]
+
+        assert len(eos_indices) > 0, "Expected at least one EOS token in the input"
+        for eos_idx in eos_indices:
+            assert (
+                labels[eos_idx] != IGNORE_TOKEN_ID
+            ), f"Expected EOS token at index {eos_idx} to be labeled"
+
+    def test_train_on_eos_turn(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with train_on_eos='turn'")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+            train_on_eos="turn",
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        eos_token_id = llama3_tokenizer.eos_token_id
+        assistant_responses = ["Hi there!", "I'm doing well, thank you!"]
+
+        for response in assistant_responses:
+            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, response_ids)
+            assert start_idx != -1, f"Could not find '{response}' in input_ids"
+
+            eos_idx = start_idx + len(response_ids)
+            while eos_idx < len(input_ids) and input_ids[eos_idx] != eos_token_id:
+                eos_idx += 1
+
+            assert eos_idx < len(
+                input_ids
+            ), f"Could not find EOS token after '{response}'"
+            assert (
+                labels[eos_idx] != IGNORE_TOKEN_ID
+            ), f"Expected EOS token after assistant response '{response}' to be labeled"
+
+        # Check that EOS tokens after human inputs are not labeled
+        human_inputs = ["Hello", "How are you?"]
+        for input_text in human_inputs:
+            input_ids = llama3_tokenizer.encode(input_text, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, input_ids)
+            assert start_idx != -1, f"Could not find '{input_text}' in input_ids"
+
+            eos_idx = start_idx + len(input_ids)
+            while eos_idx < len(input_ids) and input_ids[eos_idx] != eos_token_id:
+                eos_idx += 1
+
+            assert (
+                labels[eos_idx] == IGNORE_TOKEN_ID
+            ), f"Expected EOS token after human input '{input_text}' to not be labeled"
+
+    def test_train_on_eos_last(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with train_on_eos='last'")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+            train_on_eos="last",
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        eos_token_id = llama3_tokenizer.eos_token_id
+        eos_indices = [
+            i for i, token_id in enumerate(input_ids) if token_id == eos_token_id
+        ]
+
+        assert len(eos_indices) > 0, "Expected at least one EOS token in the input"
+        last_eos_idx = eos_indices[-1]
+
+        # Check that only the last EOS token is labeled
+        for idx in eos_indices[:-1]:
+            assert (
+                labels[idx] == IGNORE_TOKEN_ID
+            ), f"Expected EOS token at index {idx} to not be labeled"
+        assert (
+            labels[last_eos_idx] != IGNORE_TOKEN_ID
+        ), f"Expected last EOS token at index {last_eos_idx} to be labeled"
+
+    def test_train_on_eos_none(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with train_on_eos='none'")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer, chat_template=chat_templates("llama3")
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+            train_on_eos="none",
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        eos_token_id = llama3_tokenizer.eos_token_id
+        eos_indices = [
+            i for i, token_id in enumerate(input_ids) if token_id == eos_token_id
+        ]
+
+        assert len(eos_indices) > 0, "Expected at least one EOS token in the input"
+        for eos_idx in eos_indices:
+            assert (
+                labels[eos_idx] == IGNORE_TOKEN_ID
+            ), f"Expected EOS token at index {eos_idx} to not be labeled"
+
+    def test_drop_system_message(self, llama3_tokenizer, basic_dataset):
+        LOG.info("Testing with drop_system_message=True")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer,
+                chat_template=chat_templates("llama3"),
+                drop_system_message=True,
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["assistant"],
+        )
+        res = strategy.tokenize_prompt(basic_dataset[0])
+        input_ids = res["input_ids"]
+
+        # Check if system message is not present in input_ids
+        system_message = "You are an AI assistant."
+        system_ids = llama3_tokenizer.encode(system_message, add_special_tokens=False)
+        assert (
+            self.find_sublist(input_ids, system_ids) == -1
+        ), "Expected system message to be dropped"
+
+    def test_custom_roles(self, llama3_tokenizer):
+        LOG.info("Testing with custom roles mapping")
+        custom_roles = {
+            "user": ["human", "user"],
+            "assistant": ["ai", "assistant"],
+            "system": ["context"],
+        }
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer,
+                chat_template=chat_templates("llama3"),
+                roles=custom_roles,
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=["ai"],
+        )
+
+        # Create a new dataset with modified role names
+        modified_conversations = [
+            {"from": "context", "value": "You are an AI assistant."},
+            {"from": "human", "value": "Hello"},
+            {"from": "ai", "value": "Hi there!"},
+            {"from": "human", "value": "How are you?"},
+            {"from": "ai", "value": "I'm doing well, thank you!"},
+        ]
+
+        modified_dataset = Dataset.from_dict(
+            {"conversations": [modified_conversations]}
+        )
+
+        res = strategy.tokenize_prompt(modified_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        # Check if AI responses are labeled correctly
+        ai_responses = ["Hi there!", "I'm doing well, thank you!"]
+        for response in ai_responses:
+            response_ids = llama3_tokenizer.encode(response, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, response_ids)
+            assert start_idx != -1, f"Could not find response '{response}' in input_ids"
+            assert all(
+                label != IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(response_ids)]
+            ), f"Expected labels for AI response '{response}' to be set"
+
+        # Check if human messages are not labeled
+        human_messages = ["Hello", "How are you?"]
+        for message in human_messages:
+            message_ids = llama3_tokenizer.encode(message, add_special_tokens=False)
+            start_idx = self.find_sublist(input_ids, message_ids)
+            assert start_idx != -1, f"Could not find message '{message}' in input_ids"
+            assert all(
+                label == IGNORE_TOKEN_ID
+                for label in labels[start_idx : start_idx + len(message_ids)]
+            ), f"Expected labels for human message '{message}' to be IGNORE_TOKEN_ID"
+
+    def test_message_field_training(self, llama3_tokenizer):
+        LOG.info("Testing with message_field_training")
+        strategy = ChatTemplateStrategy(
+            ChatTemplatePrompter(
+                llama3_tokenizer,
+                chat_template=chat_templates("llama3"),
+                message_field_training="train",
+                message_field_training_detail="train_detail",
+            ),
+            tokenizer=llama3_tokenizer,
+            train_on_inputs=False,
+            sequence_len=512,
+            roles_to_train=[],
+        )
+
+        # Create a new dataset with the train and train_detail fields
+        modified_conversation = [
+            {"from": "system", "value": "You are an AI assistant.", "train": False},
+            {"from": "human", "value": "Hello", "train": False},
+            {"from": "assistant", "value": "Hello", "train": True},
+            {"from": "human", "value": "How are you?", "train": True},
+            {
+                "from": "assistant",
+                "value": "I'm doing very well, thank you!",
+                "train_detail": [
+                    {"begin_offset": 0, "end_offset": 8, "train": False},
+                    {"begin_offset": 9, "end_offset": 18, "train": True},
+                    {"begin_offset": 19, "end_offset": 30, "train": False},
+                ],
+            },
+            {
+                "from": "human",
+                "value": "I'm doing very well, thank you!",
+                "train": False,
+            },
+            {"from": "assistant", "value": "Hi there!", "train": True},
+        ]
+
+        modified_dataset = Dataset.from_dict({"conversations": [modified_conversation]})
+
+        res = strategy.tokenize_prompt(modified_dataset[0])
+        labels = res["labels"]
+        input_ids = res["input_ids"]
+
+        # Function to find all occurrences of a sublist
+        def find_all_sublists(full_list, sub_list):
+            indices = []
+            for index in range(len(full_list) - len(sub_list) + 1):
+                if full_list[index : index + len(sub_list)] == sub_list:
+                    indices.append(index)
+            return indices
+
+        # Keep track of which occurrences we've processed
+        processed_occurrences = {}
+        # Check if messages are labeled correctly based on train or train_detail
+        for i, turn in enumerate(modified_conversation):
+            turn_tokens = llama3_tokenizer.encode(
+                turn["value"], add_special_tokens=False
+            )
+            occurrences = find_all_sublists(input_ids, turn_tokens)
+            turn_key = turn["value"]
+            if turn_key not in processed_occurrences:
+                processed_occurrences[turn_key] = 0
+            current_occurrence = processed_occurrences[turn_key]
+
+            if current_occurrence >= len(occurrences):
+                assert (
+                    False
+                ), f"Not enough occurrences found for message: {turn['value']}"
+
+            start_idx = occurrences[current_occurrence]
+            processed_occurrences[turn_key] += 1
+            end_idx = start_idx + len(turn_tokens)
+
+            LOG.debug(
+                f"Processing turn {i}: role={turn['from']}, content='{turn['value']}', start_idx={start_idx}, end_idx={end_idx}"
+            )
+
+            if "train_detail" in turn:
+                # Get token offsets
+                tokenized_output = llama3_tokenizer(
+                    turn["value"], return_offsets_mapping=True, add_special_tokens=False
+                )
+                token_offsets = tokenized_output["offset_mapping"]
+
+                # Adjust token offsets as done in the implementation
+                for i in range(len(token_offsets) - 1):
+                    token_offsets[i] = (
+                        token_offsets[i][0],
+                        token_offsets[i + 1][0] - 1,
+                    )
+                token_offsets[-1] = (token_offsets[-1][0], len(turn["value"]) - 1)
+
+                # Adjust train_details
+                adjusted_train_details = strategy.prompter.adjust_train_details(
+                    turn["train_detail"], token_offsets
+                )
+
+                LOG.debug(f"Original train_details: {turn['train_detail']}")
+                LOG.debug(f"Adjusted train_details: {adjusted_train_details}")
+
+                # Handle train_detail
+                token_offsets = strategy.prompter.get_offsets_for_train_detail(
+                    text=turn["value"],
+                    train_details=adjusted_train_details,
+                    mask_untrainable=False,
+                )
+                token_offsets_masked = strategy.prompter.get_offsets_for_train_detail(
+                    text=turn["value"],
+                    train_details=adjusted_train_details,
+                    mask_untrainable=True,
+                )
+                LOG.debug(f"Token offsets: {token_offsets_masked}")
+
+                expected_labels = [IGNORE_TOKEN_ID] * len(turn_tokens)
+                for i, offset in enumerate(token_offsets_masked):
+                    if offset != IGNORE_TOKEN_ID:
+                        expected_labels[i] = turn_tokens[i]
+                actual_labels = labels[
+                    start_idx : start_idx + len(token_offsets_masked)
+                ]
+                assert (
+                    actual_labels == expected_labels
+                ), f"Labels mismatch for turn: {turn['value']}\nExpected: {expected_labels}\nActual: {actual_labels}"
+
+                for detail in adjusted_train_details:
+                    # Find the token indices that correspond to the character offsets
+                    detail_start = start_idx + next(
+                        i
+                        for i, offset in enumerate(token_offsets)
+                        if offset >= detail["begin_offset"]
+                    )
+                    detail_end = start_idx + next(
+                        (
+                            i
+                            for i, offset in enumerate(token_offsets)
+                            if offset > detail["end_offset"]
+                        ),
+                        len(token_offsets),
+                    )
+
+                    detail_text = turn["value"][
+                        detail["begin_offset"] : detail["end_offset"] + 1
+                    ]
+                    detail_labels = labels[detail_start:detail_end]
+                    detail_input_ids = input_ids[detail_start:detail_end]
+
+                    LOG.debug(
+                        f"Detail: '{detail_text}', Start: {detail_start}, End: {detail_end}"
+                    )
+                    LOG.debug(f"Detail input_ids: {detail_input_ids}")
+                    LOG.debug(f"Detail labels: {detail_labels}")
+                    LOG.debug(
+                        f"Decoded detail: {llama3_tokenizer.decode(detail_input_ids)}"
+                    )
+                    LOG.debug(
+                        f"Token offsets for this detail: {token_offsets[detail_start-start_idx:detail_end-start_idx]}"
+                    )
+
+                    if detail["train"]:
+                        assert all(
+                            label != IGNORE_TOKEN_ID for label in detail_labels
+                        ), (
+                            f"Expected labels for trainable detail '{detail_text}' to be set, but some were IGNORE_TOKEN_ID. "
+                            f"Labels({detail_start}:{detail_end}): {detail_labels}, "
+                            f"InputIDs: {detail_input_ids}, "
+                            f"Decoded: '{llama3_tokenizer.decode(detail_input_ids)}'"
+                        )
+                    else:
+                        assert all(
+                            label == IGNORE_TOKEN_ID for label in detail_labels
+                        ), (
+                            f"Expected all labels for non-trainable detail '{detail_text}' to be IGNORE_TOKEN_ID, but some were not. "
+                            f"Labels({detail_start}:{detail_end}): {detail_labels}, "
+                            f"InputIDs: {detail_input_ids}, "
+                            f"Decoded: '{llama3_tokenizer.decode(detail_input_ids)}'"
+                        )
+            else:
+                should_train = turn.get("train", False)
+                turn_labels = labels[start_idx:end_idx]
+
+                LOG.debug(f"Should train: {should_train}")
+                LOG.debug(f"Turn indices: start={start_idx}, end={end_idx}")
+                LOG.debug(f"Turn labels: {turn_labels}")
+                LOG.debug(f"Turn input IDs: {input_ids[start_idx:end_idx]}")
+                LOG.debug(
+                    f"Decoded turn: {llama3_tokenizer.decode(input_ids[start_idx:end_idx])}"
+                )
+
+                if should_train:
+                    assert all(label != IGNORE_TOKEN_ID for label in turn_labels), (
+                        f"Expected all labels for '{turn['value']}' to be set\n"
+                        f"Labels({start_idx}:{end_idx}): {turn_labels}, "
+                        f"InputIDs: {input_ids[start_idx:end_idx]}, "
+                        f"Decoded: '{llama3_tokenizer.decode(input_ids[start_idx:end_idx])}'"
+                    )
+                else:
+                    assert all(label == IGNORE_TOKEN_ID for label in turn_labels), (
+                        f"Expected all labels for '{turn['value']}' to be IGNORE_TOKEN_ID\n"
+                        f"Labels({start_idx}:{end_idx}): {turn_labels}, "
+                        f"InputIDs: {input_ids[start_idx:end_idx]}, "
+                        f"Decoded: '{llama3_tokenizer.decode(input_ids[start_idx:end_idx])}'"
+                    )
+
+                LOG.debug(
+                    f"Processed turn: {turn['from']}, content: '{turn['value']}', "
+                    f"start_idx: {start_idx}, end_idx: {end_idx}, "
+                    f"labels: {labels[start_idx:end_idx]}"
+                )
+
+        LOG.debug(f"Final labels: {labels}")
+        LOG.debug(f"Final input_ids: {input_ids}")
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -35,7 +35,7 @@ class TestEncodePretraining(unittest.TestCase):
                "hello, hello",
            ]
        }
-        result = encode_pretraining(self.tokenizer, self.max_tokens, examples["text"])
+        result = encode_pretraining(self.tokenizer, self.max_tokens, examples)

        self.assertEqual(len(result["input_ids"]), 3)

--- a/tests/test_prompters.py
+++ b/tests/test_prompters.py
@@ -42,6 +42,19 @@ class AlpacaPrompterTest(unittest.TestCase):
        assert "USER:" not in res
        assert "ASSISTANT:" not in res

+    def test_prompt_style_w_phi(self):
+        prompter = AlpacaPrompter(prompt_style=PromptStyle.PHI.value)
+        res = next(prompter.build_prompt("tell me a joke about the following"))
+        assert (
+            """<|system|>
+Below is an instruction that describes a task. Write a response that appropriately completes the request.<|end|>
+<|user|>
+tell me a joke about the following<|end|>
+<|assistant|>
+"""
+            == res
+        )
+
    def test_prompt_style_w_chat(self):
        prompter = AlpacaPrompter(prompt_style=PromptStyle.CHAT.value)
        res = next(
Author	SHA1	Message	Date
sunny	a02af506ed	pixtral example	2024-10-03 16:11:15 -04:00
sunny	431a0b0f9d	added pixtral example	2024-10-03 16:01:21 -04:00
Wing Lian	e1915f5625	Multimodal Vision Llama - rudimentary support (#1940 ) --------- Co-authored-by: Sunny <sunny@Sunnys-MacBook-Air.local> Co-authored-by: sunny <sunnyliu19981005@gmail.com>	2024-10-02 21:02:48 -04:00
Wing Lian	844331005c	bump transformers to 4.45.1 (#1936 )	2024-09-30 13:56:12 -04:00
Wing Lian	61aa291119	fix for empty lora+ lr embedding (#1932 )	2024-09-27 15:58:35 -04:00
Wing Lian	b98d7d7098	update upstream deps versions and replace lora+ (#1928 ) * update upstream deps versions and replace lora+ * typo transformers version	2024-09-26 11:33:41 -04:00
Wing Lian	d7eea2ff34	validation fixes 20240923 (#1925 ) * validation fixes 20240923 * fix run name for wandb and defaults for chat template fields * fix gradio inference with llama chat template	2024-09-24 14:05:58 -04:00
Keith Stevens	7b9f669a3a	Trigger the original tokenization behavior when no advanced turn settings are provided (#1915 )	2024-09-14 08:22:54 -04:00
Wing Lian	5c42f11411	remove dynamic module loader monkeypatch as this was fixed upstream (#1914 )	2024-09-13 22:19:54 -04:00
Wing Lian	3853ab7ae9	bump accelerate to 0.34.2 (#1901 ) * bump accelerate * add fixture to predownload the test model * change fixture	2024-09-07 14:39:31 -04:00
Wing Lian	6e354682e3	fix zero3 integration (#1897 ) * fix zero3 integration * bump transformers and accelerate too	2024-09-05 10:58:50 -04:00
Alpay Ariyak	ab461d83c4	Fix documentation for pre-tokenized dataset (#1894 ) It's currently asking to not add BOS and EOS, stating that Axolotl adds them, but this is not true	2024-09-05 23:11:31 +09:00
Wing Lian	93b769a979	lint fix and update gha regex (#1899 )	2024-09-05 09:58:21 -04:00
Tijmen de Haan	f18f4268b5	Docs for AMD-based HPC systems (#1891 ) * Add documentation for installing on AMD-based HPC systems. * Accept suggestion to add note about deepspeed Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update _quarto.yml with amd_hpc doc --------- Co-authored-by: Tijmen de Haan <tijmen.dehaan@gmail.com> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>	2024-09-05 18:33:19 +09:00
Wing Lian	dca1fe47d4	fix optimizer + fsdp combination in example (#1893 )	2024-09-04 11:28:47 -04:00
Wing Lian	4e5400c732	support for auto_find_batch_size when packing (#1885 ) * support for auto_find_batch_size when packing * make sure to return data from validation * make sure to return data from validation * actually expose multipack_real_batches in the config * calculate gathered efficiency in sampler * tweak to fix auto find and use actual sampler len for multipack * uncomment * use args for bsz when not available from auto find	2024-09-03 20:02:44 -04:00
Wing Lian	0aeb277456	add e2e smoke tests for llama liger integration (#1884 ) * add e2e smoke tests for llama liger integration * fix import * don't use __main__ for test * consolidate line	2024-09-01 19:29:37 -04:00
Chiwan Park	bdab3ec587	Fix RMSNorm monkey patch for Gemma models (#1886 )	2024-09-01 18:34:24 -04:00
Wing Lian	3c6b9eda2e	run pytests with varied pytorch versions too (#1883 )	2024-08-31 22:49:35 -04:00
DocShotgun	15408d0f09	Update supported models for Liger Kernel (#1875 ) * Update supported models for Liger Kernel Add Mistral LCE, Gemma LCE, Gemma 2 without LCE (softcapping is not yet implemented for Gemma in Liger Kernel LCE forward), Phi3 without LCE * move import to their appropriate conditions * Integrate Phi3 LCE support https://github.com/linkedin/Liger-Kernel/pull/103/ --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-08-31 21:59:48 -04:00
Wing Lian	ce33e1ed83	pin liger-kernel to latest 0.2.1 (#1882 ) [skip ci]	2024-08-30 17:51:18 -04:00
Byron Hsu	e3a38450de	Add liger kernel to features (#1881 ) [skip ci]	2024-08-29 08:19:18 -04:00
Aman Gupta Karmani	7037e3c836	deepseekv2 liger support (#1878 ) * deepseekv2 liger support * add comment * add missing impl	2024-08-27 23:52:40 -04:00
Aman Gupta Karmani	c1a61ae23c	fix liger plugin load issues (#1876 )	2024-08-27 23:08:26 -04:00
Aman Gupta Karmani	159b8b9a74	monkey-patch transformers to simplify monkey-patching modeling code (#1877 ) * monkey-patch transformers so that monkey-patched modeling code doesnt get overwritten * unnecessary now * add comment	2024-08-27 17:22:26 -07:00
Wing Lian	1e43660701	Sample pack trust remote code v2 (#1873 ) * fix the multipack patch for remote code models * add deepseek v2 lite example w fsdp	2024-08-27 13:39:24 -04:00
Chiwan Park	f6362d2a05	Add Liger Kernal support for Qwen2 (#1871 )	2024-08-27 13:03:16 -04:00
Wing Lian	17af1d7081	clear cuda cache to help with memory leak/creep (#1858 ) * clear cuda cache to help with memory leak/creep * reverse order of gc	2024-08-26 15:50:26 -04:00
Chiwan Park	2dac1edf72	Fix `drop_long_seq` bug due to truncation in prompt tokenization strategies when using `chat_template` (#1867 )	2024-08-26 12:56:12 -04:00
Wing Lian	6819c12cee	update specturm authors (#1869 )	2024-08-26 12:00:36 -04:00
Wing Lian	8e29bdefdd	Spectrum plugin (#1866 )	2024-08-25 17:54:02 -04:00
Wing Lian	f245964f22	better handling of llama-3 tool rolw (#1782 )	2024-08-25 12:31:40 -04:00
Wing Lian	22f4eafa55	simplify logic (#1856 )	2024-08-23 20:23:08 -04:00
Wing Lian	77a4b9cda2	change up import to prevent AttributeError (#1863 ) * change up import to prevent AttributeError * tweak patching check for updated upstream	2024-08-23 17:00:01 -04:00
Wing Lian	810ecd4e81	add liger to readme (#1865 ) * add liger to readme * updates from PR feedback	2024-08-23 14:34:03 -04:00
Wing Lian	da0d581a8c	add liger example (#1864 )	2024-08-23 12:37:50 -04:00
Wing Lian	1f686c576c	Liger Kernel integration (#1861 ) * add initial plugin support w Liger kernel patches * integrate the input args classes * fix liger plugin and dynamic configuration class * drop untrainable samples and refactor config plugins integration * fix incorrect inputs and circular imports * fix bool comparison * fix for dropping untraibable tokens * fix licensing so liger integration is Apache 2.0 * add jamba support * pylint ignore	2024-08-23 12:21:51 -04:00
Wing Lian	e8ff5d5738	don't mess with bnb since it needs compiled wheels (#1859 )	2024-08-23 12:18:47 -04:00
Wing Lian	328fd4b3b7	add axolotl community license (#1862 )	2024-08-23 11:40:21 -04:00
Wing Lian	fefa95e350	most model types now support flash attention 2 regardless of multipack support (#1854 )	2024-08-22 16:39:23 -04:00
Wing Lian	b33dc07a77	rename nightly test and add badge (#1853 )	2024-08-22 13:13:33 -04:00
Wing Lian	dcbff16983	run nightly ci builds against upstream main (#1851 ) * run nightly ci builds against upstream main * add test badges * run the multigpu tests against nightly main builds too	2024-08-22 13:10:54 -04:00
Wing Lian	2f8037fee6	ensure that the hftrainer deepspeed config is set before the trainer class is ever init'ed (#1850 ) [skip ci]	2024-08-22 13:10:40 -04:00
Aman Gupta Karmani	de4ea2d1f2	docs: minor syntax highlight fix (#1839 )	2024-08-22 11:47:34 -04:00
JohanWork	7ed92e61c2	fix: prompt phi (#1845 ) [skip ci] * corecting phi system prompt * phi test * update * add test	2024-08-22 11:46:57 -04:00
Wing Lian	9caa3eb699	make the train_on_eos default to turn so all eos tokens are treated the same (#1847 ) [skip ci]	2024-08-22 11:45:37 -04:00
Wing Lian	5b0b774e38	ensure that the bias is also in the correct dtype (#1848 ) [skip ci] * ensure that the bias is also in the correct dtype * add nightly for dpo-qlora-fsdp	2024-08-22 11:45:00 -04:00
Wing Lian	c3fc529bfc	numpy 2.1.0 was released, but incompatible with numba (#1849 ) [skip ci]	2024-08-22 11:44:45 -04:00
Gal Cohen (galco)	957c956f89	rename jamba example (#1846 ) [skip ci] * rename jamba example * feat: change readme --------- Co-authored-by: Gal Cohen <galc@ai21.com>	2024-08-22 09:22:55 -04:00
Aman Gupta Karmani	f07802f9fa	examples: fix tiny-llama pretrain yml syntax (#1840 )	2024-08-21 13:37:51 -04:00
Gal Cohen (galco)	9f917245f6	feat: add jamba chat_template (#1843 ) * feat: add jamba chat_template * fix: black * feat: jamba fsdp+qlora --------- Co-authored-by: Gal Cohen <galc@ai21.com>	2024-08-21 13:37:17 -04:00
Aman Gupta Karmani	649c19aba3	pretrain: fix with sample_packing=false (#1841 )	2024-08-21 13:36:51 -04:00
Gal Cohen (galco)	5aac4bc284	fix: dont change quant storage dtype in case of fsdp (#1837 ) * fix: dont change quant storage dtype in case of fsdp * fix black --------- Co-authored-by: Gal Cohen <galc@ai21.com>	2024-08-20 12:41:48 -04:00
Wing Lian	e29931259b	optionally save the final FSDP model as a sharded state dict (#1828 ) * efficiently save very large llms when using FSDP * fix parsing and index of sharded chunks * only save fsdp on main process * debugging for rename * save sharded state dict * remove unused new param * get state dict directly * tweak acc merge fsdp to shard the weight files * sharded_state_dict alongside save_safetensors seems to hang on checkpoint save	2024-08-19 14:59:24 -04:00
Wing Lian	b1d2921222	add validation to prevent 8bit lora finetuning on H100s (#1827 )	2024-08-16 21:32:00 -04:00
Wing Lian	803fed3e90	update sklearn versrion, torch compile env vars, don't worry about failure on preprocess load model (#1821 ) * update sklearn versrion, torch compile env vars, don't worry about failure on preprocess load model * There is already a condition check within the function. This outer one is not necessary Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>	2024-08-16 10:41:51 -04:00
NanoCode012	68a3c7678a	fix: parse model_kwargs (#1825 )	2024-08-16 07:51:19 -04:00
NanoCode012	f18925fb4b	fix: parse eager_attention (#1824 )	2024-08-14 09:46:46 -04:00