update address and port for spaces

create config
axolotl start training
2024-02-08 17:55:44 -05:00 · 2024-02-08 09:26:58 +01:00 · 2024-02-07 18:16:21 +01:00 · 2024-02-07 15:52:30 +01:00 · 2024-02-06 07:48:26 -05:00 · 2024-02-06 00:38:43 -05:00
60 changed files with 1717 additions and 563 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -1,6 +1,6 @@
 # These are supported funding model platforms

-github: OpenAccess-AI-Collective # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
+github: [winglian, OpenAccess-AI-Collective] # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
 patreon: # Replace with a single Patreon username
 open_collective: # Replace with a single Open Collective username
 ko_fi: axolotl_ai # Replace with a single Ko-fi username
--- a/.github/workflows/base.yml
+++ b/.github/workflows/base.yml
@@ -1,10 +1,7 @@
 name: ci-cd-base

 on:
-  push:
-    branches:
-      - "main-base"
-      - "dev-base"
+  workflow_dispatch:

 jobs:
  build-base:
@@ -15,11 +12,6 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: "118"
-            cuda_version: 11.8.0
-            python_version: "3.9"
-            pytorch: 2.0.1
-            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 9.0+PTX"
          - cuda: "118"
            cuda_version: 11.8.0
            python_version: "3.10"
@@ -28,12 +20,17 @@ jobs:
          - cuda: "118"
            cuda_version: 11.8.0
            python_version: "3.10"
-            pytorch: 2.1.1
+            pytorch: 2.1.2
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 9.0+PTX"
          - cuda: "121"
            cuda_version: 12.1.0
            python_version: "3.10"
-            pytorch: 2.1.1
+            pytorch: 2.1.2
+            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 9.0+PTX"
+          - cuda: "121"
+            cuda_version: 12.1.0
+            python_version: "3.11"
+            pytorch: 2.1.2
            torch_cuda_arch_list: "7.0 7.5 8.0 8.6 9.0+PTX"
    steps:
      - name: Checkout
@@ -56,7 +53,7 @@ jobs:
          context: .
          file: ./docker/Dockerfile-base
          push: ${{ github.event_name != 'pull_request' }}
-          tags: ${{ steps.metadata.outputs.tags }}-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
+          tags: ${{ steps.metadata.outputs.tags }}-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
          labels: ${{ steps.metadata.outputs.labels }}
          build-args: |
            CUDA_VERSION=${{ matrix.cuda_version }}
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -4,6 +4,7 @@ on:
  push:
    branches:
      - "main"
+  workflow_dispatch:

 jobs:
  build-axolotl:
@@ -15,24 +16,24 @@ jobs:
        include:
          - cuda: 118
            cuda_version: 11.8.0
-            python_version: "3.9"
+            python_version: "3.10"
            pytorch: 2.0.1
            axolotl_extras:
          - cuda: 118
            cuda_version: 11.8.0
            python_version: "3.10"
-            pytorch: 2.0.1
+            pytorch: 2.1.2
            axolotl_extras:
            is_latest: true
-          - cuda: 118
-            cuda_version: 11.8.0
-            python_version: "3.10"
-            pytorch: 2.1.1
-            axolotl_extras:
          - cuda: 121
            cuda_version: 12.1.0
            python_version: "3.10"
-            pytorch: 2.1.1
+            pytorch: 2.1.2
+            axolotl_extras:
+          - cuda: 121
+            cuda_version: 12.1.0
+            python_version: "3.11"
+            pytorch: 2.1.2
            axolotl_extras:
    runs-on: [self-hosted, gpu, docker]
    steps:
@@ -86,24 +87,24 @@ jobs:
        include:
          - cuda: 118
            cuda_version: 11.8.0
-            python_version: "3.9"
+            python_version: "3.10"
            pytorch: 2.0.1
            axolotl_extras:
          - cuda: 118
            cuda_version: 11.8.0
            python_version: "3.10"
-            pytorch: 2.0.1
+            pytorch: 2.1.2
            axolotl_extras:
            is_latest: true
-          - cuda: 118
-            cuda_version: 11.8.0
-            python_version: "3.10"
-            pytorch: 2.1.1
-            axolotl_extras:
          - cuda: 121
            cuda_version: 12.1.0
            python_version: "3.10"
-            pytorch: 2.1.1
+            pytorch: 2.1.2
+            axolotl_extras:
+          - cuda: 121
+            cuda_version: 12.1.0
+            python_version: "3.11"
+            pytorch: 2.1.2
            axolotl_extras:
    runs-on: [self-hosted, gpu, docker]
    steps:
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -73,7 +73,7 @@ jobs:
          - cuda: 121
            cuda_version: 12.1.0
            python_version: "3.10"
-            pytorch: 2.1.1
+            pytorch: 2.1.2
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@@ -106,3 +106,7 @@ jobs:
      - name: GPU Unit Tests monkeypatched w docker image
        run: |
          docker run --privileged --gpus "all" --env WANDB_DISABLED=true --rm ${{ steps.metadata.outputs.tags }}-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }} pytest /workspace/axolotl/tests/e2e/patched/
+      - name: Prune image from docker
+        if: github.ref != 'refs/heads/main'
+        run: |
+          docker rmi -f ${{ steps.metadata.outputs.tags }}-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}
--- a/README.md
+++ b/README.md
@@ -37,6 +37,9 @@ Features:
  - [Inference](#inference)
  - [Merge LORA to Base](#merge-lora-to-base)
  - [Special Tokens](#special-tokens)
+- Advanced Topics
+  - [Multipack](./docs/multipack.md)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
+  - [RLHF & DPO](./docs/rlhf.md)<svg width="24" height="24" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M17 13.5v6H5v-12h6m3-3h6v6m0-6-9 9" class="icon_svg-stroke" stroke="#666" stroke-width="1.5" fill="none" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"></path></svg>
 - [Common Errors](#common-errors-)
  - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
 - [Debugging Axolotl](#debugging-axolotl)
@@ -607,12 +610,25 @@ datasets:
      # For `completion` datsets only, uses the provided field instead of `text` column
      field:

+# A list of one or more datasets to eval the model with.
+# You can use either test_datasets, or val_set_size, but not both.
+test_datasets:
+  - path: /workspace/data/eval.jsonl
+    ds_type: json
+    # You need to specify a split. For "json" datasets the default split is called "train".
+    split: train
+    type: completion
+    data_files:
+      - /workspace/data/eval.jsonl
+
 # use RL training: dpo, ipo, kto_pair
 rl:

 # Saves the desired chat template to the tokenizer_config.json for easier inferencing
 # Currently supports chatml and inst (mistral/mixtral)
 chat_template: chatml
+# Changes the default system message
+default_system_message: You are a helpful assistant. Please give a long and detailed answer. # Currently only supports chatml.
 # Axolotl attempts to save the dataset as an arrow after packing the data together so
 # subsequent training attempts load faster, relative path
 dataset_prepared_path: data/last_run_prepared
@@ -694,6 +710,12 @@ lora_modules_to_save:

 lora_fan_in_fan_out: false

+peft:
+  # Configuration options for loftq initialization for LoRA
+  # https://huggingface.co/docs/peft/developer_guides/quantization#loftq-initialization
+  loftq_config:
+    loftq_bits:  # typically 4 bits
+
 # ReLoRA configuration
 # Must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
 relora_steps: # Number of steps per ReLoRA restart
@@ -1133,9 +1155,11 @@ Having misalignment between your prompts during training and inference can cause

 See [this debugging guide](docs/debugging.md) for tips on debugging Axolotl, along with an example configuration for debugging with VSCode.

-## Need help? 🙋♂️
+## Need help? 🙋

-Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
+Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we our community members can help you.
+
+Need dedicated support? Please contact us at [✉️wing@openaccessaicollective.org](mailto:wing@openaccessaicollective.org) for dedicated support options.

 ## Badge ❤🏷️

--- a/deepspeed_configs/zero1.json
+++ b/deepspeed_configs/zero1.json
@@ -15,15 +15,6 @@
    "hysteresis": 2,
    "min_loss_scale": 1
  },
-  "optimizer": {
-    "type": "AdamW",
-    "params": {
-      "lr": "auto",
-      "betas": "auto",
-      "eps": "auto",
-      "weight_decay": "auto"
-    }
-  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
--- a/deepspeed_configs/zero2.json
+++ b/deepspeed_configs/zero2.json
@@ -19,15 +19,6 @@
    "hysteresis": 2,
    "min_loss_scale": 1
  },
-  "optimizer": {
-    "type": "AdamW",
-    "params": {
-      "lr": "auto",
-      "betas": "auto",
-      "eps": "auto",
-      "weight_decay": "auto"
-    }
-  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
--- a/deepspeed_configs/zero3.json
+++ b/deepspeed_configs/zero3.json
@@ -23,15 +23,6 @@
    "hysteresis": 2,
    "min_loss_scale": 1
  },
-  "optimizer": {
-    "type": "AdamW",
-    "params": {
-      "lr": "auto",
-      "betas": "auto",
-      "eps": "auto",
-      "weight_decay": "auto"
-    }
-  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
--- a/deepspeed_configs/zero3_bf16.json
+++ b/deepspeed_configs/zero3_bf16.json
@@ -23,15 +23,6 @@
    "hysteresis": 2,
    "min_loss_scale": 1
  },
-  "optimizer": {
-    "type": "AdamW",
-    "params": {
-      "lr": "auto",
-      "betas": "auto",
-      "eps": "auto",
-      "weight_decay": "auto"
-    }
-  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
--- a/docker/Dockerfile-cloud
+++ b/docker/Dockerfile-cloud
@@ -11,6 +11,7 @@ EXPOSE 8888
 EXPOSE 22

 COPY scripts/cloud-entrypoint.sh /root/cloud-entrypoint.sh
+COPY scripts/motd /etc/motd

 RUN pip install jupyterlab notebook ipywidgets && \
    jupyter lab clean
@@ -18,6 +19,7 @@ RUN apt install --yes --no-install-recommends openssh-server tmux && \
    mkdir -p ~/.ssh && \
    chmod 700 ~/.ssh && \
    printf "\n[[ -z \"\$TMUX\"  ]] && { tmux attach-session -t ssh_tmux || tmux new-session -s ssh_tmux; exit; }\n" >> ~/.bashrc && \
+    printf "[ ! -z \"\$TERM\" -a -r /etc/motd ] && cat /etc/motd\n" >> ~/.bashrc && \
    chmod +x /workspace/axolotl/scripts/cloud-entrypoint.sh && \
    chmod +x /root/cloud-entrypoint.sh

--- a/docs/images/4d-mask.png
+++ b/docs/images/4d-mask.png
--- a/docs/multipack.md
+++ b/docs/multipack.md
@@ -1,4 +1,11 @@
-# Multipack
+# Multipack (Sample Packing)
+
+## Visualization of Multipack with Flash Attention
+
+Because Flash Attention simply drops the attention mask, we do not need to
+construct a 4d attention mask. We only need to concatenate the sequences into
+a single batch and let flash attention know where each new sequence begins.
+

 4k context, bsz =4,
 each character represents 256 tokens
@@ -49,3 +56,18 @@ w packing ( note it's the same effective number of tokens per step, but a true b
   E E E E F F F F F G G G H H H H
   I I I J J J J K K K K K L L L X ]]
 ```
+
+cu_seqlens:
+[[ 0, 11, 17, 24, 28, 36, 41 44, 48, 51, 55, 60, 64]]
+
+
+## Multipack without Flash Attention
+
+Multipack can still be achieved without Flash attention, but with lower packing
+efficiency as we are not able to join multiple batches into a single batch due to
+context length limits without flash attention. We can use either Pytorch's Scaled
+Dot Product Attention implementation or native Pytorch attention implementation
+along with [4d attention masks](https://github.com/huggingface/transformers/pull/27539)
+to pack sequences together and avoid cross attention.
+
+<img src="./images/4d-mask.png" alt="axolotl" width="800">
--- a/docs/rlhf.md
+++ b/docs/rlhf.md
@@ -12,8 +12,8 @@ feedback. Various methods include, but not limited to:

 ### RLHF using Axolotl

-[!IMPORTANT]
-This is a BETA feature and many features are not fully implemented. You are encouraged to open new PRs to improve the integration and functionality.
+>[!IMPORTANT]
+>This is a BETA feature and many features are not fully implemented. You are encouraged to open new PRs to improve the integration and functionality.

 The various RL training methods are implemented in trl and wrapped via axolotl. Below are various examples with how you can use various preference datasets to train models that use ChatML

--- a/examples/cerebras/qlora.yml
+++ b/examples/cerebras/qlora.yml
@@ -11,7 +11,6 @@ val_set_size: 0.05
 adapter: qlora
 lora_model_dir:
 sequence_len: 2048
-max_packed_sequence_len: 2048
 lora_r: 16
 lora_alpha: 32
 lora_dropout: 0.05
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
@@ -0,0 +1,198 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "AKjdG7tbTb-n"
+      },
+      "source": [
+        "# Example notebook for running Axolotl on google colab"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "RcbNpOgWRcii"
+      },
+      "outputs": [],
+      "source": [
+        "import torch\n",
+        "# Check so there is a gpu available, a T4(free tier) is enough to run this notebook\n",
+        "assert (torch.cuda.is_available()==True)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "h3nLav8oTRA5"
+      },
+      "source": [
+        "## Install Axolotl and dependencies"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "3c3yGAwnOIdi",
+        "outputId": "e3777b5a-40ef-424f-e181-62dfecd1dd01"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install torch==\"2.1.2\"\n",
+        "!pip install -e git+https://github.com/OpenAccess-AI-Collective/axolotl#egg=axolotl\n",
+        "!pip install flash-attn==\"2.5.0\"\n",
+        "!pip install deepspeed==\"0.13.1\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BW2MFr7HTjub"
+      },
+      "source": [
+        "## Create an yaml config file"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "9pkF2dSoQEUN"
+      },
+      "outputs": [],
+      "source": [
+        "import yaml\n",
+        "\n",
+        "# Your YAML string\n",
+        "yaml_string = \"\"\"\n",
+        "base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T\n",
+        "model_type: LlamaForCausalLM\n",
+        "tokenizer_type: LlamaTokenizer\n",
+        "is_llama_derived_model: true\n",
+        "\n",
+        "load_in_8bit: false\n",
+        "load_in_4bit: true\n",
+        "strict: false\n",
+        "\n",
+        "datasets:\n",
+        "  - path: mhenrichsen/alpaca_2k_test\n",
+        "    type: alpaca\n",
+        "dataset_prepared_path:\n",
+        "val_set_size: 0.05\n",
+        "output_dir: ./qlora-out\n",
+        "\n",
+        "adapter: qlora\n",
+        "lora_model_dir:\n",
+        "\n",
+        "sequence_len: 1096\n",
+        "sample_packing: true\n",
+        "pad_to_sequence_len: true\n",
+        "\n",
+        "lora_r: 32\n",
+        "lora_alpha: 16\n",
+        "lora_dropout: 0.05\n",
+        "lora_target_modules:\n",
+        "lora_target_linear: true\n",
+        "lora_fan_in_fan_out:\n",
+        "\n",
+        "wandb_project:\n",
+        "wandb_entity:\n",
+        "wandb_watch:\n",
+        "wandb_name:\n",
+        "wandb_log_model:\n",
+        "\n",
+        "mlflow_experiment_name: colab-example\n",
+        "\n",
+        "gradient_accumulation_steps: 1\n",
+        "micro_batch_size: 1\n",
+        "num_epochs: 4\n",
+        "max_steps: 20\n",
+        "optimizer: paged_adamw_32bit\n",
+        "lr_scheduler: cosine\n",
+        "learning_rate: 0.0002\n",
+        "\n",
+        "train_on_inputs: false\n",
+        "group_by_length: false\n",
+        "bf16: false\n",
+        "fp16: true\n",
+        "tf32: false\n",
+        "\n",
+        "gradient_checkpointing: true\n",
+        "early_stopping_patience:\n",
+        "resume_from_checkpoint:\n",
+        "local_rank:\n",
+        "logging_steps: 1\n",
+        "xformers_attention:\n",
+        "flash_attention: false\n",
+        "\n",
+        "warmup_steps: 10\n",
+        "evals_per_epoch:\n",
+        "saves_per_epoch:\n",
+        "debug:\n",
+        "deepspeed:\n",
+        "weight_decay: 0.0\n",
+        "fsdp:\n",
+        "fsdp_config:\n",
+        "special_tokens:\n",
+        "\n",
+        "\"\"\"\n",
+        "\n",
+        "# Convert the YAML string to a Python dictionary\n",
+        "yaml_dict = yaml.safe_load(yaml_string)\n",
+        "\n",
+        "# Specify your file path\n",
+        "file_path = 'test_axolotl.yaml'\n",
+        "\n",
+        "# Write the YAML file\n",
+        "with open(file_path, 'w') as file:\n",
+        "    yaml.dump(yaml_dict, file)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bidoj8YLTusD"
+      },
+      "source": [
+        "## Launch the training"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ydTI2Jk2RStU",
+        "outputId": "d6d0df17-4b53-439c-c802-22c0456d301b"
+      },
+      "outputs": [],
+      "source": [
+        "# Buy using the ! the comand will be executed as a bash command\n",
+        "!accelerate launch -m axolotl.cli.train /content/test_axolotl.yaml"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
--- a/examples/llama-2/fft_optimized.yml
+++ b/examples/llama-2/fft_optimized.yml
@@ -67,6 +67,3 @@ weight_decay: 0.1
 fsdp:
 fsdp_config:
 special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/llama-2/loftq.yml
+++ b/examples/llama-2/loftq.yml
@@ -0,0 +1,70 @@
+base_model: NousResearch/Llama-2-7b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: LlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path:
+val_set_size: 0.05
+output_dir: ./lora-out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+
+adapter: lora
+lora_model_dir:
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+peft:
+  loftq_config:
+    loftq_bits: 4
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 4
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+s2_attention:
+
+warmup_steps: 10
+evals_per_epoch: 4
+eval_table_size:
+eval_table_max_new_tokens: 128
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
--- a/examples/llama-2/lora.yml
+++ b/examples/llama-2/lora.yml
@@ -65,6 +65,3 @@ weight_decay: 0.0
 fsdp:
 fsdp_config:
 special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/llama-2/qlora.yml
+++ b/examples/llama-2/qlora.yml
@@ -65,6 +65,3 @@ weight_decay: 0.0
 fsdp:
 fsdp_config:
 special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/tiny-llama/pretrain.yml
+++ b/examples/tiny-llama/pretrain.yml
@@ -12,6 +12,7 @@ max_steps: 200
 pretraining_dataset:
  path: c4
  name: en
+  type: pretrain
 dataset_prepared_path:
 val_set_size: 0.0
 output_dir: ./model-out
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,7 +1,7 @@
 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
 packaging==23.2
-peft==0.7.0
-transformers==4.37.0
+peft @ git+https://github.com/huggingface/peft.git
+transformers @ git+https://github.com/huggingface/transformers.git@bebeeee01275c32fccec3fa36d8b148d3813a7dc
 tokenizers==0.15.0
 bitsandbytes>=0.41.1
 accelerate==0.26.1
@@ -15,16 +15,14 @@ sentencepiece
 wandb
 einops
 xformers==0.0.22
-optimum==1.13.2
+optimum==1.16.2
 hf_transfer
 colorama
 numba
 numpy>=1.24.4
 mlflow
 # qlora things
-bert-score==0.3.13
 evaluate==0.4.0
-rouge-score==0.1.2
 scipy
 scikit-learn==1.2.2
 pynvml
--- a/scripts/motd
+++ b/scripts/motd
@@ -0,0 +1,17 @@
+
+                                 dP            dP   dP
+                                 88            88   88
+      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
+      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
+      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
+      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP
+
+Welcome to the axolotl cloud image! If the you've mounted a disk to /workspace and the axolotl directory ie empty, run the following commands:
+
+```
+cd /workspace
+rm -rf /workspace/axolotl
+git clone https://github.com/OpenAccess-AI-Collective/axolotl.git
+cd axolotl
+pip install --no-deps -e .
+```
--- a/setup.py
+++ b/setup.py
@@ -27,9 +27,10 @@ def parse_requirements():

    try:
        torch_version = version("torch")
-        if torch_version.startswith("2.1.1"):
+        _install_requires.append(f"torch=={torch_version}")
+        if torch_version.startswith("2.1."):
            _install_requires.pop(_install_requires.index("xformers==0.0.22"))
-            _install_requires.append("xformers==0.0.23")
+            _install_requires.append("xformers>=0.0.23")
    except PackageNotFoundError:
        pass

@@ -50,7 +51,7 @@ setup(
    dependency_links=dependency_links,
    extras_require={
        "flash-attn": [
-            "flash-attn==2.3.3",
+            "flash-attn==2.5.0",
        ],
        "fused-dense-lib": [
            "fused-dense-lib  @ git+https://github.com/Dao-AILab/flash-attention@v2.3.3#subdirectory=csrc/fused_dense_lib",
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -18,6 +18,7 @@ from axolotl.cli import (
 )
 from axolotl.common.cli import PreprocessCliArgs
 from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
+from axolotl.prompt_strategies.sharegpt import register_chatml_template

 LOG = logging.getLogger("axolotl.cli.preprocess")

@@ -34,6 +35,14 @@ def do_cli(config: Path = Path("examples/"), **kwargs):
        return_remaining_strings=True
    )

+    if parsed_cfg.chat_template == "chatml" and parsed_cfg.default_system_message:
+        LOG.info(
+            f"ChatML set. Adding default system message: {parsed_cfg.default_system_message}"
+        )
+        register_chatml_template(parsed_cfg.default_system_message)
+    else:
+        register_chatml_template()
+
    if not parsed_cfg.dataset_prepared_path:
        msg = (
            Fore.RED
--- a/src/axolotl/cli/train.py
+++ b/src/axolotl/cli/train.py
@@ -6,8 +6,9 @@ from pathlib import Path
 from typing import Tuple

 import fire
-import transformers
-from transformers import PreTrainedModel, PreTrainedTokenizer
+from transformers.hf_argparser import HfArgumentParser
+from transformers.modeling_utils import PreTrainedModel
+from transformers.tokenization_utils import PreTrainedTokenizer

 from axolotl.cli import (
    check_accelerate_default_config,
@@ -18,6 +19,7 @@ from axolotl.cli import (
    print_axolotl_text_art,
 )
 from axolotl.common.cli import TrainerCliArgs
+from axolotl.prompt_strategies.sharegpt import register_chatml_template
 from axolotl.train import train

 LOG = logging.getLogger("axolotl.cli.train")
@@ -26,7 +28,7 @@ LOG = logging.getLogger("axolotl.cli.train")
 def do_cli(config: Path = Path("examples/"), **kwargs):
    # pylint: disable=duplicate-code
    parsed_cfg = load_cfg(config, **kwargs)
-    parser = transformers.HfArgumentParser((TrainerCliArgs))
+    parser = HfArgumentParser((TrainerCliArgs))
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
@@ -37,6 +39,14 @@ def do_train(cfg, cli_args) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
    print_axolotl_text_art()
    check_accelerate_default_config()
    check_user_token()
+    if cfg.chat_template == "chatml" and cfg.default_system_message:
+        LOG.info(
+            f"ChatML set. Adding default system message: {cfg.default_system_message}"
+        )
+        register_chatml_template(cfg.default_system_message)
+    else:
+        register_chatml_template()
+
    if cfg.rl:
        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)
    else:
--- a/src/axolotl/common/cli.py
+++ b/src/axolotl/common/cli.py
@@ -6,6 +6,7 @@ import logging
 from dataclasses import dataclass, field
 from typing import Optional

+import axolotl.monkeypatch.data.batch_dataset_fetcher  # pylint: disable=unused-import  # noqa: F401
 from axolotl.logging_config import configure_logging
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_model, load_tokenizer
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -59,6 +59,22 @@ except ImportError:
 LOG = logging.getLogger("axolotl.core.trainer_builder")


+def _sanitize_kwargs_for_tagging(tag_names, kwargs=None):
+    if isinstance(tag_names, str):
+        tag_names = [tag_names]
+
+    if kwargs is not None:
+        if "tags" not in kwargs:
+            kwargs["tags"] = tag_names
+        elif "tags" in kwargs and isinstance(kwargs["tags"], list):
+            kwargs["tags"].extend(tag_names)
+        elif "tags" in kwargs and isinstance(kwargs["tags"], str):
+            tag_names.append(kwargs["tags"])
+            kwargs["tags"] = tag_names
+
+    return kwargs
+
+
@dataclass
 class AxolotlTrainingArguments(TrainingArguments):
    """
@@ -82,6 +98,10 @@ class AxolotlTrainingArguments(TrainingArguments):
        default=False,
        metadata={"help": "Use sample packing for efficient training."},
    )
+    multipack_real_batches: bool = field(
+        default=False,
+        metadata={"help": "Use real batches for efficient training."},
+    )
    eval_sample_packing: Optional[bool] = field(
        default=None,
        metadata={"help": "Use sample packing for efficient evals."},
@@ -106,6 +126,10 @@ class AxolotlTrainingArguments(TrainingArguments):
        default=None,
        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
    )
+    relora_anneal_steps: Optional[int] = field(
+        default=None,
+        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
+    )
    bench_split: Optional[str] = field(
        default="eval", metadata={"help": "The benchmark split to run on"}
    )
@@ -170,24 +194,30 @@ class AxolotlTrainer(Trainer):
            num_training_steps (int): The number of training steps to do.
            optimizer (torch.optim.Optimizer): The training optimizer
        """
+        use_cosine_quadratic = (
+            self.args.lr_scheduler_type == "cosine"
+            and self.args.lr_quadratic_warmup is True
+        )
+
+        use_cosine_min_lr = (
+            self.args.lr_scheduler_type == "cosine"
+            and self.args.cosine_min_lr_ratio is not None
+        )

        # fmt: off
        if self.lr_scheduler is None:  # type: ignore  # pylint: disable=access-member-before-definition
            # fmt: on
-            if (
-                self.args.lr_scheduler_type == "cosine"
-                and self.args.lr_quadratic_warmup is True
-            ):
+            if use_cosine_quadratic:
+                if use_cosine_min_lr:
+                    LOG.warning("Both cosine quadratic warmup and min lr detected. Using quadratic warmup.")
+
                self.lr_scheduler = get_cosine_schedule_with_quadratic_warmup(  # pylint: disable=attribute-defined-outside-init
                    optimizer,
                    num_warmup_steps=self.args.get_warmup_steps(num_training_steps),
                    num_training_steps=num_training_steps,
                )
-            elif self.args.lr_scheduler_type == "cosine" and self.args.cosine_min_lr_ratio is not None:
+            elif self.args.cosine_min_lr_ratio and use_cosine_min_lr:
                assert 0 <= self.args.cosine_min_lr_ratio <= 1.0, "cosine_min_lr_ratio must be between 0.0 and 1.0"
-                if self.args.deepspeed:
-                    LOG.warning("Using cosine scheduler with deepspeed. This may be ignored if a scheduler is set \
-                                in the deepspeed JSON")
                self.lr_scheduler = get_cosine_schedule_with_min_lr(  # pylint: disable=attribute-defined-outside-init
                    optimizer,
                    num_warmup_steps=self.args.get_warmup_steps(num_training_steps),
@@ -196,15 +226,30 @@ class AxolotlTrainer(Trainer):
                )
            else:
                return super().create_scheduler(num_training_steps, optimizer)
+        else:
+            if use_cosine_quadratic:
+                LOG.warning("axolotl's cosine scheduler with quadratic warmup not used (e.g., because of deepspeed).")
+
+            if use_cosine_min_lr:
+                LOG.warning("axolotl's cosine scheduler with min lr not used (e.g., because of deepspeed).")
+
        return self.lr_scheduler

    def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
        if self.args.sample_packing and not self.args.pretraining:
+            if self.args.multipack_real_batches:
+                batch_size = self.args.per_device_train_batch_size
+                batch_max_len = self.args.max_seq_length
+            else:
+                batch_size = 1
+                batch_max_len = (
+                    self.args.per_device_train_batch_size * self.args.max_seq_length
+                )
            return MultipackBatchSampler(
                RandomSampler(self.train_dataset),
-                self.args.train_batch_size,
+                batch_size=batch_size,
                drop_last=True,
-                batch_max_len=self._train_batch_size * self.args.max_seq_length,
+                batch_max_len=batch_max_len,
                lengths=get_dataset_lengths(self.train_dataset),
                packing_efficiency_estimate=self.args.sample_packing_efficiency,
            )
@@ -214,11 +259,19 @@ class AxolotlTrainer(Trainer):
        self, eval_dataset: Dataset
    ) -> Optional[torch.utils.data.Sampler]:
        if self.args.sample_packing and self.args.eval_sample_packing is not False:
+            if self.args.multipack_real_batches:
+                batch_size = self.args.per_device_eval_batch_size
+                batch_max_len = self.args.max_seq_length
+            else:
+                batch_size = 1
+                batch_max_len = (
+                    self.args.per_device_eval_batch_size * self.args.max_seq_length
+                )
            return MultipackBatchSampler(
                SequentialSampler(eval_dataset),
-                self.args.per_device_eval_batch_size,
+                batch_size=batch_size,
                drop_last=True,
-                batch_max_len=self.args.eval_batch_size * self.args.max_seq_length,
+                batch_max_len=batch_max_len,
                lengths=get_dataset_lengths(eval_dataset),
                packing_efficiency_estimate=self.args.sample_packing_efficiency,
            )
@@ -336,30 +389,13 @@ class AxolotlTrainer(Trainer):
        #     return (loss, outputs) if return_outputs else loss
        return super().compute_loss(model, inputs, return_outputs=return_outputs)

-    def _sanitize_kwargs_for_tagging(self, tag_names, kwargs=None):
-        if isinstance(tag_names, str):
-            tag_names = [tag_names]
-
-        if kwargs is not None:
-            if "tags" not in kwargs:
-                kwargs["tags"] = tag_names
-            elif "tags" in kwargs and isinstance(kwargs["tags"], list):
-                kwargs["tags"].extend(tag_names)
-            elif "tags" in kwargs and isinstance(kwargs["tags"], str):
-                tag_names.append(kwargs["tags"])
-                kwargs["tags"] = tag_names
-
-        return kwargs
-
    @wraps(Trainer.push_to_hub)
    def push_to_hub(self, *args, **kwargs) -> str:
        """
        Overwrite the `push_to_hub` method in order to force-add the tags when pushing the
        model on the Hub. Please refer to `~transformers.Trainer.push_to_hub` for more details.
        """
-        kwargs = self._sanitize_kwargs_for_tagging(
-            tag_names=self.tag_names, kwargs=kwargs
-        )
+        kwargs = _sanitize_kwargs_for_tagging(tag_names=self.tag_names, kwargs=kwargs)

        return super().push_to_hub(*args, **kwargs)

@@ -446,10 +482,14 @@ class ReLoRATrainer(AxolotlTrainer):
            warmup_steps = (
                self.args.relora_warmup_steps if self.args.relora_warmup_steps else 10
            )
+            anneal_steps = (
+                self.args.relora_anneal_steps if self.args.relora_anneal_steps else 1
+            )
            self.lr_scheduler = ReLoRAScheduler(
                optimizer,
                lr_scheduler,
                self.args.relora_steps,
+                anneal_steps,
                warmup_steps,
            )
        else:
@@ -458,6 +498,24 @@ class ReLoRATrainer(AxolotlTrainer):
        return self.lr_scheduler


+class AxolotlDPOTrainer(DPOTrainer):
+    """
+    Extend the base DPOTrainer for axolotl helpers
+    """
+
+    tag_names = ["axolotl", "dpo"]
+
+    @wraps(DPOTrainer.push_to_hub)
+    def push_to_hub(self, *args, **kwargs) -> str:
+        """
+        Overwrite the `push_to_hub` method in order to force-add the tags when pushing the
+        model on the Hub. Please refer to `~transformers.Trainer.push_to_hub` for more details.
+        """
+        kwargs = _sanitize_kwargs_for_tagging(tag_names=self.tag_names, kwargs=kwargs)
+
+        return super().push_to_hub(*args, **kwargs)
+
+
 class TrainerBuilderBase(abc.ABC):
    """
    Base class for trainer builder
@@ -638,7 +696,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            training_arguments_kwargs[
                "gradient_checkpointing"
            ] = self.cfg.gradient_checkpointing
-            if self.cfg.gradient_checkpointing_kwargs:
+            if self.cfg.gradient_checkpointing_kwargs is not None:
                training_arguments_kwargs[
                    "gradient_checkpointing_kwargs"
                ] = self.cfg.gradient_checkpointing_kwargs
@@ -705,7 +763,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        elif self.cfg.sample_packing and self.cfg.eval_sample_packing is False:
            training_arguments_kwargs["dataloader_drop_last"] = True

-        if self.cfg.val_set_size == 0:
+        if not self.cfg.test_datasets and self.cfg.val_set_size == 0:
            # no eval set, so don't eval
            training_arguments_kwargs["evaluation_strategy"] = "no"
        elif self.cfg.eval_steps:
@@ -792,6 +850,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
                self.cfg.load_best_model_at_end is not False
                or self.cfg.early_stopping_patience
            )
+            and not self.cfg.test_datasets
            and self.cfg.val_set_size > 0
            and self.cfg.save_steps
            and self.cfg.eval_steps
@@ -829,6 +888,9 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs["sample_packing"] = (
            self.cfg.sample_packing if self.cfg.sample_packing else False
        )
+        training_arguments_kwargs["multipack_real_batches"] = (
+            self.cfg.flash_attention is not True
+        )
        training_arguments_kwargs["eval_sample_packing"] = (
            self.cfg.sample_packing
            if self.cfg.eval_sample_packing is not False
@@ -839,6 +901,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        ] = self.cfg.micro_batch_size
        training_arguments_kwargs["relora_steps"] = self.cfg.relora_steps
        training_arguments_kwargs["relora_warmup_steps"] = self.cfg.relora_warmup_steps
+        training_arguments_kwargs["relora_anneal_steps"] = self.cfg.relora_anneal_steps
        training_arguments_kwargs = self.hook_pre_create_training_args(
            training_arguments_kwargs
        )
@@ -933,6 +996,11 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if use_batch_sampler_collator:
            if self.cfg.model_config_type in ["mixtral", "qwen2", "falcon", "phi"]:
                collator = V2BatchSamplerDataCollatorForSeq2Seq
+            elif (
+                self.cfg.model_config_type in ["llama"]
+                and self.cfg.flash_attention is not True
+            ):
+                collator = V2BatchSamplerDataCollatorForSeq2Seq
            else:
                collator = BatchSamplerDataCollatorForSeq2Seq
        else:
@@ -1015,19 +1083,36 @@ class HFDPOTrainerBuilder(TrainerBuilderBase):
            training_args_kwargs[
                "dataloader_prefetch_factor"
            ] = self.cfg.dataloader_prefetch_factor
+        if self.cfg.gradient_checkpointing:
+            training_args_kwargs[
+                "gradient_checkpointing"
+            ] = self.cfg.gradient_checkpointing
+            if self.cfg.gradient_checkpointing_kwargs is not None:
+                training_args_kwargs[
+                    "gradient_checkpointing_kwargs"
+                ] = self.cfg.gradient_checkpointing_kwargs
+            else:
+                training_args_kwargs["gradient_checkpointing_kwargs"] = {
+                    "use_reentrant": False
+                }
+
+        # set save_strategy and save_steps
+        if self.cfg.save_steps:
+            training_args_kwargs["save_strategy"] = "steps"
+            training_args_kwargs["save_steps"] = self.cfg.save_steps
+        elif self.cfg.save_strategy:
+            training_args_kwargs["save_strategy"] = self.cfg.save_strategy
+        else:
+            # default to saving each epoch if not defined
+            training_args_kwargs["save_strategy"] = "epoch"

        training_args = TrainingArguments(
            per_device_train_batch_size=self.cfg.micro_batch_size,
            max_steps=self.cfg.max_steps or total_num_steps,
            gradient_accumulation_steps=self.cfg.gradient_accumulation_steps,
            learning_rate=self.cfg.learning_rate,
-            save_strategy="steps",
-            save_steps=self.cfg.save_steps,
            output_dir=self.cfg.output_dir,
            warmup_steps=self.cfg.warmup_steps,
-            gradient_checkpointing=self.cfg.gradient_checkpointing,
-            gradient_checkpointing_kwargs=self.cfg.gradient_checkpointing_kwargs
-            or {"use_reentrant": False},
            logging_first_step=True,
            logging_steps=1,
            optim=self.cfg.optimizer,
@@ -1050,7 +1135,11 @@ class HFDPOTrainerBuilder(TrainerBuilderBase):
            dpo_trainer_kwargs["eval_dataset"] = self.eval_dataset
        if self.cfg.adapter and self.peft_config:
            dpo_trainer_kwargs["peft_config"] = self.peft_config
-        dpo_trainer = DPOTrainer(
+        if self.cfg.precompute_ref_log_probs is not None:
+            dpo_trainer_kwargs[
+                "precompute_ref_log_probs"
+            ] = self.cfg.precompute_ref_log_probs
+        dpo_trainer = AxolotlDPOTrainer(
            self.model,
            self.model_ref,
            args=training_args,
--- a/src/axolotl/datasets.py
+++ b/src/axolotl/datasets.py
@@ -31,7 +31,7 @@ class TokenizedPromptDataset(Dataset):
    def __init__(  # pylint: disable=super-init-not-called
        self,
        prompt_tokenizer: PromptTokenizingStrategy,
-        dataset: IterableDataset,
+        dataset: Dataset,
        process_count: Optional[int] = None,
        keep_in_memory: Optional[bool] = False,
        **kwargs,
--- a/src/axolotl/monkeypatch/data/init.py
+++ b/src/axolotl/monkeypatch/data/init.py
--- a/src/axolotl/monkeypatch/data/batch_dataset_fetcher.py
+++ b/src/axolotl/monkeypatch/data/batch_dataset_fetcher.py
@@ -0,0 +1,46 @@
+"""monkey patches for the dataset fetcher to handle batches of packed indexes"""
+# pylint: disable=protected-access
+
+import torch
+from torch.utils.data._utils.fetch import _BaseDatasetFetcher
+from torch.utils.data._utils.worker import _worker_loop
+
+
+class _MapDatasetFetcher(_BaseDatasetFetcher):
+    def fetch(self, possibly_batched_index):
+        if isinstance(possibly_batched_index[0], list):
+            data = [None for i in possibly_batched_index]
+            for i, possibly_batched_index_ in enumerate(possibly_batched_index):
+                if self.auto_collation:
+                    if (
+                        hasattr(self.dataset, "__getitems__")
+                        and self.dataset.__getitems__
+                    ):
+                        data[i] = self.dataset.__getitems__(possibly_batched_index_)
+                    else:
+                        data[i] = [self.dataset[idx] for idx in possibly_batched_index_]
+                else:
+                    data[i] = self.dataset[possibly_batched_index_]
+        else:
+            if self.auto_collation:
+                if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
+                    data = self.dataset.__getitems__(possibly_batched_index)
+                else:
+                    data = [self.dataset[idx] for idx in possibly_batched_index]
+            else:
+                data = self.dataset[possibly_batched_index]
+        return self.collate_fn(data)
+
+
+def patch_fetchers():
+    torch.utils.data._utils.fetch._MapDatasetFetcher = _MapDatasetFetcher
+    torch.utils.data.dataloader._utils.fetch._MapDatasetFetcher = _MapDatasetFetcher
+
+
+def patched_worker_loop(*args, **kwargs):
+    patch_fetchers()
+    return _worker_loop(*args, **kwargs)
+
+
+torch.utils.data._utils.worker._worker_loop = patched_worker_loop
+patch_fetchers()
--- a/src/axolotl/monkeypatch/llama_attn_hijack_sdp.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_sdp.py
@@ -1,142 +0,0 @@
-"""
-Patched LlamaAttention to use torch.nn.functional.scaled_dot_product_attention
-"""
-
-import warnings
-from typing import Optional, Tuple
-
-import torch
-import torch.nn.functional as F
-import transformers.models.llama.modeling_llama
-from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, repeat_kv
-
-
-def hijack_llama_sdp_attention():
-    transformers.models.llama.modeling_llama.LlamaAttention.forward = (
-        sdp_attention_forward
-    )
-
-
-def sdp_attention_forward(
-    self,
-    hidden_states: torch.Tensor,
-    attention_mask: Optional[torch.Tensor] = None,
-    position_ids: Optional[torch.LongTensor] = None,
-    past_key_value: Optional[Tuple[torch.Tensor]] = None,
-    output_attentions: bool = False,
-    use_cache: bool = False,
-    padding_mask: Optional[torch.LongTensor] = None,  # pylint: disable=unused-argument
-    **kwargs,  # pylint: disable=unused-argument
-) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
-    # pylint: disable=duplicate-code
-    bsz, q_len, _ = hidden_states.size()
-
-    if not hasattr(self, "pretraining_tp"):
-        self.pretraining_tp = 1
-
-    if self.pretraining_tp > 1:
-        key_value_slicing = (
-            self.num_key_value_heads * self.head_dim
-        ) // self.pretraining_tp
-        query_slices = self.q_proj.weight.split(
-            (self.num_heads * self.head_dim) // self.pretraining_tp, dim=0
-        )
-        key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
-        value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
-
-        query_states = [
-            F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)
-        ]
-        query_states = torch.cat(query_states, dim=-1)
-
-        key_states = [
-            F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)
-        ]
-        key_states = torch.cat(key_states, dim=-1)
-
-        value_states = [
-            F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)
-        ]
-        value_states = torch.cat(value_states, dim=-1)
-
-    else:
-        query_states = self.q_proj(hidden_states)
-        key_states = self.k_proj(hidden_states)
-        value_states = self.v_proj(hidden_states)
-
-    query_states = query_states.view(
-        bsz, q_len, self.num_heads, self.head_dim
-    ).transpose(1, 2)
-    key_states = key_states.view(
-        bsz, q_len, self.num_key_value_heads, self.head_dim
-    ).transpose(1, 2)
-    value_states = value_states.view(
-        bsz, q_len, self.num_key_value_heads, self.head_dim
-    ).transpose(1, 2)
-    # [bsz, q_len, nh, hd]
-    # [bsz, nh, q_len, hd]
-
-    kv_seq_len = key_states.shape[-2]
-    if past_key_value is not None:
-        kv_seq_len += past_key_value[0].shape[-2]
-
-    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-    query_states, key_states = apply_rotary_pos_emb(
-        query_states, key_states, cos, sin, position_ids
-    )
-    # [bsz, nh, t, hd]
-
-    if past_key_value is not None:
-        # reuse k, v, self_attention
-        key_states = torch.cat([past_key_value[0], key_states], dim=2)
-        value_states = torch.cat([past_key_value[1], value_states], dim=2)
-
-    past_key_value = (key_states, value_states) if use_cache else None
-
-    # repeat k/v heads if n_kv_heads < n_heads
-    key_states = repeat_kv(key_states, self.num_key_value_groups)
-    value_states = repeat_kv(value_states, self.num_key_value_groups)
-
-    if output_attentions:
-        warnings.warn(
-            "Output attentions is not supported for patched `LlamaAttention`, returning `None` instead."
-        )
-
-    #
-    # sdp-attn start
-    #
-
-    with torch.backends.cuda.sdp_kernel():
-        attn_output = torch.nn.functional.scaled_dot_product_attention(
-            query_states,
-            key_states,
-            value_states,
-            attn_mask=attention_mask,
-            is_causal=False,
-        )
-
-    if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
-        raise ValueError(
-            f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
-            f" {attn_output.size()}"
-        )
-    attn_output = attn_output.transpose(1, 2)
-    attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
-
-    #
-    # sdp-attn end
-    #
-
-    if self.pretraining_tp > 1:
-        attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
-        o_proj_slices = self.o_proj.weight.split(
-            self.hidden_size // self.pretraining_tp, dim=1
-        )
-        attn_output = sum(
-            F.linear(attn_output[i], o_proj_slices[i])
-            for i in range(self.pretraining_tp)
-        )
-    else:
-        attn_output = self.o_proj(attn_output)
-
-    return attn_output, None, past_key_value
--- a/src/axolotl/monkeypatch/llama_expand_mask.py
+++ b/src/axolotl/monkeypatch/llama_expand_mask.py
@@ -5,38 +5,11 @@ from typing import Optional

 import torch

+from axolotl.monkeypatch.utils import mask_2d_to_4d
+

 def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
-    This expansion handles packed sequences so that sequences share the same attention mask integer value
-    when they attend to each other within that sequence.
-    This expansion transforms the mask to lower triangular form to prevent future peeking.
-    """
-    bsz, src_len = mask.size()
-    tgt_len = tgt_len if tgt_len is not None else src_len
-
-    mask = mask.unsqueeze(1).unsqueeze(2)
-    mask = mask.expand(bsz, 1, tgt_len, src_len)
-
-    # Create a binary mask from the original mask where zeros remain zeros and all other values are set to one
-    binary_mask = torch.where(
-        mask != 0,
-        torch.tensor(1).to(dtype),
-        torch.tensor(0).to(dtype),
-    )
-
-    # Create a block-diagonal mask.
-    # we multiply by the binary mask so that 0's in the original mask are correctly excluded
-    zero_one_mask = torch.eq(mask, mask.transpose(-1, -2)).int() * binary_mask
-
-    # Now let's create a lower triangular mask of ones that will zero out the upper triangular part
-    lower_triangular_ones = torch.tril(torch.ones((tgt_len, src_len), dtype=dtype)).to(
-        mask.device
-    )
-
-    # Use the lower triangular mask to zero out the upper triangular part of the zero_one_mask
-    masked_zero_one_mask = zero_one_mask * lower_triangular_ones
+    masked_zero_one_mask = mask_2d_to_4d(mask, dtype, tgt_len)
    inverted_mask = 1.0 - masked_zero_one_mask

    return inverted_mask.masked_fill(
--- a/src/axolotl/monkeypatch/llama_patch_multipack.py
+++ b/src/axolotl/monkeypatch/llama_patch_multipack.py
@@ -0,0 +1,26 @@
+"""
+Patched LlamaAttention to use torch.nn.functional.scaled_dot_product_attention
+"""
+
+from axolotl.monkeypatch.utils import (
+    patched_prepare_4d_causal_attention_mask,
+    patched_prepare_4d_causal_attention_mask_for_sdpa,
+)
+
+
+def hijack_llama_prepare_4d_mask():
+    import transformers.modeling_attn_mask_utils
+    import transformers.models.llama.modeling_llama
+
+    transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_for_sdpa = (  # pylint: disable=protected-access
+        patched_prepare_4d_causal_attention_mask_for_sdpa
+    )
+    transformers.modeling_attn_mask_utils._prepare_4d_causal_attention_mask_for_sdpa = (  # pylint: disable=protected-access
+        patched_prepare_4d_causal_attention_mask_for_sdpa
+    )
+    transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask = (  # pylint: disable=protected-access
+        patched_prepare_4d_causal_attention_mask
+    )
+    transformers.modeling_attn_mask_utils._prepare_4d_causal_attention_mask = (  # pylint: disable=protected-access
+        patched_prepare_4d_causal_attention_mask
+    )
--- a/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py
@@ -94,7 +94,7 @@ def _prepare_decoder_attention_mask(
    sliding_window,
 ):  # pylint: disable=unused-argument
    # [bsz, seq_len]
-    if attention_mask is None:
+    if attention_mask is None or sliding_window is None:
        return attention_mask

    # NOTE: attention mask and sliding masks are only broadcastable in certain scenarios.
@@ -151,7 +151,7 @@ def flashattn_forward(
    )

    use_sliding_windows = (
-        hasattr(self.config, "sliding_window") is not None
+        getattr(self.config, "sliding_window") is not None
        and kv_seq_len > self.config.sliding_window
    )

--- a/src/axolotl/monkeypatch/relora.py
+++ b/src/axolotl/monkeypatch/relora.py
@@ -4,14 +4,16 @@ import json
 import logging
 import os.path
 import shutil
+from functools import partial
 from pathlib import Path
-from typing import Dict, List, Sequence
+from typing import Dict, List, Sequence, Union

 import bitsandbytes as bnb
 import peft
 import safetensors.torch as st
 import torch
 from huggingface_hub import snapshot_download
+from torch.distributed.optim import ZeroRedundancyOptimizer
 from torch.optim.lr_scheduler import LRScheduler
 from torch.optim.optimizer import Optimizer
 from transformers import (
@@ -23,23 +25,50 @@ from transformers import (
 from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR

 from axolotl.utils.dict import DictDefault
-from axolotl.utils.distributed import is_main_process
+from axolotl.utils.distributed import barrier, is_main_process

 LOG = logging.getLogger("axolotl.relora")


-def reset_optimizer(optimizer: torch.optim.Optimizer):
-    for group in optimizer.param_groups:
-        for param in group["params"]:
-            param_state = optimizer.state[param]
-            for key in param_state:
-                if "qmap" in key:
-                    continue
+@torch.no_grad()
+def magnitude_pruning_(tensor, prune_ratio):
+    tensor_magnitude = torch.abs(tensor)
+    threshold = torch.quantile(
+        tensor_magnitude.flatten().to(dtype=torch.float32), prune_ratio
+    ).to(dtype=tensor.dtype)

-                if key == "step" and isinstance(param_state[key], int):
-                    param_state[key] = 0
-                else:
-                    param_state[key] = torch.zeros_like(param_state[key])
+    mask = tensor_magnitude > threshold
+    tensor.mul_(mask.to(dtype=tensor.dtype))
+
+
+def reset_optimizer(
+    optimizer: torch.optim.Optimizer,
+    *,
+    reset_params: list[str],  # where str is the key to a torch.nn.Parameter
+    optimizer_state_keys: list[str],
+):
+    pruning_fn = partial(magnitude_pruning_, prune_ratio=0.9)
+    n_zeros = 0
+    n_total = 0
+
+    optimizer_state = optimizer.state
+    if isinstance(optimizer, ZeroRedundancyOptimizer):
+        optimizer_state = optimizer.optim.state
+
+    for param in reset_params:
+        param_state = optimizer_state[param]
+        if len(param_state) == 0:  # no state for this param, happens for ZeRo optimizer
+            continue
+        for key in optimizer_state_keys:
+            pruning_fn(
+                param_state[key]
+            )  # pruning fn has to be inplace to keep the same keys in the dict
+            n_total += param_state[key].numel()
+            n_zeros += torch.sum(param_state[key] == 0).item()
+
+    _zeroed = n_zeros / (1e-7 + n_total) * 100
+    LOG.info(f"Percent of optimizer states zeroed: {_zeroed:.2f}")
+    LOG.info(f"absolute n of optimizer states zeroed: {n_zeros}")


 class ReLoRACallback(TrainerCallback):
@@ -97,6 +126,25 @@ class ReLoRACallback(TrainerCallback):
                "relora",
            )

+            if "adam" in args.optim.lower():
+                optimizer_state_keys = ["exp_avg", "exp_avg_sq"]
+            else:
+                raise ValueError(f"Optimizer {args.optim} not supported with ReLoRA")
+
+            lora_params = [
+                n
+                for n, p in model.named_parameters()
+                if p.requires_grad and "lora_" in n
+            ]
+
+            model.save_pretrained(
+                os.path.join(
+                    args.output_dir,
+                    f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}",
+                    "adapter",
+                ),
+                safe_serialization=True,
+            )
            with torch.no_grad():
                merge_and_save(
                    model,
@@ -107,7 +155,11 @@ class ReLoRACallback(TrainerCallback):
                    actually_save=is_main_process(),
                    cpu_offload=self.cpu_offload,
                )
-                reset_optimizer(optimizer)
+                reset_optimizer(
+                    optimizer,
+                    reset_params=lora_params,
+                    optimizer_state_keys=optimizer_state_keys,
+                )

            if self.quantized:
                self.last_full_model = checkpoint_folder
@@ -197,11 +249,13 @@ class ReLoRAScheduler(LRScheduler):
        inner_schedule: LRScheduler,
        relora_steps: int,
        warmup_steps: int,
+        anneal_steps: int = 1,
        min_lr_scale: float = 0.001,
    ) -> None:
        self.inner_schedule = inner_schedule
        self.relora_steps = relora_steps
        self.warmup_steps = warmup_steps
+        self.anneal_steps = anneal_steps
        self.min_lr_scale = min_lr_scale
        super().__init__(optimizer, inner_schedule.last_epoch, inner_schedule.verbose)

@@ -210,10 +264,20 @@ class ReLoRAScheduler(LRScheduler):

        original = self.inner_schedule.get_lr()
        step = self.last_epoch
+
        if step < self.relora_steps:
            scale = 1
        else:
-            cycle_t = min(1.0, (step % self.relora_steps) / self.warmup_steps)
+            per_relora_progress = step % self.relora_steps
+            if per_relora_progress < self.warmup_steps:
+                cycle_t = min(1.0, (per_relora_progress) / self.warmup_steps)
+            elif per_relora_progress > (self.relora_steps - self.anneal_steps):
+                cycle_t = min(
+                    1.0,
+                    (self.relora_steps - per_relora_progress) / self.anneal_steps,
+                )
+            else:
+                cycle_t = 1
            scale = cycle_t * (1 - self.min_lr_scale) + self.min_lr_scale

        if isinstance(original, Sequence):
@@ -238,7 +302,11 @@ def sharded_paths(path: str, module_names: List[str]) -> Dict[str, str]:

 def lora_delta_weight(layer: peft.tuners.lora.LoraLayer, device) -> torch.Tensor:
    if isinstance(layer, (peft.tuners.lora.Linear8bitLt, peft.tuners.lora.Linear4bit)):
-        adapter = layer.active_adapter
+        adapter: Union[List[str], str] = layer.active_adapter
+        if isinstance(adapter, list):
+            if len(adapter) > 1:
+                raise ValueError("unhandled relora for multiple adapters")
+            adapter = adapter[0]
        return (
            peft.utils.transpose(
                layer.lora_B[adapter].weight.detach().to(device)
@@ -248,7 +316,7 @@ def lora_delta_weight(layer: peft.tuners.lora.LoraLayer, device) -> torch.Tensor
            * layer.scaling[adapter]
        )

-    return layer.get_delta_weight().to(device)
+    raise ValueError("unhandled lora layer type")


 def find_lora_modules(model: peft.LoraModel) -> Dict[str, peft.tuners.lora.LoraLayer]:
@@ -273,9 +341,9 @@ def update_weights(
 ):
    if reinit:
        for adapter_name in target.lora_A:
-            target.reset_lora_parameters(adapter_name)
+            target.reset_lora_parameters(adapter_name, True)
        for adapter_name in target.lora_embedding_A:
-            target.reset_lora_parameters(adapter_name)
+            target.reset_lora_parameters(adapter_name, True)

    if isinstance(target, peft.tuners.lora.Linear4bit):
        # This could be faster, but the quantization of Linear4bit weights occurs
@@ -286,7 +354,9 @@ def update_weights(
        target.weight.data = new_weight.cpu()
        target.to(device)
    elif isinstance(target, peft.tuners.lora.Linear8bitLt):
-        target.weight = bnb.nn.Int8Params(new_weight, requires_grad=False).to(device)
+        target.weight.data = (
+            bnb.nn.Int8Params(new_weight, requires_grad=False).to(device).data
+        )
    else:
        target.weight.data = new_weight.to(device)

@@ -304,14 +374,17 @@ def merge_and_save(

    if not quantized:
        for module_name, target in modules.items():
-            update = target.get_delta_weight(target.active_adapter).detach()
+            active_adapter = target.active_adapter
+            if isinstance(active_adapter, list):
+                active_adapter = active_adapter[0]
+            update = target.get_delta_weight(active_adapter).detach()
            target.weight.data += update

            if reinit:
                for adapter_name in target.lora_A:
-                    target.reset_lora_parameters(adapter_name)
+                    target.reset_lora_parameters(adapter_name, True)
                for adapter_name in target.lora_embedding_A:
-                    target.reset_lora_parameters(adapter_name)
+                    target.reset_lora_parameters(adapter_name, True)
        return

    os.makedirs(model_dst, exist_ok=True)
@@ -363,6 +436,7 @@ def merge_and_save(
            LOG.info(f"saving tensors to {shard_fn}")
            st.save_file(out_tensors, shard_fn, metadata={"format": "pt"})

+        barrier()
        del in_tensors
        del out_tensors
        torch.cuda.empty_cache()
--- a/src/axolotl/monkeypatch/utils.py
+++ b/src/axolotl/monkeypatch/utils.py
@@ -1,8 +1,15 @@
 """
 Shared utils for the monkeypatches
 """
+from typing import Optional
+
 import torch
 import torch.nn.functional as F
+from transformers.modeling_attn_mask_utils import (
+    _prepare_4d_causal_attention_mask,
+    _prepare_4d_causal_attention_mask_for_sdpa,
+)
+from transformers.utils import is_torch_bf16_gpu_available


@torch.jit.script
@@ -89,7 +96,6 @@ def get_cu_seqlens(attn_mask):
    return torch.stack(results).to(dtype=torch.int32), torch.stack(max_seq_lens)


-@torch.jit.script
 def get_cu_seqlens_from_pos_ids(position_ids):
    """generate a cumulative sequence length mask for flash attention using pos ids"""
    if len(position_ids.shape) == 1:
@@ -135,7 +141,18 @@ def get_cu_seqlens_from_pos_ids(position_ids):
        results.append(cu_seqlens)
        max_seq_lens.append(max_seq_len)

-    return torch.stack(results).to(dtype=torch.int32), torch.stack(max_seq_lens)
+    # Find the maximum value across all tensors
+    max_value = max(t.max() for t in results)
+
+    # Find the length of the longest tensor
+    max_length = max(t.size(0) for t in results)
+
+    # Pad each tensor to the same length and collect them in a list
+    padded_results = [
+        F.pad(t, (0, max_length - t.size(0)), "constant", max_value) for t in results
+    ]
+
+    return torch.stack(padded_results).to(dtype=torch.int32), torch.stack(max_seq_lens)


 def set_module_name(model, name, value):
@@ -149,3 +166,62 @@ def set_module_name(model, name, value):
        child_name = name

    setattr(parent, child_name, value)
+
+
+def mask_2d_to_4d(
+    mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None
+):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    This expansion handles packed sequences so that sequences share the same attention mask integer value
+    when they attend to each other within that sequence.
+    This expansion transforms the mask to lower triangular form to prevent future peeking.
+    """
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    mask = mask.unsqueeze(1).unsqueeze(2)
+    mask = mask.expand(bsz, 1, tgt_len, src_len)
+
+    # Create a binary mask from the original mask where zeros remain zeros and all other values are set to one
+    binary_mask = torch.where(
+        mask != 0,
+        torch.tensor(1).to(dtype),
+        torch.tensor(0).to(dtype),
+    )
+
+    # Create a block-diagonal mask.
+    # we multiply by the binary mask so that 0's in the original mask are correctly excluded
+    zero_one_mask = torch.eq(mask, mask.transpose(-1, -2)).int() * binary_mask
+
+    # Now let's create a lower triangular mask of ones that will zero out the upper triangular part
+    lower_triangular_ones = torch.tril(torch.ones((tgt_len, src_len), dtype=dtype)).to(
+        mask.device
+    )
+
+    # Use the lower triangular mask to zero out the upper triangular part of the zero_one_mask
+    masked_zero_one_mask = zero_one_mask * lower_triangular_ones
+
+    return masked_zero_one_mask
+
+
+def patched_prepare_4d_causal_attention_mask(
+    attention_mask: Optional[torch.Tensor],
+    *args,
+):
+    dtype = torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float32
+    return _prepare_4d_causal_attention_mask(
+        mask_2d_to_4d(attention_mask, dtype=dtype),
+        *args,
+    )
+
+
+def patched_prepare_4d_causal_attention_mask_for_sdpa(
+    attention_mask: Optional[torch.Tensor],
+    *args,
+):
+    dtype = torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float32
+    return _prepare_4d_causal_attention_mask_for_sdpa(
+        mask_2d_to_4d(attention_mask, dtype=dtype),
+        *args,
+    )
--- a/src/axolotl/prompt_strategies/dpo/chatml.py
+++ b/src/axolotl/prompt_strategies/dpo/chatml.py
@@ -23,6 +23,31 @@ def argilla(
    return transform_fn


+def icr(
+    cfg,
+):  # pylint: disable=possibly-unused-variable,unused-argument
+    """
+    chatml transforms for datasets with system, input, chosen, rejected
+    ex. https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs
+    """
+
+    def transform_fn(sample):
+        if "system" in sample and sample["system"]:
+            sample["prompt"] = (
+                f"<|im_start|>system\n{sample['system']}<|im_end|>\n"
+                f"<|im_start|>user\n{sample['input']}<|im_end|>\n<|im_start|>assistant\n"
+            )
+        else:
+            sample[
+                "prompt"
+            ] = f"<|im_start|>user\n{sample['input']}<|im_end|>\n<|im_start|>assistant\n"
+        sample["chosen"] = f"{sample['chosen']}<|im_end|>"
+        sample["rejected"] = f"{sample['rejected']}<|im_end|>"
+        return sample
+
+    return transform_fn
+
+
 def intel(cfg):  # pylint: disable=possibly-unused-variable,unused-argument
    """
    For Intel Orca DPO Pairs
--- a/src/axolotl/prompt_strategies/instruct.py
+++ b/src/axolotl/prompt_strategies/instruct.py
@@ -0,0 +1,33 @@
+"""Module containing the InstructShareGPTPromptTokenizingStrategy class"""
+from typing import Any, Dict, Optional
+
+from axolotl.prompt_tokenizers import ShareGPTPromptTokenizingStrategy
+from axolotl.prompters import ShareGPTPrompterV2
+
+
+def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
+    conversation = (
+        ds_cfg["conversation"] if ds_cfg and "conversation" in ds_cfg else None
+    )
+    strategy = InstructShareGPTPromptTokenizingStrategy(
+        # pylint: disable=duplicate-code
+        ShareGPTPrompterV2(
+            conversation=conversation,
+        ),
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+    return strategy
+
+
+class InstructShareGPTPromptTokenizingStrategy(ShareGPTPromptTokenizingStrategy):
+    """
+    basic sharegpt strategy to grab conversations from the sample row
+    """
+
+    def get_conversation_thread(self, prompt):
+        return [
+            {"from": "human", "value": prompt["instruction"]},
+            {"from": "gpt", "value": prompt["output"]},
+        ]
--- a/src/axolotl/prompt_strategies/pretrain.py
+++ b/src/axolotl/prompt_strategies/pretrain.py
@@ -0,0 +1,58 @@
+"""pretraining prompt strategies"""
+from typing import Generator
+
+from transformers import BatchEncoding
+
+from axolotl.prompt_tokenizers import PromptTokenizingStrategy
+
+
+class PretrainTokenizer:
+    """basic tokenization class for pretraining"""
+
+    def build_prompt(self, prompt) -> Generator[str, None, None]:
+        yield prompt
+
+
+class PretrainTokenizationStrategy(PromptTokenizingStrategy):
+    """handles tokenization for pretraining with strides"""
+
+    @property
+    def supports_batched(self):
+        return True
+
+    def __init__(self, *args, max_length=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        if max_length:
+            self.max_length = max_length
+
+    def _tokenize(
+        self, prompt: str, add_eos_token: bool = True, strip_bos_token: bool = False
+    ) -> BatchEncoding:
+        res = self.tokenizer(
+            prompt,
+            truncation=True,
+            max_length=self.max_length - 1,
+            add_special_tokens=True,
+            return_overflowing_tokens=True,
+            stride=256,
+        )
+        res["input_ids"] = [
+            seq + [self.tokenizer.eos_token_id] for seq in res["input_ids"]
+        ]
+        res["attention_mask"] = [seq + [1] for seq in res["attention_mask"]]
+
+        return res
+
+    def tokenize_prompt(self, prompt):
+        return self._tokenize(prompt["text"])
+
+
+def load(tokenizer, cfg):
+    strat = PretrainTokenizationStrategy(
+        PretrainTokenizer(),
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+        max_length=cfg.sequence_len * 64,
+    )
+    return strat
--- a/src/axolotl/prompt_strategies/sharegpt.py
+++ b/src/axolotl/prompt_strategies/sharegpt.py
@@ -6,16 +6,19 @@ from fastchat.conversation import Conversation, SeparatorStyle, register_conv_te
 from axolotl.prompt_tokenizers import ShareGPTPromptTokenizingStrategy
 from axolotl.prompters import ShareGPTPrompterV2

-register_conv_template(
-    Conversation(
-        name="chatml",
-        system_template="<|im_start|>system\n{system_message}",
-        system_message="You are a helpful assistant.",
-        roles=["<|im_start|>user", "<|im_start|>assistant"],
-        sep_style=SeparatorStyle.CHATML,
-        sep="<|im_end|>",
+
+def register_chatml_template(system_message=None):
+    system_message = system_message or "You are a helpful assistant."
+    register_conv_template(
+        Conversation(
+            name="chatml",
+            system_template="<|im_start|>system\n{system_message}",
+            system_message=system_message,
+            roles=["<|im_start|>user", "<|im_start|>assistant"],
+            sep_style=SeparatorStyle.CHATML,
+            sep="<|im_end|>",
+        )
    )
-)


 def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -11,7 +11,6 @@ import torch
 import transformers.modelcard
 from accelerate.logging import get_logger
 from datasets import Dataset
-from optimum.bettertransformer import BetterTransformer
 from peft import PeftModel
 from pkg_resources import get_distribution  # type: ignore
 from transformers import PreTrainedModel, PreTrainedTokenizer
@@ -24,6 +23,11 @@ from axolotl.utils.freeze import freeze_parameters_except
 from axolotl.utils.models import load_model, load_tokenizer
 from axolotl.utils.trainer import setup_trainer

+try:
+    from optimum.bettertransformer import BetterTransformer
+except ImportError:
+    BetterTransformer = None
+
 project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 src_dir = os.path.join(project_root, "src")
 sys.path.insert(0, src_dir)
@@ -57,26 +61,6 @@ def train(
    eval_dataset = dataset_meta.eval_dataset
    total_num_steps = dataset_meta.total_num_steps

-    # Load the model and tokenizer
-    msg = "loading model"
-    if cfg.adapter:
-        msg += " and peft_config..."
-    LOG.debug(msg)
-    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
-    model_ref = None
-    if cfg.rl:
-        if cfg.adapter and not cfg.rl_adapter_ref_model:
-            # use built-in trl autounwrap
-            LOG.debug("Passing model_ref: None to RL trainer")
-            model_ref = None  # explicit setting to None
-        else:
-            # load the model again for model_ref/baseline
-            model_ref, _ = load_model(
-                cfg, tokenizer, inference=cli_args.inference, reference_model=True
-            )
-
-    safe_serialization = cfg.save_safetensors is True
-
    if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints:
        possible_checkpoints = [
            str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*")
@@ -92,6 +76,28 @@ def train(
            )
    resume_from_checkpoint = cfg.resume_from_checkpoint

+    # Load the model and tokenizer
+    msg = "loading model"
+    if cfg.adapter:
+        msg += " and peft_config..."
+    LOG.debug(msg)
+    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
+    model.generation_config.do_sample = True
+
+    model_ref = None
+    if cfg.rl:
+        if cfg.adapter and not cfg.rl_adapter_ref_model:
+            # use built-in trl autounwrap
+            LOG.debug("Passing model_ref: None to RL trainer")
+            model_ref = None  # explicit setting to None
+        else:
+            # load the model again for model_ref/baseline
+            model_ref, _ = load_model(
+                cfg, tokenizer, inference=cli_args.inference, reference_model=True
+            )
+
+    safe_serialization = cfg.save_safetensors is True
+
    if cfg.unfrozen_parameters:
        freeze_parameters_except(model, cfg.unfrozen_parameters)

@@ -122,7 +128,7 @@ def train(
    if cfg.local_rank == 0:

        def terminate_handler(_, __, model):
-            if cfg.flash_optimum:
+            if cfg.flash_optimum and BetterTransformer:
                model = BetterTransformer.reverse(model)
            model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
            sys.exit(0)
@@ -147,7 +153,10 @@ def train(
    pretrain_hooks(cfg, trainer)
    if cfg.flash_optimum:
        with torch.backends.cuda.sdp_kernel(
-            enable_flash=True, enable_math=True, enable_mem_efficient=True
+            # TODO configure these from the YAML w/ sdp_kernel_kwargs: ...
+            enable_flash=True,
+            enable_math=True,
+            enable_mem_efficient=True,
        ):
            trainer.train(resume_from_checkpoint=resume_from_checkpoint)
    else:
@@ -193,7 +202,7 @@ def train(
            state_dict=trainer.accelerator.get_state_dict(trainer.model_wrapped),
        )
    elif cfg.local_rank == 0:
-        if cfg.flash_optimum:
+        if cfg.flash_optimum and BetterTransformer:
            model = BetterTransformer.reverse(model)

        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
--- a/src/axolotl/utils/chat_templates.py
+++ b/src/axolotl/utils/chat_templates.py
@@ -19,8 +19,9 @@ def chat_templates(user_choice: str):
    """

    templates = {
+        "alpaca": "{% for message in messages %}{% if message['role'] == 'user' %}{{ '### Instruction: ' + message['content'] + '\n\n' }}{% elif message['role'] == 'assistant' %}{{ '### Response: ' + message['content'] + eos_token}}{% endif %}{% endfor %}",
        "inst": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",  # I don't know what this one is called. Used by Mistral/Mixtral.
-        "chatml": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+        "chatml": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful assistant.' %}{% endif %}{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in loop_messages %}{% if loop.index0 == 0 %}{{'<|im_start|>system\n' + system_message + '<|im_end|>\n'}}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
    }

    if user_choice in templates:
--- a/src/axolotl/utils/collators.py
+++ b/src/axolotl/utils/collators.py
@@ -132,24 +132,26 @@ class BatchSamplerDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
    """

    def __call__(self, features, return_tensors=None):
-        chunked_data = {}
-        for feature in features[0].keys():
-            if feature == "length":
-                continue
-            if feature == "attention_mask":
-                arrays = [
-                    (1) * np.array(item[feature])
-                    for item in features
-                    if feature in item
-                ]
-                chunked_data[feature] = np.concatenate(arrays)
-            else:
-                arrays = [
-                    np.array(item[feature]) for item in features if feature in item
-                ]
-                chunked_data[feature] = np.concatenate(arrays)
-        features = [chunked_data]
-        return super().__call__(features, return_tensors=return_tensors)
+        if not isinstance(features[0], list):
+            features = [features]
+        out_features = [{} for _ in features]
+        for i, features_ in enumerate(features):
+            for feature in features_[0].keys():
+                if feature == "length":
+                    continue
+                if feature == "attention_mask":
+                    arrays = [
+                        (1) * np.array(item[feature])
+                        for i, item in enumerate(features_)
+                        if feature in item
+                    ]
+                    out_features[i][feature] = np.concatenate(arrays)
+                else:
+                    arrays = [
+                        np.array(item[feature]) for item in features_ if feature in item
+                    ]
+                    out_features[i][feature] = np.concatenate(arrays)
+        return super().__call__(out_features, return_tensors=return_tensors)


@dataclass
@@ -159,24 +161,26 @@ class V2BatchSamplerDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
    """

    def __call__(self, features, return_tensors=None):
-        chunked_data = {}
-        for feature in features[0].keys():
-            if feature == "length":
-                continue
-            if feature == "attention_mask":
-                arrays = [
-                    (i + 1) * np.array(item[feature])
-                    for i, item in enumerate(features)
-                    if feature in item
-                ]
-                chunked_data[feature] = np.concatenate(arrays)
-            else:
-                arrays = [
-                    np.array(item[feature]) for item in features if feature in item
-                ]
-                chunked_data[feature] = np.concatenate(arrays)
-        features = [chunked_data]
-        return super().__call__(features, return_tensors=return_tensors)
+        if not isinstance(features[0], list):
+            features = [features]
+        out_features = [{} for _ in features]
+        for i, features_ in enumerate(features):
+            for feature in features_[0].keys():
+                if feature == "length":
+                    continue
+                if feature == "attention_mask":
+                    arrays = [
+                        (i + 1) * np.array(item[feature])
+                        for i, item in enumerate(features_)
+                        if feature in item
+                    ]
+                    out_features[i][feature] = np.concatenate(arrays)
+                else:
+                    arrays = [
+                        np.array(item[feature]) for item in features_ if feature in item
+                    ]
+                    out_features[i][feature] = np.concatenate(arrays)
+        return super().__call__(out_features, return_tensors=return_tensors)


@dataclass
--- a/src/axolotl/utils/config.py
+++ b/src/axolotl/utils/config.py
@@ -95,7 +95,7 @@ def normalize_config(cfg):
        save_steps = 1.0 / (cfg.saves_per_epoch * cfg.num_epochs)
        if save_steps < 1.0:  # prevent saves on every step
            cfg.save_steps = save_steps
-    if cfg.evals_per_epoch:
+    if (cfg.val_set_size or cfg.test_datasets) and cfg.evals_per_epoch:
        eval_steps = 1.0 / (cfg.evals_per_epoch * cfg.num_epochs)
        if eval_steps < 1.0:  # prevent evals on every step
            cfg.eval_steps = eval_steps
@@ -163,6 +163,7 @@ def normalize_config(cfg):
        cfg.gradient_checkpointing
        and cfg.unfrozen_parameters is None
        and cfg.gradient_checkpointing_kwargs is None
+        and cfg.rl is None
    ):
        cfg.gradient_checkpointing_kwargs = {"use_reentrant": True}

@@ -201,6 +202,20 @@ def validate_config(cfg):
            raise ValueError(
                "bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above."
            )
+    if (
+        # pylint: disable=too-many-boolean-expressions
+        not (cfg.bf16 or cfg.bfloat16)
+        and (cfg.fp16 or cfg.float16)
+        and not cfg.adapter
+        and not cfg.flash_attention
+        and cfg.sample_packing
+    ):
+        LOG.warning(
+            "Full fine tune w/o FA2 w/ sample packing and fp16/float16 is likely to raise errors. Try LoRA."
+        )
+        # ValueError: Attempting to unscale FP16 gradients.
+        # OR
+        # RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
    if cfg.max_packed_sequence_len:
        raise DeprecationWarning("`max_packed_sequence_len` is no longer supported")

@@ -231,9 +246,6 @@ def validate_config(cfg):
            "eval_batch_size != micro_batch_size. This can lead to VRAM instability."
        )

-    if cfg.load_4bit:
-        raise ValueError("cfg.load_4bit parameter has been deprecated")
-
    if cfg.adapter == "qlora":
        if cfg.merge_lora:
            # can't merge qlora if loaded in 8bit or 4bit
@@ -259,7 +271,8 @@ def validate_config(cfg):
        if cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp:
            raise ValueError("Fused modules are not supported with QLoRA")

-    if not cfg.load_in_8bit and cfg.adapter == "lora":
+    loftq = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits
+    if not cfg.load_in_8bit and cfg.adapter == "lora" and not loftq:
        LOG.warning("We recommend setting `load_in_8bit: true` for LORA finetuning")

    if cfg.adapter == "lora" and (cfg.flash_attn_fuse_qkv or cfg.flash_attn_fuse_mlp):
@@ -309,7 +322,7 @@ def validate_config(cfg):
            LOG.warning("BetterTransformers probably doesn't work with PEFT adapters")
        if cfg.fp16 or cfg.bf16:
            raise ValueError("AMP is not supported with BetterTransformer")
-        if cfg.float16 is not True and cfg.bloat16 is not True:
+        if cfg.float16 is not True and cfg.bfloat16 is not True:
            LOG.warning(
                "You should probably set bfloat16 or float16 to true to "
                "load the model in float16 for BetterTransformers"
@@ -339,6 +352,11 @@ def validate_config(cfg):
            "push_to_hub_model_id is deprecated. Please use hub_model_id instead."
        )

+    if cfg.hub_model_id and not (cfg.save_steps or cfg.saves_per_epoch):
+        LOG.warning(
+            "hub_model_id is set without any models being saved. To save a model, set either save_steps or saves_per_epoch."
+        )
+
    if cfg.gptq and cfg.model_revision:
        raise ValueError(
            "model_revision is not supported for GPTQ models. "
@@ -346,17 +364,24 @@ def validate_config(cfg):
            + "point to its path, and remove model_revision from the config."
        )

-    if cfg.sample_packing and cfg.sdp_attention:
-        # incompatible due to bug w/ accelerate causing 0.0 loss when using llama2
-        raise ValueError(
-            "sample_packing not compatible with sdp_attention. Use flash_attention"
-        )
+    # if cfg.sample_packing and cfg.sdp_attention:
+    #     # incompatible due to bug w/ accelerate causing 0.0 loss when using llama2
+    #     raise ValueError(
+    #         "sample_packing not compatible with sdp_attention. Use flash_attention"
+    #     )

    if cfg.sample_packing and cfg.xformers_attention:
        raise ValueError(
            "sample_packing not compatible with xformers_attention. Use flash_attention"
        )

+    if cfg.sample_packing and cfg.sdp_attention and (cfg.bfloat16 or cfg.bf16):
+        # https://github.com/pytorch/pytorch/blob/1b03423526536b5f3d35bdfa95ccc6197556cf9b/test/test_transformers.py#L2440-L2450
+        LOG.warning(
+            "sample_packing & torch sdpa with bf16 is unsupported may results in 0.0 loss. "
+            "This may work on H100s."
+        )
+
    if cfg.early_stopping_patience:
        if not cfg.save_steps or not cfg.eval_steps:
            raise ValueError(
@@ -422,7 +447,11 @@ def validate_config(cfg):
            "evaluation_strategy and eval_steps mismatch. Please set evaluation_strategy to 'steps' or remove eval_steps."
        )

-    if cfg.val_set_size == 0 and (cfg.eval_steps or cfg.evaluation_strategy):
+    if (
+        cfg.val_set_size == 0
+        and (cfg.eval_steps or cfg.evaluation_strategy)
+        and not cfg.test_datasets
+    ):
        raise ValueError(
            "eval_steps and evaluation_strategy are not supported with val_set_size == 0"
        )
@@ -484,35 +513,43 @@ def validate_config(cfg):
            "`use_reentrant` must be false when used with partially frozen model."
        )

-    if cfg.flash_attention and cfg.deepspeed and Path(cfg.deepspeed).is_file():
+    if cfg.deepspeed and Path(cfg.deepspeed).is_file():
        with open(cfg.deepspeed, encoding="utf-8") as file:
            contents = file.read()
            deepspeed_cfg: DictDefault = DictDefault(json.loads(contents))
-            if (
-                deepspeed_cfg.zero_optimization
-                and deepspeed_cfg.zero_optimization.stage == 3
-            ):
-                if not (
-                    (
-                        deepspeed_cfg.bf16
-                        and deepspeed_cfg.bf16.enabled  # pylint: disable=no-member
-                        is True
-                    )
-                    or (
-                        deepspeed_cfg.fp16
-                        and deepspeed_cfg.fp16.enabled  # pylint: disable=no-member
-                        is True
-                    )
+            if cfg.flash_attention:
+                if (
+                    deepspeed_cfg.zero_optimization
+                    and deepspeed_cfg.zero_optimization.stage == 3
                ):
-                    raise ValueError(
-                        "bf16.enabled or fp16.enabled must be set to true when using ZeRO-3 with flash-attention"
-                    )
+                    if not (
+                        (
+                            deepspeed_cfg.bf16
+                            and deepspeed_cfg.bf16.enabled  # pylint: disable=no-member
+                            is True
+                        )
+                        or (
+                            deepspeed_cfg.fp16
+                            and deepspeed_cfg.fp16.enabled  # pylint: disable=no-member
+                            is True
+                        )
+                    ):
+                        raise ValueError(
+                            "bf16.enabled or fp16.enabled must be set to true when using ZeRO-3 with flash-attention"
+                        )
+            if "8bit" in cfg.optimizer and deepspeed_cfg.optimizer:
+                LOG.warning(
+                    f"conflicting optimizer: {cfg.optimizer} used alongside deepspeed optimizer."
+                )

    if cfg.test_datasets and cfg.val_set_size:
        raise ValueError(
            "non-zero val_set_size should not be used with test_datasets configuration"
        )

+    if cfg.fsdp and "bnb" in cfg.optimizer:
+        raise ValueError(f"FSDP not compatible with {cfg.optimizer}")
+
    # TODO
    # MPT 7b
    # https://github.com/facebookresearch/bitsandbytes/issues/25
--- a/src/axolotl/utils/data.py
+++ b/src/axolotl/utils/data.py
@@ -4,7 +4,7 @@ import hashlib
 import logging
 from collections import defaultdict
 from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple, Union
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union

 import torch
 import yaml
@@ -16,6 +16,7 @@ from datasets import (
    load_from_disk,
 )
 from huggingface_hub import hf_hub_download
+from huggingface_hub.utils import HFValidationError
 from torch.utils.data import RandomSampler
 from transformers import PreTrainedTokenizerBase

@@ -87,12 +88,21 @@ def prepare_dataset(cfg, tokenizer):
            path = cfg.pretraining_dataset[0]["path"]
            name = cfg.pretraining_dataset[0]["name"]

-        train_dataset = load_pretraining_dataset(
-            path,
+        ds_wrapper_partial = functools.partial(
+            get_dataset_wrapper,
+            cfg.pretraining_dataset[0],
            tokenizer,
            cfg,
-            name=name,
+            cfg.pretraining_dataset[0]["type"] or "pretrain",
+        )
+
+        train_dataset = wrap_pretraining_dataset(
+            load_dataset(path, streaming=True, split="train", name=name),
+            tokenizer,
+            cfg,
+            ds_wrapper_partial,
            max_tokens=cfg.sequence_len,
+            batch_size=cfg.micro_batch_size,
            seed=cfg.seed or 42,
        )
        # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230
@@ -139,7 +149,7 @@ def load_tokenized_prepared_datasets(
                + "|".join(
                    sorted(
                        [
-                            f"{d.path}:{d.type}:{d.shards}:{d.conversation}"
+                            f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}"
                            for d in cfg_datasets
                        ]
                    )
@@ -213,7 +223,7 @@ def load_tokenized_prepared_datasets(
                    token=use_auth_token,
                )
                ds_from_hub = True
-            except (FileNotFoundError, ConnectionError):
+            except (FileNotFoundError, ConnectionError, HFValidationError):
                pass

            ds_from_cloud = False
@@ -382,9 +392,9 @@ def load_tokenized_prepared_datasets(

            dataset_wrapper, dataset_prompter = get_dataset_wrapper(
                config_dataset=config_dataset,
-                dataset=ds,
                tokenizer=tokenizer,
                cfg=cfg,
+                dataset=ds,
                d_base_type=d_base_type,
                d_prompt_style=d_prompt_style,
            )
@@ -439,7 +449,7 @@ def load_prepare_datasets(
    split="train",
 ) -> Tuple[Dataset, Dataset, List[Prompter]]:
    dataset, prompters = load_tokenized_prepared_datasets(
-        tokenizer, cfg, default_dataset_prepared_path
+        tokenizer, cfg, default_dataset_prepared_path, split=split
    )

    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
@@ -495,7 +505,12 @@ def load_prepare_datasets(


 def get_dataset_wrapper(
-    config_dataset, dataset, tokenizer, cfg, d_base_type, d_prompt_style
+    config_dataset,
+    tokenizer,
+    cfg,
+    d_base_type,
+    dataset,
+    d_prompt_style=None,
 ):
    dataset_wrapper = None
    dataset_prompter = None
@@ -506,7 +521,8 @@ def get_dataset_wrapper(
    }

    if (
-        "input_ids" in dataset.features
+        isinstance(dataset, Dataset)
+        and "input_ids" in dataset.features
        and "attention_mask" in dataset.features
        and "labels" in dataset.features
    ):
@@ -764,76 +780,67 @@ def encode_pretraining(
    return ret


-def load_pretraining_dataset(path, tokenizer, cfg, name=None, max_tokens=2048, seed=42):
+def wrap_pretraining_dataset(
+    dataset,
+    tokenizer,
+    cfg,
+    ds_wrapper_fn,
+    max_tokens=2048,
+    batch_size=1,
+    seed=42,
+    buffer_size=10_000,
+):
    if cfg.sample_packing:
        collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq(
            tokenizer,
            return_tensors="pt",
            padding=True,
-            pad_to_multiple_of=max_tokens * cfg.micro_batch_size,
+            pad_to_multiple_of=max_tokens * batch_size,
        )
        encode = functools.partial(
            encode_packed_pretraining,
-            tokenizer,
            collate_fn,
+            ds_wrapper_fn,
            max_seq_length=max_tokens,
-            batch_size=cfg.micro_batch_size,
+            batch_size=batch_size,
        )
        # set this to 1 so downstream data_loader doesn't try to increase the batch again
        cfg.micro_batch_size = 1
    else:
        encode = functools.partial(encode_pretraining, tokenizer, max_tokens)

-    dataset = load_dataset(path, streaming=True, split="train", name=name)
-    dataset = dataset.shuffle(seed=seed, buffer_size=10_000)
+    dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)
    dataset = dataset.map(
        encode,
        batched=True,
-        batch_size=10_000,
-        input_columns="text",
+        batch_size=buffer_size,
+        # input_columns="text",
        # remove all the existing columns after mapping since they end up having
        # a different length than the encoded/tokenized column
        remove_columns=dataset.features.keys(),
-        desc="Encoding Pretraining",
    )
    return dataset


 def encode_packed_pretraining(
-    tokenizer: PreTrainedTokenizerBase,
    collate_fn,
-    examples: List[str],
+    ds_wrapper: Callable,
+    examples: Dict[str, List],
    max_seq_length: int = 2048,
    batch_size: int = 4,
 ) -> Dict[str, List]:
    # pylint: disable=duplicate-code
    # tokenize all the examples
    # rows get split with stride (overlap)
-    res = tokenizer(
-        examples,
-        truncation=True,
-        max_length=max_seq_length - 1,
-        add_special_tokens=True,
-        return_overflowing_tokens=True,
-        stride=256,
-    )
+    train_dataset = ds_wrapper(Dataset.from_dict(examples))[0]

-    input_ids = [seq + [tokenizer.eos_token_id] for seq in res["input_ids"]]
-    attention_mask = [seq + [1] for seq in res["attention_mask"]]
-
-    tokenized_examples = {
-        "input_ids": input_ids,
-        "attention_mask": attention_mask,
-    }
-
-    train_dataset = Dataset.from_dict(tokenized_examples)
    train_dataset = process_pretraining_datasets_for_packing(
        train_dataset, max_seq_length
    )

    sampler = MultipackBatchSampler(
        RandomSampler(train_dataset),
-        batch_size=batch_size,
+        batch_size=1,
        drop_last=True,
        batch_max_len=batch_size * max_seq_length,
        lengths=get_dataset_lengths(train_dataset),
@@ -841,15 +848,23 @@ def encode_packed_pretraining(

    chunked_data = defaultdict(list)

-    for data in sampler:
-        features = train_dataset[data]
-        features["labels"] = features["input_ids"].copy()
-        collated_features = collate_fn(features)
+    for batch in sampler:
+        for data in batch:
+            features = train_dataset[data]
+            if "num_truncated_tokens" in features:
+                del features["num_truncated_tokens"]
+            if "num_truncated_tokens" in features:
+                del features["num_truncated_tokens"]
+            if "overflow_to_sample_mapping" in features:
+                del features["overflow_to_sample_mapping"]
+            if "labels" not in features:
+                features["labels"] = features["input_ids"].copy()
+            collated_features = collate_fn(features)

-        for feature in features.keys():
-            if feature == "length":
-                continue
-            chunked_data[feature].append(collated_features[feature].squeeze(0))
+            for feature in features.keys():
+                if feature == "length":
+                    continue
+                chunked_data[feature].append(collated_features[feature].squeeze(0))

    return chunked_data

--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -8,8 +8,13 @@ import addict
 import bitsandbytes as bnb
 import torch
 import transformers
-from optimum.bettertransformer import BetterTransformer
-from peft import PeftConfig, prepare_model_for_kbit_training
+from peft import (
+    LoftQConfig,
+    PeftConfig,
+    PeftModel,
+    PeftModelForCausalLM,
+    prepare_model_for_kbit_training,
+)
 from peft.tuners.lora import QuantLinear
 from transformers import (  # noqa: F401
    AddedToken,
@@ -67,7 +72,7 @@ def check_model_config(cfg: DictDefault, model_config: Union[AutoConfig, DictDef
    ):
        lora_modules_to_save = ", ".join(map(lambda x: f"`{x}`", lora_modules_to_save))
        raise ValueError(
-            f"`lora_modules_to_save` not properly set when adding new tokens. Please include {lora_modules_to_save} in `lora_modules_to_save`."
+            f"`lora_modules_to_save` not properly set when adding new tokens. Please include [{lora_modules_to_save}] in `lora_modules_to_save`."
        )


@@ -161,15 +166,20 @@ def load_tokenizer(cfg):
            if getattr(tokenizer, attr_name) is None:
                setattr(tokenizer, attr_name, "<|endoftext|>")

+    additional_special_tokens = None
    if cfg.special_tokens:
+        special_tokens = cfg.special_tokens.to_dict()
+        additional_special_tokens = special_tokens.pop(
+            "additional_special_tokens", None
+        )
        lora_modules_to_save = get_linear_embedding_layers(model_config.model_type)
-        for k, val in cfg.special_tokens.items():
+        for k, val in special_tokens.items():
            # check if new special token is not already in tokenizer and
            # is adapter training to make sure lora_modules_to_save is set
            # pylint: disable=too-many-boolean-expressions
            if (
                (getattr(tokenizer, k) is None or getattr(tokenizer, k) != val)
-                and (len(tokenizer.encode(val)) > 1)
+                and (len(tokenizer.encode(val, add_special_tokens=False)) > 2)
                and cfg.adapter
                and (
                    not cfg.lora_modules_to_save
@@ -182,7 +192,7 @@ def load_tokenizer(cfg):
                    [f"`{x}`" for x in lora_modules_to_save]
                )
                raise ValueError(
-                    f"Please set lora_modules_to_save to {lora_modules_to_save} when using an adapter and changing the special tokens."
+                    f"Please set lora_modules_to_save to [{lora_modules_to_save}] when using an adapter and changing the special tokens."
                )

            tokenizer.add_special_tokens(
@@ -213,13 +223,34 @@ def load_tokenizer(cfg):
            ]
        )

+    # Additional special tokens are a List, and need to be treated differently than regular special
+    # tokens. We add them after we have called `add_tokens` in case these additional special tokens
+    # are new tokens.
+    #
+    # Usage:
+    #
+    # ```py
+    # special_tokens:
+    #   additional_special_tokens: ["<|im_start|>", "<|im_end|>"]
+    # ```
+    if additional_special_tokens is not None:
+        tokenizer.add_special_tokens(
+            {"additional_special_tokens": additional_special_tokens}
+        )
+
    LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}")
    LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}")
    LOG.debug(f"PAD: {tokenizer.pad_token_id} / {tokenizer.pad_token}")
    LOG.debug(f"UNK: {tokenizer.unk_token_id} / {tokenizer.unk_token}")

    if cfg.chat_template:
-        tokenizer.chat_template = chat_templates(cfg.chat_template)
+        chat_template_string = chat_templates(cfg.chat_template)
+        if cfg.default_system_message and cfg.chat_template == "chatml":
+            chat_template_string = chat_template_string.replace(
+                "You are a helpful assistant.", cfg.default_system_message
+            )
+
+        tokenizer.chat_template = chat_template_string
    else:
        LOG.info(
            "No Chat template selected. Consider adding a chat template for easier inference."
@@ -298,13 +329,13 @@ def load_model(

            LOG.info("patching with xformers attention")
            hijack_llama_attention()
-        elif cfg.sdp_attention:
-            from axolotl.monkeypatch.llama_attn_hijack_sdp import (
-                hijack_llama_sdp_attention,
+        elif cfg.sample_packing:
+            from axolotl.monkeypatch.llama_patch_multipack import (
+                hijack_llama_prepare_4d_mask,
            )

-            LOG.info("patching with sdp attention")
-            hijack_llama_sdp_attention()
+            LOG.info("patching llama _prepare_4d_causal_attention_mask*")
+            hijack_llama_prepare_4d_mask()
        elif cfg.s2_attention:
            raise NotImplementedError(
                "Shifted-sparse attention not currently implemented without flash attention."
@@ -447,6 +478,18 @@ def load_model(
            **bnb_config,
        )

+    if cfg.load_in_8bit and cfg.adapter is not None:
+        model_kwargs["load_in_8bit"] = True
+    if cfg.load_in_4bit and cfg.adapter is not None:
+        model_kwargs["load_in_4bit"] = True
+
+    # no longer needed per https://github.com/huggingface/transformers/pull/26610
+    if "quantization_config" in model_kwargs or cfg.gptq:
+        if "load_in_8bit" in model_kwargs:
+            del model_kwargs["load_in_8bit"]
+        if "load_in_4bit" in model_kwargs:
+            del model_kwargs["load_in_4bit"]
+
    # sample packing uses custom FA2 patch
    if cfg.flash_attention:
        if not cfg.sample_packing:
@@ -468,6 +511,12 @@ def load_model(
                model_config._attn_implementation = (  # pylint: disable=protected-access
                    "eager"
                )
+    elif cfg.sdp_attention:
+        model_kwargs["attn_implementation"] = "sdpa"
+        model_config._attn_implementation = "sdpa"  # pylint: disable=protected-access
+    elif cfg.eager_attention:
+        model_kwargs["attn_implementation"] = "eager"
+        model_config._attn_implementation = "eager"  # pylint: disable=protected-access

    try:
        if (
@@ -480,8 +529,6 @@ def load_model(
            model = LlamaForCausalLM.from_pretrained(
                base_model,
                config=model_config,
-                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
-                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
                **model_kwargs,
            )

@@ -549,8 +596,6 @@ def load_model(
                model = getattr(transformers, model_type).from_pretrained(
                    base_model,
                    config=model_config,
-                    load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
-                    load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
@@ -582,8 +627,6 @@ def load_model(
                model = AutoModelForCausalLM.from_pretrained(
                    base_model,
                    config=model_config,
-                    load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
-                    load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
@@ -591,6 +634,9 @@ def load_model(
        LOG.exception(err)
        raise err

+    if isinstance(model, (PeftModel, PeftModelForCausalLM)):
+        model = model.merge_and_unload()
+
    embeddings_len = (
        math.ceil(len(tokenizer) / 32) * 32
        if cfg.resize_token_embeddings_to_32x
@@ -636,21 +682,25 @@ def load_model(

    # make sure these are fp32 per Ramesh et al. (2021)
    embedding_modules = get_linear_embedding_layers(cfg.model_config_type)
-    for name, module in model.named_modules():
-        if any(m in name for m in ["norm", "gate"]):
-            module.to(torch.float32)
-        if model_config.model_type == "btlm":
-            # don't upcast lm_head for btlm
-            continue
-        if any(m in name for m in embedding_modules):
-            if hasattr(module, "weight"):
+    if not cfg.fsdp:
+        # FSDP doesn't like mixed Float and BFloat16
+        for name, module in model.named_modules():
+            if "norm" in name or name.endswith(".gate"):
                module.to(torch.float32)
+            if model_config.model_type == "btlm":
+                # don't upcast lm_head for btlm
+                continue
+            if any(m in name for m in embedding_modules):
+                if hasattr(module, "weight"):
+                    module.to(torch.float32)

    needs_fa2_dtype = cfg.adapter or cfg.fsdp
    skip_prepare_model_for_kbit_training = False

    if cfg.model_config_type == "mixtral" and is_deepspeed_zero3_enabled():
-        from deepspeed.utils import set_z3_leaf_modules
+        from deepspeed.utils import (  # pylint: disable=no-name-in-module
+            set_z3_leaf_modules,
+        )
        from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock

        set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
@@ -659,13 +709,17 @@ def load_model(
        # Qwen doesn't play nicely with LoRA if this is enabled
        skip_prepare_model_for_kbit_training = True

-    if (cfg.adapter == "lora" and load_in_8bit) or (
-        cfg.adapter == "qlora" and cfg.load_in_4bit
-    ):
-        LOG.info("converting PEFT model w/ prepare_model_for_kbit_training")
+    loftq_bits = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits
+    if cfg.adapter == "lora" and loftq_bits:
+        skip_prepare_model_for_kbit_training = True
+
+    if cfg.adapter in ["lora", "qlora"]:
        if cfg.gradient_checkpointing:
            model.gradient_checkpointing_enable()
-        if not skip_prepare_model_for_kbit_training:
+        if (
+            cfg.load_in_8bit or cfg.load_in_4bit
+        ) and not skip_prepare_model_for_kbit_training:
+            LOG.info("converting PEFT model w/ prepare_model_for_kbit_training")
            model = prepare_model_for_kbit_training(
                model, use_gradient_checkpointing=cfg.gradient_checkpointing
            )
@@ -692,6 +746,7 @@ def load_model(
            model, lora_config = load_adapter(model, cfg, cfg.adapter)

    if cfg.ddp and not load_in_8bit and not (cfg.rl and cfg.load_in_4bit):
+        # TODO revaldate this conditional
        model.to(f"cuda:{cfg.local_rank}")

    if torch.cuda.device_count() > 1 and int(os.getenv("WORLD_SIZE", "1")) == 1:
@@ -708,6 +763,8 @@ def load_model(
        model.config.use_cache = False

    if cfg.flash_optimum:
+        from optimum.bettertransformer import BetterTransformer
+
        model = BetterTransformer.transform(model)

    if cfg.adapter is not None:
@@ -734,7 +791,7 @@ def load_adapter(model, cfg, adapter, inference=False):

 def load_llama_adapter(model, cfg):
    # type: (PreTrainedModel, DictDefault) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
-    from peft import AdaptionPromptConfig, PeftModel, get_peft_model
+    from peft import AdaptionPromptConfig, get_peft_model

    peft_config = AdaptionPromptConfig(
        adapter_layers=cfg.peft_adapter.layers,  # layers (L)
@@ -743,7 +800,7 @@ def load_llama_adapter(model, cfg):
    )

    if cfg.lora_model_dir:
-        LOG.debug("Loading pretained PEFT - llama_adapter")
+        LOG.debug("Loading pretrained PEFT - llama_adapter")
        model = PeftModel.from_pretrained(
            model,
            cfg.lora_model_dir,
@@ -780,7 +837,7 @@ def find_all_linear_names(model):
 def load_lora(model, cfg, inference=False, config_only=False):
    # type: (PreTrainedModel, DictDefault, bool, bool) -> Tuple[Optional[PreTrainedModel], Optional[PeftConfig]]

-    from peft import LoraConfig, PeftModel, get_peft_model
+    from peft import LoraConfig, get_peft_model

    lora_target_modules = list(cfg.lora_target_modules or [])

@@ -789,6 +846,12 @@ def load_lora(model, cfg, inference=False, config_only=False):
        LOG.info(f"found linear modules: {repr(linear_names)}")
        lora_target_modules = list(set(lora_target_modules + linear_names))

+    lora_config_kwargs = {}
+    loftq_bits = cfg.peft and cfg.peft.loftq_config and cfg.peft.loftq_config.loftq_bits
+    if loftq_bits:
+        lora_config_kwargs["loftq_config"] = LoftQConfig(loftq_bits=loftq_bits)
+        lora_config_kwargs["init_lora_weights"] = "loftq"
+
    lora_config = LoraConfig(
        r=cfg.lora_r,
        lora_alpha=cfg.lora_alpha,
@@ -799,13 +862,14 @@ def load_lora(model, cfg, inference=False, config_only=False):
        modules_to_save=cfg.lora_modules_to_save if cfg.lora_modules_to_save else None,
        bias="none",
        task_type="CAUSAL_LM",
+        **lora_config_kwargs,
    )

    if config_only:
        return None, lora_config

    if cfg.lora_model_dir:
-        LOG.debug("Loading pretained PEFT - LoRA")
+        LOG.debug("Loading pretrained PEFT - LoRA")
        model_kwargs: Any = {}
        if cfg.lora_on_cpu:
            model_kwargs["max_memory"] = {"cpu": "256GiB"}
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -117,7 +117,7 @@ class MultipackBatchSampler(BatchSampler):
        packing_efficiency_estimate: float = 1.0,
    ):
        super().__init__(sampler, batch_size, drop_last)
-        self.batch_size = None
+        self.batch_size = batch_size
        self.batch_max_len = batch_max_len
        self.lengths: np.ndarray = lengths
        self.packing_efficiency_estimate = packing_efficiency_estimate or 1.0
@@ -147,7 +147,13 @@ class MultipackBatchSampler(BatchSampler):
            n=1,
        )

-        batches = [[indices[b_idx] for b_idx in batch] for batch in batches]
+        batches = [
+            [
+                [indices[b_idx] for b_idx in batch]
+                for batch in batches[i : i + self.batch_size]
+            ]
+            for i in range(0, len(batches), self.batch_size)
+        ]

        # statistics
        if set_stats:
@@ -189,7 +195,7 @@ class MultipackBatchSampler(BatchSampler):
                    0.99
                    * lengths_sum_per_device
                    / self.packing_efficiency_estimate
-                    // self.batch_max_len
+                    // (self.batch_max_len * self.batch_size)
                )
                - 1
            ),
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -237,11 +237,17 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                main_process_only=True,
            )
        else:
+            if cfg.flash_attention:
+                batch_size = 1
+                batch_max_len = cfg.micro_batch_size * cfg.sequence_len
+            else:
+                batch_size = cfg.micro_batch_size
+                batch_max_len = cfg.sequence_len
            sampler = MultipackBatchSampler(
                sampler=RandomSampler(train_dataset),
-                batch_size=cfg.micro_batch_size,
+                batch_size=batch_size,
                drop_last=True,
-                batch_max_len=cfg.micro_batch_size * cfg.sequence_len,
+                batch_max_len=batch_max_len,
                lengths=get_dataset_lengths(train_dataset),
            )

@@ -249,7 +255,7 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                train_dataset.remove_columns(["length"]),
                batch_sampler=sampler,
            )
-            data_loader_len = len(data_loader)
+            data_loader_len = len(data_loader) // batch_size
            actual_eff = sampler.efficiency()
            LOG.debug(f"data_loader_len: {data_loader_len}", main_process_only=True)
            # FIXME: is there a bug here somewhere? the total num steps depends
--- a/tests/e2e/patched/test_4d_multipack_llama.py
+++ b/tests/e2e/patched/test_4d_multipack_llama.py
@@ -0,0 +1,114 @@
+"""
+E2E tests for multipack fft llama using 4d attention masks
+"""
+
+import logging
+import os
+import unittest
+from pathlib import Path
+
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
+from axolotl.train import train
+from axolotl.utils.config import normalize_config
+from axolotl.utils.dict import DictDefault
+
+from ..utils import require_torch_2_1_1, with_temp_dir
+
+LOG = logging.getLogger("axolotl.tests.e2e")
+os.environ["WANDB_DISABLED"] = "true"
+
+
+class Test4dMultipackLlama(unittest.TestCase):
+    """
+    Test case for Llama models using 4d attention with multipack
+    """
+
+    @require_torch_2_1_1
+    @with_temp_dir
+    def test_sdp_lora_packing(self, temp_dir):
+        # pylint: disable=duplicate-code
+        cfg = DictDefault(
+            {
+                "base_model": "JackFram/llama-68m",
+                "flash_attention": False,
+                "sdp_attention": True,
+                "sample_packing": True,
+                "pad_to_sequence_len": True,
+                "load_in_8bit": True,
+                "adapter": "lora",
+                "lora_r": 32,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "sequence_len": 1024,
+                "val_set_size": 0.1,
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    },
+                ],
+                "num_epochs": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch",
+                "lr_scheduler": "cosine",
+                "max_steps": 20,
+                "save_steps": 10,
+                "eval_steps": 10,
+                "fp16": True,
+            }
+        )
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()
+
+    @with_temp_dir
+    def test_torch_lora_packing(self, temp_dir):
+        # pylint: disable=duplicate-code
+        cfg = DictDefault(
+            {
+                "base_model": "JackFram/llama-68m",
+                "flash_attention": False,
+                "sdp_attention": False,
+                "sample_packing": True,
+                "pad_to_sequence_len": True,
+                "sequence_len": 1024,
+                "load_in_8bit": True,
+                "adapter": "lora",
+                "lora_r": 32,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_linear": True,
+                "val_set_size": 0.1,
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    },
+                ],
+                "num_epochs": 2,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch",
+                "lr_scheduler": "cosine",
+                "max_steps": 20,
+                "save_steps": 10,
+                "eval_steps": 10,
+                "fp16": True,
+            }
+        )
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()
--- a/tests/e2e/patched/test_fused_llama.py
+++ b/tests/e2e/patched/test_fused_llama.py
@@ -33,6 +33,7 @@ class TestFusedLlama(unittest.TestCase):
            {
                "base_model": "JackFram/llama-68m",
                "flash_attention": True,
+                "pad_to_sequence_len": True,
                "flash_attn_fuse_qkv": True,
                "flash_attn_fuse_mlp": True,
                "sample_packing": True,
--- a/tests/e2e/patched/test_mistral_samplepack.py
+++ b/tests/e2e/patched/test_mistral_samplepack.py
@@ -7,8 +7,6 @@ import os
 import unittest
 from pathlib import Path

-from transformers.utils import is_torch_bf16_gpu_available
-
 from axolotl.cli import load_datasets
 from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
@@ -63,6 +61,7 @@ class TestMistral(unittest.TestCase):
                "max_steps": 20,
                "save_steps": 10,
                "eval_steps": 10,
+                "bf16": "auto",
            }
        )
        normalize_config(cfg)
@@ -103,12 +102,9 @@ class TestMistral(unittest.TestCase):
                "max_steps": 20,
                "save_steps": 10,
                "eval_steps": 10,
+                "bf16": "auto",
            }
        )
-        if is_torch_bf16_gpu_available():
-            cfg.bf16 = True
-        else:
-            cfg.fp16 = True
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
--- a/tests/e2e/test_relora_llama.py
+++ b/tests/e2e/test_relora_llama.py
@@ -0,0 +1,68 @@
+"""
+E2E tests for relora llama
+"""
+
+import logging
+import os
+import unittest
+from pathlib import Path
+
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
+from axolotl.train import train
+from axolotl.utils.config import normalize_config
+from axolotl.utils.dict import DictDefault
+
+from .utils import with_temp_dir
+
+LOG = logging.getLogger("axolotl.tests.e2e")
+os.environ["WANDB_DISABLED"] = "true"
+
+
+class TestReLoraLlama(unittest.TestCase):
+    """
+    Test case for Llama models using LoRA
+    """
+
+    @with_temp_dir
+    def test_relora(self, temp_dir):
+        # pylint: disable=duplicate-code
+        cfg = DictDefault(
+            {
+                "base_model": "JackFram/llama-68m",
+                "tokenizer_type": "LlamaTokenizer",
+                "sequence_len": 1024,
+                "load_in_8bit": True,
+                "adapter": "lora",
+                "lora_r": 32,
+                "lora_alpha": 16,
+                "lora_dropout": 0.05,
+                "lora_target_modules": ["q_proj", "v_proj"],
+                "relora_steps": 25,
+                "relora_warmup_steps": 5,
+                "relora_anneal_steps": 5,
+                "relora_cpu_offload": True,
+                "val_set_size": 0.0,
+                "special_tokens": {},
+                "datasets": [
+                    {
+                        "path": "mhenrichsen/alpaca_2k_test",
+                        "type": "alpaca",
+                    },
+                ],
+                "warmup_steps": 15,
+                "num_epochs": 2,
+                "micro_batch_size": 4,
+                "gradient_accumulation_steps": 1,
+                "output_dir": temp_dir,
+                "learning_rate": 0.00001,
+                "optimizer": "adamw_torch",
+                "lr_scheduler": "cosine",
+            }
+        )
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()
--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -4,7 +4,9 @@ helper utils for tests
 import os
 import shutil
 import tempfile
+import unittest
 from functools import wraps
+from importlib.metadata import version
 from pathlib import Path


@@ -31,3 +33,15 @@ def most_recent_subdir(path):
    subdir = max(subdirectories, key=os.path.getctime)

    return subdir
+
+
+def require_torch_2_1_1(test_case):
+    """
+    Decorator marking a test that requires torch >= 2.1.1
+    """
+
+    def is_min_2_1_1():
+        torch_version = version("torch")
+        return torch_version >= "2.1.1"
+
+    return unittest.skipUnless(is_min_2_1_1(), "test torch 2.1.1")(test_case)
--- a/tests/monkeypatch/test_llama_attn_hijack_flash.py
+++ b/tests/monkeypatch/test_llama_attn_hijack_flash.py
@@ -30,6 +30,20 @@ class TestMonkeyPatchUtils(unittest.TestCase):
            torch.allclose(get_cu_seqlens_from_pos_ids(position_ids)[0], target_res)
        )

+    def test_get_cu_seqlens_from_pos_ids_2d(self):
+        position_ids = torch.tensor(
+            [
+                [0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1, 0, 0],
+                [0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 0],
+            ]
+        )
+        target_res = torch.tensor(
+            [[0, 4, 7, 12, 14, 16], [0, 5, 8, 15, 16, 16]], dtype=torch.int32
+        )
+        self.assertTrue(
+            torch.allclose(get_cu_seqlens_from_pos_ids(position_ids)[0], target_res)
+        )
+
    def test_get_max_seqlen_in_batch(self):
        attn_mask = torch.tensor([[1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 0, 0]])
        target_res = torch.tensor([4, 3, 5, 2], dtype=torch.int32)
--- a/tests/prompt_strategies/test_sharegpt.py
+++ b/tests/prompt_strategies/test_sharegpt.py
@@ -7,9 +7,14 @@ from tokenizers import AddedToken
 from transformers import AutoTokenizer

 from axolotl.datasets import TokenizedPromptDataset
-from axolotl.prompt_strategies.sharegpt import SimpleShareGPTPromptTokenizingStrategy
+from axolotl.prompt_strategies.sharegpt import (
+    SimpleShareGPTPromptTokenizingStrategy,
+    register_chatml_template,
+)
 from axolotl.prompters import ShareGPTPrompterV2

+register_chatml_template()
+

@pytest.fixture(name="sharegpt_dataset")
 def fixture_sharegpt_dataset():
--- a/tests/test_packed_batch_sampler.py
+++ b/tests/test_packed_batch_sampler.py
@@ -0,0 +1,99 @@
+"""Module for testing streaming dataset sequence packing"""
+import pytest
+from datasets import concatenate_datasets, load_dataset
+from torch.utils.data import DataLoader, RandomSampler
+from transformers import AutoTokenizer
+
+from axolotl.datasets import TokenizedPromptDataset
+from axolotl.prompt_strategies.completion import load
+from axolotl.utils.collators import V2BatchSamplerDataCollatorForSeq2Seq
+from axolotl.utils.dict import DictDefault
+from axolotl.utils.samplers import MultipackBatchSampler, get_dataset_lengths
+
+
+@pytest.fixture(name="tokenizer")
+def fixture_tokenizer():
+    tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
+    tokenizer.pad_token = "</s>"
+    return tokenizer
+
+
+@pytest.fixture(name="max_seq_length")
+def fixture_max_seq_length():
+    return 4096
+
+
+class TestBatchedSamplerPacking:
+    """
+    Test class for packing streaming dataset sequences
+    """
+
+    @pytest.mark.parametrize(
+        "batch_size, num_workers",
+        [
+            (1, 0),
+            (2, 0),
+            (1, 2),
+            (2, 2),
+        ],
+    )
+    def test_packing(self, batch_size, num_workers, tokenizer, max_seq_length):
+        import axolotl.monkeypatch.data.batch_dataset_fetcher  # pylint: disable=unused-import  # noqa: F401
+
+        dataset = load_dataset(
+            "Trelis/tiny-shakespeare",
+            split="train",
+        )
+
+        cfg = DictDefault(
+            {
+                "train_on_inputs": True,
+                "sequence_len": max_seq_length,
+            }
+        )
+        ds_cfg = DictDefault(
+            {
+                "field": "Text",
+            }
+        )
+        completion_strategy = load(tokenizer, cfg, ds_cfg)
+        dataset_wrapper = TokenizedPromptDataset(
+            completion_strategy,
+            dataset,
+        )
+        train_dataset = concatenate_datasets([dataset_wrapper])
+        batch_sampler = MultipackBatchSampler(
+            sampler=RandomSampler(train_dataset),
+            batch_size=batch_size,
+            drop_last=True,
+            batch_max_len=max_seq_length,
+            lengths=get_dataset_lengths(train_dataset),
+        )
+
+        loader = DataLoader(
+            train_dataset,
+            batch_sampler=batch_sampler,
+            collate_fn=V2BatchSamplerDataCollatorForSeq2Seq(  # pylint: disable=unexpected-keyword-arg
+                tokenizer=tokenizer,
+                padding=True,
+                pad_to_multiple_of=max_seq_length,
+                return_tensors="pt",
+            ),
+            num_workers=num_workers,
+        )
+        inputs = next(iter(loader))
+
+        assert inputs["input_ids"].shape == (batch_size, max_seq_length)
+        assert inputs["labels"].shape == (batch_size, max_seq_length)
+        assert inputs["attention_mask"].shape == (batch_size, max_seq_length)
+
+        assert inputs["input_ids"].tolist()[0][0] == 2
+        assert inputs["labels"].tolist()[0][0] == -100
+        assert inputs["attention_mask"].tolist()[0][0] == 0
+        assert inputs["attention_mask"].tolist()[0][-1] > 1
+
+        if batch_size >= 2:
+            assert inputs["input_ids"].tolist()[1][0] == 2
+            assert inputs["labels"].tolist()[1][0] == -100
+            assert inputs["attention_mask"].tolist()[1][0] == 0
+            assert inputs["attention_mask"].tolist()[1][-1] > 1
--- a/tests/test_packed_pretraining.py
+++ b/tests/test_packed_pretraining.py
@@ -1,17 +1,17 @@
 """Module for testing streaming dataset sequence packing"""
+import functools
 import unittest
-from functools import partial

 import torch
 from datasets import load_dataset
 from torch.utils.data import DataLoader
 from transformers import AutoTokenizer

-from axolotl.utils.collators import PretrainingBatchSamplerDataCollatorForSeq2Seq
-from axolotl.utils.data import encode_packed_pretraining
+from axolotl.utils.data import get_dataset_wrapper, wrap_pretraining_dataset
+from axolotl.utils.dict import DictDefault


-class TestPacking(unittest.TestCase):
+class TestPretrainingPacking(unittest.TestCase):
    """
    Test class for packing streaming dataset sequences
    """
@@ -20,8 +20,6 @@ class TestPacking(unittest.TestCase):
        # pylint: disable=duplicate-code
        self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
        self.tokenizer.pad_token = "</s>"
-        self.max_seq_length = 2048
-        self.batch_size = 2

    def test_packing_stream_dataset(self):
        # pylint: disable=duplicate-code
@@ -31,30 +29,43 @@ class TestPacking(unittest.TestCase):
            streaming=True,
        )["train"]

-        collate_fn = PretrainingBatchSamplerDataCollatorForSeq2Seq(
-            self.tokenizer,
-            return_tensors="pt",
-            padding=True,
-            pad_to_multiple_of=self.max_seq_length,
+        cfg = DictDefault(
+            {
+                "pretraining_dataset": [
+                    {
+                        "path": "c4",
+                        "name": "en",
+                        "type": "pretrain",
+                    }
+                ],
+                "sample_packing": True,
+                "pad_to_sequence_len": True,
+                "sequence_len": 2048,
+                "micro_batch_size": 2,
+            }
        )

-        encode = partial(
-            encode_packed_pretraining,
+        ds_wrapper_partial = functools.partial(
+            get_dataset_wrapper,
+            cfg.pretraining_dataset[0],
            self.tokenizer,
-            collate_fn,
-            max_seq_length=self.max_seq_length,
-            batch_size=self.batch_size,
+            cfg,
+            cfg.pretraining_dataset[0]["type"] or "pretrain",
        )

-        dataset = dataset.map(
-            encode,
-            batched=True,
-            input_columns="text",
-            remove_columns=dataset.features.keys(),
+        original_bsz = cfg.micro_batch_size
+        train_dataset = wrap_pretraining_dataset(
+            dataset,
+            self.tokenizer,
+            cfg,
+            ds_wrapper_partial,
+            max_tokens=cfg.sequence_len,
+            batch_size=cfg.micro_batch_size,
+            seed=cfg.seed or 42,
        )

        trainer_loader = DataLoader(
-            dataset,
+            train_dataset,
            batch_size=1,
            collate_fn=None,
            drop_last=True,
@@ -64,16 +75,16 @@ class TestPacking(unittest.TestCase):
            if idx > 10:
                break
            assert data["input_ids"].shape == torch.Size(
-                [1, self.batch_size * self.max_seq_length]
+                [1, original_bsz * cfg.sequence_len]
            )
            assert data["position_ids"].shape == torch.Size(
-                [1, self.batch_size * self.max_seq_length]
+                [1, original_bsz * cfg.sequence_len]
            )
            assert data["labels"].shape == torch.Size(
-                [1, self.batch_size * self.max_seq_length]
+                [1, original_bsz * cfg.sequence_len]
            )
            assert data["attention_mask"].shape == torch.Size(
-                [1, self.batch_size * self.max_seq_length]
+                [1, original_bsz * cfg.sequence_len]
            )
            idx += 1

--- a/tests/test_tokenizers.py
+++ b/tests/test_tokenizers.py
@@ -67,6 +67,21 @@ class TestTokenizers(unittest.TestCase):
        )
        load_tokenizer(cfg)

+    def test_add_additional_special_tokens(self):
+        cfg = DictDefault(
+            {
+                "tokenizer_config": "huggyllama/llama-7b",
+                "special_tokens": {"additional_special_tokens": ["<|im_start|>"]},
+            }
+        )
+        tokenizer = load_tokenizer(cfg)
+        self.assertEqual(tokenizer("<|im_start|>user")["input_ids"], [1, 32000, 1404])
+        self.assertEqual(len(tokenizer), 32001)
+
+        # ensure reloading the tokenizer again from cfg results in same vocab length
+        tokenizer = load_tokenizer(cfg)
+        self.assertEqual(len(tokenizer), 32001)
+

 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_validation.py
+++ b/tests/test_validation.py
@@ -26,21 +26,12 @@ class BaseValidation(unittest.TestCase):
        self._caplog = caplog


+# pylint: disable=too-many-public-methods
 class ValidationTest(BaseValidation):
    """
    Test the validation module
    """

-    def test_load_4bit_deprecate(self):
-        cfg = DictDefault(
-            {
-                "load_4bit": True,
-            }
-        )
-
-        with pytest.raises(ValueError):
-            validate_config(cfg)
-
    def test_batch_size_unused_warning(self):
        cfg = DictDefault(
            {
@@ -698,6 +689,22 @@ class ValidationTest(BaseValidation):
        ):
            validate_config(cfg)

+    def test_hub_model_id_save_value_warns(self):
+        cfg = DictDefault({"hub_model_id": "test"})
+
+        with self._caplog.at_level(logging.WARNING):
+            validate_config(cfg)
+            assert (
+                "set without any models being saved" in self._caplog.records[0].message
+            )
+
+    def test_hub_model_id_save_value(self):
+        cfg = DictDefault({"hub_model_id": "test", "saves_per_epoch": 4})
+
+        with self._caplog.at_level(logging.WARNING):
+            validate_config(cfg)
+            assert len(self._caplog.records) == 0
+

 class ValidationCheckModelConfig(BaseValidation):
    """
--- a/ui/main.py
+++ b/ui/main.py
@@ -0,0 +1,98 @@
+"""
+This module is used to launch Axolotl with user defined configurations.
+"""
+
+import gradio as gr
+import yaml
+
+
+def config(
+    base_model,
+    dataset,
+    dataset_type,
+    learn_rate,
+    gradient_accumulation_steps,
+    micro_batch_size,
+    seq_length,
+    num_epochs,
+    output_dir,
+    val_size,
+):
+    """
+    This function generates a configuration dictionary and saves it as a yaml file.
+    """
+    config_dict = {
+        "base_model": base_model,
+        "datasets": [{"path": dataset, "type": dataset_type}],
+        "learning_rate": learn_rate,
+        "gradient_accumulation_steps": gradient_accumulation_steps,
+        "micro_batch_size": micro_batch_size,
+        "sequence_len": seq_length,
+        "num_epochs": num_epochs,
+        "output_dir": output_dir,
+        "val_set_size": val_size,
+    }
+    with open("config.yml", "w", encoding="utf-8") as file:
+        yaml.dump(config_dict, file)
+    print(config_dict)
+    return yaml.dump(config_dict)
+
+
+with gr.Blocks(title="Axolotl Launcher") as demo:
+    gr.Markdown(
+        """
+    # Axolotl Launcher
+    Fill out the required fields below to create a training run.
+    """
+    )
+    with gr.Row():
+        base_model_name = gr.Textbox(
+            "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", label="Base model"
+        )
+
+        mode = gr.Radio(
+            choices=["Full finetune", "QLoRA", "LoRA"],
+            label="Training mode",
+            info="FFT = 16 bit, Qlora = 4 bit, Lora = 8 bit",
+        )
+    with gr.Row():
+        dataset_path = gr.Textbox("mhenrichsen/alpaca_2k_test", label="Dataset")
+        dataset_type_name = gr.Dropdown(
+            choices=["alpaca", "sharegpt"], label="Dataset type", value="alpaca"
+        )
+    with gr.Accordion("Hyperparameters", open=False):
+        gr.Markdown("Choose hyperparameters")
+        with gr.Row():
+            learning_rate = gr.Number(0.000001, label="Learning rate")
+            gradient_accumulation_steps_count = gr.Number(
+                1, label="Gradient accumulation steps"
+            )
+            val_set_size_count = gr.Number(0, label="Validation size")
+
+        with gr.Row():
+            micro_batch_size_count = gr.Number(1, label="Micro batch size")
+            sequence_length = gr.Number(1024, label="Sequence length")
+            num_epochs_count = gr.Number(1, label="Epochs")
+
+        output_dir_path = gr.Textbox("./model-out", label="Output directory")
+
+    create_config = gr.Button("Create config")
+    output = gr.TextArea(label="Generated config")
+    create_config.click(
+        config,
+        inputs=[
+            base_model_name,
+            dataset_path,
+            dataset_type_name,
+            learning_rate,
+            gradient_accumulation_steps_count,
+            micro_batch_size_count,
+            sequence_length,
+            num_epochs_count,
+            output_dir_path,
+            val_set_size_count,
+        ],
+        outputs=output,
+    )
+
+demo.launch(debug=True, server_name="0.0.0.0", server_port=7860)
Author	SHA1	Message	Date
Wing Lian	39ad38a1fb	update address and port for spaces	2024-02-08 17:55:44 -05:00
Mads Henrichsen	ddb60883f5	create config	2024-02-08 09:26:58 +01:00
Mads Henrichsen	a5724ef08d	axolotl start training	2024-02-07 18:16:21 +01:00
Mads Henrichsen	190930b5df	spaces ui	2024-02-07 15:52:30 +01:00
JohanWork	1c7ed26785	lock pytorch (#1247 ) [skip ci]	2024-02-06 07:48:26 -05:00
Philip May	13eea21f9b	Add more save strategies for DPO training. (#1255 ) * Set save_strategy and save_steps in HFDPOTrainerBuilder * fix doublicate save_steps	2024-02-06 00:38:43 -05:00
Chirag Jain	1072f28874	Fix typo `bloat16` -> `bfloat16` (#1257 )	2024-02-06 00:38:14 -05:00
Wing Lian	c7cf3810bd	Pretrain transforms (#1261 ) * wip for pretraining/iterable data with arbitrary prompt strategies * more fixes, wip * more fixes for custom pretraining * iterable ds wrapper not needed * remove extra features * chore: lint * update pretraning example yml * fix order for partials * fixup for tests	2024-02-06 00:37:03 -05:00
Wing Lian	8c2e05ade3	relora: magnitude pruning of the optimizer (#1245 ) * magnitude pruning of the optimizer * add alpaca chat template and fix relora patch * fix handling of lora adapter for relora * fix merge and save call * fixes for 8-bit lora merge * save intermediate checkpoint adapters * auto merge * fix eval check * handle relora annealing * fix anneal step logic * chore: lint * misx fix * fix types * Update tests/e2e/test_relora_llama.py * check for safetensors saved from relora	2024-02-06 00:35:30 -05:00
NanoCode012	2d65f470d5	fix(model): apply gate fp32 only for mixtral (#1241 ) * fix(model): apply gate fp32 only for mixtral * Update src/axolotl/utils/models.py * fix gate layer check --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-02-01 13:55:05 -05:00
Wing Lian	dfd188502a	add contact info for dedicated support for axolotl [skip ci] (#1243 )	2024-02-01 12:59:07 -05:00
Wing Lian	00568c1539	support for true batches with multipack (#1230 ) * support for true batches with multipack * patch the map dataset fetcher to handle batches with packed indexes * patch 4d mask creation for sdp attention * better handling for BetterTransformer * patch general case for 4d mask * setup forward patch. WIP * fix patch file * support for multipack w/o flash attention for llama * cleanup * add warning about bf16 vs fp16 for multipack with sdpa * bugfixes * add 4d multipack tests, refactor patches * update tests and add warnings * fix e2e file check * skip sdpa test if not at least torch 2.1.1, update docs	2024-02-01 10:18:42 -05:00
Wing Lian	c67fb71583	Peft deepspeed resume (#1227 ) * import deepspeed integration * monkeypatch peft adapater with deepspeed for resume from checkpoint * fix patch * fix patches attempt 2 * make sure to set lora_model_dir * skip pylint for deepspeed.utils * pick up upstream fix in transformers * remove monkeypatch for deepspeed/peft fix * no need to set the lora_model_dir on resume * unset load_in_bit when using quant config guard before del * better handling of load_in* kwargs	2024-01-31 18:13:29 -05:00
DreamGenX	25e037fe2d	Support for additional_special_tokens (#1221 ) [skip ci] * Support for additional_special_tokens * Support for additional_special_tokens. Adjust whitespace. * Support for additional_special_tokens. Use correct quotes. * Support for additional_special_tokens. Safe pop. * Support for additional_special_tokens. nt. * Support for additional_special_tokens. cfg.special_tokens may be None. * add token if not in vocabulary when adding additional_special_tokens * fix logic for copy/pasta * bugfix for popping from config and tokenizer reload * no need to add tokens manually now with previous bugfix --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-31 18:13:13 -05:00
Hamel Husain	52c83d30bf	Update rlhf.md (#1237 ) [skip ci]	2024-01-31 17:27:35 -05:00
Wing Lian	d113331e9a	add a helpful motd for cloud image (#1235 ) [skip ci]	2024-01-31 10:26:02 -05:00
Wing Lian	8f2b591baf	set torch version to what is installed during axolotl install (#1234 )	2024-01-31 08:47:34 -05:00
DreamGenX	5787e1a23f	Fix and document test_datasets (#1228 ) * Make sure test_dataset are used and treat val_set_size. * Add test_datasets docs. * Apply suggestions from code review --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-31 06:48:57 -05:00
xhedit	8608d8003e	Fix typo (#1231 ) [skip ci]	2024-01-31 06:46:55 -05:00
Wing Lian	4cb7900a56	Peft lotfq (#1222 ) * loftq support for lora * fix loftq check * update readme for loftq * readability cleanup * use peft main for loftq fixes, remove unnecessary special tokens * remove unused test from older deprecation	2024-01-28 18:50:08 -05:00
Filippo Broggini	18f811978c	FEAT: add tagging support to axolotl for DPOTrainer (#1209 ) * Add AxolotlDPOTrainer * chore: lint --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-26 20:01:57 -05:00
Wing Lian	afb5dd9655	Update FUNDING.yml [skip ci]	2024-01-26 20:00:28 -05:00
Wing Lian	8da1633124	Revert "run PR e2e docker CI tests in Modal" (#1220 ) [skip ci]	2024-01-26 16:50:44 -05:00
Wing Lian	36d053f6f0	run PR e2e docker CI tests in Modal (#1217 ) [skip ci] * wip modal for ci * handle falcon layernorms better * update * rebuild the template each time with the pseudo-ARGS * fix ref * update tests to use modal * cleanup ci script * make sure to install jinja2 also * kickoff the gh action on gh hosted runners and specify num gpus	2024-01-26 16:13:27 -05:00
JohanWork	af29d81f80	ADD: warning if hub_model_id ist set but not any save strategy (#1202 ) * warning if hub model id set but no save * add warning * move the warning * add test * allow more public methods for tests for now * fix tests --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-26 10:38:55 -05:00
Wing Lian	1b180034c7	ensure the tests use the same version of torch as the latest base docker images (#1215 ) [skip ci]	2024-01-26 10:38:30 -05:00
DreamGenX	62ca4a2b71	Respect sliding_window=None (#1214 )	2024-01-26 07:43:37 -05:00
Igor Berlenko	5407ddd233	Update qlora.yml - remove `max_packed_sequence_len` (#1210 ) [skip ci]	2024-01-26 07:43:05 -05:00
Wing Lian	74c72ca5eb	drop py39 docker images, add py311, upgrade pytorch to 2.1.2 (#1205 ) * drop py39 docker images, add py311, upgrade pytorch to 2.1.2 * also allow the main build to be manually triggered * fix workflow_dispatch in yaml	2024-01-26 00:38:49 -05:00
Wing Lian	e923e62d24	more checks and fixes for deepspeed and fsdp (#1208 ) [skip ci]	2024-01-25 20:01:45 -05:00
Wing Lian	ba944e6554	workaround for transformers bug requireing do_sample for saveing pretrained (#1206 )	2024-01-25 11:34:41 -05:00
Wing Lian	badda3783b	make sure to register the base chatml template even if no system message is provided (#1207 )	2024-01-25 10:38:08 -05:00
Wing Lian	a01b998c0f	Update deps 202401 (#1204 ) [skip ci] * update deps * xformers fix too	2024-01-25 10:11:49 -05:00
Wing Lian	33e117088f	precompute dpo logprobs setting and fixes (#1199 ) [skip ci] * add support for precompute_ref_log_probs for dpo * add chatml.icr type for argilla orca dpo * update inline doc * also set use_reentrant to false for dpo when not set * don't set use_reentrant to true for rl * make sure to set gradient checkpointing too	2024-01-25 09:31:55 -05:00
Ricardo Dominguez-Olmedo	b4ac96adef	fix learning rate scheduler's warnings (#1135 ) [skip ci] * fix schedulers warnings * chore: lint --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-25 07:09:34 -05:00
mhenrichsen	98b4762077	Feat/chatml add system message (#1117 ) * add system message to template * readme update * added code to register new system message * register chatml template for test --------- Co-authored-by: Mads Henrichsen <mads@BrbartiendeMads.lan> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2024-01-25 08:24:27 +01:00
JohanWork	ee0b5f60e5	add colab example (#1196 ) [skip ci]	2024-01-24 20:09:09 -05:00
NanoCode012	08719b9609	fix(log): improve warning to clarify that lora_modules_to_save expect a list (#1197 )	2024-01-24 20:08:34 -05:00