proof of concept for sage attention

.gitignore improvements (#349 ) [skip ci]
updated colab notebook (#2074 )
2024-11-22 14:47:19 -05:00 · 2024-11-22 11:08:54 -05:00 · 2024-11-22 10:09:10 -05:00 · 2024-11-21 14:32:41 -05:00 · 2024-11-21 13:36:51 -05:00 · 2024-11-21 13:24:52 -05:00
22 changed files with 1598 additions and 162 deletions
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -49,7 +49,7 @@ jobs:
            axolotlai/axolotl
          tags: |
            type=ref,event=branch
-            type=semver,pattern={{version}}
+            type=pep440,pattern={{version}}
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
@@ -116,7 +116,7 @@ jobs:
            axolotlai/axolotl-cloud
          tags: |
            type=ref,event=branch
-            type=semver,pattern={{version}}
+            type=pep440,pattern={{version}}
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
@@ -163,7 +163,7 @@ jobs:
            axolotlai/axolotl-cloud-term
          tags: |
            type=ref,event=branch
-            type=semver,pattern={{version}}
+            type=pep440,pattern={{version}}
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
--- a/.github/workflows/pypi.yml
+++ b/.github/workflows/pypi.yml
@@ -13,19 +13,10 @@ jobs:
    permissions:
      contents: write
    steps:
-      - name: Get the tag version
-        id: extract_branch
-        run: echo ::set-output name=branch::${GITHUB_REF#refs/tags/}
-        shell: bash
-
-      - name: Create Release
-        id: create_release
-        uses: actions/create-release@v1
+      - name: Create release
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-        with:
-          tag_name: ${{ steps.extract_branch.outputs.branch }}
-          release_name: ${{ steps.extract_branch.outputs.branch }}
+        run: gh release create "$GITHUB_REF_NAME" # GITHUB_REF_NAME is the tag name in `on.push.tags` workflows
  pypi-publish:
    name: Upload release to PyPI
    runs-on: ubuntu-latest
--- a/.gitignore
+++ b/.gitignore
@@ -182,3 +182,6 @@ submit.sh

 typings/
 out/
+
+# vim
+*.swp
--- a/docs/unsloth.qmd
+++ b/docs/unsloth.qmd
@@ -11,12 +11,10 @@ standard industry baselines.

 ### Installation

-The following will install unsloth from source and downgrade xformers as unsloth is incompatible with the most up
-to date libraries.
+The following will install the correct unsloth and extras from source.

 ```bash
-pip install --no-deps "unsloth @ git+https://github.com/unslothai/unsloth.git"
-pip install --no-deps --force-reinstall xformers==0.0.26.post1
+python scripts/unsloth_install.py | sh
 ```

 ### Using unsloth w Axolotl
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
@@ -2,19 +2,15 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "metadata": {
-    "id": "AKjdG7tbTb-n"
-   },
+   "metadata": {},
   "source": [
-    "# Example notebook for running Axolotl on google colab"
+    "## Setting up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {
-    "id": "RcbNpOgWRcii"
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
@@ -22,82 +18,76 @@
    "assert (torch.cuda.is_available()==True)"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "h3nLav8oTRA5"
-   },
-   "source": [
-    "## Install Axolotl and dependencies"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {
-    "colab": {
-     "base_uri": "https://localhost:8080/"
-    },
-    "id": "3c3yGAwnOIdi",
-    "outputId": "e3777b5a-40ef-424f-e181-62dfecd1dd01"
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
-    "!pip install -e git+https://github.com/axolotl-ai-cloud/axolotl#egg=axolotl\n",
-    "!pip install flash-attn==\"2.7.0.post2\"\n",
-    "!pip install deepspeed==\"0.13.1\"!pip install mlflow==\"2.13.0\""
+    "!pip install axolotl[deepspeed]"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "id": "BW2MFr7HTjub"
-   },
+   "metadata": {},
   "source": [
-    "## Create an yaml config file"
+    "## Hugging Face login (optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {
-    "id": "9pkF2dSoQEUN"
-   },
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
   "outputs": [],
   "source": [
    "import yaml\n",
    "\n",
-    "# Your YAML string\n",
    "yaml_string = \"\"\"\n",
-    "base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T\n",
-    "model_type: LlamaForCausalLM\n",
-    "tokenizer_type: LlamaTokenizer\n",
+    "base_model: NousResearch/Meta-Llama-3.1-8B\n",
    "\n",
    "load_in_8bit: false\n",
    "load_in_4bit: true\n",
    "strict: false\n",
    "\n",
    "datasets:\n",
-    "  - path: mhenrichsen/alpaca_2k_test\n",
+    "  - path: tatsu-lab/alpaca\n",
    "    type: alpaca\n",
-    "dataset_prepared_path:\n",
+    "dataset_prepared_path: last_run_prepared\n",
    "val_set_size: 0.05\n",
-    "output_dir: ./outputs/qlora-out\n",
+    "output_dir: ./outputs/lora-out\n",
+    "\n",
+    "sequence_len: 2048\n",
+    "sample_packing: true\n",
+    "eval_sample_packing: true\n",
+    "pad_to_sequence_len: true\n",
    "\n",
    "adapter: qlora\n",
    "lora_model_dir:\n",
-    "\n",
-    "sequence_len: 4096\n",
-    "sample_packing: true\n",
-    "eval_sample_packing: false\n",
-    "pad_to_sequence_len: true\n",
-    "\n",
    "lora_r: 32\n",
    "lora_alpha: 16\n",
    "lora_dropout: 0.05\n",
-    "lora_target_modules:\n",
    "lora_target_linear: true\n",
    "lora_fan_in_fan_out:\n",
+    "lora_modules_to_save:\n",
+    "  - embed_tokens\n",
+    "  - lm_head\n",
    "\n",
    "wandb_project:\n",
    "wandb_entity:\n",
@@ -105,12 +95,12 @@
    "wandb_name:\n",
    "wandb_log_model:\n",
    "\n",
-    "gradient_accumulation_steps: 4\n",
-    "micro_batch_size: 2\n",
-    "num_epochs: 4\n",
-    "optimizer: paged_adamw_32bit\n",
+    "gradient_accumulation_steps: 2\n",
+    "micro_batch_size: 1\n",
+    "num_epochs: 1\n",
+    "optimizer: paged_adamw_8bit\n",
    "lr_scheduler: cosine\n",
-    "learning_rate: 0.0002\n",
+    "learning_rate: 2e-5\n",
    "\n",
    "train_on_inputs: false\n",
    "group_by_length: false\n",
@@ -121,13 +111,15 @@
    "gradient_checkpointing: true\n",
    "early_stopping_patience:\n",
    "resume_from_checkpoint:\n",
-    "local_rank:\n",
    "logging_steps: 1\n",
    "xformers_attention:\n",
-    "flash_attention: true\n",
+    "flash_attention: false\n",
+    "sdp_attention: true\n",
    "\n",
-    "warmup_steps: 10\n",
-    "evals_per_epoch: 4\n",
+    "warmup_steps: 1\n",
+    "max_steps: 25\n",
+    "evals_per_epoch: 1\n",
+    "eval_table_size:\n",
    "saves_per_epoch: 1\n",
    "debug:\n",
    "deepspeed:\n",
@@ -135,9 +127,10 @@
    "fsdp:\n",
    "fsdp_config:\n",
    "special_tokens:\n",
-    "\n",
+    "  pad_token: <|end_of_text|>\n",
    "\"\"\"\n",
    "\n",
+    "\n",
    "# Convert the YAML string to a Python dictionary\n",
    "yaml_dict = yaml.safe_load(yaml_string)\n",
    "\n",
@@ -146,31 +139,124 @@
    "\n",
    "# Write the YAML file\n",
    "with open(file_path, 'w') as file:\n",
-    "    yaml.dump(yaml_dict, file)\n"
+    "    yaml.dump(yaml_dict, file)"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "id": "bidoj8YLTusD"
-   },
+   "metadata": {},
   "source": [
-    "## Launch the training"
+    "Above we have a configuration file with base LLM model and datasets specified, among many other things. Axolotl can automatically detect whether the specified datasets are on HuggingFace repo or local machine.\n",
+    "\n",
+    "The Axolotl configuration options encompass model and dataset selection, data pre-processing, and training. Let's go through them line by line:\n",
+    "\n",
+    "*   \"base model\": String value, specifies the underlying pre-trained LLM that will be used for finetuning\n",
+    "\n",
+    "Next we have options for model weights quantization. Quantization allows for reduction in occupied memory on GPUs.\n",
+    "\n",
+    "*   \"load_in_8bit\": Boolean value, whether to quantize the model weights into 8-bit integer.\n",
+    "\n",
+    "*   \"load_in_4bit\": Boolean value, whether to quantize the model weights into 4-bit integer.\n",
+    "\n",
+    "*   \"strict\": Boolean value. If false, it allows for overriding established configuration options in the yaml file when executing in command-line interface.\n",
+    "\n",
+    "*   \"datasets\": a list of dicts that contain path and type of data sets as well as other optional configurations where datasets are concerned. Supports multiple datasets.\n",
+    "\n",
+    "*   \"val_set_size\": Either a float value less than one or an integer less than the total size of dataset. Sets the size of validation set from the whole dataset. If float, sets the proportion of the dataset assigned for validation. If integer, sets the direct size of validation set.\n",
+    "\n",
+    "*   \"output_dir\": String value. Path of trained model.\n",
+    "\n",
+    "For data preprocessing:\n",
+    "\n",
+    "*   \"sequence_len\": Integer. Specifies the maximum sequence length of the input. Typically 2048 or less.\n",
+    "\n",
+    "*   \"pad_to_sequence_len\": Boolean. Padding input to maximum sequence length.\n",
+    "\n",
+    "*   \"sample_packing\": Boolean. Specifies whether to use multi-packing with block diagonal attention.\n",
+    "\n",
+    "*   \"special_tokens\": Python dict, optional. Allows users to specify the additional special tokens to be ignored by the tokenizer.\n",
+    "\n",
+    "For LoRA configuration and its hyperparamters:\n",
+    "\n",
+    "*   \"adapter\": String. Either \"lora\" or \"qlora\", depending on user's choice.\n",
+    "\n",
+    "*   \"lora_model_dir\": String, Optional. Path to directory that contains LoRA model, if there is already a trained LoRA model the user would like to use.\n",
+    "\n",
+    "*   \"lora_r\": Integer. Refers to the rank of LoRA decomposition matrices. Higher value will reduce LoRA efficiency. Recommended to be set to 8.\n",
+    "\n",
+    "*   \"lora_alpha\": Integer. Scale the weight matrices by $\\frac{\\text{lora_alpha}}{\\text{lora_r}}$Recommended to be fixed at 16.\n",
+    "\n",
+    "*   \"lora_dropout\": Float that is 1 or less. The dropout probability of a lora layer.\n",
+    "\n",
+    "*   \"lora_target_linear\": Boolean. If true, lora will target all linear modules in the transformers architecture.\n",
+    "\n",
+    "*   \"lora_modules_to_save\": If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.\n",
+    "\n",
+    "See [LoRA](https://arxiv.org/abs/2106.09685) for detailed explanation of LoRA implementation.\n",
+    "\n",
+    "For the training configurations:\n",
+    "\n",
+    "*   \"gradient_accumulation_steps\": Integer. The number of steps over which to accumulate gradient for batch training. E.g. if 2, backprop is performed every two steps.\n",
+    "\n",
+    "*   \"micro_batch_size\": Integer. Batch size per gpu / gradient_accumulation_steps\n",
+    "\n",
+    "*   \"num_epochs\": Integer. Number of epochs. One epoch is when training has looped over every batch in the whole data set once.\n",
+    "\n",
+    "*   \"optimizer\": The optimizer to use for the training.\n",
+    "\n",
+    "*   \"learning_rate\": The learning rate.\n",
+    "\n",
+    "*   \"lr_scheduler\": The learning rate scheduler to use for adjusting learning rate during training.\n",
+    "\n",
+    "*   \"train_on_inputs\": Boolean. Whether to ignore or include the user's prompt from the training labels.\n",
+    "\n",
+    "*   \"group_by_length\": Boolean. Whether to group similarly sized data to minimize padding.\n",
+    "\n",
+    "*   \"bf16\": Either \"auto\", \"true\", or \"false\". Whether to use CUDA bf16 floating point format. If set to \"auto\", will automatically apply bf16 should the gpu supports it.\n",
+    "\n",
+    "*   \"fp16\": Optional. Specifies whether to use CUDA fp16. Automatically set to true if \"bf16\" is set to true. Otherwise false.\n",
+    "\n",
+    "*   \"tf32\": Boolean. Whether to use CUDA tf32. Will override bf16.\n",
+    "\n",
+    "*   \"gradient_checkpointing\": Boolean. Whether to use gradient checkpointing https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing\n",
+    "\n",
+    "*   \"gradient_checkpointing_kwargs\": Python Dict. Fed into the trainer.\n",
+    "\n",
+    "*   \"logging_steps\": Integer. Log training information over every specified number of steps.\n",
+    "\n",
+    "*   \"flash_attention\": Boolean. Whether to use the [flash attention](https://github.com/Dao-AILab/flash-attention) mechanism.\n",
+    "\n",
+    "*   \"sdp_attention\": Boolean. Whether to use the Scaled Dot Product attention mechanism (the attention mechanism in the [original implementation](https://arxiv.org/abs/1706.03762) of transformers.)\n",
+    "\n",
+    "*   \"warmup_steps\": Integer. The number of pre-training steps where a very low learning rate is used.\n",
+    "\n",
+    "*   \"evals_per_epoch\": Integer. Number of evaluations to be performed within one training epoch.\n",
+    "\n",
+    "*   \"saves_per_epoch\": Integer. Number of times the model is saved in one training epoch.\n",
+    "\n",
+    "*   \"weight_decay\": Positive Float. Sets the \"strength\" of weight decay (i.e. setting the coefficient of L2 regularization)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above is but a snippet aiming to get users familiarized with the types of streamlined configuration options axolotl provides. For a full list of configuration options, see [here](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Train the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {
-    "colab": {
-     "base_uri": "https://localhost:8080/"
-    },
-    "id": "ydTI2Jk2RStU",
-    "outputId": "d6d0df17-4b53-439c-c802-22c0456d301b"
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
-    "# By using the ! the comand will be executed as a bash command\n",
    "!accelerate launch -m axolotl.cli.train /content/test_axolotl.yaml"
   ]
  },
@@ -178,7 +264,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Play with inference"
+    "Predict with trained model"
   ]
  },
  {
@@ -187,36 +273,85 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# By using the ! the comand will be executed as a bash command\n",
    "!accelerate launch -m axolotl.cli.inference /content/test_axolotl.yaml \\\n",
-    "    --qlora_model_dir=\"./qlora-out\" --gradio"
+    "    --lora_model_dir=\"./outputs/lora-out\" --gradio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deeper Dive"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It is also helpful to gain some familiarity over some of the core inner workings of axolotl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Configuration Normalization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Axolotl uses a custom Dict class, called ```DictDefault```\n",
+    "to store configurations specified in the yaml configuration file (into a Python variable named ```cfg```). The definition for this custom Dict can be found in the [utils/dict.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/dict.py)\n",
+    "\n",
+    "```DictDefault``` is amended such that calling a missing key from it will result in a ```None``` return type. This is important because if some configuration options aren't specified by the user, the ```None``` type allows Axolotl to perform boolean operations to determine the default settings for missing configurations. For more examples on how this is done, check out [utils/config/__init__.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/config/__init__.py)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading Models, Tokenizers, and Trainer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If we inspect [cli.train.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/cli/train.py), we will find that most of the heavy lifting were done by the function ```train()``` which is itself imported from [src/axolotl/train.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/train.py).\n",
+    "\n",
+    "```train()``` takes care of loading the appropriate tokenizer and pre-trained model through ```load_model()``` and ```load_tokenizer()``` from [src/axolotl/utils/models.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/models.py) respectively.\n",
+    "\n",
+    "```load_tokenizer()``` loads in the appropriate tokenizer given the desired model, as well as chat templates.\n",
+    "\n",
+    "```ModelLoader``` class follows after tokenizer has been selected. It will automatically discern the base model type, load in the desired model, as well as applying model-appropriate attention mechanism modifications (e.g. flash attention). Depending on which base model the user chooses in the configuration, ```ModelLoader``` will utilize the corresponding \"attention hijacking\" script. For example, if the user specified the base model to be ```NousResearch/Meta-Llama-3.1-8B```, which is of llama type, and set ```flash_attn``` to ```True```, ```ModelLoader``` will load in [llama_attn_hijack_flash.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/monkeypatch/llama_attn_hijack_flash.py). For a list of supported attention hijacking, please refer to the directory [/src/axolotl/monkeypatch/](https://github.com/axolotl-ai-cloud/axolotl/tree/main/src/axolotl/monkeypatch)\n",
+    "\n",
+    "Another important operation encompassed in ```train()``` is setting up the training that takes into account of user-specified traning configurations (e.g. num_epochs, optimizer) through the use of ```setup_trainer()``` from [/src/axolotl/utils/trainer.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/utils/trainer.py), which in turn relies on modules from [/src/axolotl/core/trainer_builder.py](https://github.com/axolotl-ai-cloud/axolotl/blob/main/src/axolotl/core/trainer_builder.py).\n",
+    "```trainer_builder.py``` provides a list of trainer object options bespoke for the task type (Causal or Reinforcement learning ('dpo', 'ipo', 'kto') )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Monkey patch\n",
+    "\n",
+    "The [Monkey patch directory](https://github.com/axolotl-ai-cloud/axolotl/tree/main/src/axolotl/monkeypatch) is where model architecture/optimization patching scripts are stored (these are modifications that are not implemented in the official releases, hence the name monkey patch). It includes attention jacking, ReLoRA, and unsloth optimization."
   ]
  }
 ],
 "metadata": {
-  "accelerator": "GPU",
-  "colab": {
-   "gpuType": "T4",
-   "provenance": []
-  },
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.12.1"
+   "version": "3.9.6"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 2
 }
--- a/requirements.txt
+++ b/requirements.txt
@@ -33,7 +33,7 @@ tensorboard
 python-dotenv==1.0.1
 autoawq==0.2.7.post2
 triton>=2.3.0
-liger-kernel==0.4.1
+liger-kernel==0.4.2

 mamba-ssm==1.2.0.post1

--- a/scripts/unsloth_install.py
+++ b/scripts/unsloth_install.py
@@ -0,0 +1,33 @@
+# noqa
+# pylint: skip-file
+try:
+    import torch
+except ImportError:
+    raise ImportError("Install torch via `pip install torch`")
+from packaging.version import Version as V
+
+v = V(torch.__version__)
+cuda = str(torch.version.cuda)
+is_ampere = torch.cuda.get_device_capability()[0] >= 8
+if cuda != "12.1" and cuda != "11.8" and cuda != "12.4":
+    raise RuntimeError(f"CUDA = {cuda} not supported!")
+if v <= V("2.1.0"):
+    raise RuntimeError(f"Torch = {v} too old!")
+elif v <= V("2.1.1"):
+    x = "cu{}{}-torch211"
+elif v <= V("2.1.2"):
+    x = "cu{}{}-torch212"
+elif v < V("2.3.0"):
+    x = "cu{}{}-torch220"
+elif v < V("2.4.0"):
+    x = "cu{}{}-torch230"
+elif v < V("2.5.0"):
+    x = "cu{}{}-torch240"
+elif v < V("2.6.0"):
+    x = "cu{}{}-torch250"
+else:
+    raise RuntimeError(f"Torch = {v} too new!")
+x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
+print(
+    f'pip install unsloth-zoo && pip install --no-deps "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"'
+)
--- a/setup.py
+++ b/setup.py
@@ -96,7 +96,7 @@ install_requires, dependency_links = parse_requirements()

 setup(
    name="axolotl",
-    version="0.5.1",
+    version="0.5.2",
    description="LLM Trainer",
    long_description="Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.",
    package_dir={"": "src"},
--- a/src/axolotl/cli/init.py
+++ b/src/axolotl/cli/init.py
@@ -30,7 +30,10 @@ from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
 from axolotl.integrations.base import PluginManager
 from axolotl.logging_config import configure_logging
 from axolotl.train import TrainDatasetMeta
-from axolotl.utils.chat_templates import get_chat_template
+from axolotl.utils.chat_templates import (
+    get_chat_template,
+    get_chat_template_from_config,
+)
 from axolotl.utils.comet_ import setup_comet_env_vars
 from axolotl.utils.config import (
    normalize_cfg_datasets,
@@ -199,6 +202,10 @@ def do_inference(
        )
    elif cfg.chat_template:
        chat_template_str = get_chat_template(cfg.chat_template)
+    elif cfg.datasets[0].type == "chat_template":
+        chat_template_str = get_chat_template_from_config(
+            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
+        )

    model = model.to(cfg.device, dtype=cfg.torch_dtype)

--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -1212,11 +1212,17 @@ class TrainerBuilderBase(abc.ABC):
        Callbacks added after the trainer is created, usually b/c these need access to the trainer
        """
        callbacks = []
-
-        plugin_manager = PluginManager.get_instance()
-        callbacks.extend(
-            plugin_manager.add_callbacks_post_trainer(cfg=self.cfg, trainer=trainer)
-        )
+        if self.cfg.plugins:
+            plugin_manager = PluginManager.get_instance()
+            callbacks.extend(
+                [
+                    cb
+                    for cb in plugin_manager.add_callbacks_post_trainer(
+                        self.cfg, trainer
+                    )
+                    if cb
+                ]
+            )
        return callbacks

    def hook_pre_create_training_args(self, training_arguments_kwargs):
@@ -1263,7 +1269,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        return callbacks

    def get_post_trainer_create_callbacks(self, trainer):
-        callbacks = super().get_post_trainer_create_callbacks(trainer=trainer)
+        callbacks = []
        if self.cfg.use_wandb and self.cfg.eval_table_size > 0:
            LogPredictionCallback = log_prediction_callback_factory(
                trainer, self.tokenizer, "wandb"
@@ -1301,17 +1307,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if self.cfg.lisa_step_interval and self.cfg.lisa_n_layers:
            callbacks.append(lisa_callback_factory(trainer))

-        if self.cfg.plugins:
-            plugin_manager = PluginManager.get_instance()
-            callbacks.extend(
-                [
-                    cb
-                    for cb in plugin_manager.add_callbacks_post_trainer(
-                        self.cfg, trainer
-                    )
-                    if cb
-                ]
-            )
+        callbacks.extend(super().get_post_trainer_create_callbacks(trainer=trainer))
        return callbacks

    def _get_trainer_cls(self):
--- a/src/axolotl/integrations/sageattention/init.py
+++ b/src/axolotl/integrations/sageattention/init.py
--- a/src/axolotl/integrations/sageattention/lib/core.py
+++ b/src/axolotl/integrations/sageattention/lib/core.py
@@ -0,0 +1,361 @@
+"""
+Copyright (c) 2024 by SageAttention team.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+
+from typing import Any, Optional
+
+import torch
+from torch.autograd import Function
+
+from .triton.attn_qk_int8_per_block_causal_varlen import (
+    backward as sageattn_varlen_backward,
+)
+from .triton.attn_qk_int8_per_block_causal_varlen import forward as attn_true_varlen
+from .triton.quant_per_block_varlen import (
+    per_block_int8 as per_block_int8_varlen_triton,
+)
+
+
+def get_cuda_arch_versions():
+    cuda_archs = []
+    for i in range(torch.cuda.device_count()):
+        major, minor = torch.cuda.get_device_capability(i)
+        cuda_archs.append(f"sm{major}{minor}")
+    return cuda_archs
+
+
+def sageattn_varlen(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    cu_seqlens_q: torch.Tensor,
+    cu_seqlens_k: torch.Tensor,
+    max_seqlen_q: int,
+    max_seqlen_k: int,
+    sm_scale: Optional[float] = None,
+    smooth_k: bool = True,
+    **kwargs: Any,
+) -> torch.Tensor:
+    """
+
+    Parameters
+    ----------
+    q : torch.Tensor
+        The query tensor, shape: ``[cu_seqlens_q[-1], num_qo_heads, head_dim]``.
+
+    k : torch.Tensor
+        The key tensor, shape: ``[cu_seqlens_k[-1], num_kv_heads, head_dim]``.
+
+    v : torch.Tensor
+        The value tensor, shape: ``[cu_seqlens_k[-1], num_kv_heads, head_dim]``.
+
+    cu_seqlens_q : torch.Tensor
+        The cumulative sequence lengths for the query sequences in the batch, used to index into `q`.
+        Shape: ``[batch_size + 1]``, where each entry represents the cumulative length of sequences up to that batch index.
+
+    cu_seqlens_k : torch.Tensor
+        The cumulative sequence lengths for the key and value sequences in the batch, used to index into `k` and `v`.
+        Shape: ``[batch_size + 1]``, where each entry represents the cumulative length of sequences up to that batch index.
+
+    max_seqlen_q : int
+        The maximum sequence length for the query tensor in the batch.
+
+    max_seqlen_k : int
+        The maximum sequence length for the key and value tensors in the batch.
+
+    is_causal : bool
+        Whether to apply causal mask to the attention matrix. Only applicable when qo_len == kv_len for each sequence.
+        Default: False.
+
+    sm_scale : Optional[float]
+        The scale used in softmax, if not provided, will be set to ``1.0 / sqrt(head_dim)``.
+
+    smooth_k : bool
+        Whether to smooth the key tensor by subtracting the mean along the sequence dimension.
+        Default: True.
+
+    Returns
+    -------
+    torch.Tensor
+        The output tensor, shape: ``[cu_seqlens_q[-1], num_qo_heads, head_dim]``.
+
+    Note
+    ----
+    - ``num_qo_heads`` must be divisible by ``num_kv_heads``.
+    - The tensors `q`, `k`, and `v` must have the dtype ``torch.float16``, ``torch.bfloat16`` or ``torch.float32``.
+    - The tensors `cu_seqlens_q` and `cu_seqlens_k` must have the dtype ``torch.int32`` or ``torch.int64``.
+    - All tensors must be on the same cuda device.
+    - `smooth_k` will introduce slight overhead but will improve the accuracy under most circumstances.
+    """
+
+    dtype = q.dtype
+    assert q.is_cuda, "Input tensors must be on cuda."
+    assert dtype in [
+        torch.float16,
+        torch.bfloat16,
+    ], "Input tensors must be in dtype of torch.float16 or torch.bfloat16"
+    assert q.device == k.device == v.device, "All tensors must be on the same device."
+    assert q.dtype == k.dtype == v.dtype, "All tensors must have the same dtype."
+
+    head_dim = q.size(-1)
+    assert head_dim in [64, 128], "varlen only support head_dim [64, 128]."
+
+    assert (
+        q.stride(-1) == 1 and k.stride(-1) == 1 and v.stride(-1) == 1
+    ), "Last dim of qkv must be contiguous."
+    assert (
+        cu_seqlens_q.is_contiguous() and cu_seqlens_k.is_contiguous()
+    ), "cu_seqlens_q and cu_seqlens_k must be contiguous."
+
+    if dtype == torch.bfloat16 or dtype == torch.float32:
+        v = v.to(torch.float16)
+
+    if smooth_k:
+        km = k.mean(
+            dim=0, keepdim=True
+        )  # ! km is calculated on the all the batches. Calculate over each individual sequence requires dedicated kernel.
+        k -= km
+
+    (
+        q_int8,
+        q_scale,
+        k_int8,
+        k_scale,
+        cu_seqlens_q_scale,
+        cu_seqlens_k_scale,
+    ) = per_block_int8_varlen_triton(
+        q, k, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, sm_scale=sm_scale
+    )
+
+    o = attn_true_varlen(
+        q_int8,
+        k_int8,
+        v,
+        cu_seqlens_q,
+        cu_seqlens_k,
+        max_seqlen_q,
+        q_scale,
+        k_scale,
+        cu_seqlens_q_scale,
+        cu_seqlens_k_scale,
+        output_dtype=dtype,
+    )
+
+    return o
+
+
+class SageAttentionFunction(Function):
+    @staticmethod
+    def forward(
+        ctx,
+        query,
+        key,
+        value,
+        attn_mask=None,
+        dropout_p=0.0,
+        is_causal=False,
+        scale=None,
+    ):
+        """
+        query: Tensor of shape [batch_size, num_heads, seq_len_q, head_dim]
+        key: Tensor of shape [batch_size, num_heads, seq_len_k, head_dim]
+        value: Tensor of shape [batch_size, num_heads, seq_len_k, head_dim]
+        attn_mask: Optional[Tensor], mask tensor
+        dropout_p: float, dropout probability
+        is_causal: bool, whether to apply causal masking
+        scale: Optional[float], scaling factor for attention scores
+        """
+        # Ensure inputs are contiguous
+        query = query.contiguous()
+        key = key.contiguous()
+        value = value.contiguous()
+
+        # Handle default scale
+        if scale is None:
+            scale = 1.0 / (query.size(-1) ** 0.5)
+
+        # Save parameters needed for backward
+        ctx.scale = scale
+        ctx.is_causal = is_causal
+        ctx.dropout_p = dropout_p
+        ctx.attn_mask = attn_mask
+
+        # Prepare cumulative sequence lengths and max sequence lengths
+        # Assuming batch sizes are consistent across query, key, and value
+        batch_size, num_heads, seq_len_q, head_dim = query.shape
+        seq_len_k = key.shape[2]
+
+        # Flatten batch and head dimensions
+        q = query.view(
+            -1, seq_len_q, head_dim
+        )  # [batch_size * num_heads, seq_len_q, head_dim]
+        k = key.view(-1, seq_len_k, head_dim)
+        v = value.view(-1, seq_len_k, head_dim)
+
+        # Create cumulative sequence lengths
+        cu_seqlens_q = torch.arange(
+            0,
+            (batch_size * num_heads + 1) * seq_len_q,
+            seq_len_q,
+            dtype=torch.int32,
+            device=query.device,
+        )
+        cu_seqlens_k = torch.arange(
+            0,
+            (batch_size * num_heads + 1) * seq_len_k,
+            seq_len_k,
+            dtype=torch.int32,
+            device=key.device,
+        )
+        max_seqlen_q = seq_len_q
+        max_seqlen_k = seq_len_k
+
+        # Call your custom per-block int8 quantization function
+        (
+            q_int8,
+            q_scale,
+            k_int8,
+            k_scale,
+            cu_seqlens_q_scale,
+            cu_seqlens_k_scale,
+        ) = per_block_int8_varlen_triton(
+            q, k, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, sm_scale=scale
+        )
+
+        # Call your custom attention function
+        if is_causal:
+            output = attn_true_varlen(
+                q_int8,
+                k_int8,
+                v,
+                cu_seqlens_q,
+                cu_seqlens_k,
+                max_seqlen_q,
+                q_scale,
+                k_scale,
+                cu_seqlens_q_scale,
+                cu_seqlens_k_scale,
+                output_dtype=query.dtype,
+            )
+        else:
+            raise NotImplementedError("Non-causal attention is not implemented yet.")
+
+        # Reshape output to match the expected shape
+        output = output.view(batch_size, num_heads, seq_len_q, head_dim)
+
+        # Save tensors for backward
+        ctx.save_for_backward(
+            query,
+            key,
+            value,
+            q_int8,
+            k_int8,
+            q_scale,
+            k_scale,
+            cu_seqlens_q,
+            cu_seqlens_k,
+            cu_seqlens_q_scale,
+            cu_seqlens_k_scale,
+            output,
+        )
+
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        (
+            query,
+            key,
+            value,
+            q_int8,
+            k_int8,
+            q_scale,
+            k_scale,
+            cu_seqlens_q,
+            cu_seqlens_k,
+            cu_seqlens_q_scale,
+            cu_seqlens_k_scale,
+            output,
+        ) = ctx.saved_tensors
+
+        scale = ctx.scale
+        is_causal = ctx.is_causal
+        dropout_p = ctx.dropout_p
+        attn_mask = ctx.attn_mask
+
+        # Flatten batch and head dimensions
+        batch_size, num_heads, seq_len_q, head_dim = query.shape
+        seq_len_k = key.shape[2]
+        grad_output = grad_output.contiguous()
+        do = grad_output.view(-1, seq_len_q, head_dim)
+
+        # Compute gradients w.r.t. q, k, v
+        dq, dk, dv = sageattn_varlen_backward(
+            do,
+            query.view(-1, seq_len_q, head_dim),
+            key.view(-1, seq_len_k, head_dim),
+            value.view(-1, seq_len_k, head_dim),
+            cu_seqlens_q,
+            cu_seqlens_k,
+            seq_len_q,
+            seq_len_k,
+            q_int8,
+            k_int8,
+            q_scale,
+            k_scale,
+            cu_seqlens_q_scale,
+            cu_seqlens_k_scale,
+            scale,
+            is_causal,
+        )
+
+        # Reshape gradients to match the input shapes
+        dq = dq.view(batch_size, num_heads, seq_len_q, head_dim)
+        dk = dk.view(batch_size, num_heads, seq_len_k, head_dim)
+        dv = dv.view(batch_size, num_heads, seq_len_k, head_dim)
+
+        # Handle optional arguments
+        d_attn_mask = None  # Assuming attn_mask does not require gradients
+        d_dropout_p = (
+            None  # Dropout probability is a hyperparameter, typically not optimized
+        )
+        d_is_causal = None  # Not differentiable
+        d_scale = None  # If scale is a tensor and requires grad, compute its gradient
+
+        return dq, dk, dv, d_attn_mask, d_dropout_p, d_is_causal, d_scale
+
+
+def scaled_dot_product_attention(
+    query,
+    key,
+    value,
+    attn_mask=None,
+    dropout_p=0.0,
+    is_causal=False,
+    scale=None,
+):
+    """
+    Custom scaled dot product attention using SageAttentionFunction.
+    """
+    return SageAttentionFunction.apply(
+        query, key, value, attn_mask, dropout_p, is_causal, scale
+    )
+
+
+def monkeypatch_sdp_w_sage_attention():
+    """
+    Replace torch.nn.functional.scaled_dot_product_attention with custom scaled dot product attention using SageAttentionFunction.
+    """
+    torch.nn.functional.scaled_dot_product_attention = scaled_dot_product_attention
--- a/src/axolotl/integrations/sageattention/lib/triton/init.py
+++ b/src/axolotl/integrations/sageattention/lib/triton/init.py
--- a/src/axolotl/integrations/sageattention/lib/triton/attn_qk_int8_per_block_causal_varlen.py
+++ b/src/axolotl/integrations/sageattention/lib/triton/attn_qk_int8_per_block_causal_varlen.py
@@ -0,0 +1,622 @@
+"""
+Copyright (c) 2024 by SageAttention team.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+
+import math
+
+import torch
+import triton
+import triton.language as tl
+
+
+@triton.jit
+def _attn_fwd_inner(
+    acc,
+    l_i,
+    m_i,
+    q,
+    q_scale,
+    kv_len,
+    K_ptrs,
+    K_scale_ptr,
+    V_ptrs,
+    stride_kn,
+    stride_vn,
+    start_m,
+    H: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    HEAD_DIM: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    STAGE: tl.constexpr,
+    offs_m: tl.constexpr,
+    offs_n: tl.constexpr,
+):
+    if STAGE == 1:
+        lo, hi = 0, start_m * BLOCK_M
+    elif STAGE == 2:
+        lo, hi = start_m * BLOCK_M, (start_m + 1) * BLOCK_M
+        lo = tl.multiple_of(lo, BLOCK_M)
+        K_scale_ptr += (lo // BLOCK_N) * H
+        K_ptrs += stride_kn * lo
+        V_ptrs += stride_vn * lo
+    for start_n in range(lo, hi, BLOCK_N):
+        start_n = tl.multiple_of(start_n, BLOCK_N)
+        k_mask = offs_n[None, :] < (kv_len - start_n)
+        k = tl.load(K_ptrs, mask=k_mask)
+        k_scale = tl.load(K_scale_ptr)
+        qk = tl.dot(q, k).to(tl.float32) * q_scale * k_scale
+
+        if STAGE == 2:
+            mask = offs_m[:, None] >= (start_n + offs_n[None, :])
+            qk = qk + tl.where(mask, 0, -1.0e6)
+            m_ij = tl.maximum(m_i, tl.max(qk, 1))
+            qk -= m_ij[:, None]
+        else:
+            m_ij = tl.maximum(m_i, tl.max(qk, 1))
+            qk = qk - m_ij[:, None]
+
+        p = tl.math.exp2(qk)
+        l_ij = tl.sum(p, 1)
+
+        alpha = tl.math.exp2(m_i - m_ij)
+        l_i = l_i * alpha + l_ij
+
+        acc = acc * alpha[:, None]
+
+        v = tl.load(V_ptrs, mask=offs_n[:, None] < (kv_len - start_n))
+        p = p.to(tl.float16)
+
+        acc += tl.dot(p, v, out_dtype=tl.float16)
+        m_i = m_ij
+        K_ptrs += BLOCK_N * stride_kn
+        K_scale_ptr += H
+        V_ptrs += BLOCK_N * stride_vn
+    return acc, l_i, m_i
+
+
+@triton.jit
+def _attn_fwd(
+    Q,
+    K,
+    V,
+    cu_seqlens_q,
+    cu_seqlens_k,
+    Q_scale,
+    K_scale,
+    cu_seqlens_q_scale,
+    cu_seqlens_k_scale,
+    Out,
+    stride_qh,
+    stride_qn,
+    stride_kh,
+    stride_kn,
+    stride_vh,
+    stride_vn,
+    stride_oh,
+    stride_on,
+    H: tl.constexpr,
+    num_kv_groups: tl.constexpr,
+    HEAD_DIM: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    STAGE: tl.constexpr,
+):
+    start_m = tl.program_id(0)
+
+    off_z = tl.program_id(2).to(tl.int64)
+    off_h = tl.program_id(1).to(tl.int64)
+
+    cu_seqlens_q_start = tl.load(cu_seqlens_q + off_z)
+    cu_seqlens_q_end = tl.load(cu_seqlens_q + off_z + 1)
+
+    qo_len = cu_seqlens_q_end - cu_seqlens_q_start
+
+    if (start_m * BLOCK_M) >= qo_len:
+        return
+
+    cu_seq_lens_q_scale_start = tl.load(cu_seqlens_q_scale + off_z)
+    cu_seq_lens_k_scale_start = tl.load(cu_seqlens_k_scale + off_z)
+
+    q_scale_offset = cu_seq_lens_q_scale_start * H + off_h + start_m * H
+    k_scale_offset = (
+        cu_seq_lens_k_scale_start * (H // num_kv_groups) + off_h // num_kv_groups
+    )
+
+    cu_seqlens_k_start = tl.load(cu_seqlens_k + off_z)
+    cu_seqlens_k_end = tl.load(cu_seqlens_k + off_z + 1)
+
+    kv_len = cu_seqlens_k_end - cu_seqlens_k_start
+
+    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n = tl.arange(0, BLOCK_N)
+    offs_k = tl.arange(0, HEAD_DIM)
+    Q_ptrs = (
+        Q
+        + (cu_seqlens_q_start * stride_qn + off_h * stride_qh)
+        + offs_m[:, None] * stride_qn
+        + offs_k[None, :]
+    )
+    Q_scale_ptr = Q_scale + q_scale_offset
+    K_ptrs = (
+        K
+        + (cu_seqlens_k_start * stride_kn + (off_h // num_kv_groups) * stride_kh)
+        + offs_n[None, :] * stride_kn
+        + offs_k[:, None]
+    )
+    K_scale_ptr = K_scale + k_scale_offset
+    V_ptrs = (
+        V
+        + (cu_seqlens_k_start * stride_vn + (off_h // num_kv_groups) * stride_vh)
+        + offs_n[:, None] * stride_vn
+        + offs_k[None, :]
+    )
+    O_block_ptr = (
+        Out
+        + (cu_seqlens_q_start * stride_on + off_h * stride_oh)
+        + offs_m[:, None] * stride_on
+        + offs_k[None, :]
+    )
+
+    m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf")
+    l_i = tl.zeros([BLOCK_M], dtype=tl.float32) + 1.0
+    acc = tl.zeros([BLOCK_M, HEAD_DIM], dtype=tl.float32)
+
+    q = tl.load(Q_ptrs, mask=offs_m[:, None] < qo_len)
+    q_scale = tl.load(Q_scale_ptr)
+    acc, l_i, m_i = _attn_fwd_inner(
+        acc,
+        l_i,
+        m_i,
+        q,
+        q_scale,
+        kv_len,
+        K_ptrs,
+        K_scale_ptr,
+        V_ptrs,
+        stride_kn,
+        stride_vn,
+        start_m,
+        H // num_kv_groups,
+        BLOCK_M,
+        HEAD_DIM,
+        BLOCK_N,
+        4 - STAGE,
+        offs_m,
+        offs_n,
+    )
+
+    acc, l_i, _ = _attn_fwd_inner(
+        acc,
+        l_i,
+        m_i,
+        q,
+        q_scale,
+        kv_len,
+        K_ptrs,
+        K_scale_ptr,
+        V_ptrs,
+        stride_kn,
+        stride_vn,
+        start_m,
+        H // num_kv_groups,
+        BLOCK_M,
+        HEAD_DIM,
+        BLOCK_N,
+        2,
+        offs_m,
+        offs_n,
+    )
+    acc = acc / l_i[:, None]
+    tl.store(O_block_ptr, acc.to(Out.type.element_ty), mask=(offs_m[:, None] < qo_len))
+
+
+@triton.jit
+def _attn_bwd_inner(
+    dq_acc,
+    dk_acc,
+    dv_acc,
+    l_i,
+    m_i,
+    q,
+    k,
+    v,
+    do,
+    q_scale,
+    k_scale,
+    kv_len,
+    stride_kn,
+    stride_vn,
+    start_m,
+    H,
+    BLOCK_M: tl.constexpr,
+    HEAD_DIM: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    STAGE: tl.constexpr,
+    offs_m: tl.constexpr,
+    offs_n: tl.constexpr,
+):
+    if STAGE == 1:
+        lo, hi = 0, start_m * BLOCK_M
+    elif STAGE == 2:
+        lo, hi = start_m * BLOCK_M, (start_m + 1) * BLOCK_M
+        lo = tl.multiple_of(lo, BLOCK_M)
+        k += stride_kn * lo
+        v += stride_vn * lo
+
+    for start_n in range(lo, hi, BLOCK_N):
+        start_n = tl.multiple_of(start_n, BLOCK_N)
+        k_mask = offs_n[None, :] < (kv_len - start_n)
+        k_curr = tl.load(k, mask=k_mask)
+        v_curr = tl.load(v, mask=k_mask)
+        k_scale_curr = tl.load(k_scale)
+        s = tl.dot(q, k_curr, trans_b=True).to(tl.float32) * q_scale * k_scale_curr
+
+        if STAGE == 2:
+            mask = offs_m[:, None] >= (start_n + offs_n[None, :])
+            s = s + tl.where(mask, 0.0, -float("inf"))
+            m_ij = tl.maximum(m_i, tl.max(s, 1))
+            s = s - m_ij[:, None]
+        else:
+            m_ij = tl.maximum(m_i, tl.max(s, 1))
+            s = s - m_ij[:, None]
+
+        p = tl.math.exp2(s)
+        l_ij = tl.sum(p, 1)
+        alpha = tl.math.exp2(m_i - m_ij)
+        l_i = l_i * alpha + l_ij
+        m_i = m_ij
+
+        p = p / l_i[:, None]  # Normalize probabilities
+
+        # Compute gradients
+        # Compute softmax gradient
+        do_scaled = do / l_i[:, None]
+        dv_contrib = tl.dot(p.to(tl.float16).T, do_scaled.to(tl.float16))
+        dv_acc += dv_contrib
+
+        dp = tl.dot(do_scaled.to(tl.float16), v_curr.to(tl.float16).T)
+
+        # Compute ds (gradient w.r.t. logits s)
+        p_dp = p * dp
+        sum_p_dp = tl.sum(p_dp, axis=1)
+        ds = (p_dp - p * sum_p_dp[:, None]) * tl.math.log(2.0)  # Adjust for exp2
+
+        # Compute gradients w.r.t q and k
+        dq_contrib = tl.dot(ds.to(tl.float16), k_curr.to(tl.float16))
+        dk_contrib = tl.dot(ds.to(tl.float16).T, q.to(tl.float16))
+
+        dq_acc += dq_contrib * (q_scale * k_scale_curr)
+        dk_acc += dk_contrib * (q_scale * k_scale_curr)
+
+        k += BLOCK_N * stride_kn
+        k_scale += H
+        v += BLOCK_N * stride_vn
+
+    return dq_acc, dk_acc, dv_acc, l_i, m_i
+
+
+@triton.jit
+def _attn_bwd(
+    DO,
+    Q,
+    K,
+    V,
+    cu_seqlens_q,
+    cu_seqlens_k,
+    Q_scale,
+    K_scale,
+    cu_seqlens_q_scale,
+    cu_seqlens_k_scale,
+    L,
+    M,
+    DQ,
+    DK,
+    DV,
+    stride_qh,
+    stride_qn,
+    stride_kh,
+    stride_kn,
+    stride_vh,
+    stride_vn,
+    H: tl.constexpr,
+    num_kv_groups: tl.constexpr,
+    HEAD_DIM: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    STAGE: tl.constexpr,
+):
+    start_m = tl.program_id(0)
+    off_z = tl.program_id(2).to(tl.int64)
+    off_h = tl.program_id(1).to(tl.int64)
+
+    cu_seqlens_q_start = tl.load(cu_seqlens_q + off_z)
+    cu_seqlens_q_end = tl.load(cu_seqlens_q + off_z + 1)
+    qo_len = cu_seqlens_q_end - cu_seqlens_q_start
+
+    if (start_m * BLOCK_M) >= qo_len:
+        return
+
+    cu_seq_lens_q_scale_start = tl.load(cu_seqlens_q_scale + off_z)
+    cu_seq_lens_k_scale_start = tl.load(cu_seqlens_k_scale + off_z)
+
+    q_scale_offset = cu_seq_lens_q_scale_start * H + off_h + start_m * H
+    k_scale_offset = (
+        cu_seq_lens_k_scale_start * (H // num_kv_groups) + off_h // num_kv_groups
+    )
+
+    cu_seqlens_k_start = tl.load(cu_seqlens_k + off_z)
+    cu_seqlens_k_end = tl.load(cu_seqlens_k + off_z + 1)
+    kv_len = cu_seqlens_k_end - cu_seqlens_k_start
+
+    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n = tl.arange(0, BLOCK_N)
+    offs_k = tl.arange(0, HEAD_DIM)
+    Q_ptrs = (
+        Q
+        + (cu_seqlens_q_start * stride_qn + off_h * stride_qh)
+        + offs_m[:, None] * stride_qn
+        + offs_k[None, :]
+    )
+    DO_ptrs = (
+        DO
+        + (cu_seqlens_q_start * stride_qn + off_h * stride_qh)
+        + offs_m[:, None] * stride_qn
+        + offs_k[None, :]
+    )
+    Q_scale_ptr = Q_scale + q_scale_offset
+    K_ptrs = (
+        K
+        + (cu_seqlens_k_start * stride_kn + (off_h // num_kv_groups) * stride_kh)
+        + offs_n[None, :] * stride_kn
+        + offs_k[:, None]
+    )
+    K_scale_ptr = K_scale + k_scale_offset
+    V_ptrs = (
+        V
+        + (cu_seqlens_k_start * stride_vn + (off_h // num_kv_groups) * stride_vh)
+        + offs_n[:, None] * stride_vn
+        + offs_k[None, :]
+    )
+    DQ_ptrs = (
+        DQ
+        + (cu_seqlens_q_start * stride_qn + off_h * stride_qh)
+        + offs_m[:, None] * stride_qn
+        + offs_k[None, :]
+    )
+    DK_ptrs = (
+        DK
+        + (cu_seqlens_k_start * stride_kn + (off_h // num_kv_groups) * stride_kh)
+        + offs_n[None, :] * stride_kn
+        + offs_k[:, None]
+    )
+    DV_ptrs = (
+        DV
+        + (cu_seqlens_k_start * stride_vn + (off_h // num_kv_groups) * stride_vh)
+        + offs_n[:, None] * stride_vn
+        + offs_k[None, :]
+    )
+    L_ptrs = L + (cu_seqlens_q_start + offs_m)
+    M_ptrs = M + (cu_seqlens_q_start + offs_m)
+
+    m_i = tl.load(M_ptrs, mask=offs_m < qo_len, other=float("-inf"))
+    l_i = tl.load(L_ptrs, mask=offs_m < qo_len, other=1.0)
+
+    dq_acc = tl.zeros([BLOCK_M, HEAD_DIM], dtype=tl.float32)
+    dk_acc = tl.zeros([BLOCK_N, HEAD_DIM], dtype=tl.float32)
+    dv_acc = tl.zeros([BLOCK_N, HEAD_DIM], dtype=tl.float32)
+
+    q = tl.load(Q_ptrs, mask=offs_m[:, None] < qo_len)
+    do = tl.load(DO_ptrs, mask=offs_m[:, None] < qo_len)
+    q_scale = tl.load(Q_scale_ptr)
+
+    dq_acc, dk_acc, dv_acc, l_i, m_i = _attn_bwd_inner(
+        dq_acc,
+        dk_acc,
+        dv_acc,
+        l_i,
+        m_i,
+        q,
+        K_ptrs,
+        V_ptrs,
+        do,
+        q_scale,
+        K_scale_ptr,
+        kv_len,
+        stride_kn,
+        stride_vn,
+        start_m,
+        H // num_kv_groups,
+        BLOCK_M,
+        HEAD_DIM,
+        BLOCK_N,
+        4 - STAGE,
+        offs_m,
+        offs_n,
+    )
+
+    dq_acc, dk_acc, dv_acc, l_i, m_i = _attn_bwd_inner(
+        dq_acc,
+        dk_acc,
+        dv_acc,
+        l_i,
+        m_i,
+        q,
+        K_ptrs,
+        V_ptrs,
+        do,
+        q_scale,
+        K_scale_ptr,
+        kv_len,
+        stride_kn,
+        stride_vn,
+        start_m,
+        H // num_kv_groups,
+        BLOCK_M,
+        HEAD_DIM,
+        BLOCK_N,
+        2,
+        offs_m,
+        offs_n,
+    )
+
+    tl.store(DQ_ptrs, dq_acc.to(DQ.dtype.element_ty), mask=offs_m[:, None] < qo_len)
+    tl.store(DK_ptrs, dk_acc.to(DK.dtype.element_ty), mask=offs_n[None, :] < kv_len)
+    tl.store(DV_ptrs, dv_acc.to(DV.dtype.element_ty), mask=offs_n[:, None] < kv_len)
+
+
+def forward(
+    q,
+    k,
+    v,
+    cu_seqlens_q,
+    cu_seqlens_k,
+    max_seqlen_q,
+    q_scale,
+    k_scale,
+    cu_seqlens_q_scale,
+    cu_seqlens_k_scale,
+    output_dtype=torch.float16,
+):
+    BLOCK_M = 128
+    BLOCK_N = 64
+    stage = 3
+
+    o = torch.empty(q.shape, dtype=output_dtype, device=q.device)
+
+    b = cu_seqlens_q.shape[0] - 1
+    _, h_qo, head_dim = q.shape
+    _, h_kv, _ = k.shape
+
+    HEAD_DIM_K = head_dim
+    num_kv_groups = h_qo // h_kv
+
+    grid = (triton.cdiv(max_seqlen_q, BLOCK_M), h_qo, b)
+    _attn_fwd[grid](
+        q,
+        k,
+        v,
+        cu_seqlens_q,
+        cu_seqlens_k,
+        q_scale,
+        k_scale,
+        cu_seqlens_q_scale,
+        cu_seqlens_k_scale,
+        o,
+        q.stride(1),
+        q.stride(0),
+        k.stride(1),
+        k.stride(0),
+        v.stride(1),
+        v.stride(0),
+        o.stride(1),
+        o.stride(0),
+        h_qo,
+        num_kv_groups,
+        BLOCK_M=BLOCK_M,
+        BLOCK_N=BLOCK_N,
+        HEAD_DIM=HEAD_DIM_K,
+        STAGE=stage,
+        num_warps=4 if head_dim == 64 else 8,
+        num_stages=4,
+    )
+    return o
+
+
+def backward(
+    do,
+    q,
+    k,
+    v,
+    cu_seqlens_q,
+    cu_seqlens_k,
+    max_seqlen_q,
+    q_scale,
+    k_scale,
+    cu_seqlens_q_scale,
+    cu_seqlens_k_scale,
+    l,
+    m,
+    output_dtype=torch.float16,
+):
+    BLOCK_M = 128
+    BLOCK_N = 64
+    stage = 3
+
+    device = q.device
+    dtype = q.dtype
+    b = cu_seqlens_q.shape[0] - 1
+    _, h_qo, head_dim = q.shape
+    _, h_kv, _ = k.shape
+    num_kv_groups = h_qo // h_kv
+
+    dq = torch.zeros_like(q, dtype=output_dtype)
+    dk = torch.zeros_like(k, dtype=output_dtype)
+    dv = torch.zeros_like(v, dtype=output_dtype)
+
+    grid = (triton.cdiv(max_seqlen_q, BLOCK_M), h_qo, b)
+    _attn_bwd[grid](
+        do,
+        q,
+        k,
+        v,
+        cu_seqlens_q,
+        cu_seqlens_k,
+        q_scale,
+        k_scale,
+        cu_seqlens_q_scale,
+        cu_seqlens_k_scale,
+        l,
+        m,
+        dq,
+        dk,
+        dv,
+        q.stride(1),
+        q.stride(0),
+        k.stride(1),
+        k.stride(0),
+        v.stride(1),
+        v.stride(0),
+        h_qo,
+        num_kv_groups,
+        HEAD_DIM=head_dim,
+        BLOCK_M=BLOCK_M,
+        BLOCK_N=BLOCK_N,
+        STAGE=stage,
+        num_warps=4 if head_dim == 64 else 8,
+        num_stages=4,
+    )
+    return dq, dk, dv
+
+
+# class TritonAttentionFunction(torch.autograd.Function):
+#     @staticmethod
+#     def forward(ctx, q, k, v, cu_seqlens_q, cu_seqlens_k, q_scale, k_scale, cu_seqlens_q_scale, cu_seqlens_k_scale):
+#         l = torch.zeros(q.shape[0], device=q.device, dtype=torch.float32)
+#         m = torch.zeros(q.shape[0], device=q.device, dtype=torch.float32)
+#         output = forward(q, k, v, cu_seqlens_q, cu_seqlens_k, q.shape[0], q_scale, k_scale, cu_seqlens_q_scale, cu_seqlens_k_scale, l, m)
+#         ctx.save_for_backward(q, k, v, cu_seqlens_q, cu_seqlens_k, q_scale, k_scale, cu_seqlens_q_scale, cu_seqlens_k_scale, l, m)
+#         return output
+#
+#     @staticmethod
+#     def backward(ctx, do):
+#         q, k, v, cu_seqlens_q, cu_seqlens_k, q_scale, k_scale, cu_seqlens_q_scale, cu_seqlens_k_scale, l, m = ctx.saved_tensors
+#         dq, dk, dv = backward(
+#             do, q, k, v,
+#             cu_seqlens_q, cu_seqlens_k,
+#             q.shape[0], q_scale, k_scale,
+#             cu_seqlens_q_scale, cu_seqlens_k_scale,
+#             l, m,
+#         )
+#         return dq, dk, dv, None, None, None, None, None, None
--- a/src/axolotl/integrations/sageattention/lib/triton/quant_per_block_varlen.py
+++ b/src/axolotl/integrations/sageattention/lib/triton/quant_per_block_varlen.py
@@ -0,0 +1,158 @@
+"""
+Copyright (c) 2024 by SageAttention team.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+
+import torch
+import triton
+import triton.language as tl
+
+
+@triton.jit
+def quant_per_block_int8_kernel(
+    Input,
+    Output,
+    Scale,
+    cu_seqlens_input,
+    cu_seqlens_scale,
+    stride_ih,
+    stride_in,
+    stride_oh,
+    stride_on,
+    sm_scale,
+    H: tl.constexpr,
+    C: tl.constexpr,
+    BLK: tl.constexpr,
+):
+    off_blk = tl.program_id(0)
+    off_h = tl.program_id(1)
+    off_b = tl.program_id(2)
+
+    cu_seqlens_input_start = tl.load(cu_seqlens_input + off_b)
+    cu_seqlens_input_end = tl.load(cu_seqlens_input + off_b + 1)
+
+    L = cu_seqlens_input_end - cu_seqlens_input_start
+
+    if (off_blk * BLK) >= L:
+        return
+
+    cu_seqlens_scale_start = tl.load(cu_seqlens_scale + off_b)
+
+    offs_n = off_blk * BLK + tl.arange(0, BLK)
+    offs_k = tl.arange(0, C)
+
+    input_ptrs = (
+        Input
+        + cu_seqlens_input_start * stride_in
+        + off_h * stride_ih
+        + offs_n[:, None] * stride_in
+        + offs_k[None, :]
+    )
+    output_ptrs = (
+        Output
+        + cu_seqlens_input_start * stride_on
+        + off_h * stride_oh
+        + offs_n[:, None] * stride_on
+        + offs_k[None, :]
+    )
+    scale_ptrs = Scale + cu_seqlens_scale_start * H + off_h + off_blk * H
+
+    x = tl.load(input_ptrs, mask=offs_n[:, None] < L)
+    x = x.to(tl.float32)
+    x *= sm_scale
+    scale = tl.max(tl.abs(x)) / 127.0
+    x_int8 = x / scale
+    x_int8 += 0.5 * tl.where(x_int8 >= 0, 1, -1)
+    x_int8 = x_int8.to(tl.int8)
+    tl.store(output_ptrs, x_int8, mask=offs_n[:, None] < L)
+    tl.store(scale_ptrs, scale)
+
+
+def per_block_int8(
+    q,
+    k,
+    cu_seqlens_q,
+    cu_seqlens_k,
+    max_seqlen_q,
+    max_seqlen_k,
+    BLKQ=128,
+    BLKK=64,
+    sm_scale=None,
+):
+    q_int8 = torch.empty(q.shape, dtype=torch.int8, device=q.device)
+    k_int8 = torch.empty(k.shape, dtype=torch.int8, device=k.device)
+
+    h_qo = q.shape[1]
+    h_kv = k.shape[1]
+    head_dim = q.shape[-1]
+
+    b = cu_seqlens_q.shape[0] - 1
+    q_batch_len = cu_seqlens_q[1:] - cu_seqlens_q[:-1]
+    k_batch_len = cu_seqlens_k[1:] - cu_seqlens_k[:-1]
+
+    q_scale_len = (q_batch_len + BLKQ - 1) // BLKQ
+    k_scale_len = (k_batch_len + BLKK - 1) // BLKK
+
+    cu_seqlens_q_scale = torch.nn.functional.pad(
+        torch.cumsum(q_scale_len, dim=0), (1, 0), value=0
+    )
+    cu_seqlens_k_scale = torch.nn.functional.pad(
+        torch.cumsum(k_scale_len, dim=0), (1, 0), value=0
+    )
+
+    q_scale = torch.empty(
+        (cu_seqlens_q_scale[-1], h_qo), device=q.device, dtype=torch.float32
+    )
+    k_scale = torch.empty(
+        (cu_seqlens_k_scale[-1], h_kv), device=k.device, dtype=torch.float32
+    )
+
+    if sm_scale is None:
+        sm_scale = head_dim**-0.5
+
+    grid = ((max_seqlen_q + BLKQ - 1) // BLKQ, h_qo, b)
+    quant_per_block_int8_kernel[grid](
+        q,
+        q_int8,
+        q_scale,
+        cu_seqlens_q,
+        cu_seqlens_q_scale,
+        q.stride(1),
+        q.stride(0),
+        q_int8.stride(1),
+        q_int8.stride(0),
+        sm_scale=(sm_scale * 1.44269504),
+        H=h_qo,
+        C=head_dim,
+        BLK=BLKQ,
+    )
+
+    grid = ((max_seqlen_k + BLKK - 1) // BLKK, h_kv, b)
+    quant_per_block_int8_kernel[grid](
+        k,
+        k_int8,
+        k_scale,
+        cu_seqlens_k,
+        cu_seqlens_k_scale,
+        k.stride(1),
+        k.stride(0),
+        k_int8.stride(1),
+        k_int8.stride(0),
+        sm_scale=1.0,
+        H=h_kv,
+        C=head_dim,
+        BLK=BLKK,
+    )
+
+    return q_int8, q_scale, k_int8, k_scale, cu_seqlens_q_scale, cu_seqlens_k_scale
--- a/src/axolotl/monkeypatch/unsloth_.py
+++ b/src/axolotl/monkeypatch/unsloth_.py
@@ -188,7 +188,7 @@ def integrate_lora_mlp_patch(peft_model: PeftModelForCausalLM):
            for module in layer_modules
        )
        mlp_not_dora = all(
-            getattr(module, "lora_magnitude_vector", None) is None
+            len(getattr(module, "lora_magnitude_vector", []) or []) == 0
            for module in layer_modules
        )

@@ -213,7 +213,7 @@ def integrate_lora_patch(peft_model: PeftModelForCausalLM, cfg):
                for module in layer_modules
            )
            qkv_not_dora = all(
-                getattr(module, "lora_magnitude_vector", None) is None
+                len(getattr(module, "lora_magnitude_vector", []) or []) == 0
                for module in layer_modules
            )

@@ -232,7 +232,7 @@ def integrate_lora_patch(peft_model: PeftModelForCausalLM, cfg):
                for module in layer_modules
            )
            o_not_dora = all(
-                getattr(module, "lora_magnitude_vector", None) is None
+                len(getattr(module, "lora_magnitude_vector", []) or []) == 0
                for module in layer_modules
            )

--- a/src/axolotl/utils/bench.py
+++ b/src/axolotl/utils/bench.py
@@ -4,6 +4,9 @@ import functools
 import pynvml
 import torch
 from pynvml.nvml import NVMLError
+from transformers.utils.import_utils import is_torch_npu_available
+
+from axolotl.utils.distributed import get_device_type


 def check_cuda_device(default_value):
@@ -53,6 +56,12 @@ def mps_memory_usage_all():
    return usage, reserved - usage, 0


+def npu_memory_usage_all(device=0):
+    usage = torch.npu.memory_allocated(device) / 1024.0**3
+    reserved = torch.npu.memory_reserved(device) / 1024.0**3
+    return usage, reserved - usage, 0
+
+
@check_cuda_device(0.0)
 def gpu_memory_usage_smi(device=0):
    if isinstance(device, torch.device):
@@ -69,8 +78,11 @@ def gpu_memory_usage_smi(device=0):


 def log_gpu_memory_usage(log, msg, device):
+    cur_device = get_device_type()
    if torch.backends.mps.is_available():
        usage, cache, misc = mps_memory_usage_all()
+    elif "npu" in str(cur_device) and is_torch_npu_available():
+        usage, cache, misc = npu_memory_usage_all(device)
    else:
        usage, cache, misc = gpu_memory_usage_all(device)
    extras = []
@@ -79,6 +91,7 @@ def log_gpu_memory_usage(log, msg, device):
    if misc > 0:
        extras.append(f"+{misc:.03f}GB misc")
    log.info(
-        f"GPU memory usage {msg}: {usage:.03f}GB ({', '.join(extras)})", stacklevel=2
+        f"{str(cur_device)} memory usage {msg}: {usage:.03f}GB ({', '.join(extras)})",
+        stacklevel=2,
    )
    return usage, cache, misc
--- a/src/axolotl/utils/config/init.py
+++ b/src/axolotl/utils/config/init.py
@@ -5,6 +5,7 @@ from typing import Optional

 import torch
 from transformers.utils import is_torch_bf16_gpu_available
+from transformers.utils.import_utils import is_torch_npu_available

 from axolotl.integrations.config import merge_input_args
 from axolotl.utils.bench import log_gpu_memory_usage
@@ -29,7 +30,10 @@ def choose_device(cfg):
            if torch.backends.mps.is_available():
                return "mps"

-            raise SystemError("No CUDA/mps device found")
+            if is_torch_npu_available():
+                return f"npu:{cfg.local_rank}"
+
+            raise SystemError("No CUDA/mps/npu device found")
        except Exception:  # pylint: disable=broad-exception-caught
            return "cpu"

@@ -39,6 +43,8 @@ def choose_device(cfg):
    else:
        if cfg.device.startswith("cuda"):
            cfg.device_map = {"": torch.cuda.current_device()}
+        elif cfg.device.startswith("npu"):
+            cfg.device_map = {"npu": torch.npu.current_device()}
        else:
            cfg.device_map = {"": cfg.device}

--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -7,7 +7,6 @@ Module for pydantic models for configuration
 import logging
 import os
 from enum import Enum
-from importlib.metadata import version
 from typing import Annotated, Any, Dict, List, Literal, Optional, Tuple, Union

 from pydantic import (
@@ -20,6 +19,7 @@ from pydantic import (
 )
 from transformers import SchedulerType
 from transformers.training_args import OptimizerNames
+from transformers.utils.import_utils import is_torch_npu_available

 from axolotl.utils.config.models.internals import GPUCapabilities

@@ -1314,6 +1314,7 @@ class AxolotlInputConfig(
            and data.get("gradient_checkpointing_kwargs", {})
            and data.get("gradient_checkpointing_kwargs", {}).get("use_reentrant")
            is False
+            and data.get("deepspeed", "") is not None
            and "zero3" in data.get("deepspeed", "")
        ):
            # may result in:
@@ -1425,21 +1426,6 @@ class AxolotlInputConfig(
                )
        return data

-    @model_validator(mode="before")
-    @classmethod
-    def check_unsloth_xformers_version(cls, data):
-        if (
-            data.get("unsloth_lora_mlp")
-            or data.get("unsloth_lora_qkv")
-            or data.get("unsloth_lora_o")
-        ):
-            xformers_version = version("xformers")
-            if xformers_version == "0.0.27":
-                raise ValueError(
-                    "xformers version 0.0.27 is not supported with unsloth. Please downgrade to 0.0.26.post1"
-                )
-        return data
-
    @model_validator(mode="before")
    @classmethod
    def check_torch_compile_deepspeed(cls, data):
@@ -1449,6 +1435,40 @@ class AxolotlInputConfig(
            )
        return data

+    @model_validator(mode="before")
+    @classmethod
+    def check_npu_config(cls, data):
+        if is_torch_npu_available():
+            # check attention config
+            attn_list = ["flash_attention", "sdp_attention", "s2_attention"]
+            for attn in attn_list:
+                if data.get(attn):
+                    raise NotImplementedError(
+                        f"{attn} is currently not supported in Ascend npu, please disable this configuration."
+                    )
+
+            # check quant config
+            if data.get("optimizer") is not None and "bit" in data.get("optimizer"):
+                optimizer = data.get("optimizer")
+                raise NotImplementedError(
+                    f"{optimizer} is currently not supported in Ascend npu, choose another one please."
+                )
+
+            quant_list = ["load_in_8bit", "load_in_4bit"]
+            for quant in quant_list:
+                if data.get(quant):
+                    raise NotImplementedError(
+                        f"Quantification is currently not supported in Ascend npu, please disable {quant}."
+                    )
+
+            # check dtype config
+            if data.get("tf32"):
+                raise NotImplementedError(
+                    "tf32 dtype is currently not supported in Ascend npu, please disable this configuration"
+                )
+
+        return data
+

 class AxolotlConfigWCapabilities(AxolotlInputConfig):
    """wrapper to valdiate gpu capabilities with the configured options"""
--- a/src/axolotl/utils/distributed.py
+++ b/src/axolotl/utils/distributed.py
@@ -9,10 +9,44 @@ from datetime import timedelta
 import torch
 import torch.distributed as dist
 from accelerate import PartialState
+from transformers.utils.import_utils import (
+    is_torch_cuda_available,
+    is_torch_mps_available,
+    is_torch_npu_available,
+)

 distributed_state = None  # pylint: disable=invalid-name


+def get_device_type():
+    device = torch.device("cpu")
+    if is_torch_cuda_available():
+        device = torch.device("cuda")
+    elif is_torch_mps_available():
+        device = torch.device("mps")
+    elif is_torch_npu_available():
+        device = torch.device("npu")
+    return device
+
+
+def get_device_count():
+    cur_device = get_device_type()
+    if "cuda" in str(cur_device):
+        return torch.cuda.device_count()
+    if "npu" in str(cur_device):
+        return torch.npu.device_count()
+    return 1
+
+
+def get_current_device():
+    cur_device = get_device_type()
+    if "cuda" in str(cur_device):
+        return torch.cuda.current_device()
+    if "npu" in str(cur_device):
+        return torch.npu.current_device()
+    return 0
+
+
 def is_distributed():
    """
    Check if distributed training is initialized.
@@ -91,7 +125,7 @@ def gather_scalar_from_all_ranks(fn, world_size=1):  # pylint: disable=invalid-n
    if not is_distributed():
        return [value_scalar]
    value_tensor = torch.tensor(
-        value_scalar, device=torch.cuda.current_device()
+        value_scalar, device=f"{get_device_type()}:{get_current_device()}"
    ).float()

    if not is_main_process():
@@ -115,13 +149,14 @@ def broadcast_dict(vals: dict):
    if not is_distributed():
        return vals

+    cur_device = get_device_type()
    if is_main_process():
        data_byte = pickle.dumps(vals)
-        data_tensor = torch.ByteTensor(list(data_byte)).to("cuda")
-        data_size = torch.IntTensor([len(data_byte)]).to("cuda")
+        data_tensor = torch.ByteTensor(list(data_byte)).to(cur_device)
+        data_size = torch.IntTensor([len(data_byte)]).to(cur_device)
    else:
-        data_tensor = torch.empty([1024], dtype=torch.uint8, device="cuda")
-        data_size = torch.IntTensor([0]).to("cuda")
+        data_tensor = torch.empty([1024], dtype=torch.uint8, device=cur_device)
+        data_size = torch.IntTensor([0]).to(cur_device)

    dist.broadcast(data_size, 0)
    if not is_main_process():
@@ -150,14 +185,15 @@ def compute_and_broadcast(fn):  # pylint: disable=invalid-name
    Returns:
    - The computed value (int or float).
    """
+    cur_device = f"{get_device_type()}:{get_current_device()}"
    if is_main_process():
        value_scalar = fn()
        value_tensor = torch.tensor(
-            value_scalar, device=torch.cuda.current_device(), dtype=torch.float32
+            value_scalar, device=cur_device, dtype=torch.float32
        )
    else:
        value_tensor = torch.tensor(
-            0.0, device=torch.cuda.current_device(), dtype=torch.float32
+            0.0, device=cur_device, dtype=torch.float32
        )  # Placeholder tensor

    # Broadcast the tensor to all processes.
@@ -184,7 +220,7 @@ def gather_from_all_ranks(fn, world_size=1):  # pylint: disable=invalid-name
    """
    value_scalar = fn()
    value_tensor = torch.tensor(
-        value_scalar, device=torch.cuda.current_device()
+        value_scalar, device=f"{get_device_type()}:{get_current_device()}"
    ).float()

    # Placeholder tensor for gathering results
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -46,6 +46,7 @@ from transformers.integrations.deepspeed import (
 )

 from axolotl.common.architectures import MOE_ARCH_BLOCK
+from axolotl.integrations.sageattention.lib.core import monkeypatch_sdp_w_sage_attention
 from axolotl.models.mamba import fix_mamba_attn_for_loss
 from axolotl.monkeypatch.multipack import (
    SUPPORTED_MULTIPACK_MODEL_TYPES,
@@ -55,7 +56,7 @@ from axolotl.prompt_tokenizers import LLAMA_DEFAULT_EOS_TOKEN
 from axolotl.utils.bench import log_gpu_memory_usage
 from axolotl.utils.chat_templates import get_chat_template_from_config
 from axolotl.utils.dict import DictDefault
-from axolotl.utils.distributed import zero_only
+from axolotl.utils.distributed import get_device_count, get_device_type, zero_only
 from axolotl.utils.gradient_checkpointing import hf_grad_checkpoint_unsloth_wrapper
 from axolotl.utils.lora_embeddings import get_linear_embedding_layers
 from axolotl.utils.model_shard_quant import load_sharded_model, load_sharded_model_quant
@@ -570,7 +571,8 @@ class ModelLoader:
            )

            max_memory = {}
-            for i in range(torch.cuda.device_count()):
+            num_device = get_device_count()
+            for i in range(num_device):
                max_memory[i] = gpu_memory_limit
            max_memory["cpu"] = "256GiB"  # something sufficiently large to fit anything

@@ -595,8 +597,11 @@ class ModelLoader:
        self.model_kwargs["device_map"] = device_map
        self.model_kwargs["torch_dtype"] = self.cfg.torch_dtype

-        if torch.backends.mps.is_available():
+        cur_device = get_device_type()
+        if "mps" in str(cur_device):
            self.model_kwargs["device_map"] = "mps:0"
+        elif "npu" in str(cur_device):
+            self.model_kwargs["device_map"] = "npu:0"

        # TODO can we put the reference model on it's own gpu? I think we have to move logits around to calculate loss
        # if cfg.rl:
@@ -703,6 +708,7 @@ class ModelLoader:
            self.model_config._attn_implementation = (  # pylint: disable=protected-access
                "sdpa"
            )
+            monkeypatch_sdp_w_sage_attention()
        elif self.cfg.eager_attention:
            self.model_kwargs["attn_implementation"] = "eager"
            self.model_config._attn_implementation = (  # pylint: disable=protected-access
@@ -1050,7 +1056,11 @@ class ModelLoader:
        self.ajust_model_config()

        # log device memory usage
-        if hasattr(self.model, "device") and self.model.device.type in ("cuda", "mps"):
+        if hasattr(self.model, "device") and self.model.device.type in (
+            "cuda",
+            "mps",
+            "npu",
+        ):
            log_gpu_memory_usage(LOG, "after model load", self.model.device)

        # make sure these are fp32 per Ramesh et al. (2021)
@@ -1118,9 +1128,9 @@ class ModelLoader:
            and not skip_move_to_device
        ):
            # TODO revaldate this conditional
-            self.model.to(f"cuda:{self.cfg.local_rank}")
+            self.model.to(f"{str(get_device_type())}:{self.cfg.local_rank}")

-        if torch.cuda.device_count() > 1 and int(os.getenv("WORLD_SIZE", "1")) == 1:
+        if get_device_count() > 1 and int(os.getenv("WORLD_SIZE", "1")) == 1:
            setattr(self.model, "is_parallelizable", True)
            setattr(self.model, "model_parallel", True)

--- a/tests/test_validation.py
+++ b/tests/test_validation.py
@@ -68,6 +68,53 @@ class TestValidation(BaseValidation):
        assert cfg.train_on_inputs is False
        assert cfg.weight_decay is None

+    def test_zero3_qlora_use_reentrant_false(self, minimal_cfg):
+        test_cfg = DictDefault(
+            {
+                "deepspeed": "deepspeed_configs/zero3_bf16.json",
+                "gradient_checkpointing": True,
+                "gradient_checkpointing_kwargs": {"use_reentrant": False},
+                "load_in_4bit": True,
+                "adapter": "qlora",
+            }
+            | minimal_cfg
+        )
+
+        with self._caplog.at_level(logging.WARNING):
+            validate_config(test_cfg)
+            assert (
+                "qlora + zero3 with use_reentrant: false may result in a CheckpointError about recomputed values"
+                in self._caplog.records[0].message
+            )
+
+    def test_deepspeed_empty(self, minimal_cfg):
+        test_cfg = DictDefault(
+            {
+                "deepspeed": "",
+                "gradient_checkpointing": True,
+                "gradient_checkpointing_kwargs": {"use_reentrant": False},
+                "load_in_4bit": True,
+                "adapter": "qlora",
+            }
+            | minimal_cfg
+        )
+
+        _ = validate_config(test_cfg)
+
+    def test_deepspeed_not_set(self, minimal_cfg):
+        test_cfg = DictDefault(
+            {
+                "deepspeed": None,
+                "gradient_checkpointing": True,
+                "gradient_checkpointing_kwargs": {"use_reentrant": False},
+                "load_in_4bit": True,
+                "adapter": "qlora",
+            }
+            | minimal_cfg
+        )
+
+        _ = validate_config(test_cfg)
+
    def test_datasets_min_length(self):
        cfg = DictDefault(
            {
Author	SHA1	Message	Date
Wing Lian	f8acc72dd8	proof of concept for sage attention	2024-11-22 14:47:19 -05:00
Aman Karmani	51c9e1a035	.gitignore improvements (#349 ) [skip ci]	2024-11-22 11:08:54 -05:00
Sunny Liu	45c0825587	updated colab notebook (#2074 ) * updated colab notebook * update pip installtation * cleared cell output * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * modified notebook * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/colab-notebooks/colab-axolotl-example.ipynb Co-authored-by: NanoCode012 <nano@axolotl.ai> * cleared cell output * cleared unnecessary logs --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2024-11-22 10:09:10 -05:00
Wing Lian	94fc223f6c	actions/create-release is unmaintained, and doesn't create proper release notes (#2098 ) [skip ci]	2024-11-21 14:32:41 -05:00
Sunny Liu	151abb7a67	fix None-type not iterable error when deepspeed is left blank w/ use_… (#2087 ) * fix None-type not iterable error when deepspeed is left blank w/ use_reentrant: false and qlora * added unit test[skip e2e] * corrected test case[skip e2e] * assert warning message [skip e2e] * assert warning message [skip e2e] * corrected test cases [skip e2e] * lint	2024-11-21 13:36:51 -05:00
Sunny Liu	bf416bdfd0	bump_liger_0.4.2 (#2096 )	2024-11-21 13:24:52 -05:00
Mengqing Cao	838b74d05b	Add Ascend NPU support (#1758 )	2024-11-20 21:28:41 -05:00
Wing Lian	2e99bb303e	fix inference when no chat_template is set, fix unsloth dora check (#2092 ) * fix inference when no chat_template is set, fix unsloth dora check * remove old unsloth version check * update docs on installing unsloth	2024-11-20 14:07:54 -05:00
Chirag Jain	68a26f1005	Fix duplication of plugin callbacks (#2090 )	2024-11-20 14:06:08 -05:00
Wing Lian	db51a9e4cb	use pep440 instead of semver (#2088 ) [skip ci]	2024-11-19 15:02:10 -05:00
Wing Lian	8961364bc9	release 0.5.2 (#2086 )	2024-11-19 12:44:42 -05:00