fix the sed command to replace the version w the tag

add long_description for pypi push (#555 )
replace tags, build dist for pypi publish (#553 )
2023-09-11 13:44:19 -04:00 · 2023-09-11 13:34:29 -04:00 · 2023-09-11 13:25:41 -04:00 · 2023-09-11 12:35:45 -04:00 · 2023-09-11 10:33:42 -04:00 · 2023-09-11 10:27:17 -04:00
53 changed files with 2644 additions and 691 deletions
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -13,21 +13,16 @@ jobs:
      fail-fast: false
      matrix:
        include:
-          - cuda: cu118
+          - cuda: 118
            cuda_version: 11.8.0
            python_version: "3.9"
            pytorch: 2.0.1
            axolotl_extras:
-          - cuda: cu118
+          - cuda: 118
            cuda_version: 11.8.0
            python_version: "3.10"
            pytorch: 2.0.1
            axolotl_extras:
-          - cuda: cu118
-            cuda_version: 11.8.0
-            python_version: "3.9"
-            pytorch: 2.0.1
-            axolotl_extras: gptq
    runs-on: self-hosted
    steps:
      - name: Checkout
@@ -49,10 +44,11 @@ jobs:
        with:
          context: .
          build-args: |
-            BASE_TAG=${{ github.ref_name }}-base-py${{ matrix.python_version }}-${{ matrix.cuda }}-${{ matrix.pytorch }}
+            BASE_TAG=${{ github.ref_name }}-base-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}
+            CUDA=${{ matrix.cuda }}
          file: ./docker/Dockerfile
          push: ${{ github.event_name != 'pull_request' }}
-          tags: ${{ steps.metadata.outputs.tags }}-py${{ matrix.python_version }}-${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
+          tags: ${{ steps.metadata.outputs.tags }}-py${{ matrix.python_version }}-cu${{ matrix.cuda }}-${{ matrix.pytorch }}${{ matrix.axolotl_extras != '' && '-' || '' }}${{ matrix.axolotl_extras }}
          labels: ${{ steps.metadata.outputs.labels }}
  build-axolotl-runpod:
    needs: build-axolotl
@@ -72,11 +68,6 @@ jobs:
            pytorch: 2.0.1
            axolotl_extras:
            is_latest: true
-          - cuda: 118
-            cuda_version: 11.8.0
-            python_version: "3.9"
-            pytorch: 2.0.1
-            axolotl_extras: gptq
    runs-on: self-hosted
    steps:
      - name: Checkout
--- a/.github/workflows/pypi.yml
+++ b/.github/workflows/pypi.yml
@@ -0,0 +1,45 @@
+name: publish pypi
+
+on:
+  push:
+    tags:
+      - '*'
+
+jobs:
+  pypi-publish:
+    name: Upload release to PyPI
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/axolotl
+    permissions:
+      id-token: write  # IMPORTANT: this permission is mandatory for trusted publishing
+    steps:
+      - name: Check out repository code
+        uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+
+      - name: Install dependencies
+        run: |
+          pip3 install wheel
+          pip3 install -e .
+          pip3 install -r requirements-tests.txt
+
+      - name: Extract tag name
+        id: tag
+        run: echo ::set-output name=TAG_NAME::$(echo $GITHUB_REF | cut -d / -f 3)
+
+      - name: Update version in setup.py
+        run: >-
+          sed -i -E 's/version="([0-9.]+)",/version="${{ steps.tag.outputs.TAG_NAME }}",/g' setup.py
+
+      - name: Build a binary wheel
+        run: >-
+          python setup.py sdist bdist_wheel
+
+      - name: Publish package distributions to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -24,8 +24,8 @@ jobs:

      - name: Install dependencies
        run: |
-          pip install -e .
-          pip install -r requirements-tests.txt
+          pip3 install -e .
+          pip3 install -r requirements-tests.txt

      - name: Run tests
        run: |
--- a/README.md
+++ b/README.md
@@ -16,6 +16,7 @@ Axolotl is a tool designed to streamline the fine-tuning of various AI models, o
  - [LambdaLabs Installation](#lambdalabs)
 - [Dataset](#dataset)
  - [How to Add Custom Prompts](#how-to-add-custom-prompts)
+  - [How to Use Custom Pretokenized Dataset](#how-to-use-your-custom-pretokenized-dataset)
 - [Config](#config)
  - [Train](#train)
  - [Inference](#inference)
@@ -68,8 +69,9 @@ Get started with Axolotl in just a few steps! This quickstart guide will walk yo

 ```bash
 git clone https://github.com/OpenAccess-AI-Collective/axolotl
+cd axolotl

-pip3 install -e .
+pip3 install -e .[flash-attn]
 pip3 install -U git+https://github.com/huggingface/peft.git

 # finetune lora
@@ -88,8 +90,7 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  ```bash
  docker run --gpus '"all"' --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1
  ```
-  - `winglian/axolotl-runpod:main-py3.10-cu118-2.0.1`: for runpod
-  - `winglian/axolotl-runpod:main-py3.9-cu118-2.0.1-gptq`: for gptq
+  - `winglian/axolotl-runpod:main-latest`: for runpod or use this [direct link](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)

  Or run on the current files for development:

@@ -98,23 +99,13 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  ```

 - Conda/Pip venv
-  1. Install python **3.9**
+  1. Install python >=**3.9**

  2. Install pytorch stable https://pytorch.org/get-started/locally/

-  3. Install python dependencies with ONE of the following:
-      - Recommended, supports QLoRA, NO gptq/int4 support
+  3. Install axolotl along with python dependencies
        ```bash
-        pip3 install -e .
-        pip3 install -U git+https://github.com/huggingface/peft.git
-        ```
-      - gptq/int4 support, NO QLoRA
-        ```bash
-        pip3 install -e .[gptq]
-        ```
-      - same as above but not recommended
-        ```bash
-        pip3 install -e .[gptq_triton]
+        pip3 install -e .[flash-attn]
        ```

 - LambdaLabs
@@ -149,12 +140,9 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  git clone https://github.com/OpenAccess-AI-Collective/axolotl
  cd axolotl

-  pip3 install -e . # change depend on needs
+  pip3 install -e .
  pip3 install protobuf==3.20.3
-  pip3 install -U requests
-  pip3 install -U --ignore-installed psutil
-  pip3 install -U scipy
-  pip3 install git+https://github.com/huggingface/peft.git # not for gptq
+  pip3 install -U --ignore-installed requests Pillow psutil scipy
  ```

  5. Set path
@@ -163,6 +151,8 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  ```
  </details>

+- Windows: Please use WSL or Docker!
+
 ### Dataset

 Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
@@ -257,6 +247,10 @@ Have dataset(s) in one of the following format (JSONL recommended):
  ```json
  {"conversations": [{"role": "...", "value": "..."}]}
  ```
+- `metharme`: instruction, adds additional eos tokens
+  ```json
+  {"prompt": "...", "generation": "..."}
+  ```
 - `sharegpt_simple.load_role`: conversations where `role` is used instead of `from`
  ```json
  {"conversations": [{"role": "...", "value": "..."}]}
@@ -274,11 +268,29 @@ Have dataset(s) in one of the following format (JSONL recommended):

 #### How to add custom prompts

-  1. Add your method to a file in [prompt_strategies](src/axolotl/prompt_strategies). Please see other files as example.
-  2. Use your custom file name as the dataset type `<prompt_strategies_file>.load_<load_fn>`.
+Using yaml. Example:
+```yaml
+datasets:
+  - path: repo
+    type:
+      system_prompt: ""
+      no_input_format: |-
+        User: {instruction}<|end_of_turn|>
+        Assistant:
+      format: |-
+        User: {instruction}
+        {input}<|end_of_turn|>
+        Assistant:
+```

-Optionally, download some datasets, see [data/README.md](data/README.md)
+Using file:
+1. Add your method to a file in [prompt_strategies](src/axolotl/prompt_strategies). Please see other files as example.
+2. Use your custom file name as the dataset type `<prompt_strategies_file>.load_<load_fn>`.

+#### How to use your custom pretokenized dataset
+
+- Do not pass a `type:`
+- Dataset must contain `input_ids`, `attention_mask`, `labels` in columns


 ### Config
@@ -306,11 +318,20 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
      name: enron_emails
      type: completion # format from earlier

+  # huggingface repo with multiple named configurations/subsets
+  datasets:
+    - path: bigcode/commitpackft
+      name:
+        - ruby
+        - python
+        - typescript
+      type: ... # unimplemented custom format
+
  # local
  datasets:
-    - path: json
-      data_files: data.jsonl # or json
-      type: alpaca # format from earlier
+    - path: data.jsonl # or json
+      ds_type: json # see other options below
+      type: alpaca
  ```

 - loading
@@ -385,16 +406,39 @@ fp16: true
 # Use CUDA tf32
 tf32: true # require >=ampere

+# No AMP (automatic mixed precision)
+bfloat16: true # require >=ampere
+float16: true
+
 # a list of one or more datasets to finetune the model with
 datasets:
  # hf dataset repo | "json" for local dataset, make sure to fill data_files
  - path: vicgalle/alpaca-gpt4
  # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
    type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
+    ds_type: # Optional[str] (json|arrow|parquet) defines the datatype when path is a file
    data_files: # path to source data files
    shards: # number of shards to split data into
    name: # name of dataset configuration to load

+  # custom user prompt
+  - path: repo
+    type:
+      # the below are defaults. only set what's needed.
+      system_prompt: ""
+      field_system: system
+      field_instruction: instruction
+      field_output: input
+
+      # customizable to be single line or multi-line
+      system_format: "{system}"
+      # 'format' can include {input}
+      format: |-
+        User: {instruction} {input}
+        Assistant:
+      # 'no_input_format' cannot include {input}
+      no_input_format: "{instruction} "
+
 # axolotl attempts to save the dataset as an arrow after packing the data together so
 # subsequent training attempts load faster, relative path
 dataset_prepared_path: data/last_run_prepared
@@ -418,6 +462,9 @@ dataset_shard_idx:
 # the maximum length of an input to train with, this should typically be less than 2048
 # as most models have a token/context limit of 2048
 sequence_len: 2048
+# pad inputs so each step uses constant sized buffers
+# this will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
+pad_to_sequence_len:
 # max sequence length to concatenate training samples together up to
 # inspired by StackLLaMA. see https://huggingface.co/blog/stackllama#supervised-fine-tuning
 # FutureWarning: This will soon be DEPRECATED
@@ -452,6 +499,12 @@ lora_modules_to_save:
 lora_out_dir:
 lora_fan_in_fan_out: false

+# ReLoRA configuration
+# must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
+relora_steps: # number of steps per ReLoRA restart
+relora_warmup_steps: # number of per-restart warmup steps
+relora_cpu_offload: # true to perform lora weight merges on cpu during restarts, for modest gpu memory savings
+
 # wandb configuration if you're using it
 wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
 wandb_project: # your wandb project name
@@ -472,8 +525,9 @@ warmup_steps: 100
 learning_rate: 0.00003
 lr_quadratic_warmup:
 logging_steps:
+save_strategy: # set to `no` to skip checkpoint saves
 save_steps: # leave empty to save at each epoch
-eval_steps:
+eval_steps: # leave empty to eval at each epoch
 save_total_limit: # checkpoints saved at a time
 max_steps:

@@ -506,6 +560,30 @@ log_sweep_min_lr:
 log_sweep_max_lr:

 # specify optimizer
+# Valid values are driven by the Transformers OptimizerNames class, see:
+# https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
+#
+# Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
+# torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
+# in the examples/ for your model and fine-tuning use case.
+#
+# Valid values for 'optimizer' include:
+# - adamw_hf
+# - adamw_torch
+# - adamw_torch_fused
+# - adamw_torch_xla
+# - adamw_apex_fused
+# - adafactor
+# - adamw_anyprecision
+# - sgd
+# - adagrad
+# - adamw_bnb_8bit
+# - lion_8bit
+# - lion_32bit
+# - paged_adamw_32bit
+# - paged_adamw_8bit
+# - paged_lion_32bit
+# - paged_lion_8bit
 optimizer:
 # specify weight decay
 weight_decay:
@@ -559,12 +637,14 @@ fsdp_config:
 # Deepspeed config path
 deepspeed:

+# Advanced DDP Arguments
+ddp_timeout:
+ddp_bucket_cap_mb:
+ddp_broadcast_buffers:
+
 # Path to torch distx for optim 'adamw_anyprecision'
 torchdistx_path:

-# Set padding for data collator to 'longest'
-collator_pad_to_longest:
-
 # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
 pretraining_dataset:

@@ -584,7 +664,7 @@ strict:

 Run
 ```bash
-accelerate launch scripts/finetune.py configs/your_config.yml
+accelerate launch scripts/finetune.py your_config.yml
 ```

 #### Multi-GPU
@@ -666,7 +746,9 @@ Please reduce any below
  - `gradient_accumulation_steps`
  - `sequence_len`

-> `failed (exitcode: -9)` usually means your system has run out of system memory.
+> `failed (exitcode: -9)`
+
+Usually means your system has run out of system memory.
 Similarly, you should consider reducing the same settings as when you run out of VRAM.
 Additionally, look into upgrading your system RAM which should be simpler than GPU upgrades.

@@ -682,6 +764,10 @@ Try to turn off xformers.

 It's safe to ignore it.

+> NCCL Timeouts during training
+
+See the [NCCL](docs/nccl.md) guide.
+
 ## Need help? 🙋♂️

 Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
--- a/data/README.md
+++ b/data/README.md
@@ -1,24 +0,0 @@
-
-## Download some datasets
-```shell
-curl https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data_gpt4.json -o data/raw/alpaca_data_gpt4.json
-curl https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -L -o data/raw/vicuna_cleaned.json
-curl https://github.com/teknium1/GPTeacher/blob/main/Instruct/gpt4-instruct-similarity-0.6-dataset.json?raw=true -L -o data/raw/gpt4-instruct-similarity-0.6-dataset.json
-curl https://github.com/teknium1/GPTeacher/blob/main/Roleplay/roleplay-similarity_0.6-instruct-dataset.json?raw=true -L -o data/raw/roleplay-similarity_0.6-instruct-dataset.json
-```
-
-## Convert the JSON data files to JSONL.
-
-```shell
-python3 ./scripts/alpaca_json_to_jsonl.py --file data/alpaca_data_gpt4.json --output data/alpaca_data_gpt4.jsonl
-python3 ./scripts/alpaca_json_to_jsonl.py --file data/raw/vicuna_cleaned.json --output data/vicuna_cleaned.jsonl
-python3 ./scripts/alpaca_json_to_jsonl.py --file data/raw/roleplay-similarity_0.6-instruct-dataset.json --output data/roleplay-similarity_0.6-instruct-dataset.jsonl
-python3 ./scripts/alpaca_json_to_jsonl.py --file data/raw/gpt4-instruct-similarity-0.6-dataset.json --output data/gpt4-instruct-similarity-0.6-dataset.jsonl
-```
---
-
-Using JSONL makes it easier to subset the data if you want a smaller training set, i.e get 2000 random examples.
-
-```shell
-shuf -n2000 data/vicuna_cleaned.jsonl > data/vicuna_cleaned.subset0.jsonl
-```
--- a/data/raw/.gitignore
+++ b/data/raw/.gitignore
@@ -1 +0,0 @@
-**
--- a/deepspeed/zero2.json
+++ b/deepspeed/zero2.json
@@ -0,0 +1,46 @@
+{
+    "zero_optimization": {
+      "stage": 2,
+      "offload_optimizer": {
+        "device": "cpu"
+      },
+      "contiguous_gradients": true,
+      "overlap_comm": true
+    },
+    "bf16": {
+      "enabled": "auto"
+    },
+    "fp16": {
+      "enabled": "auto",
+      "auto_cast": false,
+      "loss_scale": 0,
+      "initial_scale_power": 32,
+      "loss_scale_window": 1000,
+      "hysteresis": 2,
+      "min_loss_scale": 1
+    },
+    "optimizer": {
+      "type": "AdamW",
+      "params": {
+        "lr": "auto",
+        "betas": [
+          0.9,
+          0.999
+        ],
+        "eps": 1e-8,
+        "weight_decay": "auto"
+      }
+    },
+    "scheduler": {
+      "type": "WarmupDecayLR",
+      "params": {
+        "warmup_min_lr": "auto",
+        "warmup_max_lr": "auto",
+        "warmup_num_steps": "auto",
+        "total_num_steps": "auto"
+      }
+    },
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
--- a/deepspeed/zero3.json
+++ b/deepspeed/zero3.json
@@ -35,10 +35,7 @@
    "type": "AdamW",
    "params": {
      "lr": "auto",
-      "betas": [
-        0.9,
-        0.95
-      ],
+      "betas": "auto",
      "eps": 1e-8,
      "weight_decay": "auto"
    }
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -9,6 +9,11 @@ services:
      - ~/.cache/huggingface/:/root/.cache/huggingface/
    # set environment variables
    environment:
+      # Set environment variables
+      - GIT_AUTHOR_NAME=${GIT_AUTHOR_NAME}
+      - GIT_AUTHOR_EMAIL=${GIT_AUTHOR_EMAIL}
+      - GIT_COMMITTER_NAME=${GIT_COMMITTER_NAME}
+      - GIT_COMMITTER_EMAIL=${GIT_COMMITTER_EMAIL}
      - WANDB_API_KEY=${WANDB_API_KEY}
    deploy:
      resources:
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -11,14 +11,13 @@ RUN apt-get update && \

 WORKDIR /workspace

-RUN pip3 install --force-reinstall "peft @ git+https://github.com/huggingface/peft.git@main"
 RUN git clone --depth=1 https://github.com/OpenAccess-AI-Collective/axolotl.git
 # If AXOLOTL_EXTRAS is set, append it in brackets
 RUN cd axolotl && \
    if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
-        pip install -e .[$AXOLOTL_EXTRAS]; \
+        pip install -e .[flash-attn,$AXOLOTL_EXTRAS]; \
    else \
-        pip install -e .; \
+        pip install -e .[flash-attn]; \
    fi

 # fix so that git fetch/pull from remote works
--- a/docker/Dockerfile-base
+++ b/docker/Dockerfile-base
@@ -31,26 +31,6 @@ WORKDIR /workspace
 RUN python3 -m pip install --upgrade pip && pip3 install packaging && \
    python3 -m pip install --no-cache-dir -U torch==${PYTORCH_VERSION}+cu${CUDA} --extra-index-url https://download.pytorch.org/whl/cu$CUDA

-
-FROM base-builder AS flash-attn-builder
-
-WORKDIR /workspace
-
-ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
-
-RUN git clone https://github.com/Dao-AILab/flash-attention.git && \
-    cd flash-attention && \
-    git checkout v2.0.4  && \
-    python3 setup.py bdist_wheel && \
-    cd csrc/fused_dense_lib && \
-    python3 setup.py bdist_wheel && \
-    cd ../xentropy && \
-    python3 setup.py bdist_wheel && \
-    cd ../rotary && \
-    python3 setup.py bdist_wheel && \
-    cd ../layer_norm && \
-    python3 setup.py bdist_wheel
-
 FROM base-builder AS deepspeed-builder

 ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
@@ -90,13 +70,8 @@ RUN mkdir -p /workspace/wheels/bitsandbytes
 COPY --from=deepspeed-builder /workspace/DeepSpeed/dist/deepspeed-*.whl wheels
 COPY --from=bnb-builder /workspace/bitsandbytes/dist/bitsandbytes-*.whl wheels
 COPY --from=bnb-builder /workspace/bitsandbytes/bitsandbytes/libbitsandbytes*.so wheels/bitsandbytes
-COPY --from=flash-attn-builder /workspace/flash-attention/dist/flash_attn-*.whl wheels
-COPY --from=flash-attn-builder /workspace/flash-attention/csrc/fused_dense_lib/dist/fused_dense_lib-*.whl wheels
-COPY --from=flash-attn-builder /workspace/flash-attention/csrc/xentropy/dist/xentropy_cuda_lib-*.whl wheels
-COPY --from=flash-attn-builder /workspace/flash-attention/csrc/rotary/dist/rotary_emb-*.whl wheels
-COPY --from=flash-attn-builder /workspace/flash-attention/csrc/layer_norm/dist/dropout_layer_norm-*.whl wheels

-RUN pip3 install wheels/deepspeed-*.whl wheels/flash_attn-*.whl wheels/fused_dense_lib-*.whl wheels/xentropy_cuda_lib-*.whl wheels/rotary_emb-*.whl wheels/dropout_layer_norm-*.whl
+RUN pip3 install wheels/deepspeed-*.whl
 RUN cd /workspace/builds/bitsandbytes && python3 setup.py install
 RUN git lfs install --skip-repo
 RUN pip3 install awscli && \
--- a/docs/nccl.md
+++ b/docs/nccl.md
@@ -0,0 +1,46 @@
+# NCCL
+
+NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several [environment variables](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html). A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
+
+```text
+Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
+```
+
+Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends [disabling PCI access control services (ACS)](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs) as a possible solution if this is available to you.
+
+Forcing cross-GPU communication via [NVLink](https://en.wikipedia.org/wiki/NVLink) may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:
+
+```shell
+nvidia-smi nvlink --status
+```
+
+To force NCCL to use NVLink, simply set this in the environment:
+
+```shell
+export NCCL_P2P_LEVEL=NVL
+```
+
+If NVLink is not available in your environment there are other options for ``NCCL_P2P_LEVEL`` in the table below:
+
+| NCCL_P2P_LEVEL | Description |
+| -------------- | ----------- |
+| PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. |
+| PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. |
+| PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |
+
+To validate that acceptable data transfer speeds exist for your training job, running [NCCL Tests](https://github.com/NVIDIA/nccl-tests/blob/master/README.md) can help pinpoint bottlenecks, for example:
+
+```shell
+./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
+```
+
+It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
+
+```shell
+export NCCL_DEBUG=INFO
+export NCCL_DEBUG_SUBSYS=ALL
+export TORCH_DISTRIBUTED_DEBUG=INFO
+export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log
+```
+
+Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ``ddp_timeout`` value in the Axolotl configuration. See [PyTorch init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for documentation on this value.
--- a/examples/code-llama/13b/lora.yml
+++ b/examples/code-llama/13b/lora.yml
@@ -0,0 +1,68 @@
+base_model: codellama/CodeLlama-13b-hf
+base_model_config: codellama/CodeLlama-13b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: CodeLlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: true
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./lora-out
+
+sequence_len: 100000
+sample_packing: true
+pad_to_sequence_len: true
+
+adapter: lora
+lora_model_dir:
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/code-llama/13b/qlora.yml
+++ b/examples/code-llama/13b/qlora.yml
@@ -0,0 +1,70 @@
+base_model: codellama/CodeLlama-13b-hf
+base_model_config: codellama/CodeLlama-13b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: CodeLlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./qlora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 100000
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: paged_adamw_32bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/code-llama/34b/lora.yml
+++ b/examples/code-llama/34b/lora.yml
@@ -0,0 +1,68 @@
+base_model: codellama/CodeLlama-34b-hf
+base_model_config: codellama/CodeLlama-34b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: CodeLlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: true
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./lora-out
+
+sequence_len: 100000
+sample_packing: true
+pad_to_sequence_len: true
+
+adapter: lora
+lora_model_dir:
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/code-llama/34b/qlora.yml
+++ b/examples/code-llama/34b/qlora.yml
@@ -0,0 +1,70 @@
+base_model: codellama/CodeLlama-34b-hf
+base_model_config: codellama/CodeLlama-34b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: CodeLlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./qlora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 100000
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: paged_adamw_32bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/code-llama/7b/lora.yml
+++ b/examples/code-llama/7b/lora.yml
@@ -0,0 +1,68 @@
+base_model: codellama/CodeLlama-7b-hf
+base_model_config: codellama/CodeLlama-7b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: CodeLlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: true
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./lora-out
+
+sequence_len: 100000
+sample_packing: true
+pad_to_sequence_len: true
+
+adapter: lora
+lora_model_dir:
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/code-llama/7b/qlora.yml
+++ b/examples/code-llama/7b/qlora.yml
@@ -0,0 +1,70 @@
+base_model: codellama/CodeLlama-7b-hf
+base_model_config: codellama/CodeLlama-7b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: CodeLlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./qlora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 100000
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 32
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: paged_adamw_32bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/code-llama/README.md
+++ b/examples/code-llama/README.md
@@ -0,0 +1,22 @@
+# Overview
+
+This is an example of CodeLLaMA configuration for 7b, 13b and 34b.
+
+The 7b variant fits on any 24GB VRAM GPU and will take up about 17 GB of VRAM during training if using qlora and 20 GB if using lora. On a RTX 4090 it trains 3 epochs of the default dataset in about 15 minutes.
+
+The 13b variant will fit if you change these settings to these values:
+gradient_accumulation_steps: 2
+micro_batch_size: 1
+
+The 34b variant does not fit on 24GB of VRAM - you will need something with +40 gb VRAM that also supports flash attention v2 - A6000 or A100 are good choices.
+
+```shell
+accelerate launch scripts/finetune.py examples/code-llama/[MODEL_SIZE]/qlora.yml
+
+```
+or
+
+```shell
+accelerate launch scripts/finetune.py examples/code-llama/[MODEL_SIZE]/lora.yml
+
+```
--- a/examples/gptq-lora-7b/README.md
+++ b/examples/gptq-lora-7b/README.md
@@ -1,8 +0,0 @@
-# LLaMa 7B using LoRA
-
-This is a good place to start for beginners. This will run on an NVIDIA RTX4090 with no other changes needed.
-
-```shell
-accelerate launch scripts/finetune.py examples/gptq-lora-7b/config.yml
-
-```
--- a/examples/gptq-lora-7b/config.yml
+++ b/examples/gptq-lora-7b/config.yml
@@ -1,63 +0,0 @@
-base_model: Neko-Institute-of-Science/LLaMA-7B-4bit-128g
-base_model_config: Neko-Institute-of-Science/LLaMA-7B-4bit-128g
-model_type: LlamaForCausalLM
-tokenizer_type: LlamaTokenizer
-trust_remote_code:
-load_in_8bit: true
-gptq: true
-datasets:
-  - path: vicgalle/alpaca-gpt4
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.02
-adapter:
-lora_model_dir:
-sequence_len: 2048
-max_packed_sequence_len:
-lora_r: 8
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-  - q_proj
-  - v_proj
-lora_fan_in_fan_out: false
-wandb_project: llama-7b-lora-int4
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-output_dir: ./llama-7b-lora-int4
-gradient_accumulation_steps: 1
-micro_batch_size: 1
-num_epochs: 3
-optimizer: adamw_bnb_8bit
-torchdistx_path:
-lr_scheduler: cosine
-learning_rate: 0.0000002
-train_on_inputs: false
-group_by_length: false
-fp16: true
-bf16: false
-tf32: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 5
-xformers_attention:
-flash_attention:
-gradient_checkpointing: true
-gptq_groupsize: 128
-gptq_model_v1: false
-warmup_steps: 20
-eval_steps: 110
-save_steps: 660
-debug:
-deepspeed:
-weight_decay: 0.0001
-fsdp:
-fsdp_config:
-tokens:
-  pad_token: "[PAD]"
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/llama-2/gptq-lora.yml
+++ b/examples/llama-2/gptq-lora.yml
@@ -0,0 +1,76 @@
+base_model: TheBloke/Llama-2-7B-GPTQ
+base_model_config: TheBloke/Llama-2-7B-GPTQ
+is_llama_derived_model: false
+gptq: true
+gptq_bits: 4
+model_type: AutoModelForCausalLM
+tokenizer_type: LlamaTokenizer
+tokenizer_use_fast: true
+tokenizer_legacy: true
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+push_dataset_to_hub:
+hf_use_auth_token: true
+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+adapter: lora
+lora_model_dir:
+sequence_len: 4096
+sample_packing:
+lora_r: 8
+lora_alpha: 32
+lora_dropout: 0.05
+lora_target_modules:
+  - k_proj
+  - o_proj
+  - q_proj
+  - v_proj
+lora_target_linear:
+lora_fan_in_fan_out:
+wandb_project:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+output_dir: ./model-out
+gradient_accumulation_steps: 1
+micro_batch_size: 1
+num_epochs: 3
+optimizer: adamw_torch
+adam_beta2: 0.95
+adam_eps: 0.00001
+max_grad_norm: 1.0
+torchdistx_path:
+lr_scheduler: cosine
+lr_quadratic_warmup: true
+learning_rate: 0.000017
+train_on_inputs: false
+group_by_length: false
+bf16: false
+fp16: false
+float16: true
+tf32: true
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention:
+sdp_attention:
+flash_optimum:
+gptq_groupsize:
+gptq_model_v1:
+warmup_steps: 100
+eval_steps:
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.1
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/llama-2/lora.yml
+++ b/examples/llama-2/lora.yml
@@ -17,6 +17,7 @@ output_dir: ./lora-out

 sequence_len: 4096
 sample_packing: true
+pad_to_sequence_len: true

 adapter: lora
 lora_model_dir:
--- a/examples/llama-2/qlora.yml
+++ b/examples/llama-2/qlora.yml
@@ -20,6 +20,7 @@ lora_model_dir:

 sequence_len: 4096
 sample_packing: true
+pad_to_sequence_len: true

 lora_r: 32
 lora_alpha: 16
--- a/examples/llama-2/relora.yml
+++ b/examples/llama-2/relora.yml
@@ -0,0 +1,74 @@
+base_model: meta-llama/Llama-2-7b-hf
+base_model_config: meta-llama/Llama-2-7b-hf
+model_type: LlamaForCausalLM
+tokenizer_type: LlamaTokenizer
+is_llama_derived_model: true
+
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+
+datasets:
+  - path: teknium/GPT4-LLM-Cleaned
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./relora-out
+
+adapter: qlora
+lora_model_dir:
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+
+lora_r: 8
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+lora_fan_in_fan_out:
+
+relora_steps: 150
+relora_warmup_steps: 10
+relora_cpu_offload: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 4
+num_epochs: 3
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: false
+
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 10
+eval_steps: 20
+save_steps: 50
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/pythia-12b/config.yml
+++ b/examples/pythia-12b/config.yml
@@ -47,4 +47,3 @@ local_rank:
 gradient_checkpointing: true
 fsdp:
 fsdp_config:
-collator_pad_to_longest: true
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,20 +1,27 @@
+--extra-index-url https://download.pytorch.org/whl/cu118
+--extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
+torch==2.0.1
+auto-gptq
+packaging
 peft @ git+https://github.com/huggingface/peft.git
 transformers @ git+https://github.com/huggingface/transformers.git
 bitsandbytes>=0.41.1
-accelerate @ git+https://github.com/huggingface/accelerate@2a289f6108e77a77a4efffb3f6316bc98538413b
+accelerate @ git+https://github.com/huggingface/accelerate
 addict
+evaluate
 fire
-PyYAML==6.0
+PyYAML>=6.0
 datasets
-accelerate>=0.19.0
+flash-attn>=2.2.1
 sentencepiece
 wandb
 einops
 xformers
 optimum
 hf_transfer
+colorama
 numba
-numpy==1.24.4
+numpy>=1.24.4
 # qlora things
 bert-score==0.3.13
 evaluate==0.4.0
@@ -22,3 +29,4 @@ rouge-score==0.1.2
 scipy
 scikit-learn==1.2.2
 pynvml
+art
--- a/scripts/alpaca_json_to_jsonl.py
+++ b/scripts/alpaca_json_to_jsonl.py
@@ -1,52 +0,0 @@
-"""Module to convert json file to jsonl"""
-
-import os
-import sys
-from pathlib import Path
-from typing import Optional, Union
-
-import fire
-
-from axolotl.convert import (
-    FileReader,
-    FileWriter,
-    JsonlSerializer,
-    JsonParser,
-    JsonToJsonlConverter,
-    StdoutWriter,
-)
-from axolotl.logging_config import configure_logging
-
-configure_logging()
-
-# add src to the pythonpath so we don't need to pip install this
-project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
-src_dir = os.path.join(project_root, "src")
-sys.path.insert(0, src_dir)
-
-
-def main(
-    file: Path,
-    output: Optional[Path] = None,
-    to_stdout: Optional[bool] = False,
-):
-    """
-    Convert a json file to jsonl
-    """
-
-    file_reader = FileReader()
-    writer: Union[StdoutWriter, FileWriter]
-    if to_stdout or output is None:
-        writer = StdoutWriter()
-    else:
-        writer = FileWriter(output)
-    json_parser = JsonParser()
-    jsonl_serializer = JsonlSerializer()
-
-    converter = JsonToJsonlConverter(file_reader, writer, json_parser, jsonl_serializer)
-
-    converter.convert(file, output)
-
-
-if __name__ == "__main__":
-    fire.Fire(main)
--- a/scripts/finetune.py
+++ b/scripts/finetune.py
@@ -4,27 +4,28 @@ import importlib
 import logging
 import os
 import random
-import signal
 import sys
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Union

 import fire
 import torch
+import transformers
 import yaml

 # add src to the pythonpath so we don't need to pip install this
-from optimum.bettertransformer import BetterTransformer
+from art import text2art
 from transformers import GenerationConfig, TextStreamer

+from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
 from axolotl.logging_config import configure_logging
+from axolotl.train import TrainDatasetMeta, train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.data import prepare_dataset
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process
-from axolotl.utils.models import load_model, load_tokenizer
+from axolotl.utils.models import load_tokenizer
 from axolotl.utils.tokenization import check_dataset_labels
-from axolotl.utils.trainer import setup_trainer
 from axolotl.utils.wandb import setup_wandb_env_vars

 project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
@@ -37,15 +38,12 @@ LOG = logging.getLogger("axolotl.scripts")
 os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"


-def print_axolotl_text_art():
-    ascii_art = """
-                           dP            dP   dP
-                           88            88   88
-.d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
-88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
-88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
-`88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP
-"""
+def print_axolotl_text_art(suffix=None):
+    font = "nancyj"
+    ascii_text = "  axolotl"
+    if suffix:
+        ascii_text += f"  x  {suffix}"
+    ascii_art = text2art(" axolotl", font=font)

    if is_main_process():
        print(ascii_art)
@@ -60,7 +58,45 @@ def get_multi_line_input() -> Optional[str]:
    return instruction


-def do_inference(cfg, model, tokenizer, prompter: Optional[str]):
+def do_merge_lora(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    safe_serialization = cfg.save_safetensors is True
+
+    LOG.info("running merge of LoRA with base model")
+    model = model.merge_and_unload()
+    model.to(dtype=torch.float16)
+
+    if cfg.local_rank == 0:
+        LOG.info("saving merged model")
+        model.save_pretrained(
+            str(Path(cfg.output_dir) / "merged"),
+            safe_serialization=safe_serialization,
+        )
+        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
+
+
+def shard(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    model, _ = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    safe_serialization = cfg.save_safetensors is True
+    LOG.debug("Re-saving model w/ sharding")
+    model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+
+
+def do_inference(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    prompter = cli_args.prompter
    default_tokens = {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>"}

    for token, symbol in default_tokens.items():
@@ -82,6 +118,8 @@ def do_inference(cfg, model, tokenizer, prompter: Optional[str]):
            max_seq_len=255, mem_freq=50, top_k=5, max_cache_size=None
        )

+    model = model.to(cfg.device)
+
    while True:
        print("=" * 80)
        # support for multiline inputs
@@ -133,6 +171,10 @@ def choose_config(path: Path):
            "No YAML config files found in the specified directory. Are you using a .yml extension?"
        )

+    if len(yaml_files) == 1:
+        print(f"Using default YAML file '{yaml_files[0]}'")
+        return yaml_files[0]
+
    print("Choose a YAML file:")
    for idx, file in enumerate(yaml_files):
        print(f"{idx + 1}. {file}")
@@ -155,12 +197,7 @@ def check_not_in(list1: List[str], list2: Union[Dict[str, Any], List[str]]) -> b
    return not any(el in list2 for el in list1)


-def train(
-    config: Path = Path("configs/"),
-    prepare_ds_only: bool = False,
-    **kwargs,
-):
-    print_axolotl_text_art()
+def load_cfg(config: Path = Path("examples/"), **kwargs):
    if Path(config).is_dir():
        config = choose_config(config)

@@ -184,132 +221,58 @@ def train(
    normalize_config(cfg)

    setup_wandb_env_vars(cfg)
+    return cfg

-    # load the tokenizer first
-    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
+
+def load_datasets(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+) -> TrainDatasetMeta:
    tokenizer = load_tokenizer(cfg)

-    if (
-        check_not_in(["shard", "merge_lora"], kwargs) and not cfg.inference
-    ):  # don't need to load dataset for these
-        train_dataset, eval_dataset, total_num_steps = prepare_dataset(cfg, tokenizer)
+    train_dataset, eval_dataset, total_num_steps = prepare_dataset(cfg, tokenizer)

-    if cfg.debug or "debug" in kwargs:
+    if cli_args.debug or cfg.debug:
        LOG.info("check_dataset_labels...")
        check_dataset_labels(
            train_dataset.select(
-                [random.randrange(0, len(train_dataset) - 1) for _ in range(5)]  # nosec
+                [
+                    random.randrange(0, len(train_dataset) - 1)  # nosec
+                    for _ in range(cli_args.debug_num_examples)
+                ]
            ),
            tokenizer,
+            num_examples=cli_args.debug_num_examples,
+            text_only=cli_args.debug_text_only,
        )

-    if prepare_ds_only:
-        LOG.info("Finished preparing dataset. Exiting...")
-        return
-
-    # Load the model and tokenizer
-    LOG.info("loading model and (optionally) peft_config...")
-    model, peft_config = load_model(cfg, tokenizer)
-
-    safe_serialization = cfg.save_safetensors is True
-
-    if "merge_lora" in kwargs and cfg.adapter is not None:
-        LOG.info("running merge of LoRA with base model")
-        model = model.merge_and_unload()
-        model.to(dtype=torch.float16)
-
-        if cfg.local_rank == 0:
-            LOG.info("saving merged model")
-            model.save_pretrained(
-                str(Path(cfg.output_dir) / "merged"),
-                safe_serialization=safe_serialization,
-            )
-            tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
-        return
-
-    if cfg.inference:
-        LOG.info("calling do_inference function")
-        prompter: Optional[str] = "AlpacaPrompter"
-        if "prompter" in kwargs:
-            if kwargs["prompter"] == "None":
-                prompter = None
-            else:
-                prompter = kwargs["prompter"]
-        do_inference(cfg, model, tokenizer, prompter=prompter)
-        return
-
-    if "shard" in kwargs:
-        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
-        return
-
-    trainer = setup_trainer(
-        cfg, train_dataset, eval_dataset, model, tokenizer, total_num_steps
+    return TrainDatasetMeta(
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        total_num_steps=total_num_steps,
    )

-    model.config.use_cache = False

-    if torch.__version__ >= "2" and sys.platform != "win32":
-        LOG.info("Compiling torch model")
-        model = torch.compile(model)
-
-    # go ahead and presave, so we have the adapter config available to inspect
-    if peft_config:
-        LOG.info(f"Pre-saving adapter config to {cfg.output_dir}")
-        peft_config.save_pretrained(cfg.output_dir)
-
-    # In case we want to stop early with ctrl+c, this is a nice to have to save the pretrained model
-    if cfg.local_rank == 0:
-
-        def terminate_handler(_, __, model):
-            if cfg.flash_optimum:
-                model = BetterTransformer.reverse(model)
-            model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
-            sys.exit(0)
-
-        signal.signal(
-            signal.SIGINT, lambda signum, frame: terminate_handler(signum, frame, model)
-        )
-
-    LOG.info("Starting trainer...")
-    if cfg.group_by_length:
-        LOG.info("hang tight... sorting dataset for group_by_length")
-    resume_from_checkpoint = cfg.resume_from_checkpoint
-    if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints:
-        possible_checkpoints = [
-            str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*")
-        ]
-        if len(possible_checkpoints) > 0:
-            sorted_paths = sorted(
-                possible_checkpoints,
-                key=lambda path: int(path.split("-")[-1]),
-            )
-            resume_from_checkpoint = sorted_paths[-1]
-            LOG.info(
-                f"Using Auto-resume functionality to start with checkpoint at {resume_from_checkpoint}"
-            )
-
-    if not Path(cfg.output_dir).is_dir():
-        os.makedirs(cfg.output_dir, exist_ok=True)
-    tokenizer.save_pretrained(cfg.output_dir)
-    if cfg.flash_optimum:
-        with torch.backends.cuda.sdp_kernel(
-            enable_flash=True, enable_math=True, enable_mem_efficient=True
-        ):
-            trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+def do_cli(config: Path = Path("examples/"), **kwargs):
+    print_axolotl_text_art()
+    parsed_cfg = load_cfg(config, **kwargs)
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
+    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
+        return_remaining_strings=True
+    )
+    if parsed_cli_args.inference:
+        do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
+    elif parsed_cli_args.merge_lora:
+        do_merge_lora(cfg=parsed_cfg, cli_args=parsed_cli_args)
+    elif parsed_cli_args.shard:
+        shard(cfg=parsed_cfg, cli_args=parsed_cli_args)
    else:
-        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
-
-    LOG.info(f"Training Completed!!! Saving pre-trained model to {cfg.output_dir}")
-
-    # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
-    # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
-    if cfg.fsdp:
-        trainer.save_model(cfg.output_dir)
-    elif cfg.local_rank == 0:
-        if cfg.flash_optimum:
-            model = BetterTransformer.reverse(model)
-        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+        dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
+        if parsed_cli_args.prepare_ds_only:
+            return
+        train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)


 if __name__ == "__main__":
-    fire.Fire(train)
+    fire.Fire(do_cli)
--- a/setup.py
+++ b/setup.py
@@ -2,31 +2,40 @@

 from setuptools import find_packages, setup

-install_requires = []
-with open("./requirements.txt", encoding="utf-8") as requirements_file:
-    # don't include peft yet until we check the int4
-    # need to manually install peft for now...
-    reqs = [r.strip() for r in requirements_file.readlines() if "peft" not in r]
-    reqs = [r for r in reqs if r and r[0] != "#"]
-    for r in reqs:
-        install_requires.append(r)
+
+def parse_requirements():
+    _install_requires = []
+    _dependency_links = []
+    with open("./requirements.txt", encoding="utf-8") as requirements_file:
+        lines = [r.strip() for r in requirements_file.readlines()]
+        for line in lines:
+            if line.startswith("--extra-index-url"):
+                # Handle custom index URLs
+                _, url = line.split()
+                _dependency_links.append(url)
+            elif "flash-attn" not in line and line and line[0] != "#":
+                # Handle standard packages
+                _install_requires.append(line)
+    return _install_requires, _dependency_links
+
+
+install_requires, dependency_links = parse_requirements()
+

 setup(
    name="axolotl",
-    version="0.1",
-    description="You know you're going to axolotl questions",
+    version="0.3.0",
+    description="LLM Trainer",
+    long_description="Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.",
    package_dir={"": "src"},
    packages=find_packages(),
    install_requires=install_requires,
+    dependency_links=dependency_links,
    extras_require={
-        "gptq": [
-            "alpaca_lora_4bit @ git+https://github.com/winglian/alpaca_lora_4bit.git@setup_pip",
-        ],
-        "gptq_triton": [
-            "alpaca_lora_4bit[triton] @ git+https://github.com/winglian/alpaca_lora_4bit.git@setup_pip",
+        "flash-attn": [
+            "flash-attn>=2.2.1",
        ],
        "extras": [
-            "flash-attn",
            "deepspeed",
        ],
    },
--- a/src/axolotl/common/init.py
+++ b/src/axolotl/common/init.py
--- a/src/axolotl/common/cli.py
+++ b/src/axolotl/common/cli.py
@@ -0,0 +1,43 @@
+"""
+shared module for cli specific things
+"""
+
+import logging
+from dataclasses import dataclass, field
+from typing import Optional
+
+from axolotl.logging_config import configure_logging
+from axolotl.utils.dict import DictDefault
+from axolotl.utils.models import load_model, load_tokenizer
+
+configure_logging()
+LOG = logging.getLogger("axolotl.common.cli")
+
+
+@dataclass
+class TrainerCliArgs:
+    """
+    dataclass representing the various non-training arguments
+    """
+
+    debug: bool = field(default=False)
+    debug_text_only: bool = field(default=False)
+    debug_num_examples: int = field(default=5)
+    inference: bool = field(default=False)
+    merge_lora: bool = field(default=False)
+    prepare_ds_only: bool = field(default=False)
+    prompter: Optional[str] = field(default=None)
+    shard: bool = field(default=False)
+
+
+def load_model_and_tokenizer(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
+    tokenizer = load_tokenizer(cfg)
+    LOG.info("loading model and (optionally) peft_config...")
+    model, _ = load_model(cfg, tokenizer, inference=cli_args.inference)
+
+    return model, tokenizer
--- a/src/axolotl/logging_config.py
+++ b/src/axolotl/logging_config.py
@@ -1,16 +1,43 @@
-"""Logging configuration settings"""
+"""
+Common logging module for axolotl
+"""

 import os
 import sys
+from logging import Formatter
 from logging.config import dictConfig
 from typing import Any, Dict

+from colorama import Fore, Style, init
+
+
+class ColorfulFormatter(Formatter):
+    """
+    Formatter to add coloring to log messages by log type
+    """
+
+    COLORS = {
+        "WARNING": Fore.YELLOW,
+        "ERROR": Fore.RED,
+        "CRITICAL": Fore.RED + Style.BRIGHT,
+    }
+
+    def format(self, record):
+        record.rank = int(os.getenv("LOCAL_RANK", "0"))
+        log_message = super().format(record)
+        return self.COLORS.get(record.levelname, "") + log_message + Fore.RESET
+
+
 DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
    "version": 1,
    "formatters": {
        "simple": {
            "format": "[%(asctime)s] [%(levelname)s] [%(name)s.%(funcName)s:%(lineno)d] [PID:%(process)d] %(message)s",
        },
+        "colorful": {
+            "()": ColorfulFormatter,
+            "format": "[%(asctime)s] [%(levelname)s] [%(name)s.%(funcName)s:%(lineno)d] [PID:%(process)d] [RANK:%(rank)d] %(message)s",
+        },
    },
    "filters": {},
    "handlers": {
@@ -20,14 +47,25 @@ DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
            "filters": [],
            "stream": sys.stdout,
        },
+        "color_console": {
+            "class": "logging.StreamHandler",
+            "formatter": "colorful",
+            "filters": [],
+            "stream": sys.stdout,
+        },
    },
    "root": {"handlers": ["console"], "level": os.getenv("LOG_LEVEL", "INFO")},
    "loggers": {
-        "axolotl": {"handlers": ["console"], "level": "DEBUG", "propagate": False},
+        "axolotl": {
+            "handlers": ["color_console"],
+            "level": "DEBUG",
+            "propagate": False,
+        },
    },
 }


 def configure_logging():
    """Configure with default logging"""
+    init()  # Initialize colorama
    dictConfig(DEFAULT_LOGGING_CONFIG)
--- a/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
@@ -2,7 +2,9 @@

 # copied from https://github.com/lm-sys/FastChat/blob/main/fastchat/train/llama_flash_attn_monkey_patch.py

+import logging
 import warnings
+from functools import partial
 from typing import List, Optional, Tuple, Union

 import torch
@@ -33,6 +35,9 @@ except ImportError:
    )


+LOG = logging.getLogger("axolotl")
+
+
 def replace_llama_attn_with_flash_attn(packed: Optional[bool] = False):
    transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = (  # pylint: disable=protected-access
        _prepare_decoder_attention_mask
@@ -44,6 +49,34 @@ def replace_llama_attn_with_flash_attn(packed: Optional[bool] = False):
            llama_model_forward
        )

+    try:
+        from flash_attn.losses.cross_entropy import CrossEntropyLoss
+
+        LOG.info("patching with flash_attn.losses.cross_entropy")
+        transformers.models.llama.modeling_llama.CrossEntropyLoss = partial(
+            CrossEntropyLoss, inplace_backward=True
+        )
+    except ImportError:
+        LOG.info(
+            "optimized flash-attention CrossEntropyLoss not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=xentropy_cuda_lib&subdirectory=csrc/xentropy'`)"
+        )
+
+    try:
+        from flash_attn.ops.rms_norm import RMSNorm
+
+        class LlamaRMSNorm(RMSNorm):
+            """Patched LLamaRMSNorm"""
+
+            def __init__(self, hidden_size, eps=1e-6):
+                super().__init__(hidden_size, eps=eps)
+
+        LOG.info("patching with flash_attn.ops.rms_norm")
+        transformers.models.llama.modeling_llama.LlamaRMSNorm = LlamaRMSNorm
+    except ImportError:
+        LOG.info(
+            "optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)"
+        )
+

 # Disable the transformation of the attention mask in LlamaModel as the flash attention
 # requires the attention mask to be the same as the key_padding_mask
@@ -158,7 +191,7 @@ def flashattn_forward(
    else:
        # turn off FA causal mask after first inference autoregressive iteration
        # only on first autoregressive step q,k,v have same seqlen
-        is_causal = past_key_value is not None
+        is_causal = key_states.shape == query_states.shape

    if cu_seqlens is not None and max_seqlen is not None:
        # special handling using sample packing
@@ -169,7 +202,7 @@ def flashattn_forward(
        qkv = rearrange(qkv, "b s ... -> (b s) ...")

        output = flash_attn_varlen_qkvpacked_func(
-            qkv, cu_seqlens, max_seqlen, 0.0, softmax_scale=None, causal=is_causal
+            qkv, cu_seqlens, max_seqlen, 0.0, softmax_scale=None, causal=True
        )
        output = rearrange(output, "(b s) ... -> b s ...", b=bsz)
    elif query_states.shape == key_states.shape:
--- a/src/axolotl/monkeypatch/relora.py
+++ b/src/axolotl/monkeypatch/relora.py
@@ -0,0 +1,393 @@
+"""Implements the ReLoRA training procedure from https://arxiv.org/abs/2307.05695, minus the initial full fine-tune."""
+import glob
+import json
+import logging
+import os.path
+import shutil
+from pathlib import Path
+from typing import Dict, List, Sequence
+
+import bitsandbytes as bnb
+import peft
+import safetensors.torch as st
+import torch
+from huggingface_hub import snapshot_download
+from torch.optim.lr_scheduler import LRScheduler
+from torch.optim.optimizer import Optimizer
+from transformers import (
+    TrainerCallback,
+    TrainerControl,
+    TrainerState,
+    TrainingArguments,
+)
+from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
+
+from axolotl.utils.dict import DictDefault
+from axolotl.utils.distributed import is_main_process
+
+LOG = logging.getLogger("axolotl.relora")
+
+
+def reset_optimizer(optimizer: torch.optim.Optimizer):
+    for group in optimizer.param_groups:
+        for param in group["params"]:
+            param_state = optimizer.state[param]
+            for key in param_state:
+                if "qmap" in key:
+                    continue
+
+                if key == "step" and isinstance(param_state[key], int):
+                    param_state[key] = 0
+                else:
+                    param_state[key] = torch.zeros_like(param_state[key])
+
+
+class ReLoRACallback(TrainerCallback):
+    """Callback to merge LoRA weights into the base model and save full-weight checkpoints"""
+
+    def __init__(self, cfg: DictDefault):
+        self.relora_steps = cfg.relora_steps
+        self.cpu_offload = cfg.relora_cpu_offload
+        self.quantized = cfg.load_in_4bit or cfg.load_in_8bit
+        self.last_full_model = cfg.base_model
+        self.resume_from_checkpoint = cfg.resume_from_checkpoint
+
+        if not os.path.exists(self.last_full_model):
+            self.last_full_model = str(Path(snapshot_download(cfg.base_model)))
+
+        assert os.path.exists(
+            self.last_full_model
+        ), "for ReLORA base_model must be a local path"
+
+        self.num_lora_restarts = 0
+        self.need_full_save = False
+
+    def on_train_begin(
+        self,
+        _args: TrainingArguments,
+        _state: TrainerState,
+        control: TrainerControl,
+        model: peft.LoraModel,
+        **_kwargs,
+    ):
+        if self.resume_from_checkpoint:
+            weight_path = os.path.join(self.resume_from_checkpoint, "relora")
+            if not os.path.exists(weight_path):
+                LOG.warning(
+                    "Resuming ReLoRA from checkpoint, but no full-weight save found"
+                )
+            else:
+                LOG.info(f"Loading adjusted base weights from {weight_path}")
+                load_weight_checkpoint(model, weight_path)
+        return control
+
+    def on_step_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: peft.LoraModel,
+        optimizer: torch.optim.Optimizer,
+        **_kwargs,
+    ):
+        if state.global_step > 0 and state.global_step % self.relora_steps == 0:
+            checkpoint_folder = os.path.join(
+                args.output_dir,
+                f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}",
+                "relora",
+            )
+
+            with torch.no_grad():
+                merge_and_save(
+                    model,
+                    self.last_full_model,
+                    checkpoint_folder,
+                    reinit=True,
+                    quantized=self.quantized,
+                    actually_save=is_main_process(),
+                    cpu_offload=self.cpu_offload,
+                )
+                reset_optimizer(optimizer)
+
+            if self.quantized:
+                self.last_full_model = checkpoint_folder
+            self.num_lora_restarts += 1
+
+        return control
+
+    def on_save(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        model: peft.LoraModel,
+        **_kwargs,
+    ):
+        checkpoint_folder = os.path.join(
+            args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "relora"
+        )
+        if (
+            state.global_step >= self.relora_steps
+            and state.global_step % self.relora_steps != 0
+        ):
+            if self.quantized:
+                if is_main_process() and self.last_full_model != checkpoint_folder:
+                    # ensure the latest full parameter save is in the latest checkpoint
+                    # folder, so that automatic pruning of checkpoints does not remove it
+                    LOG.info(f"moving last full parameter save to {checkpoint_folder}")
+                    os.makedirs(checkpoint_folder, exist_ok=True)
+                    chunks = glob.glob(
+                        f"{self.last_full_model}/model*.safetensors"
+                    ) + glob.glob(f"{self.last_full_model}/model*.index.json")
+                    for path in chunks:
+                        new_path = os.path.abspath(shutil.move(path, checkpoint_folder))
+                        try:
+                            os.symlink(new_path, path)
+                        except OSError:
+                            # probably on windows without permission to symlink
+                            pass
+
+                    self.last_full_model = checkpoint_folder
+            else:
+                model.model.save_pretrained(checkpoint_folder, safe_serialization=True)
+
+        return control
+
+    def on_log(
+        self,
+        _args: TrainingArguments,
+        _state: TrainerState,
+        control: TrainerControl,
+        logs: Dict[str, float],
+        **_kwargs,
+    ):
+        logs["num_lora_restarts"] = self.num_lora_restarts
+        return control
+
+    def on_train_end(
+        self,
+        args: TrainingArguments,
+        _state: TrainerState,
+        control: TrainerControl,
+        model: peft.LoraModel,
+        **_kwargs,
+    ):
+        if self.quantized:
+            # perform final merge and save
+            with torch.no_grad():
+                merge_and_save(
+                    model,
+                    self.last_full_model,
+                    args.output_dir,
+                    reinit=False,
+                    quantized=self.quantized,
+                    actually_save=is_main_process(),
+                    cpu_offload=self.cpu_offload,
+                )
+        # no need to save if unquantized, as finetune.py will call merge_and_unload()
+        return control
+
+
+class ReLoRAScheduler(LRScheduler):
+    """Wraps another scheduler to apply per-lora-restart learning rate warmups."""
+
+    def __init__(
+        self,
+        optimizer: Optimizer,
+        inner_schedule: LRScheduler,
+        relora_steps: int,
+        warmup_steps: int,
+        min_lr_scale: float = 0.001,
+    ) -> None:
+        self.inner_schedule = inner_schedule
+        self.relora_steps = relora_steps
+        self.warmup_steps = warmup_steps
+        self.min_lr_scale = min_lr_scale
+        super().__init__(optimizer, inner_schedule.last_epoch, inner_schedule.verbose)
+
+    def get_lr(self) -> float:
+        self.inner_schedule.last_epoch = self.last_epoch
+
+        original = self.inner_schedule.get_lr()
+        step = self.last_epoch
+        if step < self.relora_steps:
+            scale = 1
+        else:
+            cycle_t = min(1.0, (step % self.relora_steps) / self.warmup_steps)
+            scale = cycle_t * (1 - self.min_lr_scale) + self.min_lr_scale
+
+        if isinstance(original, Sequence):
+            return [lr * scale for lr in original]
+        return original * scale
+
+
+def sharded_paths(path: str, module_names: List[str]) -> Dict[str, str]:
+    model_name = "model.safetensors"
+    if not os.path.exists(str(Path(path) / model_name)) and not os.path.exists(
+        str(Path(path) / f"{model_name}.index.json")
+    ):
+        model_name = "pytorch_model.bin"
+
+    index_path = str(Path(path) / f"{model_name}.index.json")
+    if os.path.exists(index_path):
+        with open(index_path, "r", encoding="utf-8") as file:
+            data = json.load(file)
+        return data["weight_map"]
+    return {(module_name + ".weight"): model_name for module_name in module_names}
+
+
+def lora_delta_weight(layer: peft.tuners.lora.LoraLayer, device) -> torch.Tensor:
+    if isinstance(layer, (peft.tuners.lora.Linear8bitLt, peft.tuners.lora.Linear4bit)):
+        adapter = layer.active_adapter
+        return (
+            peft.utils.transpose(
+                layer.lora_B[adapter].weight.detach().to(device)
+                @ layer.lora_A[adapter].weight.detach().to(device),
+                getattr(layer, "fan_in_fan_out", False),
+            )
+            * layer.scaling[adapter]
+        )
+
+    return layer.get_delta_weight().to(device)
+
+
+def find_lora_modules(model: peft.LoraModel) -> Dict[str, peft.tuners.lora.LoraLayer]:
+    modules: Dict[str, peft.tuners.lora.LoraLayer] = {}
+
+    key_list = [key for key, _ in model.model.named_modules() if "lora" not in key]
+    for key in key_list:
+        try:
+            # pylint: disable=protected-access
+            _parent, target, _target_name = peft.utils._get_submodules(model.model, key)
+        except AttributeError:
+            continue
+
+        if isinstance(target, peft.tuners.lora.LoraLayer):
+            modules[key] = target
+
+    return modules
+
+
+def update_weights(
+    target: peft.tuners.lora.LoraLayer, new_weight: torch.Tensor, reinit: bool, device
+):
+    if reinit:
+        for adapter_name in target.lora_A:
+            target.reset_lora_parameters(adapter_name)
+        for adapter_name in target.lora_embedding_A:
+            target.reset_lora_parameters(adapter_name)
+
+    if isinstance(target, peft.tuners.lora.Linear4bit):
+        # This could be faster, but the quantization of Linear4bit weights occurs
+        # when the module is moved from cpu to gpu. Without meddling *too* deeply in
+        # PEFT's innards or maintaining a duplicate of that codepath, this is good
+        # enough for now.
+        target.weight.quant_state = None
+        target.weight.data = new_weight.cpu()
+        target.to(device)
+    elif isinstance(target, peft.tuners.lora.Linear8bitLt):
+        target.weight = bnb.nn.Int8Params(new_weight, requires_grad=False).to(device)
+    else:
+        target.weight.data = new_weight.to(device)
+
+
+def merge_and_save(
+    model: peft.LoraModel,
+    model_src: str,
+    model_dst: str,
+    reinit: bool = False,
+    quantized: bool = False,
+    cpu_offload: bool = False,
+    actually_save: bool = True,
+):
+    modules = find_lora_modules(model)
+
+    if not quantized:
+        for module_name, target in modules.items():
+            update = target.get_delta_weight(target.active_adapter).detach()
+            target.weight.data += update
+
+            if reinit:
+                for adapter_name in target.lora_A:
+                    target.reset_lora_parameters(adapter_name)
+                for adapter_name in target.lora_embedding_A:
+                    target.reset_lora_parameters(adapter_name)
+        return
+
+    os.makedirs(model_dst, exist_ok=True)
+    shard_paths = sharded_paths(model_src, modules.keys())
+    out_shard_paths = {}
+
+    unique_shards = list(set(shard_paths.values()))
+    for shard_path in unique_shards:
+        out_tensors = {}
+        if shard_path.endswith(".safetensors"):
+            in_tensors = st.load_file(str(Path(model_src) / shard_path))
+        else:
+            in_tensors = torch.load(Path(model_src) / shard_path)
+            if "state_dict" in in_tensors:
+                in_tensors = in_tensors["state_dict"]
+
+        for module_name, target in modules.items():
+            key = module_name + ".weight"
+            if key not in shard_paths or shard_paths[key] != shard_path:
+                continue
+
+            orig_weight = in_tensors[key]
+            old_dev = target.weight.device
+            math_dev = "cpu" if cpu_offload else old_dev
+
+            delta_weight = lora_delta_weight(target, math_dev)
+            new_weight = orig_weight.to(math_dev) + delta_weight
+            del delta_weight
+
+            if actually_save:
+                out_tensors[key] = new_weight.half().cpu()
+
+            update_weights(target, new_weight, reinit=reinit, device=old_dev)
+
+        if actually_save:
+            out_shard_name = shard_path
+            if out_shard_name.startswith("pytorch_model"):
+                out_shard_name = (
+                    out_shard_name.replace("pytorch_model", "model").rstrip(".bin")
+                    + ".safetensors"
+                )
+
+            for module_name in in_tensors:
+                if module_name not in out_tensors:
+                    out_tensors[module_name] = in_tensors[module_name].half()
+                out_shard_paths[module_name] = out_shard_name
+
+            shard_fn = str(Path(model_dst) / out_shard_name)
+            LOG.info(f"saving tensors to {shard_fn}")
+            st.save_file(out_tensors, shard_fn, metadata={"format": "pt"})
+
+        del in_tensors
+        del out_tensors
+        torch.cuda.empty_cache()
+
+    if actually_save and len(unique_shards) > 1:
+        with open(
+            str(Path(model_dst, "model.safetensors.index.json")), "w", encoding="utf-8"
+        ) as file:
+            json.dump({"metadata": {}, "weight_map": out_shard_paths}, file)
+
+
+def load_weight_checkpoint(model: peft.LoraModel, checkpoint_path: str):
+    modules = find_lora_modules(model)
+    shard_paths = sharded_paths(checkpoint_path, modules.keys())
+    unique_shards = list(set(shard_paths.values()))
+
+    for shard_path in unique_shards:
+        tensors = st.load_file(os.path.join(checkpoint_path, shard_path))
+
+        for module_name, target in modules.items():
+            key = module_name + ".weight"
+            if key not in shard_paths or shard_paths[key] != shard_path:
+                continue
+
+            new_weight = tensors[key]
+            update_weights(
+                target, new_weight, reinit=False, device=target.weight.device
+            )
--- a/src/axolotl/prompt_strategies/init.py
+++ b/src/axolotl/prompt_strategies/init.py
@@ -2,8 +2,10 @@

 import importlib

+from axolotl.prompt_strategies.user_defined import UserDefinedDatasetConfig

-def load(strategy, tokenizer, cfg):
+
+def load(strategy, tokenizer, cfg, ds_cfg):
    try:
        load_fn = "load"
        if strategy.split(".")[-1].startswith("load_"):
@@ -11,6 +13,9 @@ def load(strategy, tokenizer, cfg):
            strategy = ".".join(strategy.split(".")[:-1])
        mod = importlib.import_module(f".{strategy}", "axolotl.prompt_strategies")
        func = getattr(mod, load_fn)
-        return func(tokenizer, cfg)
+        load_kwargs = {}
+        if strategy == "user_defined":
+            load_kwargs["ds_cfg"] = UserDefinedDatasetConfig(**ds_cfg)
+        return func(tokenizer, cfg, **load_kwargs)
    except Exception:  # pylint: disable=broad-exception-caught
        return None
--- a/src/axolotl/prompt_strategies/alpaca_w_system.py
+++ b/src/axolotl/prompt_strategies/alpaca_w_system.py
@@ -57,6 +57,8 @@ class SystemDataPrompter(AlpacaPrompter):
    Alpaca Style Prompter that uses system prompts from the dataset
    """

+    system_format: str = "### System:\n{system}\n\n"
+
    def build_prompt_w_system(
        self,
        system: str,
--- a/src/axolotl/prompt_strategies/metharme.py
+++ b/src/axolotl/prompt_strategies/metharme.py
@@ -0,0 +1,76 @@
+"""Module containing the MetharmenPromptTokenizingStrategy and MetharmePrompter class"""
+
+import logging
+from typing import Tuple
+
+from axolotl.prompt_tokenizers import InstructionPromptTokenizingStrategy
+from axolotl.prompters import AlpacaPrompter
+
+LOG = logging.getLogger("axolotl")
+
+IGNORE_TOKEN_ID = -100
+
+# pylint: disable=duplicate-code
+
+
+class MetharmePromptTokenizingStrategy(InstructionPromptTokenizingStrategy):
+    """
+    Tokenizing strategy for the Metharme models
+    """
+
+    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
+        return (prompt["prompt"], "", prompt["generation"])
+
+    def _tokenize(
+        self,
+        prompt: str,
+        add_eos_token: bool = True,
+        strip_bos_token: bool = False,
+        num_eos_tokens: int = 3,
+    ):
+        result = self.tokenizer(
+            prompt,
+            truncation=True,
+            max_length=self.sequence_len,
+            padding=False,
+            return_tensors=None,
+        )
+        if len(result["input_ids"]) == 0:
+            LOG.warning("Tokenizer result is empty. You may want to audit your dataset")
+        # If there's already an EOS token there, subtract from the number added
+        if result["input_ids"][-1] == self.tokenizer.eos_token_id:
+            num_eos_tokens -= 1
+
+        if num_eos_tokens > 0 and add_eos_token and len(result["input_ids"]) > 0:
+            for _ in range(num_eos_tokens):
+                if len(result["input_ids"]) < self.sequence_len:
+                    result["input_ids"].append(self.tokenizer.eos_token_id)
+                    result["attention_mask"].append(1)
+
+        if result["input_ids"][0] == self.tokenizer.bos_token_id and strip_bos_token:
+            result["input_ids"] = result["input_ids"][1:]
+            result["attention_mask"] = result["attention_mask"][1:]
+
+        result["labels"] = result["input_ids"].copy()
+        return result
+
+
+class MetharmePrompter(AlpacaPrompter):
+    """
+    Prompter for the Metharme models.
+    """
+
+    system_prompt = ""
+    system_no_input_prompt = ""
+    system_format = ""
+    turn_format = "{instruction}"
+    turn_no_input_format = "{instruction}"
+
+    def __init__(self, *args, **kwargs):  # pylint: disable=super-init-not-called
+        pass
+
+
+def load(tokenizer, cfg):
+    return MetharmePromptTokenizingStrategy(
+        MetharmePrompter(), tokenizer, cfg.train_on_inputs, cfg.sequence_len
+    )
--- a/src/axolotl/prompt_strategies/user_defined.py
+++ b/src/axolotl/prompt_strategies/user_defined.py
@@ -0,0 +1,98 @@
+"""
+User Defined prompts with configuration from the YML config
+"""
+
+from dataclasses import dataclass
+from functools import partial
+from typing import Optional, Tuple
+
+from axolotl.prompt_strategies.alpaca_w_system import (
+    InstructionWSystemPromptTokenizingStrategy,
+    SystemDataPrompter,
+)
+
+
+@dataclass
+class UserDefinedDatasetConfig:
+    """
+    dataclass configuration representing a userdefined dataset type
+    """
+
+    system_prompt: str = ""
+    field_system: str = "system"
+    field_instruction: str = "instruction"
+    field_input: str = "input"
+    field_output: str = "output"
+    format: str = "{instruction} {input} "
+    no_input_format: str = "{instruction} "
+    system_format: str = "{system}"
+
+    def __getitem__(self, item):
+        return getattr(self, item)
+
+
+class UserDefinedPromptTokenizationStrategy(InstructionWSystemPromptTokenizingStrategy):
+    """
+    Prompt Tokenization Strategy for user defined prompts
+    """
+
+
+def load(tokenizer, cfg, ds_cfg: Optional[UserDefinedDatasetConfig] = None):
+    if not ds_cfg:
+        raise ValueError("Missing dataset prompt configuration")
+
+    system_prompt = ""
+    if ds_cfg.system_prompt:
+        system_prompt = ds_cfg.system_prompt
+
+    def parse_instruction_fields(
+        field_instruction,
+        field_input,
+        field_output,
+        field_system,
+        system_prompt,
+        prompt,
+    ) -> Tuple[str, str, str, str]:
+        return (
+            prompt[field_instruction],
+            prompt[field_input] if field_input in prompt else "",
+            prompt[field_output] if field_output in prompt else "",
+            prompt[field_system] if field_system in prompt else system_prompt,
+        )
+
+    turn_format = ds_cfg.format
+    turn_no_input_format = ds_cfg.no_input_format
+    system_format = ds_cfg.system_format
+
+    class UserDefinedPrompter(SystemDataPrompter):
+        """
+        Prompter for user defined prompts
+        """
+
+        def match_prompt_style(self):
+            self.turn_format = turn_format
+            self.turn_no_input_format = turn_no_input_format
+            self.system_format = system_format
+
+    prompter = UserDefinedPrompter()
+
+    strat = UserDefinedPromptTokenizationStrategy(
+        prompter,
+        tokenizer,
+        cfg.train_on_inputs,
+        cfg.sequence_len,
+    )
+
+    setattr(
+        strat,
+        "parse_instruction_fields",
+        partial(
+            parse_instruction_fields,
+            ds_cfg.field_instruction,
+            ds_cfg.field_input,
+            ds_cfg.field_output,
+            ds_cfg.field_system,
+            system_prompt,
+        ),
+    )
+    return strat
--- a/src/axolotl/prompt_tokenizers.py
+++ b/src/axolotl/prompt_tokenizers.py
@@ -13,7 +13,7 @@ from axolotl.prompters import IGNORE_TOKEN_ID
 LOG = logging.getLogger("axolotl")

 IGNORE_INDEX = -100
-LLAMA_DEFAULT_PAD_TOKEN = "[PAD]"  # nosec
+LLAMA_DEFAULT_PAD_TOKEN = "<pad>"  # nosec
 LLAMA_DEFAULT_EOS_TOKEN = "</s>"  # nosec
 LLAMA_DEFAULT_BOS_TOKEN = "<s>"  # nosec
 LLAMA_DEFAULT_UNK_TOKEN = "<unk>"  # nosec
@@ -85,7 +85,11 @@ class PromptTokenizingStrategy(abc.ABC):
            result["input_ids"].append(self.tokenizer.eos_token_id)
            result["attention_mask"].append(1)

-        if result["input_ids"][0] == self.tokenizer.bos_token_id and strip_bos_token:
+        if (
+            len(result["input_ids"]) > 0
+            and result["input_ids"][0] == self.tokenizer.bos_token_id
+            and strip_bos_token
+        ):
            result["input_ids"] = result["input_ids"][1:]
            result["attention_mask"] = result["attention_mask"][1:]

--- a/src/axolotl/prompters.py
+++ b/src/axolotl/prompters.py
@@ -26,7 +26,7 @@ class AlpacaPrompter:

    system_prompt = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
    system_no_input_prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
-    system_format: str
+    system_format: str = "{system}"
    turn_format: str
    turn_no_input_format: str
    prompt_style: Optional[PromptStyle] = None
@@ -63,13 +63,17 @@ class AlpacaPrompter:
        # returns the full prompt from instruction and optional input
        # if a label (=response, =output) is provided, it's also appended.
        if input:
-            res = self.system_prompt + self.turn_format.format(
-                instruction=instruction, input=input
-            )
+            res = (
+                self.system_format.format(system=self.system_prompt)
+                if self.system_prompt
+                else ""
+            ) + self.turn_format.format(instruction=instruction, input=input)
        else:
-            res = self.system_no_input_prompt + self.turn_no_input_format.format(
-                instruction=instruction
-            )
+            res = (
+                self.system_format.format(system=self.system_no_input_prompt)
+                if self.system_prompt
+                else ""
+            ) + self.turn_no_input_format.format(instruction=instruction)
        if output:
            res = f"{res}{output}"
        yield res
@@ -305,10 +309,6 @@ class ShareGPTPrompter:  # pylint: disable=too-few-public-methods
        )

    def build_prompt(self, source) -> Generator[str, None, None]:
-        # ignore the system prompt if provided
-        if source[0]["from"] == "system":
-            source.pop(0)
-
        if len(source) < 2:
            # If there isn't a back and forth conversation, ignore it
            # also happens on the data splitting leaving empty conversations
@@ -317,6 +317,12 @@ class ShareGPTPrompter:  # pylint: disable=too-few-public-methods
            )

        conv = self._conversation.copy()
+
+        # Add the conversation system prompt if provided, otherwise use the default one
+        if source[0]["from"] == "system":
+            conv.system = source[0]["value"]
+            source.pop(0)
+
        roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

        try:
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -0,0 +1,141 @@
+"""Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
+
+import logging
+import os
+import signal
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+
+import torch
+
+# add src to the pythonpath so we don't need to pip install this
+from datasets import Dataset
+from optimum.bettertransformer import BetterTransformer
+
+from axolotl.common.cli import TrainerCliArgs
+from axolotl.logging_config import configure_logging
+from axolotl.utils.dict import DictDefault
+from axolotl.utils.models import load_model, load_tokenizer
+from axolotl.utils.trainer import setup_trainer
+
+project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+src_dir = os.path.join(project_root, "src")
+sys.path.insert(0, src_dir)
+
+configure_logging()
+LOG = logging.getLogger("axolotl.train")
+
+
+@dataclass
+class TrainDatasetMeta:
+    """
+    dataclass to capture the dataset specific options for training
+    """
+
+    train_dataset: Dataset
+    eval_dataset: Optional[Dataset] = None
+    total_num_steps: Optional[int] = None
+
+
+def train(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+    dataset_meta: TrainDatasetMeta,
+):
+    # load the tokenizer first
+    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
+    tokenizer = load_tokenizer(cfg)
+
+    train_dataset = dataset_meta.train_dataset
+    eval_dataset = dataset_meta.eval_dataset
+    total_num_steps = dataset_meta.total_num_steps
+
+    # Load the model and tokenizer
+    LOG.info("loading model and (optionally) peft_config...")
+    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
+
+    safe_serialization = cfg.save_safetensors is True
+
+    if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints:
+        possible_checkpoints = [
+            str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*")
+        ]
+        if len(possible_checkpoints) > 0:
+            sorted_paths = sorted(
+                possible_checkpoints,
+                key=lambda path: int(path.split("-")[-1]),
+            )
+            cfg.resume_from_checkpoint = sorted_paths[-1]
+            LOG.info(
+                f"Using Auto-resume functionality to start with checkpoint at {cfg.resume_from_checkpoint}"
+            )
+    resume_from_checkpoint = cfg.resume_from_checkpoint
+
+    trainer = setup_trainer(
+        cfg, train_dataset, eval_dataset, model, tokenizer, total_num_steps
+    )
+
+    model.config.use_cache = False
+
+    if torch.__version__ >= "2" and sys.platform != "win32":
+        LOG.info("Compiling torch model")
+        model = torch.compile(model)
+
+    # go ahead and presave, so we have the adapter config available to inspect
+    if peft_config:
+        LOG.info(f"Pre-saving adapter config to {cfg.output_dir}")
+        peft_config.save_pretrained(cfg.output_dir)
+    # additionally presave the tokenizer and model configs
+    if not Path(cfg.output_dir).is_dir():
+        os.makedirs(cfg.output_dir, exist_ok=True)
+    tokenizer.save_pretrained(str(Path(cfg.output_dir)))
+    model.config.save_pretrained(str(Path(cfg.output_dir)))
+
+    # In case we want to stop early with ctrl+c, this is a nice to have to save the pretrained model
+    if cfg.local_rank == 0:
+
+        def terminate_handler(_, __, model):
+            if cfg.flash_optimum:
+                model = BetterTransformer.reverse(model)
+            model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+            sys.exit(0)
+
+        signal.signal(
+            signal.SIGINT, lambda signum, frame: terminate_handler(signum, frame, model)
+        )
+
+    LOG.info("Starting trainer...")
+    if cfg.group_by_length:
+        LOG.info("hang tight... sorting dataset for group_by_length")
+
+    if cfg.flash_optimum:
+        with torch.backends.cuda.sdp_kernel(
+            enable_flash=True, enable_math=True, enable_mem_efficient=True
+        ):
+            trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+    else:
+        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+
+    LOG.info(f"Training Completed!!! Saving pre-trained model to {cfg.output_dir}")
+
+    if cfg.relora_steps:
+        if cfg.adapter == "lora" and not (cfg.load_in_4bit or cfg.load_in_8bit):
+            model = model.merge_and_unload()
+        else:
+            # final model weights have already been saved by `ReLoRACallback.on_train_end`
+            return model, tokenizer
+
+    # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
+    # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
+    if cfg.fsdp:
+        trainer.save_model(cfg.output_dir)
+    elif cfg.local_rank == 0:
+        if cfg.flash_optimum:
+            model = BetterTransformer.reverse(model)
+
+        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+
+    return model, tokenizer
--- a/src/axolotl/utils/callbacks.py
+++ b/src/axolotl/utils/callbacks.py
@@ -1,9 +1,19 @@
 """Callbacks for Trainer class"""

+from __future__ import annotations
+
 import logging
 import os
+from typing import TYPE_CHECKING, Dict, List

+import evaluate
+import numpy as np
+import pandas as pd
+import torch
+import torch.distributed as dist
+from datasets import load_dataset
 from optimum.bettertransformer import BetterTransformer
+from tqdm import tqdm
 from transformers import (
    TrainerCallback,
    TrainerControl,
@@ -13,8 +23,21 @@ from transformers import (
 from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, IntervalStrategy

 from axolotl.utils.bench import log_gpu_memory_usage
+from axolotl.utils.distributed import (
+    barrier,
+    broadcast_dict,
+    gather_scalar_from_all_ranks,
+    get_world_size,
+    is_distributed,
+    is_main_process,
+    zero_first,
+)
+
+if TYPE_CHECKING:
+    from axolotl.utils.trainer import AxolotlTrainingArguments

 LOG = logging.getLogger("axolotl.callbacks")
+IGNORE_INDEX = -100


 class SavePeftModelCallback(TrainerCallback):  # pylint: disable=too-few-public-methods
@@ -33,7 +56,9 @@ class SavePeftModelCallback(TrainerCallback):  # pylint: disable=too-few-public-
        )

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
-        kwargs["model"].save_pretrained(peft_model_path)
+        kwargs["model"].save_pretrained(
+            peft_model_path, save_safetensors=args.save_safetensors
+        )

        return control

@@ -94,3 +119,207 @@ class GPUStatsCallback(
            log_gpu_memory_usage(LOG, "while training", self.cfg.device)
            self.logged = True
        return control
+
+
+def bench_eval_callback_factory(trainer, tokenizer):
+    accuracy = evaluate.load("accuracy")
+    abcd_idx = [
+        tokenizer("A", add_special_tokens=False).input_ids[0],
+        tokenizer("B", add_special_tokens=False).input_ids[0],
+        tokenizer("C", add_special_tokens=False).input_ids[0],
+        tokenizer("D", add_special_tokens=False).input_ids[0],
+        tokenizer("E", add_special_tokens=False).input_ids[0],
+        tokenizer("F", add_special_tokens=False).input_ids[0],
+        tokenizer("G", add_special_tokens=False).input_ids[0],
+    ]
+    bench_split = "eval"
+
+    def transform_bench_subject(example):
+        # Split on ':' and trim whitespace
+        parts = example["subject"].split(":")
+        first_part = (
+            parts[0].strip().lower().replace("-", "_")
+        )  # Lowercase the first part
+        second_part = (
+            parts[1].strip().replace("-", "_") if len(parts) > 1 else "all"
+        )  # Replace hyphens with underscores
+
+        # Return the transformed values
+        return {"name": first_part, "subject": second_part}
+
+    if trainer.args.bench_dataset == "mmlu-zs":
+        bench_dataset = load_dataset(
+            "openaccess-ai-collective/mmlu-evals",
+            data_files={
+                "eval": "zero_shot_mmlu_val.json",
+                "test": "zero_shot_mmlu_test.json",
+            },
+        )
+        # bench_dataset = bench_dataset.remove_columns("subject")
+    # MMLU Five-shot (Eval/Test only)
+    elif trainer.args.bench_dataset in ["mmlu", "mmlu-fs"]:
+        bench_dataset = load_dataset(
+            "openaccess-ai-collective/mmlu-evals",
+            data_files={
+                "eval": "five_shot_mmlu_val.json",
+                "test": "five_shot_mmlu_test.json",
+            },
+        )
+        # bench_dataset = bench_dataset.remove_columns('subject')
+    elif "/" in trainer.args.bench_dataset:
+        bench_ds = trainer.args.bench_dataset
+        bench_ds_name = "/".join(bench_ds.split("/", 2)[:2])
+        bench_ds_data_file = "/".join(bench_ds.split("/", 2)[2:])
+        bench_dataset = load_dataset(
+            bench_ds_name,
+            data_files={
+                "eval": bench_ds_data_file,
+            },
+        )
+        bench_dataset["eval"] = bench_dataset["eval"].map(transform_bench_subject)
+    else:
+        raise ValueError(
+            f"unhandled value `{trainer.args.bench_dataset}` for bench_dataset training args"
+        )
+    bench_dataset = bench_dataset[trainer.args.bench_split]
+    if trainer.args.max_bench_samples is not None:
+        bench_dataset = bench_dataset.select(range(trainer.args.max_bench_samples))
+
+    def tokenize_evals(example):
+        source = f"{tokenizer.bos_token}{example['input']}"
+        target = f"{example['output']}{tokenizer.eos_token}"
+
+        tokenized_source = tokenizer(
+            source,
+            max_length=2048,
+            truncation=True,
+            add_special_tokens=False,
+        )
+        tokenized_target = tokenizer(
+            target,
+            max_length=2048,
+            truncation=True,
+            add_special_tokens=False,
+        )
+        input_ids = tokenized_source["input_ids"] + tokenized_target["input_ids"]
+        labels = [IGNORE_INDEX] * len(tokenized_source["input_ids"]) + tokenized_target[
+            "input_ids"
+        ]
+
+        return {
+            "input_ids": input_ids,
+            "labels": labels,
+            "subject": example["subject"],
+        }
+
+    with zero_first(is_main_process()):
+        bench_dataset = bench_dataset.map(tokenize_evals)
+        bench_dataset = bench_dataset.filter(lambda x: x["labels"][-2] in abcd_idx)
+
+    class BenchEvalCallback(TrainerCallback):
+        """
+        TrainerCallback that runs the MMLU evals
+        """
+
+        def on_evaluate(
+            self,
+            args: AxolotlTrainingArguments,
+            state: TrainerState,  # pylint: disable=unused-argument
+            control: TrainerControl,  # pylint: disable=unused-argument
+            metrics: Dict[str, float],  # pylint: disable=unused-argument
+            **kwargs,  # pylint: disable=unused-argument
+        ):
+            data_loader = trainer.get_bench_dataloader(
+                bench_dataset.remove_columns(["input", "subject", "output", "name"])
+            )
+            trainer.model.eval()
+            preds, refs = [], []
+            loss_bench = 0
+            for batch in tqdm(data_loader, total=len(data_loader)):
+                (loss, logits, labels) = trainer.prediction_step(
+                    trainer.model,
+                    batch,
+                    prediction_loss_only=False,
+                )
+                # There are two tokens, the output, and eos token.
+                for i, logit in enumerate(logits):
+                    label_non_zero_id = (batch["labels"][i] != IGNORE_INDEX).nonzero()[
+                        0
+                    ][0]
+                    logit_abcd = logit[label_non_zero_id - 1][abcd_idx]
+                    preds.append(torch.argmax(logit_abcd).item())
+                labels = labels[labels != IGNORE_INDEX].view(-1, 2)[:, 0]
+                refs += [
+                    abcd_idx.index(label) if label in abcd_idx else -1
+                    for label in labels.tolist()
+                ]
+                loss_bench += loss.item()
+            # Extract results by subject.
+            bench_name = bench_dataset["name"]
+            bench_names: dict = {s: {"refs": [], "preds": []} for s in set(bench_name)}
+            for s, p, r in zip(bench_name, preds, refs):  # pylint: disable=invalid-name
+                bench_names[s]["preds"].append(p)
+                bench_names[s]["refs"].append(r)
+            barrier()
+            local_bench_names = bench_names
+            gathered_bench_names: List[Dict] = [{} for _ in range(get_world_size())]
+            # Gather results from all GPUs to GPU 0
+
+            loss_bench_ranks = gather_scalar_from_all_ranks(
+                lambda: loss_bench, get_world_size()
+            )
+            len_data_loader_ranks = gather_scalar_from_all_ranks(
+                lambda: len(data_loader), get_world_size()
+            )
+
+            results = {}
+            if is_distributed() and not is_main_process():
+                dist.gather_object(local_bench_names, dst=0)
+            else:
+                if is_distributed():
+                    dist.gather_object(local_bench_names, gathered_bench_names, dst=0)
+                else:
+                    gathered_bench_names = [local_bench_names]
+                bench_loss = sum(loss_bench_ranks) / sum(len_data_loader_ranks)
+                results = {f"{bench_split}_bench_loss": bench_loss}
+
+                # Combine results from all GPUs
+                combined_bench_names: Dict[str, Dict[str, List]] = {}
+                for bench_name in gathered_bench_names:
+                    for name, data in bench_name.items():
+                        if name not in combined_bench_names:
+                            combined_bench_names[name] = {"refs": [], "preds": []}
+                        combined_bench_names[name]["refs"].extend(data["refs"])
+                        combined_bench_names[name]["preds"].extend(data["preds"])
+
+                bench_scores = []
+                bench_refs = []
+                bench_preds = []
+                for (
+                    bench_name
+                ) in combined_bench_names:  # pylint: disable=consider-using-dict-items
+                    bench_score = accuracy.compute(
+                        references=combined_bench_names[bench_name]["refs"],
+                        predictions=combined_bench_names[bench_name]["preds"],
+                    )["accuracy"]
+                    bench_refs.extend(combined_bench_names[bench_name]["refs"])
+                    bench_preds.extend(combined_bench_names[bench_name]["preds"])
+                    if not pd.isna(bench_score):
+                        results[
+                            f"{bench_split}_bench_accuracy_{bench_name}"
+                        ] = bench_score
+                        bench_scores.append(bench_score)
+                    else:
+                        results[f"{bench_split}_bench_accuracy_{bench_name}"] = 0.0
+                        bench_scores.append(0.0)
+                results[f"{bench_split}_bench_average_accuracy"] = np.mean(bench_scores)
+                results[f"{bench_split}_bench_total_accuracy"] = accuracy.compute(
+                    references=bench_refs, predictions=bench_preds
+                )["accuracy"]
+                trainer.log(results)
+
+            results = broadcast_dict(results)
+            for key, val in results.items():
+                metrics[key] = val
+
+    return BenchEvalCallback
--- a/src/axolotl/utils/config.py
+++ b/src/axolotl/utils/config.py
@@ -6,6 +6,7 @@ import os
 import torch

 from axolotl.utils.bench import log_gpu_memory_usage
+from axolotl.utils.models import load_model_config

 LOG = logging.getLogger("axolotl")

@@ -62,6 +63,23 @@ def normalize_config(cfg):
    else:
        torch.backends.cuda.matmul.allow_tf32 = cfg.tf32 or False

+    if cfg.bf16 or cfg.bfloat16:
+        cfg.torch_dtype = torch.bfloat16
+    elif cfg.load_in_8bit or cfg.fp16 or cfg.float16:
+        cfg.torch_dtype = torch.float16
+    else:
+        cfg.torch_dtype = torch.float32
+
+    model_config = load_model_config(cfg)
+
+    # figure out if the model is llama
+    cfg.is_llama_derived_model = (
+        (hasattr(model_config, "model_type") and model_config.model_type == "llama")
+        or cfg.is_llama_derived_model
+        or "llama" in cfg.base_model
+        or (cfg.model_type and "llama" in cfg.model_type.lower())
+    )
+
    log_gpu_memory_usage(LOG, "baseline", cfg.device)


@@ -79,6 +97,11 @@ def validate_config(cfg):
            )
        )

+    if cfg.sample_packing and not cfg.pad_to_sequence_len:
+        LOG.warning(
+            "`pad_to_sequence_len: true` is recommended when using sample_packing"
+        )
+
    if cfg.gradient_accumulation_steps and cfg.batch_size:
        raise ValueError(
            "please set only one of gradient_accumulation_steps or batch_size"
@@ -90,9 +113,7 @@ def validate_config(cfg):
            "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.",
        )
    if cfg.load_4bit:
-        raise ValueError(
-            "cfg.load_4bit parameter has been deprecated and replaced by cfg.gptq"
-        )
+        raise ValueError("cfg.load_4bit parameter has been deprecated")

    if cfg.adapter == "qlora":
        if cfg.merge_lora:
@@ -119,6 +140,19 @@ def validate_config(cfg):
    if not cfg.load_in_8bit and cfg.adapter == "lora":
        LOG.warning("We recommend setting `load_in_8bit: true` for LORA finetuning")

+    if cfg.relora_steps:
+        if cfg.adapter not in ("lora", "qlora"):
+            raise ValueError("cfg.adapter must be lora or qlora to use ReLoRA")
+
+        if cfg.fsdp:
+            raise ValueError("fsdp not supported with ReLoRA")
+
+        if cfg.deepspeed:
+            raise ValueError("deepspeed not supported with ReLoRA")
+
+        if cfg.lr_scheduler == "one_cycle":
+            raise ValueError("ReLoRA is not compatible with the one_cycle scheduler")
+
    if cfg.trust_remote_code:
        LOG.warning(
            "`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model."
@@ -186,6 +220,15 @@ def validate_config(cfg):
            "sample_packing not compatible with xformers_attention. Use flash_attention"
        )

+    if cfg.early_stopping_patience:
+        if not cfg.save_steps or not cfg.eval_steps:
+            raise ValueError(
+                "`early_stopping_patience` requires save_steps and eval_steps to be set. eval_steps should evenly divide save_steps."
+            )
+        if cfg.save_steps % cfg.eval_steps != 0:
+            raise ValueError(
+                "`early_stopping_patience` requires that eval_steps should evenly divide save_steps."
+            )
    # TODO
    # MPT 7b
    # https://github.com/facebookresearch/bitsandbytes/issues/25
--- a/src/axolotl/utils/data.py
+++ b/src/axolotl/utils/data.py
@@ -2,7 +2,6 @@
 import functools
 import hashlib
 import logging
-from hashlib import md5
 from pathlib import Path
 from typing import Tuple, Union

@@ -41,6 +40,7 @@ from axolotl.prompters import (
    ShareGPTPrompter,
    SummarizeTLDRPrompter,
 )
+from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process, zero_first
 from axolotl.utils.trainer import (
    calculate_total_num_steps,
@@ -51,11 +51,19 @@ LOG = logging.getLogger("axolotl")
 DEFAULT_DATASET_PREPARED_PATH = "last_run_prepared"


+def md5(to_hash: str, encoding: str = "utf-8") -> str:
+    try:
+        return hashlib.md5(to_hash.encode(encoding), usedforsecurity=False).hexdigest()
+    except TypeError:
+        return hashlib.md5(to_hash.encode(encoding)).hexdigest()  # nosec
+
+
 def prepare_dataset(cfg, tokenizer):
    if not cfg.pretraining_dataset:
-        train_dataset, eval_dataset = load_prepare_datasets(
-            tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH
-        )
+        with zero_first(is_main_process()):
+            train_dataset, eval_dataset = load_prepare_datasets(
+                tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH
+            )
    else:
        train_dataset = load_pretraining_dataset(
            cfg.pretraining_dataset,
@@ -86,7 +94,7 @@ def load_tokenized_prepared_datasets(
 ) -> DatasetDict:
    tokenizer_name = tokenizer.__class__.__name__
    ds_hash = str(
-        md5(  # nosec
+        md5(
            (
                str(cfg.sequence_len)
                + "@"
@@ -95,8 +103,8 @@ def load_tokenized_prepared_datasets(
                )
                + "|"
                + tokenizer_name
-            ).encode("utf-8")
-        ).hexdigest()
+            )
+        )
    )
    prepared_ds_path = (
        Path(cfg.dataset_prepared_path) / ds_hash
@@ -132,8 +140,17 @@ def load_tokenized_prepared_datasets(
            seed = 42

        datasets = []
+
+        def for_d_in_datasets(dataset_configs):
+            for dataset in dataset_configs:
+                if dataset.name and isinstance(dataset.name, list):
+                    for name in dataset.name:
+                        yield DictDefault({**dataset, "name": name})
+                else:
+                    yield dataset
+
        # pylint: disable=invalid-name
-        for d in cfg.datasets:
+        for d in for_d_in_datasets(cfg.datasets):
            ds: Union[Dataset, DatasetDict] = None
            ds_from_hub = False
            try:
@@ -160,8 +177,15 @@ def load_tokenized_prepared_datasets(
                        split=None,
                    )
                elif local_path.is_file():
+                    ds_type = "json"
+                    if d.ds_type:
+                        ds_type = d.ds_type
+                    elif ".parquet" in d.path:
+                        ds_type = "parquet"
+                    elif ".arrow" in d.path:
+                        ds_type = "arrow"
                    ds = load_dataset(
-                        "json",
+                        ds_type,
                        name=d.name,
                        data_files=d.path,
                        streaming=False,
@@ -198,13 +222,27 @@ def load_tokenized_prepared_datasets(
                    )
                else:
                    ds = ds.shuffle(seed=seed).shard(num_shards=d.shards, index=0)
+
+            d_base_type = d_prompt_style = None
            d_type = d.type
-            d_type_split = d_type.split(":")
-            d_base_type = d_type_split[0]
-            d_prompt_style = d_type_split[1] if len(d_type_split) > 1 else None
+            if isinstance(d_type, str):
+                d_type_split = d_type.split(":")
+                d_base_type = d_type_split[0]
+                d_prompt_style = d_type_split[1] if len(d_type_split) > 1 else None
            if "train" in ds:
                ds = ds["train"]
-            if ds_strategy := load(d.type, tokenizer, cfg):
+            if (
+                "input_ids" in ds.features
+                and "attention_mask" in ds.features
+                and "labels" in ds.features
+            ):
+                # dataset is already tokenized, just drop it straight in
+                datasets.append(ds)
+            elif isinstance(d.type, DictDefault):
+                ds_strategy = load("user_defined", tokenizer, cfg, d.type.to_dict())
+                ds_wrapper = TokenizedPromptDataset(ds_strategy, ds)
+                datasets.append(ds_wrapper)
+            elif ds_strategy := load(d.type, tokenizer, cfg, d):
                ds_wrapper = TokenizedPromptDataset(ds_strategy, ds)
                datasets.append(ds_wrapper)
            elif d_base_type == "alpaca":
@@ -342,7 +380,7 @@ def load_prepare_datasets(
        # see if we can go ahead and load the stacked dataset
        seed = f"@{str(cfg.seed)}" if cfg.seed else ""
        ds_hash = str(
-            md5(  # nosec
+            md5(
                (
                    str(cfg.sequence_len)
                    + "@"
@@ -353,8 +391,8 @@ def load_prepare_datasets(
                    )
                    + "|"
                    + tokenizer_name
-                ).encode("utf-8")
-            ).hexdigest()
+                )
+            )
        )
        prepared_ds_path = (
            Path(cfg.dataset_prepared_path) / ds_hash
@@ -468,12 +506,8 @@ def load_prepare_datasets(
            + "|"
            + str(cfg.seed or 42)
        )
-        train_fingerprint = hashlib.md5(
-            to_hash_train.encode(), usedforsecurity=False
-        ).hexdigest()
-        test_fingerprint = hashlib.md5(
-            to_hash_test.encode(), usedforsecurity=False
-        ).hexdigest()
+        train_fingerprint = md5(to_hash_train)
+        test_fingerprint = md5(to_hash_test)

        with zero_first(is_main_process()):
            dataset = dataset.train_test_split(
--- a/src/axolotl/utils/dataloader.py
+++ b/src/axolotl/utils/dataloader.py
@@ -243,6 +243,18 @@ class MultipackDistributedDataloader:
            len_remaining -= 1
            if not len_remaining:
                return
+        # yield a no-op for cases where we don't have any data left to pack
+        for i in range(0, len_remaining):
+            yield self.collate_fn(
+                [
+                    {
+                        "input_ids": [0],
+                        "labels": [-100],
+                        "attention_mask": [True],
+                        "position_ids": [0],
+                    }
+                ]
+            )

    def _len_est(self):
        lengths_sum = np.sum(self.lengths)
--- a/src/axolotl/utils/distributed.py
+++ b/src/axolotl/utils/distributed.py
@@ -1,8 +1,11 @@
 """
 utility helpers for distributed checks
 """
+import os
+import pickle  # nosec
 from contextlib import contextmanager

+import torch
 import torch.distributed as dist
 from accelerate import Accelerator

@@ -43,6 +46,10 @@ def is_main_process():
    return dist.get_rank() == 0


+def get_world_size():
+    return int(os.getenv("WORLD_SIZE", "1"))
+
+
@contextmanager
 def zero_first(is_main):
    """
@@ -53,3 +60,64 @@ def zero_first(is_main):
    yield
    if is_main:  # then rank 0 waits after it has run the context
        barrier()
+
+
+def gather_scalar_from_all_ranks(fn, world_size=1):  # pylint: disable=invalid-name
+    """
+    Run a callable 'fn' on all ranks and gather the results on the specified rank.
+
+    Args:
+    - fn (callable): A function that computes the value. This should not have any side effects.
+    - rank (int, optional): The rank that gathers the values. Default is 0.
+    - world_size (int, optional): Total number of processes in the current distributed setup.
+
+    Returns:
+    - A list of computed values from all ranks if on the gathering rank, otherwise None.
+    """
+    value_scalar = fn()
+    if not is_distributed():
+        return [value_scalar]
+    value_tensor = torch.tensor(value_scalar, device=dist.get_rank()).float()
+
+    if not is_main_process():
+        dist.gather(value_tensor, dst=0)
+    else:
+        gathered_tensors = [torch.zeros_like(value_tensor) for _ in range(world_size)]
+        dist.gather(value_tensor, gather_list=gathered_tensors, dst=0)
+
+        # Convert tensors back to their original type (int or float)
+        gathered_values = []
+        for tensor in gathered_tensors:
+            if tensor == tensor.int():
+                gathered_values.append(int(tensor.item()))
+            else:
+                gathered_values.append(float(tensor.item()))
+        return gathered_values
+    return None
+
+
+def broadcast_dict(vals: dict):
+    if not is_distributed():
+        return vals
+
+    if is_main_process():
+        data_byte = pickle.dumps(vals)
+        data_tensor = torch.ByteTensor(list(data_byte)).to("cuda")
+        data_size = torch.IntTensor([len(data_byte)]).to("cuda")
+    else:
+        data_tensor = torch.empty([1024], dtype=torch.uint8, device="cuda")
+        data_size = torch.IntTensor([0]).to("cuda")
+
+    dist.broadcast(data_size, 0)
+    if not is_main_process():
+        # resize
+        data_tensor = data_tensor.new_empty([data_size.item()])
+
+    dist.broadcast(data_tensor, 0)
+
+    if not is_main_process():
+        data_list = data_tensor.cpu().tolist()
+        data_byte = bytes(data_list[: data_size.item()])
+        vals = pickle.loads(data_byte)  # nosec
+
+    return vals
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -4,32 +4,37 @@
 import logging
 import math
 import os
-from pathlib import Path
-from typing import TYPE_CHECKING, Optional, Tuple  # noqa: F401
+from typing import Optional, Tuple  # noqa: F401

 import bitsandbytes as bnb
 import torch
 import transformers
 from optimum.bettertransformer import BetterTransformer
+from peft import PeftConfig, prepare_model_for_kbit_training
 from transformers import (  # noqa: F401
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
+    GPTQConfig,
    LlamaConfig,
    PreTrainedModel,
    PreTrainedTokenizerBase,
 )

-from axolotl.prompt_tokenizers import LLAMA_DEFAULT_PAD_TOKEN
+from axolotl.prompt_tokenizers import LLAMA_DEFAULT_EOS_TOKEN
 from axolotl.utils.bench import log_gpu_memory_usage
+from axolotl.utils.dict import DictDefault

 LOG = logging.getLogger("axolotl")

-if TYPE_CHECKING:
-    from peft import PeftConfig  # noqa: F401

-    from axolotl.utils.dict import DictDefault  # noqa: F401
+def load_model_config(cfg):
+    model_config_name = cfg.base_model_config or cfg.base_model
+    trust_remote_code: bool = False or cfg.trust_remote_code
+    return AutoConfig.from_pretrained(
+        model_config_name, trust_remote_code=trust_remote_code
+    )


 def load_tokenizer(cfg):
@@ -54,11 +59,18 @@ def load_tokenizer(cfg):
        **tokenizer_kwargs,
    )

-    if tokenizer.__class__.__name__ in [
-        "LlamaTokenizer",
-        "LlamaTokenizerFast",
-    ]:
-        tokenizer.pad_token = LLAMA_DEFAULT_PAD_TOKEN
+    if (
+        tokenizer.__class__.__name__
+        in [
+            "LlamaTokenizer",
+            "LlamaTokenizerFast",
+            "CodeLlamaTokenizer",
+        ]
+        and hasattr(tokenizer, "pad_token")
+        and not tokenizer.pad_token
+    ):
+        # set a pad_token, but use eos_token so we don't add a new token
+        tokenizer.pad_token = LLAMA_DEFAULT_EOS_TOKEN

    LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}")
    LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}")
@@ -79,8 +91,10 @@ def load_tokenizer(cfg):


 def load_model(
-    cfg, tokenizer
-):  # type: (DictDefault, PreTrainedTokenizerBase) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
+    cfg: DictDefault,
+    tokenizer: PreTrainedTokenizerBase,
+    inference: bool = False,
+) -> Tuple[PreTrainedModel, Optional[PeftConfig]]:
    """
    Load a model for a given configuration and tokenizer.
    """
@@ -90,14 +104,9 @@ def load_model(

    # TODO refactor as a kwarg
    load_in_8bit = cfg.load_in_8bit
-    cfg.is_llama_derived_model = (
-        "llama" in base_model
-        or (cfg.model_type and "llama" in cfg.model_type.lower())
-        or cfg.is_llama_derived_model
-    )

    if cfg.is_llama_derived_model and cfg.flash_attention:
-        if cfg.device not in ["mps", "cpu"] and not cfg.inference:
+        if cfg.device not in ["mps", "cpu"] and not inference:
            from axolotl.monkeypatch.llama_attn_hijack_flash import (
                replace_llama_attn_with_flash_attn,
            )
@@ -139,94 +148,35 @@ def load_model(
    if (
        cfg.is_llama_derived_model
        and (cfg.max_packed_sequence_len or cfg.sample_packing)
-        and not cfg.inference
+        and not inference
    ):
        from axolotl.monkeypatch.llama_expand_mask import hijack_expand_mask

        LOG.info("patching _expand_mask")
        hijack_expand_mask()

-    if cfg.bf16 or cfg.bfloat16:
-        torch_dtype = torch.bfloat16
-    elif cfg.load_in_8bit or cfg.fp16 or cfg.float16:
-        torch_dtype = torch.float16
-    else:
-        torch_dtype = torch.float32
-    try:
-        if cfg.gptq:
-            from alpaca_lora_4bit.monkeypatch.peft_tuners_lora_monkey_patch import (
-                replace_peft_model_with_int4_lora_model,
-            )
-
-            replace_peft_model_with_int4_lora_model()
-    except Exception as err:
-        LOG.exception(err)
-        raise err
-
-    if not cfg.gptq and (
-        (cfg.adapter == "lora" and load_in_8bit)
-        or (cfg.adapter == "qlora" and cfg.load_in_4bit)
-    ):
-        try:
-            from peft import prepare_model_for_kbit_training
-        except ImportError:
-            # For backward compatibility
-            from peft import (
-                prepare_model_for_int8_training as prepare_model_for_kbit_training,
-            )
-
    model_kwargs = {}
    if cfg.model_revision:
        model_kwargs["revision"] = cfg.model_revision
+    if cfg.gptq:
+        model_config = load_model_config(cfg)
+        if not hasattr(model_config, "quantization_config"):
+            LOG.warning("model config does not contain quantization_config information")
+        else:
+            model_kwargs["quantization_config"] = GPTQConfig(
+                **model_config.quantization_config
+            )
    if cfg.adapter == "qlora" and cfg.load_in_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
-            bnb_4bit_compute_dtype=torch_dtype,
+            bnb_4bit_compute_dtype=cfg.torch_dtype,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    try:
-        if cfg.gptq and cfg.is_llama_derived_model:
-            from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram
-            from huggingface_hub import snapshot_download
-
-            try:
-                snapshot_download_kwargs = {}
-                if cfg.base_model_ignore_patterns:
-                    snapshot_download_kwargs[
-                        "ignore_patterns"
-                    ] = cfg.base_model_ignore_patterns
-                cache_model_path = Path(
-                    snapshot_download(base_model, **snapshot_download_kwargs)
-                )
-                files = (
-                    list(cache_model_path.glob("*.pt"))
-                    + list(cache_model_path.glob("*.safetensors"))
-                    + list(cache_model_path.glob("*.bin"))
-                )
-                if len(files) > 0:
-                    model_path = str(files[0])
-                else:
-                    LOG.warning(
-                        "unable to find a cached model file, this will likely fail..."
-                    )
-                    model_path = str(cache_model_path)
-            except Exception:  # pylint: disable=broad-exception-caught
-                model_path = cfg.base_model
-            model, _ = load_llama_model_4bit_low_ram(
-                base_model_config if base_model_config else base_model,
-                model_path,
-                device_map=cfg.device_map,
-                half=cfg.fp16,
-                groupsize=cfg.gptq_groupsize if cfg.gptq_groupsize else -1,
-                is_v1_model=cfg.gptq_model_v1
-                if cfg.gptq_model_v1 is not None
-                else True,
-            )
-            load_in_8bit = False
-        elif cfg.is_llama_derived_model and not cfg.trust_remote_code:
+        if cfg.is_llama_derived_model and not cfg.trust_remote_code and not cfg.gptq:
            from transformers import LlamaForCausalLM

            config_kwargs = {}
@@ -242,7 +192,7 @@ def load_model(
                device_map=cfg.device_map,
                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
-                torch_dtype=torch_dtype,
+                torch_dtype=cfg.torch_dtype,
                **model_kwargs,
            )
        # elif model_type == "GPTNeoXForCausalLM" and cfg.flash_attention:
@@ -272,15 +222,24 @@ def load_model(
        #     )
        #     model.train() # sets to train instead of eval mode
        elif model_type and not cfg.trust_remote_code:
-            model = getattr(transformers, model_type).from_pretrained(
-                base_model,
-                device_map=cfg.device_map,
-                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
-                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
-                torch_dtype=torch_dtype,
-                trust_remote_code=cfg.trust_remote_code or False,
-                **model_kwargs,
-            )
+            if cfg.gptq:
+                model = AutoModelForCausalLM.from_pretrained(
+                    base_model,
+                    device_map=cfg.device_map,
+                    torch_dtype=cfg.torch_dtype,
+                    trust_remote_code=cfg.trust_remote_code or False,
+                    **model_kwargs,
+                )
+            else:
+                model = getattr(transformers, model_type).from_pretrained(
+                    base_model,
+                    device_map=cfg.device_map,
+                    load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
+                    load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
+                    torch_dtype=cfg.torch_dtype,
+                    trust_remote_code=cfg.trust_remote_code or False,
+                    **model_kwargs,
+                )
        else:
            config = AutoConfig.from_pretrained(
                base_model,
@@ -308,7 +267,7 @@ def load_model(
                device_map=cfg.device_map,
                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
-                torch_dtype=torch_dtype,
+                torch_dtype=cfg.torch_dtype,
                trust_remote_code=cfg.trust_remote_code or False,
                **model_kwargs,
            )
@@ -322,7 +281,7 @@ def load_model(
            device_map=cfg.device_map,
            load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
            load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
-            torch_dtype=torch_dtype,
+            torch_dtype=cfg.torch_dtype,
            trust_remote_code=cfg.trust_remote_code or False,
            **model_kwargs,
        )
@@ -347,46 +306,46 @@ def load_model(
    if model.device.type == "cuda":
        log_gpu_memory_usage(LOG, "after model load", model.device)

-    if not cfg.gptq and (
-        (cfg.adapter == "lora" and load_in_8bit)
-        or (cfg.adapter == "qlora" and cfg.load_in_4bit)
+    # make sure these are fp32 per Ramesh et al. (2021)
+    for name, module in model.named_modules():
+        if "norm" in name:
+            module.to(torch.float32)
+        if "lm_head" in name or "embed_tokens" in name:
+            if hasattr(module, "weight"):
+                module.to(torch.float32)
+
+    needs_fa2_dtype = cfg.adapter or cfg.fsdp
+    if (cfg.adapter == "lora" and load_in_8bit) or (
+        cfg.adapter == "qlora" and cfg.load_in_4bit
    ):
        LOG.info("converting PEFT model w/ prepare_model_for_kbit_training")
+        if cfg.gradient_checkpointing:
+            model.gradient_checkpointing_enable()
        model = prepare_model_for_kbit_training(
            model, use_gradient_checkpointing=cfg.gradient_checkpointing
        )
+        needs_fa2_dtype = True

-        # LlamaRMSNorm layers are in fp32 after kbit_training, so we need to
-        # convert them back to fp16/bf16 for flash-attn compatibility.
-        if cfg.flash_attention and cfg.is_llama_derived_model:
-            for name, module in model.named_modules():
-                if "norm" in name:
-                    module.to(torch_dtype)
-                if "lm_head" in name or "embed_tokens" in name:
-                    if hasattr(module, "weight"):
-                        module.to(torch_dtype)
+    # LlamaRMSNorm layers are in fp32 after kbit_training or full finetune, so we need to
+    # convert them back to fp16/bf16 for flash-attn compatibility.
+    if needs_fa2_dtype or (cfg.flash_attention and cfg.is_llama_derived_model):
+        LOG.info("converting modules to %s for flash attention", cfg.torch_dtype)
+        for name, module in model.named_modules():
+            if "norm" in name:
+                module.to(cfg.torch_dtype)
+            if "lm_head" in name or "embed_tokens" in name:
+                if hasattr(module, "weight"):
+                    module.to(cfg.torch_dtype)

    model, lora_config = load_adapter(model, cfg, cfg.adapter)

    if cfg.ddp and not load_in_8bit:
        model.to(f"cuda:{cfg.local_rank}")

-    if cfg.gptq:
-        # Scales to half
-        LOG.info("Fitting 4bit scales and zeros to half")
-        for _, module in model.named_modules():
-            if "Autograd4bitQuantLinear" in str(type(module)) or "Linear4bitLt" in str(
-                type(module)
-            ):
-                if hasattr(module, "is_v1_model") and module.is_v1_model:
-                    module.zeros = module.zeros.half()
-                module.scales = module.scales.half()
-                module.bias = module.bias.half()
-
    if (
        torch.cuda.device_count() > 1
        and int(os.getenv("WORLD_SIZE", "1")) > 1
-        and (cfg.gptq or cfg.load_in_4bit)
+        and (cfg.load_in_4bit)
    ):
        # llama is PROBABLY model parallelizable, but the default isn't that it is
        # so let's only set it for the 4bit, see
@@ -412,15 +371,15 @@ def load_model(
    return model, lora_config


-def load_adapter(model, cfg, adapter):
-    # type: (PreTrainedModel, DictDefault, Optional[str]) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
+def load_adapter(model, cfg, adapter, inference=False):
+    # type: (PreTrainedModel, DictDefault, Optional[str], bool) -> Tuple[PreTrainedModel, Optional[PeftConfig]]

    if adapter is None:
        return model, None
    if hasattr(model, "enable_input_require_grads"):
        model.enable_input_require_grads()
    if adapter in ["lora", "qlora"]:
-        return load_lora(model, cfg)
+        return load_lora(model, cfg, inference=inference)
    if adapter == "llama-adapter":
        return load_llama_adapter(model, cfg)

@@ -438,7 +397,7 @@ def load_llama_adapter(model, cfg):
    )

    if cfg.lora_model_dir:
-        LOG.info("Loading pretained LORA")
+        LOG.debug("Loading pretained PEFT - llama_adapter")
        model = PeftModel.from_pretrained(
            model,
            cfg.lora_model_dir,
@@ -452,12 +411,8 @@ def load_llama_adapter(model, cfg):
    return model, peft_config


-def find_all_linear_names(bits, model):
-    cls = (
-        bnb.nn.Linear4bit
-        if bits == 4
-        else (bnb.nn.Linear8bitLt if bits == 8 else torch.nn.Linear)
-    )
+def find_all_linear_names(model):
+    cls = (bnb.nn.Linear4bit, bnb.nn.Linear8bitLt, torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
@@ -470,21 +425,15 @@ def find_all_linear_names(bits, model):
    return list(lora_module_names)


-def load_lora(model, cfg):
-    # type: (PreTrainedModel, DictDefault) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
+def load_lora(model, cfg, inference=False):
+    # type: (PreTrainedModel, DictDefault, bool) -> Tuple[PreTrainedModel, Optional[PeftConfig]]

    from peft import LoraConfig, PeftModel, get_peft_model

    lora_target_modules = list(cfg.lora_target_modules or [])

    if cfg.lora_target_linear:
-        bits = None
-        if cfg.load_in_4bit:
-            bits = 4
-        elif cfg.load_in_8bit:
-            bits = 8
-
-        linear_names = find_all_linear_names(bits, model)
+        linear_names = find_all_linear_names(model)
        LOG.info(f"found linear modules: {repr(linear_names)}")
        lora_target_modules = list(set(lora_target_modules + linear_names))

@@ -500,10 +449,11 @@ def load_lora(model, cfg):
    )

    if cfg.lora_model_dir:
+        LOG.debug("Loading pretained PEFT - LoRA")
        model = PeftModel.from_pretrained(
            model,
            cfg.lora_model_dir,
-            is_trainable=not cfg.inference,
+            is_trainable=(not inference),
        )
    else:
        model = get_peft_model(model, lora_config)
--- a/src/axolotl/utils/tokenization.py
+++ b/src/axolotl/utils/tokenization.py
@@ -8,13 +8,13 @@ from termcolor import colored
 LOG = logging.getLogger("axolotl")


-def check_dataset_labels(dataset, tokenizer):
+def check_dataset_labels(dataset, tokenizer, num_examples=5, text_only=False):
    # the dataset is already shuffled, so let's just check the first 5 elements
-    for idx in range(5):
-        check_example_labels(dataset[idx], tokenizer)
+    for idx in range(num_examples):
+        check_example_labels(dataset[idx], tokenizer, text_only=text_only)


-def check_example_labels(example, tokenizer):
+def check_example_labels(example, tokenizer, text_only=False):
    # Get the input_ids, labels, and attention_mask from the dataset
    input_ids = example["input_ids"]
    labels = example["labels"]
@@ -29,8 +29,10 @@ def check_example_labels(example, tokenizer):
        decoded_input_token = tokenizer.decode(input_id)
        # Choose the color based on whether the label has the ignore value or not
        color = "red" if label_id == -100 else ("yellow" if label_id == 0 else "green")
-        colored_token = colored(decoded_input_token, color) + colored(
-            f"({label_id}, {mask}, {input_id})", "white"
+        colored_token = colored(decoded_input_token, color) + (
+            not text_only
+            and colored(f"({label_id}, {mask}, {input_id})", "white")
+            or ""
        )
        colored_tokens.append(colored_token)

--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -10,28 +10,31 @@ from functools import partial
 from pathlib import Path
 from typing import Optional, Union

-import bitsandbytes as bnb
 import numpy as np
 import torch.cuda
 import transformers
 from datasets import Dataset, set_caching_enabled
-from torch import nn
 from torch.optim.lr_scheduler import OneCycleLR
-from torch.utils.data import DataLoader, DistributedSampler, RandomSampler
+from torch.utils.data import (
+    DataLoader,
+    DistributedSampler,
+    RandomSampler,
+    SequentialSampler,
+)
 from transformers import EarlyStoppingCallback, Trainer, TrainingArguments
-from transformers.trainer_pt_utils import get_parameter_names
+from transformers.trainer_pt_utils import SequentialDistributedSampler

+from axolotl.monkeypatch.relora import ReLoRACallback, ReLoRAScheduler
 from axolotl.utils.callbacks import (
    GPUStatsCallback,
    SaveBetterTransformerModelCallback,
    SavePeftModelCallback,
+    bench_eval_callback_factory,
 )
 from axolotl.utils.collators import DataCollatorForSeq2Seq
 from axolotl.utils.dataloader import MultipackDistributedDataloader
-from axolotl.utils.schedulers import (
-    InterpolatingLogScheduler,
-    get_cosine_schedule_with_quadratic_warmup,
-)
+from axolotl.utils.distributed import is_main_process, zero_first
+from axolotl.utils.schedulers import get_cosine_schedule_with_quadratic_warmup

 LOG = logging.getLogger("axolotl")

@@ -124,6 +127,35 @@ class AxolotlTrainingArguments(TrainingArguments):
        default=1,
        metadata={"help": "the multiplier for the max len for packed sequences"},
    )
+    relora_steps: Optional[int] = field(
+        default=None,
+        metadata={"help": "how often to reset for ReLoRA"},
+    )
+    relora_warmup_steps: Optional[int] = field(
+        default=None,
+        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
+    )
+    bench_split: Optional[str] = field(
+        default="eval", metadata={"help": "The benchmark split to run on"}
+    )
+    bench_dataset: Optional[str] = field(
+        default="pharaouk/dharma-1/dharma_1_mini.json",
+        metadata={
+            "help": "Benchmark dataset to use: options are `mmlu-zs`, `mmlu-fs`, or the full path to the dataset file"
+        },
+    )
+    do_bench_eval: Optional[bool] = field(
+        default=False, metadata={"help": "Whether to run the Benchmark evaluation."}
+    )
+    max_bench_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "If set, only evaluates on `max_bench_samples` of the benchmark dataset."
+        },
+    )
+    bench_source_max_len: int = field(
+        default=2048, metadata={"help": "Maximum source sequence length for bench."}
+    )


 class AxolotlTrainer(Trainer):
@@ -133,6 +165,10 @@ class AxolotlTrainer(Trainer):

    args = None  # type: AxolotlTrainingArguments

+    def __init__(self, *args, bench_data_collator=None, **kwargs):
+        self.bench_data_collator = bench_data_collator
+        super().__init__(*args, **kwargs)
+
    def create_scheduler(
        self, num_training_steps: int, optimizer: torch.optim.Optimizer = None
    ):
@@ -171,6 +207,18 @@ class AxolotlTrainer(Trainer):
            )
        return super()._get_train_sampler()

+    def _get_eval_sampler(
+        self, eval_dataset: Dataset
+    ) -> Optional[torch.utils.data.Sampler]:
+        if self.args.world_size > 1 and self.args.sample_packing:
+            return SequentialDistributedSampler(
+                eval_dataset,
+                num_replicas=self.args.world_size,
+                rank=self.args.process_index,
+                batch_size=self.args.per_device_eval_batch_size,
+            )
+        return super()._get_eval_sampler(eval_dataset)
+
    def get_train_dataloader(self) -> Union[DataLoader, MultipackDistributedDataloader]:
        if self.args.sample_packing:
            train_sampler = self._get_train_sampler()
@@ -195,6 +243,7 @@ class AxolotlTrainer(Trainer):
            eval_dataset = (
                eval_dataset if eval_dataset is not None else self.eval_dataset
            )
+
            eval_sampler = self._get_eval_sampler(eval_dataset)
            return self.accelerator.prepare(
                MultipackDistributedDataloader(
@@ -210,6 +259,31 @@ class AxolotlTrainer(Trainer):
            )
        return super().get_eval_dataloader(eval_dataset)

+    def _get_bench_sampler(
+        self, bench_dataset: Dataset
+    ) -> Optional[torch.utils.data.Sampler]:
+        if self.args.world_size <= 1:
+            return SequentialSampler(bench_dataset)
+        return None
+
+    def get_bench_dataloader(
+        self,
+        bench_dataset: Dataset,
+    ) -> Union[DataLoader, MultipackDistributedDataloader]:
+        dataloader_params = {
+            "batch_size": self.args.eval_batch_size,
+            "collate_fn": self.bench_data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+        }
+
+        if not isinstance(bench_dataset, torch.utils.data.IterableDataset):
+            dataloader_params["sampler"] = self._get_bench_sampler(bench_dataset)
+            dataloader_params["drop_last"] = self.args.dataloader_drop_last
+
+        return DataLoader(bench_dataset, **dataloader_params)
+        # return self.accelerator.prepare(DataLoader(bench_dataset, **dataloader_params))
+
    def compute_loss(self, model, inputs, return_outputs=False):
        # use one's weighted cross entropy loss calc
        # if self.args.sample_packing:
@@ -249,13 +323,46 @@ class OneCycleLRSchedulerTrainer(AxolotlTrainer):
        return self.lr_scheduler


+class ReLoRATrainer(AxolotlTrainer):
+    """
+    Trainer subclass that uses the OneCycleLR scheduler
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.lr_scheduler = None
+
+    def create_scheduler(
+        self,
+        num_training_steps: int,
+        optimizer: Optional[torch.optim.Optimizer] = None,
+    ):
+        optimizer = self.optimizer if optimizer is None else optimizer
+        lr_scheduler = super().create_scheduler(num_training_steps, optimizer)
+
+        if self.args.relora_steps:
+            warmup_steps = (
+                self.args.relora_warmup_steps if self.args.relora_warmup_steps else 10
+            )
+            self.lr_scheduler = ReLoRAScheduler(
+                optimizer,
+                lr_scheduler,
+                self.args.relora_steps,
+                warmup_steps,
+            )
+        else:
+            self.lr_scheduler = lr_scheduler
+
+        return self.lr_scheduler
+
+
 def add_position_ids(sample):
    sample["position_ids"] = torch.arange(len(sample["input_ids"]))
    return sample


 def drop_long_seq(sample, sequence_len=2048):
-    return len(sample["input_ids"]) <= sequence_len
+    return len(sample["input_ids"]) <= sequence_len and len(sample["input_ids"]) > 0


@contextmanager
@@ -268,15 +375,18 @@ def disable_datasets_caching():


 def process_datasets_for_packing(cfg, train_dataset, eval_dataset):
-    if cfg.sample_packing:
-        drop_long = partial(drop_long_seq, sequence_len=cfg.sequence_len)
-        train_dataset = train_dataset.filter(drop_long, num_proc=os.cpu_count()).map(
-            add_position_ids, num_proc=os.cpu_count()
-        )
+    drop_long = partial(drop_long_seq, sequence_len=cfg.sequence_len)
+    with zero_first(is_main_process()):
+        train_dataset = train_dataset.filter(drop_long, num_proc=os.cpu_count())
        if eval_dataset:
-            eval_dataset = eval_dataset.filter(drop_long, num_proc=os.cpu_count()).map(
-                add_position_ids, num_proc=os.cpu_count()
-            )
+            eval_dataset = eval_dataset.filter(drop_long, num_proc=os.cpu_count())
+
+        if cfg.sample_packing:
+            train_dataset = train_dataset.map(add_position_ids, num_proc=os.cpu_count())
+            if eval_dataset:
+                eval_dataset = eval_dataset.map(
+                    add_position_ids, num_proc=os.cpu_count()
+                )
    return train_dataset, eval_dataset


@@ -295,6 +405,16 @@ def calculate_total_num_steps(cfg, train_dataset, tokenizer):
            LOG.info(f"📝 UPDATE CONFIG WITH: `total_num_tokens: {total_num_tokens}`")
            cfg.total_num_tokens = total_num_tokens

+        if not cfg.total_supervised_tokens:
+            total_supervised_tokens = (
+                train_dataset.data.column("labels")
+                .to_pandas()
+                .apply(lambda x: np.sum(np.array(x) != -100))
+                .sum()
+            )
+            LOG.info(f"`total_supervised_tokens: {total_supervised_tokens}`")
+            cfg.total_supervised_tokens = total_supervised_tokens
+
        if cfg.sample_packing_eff_est:
            total_num_steps = (
                # match count to len est in dataloader
@@ -355,10 +475,16 @@ def calculate_total_num_steps(cfg, train_dataset, tokenizer):

 def setup_fsdp_envs(cfg):
    os.environ["ACCELERATE_USE_FSDP"] = "true"
+    if cfg.fsdp_config.fsdp_offload_params:
+        os.environ["FSDP_OFFLOAD_PARAMS"] = "true"
    if cfg.fsdp_config.fsdp_sync_module_states:
        os.environ["FSDP_SYNC_MODULE_STATES"] = "true"
    if cfg.fsdp_config.fsdp_state_dict_type:
        os.environ["FSDP_STATE_DICT_TYPE"] = cfg.fsdp_config.fsdp_state_dict_type
+    if cfg.fsdp_config.fsdp_transformer_layer_cls_to_wrap:
+        os.environ[
+            "FSDP_TRANSFORMER_CLS_TO_WRAP"
+        ] = cfg.fsdp_config.fsdp_transformer_layer_cls_to_wrap


 def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_steps):
@@ -392,23 +518,7 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        training_arguments_kwargs["seed"] = cfg.seed

    if cfg.gradient_checkpointing:
-        if cfg.gptq:
-            from alpaca_lora_4bit.gradient_checkpointing import (
-                apply_gradient_checkpointing,
-            )
-
-            gradient_checkpointing_ratio = (
-                cfg.gradient_checkpointing_ratio
-                if cfg.gradient_checkpointing_ratio
-                else 1.0
-            )
-            apply_gradient_checkpointing(
-                model, checkpoint_ratio=gradient_checkpointing_ratio
-            )
-        else:
-            training_arguments_kwargs[
-                "gradient_checkpointing"
-            ] = cfg.gradient_checkpointing
+        training_arguments_kwargs["gradient_checkpointing"] = cfg.gradient_checkpointing
    if cfg.fsdp:
        training_arguments_kwargs["fsdp"] = cfg.fsdp
        if cfg.fsdp_config:
@@ -455,6 +565,31 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        # we have an eval set, but no steps defined, use epoch
        training_arguments_kwargs["evaluation_strategy"] = "epoch"

+    if cfg.save_strategy:
+        training_arguments_kwargs["save_strategy"] = cfg.save_strategy
+    else:
+        training_arguments_kwargs["save_strategy"] = (
+            "steps" if cfg.save_steps else "epoch"
+        )
+
+    if cfg.do_bench_eval:
+        training_arguments_kwargs["do_bench_eval"] = cfg.do_bench_eval
+        if cfg.bench_dataset:
+            training_arguments_kwargs["bench_dataset"] = cfg.bench_dataset
+    if cfg.metric_for_best_model:
+        training_arguments_kwargs["metric_for_best_model"] = cfg.metric_for_best_model
+    if cfg.greater_is_better:
+        training_arguments_kwargs["greater_is_better"] = cfg.greater_is_better
+
+    # DDP Config
+    if cfg.ddp_timeout:
+        training_arguments_kwargs["ddp_timeout"] = cfg.ddp_timeout
+    # see https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html
+    if cfg.ddp_bucket_cap_mb:
+        training_arguments_kwargs["ddp_bucket_cap_mb"] = cfg.ddp_bucket_cap_mb
+    if cfg.ddp_broadcast_buffers is not None:
+        training_arguments_kwargs["ddp_broadcast_buffers"] = cfg.ddp_broadcast_buffers
+
    training_args = AxolotlTrainingArguments(  # pylint: disable=unexpected-keyword-arg
        max_steps=total_num_steps if cfg.max_steps else -1,
        max_seq_length=cfg.sequence_len,
@@ -466,16 +601,14 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        eval_accumulation_steps=cfg.gradient_accumulation_steps,
        num_train_epochs=cfg.num_epochs,
        learning_rate=cfg.learning_rate,
-        save_strategy="steps" if cfg.save_steps else "epoch",
        save_steps=cfg.save_steps,
        output_dir=cfg.output_dir,
        save_total_limit=cfg.save_total_limit if cfg.save_total_limit else 4,
        load_best_model_at_end=(
-            cfg.load_best_model_at_end is not False
+            (cfg.load_best_model_at_end is not False or cfg.early_stopping_patience)
            and cfg.val_set_size > 0
            and cfg.save_steps
            and cfg.save_steps % cfg.eval_steps == 0
-            and cfg.load_in_8bit is not True
        )
        or False,
        ddp_find_unused_parameters=False if cfg.ddp else None,
@@ -489,6 +622,8 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        weight_decay=cfg.weight_decay if cfg.weight_decay is not None else 0.0,
        sample_packing=cfg.sample_packing if cfg.sample_packing else False,
        sample_packing_seq_len_multiplier=cfg.micro_batch_size,
+        relora_steps=cfg.relora_steps,
+        relora_warmup_steps=cfg.relora_warmup_steps,
        **training_arguments_kwargs,
    )

@@ -498,75 +633,12 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        if Path(cfg.torchdistx_path).exists():
            sys.path.append(cfg.torchdistx_path)
            importlib.import_module("torchdistx")
-    if (
-        cfg.optimizer == "adamw_bnb_8bit"
-        and not cfg.gptq
-        and "deepspeed" not in training_arguments_kwargs
-        and not cfg.fsdp
-    ):
-        decay_parameters = get_parameter_names(model, [nn.LayerNorm])
-        decay_parameters = [name for name in decay_parameters if "bias" not in name]
-        optimizer_grouped_parameters = [
-            {
-                "params": [
-                    p
-                    for n, p in model.named_parameters()
-                    if (n in decay_parameters and p.requires_grad)
-                ],
-                "weight_decay": training_args.weight_decay,
-            },
-            {
-                "params": [
-                    p
-                    for n, p in model.named_parameters()
-                    if (n not in decay_parameters and p.requires_grad)
-                ],
-                "weight_decay": 0.0,
-            },
-        ]
-
-        optimizer = bnb.optim.Adam8bit(
-            optimizer_grouped_parameters,
-            betas=(training_args.adam_beta1, training_args.adam_beta2),
-            eps=training_args.adam_epsilon,
-            lr=training_args.learning_rate,
-        )
-
-        if cfg.lr_scheduler == "one_cycle":
-            lr_scheduler_kwargs = (
-                cfg.lr_scheduler_kwargs if cfg.lr_scheduler_kwargs else {}
-            )
-            lr_scheduler = OneCycleLR(
-                optimizer,
-                cfg.learning_rate,
-                total_steps=total_num_steps,
-                epochs=cfg.num_epochs,
-                div_factor=cfg.lr_div_factor if cfg.lr_div_factor else 6,
-                **lr_scheduler_kwargs,
-            )
-        elif cfg.lr_scheduler == "log_sweep":
-            lr_scheduler = InterpolatingLogScheduler(
-                optimizer,
-                cfg.warmup_steps,
-                cfg.log_sweep_min_lr if cfg.log_sweep_min_lr else 1e-10,
-                cfg.log_sweep_max_lr if cfg.log_sweep_max_lr else 10,
-            )
-        else:
-            lr_scheduler = transformers.get_cosine_schedule_with_warmup(
-                optimizer,
-                training_args.warmup_steps,
-                total_num_steps,
-            )
-        trainer_kwargs["optimizers"] = (optimizer, lr_scheduler)

    callbacks = []
    callbacks.append(GPUStatsCallback(cfg))
-    # TODO on_save callback to sync checkpoints to GCP/AWS in background
-    if cfg.early_stopping_patience:
-        early_stop_cb = EarlyStoppingCallback(
-            cfg.early_stopping_patience,
-        )
-        callbacks.append(early_stop_cb)
+
+    if cfg.relora_steps:
+        callbacks.append(ReLoRACallback(cfg))

    if cfg.local_rank == 0 and cfg.adapter in [
        "lora",
@@ -578,10 +650,12 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        callbacks.append(SaveBetterTransformerModelCallback)

    data_collator_kwargs = {
-        "padding": True,
+        "padding": True,  # True/"longest" is the default
    }
-    if cfg.collator_pad_to_longest:
-        data_collator_kwargs["padding"] = "longest"
+    if cfg.pad_to_sequence_len:
+        data_collator_kwargs["pad_to_multiple_of"] = 64 * math.ceil(
+            cfg.sequence_len / 64
+        )
    else:
        # A100 is best at 64, while others at 8. Let's use the larger so we don't have to check
        # https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
@@ -605,11 +679,11 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
                num_proc=32,
            )

-    trainer_cls = (
-        OneCycleLRSchedulerTrainer
-        if cfg.lr_scheduler == "one_cycle" and (cfg.fsdp or cfg.adapter == "qlora")
-        else AxolotlTrainer
-    )
+    trainer_cls = AxolotlTrainer
+    if cfg.lr_scheduler == "one_cycle" and (cfg.fsdp or cfg.adapter == "qlora"):
+        trainer_cls = OneCycleLRSchedulerTrainer
+    elif cfg.relora_steps:
+        trainer_cls = ReLoRATrainer
    trainer = trainer_cls(
        model=model,
        train_dataset=train_dataset,
@@ -620,8 +694,23 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
            return_tensors="pt",
            **data_collator_kwargs,
        ),
+        bench_data_collator=transformers.DataCollatorForSeq2Seq(
+            tokenizer,
+            return_tensors="pt",
+            **data_collator_kwargs,
+        ),
        callbacks=callbacks,
        **trainer_kwargs,
    )

+    if cfg.do_bench_eval:
+        trainer.add_callback(bench_eval_callback_factory(trainer, tokenizer))
+
+    # TODO on_save callback to sync checkpoints to GCP/AWS in background
+    if cfg.early_stopping_patience:
+        early_stop_cb = EarlyStoppingCallback(
+            cfg.early_stopping_patience,
+        )
+        trainer.add_callback(early_stop_cb)
+
    return trainer
--- a/tests/fixtures/conversation.tokenized_llama2chat.json
+++ b/tests/fixtures/conversation.tokenized_llama2chat.json
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -0,0 +1,64 @@
+"""
+test module for the axolotl.utis.data module
+"""
+import unittest
+
+from transformers import LlamaTokenizer
+
+from axolotl.utils.data import encode_pretraining, md5
+
+
+class TestEncodePretraining(unittest.TestCase):
+    """
+    test class for encode pretraining and md5 helper
+    """
+
+    def setUp(self):
+        self.tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b")
+        self.tokenizer.add_special_tokens(
+            {
+                "eos_token": "</s>",
+                "bos_token": "<s>",
+                "unk_token": "<unk>",
+                "pad_token": "<pad>",
+            }
+        )
+        self.max_tokens = 15  # set a small number for easy inspection
+
+    def test_encode_pretraining(self):
+        examples = {
+            "text": [
+                "Hello, world!",
+                "Nice to meet you.",
+                "lorem ipsum dolor sit amet.",
+                "Nice to meet you again!.",
+                "hello, hello",
+            ]
+        }
+        result = encode_pretraining(self.tokenizer, self.max_tokens, examples)
+
+        self.assertEqual(len(result["input_ids"]), 3)
+
+        # Assert the length of input_ids and attention_mask is correct
+        self.assertEqual(len(result["input_ids"][0]), self.max_tokens)
+        self.assertEqual(len(result["attention_mask"][0]), self.max_tokens)
+
+        # Assert EOS and PAD tokens are correctly added
+        # hello world! is 4 tokens
+        self.assertEqual(result["input_ids"][0][0], self.tokenizer.bos_token_id)
+        self.assertEqual(result["input_ids"][0][5], self.tokenizer.eos_token_id)
+        self.assertEqual(result["input_ids"][0][6], self.tokenizer.pad_token_id)
+        # second part, 5 tokens
+        self.assertEqual(result["input_ids"][0][7], self.tokenizer.bos_token_id)
+        self.assertEqual(result["input_ids"][0][13], self.tokenizer.eos_token_id)
+        self.assertEqual(result["input_ids"][0][14], self.tokenizer.pad_token_id)
+
+    def test_md5(self):
+        self.assertEqual(md5("hello world"), "5eb63bbbe01eeed093cb22bb8f5acdc3")
+        self.assertEqual(
+            md5("hello world", "utf-8"), "5eb63bbbe01eeed093cb22bb8f5acdc3"
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_validation.py
+++ b/tests/test_validation.py
@@ -328,6 +328,20 @@ class ValidationTest(unittest.TestCase):
                for record in self._caplog.records
            )

+        cfg = DictDefault(
+            {
+                "sample_packing": True,
+                "pad_to_sequence_len": None,
+            }
+        )
+        with self._caplog.at_level(logging.WARNING):
+            validate_config(cfg)
+            assert any(
+                "`pad_to_sequence_len: true` is recommended when using sample_packing"
+                in record.message
+                for record in self._caplog.records
+            )
+
        cfg = DictDefault(
            {
                "max_packed_sequence_len": 2048,
Author	SHA1	Message	Date
Wing Lian	772cd870d4	fix the sed command to replace the version w the tag Some checks failed pre-commit / pre-commit (push) Has been cancelled Details publish pypi / Upload release to PyPI (push) Has been cancelled Details PyTest / test (3.10) (push) Has been cancelled Details PyTest / test (3.9) (push) Has been cancelled Details	2023-09-11 13:44:19 -04:00
Wing Lian	6c5fbe6223	add long_description for pypi push (#555 )	2023-09-11 13:34:29 -04:00
Wing Lian	bcbc9597e9	replace tags, build dist for pypi publish (#553 ) * replace tags, build dist for pypi publish * missing trailing comma	2023-09-11 13:25:41 -04:00
The Objective Dad	6d57f2f0f0	ergonomic update to optimizer config doc (#548 )	2023-09-11 12:35:45 -04:00
Wing Lian	20ed4c1f9e	pypi on tag push (#552 )	2023-09-11 10:33:42 -04:00
Wing Lian	c5dedb17ad	remove with section, doesn't seem to work (#551 )	2023-09-11 10:27:17 -04:00
Wing Lian	b56503d423	publish to pypi workflow on tagged release (#549 )	2023-09-11 09:44:47 -04:00
Wing Lian	a94f9cb99e	fix for quant config from model (#540 )	2023-09-10 12:40:52 -04:00
dongxiaolong	c1921c9acb	Update requirements.txt (#543 ) fix fsdp	2023-09-08 16:07:11 -04:00
Wing Lian	0b4cf5bc8c	workaround for md5 variations (#533 ) * workaround for md5 variations * refactor the prepared hash too	2023-09-08 16:01:05 -04:00
SlapDrone	78ee2cdab2	add git environment variables to compose: avoid checkout failure error 128 on build (#534 )	2023-09-08 15:59:49 -04:00
Wing Lian	34c0a86a11	update readme to point to direct link to runpod template, cleanup install instrucitons (#532 ) * update readme to point to direct link to runpod template, cleanup install instrucitons * default install flash-attn and auto-gptq now too * update readme w flash-attn extra * fix version in setup	2023-09-08 11:58:54 -04:00
The Objective Dad	5e2d8a42d9	Adding NCCL Timeout Guide (#536 ) * fixes NCCL_P2P_LEVEL=NVL #429 * adding more insights into verious values of NCCL_P2P_LEVEL	2023-09-08 11:57:47 -04:00
Wing Lian	e30f1e3cf7	Early stopping metric (#537 ) * set early stopping metric to check * tweak how load_best_model_at_end gets set for early stopping * add validation for earl;y stopping patience * remove negation * save results to metrics in callback * move early stopping callback after the benchmark evals * broadcast metrics so early stopping works	2023-09-08 11:57:02 -04:00
Wing Lian	343714972b	recommend padding when using sample packing (#531 )	2023-09-06 17:00:21 -04:00
Wing Lian	245c5c41e2	log rank too (#527 )	2023-09-06 08:37:51 -04:00
Wing Lian	a546ca2813	misc fixes/improvements (#513 ) fix per pr feedback	2023-09-05 16:40:13 -04:00
Wing Lian	3355706e22	Add support for GPTQ using native transformers/peft (#468 ) * auto gptq support * more tweaks and add yml * remove old gptq docker * don't need explicit peft install for tests * fix setup.py to use extra index url install torch for tests fix cuda version for autogptq index set torch in requirements so that it installs properly move gptq install around to work with github cicd * gptq doesn't play well with sample packing * address pr feedback * remove torch install for now * set quantization_config from model config * Fix the implementation for getting quant config from model config	2023-09-05 12:43:22 -04:00
mhenrichsen	daa4faca12	Merge pull request #520 from bdashore3/sharegpt-fixes Allow for custom system prompts with ShareGPT	2023-09-05 09:02:55 +02:00
Aman Karmani	fc8766e502	reorg a bit	2023-09-05 02:21:24 +00:00
Aman Gupta Karmani	72a6fe1c1f	use flash_attn rmsnorm when available (#526 ) * use flash_attn xentropy when available * use flash_attn.ops.rms_norm when available * log when xentropy is not found * log how to install RMSNorm * add quotes so pip install works	2023-09-04 19:44:51 -04:00
Aman Gupta Karmani	5fe30b1497	use flash_attn xentropy when available (#525 ) * use flash_attn xentropy when available * log when xentropy is not found	2023-09-04 17:49:16 -04:00
Aman Gupta Karmani	44454ae4c4	move is_llama_derived_model into normalize_config (#524 )	2023-09-04 00:19:03 -04:00
Wing Lian	09f154397e	No gather single gpu (#523 ) * don't attempt to gather on multi-gpu * also check distributed status in bench callback	2023-09-03 23:24:28 -04:00
kingbri	995557bdf3	Prompters: ShareGPT: Allow for custom system prompts If a system prompt is present in a conversation, add it instead of using the default. Signed-off-by: kingbri <bdashore3@proton.me>	2023-09-01 13:53:05 -04:00
Maxime	1991946c5a	fix: bad dtype for full finetune (#504 ) * fix: bad dtype for full finetune * Update src/axolotl/utils/models.py Co-authored-by: Wing Lian <wing.lian@gmail.com> * Update models.py --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-09-01 07:11:45 -07:00
NanoCode012	f51c9c56c6	Fix(doc): Inform Windows users to use WSL/docker (#518 )	2023-09-01 00:08:21 -07:00
Wing Lian	7710e81f50	log supervised token count (#448 )	2023-08-31 15:45:23 -07:00
Tom Jobbins	48434bec54	Debug tokenization output: Add ability to output text only (no tokens), and/or specify num samples to see (#511 )	2023-08-31 14:26:52 -07:00
Jan Philipp Harries	396a7a74fc	Added advanced DDP args (#515 ) * add ddp_config * add advanced ddp config * add ddp_config * add advanced ddp config --------- Co-authored-by: Jan Philipp Harries <jphme@users.noreply.github.com>	2023-08-31 10:37:47 -07:00
Wing Lian	b21e4a20fe	split train from other cli options (#503 )	2023-08-30 22:01:47 -07:00
Alpay Ariyak	42f9642792	Changed Bench Eval to report metrics correctly by split. Added total accuracy and renamed previously used bench_accuracy to bench_average_accuracy. (#512 ) * Added "eval_" prefix * Added total bench accuracy and renamed the previous one to bench_average_accuracy. Changed naming to use bench_split instead of always using eval_ prefix.	2023-08-30 22:00:50 -07:00
Wing Lian	c56b450cf5	drop empty tokenized rows too (#509 )	2023-08-30 06:55:26 -07:00
Aman Gupta Karmani	1e07c162f1	set zero3 optimizer betas to auto so they inherit from HF trainer config (#507 )	2023-08-30 08:10:33 -04:00
Wing Lian	76576323df	add eval benchmark callback (#441 ) * add mmlu callback * use hf dataset for mmlu evals * default to mmlu-zs * make sure to define all the explicit positional args * include metrics in callback * another callback fix for collator max len attribute * fix mmlu evals * sample benchmarks, ensure we drop long samples * fix the data file * fix elif and add better messaging * more fixes * rename mmlu to bench * more fixes * dataset handling and aggregate across benchmark * better handling when no subjects * benchmark callback has its own dataloader and collator * fixes * updated dataset * more fixes * missing transformers import * improve support for customized dataset for bench evals * gather benchmarks from all ranks * fix for gather across multiple gpus	2023-08-29 13:24:19 -07:00
Wing Lian	548787daae	customizable ascii art (#506 )	2023-08-29 10:13:42 -07:00
Wing Lian	5ac3392075	support for datasets with multiple names (#480 ) * support for datasets with multiple names * update docs	2023-08-29 06:18:17 -07:00
Aman Gupta Karmani	e356b297cb	remove --force-reinstall from Dockerfile to ensure correct pytorch version (#492 )	2023-08-29 06:17:51 -07:00
NanoCode012	48c56470d0	Fix(doc): Clarify no amp to full yaml docs (#496 )	2023-08-29 06:17:37 -07:00
Maxime	36b2e1cfee	tweak: use default config file when only one file is present (#501 )	2023-08-29 06:17:10 -07:00
Wing Lian	125cccb786	Refactor train cfg cli (#499 ) * wip to cleanup cfg cli options * fix launcher * fix cli args	2023-08-29 05:37:53 -07:00
Aman Karmani	fd55bc87e2	use math.ceil instead of round /cc #498	2023-08-29 01:03:41 +00:00
Birch-san	8e197f6fb4	pad_to_worst_case_seq_len boolean, for testing memory limits (#498 ) * pad_to_worst_case_seq_len boolean, for testing memory limits * remove collator_pad_to_longest option since it does nothing see docs: https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding.padding True and "longest" mean the same thing * rename to `pad_to_sequence_len, and ensure 64 alignment --------- Co-authored-by: Aman Karmani <aman@tmm1.net>	2023-08-28 18:47:16 -04:00
Aman Karmani	267b7b24e5	simplify linear layer locator	2023-08-28 09:45:16 -04:00
Wing Lian	98bf76e236	fsdp requires params be the same type too (#493 )	2023-08-28 04:33:50 -04:00
NanoCode012	4c37bd0b54	Fix(tokenizer): Make sure to add pad for CodeLlamaTokenizer (#489 )	2023-08-28 09:39:10 +09:00
Aman Gupta Karmani	f144e98a32	Merge pull request #485 from maximegmd/patch-4 fix: finetune model inference needs the dtype fix to work with flash-attn	2023-08-27 16:27:47 -04:00
Aman Karmani	3a011ea1ef	fix condition and add logging	2023-08-27 20:09:26 +00:00
Aman Karmani	1f613e5aa7	Merge branch 'main' into patch-4	2023-08-27 19:57:34 +00:00
Aman Karmani	f319b0bc67	rename var and reformat	2023-08-27 19:55:11 +00:00
Maxime	7fd662dd89	Update src/axolotl/utils/models.py Co-authored-by: Aman Gupta Karmani <aman@tmm1.net>	2023-08-27 21:01:43 +02:00
Maxime	9e699683d7	Update src/axolotl/utils/models.py Co-authored-by: Aman Gupta Karmani <aman@tmm1.net>	2023-08-27 21:01:37 +02:00
mhenrichsen	35130711d6	Feat(cfg): Add code-llama configs for all sizes (#479 ) * configs for all sizes * update tokenizer type --------- Co-authored-by: mhenrichsen <some_email@hey.com>	2023-08-27 10:20:17 +09:00
mhenrichsen	3fc9006298	Feat(deepspeed): Add zero2 config (#476 ) * zero2 config * config added * linting --------- Co-authored-by: mhenrichsen <some_email@hey.com>	2023-08-27 10:10:33 +09:00
NanoCode012	ad8be435ad	Feat(doc): Update eval_steps doc (#487 )	2023-08-27 10:09:09 +09:00
Charles O. Goddard	fe4d6baf92	Add example Llama 2 ReLoRA config (#471 ) * Add example Llama 2 ReLoRA config * Use adamw_bnb_8bit in example relora config	2023-08-27 10:08:34 +09:00
Aman Gupta Karmani	f31301063d	Merge pull request #486 from OpenAccess-AI-Collective/adam-bnb-simpler let transformers handle adamw_bnb_8bit	2023-08-26 20:44:19 -04:00
Aman Karmani	868530c39c	let transformers handle adamw_bnb_8bit	2023-08-26 21:40:12 +00:00
Maxime	d03887fad5	ignore: address pr review	2023-08-26 22:45:45 +02:00
Maxime	17605b85d8	fix: inference did not move the model to the correct device (#483 )	2023-08-26 16:40:56 -04:00
Maxime	a184549e4c	ignore: linter	2023-08-26 22:36:14 +02:00
Maxime	f311df9462	fix: finetune model inference needs the dtype fix to work with flash-attn	2023-08-26 22:34:11 +02:00
Maxime	c500d02517	Fix missing 'packaging' wheel (#482 )	2023-08-26 12:02:15 -04:00
Wing Lian	31f3e71764	fix checkpints on multigpu (#481 )	2023-08-26 12:00:03 -04:00
Aman Gupta Karmani	56c4a94caf	Merge pull request #484 from OpenAccess-AI-Collective/reqs allow newer deps in requirements.txt	2023-08-26 11:13:41 -04:00
Aman Karmani	c29117a0d7	allow newer deps	2023-08-26 15:06:05 +00:00
Wing Lian	0b7ba57ec4	fix types w lora (#478 )	2023-08-25 02:03:24 -04:00
NanoCode012	71bd06243c	Fix(tokenizer): Fix condition to add pad token (#477 ) * Fix(tokenizer): Fix condition to add pad token * chore: fix lint	2023-08-25 14:30:50 +09:00
Wing Lian	cb9797ef5a	improve llama pad token handling (#475 ) * improve llama pad token handling * tweak logic to not clobber	2023-08-24 13:20:35 -04:00
Charles O. Goddard	bde3c5a478	ReLoRA implementation (with quantization) (#322 ) * Experimental ReLoRA (+qlora) implementation * Add CPU offload * Remove local config * Fix saving logic * Remove redundant assert * Fix logic errors * Move ReLoRA into its own trainer class with a method override to create the proper scheduler * Formatting & typing fixes * Use safe_serialization * Don't allow fsdp/deepspeed with ReLoRA * Fix cpu-offload logic, enable multi gpu * Document parameters and add comment * Fix merge issue * Smooth over some sharp edges * Implement resume from checkpoint for relora * Address review comments * Fix saving logic * Add necessary metadata to safetensors --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-08-23 23:07:18 -04:00
NanoCode012	55c23c7bcb	Fix(doc): Clarify config (#466 )	2023-08-23 11:56:01 -04:00
Wing Lian	c69faee7a7	workaround so training doesn't hang when packed dataloader batches aren't even (#461 ) * workaround so training doesn't hang when packed dataloader batches aren't even * don't bother labeling anything in the no-op data	2023-08-23 10:39:11 -04:00
Wing Lian	d5dcf9c350	fix test fixture b/c hf trainer tokenization changed (#464 )	2023-08-23 04:04:49 -04:00
TearGosling	f4746507f6	feat: add Metharme prompt strategy (#446 ) * Add Metharme tokenizing strategy This strategy accounts for how the Metharme JSONLs are formatted as well as adds duplicated EOS tokens which can help trim model output length. I haven't gotten the chance to test this yet, and probably won't have the chance for quite a bit, so I'm committing this now. * Redo Metharme tokenizing strategy lol * fix: oops * Rearrange a conditional * chore: reformat code in accordance with linter * chore: Make lint not freak out * chore: fix lint --------- Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>	2023-08-22 11:21:45 +09:00
Wing Lian	96deb6bd67	recast loralayer, norm, lmhead + embed token weights per original qlora (#393 ) * recast loralayer, norm, lmhead + embed token weights per original qlora * try again for the fix * refactor torch dtype picking * linter fixes * missing import for LoraLayer * fix install for tests now that peft is involved	2023-08-21 18:41:12 -04:00
Wing Lian	50682a3c06	always drop samples that are too long (#452 )	2023-08-21 16:43:33 -04:00
Wing Lian	5a1985ba24	set env var for FSDP layer to wrap (#453 )	2023-08-21 16:43:22 -04:00
Aman Gupta Karmani	5e9c6afa10	Merge pull request #451 from OpenAccess-AI-Collective/eval-is-causal is_causal fix for evals?	2023-08-21 10:43:46 -07:00
Aman Karmani	a213d9972a	fix eval regression caused in `13f7efaf74`	2023-08-21 10:40:06 -07:00
Wing Lian	fbf49a4770	is_causal fix for evals?	2023-08-21 10:36:26 -04:00
Wing Lian	58cf7e7fed	add missing positional arg (#450 )	2023-08-21 04:10:19 -04:00
NanoCode012	04a42b6db1	feat(docs): improve user customized prompts (#443 ) * feat(docs): improve user customized prompts * feat(doc): add custom pretokenized instructions * chore: clean old data folder * chore: add new line	2023-08-20 23:59:43 -04:00
NanoCode012	919f4cac90	feat(doc): add pillow to lambda instructions (#445 )	2023-08-20 23:59:23 -04:00
Wing Lian	ee262818ef	fix evals (#447 )	2023-08-20 23:39:42 -04:00
Wing Lian	9d629d8bff	gracefully handle empty input (#442 )	2023-08-20 09:18:18 -04:00
Wing Lian	d2e7f27240	support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348 ) * support user defined prompters, pretokenized datasets in config, local parquet, local arrow files * fix user defined dataset types * fix for system prompts * fix tests * fix checks for parquet and arrow * aha moment that d.data_files isn't used * add documentation for ds_type to add support for parquet and arrow	2023-08-20 09:17:49 -04:00
Philpax	d21318dfb9	docs(readme): add `cd axolotl` (#440 )	2023-08-19 19:14:05 -04:00
Wing Lian	f733d0f31e	disable eval using multipack for now (#437 )	2023-08-19 10:35:04 -04:00
Wing Lian	008505c8ae	fix comma, not a tuple (#436 )	2023-08-19 00:57:40 -04:00
Wing Lian	b3f5e00ff5	use save_strategy from config if available (#434 ) * use save_strategy from config if available * update docs for save_strategy	2023-08-18 20:28:23 -04:00
Wing Lian	5247c5004e	set env for FSDP offload params (#433 )	2023-08-18 20:28:09 -04:00
mhenrichsen	cf6654769a	flash attn pip install (#426 ) * flash attn pip * add packaging * add packaging to apt get * install flash attn in dockerfile * remove unused whls * add wheel * clean up pr fix packaging requirement for ci upgrade pip for ci skip build isolation for requiremnents to get flash-attn working install flash-attn seperately * install wheel for ci * no flash-attn for basic cicd * install flash-attn as pip extras --------- Co-authored-by: Ubuntu <mgh@mgh-vm.wsyvwcia0jxedeyrchqg425tpb.ax.internal.cloudapp.net> Co-authored-by: mhenrichsen <some_email@hey.com> Co-authored-by: Mads Henrichsen <mads@BrbartiendeMads.lan> Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-08-18 19:00:27 -04:00
Aman Gupta Karmani	06edf175ac	standardize attn hijack patches (#381 ) * split sdp attn into its own patch * sync xformers patch to follow shared format and be diffable * update flash-attn patch for 70B/GQA and inference using helper from flash-attn tests * speed up flash-attn inference * fix patch to check position ids and don't use multipack for evals * copy LlamaModel.forward and LlamaDecoderLayer.forward into monkeypatch * update forwards so we only calculate cu_seqlens once * enable eval dataloader using multipack again * fix the patch to work properly and work with FSDP --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-08-18 12:54:16 -04:00
mhenrichsen	0a228479b3	adds color (#425 ) * adds color * chore: lint * fix for colorama --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>	2023-08-18 10:59:43 -04:00
Wing Lian	82e111aba9	remove extra accelearate in requirements (#430 )	2023-08-18 10:56:14 -04:00