set model merge dtype based on cfg

push merged lora to hf
deduplicate code
2023-08-23 04:07:44 -04:00 · 2023-08-23 04:07:44 -04:00 · 2023-08-23 04:07:44 -04:00 · 2023-08-23 04:07:44 -04:00
44 changed files with 564 additions and 2272 deletions
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -23,6 +23,11 @@ jobs:
            python_version: "3.10"
            pytorch: 2.0.1
            axolotl_extras:
+          - cuda: 118
+            cuda_version: 11.8.0
+            python_version: "3.9"
+            pytorch: 2.0.1
+            axolotl_extras: gptq
    runs-on: self-hosted
    steps:
      - name: Checkout
@@ -68,6 +73,11 @@ jobs:
            pytorch: 2.0.1
            axolotl_extras:
            is_latest: true
+          - cuda: 118
+            cuda_version: 11.8.0
+            python_version: "3.9"
+            pytorch: 2.0.1
+            axolotl_extras: gptq
    runs-on: self-hosted
    steps:
      - name: Checkout
--- a/.github/workflows/pypi.yml
+++ b/.github/workflows/pypi.yml
@@ -1,45 +0,0 @@
-name: publish pypi
-
-on:
-  push:
-    tags:
-      - '*'
-
-jobs:
-  pypi-publish:
-    name: Upload release to PyPI
-    runs-on: ubuntu-latest
-    environment:
-      name: pypi
-      url: https://pypi.org/p/axolotl
-    permissions:
-      id-token: write  # IMPORTANT: this permission is mandatory for trusted publishing
-    steps:
-      - name: Check out repository code
-        uses: actions/checkout@v3
-
-      - name: Setup Python
-        uses: actions/setup-python@v4
-        with:
-          python-version: "3.10"
-
-      - name: Install dependencies
-        run: |
-          pip3 install wheel
-          pip3 install -e .
-          pip3 install -r requirements-tests.txt
-
-      - name: Extract tag name
-        id: tag
-        run: echo ::set-output name=TAG_NAME::$(echo $GITHUB_REF | cut -d / -f 3)
-
-      - name: Update version in setup.py
-        run: >-
-          sed -i -E 's/version="([0-9.]+)",/version="${{ steps.tag.outputs.TAG_NAME }}",/g' setup.py
-
-      - name: Build a binary wheel
-        run: >-
-          python setup.py sdist bdist_wheel
-
-      - name: Publish package distributions to PyPI
-        uses: pypa/gh-action-pypi-publish@release/v1
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -24,8 +24,8 @@ jobs:

      - name: Install dependencies
        run: |
-          pip3 install -e .
-          pip3 install -r requirements-tests.txt
+          pip install -e .[peft]
+          pip install -r requirements-tests.txt

      - name: Run tests
        run: |
--- a/README.md
+++ b/README.md
@@ -90,7 +90,8 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  ```bash
  docker run --gpus '"all"' --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1
  ```
-  - `winglian/axolotl-runpod:main-latest`: for runpod or use this [direct link](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)
+  - `winglian/axolotl-runpod:main-py3.10-cu118-2.0.1`: for runpod
+  - `winglian/axolotl-runpod:main-py3.9-cu118-2.0.1-gptq`: for gptq

  Or run on the current files for development:

@@ -103,9 +104,19 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \

  2. Install pytorch stable https://pytorch.org/get-started/locally/

-  3. Install axolotl along with python dependencies
+  3. Install python dependencies with ONE of the following:
+      - Recommended, supports QLoRA, NO gptq/int4 support
        ```bash
-        pip3 install -e .[flash-attn]
+        pip3 install -e .
+        pip3 install -U git+https://github.com/huggingface/peft.git
+        ```
+      - gptq/int4 support, NO QLoRA
+        ```bash
+        pip3 install -e .[gptq]
+        ```
+      - same as above but not recommended
+        ```bash
+        pip3 install -e .[gptq_triton]
        ```

 - LambdaLabs
@@ -140,9 +151,10 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  git clone https://github.com/OpenAccess-AI-Collective/axolotl
  cd axolotl

-  pip3 install -e .
+  pip3 install -e . # change depend on needs
  pip3 install protobuf==3.20.3
  pip3 install -U --ignore-installed requests Pillow psutil scipy
+  pip3 install git+https://github.com/huggingface/peft.git # not for gptq
  ```

  5. Set path
@@ -151,8 +163,6 @@ accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml \
  ```
  </details>

- Windows: Please use WSL or Docker!
-
 ### Dataset

 Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
@@ -318,15 +328,6 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
      name: enron_emails
      type: completion # format from earlier

-  # huggingface repo with multiple named configurations/subsets
-  datasets:
-    - path: bigcode/commitpackft
-      name:
-        - ruby
-        - python
-        - typescript
-      type: ... # unimplemented custom format
-
  # local
  datasets:
    - path: data.jsonl # or json
@@ -406,10 +407,6 @@ fp16: true
 # Use CUDA tf32
 tf32: true # require >=ampere

-# No AMP (automatic mixed precision)
-bfloat16: true # require >=ampere
-float16: true
-
 # a list of one or more datasets to finetune the model with
 datasets:
  # hf dataset repo | "json" for local dataset, make sure to fill data_files
@@ -462,9 +459,6 @@ dataset_shard_idx:
 # the maximum length of an input to train with, this should typically be less than 2048
 # as most models have a token/context limit of 2048
 sequence_len: 2048
-# pad inputs so each step uses constant sized buffers
-# this will reduce memory fragmentation and may prevent OOMs, by re-using memory more efficiently
-pad_to_sequence_len:
 # max sequence length to concatenate training samples together up to
 # inspired by StackLLaMA. see https://huggingface.co/blog/stackllama#supervised-fine-tuning
 # FutureWarning: This will soon be DEPRECATED
@@ -499,12 +493,6 @@ lora_modules_to_save:
 lora_out_dir:
 lora_fan_in_fan_out: false

-# ReLoRA configuration
-# must use either 'lora' or 'qlora' adapter, and does not support fsdp or deepspeed
-relora_steps: # number of steps per ReLoRA restart
-relora_warmup_steps: # number of per-restart warmup steps
-relora_cpu_offload: # true to perform lora weight merges on cpu during restarts, for modest gpu memory savings
-
 # wandb configuration if you're using it
 wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
 wandb_project: # your wandb project name
@@ -527,7 +515,7 @@ lr_quadratic_warmup:
 logging_steps:
 save_strategy: # set to `no` to skip checkpoint saves
 save_steps: # leave empty to save at each epoch
-eval_steps: # leave empty to eval at each epoch
+eval_steps:
 save_total_limit: # checkpoints saved at a time
 max_steps:

@@ -560,30 +548,6 @@ log_sweep_min_lr:
 log_sweep_max_lr:

 # specify optimizer
-# Valid values are driven by the Transformers OptimizerNames class, see:
-# https://github.com/huggingface/transformers/blob/95b374952dc27d8511541d6f5a4e22c9ec11fb24/src/transformers/training_args.py#L134
-#
-# Note that not all optimizers may be available in your environment, ex: 'adamw_anyprecision' is part of
-# torchdistx, 'adamw_bnb_8bit' is part of bnb.optim.Adam8bit, etc. When in doubt, it is recommended to start with the optimizer used
-# in the examples/ for your model and fine-tuning use case.
-#
-# Valid values for 'optimizer' include:
-# - adamw_hf
-# - adamw_torch
-# - adamw_torch_fused
-# - adamw_torch_xla
-# - adamw_apex_fused
-# - adafactor
-# - adamw_anyprecision
-# - sgd
-# - adagrad
-# - adamw_bnb_8bit
-# - lion_8bit
-# - lion_32bit
-# - paged_adamw_32bit
-# - paged_adamw_8bit
-# - paged_lion_32bit
-# - paged_lion_8bit
 optimizer:
 # specify weight decay
 weight_decay:
@@ -637,14 +601,12 @@ fsdp_config:
 # Deepspeed config path
 deepspeed:

-# Advanced DDP Arguments
-ddp_timeout:
-ddp_bucket_cap_mb:
-ddp_broadcast_buffers:
-
 # Path to torch distx for optim 'adamw_anyprecision'
 torchdistx_path:

+# Set padding for data collator to 'longest'
+collator_pad_to_longest:
+
 # Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize
 pretraining_dataset:

@@ -664,7 +626,7 @@ strict:

 Run
 ```bash
-accelerate launch scripts/finetune.py your_config.yml
+accelerate launch scripts/finetune.py configs/your_config.yml
 ```

 #### Multi-GPU
@@ -764,10 +726,6 @@ Try to turn off xformers.

 It's safe to ignore it.

-> NCCL Timeouts during training
-
-See the [NCCL](docs/nccl.md) guide.
-
 ## Need help? 🙋♂️

 Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
--- a/deepspeed/zero2.json
+++ b/deepspeed/zero2.json
@@ -1,46 +0,0 @@
-{
-    "zero_optimization": {
-      "stage": 2,
-      "offload_optimizer": {
-        "device": "cpu"
-      },
-      "contiguous_gradients": true,
-      "overlap_comm": true
-    },
-    "bf16": {
-      "enabled": "auto"
-    },
-    "fp16": {
-      "enabled": "auto",
-      "auto_cast": false,
-      "loss_scale": 0,
-      "initial_scale_power": 32,
-      "loss_scale_window": 1000,
-      "hysteresis": 2,
-      "min_loss_scale": 1
-    },
-    "optimizer": {
-      "type": "AdamW",
-      "params": {
-        "lr": "auto",
-        "betas": [
-          0.9,
-          0.999
-        ],
-        "eps": 1e-8,
-        "weight_decay": "auto"
-      }
-    },
-    "scheduler": {
-      "type": "WarmupDecayLR",
-      "params": {
-        "warmup_min_lr": "auto",
-        "warmup_max_lr": "auto",
-        "warmup_num_steps": "auto",
-        "total_num_steps": "auto"
-      }
-    },
-    "train_batch_size": "auto",
-    "train_micro_batch_size_per_gpu": "auto",
-    "wall_clock_breakdown": false
-}
--- a/deepspeed/zero3.json
+++ b/deepspeed/zero3.json
@@ -35,7 +35,10 @@
    "type": "AdamW",
    "params": {
      "lr": "auto",
-      "betas": "auto",
+      "betas": [
+        0.9,
+        0.95
+      ],
      "eps": 1e-8,
      "weight_decay": "auto"
    }
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -9,11 +9,6 @@ services:
      - ~/.cache/huggingface/:/root/.cache/huggingface/
    # set environment variables
    environment:
-      # Set environment variables
-      - GIT_AUTHOR_NAME=${GIT_AUTHOR_NAME}
-      - GIT_AUTHOR_EMAIL=${GIT_AUTHOR_EMAIL}
-      - GIT_COMMITTER_NAME=${GIT_COMMITTER_NAME}
-      - GIT_COMMITTER_EMAIL=${GIT_COMMITTER_EMAIL}
      - WANDB_API_KEY=${WANDB_API_KEY}
    deploy:
      resources:
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -11,6 +11,7 @@ RUN apt-get update && \

 WORKDIR /workspace

+RUN pip3 install --force-reinstall "peft @ git+https://github.com/huggingface/peft.git@main"
 RUN git clone --depth=1 https://github.com/OpenAccess-AI-Collective/axolotl.git
 # If AXOLOTL_EXTRAS is set, append it in brackets
 RUN cd axolotl && \
--- a/docs/nccl.md
+++ b/docs/nccl.md
@@ -1,46 +0,0 @@
-# NCCL
-
-NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several [environment variables](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html). A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
-
-```text
-Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
-```
-
-Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends [disabling PCI access control services (ACS)](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs) as a possible solution if this is available to you.
-
-Forcing cross-GPU communication via [NVLink](https://en.wikipedia.org/wiki/NVLink) may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:
-
-```shell
-nvidia-smi nvlink --status
-```
-
-To force NCCL to use NVLink, simply set this in the environment:
-
-```shell
-export NCCL_P2P_LEVEL=NVL
-```
-
-If NVLink is not available in your environment there are other options for ``NCCL_P2P_LEVEL`` in the table below:
-
-| NCCL_P2P_LEVEL | Description |
-| -------------- | ----------- |
-| PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. |
-| PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. |
-| PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |
-
-To validate that acceptable data transfer speeds exist for your training job, running [NCCL Tests](https://github.com/NVIDIA/nccl-tests/blob/master/README.md) can help pinpoint bottlenecks, for example:
-
-```shell
-./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
-```
-
-It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
-
-```shell
-export NCCL_DEBUG=INFO
-export NCCL_DEBUG_SUBSYS=ALL
-export TORCH_DISTRIBUTED_DEBUG=INFO
-export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log
-```
-
-Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ``ddp_timeout`` value in the Axolotl configuration. See [PyTorch init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for documentation on this value.
--- a/examples/code-llama/13b/lora.yml
+++ b/examples/code-llama/13b/lora.yml
@@ -1,68 +0,0 @@
-base_model: codellama/CodeLlama-13b-hf
-base_model_config: codellama/CodeLlama-13b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: CodeLlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: true
-load_in_4bit: false
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./lora-out
-
-sequence_len: 100000
-sample_packing: true
-pad_to_sequence_len: true
-
-adapter: lora
-lora_model_dir:
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 3
-optimizer: adamw_bnb_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/code-llama/13b/qlora.yml
+++ b/examples/code-llama/13b/qlora.yml
@@ -1,70 +0,0 @@
-base_model: codellama/CodeLlama-13b-hf
-base_model_config: codellama/CodeLlama-13b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: CodeLlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./qlora-out
-
-adapter: qlora
-lora_model_dir:
-
-sequence_len: 100000
-sample_packing: true
-pad_to_sequence_len: true
-
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 3
-optimizer: paged_adamw_32bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/code-llama/34b/lora.yml
+++ b/examples/code-llama/34b/lora.yml
@@ -1,68 +0,0 @@
-base_model: codellama/CodeLlama-34b-hf
-base_model_config: codellama/CodeLlama-34b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: CodeLlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: true
-load_in_4bit: false
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./lora-out
-
-sequence_len: 100000
-sample_packing: true
-pad_to_sequence_len: true
-
-adapter: lora
-lora_model_dir:
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 3
-optimizer: adamw_bnb_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/code-llama/34b/qlora.yml
+++ b/examples/code-llama/34b/qlora.yml
@@ -1,70 +0,0 @@
-base_model: codellama/CodeLlama-34b-hf
-base_model_config: codellama/CodeLlama-34b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: CodeLlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./qlora-out
-
-adapter: qlora
-lora_model_dir:
-
-sequence_len: 100000
-sample_packing: true
-pad_to_sequence_len: true
-
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 3
-optimizer: paged_adamw_32bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/code-llama/7b/lora.yml
+++ b/examples/code-llama/7b/lora.yml
@@ -1,68 +0,0 @@
-base_model: codellama/CodeLlama-7b-hf
-base_model_config: codellama/CodeLlama-7b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: CodeLlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: true
-load_in_4bit: false
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./lora-out
-
-sequence_len: 100000
-sample_packing: true
-pad_to_sequence_len: true
-
-adapter: lora
-lora_model_dir:
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 3
-optimizer: adamw_bnb_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/code-llama/7b/qlora.yml
+++ b/examples/code-llama/7b/qlora.yml
@@ -1,70 +0,0 @@
-base_model: codellama/CodeLlama-7b-hf
-base_model_config: codellama/CodeLlama-7b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: CodeLlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./qlora-out
-
-adapter: qlora
-lora_model_dir:
-
-sequence_len: 100000
-sample_packing: true
-pad_to_sequence_len: true
-
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 3
-optimizer: paged_adamw_32bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/code-llama/README.md
+++ b/examples/code-llama/README.md
@@ -1,22 +0,0 @@
-# Overview
-
-This is an example of CodeLLaMA configuration for 7b, 13b and 34b.
-
-The 7b variant fits on any 24GB VRAM GPU and will take up about 17 GB of VRAM during training if using qlora and 20 GB if using lora. On a RTX 4090 it trains 3 epochs of the default dataset in about 15 minutes.
-
-The 13b variant will fit if you change these settings to these values:
-gradient_accumulation_steps: 2
-micro_batch_size: 1
-
-The 34b variant does not fit on 24GB of VRAM - you will need something with +40 gb VRAM that also supports flash attention v2 - A6000 or A100 are good choices.
-
-```shell
-accelerate launch scripts/finetune.py examples/code-llama/[MODEL_SIZE]/qlora.yml
-
-```
-or
-
-```shell
-accelerate launch scripts/finetune.py examples/code-llama/[MODEL_SIZE]/lora.yml
-
-```
--- a/examples/gptq-lora-7b/README.md
+++ b/examples/gptq-lora-7b/README.md
@@ -0,0 +1,8 @@
+# LLaMa 7B using LoRA
+
+This is a good place to start for beginners. This will run on an NVIDIA RTX4090 with no other changes needed.
+
+```shell
+accelerate launch scripts/finetune.py examples/gptq-lora-7b/config.yml
+
+```
--- a/examples/gptq-lora-7b/config.yml
+++ b/examples/gptq-lora-7b/config.yml
@@ -0,0 +1,63 @@
+base_model: Neko-Institute-of-Science/LLaMA-7B-4bit-128g
+base_model_config: Neko-Institute-of-Science/LLaMA-7B-4bit-128g
+model_type: LlamaForCausalLM
+tokenizer_type: LlamaTokenizer
+trust_remote_code:
+load_in_8bit: true
+gptq: true
+datasets:
+  - path: vicgalle/alpaca-gpt4
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.02
+adapter:
+lora_model_dir:
+sequence_len: 2048
+max_packed_sequence_len:
+lora_r: 8
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules:
+  - q_proj
+  - v_proj
+lora_fan_in_fan_out: false
+wandb_project: llama-7b-lora-int4
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+output_dir: ./llama-7b-lora-int4
+gradient_accumulation_steps: 1
+micro_batch_size: 1
+num_epochs: 3
+optimizer: adamw_bnb_8bit
+torchdistx_path:
+lr_scheduler: cosine
+learning_rate: 0.0000002
+train_on_inputs: false
+group_by_length: false
+fp16: true
+bf16: false
+tf32: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 5
+xformers_attention:
+flash_attention:
+gradient_checkpointing: true
+gptq_groupsize: 128
+gptq_model_v1: false
+warmup_steps: 20
+eval_steps: 110
+save_steps: 660
+debug:
+deepspeed:
+weight_decay: 0.0001
+fsdp:
+fsdp_config:
+tokens:
+  pad_token: "[PAD]"
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
--- a/examples/llama-2/gptq-lora.yml
+++ b/examples/llama-2/gptq-lora.yml
@@ -1,76 +0,0 @@
-base_model: TheBloke/Llama-2-7B-GPTQ
-base_model_config: TheBloke/Llama-2-7B-GPTQ
-is_llama_derived_model: false
-gptq: true
-gptq_bits: 4
-model_type: AutoModelForCausalLM
-tokenizer_type: LlamaTokenizer
-tokenizer_use_fast: true
-tokenizer_legacy: true
-load_in_8bit: false
-load_in_4bit: false
-strict: false
-push_dataset_to_hub:
-hf_use_auth_token: true
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-adapter: lora
-lora_model_dir:
-sequence_len: 4096
-sample_packing:
-lora_r: 8
-lora_alpha: 32
-lora_dropout: 0.05
-lora_target_modules:
-  - k_proj
-  - o_proj
-  - q_proj
-  - v_proj
-lora_target_linear:
-lora_fan_in_fan_out:
-wandb_project:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-output_dir: ./model-out
-gradient_accumulation_steps: 1
-micro_batch_size: 1
-num_epochs: 3
-optimizer: adamw_torch
-adam_beta2: 0.95
-adam_eps: 0.00001
-max_grad_norm: 1.0
-torchdistx_path:
-lr_scheduler: cosine
-lr_quadratic_warmup: true
-learning_rate: 0.000017
-train_on_inputs: false
-group_by_length: false
-bf16: false
-fp16: false
-float16: true
-tf32: true
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention:
-sdp_attention:
-flash_optimum:
-gptq_groupsize:
-gptq_model_v1:
-warmup_steps: 100
-eval_steps:
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.1
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/llama-2/lora.yml
+++ b/examples/llama-2/lora.yml
@@ -17,7 +17,6 @@ output_dir: ./lora-out

 sequence_len: 4096
 sample_packing: true
-pad_to_sequence_len: true

 adapter: lora
 lora_model_dir:
--- a/examples/llama-2/qlora.yml
+++ b/examples/llama-2/qlora.yml
@@ -20,7 +20,6 @@ lora_model_dir:

 sequence_len: 4096
 sample_packing: true
-pad_to_sequence_len: true

 lora_r: 32
 lora_alpha: 16
--- a/examples/llama-2/relora.yml
+++ b/examples/llama-2/relora.yml
@@ -1,74 +0,0 @@
-base_model: meta-llama/Llama-2-7b-hf
-base_model_config: meta-llama/Llama-2-7b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: LlamaTokenizer
-is_llama_derived_model: true
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: teknium/GPT4-LLM-Cleaned
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.01
-output_dir: ./relora-out
-
-adapter: qlora
-lora_model_dir:
-
-sequence_len: 4096
-sample_packing: true
-pad_to_sequence_len: true
-
-lora_r: 8
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-relora_steps: 150
-relora_warmup_steps: 10
-relora_cpu_offload: false
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 4
-num_epochs: 3
-optimizer: adamw_bnb_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 20
-save_steps: 50
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
-  bos_token: "<s>"
-  eos_token: "</s>"
-  unk_token: "<unk>"
--- a/examples/pythia-12b/config.yml
+++ b/examples/pythia-12b/config.yml
@@ -47,3 +47,4 @@ local_rank:
 gradient_checkpointing: true
 fsdp:
 fsdp_config:
+collator_pad_to_longest: true
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,18 +1,12 @@
--extra-index-url https://download.pytorch.org/whl/cu118
--extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
-torch==2.0.1
-auto-gptq
-packaging
 peft @ git+https://github.com/huggingface/peft.git
 transformers @ git+https://github.com/huggingface/transformers.git
 bitsandbytes>=0.41.1
-accelerate @ git+https://github.com/huggingface/accelerate
+accelerate @ git+https://github.com/huggingface/accelerate@2a289f6108e77a77a4efffb3f6316bc98538413b
 addict
-evaluate
 fire
-PyYAML>=6.0
+PyYAML==6.0
 datasets
-flash-attn>=2.2.1
+flash-attn==2.0.8
 sentencepiece
 wandb
 einops
@@ -21,7 +15,7 @@ optimum
 hf_transfer
 colorama
 numba
-numpy>=1.24.4
+numpy==1.24.4
 # qlora things
 bert-score==0.3.13
 evaluate==0.4.0
@@ -29,4 +23,3 @@ rouge-score==0.1.2
 scipy
 scikit-learn==1.2.2
 pynvml
-art
--- a/scripts/finetune.py
+++ b/scripts/finetune.py
@@ -4,28 +4,27 @@ import importlib
 import logging
 import os
 import random
+import signal
 import sys
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Union

 import fire
 import torch
-import transformers
 import yaml

 # add src to the pythonpath so we don't need to pip install this
-from art import text2art
+from optimum.bettertransformer import BetterTransformer
 from transformers import GenerationConfig, TextStreamer

-from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
 from axolotl.logging_config import configure_logging
-from axolotl.train import TrainDatasetMeta, train
 from axolotl.utils.config import normalize_config, validate_config
 from axolotl.utils.data import prepare_dataset
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.distributed import is_main_process
-from axolotl.utils.models import load_tokenizer
+from axolotl.utils.models import load_model, load_tokenizer
 from axolotl.utils.tokenization import check_dataset_labels
+from axolotl.utils.trainer import setup_trainer
 from axolotl.utils.wandb import setup_wandb_env_vars

 project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
@@ -38,12 +37,15 @@ LOG = logging.getLogger("axolotl.scripts")
 os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"


-def print_axolotl_text_art(suffix=None):
-    font = "nancyj"
-    ascii_text = "  axolotl"
-    if suffix:
-        ascii_text += f"  x  {suffix}"
-    ascii_art = text2art(" axolotl", font=font)
+def print_axolotl_text_art():
+    ascii_art = """
+                           dP            dP   dP
+                           88            88   88
+.d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
+88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
+88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
+`88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP
+"""

    if is_main_process():
        print(ascii_art)
@@ -58,45 +60,7 @@ def get_multi_line_input() -> Optional[str]:
    return instruction


-def do_merge_lora(
-    *,
-    cfg: DictDefault,
-    cli_args: TrainerCliArgs,
-):
-    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
-    safe_serialization = cfg.save_safetensors is True
-
-    LOG.info("running merge of LoRA with base model")
-    model = model.merge_and_unload()
-    model.to(dtype=torch.float16)
-
-    if cfg.local_rank == 0:
-        LOG.info("saving merged model")
-        model.save_pretrained(
-            str(Path(cfg.output_dir) / "merged"),
-            safe_serialization=safe_serialization,
-        )
-        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
-
-
-def shard(
-    *,
-    cfg: DictDefault,
-    cli_args: TrainerCliArgs,
-):
-    model, _ = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
-    safe_serialization = cfg.save_safetensors is True
-    LOG.debug("Re-saving model w/ sharding")
-    model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
-
-
-def do_inference(
-    *,
-    cfg: DictDefault,
-    cli_args: TrainerCliArgs,
-):
-    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
-    prompter = cli_args.prompter
+def do_inference(cfg, model, tokenizer, prompter: Optional[str]):
    default_tokens = {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>"}

    for token, symbol in default_tokens.items():
@@ -118,8 +82,6 @@ def do_inference(
            max_seq_len=255, mem_freq=50, top_k=5, max_cache_size=None
        )

-    model = model.to(cfg.device)
-
    while True:
        print("=" * 80)
        # support for multiline inputs
@@ -171,10 +133,6 @@ def choose_config(path: Path):
            "No YAML config files found in the specified directory. Are you using a .yml extension?"
        )

-    if len(yaml_files) == 1:
-        print(f"Using default YAML file '{yaml_files[0]}'")
-        return yaml_files[0]
-
    print("Choose a YAML file:")
    for idx, file in enumerate(yaml_files):
        print(f"{idx + 1}. {file}")
@@ -197,7 +155,29 @@ def check_not_in(list1: List[str], list2: Union[Dict[str, Any], List[str]]) -> b
    return not any(el in list2 for el in list1)


-def load_cfg(config: Path = Path("examples/"), **kwargs):
+def merge_lora(model, tokenizer, cfg):
+    LOG.info("running merge of LoRA with base model")
+    model = model.merge_and_unload()
+    model_dtype = torch.bfloat16 if cfg.bf16 or cfg.bfloat16 else torch.float16
+    model.to(dtype=model_dtype)
+    if cfg.hub_model_id:
+        model.push_to_hub("hub_model_id")
+
+    if cfg.local_rank == 0:
+        LOG.info("saving merged model")
+        model.save_pretrained(
+            str(Path(cfg.output_dir) / "merged"),
+            safe_serialization=cfg.save_safetensors is True,
+        )
+        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
+
+
+def train(
+    config: Path = Path("configs/"),
+    prepare_ds_only: bool = False,
+    **kwargs,
+):
+    print_axolotl_text_art()
    if Path(config).is_dir():
        config = choose_config(config)

@@ -221,58 +201,125 @@ def load_cfg(config: Path = Path("examples/"), **kwargs):
    normalize_config(cfg)

    setup_wandb_env_vars(cfg)
-    return cfg

-
-def load_datasets(
-    *,
-    cfg: DictDefault,
-    cli_args: TrainerCliArgs,
-) -> TrainDatasetMeta:
+    # load the tokenizer first
+    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
    tokenizer = load_tokenizer(cfg)

-    train_dataset, eval_dataset, total_num_steps = prepare_dataset(cfg, tokenizer)
+    if (
+        check_not_in(["shard", "merge_lora"], kwargs) and not cfg.inference
+    ):  # don't need to load dataset for these
+        train_dataset, eval_dataset, total_num_steps = prepare_dataset(cfg, tokenizer)

-    if cli_args.debug or cfg.debug:
+    if cfg.debug or "debug" in kwargs:
        LOG.info("check_dataset_labels...")
        check_dataset_labels(
            train_dataset.select(
-                [
-                    random.randrange(0, len(train_dataset) - 1)  # nosec
-                    for _ in range(cli_args.debug_num_examples)
-                ]
+                [random.randrange(0, len(train_dataset) - 1) for _ in range(5)]  # nosec
            ),
            tokenizer,
-            num_examples=cli_args.debug_num_examples,
-            text_only=cli_args.debug_text_only,
        )

-    return TrainDatasetMeta(
-        train_dataset=train_dataset,
-        eval_dataset=eval_dataset,
-        total_num_steps=total_num_steps,
+    if prepare_ds_only:
+        LOG.info("Finished preparing dataset. Exiting...")
+        return
+
+    # Load the model and tokenizer
+    LOG.info("loading model and (optionally) peft_config...")
+    model, peft_config = load_model(cfg, tokenizer)
+
+    safe_serialization = cfg.save_safetensors is True
+
+    if "merge_lora" in kwargs and cfg.adapter is not None:
+        merge_lora(model, tokenizer, cfg)
+        return
+
+    if cfg.inference:
+        LOG.info("calling do_inference function")
+        prompter: Optional[str] = "AlpacaPrompter"
+        if "prompter" in kwargs:
+            if kwargs["prompter"] == "None":
+                prompter = None
+            else:
+                prompter = kwargs["prompter"]
+        do_inference(cfg, model, tokenizer, prompter=prompter)
+        return
+
+    if "shard" in kwargs:
+        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+        return
+
+    trainer = setup_trainer(
+        cfg, train_dataset, eval_dataset, model, tokenizer, total_num_steps
    )

+    model.config.use_cache = False

-def do_cli(config: Path = Path("examples/"), **kwargs):
-    print_axolotl_text_art()
-    parsed_cfg = load_cfg(config, **kwargs)
-    parser = transformers.HfArgumentParser((TrainerCliArgs))
-    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
-        return_remaining_strings=True
-    )
-    if parsed_cli_args.inference:
-        do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
-    elif parsed_cli_args.merge_lora:
-        do_merge_lora(cfg=parsed_cfg, cli_args=parsed_cli_args)
-    elif parsed_cli_args.shard:
-        shard(cfg=parsed_cfg, cli_args=parsed_cli_args)
+    if torch.__version__ >= "2" and sys.platform != "win32":
+        LOG.info("Compiling torch model")
+        model = torch.compile(model)
+
+    # go ahead and presave, so we have the adapter config available to inspect
+    if peft_config:
+        LOG.info(f"Pre-saving adapter config to {cfg.output_dir}")
+        peft_config.save_pretrained(cfg.output_dir)
+
+    # In case we want to stop early with ctrl+c, this is a nice to have to save the pretrained model
+    if cfg.local_rank == 0:
+
+        def terminate_handler(_, __, model):
+            if cfg.flash_optimum:
+                model = BetterTransformer.reverse(model)
+            model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+            sys.exit(0)
+
+        signal.signal(
+            signal.SIGINT, lambda signum, frame: terminate_handler(signum, frame, model)
+        )
+
+    LOG.info("Starting trainer...")
+    if cfg.group_by_length:
+        LOG.info("hang tight... sorting dataset for group_by_length")
+    resume_from_checkpoint = cfg.resume_from_checkpoint
+    if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints:
+        possible_checkpoints = [
+            str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*")
+        ]
+        if len(possible_checkpoints) > 0:
+            sorted_paths = sorted(
+                possible_checkpoints,
+                key=lambda path: int(path.split("-")[-1]),
+            )
+            resume_from_checkpoint = sorted_paths[-1]
+            LOG.info(
+                f"Using Auto-resume functionality to start with checkpoint at {resume_from_checkpoint}"
+            )
+
+    if not Path(cfg.output_dir).is_dir():
+        os.makedirs(cfg.output_dir, exist_ok=True)
+    tokenizer.save_pretrained(cfg.output_dir)
+    if cfg.flash_optimum:
+        with torch.backends.cuda.sdp_kernel(
+            enable_flash=True, enable_math=True, enable_mem_efficient=True
+        ):
+            trainer.train(resume_from_checkpoint=resume_from_checkpoint)
    else:
-        dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
-        if parsed_cli_args.prepare_ds_only:
-            return
-        train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
+        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+
+    LOG.info(f"Training Completed!!! Saving pre-trained model to {cfg.output_dir}")
+
+    # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
+    # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
+    if cfg.fsdp:
+        trainer.save_model(cfg.output_dir)
+    elif cfg.local_rank == 0:
+        if cfg.flash_optimum:
+            model = BetterTransformer.reverse(model)
+        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+
+    if cfg.adapter is not None:
+        merge_lora(model, tokenizer, cfg)


 if __name__ == "__main__":
-    fire.Fire(do_cli)
+    fire.Fire(train)
--- a/setup.py
+++ b/setup.py
@@ -2,41 +2,38 @@

 from setuptools import find_packages, setup

-
-def parse_requirements():
-    _install_requires = []
-    _dependency_links = []
-    with open("./requirements.txt", encoding="utf-8") as requirements_file:
-        lines = [r.strip() for r in requirements_file.readlines()]
-        for line in lines:
-            if line.startswith("--extra-index-url"):
-                # Handle custom index URLs
-                _, url = line.split()
-                _dependency_links.append(url)
-            elif "flash-attn" not in line and line and line[0] != "#":
-                # Handle standard packages
-                _install_requires.append(line)
-    return _install_requires, _dependency_links
-
-
-install_requires, dependency_links = parse_requirements()
-
+install_requires = []
+with open("./requirements.txt", encoding="utf-8") as requirements_file:
+    # don't include peft yet until we check the int4
+    # need to manually install peft for now...
+    reqs = [r.strip() for r in requirements_file.readlines() if "peft" not in r]
+    reqs = [r for r in reqs if "flash-attn" not in r]
+    reqs = [r for r in reqs if r and r[0] != "#"]
+    for r in reqs:
+        install_requires.append(r)

 setup(
    name="axolotl",
-    version="0.3.0",
-    description="LLM Trainer",
-    long_description="Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.",
+    version="0.1",
+    description="You know you're going to axolotl questions",
    package_dir={"": "src"},
    packages=find_packages(),
    install_requires=install_requires,
-    dependency_links=dependency_links,
    extras_require={
+        "gptq": [
+            "alpaca_lora_4bit @ git+https://github.com/winglian/alpaca_lora_4bit.git@setup_pip",
+        ],
+        "gptq_triton": [
+            "alpaca_lora_4bit[triton] @ git+https://github.com/winglian/alpaca_lora_4bit.git@setup_pip",
+        ],
        "flash-attn": [
-            "flash-attn>=2.2.1",
+            "flash-attn==2.0.8",
        ],
        "extras": [
            "deepspeed",
        ],
+        "peft": [
+            "peft @ git+https://github.com/huggingface/peft.git",
+        ],
    },
 )
--- a/src/axolotl/common/init.py
+++ b/src/axolotl/common/init.py
--- a/src/axolotl/common/cli.py
+++ b/src/axolotl/common/cli.py
@@ -1,43 +0,0 @@
-"""
-shared module for cli specific things
-"""
-
-import logging
-from dataclasses import dataclass, field
-from typing import Optional
-
-from axolotl.logging_config import configure_logging
-from axolotl.utils.dict import DictDefault
-from axolotl.utils.models import load_model, load_tokenizer
-
-configure_logging()
-LOG = logging.getLogger("axolotl.common.cli")
-
-
-@dataclass
-class TrainerCliArgs:
-    """
-    dataclass representing the various non-training arguments
-    """
-
-    debug: bool = field(default=False)
-    debug_text_only: bool = field(default=False)
-    debug_num_examples: int = field(default=5)
-    inference: bool = field(default=False)
-    merge_lora: bool = field(default=False)
-    prepare_ds_only: bool = field(default=False)
-    prompter: Optional[str] = field(default=None)
-    shard: bool = field(default=False)
-
-
-def load_model_and_tokenizer(
-    *,
-    cfg: DictDefault,
-    cli_args: TrainerCliArgs,
-):
-    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
-    tokenizer = load_tokenizer(cfg)
-    LOG.info("loading model and (optionally) peft_config...")
-    model, _ = load_model(cfg, tokenizer, inference=cli_args.inference)
-
-    return model, tokenizer
--- a/src/axolotl/logging_config.py
+++ b/src/axolotl/logging_config.py
@@ -23,7 +23,6 @@ class ColorfulFormatter(Formatter):
    }

    def format(self, record):
-        record.rank = int(os.getenv("LOCAL_RANK", "0"))
        log_message = super().format(record)
        return self.COLORS.get(record.levelname, "") + log_message + Fore.RESET

@@ -36,7 +35,7 @@ DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
        },
        "colorful": {
            "()": ColorfulFormatter,
-            "format": "[%(asctime)s] [%(levelname)s] [%(name)s.%(funcName)s:%(lineno)d] [PID:%(process)d] [RANK:%(rank)d] %(message)s",
+            "format": "[%(asctime)s] [%(levelname)s] [%(name)s.%(funcName)s:%(lineno)d] [PID:%(process)d] %(message)s",
        },
    },
    "filters": {},
--- a/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
@@ -2,9 +2,7 @@

 # copied from https://github.com/lm-sys/FastChat/blob/main/fastchat/train/llama_flash_attn_monkey_patch.py

-import logging
 import warnings
-from functools import partial
 from typing import List, Optional, Tuple, Union

 import torch
@@ -35,9 +33,6 @@ except ImportError:
    )


-LOG = logging.getLogger("axolotl")
-
-
 def replace_llama_attn_with_flash_attn(packed: Optional[bool] = False):
    transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = (  # pylint: disable=protected-access
        _prepare_decoder_attention_mask
@@ -49,34 +44,6 @@ def replace_llama_attn_with_flash_attn(packed: Optional[bool] = False):
            llama_model_forward
        )

-    try:
-        from flash_attn.losses.cross_entropy import CrossEntropyLoss
-
-        LOG.info("patching with flash_attn.losses.cross_entropy")
-        transformers.models.llama.modeling_llama.CrossEntropyLoss = partial(
-            CrossEntropyLoss, inplace_backward=True
-        )
-    except ImportError:
-        LOG.info(
-            "optimized flash-attention CrossEntropyLoss not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=xentropy_cuda_lib&subdirectory=csrc/xentropy'`)"
-        )
-
-    try:
-        from flash_attn.ops.rms_norm import RMSNorm
-
-        class LlamaRMSNorm(RMSNorm):
-            """Patched LLamaRMSNorm"""
-
-            def __init__(self, hidden_size, eps=1e-6):
-                super().__init__(hidden_size, eps=eps)
-
-        LOG.info("patching with flash_attn.ops.rms_norm")
-        transformers.models.llama.modeling_llama.LlamaRMSNorm = LlamaRMSNorm
-    except ImportError:
-        LOG.info(
-            "optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)"
-        )
-

 # Disable the transformation of the attention mask in LlamaModel as the flash attention
 # requires the attention mask to be the same as the key_padding_mask
--- a/src/axolotl/monkeypatch/relora.py
+++ b/src/axolotl/monkeypatch/relora.py
@@ -1,393 +0,0 @@
-"""Implements the ReLoRA training procedure from https://arxiv.org/abs/2307.05695, minus the initial full fine-tune."""
-import glob
-import json
-import logging
-import os.path
-import shutil
-from pathlib import Path
-from typing import Dict, List, Sequence
-
-import bitsandbytes as bnb
-import peft
-import safetensors.torch as st
-import torch
-from huggingface_hub import snapshot_download
-from torch.optim.lr_scheduler import LRScheduler
-from torch.optim.optimizer import Optimizer
-from transformers import (
-    TrainerCallback,
-    TrainerControl,
-    TrainerState,
-    TrainingArguments,
-)
-from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
-
-from axolotl.utils.dict import DictDefault
-from axolotl.utils.distributed import is_main_process
-
-LOG = logging.getLogger("axolotl.relora")
-
-
-def reset_optimizer(optimizer: torch.optim.Optimizer):
-    for group in optimizer.param_groups:
-        for param in group["params"]:
-            param_state = optimizer.state[param]
-            for key in param_state:
-                if "qmap" in key:
-                    continue
-
-                if key == "step" and isinstance(param_state[key], int):
-                    param_state[key] = 0
-                else:
-                    param_state[key] = torch.zeros_like(param_state[key])
-
-
-class ReLoRACallback(TrainerCallback):
-    """Callback to merge LoRA weights into the base model and save full-weight checkpoints"""
-
-    def __init__(self, cfg: DictDefault):
-        self.relora_steps = cfg.relora_steps
-        self.cpu_offload = cfg.relora_cpu_offload
-        self.quantized = cfg.load_in_4bit or cfg.load_in_8bit
-        self.last_full_model = cfg.base_model
-        self.resume_from_checkpoint = cfg.resume_from_checkpoint
-
-        if not os.path.exists(self.last_full_model):
-            self.last_full_model = str(Path(snapshot_download(cfg.base_model)))
-
-        assert os.path.exists(
-            self.last_full_model
-        ), "for ReLORA base_model must be a local path"
-
-        self.num_lora_restarts = 0
-        self.need_full_save = False
-
-    def on_train_begin(
-        self,
-        _args: TrainingArguments,
-        _state: TrainerState,
-        control: TrainerControl,
-        model: peft.LoraModel,
-        **_kwargs,
-    ):
-        if self.resume_from_checkpoint:
-            weight_path = os.path.join(self.resume_from_checkpoint, "relora")
-            if not os.path.exists(weight_path):
-                LOG.warning(
-                    "Resuming ReLoRA from checkpoint, but no full-weight save found"
-                )
-            else:
-                LOG.info(f"Loading adjusted base weights from {weight_path}")
-                load_weight_checkpoint(model, weight_path)
-        return control
-
-    def on_step_begin(
-        self,
-        args: TrainingArguments,
-        state: TrainerState,
-        control: TrainerControl,
-        model: peft.LoraModel,
-        optimizer: torch.optim.Optimizer,
-        **_kwargs,
-    ):
-        if state.global_step > 0 and state.global_step % self.relora_steps == 0:
-            checkpoint_folder = os.path.join(
-                args.output_dir,
-                f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}",
-                "relora",
-            )
-
-            with torch.no_grad():
-                merge_and_save(
-                    model,
-                    self.last_full_model,
-                    checkpoint_folder,
-                    reinit=True,
-                    quantized=self.quantized,
-                    actually_save=is_main_process(),
-                    cpu_offload=self.cpu_offload,
-                )
-                reset_optimizer(optimizer)
-
-            if self.quantized:
-                self.last_full_model = checkpoint_folder
-            self.num_lora_restarts += 1
-
-        return control
-
-    def on_save(
-        self,
-        args: TrainingArguments,
-        state: TrainerState,
-        control: TrainerControl,
-        model: peft.LoraModel,
-        **_kwargs,
-    ):
-        checkpoint_folder = os.path.join(
-            args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}", "relora"
-        )
-        if (
-            state.global_step >= self.relora_steps
-            and state.global_step % self.relora_steps != 0
-        ):
-            if self.quantized:
-                if is_main_process() and self.last_full_model != checkpoint_folder:
-                    # ensure the latest full parameter save is in the latest checkpoint
-                    # folder, so that automatic pruning of checkpoints does not remove it
-                    LOG.info(f"moving last full parameter save to {checkpoint_folder}")
-                    os.makedirs(checkpoint_folder, exist_ok=True)
-                    chunks = glob.glob(
-                        f"{self.last_full_model}/model*.safetensors"
-                    ) + glob.glob(f"{self.last_full_model}/model*.index.json")
-                    for path in chunks:
-                        new_path = os.path.abspath(shutil.move(path, checkpoint_folder))
-                        try:
-                            os.symlink(new_path, path)
-                        except OSError:
-                            # probably on windows without permission to symlink
-                            pass
-
-                    self.last_full_model = checkpoint_folder
-            else:
-                model.model.save_pretrained(checkpoint_folder, safe_serialization=True)
-
-        return control
-
-    def on_log(
-        self,
-        _args: TrainingArguments,
-        _state: TrainerState,
-        control: TrainerControl,
-        logs: Dict[str, float],
-        **_kwargs,
-    ):
-        logs["num_lora_restarts"] = self.num_lora_restarts
-        return control
-
-    def on_train_end(
-        self,
-        args: TrainingArguments,
-        _state: TrainerState,
-        control: TrainerControl,
-        model: peft.LoraModel,
-        **_kwargs,
-    ):
-        if self.quantized:
-            # perform final merge and save
-            with torch.no_grad():
-                merge_and_save(
-                    model,
-                    self.last_full_model,
-                    args.output_dir,
-                    reinit=False,
-                    quantized=self.quantized,
-                    actually_save=is_main_process(),
-                    cpu_offload=self.cpu_offload,
-                )
-        # no need to save if unquantized, as finetune.py will call merge_and_unload()
-        return control
-
-
-class ReLoRAScheduler(LRScheduler):
-    """Wraps another scheduler to apply per-lora-restart learning rate warmups."""
-
-    def __init__(
-        self,
-        optimizer: Optimizer,
-        inner_schedule: LRScheduler,
-        relora_steps: int,
-        warmup_steps: int,
-        min_lr_scale: float = 0.001,
-    ) -> None:
-        self.inner_schedule = inner_schedule
-        self.relora_steps = relora_steps
-        self.warmup_steps = warmup_steps
-        self.min_lr_scale = min_lr_scale
-        super().__init__(optimizer, inner_schedule.last_epoch, inner_schedule.verbose)
-
-    def get_lr(self) -> float:
-        self.inner_schedule.last_epoch = self.last_epoch
-
-        original = self.inner_schedule.get_lr()
-        step = self.last_epoch
-        if step < self.relora_steps:
-            scale = 1
-        else:
-            cycle_t = min(1.0, (step % self.relora_steps) / self.warmup_steps)
-            scale = cycle_t * (1 - self.min_lr_scale) + self.min_lr_scale
-
-        if isinstance(original, Sequence):
-            return [lr * scale for lr in original]
-        return original * scale
-
-
-def sharded_paths(path: str, module_names: List[str]) -> Dict[str, str]:
-    model_name = "model.safetensors"
-    if not os.path.exists(str(Path(path) / model_name)) and not os.path.exists(
-        str(Path(path) / f"{model_name}.index.json")
-    ):
-        model_name = "pytorch_model.bin"
-
-    index_path = str(Path(path) / f"{model_name}.index.json")
-    if os.path.exists(index_path):
-        with open(index_path, "r", encoding="utf-8") as file:
-            data = json.load(file)
-        return data["weight_map"]
-    return {(module_name + ".weight"): model_name for module_name in module_names}
-
-
-def lora_delta_weight(layer: peft.tuners.lora.LoraLayer, device) -> torch.Tensor:
-    if isinstance(layer, (peft.tuners.lora.Linear8bitLt, peft.tuners.lora.Linear4bit)):
-        adapter = layer.active_adapter
-        return (
-            peft.utils.transpose(
-                layer.lora_B[adapter].weight.detach().to(device)
-                @ layer.lora_A[adapter].weight.detach().to(device),
-                getattr(layer, "fan_in_fan_out", False),
-            )
-            * layer.scaling[adapter]
-        )
-
-    return layer.get_delta_weight().to(device)
-
-
-def find_lora_modules(model: peft.LoraModel) -> Dict[str, peft.tuners.lora.LoraLayer]:
-    modules: Dict[str, peft.tuners.lora.LoraLayer] = {}
-
-    key_list = [key for key, _ in model.model.named_modules() if "lora" not in key]
-    for key in key_list:
-        try:
-            # pylint: disable=protected-access
-            _parent, target, _target_name = peft.utils._get_submodules(model.model, key)
-        except AttributeError:
-            continue
-
-        if isinstance(target, peft.tuners.lora.LoraLayer):
-            modules[key] = target
-
-    return modules
-
-
-def update_weights(
-    target: peft.tuners.lora.LoraLayer, new_weight: torch.Tensor, reinit: bool, device
-):
-    if reinit:
-        for adapter_name in target.lora_A:
-            target.reset_lora_parameters(adapter_name)
-        for adapter_name in target.lora_embedding_A:
-            target.reset_lora_parameters(adapter_name)
-
-    if isinstance(target, peft.tuners.lora.Linear4bit):
-        # This could be faster, but the quantization of Linear4bit weights occurs
-        # when the module is moved from cpu to gpu. Without meddling *too* deeply in
-        # PEFT's innards or maintaining a duplicate of that codepath, this is good
-        # enough for now.
-        target.weight.quant_state = None
-        target.weight.data = new_weight.cpu()
-        target.to(device)
-    elif isinstance(target, peft.tuners.lora.Linear8bitLt):
-        target.weight = bnb.nn.Int8Params(new_weight, requires_grad=False).to(device)
-    else:
-        target.weight.data = new_weight.to(device)
-
-
-def merge_and_save(
-    model: peft.LoraModel,
-    model_src: str,
-    model_dst: str,
-    reinit: bool = False,
-    quantized: bool = False,
-    cpu_offload: bool = False,
-    actually_save: bool = True,
-):
-    modules = find_lora_modules(model)
-
-    if not quantized:
-        for module_name, target in modules.items():
-            update = target.get_delta_weight(target.active_adapter).detach()
-            target.weight.data += update
-
-            if reinit:
-                for adapter_name in target.lora_A:
-                    target.reset_lora_parameters(adapter_name)
-                for adapter_name in target.lora_embedding_A:
-                    target.reset_lora_parameters(adapter_name)
-        return
-
-    os.makedirs(model_dst, exist_ok=True)
-    shard_paths = sharded_paths(model_src, modules.keys())
-    out_shard_paths = {}
-
-    unique_shards = list(set(shard_paths.values()))
-    for shard_path in unique_shards:
-        out_tensors = {}
-        if shard_path.endswith(".safetensors"):
-            in_tensors = st.load_file(str(Path(model_src) / shard_path))
-        else:
-            in_tensors = torch.load(Path(model_src) / shard_path)
-            if "state_dict" in in_tensors:
-                in_tensors = in_tensors["state_dict"]
-
-        for module_name, target in modules.items():
-            key = module_name + ".weight"
-            if key not in shard_paths or shard_paths[key] != shard_path:
-                continue
-
-            orig_weight = in_tensors[key]
-            old_dev = target.weight.device
-            math_dev = "cpu" if cpu_offload else old_dev
-
-            delta_weight = lora_delta_weight(target, math_dev)
-            new_weight = orig_weight.to(math_dev) + delta_weight
-            del delta_weight
-
-            if actually_save:
-                out_tensors[key] = new_weight.half().cpu()
-
-            update_weights(target, new_weight, reinit=reinit, device=old_dev)
-
-        if actually_save:
-            out_shard_name = shard_path
-            if out_shard_name.startswith("pytorch_model"):
-                out_shard_name = (
-                    out_shard_name.replace("pytorch_model", "model").rstrip(".bin")
-                    + ".safetensors"
-                )
-
-            for module_name in in_tensors:
-                if module_name not in out_tensors:
-                    out_tensors[module_name] = in_tensors[module_name].half()
-                out_shard_paths[module_name] = out_shard_name
-
-            shard_fn = str(Path(model_dst) / out_shard_name)
-            LOG.info(f"saving tensors to {shard_fn}")
-            st.save_file(out_tensors, shard_fn, metadata={"format": "pt"})
-
-        del in_tensors
-        del out_tensors
-        torch.cuda.empty_cache()
-
-    if actually_save and len(unique_shards) > 1:
-        with open(
-            str(Path(model_dst, "model.safetensors.index.json")), "w", encoding="utf-8"
-        ) as file:
-            json.dump({"metadata": {}, "weight_map": out_shard_paths}, file)
-
-
-def load_weight_checkpoint(model: peft.LoraModel, checkpoint_path: str):
-    modules = find_lora_modules(model)
-    shard_paths = sharded_paths(checkpoint_path, modules.keys())
-    unique_shards = list(set(shard_paths.values()))
-
-    for shard_path in unique_shards:
-        tensors = st.load_file(os.path.join(checkpoint_path, shard_path))
-
-        for module_name, target in modules.items():
-            key = module_name + ".weight"
-            if key not in shard_paths or shard_paths[key] != shard_path:
-                continue
-
-            new_weight = tensors[key]
-            update_weights(
-                target, new_weight, reinit=False, device=target.weight.device
-            )
--- a/src/axolotl/prompt_tokenizers.py
+++ b/src/axolotl/prompt_tokenizers.py
@@ -13,7 +13,7 @@ from axolotl.prompters import IGNORE_TOKEN_ID
 LOG = logging.getLogger("axolotl")

 IGNORE_INDEX = -100
-LLAMA_DEFAULT_PAD_TOKEN = "<pad>"  # nosec
+LLAMA_DEFAULT_PAD_TOKEN = "[PAD]"  # nosec
 LLAMA_DEFAULT_EOS_TOKEN = "</s>"  # nosec
 LLAMA_DEFAULT_BOS_TOKEN = "<s>"  # nosec
 LLAMA_DEFAULT_UNK_TOKEN = "<unk>"  # nosec
--- a/src/axolotl/prompters.py
+++ b/src/axolotl/prompters.py
@@ -309,6 +309,10 @@ class ShareGPTPrompter:  # pylint: disable=too-few-public-methods
        )

    def build_prompt(self, source) -> Generator[str, None, None]:
+        # ignore the system prompt if provided
+        if source[0]["from"] == "system":
+            source.pop(0)
+
        if len(source) < 2:
            # If there isn't a back and forth conversation, ignore it
            # also happens on the data splitting leaving empty conversations
@@ -317,12 +321,6 @@ class ShareGPTPrompter:  # pylint: disable=too-few-public-methods
            )

        conv = self._conversation.copy()
-
-        # Add the conversation system prompt if provided, otherwise use the default one
-        if source[0]["from"] == "system":
-            conv.system = source[0]["value"]
-            source.pop(0)
-
        roles = {"human": conv.roles[0], "gpt": conv.roles[1]}

        try:
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -1,141 +0,0 @@
-"""Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
-
-import logging
-import os
-import signal
-import sys
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Optional
-
-import torch
-
-# add src to the pythonpath so we don't need to pip install this
-from datasets import Dataset
-from optimum.bettertransformer import BetterTransformer
-
-from axolotl.common.cli import TrainerCliArgs
-from axolotl.logging_config import configure_logging
-from axolotl.utils.dict import DictDefault
-from axolotl.utils.models import load_model, load_tokenizer
-from axolotl.utils.trainer import setup_trainer
-
-project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
-src_dir = os.path.join(project_root, "src")
-sys.path.insert(0, src_dir)
-
-configure_logging()
-LOG = logging.getLogger("axolotl.train")
-
-
-@dataclass
-class TrainDatasetMeta:
-    """
-    dataclass to capture the dataset specific options for training
-    """
-
-    train_dataset: Dataset
-    eval_dataset: Optional[Dataset] = None
-    total_num_steps: Optional[int] = None
-
-
-def train(
-    *,
-    cfg: DictDefault,
-    cli_args: TrainerCliArgs,
-    dataset_meta: TrainDatasetMeta,
-):
-    # load the tokenizer first
-    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
-    tokenizer = load_tokenizer(cfg)
-
-    train_dataset = dataset_meta.train_dataset
-    eval_dataset = dataset_meta.eval_dataset
-    total_num_steps = dataset_meta.total_num_steps
-
-    # Load the model and tokenizer
-    LOG.info("loading model and (optionally) peft_config...")
-    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
-
-    safe_serialization = cfg.save_safetensors is True
-
-    if cfg.resume_from_checkpoint is None and cfg.auto_resume_from_checkpoints:
-        possible_checkpoints = [
-            str(cp) for cp in Path(cfg.output_dir).glob("checkpoint-*")
-        ]
-        if len(possible_checkpoints) > 0:
-            sorted_paths = sorted(
-                possible_checkpoints,
-                key=lambda path: int(path.split("-")[-1]),
-            )
-            cfg.resume_from_checkpoint = sorted_paths[-1]
-            LOG.info(
-                f"Using Auto-resume functionality to start with checkpoint at {cfg.resume_from_checkpoint}"
-            )
-    resume_from_checkpoint = cfg.resume_from_checkpoint
-
-    trainer = setup_trainer(
-        cfg, train_dataset, eval_dataset, model, tokenizer, total_num_steps
-    )
-
-    model.config.use_cache = False
-
-    if torch.__version__ >= "2" and sys.platform != "win32":
-        LOG.info("Compiling torch model")
-        model = torch.compile(model)
-
-    # go ahead and presave, so we have the adapter config available to inspect
-    if peft_config:
-        LOG.info(f"Pre-saving adapter config to {cfg.output_dir}")
-        peft_config.save_pretrained(cfg.output_dir)
-    # additionally presave the tokenizer and model configs
-    if not Path(cfg.output_dir).is_dir():
-        os.makedirs(cfg.output_dir, exist_ok=True)
-    tokenizer.save_pretrained(str(Path(cfg.output_dir)))
-    model.config.save_pretrained(str(Path(cfg.output_dir)))
-
-    # In case we want to stop early with ctrl+c, this is a nice to have to save the pretrained model
-    if cfg.local_rank == 0:
-
-        def terminate_handler(_, __, model):
-            if cfg.flash_optimum:
-                model = BetterTransformer.reverse(model)
-            model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
-            sys.exit(0)
-
-        signal.signal(
-            signal.SIGINT, lambda signum, frame: terminate_handler(signum, frame, model)
-        )
-
-    LOG.info("Starting trainer...")
-    if cfg.group_by_length:
-        LOG.info("hang tight... sorting dataset for group_by_length")
-
-    if cfg.flash_optimum:
-        with torch.backends.cuda.sdp_kernel(
-            enable_flash=True, enable_math=True, enable_mem_efficient=True
-        ):
-            trainer.train(resume_from_checkpoint=resume_from_checkpoint)
-    else:
-        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
-
-    LOG.info(f"Training Completed!!! Saving pre-trained model to {cfg.output_dir}")
-
-    if cfg.relora_steps:
-        if cfg.adapter == "lora" and not (cfg.load_in_4bit or cfg.load_in_8bit):
-            model = model.merge_and_unload()
-        else:
-            # final model weights have already been saved by `ReLoRACallback.on_train_end`
-            return model, tokenizer
-
-    # TODO do we need this fix? https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading
-    # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
-    if cfg.fsdp:
-        trainer.save_model(cfg.output_dir)
-    elif cfg.local_rank == 0:
-        if cfg.flash_optimum:
-            model = BetterTransformer.reverse(model)
-
-        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
-
-    return model, tokenizer
--- a/src/axolotl/utils/callbacks.py
+++ b/src/axolotl/utils/callbacks.py
@@ -1,19 +1,9 @@
 """Callbacks for Trainer class"""

-from __future__ import annotations
-
 import logging
 import os
-from typing import TYPE_CHECKING, Dict, List

-import evaluate
-import numpy as np
-import pandas as pd
-import torch
-import torch.distributed as dist
-from datasets import load_dataset
 from optimum.bettertransformer import BetterTransformer
-from tqdm import tqdm
 from transformers import (
    TrainerCallback,
    TrainerControl,
@@ -23,21 +13,8 @@ from transformers import (
 from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, IntervalStrategy

 from axolotl.utils.bench import log_gpu_memory_usage
-from axolotl.utils.distributed import (
-    barrier,
-    broadcast_dict,
-    gather_scalar_from_all_ranks,
-    get_world_size,
-    is_distributed,
-    is_main_process,
-    zero_first,
-)
-
-if TYPE_CHECKING:
-    from axolotl.utils.trainer import AxolotlTrainingArguments

 LOG = logging.getLogger("axolotl.callbacks")
-IGNORE_INDEX = -100


 class SavePeftModelCallback(TrainerCallback):  # pylint: disable=too-few-public-methods
@@ -56,9 +33,7 @@ class SavePeftModelCallback(TrainerCallback):  # pylint: disable=too-few-public-
        )

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
-        kwargs["model"].save_pretrained(
-            peft_model_path, save_safetensors=args.save_safetensors
-        )
+        kwargs["model"].save_pretrained(peft_model_path)

        return control

@@ -119,207 +94,3 @@ class GPUStatsCallback(
            log_gpu_memory_usage(LOG, "while training", self.cfg.device)
            self.logged = True
        return control
-
-
-def bench_eval_callback_factory(trainer, tokenizer):
-    accuracy = evaluate.load("accuracy")
-    abcd_idx = [
-        tokenizer("A", add_special_tokens=False).input_ids[0],
-        tokenizer("B", add_special_tokens=False).input_ids[0],
-        tokenizer("C", add_special_tokens=False).input_ids[0],
-        tokenizer("D", add_special_tokens=False).input_ids[0],
-        tokenizer("E", add_special_tokens=False).input_ids[0],
-        tokenizer("F", add_special_tokens=False).input_ids[0],
-        tokenizer("G", add_special_tokens=False).input_ids[0],
-    ]
-    bench_split = "eval"
-
-    def transform_bench_subject(example):
-        # Split on ':' and trim whitespace
-        parts = example["subject"].split(":")
-        first_part = (
-            parts[0].strip().lower().replace("-", "_")
-        )  # Lowercase the first part
-        second_part = (
-            parts[1].strip().replace("-", "_") if len(parts) > 1 else "all"
-        )  # Replace hyphens with underscores
-
-        # Return the transformed values
-        return {"name": first_part, "subject": second_part}
-
-    if trainer.args.bench_dataset == "mmlu-zs":
-        bench_dataset = load_dataset(
-            "openaccess-ai-collective/mmlu-evals",
-            data_files={
-                "eval": "zero_shot_mmlu_val.json",
-                "test": "zero_shot_mmlu_test.json",
-            },
-        )
-        # bench_dataset = bench_dataset.remove_columns("subject")
-    # MMLU Five-shot (Eval/Test only)
-    elif trainer.args.bench_dataset in ["mmlu", "mmlu-fs"]:
-        bench_dataset = load_dataset(
-            "openaccess-ai-collective/mmlu-evals",
-            data_files={
-                "eval": "five_shot_mmlu_val.json",
-                "test": "five_shot_mmlu_test.json",
-            },
-        )
-        # bench_dataset = bench_dataset.remove_columns('subject')
-    elif "/" in trainer.args.bench_dataset:
-        bench_ds = trainer.args.bench_dataset
-        bench_ds_name = "/".join(bench_ds.split("/", 2)[:2])
-        bench_ds_data_file = "/".join(bench_ds.split("/", 2)[2:])
-        bench_dataset = load_dataset(
-            bench_ds_name,
-            data_files={
-                "eval": bench_ds_data_file,
-            },
-        )
-        bench_dataset["eval"] = bench_dataset["eval"].map(transform_bench_subject)
-    else:
-        raise ValueError(
-            f"unhandled value `{trainer.args.bench_dataset}` for bench_dataset training args"
-        )
-    bench_dataset = bench_dataset[trainer.args.bench_split]
-    if trainer.args.max_bench_samples is not None:
-        bench_dataset = bench_dataset.select(range(trainer.args.max_bench_samples))
-
-    def tokenize_evals(example):
-        source = f"{tokenizer.bos_token}{example['input']}"
-        target = f"{example['output']}{tokenizer.eos_token}"
-
-        tokenized_source = tokenizer(
-            source,
-            max_length=2048,
-            truncation=True,
-            add_special_tokens=False,
-        )
-        tokenized_target = tokenizer(
-            target,
-            max_length=2048,
-            truncation=True,
-            add_special_tokens=False,
-        )
-        input_ids = tokenized_source["input_ids"] + tokenized_target["input_ids"]
-        labels = [IGNORE_INDEX] * len(tokenized_source["input_ids"]) + tokenized_target[
-            "input_ids"
-        ]
-
-        return {
-            "input_ids": input_ids,
-            "labels": labels,
-            "subject": example["subject"],
-        }
-
-    with zero_first(is_main_process()):
-        bench_dataset = bench_dataset.map(tokenize_evals)
-        bench_dataset = bench_dataset.filter(lambda x: x["labels"][-2] in abcd_idx)
-
-    class BenchEvalCallback(TrainerCallback):
-        """
-        TrainerCallback that runs the MMLU evals
-        """
-
-        def on_evaluate(
-            self,
-            args: AxolotlTrainingArguments,
-            state: TrainerState,  # pylint: disable=unused-argument
-            control: TrainerControl,  # pylint: disable=unused-argument
-            metrics: Dict[str, float],  # pylint: disable=unused-argument
-            **kwargs,  # pylint: disable=unused-argument
-        ):
-            data_loader = trainer.get_bench_dataloader(
-                bench_dataset.remove_columns(["input", "subject", "output", "name"])
-            )
-            trainer.model.eval()
-            preds, refs = [], []
-            loss_bench = 0
-            for batch in tqdm(data_loader, total=len(data_loader)):
-                (loss, logits, labels) = trainer.prediction_step(
-                    trainer.model,
-                    batch,
-                    prediction_loss_only=False,
-                )
-                # There are two tokens, the output, and eos token.
-                for i, logit in enumerate(logits):
-                    label_non_zero_id = (batch["labels"][i] != IGNORE_INDEX).nonzero()[
-                        0
-                    ][0]
-                    logit_abcd = logit[label_non_zero_id - 1][abcd_idx]
-                    preds.append(torch.argmax(logit_abcd).item())
-                labels = labels[labels != IGNORE_INDEX].view(-1, 2)[:, 0]
-                refs += [
-                    abcd_idx.index(label) if label in abcd_idx else -1
-                    for label in labels.tolist()
-                ]
-                loss_bench += loss.item()
-            # Extract results by subject.
-            bench_name = bench_dataset["name"]
-            bench_names: dict = {s: {"refs": [], "preds": []} for s in set(bench_name)}
-            for s, p, r in zip(bench_name, preds, refs):  # pylint: disable=invalid-name
-                bench_names[s]["preds"].append(p)
-                bench_names[s]["refs"].append(r)
-            barrier()
-            local_bench_names = bench_names
-            gathered_bench_names: List[Dict] = [{} for _ in range(get_world_size())]
-            # Gather results from all GPUs to GPU 0
-
-            loss_bench_ranks = gather_scalar_from_all_ranks(
-                lambda: loss_bench, get_world_size()
-            )
-            len_data_loader_ranks = gather_scalar_from_all_ranks(
-                lambda: len(data_loader), get_world_size()
-            )
-
-            results = {}
-            if is_distributed() and not is_main_process():
-                dist.gather_object(local_bench_names, dst=0)
-            else:
-                if is_distributed():
-                    dist.gather_object(local_bench_names, gathered_bench_names, dst=0)
-                else:
-                    gathered_bench_names = [local_bench_names]
-                bench_loss = sum(loss_bench_ranks) / sum(len_data_loader_ranks)
-                results = {f"{bench_split}_bench_loss": bench_loss}
-
-                # Combine results from all GPUs
-                combined_bench_names: Dict[str, Dict[str, List]] = {}
-                for bench_name in gathered_bench_names:
-                    for name, data in bench_name.items():
-                        if name not in combined_bench_names:
-                            combined_bench_names[name] = {"refs": [], "preds": []}
-                        combined_bench_names[name]["refs"].extend(data["refs"])
-                        combined_bench_names[name]["preds"].extend(data["preds"])
-
-                bench_scores = []
-                bench_refs = []
-                bench_preds = []
-                for (
-                    bench_name
-                ) in combined_bench_names:  # pylint: disable=consider-using-dict-items
-                    bench_score = accuracy.compute(
-                        references=combined_bench_names[bench_name]["refs"],
-                        predictions=combined_bench_names[bench_name]["preds"],
-                    )["accuracy"]
-                    bench_refs.extend(combined_bench_names[bench_name]["refs"])
-                    bench_preds.extend(combined_bench_names[bench_name]["preds"])
-                    if not pd.isna(bench_score):
-                        results[
-                            f"{bench_split}_bench_accuracy_{bench_name}"
-                        ] = bench_score
-                        bench_scores.append(bench_score)
-                    else:
-                        results[f"{bench_split}_bench_accuracy_{bench_name}"] = 0.0
-                        bench_scores.append(0.0)
-                results[f"{bench_split}_bench_average_accuracy"] = np.mean(bench_scores)
-                results[f"{bench_split}_bench_total_accuracy"] = accuracy.compute(
-                    references=bench_refs, predictions=bench_preds
-                )["accuracy"]
-                trainer.log(results)
-
-            results = broadcast_dict(results)
-            for key, val in results.items():
-                metrics[key] = val
-
-    return BenchEvalCallback
--- a/src/axolotl/utils/config.py
+++ b/src/axolotl/utils/config.py
@@ -6,7 +6,6 @@ import os
 import torch

 from axolotl.utils.bench import log_gpu_memory_usage
-from axolotl.utils.models import load_model_config

 LOG = logging.getLogger("axolotl")

@@ -70,16 +69,6 @@ def normalize_config(cfg):
    else:
        cfg.torch_dtype = torch.float32

-    model_config = load_model_config(cfg)
-
-    # figure out if the model is llama
-    cfg.is_llama_derived_model = (
-        (hasattr(model_config, "model_type") and model_config.model_type == "llama")
-        or cfg.is_llama_derived_model
-        or "llama" in cfg.base_model
-        or (cfg.model_type and "llama" in cfg.model_type.lower())
-    )
-
    log_gpu_memory_usage(LOG, "baseline", cfg.device)


@@ -97,11 +86,6 @@ def validate_config(cfg):
            )
        )

-    if cfg.sample_packing and not cfg.pad_to_sequence_len:
-        LOG.warning(
-            "`pad_to_sequence_len: true` is recommended when using sample_packing"
-        )
-
    if cfg.gradient_accumulation_steps and cfg.batch_size:
        raise ValueError(
            "please set only one of gradient_accumulation_steps or batch_size"
@@ -113,7 +97,9 @@ def validate_config(cfg):
            "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.",
        )
    if cfg.load_4bit:
-        raise ValueError("cfg.load_4bit parameter has been deprecated")
+        raise ValueError(
+            "cfg.load_4bit parameter has been deprecated and replaced by cfg.gptq"
+        )

    if cfg.adapter == "qlora":
        if cfg.merge_lora:
@@ -140,19 +126,6 @@ def validate_config(cfg):
    if not cfg.load_in_8bit and cfg.adapter == "lora":
        LOG.warning("We recommend setting `load_in_8bit: true` for LORA finetuning")

-    if cfg.relora_steps:
-        if cfg.adapter not in ("lora", "qlora"):
-            raise ValueError("cfg.adapter must be lora or qlora to use ReLoRA")
-
-        if cfg.fsdp:
-            raise ValueError("fsdp not supported with ReLoRA")
-
-        if cfg.deepspeed:
-            raise ValueError("deepspeed not supported with ReLoRA")
-
-        if cfg.lr_scheduler == "one_cycle":
-            raise ValueError("ReLoRA is not compatible with the one_cycle scheduler")
-
    if cfg.trust_remote_code:
        LOG.warning(
            "`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model."
@@ -220,15 +193,6 @@ def validate_config(cfg):
            "sample_packing not compatible with xformers_attention. Use flash_attention"
        )

-    if cfg.early_stopping_patience:
-        if not cfg.save_steps or not cfg.eval_steps:
-            raise ValueError(
-                "`early_stopping_patience` requires save_steps and eval_steps to be set. eval_steps should evenly divide save_steps."
-            )
-        if cfg.save_steps % cfg.eval_steps != 0:
-            raise ValueError(
-                "`early_stopping_patience` requires that eval_steps should evenly divide save_steps."
-            )
    # TODO
    # MPT 7b
    # https://github.com/facebookresearch/bitsandbytes/issues/25
--- a/src/axolotl/utils/data.py
+++ b/src/axolotl/utils/data.py
@@ -2,6 +2,7 @@
 import functools
 import hashlib
 import logging
+from hashlib import md5
 from pathlib import Path
 from typing import Tuple, Union

@@ -51,19 +52,11 @@ LOG = logging.getLogger("axolotl")
 DEFAULT_DATASET_PREPARED_PATH = "last_run_prepared"


-def md5(to_hash: str, encoding: str = "utf-8") -> str:
-    try:
-        return hashlib.md5(to_hash.encode(encoding), usedforsecurity=False).hexdigest()
-    except TypeError:
-        return hashlib.md5(to_hash.encode(encoding)).hexdigest()  # nosec
-
-
 def prepare_dataset(cfg, tokenizer):
    if not cfg.pretraining_dataset:
-        with zero_first(is_main_process()):
-            train_dataset, eval_dataset = load_prepare_datasets(
-                tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH
-            )
+        train_dataset, eval_dataset = load_prepare_datasets(
+            tokenizer, cfg, DEFAULT_DATASET_PREPARED_PATH
+        )
    else:
        train_dataset = load_pretraining_dataset(
            cfg.pretraining_dataset,
@@ -94,7 +87,7 @@ def load_tokenized_prepared_datasets(
 ) -> DatasetDict:
    tokenizer_name = tokenizer.__class__.__name__
    ds_hash = str(
-        md5(
+        md5(  # nosec
            (
                str(cfg.sequence_len)
                + "@"
@@ -103,8 +96,8 @@ def load_tokenized_prepared_datasets(
                )
                + "|"
                + tokenizer_name
-            )
-        )
+            ).encode("utf-8")
+        ).hexdigest()
    )
    prepared_ds_path = (
        Path(cfg.dataset_prepared_path) / ds_hash
@@ -140,17 +133,8 @@ def load_tokenized_prepared_datasets(
            seed = 42

        datasets = []
-
-        def for_d_in_datasets(dataset_configs):
-            for dataset in dataset_configs:
-                if dataset.name and isinstance(dataset.name, list):
-                    for name in dataset.name:
-                        yield DictDefault({**dataset, "name": name})
-                else:
-                    yield dataset
-
        # pylint: disable=invalid-name
-        for d in for_d_in_datasets(cfg.datasets):
+        for d in cfg.datasets:
            ds: Union[Dataset, DatasetDict] = None
            ds_from_hub = False
            try:
@@ -380,7 +364,7 @@ def load_prepare_datasets(
        # see if we can go ahead and load the stacked dataset
        seed = f"@{str(cfg.seed)}" if cfg.seed else ""
        ds_hash = str(
-            md5(
+            md5(  # nosec
                (
                    str(cfg.sequence_len)
                    + "@"
@@ -391,8 +375,8 @@ def load_prepare_datasets(
                    )
                    + "|"
                    + tokenizer_name
-                )
-            )
+                ).encode("utf-8")
+            ).hexdigest()
        )
        prepared_ds_path = (
            Path(cfg.dataset_prepared_path) / ds_hash
@@ -506,8 +490,12 @@ def load_prepare_datasets(
            + "|"
            + str(cfg.seed or 42)
        )
-        train_fingerprint = md5(to_hash_train)
-        test_fingerprint = md5(to_hash_test)
+        train_fingerprint = hashlib.md5(
+            to_hash_train.encode(), usedforsecurity=False
+        ).hexdigest()
+        test_fingerprint = hashlib.md5(
+            to_hash_test.encode(), usedforsecurity=False
+        ).hexdigest()

        with zero_first(is_main_process()):
            dataset = dataset.train_test_split(
--- a/src/axolotl/utils/dataloader.py
+++ b/src/axolotl/utils/dataloader.py
@@ -243,18 +243,6 @@ class MultipackDistributedDataloader:
            len_remaining -= 1
            if not len_remaining:
                return
-        # yield a no-op for cases where we don't have any data left to pack
-        for i in range(0, len_remaining):
-            yield self.collate_fn(
-                [
-                    {
-                        "input_ids": [0],
-                        "labels": [-100],
-                        "attention_mask": [True],
-                        "position_ids": [0],
-                    }
-                ]
-            )

    def _len_est(self):
        lengths_sum = np.sum(self.lengths)
--- a/src/axolotl/utils/distributed.py
+++ b/src/axolotl/utils/distributed.py
@@ -1,11 +1,8 @@
 """
 utility helpers for distributed checks
 """
-import os
-import pickle  # nosec
 from contextlib import contextmanager

-import torch
 import torch.distributed as dist
 from accelerate import Accelerator

@@ -46,10 +43,6 @@ def is_main_process():
    return dist.get_rank() == 0


-def get_world_size():
-    return int(os.getenv("WORLD_SIZE", "1"))
-
-
@contextmanager
 def zero_first(is_main):
    """
@@ -60,64 +53,3 @@ def zero_first(is_main):
    yield
    if is_main:  # then rank 0 waits after it has run the context
        barrier()
-
-
-def gather_scalar_from_all_ranks(fn, world_size=1):  # pylint: disable=invalid-name
-    """
-    Run a callable 'fn' on all ranks and gather the results on the specified rank.
-
-    Args:
-    - fn (callable): A function that computes the value. This should not have any side effects.
-    - rank (int, optional): The rank that gathers the values. Default is 0.
-    - world_size (int, optional): Total number of processes in the current distributed setup.
-
-    Returns:
-    - A list of computed values from all ranks if on the gathering rank, otherwise None.
-    """
-    value_scalar = fn()
-    if not is_distributed():
-        return [value_scalar]
-    value_tensor = torch.tensor(value_scalar, device=dist.get_rank()).float()
-
-    if not is_main_process():
-        dist.gather(value_tensor, dst=0)
-    else:
-        gathered_tensors = [torch.zeros_like(value_tensor) for _ in range(world_size)]
-        dist.gather(value_tensor, gather_list=gathered_tensors, dst=0)
-
-        # Convert tensors back to their original type (int or float)
-        gathered_values = []
-        for tensor in gathered_tensors:
-            if tensor == tensor.int():
-                gathered_values.append(int(tensor.item()))
-            else:
-                gathered_values.append(float(tensor.item()))
-        return gathered_values
-    return None
-
-
-def broadcast_dict(vals: dict):
-    if not is_distributed():
-        return vals
-
-    if is_main_process():
-        data_byte = pickle.dumps(vals)
-        data_tensor = torch.ByteTensor(list(data_byte)).to("cuda")
-        data_size = torch.IntTensor([len(data_byte)]).to("cuda")
-    else:
-        data_tensor = torch.empty([1024], dtype=torch.uint8, device="cuda")
-        data_size = torch.IntTensor([0]).to("cuda")
-
-    dist.broadcast(data_size, 0)
-    if not is_main_process():
-        # resize
-        data_tensor = data_tensor.new_empty([data_size.item()])
-
-    dist.broadcast(data_tensor, 0)
-
-    if not is_main_process():
-        data_list = data_tensor.cpu().tolist()
-        data_byte = bytes(data_list[: data_size.item()])
-        vals = pickle.loads(data_byte)  # nosec
-
-    return vals
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -4,37 +4,33 @@
 import logging
 import math
 import os
-from typing import Optional, Tuple  # noqa: F401
+from pathlib import Path
+from typing import TYPE_CHECKING, Optional, Tuple  # noqa: F401

 import bitsandbytes as bnb
 import torch
 import transformers
 from optimum.bettertransformer import BetterTransformer
-from peft import PeftConfig, prepare_model_for_kbit_training
+from peft.tuners.lora import LoraLayer
 from transformers import (  # noqa: F401
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
-    GPTQConfig,
    LlamaConfig,
    PreTrainedModel,
    PreTrainedTokenizerBase,
 )

-from axolotl.prompt_tokenizers import LLAMA_DEFAULT_EOS_TOKEN
+from axolotl.prompt_tokenizers import LLAMA_DEFAULT_PAD_TOKEN
 from axolotl.utils.bench import log_gpu_memory_usage
-from axolotl.utils.dict import DictDefault

 LOG = logging.getLogger("axolotl")

+if TYPE_CHECKING:
+    from peft import PeftConfig  # noqa: F401

-def load_model_config(cfg):
-    model_config_name = cfg.base_model_config or cfg.base_model
-    trust_remote_code: bool = False or cfg.trust_remote_code
-    return AutoConfig.from_pretrained(
-        model_config_name, trust_remote_code=trust_remote_code
-    )
+    from axolotl.utils.dict import DictDefault  # noqa: F401


 def load_tokenizer(cfg):
@@ -59,18 +55,11 @@ def load_tokenizer(cfg):
        **tokenizer_kwargs,
    )

-    if (
-        tokenizer.__class__.__name__
-        in [
-            "LlamaTokenizer",
-            "LlamaTokenizerFast",
-            "CodeLlamaTokenizer",
-        ]
-        and hasattr(tokenizer, "pad_token")
-        and not tokenizer.pad_token
-    ):
-        # set a pad_token, but use eos_token so we don't add a new token
-        tokenizer.pad_token = LLAMA_DEFAULT_EOS_TOKEN
+    if tokenizer.__class__.__name__ in [
+        "LlamaTokenizer",
+        "LlamaTokenizerFast",
+    ]:
+        tokenizer.pad_token = LLAMA_DEFAULT_PAD_TOKEN

    LOG.debug(f"EOS: {tokenizer.eos_token_id} / {tokenizer.eos_token}")
    LOG.debug(f"BOS: {tokenizer.bos_token_id} / {tokenizer.bos_token}")
@@ -91,10 +80,8 @@ def load_tokenizer(cfg):


 def load_model(
-    cfg: DictDefault,
-    tokenizer: PreTrainedTokenizerBase,
-    inference: bool = False,
-) -> Tuple[PreTrainedModel, Optional[PeftConfig]]:
+    cfg, tokenizer
+):  # type: (DictDefault, PreTrainedTokenizerBase) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
    """
    Load a model for a given configuration and tokenizer.
    """
@@ -104,9 +91,14 @@ def load_model(

    # TODO refactor as a kwarg
    load_in_8bit = cfg.load_in_8bit
+    cfg.is_llama_derived_model = (
+        "llama" in base_model
+        or (cfg.model_type and "llama" in cfg.model_type.lower())
+        or cfg.is_llama_derived_model
+    )

    if cfg.is_llama_derived_model and cfg.flash_attention:
-        if cfg.device not in ["mps", "cpu"] and not inference:
+        if cfg.device not in ["mps", "cpu"] and not cfg.inference:
            from axolotl.monkeypatch.llama_attn_hijack_flash import (
                replace_llama_attn_with_flash_attn,
            )
@@ -148,24 +140,39 @@ def load_model(
    if (
        cfg.is_llama_derived_model
        and (cfg.max_packed_sequence_len or cfg.sample_packing)
-        and not inference
+        and not cfg.inference
    ):
        from axolotl.monkeypatch.llama_expand_mask import hijack_expand_mask

        LOG.info("patching _expand_mask")
        hijack_expand_mask()

+    try:
+        if cfg.gptq:
+            from alpaca_lora_4bit.monkeypatch.peft_tuners_lora_monkey_patch import (
+                replace_peft_model_with_int4_lora_model,
+            )
+
+            replace_peft_model_with_int4_lora_model()
+    except Exception as err:
+        LOG.exception(err)
+        raise err
+
+    if not cfg.gptq and (
+        (cfg.adapter == "lora" and load_in_8bit)
+        or (cfg.adapter == "qlora" and cfg.load_in_4bit)
+    ):
+        try:
+            from peft import prepare_model_for_kbit_training
+        except ImportError:
+            # For backward compatibility
+            from peft import (
+                prepare_model_for_int8_training as prepare_model_for_kbit_training,
+            )
+
    model_kwargs = {}
    if cfg.model_revision:
        model_kwargs["revision"] = cfg.model_revision
-    if cfg.gptq:
-        model_config = load_model_config(cfg)
-        if not hasattr(model_config, "quantization_config"):
-            LOG.warning("model config does not contain quantization_config information")
-        else:
-            model_kwargs["quantization_config"] = GPTQConfig(
-                **model_config.quantization_config
-            )
    if cfg.adapter == "qlora" and cfg.load_in_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
@@ -176,7 +183,45 @@ def load_model(
            bnb_4bit_quant_type="nf4",
        )
    try:
-        if cfg.is_llama_derived_model and not cfg.trust_remote_code and not cfg.gptq:
+        if cfg.gptq and cfg.is_llama_derived_model:
+            from alpaca_lora_4bit.autograd_4bit import load_llama_model_4bit_low_ram
+            from huggingface_hub import snapshot_download
+
+            try:
+                snapshot_download_kwargs = {}
+                if cfg.base_model_ignore_patterns:
+                    snapshot_download_kwargs[
+                        "ignore_patterns"
+                    ] = cfg.base_model_ignore_patterns
+                cache_model_path = Path(
+                    snapshot_download(base_model, **snapshot_download_kwargs)
+                )
+                files = (
+                    list(cache_model_path.glob("*.pt"))
+                    + list(cache_model_path.glob("*.safetensors"))
+                    + list(cache_model_path.glob("*.bin"))
+                )
+                if len(files) > 0:
+                    model_path = str(files[0])
+                else:
+                    LOG.warning(
+                        "unable to find a cached model file, this will likely fail..."
+                    )
+                    model_path = str(cache_model_path)
+            except Exception:  # pylint: disable=broad-exception-caught
+                model_path = cfg.base_model
+            model, _ = load_llama_model_4bit_low_ram(
+                base_model_config if base_model_config else base_model,
+                model_path,
+                device_map=cfg.device_map,
+                half=cfg.fp16,
+                groupsize=cfg.gptq_groupsize if cfg.gptq_groupsize else -1,
+                is_v1_model=cfg.gptq_model_v1
+                if cfg.gptq_model_v1 is not None
+                else True,
+            )
+            load_in_8bit = False
+        elif cfg.is_llama_derived_model and not cfg.trust_remote_code:
            from transformers import LlamaForCausalLM

            config_kwargs = {}
@@ -222,24 +267,15 @@ def load_model(
        #     )
        #     model.train() # sets to train instead of eval mode
        elif model_type and not cfg.trust_remote_code:
-            if cfg.gptq:
-                model = AutoModelForCausalLM.from_pretrained(
-                    base_model,
-                    device_map=cfg.device_map,
-                    torch_dtype=cfg.torch_dtype,
-                    trust_remote_code=cfg.trust_remote_code or False,
-                    **model_kwargs,
-                )
-            else:
-                model = getattr(transformers, model_type).from_pretrained(
-                    base_model,
-                    device_map=cfg.device_map,
-                    load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
-                    load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
-                    torch_dtype=cfg.torch_dtype,
-                    trust_remote_code=cfg.trust_remote_code or False,
-                    **model_kwargs,
-                )
+            model = getattr(transformers, model_type).from_pretrained(
+                base_model,
+                device_map=cfg.device_map,
+                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
+                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
+                torch_dtype=cfg.torch_dtype,
+                trust_remote_code=cfg.trust_remote_code or False,
+                **model_kwargs,
+            )
        else:
            config = AutoConfig.from_pretrained(
                base_model,
@@ -306,46 +342,36 @@ def load_model(
    if model.device.type == "cuda":
        log_gpu_memory_usage(LOG, "after model load", model.device)

-    # make sure these are fp32 per Ramesh et al. (2021)
-    for name, module in model.named_modules():
-        if "norm" in name:
-            module.to(torch.float32)
-        if "lm_head" in name or "embed_tokens" in name:
-            if hasattr(module, "weight"):
-                module.to(torch.float32)
-
-    needs_fa2_dtype = cfg.adapter or cfg.fsdp
-    if (cfg.adapter == "lora" and load_in_8bit) or (
-        cfg.adapter == "qlora" and cfg.load_in_4bit
+    if not cfg.gptq and (
+        (cfg.adapter == "lora" and load_in_8bit)
+        or (cfg.adapter == "qlora" and cfg.load_in_4bit)
    ):
        LOG.info("converting PEFT model w/ prepare_model_for_kbit_training")
-        if cfg.gradient_checkpointing:
-            model.gradient_checkpointing_enable()
        model = prepare_model_for_kbit_training(
            model, use_gradient_checkpointing=cfg.gradient_checkpointing
        )
-        needs_fa2_dtype = True
-
-    # LlamaRMSNorm layers are in fp32 after kbit_training or full finetune, so we need to
-    # convert them back to fp16/bf16 for flash-attn compatibility.
-    if needs_fa2_dtype or (cfg.flash_attention and cfg.is_llama_derived_model):
-        LOG.info("converting modules to %s for flash attention", cfg.torch_dtype)
-        for name, module in model.named_modules():
-            if "norm" in name:
-                module.to(cfg.torch_dtype)
-            if "lm_head" in name or "embed_tokens" in name:
-                if hasattr(module, "weight"):
-                    module.to(cfg.torch_dtype)

    model, lora_config = load_adapter(model, cfg, cfg.adapter)

    if cfg.ddp and not load_in_8bit:
        model.to(f"cuda:{cfg.local_rank}")

+    if cfg.gptq:
+        # Scales to half
+        LOG.info("Fitting 4bit scales and zeros to half")
+        for _, module in model.named_modules():
+            if "Autograd4bitQuantLinear" in str(type(module)) or "Linear4bitLt" in str(
+                type(module)
+            ):
+                if hasattr(module, "is_v1_model") and module.is_v1_model:
+                    module.zeros = module.zeros.half()
+                module.scales = module.scales.half()
+                module.bias = module.bias.half()
+
    if (
        torch.cuda.device_count() > 1
        and int(os.getenv("WORLD_SIZE", "1")) > 1
-        and (cfg.load_in_4bit)
+        and (cfg.gptq or cfg.load_in_4bit)
    ):
        # llama is PROBABLY model parallelizable, but the default isn't that it is
        # so let's only set it for the 4bit, see
@@ -371,15 +397,15 @@ def load_model(
    return model, lora_config


-def load_adapter(model, cfg, adapter, inference=False):
-    # type: (PreTrainedModel, DictDefault, Optional[str], bool) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
+def load_adapter(model, cfg, adapter):
+    # type: (PreTrainedModel, DictDefault, Optional[str]) -> Tuple[PreTrainedModel, Optional[PeftConfig]]

    if adapter is None:
        return model, None
    if hasattr(model, "enable_input_require_grads"):
        model.enable_input_require_grads()
    if adapter in ["lora", "qlora"]:
-        return load_lora(model, cfg, inference=inference)
+        return load_lora(model, cfg)
    if adapter == "llama-adapter":
        return load_llama_adapter(model, cfg)

@@ -411,8 +437,12 @@ def load_llama_adapter(model, cfg):
    return model, peft_config


-def find_all_linear_names(model):
-    cls = (bnb.nn.Linear4bit, bnb.nn.Linear8bitLt, torch.nn.Linear)
+def find_all_linear_names(bits, model):
+    cls = (
+        bnb.nn.Linear4bit
+        if bits == 4
+        else (bnb.nn.Linear8bitLt if bits == 8 else torch.nn.Linear)
+    )
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
@@ -425,15 +455,21 @@ def find_all_linear_names(model):
    return list(lora_module_names)


-def load_lora(model, cfg, inference=False):
-    # type: (PreTrainedModel, DictDefault, bool) -> Tuple[PreTrainedModel, Optional[PeftConfig]]
+def load_lora(model, cfg):
+    # type: (PreTrainedModel, DictDefault) -> Tuple[PreTrainedModel, Optional[PeftConfig]]

    from peft import LoraConfig, PeftModel, get_peft_model

    lora_target_modules = list(cfg.lora_target_modules or [])

    if cfg.lora_target_linear:
-        linear_names = find_all_linear_names(model)
+        bits = None
+        if cfg.load_in_4bit:
+            bits = 4
+        elif cfg.load_in_8bit:
+            bits = 8
+
+        linear_names = find_all_linear_names(bits, model)
        LOG.info(f"found linear modules: {repr(linear_names)}")
        lora_target_modules = list(set(lora_target_modules + linear_names))

@@ -453,11 +489,27 @@ def load_lora(model, cfg, inference=False):
        model = PeftModel.from_pretrained(
            model,
            cfg.lora_model_dir,
-            is_trainable=(not inference),
+            is_trainable=not cfg.inference,
        )
    else:
        model = get_peft_model(model, lora_config)

+    for name, module in model.named_modules():
+        if isinstance(module, LoraLayer):
+            module = module.to(cfg.torch_dtype)
+        if "norm" in name:
+            module = module.to(torch.float32)
+        if "lm_head" in name or "embed_tokens" in name:
+            if hasattr(module, "weight"):
+                module = module.to(cfg.torch_dtype)
+
+    # LlamaRMSNorm layers are in fp32 after kbit_training, so we need to
+    # convert them back to fp16/bf16 for flash-attn compatibility.
+    if cfg.flash_attention and cfg.is_llama_derived_model:
+        for name, module in model.named_modules():
+            if "norm" in name:
+                module = module.to(cfg.torch_dtype)
+
    model.print_trainable_parameters()

    return model, lora_config
--- a/src/axolotl/utils/tokenization.py
+++ b/src/axolotl/utils/tokenization.py
@@ -8,13 +8,13 @@ from termcolor import colored
 LOG = logging.getLogger("axolotl")


-def check_dataset_labels(dataset, tokenizer, num_examples=5, text_only=False):
+def check_dataset_labels(dataset, tokenizer):
    # the dataset is already shuffled, so let's just check the first 5 elements
-    for idx in range(num_examples):
-        check_example_labels(dataset[idx], tokenizer, text_only=text_only)
+    for idx in range(5):
+        check_example_labels(dataset[idx], tokenizer)


-def check_example_labels(example, tokenizer, text_only=False):
+def check_example_labels(example, tokenizer):
    # Get the input_ids, labels, and attention_mask from the dataset
    input_ids = example["input_ids"]
    labels = example["labels"]
@@ -29,10 +29,8 @@ def check_example_labels(example, tokenizer, text_only=False):
        decoded_input_token = tokenizer.decode(input_id)
        # Choose the color based on whether the label has the ignore value or not
        color = "red" if label_id == -100 else ("yellow" if label_id == 0 else "green")
-        colored_token = colored(decoded_input_token, color) + (
-            not text_only
-            and colored(f"({label_id}, {mask}, {input_id})", "white")
-            or ""
+        colored_token = colored(decoded_input_token, color) + colored(
+            f"({label_id}, {mask}, {input_id})", "white"
        )
        colored_tokens.append(colored_token)

--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -10,31 +10,31 @@ from functools import partial
 from pathlib import Path
 from typing import Optional, Union

+import bitsandbytes as bnb
 import numpy as np
 import torch.cuda
 import transformers
 from datasets import Dataset, set_caching_enabled
+from torch import nn
 from torch.optim.lr_scheduler import OneCycleLR
-from torch.utils.data import (
-    DataLoader,
-    DistributedSampler,
-    RandomSampler,
-    SequentialSampler,
-)
+from torch.utils.data import DataLoader, DistributedSampler, RandomSampler
 from transformers import EarlyStoppingCallback, Trainer, TrainingArguments
-from transformers.trainer_pt_utils import SequentialDistributedSampler
+from transformers.trainer_pt_utils import (
+    SequentialDistributedSampler,
+    get_parameter_names,
+)

-from axolotl.monkeypatch.relora import ReLoRACallback, ReLoRAScheduler
 from axolotl.utils.callbacks import (
    GPUStatsCallback,
    SaveBetterTransformerModelCallback,
    SavePeftModelCallback,
-    bench_eval_callback_factory,
 )
 from axolotl.utils.collators import DataCollatorForSeq2Seq
 from axolotl.utils.dataloader import MultipackDistributedDataloader
-from axolotl.utils.distributed import is_main_process, zero_first
-from axolotl.utils.schedulers import get_cosine_schedule_with_quadratic_warmup
+from axolotl.utils.schedulers import (
+    InterpolatingLogScheduler,
+    get_cosine_schedule_with_quadratic_warmup,
+)

 LOG = logging.getLogger("axolotl")

@@ -127,35 +127,6 @@ class AxolotlTrainingArguments(TrainingArguments):
        default=1,
        metadata={"help": "the multiplier for the max len for packed sequences"},
    )
-    relora_steps: Optional[int] = field(
-        default=None,
-        metadata={"help": "how often to reset for ReLoRA"},
-    )
-    relora_warmup_steps: Optional[int] = field(
-        default=None,
-        metadata={"help": "how many warmup steps to take after reset for ReLoRA"},
-    )
-    bench_split: Optional[str] = field(
-        default="eval", metadata={"help": "The benchmark split to run on"}
-    )
-    bench_dataset: Optional[str] = field(
-        default="pharaouk/dharma-1/dharma_1_mini.json",
-        metadata={
-            "help": "Benchmark dataset to use: options are `mmlu-zs`, `mmlu-fs`, or the full path to the dataset file"
-        },
-    )
-    do_bench_eval: Optional[bool] = field(
-        default=False, metadata={"help": "Whether to run the Benchmark evaluation."}
-    )
-    max_bench_samples: Optional[int] = field(
-        default=None,
-        metadata={
-            "help": "If set, only evaluates on `max_bench_samples` of the benchmark dataset."
-        },
-    )
-    bench_source_max_len: int = field(
-        default=2048, metadata={"help": "Maximum source sequence length for bench."}
-    )


 class AxolotlTrainer(Trainer):
@@ -165,10 +136,6 @@ class AxolotlTrainer(Trainer):

    args = None  # type: AxolotlTrainingArguments

-    def __init__(self, *args, bench_data_collator=None, **kwargs):
-        self.bench_data_collator = bench_data_collator
-        super().__init__(*args, **kwargs)
-
    def create_scheduler(
        self, num_training_steps: int, optimizer: torch.optim.Optimizer = None
    ):
@@ -259,31 +226,6 @@ class AxolotlTrainer(Trainer):
            )
        return super().get_eval_dataloader(eval_dataset)

-    def _get_bench_sampler(
-        self, bench_dataset: Dataset
-    ) -> Optional[torch.utils.data.Sampler]:
-        if self.args.world_size <= 1:
-            return SequentialSampler(bench_dataset)
-        return None
-
-    def get_bench_dataloader(
-        self,
-        bench_dataset: Dataset,
-    ) -> Union[DataLoader, MultipackDistributedDataloader]:
-        dataloader_params = {
-            "batch_size": self.args.eval_batch_size,
-            "collate_fn": self.bench_data_collator,
-            "num_workers": self.args.dataloader_num_workers,
-            "pin_memory": self.args.dataloader_pin_memory,
-        }
-
-        if not isinstance(bench_dataset, torch.utils.data.IterableDataset):
-            dataloader_params["sampler"] = self._get_bench_sampler(bench_dataset)
-            dataloader_params["drop_last"] = self.args.dataloader_drop_last
-
-        return DataLoader(bench_dataset, **dataloader_params)
-        # return self.accelerator.prepare(DataLoader(bench_dataset, **dataloader_params))
-
    def compute_loss(self, model, inputs, return_outputs=False):
        # use one's weighted cross entropy loss calc
        # if self.args.sample_packing:
@@ -323,46 +265,13 @@ class OneCycleLRSchedulerTrainer(AxolotlTrainer):
        return self.lr_scheduler


-class ReLoRATrainer(AxolotlTrainer):
-    """
-    Trainer subclass that uses the OneCycleLR scheduler
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.lr_scheduler = None
-
-    def create_scheduler(
-        self,
-        num_training_steps: int,
-        optimizer: Optional[torch.optim.Optimizer] = None,
-    ):
-        optimizer = self.optimizer if optimizer is None else optimizer
-        lr_scheduler = super().create_scheduler(num_training_steps, optimizer)
-
-        if self.args.relora_steps:
-            warmup_steps = (
-                self.args.relora_warmup_steps if self.args.relora_warmup_steps else 10
-            )
-            self.lr_scheduler = ReLoRAScheduler(
-                optimizer,
-                lr_scheduler,
-                self.args.relora_steps,
-                warmup_steps,
-            )
-        else:
-            self.lr_scheduler = lr_scheduler
-
-        return self.lr_scheduler
-
-
 def add_position_ids(sample):
    sample["position_ids"] = torch.arange(len(sample["input_ids"]))
    return sample


 def drop_long_seq(sample, sequence_len=2048):
-    return len(sample["input_ids"]) <= sequence_len and len(sample["input_ids"]) > 0
+    return len(sample["input_ids"]) <= sequence_len


@contextmanager
@@ -376,17 +285,14 @@ def disable_datasets_caching():

 def process_datasets_for_packing(cfg, train_dataset, eval_dataset):
    drop_long = partial(drop_long_seq, sequence_len=cfg.sequence_len)
-    with zero_first(is_main_process()):
-        train_dataset = train_dataset.filter(drop_long, num_proc=os.cpu_count())
-        if eval_dataset:
-            eval_dataset = eval_dataset.filter(drop_long, num_proc=os.cpu_count())
+    train_dataset = train_dataset.filter(drop_long, num_proc=os.cpu_count())
+    if eval_dataset:
+        eval_dataset = eval_dataset.filter(drop_long, num_proc=os.cpu_count())

-        if cfg.sample_packing:
-            train_dataset = train_dataset.map(add_position_ids, num_proc=os.cpu_count())
-            if eval_dataset:
-                eval_dataset = eval_dataset.map(
-                    add_position_ids, num_proc=os.cpu_count()
-                )
+    if cfg.sample_packing:
+        train_dataset = train_dataset.map(add_position_ids, num_proc=os.cpu_count())
+        if eval_dataset:
+            eval_dataset = eval_dataset.map(add_position_ids, num_proc=os.cpu_count())
    return train_dataset, eval_dataset


@@ -405,16 +311,6 @@ def calculate_total_num_steps(cfg, train_dataset, tokenizer):
            LOG.info(f"📝 UPDATE CONFIG WITH: `total_num_tokens: {total_num_tokens}`")
            cfg.total_num_tokens = total_num_tokens

-        if not cfg.total_supervised_tokens:
-            total_supervised_tokens = (
-                train_dataset.data.column("labels")
-                .to_pandas()
-                .apply(lambda x: np.sum(np.array(x) != -100))
-                .sum()
-            )
-            LOG.info(f"`total_supervised_tokens: {total_supervised_tokens}`")
-            cfg.total_supervised_tokens = total_supervised_tokens
-
        if cfg.sample_packing_eff_est:
            total_num_steps = (
                # match count to len est in dataloader
@@ -518,7 +414,23 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        training_arguments_kwargs["seed"] = cfg.seed

    if cfg.gradient_checkpointing:
-        training_arguments_kwargs["gradient_checkpointing"] = cfg.gradient_checkpointing
+        if cfg.gptq:
+            from alpaca_lora_4bit.gradient_checkpointing import (
+                apply_gradient_checkpointing,
+            )
+
+            gradient_checkpointing_ratio = (
+                cfg.gradient_checkpointing_ratio
+                if cfg.gradient_checkpointing_ratio
+                else 1.0
+            )
+            apply_gradient_checkpointing(
+                model, checkpoint_ratio=gradient_checkpointing_ratio
+            )
+        else:
+            training_arguments_kwargs[
+                "gradient_checkpointing"
+            ] = cfg.gradient_checkpointing
    if cfg.fsdp:
        training_arguments_kwargs["fsdp"] = cfg.fsdp
        if cfg.fsdp_config:
@@ -572,24 +484,6 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
            "steps" if cfg.save_steps else "epoch"
        )

-    if cfg.do_bench_eval:
-        training_arguments_kwargs["do_bench_eval"] = cfg.do_bench_eval
-        if cfg.bench_dataset:
-            training_arguments_kwargs["bench_dataset"] = cfg.bench_dataset
-    if cfg.metric_for_best_model:
-        training_arguments_kwargs["metric_for_best_model"] = cfg.metric_for_best_model
-    if cfg.greater_is_better:
-        training_arguments_kwargs["greater_is_better"] = cfg.greater_is_better
-
-    # DDP Config
-    if cfg.ddp_timeout:
-        training_arguments_kwargs["ddp_timeout"] = cfg.ddp_timeout
-    # see https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html
-    if cfg.ddp_bucket_cap_mb:
-        training_arguments_kwargs["ddp_bucket_cap_mb"] = cfg.ddp_bucket_cap_mb
-    if cfg.ddp_broadcast_buffers is not None:
-        training_arguments_kwargs["ddp_broadcast_buffers"] = cfg.ddp_broadcast_buffers
-
    training_args = AxolotlTrainingArguments(  # pylint: disable=unexpected-keyword-arg
        max_steps=total_num_steps if cfg.max_steps else -1,
        max_seq_length=cfg.sequence_len,
@@ -605,10 +499,11 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        output_dir=cfg.output_dir,
        save_total_limit=cfg.save_total_limit if cfg.save_total_limit else 4,
        load_best_model_at_end=(
-            (cfg.load_best_model_at_end is not False or cfg.early_stopping_patience)
+            cfg.load_best_model_at_end is not False
            and cfg.val_set_size > 0
            and cfg.save_steps
            and cfg.save_steps % cfg.eval_steps == 0
+            and cfg.load_in_8bit is not True
        )
        or False,
        ddp_find_unused_parameters=False if cfg.ddp else None,
@@ -622,8 +517,6 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        weight_decay=cfg.weight_decay if cfg.weight_decay is not None else 0.0,
        sample_packing=cfg.sample_packing if cfg.sample_packing else False,
        sample_packing_seq_len_multiplier=cfg.micro_batch_size,
-        relora_steps=cfg.relora_steps,
-        relora_warmup_steps=cfg.relora_warmup_steps,
        **training_arguments_kwargs,
    )

@@ -633,12 +526,75 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        if Path(cfg.torchdistx_path).exists():
            sys.path.append(cfg.torchdistx_path)
            importlib.import_module("torchdistx")
+    if (
+        cfg.optimizer == "adamw_bnb_8bit"
+        and not cfg.gptq
+        and "deepspeed" not in training_arguments_kwargs
+        and not cfg.fsdp
+    ):
+        decay_parameters = get_parameter_names(model, [nn.LayerNorm])
+        decay_parameters = [name for name in decay_parameters if "bias" not in name]
+        optimizer_grouped_parameters = [
+            {
+                "params": [
+                    p
+                    for n, p in model.named_parameters()
+                    if (n in decay_parameters and p.requires_grad)
+                ],
+                "weight_decay": training_args.weight_decay,
+            },
+            {
+                "params": [
+                    p
+                    for n, p in model.named_parameters()
+                    if (n not in decay_parameters and p.requires_grad)
+                ],
+                "weight_decay": 0.0,
+            },
+        ]
+
+        optimizer = bnb.optim.Adam8bit(
+            optimizer_grouped_parameters,
+            betas=(training_args.adam_beta1, training_args.adam_beta2),
+            eps=training_args.adam_epsilon,
+            lr=training_args.learning_rate,
+        )
+
+        if cfg.lr_scheduler == "one_cycle":
+            lr_scheduler_kwargs = (
+                cfg.lr_scheduler_kwargs if cfg.lr_scheduler_kwargs else {}
+            )
+            lr_scheduler = OneCycleLR(
+                optimizer,
+                cfg.learning_rate,
+                total_steps=total_num_steps,
+                epochs=cfg.num_epochs,
+                div_factor=cfg.lr_div_factor if cfg.lr_div_factor else 6,
+                **lr_scheduler_kwargs,
+            )
+        elif cfg.lr_scheduler == "log_sweep":
+            lr_scheduler = InterpolatingLogScheduler(
+                optimizer,
+                cfg.warmup_steps,
+                cfg.log_sweep_min_lr if cfg.log_sweep_min_lr else 1e-10,
+                cfg.log_sweep_max_lr if cfg.log_sweep_max_lr else 10,
+            )
+        else:
+            lr_scheduler = transformers.get_cosine_schedule_with_warmup(
+                optimizer,
+                training_args.warmup_steps,
+                total_num_steps,
+            )
+        trainer_kwargs["optimizers"] = (optimizer, lr_scheduler)

    callbacks = []
    callbacks.append(GPUStatsCallback(cfg))
-
-    if cfg.relora_steps:
-        callbacks.append(ReLoRACallback(cfg))
+    # TODO on_save callback to sync checkpoints to GCP/AWS in background
+    if cfg.early_stopping_patience:
+        early_stop_cb = EarlyStoppingCallback(
+            cfg.early_stopping_patience,
+        )
+        callbacks.append(early_stop_cb)

    if cfg.local_rank == 0 and cfg.adapter in [
        "lora",
@@ -650,12 +606,10 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        callbacks.append(SaveBetterTransformerModelCallback)

    data_collator_kwargs = {
-        "padding": True,  # True/"longest" is the default
+        "padding": True,
    }
-    if cfg.pad_to_sequence_len:
-        data_collator_kwargs["pad_to_multiple_of"] = 64 * math.ceil(
-            cfg.sequence_len / 64
-        )
+    if cfg.collator_pad_to_longest:
+        data_collator_kwargs["padding"] = "longest"
    else:
        # A100 is best at 64, while others at 8. Let's use the larger so we don't have to check
        # https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
@@ -679,11 +633,11 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
                num_proc=32,
            )

-    trainer_cls = AxolotlTrainer
-    if cfg.lr_scheduler == "one_cycle" and (cfg.fsdp or cfg.adapter == "qlora"):
-        trainer_cls = OneCycleLRSchedulerTrainer
-    elif cfg.relora_steps:
-        trainer_cls = ReLoRATrainer
+    trainer_cls = (
+        OneCycleLRSchedulerTrainer
+        if cfg.lr_scheduler == "one_cycle" and (cfg.fsdp or cfg.adapter == "qlora")
+        else AxolotlTrainer
+    )
    trainer = trainer_cls(
        model=model,
        train_dataset=train_dataset,
@@ -694,23 +648,8 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
            return_tensors="pt",
            **data_collator_kwargs,
        ),
-        bench_data_collator=transformers.DataCollatorForSeq2Seq(
-            tokenizer,
-            return_tensors="pt",
-            **data_collator_kwargs,
-        ),
        callbacks=callbacks,
        **trainer_kwargs,
    )

-    if cfg.do_bench_eval:
-        trainer.add_callback(bench_eval_callback_factory(trainer, tokenizer))
-
-    # TODO on_save callback to sync checkpoints to GCP/AWS in background
-    if cfg.early_stopping_patience:
-        early_stop_cb = EarlyStoppingCallback(
-            cfg.early_stopping_patience,
-        )
-        trainer.add_callback(early_stop_cb)
-
    return trainer
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -1,64 +0,0 @@
-"""
-test module for the axolotl.utis.data module
-"""
-import unittest
-
-from transformers import LlamaTokenizer
-
-from axolotl.utils.data import encode_pretraining, md5
-
-
-class TestEncodePretraining(unittest.TestCase):
-    """
-    test class for encode pretraining and md5 helper
-    """
-
-    def setUp(self):
-        self.tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b")
-        self.tokenizer.add_special_tokens(
-            {
-                "eos_token": "</s>",
-                "bos_token": "<s>",
-                "unk_token": "<unk>",
-                "pad_token": "<pad>",
-            }
-        )
-        self.max_tokens = 15  # set a small number for easy inspection
-
-    def test_encode_pretraining(self):
-        examples = {
-            "text": [
-                "Hello, world!",
-                "Nice to meet you.",
-                "lorem ipsum dolor sit amet.",
-                "Nice to meet you again!.",
-                "hello, hello",
-            ]
-        }
-        result = encode_pretraining(self.tokenizer, self.max_tokens, examples)
-
-        self.assertEqual(len(result["input_ids"]), 3)
-
-        # Assert the length of input_ids and attention_mask is correct
-        self.assertEqual(len(result["input_ids"][0]), self.max_tokens)
-        self.assertEqual(len(result["attention_mask"][0]), self.max_tokens)
-
-        # Assert EOS and PAD tokens are correctly added
-        # hello world! is 4 tokens
-        self.assertEqual(result["input_ids"][0][0], self.tokenizer.bos_token_id)
-        self.assertEqual(result["input_ids"][0][5], self.tokenizer.eos_token_id)
-        self.assertEqual(result["input_ids"][0][6], self.tokenizer.pad_token_id)
-        # second part, 5 tokens
-        self.assertEqual(result["input_ids"][0][7], self.tokenizer.bos_token_id)
-        self.assertEqual(result["input_ids"][0][13], self.tokenizer.eos_token_id)
-        self.assertEqual(result["input_ids"][0][14], self.tokenizer.pad_token_id)
-
-    def test_md5(self):
-        self.assertEqual(md5("hello world"), "5eb63bbbe01eeed093cb22bb8f5acdc3")
-        self.assertEqual(
-            md5("hello world", "utf-8"), "5eb63bbbe01eeed093cb22bb8f5acdc3"
-        )
-
-
-if __name__ == "__main__":
-    unittest.main()
--- a/tests/test_validation.py
+++ b/tests/test_validation.py
@@ -328,20 +328,6 @@ class ValidationTest(unittest.TestCase):
                for record in self._caplog.records
            )

-        cfg = DictDefault(
-            {
-                "sample_packing": True,
-                "pad_to_sequence_len": None,
-            }
-        )
-        with self._caplog.at_level(logging.WARNING):
-            validate_config(cfg)
-            assert any(
-                "`pad_to_sequence_len: true` is recommended when using sample_packing"
-                in record.message
-                for record in self._caplog.records
-            )
-
        cfg = DictDefault(
            {
                "max_packed_sequence_len": 2048,
Author	SHA1	Message	Date
Wing Lian	9aaa4b8ced	set model merge dtype based on cfg Some checks failed pre-commit / pre-commit (push) Has been cancelled Details PyTest / test (3.10) (push) Has been cancelled Details PyTest / test (3.9) (push) Has been cancelled Details	2023-08-23 04:07:44 -04:00
Wing Lian	8be7da8999	push merged lora to hf	2023-08-23 04:07:44 -04:00
Wing Lian	53e739f11e	deduplicate code	2023-08-23 04:07:44 -04:00
Wing Lian	f20c8deff1	merge lora on train completion	2023-08-23 04:07:44 -04:00