WIP

2024-03-07 08:30:13 -05:00
33 changed files with 80 additions and 1290 deletions
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@ Features:
 - [Environment](#environment)
  - [Docker](#docker)
  - [Conda/Pip venv](#condapip-venv)
-  - [Cloud GPU](#cloud-gpu) - Latitude.sh, JarvisLabs, RunPod
+  - [Cloud GPU](#cloud-gpu) - Latitude.sh, RunPod
  - [Bare Metal Cloud GPU](#bare-metal-cloud-gpu)
  - [Windows](#windows)
  - [Launching on public clouds via SkyPilot](#launching-on-public-clouds-via-skypilot)
@@ -199,7 +199,6 @@ docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --
 For cloud GPU providers that support docker images, use [`winglian/axolotl-cloud:main-latest`](https://hub.docker.com/r/winglian/axolotl-cloud/tags)

 - on Latitude.sh use this [direct link](https://latitude.sh/blueprint/989e0e79-3bf6-41ea-a46b-1f246e309d5c)
- on JarvisLabs.ai use this [direct link](https://jarvislabs.ai/templates/axolotl)
 - on RunPod use this [direct link](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)

 #### Bare Metal Cloud GPU
@@ -1080,10 +1079,6 @@ fsdp_config:
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
 ```

-##### FSDP + QLoRA
-
-Axolotl supports training with FSDP and QLoRA, see [these docs](docs/fsdp_qlora.md) for more information.
-
 ##### Weights & Biases Logging

 Make sure your `WANDB_API_KEY` environment variable is set (recommended) or you login to wandb with `wandb login`.
@@ -1303,6 +1298,4 @@ consider sponsoring the project via [GitHub Sponsors](https://github.com/sponsor

 #### 🥉 Bronze Sponsors - $500/mo

- - [JarvisLabs.ai](https://jarvislabs.ai)
-
 ---
--- a/deepspeed_configs/zero1.json
+++ b/deepspeed_configs/zero1.json
@@ -16,7 +16,6 @@
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
--- a/deepspeed_configs/zero2.json
+++ b/deepspeed_configs/zero2.json
@@ -20,7 +20,6 @@
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
--- a/deepspeed_configs/zero3.json
+++ b/deepspeed_configs/zero3.json
@@ -24,7 +24,6 @@
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
--- a/deepspeed_configs/zero3_bf16.json
+++ b/deepspeed_configs/zero3_bf16.json
@@ -24,7 +24,6 @@
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
--- a/docs/fsdp_qlora.md
+++ b/docs/fsdp_qlora.md
@@ -1,37 +0,0 @@
-# FDSP + QLoRA
-
-## Background
-
-Using FSDP with QLoRA is essential for **fine-tuning larger (70b+ parameter) LLMs on consumer GPUs.**  For example, you can use FSDP + QLoRA to train a 70b model on two 24GB GPUs[^1].
-
-Below, we describe how to use this feature in Axolotl.
-
-## Usage
-
-To enable `QLoRA` with `FSDP`, you need to perform the following steps:
-
-> ![Tip]
-> See the [example config](#example-config) file in addition to reading these instructions.
-
-1. Set `adapter: qlora` in your axolotl config file.
-2. Enable FSDP in your axolotl config, as [described here](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#fsdp).
-3. Use one of the supported model types: `llama`, `mistral` or `mixtral`.
-
-## Example Config
-
-[examples/llama-2/qlora-fsdp.yml](../examples/llama-2/qlora-fsdp.yml) contains an example of how to enable QLoRA + FSDP in axolotl.
-
-## References
-
- [PR #1378](https://github.com/OpenAccess-AI-Collective/axolotl/pull/1378) enabling QLoRA in FSDP in Axolotl.
- [Blog Post](https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html) from the [Answer.AI](https://www.answer.ai/) team describing the work that enabled QLoRA in FSDP.
- Related HuggingFace PRs Enabling FDSP + QLoRA:
-    - Accelerate [PR#2544](https://github.com/huggingface/accelerate/pull/2544 )
-    - Transformers [PR#29587](https://github.com/huggingface/transformers/pull/29587)
-    - TRL [PR#1416](https://github.com/huggingface/trl/pull/1416)
-    - PEFT [PR#1550](https://github.com/huggingface/peft/pull/1550)
-
-
-
-
-[^1]: This was enabled by [this work](https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html) from the Answer.AI team.
--- a/examples/gemma/qlora.yml
+++ b/examples/gemma/qlora.yml
@@ -21,7 +21,7 @@ lora_dropout: 0.05
 lora_target_linear: true

 sequence_len: 4096
-sample_packing: false
+sample_packing: true
 pad_to_sequence_len: true

 wandb_project:
--- a/examples/llama-2/qlora-fsdp.yml
+++ b/examples/llama-2/qlora-fsdp.yml
@@ -1,70 +0,0 @@
-base_model: NousResearch/Llama-2-7b-hf
-model_type: LlamaForCausalLM
-tokenizer_type: LlamaTokenizer
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: yahma/alpaca-cleaned
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.05
-output_dir: ./qlora-out
-
-adapter: qlora
-lora_model_dir:
-
-sequence_len: 512
-sample_packing: false
-pad_to_sequence_len: true
-
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_modules:
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_name:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 4
-num_epochs: 4
-optimizer: paged_adamw_8bit
-lr_scheduler: cosine
-learning_rate: 0.00001
-
-train_on_inputs: false
-group_by_length: false
-bf16: auto
-fp16:
-tf32: false
-
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-evals_per_epoch: 4
-eval_table_size:
-saves_per_epoch: 1
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-  - full_shard
-fsdp_config:
-  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
-special_tokens:
--- a/examples/mistral/mixtral-qlora-fsdp.yml
+++ b/examples/mistral/mixtral-qlora-fsdp.yml
@@ -1,74 +0,0 @@
-base_model: mistralai/Mixtral-8x7B-v0.1
-model_type: AutoModelForCausalLM
-tokenizer_type: LlamaTokenizer
-trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: tatsu-lab/alpaca
-    type: alpaca
-dataset_prepared_path: last_run_prepared
-val_set_size: 0.02
-output_dir: ./qlora-out
-
-model_config:
-  output_router_logits: true
-
-adapter: qlora
-lora_model_dir:
-
-sequence_len: 1024
-sample_packing: false
-pad_to_sequence_len: false
-
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_name:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 1
-optimizer: paged_adamw_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: auto
-fp16:
-tf32: false
-
-gradient_checkpointing: true
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-loss_watchdog_threshold: 5.0
-loss_watchdog_patience: 3
-
-warmup_steps: 10
-evals_per_epoch: 4
-eval_table_size:
-eval_max_new_tokens: 128
-saves_per_epoch: 1
-debug:
-weight_decay: 0.0
-fsdp:
-  - full_shard
-fsdp_config:
-  fsdp_transformer_layer_cls_to_wrap: MixtralSparseMoeBlock
-special_tokens:
--- a/examples/mistral/mixtral.yml
+++ b/examples/mistral/mixtral.yml
@@ -16,12 +16,12 @@ output_dir: ./qlora-out

 ## You can optionally freeze the entire model and unfreeze a subset of parameters
 unfrozen_parameters:
-#  - ^lm_head.weight$
-#  - ^model.embed_tokens.weight$[:32000]
-#  - model.layers.2[0-9]+.block_sparse_moe.gate
-#  - model.layers.2[0-9]+.block_sparse_moe.experts
-#  - model.layers.3[0-9]+.block_sparse_moe.gate
-#  - model.layers.3[0-9]+.block_sparse_moe.experts
+#  - lm_head.*
+#  - model.embed_tokens.*
+#  - model.layers.2[0-9]+.block_sparse_moe.gate.*
+#  - model.layers.2[0-9]+.block_sparse_moe.experts.*
+#  - model.layers.3[0-9]+.block_sparse_moe.gate.*
+#  - model.layers.3[0-9]+.block_sparse_moe.experts.*

 model_config:
  output_router_logits: true
--- a/requirements.txt
+++ b/requirements.txt
@@ -3,7 +3,7 @@ packaging==23.2
 peft==0.9.0
 transformers==4.38.2
 tokenizers==0.15.0
-bitsandbytes>=0.43.0
+bitsandbytes>=0.41.1
 accelerate==0.26.1
 deepspeed==0.13.1
 pydantic==2.6.3
@@ -40,4 +40,3 @@ gcsfs
 # adlfs

 trl>=0.7.9
-fastcore>=1.5.29
--- a/src/axolotl/core/policies/init.py
+++ b/src/axolotl/core/policies/init.py
--- a/src/axolotl/core/policies/auto_wrap.py
+++ b/src/axolotl/core/policies/auto_wrap.py
@@ -1,55 +0,0 @@
-"""module for building the auto wrap policy for FSDP"""
-import functools
-
-from peft import PrefixEncoder, PromptEmbedding, PromptEncoder
-from torch.distributed.fsdp.wrap import (
-    _or_policy,
-    lambda_auto_wrap_policy,
-    transformer_auto_wrap_policy,
-)
-from transformers.models.llama.modeling_llama import LlamaDecoderLayer
-from transformers.models.mistral.modeling_mistral import MistralDecoderLayer
-from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer
-
-SUPPORTED_AUTO_WRAP_MODEL_TYPES = [
-    "llama",
-    "mistral",
-    "mixtral",
-]
-
-
-def get_wrapping_policy_factory(model_type):
-    if model_type == "llama":
-        layer_to_wrap = LlamaDecoderLayer
-    elif model_type == "mistral":
-        layer_to_wrap = MistralDecoderLayer
-    elif model_type == "mixtral":
-        layer_to_wrap = MixtralDecoderLayer
-
-    def get_wrapping_policy():
-        """This checks for lora layers (has weight and requires_grad)"""
-
-        def lambda_policy_fn(module):
-            return (
-                len(list(module.named_children())) == 0
-                and getattr(module, "weight", None) is not None
-                and module.weight.requires_grad
-            )
-
-        lambda_policy = functools.partial(
-            lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn
-        )
-        transformer_layer_name = layer_to_wrap
-        transformer_wrap_policy = functools.partial(
-            transformer_auto_wrap_policy,
-            transformer_layer_cls=(
-                PrefixEncoder,
-                PromptEncoder,
-                PromptEmbedding,
-                transformer_layer_name,
-            ),
-        )
-        policies = [lambda_policy, transformer_wrap_policy]
-        return functools.partial(_or_policy, policies=policies)
-
-    return get_wrapping_policy
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -8,7 +8,6 @@ import importlib
 import importlib.util
 import logging
 import math
-import os
 import sys
 from abc import abstractmethod
 from dataclasses import dataclass, field
@@ -18,10 +17,7 @@ from typing import List, Optional, Type, Union

 import torch
 import transformers
-from accelerate import FullyShardedDataParallelPlugin
-from accelerate.utils import str_to_bool
 from datasets import Dataset
-from torch.distributed.fsdp import MixedPrecision
 from torch.optim.lr_scheduler import OneCycleLR
 from torch.utils.data import BatchSampler, DataLoader, RandomSampler, SequentialSampler
 from transformers import (
@@ -34,7 +30,6 @@ from transformers.trainer_utils import seed_worker
 from transformers.utils import is_sagemaker_mp_enabled
 from trl import DPOTrainer

-from axolotl.core.policies.auto_wrap import get_wrapping_policy_factory
 from axolotl.loraplus import create_loraplus_optimizer
 from axolotl.monkeypatch.multipack import SUPPORTED_MULTIPACK_MODEL_TYPES
 from axolotl.monkeypatch.relora import ReLoRACallback, ReLoRAScheduler
@@ -196,10 +191,6 @@ class AxolotlTrainingArguments(TrainingArguments):
        default=1e-6,
        metadata={"help": "loraplus learning rate for lora embedding layers."},
    )
-    qlora: bool = field(
-        default=False,
-        metadata={"help": "whether this is a qlora training"},
-    )


 class AxolotlTrainer(Trainer):
@@ -477,56 +468,6 @@ class AxolotlTrainer(Trainer):

        return super().push_to_hub(*args, **kwargs)

-    @wraps(Trainer.create_accelerator_and_postprocess)
-    def create_accelerator_and_postprocess(self):
-        rank = int(os.environ.get("LOCAL_RANK", 0))
-        res = super().create_accelerator_and_postprocess()
-
-        if self.args.qlora is False:
-            return res
-
-        # the rest of this method override is specific to fsdp + qlora (for now)
-        sync_module_states = (
-            str_to_bool(os.environ.get("FSDP_SYNC_MODULE_STATES", "True")) == 1
-        )
-
-        mp_policy = None
-        amp = os.environ["ACCELERATE_MIXED_PRECISION"]
-        if amp == "fp16":
-            mp_policy = MixedPrecision(
-                param_dtype=torch.float32,
-                reduce_dtype=torch.float32,
-                buffer_dtype=torch.float32,
-            )
-        elif amp == "bf16":
-            mp_policy = MixedPrecision(
-                param_dtype=torch.float32,
-                reduce_dtype=torch.float32,
-                buffer_dtype=torch.float32,
-            )
-
-        # If somehow we figure out how we want to parameterize we want to autocast buffers...
-        # mp_policy = MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.float32)
-        # load_param_skip_names = ['inv_freq']
-
-        if self.is_fsdp_enabled:
-            wrapping_policy = get_wrapping_policy_factory(self.args.model_type)
-            fsdp_plugin = FullyShardedDataParallelPlugin(
-                auto_wrap_policy=wrapping_policy(),
-                cpu_offload=False,
-                use_orig_params=False,
-                limit_all_gathers=True,
-                param_init_fn=lambda module: module.to_empty(
-                    device=torch.device("cuda"), recurse=False
-                )
-                if (rank != 0 and sync_module_states)
-                else None,
-                mixed_precision_policy=mp_policy,
-            )
-            self.accelerator.state.fsdp_plugin = fsdp_plugin
-
-        return res
-

 class AxolotlMambaTrainer(AxolotlTrainer):
    """
@@ -800,7 +741,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        return AxolotlTrainer

    def build(self, total_num_steps):
-        warmup_steps = None
        if self.cfg.warmup_steps is not None:
            warmup_steps = self.cfg.warmup_steps
        elif self.cfg.warmup_ratio is not None:
@@ -846,9 +786,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            if self.cfg.fsdp_config:
                training_arguments_kwargs["fsdp_config"] = dict(self.cfg.fsdp_config)

-        if self.cfg.adapter == "qlora":
-            training_arguments_kwargs["qlora"] = True
-
        # deepspeed
        if self.cfg.deepspeed:
            training_arguments_kwargs["deepspeed"] = self.cfg.deepspeed
--- a/src/axolotl/logging_config.py
+++ b/src/axolotl/logging_config.py
@@ -30,7 +30,6 @@ class ColorfulFormatter(Formatter):

 DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
    "version": 1,
-    "disable_existing_loggers": False,
    "formatters": {
        "simple": {
            "format": "[%(asctime)s] [%(levelname)s] [%(name)s.%(funcName)s:%(lineno)d] [PID:%(process)d] %(message)s",
--- a/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
@@ -245,6 +245,7 @@ def flashattn_forward_with_s2attn(
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
+    padding_mask: Optional[torch.LongTensor] = None,  # pylint: disable=unused-argument
    cu_seqlens: Optional[torch.Tensor] = None,  # pylint: disable=unused-argument
    max_seqlen: Optional[torch.Tensor] = None,  # pylint: disable=unused-argument
 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
@@ -373,6 +374,7 @@ def flashattn_forward(
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
+    padding_mask: Optional[torch.LongTensor] = None,  # pylint: disable=unused-argument
    cu_seqlens: Optional[torch.Tensor] = None,
    max_seqlen: Optional[torch.Tensor] = None,
 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
@@ -768,6 +770,12 @@ def llama_model_forward(
            dtype=torch.bool,
            device=inputs_embeds.device,
        )
+        padding_mask = None
+    else:
+        if 0 in attention_mask:
+            padding_mask = attention_mask
+        else:
+            padding_mask = None

    attention_mask = (
        self._prepare_decoder_attention_mask(  # pylint: disable=protected-access
@@ -817,6 +825,7 @@ def llama_model_forward(
                past_key_value,
                output_attentions,
                None,
+                padding_mask,
                cu_seqlens,
                max_seqlen,
            )
@@ -828,6 +837,7 @@ def llama_model_forward(
                past_key_value=past_key_value,
                output_attentions=output_attentions,
                use_cache=use_cache,
+                padding_mask=padding_mask,
                cu_seqlens=cu_seqlens,
                max_seqlen=max_seqlen,
            )
@@ -874,6 +884,7 @@ class LlamaDecoderLayer(OriginalLlamaDecoderLayer):
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
+        padding_mask: Optional[torch.LongTensor] = None,
        cu_seqlens: Optional[torch.Tensor] = None,
        max_seqlen: Optional[torch.Tensor] = None,
    ) -> Tuple[
@@ -906,6 +917,7 @@ class LlamaDecoderLayer(OriginalLlamaDecoderLayer):
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
+            padding_mask=padding_mask,
            cu_seqlens=cu_seqlens,
            max_seqlen=max_seqlen,
        )
--- a/src/axolotl/monkeypatch/multipack.py
+++ b/src/axolotl/monkeypatch/multipack.py
@@ -1,9 +1,6 @@
 """multipack patching for v2 of sample packing"""
-import importlib

 import transformers
-from accelerate import init_empty_weights
-from transformers import AutoConfig, AutoModelForCausalLM
 from transformers.integrations import is_deepspeed_zero3_enabled

 from axolotl.monkeypatch.mixtral import patch_mixtral_moe_forward_zero3
@@ -15,12 +12,11 @@ SUPPORTED_MULTIPACK_MODEL_TYPES = [
    "falcon",
    "phi",
    "gemma",
-    "gemmoe",
    "starcoder2",
 ]


-def patch_for_multipack(model_type, model_name=None):
+def patch_for_multipack(model_type):
    if model_type == "mixtral":
        transformers.models.mixtral.modeling_mixtral._get_unpad_data = (  # pylint: disable=protected-access
            get_unpad_data
@@ -47,15 +43,3 @@ def patch_for_multipack(model_type, model_name=None):
        transformers.models.starcoder2.modeling_starcoder2._get_unpad_data = (  # pylint: disable=protected-access
            get_unpad_data
        )
-    elif model_type == "gemmoe":
-        model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
-        # we need to load the model here in order for modeling_gemmoe to be available
-        with init_empty_weights():
-            AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
-        module_name = model_config.__class__.__module__.replace(
-            ".configuration_gemmoe", ".modeling_gemmoe"
-        )
-        modeling_gemmoe = importlib.import_module(module_name)
-        modeling_gemmoe._get_unpad_data = (  # pylint: disable=protected-access
-            get_unpad_data
-        )
--- a/src/axolotl/prompt_strategies/dpo/chatml.py
+++ b/src/axolotl/prompt_strategies/dpo/chatml.py
@@ -24,25 +24,6 @@ def argilla(
    return transform_fn


-def argilla_chat(
-    cfg,
-    **kwargs,
-):  # pylint: disable=possibly-unused-variable,unused-argument
-    """
-    for argilla/dpo-mix-7k conversations
-    """
-
-    def transform_fn(sample):
-        sample[
-            "prompt"
-        ] = f"<|im_start|>user\n{sample['chosen'][0]['content']}<|im_end|>\n<|im_start|>assistant\n"
-        sample["chosen"] = f"{sample['chosen'][1]['content']}<|im_end|>"
-        sample["rejected"] = f"{sample['rejected'][1]['content']}<|im_end|>"
-        return sample
-
-    return transform_fn
-
-
 def icr(
    cfg,
    **kwargs,
--- a/src/axolotl/prompt_strategies/sharegpt.py
+++ b/src/axolotl/prompt_strategies/sharegpt.py
@@ -1,15 +1,10 @@
 """Module containing the SimpleShareGPTPromptTokenizingStrategy class"""
-
 from typing import Any, Dict, Optional

 from fastchat.conversation import Conversation, SeparatorStyle, register_conv_template

 from axolotl.prompt_tokenizers import ShareGPTPromptTokenizingStrategy
 from axolotl.prompters import ShareGPTPrompterV2
-from axolotl.utils.tokenization import (
-    chatml_to_conversation,
-    merge_consecutive_messages,
-)


 def register_chatml_template(system_message=None):
@@ -24,16 +19,6 @@ def register_chatml_template(system_message=None):
            sep="<|im_end|>",
        )
    )
-    register_conv_template(
-        Conversation(
-            name="chatml_glaive",
-            system_template="<|im_start|>system\n{system_message}",
-            system_message=system_message,
-            roles=["<|im_start|>user", "<|im_start|>assistant", "<|im_start|>tool"],
-            sep_style=SeparatorStyle.CHATML,
-            sep="<|im_end|>",
-        )
-    )


 def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
@@ -92,20 +77,6 @@ def load_guanaco(tokenizer, cfg):
    )


-def load_glaive(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
-    conversation = (
-        ds_cfg["conversation"]
-        if ds_cfg and "conversation" in ds_cfg
-        else "chatml_glaive"
-    )
-    return GlaiveShareGPTPromptTokenizingStrategy(
-        ShareGPTPrompterV2(conversation=conversation),
-        tokenizer,
-        cfg.train_on_inputs,
-        cfg.sequence_len,
-    )
-
-
 class SimpleShareGPTPromptTokenizingStrategy(ShareGPTPromptTokenizingStrategy):
    """
    basic sharegpt strategy to grab conversations from the sample row
@@ -187,15 +158,3 @@ class UltrachatShareGPTPromptTokenizingStrategy(SimpleShareGPTPromptTokenizingSt
            {"from": role_map[t["role"]], "value": t["content"]} for t in conversations
        ]
        return turns
-
-
-class GlaiveShareGPTPromptTokenizingStrategy(SimpleShareGPTPromptTokenizingStrategy):
-    """
-    sharegpt strategy that remaps glaive data to sharegpt format
-    """
-
-    def get_conversation_thread(self, prompt):
-        conversation = chatml_to_conversation(prompt)
-        conversation = merge_consecutive_messages(conversation)
-
-        return conversation
--- a/src/axolotl/prompt_tokenizers.py
+++ b/src/axolotl/prompt_tokenizers.py
@@ -360,19 +360,11 @@ class ShareGPTPromptTokenizingStrategy(PromptTokenizingStrategy):
                    LOG.warning(f"expected tuple, got {part}")
                    continue

-                tool_role_label = None
-                if len(conversation.roles) == 3:
-                    (
-                        user_role_label,
-                        assistant_role_label,
-                        tool_role_label,
-                    ) = conversation.roles
-                else:
-                    user_role_label, assistant_role_label = conversation.roles
+                user, assistant = conversation.roles
                role, content = part

                # Uses "in" because role contains extra characters
-                if user_role_label in role:
+                if user in role:
                    role = (
                        role.replace(role_remap[0]["from"], role_remap[0]["to"])
                        if role_remap
@@ -392,7 +384,7 @@ class ShareGPTPromptTokenizingStrategy(PromptTokenizingStrategy):
                    else:
                        # everything from this is masked out from the labels
                        labels = [IGNORE_TOKEN_ID] * len(res["input_ids"])
-                elif assistant_role_label in role:
+                elif assistant in role:
                    role = (
                        role.replace(role_remap[1]["from"], role_remap[1]["to"])
                        if role_remap
@@ -434,8 +426,6 @@ class ShareGPTPromptTokenizingStrategy(PromptTokenizingStrategy):
                    else:
                        # everything from this is masked out from the labels
                        labels = [IGNORE_TOKEN_ID] * len(res["input_ids"])
-                elif tool_role_label and tool_role_label in role:
-                    labels = [IGNORE_TOKEN_ID] * len(res["input_ids"])
                else:
                    LOG.warning(f"unhandled role: {role}")
                    continue
--- a/src/axolotl/prompters.py
+++ b/src/axolotl/prompters.py
@@ -267,8 +267,6 @@ class ShareGPTPrompter(Prompter):  # pylint: disable=too-few-public-methods

    role_key_human = "human"
    role_key_model = "gpt"
-    # Optional, only used for tool usage datasets.
-    role_key_tool = None

    def __init__(
        self,
@@ -276,7 +274,6 @@ class ShareGPTPrompter(Prompter):  # pylint: disable=too-few-public-methods
        conversation: Optional[Union[str, Conversation]] = None,
        role_key_human: Optional[str] = None,
        role_key_model: Optional[str] = None,
-        role_key_tool: Optional[str] = None,
    ):
        if conversation:
            if isinstance(conversation, Conversation):
@@ -289,8 +286,6 @@ class ShareGPTPrompter(Prompter):  # pylint: disable=too-few-public-methods
            self.role_key_human = role_key_human
        if role_key_model:
            self.role_key_model = role_key_model
-        if role_key_tool:
-            self.role_key_tool = role_key_tool

    def _build_result(self, source):
        if len(source) < 2:
@@ -308,8 +303,6 @@ class ShareGPTPrompter(Prompter):  # pylint: disable=too-few-public-methods
            source.pop(0)

        roles = {self.role_key_human: conv.roles[0], self.role_key_model: conv.roles[1]}
-        if self.role_key_tool:
-            roles[self.role_key_tool] = conv.roles[2]

        try:
            # Apply prompt templates
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -11,7 +11,7 @@ import torch
 import transformers.modelcard
 from accelerate.logging import get_logger
 from datasets import Dataset
-from peft import PeftModel
+from peft import PeftModel, PeftModelForCausalLM
 from pkg_resources import get_distribution  # type: ignore
 from transformers import PreTrainedModel, PreTrainedTokenizer
 from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
@@ -19,7 +19,7 @@ from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
 from axolotl.common.cli import TrainerCliArgs
 from axolotl.logging_config import configure_logging
 from axolotl.utils.dict import DictDefault
-from axolotl.utils.freeze import freeze_layers_except
+from axolotl.utils.freeze import freeze_parameters_except
 from axolotl.utils.models import load_model, load_tokenizer
 from axolotl.utils.trainer import setup_trainer

@@ -99,7 +99,7 @@ def train(
    safe_serialization = cfg.save_safetensors is True

    if cfg.unfrozen_parameters:
-        freeze_layers_except(model, cfg.unfrozen_parameters)
+        freeze_parameters_except(model, cfg.unfrozen_parameters)

    trainer = setup_trainer(
        cfg,
@@ -207,6 +207,20 @@ def train(

        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)

+    if cfg.adapter and isinstance(model, (PeftModel, PeftModelForCausalLM)):
+        model.to("cpu")
+        model = model.merge_and_unload()
+
+        if cfg.local_rank == 0:
+            LOG.info(f"saving merged model to: {str(Path(cfg.output_dir) / 'merged')}")
+            model.save_pretrained(
+                str(Path(cfg.output_dir) / "merged"),
+                safe_serialization=safe_serialization,
+                progressbar=True,
+            )
+            tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
+
+
    if not cfg.hub_model_id:
        try:
            trainer.create_model_card(model_name=cfg.output_dir.lstrip("./"))
--- a/src/axolotl/utils/bench.py
+++ b/src/axolotl/utils/bench.py
@@ -24,9 +24,9 @@ def check_cuda_device(default_value):
                or not torch.cuda.is_available()
                or device == "auto"
                or torch.device(device).type == "cpu"
-                or torch.device(device).type == "meta"
            ):
                return default_value
+
            return func(*args, **kwargs)

        return wrapper
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -1,7 +1,6 @@
 """
 Module for pydantic models for configuration
 """
-# pylint: disable=too-many-lines

 import logging
 import os
@@ -129,10 +128,8 @@ class RLType(str, Enum):
 class ChatTemplate(str, Enum):
    """Chat templates configuration subset"""

-    alpaca = "alpaca"  # pylint: disable=invalid-name
    chatml = "chatml"  # pylint: disable=invalid-name
    inst = "inst"  # pylint: disable=invalid-name
-    gemma = "gemma"  # pylint: disable=invalid-name


 class LoftQConfig(BaseModel):
@@ -182,7 +179,6 @@ class LoraConfig(BaseModel):
    peft_layers_to_transform: Optional[List[int]] = None
    peft: Optional[PeftConfig] = None
    peft_use_dora: Optional[bool] = None
-    peft_use_relora: Optional[bool] = None

    lora_on_cpu: Optional[bool] = None
    gptq: Optional[bool] = None
@@ -515,12 +511,10 @@ class AxolotlInputConfig(

    neftune_noise_alpha: Optional[float] = None

-    max_memory: Optional[
-        Dict[Union[int, Literal["cpu", "disk"]], Union[int, str]]
-    ] = None
+    max_memory: Optional[Union[int, str]] = None
    gpu_memory_limit: Optional[Union[int, str]] = None

-    chat_template: Optional[ChatTemplate] = None
+    chat_template: Optional[Union[Literal["chatml", "inst"], ChatTemplate]] = None
    default_system_message: Optional[str] = None

    # INTERNALS - document for now, generally not set externally
@@ -995,10 +989,3 @@ class AxolotlConfigWCapabilities(AxolotlInputConfig):
            )

        return data
-
-    @model_validator(mode="before")
-    @classmethod
-    def check_fsdp_deepspeed(cls, data):
-        if data.get("deepspeed") and data.get("fsdp"):
-            raise ValueError("deepspeed and fsdp cannot be used together.")
-        return data
--- a/src/axolotl/utils/data.py
+++ b/src/axolotl/utils/data.py
@@ -114,7 +114,9 @@ def prepare_dataset(cfg, tokenizer):
        total_eval_steps = calculate_total_num_steps(cfg, eval_dataset, update=False)
        if total_eval_steps == 0:
            raise ValueError(
-                "eval dataset split is too small for sample_packing. You should set `eval_sample_packing: False`. "
+                "eval dataset split is too small for sample_packing. "
+                "You should set `eval_sample_packing: False` "
+                "or decrease the value of `eval_batch_size`. "
            )

    if cfg.max_steps:
--- a/src/axolotl/utils/freeze.py
+++ b/src/axolotl/utils/freeze.py
@@ -3,14 +3,13 @@ module to freeze/unfreeze parameters by name
 """
 import logging
 import re
-from typing import Callable, List, Tuple

 from axolotl.utils.distributed import is_main_process

 LOG = logging.getLogger("axolotl.utils.freeze")


-def freeze_layers_except(model, regex_patterns):
+def freeze_parameters_except(model, regex_patterns):
    """
    Freezes all layers of the given model except for the layers that match given regex patterns.
    Periods in the patterns are treated as literal periods, not as wildcard characters.
@@ -18,209 +17,22 @@ def freeze_layers_except(model, regex_patterns):
    Parameters:
    - model (nn.Module): The PyTorch model to be modified.
    - regex_patterns (list of str): List of regex patterns to match layer names to keep unfrozen.
-      Note that you cannot use a dot as a wildcard character in the patterns since it is reserved for separating layer names.
-      Also, to match the entire layer name, the pattern should start with "^" and end with "$", otherwise it will match any part of the layer name.
-      The range pattern part is optional and it is not compiled as a regex pattern which means you must put "$" before the range pattern if you want to match the entire layer name.
-      E.g., ["^model.embed_tokens.weight$[:32000]", "layers.2[0-9]+.block_sparse_moe.gate.[a-z]+$"]

    Returns:
    None; the model is modified in place.
    """
-    if isinstance(regex_patterns, str):
-        regex_patterns = [regex_patterns]
+    # Escape periods and compile the regex patterns
+    compiled_patterns = [
+        re.compile(pattern.replace(".", "\\.")) for pattern in regex_patterns
+    ]

-    patterns = [LayerNamePattern(pattern) for pattern in regex_patterns]
+    # First, freeze all parameters in the model
+    for param in model.parameters():
+        param.requires_grad = False

    # Unfreeze layers that match the regex patterns
    for name, param in model.named_parameters():
-        param.requires_grad = False
-        unfrozen_ranges = []
-        for pattern in patterns:
-            if not pattern.match(name):
-                continue
-
+        if any(pattern.match(name) for pattern in compiled_patterns):
+            if is_main_process():
+                LOG.debug(f"unfreezing {name}")
            param.requires_grad = True
-
-            if pattern.range is not None:
-                unfrozen_ranges.append(pattern.range)
-
-        merged_unfrozen_ranges = _merge_ranges(unfrozen_ranges, len(param))
-
-        if param.requires_grad and is_main_process():
-            unfrozen_ranges = (
-                f" with ranges {merged_unfrozen_ranges}"
-                if merged_unfrozen_ranges
-                else ""
-            )
-            LOG.debug(f"Unfrozen {name}{unfrozen_ranges}")
-
-        if not merged_unfrozen_ranges:
-            continue
-
-        # The range list we need is actually the inverted of the merged ranges
-        ranges_to_freeze = _invert_ranges(merged_unfrozen_ranges, len(param))
-
-        param.register_hook(_create_freeze_parameters_hook(ranges_to_freeze))
-
-    if is_main_process() and all(
-        not param.requires_grad for param in model.parameters()
-    ):
-        LOG.warning("All parameters are frozen. Model will not be trained.")
-
-
-def _invert_ranges(
-    given_ranges: List[Tuple[int, int]], layer_size: int
-) -> List[Tuple[int, int]]:
-    """
-    Inverts a list of ranges to obtain the ranges not covered by the given ranges.
-
-    Parameters:
-    - given_ranges (List[Tuple[int, int]]): List of ranges to invert. Each range is represented as a tuple of start (inclusive) and end (exclusive) indices.
-    - layer_size (int): The length of the layer. E.g., len(model.layer.weight)
-    Returns:
-    - List[Tuple[int, int]]: List of inverted ranges, where each range is represented as a tuple of start (inclusive) and end (exclusive) indices.
-    """
-    if not given_ranges:
-        return [(0, layer_size)]
-
-    inverted_ranges = []
-    current_start = 0
-
-    for start, end in sorted(given_ranges):
-        if start > current_start:
-            inverted_ranges.append((current_start, start))
-        current_start = max(current_start, end)
-
-    # Handle the case where the last given range does not reach the end of the total_size
-    if current_start < layer_size:
-        inverted_ranges.append((current_start, layer_size))
-
-    return inverted_ranges
-
-
-def _merge_ranges(
-    given_ranges: List[Tuple[int, int | None]], layer_size: int
-) -> List[Tuple[int, int]]:
-    """
-    Merges overlapping ranges and sorts the given ranges.
-
-    This function takes a list of ranges and merges any overlapping ranges. The ranges are represented
-    as tuples, where the first element is the start index (inclusive) and the second element is the end
-    index (exclusive). The end index can be None, indicating that the range extends to the end of the
-    sequence.
-
-    Parameters:
-    - given_ranges (List[Tuple[int, int | None]]): List of ranges to merge.
-    - layer_size (int): The length of the layer. E.g., len(model.layer.weight)
-
-    Returns:
-    - List[Tuple[int, int]]: List of merged ranges, as start (inclusive) and end (exclusive) indices.
-    """
-    # End of each range can be determined now since we have the total size
-    processed_ranges = [
-        (start, end if end is not None else layer_size) for start, end in given_ranges
-    ]
-
-    # No need to merge if there's only one or no ranges
-    if len(processed_ranges) <= 1:
-        return processed_ranges
-
-    sorted_ranges = sorted(processed_ranges)
-
-    merged_ranges = [sorted_ranges[0]]
-    for start, end in sorted_ranges[1:]:
-        prev_start, prev_end = merged_ranges[-1]
-        if start <= prev_end:
-            merged_ranges[-1] = (prev_start, max(prev_end, end))
-        else:
-            merged_ranges.append((start, end))
-
-    return merged_ranges
-
-
-def _create_freeze_parameters_hook(ranges_to_freeze: List[Tuple[int, int]]) -> Callable:
-    """
-    Create a hook to freeze parameters in specified ranges by setting their gradients to zero.
-
-    This function takes a list of tuples representing the ranges of indices to freeze. Each tuple should contain
-    two integers representing the start and end indices of the range.
-
-    Parameters:
-    - ranges_to_freeze (List[Tuple[int, int]]): Ranges of indices to freeze.
-
-    Returns:
-    - Callable: A hook function to be used with `register_hook` on parameters.
-
-    Example usage:
-    ```
-    ranges_to_freeze = [(0, 10), (20, 30)]
-    hook = _create_freeze_parameters_hook(ranges_to_freeze)
-    model.register_hook(hook)
-    ```
-    """
-
-    def freeze_parameters_hook(gradients):
-        for start, end in ranges_to_freeze:
-            gradients[start:end].zero_()
-
-    return freeze_parameters_hook
-
-
-class LayerNamePattern:
-    """
-    Represents a regex pattern for layer names, potentially including a parameter index range.
-    """
-
-    def __init__(self, pattern: str):
-        """
-        Initializes a new instance of the LayerNamePattern class.
-
-        Parameters:
-        - pattern (str): The regex pattern for layer names, potentially including a parameter index range.
-        """
-        self.raw_pattern = pattern
-        name_pattern, self.range = self._parse_pattern(pattern)
-        self.name_regex = re.compile(name_pattern.replace(".", "\\."))
-
-    def match(self, name: str) -> bool:
-        """
-        Checks if the given layer name matches the regex pattern.
-
-        Parameters:
-        - name (str): The layer name to check.
-
-        Returns:
-        - bool: True if the layer name matches the pattern, False otherwise.
-        """
-        return self.name_regex.match(name) is not None
-
-    def _parse_pattern(self, pattern: str) -> Tuple[str, Tuple[int, int | None] | None]:
-        """
-        Extracts the range pattern from the given pattern.
-
-        Parameters:
-        - pattern (str): The pattern to extract the range from.
-
-        Returns:
-        - Tuple[str, Tuple[int, int | None] | None]: A tuple containing the regex pattern to match the layer name without the range pattern and the range of layer indices to match, if specified.
-        """
-        match = re.match(r"^(.+)\[([0-9]*)(?::([0-9]*))?\]$", pattern)
-        if not match:
-            return pattern, None
-
-        base_pattern, start_part, end_part = match.groups()
-
-        if end_part is None and start_part.isdecimal():
-            index = int(start_part)
-            return base_pattern, (index, index + 1)
-
-        # [:end] or [start:] or [start:end]
-        start = int(start_part) if start_part else 0
-        end = int(end_part) if end_part else None
-
-        if end is not None and start >= end:
-            raise ValueError(
-                f"Invalid range in layer name pattern: {pattern}."
-                "End of range must be greater than start."
-            )
-        return base_pattern, (start, end)
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -1,20 +1,13 @@
 """Module for models and model loading"""
-# pylint: disable=too-many-lines
-
 import logging
 import math
 import os
-import types
-from typing import Any, Dict, List, Optional, Tuple, Type, Union  # noqa: F401
+from typing import Any, Dict, Optional, Tuple, Union  # noqa: F401

 import addict
 import bitsandbytes as bnb
-import safetensors
 import torch
 import transformers
-from accelerate import init_empty_weights
-from bitsandbytes.nn import Linear4bit, Params4bit
-from fastcore.parallel import parallel
 from peft import (
    LoftQConfig,
    PeftConfig,
@@ -23,7 +16,6 @@ from peft import (
    prepare_model_for_kbit_training,
 )
 from peft.tuners.lora import QuantLinear
-from torch import Tensor, nn
 from transformers import (  # noqa: F401
    AddedToken,
    AutoConfig,
@@ -35,9 +27,7 @@ from transformers import (  # noqa: F401
    PreTrainedTokenizerBase,
 )
 from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
-from transformers.utils import SAFE_WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_NAME, hub

-from axolotl.core.policies.auto_wrap import SUPPORTED_AUTO_WRAP_MODEL_TYPES
 from axolotl.models.mamba import fix_mamba_attn_for_loss
 from axolotl.monkeypatch.multipack import (
    SUPPORTED_MULTIPACK_MODEL_TYPES,
@@ -272,117 +262,6 @@ def load_tokenizer(cfg):
    return tokenizer


-def replace_linear(
-    model: nn.Module,
-    linear_replacement: Type[nn.Module],
-    quant_config: Union[dict, None] = None,
-    skip_modules=None,
-    **kwargs,
-):
-    """
-    Replace linear modules with a new Linear module.
-    Parameters:
-        model (`torch.nn.Module`):
-            Input model or `torch.nn.Module` as the function is run recursively.
-        linear_replacement (`torch.nn.Module`):
-            The linear module that replaces the old one. Only expects standard arguments.
-            If other arguments need to be passed, use a lambda.
-        skip_modules (`List[str]`, *optional*, defaults to `lm_head`):
-            List of modules names not to convert. Defaults to `lm_head`.
-    """
-    if skip_modules is None:
-        skip_modules = ["lm_head"]
-    for name, module in model.named_children():
-        if len(list(module.children())) > 0:
-            replace_linear(
-                module, linear_replacement, quant_config, skip_modules, **kwargs
-            )
-
-        if isinstance(module, torch.nn.Linear) and name not in skip_modules:
-            if issubclass(linear_replacement, Linear4bit):
-                model._modules[  # pylint: disable=protected-access
-                    name
-                ] = linear_replacement(
-                    module.in_features,
-                    module.out_features,
-                    module.bias is not None,
-                    **kwargs,
-                )
-            else:
-                raise ValueError(
-                    f"Unsupported linear replacement: {type(linear_replacement)}"
-                )
-    return model
-
-
-def load_and_quantize(
-    module: nn.Module,
-    name: str,
-    value: Tensor,
-    device: torch.device = None,
-    dtype: torch.dtype = None,
-    skip_names: Optional[List[str]] = None,
-    is_meta_rank: bool = False,
-    low_memory: bool = True,
-    verbose: bool = False,
-    quant_method: str = "bnb",
-):
-    """
-    Loads `value` tensor into submodule of `module`, optionally skipping `skip_names` and converting to `dtype`.
-
-    Quantizes `Params4bit` on `device` then places on "cpu" if low_memory=True or "meta" if is_meta_rank=True.
-    """
-
-    if skip_names is None:
-        skip_names = []
-
-    def place_on_device(value):
-        if is_meta_rank:
-            device = "meta"
-        elif low_memory:
-            device = "cpu"
-        else:
-            device = "cuda"
-        return value.to(device=device, dtype=dtype)
-
-    if any(skip_name in name for skip_name in skip_names):
-        if verbose:
-            print(f"Skipping {name} because it is in skip_names")
-        return
-
-    module_key, _, value_key = name.rpartition(".")
-    try:
-        submodule = module.get_submodule(module_key)
-    except AttributeError as exc:
-        print(f"Module {module_key} not found:\n{exc}")
-        return
-
-    try:
-        if quant_method == "bnb":
-            param = submodule.get_parameter(value_key)
-            if isinstance(param, Params4bit):
-                # With `sync_module_states=True`, a meta device Params4bit needs to be the same
-                # shape as the quantized Params4bit with an initialized quant_state. However,
-                # FSDP only syncs parameters and buffers, so the quant_state isn't copied. This
-                # workaround quantizes Params4bit to initialize quant_state on all ranks, then
-                # replaces Params4bit's data with a meta tensor to free memory on non-rank 0.
-                value = type(param)(
-                    value.to(device=device, dtype=dtype).data, **param.__dict__
-                ).cuda(device)
-                if is_meta_rank:
-                    value = type(param)(value.data.to("meta"), **value.__dict__)
-                elif low_memory:
-                    value = type(param)(value.data.to("cpu"), **value.__dict__)
-            else:
-                value = type(param)(place_on_device(value).data)
-
-    except AttributeError:
-        # it's a buffer
-        value = place_on_device(value)
-
-    setattr(submodule, value_key, value)
-
-
 def load_model(
    cfg: DictDefault,
    tokenizer: PreTrainedTokenizerBase,
@@ -429,7 +308,7 @@ def load_model(
        and cfg.flash_attention
        and cfg.sample_packing
    ):
-        patch_for_multipack(cfg.model_config_type, model_name=cfg.base_model)
+        patch_for_multipack(cfg.model_config_type)
    elif cfg.is_llama_derived_model:
        # Modify all llama derived models in one block

@@ -515,7 +394,7 @@ def load_model(

    if max_memory is not None:
        # Based on https://github.com/togethercomputer/OpenChatKit/blob/main/inference/bot.py
-        from accelerate import infer_auto_device_map
+        from accelerate import infer_auto_device_map, init_empty_weights

        with init_empty_weights():
            model_canvas = AutoModelForCausalLM.from_config(model_config)
@@ -617,78 +496,8 @@ def load_model(
        model_kwargs["attn_implementation"] = "eager"
        model_config._attn_implementation = "eager"  # pylint: disable=protected-access

-    qlora_fsdp = (
-        cfg.fsdp
-        and cfg.adapter == "qlora"
-        and model_config.model_type in SUPPORTED_AUTO_WRAP_MODEL_TYPES
-    )
-
    try:
-        if qlora_fsdp:
-            if cfg.bf16 or cfg.bfloat16:
-                torch_dtype, compute_dtype = torch.float32, torch.bfloat16
-            elif cfg.fp16 or cfg.float16:
-                torch_dtype, compute_dtype = torch.float32, torch.float16
-            else:
-                torch_dtype, compute_dtype = torch.float32, torch.float16
-
-            with init_empty_weights():
-                LOG.info("Loading model with empty weights.")
-                model = AutoModelForCausalLM.from_config(model_config)
-                model.model = replace_linear(
-                    model.model,
-                    Linear4bit,
-                    compute_dtype=compute_dtype,
-                    quant_type="nf4",
-                    quant_storage=torch_dtype,
-                )
-
-            model.is_loaded_in_4bit = True
-
-            # Grab the safetensors files that hold the weights
-            try:
-                idx = hub.cached_file(base_model, SAFE_WEIGHTS_INDEX_NAME)
-                files, _ = hub.get_checkpoint_shard_files(base_model, idx)
-            except OSError:
-                try:
-                    # This means the model doesn't have a model.safetensors.index.json because it is not sharded
-                    files = []
-                    files.append(hub.cached_file(base_model, SAFE_WEIGHTS_NAME))
-                except OSError as exc:
-                    # This means the model probably doesn't have a safetensors file
-                    raise exc
-
-            # Load in the weights, using our custom load_and_quantize method which quantizes Params4bit on the fly
-            # and then places each layer on CPU or meta if using low_memory to minimize GPU memory usage
-            def load_and_quantize_parallel(name_param, model, **kwargs):
-                name, param = name_param
-                load_and_quantize(model, name, param, **kwargs)
-
-            param_count = sum((p.numel() for n, p in model.named_parameters()))
-            for filename in files:
-                weights = safetensors.torch.load_file(filename)
-                quant_method = "bnb"
-                devprops = torch.cuda.get_device_properties(torch.cuda.current_device())
-                left = int(os.cpu_count() / torch.cuda.device_count())
-                right = int(
-                    8 * (devprops.total_memory / 1e9 / 40) * (70 / (param_count / 1e9))
-                )
-                n_workers = min(left, right)
-                parallel(
-                    load_and_quantize_parallel,
-                    weights.items(),
-                    n_workers=n_workers,
-                    threadpool=True,
-                    model=model,
-                    dtype=torch_dtype,
-                    device=cfg.local_rank,
-                    skip_names=[],
-                    is_meta_rank=(cfg.local_rank != 0),
-                    verbose=False,
-                    quant_method=quant_method,
-                )
-
-        elif (
+        if (
            model_config.model_type == "llama"
            and not cfg.trust_remote_code
            and not cfg.gptq
@@ -804,7 +613,7 @@ def load_model(
        LOG.exception(err)
        raise err

-    if isinstance(model, (PeftModel, PeftModelForCausalLM)) and not qlora_fsdp:
+    if isinstance(model, (PeftModel, PeftModelForCausalLM)):
        model = model.merge_and_unload()

    embeddings_len = (
@@ -883,9 +692,6 @@ def load_model(
    if cfg.adapter == "lora" and loftq_bits:
        skip_prepare_model_for_kbit_training = True

-    if qlora_fsdp:
-        skip_prepare_model_for_kbit_training = True
-
    if cfg.adapter in ["lora", "qlora"]:
        if cfg.gradient_checkpointing:
            model.gradient_checkpointing_enable()
@@ -900,7 +706,7 @@ def load_model(

    # LlamaRMSNorm layers are in fp32 after kbit_training or full finetune, so we need to
    # convert them back to fp16/bf16 for flash-attn compatibility.
-    if (needs_fa2_dtype or cfg.flash_attention) and not qlora_fsdp:
+    if needs_fa2_dtype or cfg.flash_attention:
        LOG.info("converting modules to %s for flash attention", cfg.torch_dtype)
        for name, module in model.named_modules():
            if "norm" in name:
@@ -918,12 +724,7 @@ def load_model(
        else:
            model, lora_config = load_adapter(model, cfg, cfg.adapter)

-    if (
-        cfg.ddp
-        and not load_in_8bit
-        and not (cfg.rl and cfg.load_in_4bit)
-        and not qlora_fsdp
-    ):
+    if cfg.ddp and not load_in_8bit and not (cfg.rl and cfg.load_in_4bit):
        # TODO revaldate this conditional
        model.to(f"cuda:{cfg.local_rank}")

@@ -1012,30 +813,6 @@ def find_all_linear_names(model):
    return list(lora_module_names)


-def setup_quantized_meta_for_peft(model: nn.Module):
-    """Replaces `quant_state.to` with a dummy function to prevent PEFT from moving `quant_state` to meta device"""
-
-    def temp_to_method(self, *args, **kwargs):  # pylint: disable=unused-argument
-        return self
-
-    for param in model.parameters():
-        if isinstance(param, Params4bit):
-            param.quant_state._orig_to = (  # pylint: disable=protected-access
-                param.quant_state.to
-            )
-            param.quant_state.to = types.MethodType(temp_to_method, param.quant_state)
-
-
-def setup_quantized_peft_meta_for_training(model: nn.Module):
-    """Replaces dummy `quant_state.to` method with the original function to allow training to continue"""
-    for param in model.parameters():
-        if isinstance(param, Params4bit) and hasattr(param.quant_state, "_orig_to"):
-            param.quant_state.to = (
-                param.quant_state._orig_to  # pylint: disable=protected-access
-            )
-            param.quant_state._orig_to = None  # pylint: disable=protected-access
-
-
 def load_lora(model, cfg, inference=False, config_only=False):
    # type: (PreTrainedModel, DictDefault, bool, bool) -> Tuple[Optional[PreTrainedModel], Optional[PeftConfig]]

@@ -1055,8 +832,6 @@ def load_lora(model, cfg, inference=False, config_only=False):
        lora_config_kwargs["init_lora_weights"] = "loftq"
    if cfg.peft_use_dora:
        lora_config_kwargs["use_dora"] = cfg.peft_use_dora
-    if cfg.peft_use_rslora:
-        lora_config_kwargs["use_rslora"] = cfg.use_rslora

    lora_config = LoraConfig(
        r=cfg.lora_r,
@@ -1074,11 +849,6 @@ def load_lora(model, cfg, inference=False, config_only=False):
    if config_only:
        return None, lora_config

-    rank = int(os.environ.get("LOCAL_RANK", 0))
-
-    if cfg.fsdp and cfg.adapter == "qlora" and rank != 0:
-        setup_quantized_meta_for_peft(model)
-
    if cfg.lora_model_dir:
        LOG.debug("Loading pretrained PEFT - LoRA")
        model_kwargs: Any = {}
@@ -1094,9 +864,6 @@ def load_lora(model, cfg, inference=False, config_only=False):
    else:
        model = get_peft_model(model, lora_config)

-    if rank == 0:
-        model.print_trainable_parameters()
-    elif cfg.fsdp and cfg.adapter == "qlora":
-        setup_quantized_peft_meta_for_training(model)
+    model.print_trainable_parameters()

    return model, lora_config
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -5,7 +5,7 @@ Multipack Batch Sampler
 import logging
 import math
 import os
-from typing import Any, Iterable, List, Union
+from typing import Any, Iterable, List, Union, Optional

 import numba
 import numpy as np
@@ -115,12 +115,14 @@ class MultipackBatchSampler(BatchSampler):
        batch_max_len: int,
        lengths: np.ndarray,
        packing_efficiency_estimate: float = 1.0,
+        consistent_length: Optional[bool] = False,
    ):
        super().__init__(sampler, batch_size, drop_last)
        self.batch_size = batch_size
        self.batch_max_len = batch_max_len
        self.lengths: np.ndarray = lengths
        self.packing_efficiency_estimate = packing_efficiency_estimate or 1.0
+        self.consistent_length = consistent_length

        assert isinstance(self.lengths, np.ndarray)

@@ -164,11 +166,18 @@ class MultipackBatchSampler(BatchSampler):

    def __iter__(self):
        batches = self.generate_batches(set_stats=True)
-        return iter(batches)
+        if self.consistent_length:
+            length = self._len_est()
+            return iter(batches[:length])
+        else:
+            return iter(batches)

    def num_batches(self):
        batches = self.generate_batches(set_stats=True)
-        return len(batches)
+        if self.consistent_length:
+            return self._len_est()
+        else:
+            return len(batches)

    def efficiency(self):
        return self.eff_total_used / self.eff_total_slots
--- a/src/axolotl/utils/tokenization.py
+++ b/src/axolotl/utils/tokenization.py
@@ -2,8 +2,6 @@


 import logging
-import re
-from typing import Dict, List

 from termcolor import colored

@@ -38,65 +36,3 @@ def check_example_labels(example, tokenizer, text_only=False):
    LOG.info("\n\n\n")

    return " ".join(colored_tokens)
-
-
-GLAIVE_ROLES = ["USER", "ASSISTANT", "FUNCTION RESPONSE"]
-GLAIVE_TO_SHAREGPT_ROLE = {
-    "SYSTEM": "system",
-    "USER": "human",
-    "ASSISTANT": "gpt",
-    "FUNCTION RESPONSE": "tool",
-}
-
-GLAIVE_MSG_REGEX = re.compile(rf"({'|'.join(GLAIVE_ROLES)}): ")
-
-
-def chatml_to_conversation(row: Dict[str, str]) -> List[Dict[str, str]]:
-    """
-    Converts a ChatML formatted row to a list of messages in ShareGPT format.
-    Initially based off https://github.com/lilacai/lilac/blob/main/notebooks/GlaiveToShareGPT.ipynb.
-    """
-
-    system_prompt = row.get("system")
-    if system_prompt:
-        system_prompt = system_prompt.removeprefix("SYSTEM: ")
-
-    chat_str = row["chat"]
-    chat_msgs = [s.strip() for s in GLAIVE_MSG_REGEX.split(chat_str) if s]
-
-    chat_msg_dicts = [
-        {"from": GLAIVE_TO_SHAREGPT_ROLE[role], "value": value}
-        for role, value in zip(chat_msgs[::2], chat_msgs[1::2])
-    ]
-
-    if system_prompt:
-        chat_msg_dicts = [
-            {"from": GLAIVE_TO_SHAREGPT_ROLE["SYSTEM"], "value": system_prompt}
-        ] + chat_msg_dicts
-
-    return chat_msg_dicts
-
-
-def merge_consecutive_messages(messages):
-    """
-    Merge consecutive messages from the same sender into a single message.
-    This can be useful with datasets that contain multiple consecutive tool calls.
-    """
-
-    merged_messages = []
-    current_from = None
-    current_message = ""
-
-    for msg in messages:
-        if current_from == msg["from"]:
-            current_message += msg["value"]
-        else:
-            if current_from is not None:
-                merged_messages.append({"from": current_from, "value": current_message})
-            current_from = msg["from"]
-            current_message = msg["value"]
-
-    if current_from is not None:
-        merged_messages.append({"from": current_from, "value": current_message})
-
-    return merged_messages
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -277,7 +277,7 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                calc_sample_packing_eff_est,
            )
            sample_packing_eff_est = (
-                math.ceil(sample_packing_actual_eff_all * 100.0) / 100.0
+                math.ceil(sample_packing_actual_eff_all * 10000.0) / 10000.0
            )
            if update:
                cfg.sample_packing_eff_est = sample_packing_eff_est
--- a/tests/prompt_strategies/test_sharegpt.py
+++ b/tests/prompt_strategies/test_sharegpt.py
@@ -1,7 +1,6 @@
 """
 Test module for sharegpt integration w chatml
 """
-
 import pytest
 from datasets import Dataset
 from tokenizers import AddedToken
@@ -9,7 +8,6 @@ from transformers import AutoTokenizer

 from axolotl.datasets import TokenizedPromptDataset
 from axolotl.prompt_strategies.sharegpt import (
-    GlaiveShareGPTPromptTokenizingStrategy,
    SimpleShareGPTPromptTokenizingStrategy,
    register_chatml_template,
 )
@@ -50,18 +48,6 @@ def fixture_sharegpt_dataset():
    )


-@pytest.fixture(name="glaive_dataset")
-def fixture_sharegpt_glaive_dataset():
-    return Dataset.from_list(
-        [
-            {
-                "system": "SYSTEM: This is a system prompt",
-                "chat": "USER: Can you book a flight for me from New York to London? ASSISTANT: I'm sorry, but I don't have the capability to book flights.  <|endoftext|>",
-            }
-        ]
-    )
-
-
@pytest.fixture(name="tokenizer")
 def fixture_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
@@ -170,29 +156,3 @@ class TestSharegpt:
            32001, 13892, 13, 12684, 17664, 32000, 28705, 13,  # gpt
        ]
        # fmt: on
-
-    def test_chatml_glaive(self, glaive_dataset, tokenizer):
-        strategy = GlaiveShareGPTPromptTokenizingStrategy(
-            ShareGPTPrompterV2(
-                conversation="chatml",
-                role_key_model=None,
-                role_key_human=None,
-            ),
-            tokenizer,
-            True,  # train_on_inputs
-            2048,  # sequence_len
-        )
-
-        dataset_wrapper = TokenizedPromptDataset(
-            strategy, glaive_dataset, process_count=1
-        )
-
-        labels = dataset_wrapper[0]["labels"]
-        # fmt: off
-        assert labels == [
-            1,  # bos
-            32001, 1587, 13, 3260, 349, 264, 1587, 11510, 32000, 28705, 13,  # system
-            32001, 2188, 13, 6325, 368, 1820, 264, 9314, 354, 528, 477, 1450, 2726, 298, 4222, 28804, 32000, 28705, 13,  # human
-            32001, 13892, 13, 28737, 28742, 28719, 7371, 28725, 562, 315, 949, 28742, 28707, 506, 272, 21368, 298, 1820, 22447, 28723, 28705, 523, 28766, 416, 1009, 772, 28766, 28767, 32000, 28705, 13  # gpt
-        ]
-        # fmt: on
--- a/tests/test_freeze.py
+++ b/tests/test_freeze.py
@@ -1,285 +0,0 @@
-"""
-This module contains unit tests for the `freeze_layers_except` function.
-
-The `freeze_layers_except` function is used to freeze layers in a model, except for the specified layers.
-The unit tests in this module verify the behavior of the `freeze_layers_except` function in different scenarios.
-"""
-
-import unittest
-
-import torch
-from torch import nn
-
-from axolotl.utils.freeze import freeze_layers_except
-
-ZERO = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
-ONE_TO_TEN = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
-
-
-class TestFreezeLayersExcept(unittest.TestCase):
-    """
-    A test case class for the `freeze_layers_except` function.
-    """
-
-    def setUp(self):
-        self.model = _TestModel()
-
-    def test_freeze_layers_with_dots_in_name(self):
-        freeze_layers_except(self.model, ["features.layer"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-    def test_freeze_layers_without_dots_in_name(self):
-        freeze_layers_except(self.model, ["classifier"])
-        self.assertFalse(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertTrue(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-    def test_freeze_layers_regex_patterns(self):
-        # The second pattern cannot match because only characters 'a' to 'c' are allowed after the word 'class', whereas it should be matching the character 'i'.
-        freeze_layers_except(self.model, [r"^features.[a-z]+.weight$", r"class[a-c]+"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-    def test_all_layers_frozen(self):
-        freeze_layers_except(self.model, [])
-        self.assertFalse(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be frozen.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-    def test_all_layers_unfrozen(self):
-        freeze_layers_except(self.model, ["features.layer", "classifier"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertTrue(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be trainable.",
-        )
-
-    def test_freeze_layers_with_range_pattern_start_end(self):
-        freeze_layers_except(self.model, ["features.layer[1:5]"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [
-                ZERO,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-            ]
-        )
-
-    def test_freeze_layers_with_range_pattern_single_index(self):
-        freeze_layers_except(self.model, ["features.layer[5]"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [ZERO, ZERO, ZERO, ZERO, ZERO, ONE_TO_TEN, ZERO, ZERO, ZERO, ZERO]
-        )
-
-    def test_freeze_layers_with_range_pattern_start_omitted(self):
-        freeze_layers_except(self.model, ["features.layer[:5]"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-            ]
-        )
-
-    def test_freeze_layers_with_range_pattern_end_omitted(self):
-        freeze_layers_except(self.model, ["features.layer[4:]"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-            ]
-        )
-
-    def test_freeze_layers_with_range_pattern_merge_included(self):
-        freeze_layers_except(self.model, ["features.layer[4:]", "features.layer[5:6]"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-            ]
-        )
-
-    def test_freeze_layers_with_range_pattern_merge_intersect(self):
-        freeze_layers_except(self.model, ["features.layer[4:7]", "features.layer[6:8]"])
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ONE_TO_TEN,
-                ZERO,
-                ZERO,
-            ]
-        )
-
-    def test_freeze_layers_with_range_pattern_merge_separate(self):
-        freeze_layers_except(
-            self.model,
-            ["features.layer[1:2]", "features.layer[3:4]", "features.layer[5:6]"],
-        )
-        self.assertTrue(
-            self.model.features.layer.weight.requires_grad,
-            "model.features.layer should be trainable.",
-        )
-        self.assertFalse(
-            self.model.classifier.weight.requires_grad,
-            "model.classifier should be frozen.",
-        )
-
-        self._assert_gradient_output(
-            [
-                ZERO,
-                ONE_TO_TEN,
-                ZERO,
-                ONE_TO_TEN,
-                ZERO,
-                ONE_TO_TEN,
-                ZERO,
-                ZERO,
-                ZERO,
-                ZERO,
-            ]
-        )
-
-    def _assert_gradient_output(self, expected):
-        input_tensor = torch.tensor([ONE_TO_TEN], dtype=torch.float32)
-
-        self.model.features.layer.weight.grad = None  # Reset gradients
-        output = self.model.features.layer(input_tensor)
-        loss = output.sum()
-        loss.backward()
-
-        expected_grads = torch.tensor(expected)
-        torch.testing.assert_close(
-            self.model.features.layer.weight.grad, expected_grads
-        )
-
-
-class _SubLayerModule(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.layer = nn.Linear(10, 10)
-
-
-class _TestModel(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.features = _SubLayerModule()
-        self.classifier = nn.Linear(10, 2)
-
-
-if __name__ == "__main__":
-    unittest.main()
--- a/tests/test_prompt_tokenizers.py
+++ b/tests/test_prompt_tokenizers.py
@@ -1,5 +1,4 @@
 """Module for testing prompt tokenizers."""
-
 import json
 import logging
 import unittest
@@ -19,7 +18,6 @@ from axolotl.prompt_strategies.llama2_chat import (
    Llama2ChatPrompter,
    LLama2ChatTokenizingStrategy,
 )
-from axolotl.prompt_strategies.sharegpt import GlaiveShareGPTPromptTokenizingStrategy
 from axolotl.prompt_tokenizers import (
    AlpacaPromptTokenizingStrategy,
    ShareGPTPromptTokenizingStrategy,
@@ -268,23 +266,6 @@ class TestPromptTokenizationStrategies(unittest.TestCase):
            idx = res["input_ids"].index(20255)  # assistant token
            assert res["labels"][idx] == -100

-    def test_glaive_tool_label_ignore(self):
-        conversation = {
-            "system": "SYSTEM: This is a system prompt",
-            "chat": "USER: Can you book a flight for me from New York to London? ASSISTANT: I'm sorry, but I don't have the capability to book flights.  <|endoftext|>",
-        }
-        prompter = ShareGPTPrompterV2()
-        strat = GlaiveShareGPTPromptTokenizingStrategy(
-            prompter,
-            self.tokenizer,
-            False,
-            2048,
-        )
-        with self._caplog.at_level(logging.WARNING):
-            res = strat.tokenize_prompt(conversation)
-            idx = res["input_ids"].index(13566)  # assistant token
-            assert res["labels"][idx] == -100
-
    def test_no_sys_prompt(self):
        """
        tests the interface between the user and assistant parts