support for fp8

2023-11-10 02:35:19 -05:00
29 changed files with 499 additions and 1786 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -71,7 +71,6 @@ jobs:

      - name: Install dependencies
        run: |
-          pip3 install --extra-index-url https://download.pytorch.org/whl/cu118 -U torch==2.0.1
          pip3 uninstall -y transformers accelerate
          pip3 install -U -e .[flash-attn]
          pip3 install -r requirements-tests.txt
--- a/README.md
+++ b/README.md
@@ -25,10 +25,8 @@ Features:
 - [Installation](#installation)
  - [Docker](#docker)
  - [Conda/Pip venv](#condapip-venv)
-  - [Runpod](#runpod)
  - [LambdaLabs](#lambdalabs)
  - [Windows](#windows)
-  - [Launching on public clouds via SkyPilot](#launching-on-public-clouds-via-skypilot)
 - [Dataset](#dataset)
  - [How to Add Custom Prompts](#how-to-add-custom-prompts)
  - [How to Use Custom Pretokenized Dataset](#how-to-use-your-custom-pretokenized-dataset)
@@ -77,7 +75,6 @@ Features:
 | XGen     | ✅         | ❓    | ✅     | ❓             | ❓                 | ❓          | ✅            |
 | phi      | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |
 | RWKV     | ✅         | ❓    | ❓     | ❓             | ❓                 | ❓          | ❓            |
-| Qwen     | ✅         | ✅    | ✅     | ❓             | ❓                 | ❓          | ❓            |


 ## Quickstart ⚡
@@ -86,19 +83,14 @@ Get started with Axolotl in just a few steps! This quickstart guide will walk yo

 **Requirements**: Python >=3.9 and Pytorch >=2.0.

-`pip3 install "axolotl[flash-attn,deepspeed] @ git+https://github.com/OpenAccess-AI-Collective/axolotl"`
-
-### For developers
 ```bash
 git clone https://github.com/OpenAccess-AI-Collective/axolotl
 cd axolotl

 pip3 install packaging
 pip3 install -e '.[flash-attn,deepspeed]'
-```
+pip3 install -U git+https://github.com/huggingface/peft.git

-### Usage
-```bash
 # finetune lora
 accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

@@ -119,6 +111,7 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
  ```bash
  docker run --gpus '"all"' --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1
  ```
+  - `winglian/axolotl-runpod:main-latest`: for runpod or use this [direct link](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)

  Or run on the current files for development:

@@ -133,15 +126,13 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
  A more powerful Docker command to run would be this:

  ```bash
-  docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=volume,src=axolotl,target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu118-2.0.1
+  docker run --gpus '"all"' --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=volume,src=axolotl,target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu118-2.0.1
  ```

  It additionally:
  * Prevents memory issues when running e.g. deepspeed (e.g. you could hit SIGBUS/signal 7 error) through `--ipc` and `--ulimit` args.
  * Persists the downloaded HF data (models etc.) and your modifications to axolotl code through `--mount`/`-v` args.
  * The `--name` argument simply makes it easier to refer to the container in vscode (`Dev Containers: Attach to Running Container...`) or in your terminal.
-  * The `--privileged` flag gives all capabilities to the container.
-  * The `--shm-size 10g` argument increases the shared memory size. Use this if you see `exitcode: -7` errors using deepspeed.

  [More information on nvidia website](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem)

@@ -163,10 +154,6 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
        ```
        Get the token at huggingface.co/settings/tokens

-#### Runpod
-
-Use `winglian/axolotl-runpod:main-latest` or use this [direct link](https://runpod.io/gsc?template=v2ickqhz9s&ref=6i7fkpdz)
-
 #### LambdaLabs
  <details>

@@ -214,28 +201,6 @@ Use `winglian/axolotl-runpod:main-latest` or use this [direct link](https://runp
 #### Windows
 Please use WSL or Docker!

-
-#### Launching on public clouds via SkyPilot
-To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use [SkyPilot](https://skypilot.readthedocs.io/en/latest/index.html):
-```bash
-pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]"  # choose your clouds
-sky check
-```
-Get the [example YAMLs](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl) of using Axolotl to finetune `mistralai/Mistral-7B-v0.1`:
-```
-git clone https://github.com/skypilot-org/skypilot.git
-cd skypilot/llm/axolotl
-```
-Use one command to launch:
-```bash
-# On-demand
-HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN
-
-# Managed spot (auto-recovery on preemption)
-HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
-```
-
-
 ### Dataset

 Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
@@ -432,12 +397,6 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
    - path: knowrohit07/know_sql
      type: context_qa.load_v2
      train_on_split: validation
-
-  # loading from s3 or gcs
-  # s3 creds will be loaded from the system default and gcs only supports public access
-  dataset:
-    - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
-      ...
  ```

 - loading
@@ -500,15 +459,6 @@ is_falcon_derived_model:
 is_llama_derived_model:
 # Please note that if you set this to true, `padding_side` will be set to "left" by default
 is_mistral_derived_model:
-is_qwen_derived_model:
-
-# optional overrides to the base model configuration
-model_config:
-  # RoPE Scaling https://github.com/huggingface/transformers/pull/24653
-  rope_scaling:
-    type: # linear | dynamic
-    factor: # float
-

 # Whether you are training a 4-bit GPTQ quantized model
 gptq: true
@@ -533,7 +483,7 @@ float16: true

 # A list of one or more datasets to finetune the model with
 datasets:
-  # HuggingFace dataset repo | s3://,gs:// path | "json" for local dataset, make sure to fill data_files
+  # HuggingFace dataset repo | "json" for local dataset, make sure to fill data_files
  - path: vicgalle/alpaca-gpt4
  # The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
    type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
@@ -541,12 +491,9 @@ datasets:
    data_files: # Optional[str] path to source data files
    shards: # Optional[int] number of shards to split data into
    name: # Optional[str] name of dataset configuration to load
-    train_on_split: train # Optional[str] name of dataset split to load from

    # Optional[str] fastchat conversation type, only used with type: sharegpt
    conversation:  # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
-    field_human: # Optional[str]. Human key to use for conversation.
-    field_model: # Optional[str]. Assistant key to use for conversation.

  # Custom user prompt
  - path: repo
@@ -677,8 +624,7 @@ gradient_accumulation_steps: 1
 micro_batch_size: 2
 eval_batch_size:
 num_epochs: 4
-warmup_steps: 100  # cannot use with warmup_ratio
-warmup_ratio: 0.05  # cannot use with warmup_steps
+warmup_steps: 100
 learning_rate: 0.00003
 lr_quadratic_warmup:
 logging_steps:
@@ -780,6 +726,10 @@ landmark_attention:
 # xpos RoPE see https://github.com/kaiokendev/cutoff-len-is-context-len/blob/main/util/xpos_rope_llama_monkey_patch.py
 # LLaMA only
 xpos_rope:
+# RoPE Scaling https://github.com/huggingface/transformers/pull/24653
+rope_scaling:
+  type: # linear | dynamic
+  factor: # float

 # Resume from a specific checkpoint dir
 resume_from_checkpoint:
--- a/examples/llama-2/tiny-llama.yml
+++ b/examples/llama-2/tiny-llama.yml
@@ -1,4 +1,4 @@
-base_model: PY007/TinyLlama-1.1B-intermediate-step-715k-1.5T
+base_model: PY007/TinyLlama-1.1B-step-50K-105b

 model_type: LlamaForCausalLM
 tokenizer_type: LlamaTokenizer
--- a/examples/phi/phi-ft.yml
+++ b/examples/phi/phi-ft.yml
@@ -1,5 +1,5 @@
 base_model: microsoft/phi-1_5
-model_type: PhiForCausalLM
+model_type: MixFormerSequentialForCausalLM
 tokenizer_type: AutoTokenizer
 is_llama_derived_model: false
 trust_remote_code: true
--- a/examples/qwen/lora.yml
+++ b/examples/qwen/lora.yml
@@ -1,68 +0,0 @@
-base_model: Qwen/Qwen-7B
-model_type: AutoModelForCausalLM
-tokenizer_type: AutoTokenizer
-
-is_qwen_derived_model: true
-trust_remote_code: true
-
-load_in_8bit: true
-load_in_4bit: false
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path:
-val_set_size: 0.05
-output_dir: ./lora-out
-
-sequence_len: 2048  # supports up to 8192
-sample_packing: false
-pad_to_sequence_len:
-
-adapter: lora
-lora_model_dir:
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 4
-optimizer: adamw_bnb_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: false
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 0.05
-eval_table_size:
-eval_table_max_new_tokens: 128
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
--- a/examples/qwen/qlora.yml
+++ b/examples/qwen/qlora.yml
@@ -1,68 +0,0 @@
-base_model: Qwen/Qwen-7B
-model_type: AutoModelForCausalLM
-tokenizer_type: AutoTokenizer
-
-is_qwen_derived_model: true
-trust_remote_code: true
-
-load_in_8bit: false
-load_in_4bit: true
-strict: false
-
-datasets:
-  - path: mhenrichsen/alpaca_2k_test
-    type: alpaca
-dataset_prepared_path:
-val_set_size: 0.05
-output_dir: ./lora-out
-
-sequence_len: 2048  # supports up to 8192
-sample_packing: false
-pad_to_sequence_len:
-
-adapter: qlora
-lora_model_dir:
-lora_r: 32
-lora_alpha: 16
-lora_dropout: 0.05
-lora_target_linear: true
-lora_fan_in_fan_out:
-
-wandb_project:
-wandb_entity:
-wandb_watch:
-wandb_run_id:
-wandb_log_model:
-
-gradient_accumulation_steps: 4
-micro_batch_size: 2
-num_epochs: 4
-optimizer: adamw_bnb_8bit
-lr_scheduler: cosine
-learning_rate: 0.0002
-
-train_on_inputs: false
-group_by_length: false
-bf16: true
-fp16: false
-tf32: false
-
-gradient_checkpointing: false
-early_stopping_patience:
-resume_from_checkpoint:
-local_rank:
-logging_steps: 1
-xformers_attention:
-flash_attention: true
-
-warmup_steps: 10
-eval_steps: 0.05
-eval_table_size:
-eval_table_max_new_tokens: 128
-save_steps:
-debug:
-deepspeed:
-weight_decay: 0.0
-fsdp:
-fsdp_config:
-special_tokens:
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,20 +1,22 @@
+--extra-index-url https://download.pytorch.org/whl/cu118
 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
-auto-gptq==0.5.1
+torch==2.0.1
+auto-gptq==0.4.2
 packaging
 peft==0.6.0
-transformers==4.35.1
+transformers @ git+https://github.com/huggingface/transformers.git@acc394c4f5e1283c19783581790b3dc3105a3697
 bitsandbytes>=0.41.1
-accelerate==0.24.1
+accelerate @ git+https://github.com/huggingface/accelerate@80da9cfb09bb3cc9f1b385cb55d6b90d025a5fd9
 deepspeed
 addict
 fire
 PyYAML>=6.0
-datasets>=2.14.0
-flash-attn==2.3.3
+datasets
+flash-attn>=2.3.0
 sentencepiece
 wandb
 einops
-xformers==0.0.22
+xformers>=0.0.22
 optimum==1.13.2
 hf_transfer
 colorama
@@ -30,9 +32,3 @@ pynvml
 art
 fschat==0.2.29
 gradio
-tensorboard
-
-# remote filesystems
-s3fs
-gcsfs
-# adlfs
--- a/src/axolotl/cli/init.py
+++ b/src/axolotl/cli/init.py
@@ -46,7 +46,7 @@ def print_axolotl_text_art(suffix=None):
    ascii_text = "  axolotl"
    if suffix:
        ascii_text += f"  x  {suffix}"
-    ascii_art = text2art(ascii_text, font=font)
+    ascii_art = text2art(" axolotl", font=font)

    if is_main_process():
        print(ascii_art)
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -11,7 +11,7 @@ from abc import abstractmethod
 from dataclasses import dataclass, field
 from functools import partial
 from pathlib import Path
-from typing import Optional
+from typing import Optional, Union

 import torch
 import transformers
@@ -31,6 +31,7 @@ from axolotl.utils.callbacks import (
    log_prediction_callback_factory,
 )
 from axolotl.utils.collators import BatchSamplerDataCollatorForSeq2Seq
+from axolotl.utils.dataloader import MultipackDistributedDataloader
 from axolotl.utils.samplers import MultipackBatchSampler
 from axolotl.utils.schedulers import get_cosine_schedule_with_quadratic_warmup

@@ -214,7 +215,9 @@ class AxolotlTrainer(Trainer):
            )
        return super().get_train_dataloader()

-    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+    def get_eval_dataloader(
+        self, eval_dataset: Optional[Dataset] = None
+    ) -> Union[DataLoader, MultipackDistributedDataloader]:
        if self.args.sample_packing and self.args.eval_sample_packing is not False:
            eval_dataset = (
                eval_dataset if eval_dataset is not None else self.eval_dataset
@@ -257,7 +260,7 @@ class AxolotlTrainer(Trainer):
    def get_bench_dataloader(
        self,
        bench_dataset: Dataset,
-    ) -> DataLoader:
+    ) -> Union[DataLoader, MultipackDistributedDataloader]:
        dataloader_params = {
            "batch_size": self.args.eval_batch_size,
            "collate_fn": self.bench_data_collator,
@@ -461,14 +464,11 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        return AxolotlTrainer

    def build(self, total_num_steps):
-        warmup_steps = None
-        if self.cfg.warmup_steps is not None:
-            warmup_steps = self.cfg.warmup_steps
-        elif self.cfg.warmup_ratio is not None:
-            warmup_steps = max(int(self.cfg.warmup_ratio * total_num_steps), 0)
-        else:
-            warmup_steps = min(int(0.03 * total_num_steps), 100)
-
+        warmup_steps = (
+            self.cfg.warmup_steps
+            if self.cfg.warmup_steps is not None
+            else min(int(0.03 * total_num_steps), 100)
+        )
        logging_steps = (
            self.cfg.logging_steps
            if self.cfg.logging_steps is not None
@@ -483,6 +483,10 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        training_arguments_kwargs["fp16"] = (
            self.cfg.fp16 and not self.cfg.bf16
        ) or False
+        if self.cfg.fp8:
+            training_arguments_kwargs["fp16"] = False
+            training_arguments_kwargs["bf16"] = False
+
        training_arguments_kwargs["tf32"] = self.cfg.tf32
        training_arguments_kwargs["warmup_steps"] = warmup_steps
        training_arguments_kwargs["logging_steps"] = logging_steps
@@ -546,16 +550,16 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
                "dataloader_prefetch_factor"
            ] = self.cfg.dataloader_prefetch_factor

-        if self.cfg.val_set_size == 0:
-            # no eval set, so don't eval
-            training_arguments_kwargs["evaluation_strategy"] = "no"
-        elif self.cfg.eval_steps:
+        if self.cfg.eval_steps:
            training_arguments_kwargs["evaluation_strategy"] = "steps"
            training_arguments_kwargs["eval_steps"] = self.cfg.eval_steps
        elif self.cfg.evaluation_strategy:
            training_arguments_kwargs[
                "evaluation_strategy"
            ] = self.cfg.evaluation_strategy
+        elif self.cfg.val_set_size == 0:
+            # no eval set, so don't eval
+            training_arguments_kwargs["evaluation_strategy"] = "no"
        else:
            # we have an eval set, but no steps defined, default to use epoch
            training_arguments_kwargs["evaluation_strategy"] = "epoch"
@@ -661,9 +665,7 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
            self.cfg.sample_packing if self.cfg.sample_packing else False
        )
        training_arguments_kwargs["eval_sample_packing"] = (
-            self.cfg.sample_packing
-            if self.cfg.eval_sample_packing is not False
-            else False
+            self.cfg.sample_packing if self.cfg.sample_packing else False
        )
        training_arguments_kwargs[
            "sample_packing_seq_len_multiplier"
--- a/src/axolotl/models/phi/init.py
+++ b/src/axolotl/models/phi/init.py
@@ -3,6 +3,4 @@ MixFormers model architecture used for phi models
 """

 from .configuration_mixformer_sequential import MixFormerSequentialConfig  # noqa
-from .configuration_phi import PhiConfig  # noqa
 from .modeling_mixformer_sequential import MixFormerSequentialForCausalLM  # noqa
-from .modeling_phi import PhiForCausalLM  # noqa
--- a/src/axolotl/models/phi/configuration_phi.py
+++ b/src/axolotl/models/phi/configuration_phi.py
@@ -1,65 +0,0 @@
-# pylint: skip-file
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-
-import math
-from typing import Optional
-
-from transformers import PretrainedConfig
-
-
-class PhiConfig(PretrainedConfig):
-    """Phi configuration."""
-
-    model_type = "phi"
-    attribute_map = {
-        "max_position_embeddings": "n_positions",
-        "hidden_size": "n_embd",
-        "num_attention_heads": "n_head",
-        "num_hidden_layers": "n_layer",
-    }
-
-    def __init__(
-        self,
-        vocab_size: int = 50304,
-        n_positions: int = 2048,
-        n_embd: int = 1024,
-        n_layer: int = 20,
-        n_inner: Optional[int] = None,
-        n_head: int = 16,
-        n_head_kv: Optional[int] = None,
-        rotary_dim: Optional[int] = 32,
-        activation_function: Optional[str] = "gelu_new",
-        flash_attn: bool = False,
-        flash_rotary: bool = False,
-        fused_dense: bool = False,
-        attn_pdrop: float = 0.0,
-        embd_pdrop: float = 0.0,
-        resid_pdrop: float = 0.0,
-        layer_norm_epsilon: float = 1e-5,
-        initializer_range: float = 0.02,
-        tie_word_embeddings: bool = False,
-        pad_vocab_size_multiple: int = 64,
-        **kwargs
-    ) -> None:
-        self.vocab_size = int(
-            math.ceil(vocab_size / pad_vocab_size_multiple) * pad_vocab_size_multiple
-        )
-        self.n_positions = n_positions
-        self.n_embd = n_embd
-        self.n_layer = n_layer
-        self.n_inner = n_inner
-        self.n_head = n_head
-        self.n_head_kv = n_head_kv
-        self.rotary_dim = min(rotary_dim, n_embd // n_head)
-        self.activation_function = activation_function
-        self.flash_attn = flash_attn
-        self.flash_rotary = flash_rotary
-        self.fused_dense = fused_dense
-        self.attn_pdrop = attn_pdrop
-        self.embd_pdrop = embd_pdrop
-        self.resid_pdrop = resid_pdrop
-        self.layer_norm_epsilon = layer_norm_epsilon
-        self.initializer_range = initializer_range
-
-        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
--- a/src/axolotl/models/phi/modeling_phi.py
+++ b/src/axolotl/models/phi/modeling_phi.py
--- a/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_flash.py
@@ -321,8 +321,6 @@ def flashattn_forward(
        # only on first autoregressive step q,k,v have same seqlen
        is_causal = key_states.shape == query_states.shape

-    dropout_rate = 0.0 if not self.training else getattr(self, "attention_dropout", 0.0)
-
    if cu_seqlens is not None and max_seqlen is not None and cu_seqlens.dim() == 1:
        # special handling using sample packing
        qkv = torch.stack(
@@ -332,12 +330,7 @@ def flashattn_forward(
        qkv = rearrange(qkv, "b s ... -> (b s) ...")

        output = flash_attn_varlen_qkvpacked_func(
-            qkv,
-            cu_seqlens,
-            max_seqlen,
-            dropout_p=dropout_rate,
-            softmax_scale=None,
-            causal=True,
+            qkv, cu_seqlens, max_seqlen, 0.0, softmax_scale=None, causal=True
        )
        output = rearrange(output, "(b s) ... -> b s ...", b=bsz)
    elif query_states.shape == key_states.shape:
@@ -360,7 +353,7 @@ def flashattn_forward(
            qkv_unpad,
            cu_seqlens_q,
            max_seqlen_q,
-            dropout_p=dropout_rate,
+            0.0,
            softmax_scale=None,
            causal=is_causal,
        )
@@ -373,7 +366,6 @@ def flashattn_forward(
            output = flash_attn_kvpacked_func(
                query_states,
                torch.stack([key_states, value_states], 2),
-                dropout_p=dropout_rate,
                causal=is_causal,
            )
        else:
@@ -406,7 +398,7 @@ def flashattn_forward(
                cu_seqlens_k,
                max_seqlen_q,
                max_seqlen_k,
-                dropout_p=dropout_rate,
+                0.0,
                softmax_scale=None,
                causal=is_causal,
            )
--- a/src/axolotl/monkeypatch/llama_attn_hijack_sdp.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_sdp.py
@@ -25,8 +25,6 @@ def sdp_attention_forward(
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
-    padding_mask: Optional[torch.LongTensor] = None,  # pylint: disable=unused-argument
-    **kwargs,  # pylint: disable=unused-argument
 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    # pylint: disable=duplicate-code
    bsz, q_len, _ = hidden_states.size()
--- a/src/axolotl/monkeypatch/llama_attn_hijack_xformers.py
+++ b/src/axolotl/monkeypatch/llama_attn_hijack_xformers.py
@@ -29,8 +29,6 @@ def xformers_forward(
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
-    padding_mask: Optional[torch.LongTensor] = None,  # pylint: disable=unused-argument
-    **kwargs,  # pylint: disable=unused-argument
 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    # pylint: disable=duplicate-code
    bsz, q_len, _ = hidden_states.size()
--- a/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py
+++ b/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py
@@ -201,8 +201,6 @@ def flashattn_forward(
        # only on first autoregressive step q,k,v have same seqlen
        is_causal = key_states.shape == query_states.shape

-    dropout_rate = 0.0 if not self.training else getattr(self, "attention_dropout", 0.0)
-
    if cu_seqlens is not None and max_seqlen is not None and cu_seqlens.dim() == 1:
        # special handling using sample packing
        qkv = torch.stack(
@@ -215,7 +213,7 @@ def flashattn_forward(
            qkv,
            cu_seqlens,
            max_seqlen,
-            dropout_p=dropout_rate,
+            0.0,
            softmax_scale=None,
            causal=True,
            window_size=window_size,
@@ -241,7 +239,7 @@ def flashattn_forward(
            qkv_unpad,
            cu_seqlens_q,
            max_seqlen_q,
-            dropout_p=dropout_rate,
+            0.0,
            softmax_scale=None,
            causal=is_causal,
            window_size=window_size,
@@ -255,7 +253,6 @@ def flashattn_forward(
            output = flash_attn_kvpacked_func(
                query_states,
                torch.stack([key_states, value_states], 2),
-                dropout_p=dropout_rate,
                causal=is_causal,
                window_size=window_size,
            )
@@ -289,7 +286,7 @@ def flashattn_forward(
                cu_seqlens_k,
                max_seqlen_q,
                max_seqlen_k,
-                dropout_p=dropout_rate,
+                0.0,
                softmax_scale=None,
                causal=is_causal,
                window_size=window_size,
--- a/src/axolotl/prompt_strategies/completion.py
+++ b/src/axolotl/prompt_strategies/completion.py
@@ -1,7 +1,6 @@
 """
 Basic completion text
 """
-import json
 from collections import defaultdict
 from typing import Any, Dict, Generator, Optional, Tuple

@@ -65,19 +64,6 @@ class CompletionPromptTokenizingStrategy(InstructionPromptTokenizingStrategy):
        return next(iter(self.prompter.build_prompt(instruction, input, response)))


-class CompletionJSONPromptTokenizationStrategy(CompletionPromptTokenizingStrategy):
-    """
-    Strategy to return the stringified JSON of the entire row as the training data
-    """
-
-    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
-        return (
-            json.dumps(prompt),
-            "",
-            "",
-        )
-
-
 class CompletionPrompter:
    """
    Prompter for completion
@@ -96,7 +82,7 @@ def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
    strat = CompletionPromptTokenizingStrategy(
        CompletionPrompter(),
        tokenizer,
-        True,
+        cfg.train_on_inputs,
        cfg.sequence_len,
        max_length=cfg.sequence_len * 64,
    )
@@ -104,15 +90,3 @@ def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
        strat.field = ds_cfg["field"]

    return strat
-
-
-def load_json(tokenizer, cfg):
-    strat = CompletionJSONPromptTokenizationStrategy(
-        CompletionPrompter(),
-        tokenizer,
-        True,
-        cfg.sequence_len,
-        max_length=cfg.sequence_len * 64,
-    )
-
-    return strat
--- a/src/axolotl/prompters.py
+++ b/src/axolotl/prompters.py
@@ -22,13 +22,7 @@ class PromptStyle(Enum):
    CHATML = "chatml"


-class Prompter:
-    """
-    Base prompter class for all prompters
-    """
-
-
-class AlpacaPrompter(Prompter):
+class AlpacaPrompter:
    """
    Base class for alpaca prompters
    """
@@ -75,7 +69,7 @@ class AlpacaPrompter(Prompter):
        else:
            res = (
                self.system_format.format(system=self.system_no_input_prompt)
-                if self.system_no_input_prompt
+                if self.system_prompt
                else ""
            ) + self.turn_no_input_format.format(instruction=instruction)
        if output:
@@ -165,7 +159,7 @@ class NomicGPT4AllPrompter(AlpacaPrompter):
    """


-class ReflectAlpacaPrompter(Prompter):
+class ReflectAlpacaPrompter:
    """
    Prompter for ReflectAlpaca
    """
@@ -260,7 +254,7 @@ SHAREGPT_ASSERTION_FAILED_ROLE = (
 )


-class ShareGPTPrompter(Prompter):  # pylint: disable=too-few-public-methods
+class ShareGPTPrompter:  # pylint: disable=too-few-public-methods
    """
    A prompter that generates prompts for the ShareGPT
    """
@@ -355,7 +349,7 @@ class ShareGPTPrompterV2(ShareGPTPrompter):
        )


-class UnsupportedPrompter(Prompter):
+class UnsupportedPrompter:
    """
    A dummy class for custom prompters
    """
--- a/src/axolotl/utils/config.py
+++ b/src/axolotl/utils/config.py
@@ -70,7 +70,9 @@ def normalize_config(cfg):
    else:
        torch.backends.cuda.matmul.allow_tf32 = cfg.tf32 or False

-    if cfg.bf16 or cfg.bfloat16:
+    if cfg.fp8:
+        cfg.torch_dtype = torch.bfloat16
+    elif cfg.bf16 or cfg.bfloat16:
        cfg.torch_dtype = torch.bfloat16
    elif cfg.load_in_8bit or cfg.fp16 or cfg.float16:
        cfg.torch_dtype = torch.float16
@@ -122,19 +124,6 @@ def normalize_config(cfg):
        or (cfg.model_type and "mistral" in cfg.model_type.lower())
    )

-    cfg.is_qwen_derived_model = (
-        (
-            hasattr(model_config, "model_type")
-            and model_config.model_type
-            in [
-                "qwen",
-            ]
-        )
-        or cfg.is_qwen_derived_model
-        or "qwen" in cfg.base_model.lower()
-        or (cfg.model_type and "qwen" in cfg.model_type.lower())
-    )
-
    if isinstance(cfg.learning_rate, str):
        cfg.learning_rate = float(cfg.learning_rate)

@@ -178,11 +167,7 @@ def validate_config(cfg):
            "batch_size is not recommended. Please use gradient_accumulation_steps instead.",
            "To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.",
        )
-    if (
-        cfg.eval_batch_size
-        and cfg.micro_batch_size
-        and cfg.eval_batch_size != cfg.micro_batch_size
-    ):
+    if cfg.eval_batch_size != cfg.micro_batch_size:
        LOG.warning(
            "eval_batch_size != micro_batch_size. This can lead to VRAM instability."
        )
@@ -386,17 +371,6 @@ def validate_config(cfg):
            "If you want to full finetune, please turn off load_in_8bit and load_in_4bit."
        )

-    if cfg.rope_scaling:
-        LOG.warning("`rope_scaling` should now be be a key under `model_config`")
-
-    if cfg.warmup_steps and cfg.warmup_ratio:
-        raise ValueError("warmup_steps and warmup_ratio are mutually exclusive")
-
-    if cfg.is_qwen_derived_model and cfg.gradient_checkpointing:
-        LOG.warning(
-            "Gradient checkpointing is broken for Qwen models for transformers>=4.35.0, except main branch."
-        )
-
    # TODO
    # MPT 7b
    # https://github.com/facebookresearch/bitsandbytes/issues/25
--- a/src/axolotl/utils/data.py
+++ b/src/axolotl/utils/data.py
@@ -3,7 +3,7 @@ import functools
 import hashlib
 import logging
 from pathlib import Path
-from typing import Dict, List, Tuple, Union
+from typing import Any, Dict, List, Tuple, Union

 import torch
 from datasets import (
@@ -34,7 +34,6 @@ from axolotl.prompters import (
    JeopardyPrompter,
    MultipleChoiceConcisePrompter,
    MultipleChoiceExplainPrompter,
-    Prompter,
    ReflectAlpacaPrompter,
    SummarizeTLDRPrompter,
    UnsupportedPrompter,
@@ -79,14 +78,6 @@ def prepare_dataset(cfg, tokenizer):
        train_dataset, eval_dataset = process_datasets_for_packing(
            cfg, train_dataset, eval_dataset, tokenizer
        )
-
-    if eval_dataset and cfg.sample_packing and cfg.eval_sample_packing is not False:
-        total_eval_steps = calculate_total_num_steps(cfg, eval_dataset, update=False)
-        if total_eval_steps == 0:
-            raise ValueError(
-                "eval dataset split is too small for sample_packing. You should set `eval_sample_packing: False`. "
-            )
-
    if cfg.max_steps:
        total_num_steps = min(
            calculate_total_num_steps(cfg, train_dataset), cfg.max_steps
@@ -99,7 +90,7 @@ def prepare_dataset(cfg, tokenizer):

 def load_tokenized_prepared_datasets(
    tokenizer, cfg, default_dataset_prepared_path
-) -> Tuple[DatasetDict, List[Prompter]]:
+) -> DatasetDict:
    tokenizer_name = tokenizer.__class__.__name__
    ds_hash = str(
        md5(
@@ -107,12 +98,7 @@ def load_tokenized_prepared_datasets(
                str(cfg.sequence_len)
                + "@"
                + "|".join(
-                    sorted(
-                        [
-                            f"{d.path}:{d.type}:{d.shards}:{d.conversation}"
-                            for d in cfg.datasets
-                        ]
-                    )
+                    sorted([f"{d.path}:{d.type}:{d.shards}" for d in cfg.datasets])
                )
                + "|"
                + tokenizer_name
@@ -178,66 +164,6 @@ def load_tokenized_prepared_datasets(
            except (FileNotFoundError, ConnectionError):
                pass

-            ds_from_cloud = False
-            storage_options = {}
-            remote_file_system = None
-            if config_dataset.path.startswith("s3://"):
-                try:
-                    import aiobotocore.session  # type: ignore
-                    import s3fs  # type: ignore
-                except ImportError as exc:
-                    raise ImportError(
-                        "s3:// paths require aiobotocore and s3fs to be installed"
-                    ) from exc
-
-                # Takes credentials from ~/.aws/credentials for default profile
-                s3_session = aiobotocore.session.AioSession(profile="default")
-                storage_options = {"session": s3_session}
-                remote_file_system = s3fs.S3FileSystem(**storage_options)
-            elif config_dataset.path.startswith(
-                "gs://"
-            ) or config_dataset.path.startswith("gcs://"):
-                try:
-                    import gcsfs  # type: ignore
-                except ImportError as exc:
-                    raise ImportError(
-                        "gs:// or gcs:// paths require gcsfs to be installed"
-                    ) from exc
-
-                # gcsfs will use default credentials from the environment else anon
-                # https://gcsfs.readthedocs.io/en/latest/#credentials
-                storage_options = {"token": None}
-                remote_file_system = gcsfs.GCSFileSystem(**storage_options)
-            # TODO: Figure out how to get auth creds passed
-            # elif config_dataset.path.startswith("adl://") or config_dataset.path.startswith("abfs://"):
-            #     try:
-            #         import adlfs
-            #     except ImportError as exc:
-            #        raise ImportError(
-            #            "adl:// or abfs:// paths require adlfs to be installed"
-            #        ) from exc
-
-            #     # Gen 1
-            #     storage_options = {
-            #         "tenant_id": TENANT_ID,
-            #         "client_id": CLIENT_ID,
-            #         "client_secret": CLIENT_SECRET,
-            #     }
-            #     # Gen 2
-            #     storage_options = {
-            #         "account_name": ACCOUNT_NAME,
-            #         "account_key": ACCOUNT_KEY,
-            #     }
-
-            #     remote_file_system = adlfs.AzureBlobFileSystem(**storage_options)
-            try:
-                if remote_file_system and remote_file_system.exists(
-                    config_dataset.path
-                ):
-                    ds_from_cloud = True
-            except (FileNotFoundError, ConnectionError):
-                pass
-
            # prefer local dataset, even if hub exists
            local_path = Path(config_dataset.path)
            if local_path.exists():
@@ -251,8 +177,17 @@ def load_tokenized_prepared_datasets(
                        split=None,
                    )
                elif local_path.is_file():
-                    ds_type = get_ds_type(config_dataset)
-
+                    ds_type = "json"
+                    if config_dataset.ds_type:
+                        ds_type = config_dataset.ds_type
+                    elif ".parquet" in config_dataset.path:
+                        ds_type = "parquet"
+                    elif ".arrow" in config_dataset.path:
+                        ds_type = "arrow"
+                    elif ".csv" in config_dataset.path:
+                        ds_type = "csv"
+                    elif ".txt" in config_dataset.path:
+                        ds_type = "text"
                    ds = load_dataset(
                        ds_type,
                        name=config_dataset.name,
@@ -272,22 +207,6 @@ def load_tokenized_prepared_datasets(
                    data_files=config_dataset.data_files,
                    token=use_auth_token,
                )
-            elif ds_from_cloud and remote_file_system:
-                if remote_file_system.isdir(config_dataset.path):
-                    ds = load_from_disk(
-                        config_dataset.path,
-                        storage_options=storage_options,
-                    )
-                elif remote_file_system.isfile(config_dataset.path):
-                    ds_type = get_ds_type(config_dataset)
-                    ds = load_dataset(
-                        ds_type,
-                        name=config_dataset.name,
-                        data_files=config_dataset.path,
-                        streaming=False,
-                        split=None,
-                        storage_options=storage_options,
-                    )
            else:
                if isinstance(config_dataset.data_files, str):
                    fp = hf_hub_download(
@@ -379,29 +298,11 @@ def load_tokenized_prepared_datasets(
    return dataset, prompters


-def get_ds_type(config_dataset: DictDefault):
-    """
-    Get the dataset type from the path if it's not specified
-    """
-    ds_type = "json"
-    if config_dataset.ds_type:
-        ds_type = config_dataset.ds_type
-    elif ".parquet" in config_dataset.path:
-        ds_type = "parquet"
-    elif ".arrow" in config_dataset.path:
-        ds_type = "arrow"
-    elif ".csv" in config_dataset.path:
-        ds_type = "csv"
-    elif ".txt" in config_dataset.path:
-        ds_type = "text"
-    return ds_type
-
-
 def load_prepare_datasets(
    tokenizer: PreTrainedTokenizerBase,
    cfg,
    default_dataset_prepared_path,
-) -> Tuple[Dataset, Dataset, List[Prompter]]:
+) -> Tuple[Dataset, Dataset, List[Any]]:
    max_packed_sequence_len = (
        cfg.max_packed_sequence_len if cfg.max_packed_sequence_len else cfg.sequence_len
    )
@@ -410,7 +311,7 @@ def load_prepare_datasets(
    )  # make sure we don't accidentally set it larger than sequence_len

    tokenizer_name = tokenizer.__class__.__name__
-    prompters: List[Prompter] = []
+    prompters = []
    if cfg.max_packed_sequence_len is not None:
        # see if we can go ahead and load the stacked dataset
        seed = f"@{str(cfg.seed)}" if cfg.seed else ""
@@ -544,13 +445,14 @@ def load_prepare_datasets(
        train_fingerprint = md5(to_hash_train)
        test_fingerprint = md5(to_hash_test)

-        dataset = dataset.train_test_split(
-            test_size=cfg.val_set_size,
-            shuffle=False,
-            seed=cfg.seed or 42,
-            train_new_fingerprint=train_fingerprint,
-            test_new_fingerprint=test_fingerprint,
-        )
+        with zero_first(is_main_process()):
+            dataset = dataset.train_test_split(
+                test_size=cfg.val_set_size,
+                shuffle=False,
+                seed=cfg.seed or 42,
+                train_new_fingerprint=train_fingerprint,
+                test_new_fingerprint=test_fingerprint,
+            )

        train_dataset = dataset["train"]
        eval_dataset = dataset["test"]
--- a/src/axolotl/utils/dataloader.py
+++ b/src/axolotl/utils/dataloader.py
@@ -0,0 +1,342 @@
+# pylint: skip-file
+import hashlib
+import itertools
+import logging
+import math
+import time
+from queue import Queue
+from threading import Thread
+from typing import Any, Callable, List, Union
+
+import numba
+import numpy as np
+from torch.utils.data import DistributedSampler, Sampler
+
+LOG = logging.getLogger("axolotl.utils.dataloader")
+
+
+@numba.njit
+def ffd_check(a: np.ndarray, c: int, n: int):
+    # First-fit-decreasing bin packing
+    # Check if a[] could fit in n bins with capacity c
+    # https://en.wikipedia.org/wiki/First-fit-decreasing_bin_packing
+
+    a = np.sort(a)[::-1]
+    bins = np.full((n,), c, dtype=a.dtype)
+    for size in a:
+        not_found = True
+        for idx in range(n):
+            if bins[idx] >= size:
+                bins[idx] -= size
+                not_found = False
+                break
+
+        if not_found:
+            return False
+
+    return True
+
+
+@numba.njit
+def ffd_with_result(a: np.ndarray, c: int, start_index: int):
+    # First-fit-decreasing bin packing (with result return)
+
+    indices = np.argsort(a)[::-1]
+    a = a[indices]
+
+    bins: List[Any] = []
+    bins_result: List[Any] = []
+    for a_id, size in enumerate(a):
+        add_new = True
+        for idx in range(len(bins)):
+            if bins[idx] >= size:
+                bins[idx] -= size
+                bins_result[idx].append(indices[a_id] + start_index)
+                add_new = False
+                break
+
+        if add_new:
+            bins.append(c - size)
+            bins_result.append([indices[a_id] + start_index])
+
+    return bins_result, len(a)
+
+
+@numba.njit
+def allocate(
+    lengths: np.ndarray, lengths_cumsum: np.ndarray, rank: int, c: int, n: int
+):
+    """
+    :param lengths: array of lengths of each sample
+    :param lengths_cumsum: cumulative sum of consecutive lengths
+    :param rank: rank for this process
+    :param c: length of tokens per batch
+    :param n: number of ranks
+    :return:
+    """
+    # Dynamic batch allocator, similar to Multifit
+    # https://en.wikipedia.org/wiki/Multifit_algorithm
+    # ~99.5% efficiency on OpenChat training set (12 * 2048 ctx len)
+
+    s = 0
+    start_index = 0
+    result = []
+    result_totseqs = []
+
+    while True:
+        # binary search [left, right)
+        left = 1
+        right = 1 + np.searchsorted(lengths_cumsum[start_index:], s + c * n, "right")
+
+        while right - left > 1:
+            mid = (left + right) // 2
+            if ffd_check(lengths[start_index : start_index + mid], c, n):
+                left = mid
+            else:
+                right = mid
+
+        # use length left
+        batch, tot_seqs = ffd_with_result(
+            lengths[start_index : start_index + left], c, start_index
+        )
+        if len(batch) < n:
+            break
+
+        start_index += left
+        s = lengths_cumsum[start_index - 1]
+
+        # add local rank
+        result.append(batch[rank])
+        # add total seqs for all ranks
+        result_totseqs.append(tot_seqs)
+        # yield batch[rank], tot_seqs, s, len(result) * c * n
+    return result, result_totseqs, s, len(result) * c * n
+
+
+def chunk(iterable, n):
+    """
+    Chunk data into tuples of length n
+    """
+    # batched('ABCDEFG', 3) --> ABC DEF G
+    if n < 1:
+        raise ValueError("n must be at least one")
+    it = iter(iterable)
+    while batch := tuple(itertools.islice(it, n)):
+        yield batch
+
+
+def hash_indices(lst: List[int]) -> str:
+    # Convert the list of integers to a string representation
+    concatenated = ",".join(map(str, lst))
+
+    # Generate the hash
+    sha256 = hashlib.sha256()
+    sha256.update(concatenated.encode())
+
+    return sha256.hexdigest()
+
+
+class MultipackDistributedDataloader:
+    """Unpadded data loading using Multipack.
+    Adapted from https://github.com/imoneoi/openchat/blob/v3_fix_mle_loss/ochat/training_deepspeed/multipack_dataloader.py
+    Approximate (at most ~1.22x) the optimal solution of the identical-machines scheduling problem, which is NP-hard.
+    """
+
+    def __init__(
+        self,
+        dataset: Any,
+        collate_fn: Callable,
+        seq_max_length: int = 2048,
+        batch_size: int = 1,
+        sampler: Union[Sampler, DistributedSampler] = None,
+        packing_efficiency_estimate: float = 1.0,
+        sample_packing_seq_len_multiplier: int = 1,
+        device_count: int = 1,
+        prefetch_max: int = 1000,
+        num_epochs: int = 1,
+    ):
+        # Dataset
+        self.dataset = dataset
+        self.lengths = (
+            dataset.data.column("position_ids")
+            .to_pandas()
+            .apply(lambda x: x[-1] + 1)
+            .values
+        )
+        assert isinstance(self.lengths, np.ndarray)
+        assert batch_size % sample_packing_seq_len_multiplier == 0
+        assert batch_size >= sample_packing_seq_len_multiplier
+        self.sampler = sampler
+        self.batch_size = batch_size
+        self.sample_packing_seq_len_multiplier = sample_packing_seq_len_multiplier
+        self.seq_max_length = seq_max_length
+        self.batch_max_length = batch_size * seq_max_length
+        self.collate_fn = collate_fn
+        self.num_epochs = num_epochs
+
+        self.num_replicas = 1
+        self.rank = 0
+
+        # statistics
+        self.eff_total_used = 0
+        self.eff_total_slots = 0
+        self.packing_efficiency_estimate = packing_efficiency_estimate or 1.0
+        self.device_count = device_count
+
+        # maxsize is maximum number of samples in queue
+        self.prefetch_max = prefetch_max
+        self.queue: Queue = Queue(maxsize=prefetch_max)
+        self.thread = None
+
+    def _worker(self):
+        LOG.info(
+            f"[WORKER] Epochs: {self.num_epochs}, Samples: {self.len_w_stats()*self.batch_size}"
+        )
+        for epoch in range(self.num_epochs):
+            for sample in self._internal_batch_generator():
+                while True:
+                    if self.queue.full():
+                        time.sleep(1)
+                    else:
+                        break
+                self.queue.put(sample)
+
+            # stop the queue when epoch is done
+            self.queue.put(None)
+
+    def __iter__(self):
+        if hasattr(self.sampler, "set_epoch"):
+            new_epoch = self.sampler.epoch + 1
+            self.sampler.set_epoch(new_epoch)
+            LOG.info(f"calling sampler.set_epoch({new_epoch})")
+
+        if self.thread is None:
+            self.thread = Thread(target=self._worker, daemon=True)
+            self.thread.start()
+
+        while True:
+            item = self.queue.get()
+
+            if item is None:
+                break
+            yield item
+
+    def generate_batches(self, set_stats=False):
+        LOG.info("generating packed batches")
+        if self.sampler:
+            indices = [idx for idx in self.sampler]
+        else:
+            indices = range(0, len(self.dataset))
+
+        LOG.info(hash_indices(indices))
+        lengths = self.lengths[indices]
+        lengths_cumsum = np.cumsum(lengths)
+
+        batches, totseqs, total_used, total_slots = allocate(
+            lengths=lengths,
+            lengths_cumsum=lengths_cumsum,
+            rank=self.rank,
+            # c=self.batch_max_length,
+            c=self.seq_max_length * self.sample_packing_seq_len_multiplier,
+            n=self.num_replicas,
+        )
+
+        batches = [[indices[b_idx] for b_idx in batch] for batch in batches]
+
+        # statistics
+        if set_stats:
+            self.eff_total_used += total_used
+            self.eff_total_slots += total_slots
+
+        return batches, totseqs
+
+    def _internal_batch_generator(self):
+        all_batches, _ = self.generate_batches(set_stats=True)
+        features = self.dataset.features.keys()
+        len_remaining = self._len_est()
+        for batches in chunk(
+            all_batches, self.batch_size // self.sample_packing_seq_len_multiplier
+        ):
+            chunked_data = []
+            attn_mask_cum_idx = 0
+            for batch in batches:
+                concatenated = {}
+                batched_data = [self.dataset[batch_idx] for batch_idx in batch]
+                for feature in features:
+                    if feature == "length":
+                        continue
+                    if feature == "attention_mask":
+                        arrays = [
+                            (attn_mask_cum_idx + idx + 1) * np.array(item[feature])
+                            for idx, item in enumerate(batched_data)
+                            if feature in item
+                        ]
+                        attn_mask_cum_idx += len(batched_data)
+                        concatenated[feature] = np.concatenate(arrays)
+                    else:
+                        arrays = [
+                            np.array(item[feature])
+                            for item in batched_data
+                            if feature in item
+                        ]
+                        concatenated[feature] = np.concatenate(arrays)
+                chunked_data.append(concatenated)
+            yield self.collate_fn(chunked_data)
+            len_remaining -= 1
+            if not len_remaining:
+                return
+        # yield a no-op for cases where we don't have any data left to pack
+        for i in range(0, len_remaining):
+            yield self.collate_fn(
+                [
+                    {
+                        "input_ids": [0],
+                        "labels": [-100],
+                        "attention_mask": [True],
+                        "position_ids": [0],
+                    }
+                ]
+            )
+
+    def _len_est(self):
+        lengths_sum = np.sum(self.lengths)
+        lengths_sum_per_device = lengths_sum // self.device_count
+        LOG.info(
+            f"packing_efficiency_estimate: {self.packing_efficiency_estimate} "
+            f"total_num_tokens per device: {lengths_sum_per_device}"
+        )
+
+        # shave off 1% + 1 for dealing with variance in packing from random sampler to sampler
+        return (
+            math.floor(
+                0.99
+                * lengths_sum_per_device
+                / self.packing_efficiency_estimate
+                // self.seq_max_length
+                // self.batch_size
+            )
+            - 1
+        )
+
+    def __len__(self):
+        # this doesn't return the actual length b/c with distributed samplers, not all dataloaders get
+        # the same share of total tokens
+        # if not self.eff_total_used:
+        #     batches, _ = self.generate_batches(set_stats=True)
+        # LOG.info(
+        #     f"packing_efficiency_estimate: {self.packing_efficiency_estimate} "
+        #     f"actual packing efficiency: {self.efficiency()}"
+        # )
+        return max(1, self._len_est())
+
+    def len_w_stats(self):
+        if not self.eff_total_used:
+            batches, _ = self.generate_batches(set_stats=True)
+        LOG.info(
+            f"packing_efficiency_estimate: {self.packing_efficiency_estimate} "
+            f"actual packing efficiency: {self.efficiency()}"
+        )
+        return max(1, self._len_est())
+
+    def efficiency(self):
+        return self.eff_total_used / self.eff_total_slots
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -17,6 +17,7 @@ from transformers import (  # noqa: F401
    AutoTokenizer,
    BitsAndBytesConfig,
    GPTQConfig,
+    LlamaConfig,
    PreTrainedModel,
    PreTrainedTokenizerBase,
 )
@@ -31,14 +32,9 @@ LOG = logging.getLogger("axolotl")
 def load_model_config(cfg):
    model_config_name = cfg.base_model_config or cfg.base_model
    trust_remote_code = cfg.trust_remote_code is True
-    model_config = AutoConfig.from_pretrained(
+    return AutoConfig.from_pretrained(
        model_config_name, trust_remote_code=trust_remote_code
    )
-    if cfg.model_config:
-        for key, val in cfg.model_config.items():
-            setattr(model_config, key, val)
-
-    return model_config


 def load_tokenizer(cfg):
@@ -55,7 +51,7 @@ def load_tokenizer(cfg):
    if cfg.tokenizer_type:
        tokenizer_cls = getattr(transformers, cfg.tokenizer_type)

-    tokenizer_config = cfg.tokenizer_config or cfg.base_model_config or cfg.base_model
+    tokenizer_config = cfg.tokenizer_config or cfg.base_model_config
    tokenizer = tokenizer_cls.from_pretrained(
        tokenizer_config,
        trust_remote_code=cfg.trust_remote_code or False,
@@ -84,18 +80,6 @@ def load_tokenizer(cfg):
    if cfg.is_mistral_derived_model and cfg.flash_attention and not cfg.sample_packing:
        tokenizer.padding_side = "left"

-    # Qwen base only has single token, so we need to set the special tokens
-    if cfg.is_qwen_derived_model:
-        token_ids = ["bos_token_id", "eos_token_id", "pad_token_id", "unk_token_id"]
-        for attr_name in token_ids:
-            if getattr(tokenizer, attr_name) is None:
-                setattr(tokenizer, attr_name, tokenizer.eod_id)
-
-        token_names = ["bos_token", "eos_token", "pad_token", "unk_token"]
-        for attr_name in token_names:
-            if getattr(tokenizer, attr_name) is None:
-                setattr(tokenizer, attr_name, "<|endoftext|>")
-
    if cfg.special_tokens:
        for k, val in cfg.special_tokens.items():
            tokenizer.add_special_tokens(
@@ -126,6 +110,7 @@ def load_model(
    Load a model for a given configuration and tokenizer.
    """
    base_model = cfg.base_model
+    base_model_config = cfg.base_model_config
    model_type = cfg.model_type
    model_config = load_model_config(cfg)

@@ -253,9 +238,16 @@ def load_model(
        if cfg.is_llama_derived_model and not cfg.trust_remote_code and not cfg.gptq:
            from transformers import LlamaForCausalLM

+            config_kwargs = {}
+            if cfg.rope_scaling:
+                config_kwargs["rope_scaling"] = cfg.rope_scaling
+            config = LlamaConfig.from_pretrained(
+                base_model_config,
+                **config_kwargs,
+            )
            model = LlamaForCausalLM.from_pretrained(
                base_model,
-                config=model_config,
+                config=config,
                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
                **model_kwargs,
@@ -300,10 +292,10 @@ def load_model(
        #         device=cfg.device,
        #     )
        #     model.train() # sets to train instead of eval mode
-        elif model_type == "PhiForCausalLM":
-            from axolotl.models.phi import PhiForCausalLM
+        elif model_type == "MixFormerSequentialForCausalLM":
+            from axolotl.models.phi import MixFormerSequentialForCausalLM

-            model = PhiForCausalLM.from_pretrained(
+            model = MixFormerSequentialForCausalLM.from_pretrained(
                base_model,
                load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
                load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
@@ -313,55 +305,66 @@ def load_model(
            if cfg.gptq:
                model = AutoModelForCausalLM.from_pretrained(
                    base_model,
-                    config=model_config,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
            else:
                model = getattr(transformers, model_type).from_pretrained(
                    base_model,
-                    config=model_config,
                    load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
                    load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
        else:
+            config = AutoConfig.from_pretrained(
+                base_model,
+                trust_remote_code=cfg.trust_remote_code or False,
+            )
            # Shouldn't be a problem most of the time. will obviously error if the model doesn't support this
            # when training starts
            if (
-                hasattr(model_config, "max_seq_len")
-                and model_config.max_seq_len
-                and cfg.sequence_len > model_config.max_seq_len
+                hasattr(config, "max_seq_len")
+                and config.max_seq_len
+                and cfg.sequence_len > config.max_seq_len
            ):
-                model_config.max_seq_len = cfg.sequence_len
+                config.max_seq_len = cfg.sequence_len
                LOG.warning(f"increasing context length to {cfg.sequence_len}")
            elif (
-                hasattr(model_config, "max_sequence_length")
-                and model_config.max_sequence_length
-                and cfg.sequence_len > model_config.max_sequence_length
+                hasattr(config, "max_sequence_length")
+                and config.max_sequence_length
+                and cfg.sequence_len > config.max_sequence_length
            ):
-                model_config.max_sequence_length = cfg.sequence_len
+                config.max_sequence_length = cfg.sequence_len
                LOG.warning(f"increasing context length to {cfg.sequence_len}")
            if cfg.gptq:
                model = AutoModelForCausalLM.from_pretrained(
                    base_model,
-                    config=model_config,
+                    config=config,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
            else:
                model = AutoModelForCausalLM.from_pretrained(
                    base_model,
-                    config=model_config,
+                    config=config,
                    load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
                    load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
                    trust_remote_code=cfg.trust_remote_code or False,
                    **model_kwargs,
                )
    except Exception as err:  # pylint: disable=broad-exception-caught
+        LOG.error(
+            "Exception raised attempting to load model, retrying with AutoModelForCausalLM"
+        )
        LOG.exception(err)
-        raise err
+        model = AutoModelForCausalLM.from_pretrained(
+            base_model,
+            load_in_8bit=cfg.load_in_8bit and cfg.adapter is not None,
+            load_in_4bit=cfg.load_in_4bit and cfg.adapter is not None,
+            trust_remote_code=cfg.trust_remote_code or False,
+            **model_kwargs,
+        )

    embeddings_len = (
        math.ceil(len(tokenizer) / 32) * 32
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -181,16 +181,13 @@ class MultipackBatchSampler(BatchSampler):
        )

        # shave off 1% + 1 for dealing with variance in packing from random sampler to sampler
-        return max(
-            0,
-            (
-                world_size
-                * math.floor(
-                    0.99
-                    * lengths_sum_per_device
-                    / self.packing_efficiency_estimate
-                    // self.batch_max_len
-                )
-                - 1
-            ),
+        return (
+            world_size
+            * math.floor(
+                0.99
+                * lengths_sum_per_device
+                / self.packing_efficiency_estimate
+                // self.batch_max_len
+            )
+            - 1
        )
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -141,35 +141,32 @@ def process_datasets_for_packing(cfg, train_dataset, eval_dataset, tokenizer):
    return train_dataset, eval_dataset


-def calculate_total_num_steps(cfg, train_dataset, update=True):
-    if not cfg.total_num_tokens:
-        total_num_tokens = np.sum(
-            train_dataset.data.column("input_ids")
-            .to_pandas()
-            .apply(lambda x: len(x))  # pylint: disable=unnecessary-lambda
-            .values
-        )
-        LOG.debug(f"total_num_tokens: {total_num_tokens}", main_process_only=True)
-        if update:
-            cfg.total_num_tokens = total_num_tokens
-
-    if not cfg.total_supervised_tokens:
-        total_supervised_tokens = (
-            train_dataset.data.column("labels")
-            .to_pandas()
-            .apply(lambda x: np.sum(np.array(x) != -100))
-            .sum()
-        )
-        LOG.debug(
-            f"`total_supervised_tokens: {total_supervised_tokens}`",
-            main_process_only=True,
-        )
-        if update:
-            cfg.total_supervised_tokens = total_supervised_tokens
-
+def calculate_total_num_steps(cfg, train_dataset):
    if cfg.sample_packing:
        # we have to drop anything longer then sequence len otherwise
        # flash attention with position ids fails
+        if not cfg.total_num_tokens:
+            total_num_tokens = np.sum(
+                train_dataset.data.column("input_ids")
+                .to_pandas()
+                .apply(lambda x: len(x))  # pylint: disable=unnecessary-lambda
+                .values
+            )
+            LOG.debug(f"total_num_tokens: {total_num_tokens}", main_process_only=True)
+            cfg.total_num_tokens = total_num_tokens
+
+        if not cfg.total_supervised_tokens:
+            total_supervised_tokens = (
+                train_dataset.data.column("labels")
+                .to_pandas()
+                .apply(lambda x: np.sum(np.array(x) != -100))
+                .sum()
+            )
+            LOG.debug(
+                f"`total_supervised_tokens: {total_supervised_tokens}`",
+                main_process_only=True,
+            )
+            cfg.total_supervised_tokens = total_supervised_tokens

        if cfg.sample_packing_eff_est:
            total_num_steps = (
@@ -234,8 +231,7 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
            sample_packing_eff_est = (
                math.ceil(sample_packing_actual_eff_all * 100.0) / 100.0
            )
-            if update:
-                cfg.sample_packing_eff_est = sample_packing_eff_est
+            cfg.sample_packing_eff_est = sample_packing_eff_est
            LOG.debug(
                f"sample_packing_eff_est: {cfg.sample_packing_eff_est}",
                main_process_only=True,
@@ -272,6 +268,8 @@ def setup_trainer(cfg, train_dataset, eval_dataset, model, tokenizer, total_num_
        setup_fsdp_envs(cfg)
    elif cfg.deepspeed:
        os.environ["ACCELERATE_USE_DEEPSPEED"] = "true"
+    if cfg.fp8:
+        os.environ["ACCELERATE_MIXED_PRECISION"] = "fp8"

    trainer_builder = HFCausalTrainerBuilder(cfg, model, tokenizer)
    trainer_builder.train_dataset = train_dataset
--- a/tests/e2e/test_lora_llama.py
+++ b/tests/e2e/test_lora_llama.py
@@ -101,7 +101,6 @@ class TestLoraLlama(unittest.TestCase):
                "learning_rate": 0.00001,
                "optimizer": "adamw_torch",
                "lr_scheduler": "cosine",
-                "bf16": True,
            }
        )
        normalize_config(cfg)
--- a/tests/e2e/test_phi.py
+++ b/tests/e2e/test_phi.py
@@ -31,7 +31,7 @@ class TestPhi(unittest.TestCase):
            {
                "base_model": "microsoft/phi-1_5",
                "trust_remote_code": True,
-                "model_type": "PhiForCausalLM",
+                "model_type": "MixFormerSequentialForCausalLM",
                "tokenizer_type": "AutoTokenizer",
                "sequence_len": 512,
                "sample_packing": False,
@@ -76,7 +76,7 @@ class TestPhi(unittest.TestCase):
            {
                "base_model": "microsoft/phi-1_5",
                "trust_remote_code": True,
-                "model_type": "PhiForCausalLM",
+                "model_type": "MixFormerSequentialForCausalLM",
                "tokenizer_type": "AutoTokenizer",
                "sequence_len": 512,
                "sample_packing": True,
--- a/tests/e2e/test_resume.py
+++ b/tests/e2e/test_resume.py
@@ -1,95 +0,0 @@
-"""
-E2E tests for resuming training
-"""
-
-import logging
-import os
-import re
-import subprocess
-import unittest
-from pathlib import Path
-
-from transformers.utils import is_torch_bf16_gpu_available
-
-from axolotl.cli import load_datasets
-from axolotl.common.cli import TrainerCliArgs
-from axolotl.train import train
-from axolotl.utils.config import normalize_config
-from axolotl.utils.dict import DictDefault
-
-from .utils import most_recent_subdir, with_temp_dir
-
-LOG = logging.getLogger("axolotl.tests.e2e")
-os.environ["WANDB_DISABLED"] = "true"
-
-
-class TestResumeLlama(unittest.TestCase):
-    """
-    Test case for resuming training of llama models
-    """
-
-    @with_temp_dir
-    def test_resume_qlora(self, temp_dir):
-        # pylint: disable=duplicate-code
-        cfg = DictDefault(
-            {
-                "base_model": "JackFram/llama-68m",
-                "tokenizer_type": "LlamaTokenizer",
-                "sequence_len": 1024,
-                "sample_packing": True,
-                "flash_attention": True,
-                "load_in_4bit": True,
-                "adapter": "qlora",
-                "lora_r": 32,
-                "lora_alpha": 64,
-                "lora_dropout": 0.05,
-                "lora_target_linear": True,
-                "val_set_size": 0.1,
-                "special_tokens": {},
-                "datasets": [
-                    {
-                        "path": "vicgalle/alpaca-gpt4",
-                        "type": "alpaca",
-                    },
-                ],
-                "num_epochs": 2,
-                "micro_batch_size": 1,
-                "gradient_accumulation_steps": 1,
-                "output_dir": temp_dir,
-                "learning_rate": 0.00001,
-                "optimizer": "adamw_torch",
-                "lr_scheduler": "cosine",
-                "save_steps": 10,
-                "save_total_limit": 5,
-                "max_steps": 40,
-            }
-        )
-        if is_torch_bf16_gpu_available():
-            cfg.bf16 = True
-        else:
-            cfg.fp16 = True
-        normalize_config(cfg)
-        cli_args = TrainerCliArgs()
-        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
-
-        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
-
-        resume_cfg = cfg | DictDefault(
-            {
-                "resume_from_checkpoint": f"{temp_dir}/checkpoint-30/",
-            }
-        )
-        normalize_config(resume_cfg)
-        cli_args = TrainerCliArgs()
-
-        train(cfg=resume_cfg, cli_args=cli_args, dataset_meta=dataset_meta)
-        assert (Path(temp_dir) / "adapter_model.bin").exists()
-
-        tb_log_path_1 = most_recent_subdir(temp_dir + "/runs")
-        cmd = f"tensorboard --inspect  --logdir {tb_log_path_1}"
-        res = subprocess.run(
-            cmd, shell=True, text=True, capture_output=True, check=True
-        )
-        pattern = r"first_step\s+(\d+)"
-        first_steps = int(re.findall(pattern, res.stdout)[0])
-        assert first_steps == 31
--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -1,11 +1,10 @@
 """
 helper utils for tests
 """
-import os
+
 import shutil
 import tempfile
 from functools import wraps
-from pathlib import Path


 def with_temp_dir(test_func):
@@ -21,13 +20,3 @@ def with_temp_dir(test_func):
            shutil.rmtree(temp_dir)

    return wrapper
-
-
-def most_recent_subdir(path):
-    base_path = Path(path)
-    subdirectories = [d for d in base_path.iterdir() if d.is_dir()]
-    if not subdirectories:
-        return None
-    subdir = max(subdirectories, key=os.path.getctime)
-
-    return subdir
--- a/tests/test_validation.py
+++ b/tests/test_validation.py
@@ -649,33 +649,3 @@ class ValidationTest(unittest.TestCase):
        )

        validate_config(cfg)
-
-    def test_warmup_step_no_conflict(self):
-        cfg = DictDefault(
-            {
-                "warmup_steps": 10,
-                "warmup_ratio": 0.1,
-            }
-        )
-
-        with pytest.raises(
-            ValueError,
-            match=r".*warmup_steps and warmup_ratio are mutually exclusive*",
-        ):
-            validate_config(cfg)
-
-        cfg = DictDefault(
-            {
-                "warmup_steps": 10,
-            }
-        )
-
-        validate_config(cfg)
-
-        cfg = DictDefault(
-            {
-                "warmup_ratio": 0.1,
-            }
-        )
-
-        validate_config(cfg)