fix softmax class check

register rala
use autoconfig w rala
2025-01-15 23:23:13 -05:00 · 2025-01-15 23:21:22 -05:00 · 2025-01-15 23:14:47 -05:00 · 2025-01-15 22:45:02 -05:00 · 2025-01-15 21:36:14 -05:00 · 2025-01-15 21:27:12 -05:00
112 changed files with 4969 additions and 2849 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -186,3 +186,6 @@ out/

 # vim
 *.swp
+
+# symlinked to axolotl-artifacts in docker containers
+outputs
--- a/README.md
+++ b/README.md
@@ -519,8 +519,8 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
      train_on_split: validation

      # loading from s3 or gcs
-      # s3 creds will be loaded from the system default / gcs will attempt to load from gcloud creds, google metadata service, or anon
-    - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above
+      # s3 creds will be loaded from the system default and gcs only supports public access
+    - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
      ...

      # Loading Data From a Public URL
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -4,8 +4,6 @@ set -e
 python -c "import torch; assert '$PYTORCH_VERSION' in torch.__version__"

 pytest -v --durations=10 -n8 --ignore=tests/e2e/ --ignore=tests/patched/ /workspace/axolotl/tests/
-# pytest -v --durations=10 -n8 --dist loadfile /workspace/axolotl/tests/patched/
 pytest -v --durations=10 /workspace/axolotl/tests/e2e/patched/
-pytest -v --durations=10 -n1 /workspace/axolotl/tests/e2e/solo/
 pytest -v --durations=10 /workspace/axolotl/tests/e2e/integrations/
-pytest -v --durations=10 --ignore=tests/e2e/solo/ --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ --ignore=tests/e2e/integrations/ /workspace/axolotl/tests/e2e/
+pytest -v --durations=10 --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ --ignore=tests/e2e/integrations/ /workspace/axolotl/tests/e2e/
--- a/cicd/multigpu.py
+++ b/cicd/multigpu.py
@@ -1,6 +1,6 @@
 """
- modal application to run axolotl gpu tests in Modal
- """
+modal application to run axolotl gpu tests in Modal
+"""
 # pylint: disable=duplicate-code

 import os
--- a/docker/Dockerfile-cloud
+++ b/docker/Dockerfile-cloud
@@ -20,8 +20,7 @@ RUN apt install --yes --no-install-recommends openssh-server tmux && \
    printf "\n[[ -z \"\$TMUX\"  ]] && { tmux attach-session -t ssh_tmux || tmux new-session -s ssh_tmux; exit; }\n" >> ~/.bashrc && \
    printf "[ ! -z \"\$TERM\" -a -r /etc/motd ] && cat /etc/motd\n" >> ~/.bashrc && \
    chmod +x /workspace/axolotl/scripts/cloud-entrypoint.sh && \
-    chmod +x /root/cloud-entrypoint.sh && \
-    echo 'set-option -g history-limit 5000' >> ~/.tmux.conf
+    chmod +x /root/cloud-entrypoint.sh

 ENTRYPOINT ["/root/cloud-entrypoint.sh"]
 CMD ["sleep", "infinity"]
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -244,8 +244,6 @@ total_num_tokens:
 sample_packing_group_size: 100000
 # The number of samples which can be packed into one sequence. Increase if using a large sequence_len with many short samples.
 sample_packing_bin_size: 200
-# whether to concatenate samples during pretraining
-pretraining_sample_concatenation:

 # Use batch flattening for speedups when not using sample_packing
 batch_flattening:
@@ -360,11 +358,10 @@ warmup_ratio: 0.05  # cannot use with warmup_steps
 learning_rate: 0.00003
 lr_quadratic_warmup:
 logging_steps:
-eval_steps: # Leave empty to eval at each epoch, integer for every N steps. float for fraction of total steps
+eval_steps: # Leave empty to eval at each epoch, integers for every N steps. decimal for fraction of total steps
 evals_per_epoch: # number of times per epoch to run evals, mutually exclusive with eval_steps
-eval_strategy: # Set to `"no"` to skip evaluation, `"epoch"` at end of each epoch, leave empty to infer from `eval_steps`.
-save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of each epoch, `"best"` when better result is achieved, leave empty to infer from `save_steps`.
-save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps
+save_strategy: # Set to `"no"` to skip checkpoint saves
+save_steps: # Leave empty to save at each epoch
 saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
 save_total_limit: # Checkpoints saved at a time
 # Maximum number of iterations to train for. It precedes num_epochs which means that
--- a/docs/dataset-formats/pretraining.qmd
+++ b/docs/dataset-formats/pretraining.qmd
@@ -19,14 +19,7 @@ For pretraining, there is no prompt template or roles.  The only required field
 Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:

 ```{.yaml filename="config.yaml"}
-pretraining_dataset:
-  - name:
-    path:
-    split:
-    text_column: # column in dataset with the data, usually `text`
-    type: pretrain
-    trust_remote_code:
-    skip: # number of rows of data to skip over from the beginning
+pretraining_dataset: # hf path only
 ...
 ```

--- a/docs/lr_groups.qmd
+++ b/docs/lr_groups.qmd
@@ -1,29 +0,0 @@
---
-title: Learning Rate Groups
-description: "Setting different learning rates by module name"
---
-
-## Background
-
-Inspired by LoRA+, Axolotl allows practitioners to specify separate learning rates for each module or groups of
-modules in a model.
-
-## Example
-
-```yaml
-lr_groups:
-  - name: o_proj
-    modules:
-      - self_attn.o_proj.weight
-    lr: 1e-6
-  - name: q_proj
-    modules:
-      - model.layers.2.self_attn.q_proj.weight
-    lr: 1e-5
-
-learning_rate: 2e-5
-```
-
-In this example, we have a default learning rate of 2e-5 across the entire model, but we have a separate learning rate
-of 1e-6 for all the self attention `o_proj` modules across all layers, and a learning are of 1e-5 to the 3rd layer's
-self attention `q_proj` module.
--- a/requirements.txt
+++ b/requirements.txt
@@ -13,9 +13,9 @@ liger-kernel==0.5.2
 packaging==23.2

 peft==0.14.0
-transformers==4.48.1
+transformers==4.47.1
 tokenizers>=0.21.0
-accelerate==1.3.0
+accelerate==1.2.1
 datasets==3.2.0
 deepspeed==0.16.1
 trl==0.13.0
--- a/scripts/chat_datasets.py
+++ b/scripts/chat_datasets.py
@@ -30,7 +30,7 @@ def parse_dataset(dataset=None, split="train"):
        )
    ds_cfg["field_messages"] = field_messages

-    message_fields = features[field_messages][0].keys()
+    message_fields = features["conversations"][0].keys()
    message_field_role = None
    for key in ["from", "role"]:
        if key in message_fields:
--- a/scripts/finetune.py
+++ b/scripts/finetune.py
@@ -0,0 +1,52 @@
+"""Prepare and train a model on a dataset. Can also infer from a model or merge lora"""
+import logging
+from pathlib import Path
+
+import fire
+import transformers
+
+from axolotl.cli import (
+    check_accelerate_default_config,
+    check_user_token,
+    do_inference,
+    do_merge_lora,
+    load_cfg,
+    load_datasets,
+    print_axolotl_text_art,
+)
+from axolotl.cli.shard import shard
+from axolotl.common.cli import TrainerCliArgs
+from axolotl.train import train
+
+LOG = logging.getLogger("axolotl.scripts.finetune")
+
+
+def do_cli(config: Path = Path("examples/"), **kwargs):
+    print_axolotl_text_art()
+    LOG.warning(
+        str(
+            PendingDeprecationWarning(
+                "scripts/finetune.py will be replaced with calling axolotl.cli.train"
+            )
+        )
+    )
+    parsed_cfg = load_cfg(config, **kwargs)
+    check_accelerate_default_config()
+    check_user_token()
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
+    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
+        return_remaining_strings=True
+    )
+    if parsed_cli_args.inference:
+        do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
+    elif parsed_cli_args.merge_lora:
+        do_merge_lora(cfg=parsed_cfg, cli_args=parsed_cli_args)
+    elif parsed_cli_args.shard:
+        shard(cfg=parsed_cfg, cli_args=parsed_cli_args)
+    else:
+        dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
+        train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
+
+
+if __name__ == "__main__":
+    fire.Fire(do_cli)
--- a/src/axolotl/cli/init.py
+++ b/src/axolotl/cli/init.py
@@ -1,5 +1,568 @@
-"""Axolotl CLI module initialization."""
+"""Prepare and train a model on a dataset. Can also infer from a model or merge lora"""

+import importlib
+import json
+import logging
+import math
 import os
+import random
+import sys
+import tempfile
+from pathlib import Path
+from threading import Thread
+from typing import Any, Dict, List, Optional, Union
+from urllib.parse import urlparse
+
+import requests
+import torch
+import yaml
+
+# add src to the pythonpath so we don't need to pip install this
+from accelerate.commands.config import config_args
+from art import text2art
+from huggingface_hub import HfApi
+from huggingface_hub.utils import LocalTokenNotFoundError
+from transformers import GenerationConfig, TextIteratorStreamer, TextStreamer
+from transformers.utils import is_torch_bf16_gpu_available
+from transformers.utils.import_utils import _is_package_available
+
+from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
+from axolotl.logging_config import configure_logging
+from axolotl.train import TrainDatasetMeta
+from axolotl.utils.chat_templates import (
+    get_chat_template,
+    get_chat_template_from_config,
+)
+from axolotl.utils.comet_ import setup_comet_env_vars
+from axolotl.utils.config import (
+    normalize_cfg_datasets,
+    normalize_config,
+    prepare_plugins,
+    validate_config,
+)
+from axolotl.utils.data import load_prepare_dpo_datasets, prepare_dataset
+from axolotl.utils.dict import DictDefault
+from axolotl.utils.distributed import is_main_process
+from axolotl.utils.mlflow_ import setup_mlflow_env_vars
+from axolotl.utils.models import load_processor, load_tokenizer
+from axolotl.utils.tokenization import check_dataset_labels
+from axolotl.utils.trainer import prepare_opinionated_env, prepare_optim_env
+from axolotl.utils.wandb_ import setup_wandb_env_vars
+
+project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+src_dir = os.path.join(project_root, "src")
+sys.path.insert(0, src_dir)
+
+configure_logging()
+LOG = logging.getLogger("axolotl.scripts")

 os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+
+AXOLOTL_LOGO = """
+     #@@ #@@      @@# @@#
+    @@  @@          @@  @@           =@@#                               @@                 #@    =@@#.
+    @@    #@@@@@@@@@    @@           #@#@=                              @@                 #@     .=@@
+      #@@@@@@@@@@@@@@@@@            =@# @#     ##=     ##    =####=+    @@      =#####+  =#@@###.   @@
+    @@@@@@@@@@/  +@@/  +@@          #@  =@=     #@=   @@   =@#+  +#@#   @@    =@#+  +#@#   #@.      @@
+    @@@@@@@@@@  ##@@  ##@@         =@#   @#      =@# @#    @@      @@   @@    @@      #@   #@       @@
+     @@@@@@@@@@@@@@@@@@@@          #@=+++#@=      =@@#     @@      @@   @@    @@      #@   #@       @@
+                                  =@#=====@@     =@# @#    @@      @@   @@    @@      #@   #@       @@
+    @@@@@@@@@@@@@@@@  @@@@        #@      #@=   #@=  +@@   #@#    =@#   @@.   =@#    =@#   #@.      @@
+                                 =@#       @#  #@=     #@   =#@@@@#=    +#@@=  +#@@@@#=    .##@@+   @@
+    @@@@  @@@@@@@@@@@@@@@@
+"""
+
+
+def print_legacy_axolotl_text_art(suffix=None):
+    font = "nancyj"
+    ascii_text = "  axolotl"
+    if suffix:
+        ascii_text += f"  x  {suffix}"
+    ascii_art = text2art(ascii_text, font=font)
+
+    if is_main_process():
+        print(ascii_art)
+
+    print_dep_versions()
+
+
+def print_axolotl_text_art(
+    **kwargs,  # pylint: disable=unused-argument
+):
+    if is_main_process():
+        print(AXOLOTL_LOGO)
+
+
+def print_dep_versions():
+    packages = ["accelerate", "peft", "transformers", "trl", "torch", "bitsandbytes"]
+    max_len = max(len(pkg) for pkg in packages)
+    if is_main_process():
+        print("*" * 40)
+        print("**** Axolotl Dependency Versions *****")
+        for pkg in packages:
+            pkg_version = _is_package_available(pkg, return_version=True)
+            print(f"{pkg: >{max_len}}: {pkg_version[1]: <15}")
+        print("*" * 40)
+
+
+def check_remote_config(config: Union[str, Path]):
+    # Check if the config is a valid HTTPS URL to a .yml or .yaml file
+    if not (isinstance(config, str) and config.startswith("https://")):
+        return config  # Return the original value if it's not a valid URL
+
+    filename = os.path.basename(urlparse(config).path)
+    temp_dir = tempfile.mkdtemp()
+
+    try:
+        response = requests.get(config, timeout=30)
+        response.raise_for_status()  # Check for HTTP errors
+
+        content = response.content
+        try:
+            # Try parsing as JSON first to catch cases where JSON content is mistakenly considered YAML
+            json.loads(content)
+            # Log a warning but do not raise an error; JSON is technically valid YAML - this can happen when you forget to point to a raw github link
+            LOG.warning(
+                f"Warning: The content of the file at {config} is JSON, which is technically valid YAML but might not be intended."
+            )
+        except json.JSONDecodeError:
+            # If it's not valid JSON, verify it's valid YAML
+            try:
+                yaml.safe_load(content)
+            except yaml.YAMLError as err:
+                raise ValueError(
+                    f"Failed to parse the content at {config} as YAML: {err}"
+                ) from err
+
+        # Write the content to a file if it's valid YAML (or JSON treated as YAML)
+        output_path = Path(temp_dir) / filename
+        with open(output_path, "wb") as file:
+            file.write(content)
+        LOG.info(
+            f"Using the following config obtained from {config}: \n\n{content.decode('utf-8')}\n"
+        )
+        return output_path
+
+    except requests.RequestException as err:
+        # This catches all requests-related exceptions including HTTPError
+        raise RuntimeError(f"Failed to download {config}: {err}") from err
+    except Exception as err:
+        # Catch-all for any other exceptions
+        raise err
+
+
+def get_multi_line_input() -> Optional[str]:
+    print("Give me an instruction (Ctrl + D to submit): ")
+    instruction = ""
+    for line in sys.stdin:
+        instruction += line  # pylint: disable=consider-using-join
+    # instruction = pathlib.Path("/proc/self/fd/0").read_text()
+    return instruction
+
+
+def do_merge_lora(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    safe_serialization = cfg.save_safetensors is True
+
+    LOG.info("running merge of LoRA with base model")
+    model = model.merge_and_unload(progressbar=True)
+    try:
+        model.to(dtype=cfg.torch_dtype)
+    except RuntimeError:
+        pass
+    model.generation_config.do_sample = True
+
+    if cfg.local_rank == 0:
+        LOG.info(f"saving merged model to: {str(Path(cfg.output_dir) / 'merged')}")
+        model.save_pretrained(
+            str(Path(cfg.output_dir) / "merged"),
+            safe_serialization=safe_serialization,
+            progressbar=True,
+        )
+        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
+
+
+def do_inference(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    prompter = cli_args.prompter
+
+    prompter_module = None
+    chat_template_str = None
+    if prompter:
+        prompter_module = getattr(
+            importlib.import_module("axolotl.prompters"), prompter
+        )
+    elif cfg.chat_template:
+        chat_template_str = get_chat_template(cfg.chat_template)
+    elif cfg.datasets and cfg.datasets[0].type == "chat_template":
+        chat_template_str = get_chat_template_from_config(
+            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
+        )
+
+    model = model.to(cfg.device, dtype=cfg.torch_dtype)
+
+    while True:
+        print("=" * 80)
+        # support for multiline inputs
+        instruction = get_multi_line_input()
+        if not instruction:
+            return
+
+        if prompter_module:
+            prompt: str = next(
+                prompter_module().build_prompt(instruction=instruction.strip("\n"))
+            )
+        else:
+            prompt = instruction.strip()
+
+        if chat_template_str:
+            batch = tokenizer.apply_chat_template(
+                [
+                    {
+                        "role": "user",
+                        "content": prompt,
+                    }
+                ],
+                return_tensors="pt",
+                add_special_tokens=True,
+                add_generation_prompt=True,
+                chat_template=chat_template_str,
+                tokenize=True,
+                return_dict=True,
+            )
+        else:
+            batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
+
+        print("=" * 40)
+        model.eval()
+        with torch.no_grad():
+            generation_config = GenerationConfig(
+                repetition_penalty=1.1,
+                max_new_tokens=1024,
+                temperature=0.9,
+                top_p=0.95,
+                top_k=40,
+                bos_token_id=tokenizer.bos_token_id,
+                eos_token_id=tokenizer.eos_token_id,
+                pad_token_id=tokenizer.pad_token_id,
+                do_sample=True,
+                use_cache=True,
+                return_dict_in_generate=True,
+                output_attentions=False,
+                output_hidden_states=False,
+                output_scores=False,
+            )
+            streamer = TextStreamer(tokenizer)
+            generated = model.generate(
+                inputs=batch["input_ids"].to(cfg.device),
+                generation_config=generation_config,
+                streamer=streamer,
+            )
+        print("=" * 40)
+        print(tokenizer.decode(generated["sequences"].cpu().tolist()[0]))
+
+
+def do_inference_gradio(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    import gradio as gr
+
+    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    prompter = cli_args.prompter
+
+    prompter_module = None
+    chat_template_str = None
+    if prompter:
+        prompter_module = getattr(
+            importlib.import_module("axolotl.prompters"), prompter
+        )
+    elif cfg.chat_template:
+        chat_template_str = get_chat_template(cfg.chat_template, tokenizer=tokenizer)
+
+    model = model.to(cfg.device, dtype=cfg.torch_dtype)
+
+    def generate(instruction):
+        if not instruction:
+            return
+        if prompter_module:
+            # pylint: disable=stop-iteration-return
+            prompt: str = next(
+                prompter_module().build_prompt(instruction=instruction.strip("\n"))
+            )
+        else:
+            prompt = instruction.strip()
+
+        if chat_template_str:
+            batch = tokenizer.apply_chat_template(
+                [
+                    {
+                        "role": "user",
+                        "content": prompt,
+                    }
+                ],
+                return_tensors="pt",
+                add_special_tokens=True,
+                add_generation_prompt=True,
+                chat_template=chat_template_str,
+                tokenize=True,
+                return_dict=True,
+            )
+        else:
+            batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
+
+        model.eval()
+        with torch.no_grad():
+            generation_config = GenerationConfig(
+                repetition_penalty=1.1,
+                max_new_tokens=cfg.get("gradio_max_new_tokens", 1024),
+                temperature=cfg.get("gradio_temperature", 0.9),
+                top_p=0.95,
+                top_k=40,
+                bos_token_id=tokenizer.bos_token_id,
+                eos_token_id=tokenizer.eos_token_id,
+                pad_token_id=tokenizer.pad_token_id,
+                do_sample=True,
+                use_cache=True,
+                return_dict_in_generate=True,
+                output_attentions=False,
+                output_hidden_states=False,
+                output_scores=False,
+            )
+            streamer = TextIteratorStreamer(tokenizer)
+            generation_kwargs = {
+                "inputs": batch["input_ids"].to(cfg.device),
+                "attention_mask": batch["attention_mask"].to(cfg.device),
+                "generation_config": generation_config,
+                "streamer": streamer,
+            }
+
+            thread = Thread(target=model.generate, kwargs=generation_kwargs)
+            thread.start()
+
+            all_text = ""
+
+            for new_text in streamer:
+                all_text += new_text
+                yield all_text
+
+    demo = gr.Interface(
+        fn=generate,
+        inputs="textbox",
+        outputs="text",
+        title=cfg.get("gradio_title", "Axolotl Gradio Interface"),
+    )
+
+    demo.queue().launch(
+        show_api=False,
+        share=cfg.get("gradio_share", True),
+        server_name=cfg.get("gradio_server_name", "127.0.0.1"),
+        server_port=cfg.get("gradio_server_port", None),
+    )
+
+
+def choose_config(path: Path):
+    yaml_files = list(path.glob("*.yml"))
+
+    if not yaml_files:
+        raise ValueError(
+            "No YAML config files found in the specified directory. Are you using a .yml extension?"
+        )
+
+    if len(yaml_files) == 1:
+        print(f"Using default YAML file '{yaml_files[0]}'")
+        return str(yaml_files[0])
+
+    print("Choose a YAML file:")
+    for idx, file in enumerate(yaml_files):
+        print(f"{idx + 1}. {file}")
+
+    chosen_file = None
+    while chosen_file is None:
+        try:
+            choice = int(input("Enter the number of your choice: "))
+            if 1 <= choice <= len(yaml_files):
+                chosen_file = str(yaml_files[choice - 1])
+            else:
+                print("Invalid choice. Please choose a number from the list.")
+        except ValueError:
+            print("Invalid input. Please enter a number.")
+
+    return chosen_file
+
+
+def check_not_in(list1: List[str], list2: Union[Dict[str, Any], List[str]]) -> bool:
+    return not any(el in list2 for el in list1)
+
+
+def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs):
+    config = check_remote_config(config)
+    if Path(config).is_dir():
+        config = choose_config(Path(config))
+
+    # load the config from the yaml file
+    with open(config, encoding="utf-8") as file:
+        cfg: DictDefault = DictDefault(yaml.safe_load(file))
+    # if there are any options passed in the cli, if it is something that seems valid from the yaml,
+    # then overwrite the value
+    cfg_keys = cfg.keys()
+    for k, _ in kwargs.items():
+        # if not strict, allow writing to cfg even if it's not in the yml already
+        if k in cfg_keys or not cfg.strict:
+            # handle booleans
+            if isinstance(cfg[k], bool):
+                cfg[k] = bool(kwargs[k])
+            else:
+                cfg[k] = kwargs[k]
+
+    cfg.axolotl_config_path = config
+
+    try:
+        device_props = torch.cuda.get_device_properties("cuda")
+        gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
+    except:  # pylint: disable=bare-except # noqa: E722
+        gpu_version = None
+
+    prepare_plugins(cfg)
+
+    cfg = validate_config(
+        cfg,
+        capabilities={
+            "bf16": is_torch_bf16_gpu_available(),
+            "n_gpu": int(os.environ.get("WORLD_SIZE", 1)),
+            "compute_capability": gpu_version,
+        },
+        env_capabilities={
+            "torch_version": str(torch.__version__).split("+", maxsplit=1)[0],
+        },
+    )
+
+    prepare_optim_env(cfg)
+
+    prepare_opinionated_env(cfg)
+
+    normalize_config(cfg)
+
+    normalize_cfg_datasets(cfg)
+
+    setup_wandb_env_vars(cfg)
+
+    setup_mlflow_env_vars(cfg)
+
+    setup_comet_env_vars(cfg)
+
+    return cfg
+
+
+def load_datasets(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+) -> TrainDatasetMeta:
+    tokenizer = load_tokenizer(cfg)
+    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None
+
+    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
+        cfg,
+        tokenizer,
+        processor=processor,
+    )
+
+    if (
+        cli_args.debug
+        or cfg.debug
+        or cli_args.debug_text_only
+        or int(cli_args.debug_num_examples) > 0
+    ):
+        LOG.info("check_dataset_labels...")
+        check_dataset_labels(
+            train_dataset.select(
+                [
+                    random.randrange(0, len(train_dataset) - 1)  # nosec
+                    for _ in range(cli_args.debug_num_examples)
+                ]
+            ),
+            tokenizer,
+            num_examples=cli_args.debug_num_examples,
+            text_only=cli_args.debug_text_only,
+        )
+
+        LOG.info("printing prompters...")
+        for prompter in prompters:
+            LOG.info(prompter)
+
+    return TrainDatasetMeta(
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        total_num_steps=total_num_steps,
+    )
+
+
+def load_rl_datasets(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,  # pylint: disable=unused-argument
+) -> TrainDatasetMeta:
+    train_dataset, eval_dataset = load_prepare_dpo_datasets(cfg)
+    total_num_steps = int(
+        math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
+    )
+
+    if cli_args.debug or cfg.debug:
+        LOG.info("check_dataset_labels...")
+
+        tokenizer = load_tokenizer(cfg)
+        check_dataset_labels(
+            train_dataset.select(
+                [
+                    random.randrange(0, len(train_dataset) - 1)  # nosec
+                    for _ in range(cli_args.debug_num_examples)
+                ]
+            ),
+            tokenizer,
+            num_examples=cli_args.debug_num_examples,
+            text_only=cli_args.debug_text_only,
+            rl_mode=True,
+        )
+
+    return TrainDatasetMeta(
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        total_num_steps=total_num_steps,
+    )
+
+
+def check_accelerate_default_config():
+    if Path(config_args.default_yaml_config_file).exists():
+        LOG.warning(
+            f"accelerate config file found at {config_args.default_yaml_config_file}. This can lead to unexpected errors"
+        )
+
+
+def check_user_token():
+    # Skip check if HF_HUB_OFFLINE is set to True
+    if os.getenv("HF_HUB_OFFLINE") == "1":
+        LOG.info(
+            "Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used."
+        )
+        return True
+
+    # Verify if token is valid
+    api = HfApi()
+    try:
+        user_info = api.whoami()
+        return bool(user_info)
+    except LocalTokenNotFoundError:
+        LOG.warning(
+            "Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets."
+        )
+        return False
--- a/src/axolotl/cli/args.py
+++ b/src/axolotl/cli/args.py
@@ -1,43 +0,0 @@
-"""Module for axolotl CLI command arguments."""
-
-from dataclasses import dataclass, field
-from typing import Optional
-
-
-@dataclass
-class PreprocessCliArgs:
-    """Dataclass with CLI arguments for `axolotl preprocess` command."""
-
-    debug: bool = field(default=False)
-    debug_text_only: bool = field(default=False)
-    debug_num_examples: int = field(default=1)
-    prompter: Optional[str] = field(default=None)
-    download: Optional[bool] = field(default=True)
-
-
-@dataclass
-class TrainerCliArgs:
-    """Dataclass with CLI arguments for `axolotl train` command."""
-
-    debug: bool = field(default=False)
-    debug_text_only: bool = field(default=False)
-    debug_num_examples: int = field(default=0)
-    merge_lora: bool = field(default=False)
-    prompter: Optional[str] = field(default=None)
-    shard: bool = field(default=False)
-
-
-@dataclass
-class EvaluateCliArgs:
-    """Dataclass with CLI arguments for `axolotl evaluate` command."""
-
-    debug: bool = field(default=False)
-    debug_text_only: bool = field(default=False)
-    debug_num_examples: int = field(default=0)
-
-
-@dataclass
-class InferenceCliArgs:
-    """Dataclass with CLI arguments for `axolotl inference` command."""
-
-    prompter: Optional[str] = field(default=None)
--- a/src/axolotl/cli/art.py
+++ b/src/axolotl/cli/art.py
@@ -1,23 +0,0 @@
-"""Axolotl ASCII logo utils."""
-
-from axolotl.utils.distributed import is_main_process
-
-AXOLOTL_LOGO = """
-     #@@ #@@      @@# @@#
-    @@  @@          @@  @@           =@@#                               @@                 #@    =@@#.
-    @@    #@@@@@@@@@    @@           #@#@=                              @@                 #@     .=@@
-      #@@@@@@@@@@@@@@@@@            =@# @#     ##=     ##    =####=+    @@      =#####+  =#@@###.   @@
-    @@@@@@@@@@/  +@@/  +@@          #@  =@=     #@=   @@   =@#+  +#@#   @@    =@#+  +#@#   #@.      @@
-    @@@@@@@@@@  ##@@  ##@@         =@#   @#      =@# @#    @@      @@   @@    @@      #@   #@       @@
-     @@@@@@@@@@@@@@@@@@@@          #@=+++#@=      =@@#     @@      @@   @@    @@      #@   #@       @@
-                                  =@#=====@@     =@# @#    @@      @@   @@    @@      #@   #@       @@
-    @@@@@@@@@@@@@@@@  @@@@        #@      #@=   #@=  +@@   #@#    =@#   @@.   =@#    =@#   #@.      @@
-                                 =@#       @#  #@=     #@   =#@@@@#=    +#@@=  +#@@@@#=    .##@@+   @@
-    @@@@  @@@@@@@@@@@@@@@@
-"""
-
-
-def print_axolotl_text_art():
-    """Prints axolotl ASCII art."""
-    if is_main_process():
-        print(AXOLOTL_LOGO)
--- a/src/axolotl/cli/checks.py
+++ b/src/axolotl/cli/checks.py
@@ -1,50 +0,0 @@
-"""Various checks for Axolotl CLI."""
-
-import logging
-import os
-from pathlib import Path
-
-from accelerate.commands.config import config_args
-from huggingface_hub import HfApi
-from huggingface_hub.utils import LocalTokenNotFoundError
-
-from axolotl.logging_config import configure_logging
-
-configure_logging()
-LOG = logging.getLogger(__name__)
-
-
-def check_accelerate_default_config() -> None:
-    """Logs at warning level if no accelerate config file is found."""
-    if Path(config_args.default_yaml_config_file).exists():
-        LOG.warning(
-            f"accelerate config file found at {config_args.default_yaml_config_file}. This can lead to unexpected errors"
-        )
-
-
-def check_user_token() -> bool:
-    """Checks for HF user info. Check is skipped if HF_HUB_OFFLINE=1.
-
-    Returns:
-        Boolean indicating successful check (i.e., HF_HUB_OFFLINE=1 or HF user info is retrieved).
-
-    Raises:
-        LocalTokenNotFoundError: If HF user info can't be retrieved.
-    """
-    # Skip check if HF_HUB_OFFLINE is set to True
-    if os.getenv("HF_HUB_OFFLINE") == "1":
-        LOG.info(
-            "Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used."
-        )
-        return True
-
-    # Verify if token is valid
-    api = HfApi()
-    try:
-        user_info = api.whoami()
-        return bool(user_info)
-    except LocalTokenNotFoundError:
-        LOG.warning(
-            "Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets."
-        )
-        return False
--- a/src/axolotl/cli/config.py
+++ b/src/axolotl/cli/config.py
@@ -1,217 +0,0 @@
-"""Configuration loading and processing."""
-
-import json
-import logging
-import os
-import tempfile
-from pathlib import Path
-from typing import Union
-from urllib.parse import urlparse
-
-import requests
-import torch
-import yaml
-from transformers.utils import is_torch_bf16_gpu_available
-
-from axolotl.integrations.base import PluginManager
-from axolotl.utils.comet_ import setup_comet_env_vars
-from axolotl.utils.config import (
-    normalize_cfg_datasets,
-    normalize_config,
-    validate_config,
-)
-from axolotl.utils.dict import DictDefault
-from axolotl.utils.mlflow_ import setup_mlflow_env_vars
-from axolotl.utils.trainer import prepare_opinionated_env, prepare_optim_env
-from axolotl.utils.wandb_ import setup_wandb_env_vars
-
-LOG = logging.getLogger(__name__)
-
-
-def check_remote_config(config: Union[str, Path]) -> Union[str, Path]:
-    """
-    First, determines if the passed config is a valid HTTPS URL. Then, attempts to query
-    for it and parse its content, first as JSON, then as YAML (YAML is preferred).
-    Finally, the parsed content is written to a local file and its path is returned.
-
-    Args:
-        config: HTTPS URL to a YAML or JSON file.
-
-    Returns:
-        Either the original `config` if it's not a valid HTTPS URL, or the path to the
-        downloaded remote config.
-
-    Raises:
-        ValueError: If the remote configuration is neither valid JSON or YAML.
-        RuntimeError: If some request-related exception occurs from the file download.
-        Exception: Catch-all for any other exception.
-    """
-    # Check if the config is a valid HTTPS URL to a .yml or .yaml file
-    if not (isinstance(config, str) and config.startswith("https://")):
-        return config  # Return the original value if it's not a valid URL
-
-    filename = os.path.basename(urlparse(config).path)
-    temp_dir = tempfile.mkdtemp()
-
-    try:
-        response = requests.get(config, timeout=30)
-        response.raise_for_status()  # Check for HTTP errors
-
-        content = response.content
-        try:
-            # Try parsing as JSON first to catch cases where JSON content is mistakenly
-            # considered YAML.
-            json.loads(content)
-
-            # Log a warning but do not raise an error; JSON is technically valid YAML.
-            # This can happen when you forget to point to a raw GitHub link.
-            LOG.warning(
-                f"Warning: The content of the file at {config} is JSON, which is technically valid YAML but might not be intended."
-            )
-        except json.JSONDecodeError:
-            # If it's not valid JSON, verify it's valid YAML
-            try:
-                yaml.safe_load(content)
-            except yaml.YAMLError as err:
-                raise ValueError(
-                    f"Failed to parse the content at {config} as YAML: {err}"
-                ) from err
-
-        # Write the content to a file if it's valid YAML (or JSON treated as YAML)
-        output_path = Path(temp_dir) / filename
-        with open(output_path, "wb") as file:
-            file.write(content)
-        LOG.info(
-            f"Using the following config obtained from {config}: \n\n{content.decode('utf-8')}\n"
-        )
-        return output_path
-
-    except requests.RequestException as err:
-        # This catches all requests-related exceptions including HTTPError
-        raise RuntimeError(f"Failed to download {config}: {err}") from err
-    except Exception as err:
-        # Catch-all for any other exceptions
-        raise err
-
-
-def choose_config(path: Path) -> str:
-    """
-    Helper method for choosing a `axolotl` config YAML file (considering only files
-    ending with `.yml` or `.yaml`). If more than one config file exists in the passed
-    `path`, the user is prompted to choose one.
-
-    Args:
-        path: Directory in which config file(s) are stored.
-
-    Returns:
-        Path to either (1) the sole YAML file, or (2) if more than one YAML files exist,
-        the user-selected YAML file.
-
-    Raises:
-        ValueError: If no YAML files are found in the given `path`.
-    """
-    yaml_files = list(path.glob("*.yml")) + list(path.glob("*.yaml"))
-
-    if not yaml_files:
-        raise ValueError(
-            "No YAML config files found in the specified directory. Are you using a .yml extension?"
-        )
-
-    if len(yaml_files) == 1:
-        print(f"Using default YAML file '{yaml_files[0]}'")
-        return str(yaml_files[0])
-
-    print("Choose a YAML file:")
-    for idx, file in enumerate(yaml_files):
-        print(f"{idx + 1}. {file}")
-
-    chosen_file = None
-    while chosen_file is None:
-        try:
-            choice = int(input("Enter the number of your choice: "))
-            if 1 <= choice <= len(yaml_files):
-                chosen_file = str(yaml_files[choice - 1])
-            else:
-                print("Invalid choice. Please choose a number from the list.")
-        except ValueError:
-            print("Invalid input. Please enter a number.")
-
-    return chosen_file
-
-
-def prepare_plugins(cfg: DictDefault):
-    """
-    Registers the plugins for the given configuration.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-    """
-    if cfg.get("plugins"):
-        plugin_manager = PluginManager.get_instance()
-        for plugin_name in cfg["plugins"]:
-            plugin_manager.register(plugin_name)
-
-
-def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs) -> DictDefault:
-    """
-    Loads the `axolotl` configuration stored at `config`, validates it, and performs
-    various setup.
-
-    Args:
-        config: Path (local or remote) to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-
-    Returns:
-        `DictDefault` mapping configuration keys to values.
-    """
-    config = check_remote_config(config)
-    if Path(config).is_dir():
-        config = choose_config(Path(config))
-
-    # Load the config from the yaml file
-    with open(config, encoding="utf-8") as file:
-        cfg: DictDefault = DictDefault(yaml.safe_load(file))
-
-    # If there are any options passed in the cli, if it is something that seems valid
-    # from the yaml, then overwrite the value
-    cfg_keys = cfg.keys()
-    for k, _ in kwargs.items():
-        # if not strict, allow writing to cfg even if it's not in the yml already
-        if k in cfg_keys or not cfg.strict:
-            # handle booleans
-            if isinstance(cfg[k], bool):
-                cfg[k] = bool(kwargs[k])
-            else:
-                cfg[k] = kwargs[k]
-
-    cfg.axolotl_config_path = config
-
-    try:
-        device_props = torch.cuda.get_device_properties("cuda")
-        gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
-    except:  # pylint: disable=bare-except # noqa: E722
-        gpu_version = None
-
-    prepare_plugins(cfg)
-
-    cfg = validate_config(
-        cfg,
-        capabilities={
-            "bf16": is_torch_bf16_gpu_available(),
-            "n_gpu": int(os.environ.get("WORLD_SIZE", 1)),
-            "compute_capability": gpu_version,
-        },
-        env_capabilities={
-            "torch_version": str(torch.__version__).split("+", maxsplit=1)[0]
-        },
-    )
-
-    prepare_optim_env(cfg)
-    prepare_opinionated_env(cfg)
-    normalize_config(cfg)
-    normalize_cfg_datasets(cfg)
-    setup_wandb_env_vars(cfg)
-    setup_mlflow_env_vars(cfg)
-    setup_comet_env_vars(cfg)
-
-    return cfg
--- a/src/axolotl/cli/evaluate.py
+++ b/src/axolotl/cli/evaluate.py
@@ -1,55 +1,43 @@
-"""CLI to run evaluation on a model."""
-
+"""
+CLI to run training on a model
+"""
 import logging
 from pathlib import Path
-from typing import Union
+from typing import Dict, Union

 import fire
 from dotenv import load_dotenv
 from transformers.hf_argparser import HfArgumentParser

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.cli.art import print_axolotl_text_art
-from axolotl.cli.checks import check_accelerate_default_config, check_user_token
-from axolotl.cli.config import load_cfg
-from axolotl.common.datasets import load_datasets, load_preference_datasets
+from axolotl.cli import (
+    check_accelerate_default_config,
+    check_user_token,
+    load_cfg,
+    load_datasets,
+    load_rl_datasets,
+    print_axolotl_text_art,
+)
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.evaluate import evaluate
-from axolotl.utils.dict import DictDefault

-LOG = logging.getLogger(__name__)
+LOG = logging.getLogger("axolotl.cli.evaluate")


-def do_evaluate(cfg: DictDefault, cli_args: TrainerCliArgs) -> None:
-    """
-    Evaluates a `transformers` model by first loading the dataset(s) specified in the
-    `axolotl` config, and then calling `axolotl.evaluate.evaluate`, which computes
-    evaluation metrics on the given dataset(s) and writes them to disk.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: CLI arguments.
-    """
+def do_evaluate(cfg, cli_args) -> Dict[str, float]:
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
    check_accelerate_default_config()
    check_user_token()

-    if cfg.rl:
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+    if cfg.rl:  # and cfg.rl != "orpo":
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)
    else:
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-    evaluate(cfg=cfg, dataset_meta=dataset_meta)
+    return evaluate(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)


 def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
-    """
-    Parses `axolotl` config, CLI args, and calls `do_evaluate`.
-
-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-    """
    # pylint: disable=duplicate-code
    parsed_cfg = load_cfg(config, **kwargs)
    parser = HfArgumentParser(TrainerCliArgs)
--- a/src/axolotl/cli/inference.py
+++ b/src/axolotl/cli/inference.py
@@ -1,267 +1,32 @@
-"""CLI to run inference on a trained model."""
-
-import importlib
-import logging
-import sys
+"""
+CLI to run inference on a trained model
+"""
 from pathlib import Path
-from threading import Thread
 from typing import Union

 import fire
-import torch
 import transformers
 from dotenv import load_dotenv
-from transformers import GenerationConfig, TextIteratorStreamer, TextStreamer

-from axolotl.cli.args import InferenceCliArgs
-from axolotl.cli.art import print_axolotl_text_art
-from axolotl.cli.config import load_cfg
-from axolotl.cli.utils import load_model_and_tokenizer
-from axolotl.utils.chat_templates import (
-    get_chat_template,
-    get_chat_template_from_config,
+from axolotl.cli import (
+    do_inference,
+    do_inference_gradio,
+    load_cfg,
+    print_axolotl_text_art,
 )
-from axolotl.utils.dict import DictDefault
-
-LOG = logging.getLogger(__name__)
+from axolotl.common.cli import TrainerCliArgs


-def get_multi_line_input() -> str:
-    """
-    Gets multi-line input from terminal.
-
-    Returns:
-        Possibly multi-line, possibly empty stdin input as a string.
-    """
-    print("Give me an instruction (Ctrl + D to submit): ")
-
-    instruction = ""
-    for line in sys.stdin:
-        instruction += line  # pylint: disable=consider-using-join
-
-    return instruction
-
-
-def do_inference(
-    *,
-    cfg: DictDefault,
-    cli_args: InferenceCliArgs,
-):
-    """
-    Runs inference on the command line in a loop. User input is accepted, a chat template
-    is (optionally) applied, and the model specified in the `axolotl` config is used to
-    generate completions according to a default generation config.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: Inference-specific CLI arguments.
-    """
-    model, tokenizer = load_model_and_tokenizer(cfg=cfg, inference=True)
-    prompter = cli_args.prompter
-
-    prompter_module = None
-    chat_template_str = None
-    if prompter:
-        prompter_module = getattr(
-            importlib.import_module("axolotl.prompters"), prompter
-        )
-    elif cfg.chat_template:
-        chat_template_str = get_chat_template(cfg.chat_template)
-    elif cfg.datasets[0].type == "chat_template":
-        chat_template_str = get_chat_template_from_config(
-            cfg=cfg, ds_cfg=cfg.datasets[0], tokenizer=tokenizer
-        )
-
-    model = model.to(cfg.device, dtype=cfg.torch_dtype)
-
-    while True:
-        print("=" * 80)
-        # support for multiline inputs
-        instruction = get_multi_line_input()
-        if not instruction:
-            return
-
-        if prompter_module:
-            prompt: str = next(
-                prompter_module().build_prompt(instruction=instruction.strip("\n"))
-            )
-        else:
-            prompt = instruction.strip()
-
-        if chat_template_str:
-            batch = tokenizer.apply_chat_template(
-                [
-                    {
-                        "role": "user",
-                        "content": prompt,
-                    }
-                ],
-                return_tensors="pt",
-                add_special_tokens=True,
-                add_generation_prompt=True,
-                chat_template=chat_template_str,
-                tokenize=True,
-                return_dict=True,
-            )
-        else:
-            batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
-
-        print("=" * 40)
-        model.eval()
-        with torch.no_grad():
-            generation_config = GenerationConfig(
-                repetition_penalty=1.1,
-                max_new_tokens=1024,
-                temperature=0.9,
-                top_p=0.95,
-                top_k=40,
-                bos_token_id=tokenizer.bos_token_id,
-                eos_token_id=tokenizer.eos_token_id,
-                pad_token_id=tokenizer.pad_token_id,
-                do_sample=True,
-                use_cache=True,
-                return_dict_in_generate=True,
-                output_attentions=False,
-                output_hidden_states=False,
-                output_scores=False,
-            )
-            streamer = TextStreamer(tokenizer)
-            generated = model.generate(
-                inputs=batch["input_ids"].to(cfg.device),
-                generation_config=generation_config,
-                streamer=streamer,
-            )
-        print("=" * 40)
-        print(tokenizer.decode(generated["sequences"].cpu().tolist()[0]))
-
-
-def do_inference_gradio(
-    *,
-    cfg: DictDefault,
-    cli_args: InferenceCliArgs,
-):
-    """
-    Runs inference in a Gradio interface. User input is accepted, a chat template is
-    (optionally) applied, and the model specified in the `axolotl` config is used to
-    generate completions according to a default generation config.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: Inference-specific CLI arguments.
-    """
-    import gradio as gr
-
-    model, tokenizer = load_model_and_tokenizer(cfg=cfg, inference=True)
-    prompter = cli_args.prompter
-
-    prompter_module = None
-    chat_template_str = None
-    if prompter:
-        prompter_module = getattr(
-            importlib.import_module("axolotl.prompters"), prompter
-        )
-    elif cfg.chat_template:
-        chat_template_str = get_chat_template(cfg.chat_template, tokenizer=tokenizer)
-
-    model = model.to(cfg.device, dtype=cfg.torch_dtype)
-
-    def generate(instruction):
-        if not instruction:
-            return
-        if prompter_module:
-            # pylint: disable=stop-iteration-return
-            prompt: str = next(
-                prompter_module().build_prompt(instruction=instruction.strip("\n"))
-            )
-        else:
-            prompt = instruction.strip()
-
-        if chat_template_str:
-            batch = tokenizer.apply_chat_template(
-                [
-                    {
-                        "role": "user",
-                        "content": prompt,
-                    }
-                ],
-                return_tensors="pt",
-                add_special_tokens=True,
-                add_generation_prompt=True,
-                chat_template=chat_template_str,
-                tokenize=True,
-                return_dict=True,
-            )
-        else:
-            batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
-
-        model.eval()
-        with torch.no_grad():
-            generation_config = GenerationConfig(
-                repetition_penalty=1.1,
-                max_new_tokens=cfg.get("gradio_max_new_tokens", 1024),
-                temperature=cfg.get("gradio_temperature", 0.9),
-                top_p=0.95,
-                top_k=40,
-                bos_token_id=tokenizer.bos_token_id,
-                eos_token_id=tokenizer.eos_token_id,
-                pad_token_id=tokenizer.pad_token_id,
-                do_sample=True,
-                use_cache=True,
-                return_dict_in_generate=True,
-                output_attentions=False,
-                output_hidden_states=False,
-                output_scores=False,
-            )
-            streamer = TextIteratorStreamer(tokenizer)
-            generation_kwargs = {
-                "inputs": batch["input_ids"].to(cfg.device),
-                "attention_mask": batch["attention_mask"].to(cfg.device),
-                "generation_config": generation_config,
-                "streamer": streamer,
-            }
-
-            thread = Thread(target=model.generate, kwargs=generation_kwargs)
-            thread.start()
-
-            all_text = ""
-
-            for new_text in streamer:
-                all_text += new_text
-                yield all_text
-
-    demo = gr.Interface(
-        fn=generate,
-        inputs="textbox",
-        outputs="text",
-        title=cfg.get("gradio_title", "Axolotl Gradio Interface"),
-    )
-
-    demo.queue().launch(
-        show_api=False,
-        share=cfg.get("gradio_share", True),
-        server_name=cfg.get("gradio_server_name", "127.0.0.1"),
-        server_port=cfg.get("gradio_server_port", None),
-    )
-
-
-def do_cli(
-    config: Union[Path, str] = Path("examples/"), gradio: bool = False, **kwargs
-) -> None:
-    """
-    Parses axolotl config, CLI args, and calls `do_inference` or `do_inference_gradio`.
-
-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-    """
+def do_cli(config: Union[Path, str] = Path("examples/"), gradio=False, **kwargs):
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
    parsed_cfg = load_cfg(config, inference=True, **kwargs)
    parsed_cfg.sample_packing = False
-    parser = transformers.HfArgumentParser(InferenceCliArgs)
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
+    parsed_cli_args.inference = True

    if gradio:
        do_inference_gradio(cfg=parsed_cfg, cli_args=parsed_cli_args)
--- a/src/axolotl/cli/integrations/init.py
+++ b/src/axolotl/cli/integrations/init.py
--- a/src/axolotl/cli/integrations/convert_diff_transformer.py
+++ b/src/axolotl/cli/integrations/convert_diff_transformer.py
@@ -0,0 +1,208 @@
+"""CLI to convert a transformers model's attention layers to differential attention layers."""
+
+import logging
+import warnings
+from pathlib import Path
+from time import time
+from typing import Union
+
+import fire
+import torch
+import yaml
+from colorama import Fore
+from dotenv import load_dotenv
+from transformers import HfArgumentParser
+
+from axolotl.cli import load_cfg, print_axolotl_text_art
+from axolotl.common.cli import ConvertDiffTransformerCliArgs, load_model_and_tokenizer
+from axolotl.integrations.diff_transformer.modeling_diff_attn import (
+    LlamaDifferentialConfig,
+    LlamaDifferentialForCausalLM,
+)
+from axolotl.utils.yaml import dump_yaml_preserved_order
+
+LOG = logging.getLogger(__name__)
+
+
+def test_inference(model, tokenizer, prompt="The quick brown fox"):
+    """Run test inference and return generation time"""
+    inputs = tokenizer(prompt, return_tensors="pt")
+    inputs = {k: v.to(device=model.device, dtype=torch.long) for k, v in inputs.items()}
+
+    start = time()
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=20,
+            num_beams=1,
+            do_sample=False,
+            pad_token_id=tokenizer.pad_token_id,
+            use_cache=False,
+        )
+    elapsed = time() - start
+
+    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    LOG.info("Prompt: %s", prompt)
+    LOG.info("Generated: %s", generated_text)
+    LOG.info("Generation time: %.2fs", elapsed)
+
+    return elapsed, generated_text
+
+
+def convert_diff_transformer(cfg, cli_args, config_path):
+    assert not (
+        cli_args.split_heads and cli_args.zero_init
+    ), "Both `split_heads` and `zero_init` cannot be `True`"
+    assert not (
+        cli_args.zero_init and cli_args.mirror_weights
+    ), "Both `zero_init` and `mirror_weights` cannot be `True`"
+
+    debug_info = {}
+
+    # Load model and tokenizer
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+        model.to(cfg.device, dtype=cfg.torch_dtype)
+
+    # Log original model info
+    LOG.info(
+        "Original model config:\n\t- Hidden size: %d\n\t- Num attention heads: %d",
+        model.config.hidden_size,
+        model.config.num_attention_heads,
+    )
+
+    # Test original model
+    if cli_args.debug:
+        LOG.info("Testing original model...")
+        debug_info["orig_time"], debug_info["orig_text"] = test_inference(
+            model, tokenizer
+        )
+
+    try:
+        # Convert attention
+        LOG.info("Converting to differential attention...")
+
+        config = LlamaDifferentialConfig(
+            **model.config.__dict__,
+            zero_init=cli_args.zero_init,
+            sublayer_norm=cli_args.sublayer_norm,
+            split_heads=cli_args.split_heads,
+            mirror_weights=cli_args.mirror_weights,
+        )
+        model = LlamaDifferentialForCausalLM.from_llama(model, config)
+        model.to(cfg.device, dtype=cfg.torch_dtype)
+    except Exception as exc:
+        LOG.error(Fore.RED + "Conversion failed: %s" + Fore.RESET, str(exc))
+        raise
+
+    # Test converted model
+    if cli_args.debug:
+        LOG.info("Testing converted model...")
+        debug_info["conv_time"], debug_info["conv_text"] = test_inference(
+            model, tokenizer
+        )
+
+    # Save if requested
+    if cfg.output_dir:
+        # Save model and tokenizer
+        LOG.info("Saving converted model to %s", cfg.output_dir)
+        model.save_pretrained(cfg.output_dir)
+        tokenizer.save_pretrained(cfg.output_dir)
+
+        # Modify config to reflect new path / differential attention
+        output_config_path = Path(cfg.output_dir) / "axolotl_config.yml"
+        LOG.info("Saving updated config to %s", output_config_path)
+
+        with open(config_path, "r", encoding="utf-8") as file:
+            modified_cfg = yaml.safe_load(file) or {}
+
+        modified_cfg["base_model"] = cfg.output_dir
+        modified_cfg["diff_attention"] = True
+        plugin_class = (
+            "axolotl.integrations.diff_transformer.DifferentialTransformerPlugin"
+        )
+        if "plugins" in modified_cfg:
+            modified_cfg["plugins"].append(plugin_class)
+        else:
+            modified_cfg["plugins"] = [plugin_class]
+
+        # Write out the updated axolotl config while preserving original ordering / formatting
+        dump_yaml_preserved_order(
+            data=modified_cfg,
+            reference_yaml_path=config_path,
+            output_path=output_config_path,
+        )
+    else:
+        LOG.info("Not saving converted model to disk")
+        LOG.info("Pass --output-dir path/to/save to save model")
+
+    if cli_args.debug:
+        LOG.info(
+            Fore.GREEN
+            + "Conversion successful!\n"
+            + f"Original generation time: {debug_info['orig_time']:.2f}s\n"
+            + f"Converted generation time: {debug_info['conv_time']:.2f}s"
+            + Fore.RESET
+        )
+
+        if debug_info["orig_text"] == debug_info["conv_text"]:
+            LOG.info(
+                Fore.GREEN
+                + "Generations match!\n"
+                + "Model generation:\n"
+                + "*" * 50
+                + "\n"
+                + f"{debug_info['orig_text']}\n"
+                + "*" * 50
+                + "\n"
+                + Fore.RESET
+            )
+            debug_info["generations_match"] = True
+        else:
+            message = (
+                "Generations do not match.\n"
+                + "Original generation:\n"
+                + "*" * 50
+                + "\n"
+                + f"{debug_info['orig_text']}\n"
+                + "*" * 50
+                + "\n"
+                + "Converted generation:\n"
+                + "*" * 50
+                + "\n"
+                + f"{debug_info['conv_text']}\n"
+                + "*" * 50
+                + "\n"
+            )
+            debug_info["generations_match"] = False
+
+            if cli_args.zero_init and not cli_args.sublayer_norm:
+                LOG.info(Fore.RED + message + Fore.RESET)
+                debug_info["match_expected"] = True
+            else:
+                LOG.info(
+                    Fore.YELLOW
+                    + message
+                    + "However, this is expected since --zero-init"
+                    + " and --no-sublayer-norm were not passed."
+                    + Fore.RESET
+                )
+                debug_info["match_expected"] = False
+
+    return model, debug_info
+
+
+def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
+    print_axolotl_text_art()
+
+    cfg = load_cfg(config, **kwargs)
+    parser = HfArgumentParser(ConvertDiffTransformerCliArgs)
+    cli_args, _ = parser.parse_args_into_dataclasses(return_remaining_strings=True)
+
+    convert_diff_transformer(cfg, cli_args, config)
+
+
+if __name__ == "__main__":
+    load_dotenv()
+    fire.Fire(do_cli)
--- a/src/axolotl/cli/integrations/convert_rala.py
+++ b/src/axolotl/cli/integrations/convert_rala.py
@@ -0,0 +1,198 @@
+"""CLI to convert a transformers model's attns to rala attns."""
+import logging
+import warnings
+from pathlib import Path
+from time import time
+from typing import Union
+
+import fire
+import torch
+import yaml
+from colorama import Fore
+from dotenv import load_dotenv
+from transformers import HfArgumentParser
+
+from axolotl.cli import load_cfg, print_axolotl_text_art
+from axolotl.common.cli import ConvertDiffTransformerCliArgs, load_model_and_tokenizer
+from axolotl.integrations.rala.convert import convert_to_rala
+from axolotl.utils.yaml import dump_yaml_preserved_order
+
+LOG = logging.getLogger(__name__)
+
+
+def test_inference(model, tokenizer, prompt="The quick brown fox"):
+    """Run test inference and return generation time"""
+    try:
+        inputs = tokenizer(prompt, return_tensors="pt")
+        inputs = {
+            k: v.to(device=model.device, dtype=torch.long) for k, v in inputs.items()
+        }
+
+        start = time()
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                max_new_tokens=20,
+                num_beams=1,
+                do_sample=False,
+                pad_token_id=tokenizer.pad_token_id,
+                use_cache=False,
+            )
+        elapsed = time() - start
+
+        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        LOG.info("Prompt: %s", prompt)
+        LOG.info("Generated: %s", generated_text)
+        LOG.info("Generation time: %.2fs", elapsed)
+
+        return elapsed, generated_text
+
+    except Exception as exc:
+        LOG.error("Inference failed: %s", str(exc))
+        raise
+
+
+def convert_rala(cfg, cli_args, config_path):
+    debug_info = {}
+
+    # Load model and tokenizer
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+        model.to(cfg.device, dtype=cfg.torch_dtype)
+
+    # Log original model info
+    LOG.info(
+        "Original model config:\n\t- Hidden size: %d\n\t- Num attention heads: %d",
+        model.config.hidden_size,
+        model.config.num_attention_heads,
+    )
+
+    # Test original model
+    if cli_args.debug:
+        LOG.info("attention layers to RALA attention")
+        debug_info["orig_time"], debug_info["orig_text"] = test_inference(
+            model, tokenizer
+        )
+
+    # Convert attention
+    try:
+        model = convert_to_rala(
+            model=model,
+            zero_init=cli_args.zero_init,
+        )
+        model.to(cfg.device, dtype=cfg.torch_dtype)
+        model.config.model_type = "llama-rala"
+    except Exception as exc:
+        LOG.error(Fore.RED + "Conversion failed: %s" + Fore.RESET, str(exc))
+        raise
+
+    # Test converted model
+    if cli_args.debug:
+        LOG.info("Testing converted model...")
+        debug_info["conv_time"], debug_info["conv_text"] = test_inference(
+            model, tokenizer
+        )
+
+    # Save if requested
+    if cfg.output_dir:
+        # Save model and tokenizer
+        LOG.info("Saving converted model to %s", cfg.output_dir)
+        model.save_pretrained(cfg.output_dir)
+        tokenizer.save_pretrained(cfg.output_dir)
+
+        # Modify config to reflect new path / differential attention
+        output_config_path = Path(cfg.output_dir) / "axolotl_config.yml"
+        LOG.info("Saving updated config to %s", output_config_path)
+
+        with open(config_path, "r", encoding="utf-8") as file:
+            modified_cfg = yaml.safe_load(file) or {}
+
+        modified_cfg["base_model"] = cfg.output_dir
+        modified_cfg["rala_attention"] = True
+        plugin_class = "axolotl.integrations.rala.RalaPlugin"
+        if "plugins" in modified_cfg:
+            modified_cfg["plugins"].append(plugin_class)
+        else:
+            modified_cfg["plugins"] = [plugin_class]
+
+        dump_yaml_preserved_order(
+            data=modified_cfg,
+            reference_yaml_path=config_path,
+            output_path=output_config_path,
+        )
+    else:
+        LOG.info("Not saving converted model to disk")
+        LOG.info("Pass --output-dir path/to/save to save model")
+
+    if cli_args.debug:
+        LOG.info(
+            Fore.GREEN
+            + "Conversion successful!\n"
+            + f"Original generation time: {debug_info['orig_time']:.2f}s\n"
+            + f"Converted generation time: {debug_info['conv_time']:.2f}s"
+            + Fore.RESET
+        )
+
+        if debug_info["orig_text"] == debug_info["conv_text"]:
+            LOG.info(
+                Fore.GREEN
+                + "Generations match!\n"
+                + "Model generation:\n"
+                + "*" * 50
+                + "\n"
+                + f"{debug_info['orig_text']}\n"
+                + "*" * 50
+                + "\n"
+                + Fore.RESET
+            )
+            debug_info["generations_match"] = True
+        else:
+            message = (
+                "Generations do not match.\n"
+                + "Original generation:\n"
+                + "*" * 50
+                + "\n"
+                + f"{debug_info['orig_text']}\n"
+                + "*" * 50
+                + "\n"
+                + "Converted generation:\n"
+                + "*" * 50
+                + "\n"
+                + f"{debug_info['conv_text']}\n"
+                + "*" * 50
+                + "\n"
+            )
+            debug_info["generations_match"] = False
+
+            if cli_args.zero_init and not cli_args.sublayer_norm:
+                LOG.info(Fore.RED + message + Fore.RESET)
+                debug_info["match_expected"] = True
+            else:
+                LOG.info(
+                    Fore.YELLOW
+                    + message
+                    + "However, this is expected since --zero-init"
+                    + " and --no-sublayer-norm were not passed."
+                    + Fore.RESET
+                )
+                debug_info["match_expected"] = False
+
+    return model, debug_info
+
+
+def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
+    print_axolotl_text_art()
+
+    cfg = load_cfg(config, **kwargs)
+    if cfg.rala_attention:
+        cfg.rala_attention = False
+    parser = HfArgumentParser(ConvertDiffTransformerCliArgs)
+    cli_args, _ = parser.parse_args_into_dataclasses(return_remaining_strings=True)
+
+    convert_rala(cfg, cli_args, config)
+
+
+if __name__ == "__main__":
+    load_dotenv()
+    fire.Fire(do_cli)
--- a/src/axolotl/cli/main.py
+++ b/src/axolotl/cli/main.py
@@ -1,19 +1,22 @@
-"""Click CLI definitions for various axolotl commands."""
+"""CLI definition for various axolotl commands."""
 # pylint: disable=redefined-outer-name
-
 import subprocess  # nosec B404
 from typing import Optional

 import click

 import axolotl
-from axolotl.cli.args import EvaluateCliArgs, PreprocessCliArgs, TrainerCliArgs
 from axolotl.cli.utils import (
    add_options_from_config,
    add_options_from_dataclass,
    build_command,
    fetch_from_github,
-    filter_none_kwargs,
+)
+from axolotl.common.cli import (
+    ConvertDiffTransformerCliArgs,
+    EvaluateCliArgs,
+    PreprocessCliArgs,
+    TrainerCliArgs,
 )
 from axolotl.utils import set_pytorch_cuda_alloc_conf
 from axolotl.utils.config.models.input.v0_4_1 import AxolotlInputConfig
@@ -29,16 +32,10 @@ def cli():
@click.argument("config", type=click.Path(exists=True, path_type=str))
@add_options_from_dataclass(PreprocessCliArgs)
@add_options_from_config(AxolotlInputConfig)
-@filter_none_kwargs
-def preprocess(config: str, **kwargs) -> None:
-    """
-    Preprocess datasets before training.
+def preprocess(config: str, **kwargs):
+    """Preprocess datasets before training."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}

-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
-            config options.
-    """
    from axolotl.cli.preprocess import do_cli

    do_cli(config=config, **kwargs)
@@ -53,17 +50,10 @@ def preprocess(config: str, **kwargs) -> None:
 )
@add_options_from_dataclass(TrainerCliArgs)
@add_options_from_config(AxolotlInputConfig)
-@filter_none_kwargs
-def train(config: str, accelerate: bool, **kwargs) -> None:
-    """
-    Train or fine-tune a model.
+def train(config: str, accelerate: bool, **kwargs):
+    """Train or fine-tune a model."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}

-    Args:
-        config: Path to `axolotl` config YAML file.
-        accelerate: Whether to use `accelerate` launcher.
-        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
-            config options.
-    """
    # Enable expandable segments for cuda allocation to improve VRAM usage
    set_pytorch_cuda_alloc_conf()

@@ -88,17 +78,13 @@ def train(config: str, accelerate: bool, **kwargs) -> None:
 )
@add_options_from_dataclass(EvaluateCliArgs)
@add_options_from_config(AxolotlInputConfig)
-@filter_none_kwargs
-def evaluate(config: str, accelerate: bool, **kwargs) -> None:
-    """
-    Evaluate a model.
+def evaluate(config: str, accelerate: bool, **kwargs):
+    """Evaluate a model."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}
+
+    # Enable expandable segments for cuda allocation to improve VRAM usage
+    set_pytorch_cuda_alloc_conf()

-    Args:
-        config: Path to `axolotl` config YAML file.
-        accelerate: Whether to use `accelerate` launcher.
-        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
-            config options.
-    """
    if accelerate:
        base_cmd = ["accelerate", "launch", "-m", "axolotl.cli.evaluate"]
        if config:
@@ -118,33 +104,81 @@ def evaluate(config: str, accelerate: bool, **kwargs) -> None:
    default=False,
    help="Use accelerate launch for multi-GPU inference",
 )
+@click.option(
+    "--lora-model-dir",
+    type=click.Path(exists=True, path_type=str),
+    help="Directory containing LoRA model",
+)
+@click.option(
+    "--base-model",
+    type=click.Path(exists=True, path_type=str),
+    help="Path to base model for non-LoRA models",
+)
@click.option("--gradio", is_flag=True, help="Launch Gradio interface")
+@click.option("--load-in-8bit", is_flag=True, help="Load model in 8-bit mode")
@add_options_from_dataclass(TrainerCliArgs)
@add_options_from_config(AxolotlInputConfig)
-@filter_none_kwargs
-def inference(config: str, accelerate: bool, gradio: bool, **kwargs) -> None:
-    """
-    Run inference with a trained model.
+def inference(
+    config: str,
+    accelerate: bool,
+    lora_model_dir: Optional[str] = None,
+    base_model: Optional[str] = None,
+    **kwargs,
+):
+    """Run inference with a trained model."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}
+    del kwargs["inference"]  # interferes with inference.do_cli
+
+    if lora_model_dir:
+        kwargs["lora_model_dir"] = lora_model_dir
+    if base_model:
+        kwargs["base_model"] = base_model

-    Args:
-        config: Path to `axolotl` config YAML file.
-        accelerate: Whether to use `accelerate` launcher.
-        gradio: Whether to use Gradio browser interface or command line for inference.
-        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
-            config options.
-    """
    if accelerate:
        base_cmd = ["accelerate", "launch", "-m", "axolotl.cli.inference"]
        if config:
            base_cmd.append(config)
-        if gradio:
-            base_cmd.append("--gradio")
        cmd = build_command(base_cmd, kwargs)
        subprocess.run(cmd, check=True)  # nosec B603
    else:
        from axolotl.cli.inference import do_cli

-        do_cli(config=config, gradio=gradio, **kwargs)
+        do_cli(config=config, **kwargs)
+
+
+@cli.command()
+@click.argument("config", type=click.Path(exists=True, path_type=str))
+@click.option(
+    "--accelerate/--no-accelerate",
+    default=False,
+    help="Use accelerate launch for multi-GPU operations",
+)
+@click.option(
+    "--model-dir",
+    type=click.Path(exists=True, path_type=str),
+    help="Directory containing model weights to shard",
+)
+@click.option(
+    "--save-dir",
+    type=click.Path(path_type=str),
+    help="Directory to save sharded weights",
+)
+@add_options_from_dataclass(TrainerCliArgs)
+@add_options_from_config(AxolotlInputConfig)
+def shard(config: str, accelerate: bool, **kwargs):
+    """Shard model weights."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}
+
+    if accelerate:
+        base_cmd = ["accelerate", "launch", "-m", "axolotl.cli.shard"]
+        if config:
+            base_cmd.append(config)
+        cmd = build_command(base_cmd, kwargs)
+        subprocess.run(cmd, check=True)  # nosec B603
+    else:
+        from axolotl.cli.shard import do_cli
+
+        do_cli(config=config, **kwargs)


@cli.command()
@@ -154,19 +188,20 @@ def inference(config: str, accelerate: bool, gradio: bool, **kwargs) -> None:
    default=True,
    help="Use accelerate launch for weight merging",
 )
+@click.option(
+    "--model-dir",
+    type=click.Path(exists=True, path_type=str),
+    help="Directory containing sharded weights",
+)
+@click.option(
+    "--save-path", type=click.Path(path_type=str), help="Path to save merged weights"
+)
@add_options_from_dataclass(TrainerCliArgs)
@add_options_from_config(AxolotlInputConfig)
-@filter_none_kwargs
-def merge_sharded_fsdp_weights(config: str, accelerate: bool, **kwargs) -> None:
-    """
-    Merge sharded FSDP model weights.
+def merge_sharded_fsdp_weights(config: str, accelerate: bool, **kwargs):
+    """Merge sharded FSDP model weights."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}

-    Args:
-        config: Path to `axolotl` config YAML file.
-        accelerate: Whether to use `accelerate` launcher.
-        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
-            config options.
-    """
    if accelerate:
        base_cmd = [
            "accelerate",
@@ -186,38 +221,69 @@ def merge_sharded_fsdp_weights(config: str, accelerate: bool, **kwargs) -> None:

@cli.command()
@click.argument("config", type=click.Path(exists=True, path_type=str))
-@add_options_from_dataclass(TrainerCliArgs)
-@add_options_from_config(AxolotlInputConfig)
-@filter_none_kwargs
-def merge_lora(config: str, **kwargs) -> None:
-    """
-    Merge trained LoRA adapters into a base model.
+@click.option(
+    "--lora-model-dir",
+    type=click.Path(exists=True, path_type=str),
+    help="Directory containing the LoRA model to merge",
+)
+@click.option(
+    "--output-dir",
+    type=click.Path(path_type=str),
+    help="Directory to save the merged model",
+)
+def merge_lora(
+    config: str,
+    lora_model_dir: Optional[str] = None,
+    output_dir: Optional[str] = None,
+):
+    """Merge a trained LoRA into a base model"""
+    kwargs = {}
+    if lora_model_dir:
+        kwargs["lora_model_dir"] = lora_model_dir
+    if output_dir:
+        kwargs["output_dir"] = output_dir

-    Args:
-        config: Path to `axolotl` config YAML file.
-        accelerate: Whether to use `accelerate` launcher.
-        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
-            config options.
-    """
    from axolotl.cli.merge_lora import do_cli

    do_cli(config=config, **kwargs)


+@cli.command()
+@click.argument("config", type=click.Path(exists=True, path_type=str))
+@add_options_from_dataclass(ConvertDiffTransformerCliArgs)
+@add_options_from_config(AxolotlInputConfig)
+def convert_diff_transformer(config: str, **kwargs):
+    """Convert model attention layers to differential attention layers."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}
+
+    from axolotl.cli.integrations.convert_diff_transformer import do_cli
+
+    do_cli(config=config, **kwargs)
+
+
+@cli.command()
+@click.argument("config", type=click.Path(exists=True, path_type=str))
+@add_options_from_dataclass(ConvertDiffTransformerCliArgs)
+@add_options_from_config(AxolotlInputConfig)
+def convert_rala(config: str, **kwargs):
+    """Convert model attention layers to RALA attention layers."""
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}
+
+    from axolotl.cli.integrations.convert_rala import do_cli
+
+    do_cli(config=config, **kwargs)
+
+
@cli.command()
@click.argument("directory", type=click.Choice(["examples", "deepspeed_configs"]))
@click.option("--dest", help="Destination directory")
-def fetch(directory: str, dest: Optional[str]) -> None:
+def fetch(directory: str, dest: Optional[str]):
    """
    Fetch example configs or other resources.

    Available directories:
    - examples: Example configuration files
    - deepspeed_configs: DeepSpeed configuration files
-
-    Args:
-        directory: One of `examples`, `deepspeed_configs`.
-        dest: Optional destination directory.
    """
    fetch_from_github(f"{directory}/", dest)

--- a/src/axolotl/cli/merge_lora.py
+++ b/src/axolotl/cli/merge_lora.py
@@ -1,6 +1,6 @@
-"""CLI to merge a trained LoRA into a base model."""
-
-import logging
+"""
+CLI to run merge a trained LoRA into a base model
+"""
 from pathlib import Path
 from typing import Union

@@ -8,58 +8,14 @@ import fire
 import transformers
 from dotenv import load_dotenv

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.cli.art import print_axolotl_text_art
-from axolotl.cli.config import load_cfg
-from axolotl.cli.utils import load_model_and_tokenizer
-from axolotl.utils.dict import DictDefault
-
-LOG = logging.getLogger(__name__)
+from axolotl.cli import do_merge_lora, load_cfg, print_axolotl_text_art
+from axolotl.common.cli import TrainerCliArgs


-def do_merge_lora(*, cfg: DictDefault) -> None:
-    """
-    Calls `transformers`' `merge_and_unload` on the model given in the `axolotl` config
-    along with the LoRA adapters to combine them into a single base model.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-    """
-    print_axolotl_text_art()
-
-    model, tokenizer = load_model_and_tokenizer(cfg=cfg)
-    safe_serialization = cfg.save_safetensors is True
-
-    LOG.info("Running merge of LoRA with base model...")
-    model = model.merge_and_unload(progressbar=True)
-    model.to(dtype=cfg.torch_dtype)
-    model.generation_config.do_sample = True
-
-    if cfg.local_rank == 0:
-        LOG.info(f"Saving merged model to: {str(Path(cfg.output_dir) / 'merged')}...")
-        model.save_pretrained(
-            str(Path(cfg.output_dir) / "merged"),
-            safe_serialization=safe_serialization,
-            progressbar=True,
-        )
-        tokenizer.save_pretrained(str(Path(cfg.output_dir) / "merged"))
-
-
-def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
-    """
-    Parses `axolotl` config, CLI args, and calls `do_merge_lora`. Note that various
-    config values will be overwritten to allow the LoRA merge logic to work as expected
-    (`load_in_8bit=False`, `load_in4bit=False`, `flash_attention=False`, etc.).
-
-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-
-    Raises:
-        ValueError: If target directory for LoRA merged model does not exist.
-    """
+def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
    # pylint: disable=duplicate-code
-    parser = transformers.HfArgumentParser(TrainerCliArgs)
+    print_axolotl_text_art()
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
@@ -90,7 +46,7 @@ def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
    parsed_cfg.fsdp = None
    parsed_cfg.fsdp_config = None

-    do_merge_lora(cfg=parsed_cfg)
+    do_merge_lora(cfg=parsed_cfg, cli_args=parsed_cli_args)


 if __name__ == "__main__":
--- a/src/axolotl/cli/merge_sharded_fsdp_weights.py
+++ b/src/axolotl/cli/merge_sharded_fsdp_weights.py
@@ -1,5 +1,6 @@
-"""CLI to merge sharded FSDP model checkpoints into a single combined checkpoint."""
-
+"""
+This module provides a CLI to merge sharded FSDP model checkpoints into a single combined checkpoint
+"""
 import json
 import logging
 import os
@@ -24,15 +25,16 @@ from huggingface_hub import split_torch_state_dict_into_shards
 from safetensors.torch import save_file as safe_save_file
 from torch.distributed.checkpoint.format_utils import _EmptyStateDictLoadPlanner

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.cli.art import print_axolotl_text_art
-from axolotl.cli.config import load_cfg
+from axolotl.cli import load_cfg, print_axolotl_text_art
+from axolotl.common.cli import TrainerCliArgs

-LOG = logging.getLogger(__name__)
+LOG = logging.getLogger("axolotl.cli.merge_sharded_fsdp_weights")


 class BFloat16CastPlanner(_EmptyStateDictLoadPlanner):
-    """A custom planner to cast tensors to bfloat16 on the fly during loading."""
+    """
+    A custom planner to cast tensors to bfloat16 on the fly during loading.
+    """

    def commit_tensor(self, read_item, tensor):  # pylint: disable=unused-argument
        tensor.copy_(tensor.to(torch.bfloat16))
@@ -43,19 +45,11 @@ def _distributed_checkpoint_to_merged_weights(
    save_path: str,
    safe_serialization: bool = False,
    max_shard_size: str = "5GB",
-) -> Path:
+):
    """
-    Passthrough to `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`. Will
-    save under `save_path` as either `model.safetensors` or `pytorch_model.bin`.
+    Passthrough to `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`

-    Args:
-        checkpoint_dir: Directory where distributed checkpoint is saved.
-        save_path: Path to save model to.
-        safe_serialization: Whether to save in safetensors format.
-        max_shard_size: Max size of model shards to save.
-
-    Returns:
-        Path where model is saved.
+    Will save under `save_path` as either `model.safetensors` or `pytorch_model.bin`.
    """

    state_dict: Dict = {}
@@ -85,7 +79,6 @@ def _distributed_checkpoint_to_merged_weights(
    state_dict_split = split_torch_state_dict_into_shards(
        state_dict, filename_pattern=filename_pattern, max_shard_size=max_shard_size
    )
-
    # Save index if sharded
    index = None
    if state_dict_split.is_sharded:
@@ -142,9 +135,6 @@ def merge_fsdp_weights(
            Whether to save the merged weights with safetensors (recommended).
        remove_checkpoint_dir (`bool`, *optional*, defaults to `False`):
            Whether to remove the checkpoint directory after merging.
-
-    Raises:
-        ValueError: If torch version < 2.3.0, or if `checkpoint_dir` does not exist.
    """
    checkpoint_dir_ = Path(checkpoint_dir)
    from accelerate.state import PartialState
@@ -188,21 +178,18 @@ def merge_fsdp_weights(


 def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
-    """
-    Parses `axolotl` config, CLI args, and calls `merge_fsdp_weights`.
-
-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-    """
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
-    parser = transformers.HfArgumentParser(TrainerCliArgs)
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
    parsed_cli_args.merge_lora = True
-    parsed_cfg = load_cfg(config, **kwargs)
+
+    parsed_cfg = load_cfg(
+        config,
+        **kwargs,
+    )

    fsdp_dir = Path(parsed_cfg.output_dir) / "pytorch_model_fsdp_0"
    merge_fsdp_weights(
--- a/src/axolotl/cli/preprocess.py
+++ b/src/axolotl/cli/preprocess.py
@@ -1,5 +1,6 @@
-"""CLI to run preprocessing of a dataset."""
-
+"""
+CLI to run training on a model
+"""
 import logging
 import warnings
 from pathlib import Path
@@ -12,31 +13,34 @@ from colorama import Fore
 from dotenv import load_dotenv
 from transformers import AutoModelForCausalLM

-from axolotl.cli.args import PreprocessCliArgs
-from axolotl.cli.art import print_axolotl_text_art
-from axolotl.cli.checks import check_accelerate_default_config, check_user_token
-from axolotl.cli.config import load_cfg
+from axolotl.cli import (
+    check_accelerate_default_config,
+    check_user_token,
+    load_cfg,
+    load_datasets,
+    load_rl_datasets,
+    print_axolotl_text_art,
+)
+from axolotl.common.cli import PreprocessCliArgs
 from axolotl.common.const import DEFAULT_DATASET_PREPARED_PATH
-from axolotl.common.datasets import load_datasets, load_preference_datasets
-from axolotl.utils.dict import DictDefault
 from axolotl.utils.trainer import disable_datasets_caching

-LOG = logging.getLogger(__name__)
+LOG = logging.getLogger("axolotl.cli.preprocess")


-def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:
-    """
-    Preprocesses dataset specified in axolotl config.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: Preprocessing-specific CLI arguments.
-    """
+def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
+    # pylint: disable=duplicate-code
    print_axolotl_text_art()
+    parsed_cfg = load_cfg(config, **kwargs)
+    parsed_cfg.is_preprocess = True
    check_accelerate_default_config()
    check_user_token()
+    parser = transformers.HfArgumentParser((PreprocessCliArgs))
+    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
+        return_remaining_strings=True
+    )

-    if not cfg.dataset_prepared_path:
+    if not parsed_cfg.dataset_prepared_path:
        msg = (
            Fore.RED
            + "preprocess CLI called without dataset_prepared_path set, "
@@ -44,16 +48,16 @@ def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:
            + Fore.RESET
        )
        LOG.warning(msg)
-        cfg.dataset_prepared_path = DEFAULT_DATASET_PREPARED_PATH
+        parsed_cfg.dataset_prepared_path = DEFAULT_DATASET_PREPARED_PATH

    with disable_datasets_caching():
-        if cfg.rl:
-            load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        if parsed_cfg.rl:  # and parsed_cfg.rl != "orpo":
+            load_rl_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
        else:
-            load_datasets(cfg=cfg, cli_args=cli_args)
+            load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)

-    if cli_args.download:
-        model_name = cfg.base_model
+    if parsed_cli_args.download:
+        model_name = parsed_cfg.base_model
        with warnings.catch_warnings():
            # there are a bunch of useless UserWarnings about
            # "copying from a non-meta parameter in the checkpoint to a meta parameter in the current model"
@@ -70,30 +74,11 @@ def do_preprocess(cfg: DictDefault, cli_args: PreprocessCliArgs) -> None:

    LOG.info(
        Fore.GREEN
-        + f"Success! Preprocessed data path: `dataset_prepared_path: {cfg.dataset_prepared_path}`"
+        + f"Success! Preprocessed data path: `dataset_prepared_path: {parsed_cfg.dataset_prepared_path}`"
        + Fore.RESET
    )


-def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
-    """
-    Parses `axolotl` config, CLI args, and calls `do_preprocess`.
-
-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-    """
-    # pylint: disable=duplicate-code
-    parsed_cfg = load_cfg(config, **kwargs)
-    parsed_cfg.is_preprocess = True
-    parser = transformers.HfArgumentParser(PreprocessCliArgs)
-    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
-        return_remaining_strings=True
-    )
-
-    do_preprocess(parsed_cfg, parsed_cli_args)
-
-
 if __name__ == "__main__":
    load_dotenv()
    fire.Fire(do_cli)
--- a/src/axolotl/cli/shard.py
+++ b/src/axolotl/cli/shard.py
@@ -0,0 +1,45 @@
+"""
+CLI to shard a trained model into 10GiB chunks
+"""
+import logging
+from pathlib import Path
+from typing import Union
+
+import fire
+import transformers
+from dotenv import load_dotenv
+
+from axolotl.cli import load_cfg, print_axolotl_text_art
+from axolotl.common.cli import TrainerCliArgs, load_model_and_tokenizer
+from axolotl.utils.dict import DictDefault
+
+LOG = logging.getLogger("axolotl.scripts")
+
+
+def shard(
+    *,
+    cfg: DictDefault,
+    cli_args: TrainerCliArgs,
+):
+    model, _ = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    safe_serialization = cfg.save_safetensors is True
+    LOG.debug("Re-saving model w/ sharding")
+    model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+
+
+def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
+    # pylint: disable=duplicate-code
+    print_axolotl_text_art()
+    parsed_cfg = load_cfg(config, **kwargs)
+    parser = transformers.HfArgumentParser((TrainerCliArgs))
+    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
+        return_remaining_strings=True
+    )
+    parsed_cli_args.shard = True
+
+    shard(cfg=parsed_cfg, cli_args=parsed_cli_args)
+
+
+if __name__ == "__main__":
+    load_dotenv()
+    fire.Fire(do_cli)
--- a/src/axolotl/cli/train.py
+++ b/src/axolotl/cli/train.py
@@ -1,5 +1,6 @@
-"""CLI to run training on a model."""
-
+"""
+CLI to run training on a model
+"""
 import logging
 from pathlib import Path
 from typing import Union
@@ -8,38 +9,42 @@ import fire
 from dotenv import load_dotenv
 from transformers.hf_argparser import HfArgumentParser

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.cli.art import print_axolotl_text_art
-from axolotl.cli.checks import check_accelerate_default_config, check_user_token
-from axolotl.cli.config import load_cfg
-from axolotl.common.datasets import load_datasets, load_preference_datasets
+from axolotl.cli import (
+    check_accelerate_default_config,
+    check_user_token,
+    load_cfg,
+    load_datasets,
+    load_rl_datasets,
+    print_axolotl_text_art,
+)
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.integrations.base import PluginManager
 from axolotl.train import train
-from axolotl.utils.dict import DictDefault

-LOG = logging.getLogger(__name__)
+LOG = logging.getLogger("axolotl.cli.train")


-def do_train(cfg: DictDefault, cli_args: TrainerCliArgs) -> None:
-    """
-    Trains a `transformers` model by first loading the dataset(s) specified in the
-    `axolotl` config, and then calling `axolotl.train.train`. Also runs the plugin
-    manager's `post_train_unload` once training completes.
+def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs):
+    # pylint: disable=duplicate-code
+    parsed_cfg = load_cfg(config, **kwargs)
+    parser = HfArgumentParser((TrainerCliArgs))
+    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
+        return_remaining_strings=True
+    )
+    return do_train(parsed_cfg, parsed_cli_args)

-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: Training-specific CLI arguments.
-    """
+
+def do_train(cfg, cli_args) -> None:
    print_axolotl_text_art()
    check_accelerate_default_config()
    check_user_token()

-    if cfg.rl:
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+    if cfg.rl:  # and cfg.rl != "orpo":
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)
    else:
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-    model, tokenizer = train(cfg=cfg, dataset_meta=dataset_meta)
+    model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
    plugin_manager = PluginManager.get_instance()

    del model
@@ -48,24 +53,6 @@ def do_train(cfg: DictDefault, cli_args: TrainerCliArgs) -> None:
    plugin_manager.post_train_unload(cfg)


-def do_cli(config: Union[Path, str] = Path("examples/"), **kwargs) -> None:
-    """
-    Parses `axolotl` config, CLI args, and calls `do_train`.
-
-    Args:
-        config: Path to `axolotl` config YAML file.
-        kwargs: Additional keyword arguments to override config file values.
-    """
-    # pylint: disable=duplicate-code
-    parsed_cfg = load_cfg(config, **kwargs)
-    parser = HfArgumentParser(TrainerCliArgs)
-    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
-        return_remaining_strings=True
-    )
-
-    do_train(parsed_cfg, parsed_cli_args)
-
-
 if __name__ == "__main__":
    load_dotenv()
    fire.Fire(do_cli)
--- a/src/axolotl/cli/utils.py
+++ b/src/axolotl/cli/utils.py
@@ -1,84 +1,31 @@
-"""Utility methods for axolotl CLI."""
-
+"""Utility methods for axoltl CLI."""
 import concurrent.futures
 import dataclasses
 import hashlib
 import json
 import logging
-import typing
-from functools import wraps
 from pathlib import Path
 from types import NoneType
-from typing import Any, Callable, Type, Union, get_args, get_origin
+from typing import Any, Dict, List, Optional, Tuple, Type, Union, get_args, get_origin

 import click
 import requests
 from pydantic import BaseModel
-from transformers import PreTrainedModel, PreTrainedTokenizer, PreTrainedTokenizerFast

-from axolotl.logging_config import configure_logging
-from axolotl.utils.dict import DictDefault
-from axolotl.utils.models import load_model, load_tokenizer
-
-configure_logging()
-LOG = logging.getLogger(__name__)
+LOG = logging.getLogger("axolotl.cli.utils")


-def strip_optional_type(field_type: type | typing._SpecialForm | None):
-    """
-    Extracts the non-`None` type from an `Optional` / `Union` type.
+def add_options_from_dataclass(config_class: Type[Any]):
+    """Create Click options from the fields of a dataclass."""

-    Args:
-        field_type: Type of field for Axolotl CLI command.
-
-    Returns:
-        If the input type is `Union[T, None]` or `Optional[T]`, returns `T`. Otherwise
-            returns the input type unchanged.
-    """
-    if get_origin(field_type) is Union and type(None) in get_args(field_type):
-        field_type = next(
-            t for t in get_args(field_type) if not isinstance(t, NoneType)
-        )
-
-    return field_type
-
-
-def filter_none_kwargs(func: Callable) -> Callable:
-    """
-    Wraps function to remove `None`-valued `kwargs`.
-
-    Args:
-        func: Function to wrap.
-
-    Returns:
-        Wrapped function.
-    """
-
-    @wraps(func)
-    def wrapper(*args, **kwargs) -> Callable:
-        """Filters out `None`-valued `kwargs`."""
-        filtered_kwargs = {k: v for k, v in kwargs.items() if v is not None}
-
-        return func(*args, **filtered_kwargs)
-
-    return wrapper
-
-
-def add_options_from_dataclass(config_class: Type[Any]) -> Callable:
-    """
-    Create Click options from the fields of a dataclass.
-
-    Args:
-        config_class: Dataclass with fields to parse from the CLI.
-
-    Returns:
-        Function decorator for Axolotl CLI command.
-    """
-
-    def decorator(function: Callable) -> Callable:
+    def decorator(function):
        # Process dataclass fields in reverse order for correct option ordering
        for field in reversed(dataclasses.fields(config_class)):
-            field_type = strip_optional_type(field.type)
+            field_type = field.type
+            if get_origin(field_type) is Union and type(None) in get_args(field_type):
+                field_type = next(
+                    t for t in get_args(field_type) if not isinstance(t, NoneType)
+                )

            if field_type == bool:
                field_name = field.name.replace("_", "-")
@@ -102,22 +49,19 @@ def add_options_from_dataclass(config_class: Type[Any]) -> Callable:
    return decorator


-def add_options_from_config(config_class: Type[BaseModel]) -> Callable:
-    """
-    Create Click options from the fields of a Pydantic model.
+def add_options_from_config(config_class: Type[BaseModel]):
+    """Create Click options from the fields of a Pydantic model."""

-    Args:
-        config_class: PyDantic model with fields to parse from the CLI
-
-    Returns:
-        Function decorator for Axolotl CLI command.
-    """
-
-    def decorator(function: Callable) -> Callable:
+    def decorator(function):
        # Process model fields in reverse order for correct option ordering
        for name, field in reversed(config_class.model_fields.items()):
-            field_type = strip_optional_type(field.annotation)
+            field_type = field.annotation
+            if get_origin(field_type) is Union and type(None) in get_args(field_type):
+                field_type = next(
+                    t for t in get_args(field_type) if not isinstance(t, NoneType)
+                )

+            # NOTE: defaults are handled by the pydantic model config classes.
            if field_type == bool:
                field_name = name.replace("_", "-")
                option_name = f"--{field_name}/--no-{field_name}"
@@ -135,17 +79,8 @@ def add_options_from_config(config_class: Type[BaseModel]) -> Callable:
    return decorator


-def build_command(base_cmd: list[str], options: dict[str, Any]) -> list[str]:
-    """
-    Build command list from base command and options.
-
-    Args:
-        base_cmd: Command without options.
-        options: Options to parse and append to base command.
-
-    Returns:
-        List of strings giving shell command.
-    """
+def build_command(base_cmd: List[str], options: Dict[str, Any]) -> List[str]:
+    """Build command list from base command and options."""
    cmd = base_cmd.copy()

    for key, value in options.items():
@@ -157,6 +92,8 @@ def build_command(base_cmd: list[str], options: dict[str, Any]) -> list[str]:
        if isinstance(value, bool):
            if value:
                cmd.append(f"--{key}")
+            else:
+                cmd.append(f"--no{key}")
        else:
            cmd.extend([f"--{key}", str(value)])

@@ -165,18 +102,18 @@ def build_command(base_cmd: list[str], options: dict[str, Any]) -> list[str]:

 def download_file(
    file_info: tuple, raw_base_url: str, dest_path: Path, dir_prefix: str
-) -> tuple[str, str]:
+) -> Tuple[str, str]:
    """
    Download a single file and return its processing status.

    Args:
-        file_info: Tuple of (file_path, remote_sha).
-        raw_base_url: Base URL for raw GitHub content.
-        dest_path: Local destination directory.
-        dir_prefix: Directory prefix to filter files.
+        file_info: Tuple of (file_path, remote_sha)
+        raw_base_url: Base URL for raw GitHub content
+        dest_path: Local destination directory
+        dir_prefix: Directory prefix to filter files

    Returns:
-        Tuple of (file_path, status) where status is 'new', 'updated', or 'unchanged'.
+        Tuple of (file_path, status) where status is 'new', 'updated', or 'unchanged'
    """
    file_path, remote_sha = file_info
    raw_url = f"{raw_base_url}/{file_path}"
@@ -218,17 +155,16 @@ def download_file(


 def fetch_from_github(
-    dir_prefix: str, dest_dir: str | None = None, max_workers: int = 5
+    dir_prefix: str, dest_dir: Optional[str] = None, max_workers: int = 5
 ) -> None:
    """
    Sync files from a specific directory in the GitHub repository.
    Only downloads files that don't exist locally or have changed.

    Args:
-        dir_prefix: Directory prefix to filter files (e.g., 'examples/',
-            'deepspeed_configs/').
-        dest_dir: Local destination directory.
-        max_workers: Maximum number of concurrent downloads.
+        dir_prefix: Directory prefix to filter files (e.g., 'examples/', 'deepspeed_configs/')
+        dest_dir: Local destination directory
+        max_workers: Maximum number of concurrent downloads
    """
    api_url = "https://api.github.com/repos/axolotl-ai-cloud/axolotl/git/trees/main?recursive=1"
    raw_base_url = "https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main"
@@ -253,7 +189,7 @@ def fetch_from_github(
    dest_path = Path(dest_dir) if dest_dir else default_dest

    # Keep track of processed files for summary
-    files_processed: dict[str, list[str]] = {
+    files_processed: Dict[str, List[str]] = {
        "new": [],
        "updated": [],
        "unchanged": [],
@@ -290,28 +226,3 @@ def fetch_from_github(
    LOG.info(f"Unchanged files: {len(files_processed['unchanged'])}")
    if files_processed["error"]:
        LOG.info(f"Failed files: {len(files_processed['error'])}")
-
-
-def load_model_and_tokenizer(
-    *,
-    cfg: DictDefault,
-    inference: bool = False,
-) -> tuple[PreTrainedModel, PreTrainedTokenizer | PreTrainedTokenizerFast | Any]:
-    """
-    Helper function for loading a model and tokenizer specified in the given `axolotl`
-    config.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        inference: Boolean denoting inference mode.
-
-    Returns:
-        `transformers` model and tokenizer.
-    """
-    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
-    tokenizer = load_tokenizer(cfg)
-
-    LOG.info("loading model...")
-    model, _ = load_model(cfg, tokenizer, inference=inference)
-
-    return model, tokenizer
--- a/src/axolotl/common/cli.py
+++ b/src/axolotl/common/cli.py
@@ -0,0 +1,74 @@
+"""
+shared module for cli specific things
+"""
+
+import logging
+from dataclasses import dataclass, field
+from typing import Optional, Union
+
+import axolotl.monkeypatch.data.batch_dataset_fetcher  # pylint: disable=unused-import  # noqa: F401
+from axolotl.logging_config import configure_logging
+from axolotl.utils.dict import DictDefault
+from axolotl.utils.models import load_model, load_tokenizer
+
+configure_logging()
+LOG = logging.getLogger(__name__)
+
+
+@dataclass
+class PreprocessCliArgs:
+    """dataclass with arguments for preprocessing only"""
+
+    debug: bool = field(default=False)
+    debug_text_only: bool = field(default=False)
+    debug_num_examples: int = field(default=1)
+    prompter: Optional[str] = field(default=None)
+    download: Optional[bool] = field(default=True)
+
+
+@dataclass
+class TrainerCliArgs:
+    """dataclass with various non-training arguments"""
+
+    debug: bool = field(default=False)
+    debug_text_only: bool = field(default=False)
+    debug_num_examples: int = field(default=0)
+    inference: bool = field(default=False)
+    merge_lora: bool = field(default=False)
+    prompter: Optional[str] = field(default=None)
+    shard: bool = field(default=False)
+
+
+@dataclass
+class EvaluateCliArgs:
+    """dataclass with various evaluation arguments"""
+
+    debug: bool = field(default=False)
+    debug_text_only: bool = field(default=False)
+    debug_num_examples: int = field(default=0)
+
+
+@dataclass
+class ConvertDiffTransformerCliArgs:
+    """dataclass with arguments for convert-diff-transformer CLI"""
+
+    debug: bool = field(default=False)
+    zero_init: bool = field(default=False)
+    sublayer_norm: bool = field(default=True)
+    split_heads: bool = field(default=False)
+    mirror_weights: bool = field(default=False)
+
+
+def load_model_and_tokenizer(
+    *,
+    cfg: DictDefault,
+    cli_args: Union[TrainerCliArgs, EvaluateCliArgs, ConvertDiffTransformerCliArgs],
+):
+    LOG.info(f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}")
+    tokenizer = load_tokenizer(cfg)
+
+    LOG.info("loading model and (optionally) peft_config...")
+    inference = getattr(cli_args, "inference", False)
+    model, _ = load_model(cfg, tokenizer, inference=inference)
+
+    return model, tokenizer
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -1,140 +0,0 @@
-"""Dataset loading utilities."""
-
-import logging
-import math
-import random
-from dataclasses import dataclass
-from typing import Optional, Union
-
-from datasets import Dataset
-
-import axolotl.monkeypatch.data.batch_dataset_fetcher  # pylint: disable=unused-import  # noqa: F401
-from axolotl.cli.args import PreprocessCliArgs, TrainerCliArgs
-from axolotl.utils.data import prepare_dataset
-from axolotl.utils.data.rl import load_prepare_preference_datasets
-from axolotl.utils.dict import DictDefault
-from axolotl.utils.models import load_processor, load_tokenizer
-from axolotl.utils.tokenization import check_dataset_labels
-
-LOG = logging.getLogger(__name__)
-
-
-@dataclass
-class TrainDatasetMeta:
-    """Dataclass with fields for training and validation datasets and metadata."""
-
-    train_dataset: Dataset
-    eval_dataset: Optional[Dataset] = None
-    total_num_steps: Optional[int] = None
-
-
-def sample_dataset(dataset: Dataset, num_samples: int) -> Dataset:
-    """
-    Randomly sample `num_samples` samples from `dataset`.
-
-    Args:
-        dataset: Dataset.
-        num_samples: Number of samples to return.
-
-    Returns:
-        Random sample (with replacement) of examples in `dataset`.
-    """
-    return dataset.select(
-        [random.randrange(0, len(dataset) - 1) for _ in range(num_samples)]  # nosec
-    )
-
-
-def load_datasets(
-    *,
-    cfg: DictDefault,
-    cli_args: Union[PreprocessCliArgs, TrainerCliArgs],
-) -> TrainDatasetMeta:
-    """
-    Loads one or more training or evaluation datasets, calling
-    `axolotl.utils.data.prepare_dataset`. Optionally, logs out debug information.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: Command-specific CLI arguments.
-
-    Returns:
-        Dataclass with fields for training and evaluation datasets and the computed
-        `total_num_steps`.
-    """
-    tokenizer = load_tokenizer(cfg)
-    processor = load_processor(cfg, tokenizer=tokenizer) if cfg.processor_type else None
-
-    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
-        cfg,
-        tokenizer,
-        processor=processor,
-    )
-
-    if (
-        cli_args.debug
-        or cfg.debug
-        or cli_args.debug_text_only
-        or int(cli_args.debug_num_examples) > 0
-    ):
-        LOG.info("check_dataset_labels...")
-
-        train_samples = sample_dataset(train_dataset, cli_args.debug_num_examples)
-        check_dataset_labels(
-            train_samples,
-            tokenizer,
-            num_examples=cli_args.debug_num_examples,
-            text_only=cli_args.debug_text_only,
-        )
-
-        LOG.info("printing prompters...")
-        for prompter in prompters:
-            LOG.info(prompter)
-
-    return TrainDatasetMeta(
-        train_dataset=train_dataset,
-        eval_dataset=eval_dataset,
-        total_num_steps=total_num_steps,
-    )
-
-
-def load_preference_datasets(
-    *,
-    cfg: DictDefault,
-    cli_args: Union[PreprocessCliArgs, TrainerCliArgs],
-) -> TrainDatasetMeta:
-    """
-    Loads one or more training or evaluation datasets for RL training using paired
-    preference data, calling `axolotl.utils.data.rl.load_prepare_preference_datasets`.
-    Optionally, logs out debug information.
-
-    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        cli_args: Command-specific CLI arguments.
-
-    Returns:
-        Dataclass with fields for training and evaluation datasets and the computed
-        `total_num_steps`.
-    """
-    train_dataset, eval_dataset = load_prepare_preference_datasets(cfg)
-    total_num_steps = int(
-        math.ceil(len(train_dataset) * cfg.num_epochs / cfg.batch_size)
-    )
-
-    if cli_args.debug or cfg.debug:
-        LOG.info("check_dataset_labels...")
-
-        tokenizer = load_tokenizer(cfg)
-        train_samples = sample_dataset(train_dataset, cli_args.debug_num_examples)
-        check_dataset_labels(
-            train_samples,
-            tokenizer,
-            num_examples=cli_args.debug_num_examples,
-            text_only=cli_args.debug_text_only,
-            rl_mode=True,
-        )
-
-    return TrainDatasetMeta(
-        train_dataset=train_dataset,
-        eval_dataset=eval_dataset,
-        total_num_steps=total_num_steps,
-    )
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -243,10 +243,6 @@ class AxolotlTrainingMixins:
        default=None,
        metadata={"help": "Scale the learning rate for the embedding layers."},
    )
-    lr_groups: Optional[list[dict]] = field(
-        default=None,
-        metadata={"help": "Specify learning rate groups for with different LRs."},
-    )
    embedding_lr: Optional[float] = field(
        default=None,
        metadata={"help": "absolute learning rate for the embedding layers."},
@@ -297,7 +293,7 @@ class AxolotlTrainingArguments(AxolotlTrainingMixins, TrainingArguments):
    """
    Training arguments for Causal trainer

-    This code is duplicated due to HF TrainingArguments not setting output_dir with a defaujlt value
+    This code is duplicated due to HF TrainingArguments not setting output_dir with a default value
    so it can't be used as a mixin.
    """

@@ -465,95 +461,11 @@ class AxolotlTrainer(SchedulerMixin, Trainer):
            )
        return super()._wrap_model(model, training=training, dataloader=dataloader)

-    def create_optimizer_grouped_parameters(self, opt_model, optimizer_kwargs):
-        decay_parameters = self.get_decay_parameter_names(opt_model)
-        params = {
-            "to_weight_decay": {},  # LayerNorm and bias
-            "embeddings": {},  # lm_head, embed_tokens,
-            "no_weight_decay": {},
-        }
-        lr_groups_lookup = {}
-        lr_groups_learning_rates = {}
-        if self.args.lr_groups:
-            for lr_group in self.args.lr_groups:
-                group_name = lr_group["name"]
-                group_modules = lr_group["modules"]
-                for module in group_modules:
-                    lr_groups_lookup[module] = group_name
-                lr_groups_learning_rates[group_name] = lr_group["lr"]
-                params[f"to_weight_decay_{group_name}"] = {}
-
-        for name, param in opt_model.named_parameters():
-            if not param.requires_grad:
-                continue
-            if name.endswith("modules_to_save.default.weight") or any(
-                embed_name in name for embed_name in ["embed_tokens", "lm_head"]
-            ):
-                params["embeddings"][name] = param
-            elif name in decay_parameters:
-                lr_group_modules = [
-                    group_modules
-                    for group_modules in lr_groups_lookup
-                    if group_modules in name
-                ]
-                if lr_groups_lookup and any(lr_group_modules):
-                    lr_group_module = lr_group_modules[0]
-                    group_name = lr_groups_lookup[lr_group_module]
-                    params[f"to_weight_decay_{group_name}"][name] = param
-                else:
-                    params["to_weight_decay"][name] = param
-            else:
-                params["no_weight_decay"][name] = param
-        optimizer_grouped_parameters = []
-        if params["to_weight_decay"]:
-            optimizer_grouped_parameters.append(
-                {
-                    "params": list(params["to_weight_decay"].values()),
-                    "weight_decay": self.args.weight_decay,
-                    "lr": optimizer_kwargs["lr"],
-                }
-            )
-        if params["embeddings"]:
-            lr = optimizer_kwargs["lr"]  # pylint: disable=invalid-name
-            if self.args.embedding_lr_scale:
-                lr *= self.args.embedding_lr_scale  # pylint: disable=invalid-name
-            elif self.args.embedding_lr:
-                lr = self.args.embedding_lr  # pylint: disable=invalid-name
-            optimizer_grouped_parameters.append(
-                {
-                    "params": list(params["embeddings"].values()),
-                    "weight_decay": 0.0,
-                    "lr": lr,
-                }
-            )
-        if params["no_weight_decay"]:
-            optimizer_grouped_parameters.append(
-                {
-                    "params": list(params["no_weight_decay"].values()),
-                    "weight_decay": 0.0,
-                    "lr": optimizer_kwargs["lr"],
-                }
-            )
-        for group_name, group_lr in lr_groups_learning_rates.items():
-            if params[f"to_weight_decay_{group_name}"]:
-                optimizer_grouped_parameters.append(
-                    {
-                        "params": list(
-                            params[f"to_weight_decay_{group_name}"].values()
-                        ),
-                        "weight_decay": self.args.weight_decay,
-                        "lr": group_lr,
-                    }
-                )
-
-        return optimizer_grouped_parameters
-
    def create_optimizer(self):
        if (
            self.args.loraplus_lr_ratio is None
            and self.args.embedding_lr_scale is None
            and self.args.embedding_lr is None
-            and self.args.lr_groups is None
            and self.args.alternate_optimizer
            not in [
                "optimi_adamw",
@@ -567,13 +479,59 @@ class AxolotlTrainer(SchedulerMixin, Trainer):

        opt_model = self.model_wrapped if is_sagemaker_mp_enabled() else self.model
        if self.optimizer is None:  # pylint: disable=access-member-before-definition
+            decay_parameters = self.get_decay_parameter_names(opt_model)
+            params = {
+                "to_weight_decay": {},  # LayerNorm except bias
+                "embeddings": {},  # lm_head, embed_tokens,
+                "no_weight_decay": {},
+            }
+
            optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(
                self.args,
                opt_model,
            )
-            optimizer_grouped_parameters = self.create_optimizer_grouped_parameters(
-                opt_model, optimizer_kwargs
-            )
+
+            for name, param in opt_model.named_parameters():
+                if not param.requires_grad:
+                    continue
+                if name.endswith("modules_to_save.default.weight") or any(
+                    embed_name in name for embed_name in ["embed_tokens", "lm_head"]
+                ):
+                    params["embeddings"][name] = param
+                elif name in decay_parameters:
+                    params["to_weight_decay"][name] = param
+                else:
+                    params["no_weight_decay"][name] = param
+            optimizer_grouped_parameters = []
+            if params["to_weight_decay"]:
+                optimizer_grouped_parameters.append(
+                    {
+                        "params": list(params["to_weight_decay"].values()),
+                        "weight_decay": self.args.weight_decay,
+                        "lr": optimizer_kwargs["lr"],
+                    }
+                )
+            if params["embeddings"]:
+                lr = optimizer_kwargs["lr"]  # pylint: disable=invalid-name
+                if self.args.embedding_lr_scale:
+                    lr *= self.args.embedding_lr_scale  # pylint: disable=invalid-name
+                elif self.args.embedding_lr:
+                    lr = self.args.embedding_lr  # pylint: disable=invalid-name
+                optimizer_grouped_parameters.append(
+                    {
+                        "params": list(params["embeddings"].values()),
+                        "weight_decay": 0.0,
+                        "lr": lr,
+                    }
+                )
+            if params["no_weight_decay"]:
+                optimizer_grouped_parameters.append(
+                    {
+                        "params": list(params["no_weight_decay"].values()),
+                        "weight_decay": 0.0,
+                        "lr": optimizer_kwargs["lr"],
+                    }
+                )

            if self.args.loraplus_lr_ratio is not None:
                loraplus_lr_ratio = getattr(self.args, "loraplus_lr_ratio", None)
@@ -590,7 +548,6 @@ class AxolotlTrainer(SchedulerMixin, Trainer):
            elif (
                self.args.embedding_lr_scale is not None
                or self.args.embedding_lr is not None
-                or self.args.lr_groups is not None
            ):
                self.optimizer = (  # pylint: disable=attribute-defined-outside-init
                    optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
@@ -1122,7 +1079,6 @@ class AxolotlDPOTrainer(SchedulerMixin, DPOTrainer):
        super().__init__(*args, **kwargs)
        self.dataset_tags = dataset_tags
        self.optimizer = None
-        self.model_accepts_loss_kwargs = False

    def create_optimizer(self):
        if self.args.loraplus_lr_ratio is None:
@@ -1708,7 +1664,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        ] = self.cfg.loraplus_lr_embedding
        training_arguments_kwargs["embedding_lr"] = self.cfg.embedding_lr
        training_arguments_kwargs["embedding_lr_scale"] = self.cfg.embedding_lr_scale
-        training_arguments_kwargs["lr_groups"] = self.cfg.lr_groups

        if self.cfg.lr_scheduler in ["one_cycle", "log_sweep"]:
            training_arguments_kwargs["lr_scheduler_type"] = "cosine"
@@ -1924,8 +1879,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if training_args.pretraining:
            if self.cfg.pretraining_sample_concatenation is False:
                return DataCollatorForSeq2Seq(self.tokenizer, **kwargs)
-            if self.cfg.micro_batch_size > 1:
-                return DataCollatorForSeq2Seq(self.tokenizer, **kwargs)
            return None

        if self.cfg.model_config_type == "mamba":
--- a/src/axolotl/evaluate.py
+++ b/src/axolotl/evaluate.py
@@ -9,11 +9,11 @@ from typing import Dict, Optional
 import torch
 from accelerate.logging import get_logger

+from axolotl.common.cli import EvaluateCliArgs, load_model_and_tokenizer
 from axolotl.logging_config import configure_logging
 from axolotl.train import TrainDatasetMeta
-from axolotl.utils import set_pytorch_cuda_alloc_conf
 from axolotl.utils.dict import DictDefault
-from axolotl.utils.models import load_model, load_processor, load_tokenizer
+from axolotl.utils.models import load_processor
 from axolotl.utils.trainer import setup_trainer

 project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
@@ -61,13 +61,17 @@ def evaluate_dataset(
    return metrics


-def evaluate(*, cfg: DictDefault, dataset_meta: TrainDatasetMeta) -> Dict[str, float]:
+# pylint: disable=duplicate-code
+def evaluate(
+    *, cfg: DictDefault, cli_args: EvaluateCliArgs, dataset_meta: TrainDatasetMeta
+) -> Dict[str, float]:
    """
    Evaluate a model on training and validation datasets

    Args:
-        cfg: Dictionary mapping `axolotl` config keys to values.
-        dataset_meta: Dataset metadata containing training and evaluation datasets.
+        cfg: Configuration dictionary
+        cli_args: Command line arguments
+        dataset_meta: Dataset metadata containing training and evaluation datasets

    Returns:
        Tuple containing:
@@ -75,16 +79,11 @@ def evaluate(*, cfg: DictDefault, dataset_meta: TrainDatasetMeta) -> Dict[str, f
        - The tokenizer
        - Dictionary of evaluation metrics
    """
-    # pylint: disable=duplicate-code
-    # Enable expandable segments for cuda allocation to improve VRAM usage
-    set_pytorch_cuda_alloc_conf()
+    # Load model
+    LOG.debug("loading model for evaluation...")

-    # Load tokenizer
-    LOG.debug(
-        f"loading tokenizer... {cfg.tokenizer_config or cfg.base_model_config}",
-        main_process_only=True,
-    )
-    tokenizer = load_tokenizer(cfg)
+    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)
+    model = model.to(cfg.device, dtype=cfg.torch_dtype)

    # Load processor for multimodal models if needed
    processor = None
@@ -96,10 +95,6 @@ def evaluate(*, cfg: DictDefault, dataset_meta: TrainDatasetMeta) -> Dict[str, f
    eval_dataset = dataset_meta.eval_dataset
    total_num_steps = dataset_meta.total_num_steps

-    # Load model
-    LOG.debug("loading model for evaluation...")
-    model, _ = load_model(cfg, tokenizer, processor=processor)
-
    # Set up trainer
    trainer = setup_trainer(
        cfg,
--- a/src/axolotl/integrations/base.py
+++ b/src/axolotl/integrations/base.py
@@ -50,10 +50,10 @@ class BasePlugin:

    def register(self):  # pylint: disable=unused-argument
        """
-        Registers the plugin
+        Registers the plugin with the given configuration.

        Parameters:
-        cfg (dict): The configuration for the plugin.
+        None

        Returns:
        None
@@ -75,6 +75,19 @@ class BasePlugin:
        None
        """

+    def set_attn_config(
+        self, cfg, model_kwargs, model_config
+    ):  # pylint: disable=unused-argument
+        """
+        Sets attention configuration for the model.
+        Parameters:
+        cfg (dict): The configuration for the plugin.
+        model_kwargs (dict): The model kwargs for the plugin.
+        model_config (object): The model configuration.
+        Returns:
+        None
+        """
+
    def post_model_load(self, cfg, model):  # pylint: disable=unused-argument
        """
        Performs actions after the model is loaded.
@@ -305,6 +318,17 @@ class PluginManager:
        for plugin in self.plugins.values():
            plugin.pre_model_load(cfg)

+    def set_attn_config(self, cfg, model_kwargs, model_config):
+        """
+        modifies the attention configuration of the model kwargs for loading
+        Parameters:
+            cfg (dict): The configuration for the plugins.
+            model_kwargs (dict): The model's kwargs for construction the model
+            model_config (dict): The model's configuration.
+        """
+        for plugin in self.plugins.values():
+            plugin.set_attn_config(cfg, model_kwargs, model_config)
+
    def post_model_load(self, cfg, model):
        """
        Calls the post_model_load method of all registered plugins.
--- a/src/axolotl/integrations/config.py
+++ b/src/axolotl/integrations/config.py
@@ -43,10 +43,12 @@ def merge_input_args():
    input_args: List[str] = plugin_manager.get_input_args()
    plugin_classes = []
    dynamic_input = ""
+
    for plugin_args in input_args:
        plugin_module, plugin_cls = plugin_args.rsplit(".", 1)
        dynamic_input += f"from {plugin_module} import {plugin_cls}\n"
        plugin_classes.append(plugin_cls)
+
    if dynamic_input:
        dynamic_input += f"class AxolotlConfigWCapabilities(AxolotlConfigWCapabilitiesBase, {', '.join(plugin_classes)}):\n    pass\n"
        dynamic_input += f"class AxolotlInputConfig(AxolotlInputConfigBase, {', '.join(plugin_classes)}):\n    pass\n"
@@ -62,4 +64,5 @@ def merge_input_args():
            "AxolotlConfigWCapabilities"
        ]
        return AxolotlConfigWCapabilities, AxolotlInputConfig
+
    return AxolotlConfigWCapabilitiesBase, AxolotlInputConfigBase
--- a/src/axolotl/integrations/diff_transformer/README.md
+++ b/src/axolotl/integrations/diff_transformer/README.md
@@ -0,0 +1,12 @@
+# Differential Transformer
+
+### Usage
+
+**Note:** The following with be set in the model config output by the `axolotl convert-diff-transformer` command.
+
+```yaml
+plugins:
+  - axolotl.integrations.diff_transformer.DifferentialTransformerPlugin
+
+diff_attention: true
+```
--- a/src/axolotl/integrations/diff_transformer/init.py
+++ b/src/axolotl/integrations/diff_transformer/init.py
@@ -0,0 +1,67 @@
+"""Definition of differential transformer plugin."""
+
+import logging
+from typing import List
+
+from transformers import PreTrainedModel, TrainerCallback
+
+from axolotl.integrations.base import BasePlugin
+from axolotl.utils.callbacks.diff_attn import (
+    DifferentialAttentionMixingCallback,
+    DifferentialAttentionMonitorCallback,
+)
+from axolotl.utils.dict import DictDefault
+
+LOG = logging.getLogger(__name__)
+
+
+class DifferentialTransformerPlugin(BasePlugin):
+    """Plugin for differential transformer integration with Axolotl."""
+
+    def __init__(self) -> None:
+        """
+        Constructor for differential transformers plugin. Calls `register_diff_attn`
+        to register differential attention custom modeling implementation to `AutoConfig`
+        and `AutoModel`.
+        """
+        from .modeling_diff_attn import register_diff_attn
+
+        register_diff_attn()
+
+    def get_input_args(self) -> str:
+        """Returns module path to diff transformer plugin args for `axolotl` config."""
+        return "axolotl.integrations.diff_transformer.args.DifferentialTransformerArgs"
+
+    # pylint: disable=unused-argument
+    def add_callbacks_pre_trainer(
+        self, cfg: DictDefault, model: PreTrainedModel
+    ) -> List[TrainerCallback]:
+        """
+        Returns `DifferentialAttentionMonitorCallback` to be added to the list of
+        callbacks for the `axolotl` trainer if wandb usage is enabled.
+
+        Parameters:
+            cfg: Dictionary mapping `axolotl` config keys to values.
+            model: The loaded mfodel.
+
+        Returns:
+            A list (possibly) containing an instantiated `DifferentialAttentionMonitorCallback`.
+        """
+        callbacks = []
+        if cfg.use_wandb:
+            callbacks.append(
+                DifferentialAttentionMonitorCallback(
+                    log_every=cfg.diff_attn_log_every,
+                    num_monitor_layers=cfg.diff_attn_num_monitor_layers,
+                    warmup_steps=cfg.diff_attn_warmup_steps,
+                )
+            )
+
+        if cfg.diff_attn_warmup_steps:
+            callbacks.append(
+                DifferentialAttentionMixingCallback(
+                    warmup_steps=cfg.diff_attn_warmup_steps
+                )
+            )
+
+        return callbacks
--- a/src/axolotl/integrations/diff_transformer/args.py
+++ b/src/axolotl/integrations/diff_transformer/args.py
@@ -0,0 +1,27 @@
+"""Module for handling differential transfomer input arguments."""
+
+import logging
+from typing import Optional
+
+from pydantic import BaseModel
+
+LOG = logging.getLogger(__name__)
+
+
+class DifferentialTransformerArgs(BaseModel):
+    """
+    Input args for differential transformer.
+
+    Attributes:
+        diff_attention: Whether to use differential attention layers.
+        diff_attn_log_every: How often to log differential attention statistics.
+        diff_attn_num_monitor_layers: Number of layers to monitor for attention stats.
+        diff_attn_warmup_steps: Number of steps to linearly increase negative attention
+            mixing weight from 0 to 1. If specified, will reach full mixing at this
+            step. If `None`, negative attention has full weight from the start.
+    """
+
+    diff_attention: Optional[bool] = None
+    diff_attn_log_every: Optional[int] = 100
+    diff_attn_num_monitor_layers: Optional[int] = 3
+    diff_attn_warmup_steps: Optional[int] = None
--- a/src/axolotl/integrations/diff_transformer/diff_attn.py
+++ b/src/axolotl/integrations/diff_transformer/diff_attn.py
@@ -0,0 +1,694 @@
+"""Re-implemention of differential attention from the Differential Transformer paper
+(https://arxiv.org/abs/2410.05258)."""
+# pylint: disable=invalid-name
+
+import logging
+import math
+from typing import Any
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers.cache_utils import Cache
+from transformers.models.llama.modeling_llama import (
+    LlamaRMSNorm,
+    LlamaRotaryEmbedding,
+    apply_rotary_pos_emb,
+)
+
+logging.basicConfig(level=logging.INFO)
+LOG = logging.getLogger(__name__)
+
+try:
+    from flash_attn.flash_attn_interface import flash_attn_func
+
+    FLASH_ATTENTION_AVAILABLE = True
+except ImportError:
+    FLASH_ATTENTION_AVAILABLE = False
+
+
+def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    Repeats key/value heads to match the number of query heads in multi-head attention.
+
+    Args:
+        x: Input tensor of shape `(batch_size, num_kv_heads, seq_len, head_dim)`.
+        n_rep: Number of times to repeat each head.
+
+    Returns:
+        Tensor with repeated heads of shape `(batch_size, num_kv_heads * n_rep,
+            seq_len, head_dim)`.
+        If `n_rep` is 1, returns the input tensor unchanged.
+    """
+    batch_size, n_kv_heads, slen, head_dim = x.shape
+    if n_rep == 1:
+        return x
+    return (
+        x[:, :, None, :, :]
+        .expand(batch_size, n_kv_heads, n_rep, slen, head_dim)
+        .reshape(batch_size, n_kv_heads * n_rep, slen, head_dim)
+    )
+
+
+def lambda_init_fn(depth: int) -> float:
+    """
+    Lambda mixing parameter init function from the "Differential Transformer" paper.
+
+    Args:
+        depth: Index of layer to init lambda parameter.
+
+    Returns:
+        Lambda initialization value (decreasing with `depth`).
+    """
+    return 0.8 - 0.6 * math.exp(-0.3 * depth)
+
+
+class LlamaDifferentialAttentionBase(nn.Module):
+    """
+    Base class for differential attention implementations.
+
+    This class implements the core differential attention mechanism used in Llama models.
+    It supports both split heads and double projection modes for attention computation.
+    """
+
+    def __init__(self, config: Any, layer_idx: int):
+        """
+        Initializes the differential attention module.
+
+        Args:
+            config: Model configuration object containing hyperparameters, including:
+                - hidden_size: The size of hidden states.
+                - num_attention_heads: Number of attention heads.
+                - num_key_value_heads: Number of key/value heads.
+                - attention_bias: Whether to use bias in attention projections.
+                - split_heads: Whether to use split heads mode.
+                - rms_norm_eps: Epsilon for RMS normalization.
+            layer_idx: The index of this layer in the model.
+
+        Note:
+            The initialization process consists of four steps:
+            1. Configuration initialization (`_init_config`)
+            2. Projection layers initialization (`_init_projections`)
+            3. Differential parameters initialization (`_init_differential_params`)
+            4. Normalization layers initialization (`_init_normalization`)
+        """
+        super().__init__()
+
+        self.config = config
+        self._init_config(layer_idx)
+        self._init_projections()
+        self._init_differential_params()
+        self._init_normalization()
+
+        # For logging
+        self.attn1 = None
+        self.attn2 = None
+        self.lambda_full = None
+
+    def _init_config(self, layer_idx: int) -> None:
+        """
+        Initializes configuration parameters for the attention layer. Sets up various
+        dimension sizes and head counts based on the provided config. Handles both
+        split heads and double projection modes.
+
+        In split heads mode, the number of heads is divided by 2 (rounding down), which
+        differs from the original implementation that required an even number.
+
+        Args:
+            layer_idx: Index of the current layer.
+        """
+        self.head_dim = self.config.hidden_size // self.config.num_attention_heads
+        self.base_num_heads = self.config.num_attention_heads
+        self.base_num_kv_heads = self.config.num_key_value_heads
+        self.num_key_value_groups = self.base_num_heads // self.base_num_kv_heads
+        self.layer_idx = layer_idx
+
+        if self.config.split_heads:
+            self.heads_per_component = self.base_num_heads // 2
+            self.kv_heads_per_component = self.base_num_kv_heads // 2
+            self.value_head_dim = 2 * self.head_dim
+        else:
+            self.heads_per_component = self.base_num_heads
+            self.kv_heads_per_component = self.base_num_kv_heads
+            self.value_head_dim = self.head_dim
+
+    def _init_projections(self) -> None:
+        """
+        Initializes the query, key, value, and output projection layers.
+
+        Creates linear transformations for Q, K, V projections with dimensions
+        depending on whether split heads or double projection mode is used.
+        The output projection combines the attention heads back to model dimension.
+        """
+        if self.config.split_heads:
+            q_out_dim = self.config.hidden_size
+            k_out_dim = self.head_dim * self.base_num_kv_heads
+        else:
+            q_out_dim = self.config.hidden_size * 2
+            k_out_dim = self.head_dim * self.base_num_kv_heads * 2
+
+        self.q_proj = nn.Linear(
+            self.config.hidden_size, q_out_dim, bias=self.config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            self.config.hidden_size, k_out_dim, bias=self.config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            self.config.hidden_size,
+            self.head_dim * self.base_num_kv_heads,
+            bias=self.config.attention_bias,
+        )
+        self.o_proj = nn.Linear(
+            self.base_num_heads * self.head_dim,
+            self.config.hidden_size,
+            bias=self.config.attention_bias,
+        )
+
+    def _init_differential_params(self) -> None:
+        """
+        Initializes parameters specific to differential attention.
+
+        Creates learnable parameters for the differential attention mechanism:
+        - Mixing parameter for negative attention component warmup phase.
+        - Lambda parameters for queries and keys.
+        - Initial lambda value based on layer index.
+        - Rotary position embedding layer.
+        """
+        self.diff_attn_mix = 1.0  # Default to full mixing
+
+        self.lambda_init = nn.Parameter(
+            torch.full((), lambda_init_fn(self.layer_idx)),
+            requires_grad=False,
+        )
+        self.lambda_q1 = nn.Parameter(
+            torch.zeros(self.head_dim).normal_(mean=0, std=0.1)
+        )
+        self.lambda_k1 = nn.Parameter(
+            torch.zeros(self.head_dim).normal_(mean=0, std=0.1)
+        )
+        self.lambda_q2 = nn.Parameter(
+            torch.zeros(self.head_dim).normal_(mean=0, std=0.1)
+        )
+        self.lambda_k2 = nn.Parameter(
+            torch.zeros(self.head_dim).normal_(mean=0, std=0.1)
+        )
+
+        self.rotary_emb = LlamaRotaryEmbedding(config=self.config)
+
+    def _init_normalization(self) -> None:
+        """
+        Initializes normalization layers for the attention mechanism.
+
+        Sets up either RMS normalization or identity transformation based on config.
+        The normalization is applied to the sublayer output if enabled.
+        """
+        sublayer_norm = getattr(self.config, "sublayer_norm", True)
+        if sublayer_norm:
+            self.subln = LlamaRMSNorm(self.value_head_dim, eps=self.config.rms_norm_eps)
+        else:
+            self.subln = nn.Identity()
+
+    def _prepare_attention_inputs(
+        self, hidden_states: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Prepares input tensors for attention computation.
+
+        Projects input hidden states to query, key, and value spaces, then reshapes
+        them for multi-head attention processing.
+
+        Args:
+            hidden_states: Input tensor of shape `(batch_size, seq_len,
+            hidden_size)`.
+
+        Returns:
+            tuple: Tuple containing:
+                - q1: Positive attention query component
+                - q2: Negative attention query component
+                - k1: Positive attention key component
+                - k2: Negative attention key component
+                - v: Value tensor
+        """
+        bsz, q_len, _ = hidden_states.size()
+
+        q = self.q_proj(hidden_states)
+        k = self.k_proj(hidden_states)
+        v = self.v_proj(hidden_states)
+        q1, q2 = q.chunk(2, dim=-1)
+        k1, k2 = k.chunk(2, dim=-1)
+
+        q1 = q1.view(bsz, q_len, self.heads_per_component, self.head_dim).transpose(
+            1, 2
+        )
+        q2 = q2.view(bsz, q_len, self.heads_per_component, self.head_dim).transpose(
+            1, 2
+        )
+        k1 = k1.view(bsz, q_len, self.kv_heads_per_component, self.head_dim).transpose(
+            1, 2
+        )
+        k2 = k2.view(bsz, q_len, self.kv_heads_per_component, self.head_dim).transpose(
+            1, 2
+        )
+        v = v.view(bsz, q_len, self.base_num_kv_heads, self.head_dim).transpose(1, 2)
+
+        return q1, q2, k1, k2, v
+
+    def _apply_rotary_embeddings(
+        self,
+        q1: torch.Tensor,
+        q2: torch.Tensor,
+        k1: torch.Tensor,
+        k2: torch.Tensor,
+        position_ids: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor] | None,
+    ) -> tuple[
+        torch.Tensor,
+        torch.Tensor,
+        torch.Tensor,
+        torch.Tensor,
+        torch.Tensor,
+        torch.Tensor,
+    ]:
+        """
+        Applies rotary positional embeddings to queries and keys.
+
+        Args:
+            q1: Positive attention query component.
+            q2: Negative attention query component.
+            k1: Positive attention key component.
+            k2: Negative attention key component.
+            position_ids: Token position indices.
+            position_embeddings: Pre-computed rotary embeddings (cos, sin).
+
+        Returns:
+            tuple: Tuple containing:
+                - q1: Positive attention query with positional encoding.
+                - q2: Negative attention query with positional encoding.
+                - k1: Positive attention key with positional encoding.
+                - k2: Negative attention key with positional encoding.
+                - cos: Cosine part of rotary embeddings.
+                - sin: Sine part of rotary embeddings.
+        """
+        if position_embeddings is None:
+            LOG.warning(
+                "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
+                "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
+                "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
+                "removed and `position_embeddings` will be mandatory."
+            )
+            cos, sin = self.rotary_emb(q1, position_ids)
+        else:
+            cos, sin = position_embeddings
+
+        q1, k1 = apply_rotary_pos_emb(q1, k1, cos, sin)
+        q2, k2 = apply_rotary_pos_emb(q2, k2, cos, sin)
+
+        return q1, q2, k1, k2, cos, sin
+
+    def _handle_cache(
+        self,
+        k1: torch.Tensor,
+        k2: torch.Tensor,
+        v: torch.Tensor,
+        past_key_value: Cache | None,
+        cache_kwargs: dict,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Handles key-value caching for autoregressive generation and the repetition of
+        key-value heads to match the number of query heads.
+
+        Args:
+            k1: Positive attention key component.
+            k2: Negative attention key component.
+            v: Value tensor.
+            past_key_value: Cache object for storing previous key-value pairs.
+            cache_kwargs: Additional arguments for cache handling.
+
+        Returns:
+            tuple: Tuple containing:
+                - k1: Processed positive attention key component.
+                - k2: Processed negative attention key component.
+                - v: Processed value tensor.
+        """
+        if past_key_value is not None:
+            k = torch.stack([k1, k2], dim=1)
+            k, v = past_key_value.update(k, v, self.layer_idx, cache_kwargs)
+            k1, k2 = k.unbind(dim=1)
+
+        k1 = repeat_kv(k1, self.num_key_value_groups)
+        k2 = repeat_kv(k2, self.num_key_value_groups)
+        v = repeat_kv(v, self.num_key_value_groups)
+        if self.config.split_heads:
+            v = torch.cat(torch.chunk(v, 2, dim=1), dim=-1)
+
+        return k1, k2, v
+
+    def _compute_lambda(self, q1: torch.Tensor) -> torch.Tensor:
+        """
+        Computes lambda values for differential attention.
+
+        The lambda value is computed as λ₁ - λ₂ + λ_init, where λ₁ and λ₂ are computed
+        from the learned parameters. `diff_attn_mix` is multiplied through the result
+        for negative attention component warmup phase (if applicable).
+
+        Args:
+            q1: Positive attention query component, used for type casting.
+
+        Returns:
+            Computed lambda value for differential attention.
+        """
+        lambda_1 = torch.exp(
+            torch.sum(self.lambda_q1 * self.lambda_k1, dim=-1).float()
+        ).type_as(q1)
+        lambda_2 = torch.exp(
+            torch.sum(self.lambda_q2 * self.lambda_k2, dim=-1).float()
+        ).type_as(q1)
+        lambda_full = lambda_1 - lambda_2 + self.lambda_init
+
+        return self.diff_attn_mix * lambda_full
+
+    def _process_attention_output(
+        self, attn: torch.Tensor, bsz: int, q_len: int
+    ) -> torch.Tensor:
+        """
+        Processes and projects the attention output. Applies sublayer normalization,
+        scales by (1 - λ_init), and projects back to model dimension.
+
+        Args:
+            attn: Raw attention output.
+            bsz: Batch size.
+            q_len: Query sequence length.
+
+        Returns:
+            Processed attention output of shape (batch_size, seq_len, hidden_size)
+        """
+        attn = self.subln(attn)
+        # NOTE: this may need to be added back in, but doesn't interact well with
+        # `diff_attn_mix`, and doesn't allow us to preserve the original model output.
+        # attn = attn * self.diff_attn_mix * (1 - self.lambda_init)
+        attn = attn.transpose(1, 2).reshape(bsz, q_len, self.config.hidden_size)
+
+        return self.o_proj(attn)
+
+
+class LlamaDifferentialAttention(LlamaDifferentialAttentionBase):
+    """
+    Standard implementation of differential attention.
+
+    This class implements the standard differential attention mechanism using
+    explicit matrix multiplications for the attention computation.
+    """
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_value: Cache | None = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,  # pylint: disable=unused-argument
+        cache_position: torch.LongTensor | None = None,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor] | None = None,
+        **kwargs,  # pylint: disable=unused-argument
+    ):
+        """
+        Computes differential attention using standard matrix multiplication operations.
+
+        Args:
+            hidden_states: Input tensor containing sequence to attend to.
+            attention_mask: Mask to avoid attention on padding tokens.
+            position_ids: Indices of positions for positional embeddings.
+            past_key_value: Cached key and value tensors for autoregressive decoding.
+            output_attentions: Whether to return attention weights.
+            use_cache: Whether to use cached key/value states.
+            cache_position: Position indices for cached states.
+            position_embeddings: Pre-computed positional embeddings.
+            **kwargs: Additional arguments passed to the forward call.
+
+        Returns:
+            tuple containing:
+                - Output tensor after attention computation.
+                - Attention weights if output_attentions is True, else None.
+                - Updated key-value cache if use_cache is True, else None.
+        """
+        bsz, q_len, _ = hidden_states.size()
+        q1, q2, k1, k2, v = self._prepare_attention_inputs(hidden_states)
+        q1, q2, k1, k2, cos, sin = self._apply_rotary_embeddings(
+            q1, q2, k1, k2, position_ids, position_embeddings
+        )
+
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        k1, k2, v = self._handle_cache(k1, k2, v, past_key_value, cache_kwargs)
+
+        # Standard attention computation
+        attn1 = torch.matmul(q1, k1.transpose(-1, -2)) / math.sqrt(self.head_dim)
+        attn2 = torch.matmul(q2, k2.transpose(-1, -2)) / math.sqrt(self.head_dim)
+
+        if attention_mask is not None:
+            causal_mask = attention_mask[:, :, :, : k1.shape[-2]]
+            attn1 = attn1 + causal_mask
+            attn2 = attn2 + causal_mask
+
+        attn1 = F.softmax(attn1, dim=-1, dtype=torch.float32).type_as(attn1)
+        attn2 = F.softmax(attn2, dim=-1, dtype=torch.float32).type_as(attn2)
+
+        dropout_p = self.config.attention_dropout if self.training else 0.0
+        attn1 = F.dropout(attn1, p=dropout_p, training=self.training)
+        attn2 = F.dropout(attn2, p=dropout_p, training=self.training)
+
+        lambda_full = self._compute_lambda(q1)
+        attn = torch.matmul(attn1, v) - lambda_full * torch.matmul(attn2, v)
+        attn = self._process_attention_output(attn, bsz, q_len)
+
+        # Save for logging
+        self.attn1 = attn1
+        self.attn2 = attn2
+        self.lambda_full = lambda_full
+
+        if output_attentions:
+            attn_weights = attn1 - lambda_full * attn2
+            attn_weights = attn_weights.view(bsz, self.heads_per_component, q_len, -1)
+            return attn, attn_weights, past_key_value
+        return attn, None, past_key_value
+
+
+class LlamaDifferentialSdpaAttention(LlamaDifferentialAttentionBase):
+    """
+    SDPA-based implementation of differential attention.
+
+    This class implements differential attention using PyTorch's scaled_dot_product_attention
+    for improved performance on supported hardware.
+    """
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_value: Cache | None = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: torch.LongTensor | None = None,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor] | None = None,
+        **kwargs,  # pylint: disable=unused-argument
+    ):
+        """
+        Computes differential attention using PyTorch's scaled dot product attention.
+
+        Args:
+            hidden_states: Input tensor containing sequence to attend to.
+            attention_mask: Mask to avoid attention on padding tokens.
+            position_ids: Indices of positions for positional embeddings.
+            past_key_value: Cached key and value tensors for autoregressive decoding.
+            output_attentions: Whether to return attention weights.
+            use_cache: Whether to use cached key/value states.
+            cache_position: Position indices for cached states.
+            position_embeddings: Pre-computed positional embeddings.
+            **kwargs: Additional arguments passed to the forward call.
+
+        Returns:
+            tuple containing:
+                - Output tensor after attention computation.
+                - None for attention weights (SDPA doesn't support output_attentions).
+                - Updated key-value cache if use_cache is True, else None.
+        """
+        if output_attentions:
+            LOG.warning(
+                "LlamaDifferentialModel is using LlamaDifferentialSdpaAttention, but "
+                + "`torch.nn.functional.scaled_dot_product_attention` does not support "
+                + "`output_attentions=True`. Falling back to the eager attention implementation."
+            )
+
+            # pylint: disable=duplicate-code
+            return LlamaDifferentialAttention.forward(
+                self,
+                hidden_states,
+                attention_mask,
+                position_ids,
+                past_key_value,
+                output_attentions,
+                use_cache,
+                cache_position,
+                position_embeddings,
+            )
+
+        bsz, q_len, _ = hidden_states.size()
+        q1, q2, k1, k2, v = self._prepare_attention_inputs(hidden_states)
+        q1, q2, k1, k2, cos, sin = self._apply_rotary_embeddings(
+            q1, q2, k1, k2, position_ids, position_embeddings
+        )
+
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        k1, k2, v = self._handle_cache(k1, k2, v, past_key_value, cache_kwargs)
+
+        # SDPA-specific attention computation
+        causal_mask = (
+            None if attention_mask is None else attention_mask[:, :, :, : k1.shape[-2]]
+        )
+        is_causal = attention_mask is None and q_len > 1
+        dropout_p = self.config.attention_dropout if self.training else 0.0
+
+        if q1.device.type == "cuda" and causal_mask is not None:
+            q1, q2 = q1.contiguous(), q2.contiguous()
+            k1, k2 = k1.contiguous(), k2.contiguous()
+            v = v.contiguous()
+
+        attn1 = F.scaled_dot_product_attention(
+            q1, k1, v, attn_mask=causal_mask, dropout_p=dropout_p, is_causal=is_causal
+        )
+        attn2 = F.scaled_dot_product_attention(
+            q2, k2, v, attn_mask=causal_mask, dropout_p=dropout_p, is_causal=is_causal
+        )
+
+        lambda_full = self._compute_lambda(q1)
+        attn = attn1 - lambda_full * attn2
+        attn = self._process_attention_output(attn, bsz, q_len)
+
+        # Save for logging
+        self.attn1 = attn1
+        self.attn2 = attn2
+        self.lambda_full = lambda_full
+
+        return attn, None, past_key_value
+
+
+class LlamaDifferentialFlashAttention2(LlamaDifferentialAttentionBase):
+    """
+    Flash Attention 2-based implementation of differential attention.
+
+    This class implements differential attention using Flash Attention 2 for maximum
+    performance on supported hardware.
+    """
+
+    def __init__(self, *args, **kwargs):
+        """
+        Initializes the Flash Attention 2 differential attention module.
+
+        Args:
+            *args: Positional arguments passed to parent class.
+            **kwargs: Keyword arguments passed to parent class.
+
+        Raises:
+            ImportError: If flash-attn library is not installed.
+        """
+        if not FLASH_ATTENTION_AVAILABLE:
+            raise ImportError(
+                "LlamaDifferentialFlashAttention2 requires flash-attn library. "
+                "Please install with `pip install flash-attn --no-build-isolation`"
+            )
+
+        super().__init__(*args, **kwargs)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_value: Cache | None = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: torch.LongTensor | None = None,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor] | None = None,
+        **kwargs,  # pylint: disable=unused-argument
+    ):
+        """
+        Computes differential attention using Flash Attention 2.
+
+        Args:
+            hidden_states: Input tensor containing sequence to attend to.
+            attention_mask: Mask to avoid attention on padding tokens.
+            position_ids: Indices of positions for positional embeddings.
+            past_key_value: Cached key and value tensors for autoregressive decoding.
+            output_attentions: Whether to return attention weights.
+            use_cache: Whether to use cached key/value states.
+            cache_position: Position indices for cached states.
+            position_embeddings: Pre-computed positional embeddings.
+            **kwargs: Additional arguments passed to the forward call.
+
+        Returns:
+            tuple containing:
+                - Output tensor after attention computation.
+                - None for attention weights (Flash Attention doesn't support output_attentions).
+                - Updated key-value cache if use_cache is True, else None.
+        """
+        if output_attentions:
+            LOG.warning(
+                "LlamaDifferentialModel is using LlamaDifferentialFlashAttention2, but "
+                + "flash attenion does not support `output_attentions=True`. Falling back "
+                + "to the eager attention implementation."
+            )
+
+            # pylint: disable=duplicate-code
+            return LlamaDifferentialAttention.forward(
+                self,
+                hidden_states,
+                attention_mask,
+                position_ids,
+                past_key_value,
+                output_attentions,
+                use_cache,
+                cache_position,
+                position_embeddings,
+            )
+
+        bsz, q_len, _ = hidden_states.size()
+        q1, q2, k1, k2, v = self._prepare_attention_inputs(hidden_states)
+        q1, q2, k1, k2, cos, sin = self._apply_rotary_embeddings(
+            q1, q2, k1, k2, position_ids, position_embeddings
+        )
+
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        k1, k2, v = self._handle_cache(k1, k2, v, past_key_value, cache_kwargs)
+
+        # Flash Attention specific processing
+        q1, q2 = q1.transpose(1, 2), q2.transpose(1, 2)
+        k1, k2 = k1.transpose(1, 2), k2.transpose(1, 2)
+        v = v.transpose(1, 2)
+
+        dropout_p = self.config.attention_dropout if self.training else 0.0
+
+        if self.config.split_heads:
+            v1, v2 = v.chunk(2, dim=-1)
+            attn11 = flash_attn_func(q1, k1, v1, dropout_p=dropout_p, causal=True)
+            attn12 = flash_attn_func(q1, k1, v2, dropout_p=dropout_p, causal=True)
+            attn1 = torch.cat([attn11, attn12], dim=-1)
+
+            attn21 = flash_attn_func(q2, k2, v1, dropout_p=dropout_p, causal=True)
+            attn22 = flash_attn_func(q2, k2, v2, dropout_p=dropout_p, causal=True)
+            attn2 = torch.cat([attn21, attn22], dim=-1)
+        else:
+            attn1 = flash_attn_func(q1, k1, v, dropout_p=dropout_p, causal=True)
+            attn2 = flash_attn_func(q2, k2, v, dropout_p=dropout_p, causal=True)
+
+        attn1, attn2 = attn1.transpose(1, 2), attn2.transpose(1, 2)
+
+        lambda_full = self._compute_lambda(q1)
+        attn = attn1 - lambda_full * attn2
+        attn = self._process_attention_output(attn, bsz, q_len)
+
+        # Save for logging
+        self.attn1 = attn1
+        self.attn2 = attn2
+        self.lambda_full = lambda_full
+
+        return attn, None, past_key_value
--- a/src/axolotl/integrations/diff_transformer/modeling_diff_attn.py
+++ b/src/axolotl/integrations/diff_transformer/modeling_diff_attn.py
@@ -0,0 +1,401 @@
+"""
+Modeling for differential transformers.
+
+This module implements differential attention variants of the LLaMA model,
+providing various attention implementations for improved performance.
+"""
+
+import logging
+
+import torch
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from transformers.models.llama.configuration_llama import LlamaConfig
+from transformers.models.llama.modeling_llama import LlamaForCausalLM, LlamaModel
+
+from .diff_attn import (
+    LlamaDifferentialAttention,
+    LlamaDifferentialFlashAttention2,
+    LlamaDifferentialSdpaAttention,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class LlamaDifferentialConfig(LlamaConfig):
+    """
+    Configuration class for Differential LLaMA model.
+
+    Extends the base LLaMA configuration with additional parameters for differential
+    attention mechanisms.
+    """
+
+    model_type = "llama-differential"
+
+    def __init__(
+        self,
+        split_heads: bool = False,
+        sublayer_norm: bool = True,
+        zero_init: bool = False,
+        mirror_weights: bool = False,
+        **kwargs,
+    ):
+        """
+        Initialize differential LLaMA configuration.
+
+        Args:
+            split_heads: Whether to use split heads mode for attention computation.
+            sublayer_norm: Whether to apply normalization to sublayers.
+            zero_init: Whether to initialize new weights to zero.
+            mirror_weights: Whether to copy the positive attention component weights to
+                the negative attention component.
+            **kwargs: Additional arguments passed to LlamaConfig.
+        """
+        super().__init__(**kwargs)
+        self.split_heads = split_heads
+        self.sublayer_norm = sublayer_norm
+        self.zero_init = zero_init
+        self.mirror_weights = mirror_weights
+        self.architectures = ["LlamaDifferentialModel"]
+        self._attn_implementations = {
+            "eager": "differential_eager",
+            "sdpa": "differential_sdpa",
+            "flash_attention_2": "differential_flash_attention_2",
+        }
+
+
+class LlamaDifferentialModel(LlamaModel):
+    """
+    LlamaModel with differential attention.
+
+    This class extends the base LLaMA model by replacing standard attention with
+    differential attention mechanisms.
+    """
+
+    config_class = LlamaDifferentialConfig
+    base_model_prefix = "llama_differential"
+
+    def __init__(self, config: LlamaDifferentialConfig):
+        """
+        Initialize a differential LLaMA model.
+
+        Args:
+            config: Configuration object for the model.
+
+        Raises:
+            ValueError: If specified attention implementation is not supported.
+        """
+        super().__init__(config)
+
+        # Handle attention implementation
+        attn_impl = config._attn_implementation or "eager"
+        if attn_impl in config._attn_implementations:
+            attn_impl = config._attn_implementations[attn_impl]
+
+        # Validate attention implementation
+        valid_impls = [
+            None,
+            "differential_eager",
+            "differential_sdpa",
+            "differential_flash_attention_2",
+        ]
+        if attn_impl not in valid_impls:
+            raise ValueError(f"Invalid attention implementation: {attn_impl}")
+
+        # Replace standard attention with differential attention in each layer
+        attn_classes = {
+            "differential_eager": LlamaDifferentialAttention,
+            "differential_sdpa": LlamaDifferentialSdpaAttention,
+            "differential_flash_attention_2": LlamaDifferentialFlashAttention2,
+        }
+        attn_class = attn_classes.get(attn_impl, LlamaDifferentialAttention)
+
+        for idx, layer in enumerate(self.layers):
+            layer.self_attn = attn_class(config, idx)
+
+    @classmethod
+    # pylint: disable=protected-access
+    def _autoset_attn_implementation(
+        cls,
+        config: LlamaDifferentialConfig,
+        **kwargs,  # pylint: disable=unused-argument
+    ) -> LlamaDifferentialConfig:
+        """
+        Automatically set the attention implementation based on config.
+
+        Args:
+            config: Model configuration object.
+            **kwargs: Additional arguments (unused).
+
+        Returns:
+            Updated configuration object.
+
+        Raises:
+            ValueError: If specified attention implementation is not supported.
+        """
+        config._attn_implementation_autoset = True
+        attn_implementation = getattr(config, "_attn_implementation", None)
+
+        # Map standard types to differential types if mapping exists
+        if attn_implementation in config._attn_implementations:
+            config._attn_implementation = config._attn_implementations[
+                attn_implementation
+            ]
+            return config
+
+        # If no mapping, validate it's a valid differential type
+        valid_impls = [
+            None,
+            "differential_eager",
+            "differential_sdpa",
+            "differential_flash_attention_2",
+        ]
+        if attn_implementation not in valid_impls:
+            message = (
+                f"Specified `attn_implementation={attn_implementation}` is not supported. "
+                f"The only possible arguments are: {', '.join(repr(x) for x in valid_impls if x)}"
+            )
+            raise ValueError(message)
+
+        return config
+
+    @classmethod
+    def from_llama(
+        cls,
+        model: LlamaModel | LlamaForCausalLM,
+        config: LlamaDifferentialConfig | None = None,
+    ) -> "LlamaDifferentialModel":
+        """
+        Convert a `LlamaModel` to use differential attention.
+
+        Args:
+            model: Base LLaMA model to convert.
+            config: Configuration for differential attention. If `None`, created from
+                base model config.
+
+        Returns:
+            Converted model with differential attention.
+
+        Raises:
+            ValueError: If number of heads is not even when using `split_heads` mode.
+        """
+        logger.info(f"Converting {type(model).__name__} to {cls.__name__}")
+
+        # Handle LlamaForCausalLM
+        if isinstance(model, LlamaForCausalLM):
+            model = model.model
+
+        if config is None:
+            config = LlamaDifferentialConfig(**model.config.__dict__)
+            logger.debug(f"Created config: {config}")
+
+        # Validate head counts if using split heads mode
+        if config.split_heads:
+            if config.num_attention_heads % 2 != 0:
+                raise ValueError(
+                    f"Number of attention heads ({config.num_attention_heads}) must be even "
+                    "when using split_heads=True"
+                )
+            if config.num_key_value_heads % 2 != 0:
+                raise ValueError(
+                    f"Number of key/value heads ({config.num_key_value_heads}) must be even "
+                    "when using split_heads=True"
+                )
+
+        new_model = cls(config)
+
+        # Copy all weights except attention
+        logger.debug("Copying embeddings and norm")
+        new_model.embed_tokens.load_state_dict(model.embed_tokens.state_dict())
+        new_model.norm.load_state_dict(model.norm.state_dict())
+
+        logger.debug("Copying layer weights")
+        for layer_idx, (new_layer, old_layer) in enumerate(
+            zip(new_model.layers, model.layers)
+        ):
+            # Copy everything except attention weights
+            new_layer.mlp.load_state_dict(old_layer.mlp.state_dict())
+            new_layer.input_layernorm.load_state_dict(
+                old_layer.input_layernorm.state_dict()
+            )
+            new_layer.post_attention_layernorm.load_state_dict(
+                old_layer.post_attention_layernorm.state_dict()
+            )
+
+            # Handle attention weights
+            new_layer.self_attn.v_proj.load_state_dict(
+                old_layer.self_attn.v_proj.state_dict()
+            )
+            new_layer.self_attn.o_proj.load_state_dict(
+                old_layer.self_attn.o_proj.state_dict()
+            )
+
+            # Get the original projection sizes
+            old_q_size = old_layer.self_attn.q_proj.weight.size(0)
+            old_k_size = old_layer.self_attn.k_proj.weight.size(0)
+
+            if not config.split_heads:
+                logger.debug(
+                    f"Layer {layer_idx}: Copying Q/K projections with sizes {old_q_size}, {old_k_size}"
+                )
+                new_layer.self_attn.q_proj.weight.data[:old_q_size].copy_(
+                    old_layer.self_attn.q_proj.weight.data
+                )
+                new_layer.self_attn.k_proj.weight.data[:old_k_size].copy_(
+                    old_layer.self_attn.k_proj.weight.data
+                )
+
+                if config.zero_init:
+                    logger.debug(f"Layer {layer_idx}: Zero initializing")
+                    with torch.no_grad():
+                        new_layer.self_attn.q_proj.weight.data[old_q_size:].zero_()
+                        new_layer.self_attn.k_proj.weight.data[old_k_size:].zero_()
+                        new_layer.self_attn.lambda_q1.zero_()
+                        new_layer.self_attn.lambda_k1.zero_()
+                        new_layer.self_attn.lambda_q2.zero_()
+                        new_layer.self_attn.lambda_k2.zero_()
+                        new_layer.self_attn.lambda_init.zero_()
+                elif config.mirror_weights:
+                    # Mirror weights for second component
+                    new_layer.self_attn.q_proj.weight.data[old_q_size:].copy_(
+                        old_layer.self_attn.q_proj.weight.data
+                    )
+                    new_layer.self_attn.k_proj.weight.data[old_k_size:].copy_(
+                        old_layer.self_attn.k_proj.weight.data
+                    )
+
+        logger.info("Conversion complete")
+
+        return new_model
+
+
+class LlamaDifferentialForCausalLM(LlamaForCausalLM):
+    """
+    `LlamaForCausalLM` with differential attention.
+
+    This class extends the base LLaMA causal language model by incorporating
+    differential attention mechanisms.
+    """
+
+    config_class = LlamaDifferentialConfig
+    base_model_prefix = "llama_differential"
+
+    def __init__(self, config: LlamaDifferentialConfig):
+        """
+        Initialize a differential LLaMA model for causal language modeling.
+
+        Args:
+            config: Configuration object for the model.
+        """
+        super().__init__(config)
+        self.model = LlamaDifferentialModel(config)
+
+    @classmethod
+    # pylint: disable=protected-access
+    def _autoset_attn_implementation(
+        cls,
+        config: LlamaDifferentialConfig,
+        **kwargs,  # pylint: disable=unused-argument
+    ) -> LlamaDifferentialConfig:
+        """
+        Automatically set the attention implementation based on config.
+
+        Args:
+            config: Model configuration object.
+            **kwargs: Additional arguments (unused).
+
+        Returns:
+            Updated configuration object.
+
+        Raises:
+            ValueError: If specified attention implementation is not supported.
+        """
+        config._attn_implementation_autoset = True
+        attn_implementation = getattr(config, "_attn_implementation", None)
+
+        # Map standard types to differential types if mapping exists
+        if attn_implementation in config._attn_implementations:
+            config._attn_implementation = config._attn_implementations[
+                attn_implementation
+            ]
+
+            return config
+
+        # If no mapping, validate it's a valid differential type
+        valid_impls = [
+            None,
+            "differential_eager",
+            "differential_sdpa",
+            "differential_flash_attention_2",
+        ]
+        if attn_implementation not in valid_impls:
+            message = (
+                f"Specified `attn_implementation={attn_implementation}` is not supported. "
+                f"The only possible arguments are: {', '.join(repr(x) for x in valid_impls if x)}"
+            )
+            raise ValueError(message)
+
+        return config
+
+    @classmethod
+    def from_llama(
+        cls, model: LlamaForCausalLM, config: LlamaDifferentialConfig | None = None
+    ) -> "LlamaDifferentialForCausalLM":
+        """
+        Convert a `LlamaForCausalLM` to use differential attention.
+
+        Args:
+            model: Base LLaMA model to convert.
+            config: Configuration for differential attention. If `None`, created from
+                base model config.
+
+        Returns:
+            Converted model with differential attention.
+
+        Raises:
+            ValueError: If number of heads is not even when using `split_heads` mode.
+        """
+        if config is None:
+            config = LlamaDifferentialConfig(**model.config.__dict__)
+
+        # Validate head counts if using split heads mode
+        if config.split_heads:
+            if config.num_attention_heads % 2 != 0:
+                raise ValueError(
+                    f"Number of attention heads ({config.num_attention_heads}) must be even "
+                    "when using split_heads=True"
+                )
+            if config.num_key_value_heads % 2 != 0:
+                raise ValueError(
+                    f"Number of key/value heads ({config.num_key_value_heads}) must be even "
+                    "when using split_heads=True"
+                )
+
+        new_model = cls(config)
+        new_model.model = LlamaDifferentialModel.from_llama(model.model, config)
+        new_model.lm_head.load_state_dict(model.lm_head.state_dict())
+
+        return new_model
+
+
+def register_diff_attn() -> None:
+    """
+    Register differential attention components with the transformers library.
+
+    This function registers the differential attention configurations and model classes
+    with the Auto* classes from `transformers`, making them available through the
+    standard model loading pipeline.
+    """
+    # Register configs
+    AutoConfig.register("llama-differential", LlamaDifferentialConfig)
+
+    # Register models
+    AutoModel.register(LlamaDifferentialConfig, LlamaDifferentialModel)
+    AutoModelForCausalLM.register(LlamaDifferentialConfig, LlamaDifferentialForCausalLM)
+
+    from transformers.models.llama.modeling_llama import LLAMA_ATTENTION_CLASSES
+
+    LLAMA_ATTENTION_CLASSES["differential_eager"] = LlamaDifferentialAttention
+    LLAMA_ATTENTION_CLASSES["differential_sdpa"] = LlamaDifferentialSdpaAttention
+    LLAMA_ATTENTION_CLASSES[
+        "differential_flash_attention_2"
+    ] = LlamaDifferentialFlashAttention2
--- a/src/axolotl/integrations/rala/init.py
+++ b/src/axolotl/integrations/rala/init.py
@@ -0,0 +1,21 @@
+"""Definition of RALA plugin."""
+
+import logging
+
+from axolotl.integrations.base import BasePlugin
+from axolotl.integrations.rala.auto.llama.modeling_rala import register_rala_model
+
+LOG = logging.getLogger(__name__)
+
+
+class RalaPlugin(BasePlugin):
+    """
+    Plugin for Rala integration with Axolotl.
+    """
+
+    def get_input_args(self):
+        return "axolotl.integrations.rala.args.RalaArgs"
+
+    def register(self):
+        LOG.info("Registering RALA model with AutoConfig & AutoModel")
+        register_rala_model()
--- a/src/axolotl/integrations/rala/args.py
+++ b/src/axolotl/integrations/rala/args.py
@@ -0,0 +1,14 @@
+"""Module for handling RALA input arguments."""
+
+import logging
+from typing import Optional
+
+from pydantic import BaseModel
+
+LOG = logging.getLogger(__name__)
+
+
+class RalaArgs(BaseModel):
+    """Input args for RALA."""
+
+    rala_attention: Optional[bool] = None
--- a/src/axolotl/integrations/rala/auto/init.py
+++ b/src/axolotl/integrations/rala/auto/init.py
--- a/src/axolotl/integrations/rala/auto/llama/init.py
+++ b/src/axolotl/integrations/rala/auto/llama/init.py
--- a/src/axolotl/integrations/rala/auto/llama/configuration_rala.py
+++ b/src/axolotl/integrations/rala/auto/llama/configuration_rala.py
@@ -0,0 +1,13 @@
+"""
+Rala config class
+"""
+from transformers import LlamaConfig
+
+
+class LlamaRalaConfig(LlamaConfig):
+    """
+    Configuration for LlamaRala model
+    """
+
+    model_type = "llama-rala"
+    softmax_every: int = 6  # every N-th layer applies softmax
--- a/src/axolotl/integrations/rala/auto/llama/modeling_rala.py
+++ b/src/axolotl/integrations/rala/auto/llama/modeling_rala.py
@@ -0,0 +1,623 @@
+# Copyright 2024-2025 Axolotl AI. All rights reserved.
+#
+# This software may be used and distributed according to
+# the terms of the Apache License 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations under
+# the License.
+
+"""
+Custom modeling code for RALA Llama
+"""
+
+from typing import List, Optional, Tuple, Union, Unpack
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForCausalLM,
+    Cache,
+    GenerationMixin,
+    LlamaModel,
+)
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.models.llama.modeling_llama import (
+    LLAMA_ATTENTION_CLASSES,
+    KwargsForCausalLM,
+    LlamaDynamicNTKScalingRotaryEmbedding,
+    LlamaLinearScalingRotaryEmbedding,
+    LlamaMLP,
+    LlamaPreTrainedModel,
+    LlamaRMSNorm,
+    LlamaRotaryEmbedding,
+    apply_rotary_pos_emb,
+    repeat_kv,
+)
+
+from .configuration_rala import LlamaRalaConfig
+
+
+def kappa(x: torch.Tensor) -> torch.Tensor:  # pylint: disable=invalid-name
+    """
+    The paper uses κ(x) = ELU(x) + 1.
+    x is assumed to be [batch, n_heads, seq_len, head_dim].
+    """
+    return F.elu(x) + 1
+
+
+class LlamaRALAAttention(nn.Module):
+    """
+    LlamaAttention replaced with Rank-Augmented Linear Attention (RALA).
+    Adapted from the standard LlamaAttention for demonstration.
+    **Not** a fully drop-in replacement if you need caching/TP.
+    """
+
+    def __init__(self, config, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+
+        self.attention_dropout = config.attention_dropout
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.is_causal = True
+
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+
+        # Same Q, K, V, output projections
+        self.q_proj = nn.Linear(
+            self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            self.hidden_size,
+            self.num_key_value_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.v_proj = nn.Linear(
+            self.hidden_size,
+            self.num_key_value_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.o_proj = nn.Linear(
+            self.hidden_size, self.hidden_size, bias=config.attention_bias
+        )
+
+        # We will preserve rope usage
+        self._init_rope()
+
+        # A simple φ-projection for RALA:
+        # The paper uses φ(x) as a linear transform or identity. We'll do a linear:
+        self.phi = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
+
+    def _init_rope(self):
+        # Standard Llama rope logic
+        if self.config.rope_scaling is None:
+            self.rotary_emb = LlamaRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.rope_theta,
+            )
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            scaling_factor = self.config.rope_scaling["factor"]
+            if scaling_type == "linear":
+                self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
+                    self.head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "dynamic":
+                self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
+                    self.head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,  # pylint: disable=unused-argument
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs,  # pylint: disable=unused-argument
+    ):
+        """
+        RALA forward pass.
+        This version omits incremental decoding with `past_key_value` for simplicity
+        (linear attention caching is non-trivial).
+        """
+        bsz, q_len, _ = hidden_states.size()
+
+        # Standard Q, K, V
+        query_states = self.q_proj(hidden_states)  # [b, seq, n_heads*dim]
+        key_states = self.k_proj(hidden_states)  # [b, seq, n_kv_heads*dim]
+        value_states = self.v_proj(hidden_states)  # [b, seq, n_kv_heads*dim]
+
+        # Reshape to [b, n_heads, seq_len, head_dim]
+        query_states = query_states.view(
+            bsz, q_len, self.num_heads, self.head_dim
+        ).transpose(1, 2)
+        key_states = key_states.view(
+            bsz, q_len, self.num_key_value_heads, self.head_dim
+        ).transpose(1, 2)
+        value_states = value_states.view(
+            bsz, q_len, self.num_key_value_heads, self.head_dim
+        ).transpose(1, 2)
+
+        # Apply RoPE (rotary embeddings) just as in standard Llama
+        cos, sin = self.rotary_emb(value_states, position_ids)
+        query_states, key_states = apply_rotary_pos_emb(
+            query_states, key_states, cos, sin
+        )
+
+        # 4. If we have a past_key_value (Cache object), let it update / append
+        if past_key_value is not None:
+            # This is the normal Llama pattern
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            # The .update() method returns updated (key_states, value_states)
+            # and typically updates internal buffers. It may also store `layer_idx` data.
+            key_states, value_states = past_key_value.update(
+                key_states, value_states, self.layer_idx, cache_kwargs
+            )
+
+        # If you still want to handle the repeated KV for multi-group setups:
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+        # Now we apply RALA.
+
+        # 1) Apply κ(.) to Q,K: shape [b, n_heads, seq_len, head_dim]
+        Q_kappa = kappa(query_states)  # pylint: disable=invalid-name
+        K_kappa = kappa(key_states)  # pylint: disable=invalid-name
+
+        # 2) Compute global query Q_g = average of Q_kappa across seq_len => [b, n_heads, head_dim]
+        # The paper denotes Q_g = (1/N) Σ_i Q_kappa_i
+        seq_len_float = float(q_len)  # for scaling
+        Q_g = Q_kappa.mean(  # pylint: disable=invalid-name
+            dim=2
+        )  # [b, n_heads, head_dim]
+
+        # 3) Compute alpha_j for each token j in [0..seq_len-1]
+        #    alpha_j = N * softmax( Q_g · K_kappa_j^T ), shape => [b, n_heads, seq_len]
+        # Dot product over head_dim
+        # K_kappa is [b, n_heads, seq_len, head_dim], Q_g is [b, n_heads, head_dim]
+        # We'll do an einsum or transpose to produce logits [b, n_heads, seq_len]
+
+        # Dot product across the last dimension (d_head), resulting in shape [b, n_heads, seq_len]
+        # logits = torch.einsum("bnh, bnsh -> bns", Q_g, K_kappa)  # [b, n_heads, seq_len]
+        logits = (Q_g.unsqueeze(2) * K_kappa).sum(
+            dim=-1
+        )  # -> [b, n_heads, seq_len]  # identical to above but torch.compile should work
+
+        # 4) Incorporate causal or padding mask if provided.
+        #    In standard Llama, attention_mask is broadcast as [b, 1, seq_len, seq_len] or similar.
+        #    For RALA, we only do a single softmax over "j" dimension. We can add the mask to logits.
+        #    Caution: This might not replicate strict causal linear attention. It's a best-effort approach.
+        if attention_mask is not None:
+            # Usually Llama's causal mask is [b, 1, q_len, kv_len] with 0 or -inf
+            # We want shape [b, n_heads, seq_len], so we can broadcast accordingly:
+            # e.g., attention_mask: [b, 1, q_len, seq_len]
+            # We pick the slice that corresponds to q_len vs. kv_len.
+            # Typically the last two dims are (q_len, kv_len). We want the kv_len dimension to be `seq_len`.
+            # We'll do something like:
+            if attention_mask.dim() == 4:
+                # attention_mask: [b, 1, q_len, kv_len]
+                # if q_len == kv_len, we can do attention_mask[:, :, :, :seq_len], then squeeze dims
+                mask_2d = attention_mask[:, 0, :, :q_len]  # [b, q_len, seq_len]
+                # we only want [b, n_heads, seq_len], so we must broadcast over q_len if needed
+                # but in this snippet, we do a single alpha_j for each j *per head*,
+                # ignoring per-token Q_i. So there's a mismatch.
+                # A simpler approach is to apply the mask for the entire sequence if a token j is invalid for ANY i.
+                # That is approximate. We'll just pick the first row of q_len, or do min across i dimension...
+                # For demonstration, let's sum or min across i dimension to see if j is valid for ANY i.
+                # Or we do a "causal" approach: all tokens j>i get masked. But there's no direct i index here in alpha_j.
+                # We'll just do a rough approach, e.g. mask = min across the q_len dimension:
+                mask_1d = torch.min(mask_2d, dim=1)[
+                    0
+                ]  # [b, seq_len], picking the worst mask across query positions
+                # broadcast for n_heads
+                mask_1d = mask_1d.unsqueeze(1).expand(
+                    -1, self.num_heads, -1
+                )  # [b, n_heads, seq_len]
+                logits = logits + mask_1d
+            else:
+                # Possibly it's [b, seq_len]. Then we just broadcast to [b,n_heads,seq_len].
+                mask_1d = attention_mask  # [b, seq_len]
+                mask_1d = mask_1d.unsqueeze(1).expand(-1, self.num_heads, -1)
+                logits = logits + mask_1d
+
+        alpha = F.softmax(logits, dim=-1)  # [b, n_heads, seq_len]
+        # multiply by seq_len per the formula
+        alpha = alpha * seq_len_float
+
+        # 5) Construct the outer-sum:  Σ_j alpha_j * (K_kappa_j^T V_j)
+        #    The paper shows a d×d matrix formed per head.
+        #    K_kappa: [b, n_heads, seq_len, head_dim], V: [b, n_heads, seq_len, head_dim]
+        #    For each j, do outer product K_kappa_j (d×1) × V_j^T (1×d) => d×d
+        #    Then multiply by alpha_j and sum over j.
+        #    We'll do an einsum for that: [b,n_heads,seq_len,d] outer [b,n_heads,seq_len,d] => [b,n_heads,d,d]
+        #    alpha: [b, n_heads, seq_len].
+        value_states_ = value_states  # [b, n_heads, seq_len, head_dim]
+        outer_sum = torch.einsum("bns,bnsd,bnsf->bndf", alpha, K_kappa, value_states_)
+
+        # Explanation:
+        #  - 'bnhs' is alpha (batch, n_heads, seq_len)
+        #  - 'bnhsd' is K_kappa  (b,n_heads,seq_len, d)
+        #  - 'bnhsf' is V        (b,n_heads,seq_len, d)
+        # We want [b,n_heads,d,f], which is the d×d matrix per head.
+        # Actually we need an outer product (K_kappa_j^T × V_j). That is [d, d].
+        # The call above is not quite correct if we want K_kappa_j^T × V_j as [d,d].
+        # Let's do a simpler approach:
+        #   outer_sum = sum_j alpha_j * (K_kappa_j^T outer V_j).
+        #   = "bnhs,bnhsd,bnhsf -> bnhdf"
+        #   means: alpha has shape (b,n,h,s), K_kappa has shape (b,n,h,s,d), V has shape (b,n,h,s,d)
+        #   We want to produce (b,n,h,d,d).
+        # So the correct einsum string is 'bnhs,bnhsd,bnhsf->bnhdf':
+        #   alpha indexes b,n,h,s
+        #   K_kappa indexes b,n,h,s,d => K_kappa_j
+        #   V indexes b,n,h,s,f => V_j
+        # The resulting shape is (b,n,h,d,f). Great.
+
+        # 6) For each token i, Y_i = φ(X_i) ∘ [ κ(Q_i) × outer_sum ]
+        #    Here κ(Q_i) is shape [b,n,h,d], outer_sum is shape [b,n,h,d,d].
+        #    We'll do a batch matmul: result_attn = Q_kappa_i × outer_sum => [b,n,h,d]
+        #    Then multiply elementwise by φ(X_i).
+        #    But φ(X_i) is a single [b,seq_len,d_model], so we reshape to [b,seq_len,n,h_dim].
+        #    We'll do per-token i in a loop or broadcast. Let's do it in a single operation with einsum:
+
+        # first, compute φ(X):
+        # X is the original hidden_states: [b, seq_len, d_model]
+        X_phi = self.phi(  # pylint: disable=invalid-name
+            hidden_states
+        )  # [b, seq_len, d_model]
+        X_phi = X_phi.view(  # pylint: disable=invalid-name
+            bsz, q_len, self.num_heads, self.head_dim
+        )  # [b, s, n, d]
+        X_phi = X_phi.transpose(1, 2)  # [b, n, s, d]  # pylint: disable=invalid-name
+
+        # Now for each i in [0..q_len-1], we do a matrix multiply:
+        # result_attn_i = Q_kappa_i [b,n,s,d] × outer_sum [b,n,d,d] => we want [b,n,s,d].
+        # We'll do:
+        result_attn = torch.einsum("bnsd,bndf->bnsf", Q_kappa, outer_sum)  # [b,n,s,d]
+
+        # Then elementwise multiply by φ(X_i):
+        context_layer = X_phi * result_attn  # [b,n,s,d]
+
+        # Finally, reorder to [b, s, n, d] -> [b, s, n*d]
+        context_layer = context_layer.transpose(1, 2).contiguous()  # [b, s, n, d]
+        context_layer = context_layer.view(bsz, q_len, self.hidden_size)
+
+        # One last linear projection:
+        attn_output = self.o_proj(context_layer)
+
+        if output_attentions:
+            # alpha => [b, n_heads, (past_len + q_len)]
+            attn_weights = alpha
+        else:
+            attn_weights = None
+
+        # Return 3-tuple: (attn_output, attn_weights, past_key_value)
+        return attn_output, attn_weights, past_key_value
+
+
+class LlamaRalaDecoderLayer(nn.Module):
+    """
+    LlamaDecoderLayer with RALA support
+    """
+
+    def __init__(self, config: LlamaRalaConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+
+        if LlamaRalaDecoderLayer.is_layer_idx_softmax(
+            config.num_hidden_layers, layer_idx, config.softmax_every
+        ):
+            self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](
+                config=config, layer_idx=layer_idx
+            )
+            # self.self_attn = LlamaAttention(config=config, layer_idx=layer_idx)
+        else:
+            self.self_attn = LlamaRALAAttention(config=config, layer_idx=layer_idx)
+
+        self.mlp = LlamaMLP(config)
+        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = LlamaRMSNorm(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+
+    @classmethod
+    def is_layer_idx_softmax(
+        cls, num_hidden_layers: int, layer_idx: int, softmax_every: int
+    ) -> bool:
+        inner_layers = num_hidden_layers - 2
+        if 1 + softmax_every * (inner_layers // softmax_every) == inner_layers:
+            softmax_start_idx = 1
+        elif 1 + softmax_every * (inner_layers // softmax_every) > inner_layers:
+            layer_group_size = 1 + softmax_every * ((inner_layers // softmax_every) - 1)
+            softmax_start_idx = 1 + (inner_layers - layer_group_size) // 2
+        elif 1 + softmax_every * (inner_layers // softmax_every) < inner_layers:
+            layer_group_size = 1 + softmax_every * (inner_layers // softmax_every)
+            softmax_start_idx = 1 + (inner_layers - layer_group_size) // 2
+
+        softmax_layers = set(range(softmax_start_idx, num_hidden_layers, softmax_every))
+        softmax_layers.add(0)
+        softmax_layers.add(num_hidden_layers - 1)
+
+        return layer_idx in softmax_layers
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[
+            Tuple[torch.Tensor, torch.Tensor]
+        ] = None,  # will become mandatory in v4.46
+        **kwargs,
+    ) -> Tuple[
+        torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
+    ]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*):
+                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
+                query_sequence_length, key_sequence_length)` if default attention is used.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+            cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+                Indices depicting the position of the input sequence tokens in the sequence
+            position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
+                Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
+                with `head_dim` being the embedding dimension of each attention head.
+            kwargs (`dict`, *optional*):
+                Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
+                into the model
+        """
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)  # type: ignore
+
+        if use_cache:
+            outputs += (present_key_value,)  # type: ignore
+
+        return outputs  # type: ignore
+
+
+class LlamaRalaModel(LlamaModel):
+    """
+    LlamaModel with RALA support
+    """
+
+    config_class = LlamaRalaConfig
+
+    def __init__(self, config: LlamaRalaConfig):
+        LlamaPreTrainedModel.__init__(self, config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(
+            config.vocab_size, config.hidden_size, self.padding_idx
+        )
+
+        self.layers = nn.ModuleList(
+            [
+                LlamaRalaDecoderLayer(config, layer_idx)
+                for layer_idx in range(config.num_hidden_layers)
+            ]
+        )
+
+        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = LlamaRotaryEmbedding(config=config)
+
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+
+class LlamaRalaForCausalLM(LlamaPreTrainedModel, GenerationMixin):
+    """
+    LlamaForCausalLM with RALA support
+    """
+
+    config_class = LlamaRalaConfig
+    _no_split_modules = ["LlamaRalaDecoderLayer"]
+
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = LlamaRalaModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        num_logits_to_keep: int = 0,
+        **kwargs: Unpack[KwargsForCausalLM],  # type: ignore
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            num_logits_to_keep (`int`, *optional*):
+                Calculate logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, LlamaForCausalLM
+        >>> model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        hidden_states = outputs[0]
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(
+                logits=logits,
+                labels=labels,
+                vocab_size=self.config.vocab_size,
+                **kwargs,
+            )
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+def register_rala_model() -> None:
+    """
+    Register differential attention components with the transformers library.
+    This function registers the differential attention configurations and model classes
+    with the Auto* classes from `transformers`, making them available through the
+    standard model loading pipeline.
+    """
+    # Register configs
+    AutoConfig.register("llama-rala", LlamaRalaConfig)
+
+    # Register models
+    AutoModel.register(LlamaRalaConfig, LlamaRalaModel)
+    AutoModelForCausalLM.register(LlamaRalaConfig, LlamaRalaForCausalLM)
+
+    LLAMA_ATTENTION_CLASSES["rala"] = LlamaRALAAttention
--- a/src/axolotl/integrations/rala/convert.py
+++ b/src/axolotl/integrations/rala/convert.py
@@ -0,0 +1,106 @@
+"""
+conversion for llama models to use RALA attention
+"""
+import logging
+
+from torch import nn
+from transformers import PreTrainedModel
+from transformers.models.llama.modeling_llama import LlamaAttention
+
+from axolotl.integrations.rala.auto.llama.modeling_rala import (
+    LlamaRALAAttention,
+    LlamaRalaDecoderLayer,
+)
+
+logger = logging.getLogger(__name__)
+
+ATTENTION_MAPPING = {
+    LlamaAttention: LlamaRALAAttention,
+}
+
+
+def copy_attention_weights(
+    old_attn,
+    new_attn,
+    zero_init: bool = False,
+) -> None:
+    """
+    Copy weights from old attention layer to new RALA layer.
+    Copies q, k, v, o
+    """
+    new_attn.q_proj.weight.data.copy_(old_attn.q_proj.weight.data)
+    new_attn.k_proj.weight.data.copy_(old_attn.k_proj.weight.data)
+    new_attn.v_proj.weight.data.copy_(old_attn.v_proj.weight.data)
+    new_attn.o_proj.weight.data.copy_(old_attn.o_proj.weight.data)
+
+    # Zero out lambda parameters for exact equivalence
+    if zero_init:
+        nn.init.zeros_(new_attn.phi.weight)
+    else:
+        nn.init.normal_(new_attn.phi.weight)
+    if new_attn.phi.bias:
+        nn.init.normal_(new_attn.phi.bias)
+
+    logger.debug(
+        "Copied positive attention weights from %s to %s",
+        type(old_attn).__name__,
+        type(new_attn).__name__,
+    )
+
+
+def convert_to_rala(
+    model: PreTrainedModel, zero_init: bool = False, softmax_every_n: int = 6
+) -> PreTrainedModel:
+    """Convert a pre-trained model's attention layers to differential attention"""
+    layer_idx = 0
+
+    def convert_module(module, softmax_every, num_hidden_layers):
+        nonlocal layer_idx
+
+        # Iterate through module children, convert any attn layers to diff attn
+        for name, child in module.named_children():
+            if isinstance(child, tuple(ATTENTION_MAPPING.keys())):
+                decoder_layer_idx = child.layer_idx
+                if LlamaRalaDecoderLayer.is_layer_idx_softmax(
+                    num_hidden_layers, decoder_layer_idx, softmax_every
+                ):
+                    continue
+                # Choose appropriate differential attention class
+                # pylint: disable=duplicate-code
+                attention_class = ATTENTION_MAPPING[type(child)]
+
+                layer_type = type(child).__name__
+                logger.info(
+                    f"Converting attention layer {decoder_layer_idx}: {layer_type} to {attention_class.__name__}"
+                )
+
+                # Create new diff attn layer
+                new_attention = attention_class(
+                    config=module.config if hasattr(module, "config") else model.config,
+                    layer_idx=layer_idx,
+                )
+
+                # Copy weights from old attention to new attention
+                new_attention.to(child.q_proj.weight.device)
+                copy_attention_weights(child, new_attention, zero_init=zero_init)
+
+                # Replace the layer
+                setattr(module, name, new_attention)
+                layer_idx += 1
+            elif len(list(child.children())) > 0:
+                convert_module(child, softmax_every, num_hidden_layers)
+
+    model.config.softmax_every = softmax_every_n
+    convert_module(model, softmax_every_n, model.config.num_hidden_layers)
+    logger.info(f"Converted {layer_idx} attention layers to RALA attention")
+
+    model.config.architectures = [
+        "LlamaRalaForCausalLM",
+    ]
+    model.config.model_type = "llama-rala"
+    # model.config.auto_map = {
+    #     "AutoConfig": "llama.configuration_rala.LlamaRalaConfig",
+    #     "AutoModel": "llama.modeling_rala.LlamaRalaModel",
+    #     "AutoModelForCausalLM": "llama.modeling_rala.LlamaRalaForCausalLM",
+    # }
+    return model
--- a/src/axolotl/integrations/rrt/init.py
+++ b/src/axolotl/integrations/rrt/init.py
@@ -1,25 +0,0 @@
-"""
-Axolotl Plugin for Relaxed Recursive Transformers
-"""
-
-import logging
-
-from axolotl.integrations.base import BasePlugin
-from axolotl.integrations.rrt.modeling import register_rrt_model
-
-LOG = logging.getLogger(__name__)
-
-
-class RelaxedRecursiveTransformerPlugin(BasePlugin):
-    """
-    Plugin for Relaxed Recursive Transformers integration with Axolotl
-    """
-
-    def get_input_args(self):
-        return "axolotl.integrations.rrt.args.RelaxedRecursiveTransformerArgs"
-
-    def register(self):
-        LOG.info(
-            "Registering Relaxed Recursive Transformers modeling with transformers"
-        )
-        register_rrt_model()
--- a/src/axolotl/integrations/rrt/args.py
+++ b/src/axolotl/integrations/rrt/args.py
@@ -1,11 +0,0 @@
-"""
-Axolotl config args for Relaxed Recursive Transformers plugin
-"""
-
-from pydantic import BaseModel
-
-
-class RelaxedRecursiveTransformerArgs(BaseModel):
-    """
-    Arguments pertaining to the Relaxed Recursive Transformer model.
-    """
--- a/src/axolotl/integrations/rrt/cli/convert.py
+++ b/src/axolotl/integrations/rrt/cli/convert.py
@@ -1,370 +0,0 @@
-"""
-cli script for converting a pretrained model to a relaxed recursive transformer model
-"""
-import json
-import logging
-import math
-import os
-import re
-from pathlib import Path
-from typing import Tuple
-
-import safetensors
-import torch
-from huggingface_hub import snapshot_download, split_torch_state_dict_into_shards
-from safetensors.torch import save_file
-from tqdm import tqdm
-from transformers import AutoConfig, AutoTokenizer
-from transformers.utils import SAFE_WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_NAME
-
-from axolotl.integrations.rrt.modeling.modeling_rrt_llama import (
-    RelaxedRecursiveLlamaConfig,
-)
-
-logger = logging.getLogger(__name__)
-
-
-def extract_layer_number(key):
-    """Extract layer number from parameter key."""
-    match = re.search(r"layers\.(\d+)\.", key)
-    return int(match.group(1)) if match else None
-
-
-def iter_parameter_weights(model_path, device="mps"):
-    """
-    iterator over parameter weights in the model shards
-
-    :param model_path: Path to model shards
-    :param device: Computing device
-    :return: generator yielding (parameter key, parameter weight, layer index) tuples
-    """
-    shards = list(model_path.glob("model*.safetensors"))
-    if not shards:
-        raise ValueError(f"No model shards found in {model_path}")
-
-    for shard in tqdm(shards, desc="Processing shards"):
-        with safetensors.safe_open(shard, framework="pt", device=device) as f:
-            for key in f.keys():
-                layer_idx = extract_layer_number(key)
-                weight = f.get_tensor(key)
-                yield key, weight, layer_idx
-
-
-def iter_recursive_parameter_weights(
-    model_path, modules_to_recurse: list[str], device="mps", recurse_layers=12
-):
-    # setup placeholder state_dict for recursive weights, need to keep in float32 precision
-    # to avoid precision loss when averaging weights across layers
-    rrt_avg_model_state_dict: dict[str, list[torch.Tensor]] = {}
-
-    # iterate over all parameter weights in the model shards
-    for key, weight, layer_idx in iter_parameter_weights(model_path, device=device):
-        # get the matching module name in modules_to_recurse for the current parameter key
-        matched_module_name = next(
-            (module for module in modules_to_recurse if module in key), None
-        )
-        if matched_module_name is None:
-            continue
-
-        recurse_idx = layer_idx % recurse_layers
-        suffix = f"{recurse_idx}.{matched_module_name}"
-        if rrt_avg_model_state_dict.get(suffix) is None:
-            # setup as storage for suffix with torch.stack
-            rrt_avg_model_state_dict[suffix] = [weight.to(torch.float32).detach().cpu()]
-        else:
-            rrt_avg_model_state_dict[suffix].append(
-                weight.to(torch.float32).detach().cpu()
-            )
-
-    for module_name in modules_to_recurse:
-        for recurse_idx in range(recurse_layers):
-            suffix = f"{recurse_idx}.{module_name}"
-            prefix = f"model.layers.{suffix}"
-            avg_weight = torch.stack(rrt_avg_model_state_dict[suffix]).mean(dim=0)
-            yield f"{prefix}.weight_base", avg_weight
-
-    # compute the decomposed lora diff from the weight base to the actual weight for each module
-
-
-def low_rank_decomposition(
-    weight: torch.Tensor, max_rank: int
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """
-    Decompose a 2D matrix into low-rank matrices L and R using SVD.
-
-    :param weight: The matrix to decompose, of shape (H, W)
-    :param max_rank: The maximum rank of the decomposition
-    :return: A tuple of tensors (L, R)
-    """
-    # pylint: disable=invalid-name
-    assert (
-        weight.dim() == 2
-    ), f"Only support 2D matrix, but input has {weight.dim()} dimensions."
-    assert (
-        max_rank >= 1
-    ), f"Maximum rank must be a positive integer, but input max_rank={max_rank}."
-
-    dtype = weight.dtype
-
-    U, S, Vh = torch.linalg.svd(weight.float(), full_matrices=False)
-
-    # Distribute S to both to improve numerical precision
-    sqrt_S = torch.sqrt(torch.diag(S[:max_rank]))
-    A = sqrt_S @ Vh[:max_rank, :]  # shape: [r, cols]
-    B = U[:, :max_rank] @ sqrt_S  # shape: [rows, r]
-
-    return A.to(dtype), B.to(dtype)
-
-
-def get_weight_norm(weight, lora_weight, scaling) -> torch.Tensor:
-    # calculate L2 norm of weight matrix, column-wise
-    weight = weight + scaling * lora_weight
-    weight_norm = torch.linalg.norm(weight, dim=1).to(weight.dtype)
-    return weight_norm
-
-
-def decompose_delta_weight(layer_weight, avg_weight, alpha, rank, use_dora=True):
-    """
-    Decompose the difference in directions (ΔV) via SVD,
-    and return (magnitudes, L, R).
-    """
-    device = "cuda" if torch.cuda.is_available() else "mps"
-
-    # rslora
-    scaling = alpha / math.sqrt(rank)
-
-    base_weight = avg_weight.to(device)
-    final_weight = layer_weight.to(device)
-
-    delta_for_svd = final_weight - base_weight
-
-    # Low-rank factorization of the delta direction
-    lora_A, lora_B = low_rank_decomposition(  # pylint: disable=invalid-name
-        delta_for_svd, rank
-    )
-
-    if use_dora:
-        lora_weight = lora_B @ lora_A
-        weight_norm = get_weight_norm(
-            base_weight.to(lora_A.device), lora_weight, scaling
-        )
-        return lora_A.cpu(), lora_B.cpu(), weight_norm.cpu()
-
-    # let's rescale the lora weight to have the same magnitude as the base weight
-
-    return lora_A.cpu(), lora_B.cpu(), None
-
-
-def iter_dora_parameter_weights(
-    model_path,
-    avg_recursive_weights,
-    modules_to_recurse: list[str],
-    alpha,
-    rank,
-    device="mps",
-    recurse_layers=12,
-    use_dora=True,
-):
-    # iterate over all parameter weights in the model shards
-    for key, weight, layer_idx in iter_parameter_weights(model_path, device=device):
-        # get the matching module name in modules_to_recurse for the current parameter key
-        matched_module_name = next(
-            (module for module in modules_to_recurse if module in key), None
-        )
-        if matched_module_name is None:
-            if "input_layernorm" in key:
-                # map to input_layernorm_list in the recursive layers and account for the layer_idx and loop_idx
-                loop_idx = layer_idx // recurse_layers
-                layer_idx = layer_idx % recurse_layers
-                layernorm_key = (
-                    f"model.layers.{layer_idx}.input_layernorm_list.{loop_idx}.weight"
-                )
-                yield layernorm_key, weight
-            elif "post_attention_layernorm" in key:
-                # map to input_layernorm_list in the recursive layers and account for the layer_idx and loop_idx
-                loop_idx = layer_idx // recurse_layers
-                layer_idx = layer_idx % recurse_layers
-                layernorm_key = f"model.layers.{layer_idx}.post_attention_layernorm_list.{loop_idx}.weight"
-                yield layernorm_key, weight
-            else:
-                yield key, weight
-            continue
-
-        # figure out the base weight layer for this key
-        loop_idx = layer_idx // recurse_layers
-        layer_idx = layer_idx % recurse_layers
-        suffix = f"{layer_idx}.{matched_module_name}"
-        prefix = f"model.layers.{suffix}.weight_base"
-        avg_weight = avg_recursive_weights[prefix]
-        lora_a_key = f"model.layers.{suffix}.lora_A_list.{loop_idx}"
-        lora_b_key = f"model.layers.{suffix}.lora_B_list.{loop_idx}"
-        lora_magnitude_key = (
-            f"model.layers.{suffix}.lora_magnitude_vector_list.{loop_idx}"
-        )
-        lora_a, lora_b, lora_magnitude = decompose_delta_weight(
-            weight,
-            avg_weight,
-            alpha,
-            rank,
-            use_dora=use_dora,
-        )
-        yield lora_a_key, lora_a
-        yield lora_b_key, lora_b
-        if use_dora:
-            yield lora_magnitude_key, lora_magnitude
-
-
-def save_state_dict_to_safetensors(state_dict, save_directory):
-    os.makedirs(save_directory, exist_ok=True)
-    weights_name = SAFE_WEIGHTS_NAME
-
-    filename_pattern = weights_name.replace(".bin", "{suffix}.bin").replace(
-        ".safetensors", "{suffix}.safetensors"
-    )
-    state_dict_split = split_torch_state_dict_into_shards(
-        state_dict, filename_pattern=filename_pattern, max_shard_size="1GB"
-    )
-    # pylint: disable=duplicate-code
-    # Save index if sharded
-    index = None
-    if state_dict_split.is_sharded:
-        index = {
-            "metadata": state_dict_split.metadata,
-            "weight_map": state_dict_split.tensor_to_filename,
-        }
-
-    # Clean the folder from a previous save
-    for filename in os.listdir(save_directory):
-        full_filename = os.path.join(save_directory, filename)
-        # If we have a shard file that is not going to be replaced, we delete it, but only from the main process
-        # in distributed settings to avoid race conditions.
-        weights_no_suffix = weights_name.replace(".bin", "").replace(".safetensors", "")
-
-        # make sure that file to be deleted matches format of sharded file, e.g. pytorch_model-00001-of-00005
-        filename_no_suffix = filename.replace(".bin", "").replace(".safetensors", "")
-        reg = re.compile(r"(.*?)-\d{5}-of-\d{5}")
-
-        if (
-            filename.startswith(weights_no_suffix)
-            and os.path.isfile(full_filename)
-            and filename not in state_dict_split.filename_to_tensors.keys()
-            and reg.fullmatch(filename_no_suffix) is not None
-        ):
-            os.remove(full_filename)
-
-    filename_to_tensors = state_dict_split.filename_to_tensors.items()
-    for shard_file, tensors in filename_to_tensors:
-        shard = {}
-        for tensor in tensors:
-            shard[tensor] = state_dict[tensor].contiguous()
-            del state_dict[tensor]
-
-        save_file(
-            shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"}
-        )
-
-    del state_dict
-
-    if index is None:
-        path_to_weights = os.path.join(save_directory, weights_name)
-        logger.info(f"Model weights saved in {path_to_weights}")
-    else:
-        save_index_file = SAFE_WEIGHTS_INDEX_NAME
-        save_index_file = os.path.join(save_directory, save_index_file)
-        # Save the index as well
-        with open(save_index_file, "w", encoding="utf-8") as f:
-            content = json.dumps(index, indent=2, sort_keys=True) + "\n"
-            f.write(content)
-
-
-def convert_llama_to_rrt(
-    model_name,
-    output_dir,
-    recurse_layers: int = 12,
-    rank=32,
-    alpha=32,
-    device=None,
-    use_dora=True,
-):
-    if not device:
-        if torch.backends.mps.is_available():
-            device = "mps"
-        elif torch.cuda.is_available():
-            device = "cuda"
-        else:
-            device = "cpu"
-
-    modules_to_recurse = [
-        "self_attn.q_proj",
-        "self_attn.k_proj",
-        "self_attn.v_proj",
-        "self_attn.o_proj",
-        "mlp.down_proj",
-        "mlp.gate_proj",
-        "mlp.up_proj",
-    ]
-
-    config = AutoConfig.from_pretrained(model_name)
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    num_hidden_layers = config.num_hidden_layers
-    if num_hidden_layers % recurse_layers != 0:
-        raise ValueError(
-            f"The number of hidden layers ({num_hidden_layers}) in the model must be "
-            f"divisible by the recurse layers ({recurse_layers})"
-        )
-
-    config = RelaxedRecursiveLlamaConfig.from_dict(
-        {
-            **config.to_dict(),
-            "recurse_layers": recurse_layers,
-            "rank": rank,
-            "alpha": alpha,
-            "use_dora": use_dora,
-        }
-    )
-    config.save_pretrained(output_dir)
-    tokenizer.save_pretrained(output_dir)
-    model_path = Path(snapshot_download(model_name, ignore_patterns="*.pth"))
-
-    # create a new state_dict to store the RRT model weights
-    rrt_model_state_dict = {}
-
-    logger.info("Calculating average recursive weights...")
-    for key, weight in iter_recursive_parameter_weights(
-        model_path, modules_to_recurse, device=device, recurse_layers=recurse_layers
-    ):
-        rrt_model_state_dict[key] = weight.to(torch.bfloat16).detach().cpu()
-
-    logger.info("Calculating decomposed lora diff...")
-    # now that we have the average weights, we need to loop over the shards again to calculate the decomposed lora diff
-    rrt_lora_state_dict = {}
-    for key, weight in iter_dora_parameter_weights(
-        model_path,
-        rrt_model_state_dict,
-        modules_to_recurse,
-        alpha=32,
-        rank=rank,
-        device=device,
-        recurse_layers=recurse_layers,
-        use_dora=use_dora,
-    ):
-        rrt_lora_state_dict[key] = weight.to(torch.bfloat16).detach().cpu()
-
-    # combine state dicts into a single state_dict
-    rrt_model_state_dict.update(rrt_lora_state_dict)
-
-    # save state dict as sharded safetensors to disk using split_torch_state_dict_into_shards
-    save_state_dict_to_safetensors(rrt_model_state_dict, output_dir)
-
-
-if __name__ == "__main__":
-    # meta-llama/Llama-3.2-1B has 16 hidden layers
-    # meta-llama/Llama-3.2-3B has 28 hidden layers
-    convert_llama_to_rrt(
-        "meta-llama/Llama-3.2-3B",
-        "/tmp/rrt_model",  # nosec
-        recurse_layers=4,
-        rank=256,
-        alpha=512,
-        use_dora=False,
-    )
--- a/src/axolotl/integrations/rrt/modeling/init.py
+++ b/src/axolotl/integrations/rrt/modeling/init.py
@@ -1,25 +0,0 @@
-"""
-module for modeling relaxed recursive transformers model
-"""
-from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
-
-from .configuration_rrt_llama import RelaxedRecursiveLlamaConfig
-from .modeling_rrt_llama import (
-    RelaxedRecursiveLlamaForCausalLM,
-    RelaxedRecursiveLlamaModel,
-)
-
-
-def register_rrt_model():
-    """
-    Register Relaxed Recursive Transformers model with transformers
-    """
-
-    # Register configs
-    AutoConfig.register("llama-rrt", RelaxedRecursiveLlamaConfig)
-
-    # Register models
-    AutoModel.register(RelaxedRecursiveLlamaConfig, RelaxedRecursiveLlamaModel)
-    AutoModelForCausalLM.register(
-        RelaxedRecursiveLlamaConfig, RelaxedRecursiveLlamaForCausalLM
-    )
--- a/src/axolotl/integrations/rrt/modeling/configuration_rrt_llama.py
+++ b/src/axolotl/integrations/rrt/modeling/configuration_rrt_llama.py
@@ -1,16 +0,0 @@
-"""
-module for custom configuration for relaxed recursive transformers model
-"""
-from transformers import LlamaConfig
-
-
-class RelaxedRecursiveLlamaConfig(LlamaConfig):
-    """
-    Configuration for Relaxed Recursive Llama.
-    """
-
-    model_type: str = "llama-rrt"
-    recurse_layers: int = 4
-    rank: int
-    alpha: int
-    use_dora: bool = True
--- a/src/axolotl/integrations/rrt/modeling/linear.py
+++ b/src/axolotl/integrations/rrt/modeling/linear.py
@@ -1,116 +0,0 @@
-"""
-module for the shared linear layer for the relaxed recursive transformers model
-"""
-import math
-
-import torch
-import torch.nn.functional as F
-from peft.utils import transpose
-from torch import nn
-
-
-class RelaxedRecursiveDoraLinear(nn.Module):
-    """
-    A single linear layer that is "shared" across multiple loop iterations,
-    but each iteration has its own DoRA offsets (A_i, B_i, magnitude_i).
-
-    The constructor expects you to specify:
-      - in_features, out_features
-      - B: number of loop iterations (i.e., how many times we "unroll")
-      - fan_in_fan_out: pass True if your underlying base weight is transposed, etc.
-
-    The forward(...) expects an additional argument "loop_idx" in [0..B-1],
-    which picks out the iteration-specific DoRA offsets.
-    """
-
-    def __init__(
-        self,
-        in_features: int,
-        out_features: int,
-        B: int,  # pylint: disable=invalid-name
-        rank: int,
-        alpha: int,
-        fan_in_fan_out: bool = False,
-        bias: bool = True,
-        use_dora: bool = True,
-    ):
-        super().__init__()
-        self.B = B  # pylint: disable=invalid-name
-        self.fan_in_fan_out = fan_in_fan_out
-
-        self.weight_base = nn.Parameter(torch.empty(out_features, in_features))
-
-        self.use_bias = bias
-        if self.use_bias:
-            self.bias = nn.Parameter(torch.zeros(out_features))
-        else:
-            self.register_parameter("bias", None)
-
-        self.lora_A_list = nn.ParameterList(  # pylint: disable=invalid-name
-            [nn.Parameter(torch.zeros(rank, in_features)) for _ in range(B)]
-        )
-        self.lora_B_list = nn.ParameterList(  # pylint: disable=invalid-name
-            [nn.Parameter(torch.zeros(out_features, rank)) for _ in range(B)]
-        )
-        # rslora
-        self.scaling = alpha / math.sqrt(rank)
-        self.use_dora = use_dora
-        if use_dora:
-            self.lora_magnitude_vector_list = nn.ParameterList(
-                [nn.Parameter(torch.ones(out_features)) for _ in range(B)]
-            )
-
-    def get_weight_norm(self, weight, lora_weight, scaling) -> torch.Tensor:
-        # calculate L2 norm of weight matrix, column-wise
-        weight = transpose(weight, self.fan_in_fan_out)
-        weight = weight + scaling * lora_weight
-        weight_norm = torch.linalg.norm(weight, dim=1).to(weight.dtype)
-        return weight_norm
-
-    def forward(self, x, loop_idx: int):
-        """
-
-        :param x: hidden state of shape (batch_size, seq_len, in_features)
-        :param loop_idx:
-        :return:
-        """
-        eps = 1e-6
-        w_base = self.weight_base
-        w_base = w_base.to(x.dtype)
-
-        lora_A: torch.Tensor = self.lora_A_list[  # pylint: disable=invalid-name
-            loop_idx
-        ]
-        lora_B: torch.Tensor = self.lora_B_list[  # pylint: disable=invalid-name
-            loop_idx
-        ]
-
-        base_out: torch.Tensor = F.linear(x, w_base, self.bias)
-        lora_out: torch.Tensor = F.linear(F.linear(x, lora_A), lora_B) * self.scaling
-
-        if self.use_dora:
-            x_eye: torch.Tensor = torch.eye(
-                lora_A.shape[1], device=lora_A.device, dtype=x.dtype
-            )
-            tmp = F.linear(x_eye, lora_A)  # [hidden_size, rank]
-            w_dora_full: torch.Tensor = F.linear(tmp, lora_B)
-            w_dora_full = w_dora_full.t()
-
-            magnitude_vector: torch.Tensor = self.lora_magnitude_vector_list[loop_idx]
-            w_dora_norm: torch.Tensor = self.get_weight_norm(
-                w_base, w_dora_full.detach(), self.scaling
-            )
-            w_dora_norm = w_dora_norm.detach()
-            scale_factor = (magnitude_vector / w_dora_norm).unsqueeze(
-                0
-            )  # shape [1, out_features]
-
-            result_dora = (scale_factor - 1) * base_out + scale_factor * lora_out
-            return result_dora
-
-        # scale the lora norm to prevent gradient explosion
-        orig_norm = torch.linalg.norm(w_base)
-        update_norm = torch.linalg.norm(lora_out)
-        scale = orig_norm / (update_norm + eps)
-
-        return base_out + lora_out * scale
--- a/src/axolotl/integrations/rrt/modeling/modeling_rrt_llama.py
+++ b/src/axolotl/integrations/rrt/modeling/modeling_rrt_llama.py
@@ -1,471 +0,0 @@
-import logging
-from typing import Callable, Optional, Tuple, Union, Unpack
-
-import torch
-from torch import nn
-from transformers import Cache, DynamicCache, LlamaConfig
-from transformers.activations import ACT2FN
-from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
-from transformers.modeling_outputs import BaseModelOutputWithPast
-from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
-from transformers.models.llama.modeling_llama import (
-    LlamaForCausalLM,
-    LlamaModel,
-    LlamaRMSNorm,
-    LlamaRotaryEmbedding,
-    apply_rotary_pos_emb,
-    eager_attention_forward,
-)
-
-from axolotl.integrations.rrt.modeling.linear import RelaxedRecursiveDoraLinear
-
-from .configuration_rrt_llama import RelaxedRecursiveLlamaConfig
-
-logger = logging.getLogger(__name__)
-
-
-# pylint: skip-file
-# mypy: ignore-errors
-
-
-class RelaxedRecursiveLlamaMLP(nn.Module):
-    def __init__(self, config: RelaxedRecursiveLlamaConfig):
-        super().__init__()
-        recurse_loops = config.num_hidden_layers // config.recurse_layers
-        self.config = config
-        self.hidden_size = config.hidden_size
-        self.intermediate_size = config.intermediate_size
-        self.gate_proj = RelaxedRecursiveDoraLinear(
-            self.hidden_size,
-            self.intermediate_size,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.mlp_bias,
-            use_dora=config.use_dora,
-        )
-        self.up_proj = RelaxedRecursiveDoraLinear(
-            self.hidden_size,
-            self.intermediate_size,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.mlp_bias,
-            use_dora=config.use_dora,
-        )
-        self.down_proj = RelaxedRecursiveDoraLinear(
-            self.intermediate_size,
-            self.hidden_size,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.mlp_bias,
-            use_dora=config.use_dora,
-        )
-        self.act_fn = ACT2FN[config.hidden_act]
-
-    def forward(self, x, loop_idx: int):
-        down_proj = self.down_proj(
-            self.act_fn(self.gate_proj(x, loop_idx)) * self.up_proj(x, loop_idx),
-            loop_idx,
-        )
-        return down_proj
-
-
-class RelaxedRecursiveLlamaAttention(nn.Module):
-    """
-    A single attention layer of the Relaxed Recursive Llama.
-    """
-
-    def __init__(self, config: RelaxedRecursiveLlamaConfig, layer_idx: int):
-        super().__init__()
-        recurse_loops = config.num_hidden_layers // config.recurse_layers
-        self.config = config
-        self.layer_idx = layer_idx
-        self.head_dim = getattr(
-            config, "head_dim", config.hidden_size // config.num_attention_heads
-        )
-        self.num_key_value_groups = (
-            config.num_attention_heads // config.num_key_value_heads
-        )
-        self.scaling = self.head_dim**-0.5
-        self.attention_dropout = config.attention_dropout
-        self.is_causal = True
-
-        self.q_proj = RelaxedRecursiveDoraLinear(
-            config.hidden_size,
-            config.num_attention_heads * self.head_dim,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.attention_bias,
-            use_dora=config.use_dora,
-        )
-        self.k_proj = RelaxedRecursiveDoraLinear(
-            config.hidden_size,
-            config.num_key_value_heads * self.head_dim,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.attention_bias,
-            use_dora=config.use_dora,
-        )
-        self.v_proj = RelaxedRecursiveDoraLinear(
-            config.hidden_size,
-            config.num_key_value_heads * self.head_dim,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.attention_bias,
-            use_dora=config.use_dora,
-        )
-        self.o_proj = RelaxedRecursiveDoraLinear(
-            config.num_attention_heads * self.head_dim,
-            config.hidden_size,
-            recurse_loops,
-            config.rank,
-            config.alpha,
-            bias=config.attention_bias,
-            use_dora=config.use_dora,
-        )
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
-        attention_mask: Optional[torch.Tensor],
-        loop_idx: int,
-        past_key_value: Optional[Cache] = None,
-        cache_position: Optional[torch.LongTensor] = None,
-        **kwargs: Unpack[FlashAttentionKwargs],  # pylint: disable=misc
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
-        input_shape = hidden_states.shape[:-1]
-        hidden_shape = (*input_shape, -1, self.head_dim)
-
-        query_states = (
-            self.q_proj(hidden_states, loop_idx).view(hidden_shape).transpose(1, 2)
-        )
-        key_states = (
-            self.k_proj(hidden_states, loop_idx).view(hidden_shape).transpose(1, 2)
-        )
-        value_states = (
-            self.v_proj(hidden_states, loop_idx).view(hidden_shape).transpose(1, 2)
-        )
-
-        cos, sin = position_embeddings
-        query_states, key_states = apply_rotary_pos_emb(
-            query_states, key_states, cos, sin
-        )
-
-        if past_key_value is not None:
-            # sin and cos are specific to RoPE models; cache_position needed for the static cache
-            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
-            key_states, value_states = past_key_value.update(
-                key_states, value_states, self.layer_idx, cache_kwargs
-            )
-
-        attention_interface: Callable = eager_attention_forward
-        if self.config._attn_implementation != "eager":
-            if self.config._attn_implementation == "sdpa" and kwargs.get(
-                "output_attentions", False
-            ):
-                logger.warning(
-                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
-                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
-                )
-            else:
-                attention_interface = ALL_ATTENTION_FUNCTIONS[
-                    self.config._attn_implementation
-                ]
-
-        attn_output, attn_weights = attention_interface(
-            self,
-            query_states,
-            key_states,
-            value_states,
-            attention_mask,
-            dropout=0.0 if not self.training else self.attention_dropout,
-            scaling=self.scaling,
-            **kwargs,
-        )
-
-        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
-        attn_output = self.o_proj(attn_output, loop_idx)
-        return attn_output, attn_weights  # pylint: disable=return-value
-
-
-class RelaxedRecursiveLlamaDecoderLayer(nn.Module):
-    """
-    A single layer of the Relaxed Recursive Llama decoder.
-    """
-
-    def __init__(self, config: LlamaConfig, layer_idx: int):
-        super().__init__()
-        recurse_loops = config.num_hidden_layers // config.recurse_layers
-        self.hidden_size = config.hidden_size
-
-        self.self_attn = RelaxedRecursiveLlamaAttention(
-            config=config, layer_idx=layer_idx
-        )
-
-        self.mlp = RelaxedRecursiveLlamaMLP(config)
-
-        self.input_layernorm_list = nn.ModuleList(
-            [
-                LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-                for _ in range(recurse_loops)
-            ]
-        )
-        self.post_attention_layernorm_list = nn.ModuleList(
-            [
-                LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-                for _ in range(recurse_loops)
-            ]
-        )
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        loop_idx: int,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_value: Optional[Cache] = None,
-        output_attentions: Optional[bool] = False,
-        use_cache: Optional[bool] = False,
-        cache_position: Optional[torch.LongTensor] = None,
-        position_embeddings: Optional[
-            Tuple[torch.Tensor, torch.Tensor]
-        ] = None,  # necessary, but kept here for BC
-        **kwargs: Unpack[FlashAttentionKwargs],  # pylint: disable=misc
-    ) -> Tuple[
-        torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
-    ]:
-        residual = hidden_states
-
-        hidden_states = self.input_layernorm_list[loop_idx](hidden_states)
-
-        # Self Attention
-        hidden_states, self_attn_weights = self.self_attn(
-            hidden_states=hidden_states,
-            attention_mask=attention_mask,
-            loop_idx=loop_idx,
-            position_ids=position_ids,
-            past_key_value=past_key_value,
-            output_attentions=output_attentions,
-            use_cache=use_cache,
-            cache_position=cache_position,
-            position_embeddings=position_embeddings,
-            **kwargs,
-        )
-        hidden_states = residual + hidden_states
-
-        # Fully Connected
-        residual = hidden_states
-        hidden_states = self.post_attention_layernorm_list[loop_idx](hidden_states)
-        hidden_states = self.mlp(hidden_states, loop_idx)
-        hidden_states = residual + hidden_states
-
-        outputs = (hidden_states,)
-        if output_attentions:
-            outputs += (self_attn_weights,)
-
-        return outputs
-
-
-class RelaxedRecursiveLlamaModel(LlamaModel):
-    config_class = RelaxedRecursiveLlamaConfig
-
-    def __init__(self, config):
-        super(LlamaModel, self).__init__(config)
-        self.recurse_loops = config.num_hidden_layers // config.recurse_layers
-        self.padding_idx = config.pad_token_id
-        self.vocab_size = config.vocab_size
-
-        self.embed_tokens = nn.Embedding(
-            config.vocab_size, config.hidden_size, self.padding_idx
-        )
-        self.layers = nn.ModuleList(
-            [
-                RelaxedRecursiveLlamaDecoderLayer(config, layer_idx)
-                for layer_idx in range(config.recurse_layers)
-            ]
-        )
-        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-        self.rotary_emb = LlamaRotaryEmbedding(config=config)
-        self.gradient_checkpointing = False
-
-        # Initialize weights and apply final processing
-        self.post_init()
-
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[Cache] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-        cache_position: Optional[torch.LongTensor] = None,
-        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
-    ) -> Union[Tuple, BaseModelOutputWithPast]:
-        output_attentions = (
-            output_attentions
-            if output_attentions is not None
-            else self.config.output_attentions
-        )
-        output_hidden_states = (
-            output_hidden_states
-            if output_hidden_states is not None
-            else self.config.output_hidden_states
-        )
-        use_cache = use_cache if use_cache is not None else self.config.use_cache
-        return_dict = (
-            return_dict if return_dict is not None else self.config.use_return_dict
-        )
-
-        if (input_ids is None) ^ (inputs_embeds is not None):
-            raise ValueError(
-                "You must specify exactly one of input_ids or inputs_embeds"
-            )
-
-        if self.gradient_checkpointing and self.training and use_cache:
-            logger.warning_once(
-                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
-            )
-            use_cache = False
-
-        if inputs_embeds is None:
-            inputs_embeds = self.embed_tokens(input_ids)
-
-        if use_cache and past_key_values is None:
-            past_key_values = DynamicCache()
-
-        if cache_position is None:
-            past_seen_tokens = (
-                past_key_values.get_seq_length() if past_key_values is not None else 0
-            )
-            cache_position = torch.arange(
-                past_seen_tokens,
-                past_seen_tokens + inputs_embeds.shape[1],
-                device=inputs_embeds.device,
-            )
-
-        if position_ids is None:
-            position_ids = cache_position.unsqueeze(0)
-
-        causal_mask = self._update_causal_mask(
-            attention_mask,
-            inputs_embeds,
-            cache_position,
-            past_key_values,
-            output_attentions,
-        )
-
-        hidden_states = inputs_embeds
-
-        # create position embeddings to be shared across the decoder layers
-        position_embeddings = self.rotary_emb(hidden_states, position_ids)
-
-        # decoder layers
-        all_hidden_states = () if output_hidden_states else None
-        all_self_attns = () if output_attentions else None
-
-        for loop_idx in range(self.recurse_loops):
-            for decoder_layer in self.layers[: self.config.recurse_layers]:
-                if output_hidden_states:
-                    all_hidden_states += (hidden_states,)
-
-                if self.gradient_checkpointing and self.training:
-                    layer_outputs = self._gradient_checkpointing_func(
-                        decoder_layer.__call__,
-                        hidden_states,
-                        loop_idx,
-                        causal_mask,
-                        position_ids,
-                        past_key_values,
-                        output_attentions,
-                        use_cache,
-                        cache_position,
-                        position_embeddings,
-                    )
-                else:
-                    layer_outputs = decoder_layer(
-                        hidden_states,
-                        loop_idx,
-                        attention_mask=causal_mask,
-                        position_ids=position_ids,
-                        past_key_value=past_key_values,
-                        output_attentions=output_attentions,
-                        use_cache=use_cache,
-                        cache_position=cache_position,
-                        position_embeddings=position_embeddings,
-                        **flash_attn_kwargs,
-                    )
-
-                hidden_states = layer_outputs[0]
-
-                if output_attentions:
-                    all_self_attns += (layer_outputs[1],)
-
-        hidden_states = self.norm(hidden_states)
-
-        # add hidden states from the last decoder layer
-        if output_hidden_states:
-            all_hidden_states += (hidden_states,)
-
-        output = BaseModelOutputWithPast(
-            last_hidden_state=hidden_states,
-            past_key_values=past_key_values if use_cache else None,
-            hidden_states=all_hidden_states,
-            attentions=all_self_attns,
-        )
-        return output if return_dict else output.to_tuple()
-
-
-class RelaxedRecursiveLlamaForCausalLM(LlamaForCausalLM):
-    config_class = RelaxedRecursiveLlamaConfig
-
-    def __init__(self, config):
-        super(LlamaForCausalLM, self).__init__(config)
-        self.model = RelaxedRecursiveLlamaModel(config)
-        self.vocab_size = config.vocab_size
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-
-        # Initialize weights and apply final processing
-        self.post_init()
-
-    def get_nb_trainable_parameters(self) -> tuple[int, int, int]:
-        r"""
-        Returns the number of trainable parameters and the number of all parameters in the model.
-        """
-        trainable_params = 0
-        all_param = 0
-        lora_params = 0
-        for name, param in self.named_parameters():
-            num_params = param.numel()
-            # if using DS Zero 3 and the weights are initialized empty
-            if num_params == 0 and hasattr(param, "ds_numel"):
-                num_params = param.ds_numel
-
-            # Due to the design of 4bit linear layers from bitsandbytes
-            # one needs to multiply the number of parameters by 2 to get
-            # the correct number of parameters
-            if param.__class__.__name__ == "Params4bit":
-                if hasattr(param, "element_size"):
-                    num_bytes = param.element_size()
-                elif not hasattr(param, "quant_storage"):
-                    num_bytes = 1
-                else:
-                    num_bytes = param.quant_storage.itemsize
-                num_params = num_params * 2 * num_bytes
-
-            all_param += num_params
-            if param.requires_grad:
-                trainable_params += num_params
-            if "lora_" in name:
-                lora_params += num_params
-
-        return trainable_params, all_param, lora_params
--- a/src/axolotl/monkeypatch/trainer_grad_accum.py
+++ b/src/axolotl/monkeypatch/trainer_grad_accum.py
@@ -0,0 +1,308 @@
+"""
+fix for FSDP gradient accumulation
+see https://github.com/huggingface/transformers/pull/35128
+"""
+import inspect
+import logging
+
+from transformers import LlamaForCausalLM, Trainer
+from transformers.modeling_flash_attention_utils import _flash_attention_forward
+
+from axolotl.monkeypatch.utils import detab_code
+
+LOG = logging.getLogger("axolotl.monkeypatch.trainer_grad_accum")
+
+ORIGINAL_CONTEXT_CODE = """
+    with self.compute_loss_context_manager():
+        loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
+"""
+
+PATCHED_CONTEXT_CODE = """
+    with self.compute_loss_context_manager():
+        if self.model_accepts_loss_kwargs:
+            loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
+        else:
+            loss = self.compute_loss(model, inputs)
+"""
+
+ORIGINAL_LLAMA_FCLM_CODE = """
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = (
+        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    )
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+        cache_position=cache_position,
+        **kwargs,
+    )
+
+    hidden_states = outputs[0]
+    # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+    logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+
+    loss = None
+    if labels is not None:
+        loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+"""
+
+PATCHED_LLAMA_FCLM_CODE = """
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = (
+        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    )
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    # remove num_items_in_batch otherwise self.model attempts to pass it to flash_attention
+    num_items_in_batch = kwargs.pop("num_items_in_batch", None)
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+        cache_position=cache_position,
+        **kwargs,
+    )
+    hidden_states = outputs[0]
+    # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+    logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+
+    loss = None
+    if labels is not None:
+        loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, num_items_in_batch=num_items_in_batch, **kwargs)
+"""
+
+
+def get_training_step_code() -> str:
+    training_step = inspect.getsource(
+        Trainer.training_step  # pylint: disable=protected-access
+    )
+    return training_step
+
+
+def check_training_step_is_patchable() -> bool:
+    training_step = get_training_step_code()
+    training_step, _ = detab_code(training_step)
+    return ORIGINAL_CONTEXT_CODE in training_step
+
+
+def patch_training_step_for_ga():
+    """
+    monkeypatch for fixing the training loop for gradient accumulation
+    """
+
+    try:
+        training_step = get_training_step_code()
+    except OSError:
+        return
+    Trainer._original_training_step = training_step  # pylint: disable=protected-access
+    training_step, _ = detab_code(training_step)
+    if ORIGINAL_CONTEXT_CODE not in training_step:
+        return
+    # assert (
+    #     ORIGINAL_CONTEXT_CODE in training_step
+    # ), "Original training_step code not found"
+
+    training_step = training_step.replace(ORIGINAL_CONTEXT_CODE, PATCHED_CONTEXT_CODE)
+    training_step = training_step.replace(
+        "def training_step(",
+        "def _fixed_training_step(",
+        1,
+    )
+
+    # load imports necessary
+    import transformers.trainer
+
+    items_to_import = []
+    for item in dir(transformers.trainer):
+        if item in training_step:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from transformers.trainer import ("
+        + ", ".join(x for x in items_to_import)
+        + ")",
+        globals(),
+    )
+    exec(training_step, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching training_step")
+    Trainer.training_step = (  # pylint: disable=protected-access
+        _fixed_training_step  # pylint: disable=undefined-variable  # noqa: F821
+    )
+
+
+def get_model_forward_code() -> str:
+    forward = inspect.getsource(
+        LlamaForCausalLM.forward  # pylint: disable=protected-access
+    )
+    return forward
+
+
+def check_forward_is_patchable() -> bool:
+    forward = get_model_forward_code()
+    forward, _ = detab_code(forward)
+    return ORIGINAL_LLAMA_FCLM_CODE in forward
+
+
+def patch_forward_for_ga():
+    """
+    monkeypatch for fixing the training loop for gradient accumulation
+    """
+
+    try:
+        forward = get_model_forward_code()
+    except OSError:
+        return
+    LlamaForCausalLM._original_forward = forward  # pylint: disable=protected-access
+    forward, _ = detab_code(forward)
+    if ORIGINAL_LLAMA_FCLM_CODE not in forward:
+        return
+    # assert ORIGINAL_LLAMA_FCLM_CODE in forward, "Original forward code not found"
+
+    forward = forward.replace(ORIGINAL_LLAMA_FCLM_CODE, PATCHED_LLAMA_FCLM_CODE)
+    forward = forward.replace(
+        "def forward(",
+        "def _fixed_forward(",
+        1,
+    )
+
+    # load imports necessary
+    import transformers.models.llama.modeling_llama
+
+    items_to_import = []
+    for item in dir(transformers.models.llama.modeling_llama):
+        if item in forward:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from transformers.models.llama.modeling_llama import ("
+        + ", ".join(x for x in items_to_import)
+        + ")",
+        globals(),
+    )
+    exec(forward, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching forward")
+    LlamaForCausalLM.forward = (  # pylint: disable=protected-access
+        _fixed_forward  # pylint: disable=undefined-variable  # noqa: F821
+    )
+
+
+ORIGINAL_TRAINER_CODE = """
+                context = (
+                    functools.partial(self.accelerator.no_sync, model=model)
+                    if i != len(batch_samples) - 1
+                    else contextlib.nullcontext
+                )
+                with context():
+                    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
+"""
+
+PATCHED_TRAINER_CODE = """
+                disable_deepspeed_no_sync = (
+                        self.accelerator.distributed_type == DistributedType.DEEPSPEED
+                        # and self.accelerator.deepspeed_engine_wrapped.engine.zero_optimization_partition_gradients()
+                )
+                context = (
+                    functools.partial(self.accelerator.no_sync, model=model)
+                    if i != len(batch_samples) - 1 and not disable_deepspeed_no_sync
+                    else contextlib.nullcontext
+                )
+                with context():
+                    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
+"""
+
+
+def get_training_loop_code() -> str:
+    training_loop = inspect.getsource(
+        Trainer._inner_training_loop  # pylint: disable=protected-access
+    )
+    return training_loop
+
+
+def check_training_loop_is_patchable() -> bool:
+    training_loop = get_training_loop_code()
+    training_loop, _ = detab_code(training_loop)
+    return ORIGINAL_TRAINER_CODE in training_loop
+
+
+def patch_training_loop_for_deepspeed_0_16_x():
+    """
+    monkeypatch for fixing the training loop for deepspeed GA
+
+    see https://github.com/huggingface/transformers/pull/35157
+    """
+
+    try:
+        training_loop = get_training_loop_code()
+    except OSError:
+        return
+    Trainer._original_inner_training_loop = (  # pylint: disable=protected-access
+        training_loop
+    )
+    training_loop, _ = detab_code(training_loop)
+    if ORIGINAL_TRAINER_CODE not in training_loop:
+        return
+
+    training_loop = training_loop.replace(ORIGINAL_TRAINER_CODE, PATCHED_TRAINER_CODE)
+    training_loop = training_loop.replace(
+        "def _inner_training_loop(",
+        "def _fixed_inner_training_loop(",
+        1,
+    )
+
+    # load imports necessary
+    import transformers.trainer
+
+    items_to_import = []
+    for item in dir(transformers.trainer):
+        if item in training_loop:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from transformers.trainer import ("
+        + ", ".join(x for x in items_to_import)
+        + ")",
+        globals(),
+    )
+    exec(training_loop, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching _inner_training_loop for fsdp optimizer save")
+    Trainer._inner_training_loop = (  # pylint: disable=protected-access
+        _fixed_inner_training_loop  # pylint: disable=undefined-variable  # noqa: F821
+    )
+
+
+def patch_flash_attention_forward():
+    """
+    monkeypatch for fixing the forward pass for flash attention to ignore num_items_in_batch
+    """
+
+    import transformers.modeling_flash_attention_utils
+
+    def proxy_flash_attention_forward(*args, **kwargs):
+        kwargs.pop("num_items_in_batch", None)
+
+        return _flash_attention_forward(*args, **kwargs)
+
+    transformers.modeling_flash_attention_utils._flash_attention_forward = (  # pylint: disable=protected-access
+        proxy_flash_attention_forward
+    )
+    transformers.models.llama.modeling_llama._flash_attention_forward = (  # pylint: disable=protected-access
+        proxy_flash_attention_forward
+    )
--- a/src/axolotl/monkeypatch/transformers_fa_utils.py
+++ b/src/axolotl/monkeypatch/transformers_fa_utils.py
@@ -1,67 +0,0 @@
-"""
-see https://github.com/huggingface/transformers/pull/35834
-"""
-
-import logging
-from functools import partial
-from typing import Optional
-
-import torch
-
-logger = logging.getLogger(__name__)
-
-
-def fixed_fa_peft_integration_check(
-    query: torch.Tensor,
-    key: torch.Tensor,
-    value: torch.Tensor,
-    target_dtype: Optional[torch.dtype] = None,
-    preferred_dtype: Optional[torch.dtype] = None,
-):
-    """
-    PEFT usually casts the layer norms in float32 for training stability reasons
-    therefore the input hidden states gets silently casted in float32. Hence, we need
-    cast them back in float16 / bfloat16 just to be sure everything works as expected.
-    This might slowdown training & inference so it is recommended to not cast the LayerNorms!
-
-    Args:
-        query (`torch.Tensor`):
-            Input query states to be passed to Flash Attention API
-        key (`torch.Tensor`):
-            Input key states to be passed to Flash Attention API
-        value (`torch.Tensor`):
-            Input value states to be passed to Flash Attention API
-        target_dtype (`torch.dtype`, *optional*):
-            The dtype to convert the attention tensors to. Conversion can be ignored by
-            not providing the target dtype.
-        preferred_dtype (`torch.dtype`, *optional*):
-            The preferred dtype to convert the attention tensors to regardless of the
-            target dtype.
-    """
-    if target_dtype is None and preferred_dtype is None:
-        return query, key, value
-
-    if preferred_dtype and target_dtype != preferred_dtype:
-        target_dtype = preferred_dtype
-
-    # check if any of query, key, or value are in float32. If so, cast them back to target dtype.
-    if any(module.dtype == torch.float32 for module in [query, key, value]):
-        logger.warning_once(
-            f"The input hidden states seems to be silently casted in float32, this might be related to"
-            f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
-            f" {target_dtype}."
-        )
-
-        query = query.to(target_dtype)
-        key = key.to(target_dtype)
-        value = value.to(target_dtype)
-
-    return query, key, value
-
-
-def patch_fa_peft_integration():
-    import transformers.modeling_flash_attention_utils
-
-    transformers.modeling_flash_attention_utils.fa_peft_integration_check = partial(
-        fixed_fa_peft_integration_check, preferred_dtype=None
-    )
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -5,19 +5,21 @@ import os
 import signal
 import sys
 import weakref
+from dataclasses import dataclass
 from pathlib import Path
-from typing import Tuple, Union
+from typing import Optional, Tuple, Union

 import torch
 import transformers.modelcard
 from accelerate.logging import get_logger
 from accelerate.utils import save_fsdp_model
+from datasets import Dataset
 from peft import PeftModel
 from pkg_resources import get_distribution  # type: ignore
 from transformers import PreTrainedModel, PreTrainedTokenizer
 from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled

-from axolotl.common.datasets import TrainDatasetMeta
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.contribs.lgpl.unsloth import (  # pylint: disable = no-name-in-module
    fix_untrained_tokens,
 )
@@ -37,11 +39,22 @@ src_dir = os.path.join(project_root, "src")
 sys.path.insert(0, src_dir)

 configure_logging()
-LOG = get_logger(__name__)
+LOG = get_logger("axolotl.train")
+
+
+@dataclass
+class TrainDatasetMeta:
+    """
+    dataclass to capture the dataset specific options for training
+    """
+
+    train_dataset: Dataset
+    eval_dataset: Optional[Dataset] = None
+    total_num_steps: Optional[int] = None


 def train(
-    *, cfg: DictDefault, dataset_meta: TrainDatasetMeta
+    *, cfg: DictDefault, cli_args: TrainerCliArgs, dataset_meta: TrainDatasetMeta
 ) -> Tuple[Union[PeftModel, PreTrainedModel], PreTrainedTokenizer]:
    # Load tokenizer
    LOG.debug(
@@ -80,7 +93,9 @@ def train(
    if cfg.adapter:
        msg += " and peft_config..."
    LOG.debug(msg)
-    model, peft_config = load_model(cfg, tokenizer, processor=processor)
+    model, peft_config = load_model(
+        cfg, tokenizer, processor=processor, inference=cli_args.inference
+    )
    if model.generation_config is not None:
        model.generation_config.do_sample = True

@@ -92,7 +107,9 @@ def train(
            model_ref = None  # explicit setting to None
        else:
            # load the model again for model_ref/baseline
-            model_ref, _ = load_model(cfg, tokenizer, reference_model=True)
+            model_ref, _ = load_model(
+                cfg, tokenizer, inference=cli_args.inference, reference_model=True
+            )

    safe_serialization = cfg.save_safetensors is True

--- a/src/axolotl/utils/callbacks/diff_attn.py
+++ b/src/axolotl/utils/callbacks/diff_attn.py
@@ -0,0 +1,234 @@
+"""
+Monitor and log differential attention components during training.
+
+This module provides a callback for tracking the behavior of differential attention
+mechanisms, including lambda parameters and attention statistics.
+"""
+
+from typing import Any
+
+import torch
+import wandb
+from torch import nn
+from transformers import TrainerCallback
+
+from axolotl.utils.distributed import is_main_process
+
+
+class DifferentialAttentionMonitorCallback(TrainerCallback):
+    """
+    Callback to monitor differential attention components and lambda parameters.
+
+    This callback tracks attention statistics across all layers and provides detailed
+    monitoring for a specified number of layers evenly spaced through the model.
+    """
+
+    def __init__(
+        self,
+        log_every: int = 250,
+        num_monitor_layers: int = 3,
+        warmup_steps: int | None = None,
+    ):
+        """
+        Initialize the differential attention monitor.
+
+        Args:
+            log_every: Number of steps between logging events.
+            num_monitor_layers: Number of individual layers to monitor in detail.
+            warmup_steps: Optional parameter for negative attention component warmup.
+        """
+        self.log_every = log_every
+        self.num_monitor_layers = num_monitor_layers
+        self.warmup_steps = warmup_steps
+        self.monitor_layers: list[int] | None = None  # Will be set in on_train_begin
+
+    # pylint: disable=unused-argument
+    def on_train_begin(
+        self,
+        args: Any,
+        state: Any,
+        control: Any,
+        model: torch.nn.Module,
+        **kwargs,
+    ) -> None:
+        """
+        Set up layer monitoring at the start of training.
+
+        Args:
+            args: Training arguments.
+            state: Training state.
+            control: Training control object.
+            model: The model being trained.
+            **kwargs: Additional arguments passed by the trainer.
+        """
+        if is_main_process():
+            num_layers = len(model.model.layers)
+            self.num_monitor_layers = min(self.num_monitor_layers, num_layers)
+
+            stride = (
+                (num_layers - 1) / (self.num_monitor_layers - 1)
+                if self.num_monitor_layers > 1
+                else 0
+            )
+            self.monitor_layers = [
+                round(i * stride) for i in range(self.num_monitor_layers)
+            ]
+            print(f"Monitoring layers {self.monitor_layers} in detail")
+
+    # pylint: disable=unused-argument
+    def on_step_end(
+        self, args: Any, state: Any, control: Any, model: torch.nn.Module, **kwargs
+    ) -> None:
+        """
+        Log attention metrics at the end of each step.
+
+        Collects and logs:
+            - Lambda parameter norms and values.
+            - Attention statistics (mean and std).
+            - Both per-layer and aggregate metrics.
+
+        Args:
+            args: Training arguments.
+            state: Training state.
+            control: Training control object.
+            model: The model being trained.
+            **kwargs: Additional arguments passed by the trainer.
+        """
+        if not is_main_process() or state.global_step % self.log_every != 0:
+            return
+
+        assert self.monitor_layers is not None
+
+        # Aggregate stats across all layers
+        all_q1_norms = []
+        all_q2_norms = []
+        all_k1_norms = []
+        all_k2_norms = []
+        all_lambda1 = []
+        all_lambda2 = []
+        all_lambda_full = []
+
+        metrics = {}
+        for layer_idx, layer in enumerate(model.model.layers):
+            attn = layer.self_attn
+
+            # Collect stats for aggregation
+            all_q1_norms.append(attn.lambda_q1.norm().item())
+            all_q2_norms.append(attn.lambda_q2.norm().item())
+            all_k1_norms.append(attn.lambda_k1.norm().item())
+            all_k2_norms.append(attn.lambda_k2.norm().item())
+
+            lambda1 = torch.exp(torch.sum(attn.lambda_q1 * attn.lambda_k1)).item()
+            lambda2 = torch.exp(torch.sum(attn.lambda_q2 * attn.lambda_k2)).item()
+            all_lambda1.append(lambda1)
+            all_lambda2.append(lambda2)
+            all_lambda_full.append(attn.lambda_full)
+
+            # Log detailed metrics for monitored layers
+            if layer_idx in self.monitor_layers:
+                metrics.update(
+                    {
+                        f"layer_{layer_idx}/lambda_q1_norm": attn.lambda_q1.norm().item(),
+                        f"layer_{layer_idx}/lambda_k1_norm": attn.lambda_k1.norm().item(),
+                        f"layer_{layer_idx}/lambda_q2_norm": attn.lambda_q2.norm().item(),
+                        f"layer_{layer_idx}/lambda_k2_norm": attn.lambda_k2.norm().item(),
+                        f"layer_{layer_idx}/lambda1": lambda1,
+                        f"layer_{layer_idx}/lambda2": lambda2,
+                        f"layer_{layer_idx}/lambda_init": attn.lambda_init.item(),
+                        f"layer_{layer_idx}/lambda_full": lambda1
+                        - lambda2
+                        + attn.lambda_init.item(),
+                        f"layer_{layer_idx}/attn1_mean": attn.attn1.mean().item(),
+                        f"layer_{layer_idx}/attn2_mean": attn.attn2.mean().item(),
+                        f"layer_{layer_idx}/attn1_std": attn.attn1.std().item(),
+                        f"layer_{layer_idx}/attn2_std": attn.attn2.std().item(),
+                    }
+                )
+
+        # Add aggregate metrics
+        metrics.update(
+            {
+                "aggregate/lambda_q1_norm_mean": torch.tensor(all_q1_norms)
+                .mean()
+                .item(),
+                "aggregate/lambda_q1_norm_std": torch.tensor(all_q1_norms).std().item(),
+                "aggregate/lambda_q2_norm_mean": torch.tensor(all_q2_norms)
+                .mean()
+                .item(),
+                "aggregate/lambda_q2_norm_std": torch.tensor(all_q2_norms).std().item(),
+                "aggregate/lambda_k1_norm_mean": torch.tensor(all_k1_norms)
+                .mean()
+                .item(),
+                "aggregate/lambda_k1_norm_std": torch.tensor(all_k1_norms).std().item(),
+                "aggregate/lambda_k2_norm_mean": torch.tensor(all_k2_norms)
+                .mean()
+                .item(),
+                "aggregate/lambda_k2_norm_std": torch.tensor(all_k2_norms).std().item(),
+                "aggregate/lambda1_mean": torch.tensor(all_lambda1).mean().item(),
+                "aggregate/lambda1_std": torch.tensor(all_lambda1).std().item(),
+                "aggregate/lambda2_mean": torch.tensor(all_lambda2).mean().item(),
+                "aggregate/lambda2_std": torch.tensor(all_lambda2).std().item(),
+                "aggregate/lambda_full_mean": torch.tensor(all_lambda_full)
+                .mean()
+                .item(),
+                "aggregate/lambda_full_std": torch.tensor(all_lambda_full).std().item(),
+            }
+        )
+
+        if self.warmup_steps:
+            metrics["aggregate/diff_attn_mix"] = attn.diff_attn_mix
+
+        wandb.log(metrics, step=state.global_step)
+
+
+class DifferentialAttentionMixingCallback(TrainerCallback):
+    """
+    Callback to gradually increase the weight of negative attention components during
+    training.
+    """
+
+    def __init__(self, warmup_steps: int):
+        """
+        Args:
+            warmup_steps: Number of steps to linearly increase negative attention
+                weight from 0 to 1. If `None`, negative attention has full weight from
+                start.
+        """
+        self.warmup_steps = warmup_steps
+        self.diff_attention_layers: list[nn.Module] | None = None
+
+    # pylint: disable=unused-argument
+    def on_train_begin(
+        self,
+        args: Any,
+        state: Any,
+        control: Any,
+        model: torch.nn.Module,
+        **kwargs,
+    ) -> None:
+        """Cache the differential attention layers at the start of training."""
+        if model is not None:
+            # Get the actual model if it's wrapped
+            if hasattr(model, "module"):
+                model = model.module
+
+            # Cache all differential attention layers
+            self.diff_attention_layers = [
+                module for module in model.modules() if hasattr(module, "diff_attn_mix")
+            ]
+
+    def on_step_begin(
+        self,
+        args: Any,
+        state: Any,
+        control: Any,
+        model: torch.nn.Module = None,
+        **kwargs,
+    ) -> None:
+        if self.diff_attention_layers and self.warmup_steps:
+            # Calculate mixing parameter (0 to 1)
+            mix = min(1.0, state.global_step / self.warmup_steps)
+
+            # Update cached layers
+            for layer in self.diff_attention_layers:
+                layer.diff_attn_mix = mix
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -129,7 +129,6 @@ class PretrainingDataset(BaseModel):
    type: Optional[str] = "pretrain"
    trust_remote_code: Optional[bool] = False
    data_files: Optional[str] = None
-    skip: Optional[int] = None


 class UserDefinedPrompterType(BaseModel):
@@ -147,14 +146,6 @@ class UserDefinedPrompterType(BaseModel):
    field: Optional[str] = None


-class LrGroup(BaseModel):
-    """Custom learning rate group configuration"""
-
-    name: str
-    modules: List[str]
-    lr: float
-
-
 class SFTDataset(BaseModel):
    """SFT configuration subset"""

@@ -376,13 +367,6 @@ class LoraConfig(BaseModel):
            loraplus_lr_embedding = float(loraplus_lr_embedding)
        return loraplus_lr_embedding

-    @model_validator(mode="before")
-    @classmethod
-    def validate_lora_dropout(cls, data):
-        if data.get("adapter") is not None and data.get("lora_dropout") is None:
-            data["lora_dropout"] = 0.0
-        return data
-

 class ReLoRAConfig(BaseModel):
    """ReLoRA configuration subset"""
@@ -483,7 +467,6 @@ class HyperparametersConfig(BaseModel):
    cosine_min_lr_ratio: Optional[float] = None
    cosine_constant_lr_ratio: Optional[float] = None
    lr_div_factor: Optional[float] = None
-    lr_groups: Optional[List[LrGroup]] = None

    adam_epsilon: Optional[float] = None
    adam_beta1: Optional[float] = None
--- a/src/axolotl/utils/data/init.py
+++ b/src/axolotl/utils/data/init.py
@@ -5,7 +5,7 @@ from axolotl.utils.data.pretraining import (  # noqa: F401
    encode_pretraining,
    wrap_pretraining_dataset,
 )
-from axolotl.utils.data.rl import load_prepare_preference_datasets  # noqa: F401
+from axolotl.utils.data.rl import load_prepare_dpo_datasets  # noqa: F401
 from axolotl.utils.data.sft import (  # noqa: F401
    get_dataset_wrapper,
    load_prepare_datasets,
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -21,11 +21,10 @@ def encode_pretraining(
    tokenizer: PreTrainedTokenizerBase,
    max_tokens: int,
    examples: Dict[str, List],
-    text_column: str = "text",
    concatenate: bool = True,
 ) -> Dict[str, List]:
    res = tokenizer(
-        examples[text_column],
+        examples["text"],
        truncation=True,
        max_length=max_tokens - 2,
        add_special_tokens=True,
@@ -191,7 +190,7 @@ def wrap_pretraining_dataset(
            tokenizer,
            return_tensors="pt",
            padding=True,
-            pad_to_multiple_of=max_tokens,
+            pad_to_multiple_of=max_tokens * batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
        )
        encode = functools.partial(
@@ -201,17 +200,17 @@ def wrap_pretraining_dataset(
            max_seq_length=max_tokens,
            batch_size=batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
+            group_size=cfg.sample_packing_group_size,
+            bin_size=cfg.sample_packing_bin_size,
        )
        # set this to 1 so downstream data_loader doesn't try to increase the batch again
        cfg.micro_batch_size = 1
-    else:
+    elif cfg.pretraining_sample_concatenation is False:
        encode = functools.partial(
-            encode_pretraining,
-            tokenizer,
-            max_tokens,
-            text_column=cfg.pretraining_dataset[0].text_column or "text",
-            concatenate=cfg.pretraining_sample_concatenation is True,
+            encode_pretraining, tokenizer, max_tokens, concatenate=False
        )
+    else:
+        encode = functools.partial(encode_pretraining, tokenizer, max_tokens)

    if cfg.shuffle_merged_datasets:
        dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)
@@ -245,7 +244,9 @@ def encode_packed_pretraining(
    examples: Dict[str, List],
    max_seq_length: int = 2048,
    batch_size: int = 4,
-    multipack_attn: Optional[bool] = True,
+    multipack_attn: Optional[bool] = False,
+    group_size: int = 100000,
+    bin_size: int = 200,
 ) -> Dict[str, List]:
    # pylint: disable=duplicate-code
    # tokenize all the examples
@@ -256,9 +257,6 @@ def encode_packed_pretraining(
        train_dataset,
        max_seq_length,
        skip_position_ids=not multipack_attn,
-        # FIXME using attention mask unpad/pad with trainer and packed pretraining is broken atm
-        # workaround by using the position id logic for now in trainer
-        drop_attention_mask=multipack_attn,
    )

    sampler = MultipackBatchSampler(
@@ -266,6 +264,8 @@ def encode_packed_pretraining(
        lengths=get_dataset_lengths(train_dataset),
        batch_size=1,
        batch_max_len=batch_size * max_seq_length,
+        group_size=group_size,
+        bin_size=bin_size,
        drop_last=True,
    )

--- a/src/axolotl/utils/data/rl.py
+++ b/src/axolotl/utils/data/rl.py
@@ -115,7 +115,7 @@ def drop_long_rl_seq(
    raise ValueError("Unknown RL type")


-def load_prepare_preference_datasets(cfg):
+def load_prepare_dpo_datasets(cfg):
    def load_split(dataset_cfgs, _cfg):
        split_datasets: List[Any] = []
        for i, ds_cfg in enumerate(dataset_cfgs):
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -89,13 +89,11 @@ def prepare_dataset(cfg, tokenizer, processor=None):
        split = "train"
        name = None
        data_files = None
-        skip = 0
        if isinstance(cfg.pretraining_dataset, list) and isinstance(
            cfg.pretraining_dataset[0], dict
        ):
            path = cfg.pretraining_dataset[0]["path"]
            name = cfg.pretraining_dataset[0]["name"]
-            skip = cfg.pretraining_dataset[0]["skip"]
            if "split" in cfg.pretraining_dataset[0]:
                split = cfg.pretraining_dataset[0]["split"]

@@ -109,14 +107,10 @@ def prepare_dataset(cfg, tokenizer, processor=None):
            cfg.pretraining_dataset[0]["type"] or "pretrain",
        )

-        iter_ds = load_dataset(
-            path, streaming=True, split=split, name=name, data_files=data_files
-        )
-        if skip:
-            LOG.info(f"Skipping {skip} samples from the dataset")
-            iter_ds = iter_ds.skip(skip)
        train_dataset = wrap_pretraining_dataset(
-            iter_ds,
+            load_dataset(
+                path, streaming=True, split=split, name=name, data_files=data_files
+            ),
            tokenizer,
            cfg,
            ds_wrapper_partial,
--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -107,13 +107,6 @@ def load_dataset_w_config(config_dataset, auth_token):
    except (FileNotFoundError, ConnectionError):
        pass

-    # gather extra args from the config
-    load_ds_kwargs = {}
-    if config_dataset.split:
-        load_ds_kwargs["split"] = config_dataset.split
-    else:
-        load_ds_kwargs["split"] = None
-
    # prefer local dataset, even if hub exists
    local_path = Path(config_dataset.path)
    if local_path.exists():
@@ -125,7 +118,7 @@ def load_dataset_w_config(config_dataset, auth_token):
                    name=config_dataset.name,
                    data_files=config_dataset.data_files,
                    streaming=False,
-                    **load_ds_kwargs,
+                    split=None,
                )
            else:
                try:
@@ -137,7 +130,7 @@ def load_dataset_w_config(config_dataset, auth_token):
                        config_dataset.path,
                        name=config_dataset.name,
                        streaming=False,
-                        **load_ds_kwargs,
+                        split=None,
                    )
        elif local_path.is_file():
            ds_type = get_ds_type(config_dataset)
@@ -147,13 +140,16 @@ def load_dataset_w_config(config_dataset, auth_token):
                name=config_dataset.name,
                data_files=config_dataset.path,
                streaming=False,
-                **load_ds_kwargs,
+                split=None,
            )
        else:
            raise ValueError(
                "unhandled dataset load: local path exists, but is neither a directory or a file"
            )
    elif ds_from_hub:
+        load_ds_kwargs = {}
+        if config_dataset.split:
+            load_ds_kwargs["split"] = config_dataset.split
        ds = load_dataset(
            config_dataset.path,
            name=config_dataset.name,
@@ -177,9 +173,9 @@ def load_dataset_w_config(config_dataset, auth_token):
                name=config_dataset.name,
                data_files=config_dataset.path,
                streaming=False,
+                split=None,
                storage_options=storage_options,
                trust_remote_code=config_dataset.trust_remote_code,
-                **load_ds_kwargs,
            )
    elif config_dataset.path.startswith("https://"):
        ds_type = get_ds_type(config_dataset)
@@ -188,9 +184,9 @@ def load_dataset_w_config(config_dataset, auth_token):
            name=config_dataset.name,
            data_files=config_dataset.path,
            streaming=False,
+            split=None,
            storage_options=storage_options,
            trust_remote_code=config_dataset.trust_remote_code,
-            **load_ds_kwargs,
        )
    else:
        if isinstance(config_dataset.data_files, str):
@@ -218,7 +214,7 @@ def load_dataset_w_config(config_dataset, auth_token):
            name=config_dataset.name,
            data_files=fp,
            streaming=False,
-            **load_ds_kwargs,
+            split=None,
        )
    if not ds:
        raise ValueError("unhandled dataset load")
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -48,6 +48,7 @@ from transformers.integrations.deepspeed import (
 )

 from axolotl.common.architectures import MOE_ARCH_BLOCK
+from axolotl.integrations.base import PluginManager
 from axolotl.models.mamba import fix_mamba_attn_for_loss
 from axolotl.monkeypatch.multipack import (
    SUPPORTED_MULTIPACK_MODEL_TYPES,
@@ -375,24 +376,26 @@ class ModelLoader:

    def apply_patches(self) -> None:
        # load any patches from plugins
-        from axolotl.integrations.base import PluginManager
-
        plugin_manager = PluginManager.get_instance()
        plugin_manager.pre_model_load(self.cfg)

-        if self.cfg.adapter:
-            from axolotl.monkeypatch.transformers_fa_utils import (
-                patch_fa_peft_integration,
-            )
-
-            patch_fa_peft_integration()
-
        if self.cfg.gradient_checkpointing == "unsloth":
            transformers.modeling_utils.checkpoint = hf_grad_checkpoint_unsloth_wrapper

        if self.cfg.flash_attention:
            self.patch_attention()

+        if self.cfg.model_config_type == "llama":
+            from axolotl.monkeypatch.trainer_grad_accum import (
+                patch_flash_attention_forward,
+                patch_forward_for_ga,
+                patch_training_step_for_ga,
+            )
+
+            patch_flash_attention_forward()
+            patch_forward_for_ga()
+            patch_training_step_for_ga()
+
        if self.cfg.sample_packing and self.cfg.s2_attention:
            raise ValueError(
                "Received `sample_packing=true` and `s2_attention=true`; however, \
@@ -709,24 +712,53 @@ class ModelLoader:
        if self.cfg.flash_attention:
            if not self.cfg.sample_packing and self.cfg.s2_attention:
                pass
-            self.model_kwargs["attn_implementation"] = "flash_attention_2"
-            self.model_config._attn_implementation = (  # pylint: disable=protected-access
-                "flash_attention_2"
-            )
+
+            if self.cfg.diff_attention:
+                self.model_kwargs[
+                    "attn_implementation"
+                ] = "differential_flash_attention_2"
+                self.model_config._attn_implementation = (  # pylint: disable=protected-access
+                    "differential_flash_attention_2"
+                )
+            else:
+                self.model_kwargs["attn_implementation"] = "flash_attention_2"
+                self.model_config._attn_implementation = (  # pylint: disable=protected-access
+                    "flash_attention_2"
+                )
        elif self.cfg.sdp_attention:
-            self.model_kwargs["attn_implementation"] = "sdpa"
-            self.model_config._attn_implementation = (  # pylint: disable=protected-access
-                "sdpa"
-            )
+            if self.cfg.diff_attention:
+                self.model_kwargs["attn_implementation"] = "differential_sdpa"
+                self.model_config._attn_implementation = (  # pylint: disable=protected-access
+                    "differential_sdpa"
+                )
+            else:
+                self.model_kwargs["attn_implementation"] = "sdpa"
+                self.model_config._attn_implementation = (  # pylint: disable=protected-access
+                    "sdpa"
+                )
        elif self.cfg.eager_attention:
-            self.model_kwargs["attn_implementation"] = "eager"
+            if self.cfg.diff_attention:
+                self.model_kwargs["attn_implementation"] = "differential_eager"
+                self.model_config._attn_implementation = (  # pylint: disable=protected-access
+                    "differential_eager"
+                )
+            else:
+                self.model_kwargs["attn_implementation"] = "eager"
+                self.model_config._attn_implementation = (  # pylint: disable=protected-access
+                    "eager"
+                )
+        elif self.cfg.diff_attention:
+            self.model_kwargs["attn_implementation"] = "differential_eager"
            self.model_config._attn_implementation = (  # pylint: disable=protected-access
-                "eager"
+                "differential_eager"
            )

        if self.cfg.low_cpu_mem_usage:
            self.model_kwargs["low_cpu_mem_usage"] = True

+        plugin_manager = PluginManager.get_instance()
+        plugin_manager.set_attn_config(self.cfg, self.model_kwargs, self.model_config)
+
    def build_model(self, qlora_fsdp) -> bool:
        def _configure_zero3_memory_efficient_loading():
            """
@@ -812,6 +844,7 @@ class ModelLoader:

            if self.cfg.is_multimodal:
                self.model_config.text_config = self.text_model_config
+
            self.model = self.AutoModelLoader.from_pretrained(
                self.base_model,
                config=self.model_config,
@@ -1053,7 +1086,7 @@ class ModelLoader:
        )
        if (
            hasattr(self.model, "get_input_embeddings")
-            and self.model.get_input_embeddings().num_embeddings != embeddings_len
+            and self.model.get_input_embeddings().num_embeddings < embeddings_len
        ):
            resize_kwargs = {}
            if self.cfg.mean_resizing_embeddings is not None:
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -310,22 +310,19 @@ def process_datasets_for_packing(cfg, train_dataset, eval_dataset):


 def process_pretraining_datasets_for_packing(
-    train_dataset, sequence_len, skip_position_ids=True, drop_attention_mask=False
+    train_dataset, sequence_len, skip_position_ids=True
 ):
    drop_long = partial(drop_long_seq, sequence_len=sequence_len)

    train_dataset = train_dataset.filter(
        drop_long,
        desc="Dropping Long Sequences",
-        load_from_cache_file=False,
    )
-    if not skip_position_ids:
+    if skip_position_ids:
        train_dataset = train_dataset.map(
            add_position_ids,
            desc="Add position_id column (Pretraining Sample Packing)",
        )
-    if drop_attention_mask:
-        train_dataset = train_dataset.remove_columns("attention_mask")

    return train_dataset

--- a/src/axolotl/utils/yaml.py
+++ b/src/axolotl/utils/yaml.py
@@ -0,0 +1,157 @@
+"""Utilities for YAML files."""
+
+from collections import OrderedDict
+from typing import Any, Dict, List, Set, Tuple, Union
+
+import yaml
+
+
+class YAMLOrderTracker:
+    """Tracks the order of keys and section breaks in YAML files."""
+
+    def __init__(self, yaml_path: str):
+        self.yaml_path = yaml_path
+        self.structure, self.needs_break = self._parse_yaml_structure()
+
+    def _get_indentation_level(self, line: str) -> int:
+        """Get the indentation level of a line."""
+        return len(line) - len(line.lstrip())
+
+    def _parse_yaml_structure(
+        self,
+    ) -> Tuple[Dict[str, Union[List[str], Dict]], Set[str]]:
+        """Parse the YAML file to extract structure and identify section breaks."""
+        with open(self.yaml_path, "r", encoding="utf-8") as file:
+            contents = file.readlines()
+
+        structure: OrderedDict = OrderedDict()
+        needs_break = set()  # Track which keys should have a break before them
+        current_path = []
+        last_indentation = -1
+        had_empty_line = False
+
+        for line in contents:
+            # Track empty lines and comments
+            if not line.strip() or line.strip().startswith("#"):
+                had_empty_line = True
+                continue
+
+            # Get indentation level and content
+            indentation = self._get_indentation_level(line)
+            content = line.strip()
+
+            # Skip lines that don't define keys
+            if ":" not in content:
+                continue
+
+            # Extract key
+            key = content.split(":")[0].strip()
+
+            # If this is a top-level key and we had an empty line, mark it
+            if indentation == 0:
+                if had_empty_line:
+                    needs_break.add(key)
+                had_empty_line = False
+
+            # Handle indentation changes
+            if indentation > last_indentation:
+                current_path.append(key)
+            elif indentation < last_indentation:
+                levels_up = (last_indentation - indentation) // 2
+                current_path = current_path[:-levels_up]
+                current_path[-1] = key
+            else:
+                if current_path:
+                    current_path[-1] = key
+
+            # Update structure
+            current_dict = structure
+            for path_key in current_path[:-1]:
+                if path_key not in current_dict:
+                    current_dict[path_key] = OrderedDict()
+                current_dict = current_dict[path_key]
+
+            if current_path:
+                if current_path[-1] not in current_dict:
+                    current_dict[current_path[-1]] = OrderedDict()
+
+            last_indentation = indentation
+
+        return structure, needs_break
+
+
+class OrderedDumper(yaml.SafeDumper):
+    """Custom YAML dumper that maintains dictionary order."""
+
+
+def represent_none(self, _):
+    """Represent None values as empty fields."""
+    return self.represent_scalar("tag:yaml.org,2002:null", "")
+
+
+def ordered_dict_representer(dumper: OrderedDumper, data: Dict) -> Any:
+    """Custom representer for dictionaries that maintains order."""
+    return dumper.represent_mapping("tag:yaml.org,2002:map", data.items())
+
+
+def reorder_dict(data: Dict, reference_structure: Dict) -> OrderedDict:
+    """Reorder a dictionary based on a reference structure."""
+    ordered = OrderedDict()
+
+    # First add keys that are in the reference order
+    for key in reference_structure:
+        if key in data:
+            if isinstance(reference_structure[key], dict) and isinstance(
+                data[key], dict
+            ):
+                ordered[key] = reorder_dict(data[key], reference_structure[key])
+            else:
+                ordered[key] = data[key]
+
+    # Then add any remaining keys that weren't in the reference
+    for key in data:
+        if key not in ordered:
+            ordered[key] = data[key]
+
+    return ordered
+
+
+def dump_yaml_preserved_order(
+    data: Dict, reference_yaml_path: str, output_path: str
+) -> None:
+    """Dump YAML file while preserving nested order and normalized spacing."""
+    # Get reference structure and spacing
+    tracker = YAMLOrderTracker(reference_yaml_path)
+
+    # Reorder the data
+    ordered_data = reorder_dict(data, tracker.structure)
+
+    # Register the custom representers
+    OrderedDumper.add_representer(type(None), represent_none)
+    OrderedDumper.add_representer(dict, ordered_dict_representer)
+    OrderedDumper.add_representer(OrderedDict, ordered_dict_representer)
+
+    # First dump to string
+    yaml_str = yaml.dump(
+        ordered_data, Dumper=OrderedDumper, sort_keys=False, default_flow_style=False
+    )
+
+    # Add spacing according to reference
+    lines = yaml_str.split("\n")
+    result_lines: List[str] = []
+    current_line = 0
+
+    while current_line < len(lines):
+        line = lines[current_line]
+        if line.strip() and ":" in line and not line.startswith(" "):  # Top-level key
+            key = line.split(":")[0].strip()
+            if key in tracker.needs_break:
+                # Add single empty line before this key
+                if result_lines and result_lines[-1] != "":
+                    result_lines.append("")
+        result_lines.append(line)
+        current_line += 1
+
+    # Write the final result
+    with open(output_path, "w", encoding="utf-8") as file:
+        file.write("\n".join(result_lines))
--- a/tests/cli/test_cli_base.py
+++ b/tests/cli/test_cli_base.py
@@ -43,14 +43,12 @@ class BaseCliTest:
            result = cli_runner.invoke(cli, [command, str(config_path)])

            assert mock.called
-            assert mock.call_args.args[0] == [
+            assert mock.call_args.args[0][:5] == [
                "accelerate",
                "launch",
                "-m",
                f"axolotl.cli.{command}",
                str(config_path),
-                "--debug-num-examples",
-                "0",
            ]
            assert mock.call_args.kwargs == {"check": True}
            assert result.exit_code == 0
--- a/tests/cli/test_cli_interface.py
+++ b/tests/cli/test_cli_interface.py
@@ -23,6 +23,7 @@ def test_build_command():
        "--batch-size",
        "8",
        "--debug",
+        "--nouse-fp16",
    ]


--- a/tests/cli/test_cli_merge_sharded_fsdp_weights.py
+++ b/tests/cli/test_cli_merge_sharded_fsdp_weights.py
@@ -16,3 +16,46 @@ def test_merge_sharded_fsdp_weights_no_accelerate(cli_runner, config_path):
        assert mock.called
        assert mock.call_args.kwargs["config"] == str(config_path)
        assert result.exit_code == 0
+
+
+def test_merge_sharded_fsdp_weights_with_model_dir(cli_runner, config_path, tmp_path):
+    """Test merge_sharded_fsdp_weights command with model_dir option"""
+    model_dir = tmp_path / "model"
+    model_dir.mkdir()
+
+    with patch("axolotl.cli.merge_sharded_fsdp_weights.do_cli") as mock:
+        result = cli_runner.invoke(
+            cli,
+            [
+                "merge-sharded-fsdp-weights",
+                str(config_path),
+                "--no-accelerate",
+                "--model-dir",
+                str(model_dir),
+            ],
+        )
+
+        assert mock.called
+        assert mock.call_args.kwargs["config"] == str(config_path)
+        assert mock.call_args.kwargs["model_dir"] == str(model_dir)
+        assert result.exit_code == 0
+
+
+def test_merge_sharded_fsdp_weights_with_save_path(cli_runner, config_path):
+    """Test merge_sharded_fsdp_weights command with save_path option"""
+    with patch("axolotl.cli.merge_sharded_fsdp_weights.do_cli") as mock:
+        result = cli_runner.invoke(
+            cli,
+            [
+                "merge-sharded-fsdp-weights",
+                str(config_path),
+                "--no-accelerate",
+                "--save-path",
+                "/path/to/save",
+            ],
+        )
+
+        assert mock.called
+        assert mock.call_args.kwargs["config"] == str(config_path)
+        assert mock.call_args.kwargs["save_path"] == "/path/to/save"
+        assert result.exit_code == 0
--- a/tests/cli/test_cli_shard.py
+++ b/tests/cli/test_cli_shard.py
@@ -0,0 +1,75 @@
+"""pytest tests for axolotl CLI shard command."""
+# pylint: disable=duplicate-code
+
+from unittest.mock import patch
+
+from axolotl.cli.main import cli
+
+
+def test_shard_with_accelerate(cli_runner, config_path):
+    """Test shard command with accelerate"""
+    with patch("subprocess.run") as mock:
+        result = cli_runner.invoke(cli, ["shard", str(config_path), "--accelerate"])
+
+        assert mock.called
+        assert mock.call_args.args[0][:5] == [
+            "accelerate",
+            "launch",
+            "-m",
+            "axolotl.cli.shard",
+            str(config_path),
+        ]
+        assert mock.call_args.kwargs == {"check": True}
+        assert result.exit_code == 0
+
+
+def test_shard_no_accelerate(cli_runner, config_path):
+    """Test shard command without accelerate"""
+    with patch("axolotl.cli.shard.do_cli") as mock:
+        result = cli_runner.invoke(cli, ["shard", str(config_path), "--no-accelerate"])
+
+        assert mock.called
+        assert result.exit_code == 0
+
+
+def test_shard_with_model_dir(cli_runner, config_path, tmp_path):
+    """Test shard command with model_dir option"""
+    model_dir = tmp_path / "model"
+    model_dir.mkdir()
+
+    with patch("axolotl.cli.shard.do_cli") as mock:
+        result = cli_runner.invoke(
+            cli,
+            [
+                "shard",
+                str(config_path),
+                "--no-accelerate",
+                "--model-dir",
+                str(model_dir),
+            ],
+            catch_exceptions=False,
+        )
+
+        assert mock.called
+        assert mock.call_args.kwargs["config"] == str(config_path)
+        assert mock.call_args.kwargs["model_dir"] == str(model_dir)
+        assert result.exit_code == 0
+
+
+def test_shard_with_save_dir(cli_runner, config_path):
+    with patch("axolotl.cli.shard.do_cli") as mock:
+        result = cli_runner.invoke(
+            cli,
+            [
+                "shard",
+                str(config_path),
+                "--no-accelerate",
+                "--save-dir",
+                "/path/to/save",
+            ],
+        )
+
+        assert mock.called
+        assert mock.call_args.kwargs["config"] == str(config_path)
+        assert mock.call_args.kwargs["save_dir"] == "/path/to/save"
+        assert result.exit_code == 0
--- a/tests/e2e/integrations/convert_diff_transformer/init.py
+++ b/tests/e2e/integrations/convert_diff_transformer/init.py
--- a/tests/e2e/integrations/convert_diff_transformer/conftest.py
+++ b/tests/e2e/integrations/convert_diff_transformer/conftest.py
@@ -0,0 +1,31 @@
+"""Shared fixtures for differential transformer conversion tests."""
+
+import pytest
+from click.testing import CliRunner
+
+
+@pytest.fixture(scope="class")
+def base_config():
+    """Basic config for testing."""
+    return {
+        "base_model": "HuggingFaceTB/SmolLM2-135M",
+        "datasets": [
+            {
+                "path": "axolotl-ai-co/alpaca_100_test",
+                "type": "alpaca",
+            },
+        ],
+        "gradient_accumulation_steps": 1,
+        "learning_rate": 1e-4,
+        "val_set_size": 0.1,
+        "micro_batch_size": 1,
+        "sequence_len": 2048,
+        "special_tokens": {
+            "pad_token": "<|endoftext|>",
+        },
+    }
+
+
+@pytest.fixture(scope="class")
+def cli_runner():
+    return CliRunner()
--- a/tests/e2e/integrations/convert_diff_transformer/test_convert_and_evaluate.py
+++ b/tests/e2e/integrations/convert_diff_transformer/test_convert_and_evaluate.py
@@ -0,0 +1,51 @@
+"""End-to-end tests for differential transformer conversion and evaluation."""
+# pylint: disable=duplicate-code
+
+from pathlib import Path
+
+import yaml
+from pytest import approx
+
+from axolotl.cli import load_cfg
+from axolotl.cli.evaluate import do_evaluate
+from axolotl.cli.integrations.convert_diff_transformer import convert_diff_transformer
+from axolotl.common.cli import ConvertDiffTransformerCliArgs, EvaluateCliArgs
+
+
+def test_conversion_and_eval_cli(tmp_path: Path, base_config):
+    output_dir = tmp_path / "converted"
+    base_config["output_dir"] = str(output_dir)
+
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    cfg = load_cfg(str(config_path))
+    cli_args = ConvertDiffTransformerCliArgs(
+        debug=True, zero_init=True, sublayer_norm=False
+    )
+    _, debug_info = convert_diff_transformer(cfg, cli_args, str(config_path))
+
+    assert debug_info["generations_match"] is True
+    assert (output_dir / "model.safetensors").exists()
+    assert (output_dir / "config.json").exists()
+    assert (output_dir / "axolotl_config.yml").exists()
+
+    eval_cfg = load_cfg(str(output_dir))
+    eval_cli_args = EvaluateCliArgs()
+    all_metrics = do_evaluate(eval_cfg, eval_cli_args)
+
+    assert list(all_metrics.keys()) == [
+        "train_loss",
+        "train_model_preparation_time",
+        "train_runtime",
+        "train_samples_per_second",
+        "train_steps_per_second",
+        "eval_loss",
+        "eval_model_preparation_time",
+        "eval_runtime",
+        "eval_samples_per_second",
+        "eval_steps_per_second",
+    ]
+    assert all_metrics["train_loss"] == approx(1.7307, rel=1e-4)
+    assert all_metrics["eval_loss"] == approx(1.8387, rel=1e-4)
--- a/tests/e2e/integrations/convert_diff_transformer/test_convert_diff_transformer.py
+++ b/tests/e2e/integrations/convert_diff_transformer/test_convert_diff_transformer.py
@@ -0,0 +1,150 @@
+"""End-to-end tests for differential transformer conversion."""
+# pylint: disable=redefined-outer-name
+# pylint: disable=duplicate-code
+
+from pathlib import Path
+from typing import Optional
+from unittest.mock import patch
+
+import pytest
+import yaml
+
+from axolotl.cli import load_cfg
+from axolotl.cli.integrations.convert_diff_transformer import convert_diff_transformer
+from axolotl.cli.main import cli
+from axolotl.common.cli import ConvertDiffTransformerCliArgs
+
+
+def test_cli_validation(cli_runner):
+    # Test missing config file
+    result = cli_runner.invoke(cli, ["convert-diff-transformer"])
+    assert result.exit_code != 0
+    assert "Error: Missing argument 'CONFIG'." in result.output
+
+    # Test non-existent config file
+    result = cli_runner.invoke(cli, ["convert-diff-transformer", "nonexistent.yml"])
+    assert result.exit_code != 0
+    assert "Error: Invalid value for 'CONFIG'" in result.output
+
+
+def test_basic_execution(cli_runner, tmp_path: Path, base_config):
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    with patch(
+        "axolotl.cli.integrations.convert_diff_transformer.do_cli"
+    ) as mock_do_cli:
+        result = cli_runner.invoke(cli, ["convert-diff-transformer", str(config_path)])
+        assert result.exit_code == 0
+
+        mock_do_cli.assert_called_once()
+        assert mock_do_cli.call_args.kwargs["config"] == str(config_path)
+
+
+def test_conversion_cli_basic(tmp_path: Path, base_config):
+    output_dir = tmp_path / "converted"
+    base_config["output_dir"] = str(output_dir)
+
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    cfg = load_cfg(str(config_path))
+    cli_args = ConvertDiffTransformerCliArgs()
+    _, debug_info = convert_diff_transformer(cfg, cli_args, str(config_path))
+
+    assert not debug_info
+    assert (output_dir / "model.safetensors").exists()
+    assert (output_dir / "config.json").exists()
+    assert (output_dir / "axolotl_config.yml").exists()
+
+
+def test_conversion_cli_debug(tmp_path: Path, base_config):
+    output_dir = tmp_path / "converted"
+    base_config["output_dir"] = str(output_dir)
+
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    cfg = load_cfg(str(config_path))
+    cli_args = ConvertDiffTransformerCliArgs(debug=True)
+    _, debug_info = convert_diff_transformer(cfg, cli_args, str(config_path))
+
+    assert not debug_info["generations_match"]
+    assert not debug_info["match_expected"]
+    assert (output_dir / "model.safetensors").exists()
+    assert (output_dir / "config.json").exists()
+    assert (output_dir / "axolotl_config.yml").exists()
+
+
+def test_conversion_cli_reproduce(tmp_path: Path, base_config):
+    output_dir = tmp_path / "converted"
+    base_config["output_dir"] = str(output_dir)
+
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    cfg = load_cfg(str(config_path))
+    cli_args = ConvertDiffTransformerCliArgs(
+        debug=True, zero_init=True, sublayer_norm=False
+    )
+    _, debug_info = convert_diff_transformer(cfg, cli_args, str(config_path))
+
+    assert debug_info["generations_match"] is True
+    assert (output_dir / "model.safetensors").exists()
+    assert (output_dir / "config.json").exists()
+    assert (output_dir / "axolotl_config.yml").exists()
+
+
+@pytest.mark.parametrize(
+    "attention", ["eager_attention", "sdp_attention", "flash_attention"]
+)
+def test_conversion_cli_repoduce_attentions(
+    tmp_path: Path, base_config, attention: Optional[str]
+):
+    output_dir = tmp_path / "converted"
+    base_config["output_dir"] = str(output_dir)
+    base_config[attention] = True
+
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    cfg = load_cfg(str(config_path))
+    cli_args = ConvertDiffTransformerCliArgs(
+        debug=True, zero_init=True, sublayer_norm=False
+    )
+    _, debug_info = convert_diff_transformer(cfg, cli_args, str(config_path))
+
+    assert debug_info["generations_match"] is True
+    assert (output_dir / "model.safetensors").exists()
+    assert (output_dir / "config.json").exists()
+    assert (output_dir / "axolotl_config.yml").exists()
+
+
+@pytest.mark.parametrize(
+    "attention", ["eager_attention", "sdp_attention", "flash_attention"]
+)
+def test_conversion_cli_split_heads(tmp_path: Path, base_config, attention: str):
+    output_dir = tmp_path / "converted"
+
+    # Smallest model with an even number of attention heads
+    base_config["base_model"] = "HuggingFaceTB/SmolLM2-1.7B"
+    base_config["output_dir"] = str(output_dir)
+    base_config[attention] = True
+
+    config_path = tmp_path / "config.yml"
+    with open(config_path, "w", encoding="utf-8") as file:
+        yaml.dump(base_config, file)
+
+    cfg = load_cfg(str(config_path))
+    cli_args = ConvertDiffTransformerCliArgs(debug=True, split_heads=True)
+    _, debug_info = convert_diff_transformer(cfg, cli_args, str(config_path))
+
+    assert debug_info["generations_match"] is False
+    assert (output_dir / "model.safetensors").exists()
+    assert (output_dir / "config.json").exists()
+    assert (output_dir / "axolotl_config.yml").exists()
--- a/tests/e2e/integrations/test_cut_cross_entropy.py
+++ b/tests/e2e/integrations/test_cut_cross_entropy.py
@@ -2,17 +2,17 @@
 Simple end-to-end test for Cut Cross Entropy integration
 """

+from pathlib import Path
+
 import pytest

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils import get_pytorch_version
 from axolotl.utils.config import normalize_config, prepare_plugins
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists
-
 # pylint: disable=duplicate-code


@@ -64,10 +64,10 @@ class TestCutCrossEntropyIntegration:
        major, minor, _ = get_pytorch_version()
        if (major, minor) < (2, 4):
            with pytest.raises(ImportError):
-                train(cfg=cfg, dataset_meta=dataset_meta)
+                train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
        else:
-            train(cfg=cfg, dataset_meta=dataset_meta)
-            check_model_output_exists(temp_dir, cfg)
+            train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+            assert (Path(temp_dir) / "model.safetensors").exists()

    @pytest.mark.parametrize(
        "attention_type",
@@ -92,7 +92,7 @@ class TestCutCrossEntropyIntegration:
        major, minor, _ = get_pytorch_version()
        if (major, minor) < (2, 4):
            with pytest.raises(ImportError):
-                train(cfg=cfg, dataset_meta=dataset_meta)
+                train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
        else:
-            train(cfg=cfg, dataset_meta=dataset_meta)
-            check_model_output_exists(temp_dir, cfg)
+            train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+            assert (Path(temp_dir) / "model.safetensors").exists()
--- a/tests/e2e/integrations/test_liger.py
+++ b/tests/e2e/integrations/test_liger.py
@@ -1,17 +1,16 @@
 """
 Simple end-to-end test for Liger integration
 """
+from pathlib import Path

 from e2e.utils import require_torch_2_4_1

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config, prepare_plugins
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists
-

 class LigerIntegrationTestCase:
    """
@@ -60,8 +59,8 @@ class LigerIntegrationTestCase:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()

    @require_torch_2_4_1
    def test_llama_w_flce(self, temp_dir):
@@ -105,5 +104,5 @@ class LigerIntegrationTestCase:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()
--- a/tests/e2e/multigpu/test_llama.py
+++ b/tests/e2e/multigpu/test_llama.py
@@ -63,7 +63,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -128,7 +127,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -203,7 +201,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -226,12 +223,8 @@ class TestMultiGPULlama:
            ]
        )

-        loss_threshold = 2.3
        check_tensorboard(
-            temp_dir + "/runs",
-            "train/train_loss",
-            loss_threshold,
-            "Train Loss is too high",
+            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss is too high"
        )

    def test_dpo_qlora_ddp(self, temp_dir):
@@ -282,7 +275,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -305,12 +297,8 @@ class TestMultiGPULlama:
            ]
        )

-        loss_threshold = 2.3
        check_tensorboard(
-            temp_dir + "/runs",
-            "train/train_loss",
-            loss_threshold,
-            "Train Loss is too high",
+            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss is too high"
        )

    @pytest.mark.parametrize(
--- a/tests/e2e/patched/test_4d_multipack_llama.py
+++ b/tests/e2e/patched/test_4d_multipack_llama.py
@@ -5,14 +5,15 @@ E2E tests for multipack fft llama using 4d attention masks
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, require_torch_2_3_1, with_temp_dir
+from ..utils import require_torch_2_3_1, with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -65,8 +66,8 @@ class Test4dMultipackLlama(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_torch_lora_packing(self, temp_dir):
@@ -109,5 +110,5 @@ class Test4dMultipackLlama(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()
--- a/tests/e2e/patched/test_cli_integrations.py
+++ b/tests/e2e/patched/test_cli_integrations.py
@@ -5,7 +5,7 @@ from pathlib import Path

 import yaml

-from axolotl.cli.config import load_cfg
+from axolotl.cli import load_cfg
 from axolotl.utils.dict import DictDefault


--- a/tests/e2e/patched/test_fa_xentropy.py
+++ b/tests/e2e/patched/test_fa_xentropy.py
@@ -4,17 +4,18 @@ E2E tests for lora llama

 import logging
 import os
+from pathlib import Path

 import pytest
 from transformers.utils import is_torch_bf16_gpu_available

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, check_tensorboard
+from ..utils import check_tensorboard

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -80,8 +81,8 @@ class TestFAXentropyLlama:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 1.5, "Train Loss is too high"
--- a/tests/e2e/patched/test_falcon_samplepack.py
+++ b/tests/e2e/patched/test_falcon_samplepack.py
@@ -5,14 +5,15 @@ E2E tests for falcon
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -67,8 +68,8 @@ class TestFalconPatched(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_ft(self, temp_dir):
@@ -107,5 +108,5 @@ class TestFalconPatched(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/patched/test_fused_llama.py
+++ b/tests/e2e/patched/test_fused_llama.py
@@ -5,17 +5,18 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

 import pytest
 from transformers.utils import is_torch_bf16_gpu_available

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -71,5 +72,5 @@ class TestFusedLlama(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/patched/test_llama_s2_attention.py
+++ b/tests/e2e/patched/test_llama_s2_attention.py
@@ -5,16 +5,17 @@ E2E tests for llama w/ S2 attn
 import logging
 import os
 import unittest
+from pathlib import Path

 import pytest

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -69,8 +70,8 @@ class TestLlamaShiftedSparseAttention(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_fft_s2_attn(self, temp_dir):
@@ -109,5 +110,5 @@ class TestLlamaShiftedSparseAttention(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/patched/test_lora_llama_multipack.py
+++ b/tests/e2e/patched/test_lora_llama_multipack.py
@@ -5,17 +5,18 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

 import pytest
 from transformers.utils import is_auto_gptq_available, is_torch_bf16_gpu_available

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -74,8 +75,8 @@ class TestLoraLlama(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @pytest.mark.skipif(not is_auto_gptq_available(), reason="auto-gptq not available")
    @with_temp_dir
@@ -124,5 +125,5 @@ class TestLoraLlama(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()
--- a/tests/e2e/patched/test_mistral_samplepack.py
+++ b/tests/e2e/patched/test_mistral_samplepack.py
@@ -5,14 +5,15 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -67,8 +68,8 @@ class TestMistral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_ft_packing(self, temp_dir):
@@ -108,5 +109,5 @@ class TestMistral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/patched/test_mixtral_samplepack.py
+++ b/tests/e2e/patched/test_mixtral_samplepack.py
@@ -5,14 +5,15 @@ E2E tests for mixtral
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -64,8 +65,8 @@ class TestMixtral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_ft(self, temp_dir):
@@ -102,5 +103,9 @@ class TestMixtral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        model, _ = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (
+            "MixtralFlashAttention2"
+            in model.model.layers[0].self_attn.__class__.__name__
+        )
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/patched/test_model_patches.py
+++ b/tests/e2e/patched/test_model_patches.py
@@ -6,6 +6,7 @@ import unittest

 import transformers

+from axolotl.common.cli import TrainerCliArgs
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault
 from axolotl.utils.models import load_model, load_tokenizer
@@ -48,8 +49,14 @@ class TestModelPatches(unittest.TestCase):
            }
        )
        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
        tokenizer = load_tokenizer(cfg)
-        load_model(cfg, tokenizer, inference=False)
+        model, _ = load_model(cfg, tokenizer, inference=cli_args.inference)
+
+        assert (
+            "MixtralFlashAttention2"
+            in model.model.layers[0].self_attn.__class__.__name__
+        )

    @with_temp_dir
    def test_mistral_multipack(self, temp_dir):
@@ -80,8 +87,9 @@ class TestModelPatches(unittest.TestCase):
            }
        )
        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
        tokenizer = load_tokenizer(cfg)
-        load_model(cfg, tokenizer, inference=False)
+        load_model(cfg, tokenizer, inference=cli_args.inference)

        assert (
            "torch.jit"
--- a/tests/e2e/patched/test_phi_multipack.py
+++ b/tests/e2e/patched/test_phi_multipack.py
@@ -5,14 +5,15 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, with_temp_dir
+from ..utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -67,8 +68,8 @@ class TestPhiMultipack(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()

    @with_temp_dir
    def test_qlora_packed(self, temp_dir):
@@ -118,5 +119,5 @@ class TestPhiMultipack(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()
--- a/tests/e2e/patched/test_resume.py
+++ b/tests/e2e/patched/test_resume.py
@@ -6,16 +6,17 @@ import logging
 import os
 import re
 import subprocess
+from pathlib import Path

 from transformers.utils import is_torch_bf16_gpu_available

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, most_recent_subdir
+from ..utils import most_recent_subdir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -71,7 +72,7 @@ class TestResumeLlama:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)

        resume_cfg = cfg | DictDefault(
            {
@@ -81,8 +82,8 @@ class TestResumeLlama:
        normalize_config(resume_cfg)
        cli_args = TrainerCliArgs()

-        train(cfg=resume_cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=resume_cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

        tb_log_path_1 = most_recent_subdir(temp_dir + "/runs")
        cmd = f"tensorboard --inspect  --logdir {tb_log_path_1}"
--- a/tests/e2e/patched/test_unsloth_integration.py
+++ b/tests/e2e/patched/test_unsloth_integration.py
@@ -3,6 +3,8 @@ import unittest

 import pytest

+from axolotl.monkeypatch.unsloth_ import check_self_attn_is_patchable
+

@pytest.mark.skip(
    reason="Unsloth integration will be broken going into latest transformers"
@@ -11,8 +13,6 @@ class TestUnslothIntegration(unittest.TestCase):
    """Unsloth monkeypatch integration tests."""

    def test_is_self_attn_patchable(self):
-        from axolotl.monkeypatch.unsloth_ import check_self_attn_is_patchable
-
        # ensures the current version of transformers has loss code that matches our patching code
        self.assertTrue(
            check_self_attn_is_patchable(),
--- a/tests/e2e/patched/test_unsloth_qlora.py
+++ b/tests/e2e/patched/test_unsloth_qlora.py
@@ -3,16 +3,17 @@ e2e tests for unsloth qlora
 """
 import logging
 import os
+from pathlib import Path

 import pytest

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, check_tensorboard
+from ..utils import check_tensorboard

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -75,8 +76,8 @@ class TestUnslothQLoRA:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.0, "Train Loss is too high"
@@ -125,8 +126,8 @@ class TestUnslothQLoRA:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.0, "Train Loss is too high"
@@ -180,8 +181,8 @@ class TestUnslothQLoRA:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.0, "Train Loss is too high"
--- a/tests/e2e/test_dpo.py
+++ b/tests/e2e/test_dpo.py
@@ -9,13 +9,13 @@ from pathlib import Path

 import pytest

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_preference_datasets
+from axolotl.cli import load_rl_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, with_temp_dir
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -65,10 +65,10 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()

    @with_temp_dir
    def test_dpo_nll_lora(self, temp_dir):
@@ -110,10 +110,10 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()

    @with_temp_dir
    def test_dpo_use_weighting(self, temp_dir):
@@ -155,10 +155,10 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()

    @pytest.mark.skip("kto_pair no longer supported in trl")
    @with_temp_dir
@@ -200,10 +200,10 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()

    @with_temp_dir
    def test_ipo_lora(self, temp_dir):
@@ -244,10 +244,10 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()

    @with_temp_dir
    def test_orpo_lora(self, temp_dir):
@@ -291,10 +291,10 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()

    @pytest.mark.skip(reason="Fix the implementation")
    @with_temp_dir
@@ -355,7 +355,7 @@ class TestDPOLlamaLora(unittest.TestCase):
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
-        dataset_meta = load_preference_datasets(cfg=cfg, cli_args=cli_args)
+        dataset_meta = load_rl_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(Path(temp_dir) / "checkpoint-20", cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "checkpoint-20/adapter_model.safetensors").exists()
--- a/tests/e2e/test_embeddings_lr.py
+++ b/tests/e2e/test_embeddings_lr.py
@@ -5,14 +5,15 @@ E2E tests for llama pretrain
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, check_tensorboard, with_temp_dir
+from .utils import check_tensorboard, with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -60,8 +61,8 @@ class TestEmbeddingsLrScale(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()

        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.0, "Loss is too high"
@@ -104,8 +105,8 @@ class TestEmbeddingsLrScale(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()

        check_tensorboard(
            temp_dir + "/runs", "train/train_loss", 2.0, "Loss is too high"
--- a/tests/e2e/test_falcon.py
+++ b/tests/e2e/test_falcon.py
@@ -5,14 +5,15 @@ E2E tests for falcon
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, with_temp_dir
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -69,8 +70,8 @@ class TestFalcon(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_lora_added_vocab(self, temp_dir):
@@ -122,8 +123,8 @@ class TestFalcon(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_ft(self, temp_dir):
@@ -161,5 +162,5 @@ class TestFalcon(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/test_llama.py
+++ b/tests/e2e/test_llama.py
@@ -4,11 +4,10 @@ E2E tests for llama

 import logging
 import os
+from pathlib import Path

-from e2e.utils import check_model_output_exists
-
-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault
@@ -60,8 +59,8 @@ class TestLlama:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()

    def test_fix_untrained_tokens(self, temp_dir):
        # pylint: disable=duplicate-code
@@ -103,8 +102,8 @@ class TestLlama:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()

    def test_batch_flattening(self, temp_dir):
        # pylint: disable=duplicate-code
@@ -142,5 +141,5 @@ class TestLlama:
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()
--- a/tests/e2e/test_llama_pretrain.py
+++ b/tests/e2e/test_llama_pretrain.py
@@ -4,49 +4,40 @@ E2E tests for llama pretrain

 import logging
 import os
+import unittest
+from pathlib import Path

-import pytest
-
-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, check_tensorboard
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"


-class TestPretrainLlama:
+class TestPretrainLlama(unittest.TestCase):
    """
    Test case for Llama models w pretraining
    """

-    @pytest.mark.parametrize(
-        "sample_packing",
-        [True, False],
-    )
-    @pytest.mark.parametrize(
-        "pretrain_multipack_attn",
-        [True, False],
-    )
-    def test_pretrain(self, temp_dir, sample_packing, pretrain_multipack_attn):
-        if not sample_packing and pretrain_multipack_attn:
-            return
-
+    @with_temp_dir
+    def test_pretrain_w_sample_packing(self, temp_dir):
        # pylint: disable=duplicate-code
        cfg = DictDefault(
            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                "base_model": "JackFram/llama-68m",
+                "tokenizer_type": "LlamaTokenizer",
                "flash_attention": True,
                "sequence_len": 1024,
-                "sample_packing": sample_packing,
-                "pretrain_multipack_attn": pretrain_multipack_attn,
-                "dataset_processes": 1,
+                "sample_packing": True,
                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
+                    "unk_token": "<unk>",
+                    "bos_token": "<s>",
+                    "eos_token": "</s>",
                },
                "pretraining_dataset": [
                    {
@@ -57,7 +48,7 @@ class TestPretrainLlama:
                ],
                "max_steps": 5,
                "num_epochs": 1,
-                "micro_batch_size": 2,
+                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
@@ -66,21 +57,11 @@ class TestPretrainLlama:
                "lr_scheduler": "cosine",
                "save_safetensors": True,
                "bf16": "auto",
-                "use_tensorboard": True,
            }
        )
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
-        loss_threshold = 3.5
-        if sample_packing and not pretrain_multipack_attn:
-            loss_threshold = 6.5
-        check_tensorboard(
-            temp_dir + "/runs",
-            "train/train_loss",
-            loss_threshold,
-            "Train Loss is too high",
-        )
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "model.safetensors").exists()
--- a/tests/e2e/test_llama_vision.py
+++ b/tests/e2e/test_llama_vision.py
@@ -5,14 +5,15 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, with_temp_dir
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -66,8 +67,8 @@ class TestLlamaVision(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.safetensors").exists()

    @with_temp_dir
    def test_lora_llama_vision_multimodal_dataset(self, temp_dir):
@@ -111,5 +112,5 @@ class TestLlamaVision(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.safetensors").exists()
--- a/tests/e2e/test_lora_llama.py
+++ b/tests/e2e/test_lora_llama.py
@@ -5,14 +5,15 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, with_temp_dir
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -63,5 +64,5 @@ class TestLoraLlama(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()
--- a/tests/e2e/test_mamba.py
+++ b/tests/e2e/test_mamba.py
@@ -5,16 +5,17 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

 import pytest

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, with_temp_dir
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -63,5 +64,5 @@ class TestMamba(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/tests/e2e/test_mistral.py
+++ b/tests/e2e/test_mistral.py
@@ -5,16 +5,17 @@ E2E tests for lora llama
 import logging
 import os
 import unittest
+from pathlib import Path

 from transformers.utils import is_torch_bf16_gpu_available

-from axolotl.cli.args import TrainerCliArgs
-from axolotl.common.datasets import load_datasets
+from axolotl.cli import load_datasets
+from axolotl.common.cli import TrainerCliArgs
 from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, with_temp_dir
+from .utils import with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -67,8 +68,8 @@ class TestMistral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "adapter_model.bin").exists()

    @with_temp_dir
    def test_ft(self, temp_dir):
@@ -110,5 +111,5 @@ class TestMistral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
+        train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
+        assert (Path(temp_dir) / "pytorch_model.bin").exists()
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Wing Lian	8c4f89745a	fix softmax class check	2025-01-15 23:23:13 -05:00
Wing Lian	36b71f34d7	register rala	2025-01-15 23:21:22 -05:00
Wing Lian	d28fee7609	use autoconfig w rala	2025-01-15 23:14:47 -05:00
Wing Lian	c196776996	option to not concatenate during pretraining	2025-01-15 22:45:02 -05:00
Wing Lian	79ae776102	fixup logging layer	2025-01-15 21:36:14 -05:00
Wing Lian	145664d82c	more fixups	2025-01-15 21:27:12 -05:00
Dan Saunders	28694219a5	inline comment change	2025-01-14 16:59:43 +00:00
Dan Saunders	fd8ad6fcbf	fixing negative component mixing	2025-01-13 19:21:55 +00:00
Dan Saunders	661d71a14b	adding diff attn negative component warmup (in progress)	2025-01-10 21:57:31 +00:00
Dan Saunders	6dd47edcb8	fire CLI fixes	2025-01-10 18:24:16 +00:00
Dan Saunders	7aca08ff60	adding guard statements	2025-01-10 16:39:21 +00:00
Dan Saunders	4f804f6d88	adding diff attn callback, adding documentation	2025-01-10 16:28:51 +00:00
Dan Saunders	443327c585	CLI build_command bugfix	2025-01-10 16:28:51 +00:00
Dan Saunders	70c4e6fbe6	updates and cleanup	2025-01-10 16:28:51 +00:00
Dan Saunders	2a7f139ad2	pre-commit fix	2025-01-10 16:28:51 +00:00
Dan Saunders	332ce0ae85	fixes and cleanup	2025-01-10 16:28:51 +00:00
Dan Saunders	e5fa842ff8	update	2025-01-10 16:28:51 +00:00
Dan Saunders	78e0ec0aa5	changes	2025-01-10 16:28:51 +00:00
Dan Saunders	3bc568eb27	adding registration function	2025-01-10 16:28:51 +00:00
Dan Saunders	eb6611d55f	progress on modeling code	2025-01-10 16:28:51 +00:00
Dan Saunders	4ff3328e66	updated custom modeling code	2025-01-10 16:28:51 +00:00
Dan Saunders	a3fd5074a9	fix duplicate-code warnings	2025-01-10 16:28:51 +00:00
Dan Saunders	5b90da0be3	added modeling code; cleanup + refactor	2025-01-10 16:28:51 +00:00
Dan Saunders	fcbfa86373	refactor and fixing test isolation issues	2025-01-10 16:28:51 +00:00
Dan Saunders	0d56582090	adding yaml dumper preserving input config format	2025-01-10 16:28:51 +00:00
Dan Saunders	390cb5742e	removing extra pytest xdist args	2025-01-10 16:28:51 +00:00
Dan Saunders	1d935f65c3	moving tests around for flash_attn install	2025-01-10 16:28:51 +00:00
Dan Saunders	66176b3e07	adding split_heads argument for retaining original (Q, K) dimensionanlity	2025-01-10 16:28:51 +00:00
Dan Saunders	505321ac95	isolating problematic test	2025-01-10 16:28:51 +00:00
Dan Saunders	0b382c88da	fixes post-rebase	2025-01-10 16:28:51 +00:00
Dan Saunders	ea07a7086e	plugin implementation	2025-01-10 16:28:51 +00:00
Dan Saunders	d22e1136bc	convert-differential-transformer test coverage	2025-01-10 16:28:51 +00:00
Dan Saunders	63b8e42c6b	duplicate code ignore	2025-01-10 16:28:51 +00:00
Dan Saunders	bda1eed59e	differential flash attention 2; cleanup	2025-01-10 16:28:51 +00:00
Dan Saunders	41ebd93158	moving monkeypatch	2025-01-10 16:28:51 +00:00
Dan Saunders	4c050ce807	pre-commit fix	2025-01-10 16:28:51 +00:00
Dan Saunders	6665acf63d	fix model save / load logic	2025-01-10 16:28:51 +00:00
Dan Saunders	2f9fa4c465	various improvemnents	2025-01-10 16:28:51 +00:00
Dan Saunders	849bc94112	various improvemnents	2025-01-10 16:28:51 +00:00
Dan Saunders	e484ec778d	training fixes, patching, minor cleanup	2025-01-10 16:28:51 +00:00
Dan Saunders	df1504ae14	adding CLI command for convert-diff-transformer	2025-01-10 16:28:51 +00:00
Dan Saunders	7be0d7496c	Adding script for doing conversion; fixes and updates	2025-01-10 16:28:51 +00:00
Dan Saunders	13cdffa91f	initial diff attn layer / model conversion implementation (support for llama arch)	2025-01-10 16:28:51 +00:00
Dan Saunders	7a4b296f60	Basic evaluate CLI command / codepath (#2188 ) * basic evaluate CLI command / codepath * tests for evaluate CLI command * fixes and cleanup * review comments; slightly DRYing up things --------- Co-authored-by: Dan Saunders <danjsaund@gmail.com>	2025-01-10 16:28:51 +00:00