improve check for base case

fixes last eos token not in labels on basic use case
2025-01-24 12:02:34 -05:00 · 2025-01-24 12:00:06 -05:00
35 changed files with 475 additions and 456 deletions
--- a/README.md
+++ b/README.md
@@ -519,8 +519,8 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
      train_on_split: validation

      # loading from s3 or gcs
-      # s3 creds will be loaded from the system default / gcs will attempt to load from gcloud creds, google metadata service, or anon
-    - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above
+      # s3 creds will be loaded from the system default and gcs only supports public access
+    - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
      ...

      # Loading Data From a Public URL
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -19,47 +19,35 @@ website:
      href: https://discord.gg/7m9sfhzaf3

  sidebar:
-    pinned: true
-    collapse-level: 2
-    style: docked
-    contents:
-      - text: Home
-        href: index.qmd
-      - section: "How-To Guides"
-        contents:
-          - docs/debugging.qmd
-          - docs/multipack.qmd
-          - docs/fsdp_qlora.qmd
-          - docs/input_output.qmd
-          - docs/rlhf.qmd
-          - docs/nccl.qmd
-          - docs/mac.qmd
-          - docs/multi-node.qmd
-          - docs/unsloth.qmd
-          - docs/amd_hpc.qmd
-      - section: "Dataset Formats"
-        contents: docs/dataset-formats/*
-      - section: "Reference"
-        contents:
-          - docs/config.qmd
-      - section: "API Reference"
-        contents: "{{ api_contents }}"
-      - text: "FAQ"
-        href: docs/faq.qmd
+      pinned: true
+      collapse-level: 2
+      style: docked
+      contents:
+        - text: Home
+          href: index.qmd
+        - section: "How-To Guides"
+          contents:
+          # TODO Edit folder structure after we have more docs.
+            - docs/debugging.qmd
+            - docs/multipack.qmd
+            - docs/fsdp_qlora.qmd
+            - docs/input_output.qmd
+            - docs/rlhf.qmd
+            - docs/nccl.qmd
+            - docs/mac.qmd
+            - docs/multi-node.qmd
+            - docs/unsloth.qmd
+            - docs/amd_hpc.qmd
+        - section: "Dataset Formats"
+          contents: docs/dataset-formats/*
+        - section: "Reference"
+          contents:
+            - docs/config.qmd
+        - docs/faq.qmd
+

 format:
  html:
    theme: materia
    css: styles.css
    toc: true
-
-quartodoc:
-  package: axolotl
-  parser: google
-  dir: api
-  sections:
-    - title: Core API
-      desc: Core functionality of Axolotl
-
-metadata-files:
-  - api/_sidebar.yml
--- a/_sidebar.yml
+++ b/_sidebar.yml
@@ -1,17 +0,0 @@
-website:
-  sidebar:
-  - collapse-level: 2
-    contents:
-    - href: introduction.qmd
-      text: Introduction
-    - contents:
-      - reference/index.qmd
-      - contents: []
-        section: axolotl
-      section: Reference
-    - href: basics-summary.qmd
-      text: Basics
-    id: reference
-    search: true
-    style: docked
-  - id: dummy-sidebar
--- a/api/ConstantLengthDataset.qmd
+++ b/api/ConstantLengthDataset.qmd
@@ -1,11 +0,0 @@
-# ConstantLengthDataset { #axolotl.ConstantLengthDataset }
-
-```python
-ConstantLengthDataset(self, tokenizer, datasets, seq_length=2048)
-```
-
-Iterable dataset that returns constant length chunks of tokens from stream of text files.
-    Args:
-        tokenizer (Tokenizer): The processor used for processing the data.
-        dataset (dataset.Dataset): Dataset with text files.
-        seq_length (int): Length of token sequences to return.
--- a/api/TokenizedPromptDataset.qmd
+++ b/api/TokenizedPromptDataset.qmd
@@ -1,19 +0,0 @@
-# TokenizedPromptDataset { #axolotl.TokenizedPromptDataset }
-
-```python
-TokenizedPromptDataset(
-    self,
-    prompt_tokenizer,
-    dataset,
-    process_count=None,
-    keep_in_memory=False,
-    **kwargs,
-)
-```
-
-Dataset that returns tokenized prompts from a stream of text files.
-    Args:
-        prompt_tokenizer (PromptTokenizingStrategy): The prompt tokenizing method for processing the data.
-        dataset (dataset.Dataset): Dataset with text files.
-        process_count (int): Number of processes to use for tokenizing.
-        keep_in_memory (bool): Whether to keep the tokenized dataset in memory.
--- a/api/choose_config.qmd
+++ b/api/choose_config.qmd
@@ -1,28 +0,0 @@
-# choose_config { #axolotl.choose_config }
-
-```python
-choose_config(path)
-```
-
-Helper method for choosing a `axolotl` config YAML file (considering only files
-ending with `.yml` or `.yaml`). If more than one config file exists in the passed
-`path`, the user is prompted to choose one.
-
-## Parameters {.doc-section .doc-section-parameters}
-
-| Name   | Type   | Description                                   | Default    |
-|--------|--------|-----------------------------------------------|------------|
-| path   | Path   | Directory in which config file(s) are stored. | _required_ |
-
-## Returns {.doc-section .doc-section-returns}
-
-| Name   | Type   | Description                                                                      |
-|--------|--------|----------------------------------------------------------------------------------|
-|        | str    | Path to either (1) the sole YAML file, or (2) if more than one YAML files exist, |
-|        | str    | the user-selected YAML file.                                                     |
-
-## Raises {.doc-section .doc-section-raises}
-
-| Name   | Type       | Description                                     |
-|--------|------------|-------------------------------------------------|
-|        | ValueError | If no YAML files are found in the given `path`. |
--- a/api/index.qmd
+++ b/api/index.qmd
@@ -1,5 +0,0 @@
-# Function reference {.doc .doc-index}
-
-## Core API
-
-Core functionality of Axolotl
--- a/api/load_cfg.qmd
+++ b/api/load_cfg.qmd
@@ -1,21 +0,0 @@
-# load_cfg { #axolotl.load_cfg }
-
-```python
-load_cfg(config=Path('examples/'), **kwargs)
-```
-
-Loads the `axolotl` configuration stored at `config`, validates it, and performs
-various setup.
-
-## Parameters {.doc-section .doc-section-parameters}
-
-| Name   | Type               | Description                                                  | Default             |
-|--------|--------------------|--------------------------------------------------------------|---------------------|
-| config | Union\[str, Path\] | Path (local or remote) to `axolotl` config YAML file.        | `Path('examples/')` |
-| kwargs |                    | Additional keyword arguments to override config file values. | `{}`                |
-
-## Returns {.doc-section .doc-section-returns}
-
-| Name   | Type        | Description                                         |
-|--------|-------------|-----------------------------------------------------|
-|        | DictDefault | `DictDefault` mapping configuration keys to values. |
--- a/api/validate_config.qmd
+++ b/api/validate_config.qmd
@@ -1,5 +0,0 @@
-# validate_config { #axolotl.validate_config }
-
-```python
-validate_config(cfg, capabilities=None, env_capabilities=None)
-```
--- a/cicd/cicd.sh
+++ b/cicd/cicd.sh
@@ -6,6 +6,5 @@ python -c "import torch; assert '$PYTORCH_VERSION' in torch.__version__"
 pytest -v --durations=10 -n8 --ignore=tests/e2e/ --ignore=tests/patched/ /workspace/axolotl/tests/
 # pytest -v --durations=10 -n8 --dist loadfile /workspace/axolotl/tests/patched/
 pytest -v --durations=10 /workspace/axolotl/tests/e2e/patched/
-pytest -v --durations=10 -n1 /workspace/axolotl/tests/e2e/solo/
 pytest -v --durations=10 /workspace/axolotl/tests/e2e/integrations/
-pytest -v --durations=10 --ignore=tests/e2e/solo/ --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ --ignore=tests/e2e/integrations/ /workspace/axolotl/tests/e2e/
+pytest -v --durations=10 --ignore=tests/e2e/patched/ --ignore=tests/e2e/multigpu/ --ignore=tests/e2e/integrations/ /workspace/axolotl/tests/e2e/
--- a/docs/config.qmd
+++ b/docs/config.qmd
@@ -360,11 +360,10 @@ warmup_ratio: 0.05  # cannot use with warmup_steps
 learning_rate: 0.00003
 lr_quadratic_warmup:
 logging_steps:
-eval_steps: # Leave empty to eval at each epoch, integer for every N steps. float for fraction of total steps
+eval_steps: # Leave empty to eval at each epoch, integers for every N steps. decimal for fraction of total steps
 evals_per_epoch: # number of times per epoch to run evals, mutually exclusive with eval_steps
-eval_strategy: # Set to `"no"` to skip evaluation, `"epoch"` at end of each epoch, leave empty to infer from `eval_steps`.
-save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of each epoch, `"best"` when better result is achieved, leave empty to infer from `save_steps`.
-save_steps: # Leave empty to save at each epoch, integer for every N steps. float for fraction of total steps
+save_strategy: # Set to `"no"` to skip checkpoint saves
+save_steps: # Leave empty to save at each epoch
 saves_per_epoch: # number of times per epoch to save a checkpoint, mutually exclusive with save_steps
 save_total_limit: # Checkpoints saved at a time
 # Maximum number of iterations to train for. It precedes num_epochs which means that
--- a/docs/lr_groups.qmd
+++ b/docs/lr_groups.qmd
@@ -1,29 +0,0 @@
---
-title: Learning Rate Groups
-description: "Setting different learning rates by module name"
---
-
-## Background
-
-Inspired by LoRA+, Axolotl allows practitioners to specify separate learning rates for each module or groups of
-modules in a model.
-
-## Example
-
-```yaml
-lr_groups:
-  - name: o_proj
-    modules:
-      - self_attn.o_proj.weight
-    lr: 1e-6
-  - name: q_proj
-    modules:
-      - model.layers.2.self_attn.q_proj.weight
-    lr: 1e-5
-
-learning_rate: 2e-5
-```
-
-In this example, we have a default learning rate of 2e-5 across the entire model, but we have a separate learning rate
-of 1e-6 for all the self attention `o_proj` modules across all layers, and a learning are of 1e-5 to the 3rd layer's
-self attention `q_proj` module.
--- a/objects.json
+++ b/objects.json
@@ -1 +0,0 @@
-{"project": "axolotl", "version": "0.0.9999", "count": 0, "items": []}
--- a/reference/index.qmd
+++ b/reference/index.qmd
@@ -1,3 +0,0 @@
-# API Reference {.doc .doc-index}
-
-## Core API
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@@ -2,5 +2,3 @@ pre-commit
 black
 mypy
 types-requests
-quartodoc
-quarto-cli
--- a/requirements.txt
+++ b/requirements.txt
@@ -13,9 +13,9 @@ liger-kernel==0.5.2
 packaging==23.2

 peft==0.14.0
-transformers==4.48.1
+transformers==4.47.1
 tokenizers>=0.21.0
-accelerate==1.3.0
+accelerate==1.2.1
 datasets==3.2.0
 deepspeed==0.16.1
 trl==0.13.0
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -2,20 +2,6 @@

 import pkgutil

-from .cli.config import choose_config, load_cfg, validate_config
-from .datasets import ConstantLengthDataset, TokenizedPromptDataset
-from .evaluate import evaluate
-from .train import train
-
 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package
-__version__ = "0.6.0"

-__all__ = [
-    "train",
-    "evaluate",
-    "TokenizedPromptDataset",
-    "ConstantLengthDataset",
-    "load_cfg",
-    "choose_config",
-    "validate_config",
-]
+__version__ = "0.6.0"
--- a/src/axolotl/core/trainer_builder.py
+++ b/src/axolotl/core/trainer_builder.py
@@ -243,10 +243,6 @@ class AxolotlTrainingMixins:
        default=None,
        metadata={"help": "Scale the learning rate for the embedding layers."},
    )
-    lr_groups: Optional[list[dict]] = field(
-        default=None,
-        metadata={"help": "Specify learning rate groups for with different LRs."},
-    )
    embedding_lr: Optional[float] = field(
        default=None,
        metadata={"help": "absolute learning rate for the embedding layers."},
@@ -465,95 +461,11 @@ class AxolotlTrainer(SchedulerMixin, Trainer):
            )
        return super()._wrap_model(model, training=training, dataloader=dataloader)

-    def create_optimizer_grouped_parameters(self, opt_model, optimizer_kwargs):
-        decay_parameters = self.get_decay_parameter_names(opt_model)
-        params = {
-            "to_weight_decay": {},  # LayerNorm and bias
-            "embeddings": {},  # lm_head, embed_tokens,
-            "no_weight_decay": {},
-        }
-        lr_groups_lookup = {}
-        lr_groups_learning_rates = {}
-        if self.args.lr_groups:
-            for lr_group in self.args.lr_groups:
-                group_name = lr_group["name"]
-                group_modules = lr_group["modules"]
-                for module in group_modules:
-                    lr_groups_lookup[module] = group_name
-                lr_groups_learning_rates[group_name] = lr_group["lr"]
-                params[f"to_weight_decay_{group_name}"] = {}
-
-        for name, param in opt_model.named_parameters():
-            if not param.requires_grad:
-                continue
-            if name.endswith("modules_to_save.default.weight") or any(
-                embed_name in name for embed_name in ["embed_tokens", "lm_head"]
-            ):
-                params["embeddings"][name] = param
-            elif name in decay_parameters:
-                lr_group_modules = [
-                    group_modules
-                    for group_modules in lr_groups_lookup
-                    if group_modules in name
-                ]
-                if lr_groups_lookup and any(lr_group_modules):
-                    lr_group_module = lr_group_modules[0]
-                    group_name = lr_groups_lookup[lr_group_module]
-                    params[f"to_weight_decay_{group_name}"][name] = param
-                else:
-                    params["to_weight_decay"][name] = param
-            else:
-                params["no_weight_decay"][name] = param
-        optimizer_grouped_parameters = []
-        if params["to_weight_decay"]:
-            optimizer_grouped_parameters.append(
-                {
-                    "params": list(params["to_weight_decay"].values()),
-                    "weight_decay": self.args.weight_decay,
-                    "lr": optimizer_kwargs["lr"],
-                }
-            )
-        if params["embeddings"]:
-            lr = optimizer_kwargs["lr"]  # pylint: disable=invalid-name
-            if self.args.embedding_lr_scale:
-                lr *= self.args.embedding_lr_scale  # pylint: disable=invalid-name
-            elif self.args.embedding_lr:
-                lr = self.args.embedding_lr  # pylint: disable=invalid-name
-            optimizer_grouped_parameters.append(
-                {
-                    "params": list(params["embeddings"].values()),
-                    "weight_decay": 0.0,
-                    "lr": lr,
-                }
-            )
-        if params["no_weight_decay"]:
-            optimizer_grouped_parameters.append(
-                {
-                    "params": list(params["no_weight_decay"].values()),
-                    "weight_decay": 0.0,
-                    "lr": optimizer_kwargs["lr"],
-                }
-            )
-        for group_name, group_lr in lr_groups_learning_rates.items():
-            if params[f"to_weight_decay_{group_name}"]:
-                optimizer_grouped_parameters.append(
-                    {
-                        "params": list(
-                            params[f"to_weight_decay_{group_name}"].values()
-                        ),
-                        "weight_decay": self.args.weight_decay,
-                        "lr": group_lr,
-                    }
-                )
-
-        return optimizer_grouped_parameters
-
    def create_optimizer(self):
        if (
            self.args.loraplus_lr_ratio is None
            and self.args.embedding_lr_scale is None
            and self.args.embedding_lr is None
-            and self.args.lr_groups is None
            and self.args.alternate_optimizer
            not in [
                "optimi_adamw",
@@ -567,13 +479,59 @@ class AxolotlTrainer(SchedulerMixin, Trainer):

        opt_model = self.model_wrapped if is_sagemaker_mp_enabled() else self.model
        if self.optimizer is None:  # pylint: disable=access-member-before-definition
+            decay_parameters = self.get_decay_parameter_names(opt_model)
+            params = {
+                "to_weight_decay": {},  # LayerNorm and bias
+                "embeddings": {},  # lm_head, embed_tokens,
+                "no_weight_decay": {},
+            }
+
            optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(
                self.args,
                opt_model,
            )
-            optimizer_grouped_parameters = self.create_optimizer_grouped_parameters(
-                opt_model, optimizer_kwargs
-            )
+
+            for name, param in opt_model.named_parameters():
+                if not param.requires_grad:
+                    continue
+                if name.endswith("modules_to_save.default.weight") or any(
+                    embed_name in name for embed_name in ["embed_tokens", "lm_head"]
+                ):
+                    params["embeddings"][name] = param
+                elif name in decay_parameters:
+                    params["to_weight_decay"][name] = param
+                else:
+                    params["no_weight_decay"][name] = param
+            optimizer_grouped_parameters = []
+            if params["to_weight_decay"]:
+                optimizer_grouped_parameters.append(
+                    {
+                        "params": list(params["to_weight_decay"].values()),
+                        "weight_decay": self.args.weight_decay,
+                        "lr": optimizer_kwargs["lr"],
+                    }
+                )
+            if params["embeddings"]:
+                lr = optimizer_kwargs["lr"]  # pylint: disable=invalid-name
+                if self.args.embedding_lr_scale:
+                    lr *= self.args.embedding_lr_scale  # pylint: disable=invalid-name
+                elif self.args.embedding_lr:
+                    lr = self.args.embedding_lr  # pylint: disable=invalid-name
+                optimizer_grouped_parameters.append(
+                    {
+                        "params": list(params["embeddings"].values()),
+                        "weight_decay": 0.0,
+                        "lr": lr,
+                    }
+                )
+            if params["no_weight_decay"]:
+                optimizer_grouped_parameters.append(
+                    {
+                        "params": list(params["no_weight_decay"].values()),
+                        "weight_decay": 0.0,
+                        "lr": optimizer_kwargs["lr"],
+                    }
+                )

            if self.args.loraplus_lr_ratio is not None:
                loraplus_lr_ratio = getattr(self.args, "loraplus_lr_ratio", None)
@@ -590,7 +548,6 @@ class AxolotlTrainer(SchedulerMixin, Trainer):
            elif (
                self.args.embedding_lr_scale is not None
                or self.args.embedding_lr is not None
-                or self.args.lr_groups is not None
            ):
                self.optimizer = (  # pylint: disable=attribute-defined-outside-init
                    optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
@@ -1122,7 +1079,6 @@ class AxolotlDPOTrainer(SchedulerMixin, DPOTrainer):
        super().__init__(*args, **kwargs)
        self.dataset_tags = dataset_tags
        self.optimizer = None
-        self.model_accepts_loss_kwargs = False

    def create_optimizer(self):
        if self.args.loraplus_lr_ratio is None:
@@ -1708,7 +1664,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        ] = self.cfg.loraplus_lr_embedding
        training_arguments_kwargs["embedding_lr"] = self.cfg.embedding_lr
        training_arguments_kwargs["embedding_lr_scale"] = self.cfg.embedding_lr_scale
-        training_arguments_kwargs["lr_groups"] = self.cfg.lr_groups

        if self.cfg.lr_scheduler in ["one_cycle", "log_sweep"]:
            training_arguments_kwargs["lr_scheduler_type"] = "cosine"
@@ -1924,8 +1879,6 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
        if training_args.pretraining:
            if self.cfg.pretraining_sample_concatenation is False:
                return DataCollatorForSeq2Seq(self.tokenizer, **kwargs)
-            if self.cfg.micro_batch_size > 1:
-                return DataCollatorForSeq2Seq(self.tokenizer, **kwargs)
            return None

        if self.cfg.model_config_type == "mamba":
--- a/src/axolotl/monkeypatch/trainer_grad_accum.py
+++ b/src/axolotl/monkeypatch/trainer_grad_accum.py
@@ -0,0 +1,308 @@
+"""
+fix for FSDP gradient accumulation
+see https://github.com/huggingface/transformers/pull/35128
+"""
+import inspect
+import logging
+
+from transformers import LlamaForCausalLM, Trainer
+from transformers.modeling_flash_attention_utils import _flash_attention_forward
+
+from axolotl.monkeypatch.utils import detab_code
+
+LOG = logging.getLogger("axolotl.monkeypatch.trainer_grad_accum")
+
+ORIGINAL_CONTEXT_CODE = """
+    with self.compute_loss_context_manager():
+        loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
+"""
+
+PATCHED_CONTEXT_CODE = """
+    with self.compute_loss_context_manager():
+        if self.model_accepts_loss_kwargs:
+            loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
+        else:
+            loss = self.compute_loss(model, inputs)
+"""
+
+ORIGINAL_LLAMA_FCLM_CODE = """
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = (
+        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    )
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+        cache_position=cache_position,
+        **kwargs,
+    )
+
+    hidden_states = outputs[0]
+    # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+    logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+
+    loss = None
+    if labels is not None:
+        loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+"""
+
+PATCHED_LLAMA_FCLM_CODE = """
+    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+    output_hidden_states = (
+        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+    )
+    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+    # remove num_items_in_batch otherwise self.model attempts to pass it to flash_attention
+    num_items_in_batch = kwargs.pop("num_items_in_batch", None)
+
+    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+    outputs = self.model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+        cache_position=cache_position,
+        **kwargs,
+    )
+    hidden_states = outputs[0]
+    # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+    logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
+
+    loss = None
+    if labels is not None:
+        loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, num_items_in_batch=num_items_in_batch, **kwargs)
+"""
+
+
+def get_training_step_code() -> str:
+    training_step = inspect.getsource(
+        Trainer.training_step  # pylint: disable=protected-access
+    )
+    return training_step
+
+
+def check_training_step_is_patchable() -> bool:
+    training_step = get_training_step_code()
+    training_step, _ = detab_code(training_step)
+    return ORIGINAL_CONTEXT_CODE in training_step
+
+
+def patch_training_step_for_ga():
+    """
+    monkeypatch for fixing the training loop for gradient accumulation
+    """
+
+    try:
+        training_step = get_training_step_code()
+    except OSError:
+        return
+    Trainer._original_training_step = training_step  # pylint: disable=protected-access
+    training_step, _ = detab_code(training_step)
+    if ORIGINAL_CONTEXT_CODE not in training_step:
+        return
+    # assert (
+    #     ORIGINAL_CONTEXT_CODE in training_step
+    # ), "Original training_step code not found"
+
+    training_step = training_step.replace(ORIGINAL_CONTEXT_CODE, PATCHED_CONTEXT_CODE)
+    training_step = training_step.replace(
+        "def training_step(",
+        "def _fixed_training_step(",
+        1,
+    )
+
+    # load imports necessary
+    import transformers.trainer
+
+    items_to_import = []
+    for item in dir(transformers.trainer):
+        if item in training_step:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from transformers.trainer import ("
+        + ", ".join(x for x in items_to_import)
+        + ")",
+        globals(),
+    )
+    exec(training_step, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching training_step")
+    Trainer.training_step = (  # pylint: disable=protected-access
+        _fixed_training_step  # pylint: disable=undefined-variable  # noqa: F821
+    )
+
+
+def get_model_forward_code() -> str:
+    forward = inspect.getsource(
+        LlamaForCausalLM.forward  # pylint: disable=protected-access
+    )
+    return forward
+
+
+def check_forward_is_patchable() -> bool:
+    forward = get_model_forward_code()
+    forward, _ = detab_code(forward)
+    return ORIGINAL_LLAMA_FCLM_CODE in forward
+
+
+def patch_forward_for_ga():
+    """
+    monkeypatch for fixing the training loop for gradient accumulation
+    """
+
+    try:
+        forward = get_model_forward_code()
+    except OSError:
+        return
+    LlamaForCausalLM._original_forward = forward  # pylint: disable=protected-access
+    forward, _ = detab_code(forward)
+    if ORIGINAL_LLAMA_FCLM_CODE not in forward:
+        return
+    # assert ORIGINAL_LLAMA_FCLM_CODE in forward, "Original forward code not found"
+
+    forward = forward.replace(ORIGINAL_LLAMA_FCLM_CODE, PATCHED_LLAMA_FCLM_CODE)
+    forward = forward.replace(
+        "def forward(",
+        "def _fixed_forward(",
+        1,
+    )
+
+    # load imports necessary
+    import transformers.models.llama.modeling_llama
+
+    items_to_import = []
+    for item in dir(transformers.models.llama.modeling_llama):
+        if item in forward:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from transformers.models.llama.modeling_llama import ("
+        + ", ".join(x for x in items_to_import)
+        + ")",
+        globals(),
+    )
+    exec(forward, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching forward")
+    LlamaForCausalLM.forward = (  # pylint: disable=protected-access
+        _fixed_forward  # pylint: disable=undefined-variable  # noqa: F821
+    )
+
+
+ORIGINAL_TRAINER_CODE = """
+                context = (
+                    functools.partial(self.accelerator.no_sync, model=model)
+                    if i != len(batch_samples) - 1
+                    else contextlib.nullcontext
+                )
+                with context():
+                    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
+"""
+
+PATCHED_TRAINER_CODE = """
+                disable_deepspeed_no_sync = (
+                        self.accelerator.distributed_type == DistributedType.DEEPSPEED
+                        # and self.accelerator.deepspeed_engine_wrapped.engine.zero_optimization_partition_gradients()
+                )
+                context = (
+                    functools.partial(self.accelerator.no_sync, model=model)
+                    if i != len(batch_samples) - 1 and not disable_deepspeed_no_sync
+                    else contextlib.nullcontext
+                )
+                with context():
+                    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
+"""
+
+
+def get_training_loop_code() -> str:
+    training_loop = inspect.getsource(
+        Trainer._inner_training_loop  # pylint: disable=protected-access
+    )
+    return training_loop
+
+
+def check_training_loop_is_patchable() -> bool:
+    training_loop = get_training_loop_code()
+    training_loop, _ = detab_code(training_loop)
+    return ORIGINAL_TRAINER_CODE in training_loop
+
+
+def patch_training_loop_for_deepspeed_0_16_x():
+    """
+    monkeypatch for fixing the training loop for deepspeed GA
+
+    see https://github.com/huggingface/transformers/pull/35157
+    """
+
+    try:
+        training_loop = get_training_loop_code()
+    except OSError:
+        return
+    Trainer._original_inner_training_loop = (  # pylint: disable=protected-access
+        training_loop
+    )
+    training_loop, _ = detab_code(training_loop)
+    if ORIGINAL_TRAINER_CODE not in training_loop:
+        return
+
+    training_loop = training_loop.replace(ORIGINAL_TRAINER_CODE, PATCHED_TRAINER_CODE)
+    training_loop = training_loop.replace(
+        "def _inner_training_loop(",
+        "def _fixed_inner_training_loop(",
+        1,
+    )
+
+    # load imports necessary
+    import transformers.trainer
+
+    items_to_import = []
+    for item in dir(transformers.trainer):
+        if item in training_loop:
+            items_to_import.append(item)
+
+    exec(  # pylint: disable=exec-used  # nosec B102
+        "from transformers.trainer import ("
+        + ", ".join(x for x in items_to_import)
+        + ")",
+        globals(),
+    )
+    exec(training_loop, globals())  # pylint: disable=exec-used  # nosec B102
+    LOG.info("patching _inner_training_loop for fsdp optimizer save")
+    Trainer._inner_training_loop = (  # pylint: disable=protected-access
+        _fixed_inner_training_loop  # pylint: disable=undefined-variable  # noqa: F821
+    )
+
+
+def patch_flash_attention_forward():
+    """
+    monkeypatch for fixing the forward pass for flash attention to ignore num_items_in_batch
+    """
+
+    import transformers.modeling_flash_attention_utils
+
+    def proxy_flash_attention_forward(*args, **kwargs):
+        kwargs.pop("num_items_in_batch", None)
+
+        return _flash_attention_forward(*args, **kwargs)
+
+    transformers.modeling_flash_attention_utils._flash_attention_forward = (  # pylint: disable=protected-access
+        proxy_flash_attention_forward
+    )
+    transformers.models.llama.modeling_llama._flash_attention_forward = (  # pylint: disable=protected-access
+        proxy_flash_attention_forward
+    )
--- a/src/axolotl/monkeypatch/transformers_fa_utils.py
+++ b/src/axolotl/monkeypatch/transformers_fa_utils.py
@@ -1,67 +0,0 @@
-"""
-see https://github.com/huggingface/transformers/pull/35834
-"""
-
-import logging
-from functools import partial
-from typing import Optional
-
-import torch
-
-logger = logging.getLogger(__name__)
-
-
-def fixed_fa_peft_integration_check(
-    query: torch.Tensor,
-    key: torch.Tensor,
-    value: torch.Tensor,
-    target_dtype: Optional[torch.dtype] = None,
-    preferred_dtype: Optional[torch.dtype] = None,
-):
-    """
-    PEFT usually casts the layer norms in float32 for training stability reasons
-    therefore the input hidden states gets silently casted in float32. Hence, we need
-    cast them back in float16 / bfloat16 just to be sure everything works as expected.
-    This might slowdown training & inference so it is recommended to not cast the LayerNorms!
-
-    Args:
-        query (`torch.Tensor`):
-            Input query states to be passed to Flash Attention API
-        key (`torch.Tensor`):
-            Input key states to be passed to Flash Attention API
-        value (`torch.Tensor`):
-            Input value states to be passed to Flash Attention API
-        target_dtype (`torch.dtype`, *optional*):
-            The dtype to convert the attention tensors to. Conversion can be ignored by
-            not providing the target dtype.
-        preferred_dtype (`torch.dtype`, *optional*):
-            The preferred dtype to convert the attention tensors to regardless of the
-            target dtype.
-    """
-    if target_dtype is None and preferred_dtype is None:
-        return query, key, value
-
-    if preferred_dtype and target_dtype != preferred_dtype:
-        target_dtype = preferred_dtype
-
-    # check if any of query, key, or value are in float32. If so, cast them back to target dtype.
-    if any(module.dtype == torch.float32 for module in [query, key, value]):
-        logger.warning_once(
-            f"The input hidden states seems to be silently casted in float32, this might be related to"
-            f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
-            f" {target_dtype}."
-        )
-
-        query = query.to(target_dtype)
-        key = key.to(target_dtype)
-        value = value.to(target_dtype)
-
-    return query, key, value
-
-
-def patch_fa_peft_integration():
-    import transformers.modeling_flash_attention_utils
-
-    transformers.modeling_flash_attention_utils.fa_peft_integration_check = partial(
-        fixed_fa_peft_integration_check, preferred_dtype=None
-    )
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -223,7 +223,7 @@ class ChatTemplateStrategy(PromptTokenizingStrategy):
    def tokenize_prompt(self, prompt):
        # Old simple legacy behavior that works reliably.
        if (
-            not self.roles_to_train
+            (not self.roles_to_train or self.roles_to_train == ["assistant"])
            and not self.train_on_eos
            and not self.prompter.message_field_training
            and not self.prompter.message_field_training_detail
--- a/src/axolotl/utils/config/models/input/v0_4_1/init.py
+++ b/src/axolotl/utils/config/models/input/v0_4_1/init.py
@@ -147,14 +147,6 @@ class UserDefinedPrompterType(BaseModel):
    field: Optional[str] = None


-class LrGroup(BaseModel):
-    """Custom learning rate group configuration"""
-
-    name: str
-    modules: List[str]
-    lr: float
-
-
 class SFTDataset(BaseModel):
    """SFT configuration subset"""

@@ -483,7 +475,6 @@ class HyperparametersConfig(BaseModel):
    cosine_min_lr_ratio: Optional[float] = None
    cosine_constant_lr_ratio: Optional[float] = None
    lr_div_factor: Optional[float] = None
-    lr_groups: Optional[List[LrGroup]] = None

    adam_epsilon: Optional[float] = None
    adam_beta1: Optional[float] = None
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -191,7 +191,7 @@ def wrap_pretraining_dataset(
            tokenizer,
            return_tensors="pt",
            padding=True,
-            pad_to_multiple_of=max_tokens,
+            pad_to_multiple_of=max_tokens * batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
        )
        encode = functools.partial(
@@ -201,6 +201,8 @@ def wrap_pretraining_dataset(
            max_seq_length=max_tokens,
            batch_size=batch_size,
            multipack_attn=cfg.pretrain_multipack_attn,
+            group_size=cfg.sample_packing_group_size,
+            bin_size=cfg.sample_packing_bin_size,
        )
        # set this to 1 so downstream data_loader doesn't try to increase the batch again
        cfg.micro_batch_size = 1
@@ -245,7 +247,9 @@ def encode_packed_pretraining(
    examples: Dict[str, List],
    max_seq_length: int = 2048,
    batch_size: int = 4,
-    multipack_attn: Optional[bool] = True,
+    multipack_attn: Optional[bool] = False,
+    group_size: int = 100000,
+    bin_size: int = 200,
 ) -> Dict[str, List]:
    # pylint: disable=duplicate-code
    # tokenize all the examples
@@ -256,9 +260,6 @@ def encode_packed_pretraining(
        train_dataset,
        max_seq_length,
        skip_position_ids=not multipack_attn,
-        # FIXME using attention mask unpad/pad with trainer and packed pretraining is broken atm
-        # workaround by using the position id logic for now in trainer
-        drop_attention_mask=multipack_attn,
    )

    sampler = MultipackBatchSampler(
@@ -266,6 +267,8 @@ def encode_packed_pretraining(
        lengths=get_dataset_lengths(train_dataset),
        batch_size=1,
        batch_max_len=batch_size * max_seq_length,
+        group_size=group_size,
+        bin_size=bin_size,
        drop_last=True,
    )

--- a/src/axolotl/utils/data/shared.py
+++ b/src/axolotl/utils/data/shared.py
@@ -107,13 +107,6 @@ def load_dataset_w_config(config_dataset, auth_token):
    except (FileNotFoundError, ConnectionError):
        pass

-    # gather extra args from the config
-    load_ds_kwargs = {}
-    if config_dataset.split:
-        load_ds_kwargs["split"] = config_dataset.split
-    else:
-        load_ds_kwargs["split"] = None
-
    # prefer local dataset, even if hub exists
    local_path = Path(config_dataset.path)
    if local_path.exists():
@@ -125,7 +118,7 @@ def load_dataset_w_config(config_dataset, auth_token):
                    name=config_dataset.name,
                    data_files=config_dataset.data_files,
                    streaming=False,
-                    **load_ds_kwargs,
+                    split=None,
                )
            else:
                try:
@@ -137,7 +130,7 @@ def load_dataset_w_config(config_dataset, auth_token):
                        config_dataset.path,
                        name=config_dataset.name,
                        streaming=False,
-                        **load_ds_kwargs,
+                        split=None,
                    )
        elif local_path.is_file():
            ds_type = get_ds_type(config_dataset)
@@ -147,13 +140,16 @@ def load_dataset_w_config(config_dataset, auth_token):
                name=config_dataset.name,
                data_files=config_dataset.path,
                streaming=False,
-                **load_ds_kwargs,
+                split=None,
            )
        else:
            raise ValueError(
                "unhandled dataset load: local path exists, but is neither a directory or a file"
            )
    elif ds_from_hub:
+        load_ds_kwargs = {}
+        if config_dataset.split:
+            load_ds_kwargs["split"] = config_dataset.split
        ds = load_dataset(
            config_dataset.path,
            name=config_dataset.name,
@@ -177,9 +173,9 @@ def load_dataset_w_config(config_dataset, auth_token):
                name=config_dataset.name,
                data_files=config_dataset.path,
                streaming=False,
+                split=None,
                storage_options=storage_options,
                trust_remote_code=config_dataset.trust_remote_code,
-                **load_ds_kwargs,
            )
    elif config_dataset.path.startswith("https://"):
        ds_type = get_ds_type(config_dataset)
@@ -188,9 +184,9 @@ def load_dataset_w_config(config_dataset, auth_token):
            name=config_dataset.name,
            data_files=config_dataset.path,
            streaming=False,
+            split=None,
            storage_options=storage_options,
            trust_remote_code=config_dataset.trust_remote_code,
-            **load_ds_kwargs,
        )
    else:
        if isinstance(config_dataset.data_files, str):
@@ -218,7 +214,7 @@ def load_dataset_w_config(config_dataset, auth_token):
            name=config_dataset.name,
            data_files=fp,
            streaming=False,
-            **load_ds_kwargs,
+            split=None,
        )
    if not ds:
        raise ValueError("unhandled dataset load")
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -380,19 +380,23 @@ class ModelLoader:
        plugin_manager = PluginManager.get_instance()
        plugin_manager.pre_model_load(self.cfg)

-        if self.cfg.adapter:
-            from axolotl.monkeypatch.transformers_fa_utils import (
-                patch_fa_peft_integration,
-            )
-
-            patch_fa_peft_integration()
-
        if self.cfg.gradient_checkpointing == "unsloth":
            transformers.modeling_utils.checkpoint = hf_grad_checkpoint_unsloth_wrapper

        if self.cfg.flash_attention:
            self.patch_attention()

+        if self.cfg.model_config_type == "llama":
+            from axolotl.monkeypatch.trainer_grad_accum import (
+                patch_flash_attention_forward,
+                patch_forward_for_ga,
+                patch_training_step_for_ga,
+            )
+
+            patch_flash_attention_forward()
+            patch_forward_for_ga()
+            patch_training_step_for_ga()
+
        if self.cfg.sample_packing and self.cfg.s2_attention:
            raise ValueError(
                "Received `sample_packing=true` and `s2_attention=true`; however, \
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -310,22 +310,19 @@ def process_datasets_for_packing(cfg, train_dataset, eval_dataset):


 def process_pretraining_datasets_for_packing(
-    train_dataset, sequence_len, skip_position_ids=True, drop_attention_mask=False
+    train_dataset, sequence_len, skip_position_ids=True
 ):
    drop_long = partial(drop_long_seq, sequence_len=sequence_len)

    train_dataset = train_dataset.filter(
        drop_long,
        desc="Dropping Long Sequences",
-        load_from_cache_file=False,
    )
-    if not skip_position_ids:
+    if skip_position_ids:
        train_dataset = train_dataset.map(
            add_position_ids,
            desc="Add position_id column (Pretraining Sample Packing)",
        )
-    if drop_attention_mask:
-        train_dataset = train_dataset.remove_columns("attention_mask")

    return train_dataset

--- a/tests/e2e/multigpu/test_llama.py
+++ b/tests/e2e/multigpu/test_llama.py
@@ -63,7 +63,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -128,7 +127,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -203,7 +201,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -226,12 +223,8 @@ class TestMultiGPULlama:
            ]
        )

-        loss_threshold = 2.3
        check_tensorboard(
-            temp_dir + "/runs",
-            "train/train_loss",
-            loss_threshold,
-            "Train Loss is too high",
+            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss is too high"
        )

    def test_dpo_qlora_ddp(self, temp_dir):
@@ -282,7 +275,6 @@ class TestMultiGPULlama:
                "lr_scheduler": "cosine",
                "flash_attention": True,
                "use_tensorboard": True,
-                "bf16": True,
            }
        )

@@ -305,12 +297,8 @@ class TestMultiGPULlama:
            ]
        )

-        loss_threshold = 2.3
        check_tensorboard(
-            temp_dir + "/runs",
-            "train/train_loss",
-            loss_threshold,
-            "Train Loss is too high",
+            temp_dir + "/runs", "train/train_loss", 2.3, "Train Loss is too high"
        )

    @pytest.mark.parametrize(
--- a/tests/e2e/patched/test_mixtral_samplepack.py
+++ b/tests/e2e/patched/test_mixtral_samplepack.py
@@ -102,5 +102,9 @@ class TestMixtral(unittest.TestCase):
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
+        model, _ = train(cfg=cfg, dataset_meta=dataset_meta)
+        assert (
+            "MixtralFlashAttention2"
+            in model.model.layers[0].self_attn.__class__.__name__
+        )
        check_model_output_exists(temp_dir, cfg)
--- a/tests/e2e/patched/test_model_patches.py
+++ b/tests/e2e/patched/test_model_patches.py
@@ -49,7 +49,12 @@ class TestModelPatches(unittest.TestCase):
        )
        normalize_config(cfg)
        tokenizer = load_tokenizer(cfg)
-        load_model(cfg, tokenizer, inference=False)
+        model, _ = load_model(cfg, tokenizer, inference=False)
+
+        assert (
+            "MixtralFlashAttention2"
+            in model.model.layers[0].self_attn.__class__.__name__
+        )

    @with_temp_dir
    def test_mistral_multipack(self, temp_dir):
--- a/tests/e2e/patched/test_unsloth_integration.py
+++ b/tests/e2e/patched/test_unsloth_integration.py
@@ -3,6 +3,8 @@ import unittest

 import pytest

+from axolotl.monkeypatch.unsloth_ import check_self_attn_is_patchable
+

@pytest.mark.skip(
    reason="Unsloth integration will be broken going into latest transformers"
@@ -11,8 +13,6 @@ class TestUnslothIntegration(unittest.TestCase):
    """Unsloth monkeypatch integration tests."""

    def test_is_self_attn_patchable(self):
-        from axolotl.monkeypatch.unsloth_ import check_self_attn_is_patchable
-
        # ensures the current version of transformers has loss code that matches our patching code
        self.assertTrue(
            check_self_attn_is_patchable(),
--- a/tests/e2e/solo/init.py
+++ b/tests/e2e/solo/init.py
--- a/tests/e2e/test_llama_pretrain.py
+++ b/tests/e2e/test_llama_pretrain.py
@@ -13,7 +13,7 @@ from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from .utils import check_model_output_exists, check_tensorboard
+from .utils import check_model_output_exists

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
@@ -28,25 +28,19 @@ class TestPretrainLlama:
        "sample_packing",
        [True, False],
    )
-    @pytest.mark.parametrize(
-        "pretrain_multipack_attn",
-        [True, False],
-    )
-    def test_pretrain(self, temp_dir, sample_packing, pretrain_multipack_attn):
-        if not sample_packing and pretrain_multipack_attn:
-            return
-
+    def test_pretrain(self, temp_dir, sample_packing):
        # pylint: disable=duplicate-code
        cfg = DictDefault(
            {
-                "base_model": "HuggingFaceTB/SmolLM2-135M",
+                "base_model": "JackFram/llama-68m",
+                "tokenizer_type": "LlamaTokenizer",
                "flash_attention": True,
                "sequence_len": 1024,
                "sample_packing": sample_packing,
-                "pretrain_multipack_attn": pretrain_multipack_attn,
-                "dataset_processes": 1,
                "special_tokens": {
-                    "pad_token": "<|endoftext|>",
+                    "unk_token": "<unk>",
+                    "bos_token": "<s>",
+                    "eos_token": "</s>",
                },
                "pretraining_dataset": [
                    {
@@ -57,7 +51,7 @@ class TestPretrainLlama:
                ],
                "max_steps": 5,
                "num_epochs": 1,
-                "micro_batch_size": 2,
+                "micro_batch_size": 1,
                "gradient_accumulation_steps": 1,
                "val_set_size": 0.0,
                "output_dir": temp_dir,
@@ -66,7 +60,6 @@ class TestPretrainLlama:
                "lr_scheduler": "cosine",
                "save_safetensors": True,
                "bf16": "auto",
-                "use_tensorboard": True,
            }
        )
        normalize_config(cfg)
@@ -75,12 +68,3 @@ class TestPretrainLlama:

        train(cfg=cfg, dataset_meta=dataset_meta)
        check_model_output_exists(temp_dir, cfg)
-        loss_threshold = 3.5
-        if sample_packing and not pretrain_multipack_attn:
-            loss_threshold = 6.5
-        check_tensorboard(
-            temp_dir + "/runs",
-            "train/train_loss",
-            loss_threshold,
-            "Train Loss is too high",
-        )
--- a/tests/e2e/solo/test_relora_llama.py
+++ b/tests/e2e/solo/test_relora_llama.py
@@ -13,7 +13,7 @@ from axolotl.train import train
 from axolotl.utils.config import normalize_config
 from axolotl.utils.dict import DictDefault

-from ..utils import check_model_output_exists, check_tensorboard, with_temp_dir
+from .utils import check_model_output_exists, check_tensorboard, with_temp_dir

 LOG = logging.getLogger("axolotl.tests.e2e")
 os.environ["WANDB_DISABLED"] = "true"
--- a/tests/patched/test_llama_trainer_ga.py
+++ b/tests/patched/test_llama_trainer_ga.py
@@ -0,0 +1,25 @@
+""""Test module for checking whether the Hugging Face Transformers is working as expected."""
+import unittest
+
+from axolotl.monkeypatch.trainer_grad_accum import (
+    check_forward_is_patchable,
+    check_training_step_is_patchable,
+)
+
+
+class TestTrainerGAIntegration(unittest.TestCase):
+    """llama monkeypatch integration tests."""
+
+    def test_train_step_patchable(self):
+        # ensures the current version of transformers has loss code that matches our patching code
+        self.assertTrue(
+            check_training_step_is_patchable(),
+            "HF transformers Trainer.training_step has changed and isn't patchable",
+        )
+
+    def test_model_forward_patchable(self):
+        # ensures the current version of transformers has loss code that matches our patching code
+        self.assertTrue(
+            check_forward_is_patchable(),
+            "HF transformers LlamaForCausalLM.forward has changed and isn't patchable",
+        )
--- a/tests/test_packed_pretraining.py
+++ b/tests/test_packed_pretraining.py
@@ -41,7 +41,6 @@ class TestPretrainingPacking(unittest.TestCase):
                    }
                ],
                "sample_packing": True,
-                "pretrain_multipack_attn": True,
                "pad_to_sequence_len": True,
                "sequence_len": 2048,
                "micro_batch_size": 2,
@@ -88,11 +87,9 @@ class TestPretrainingPacking(unittest.TestCase):
            assert data["labels"].shape == torch.Size(
                [1, original_bsz * cfg.sequence_len]
            )
-            assert "attention_mask" not in data
-            # FIXME add back once we fix packing unpad/pad with attention mask
-            # assert data["attention_mask"].shape == torch.Size(
-            #     [1, original_bsz * cfg.sequence_len]
-            # )
+            assert data["attention_mask"].shape == torch.Size(
+                [1, original_bsz * cfg.sequence_len]
+            )
            idx += 1
Author	SHA1	Message	Date
Wing Lian	6c49083d8b	improve check for base case	2025-01-24 12:02:34 -05:00
Wing Lian	94c226edb3	fixes last eos token not in labels on basic use case	2025-01-24 12:00:06 -05:00
				`@@ -1 +0,0 @@`
				`{"project": "axolotl", "version": "0.0.9999", "count": 0, "items": []}`