fix: correct attention class retrieval for gemma3n model in lora_kernels.py

fix: update attention class import logic for gemma3n model
add assertion for packing patch to _get_unpad_data (#2840 )
2025-06-27 19:30:09 +02:00 · 2025-06-27 19:27:36 +02:00 · 2025-06-27 11:20:23 -04:00 · 2025-06-27 11:19:24 -04:00 · 2025-06-27 11:02:51 -04:00 · 2025-06-27 10:38:46 -04:00
22 changed files with 10026 additions and 386 deletions
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -19,7 +19,7 @@ repos:
    hooks:
      - id: isort
 -   repo: https://github.com/PyCQA/flake8
-    rev: 7.2.0
+    rev: 7.3.0
    hooks:
    - id: flake8
 -   repo: https://github.com/pylint-dev/pylint
@@ -27,7 +27,7 @@ repos:
    hooks:
    - id: pylint
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.16.0
+    rev: v1.16.1
    hooks:
    - id: mypy
      additional_dependencies:
@@ -36,7 +36,7 @@ repos:
            'pydantic>=2.5.3',
        ]
 -   repo: https://github.com/PyCQA/bandit
-    rev: 1.8.3
+    rev: 1.8.5
    hooks:
    -   id: bandit
        args: [
--- a/README.md
+++ b/README.md
@@ -43,7 +43,7 @@ Features:
 - **Multiple Model Support**: Train various models like LLaMA, Mistral, Mixtral, Pythia, and more. We are compatible with HuggingFace transformers causal language models.
 - **Training Methods**: Full fine-tuning, LoRA, QLoRA, GPTQ, QAT, Preference Tuning (DPO, IPO, KTO, ORPO), RL (GRPO), Multimodal, and Reward Modelling (RM) / Process Reward Modelling (PRM).
 - **Easy Configuration**: Re-use a single YAML file between dataset preprocess, training, evaluation, quantization, and inference.
- **Performance Optimizations**: [Multipacking](https://docs.axolotl.ai/docs/multipack.html), [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Xformers](https://github.com/facebookresearch/xformers), [Flex Attention](https://pytorch.org/blog/flexattention/), [Liger Kernel](https://github.com/linkedin/Liger-Kernel), [Cut Cross Entropy](https://github.com/apple/ml-cross-entropy/tree/main), Sequence Parallelism (SP), LoRA optimizations, Multi-GPU training (FSDP1, FSDP2, DeepSpeed), Multi-node training (Torchrun, Ray), and many more!
+- **Performance Optimizations**: [Multipacking](https://docs.axolotl.ai/docs/multipack.html), [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Xformers](https://github.com/facebookresearch/xformers), [Flex Attention](https://pytorch.org/blog/flexattention/), [Liger Kernel](https://github.com/linkedin/Liger-Kernel), [Cut Cross Entropy](https://github.com/apple/ml-cross-entropy/tree/main), [Sequence Parallelism (SP)](https://docs.axolotl.ai/docs/sequence_parallelism.html), [LoRA optimizations](https://docs.axolotl.ai/docs/lora_optims.html), [Multi-GPU training (FSDP1, FSDP2, DeepSpeed)](https://docs.axolotl.ai/docs/multi-gpu.html), [Multi-node training (Torchrun, Ray)](https://docs.axolotl.ai/docs/multi-node.html), and many more!
 - **Flexible Dataset Handling**: Load from local, HuggingFace, and cloud (S3, Azure, GCP, OCI) datasets.
 - **Cloud Ready**: We ship [Docker images](https://hub.docker.com/u/axolotlai) and also [PyPI packages](https://pypi.org/project/axolotl/) for use on cloud platforms and local hardware.

--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -9,7 +9,7 @@ order: 3
 Chat Template strategy uses a jinja2 template that converts a list of messages into a prompt. Support using tokenizer's template, a supported template, or custom jinja2.

 ```{.json filename="data.jsonl"}
-{"conversations": [{"role": "...", "content": "..."}]}
+{"messages": [{"role": "...", "content": "..."}, {"role": "...", "content": "..."}, ...]}
 ```

 See [configs](../config-reference.qmd) for full configs and supported templates.
--- a/examples/colab-notebooks/colab-axolotl-example.ipynb
+++ b/examples/colab-notebooks/colab-axolotl-example.ipynb
--- a/src/axolotl/common/datasets.py
+++ b/src/axolotl/common/datasets.py
@@ -75,13 +75,17 @@ def load_datasets(

        num_examples = cli_args.debug_num_examples if cli_args else 1
        text_only = cli_args.debug_text_only if cli_args else False
-        train_samples = sample_dataset(train_dataset, num_examples)
-        check_dataset_labels(
-            train_samples,
-            tokenizer,
-            num_examples=num_examples,
-            text_only=text_only,
-        )
+        try:
+            train_samples = sample_dataset(train_dataset, num_examples)
+            check_dataset_labels(
+                train_samples,
+                tokenizer,
+                num_examples=num_examples,
+                text_only=text_only,
+            )
+        except AttributeError:
+            # can't sample iterable datasets
+            pass

        LOG.info("printing prompters...")
        for prompter in prompters:
--- a/src/axolotl/core/builders/causal.py
+++ b/src/axolotl/core/builders/causal.py
@@ -413,7 +413,8 @@ class HFCausalTrainerBuilder(TrainerBuilderBase):
                or self.cfg.micro_batch_size > 1
            ):
                return DataCollatorForSeq2Seq(self.tokenizer, **kwargs)
-            return None
+            if not (self.cfg.sample_packing and self.cfg.pretrain_multipack_attn):
+                return None

        if self.cfg.model_config_type == "mamba":
            return MambaDataCollator(tokenizer=self.tokenizer)
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -116,6 +116,7 @@ class AxolotlTrainer(
            sequential=self.args.sample_packing_sequentially,
            drop_last=True,
            num_processes=self.args.dataset_num_proc,
+            mp_start_method=self.args.sample_packing_mp_start_method or "fork",
        )

        len(sampler)
--- a/src/axolotl/core/training_args_base.py
+++ b/src/axolotl/core/training_args_base.py
@@ -38,6 +38,10 @@ class AxolotlTrainingMixins:
            "help": "Use next-fit sample packing that preserves the order of samples coming from the sampler. Use in combination with curriculum_sampling for fully sequential packing."
        },
    )
+    sample_packing_mp_start_method: str | None = field(
+        default=None,
+        metadata={"help": "The multiprocessing start method to use."},
+    )
    multipack_real_batches: bool = field(
        default=False,
        metadata={"help": "Use real batches for efficient training."},
--- a/src/axolotl/loaders/model.py
+++ b/src/axolotl/loaders/model.py
@@ -776,6 +776,9 @@ class ModelLoader:
        dist_dtype: torch.dtype,
        before_kbit_train_or_finetune: bool,
    ):
+        dest = {"dtype": dist_dtype}
+        if self.cfg.lora_on_cpu:
+            dest["device"] = "cpu"
        for name, module in self.model.named_modules():
            if "norm" in name:
                module.to(dist_dtype)
@@ -786,4 +789,4 @@ class ModelLoader:
                    # don't upcast lm_head for btlm
                    continue
            if any(m in name for m in embedding_modules) and hasattr(module, "weight"):
-                module.to(dist_dtype)
+                module.to(**dest)
--- a/src/axolotl/monkeypatch/lora_kernels.py
+++ b/src/axolotl/monkeypatch/lora_kernels.py
@@ -156,8 +156,12 @@ def get_attention_cls_from_config(cfg: DictDefault) -> Type[nn.Module]:
        model_cls_prefix = "".join(
            [part.capitalize() for part in model_type.split("_")]
        )
-        module = __import__(module_path, fromlist=[f"{model_cls_prefix}Attention"])
-        attention_cls = getattr(module, f"{model_cls_prefix}Attention")
+        if model_type == "gemma3n":
+            module = __import__(module_path, fromlist=[f"{model_cls_prefix}TextAttention"])
+            attention_cls = getattr(module, f"{model_cls_prefix}TextAttention")
+        else:
+            module = __import__(module_path, fromlist=[f"{model_cls_prefix}Attention"])
+            attention_cls = getattr(module, f"{model_cls_prefix}Attention")

        return attention_cls
    except (ImportError, AttributeError) as e:
--- a/src/axolotl/monkeypatch/multipack.py
+++ b/src/axolotl/monkeypatch/multipack.py
@@ -42,6 +42,10 @@ def patch_for_multipack(model_type, model_name=None, has_remote_code=False):
    if has_remote_code:
        patch_remote(model_name)
    elif hasattr(transformers, "modeling_flash_attention_utils"):
+        # sanity check in case upstream api changes on this
+        assert hasattr(
+            transformers.modeling_flash_attention_utils, "_get_unpad_data"
+        ), "transformers api changed for _get_unpad_data for flash attention"
        transformers.modeling_flash_attention_utils._get_unpad_data = (  # pylint: disable=protected-access
            get_unpad_data
        )
--- a/src/axolotl/prompt_strategies/chat_template.py
+++ b/src/axolotl/prompt_strategies/chat_template.py
@@ -103,6 +103,7 @@ class ChatTemplatePrompter(Prompter):
        chat_template_kwargs = {
            "chat_template": self.chat_template,
            "add_generation_prompt": add_generation_prompt,
+            **self.chat_template_kwargs,
        }

        if tools:
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -223,6 +223,8 @@ def execute_training(
            )

        LOG.info("Starting trainer...")
+        if cfg.bf16:
+            torch.set_default_dtype(torch.bfloat16)
        trainer.train(resume_from_checkpoint=resume_from_checkpoint)


--- a/src/axolotl/utils/ctx_managers/sequence_parallel.py
+++ b/src/axolotl/utils/ctx_managers/sequence_parallel.py
@@ -207,9 +207,6 @@ class SequenceParallelContextManager:
        # Store original sequence length and padding information
        self.original_seq_len = 0
        self.pad_len = 0
-        
-        # Store kwargs passed to model forward pass
-        self.original_kwargs: None | dict[str, torch.Tensor] = None

        # Create a partially applied version of the apply_sequence_parallelism function
        self.apply_sequence_parallelism = functools.partial(
@@ -262,9 +259,6 @@ class SequenceParallelContextManager:

            # Any excess positional arguments are kept as-is
            remaining_args = args[len(forward_params) :]
-            
-            # Store original kwargs
-            self.original_kwargs = {key: value.clone() for key, value in updated_kwargs.items()}

            # Apply sequence parallelism to updated kwargs
            updated_kwargs, self.original_seq_len, self.pad_len = (
--- a/src/axolotl/utils/data/pretraining.py
+++ b/src/axolotl/utils/data/pretraining.py
@@ -224,10 +224,10 @@ def wrap_pretraining_dataset(
    remove_columns = []
    if dataset.features is None:
        for first_row in dataset:
-            remove_columns = first_row.keys()
+            remove_columns = list(first_row.keys())
            break
    else:
-        remove_columns = dataset.features.keys()
+        remove_columns = list(dataset.features.keys())

    dataset = dataset.map(
        encode,
@@ -267,6 +267,7 @@ def encode_packed_pretraining(
        batch_size=1,
        batch_max_len=batch_size * max_seq_length,
        drop_last=True,
+        num_processes=1,
    )

    chunked_data = defaultdict(list)
--- a/src/axolotl/utils/data/sft.py
+++ b/src/axolotl/utils/data/sft.py
@@ -334,7 +334,10 @@ def _load_raw_datasets(
    dataset = merge_datasets(datasets, cfg)

    if not cfg.skip_prepare_dataset:
-        dataset = drop_long_seq_in_dataset(dataset, cfg)
+        if split == "test" and cfg.eval_sequence_len:
+            dataset = drop_long_seq_in_dataset(dataset, cfg.eval_sequence_len, cfg)
+        else:
+            dataset = drop_long_seq_in_dataset(dataset, cfg.sequence_len, cfg)
        if cfg.sample_packing:
            dataset, _ = process_datasets_for_packing(cfg, dataset, None)

--- a/src/axolotl/utils/data/utils.py
+++ b/src/axolotl/utils/data/utils.py
@@ -148,11 +148,14 @@ def deduplicate_and_log_datasets(
    return dataset, other_dataset


-def drop_long_seq_in_dataset(dataset: Dataset, cfg: DictDefault) -> Dataset:
+def drop_long_seq_in_dataset(
+    dataset: Dataset, sequence_len: int, cfg: DictDefault
+) -> Dataset:
    """Remove sequences longer than configured maximum from dataset.

    Args:
        dataset: Dataset to filter.
+        sequence_len: Maximum length for sequences to keep
        cfg: Dictionary mapping `axolotl` config keys to values.

    Returns:
@@ -167,7 +170,7 @@ def drop_long_seq_in_dataset(dataset: Dataset, cfg: DictDefault) -> Dataset:

    drop_long = functools.partial(
        drop_long_seq,
-        sequence_len=cfg.sequence_len,
+        sequence_len=sequence_len,
        min_sequence_len=cfg.min_sample_len,
    )

@@ -187,7 +190,7 @@ def drop_long_seq_in_dataset(dataset: Dataset, cfg: DictDefault) -> Dataset:

    drop_long_kwargs = {}
    if filter_map_kwargs:
-        drop_long_kwargs["desc"] = "Dropping Long Sequences"
+        drop_long_kwargs["desc"] = f"Dropping Long Sequences (>{sequence_len})"

    dataset = dataset.filter(
        drop_long,
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -127,7 +127,7 @@ def pack_parallel(
    bin_size: int,
    num_processes: int | None = None,
    safe_mode: bool = True,
-    mp_start_method: str | None = "spawn",
+    mp_start_method: str | None = "fork",
 ) -> list[list[int]]:
    """Pack sequences into bins using parallel processing.

@@ -260,12 +260,13 @@ class MultipackBatchSampler(BatchSampler):
        lengths: np.ndarray,  # Sequence lengths
        packing_efficiency_estimate: float = 1.0,  # Initial efficiency estimate
        drop_last: bool = True,  # Whether to drop final batches (might be incomplete)
-        num_count_samples: int = 8,  # Number of times to estimate batch count
+        num_count_samples: int = 4,  # Number of times to estimate batch count
        sequential: bool = False,  # Whether to use sequential packing
        group_size: int = 100_000,  # Size of groups for parallel packing
        bin_size: int = 200,  # The max number of samples that can be packed in a single bin
        num_processes: int | None = None,  # Number of processes for parallel packing
        safe_mode: bool = True,  # Conservative packing to prevent training instability
+        mp_start_method: str = "fork",
        **kwargs,  # pylint: disable=unused-argument
    ):
        super().__init__(sampler, batch_size, drop_last)
@@ -278,6 +279,7 @@ class MultipackBatchSampler(BatchSampler):
        self.bin_size = bin_size
        self.num_processes = num_processes
        self.safe_mode = safe_mode
+        self.mp_start_method = mp_start_method

        assert isinstance(self.lengths, np.ndarray)

@@ -333,13 +335,15 @@ class MultipackBatchSampler(BatchSampler):
            bins = [[indices[b_idx] for b_idx in bin_indices] for bin_indices in bins]
        else:
            # Use parallel packing
+            num_processes = self.num_processes or 1
            all_bins = pack_parallel(
                lengths,
                bin_capacity=self.batch_max_len,
                group_size=self.group_size,
                bin_size=self.bin_size,
-                num_processes=self.num_processes,
+                num_processes=min(4, num_processes) if num_processes else 4,
                safe_mode=self.safe_mode,
+                mp_start_method=self.mp_start_method,
            )

            # Map bin indices back to original indices
--- a/src/axolotl/utils/schemas/config.py
+++ b/src/axolotl/utils/schemas/config.py
@@ -366,6 +366,12 @@ class AxolotlInputConfig(
            "description": "The maximum length of an input to train with, this should typically be less than 2048 as most models have a token/context limit of 2048"
        },
    )
+    eval_sequence_len: int | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The maximum length of an input for evaluation. If not specified, defaults to sequence_len"
+        },
+    )
    min_sample_len: int | None = None
    max_prompt_len: int = Field(
        default=512,
@@ -393,6 +399,12 @@ class AxolotlInputConfig(
        default=None,
        json_schema_extra={"description": "Whether to pack samples sequentially"},
    )
+    sample_packing_mp_start_method: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "The multiprocessing start method to use for packing. Should be 'fork', 'spawn' or 'forkserver'"
+        },
+    )
    eval_sample_packing: bool | None = Field(
        default=None,
        json_schema_extra={
@@ -772,6 +784,12 @@ class AxolotlInputConfig(
            "description": "Custom jinja template for chat template. This will be only used if chat_template is set to `jinja` or `null` (in which case chat_template is automatically set to `jinja`). Default is null."
        },
    )
+    chat_template_kwargs: dict[str, Any] | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Additional kwargs to pass to the chat template. This is useful for customizing the chat template. For example, you can pass `thinking=False` to add a generation prompt to the chat template."
+        },
+    )
    eot_tokens: list[str] | None = Field(
        default=None,
        json_schema_extra={
--- a/src/axolotl/utils/schemas/validation.py
+++ b/src/axolotl/utils/schemas/validation.py
@@ -462,6 +462,20 @@ class TrainingValidationMixin:

        return data

+    @model_validator(mode="before")
+    @classmethod
+    def pretrain_with_tps(cls, data):
+        if data.get("pretraining_dataset") and data.get(
+            "include_tokens_per_second", False
+        ):
+            # combining these would raise `TypeError: cannot pickle 'dict_keys' object`
+            # due to trying to count the number of tokens total in the dataset
+            raise ValueError(
+                "pretraining_dataset and include_tokens_per_second cannot be used together."
+            )
+
+        return data
+

 class LoRAValidationMixin:
    """Validation methods related to LoRA/QLoRA configuration."""
--- a/src/axolotl/utils/trainer.py
+++ b/src/axolotl/utils/trainer.py
@@ -381,6 +381,7 @@ def process_pretraining_datasets_for_packing(
    if not skip_position_ids:
        train_dataset = train_dataset.map(
            add_position_ids,
+            batched=True,
            desc="Add position_id column (Pretraining Sample Packing)",
        )
    if drop_attention_mask:
@@ -467,6 +468,7 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
                sequential=cfg.sample_packing_sequentially,
                drop_last=True,
                num_processes=cfg.dataset_processes,
+                mp_start_method=cfg.sample_packing_mp_start_method or "fork",
            )

            data_loader = DataLoader(
--- a/tests/test_packed_batch_sampler.py
+++ b/tests/test_packed_batch_sampler.py
@@ -70,7 +70,7 @@ class TestBatchedSamplerPacking:
        )
        train_dataset = concatenate_datasets([dataset_wrapper])

-        train_dataset = drop_long_seq_in_dataset(train_dataset, cfg)
+        train_dataset = drop_long_seq_in_dataset(train_dataset, cfg.sequence_len, cfg)

        lengths = get_dataset_lengths(train_dataset)
        batch_sampler = MultipackBatchSampler(
Author	SHA1	Message	Date
mhenrhcsen	8eba033dc4	fix: correct attention class retrieval for gemma3n model in lora_kernels.py	2025-06-27 19:30:09 +02:00
mhenrhcsen	a9c0f43202	fix: update attention class import logic for gemma3n model	2025-06-27 19:27:36 +02:00
Wing Lian	a1a740608d	add assertion for packing patch to _get_unpad_data (#2840 )	2025-06-27 11:20:23 -04:00
kallewoof	ec15a7a691	Support --lora-on-cpu flag for DPO model merging (#2766 ) [skip ci] * Support --lora-on-cpu flag for DPO model merging * fix: use device=cpu in _convert_embedding_modules_dtype when lora_on_cpu is set	2025-06-27 11:19:24 -04:00
Wing Lian	0a7a216b60	allow for different sequence_len for evaluations (#2836 ) [skip ci] * allow for different sequence_len for evaluations * reversed 🤦 * add more information to filter msg	2025-06-27 11:02:51 -04:00
NanoCode012	d8280d45c1	feat: add chat_template kwargs (#2837 )	2025-06-27 10:38:46 -04:00
Wing Lian	24f2887e87	don't fail during preprocess for sampling from iterable dataset (#2825 ) [skip ci]	2025-06-27 10:37:53 -04:00
NanoCode012	29289a4de9	feat: replace old colab notebook with newer one (#2838 ) [skip ci] * feat: replace old colab notebook with newer one * fix: point to update cce fork	2025-06-27 10:35:47 -04:00
Wing Lian	a24957fa04	fix for iterable datasets and pickling (#2831 ) [skip ci] * fix for iterable datasets and pickling * more fixes for pretraining * can't pickle mock generator dataset	2025-06-27 10:35:23 -04:00
NanoCode012	927bf530bc	fix(doc): default messages example used wrong key (#2832 ) * fix(doc): default messages example used wrong key * feat: add links to SP, multi-gpu, multi-node on readme	2025-06-26 10:47:31 -04:00
github-actions[bot]	18954ba100	chore: update pre-commit hooks (#2821 ) [skip ci] Co-authored-by: djsaunde <1245942+djsaunde@users.noreply.github.com>	2025-06-26 10:46:53 -04:00
Wing Lian	d8cf66edbd	use fork for multiprocess start method for packing in parallel (#2830 )	2025-06-25 13:17:33 -04:00