Revert "Multipack parallel bin packing (#2631 )"

This reverts commit 8e4158cc0b.
swap tinymodels that have safetensors for some ci tests (#2641 )
2025-05-09 17:33:31 +00:00 · 2025-05-07 15:06:07 -04:00 · 2025-05-07 10:31:46 -04:00 · 2025-05-07 10:29:47 -04:00 · 2025-05-07 10:29:05 -04:00 · 2025-05-06 23:40:44 -04:00
16 changed files with 750 additions and 292 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -329,6 +329,18 @@ jobs:
      fail-fast: false
      matrix:
        include:
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.6.0
+            num_gpus: 1
+            axolotl_extras: llmcompressor
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.4.1
+            num_gpus: 1
+            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
+    ("LLMCompressor", "llm_compressor")
 ]

 for section_name, folder_name in sections:
--- a/examples/llama-3/sparse-finetuning.yaml
+++ b/examples/llama-3/sparse-finetuning.yaml
@@ -0,0 +1,77 @@
+base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4
+
+plugins:
+  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: tatsu-lab/alpaca
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.05
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+eval_sample_packing: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 8
+micro_batch_size: 1
+num_epochs: 1
+optimizer: paged_adamw_8bit
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  pad_token: <|end_of_text|>
+
+llmcompressor:
+  recipe:
+    finetuning_stage:
+      finetuning_modifiers:
+        ConstantPruningModifier:
+          targets: [
+            're:.*q_proj.weight',
+            're:.*k_proj.weight',
+            're:.*v_proj.weight',
+            're:.*o_proj.weight',
+            're:.*gate_proj.weight',
+            're:.*up_proj.weight',
+            're:.*down_proj.weight',
+          ]
+          start: 0
+  save_compressed: true
--- a/setup.py
+++ b/setup.py
@@ -150,6 +150,9 @@ extras_require = {
    "vllm": [
        "vllm==0.7.2",
    ],
+    "llmcompressor": [
+        "llmcompressor==0.5.1",
+    ],
 }

 install_requires, dependency_links, extras_require_build = parse_requirements(
--- a/src/axolotl/init.py
+++ b/src/axolotl/init.py
@@ -4,4 +4,4 @@ import pkgutil

 __path__ = pkgutil.extend_path(__path__, __name__)  # Make this a namespace package

-__version__ = "0.9.1"
+__version__ = "0.10.0.dev0"
--- a/src/axolotl/core/trainers/base.py
+++ b/src/axolotl/core/trainers/base.py
@@ -114,8 +114,6 @@ class AxolotlTrainer(
            packing_efficiency_estimate=self.args.sample_packing_efficiency,
            batch_max_len=batch_max_len,
            batch_size=batch_size,
-            group_size=self.args.sample_packing_group_size,
-            bin_size=self.args.sample_packing_bin_size,
            sequential=self.args.sample_packing_sequentially,
            drop_last=True,
        )
--- a/src/axolotl/integrations/llm_compressor/README.md
+++ b/src/axolotl/integrations/llm_compressor/README.md
@@ -0,0 +1,108 @@
+# LLMCompressor Integration
+
+Fine-tune sparsified models in Axolotl using Neural Magic's [LLMCompressor](https://github.com/vllm-project/llm-compressor).
+
+This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor's model compression capabilities with Axolotl's distributed training pipelines, users can efficiently fine-tune sparse models at scale.
+
+It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training.
+
+---
+
+## Requirements
+
+- Axolotl with `llmcompressor` extras:
+
+  ```bash
+  pip install "axolotl[llmcompressor]"
+  ```
+
+- Requires `llmcompressor >= 0.5.1`
+
+This will install all necessary dependencies to fine-tune sparsified models using the integration.
+
+---
+
+## Usage
+
+To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
+
+```yaml
+plugins:
+  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+llmcompressor:
+  recipe:
+    finetuning_stage:
+      finetuning_modifiers:
+        ConstantPruningModifier:
+          targets: [
+            're:.*q_proj.weight',
+            're:.*k_proj.weight',
+            're:.*v_proj.weight',
+            're:.*o_proj.weight',
+            're:.*gate_proj.weight',
+            're:.*up_proj.weight',
+            're:.*down_proj.weight',
+          ]
+          start: 0
+  save_compressed: true
+# ... (other training arguments)
+```
+
+This plugin **does not apply pruning or sparsification itself** — it is intended for **fine-tuning models that have already been sparsified**.
+
+Pre-sparsified checkpoints can be:
+- Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
+- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Any custom LLM with compatible sparsity patterns that you've created yourself
+
+To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
+[https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)
+
+### Storage Optimization with save_compressed
+
+Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
+- Reduces disk space usage by approximately 40%
+- Maintains compatibility with vLLM for accelerated inference
+- Maintains compatibility with llmcompressor for further optimization (example: quantization)
+
+This option is highly recommended when working with sparse models to maximize the benefits of model compression.
+
+### Example Config
+
+See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.
+
+---
+
+## Inference with vLLM
+
+After fine-tuning your sparse model, you can leverage vLLM for efficient inference.
+You can also use LLMCompressor to apply additional quantization to your fine-tuned
+sparse model before inference for even greater performance benefits.:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM("path/to/your/sparse/model")
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
+
+## Learn More
+
+For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
+
+[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
--- a/src/axolotl/integrations/llm_compressor/init.py
+++ b/src/axolotl/integrations/llm_compressor/init.py
@@ -0,0 +1,5 @@
+"""Integration entry point for the LLMCompressor plugin."""
+
+from .plugin import LLMCompressorPlugin
+
+__all__ = ["LLMCompressorPlugin"]
--- a/src/axolotl/integrations/llm_compressor/args.py
+++ b/src/axolotl/integrations/llm_compressor/args.py
@@ -0,0 +1,40 @@
+"""
+LLMCompressor and Sparse Finetuning config models.
+"""
+
+from typing import Any
+
+from pydantic import BaseModel, Field
+from typing_extensions import Annotated
+
+
+class CompressionArgs(BaseModel):
+    """Sparse Finetuning config for LLMCompressor."""
+
+    # Typing for recipe is set to Any due to:
+    # https://github.com/vllm-project/llm-compressor/issues/1319
+    recipe: Annotated[
+        Any,
+        Field(
+            description="The recipe containing the compression algorithms and hyperparameters to apply."
+        ),
+    ]
+
+    save_compressed: Annotated[
+        bool,
+        Field(
+            default=False,
+            description="Whether to save the compressed model after training.",
+        ),
+    ]
+
+
+class LLMCompressorArgs(BaseModel):
+    """LLMCompressor configuration BaseModel."""
+
+    llmcompressor: Annotated[
+        CompressionArgs,
+        Field(
+            description="Arguments enabling compression pathways through the LLM Compressor plugins"
+        ),
+    ]
--- a/src/axolotl/integrations/llm_compressor/plugin.py
+++ b/src/axolotl/integrations/llm_compressor/plugin.py
@@ -0,0 +1,171 @@
+"""
+Sparse Finetuning plugin for Axolotl — enables handling of sparse neural networks
+by maintaining masks for zero weights during training.
+"""
+
+import logging
+from functools import wraps
+from typing import Any, Callable, Concatenate, ParamSpec, TypeVar
+
+from llmcompressor import active_session, create_session
+from llmcompressor.core import callbacks as session_callbacks
+from llmcompressor.recipe import Recipe
+from torch.nn import Module
+from transformers.trainer import Trainer
+from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
+from transformers.training_args import TrainingArguments
+
+from axolotl.integrations.base import BasePlugin
+
+P = ParamSpec("P")  # Params for generic function signatures
+R = TypeVar("R")  # Return type for generic function signatures
+
+LOG = logging.getLogger("axolotl.integrations.llm_compressor")
+
+
+class LLMCompressorCallbackHandler(TrainerCallback):
+    """
+    Trainer callback for Sparse Finetuning.
+    Maintains sparsity patterns during training by applying masks after optimization steps,
+    ensuring zero-weight updates are canceled out.
+    """
+
+    def __init__(self, trainer: Trainer, recipe: Any):
+        """
+        Initialize the Sparse Finetuning callback handler.
+
+        Args:
+            trainer (Trainer): Huggingface Trainer instance.
+            recipe (Recipe | dict): Sparse finetuning recipe to apply.
+        """
+        super().__init__()
+        self.trainer = trainer
+        self.recipe = (
+            Recipe.model_validate(recipe) if not isinstance(recipe, Recipe) else recipe
+        )
+        self.original_compute_loss = trainer.compute_loss
+        self.trainer.compute_loss = compute_loss_wrapper(self.trainer.compute_loss)
+        create_session()
+
+    def on_train_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the beginning of training. Initializes the compression session.
+
+        Args:
+            args (TrainingArguments): Training arguments.
+            state (TrainerState): Trainer state.
+            control (TrainerControl): Trainer control.
+        """
+        super().on_train_begin(args, state, control, **kwargs)
+        self.trainer.accelerator.wait_for_everyone()
+        active_session().initialize(
+            model=self.trainer.model,
+            optimizer=self.trainer.optimizer,
+            start=state.epoch,
+            recipe=self.recipe,
+        )
+        self.trainer.accelerator.wait_for_everyone()
+
+    def on_step_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the beginning of a training step. Triggers batch_start callback.
+        """
+        super().on_step_begin(args, state, control, **kwargs)
+        session_callbacks.batch_start()
+
+    def on_step_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the end of a training step. Triggers optimizer and batch_end callbacks.
+        """
+        super().on_step_end(args, state, control, **kwargs)
+        session_callbacks.optim_pre_step()
+        session_callbacks.optim_post_step()
+        session_callbacks.batch_end()
+
+    def on_train_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the end of training. Finalizes the compression session.
+        """
+        super().on_train_end(args, state, control, **kwargs)
+        active_session().finalize()
+        self.trainer.compute_loss_func = self.original_compute_loss
+
+
+class LLMCompressorPlugin(BasePlugin):
+    """
+    Sparse Finetuning plugin for Axolotl integration.
+    """
+
+    def get_input_args(self) -> str:
+        """
+        Returns the path to the plugin's argument definition.
+
+        Returns:
+            str: Dotted path to the LLMCompressorArgs class.
+        """
+        return "axolotl.integrations.llm_compressor.args.LLMCompressorArgs"
+
+    def add_callbacks_post_trainer(self, cfg: Any, trainer: Trainer) -> list:
+        """
+        Adds Sparse Finetuning callback to the Trainer instance.
+
+        Args:
+            cfg (Any): Configuration object containing the sparse recipe.
+            trainer (Trainer): Huggingface Trainer instance.
+
+        Returns:
+            list: List containing the configured callback instances.
+        """
+        LOG.info("Adding Sparse Finetuning callback to the trainer")
+        callback = LLMCompressorCallbackHandler(
+            trainer=trainer,
+            recipe=cfg.llmcompressor.recipe,
+        )
+        return [callback]
+
+
+def compute_loss_wrapper(
+    compute_loss_func: Callable[Concatenate[Module, P], R],
+) -> Callable[Concatenate[Module, P], R]:
+    """
+    Wraps the loss computation function to trigger the loss_calculated callback.
+
+    Args:
+        compute_loss_func (Callable): Original loss computation function.
+
+    Returns:
+        Callable: Wrapped function that also invokes the loss_calculated callback.
+    """
+
+    @wraps(compute_loss_func)
+    def compute_and_notify(model: Module, *args: P.args, **kwargs: P.kwargs) -> R:
+        loss = compute_loss_func(model, *args, **kwargs)
+        if active_session().lifecycle.initialized_ and model.training:
+            session_callbacks.loss_calculated(loss=loss)
+        return loss
+
+    return compute_and_notify
--- a/src/axolotl/integrations/llm_compressor/utils.py
+++ b/src/axolotl/integrations/llm_compressor/utils.py
@@ -0,0 +1,40 @@
+"""Utilities for llmcompressor integration with axolotl."""
+
+from typing import Union
+
+from llmcompressor.transformers.sparsification.compressed_tensors_utils import (
+    modify_save_pretrained,
+)
+from transformers import PreTrainedModel, Trainer
+
+
+def save_compressed_model(
+    model: PreTrainedModel,
+    output_dir: Union[str, bytes],
+    trainer: Trainer,
+    safe_serialization: bool = False,
+    save_compressed: bool = False,
+) -> None:
+    """
+    Synchronize processes, apply compression hooks, and save the model.
+
+    Args:
+        model (PreTrainedModel): The model to be saved.
+        output_dir (str or bytes): Path where the model files will be written.
+        trainer (Trainer): Hugging Face Trainer for process synchronization.
+        safe_serialization (bool): Use safe serialization if True.
+        save_compressed (bool): Write compressed tensors if True.
+    """
+    trainer.accelerator.wait_for_everyone()
+
+    # Only the main process writes the files
+    if not trainer.accelerator.is_main_process:
+        return
+
+    modify_save_pretrained(model)
+    model.save_pretrained(
+        output_dir,
+        safe_serialization=safe_serialization,
+        save_compressed=save_compressed,
+        skip_sparsity_compression_stats=not save_compressed,
+    )
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -294,8 +294,23 @@ def save_trained_model(
            trainer.model.save_pretrained(
                cfg.output_dir, safe_serialization=safe_serialization
            )
+
        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)

+    if hasattr(cfg, "llmcompressor") and cfg.llmcompressor:
+        # TODO: add integration support so this can be implemented completely within the plugin
+        from axolotl.integrations.llm_compressor.utils import (
+            save_compressed_model,
+        )
+
+        save_compressed_model(
+            model=model,
+            output_dir=cfg.output_dir,
+            trainer=trainer,
+            safe_serialization=safe_serialization,
+            save_compressed=cfg.llmcompressor.save_compressed,
+        )
+

 def create_model_card(cfg: DictDefault, trainer: Trainer):
    """
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -141,6 +141,22 @@ def check_model_config(cfg: DictDefault, model_config: PretrainedConfig):
        hasattr(model_config, "quantization_config")
        and model_config.quantization_config
    )
+
+    # Detect compressed-tensors config
+    is_compressed_tensors_config = (
+        quant_config_exists
+        and model_config.quantization_config.get("quant_method") == "compressed-tensors"
+    )
+
+    if is_compressed_tensors_config:
+        if model_config.quantization_config.get("config_groups"):
+            LOG.warning(
+                "Found `config_groups` in a compressed-tensors config. "
+                "QAT integration with llmcompressor is not tested."
+            )
+        # Skip further quant checks for compressed-tensors
+        return
+
    quant_config_method_is_gptq = (
        quant_config_exists
        and "quant_method" in model_config.quantization_config
--- a/src/axolotl/utils/samplers/multipack.py
+++ b/src/axolotl/utils/samplers/multipack.py
@@ -1,13 +1,10 @@
+# pylint: skip-file
 """
-Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences
-into fixed-capacity batches to optimize memory usage and training throughput.
+Multipack Batch Sampler
 """
-
 import logging
 import math
-from concurrent.futures import ProcessPoolExecutor
-from multiprocessing import cpu_count
-from typing import Iterable, Union
+from typing import Any, Iterable, List, Union

 import numba
 import numpy as np
@@ -16,39 +13,26 @@ from torch.utils.data import BatchSampler, Sampler, SequentialSampler
 from axolotl.utils.distributed import reduce_and_broadcast

 LOG = logging.getLogger(__name__)
+
 LOG.setLevel(logging.INFO)


@numba.njit
-def ffd_check(sequence_lengths: np.ndarray, bin_capacity: int, num_bins: int):
-    """
-    First-fit-decreasing bin packing algorithm check
+def ffd_check(a: np.ndarray, c: int, n: int):
+    # First-fit-decreasing bin packing
+    # Check if a[] could fit in n bins with capacity c
+    # https://en.wikipedia.org/wiki/First-fit-decreasing_bin_packing

-    Checks if sequences with the given lengths could fit in the specified number of bins
-
-    Args:
-        sequence_lengths: Array of sequence lengths
-        bin_capacity: Maximum capacity of each bin
-        num_bins: Number of bins available
-
-    Returns:
-        True if all sequences can be packed, False otherwise
-    """
-    # Sort sequence lengths in descending order for optimal packing
-    sequence_lengths = np.sort(sequence_lengths)[::-1]
-    # Initialize all bins with full capacity
-    bins = np.full((num_bins,), bin_capacity, dtype=sequence_lengths.dtype)
-
-    # Try to place each sequence in the first bin it fits
-    for size in sequence_lengths:
+    a = np.sort(a)[::-1]
+    bins = np.full((n,), c, dtype=a.dtype)
+    for size in a:
        not_found = True
-        for idx in range(num_bins):
+        for idx in range(n):
            if bins[idx] >= size:
                bins[idx] -= size
                not_found = False
                break

-        # If no bin could fit this sequence, packing failed
        if not_found:
            return False

@@ -56,132 +40,86 @@ def ffd_check(sequence_lengths: np.ndarray, bin_capacity: int, num_bins: int):


@numba.njit
-def pack_group(
-    sequence_lengths: np.ndarray,
-    group_offset: int,
-    bin_capacity: int,
-    max_bins: int,
-    bin_size: int,
-    safe_mode: bool = True,
-):
-    """
-    Pack a group of sequences into bins using First-Fit Decreasing algorithm
+def ffd_with_result(a: np.ndarray, c: int, start_index: int):
+    # First-fit-decreasing bin packing (with result return)

-    Args:
-        sequence_lengths: Array of sequence lengths
-        group_offset: Offset to apply to indices when returning results
-        bin_capacity: Maximum capacity of each bin
-        max_bins: Maximum number of bins to use
-        bin_size: Maximum number of sequences per bin
-        safe_mode: If True, use a more conservative packing approach
+    indices = np.argsort(a)[::-1]
+    a = a[indices]

-    Returns:
-        List of bins, where each bin contains indices of sequences assigned to it
-    """
-    # Get sorting indices and sort lengths in descending order
-    indices = np.argsort(sequence_lengths)[::-1]
-    sorted_lengths = sequence_lengths[indices]
-
-    bins_remaining_space: list = []  # Tracks remaining capacity in each bin
-    bins_assigned_sequences: list = []  # Tracks sequence indices assigned to each bin
-
-    for seq_id, size in enumerate(sorted_lengths):
-        global_idx = indices[seq_id] + group_offset
-
-        # Try to place sequence in existing bins
-        add_new_bin = True
-        for bin_idx, _ in enumerate(bins_remaining_space):
-            if (
-                bins_remaining_space[bin_idx] >= size
-                and len(bins_assigned_sequences[bin_idx]) < bin_size
-            ):
-                bins_remaining_space[bin_idx] -= size
-                bins_assigned_sequences[bin_idx].append(global_idx)
-                add_new_bin = False
+    bins: List[Any] = []
+    bins_result: List[Any] = []
+    for a_id, size in enumerate(a):
+        add_new = True
+        for idx in range(len(bins)):
+            if bins[idx] >= size:
+                bins[idx] -= size
+                bins_result[idx].append(indices[a_id] + start_index)
+                add_new = False
                break

-        # Create a new bin if needed and if we haven't reached the limit
-        if add_new_bin:
-            if len(bins_remaining_space) >= max_bins and safe_mode:
-                # In safe mode, skip items that would exceed max_bins
-                continue
-            bins_remaining_space.append(bin_capacity - size)
-            bins_assigned_sequences.append([global_idx])
+        if add_new:
+            bins.append(c - size)
+            bins_result.append([indices[a_id] + start_index])

-            # Safety check to avoid infinite bins
-            if len(bins_remaining_space) > len(sequence_lengths):
-                break
-
-    return bins_assigned_sequences
-
-
-# Define a standalone function for multiprocessing
-def _process_group(args):
-    group_lengths, start_idx, bin_capacity, max_bins, bin_size, safe_mode = args
-    return pack_group(
-        group_lengths, start_idx, bin_capacity, max_bins, bin_size, safe_mode
-    )
-
-
-def pack_parallel(
-    sequence_lengths: np.ndarray,
-    bin_capacity: int,
-    group_size: int,
-    bin_size: int,
-    num_processes: int | None = None,
-    safe_mode: bool = True,
-):
-    """
-    Pack sequences into bins using parallel processing
-
-    Args:
-        sequence_lengths: Array of sequence lengths
-        bin_capacity: Maximum capacity of each bin as total number of tokens
-        group_size: Number of sequences to process in each group
-        bin_size: Maximum number of bins to use
-        num_processes: Number of parallel processes to use
-        safe_mode: If True, use a more conservative packing approach
-
-    Returns:
-        List of bins, where each bin contains indices of sequences assigned to it
-    """
-    num_items = len(sequence_lengths)
-    if num_processes is None:
-        num_processes = max(1, min(num_items // group_size, cpu_count()))
-
-    # Create tasks for parallel processing
-    tasks = []
-    for i in range(0, num_items, group_size):
-        group_lengths = sequence_lengths[i : i + group_size]
-        max_bins = len(group_lengths)  # Allow as many bins as items in the group
-        tasks.append((group_lengths, i, bin_capacity, max_bins, bin_size, safe_mode))
-
-    # Process groups in parallel
-    all_bins = []
-    with ProcessPoolExecutor(max_workers=num_processes) as executor:
-        for group_bins in executor.map(_process_group, tasks):
-            all_bins.extend(group_bins)
-
-    return all_bins
+    return bins_result


@numba.njit
-def allocate_sequentially(
-    sequence_lengths: np.ndarray, rank: int, bin_capacity: int, num_ranks: int
+def allocate(
+    lengths: np.ndarray, lengths_cumsum: np.ndarray, rank: int, c: int, n: int
 ):
+    # Dynamic batch allocator, similar to Multifit
+    # https://en.wikipedia.org/wiki/Multifit_algorithm
+    # ~99.5% efficiency on OpenChat training set (12 * 2048 ctx len)
+
+    s = 0
+    start_index = 0
+    result = []
+
+    while True:
+        # binary search [l, r)
+        left = 1
+        right = 1 + np.searchsorted(lengths_cumsum[start_index:], s + c * n, "right")
+
+        while right - left > 1:
+            mid = (left + right) // 2
+            if ffd_check(lengths[start_index : start_index + mid], c, n):
+                left = mid
+            else:
+                right = mid
+
+        # use length l
+        batch = ffd_with_result(
+            lengths[start_index : start_index + left], c, start_index
+        )
+        assert len(batch) <= n
+        if len(batch) < n:
+            break
+
+        start_index += left
+        s = lengths_cumsum[start_index - 1]
+
+        # add local rank
+        result.append(batch[rank])
+
+    return result, s, len(result) * c * n
+
+
+@numba.njit
+def allocate_sequentially(lengths: np.ndarray, rank: int, c: int, n: int):
    """
    Sequential allocator that preserves example order

-    Args:
-        sequence_lengths: The lengths of all examples
-        rank: The current rank (for distributed training)
-        bin_capacity: The capacity of each bin (maximum sequence length)
-        num_ranks: Number of ranks (processes/GPUs)
+    Parameters:
+    - lengths: The lengths of all examples
+    - rank: The current rank (for distributed training)
+    - c: The capacity of each bin (maximum sequence length)
+    - n: Number of ranks

    Returns:
-        rank_batches: List of batches for the current rank
-        total_tokens_used: Number of actual example tokens
-        total_token_slots: Maximum theoretical number of example tokens (number of bins * bin capacity)
+    - result: List of batches for the current rank
+    - total_used: Number of actual example tokens
+    - total_slots: Maximum theoretical number of example tokens (number of bins * bin capacity)
    """
    result = []
    total_used = 0
@@ -189,9 +127,9 @@ def allocate_sequentially(
    # First, do sequential packing into bins
    all_bins = []
    current_bin = [0 for i in range(0)]  # numba hint
-    remaining_capacity = bin_capacity
+    remaining_capacity = c

-    for idx, size in enumerate(sequence_lengths):
+    for idx, size in enumerate(lengths):
        if size <= remaining_capacity:
            # Example fits in current bin
            current_bin.append(idx)
@@ -202,7 +140,7 @@ def allocate_sequentially(
            if current_bin:  # Add non-empty bin to all_bins
                all_bins.append(current_bin)
            current_bin = [idx]
-            remaining_capacity = bin_capacity - size
+            remaining_capacity = c - size
            total_used += size

    # Add the last bin if not empty
@@ -210,227 +148,132 @@ def allocate_sequentially(
        all_bins.append(current_bin)

    # Assign bins to ranks - each rank gets every n-th bin
-    for bin_idx in range(rank, len(all_bins), num_ranks):
+    for bin_idx in range(rank, len(all_bins), n):
        result.append(all_bins[bin_idx])

-    return result, total_used, len(all_bins) * bin_capacity
+    return result, total_used, len(all_bins) * c


 class MultipackBatchSampler(BatchSampler):
-    """
-    Batch sampler class for efficient packing of variable-length sequences
-
-    This sampler packs sequences into fixed-capacity bins (batches) to maximize
-    GPU memory utilization and training throughput by reducing padding.
-
-    It supports both parallel packing (using FFD algorithm) and
-    sequential packing (preserving original sequence order).
-    """
+    """Batch sampler class for multipack"""

    def __init__(
        self,
        sampler: Union[Sampler[int], Iterable[int]],
-        batch_size: int,  # Number of bins per batch
-        batch_max_len: int,  # Maximum sequence length (bin capacity)
-        lengths: np.ndarray,  # Sequence lengths
-        packing_efficiency_estimate: float = 1.0,  # Initial efficiency estimate
-        drop_last: bool = False,  # Whether to drop final batches (might be incomplete)
-        num_count_samples: int = 16,  # Number of times to estimate batch count
-        sequential: bool = False,  # Whether to use sequential packing
-        group_size: int = 100_000,  # Size of groups for parallel packing
-        bin_size: int = 200,  # The max number of samples that can be packed in a single bin
-        num_processes: int | None = None,  # Number of processes for parallel packing
-        safe_mode: bool = True,  # Conservative packing to prevent training instability
-        **kwargs,  # pylint: disable=unused-argument
+        batch_size: int,
+        batch_max_len: int,
+        lengths: np.ndarray,
+        packing_efficiency_estimate: float = 1.0,
+        drop_last: bool = False,
+        num_count_samples: int = 16,
+        sequential: bool = False,
+        **kwargs,
    ):
        super().__init__(sampler, batch_size, drop_last)
        self.batch_size = batch_size
        self.batch_max_len = batch_max_len
-        self.lengths = np.array(lengths, dtype=np.int32)
+        self.lengths: np.ndarray = lengths
        self.packing_efficiency_estimate = packing_efficiency_estimate or 1.0
        self.sequential = sequential
-        self.group_size = group_size
-        self.bin_size = bin_size
-        self.num_processes = num_processes
-        self.safe_mode = safe_mode

        assert isinstance(self.lengths, np.ndarray)

        self.epoch = 0

-        # Efficiency statistics tracking
-        self.total_tokens_used = 0
-        self.total_token_slots = 0
+        # statistics
+        self.eff_total_used = 0
+        self.eff_total_slots = 0

-        # The number of times to calculate batches to determine minimum packed dataset length
+        # The number of times to calculate the batches to determine the minimum packed dataset length for the local rank
        self.num_count_samples = num_count_samples
-        # Minimum packed dataset length across all ranks (determined by gather/broadcast)
+        # the minimum packed dataset length across all ranks determined by a gather/broadcast
        self.len_across_ranks = None

-        # Cache for batches
-        self._batches = None
-
        if self.sequential and not isinstance(sampler, SequentialSampler):
            LOG.warning(
                "using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?"
            )

    def set_epoch(self, epoch: int):
-        """Set the epoch number, used for reproducible shuffling across epochs"""
        self.epoch = epoch
-        self._batches = None  # Invalidate batch cache

    def generate_batches(self, set_stats=False):
-        """
-        Generate packed batches for training
+        indices = [idx for idx in self.sampler]

-        Args:
-            set_stats: Whether to update efficiency statistics
-
-        Returns:
-            List of batches, where each batch contains multiple bins,
-            and each bin contains multiple sequence indices
-        """
-        if self._batches is not None:
-            return self._batches
-
-        # Get indices from the sampler
-        indices = [  # pylint: disable=unnecessary-comprehension
-            idx for idx in self.sampler
-        ]
-
-        # Get lengths of the selected sequences
        lengths = self.lengths[indices]
+        lengths_cumsum = np.cumsum(lengths)

-        # Pack sequences into bins using either sequential or parallel packing
        if self.sequential:
-            bins, total_used, total_slots = allocate_sequentially(
-                lengths,
+            batches, total_used, total_slots = allocate_sequentially(
+                lengths=lengths,
                rank=0,
-                bin_capacity=self.batch_max_len,
-                num_ranks=1,
+                c=self.batch_max_len,
+                n=1,
            )
-            # Map bin indices back to original indices
-            bins = [[indices[b_idx] for b_idx in bin_indices] for bin_indices in bins]
        else:
-            # Use parallel packing
-            all_bins = pack_parallel(
-                lengths,
-                bin_capacity=self.batch_max_len,
-                group_size=self.group_size,
-                bin_size=self.bin_size,
-                num_processes=self.num_processes,
-                safe_mode=self.safe_mode,
+            batches, total_used, total_slots = allocate(
+                lengths=lengths,
+                lengths_cumsum=lengths_cumsum,
+                rank=0,
+                c=self.batch_max_len,
+                n=1,
            )

-            # Map bin indices back to original indices
-            bins = [
-                [indices[b_idx] for b_idx in bin_indices] for bin_indices in all_bins
-            ]
-
-            # Calculate efficiency statistics
-            total_used = lengths.sum()
-            total_slots = len(all_bins) * self.batch_max_len
-
-        # Group bins into batches (each batch contains batch_size bins)
        batches = [
-            bins[i : i + self.batch_size] for i in range(0, len(bins), self.batch_size)
+            [
+                [indices[b_idx] for b_idx in batch]
+                for batch in batches[i : i + self.batch_size]
+            ]
+            for i in range(0, len(batches), self.batch_size)
        ]

-        # Drop last batch if requested and it's incomplete
-        if self.drop_last and len(batches[-1]) < self.batch_size:
-            batches = batches[:-1]
-            # Adjust total_slots if we dropped a batch
-            if not self.sequential:
-                total_slots -= (self.batch_size - len(batches[-1])) * self.batch_max_len
-
-        # Update statistics if requested
+        # statistics
        if set_stats:
-            self.total_tokens_used += total_used
-            self.total_token_slots += total_slots
+            self.eff_total_used += total_used
+            self.eff_total_slots += total_slots

-        self._batches = batches
        return batches

    def __iter__(self):
-        """
-        Return an iterator over batches
-
-        The batches are truncated to match the minimum number of batches across all ranks
-        to ensure distributed training balance
-        """
        batches = self.generate_batches(set_stats=True)
        if self.len_across_ranks:
-            # Truncate batches to ensure all ranks have the same number of batches
+            # make sure the batches we iterate over is truncated to the same min length across all ranks
            batches = batches[: self.len_across_ranks]
        return iter(batches)

+    def num_batches(self):
+        batches = self.generate_batches(set_stats=True)
+        return len(batches)
+
    def efficiency(self):
-        """
-        Calculate the packing efficiency (ratio of tokens used to total token slots)
-        Higher is better - 1.0 would mean perfect packing with no wasted space
-        """
-        if self.total_token_slots == 0:
-            self.generate_batches(set_stats=True)
-        if self.total_token_slots == 0:
-            return 0.0
-        # Return a Python float instead of potentially a numpy float
-        return float(self.total_tokens_used / self.total_token_slots)
+        return self.eff_total_used / self.eff_total_slots

    def gather_efficiency(self):
-        """
-        Gather and synchronize packing efficiency estimates across all distributed ranks
-        Returns a conservative efficiency estimate based on the measurements
-        """
-
-        def calc_sample_packing_eff_est(estimates: list[float]):
+        def calc_sample_packing_eff_est(estimates: List[float]):
            LOG.debug(f"sample_packing_eff_est across ranks: {repr(estimates)}")
-            # Use 99.7% of max observed efficiency as a safe estimate
-            max_eff = max(float(eff) for eff in estimates)
-            return math.floor(0.997 * max_eff)
+            return math.floor(0.997 * max(estimates))

-        # Gather efficiency from all ranks and apply the calculation function
        sample_packing_actual_eff_all = reduce_and_broadcast(
-            lambda: float(self.efficiency()),  # pylint: disable=unnecessary-lambda
+            lambda: self.efficiency(),  # pylint: disable=unnecessary-lambda
            calc_sample_packing_eff_est,
        )
-
-        # Quantize to 0.5% intervals for stability
        sample_packing_eff_est = (
            math.ceil(sample_packing_actual_eff_all * 200.0) / 200.0
        )
        return sample_packing_eff_est

    def gather_len_batches(self, num):
-        """
-        Gather and synchronize batch counts across all distributed ranks
-        Returns the minimum number of batches available on any rank
-        """
-
        def calc_min_len(estimates: list[(int, float)]):
            LOG.info(f"gather_len_batches: {repr(estimates)}")
            return math.floor(min(estimates))

-        # Find minimum batch count across ranks to ensure balance
        min_len_batches = reduce_and_broadcast(lambda: num, calc_min_len)
        return min_len_batches

    def __len__(self):
-        """
-        Return the total number of batches that will be yielded by this sampler
-
-        This is calculated as the minimum number of batches available on any rank
-        to ensure balanced distributed training
-        """
-        if self._batches is None:
-            self._batches = self.generate_batches(set_stats=True)
-
-        if self.len_across_ranks is None:
-            # Sample multiple times to get stable estimate
-            len_batches = min(  # pylint: disable=consider-using-generator
-                [len(self._batches) for _ in range(self.num_count_samples)]
+        if not self.len_across_ranks:
+            len_batches = min(
+                [self.num_batches() for _ in range(self.num_count_samples)]
            )
-            # Gather minimum across all ranks
            self.len_across_ranks = self.gather_len_batches(len_batches)
-
        return self.len_across_ranks
--- a/tests/e2e/integrations/test_llm_compressor.py
+++ b/tests/e2e/integrations/test_llm_compressor.py
@@ -0,0 +1,111 @@
+"""
+E2E smoke tests for LLMCompressorPlugin integration
+"""
+
+from pathlib import Path
+
+import pytest
+
+from axolotl.cli.args import TrainerCliArgs
+from axolotl.common.datasets import load_datasets
+from axolotl.train import train
+from axolotl.utils.config import normalize_config, prepare_plugins, validate_config
+from axolotl.utils.dict import DictDefault
+
+from tests.e2e.utils import (
+    check_model_output_exists,
+    require_llmcompressor,
+    require_torch_2_4_1,
+)
+
+MODELS = [
+    "nm-testing/llama2.c-stories42M-pruned2.4-compressed",
+    "nm-testing/llama2.c-stories42M-gsm8k-sparse-only-compressed",
+]
+
+
+@pytest.mark.parametrize(
+    "base_model", MODELS, ids=["no-checkpoint-recipe", "with-checkpoint-recipe"]
+)
+@pytest.mark.parametrize(
+    "save_compressed", [True, False], ids=["save_compressed", "save_uncompressed"]
+)
+class TestLLMCompressorIntegration:
+    """
+    e2e tests for axolotl.integrations.llm_compressor.LLMCompressorPlugin
+    """
+
+    @require_llmcompressor
+    @require_torch_2_4_1
+    def test_llmcompressor_plugin(
+        self, temp_dir, base_model: str, save_compressed: bool
+    ):
+        from llmcompressor import active_session
+
+        # core cfg
+        cfg = DictDefault(
+            {
+                "base_model": base_model,
+                "plugins": ["axolotl.integrations.llm_compressor.LLMCompressorPlugin"],
+                "sequence_len": 1024,
+                "val_set_size": 0.05,
+                "special_tokens": {"pad_token": "<|endoftext|>"},
+                "datasets": [{"path": "mhenrichsen/alpaca_2k_test", "type": "alpaca"}],
+                "num_epochs": 1,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 2,
+                "output_dir": temp_dir,
+                "learning_rate": 1e-5,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "save_safetensors": True,
+                "bf16": "auto",
+                "max_steps": 5,
+                "llmcompressor": {
+                    "recipe": {
+                        "finetuning_stage": {
+                            "finetuning_modifiers": {
+                                "ConstantPruningModifier": {
+                                    "targets": [
+                                        "re:.*q_proj.weight",
+                                        "re:.*k_proj.weight",
+                                        "re:.*v_proj.weight",
+                                        "re:.*o_proj.weight",
+                                        "re:.*gate_proj.weight",
+                                        "re:.*up_proj.weight",
+                                        "re:.*down_proj.weight",
+                                    ],
+                                    "start": 0,
+                                },
+                            },
+                        },
+                    },
+                    "save_compressed": save_compressed,
+                },
+            }
+        )
+
+        prepare_plugins(cfg)
+        cfg = validate_config(cfg)
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        try:
+            train(cfg=cfg, dataset_meta=dataset_meta)
+            check_model_output_exists(temp_dir, cfg)
+            _check_llmcompressor_model_outputs(temp_dir, save_compressed)
+        finally:
+            active_session().reset()
+
+
+def _check_llmcompressor_model_outputs(temp_dir, save_compressed):
+    if save_compressed:
+        assert (Path(temp_dir) / "recipe.yaml").exists()
+
+        from compressed_tensors import ModelCompressor
+        from compressed_tensors.config import Sparse24BitMaskConfig
+
+        compressor = ModelCompressor.from_pretrained(temp_dir)
+        assert compressor is not None
+        assert isinstance(compressor.sparsity_config, Sparse24BitMaskConfig)
--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -105,7 +105,25 @@ def require_vllm(test_case):
            return False

    return unittest.skipUnless(
-        is_vllm_installed(), "test requires a vllm to be installed"
+        is_vllm_installed(), "test requires vllm to be installed"
+    )(test_case)
+
+
+def require_llmcompressor(test_case):
+    """
+    Decorator marking a test that requires a llmcompressor to be installed
+    """
+
+    def is_llmcompressor_installed():
+        try:
+            import llmcompressor  # pylint: disable=unused-import  # noqa: F401
+
+            return True
+        except ImportError:
+            return False
+
+    return unittest.skipUnless(
+        is_llmcompressor_installed(), "test requires llmcompressor to be installed"
    )(test_case)
Author	SHA1	Message	Date
Dan Saunders	e910e3e164	Revert "Multipack parallel bin packing (#2631 )" This reverts commit `8e4158cc0b`.	2025-05-09 17:33:31 +00:00
Wing Lian	0f3587174d	swap tinymodels that have safetensors for some ci tests (#2641 )	2025-05-07 15:06:07 -04:00
xzuyn	25e6c5f9bd	Add CAME Optimizer (#2385 )	2025-05-07 10:31:46 -04:00
NanoCode012	32f51bca35	fix(doc): clarify instruction to delinearize llama4 similar to cli doc (#2644 ) [skip ci]	2025-05-07 10:29:47 -04:00
NanoCode012	9daa04da90	Fix: improve error message on failed dataset load (#2637 ) [skip ci] * fix(log): clarify error on dataset loading failed * fix: add path for easy tracking of broken config * fix: improve error message based on pr feedback	2025-05-07 10:29:05 -04:00
Wing Lian	0d71b0aa5f	Configurable embeddings upcast (#2621 ) * fsdp embeddings should be float32 per comment * patch peft to not upcast everything * add tabs back to code check * fix import * add configurable option and fix check * add check for dtypes * move embeddings test to patch dir * fix test * fix comment and logic	2025-05-06 23:40:44 -04:00
Eric Meier	63aaccf85b	Fix cut_cross_entropy plugin install (#2642 ) [skip ci]	2025-05-06 22:56:00 -04:00
Wing Lian	ff0fe767c8	xformers attention with packing (#2619 ) * xformers attention with packing * wire up the patch * fix xformers + packing validation * fix warning * reorder the packing check * fix fp16 / bf16 reset when using fp16 with bf16 auto * fix seq lens calc to drop hanging sequences * handle xformers patch for inference too * fix batch size setter * fix xformers inference * add colab callback to fix inference post train * PR feedback	2025-05-06 22:49:22 -04:00
Wing Lian	8e4158cc0b	Multipack parallel bin packing (#2631 ) * improve readability of multipack sampler * parallel bin packing fix error with lambda and pickling make sure things are in float instead of np.float * annotations and comments update * support for configurable group and bin size for sample packing * fix missing map back to original indices	2025-05-06 20:08:08 -04:00
Wing Lian	cd84325253	allow plugins to return their own dataset (#2617 ) [skip ci] * allow plugins to return their own dataset * add post_trainer_create and wire up * add hook check * address PR feedback: * remove annotation causing circular import	2025-05-06 20:05:51 -04:00
NanoCode012	0b140fef83	feat(doc): add split_thinking docs (#2613 ) [skip ci] * feat(doc): add split_thinking docs * fix: link config.qmd to conversation.qmd for split_thinking example * update thinking => reasoning_content in messages format --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-06 20:05:32 -04:00
Wing Lian	e4cfebe995	bump liger dep to 0.5.9 (#2640 ) [skip ci] * bump liger dep to 0.5.9 * also upgrade vllm to post1, and datasets to 3.5.1	2025-05-06 20:05:19 -04:00
mhenrichsen	a6cac5dd32	Update lr_scheduler options in config.qmd to include additional scheduling strategies for improved training flexibility. (#2636 ) [skip ci]	2025-05-06 11:24:07 -04:00
Wing Lian	b71c0e3447	Print axolotl art if train is called outside of cli: (#2627 ) [skip ci]	2025-05-06 11:18:45 -04:00
Wing Lian	ddaebf8309	fix dpo eval override to call grandparent instead of the broken super (#2628 ) [skip ci]	2025-05-06 11:18:25 -04:00
Wing Lian	679743087a	make sure gc_steps is used for all trainers (#2638 )	2025-05-06 11:18:00 -04:00
Wing Lian	f720b6e72d	repop cache (#2639 ) * repop cache * pre-cache as a step * fix the name * add reason for pytest skipif * restore pytorch matrix * remove max-parallel now that we've optimized this a bit	2025-05-06 11:09:07 -04:00
mhenrichsen	a980618fd0	Adds example for training a TTS model on top of a LLM. (#2614 ) * Adds example for training a TTS model on top of a LLM. * Update examples/orpheus/finetune.yml Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update examples/orpheus/finetune.yml Co-authored-by: NanoCode012 <nano@axolotl.ai> * Update README.md to clarify GPU requirements for finetuning Orpheus TTS model * Update finetune.yml to use the new base model canopylabs/orpheus-3b-0.1-pretrained * Update finetune.yml and README.md for consistency and clarity --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-06 10:11:06 +02:00
Emmanuel Ferdman	54960d4de0	Fix logging deprecation warnings (#2623 ) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-04 08:22:45 -04:00
Wing Lian	ed922796b7	include multipack support for qwen3 family (#2622 )	2025-05-03 12:02:39 -04:00
Wing Lian	3dd9c3bf3f	setup hf transfer too and fix auto bf16 when fp16 enabled (#2620 ) [skip ci]	2025-05-03 12:02:26 -04:00
Wing Lian	0ba7d362fa	qwen3 and qwen3_moe support for liger kernels (#2612 ) * qwen3 and qwen3_moe support for liger kernels * fix moe module path * fix: qwen3 liger input args and mlp * fix: qwen3 input args and output class --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-05-02 09:29:55 -04:00
aitechguy	e4f73bc98e	remove keys to incoporate changes for the trl update (#2616 )	2025-05-02 08:47:42 -04:00
Wing Lian	bcb59c70e2	automatically set pad_to_sequence_len when use packing (#2607 ) * automatically set pad_to_sequence_len when use packing * update tests	2025-05-01 13:24:38 -04:00
NanoCode012	6a3e6f8c53	fix: run preview-docs only when md/qmd changes (#2606 ) * fix: run preview-docs only when md/qmd changes * feat: add quarto yaml based on PR feedback	2025-05-01 13:21:28 -04:00
Wing Lian	fee3c13bb5	Logging config for colab (#2611 ) * only configure logging on cli to play nicely with colab * allow reloading the config on the fly from a dict * make sure to use dict for yaml * reuse existing function for load * make cli args optional * mps fix and respect max_steps	2025-05-01 12:58:00 -04:00
Rahul Tuli	996fc124e5	Add: Sparse Finetuning Integration with llmcompressor (#2479 ) * Add: SFTPlugin with llmcompressor * Update: review comments! * Add:llmcompressor instalable * pre commit hooks * Use: warning over warn * Revert: TODO's * Update llmcompressor version to latest * Apply suggestions from @markurtz Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com> * Address review comments from @markurtz * Add: llcompressor installable * Rename: sft.yaml to sparse-finetuning.yaml * Use: absolute import * Update model config * Move: LLMCompressorPlugin into it's own submodule * Add: `llm_compressor` integration documentation * Rebase and updates! * Tests, Style, Updates * Add: .qmd file * Address Review Comments: * deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Add: line about further optimizations using llmcompressor Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Apply patch from @winglian Signed-off-by: Rahul Tuli <rtuli@redhat.com> * Fix: Test Signed-off-by: Rahul Tuli <rtuli@redhat.com> * additional fixes for docker and saving compressed * split llmcompressor from vllm checks * Reset session between tests Signed-off-by: Rahul Tuli <rtuli@redhat.com> * move decorator to test method instead of class * make sure to reset the session after each test * move import of llmcompressor to reset session inside test --------- Signed-off-by: Rahul Tuli <rtuli@redhat.com> Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-05-01 12:25:16 -04:00
Wing Lian	e963990ad7	add missing __init__ for lr monkeypatch fix (#2609 )	2025-05-01 09:41:32 -04:00
Dhruv Mullick	c3f2b1c5c2	Add num_completions_to_print for trl and grpo (#2604 )	2025-04-30 21:00:30 -04:00
Wing Lian	6ba5c0ed2c	use latest hf-xet and don't install vllm for torch 2.7.0 (#2603 ) * use latest hf-xet and don't install vllm for torch 2.7.0 * fix runpod hub tests	2025-04-30 18:27:39 -04:00
Wing Lian	24ff5f53f8	additional args for grpo config/trainer (#2598 )	2025-04-30 13:11:12 -04:00
Wing Lian	5e949eaa07	replace zero_only with simpler if statement (#2592 )	2025-04-30 13:11:03 -04:00
Wing Lian	89ca14d9a0	ensure we pass axolotl extras to the Dockerfile so vllm is included in shipped images (#2599 )	2025-04-30 11:35:45 -04:00
Wing Lian	8446b4ad28	don't automatically enable lora kernels for RL training (#2600 )	2025-04-30 11:06:50 -04:00
Wing Lian	fc79606b6d	only import vllm serve cli if its being called (#2597 ) [skip ci]	2025-04-30 09:11:25 -04:00
Wing Lian	baeb00231b	Handle other reasoning trace dataset formats (#2591 ) * Handle other reasoning trace dataset formats * rename var to improve readability * chore: refactor with comments --------- Co-authored-by: NanoCode012 <nano@axolotl.ai>	2025-04-30 03:32:55 -04:00
Wing Lian	2413688b08	upload the deepspeed json to wandb (#2593 ) [skip ci]	2025-04-30 03:32:44 -04:00
NanoCode012	5bb1f3da56	feat: add qwen3 moe block for ds3 (#2596 ) [skip ci]	2025-04-30 03:32:23 -04:00
Wing Lian	a21b9cc472	patch to convert LR from tensor to float when using DS (#2595 ) [skip ci]	2025-04-30 03:31:57 -04:00
Aleksandr Dremov	41a1ec0c95	Plugins create_lr_scheduler support (#2584 ) * lr_scheduler support * fix * Update scheduler.py * Update scheduler.py * cfg handling * black * remove debug * remove adding the axolotl cfg to the scheduler mixin --------- Co-authored-by: Wing Lian <wing@axolotl.ai>	2025-04-29 17:08:30 -04:00
Dan Saunders	ecac731922	auto-enable lora kernels where possible (#2589 ) * auto-enable lora kernels where possible * test * revert change to example yaml * naming * remove print * slight logic change	2025-04-29 16:18:49 -04:00
NanoCode012	742fef4200	fix(doc): key used to point to url in multimodal doc (#2575 ) [skip ci]	2025-04-29 15:10:59 -04:00
Wing Lian	a39caf8824	bump vllm==0.8.5 for qwen3 support (#2583 ) [skip ci]	2025-04-29 15:10:40 -04:00
Wing Lian	07e4f2e25b	support for qwen3 with lora kernels (#2588 ) * support for qwen3 with lora kernels * fix patch * typo	2025-04-29 15:02:49 -04:00
Dan Saunders	c7d07de6b4	Fix eval + add smoke test (#2586 ) * fix evaluate CLI * add smoke test * fix naming * lint	2025-04-29 12:58:54 -04:00
Wing Lian	6565ae85d8	set config on the PluginManager for callback access (#2587 )	2025-04-29 12:05:44 -04:00
Wing Lian	80b4edb4a7	Post release fixes (#2581 ) * fix missing kwarg on child * make the runpod test shorter * update docs * rename runpod test json file * typing fixes and ordering of doc	2025-04-29 10:01:38 -04:00
Wing Lian	fedbcc0254	remove torch 2.4.1 CI as part of support deprecation (#2582 )	2025-04-29 08:28:32 -04:00
Wing Lian	8175896ada	add dev tag for v0.10.0.dev0 (#2580 )	2025-04-28 20:30:14 -04:00