split llmcompressor from vllm checks

additional fixes for docker and saving compressed
Fix: Test
2025-04-29 08:35:06 -04:00 · 2025-04-28 13:16:29 -04:00 · 2025-04-28 13:16:29 -04:00 · 2025-04-28 13:16:29 -04:00 · 2025-04-28 13:16:29 -04:00 · 2025-04-28 13:16:29 -04:00
19 changed files with 703 additions and 32 deletions
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -24,7 +24,7 @@ jobs:
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.5.1
-            axolotl_extras: vllm
+            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/.github/workflows/multi-gpu-e2e.yml
+++ b/.github/workflows/multi-gpu-e2e.yml
@@ -43,7 +43,7 @@ jobs:
            cuda_version: 12.4.1
            python_version: "3.11"
            pytorch: 2.5.1
-            axolotl_extras: vllm
+            axolotl_extras:
            num_gpus: 2
            nightly_build: "true"
          - cuda: 126
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -258,6 +258,12 @@ jobs:
      fail-fast: false
      matrix:
        include:
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.6.0
+            num_gpus: 1
+            axolotl_extras: llmcompressor
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
@@ -269,7 +275,7 @@ jobs:
            python_version: "3.11"
            pytorch: 2.5.1
            num_gpus: 1
-            axolotl_extras: vllm
+            axolotl_extras:
          - cuda: 126
            cuda_version: 12.6.3
            python_version: "3.11"
--- a/cicd/multigpu.sh
+++ b/cicd/multigpu.sh
@@ -20,4 +20,4 @@ pytest -v  --durations=10 -n1 /workspace/axolotl/tests/e2e/multigpu/patched/ \
  --cov-report=xml:multigpu-coverage.xml

 # Upload coverage to Codecov
-codecov upload-process -t $CODECOV_TOKEN -f multigpu-coverage.xml -F multigpu,docker-tests,pytorch-${PYTORCH_VERSION}
+codecov upload-process -t "${CODECOV_TOKEN}" -f multigpu-coverage.xml -F multigpu,docker-tests,pytorch-${PYTORCH_VERSION} || true
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
+    ("LLMCompressor", "llm_compressor")
 ]

 for section_name, folder_name in sections:
--- a/examples/llama-3/sparse-finetuning.yaml
+++ b/examples/llama-3/sparse-finetuning.yaml
@@ -0,0 +1,77 @@
+base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4
+
+plugins:
+  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+
+datasets:
+  - path: tatsu-lab/alpaca
+    type: alpaca
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.05
+output_dir: ./outputs/out
+
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+eval_sample_packing: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 8
+micro_batch_size: 1
+num_epochs: 1
+optimizer: paged_adamw_8bit
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+
+warmup_steps: 100
+evals_per_epoch: 2
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed:
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  pad_token: <|end_of_text|>
+
+llmcompressor:
+  recipe:
+    finetuning_stage:
+      finetuning_modifiers:
+        ConstantPruningModifier:
+          targets: [
+            're:.*q_proj.weight',
+            're:.*k_proj.weight',
+            're:.*v_proj.weight',
+            're:.*o_proj.weight',
+            're:.*gate_proj.weight',
+            're:.*up_proj.weight',
+            're:.*down_proj.weight',
+          ]
+          start: 0
+  save_compressed: true
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,13 +11,13 @@ liger-kernel==0.5.8

 packaging==23.2

-peft==0.15.1
+peft==0.15.2
 transformers==4.51.3
 tokenizers>=0.21.1
 accelerate==1.6.0
 datasets==3.5.0
 deepspeed>=0.15.4
-trl==0.16.1
+trl==0.17.0
 hf_xet==1.0.0
 hqq==0.2.5

--- a/setup.py
+++ b/setup.py
@@ -67,13 +67,13 @@ def parse_requirements(extras_require_map):
            if (major, minor) >= (2, 7):
                _install_requires.pop(_install_requires.index(xformers_version))
                # _install_requires.append("xformers==0.0.29.post3")  # xformers seems to be hard pinned to 2.6.0
-                extras_require_map["vllm"] = ["vllm==0.8.3"]
+                extras_require_map["vllm"] = ["vllm==0.8.4"]
            elif (major, minor) >= (2, 6):
                _install_requires.pop(_install_requires.index(xformers_version))
                _install_requires.append(
                    "xformers==0.0.29.post2"
                )  # vllm needs post2 w torch 2.6
-                extras_require_map["vllm"] = ["vllm==0.8.3"]
+                extras_require_map["vllm"] = ["vllm==0.8.4"]
            elif (major, minor) >= (2, 5):
                _install_requires.pop(_install_requires.index(xformers_version))
                if patch == 0:
@@ -149,6 +149,9 @@ extras_require = {
    "vllm": [
        "vllm==0.7.2",
    ],
+    "llmcompressor": [
+        "llmcompressor==0.5.1",
+    ],
 }

 install_requires, dependency_links, extras_require_build = parse_requirements(
--- a/src/axolotl/core/trainers/grpo/init.py
+++ b/src/axolotl/core/trainers/grpo/init.py
@@ -135,7 +135,9 @@ class GRPOStrategy:
        try:
            # use importlib to dynamically load the reward function from the module
            reward_func_module_name = reward_func_fqn.split(".")[-1]
-            reward_func_module = importlib.import_module(reward_func_fqn.split(".")[-2])
+            reward_func_module = importlib.import_module(
+                ".".join(reward_func_fqn.split(".")[:-1])
+            )
            reward_func = getattr(reward_func_module, reward_func_module_name)
            if not len(inspect.signature(reward_func).parameters) >= 2:
                raise ValueError(
--- a/src/axolotl/integrations/llm_compressor/README.md
+++ b/src/axolotl/integrations/llm_compressor/README.md
@@ -0,0 +1,108 @@
+# LLMCompressor Integration
+
+Fine-tune sparsified models in Axolotl using Neural Magic's [LLMCompressor](https://github.com/vllm-project/llm-compressor).
+
+This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor's model compression capabilities with Axolotl's distributed training pipelines, users can efficiently fine-tune sparse models at scale.
+
+It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training.
+
+---
+
+## Requirements
+
+- Axolotl with `llmcompressor` extras:
+
+  ```bash
+  pip install "axolotl[llmcompressor]"
+  ```
+
+- Requires `llmcompressor >= 0.5.1`
+
+This will install all necessary dependencies to fine-tune sparsified models using the integration.
+
+---
+
+## Usage
+
+To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
+
+```yaml
+plugins:
+  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
+
+llmcompressor:
+  recipe:
+    finetuning_stage:
+      finetuning_modifiers:
+        ConstantPruningModifier:
+          targets: [
+            're:.*q_proj.weight',
+            're:.*k_proj.weight',
+            're:.*v_proj.weight',
+            're:.*o_proj.weight',
+            're:.*gate_proj.weight',
+            're:.*up_proj.weight',
+            're:.*down_proj.weight',
+          ]
+          start: 0
+  save_compressed: true
+# ... (other training arguments)
+```
+
+This plugin **does not apply pruning or sparsification itself** — it is intended for **fine-tuning models that have already been sparsified**.
+
+Pre-sparsified checkpoints can be:
+- Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
+- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Any custom LLM with compatible sparsity patterns that you've created yourself
+
+To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
+[https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)
+
+### Storage Optimization with save_compressed
+
+Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
+- Reduces disk space usage by approximately 40%
+- Maintains compatibility with vLLM for accelerated inference
+- Maintains compatibility with llmcompressor for further optimization (example: quantization)
+
+This option is highly recommended when working with sparse models to maximize the benefits of model compression.
+
+### Example Config
+
+See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.
+
+---
+
+## Inference with vLLM
+
+After fine-tuning your sparse model, you can leverage vLLM for efficient inference.
+You can also use LLMCompressor to apply additional quantization to your fine-tuned
+sparse model before inference for even greater performance benefits.:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM("path/to/your/sparse/model")
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
+
+## Learn More
+
+For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
+
+[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
--- a/src/axolotl/integrations/llm_compressor/init.py
+++ b/src/axolotl/integrations/llm_compressor/init.py
@@ -0,0 +1,5 @@
+"""Integration entry point for the LLMCompressor plugin."""
+
+from .plugin import LLMCompressorPlugin
+
+__all__ = ["LLMCompressorPlugin"]
--- a/src/axolotl/integrations/llm_compressor/args.py
+++ b/src/axolotl/integrations/llm_compressor/args.py
@@ -0,0 +1,40 @@
+"""
+LLMCompressor and Sparse Finetuning config models.
+"""
+
+from typing import Any
+
+from pydantic import BaseModel, Field
+from typing_extensions import Annotated
+
+
+class CompressionArgs(BaseModel):
+    """Sparse Finetuning config for LLMCompressor."""
+
+    # Typing for recipe is set to Any due to:
+    # https://github.com/vllm-project/llm-compressor/issues/1319
+    recipe: Annotated[
+        Any,
+        Field(
+            description="The recipe containing the compression algorithms and hyperparameters to apply."
+        ),
+    ]
+
+    save_compressed: Annotated[
+        bool,
+        Field(
+            default=False,
+            description="Whether to save the compressed model after training.",
+        ),
+    ]
+
+
+class LLMCompressorArgs(BaseModel):
+    """LLMCompressor configuration BaseModel."""
+
+    llmcompressor: Annotated[
+        CompressionArgs,
+        Field(
+            description="Arguments enabling compression pathways through the LLM Compressor plugins"
+        ),
+    ]
--- a/src/axolotl/integrations/llm_compressor/plugin.py
+++ b/src/axolotl/integrations/llm_compressor/plugin.py
@@ -0,0 +1,171 @@
+"""
+Sparse Finetuning plugin for Axolotl — enables handling of sparse neural networks
+by maintaining masks for zero weights during training.
+"""
+
+import logging
+from functools import wraps
+from typing import Any, Callable, Concatenate, ParamSpec, TypeVar
+
+from llmcompressor import active_session, create_session
+from llmcompressor.core import callbacks as session_callbacks
+from llmcompressor.recipe import Recipe
+from torch.nn import Module
+from transformers.trainer import Trainer
+from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
+from transformers.training_args import TrainingArguments
+
+from axolotl.integrations.base import BasePlugin
+
+P = ParamSpec("P")  # Params for generic function signatures
+R = TypeVar("R")  # Return type for generic function signatures
+
+LOG = logging.getLogger("axolotl.integrations.llm_compressor")
+
+
+class LLMCompressorCallbackHandler(TrainerCallback):
+    """
+    Trainer callback for Sparse Finetuning.
+    Maintains sparsity patterns during training by applying masks after optimization steps,
+    ensuring zero-weight updates are canceled out.
+    """
+
+    def __init__(self, trainer: Trainer, recipe: Any):
+        """
+        Initialize the Sparse Finetuning callback handler.
+
+        Args:
+            trainer (Trainer): Huggingface Trainer instance.
+            recipe (Recipe | dict): Sparse finetuning recipe to apply.
+        """
+        super().__init__()
+        self.trainer = trainer
+        self.recipe = (
+            Recipe.model_validate(recipe) if not isinstance(recipe, Recipe) else recipe
+        )
+        self.original_compute_loss = trainer.compute_loss
+        self.trainer.compute_loss = compute_loss_wrapper(self.trainer.compute_loss)
+        create_session()
+
+    def on_train_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the beginning of training. Initializes the compression session.
+
+        Args:
+            args (TrainingArguments): Training arguments.
+            state (TrainerState): Trainer state.
+            control (TrainerControl): Trainer control.
+        """
+        super().on_train_begin(args, state, control, **kwargs)
+        self.trainer.accelerator.wait_for_everyone()
+        active_session().initialize(
+            model=self.trainer.model,
+            optimizer=self.trainer.optimizer,
+            start=state.epoch,
+            recipe=self.recipe,
+        )
+        self.trainer.accelerator.wait_for_everyone()
+
+    def on_step_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the beginning of a training step. Triggers batch_start callback.
+        """
+        super().on_step_begin(args, state, control, **kwargs)
+        session_callbacks.batch_start()
+
+    def on_step_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the end of a training step. Triggers optimizer and batch_end callbacks.
+        """
+        super().on_step_end(args, state, control, **kwargs)
+        session_callbacks.optim_pre_step()
+        session_callbacks.optim_post_step()
+        session_callbacks.batch_end()
+
+    def on_train_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ) -> None:
+        """
+        Called at the end of training. Finalizes the compression session.
+        """
+        super().on_train_end(args, state, control, **kwargs)
+        active_session().finalize()
+        self.trainer.compute_loss_func = self.original_compute_loss
+
+
+class LLMCompressorPlugin(BasePlugin):
+    """
+    Sparse Finetuning plugin for Axolotl integration.
+    """
+
+    def get_input_args(self) -> str:
+        """
+        Returns the path to the plugin's argument definition.
+
+        Returns:
+            str: Dotted path to the LLMCompressorArgs class.
+        """
+        return "axolotl.integrations.llm_compressor.args.LLMCompressorArgs"
+
+    def add_callbacks_post_trainer(self, cfg: Any, trainer: Trainer) -> list:
+        """
+        Adds Sparse Finetuning callback to the Trainer instance.
+
+        Args:
+            cfg (Any): Configuration object containing the sparse recipe.
+            trainer (Trainer): Huggingface Trainer instance.
+
+        Returns:
+            list: List containing the configured callback instances.
+        """
+        LOG.info("Adding Sparse Finetuning callback to the trainer")
+        callback = LLMCompressorCallbackHandler(
+            trainer=trainer,
+            recipe=cfg.llmcompressor.recipe,
+        )
+        return [callback]
+
+
+def compute_loss_wrapper(
+    compute_loss_func: Callable[Concatenate[Module, P], R],
+) -> Callable[Concatenate[Module, P], R]:
+    """
+    Wraps the loss computation function to trigger the loss_calculated callback.
+
+    Args:
+        compute_loss_func (Callable): Original loss computation function.
+
+    Returns:
+        Callable: Wrapped function that also invokes the loss_calculated callback.
+    """
+
+    @wraps(compute_loss_func)
+    def compute_and_notify(model: Module, *args: P.args, **kwargs: P.kwargs) -> R:
+        loss = compute_loss_func(model, *args, **kwargs)
+        if active_session().lifecycle.initialized_ and model.training:
+            session_callbacks.loss_calculated(loss=loss)
+        return loss
+
+    return compute_and_notify
--- a/src/axolotl/integrations/llm_compressor/utils.py
+++ b/src/axolotl/integrations/llm_compressor/utils.py
@@ -0,0 +1,40 @@
+"""Utilities for llmcompressor integration with axolotl."""
+
+from typing import Union
+
+from llmcompressor.transformers.sparsification.compressed_tensors_utils import (
+    modify_save_pretrained,
+)
+from transformers import PreTrainedModel, Trainer
+
+
+def save_compressed_model(
+    model: PreTrainedModel,
+    output_dir: Union[str, bytes],
+    trainer: Trainer,
+    safe_serialization: bool = False,
+    save_compressed: bool = False,
+) -> None:
+    """
+    Synchronize processes, apply compression hooks, and save the model.
+
+    Args:
+        model (PreTrainedModel): The model to be saved.
+        output_dir (str or bytes): Path where the model files will be written.
+        trainer (Trainer): Hugging Face Trainer for process synchronization.
+        safe_serialization (bool): Use safe serialization if True.
+        save_compressed (bool): Write compressed tensors if True.
+    """
+    trainer.accelerator.wait_for_everyone()
+
+    # Only the main process writes the files
+    if not trainer.accelerator.is_main_process:
+        return
+
+    modify_save_pretrained(model)
+    model.save_pretrained(
+        output_dir,
+        safe_serialization=safe_serialization,
+        save_compressed=save_compressed,
+        skip_sparsity_compression_stats=not save_compressed,
+    )
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -295,8 +295,23 @@ def save_trained_model(
            trainer.model.save_pretrained(
                cfg.output_dir, safe_serialization=safe_serialization
            )
+
        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)

+    if hasattr(cfg, "llmcompressor") and cfg.llmcompressor:
+        # TODO: add integration support so this can be implemented completely within the plugin
+        from axolotl.integrations.llm_compressor.utils import (
+            save_compressed_model,
+        )
+
+        save_compressed_model(
+            model=model,
+            output_dir=cfg.output_dir,
+            trainer=trainer,
+            safe_serialization=safe_serialization,
+            save_compressed=cfg.llmcompressor.save_compressed,
+        )
+

 def create_model_card(cfg: DictDefault, trainer: Trainer):
    """
--- a/src/axolotl/utils/models.py
+++ b/src/axolotl/utils/models.py
@@ -139,6 +139,22 @@ def check_model_config(cfg: DictDefault, model_config: PretrainedConfig):
        hasattr(model_config, "quantization_config")
        and model_config.quantization_config
    )
+
+    # Detect compressed-tensors config
+    is_compressed_tensors_config = (
+        quant_config_exists
+        and model_config.quantization_config.get("quant_method") == "compressed-tensors"
+    )
+
+    if is_compressed_tensors_config:
+        if model_config.quantization_config.get("config_groups"):
+            LOG.warning(
+                "Found `config_groups` in a compressed-tensors config. "
+                "QAT integration with llmcompressor is not tested."
+            )
+        # Skip further quant checks for compressed-tensors
+        return
+
    quant_config_method_is_gptq = (
        quant_config_exists
        and "quant_method" in model_config.quantization_config
--- a/tests/e2e/integrations/test_llm_compressor.py
+++ b/tests/e2e/integrations/test_llm_compressor.py
@@ -0,0 +1,106 @@
+"""
+E2E smoke tests for LLMCompressorPlugin integration
+"""
+
+from pathlib import Path
+
+import pytest
+
+from axolotl.cli.args import TrainerCliArgs
+from axolotl.common.datasets import load_datasets
+from axolotl.train import train
+from axolotl.utils.config import normalize_config, prepare_plugins, validate_config
+from axolotl.utils.dict import DictDefault
+
+from tests.e2e.utils import (
+    check_model_output_exists,
+    require_llmcompressor,
+    require_torch_2_4_1,
+)
+
+MODELS = [
+    "nm-testing/llama2.c-stories42M-pruned2.4-compressed",
+    "nm-testing/llama2.c-stories42M-gsm8k-sparse-only-compressed",
+]
+
+
+@pytest.mark.parametrize(
+    "base_model", MODELS, ids=["no-checkpoint-recipe", "with-checkpoint-recipe"]
+)
+@pytest.mark.parametrize(
+    "save_compressed", [True, False], ids=["save_compressed", "save_uncompressed"]
+)
+@require_llmcompressor
+class TestLLMCompressorIntegration:
+    """
+    e2e tests for axolotl.integrations.llm_compressor.LLMCompressorPlugin
+    """
+
+    @require_torch_2_4_1
+    def test_llmcompressor_plugin(
+        self, temp_dir, base_model: str, save_compressed: bool
+    ):
+        # core cfg
+        cfg = DictDefault(
+            {
+                "base_model": base_model,
+                "plugins": ["axolotl.integrations.llm_compressor.LLMCompressorPlugin"],
+                "sequence_len": 1024,
+                "val_set_size": 0.05,
+                "special_tokens": {"pad_token": "<|endoftext|>"},
+                "datasets": [{"path": "mhenrichsen/alpaca_2k_test", "type": "alpaca"}],
+                "num_epochs": 1,
+                "micro_batch_size": 2,
+                "gradient_accumulation_steps": 2,
+                "output_dir": temp_dir,
+                "learning_rate": 1e-5,
+                "optimizer": "adamw_torch_fused",
+                "lr_scheduler": "cosine",
+                "save_safetensors": True,
+                "bf16": "auto",
+                "max_steps": 5,
+                "llmcompressor": {
+                    "recipe": {
+                        "finetuning_stage": {
+                            "finetuning_modifiers": {
+                                "ConstantPruningModifier": {
+                                    "targets": [
+                                        "re:.*q_proj.weight",
+                                        "re:.*k_proj.weight",
+                                        "re:.*v_proj.weight",
+                                        "re:.*o_proj.weight",
+                                        "re:.*gate_proj.weight",
+                                        "re:.*up_proj.weight",
+                                        "re:.*down_proj.weight",
+                                    ],
+                                    "start": 0,
+                                },
+                            },
+                        },
+                    },
+                    "save_compressed": save_compressed,
+                },
+            }
+        )
+
+        prepare_plugins(cfg)
+        cfg = validate_config(cfg)
+        normalize_config(cfg)
+        cli_args = TrainerCliArgs()
+        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
+
+        train(cfg=cfg, dataset_meta=dataset_meta)
+        check_model_output_exists(temp_dir, cfg)
+        _check_llmcompressor_model_outputs(temp_dir, save_compressed)
+
+
+def _check_llmcompressor_model_outputs(temp_dir, save_compressed):
+    if save_compressed:
+        assert (Path(temp_dir) / "recipe.yaml").exists()
+
+        from compressed_tensors import ModelCompressor
+        from compressed_tensors.config import Sparse24BitMaskConfig
+
+        compressor = ModelCompressor.from_pretrained(temp_dir)
+        assert compressor is not None
+        assert isinstance(compressor.sparsity_config, Sparse24BitMaskConfig)
--- a/tests/e2e/multigpu/solo/test_grpo.py
+++ b/tests/e2e/multigpu/solo/test_grpo.py
@@ -4,11 +4,14 @@ GRPO test suite

 import os
 import random
+import shutil
 import subprocess  # nosec B404
 import sys
+import tempfile
 import time
 from pathlib import Path

+import psutil
 import pytest
 import requests
 import yaml
@@ -21,8 +24,8 @@ from tests.e2e.utils import require_vllm


 def start_vllm(
-    model: str, env: dict | None = None, wait: int | None = None, quiet=False, **kwargs
-) -> int:
+    model: str, env: dict, wait: int | None = None, quiet=False, **kwargs
+) -> subprocess.Popen:
    """
    helper function to start the VLLM server in the background, mostly for testing purposes
    """
@@ -46,10 +49,41 @@ def start_vllm(
    # print out the command to be executed
    print(" ".join(cmd))

+    vllm_logging_json = Path(tempfile.mkdtemp()) / "vllm_logging.json"
+    with open(vllm_logging_json, "w", encoding="utf-8") as temp_file:
+        temp_file.write(
+            """{
+  "formatters": {
+    "json": {
+      "class": "pythonjsonlogger.jsonlogger.JsonFormatter"
+    }
+  },
+  "handlers": {
+    "file": {
+      "class": "logging.FileHandler",
+      "formatter": "json",
+      "level": "DEBUG",
+      "filename": "/tmp/vllm.log",
+      "mode": "a"
+    }
+  },
+  "loggers": {
+    "vllm": {
+      "handlers": ["file"],
+      "level": "DEBUG",
+      "propagate": false
+    }
+  },
+  "version": 1
+}"""
+        )
+
+    cmd_env = env.copy()
+    cmd_env.update({"VLLM_LOGGING_CONFIG_PATH": vllm_logging_json})
    # start `trl vllm-serve` command in the background and capture the process id
    process = subprocess.Popen(  # pylint: disable=consider-using-with
        cmd,
-        env=env,
+        env=cmd_env,
        stdout=subprocess.DEVNULL if quiet else subprocess.PIPE,
        stderr=subprocess.DEVNULL if quiet else subprocess.PIPE,
    )  # nosec B603
@@ -58,32 +92,51 @@ def start_vllm(
    print(f"VLLM server process started (PID: {process.pid})")

    # wait until the http server is ready, even if it 404s, but timeout after 60 seconds
+    period_seconds = 5
    started = False
    if wait and host and port:
-        for _ in range(int(wait)):
+        for i in range(0, int(wait), period_seconds):
            try:
                response = requests.get(f"http://{host}:{port}", timeout=1)
+                print(f"{i}: VLLM server (status: {response.status_code})")
                if int(response.status_code) in [200, 404]:
                    started = True
                    break
-            except requests.exceptions.RequestException:
-                pass
+            except requests.exceptions.RequestException as exc:
+                print(f"{i}: VLLM server failed to start: {str(exc)}")

            # also check if the process.pid is still running
            if not process.poll() is None:
                break

-            time.sleep(1)
+            time.sleep(period_seconds)

    if wait and not started:
        print(
            f"VLLM server process did not start within {wait} seconds. Please check your server logs."
        )
-        process.kill()
+        recursive_kill(process)
+        with open("/tmp/vllm.log", "r", encoding="utf-8") as log_file:
+            print(log_file.read())
+        shutil.rmtree("/tmp/vllm.log")
        raise RuntimeError(f"VLLM server process did not start within {wait} seconds.")

-    # return the process id
-    return process.pid
+    # return the process
+    return process
+
+
+def recursive_kill(process: subprocess.Popen):
+    """
+    Recursively kill a process and its children
+    """
+    process = psutil.Process(process.pid)
+    for child in psutil.Process(process.pid).children(recursive=True):
+        child.terminate()
+        child.kill()
+        os.kill(child.pid, 9)
+    process.terminate()
+    process.kill()
+    os.kill(process.pid, 9)


 class TestGRPO:
@@ -174,16 +227,17 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):

        current_env = os.environ.copy()
        env = {
-            "NCCL_P2P_LEVEL": "LOC",
+            "NCCL_P2P_LEVEL": "NVL",
            **current_env,
            "CUDA_VISIBLE_DEVICES": "1",
-            "VLLM_USE_V1": "0",
+            "VLLM_DISABLE_COMPILE_CACHE": "1",
+            # "VLLM_USE_V1": "0",
        }
-        vllm_process_id = start_vllm(
+        vllm_process = start_vllm(
            cfg.base_model,
            env=env,
            quiet=True,
-            wait=120,
+            wait=300,
            gpu_memory_utilization=0.15,
            max_model_len=cfg.vllm.max_model_len,
            enable_prefix_caching=cfg.vllm.enable_prefix_caching,
@@ -202,10 +256,14 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):
                    "--main-process-port",
                    f"{get_torch_dist_unique_port()}",
                ],
-                env={"NCCL_P2P_LEVEL": "LOC", "NCCL_DEBUG": "INFO", **current_env},
+                env={
+                    "NCCL_P2P_LEVEL": "NVL",
+                    "NCCL_DEBUG": "INFO",
+                    **current_env,
+                },
            )
        finally:
-            os.kill(vllm_process_id, 9)
+            recursive_kill(vllm_process)

    @pytest.mark.parametrize(
        "num_gpus",
@@ -262,16 +320,17 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):

        current_env = os.environ.copy()
        env = {
-            "NCCL_P2P_LEVEL": "LOC",  # nccl can be brittle, assume P2P isn't reliable
+            "NCCL_P2P_LEVEL": "NVL",  # nccl can be brittle, assume P2P isn't reliable
            **current_env,
            "CUDA_VISIBLE_DEVICES": "1",
-            "VLLM_USE_V1": "0",
+            "VLLM_DISABLE_COMPILE_CACHE": "1",
+            # "VLLM_USE_V1": "0",
        }
-        vllm_process_id = start_vllm(
+        vllm_process = start_vllm(
            cfg.base_model,
            env=env,
            quiet=True,
-            wait=120,
+            wait=300,
            gpu_memory_utilization=0.15,
            max_model_len=cfg.vllm.max_model_len,
            enable_prefix_caching=cfg.vllm.enable_prefix_caching,
@@ -290,7 +349,11 @@ def oai_gsm8k_transform(cfg, *args, **kwargs):
                    "--main-process-port",
                    f"{get_torch_dist_unique_port()}",
                ],
-                env={"NCCL_P2P_LEVEL": "LOC", "NCCL_DEBUG": "INFO", **current_env},
+                env={
+                    "NCCL_P2P_LEVEL": "NVL",
+                    "NCCL_DEBUG": "INFO",
+                    **current_env,
+                },
            )
        finally:
-            os.kill(vllm_process_id, 9)
+            recursive_kill(vllm_process)
--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -109,6 +109,24 @@ def require_vllm(test_case):
    )(test_case)


+def require_llmcompressor(test_case):
+    """
+    Decorator marking a test that requires a llmcompressor to be installed
+    """
+
+    def is_llmcompressor_installed():
+        try:
+            import llmcompressor  # pylint: disable=unused-import  # noqa: F401
+
+            return True
+        except ImportError:
+            return False
+
+    return unittest.skipUnless(
+        is_llmcompressor_installed(), "test requires a llmcompressor to be installed"
+    )(test_case)
+
+
 def is_hopper():
    compute_capability = torch.cuda.get_device_capability()
    return compute_capability == (9, 0)
Author	SHA1	Message	Date
Wing Lian	c9880977be	split llmcompressor from vllm checks	2025-04-29 08:35:06 -04:00
Wing Lian	f196941315	additional fixes for docker and saving compressed	2025-04-28 13:16:29 -04:00
Rahul Tuli	5be047ac46	Fix: Test Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-28 13:16:29 -04:00
Rahul Tuli	758115b8c6	Apply patch from @winglian Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-28 13:16:29 -04:00
Rahul Tuli	0dc1da5876	Add: line about further optimizations using llmcompressor Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-28 13:16:29 -04:00
Rahul Tuli	f3e876dbfc	Address Review Comments: * deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-28 13:16:29 -04:00
Rahul Tuli	99c13ef60c	Add: .qmd file	2025-04-28 13:16:29 -04:00
Rahul Tuli	2c24434ee0	Tests, Style, Updates	2025-04-28 13:16:29 -04:00
Rahul Tuli	586268a0d7	Rebase and updates!	2025-04-28 13:16:29 -04:00
Rahul Tuli	b600e119b6	Add: `llm_compressor` integration documentation	2025-04-28 13:16:29 -04:00
Rahul Tuli	a8e5ba000e	Move: LLMCompressorPlugin into it's own submodule	2025-04-28 13:16:29 -04:00
Rahul Tuli	bc3dfa666d	Update model config	2025-04-28 13:16:29 -04:00
Rahul Tuli	4371f3459e	Use: absolute import	2025-04-28 13:16:29 -04:00
Rahul Tuli	cc58d5e072	Rename: sft.yaml to sparse-finetuning.yaml	2025-04-28 13:16:29 -04:00
Rahul Tuli	d197b054e3	Add: llcompressor installable	2025-04-28 13:16:29 -04:00
Rahul Tuli	7e1e153831	Address review comments from @markurtz	2025-04-28 13:16:29 -04:00
Rahul Tuli	42de3096cf	Apply suggestions from @markurtz Co-authored-by: Mark Kurtz <mark.j.kurtz@gmail.com>	2025-04-28 13:16:29 -04:00
Rahul Tuli	27758840a1	Update llmcompressor version to latest	2025-04-28 13:16:29 -04:00
Rahul Tuli	8dbf5c215a	Revert: TODO's	2025-04-28 13:16:29 -04:00
Rahul Tuli	6411ca3fe1	Use: warning over warn	2025-04-28 13:16:29 -04:00
Rahul Tuli	813809c54d	pre commit hooks	2025-04-28 13:16:29 -04:00
Rahul Tuli	af7cfdc30b	Add:llmcompressor instalable	2025-04-28 13:16:29 -04:00
Rahul Tuli	b76d2d1130	Update: review comments!	2025-04-28 13:16:29 -04:00
Rahul Tuli	7946f89df4	Add: SFTPlugin with llmcompressor	2025-04-28 13:16:29 -04:00
Dhruv Mullick	8b33ae1c4f	Fix bug in grpo reward module import (#2571 )	2025-04-28 00:31:56 -04:00
Wing Lian	dc4da4a7e2	update trl to 0.17.0 (#2560 ) * update trl to 0.17.0 * grpo + vllm no longer supported with 2.5.1 due to vllm constraints * disable VLLM_USE_V1 for ci * imporve handle killing off of multiprocessing vllm service * debug why this doesn't run in CI * increase vllm wait time * increase timeout to 5min * upgrade to vllm 0.8.4 * dump out the vllm log for debugging * use debug logging * increase vllm start timeout * use NVL instead * disable torch compile cache * revert some commented checks now that grpo tests are fixed * increase vllm timeoout back to 5min	2025-04-27 19:19:53 -04:00