move import of llmcompressor to reset session inside test

make sure to reset the session after each test
move decorator to test method instead of class
2025-04-30 18:10:44 -04:00 · 2025-04-30 17:21:53 -04:00 · 2025-04-30 17:21:53 -04:00 · 2025-04-30 17:21:53 -04:00 · 2025-04-30 17:21:53 -04:00 · 2025-04-30 17:21:53 -04:00
6 changed files with 104 additions and 26 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -261,6 +261,18 @@ jobs:
      fail-fast: false
      matrix:
        include:
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.6.0
+            num_gpus: 1
+            axolotl_extras: llmcompressor
+          - cuda: 124
+            cuda_version: 12.4.1
+            python_version: "3.11"
+            pytorch: 2.4.1
+            num_gpus: 1
+            axolotl_extras:
          - cuda: 124
            cuda_version: 12.4.1
            python_version: "3.11"
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
+    ("LLMCompressor", "llm_compressor")
 ]

 for section_name, folder_name in sections:
--- a/src/axolotl/integrations/llm_compressor/README.md
+++ b/src/axolotl/integrations/llm_compressor/README.md
@@ -45,6 +45,7 @@ llmcompressor:
            're:.*down_proj.weight',
          ]
          start: 0
+  save_compressed: true
 # ... (other training arguments)
 ```

@@ -52,19 +53,56 @@ This plugin **does not apply pruning or sparsification itself** — it is intend

 Pre-sparsified checkpoints can be:
 - Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
- Or downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Any custom LLM with compatible sparsity patterns that you've created yourself

 To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
 [https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)

+### Storage Optimization with save_compressed
+
+Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
+- Reduces disk space usage by approximately 40%
+- Maintains compatibility with vLLM for accelerated inference
+- Maintains compatibility with llmcompressor for further optimization (example: quantization)
+
+This option is highly recommended when working with sparse models to maximize the benefits of model compression.
+
 ### Example Config

 See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.

 ---

+## Inference with vLLM
+
+After fine-tuning your sparse model, you can leverage vLLM for efficient inference.
+You can also use LLMCompressor to apply additional quantization to your fine-tuned
+sparse model before inference for even greater performance benefits.:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM("path/to/your/sparse/model")
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
+
 ## Learn More

 For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:

-👉 [https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
+[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
--- a/src/axolotl/train.py
+++ b/src/axolotl/train.py
@@ -288,7 +288,19 @@ def save_trained_model(
                os.remove(os.path.join(cfg.output_dir, "model.safetensors"))
            except FileNotFoundError:
                pass
-    elif hasattr(cfg, "llmcompressor") and cfg.llmcompressor:
+    elif cfg.local_rank == 0:
+        if cfg.flash_optimum and BetterTransformer:
+            model = BetterTransformer.reverse(model)
+
+        if cfg.rl and cfg.adapter and not cfg.rl_adapter_ref_model:
+            trainer.model.save_pretrained(
+                cfg.output_dir, safe_serialization=safe_serialization
+            )
+
+        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
+
+    if hasattr(cfg, "llmcompressor") and cfg.llmcompressor:
+        # TODO: add integration support so this can be implemented completely within the plugin
        from axolotl.integrations.llm_compressor.utils import (
            save_compressed_model,
        )
@@ -301,17 +313,6 @@ def save_trained_model(
            save_compressed=cfg.llmcompressor.save_compressed,
        )

-    elif cfg.local_rank == 0:
-        if cfg.flash_optimum and BetterTransformer:
-            model = BetterTransformer.reverse(model)
-
-        if cfg.rl and cfg.adapter and not cfg.rl_adapter_ref_model:
-            trainer.model.save_pretrained(
-                cfg.output_dir, safe_serialization=safe_serialization
-            )
-
-        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)
-

 def create_model_card(cfg: DictDefault, trainer: Trainer):
    """
--- a/tests/e2e/integrations/test_llm_compressor.py
+++ b/tests/e2e/integrations/test_llm_compressor.py
@@ -9,10 +9,14 @@ import pytest
 from axolotl.cli.args import TrainerCliArgs
 from axolotl.common.datasets import load_datasets
 from axolotl.train import train
-from axolotl.utils.config import normalize_config, prepare_plugins
+from axolotl.utils.config import normalize_config, prepare_plugins, validate_config
 from axolotl.utils.dict import DictDefault

-from tests.e2e.utils import check_model_output_exists, require_torch_2_4_1
+from tests.e2e.utils import (
+    check_model_output_exists,
+    require_llmcompressor,
+    require_torch_2_4_1,
+)

 MODELS = [
    "nm-testing/llama2.c-stories42M-pruned2.4-compressed",
@@ -31,10 +35,13 @@ class TestLLMCompressorIntegration:
    e2e tests for axolotl.integrations.llm_compressor.LLMCompressorPlugin
    """

+    @require_llmcompressor
    @require_torch_2_4_1
    def test_llmcompressor_plugin(
        self, temp_dir, base_model: str, save_compressed: bool
    ):
+        from llmcompressor import active_session
+
        # core cfg
        cfg = DictDefault(
            {
@@ -79,22 +86,23 @@ class TestLLMCompressorIntegration:
        )

        prepare_plugins(cfg)
+        cfg = validate_config(cfg)
        normalize_config(cfg)
        cli_args = TrainerCliArgs()
        dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)

-        train(cfg=cfg, dataset_meta=dataset_meta)
-        check_model_output_exists(temp_dir, cfg)
-        _check_llmcompressor_model_outputs(temp_dir, save_compressed)
+        try:
+            train(cfg=cfg, dataset_meta=dataset_meta)
+            check_model_output_exists(temp_dir, cfg)
+            _check_llmcompressor_model_outputs(temp_dir, save_compressed)
+        finally:
+            active_session().reset()


 def _check_llmcompressor_model_outputs(temp_dir, save_compressed):
-
-    # recipe.yaml should exist
-    assert (Path(temp_dir) / "recipe.yaml").exists()
-
-    # sparsity config exists if save_compressed
    if save_compressed:
+        assert (Path(temp_dir) / "recipe.yaml").exists()
+
        from compressed_tensors import ModelCompressor
        from compressed_tensors.config import Sparse24BitMaskConfig

--- a/tests/e2e/utils.py
+++ b/tests/e2e/utils.py
@@ -105,7 +105,25 @@ def require_vllm(test_case):
            return False

    return unittest.skipUnless(
-        is_vllm_installed(), "test requires a vllm to be installed"
+        is_vllm_installed(), "test requires vllm to be installed"
+    )(test_case)
+
+
+def require_llmcompressor(test_case):
+    """
+    Decorator marking a test that requires a llmcompressor to be installed
+    """
+
+    def is_llmcompressor_installed():
+        try:
+            import llmcompressor  # pylint: disable=unused-import  # noqa: F401
+
+            return True
+        except ImportError:
+            return False
+
+    return unittest.skipUnless(
+        is_llmcompressor_installed(), "test requires llmcompressor to be installed"
    )(test_case)
Author	SHA1	Message	Date
Wing Lian	6affbb1f85	move import of llmcompressor to reset session inside test	2025-04-30 18:10:44 -04:00
Wing Lian	0ed4b4c310	make sure to reset the session after each test	2025-04-30 17:21:53 -04:00
Wing Lian	f4a0f496a0	move decorator to test method instead of class	2025-04-30 17:21:53 -04:00
Wing Lian	82b16bd040	split llmcompressor from vllm checks	2025-04-30 17:21:53 -04:00
Wing Lian	fd5c985038	additional fixes for docker and saving compressed	2025-04-30 17:21:53 -04:00
Rahul Tuli	5246aebc04	Fix: Test Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-30 17:21:53 -04:00
Rahul Tuli	f4bcc71c86	Apply patch from @winglian Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-30 17:21:53 -04:00
Rahul Tuli	3a9e172272	Add: line about further optimizations using llmcompressor Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-30 17:21:53 -04:00
Rahul Tuli	372f0e137b	Address Review Comments: * deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com>	2025-04-30 17:21:52 -04:00
Rahul Tuli	17dffec71d	Add: .qmd file	2025-04-30 17:21:52 -04:00