Address Review Comments:

* deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com>
2025-04-23 18:00:00 -04:00
parent e766a730ba
commit 20d48cd617
3 changed files with 40 additions and 101 deletions
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
+    ("LLMCompressor", "llm_compressor")
 ]

 for section_name, folder_name in sections:
--- a/docs/llm_compressor.qmd
+++ b/docs/llm_compressor.qmd
@@ -1,98 +0,0 @@
---
-title: "LLMCompressor Sparse Fine-tuning"
-format:
-  html:
-    toc: true
-    toc-depth: 3
-    number-sections: true
-execute:
-  enabled: false
---
-
-# LLMCompressor Integration
-
-Fine-tune sparsified models in Axolotl using [LLMCompressor](https://github.com/vllm-project/llm-compressor).
-
-This integration enables fine-tuning of models **already sparsified** using LLMCompressor. 
-It hooks into Axolotl’s training pipeline using the plugin system and maintains sparsity throughout the fine-tuning process.
-
---
-
-## Requirements
-
- Install Axolotl with `llmcompressor` extras:
-
-```bash
-pip install "axolotl[llmcompressor]"
-```
-
- Requires `llmcompressor >= 0.5.1`
-
-This will install all required dependencies for sparse model fine-tuning.
-
---
-
-## Usage
-
-To enable sparse fine-tuning with this integration, configure your Axolotl YAML like so:
-
-```yaml
-plugins:
-  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
-
-llmcompressor:
-  recipe:
-    finetuning_stage:
-      finetuning_modifiers:
-        ConstantPruningModifier:
-          targets: [
-            're:.*q_proj.weight',
-            're:.*k_proj.weight',
-            're:.*v_proj.weight',
-            're:.*o_proj.weight',
-            're:.*gate_proj.weight',
-            're:.*up_proj.weight',
-            're:.*down_proj.weight',
-          ]
-          start: 0
-# ... (other Axolotl training arguments)
-```
-
-::: {.callout-note}
-This plugin **does not prune or sparsify the model**. It is only meant for **fine-tuning models that are already sparsified**.
-:::
-
---
-
-## Pre-Sparsified Checkpoints
-
-You can use:
-
- Your own LLMCompressor-sparsified model
- Or one from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
-
-Refer to the [LLMCompressor README](https://github.com/vllm-project/llm-compressor/blob/main/README.md) to learn how to sparsify models or write custom recipes.
-
---
-
-## Example Config
-
-A full working example is provided at:
-
-```bash
-examples/llama-3/sparse-finetuning.yaml
-```
-
-Run fine-tuning using:
-
-```bash
-axolotl train examples/llama-3/sparse-finetuning.yaml
-```
-
---
-
-## Learn More
-
-Explore LLMCompressor capabilities, supported modifiers, and detailed examples:
-
-👉 [LLMCompressor GitHub](https://github.com/vllm-project/llm-compressor)
--- a/src/axolotl/integrations/llm_compressor/README.md
+++ b/src/axolotl/integrations/llm_compressor/README.md
@@ -45,6 +45,7 @@ llmcompressor:
            're:.*down_proj.weight',
          ]
          start: 0
+  save_compressed: true
 # ... (other training arguments)
 ```

@@ -52,19 +53,54 @@ This plugin **does not apply pruning or sparsification itself** — it is intend

 Pre-sparsified checkpoints can be:
 - Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
- Or downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Any custom LLM with compatible sparsity patterns that you've created yourself

 To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
 [https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)

+### Storage Optimization with save_compressed
+
+Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
+- Reduces disk space usage by approximately 40%
+- Maintains compatibility with vLLM for accelerated inference
+- Maintains compatibility with llmcompressor for further optimization (example: quantization)
+
+This option is highly recommended when working with sparse models to maximize the benefits of model compression.
+
 ### Example Config

 See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.

 ---

+## Inference with vLLM
+
+After fine-tuning your sparse model, you can leverage vLLM for efficient inference:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM("path/to/your/sparse/model")
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
+
 ## Learn More

 For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:

-👉 [https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
+[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)