Address Review Comments:

* deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <rtuli@redhat.com>
2025-04-23 18:00:00 -04:00
parent 99c13ef60c
commit f3e876dbfc
3 changed files with 40 additions and 101 deletions
--- a/docs/custom_integrations.qmd
+++ b/docs/custom_integrations.qmd
@@ -49,7 +49,8 @@ sections = [
    ("Knowledge Distillation (KD)", "kd"),
    ("Liger Kernels", "liger"),
    ("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
-    ("Spectrum", "spectrum")
+    ("Spectrum", "spectrum"),
    ("LLMCompressor", "llm_compressor")
 ]
 for section_name, folder_name in sections:
--- a/docs/llm_compressor.qmd
+++ b/docs/llm_compressor.qmd
@@ -1,98 +0,0 @@
 ---
 title: "LLMCompressor Sparse Fine-tuning"
 format:
  html:
    toc: true
    toc-depth: 3
    number-sections: true
 execute:
  enabled: false
 ---
 # LLMCompressor Integration
 Fine-tune sparsified models in Axolotl using [LLMCompressor](https://github.com/vllm-project/llm-compressor).
 This integration enables fine-tuning of models **already sparsified** using LLMCompressor. 
 It hooks into Axolotl’s training pipeline using the plugin system and maintains sparsity throughout the fine-tuning process.
 ---
 ## Requirements
 - Install Axolotl with `llmcompressor` extras:
 ```bash
 pip install "axolotl[llmcompressor]"
 ```
 - Requires `llmcompressor >= 0.5.1`
 This will install all required dependencies for sparse model fine-tuning.
 ---
 ## Usage
 To enable sparse fine-tuning with this integration, configure your Axolotl YAML like so:
 ```yaml
 plugins:
  - axolotl.integrations.llm_compressor.LLMCompressorPlugin
 llmcompressor:
  recipe:
    finetuning_stage:
      finetuning_modifiers:
        ConstantPruningModifier:
          targets: [
            're:.*q_proj.weight',
            're:.*k_proj.weight',
            're:.*v_proj.weight',
            're:.*o_proj.weight',
            're:.*gate_proj.weight',
            're:.*up_proj.weight',
            're:.*down_proj.weight',
          ]
          start: 0
 # ... (other Axolotl training arguments)
 ```
 ::: {.callout-note}
 This plugin **does not prune or sparsify the model**. It is only meant for **fine-tuning models that are already sparsified**.
 :::
 ---
 ## Pre-Sparsified Checkpoints
 You can use:
 - Your own LLMCompressor-sparsified model
 - Or one from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
 Refer to the [LLMCompressor README](https://github.com/vllm-project/llm-compressor/blob/main/README.md) to learn how to sparsify models or write custom recipes.
 ---
 ## Example Config
 A full working example is provided at:
 ```bash
 examples/llama-3/sparse-finetuning.yaml
 ```
 Run fine-tuning using:
 ```bash
 axolotl train examples/llama-3/sparse-finetuning.yaml
 ```
 ---
 ## Learn More
 Explore LLMCompressor capabilities, supported modifiers, and detailed examples:
 👉 [LLMCompressor GitHub](https://github.com/vllm-project/llm-compressor)
--- a/src/axolotl/integrations/llm_compressor/README.md
+++ b/src/axolotl/integrations/llm_compressor/README.md
@@ -45,6 +45,7 @@ llmcompressor:
            're:.*down_proj.weight',
          ]
          start: 0
  save_compressed: true
 # ... (other training arguments)
 ```
@@ -52,19 +53,54 @@ This plugin **does not apply pruning or sparsification itself** — it is intend
 Pre-sparsified checkpoints can be:
 - Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
- Or downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
+- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
 - Any custom LLM with compatible sparsity patterns that you've created yourself
 To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
 [https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)
 ### Storage Optimization with save_compressed
 Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
 - Reduces disk space usage by approximately 40%
 - Maintains compatibility with vLLM for accelerated inference
 - Maintains compatibility with llmcompressor for further optimization (example: quantization)
 This option is highly recommended when working with sparse models to maximize the benefits of model compression.
 ### Example Config
 See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.
 ---
 ## Inference with vLLM
 After fine-tuning your sparse model, you can leverage vLLM for efficient inference:
 ```python
 from vllm import LLM, SamplingParams
 prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
 ]
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 llm = LLM("path/to/your/sparse/model")
 outputs = llm.generate(prompts, sampling_params)
 for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
 For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
 ## Learn More
 For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
-👉 [https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
+[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)