diff --git a/docs/custom_integrations.qmd b/docs/custom_integrations.qmd index cb4aef9ca..023f09732 100644 --- a/docs/custom_integrations.qmd +++ b/docs/custom_integrations.qmd @@ -49,7 +49,8 @@ sections = [ ("Knowledge Distillation (KD)", "kd"), ("Liger Kernels", "liger"), ("Language Model Evaluation Harness (LM Eval)", "lm_eval"), - ("Spectrum", "spectrum") + ("Spectrum", "spectrum"), + ("LLMCompressor", "llm_compressor") ] for section_name, folder_name in sections: diff --git a/docs/llm_compressor.qmd b/docs/llm_compressor.qmd deleted file mode 100644 index 60b685973..000000000 --- a/docs/llm_compressor.qmd +++ /dev/null @@ -1,98 +0,0 @@ ---- -title: "LLMCompressor Sparse Fine-tuning" -format: - html: - toc: true - toc-depth: 3 - number-sections: true -execute: - enabled: false ---- - -# LLMCompressor Integration - -Fine-tune sparsified models in Axolotl using [LLMCompressor](https://github.com/vllm-project/llm-compressor). - -This integration enables fine-tuning of models **already sparsified** using LLMCompressor. -It hooks into Axolotl’s training pipeline using the plugin system and maintains sparsity throughout the fine-tuning process. - ---- - -## Requirements - -- Install Axolotl with `llmcompressor` extras: - -```bash -pip install "axolotl[llmcompressor]" -``` - -- Requires `llmcompressor >= 0.5.1` - -This will install all required dependencies for sparse model fine-tuning. - ---- - -## Usage - -To enable sparse fine-tuning with this integration, configure your Axolotl YAML like so: - -```yaml -plugins: - - axolotl.integrations.llm_compressor.LLMCompressorPlugin - -llmcompressor: - recipe: - finetuning_stage: - finetuning_modifiers: - ConstantPruningModifier: - targets: [ - 're:.*q_proj.weight', - 're:.*k_proj.weight', - 're:.*v_proj.weight', - 're:.*o_proj.weight', - 're:.*gate_proj.weight', - 're:.*up_proj.weight', - 're:.*down_proj.weight', - ] - start: 0 -# ... (other Axolotl training arguments) -``` - -::: {.callout-note} -This plugin **does not prune or sparsify the model**. It is only meant for **fine-tuning models that are already sparsified**. -::: - ---- - -## Pre-Sparsified Checkpoints - -You can use: - -- Your own LLMCompressor-sparsified model -- Or one from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic) - -Refer to the [LLMCompressor README](https://github.com/vllm-project/llm-compressor/blob/main/README.md) to learn how to sparsify models or write custom recipes. - ---- - -## Example Config - -A full working example is provided at: - -```bash -examples/llama-3/sparse-finetuning.yaml -``` - -Run fine-tuning using: - -```bash -axolotl train examples/llama-3/sparse-finetuning.yaml -``` - ---- - -## Learn More - -Explore LLMCompressor capabilities, supported modifiers, and detailed examples: - -👉 [LLMCompressor GitHub](https://github.com/vllm-project/llm-compressor) \ No newline at end of file diff --git a/src/axolotl/integrations/llm_compressor/README.md b/src/axolotl/integrations/llm_compressor/README.md index a86a89c51..a087f37a8 100644 --- a/src/axolotl/integrations/llm_compressor/README.md +++ b/src/axolotl/integrations/llm_compressor/README.md @@ -45,6 +45,7 @@ llmcompressor: 're:.*down_proj.weight', ] start: 0 + save_compressed: true # ... (other training arguments) ``` @@ -52,19 +53,54 @@ This plugin **does not apply pruning or sparsification itself** — it is intend Pre-sparsified checkpoints can be: - Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor) -- Or downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic) +- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic) +- Any custom LLM with compatible sparsity patterns that you've created yourself To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation: [https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md) +### Storage Optimization with save_compressed + +Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which: +- Reduces disk space usage by approximately 40% +- Maintains compatibility with vLLM for accelerated inference +- Maintains compatibility with llmcompressor for further optimization (example: quantization) + +This option is highly recommended when working with sparse models to maximize the benefits of model compression. + ### Example Config See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example. --- +## Inference with vLLM + +After fine-tuning your sparse model, you can leverage vLLM for efficient inference: + +```python +from vllm import LLM, SamplingParams + +prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) +llm = LLM("path/to/your/sparse/model") +outputs = llm.generate(prompts, sampling_params) + +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") +``` + +For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/). + ## Learn More For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository: -👉 [https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) +[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)