# llm-compressor **Repository Path**: aierwiki/llm-compressor ## Basic Information - **Project Name**: llm-compressor - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: abhinav/gptq_hack - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-17 - **Last Updated**: 2025-07-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LLM Compressor `llm-compressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including: * Comprehensive set of quantization algorithms including weight-only and activation quantization * Seemless integration Hugging Face models and repositories * `safetensors`-based file format compatible with `vllm`

LLM Compressor Flow

### Supported Formats * Mixed Precision: W4A16, W8A16 * Activation Quantization: W8A8 (int8 and fp8) * 2:4 Semi-structured Sparsity * Unstructured Sparsity ### Supported Algorithms * PTQ (Post Training Quantization) * GPTQ * SmoothQuant * SparseGPT ## Installation `llm-compressor` can be installed from the source code via a git clone and local pip install. ```bash git clone https://github.com/vllm-project/llm-compressor.git pip install -e llm-compressor ``` ## Quick Tour The following snippet is a minimal example with 4-bit weight-only quantization via GPTQ and inference of a `TinyLlama/TinyLlama-1.1B-Chat-v1.0`. Note that the model can be swapped for a local or remote HF-compatible checkpoint and the `recipe` may be changed to target different quantization algorithms or formats. ### Compression Compression is easily applied by selecting an algorithm (GPTQ) and calling the `oneshot` API. ```python from llmcompressor.transformers import oneshot from llmcompressor.modifiers.quantization.gptq import GPTQModifier # Sets parameters for the GPTQ algorithms - target Linear layer weights at 4 bits recipe = GPTQModifier(scheme="W4A16", targets="Linear", ignore=["lm_head"]) # Apply GPTQ algorithm using open_platypus dataset for calibration. oneshot( model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", dataset="open_platypus", recipe=recipe, save_compressed=True, output_dir="llama-compressed-quickstart", overwrite_output_dir=True, max_seq_length=2048, num_calibration_samples=512, ) ``` ### Inference with vLLM The checkpoint is ready to run with vLLM (after install `pip install vllm`). ```python from vllm import LLM model = LLM("llama-compressed-quickstart") output = model.generate("I love 4 bit models because") ``` ## End-to-End Examples The `llm-compressor` library provides a rich feature-set for model compression. Below are examples and documentation of a few key flows: * [`Meta-Llama-3-8B-Instruct` W4A16 With GPTQ](examples/quantization_w4a16) * [`Meta-Llama-3-8B-Instruct` W8A8-Int8 With GPTQ and SmoothQuant](examples/quantization_w8a8_int8) * [`Meta-Llama-3-8B-Instruct` W8A8-Fp8 With PTQ](examples/quantization_w8a8_fp8) If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation. ## Contribute We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md).