Skip to content

Latest commit

 

History

History
98 lines (67 loc) · 5.42 KB

File metadata and controls

98 lines (67 loc) · 5.42 KB

llm-compressor Support

This guide aims to introduce how to use LMDeploy's TurboMind inference engine to run models quantized by the vllm-project/llm-compressor tool.

Currently supported llm-compressor quantization types include:

  • int4 quantization (e.g., AWQ, GPTQ)

These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:

Compute Capability Micro-architecture GPUs
7.0 Volta V100
7.2 Volta Jetson Xavier
7.5 Turing GeForce RTX 20 series, T4
8.0 Ampere A100, A800, A30
8.6 Ampere GeForce RTX 30 series, A40, A10
8.7 Ampere Jetson Orin
8.9 Ada Lovelace GeForce RTX 40 series, L40, L20
9.0 Hopper H20, H200, H100, GH200
12.0 Blackwell GeForce RTX 50 series

LMDeploy will continue to follow up and expand support for the llm-compressor project.

The remainder of this document consists of the following sections:

Model Quantization

llm-compressor provides a wealth of model quantization examples. Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.

LMDeploy also provides a built-in script for AWQ quantization of Qwen3-30B-A3B using llm-compressor for your reference:

# Create conda environment
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy

# Install llm-compressor
pip install llm-compressor

# Clone lmdeploy source code and run the quantization example
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq

In the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.

Model Deployment

Offline Inference

With the quantized model, offline batch processing can be implemented with just a few lines of code:

from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig()
with pipeline("./qwen3_30b_a3b_4bit", backend_config=engine_config) as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

For a detailed introduction to the pipeline, please refer to here.

Online Serving

LMDeploy api_server supports encapsulating the model as a service with a single command. The provided RESTful APIs are compatible with OpenAI interfaces. Below is an example of starting the service:

lmdeploy serve api_server ./qwen3_30b_a3b_4bit --backend turbomind

The default service port is 23333. After the server starts, you can access the service via the OpenAI SDK. For command arguments and methods to access the service, please read this document.

Accuracy Evaluation

Aftering deploying AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, we evaluated their accuracy on several academic datasets using opencompass. Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.

dataset Qwen3-8B Qwen3-30B-A3B
bf16 awq sym awq asym bf16 awq sym awq asym
ifeval 85.58 83.73 85.77 86.32 84.10 84.29
hle 5.05 5.05 5.24 7.00 5.47 5.65
gpqa 59.97 56.57 59.47 61.74 57.95 57.07
aime2025 69.48 64.38 63.96 73.44 64.79 66.67
mmlu_pro 73.69 71.73 72.34 77.85 75.77 75.69
LCBCodeGeneration 50.86 44.10 46.95 56.67 50.86 49.24

For reproduction methods, please refer to this document.