This guide aims to introduce how to use LMDeploy's TurboMind inference engine to run models quantized by the vllm-project/llm-compressor tool.
Currently supported llm-compressor quantization types include:
- int4 quantization (e.g., AWQ, GPTQ)
These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:
| Compute Capability | Micro-architecture | GPUs |
|---|---|---|
| 7.0 | Volta | V100 |
| 7.2 | Volta | Jetson Xavier |
| 7.5 | Turing | GeForce RTX 20 series, T4 |
| 8.0 | Ampere | A100, A800, A30 |
| 8.6 | Ampere | GeForce RTX 30 series, A40, A10 |
| 8.7 | Ampere | Jetson Orin |
| 8.9 | Ada Lovelace | GeForce RTX 40 series, L40, L20 |
| 9.0 | Hopper | H20, H200, H100, GH200 |
| 12.0 | Blackwell | GeForce RTX 50 series |
LMDeploy will continue to follow up and expand support for the llm-compressor project.
The remainder of this document consists of the following sections:
llm-compressor provides a wealth of model quantization examples. Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.
LMDeploy also provides a built-in script for AWQ quantization of Qwen3-30B-A3B using llm-compressor for your reference:
# Create conda environment
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
# Install llm-compressor
pip install llm-compressor
# Clone lmdeploy source code and run the quantization example
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awqIn the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.
With the quantized model, offline batch processing can be implemented with just a few lines of code:
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig()
with pipeline("./qwen3_30b_a3b_4bit", backend_config=engine_config) as pipe:
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)For a detailed introduction to the pipeline, please refer to here.
LMDeploy api_server supports encapsulating the model as a service with a single command. The provided RESTful APIs are compatible with OpenAI interfaces. Below is an example of starting the service:
lmdeploy serve api_server ./qwen3_30b_a3b_4bit --backend turbomindThe default service port is 23333. After the server starts, you can access the service via the OpenAI SDK. For command arguments and methods to access the service, please read this document.
Aftering deploying AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, we evaluated their accuracy on several academic datasets using opencompass. Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.
| dataset | Qwen3-8B | Qwen3-30B-A3B | ||||
|---|---|---|---|---|---|---|
| bf16 | awq sym | awq asym | bf16 | awq sym | awq asym | |
| ifeval | 85.58 | 83.73 | 85.77 | 86.32 | 84.10 | 84.29 |
| hle | 5.05 | 5.05 | 5.24 | 7.00 | 5.47 | 5.65 |
| gpqa | 59.97 | 56.57 | 59.47 | 61.74 | 57.95 | 57.07 |
| aime2025 | 69.48 | 64.38 | 63.96 | 73.44 | 64.79 | 66.67 |
| mmlu_pro | 73.69 | 71.73 | 72.34 | 77.85 | 75.77 | 75.69 |
| LCBCodeGeneration | 50.86 | 44.10 | 46.95 | 56.67 | 50.86 | 49.24 |
For reproduction methods, please refer to this document.