Skip to content

[Bug]: qwen2.5 kv-cache quantization, KeyError: 'layers.0.self_attn.qkv_proj.kv_cache_offset' #2144

@briobol

Description

@briobol

Your current environment

The output of `python collect_env.py` Collecting environment information... PyTorch version: 2.5.1 Is debug build: False

OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.3
Libc version: glibc-2.35

Python version: 3.10.17 (main, May 8 2025, 07:18:04) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.35

CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.post1.dev20250619
[pip3] torchvision==0.20.1
[pip3] transformers==4.53.3
[conda] Could not collect
vLLM Version: 0.9.2
vLLM Ascend Version: 0.1.dev1+g3aa3b46 (git sha: 3aa3b46)

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
VLLM_USE_MODELSCOPE=true
PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
VLLM_USE_V1=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B4 | OK | 84.1 38 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 2881 / 32768 |
+===========================+===============+====================================================+
| 1 910B4 | OK | 87.9 41 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 2853 / 32768 |
+===========================+===============+====================================================+
| 2 910B4 | OK | 86.5 37 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 2850 / 32768 |
+===========================+===============+====================================================+
| 3 910B4 | OK | 91.4 40 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 2851 / 32768 |
+===========================+===============+====================================================+
| 4 910B4 | OK | 90.4 37 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 2853 / 32768 |
+===========================+===============+====================================================+
| 5 910B4 | OK | 86.3 40 0 / 0 |
| 0 | 0000:41:00.0 | 0 0 / 0 2853 / 32768 |
+===========================+===============+====================================================+
| 6 910B4 | OK | 96.0 39 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 2852 / 32768 |
+===========================+===============+====================================================+
| 7 910B4 | OK | 89.2 39 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 2850 / 32768 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
| No running processes found in NPU 4 |
+===========================+===============+====================================================+
| No running processes found in NPU 5 |
+===========================+===============+====================================================+
| No running processes found in NPU 6 |
+===========================+===============+====================================================+
| No running processes found in NPU 7 |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux

🐛 Describe the bug

Docker image used: url

Docker container creation command
docker run --privileged -itd -u root --rm --name <name> --ipc=host \
--privileged=true --net=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-e "ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7" \
-e "MAX_MEMORY_GB=55" \
-e "VLLM_USE_MODELSCOPE=true" \
-e "VLLM_USE_V1=1" \
-e "PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256" \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /lib/modules:/lib/modules \
vllm-ascend:v0.9.2rc1 \
/bin/bash

Quantized model with msmodelslim as in documentation version modelslim-VLLM-8.1.RC1.b020_001

msit# git branch
* (HEAD detached at modelslim-VLLM-8.1.RC1.b020_001)
quantization script
python3 msit/msmodelslim/example/Qwen/quant_qwen.py \
        --model_path <path>/Qwen2.5-32B-Instruct/ \
        --save_directory <path> \
        --calib_file <path>msit/msmodelslim/example/Qwen/calib_data/calib_prompt.jsonl \
        --anti_calib_file <path>/msit/msmodelslim/example/Qwen/calib_data/anti_prompt.jsonl \
        --w_bit 8 \
        --a_bit 8 \
        --anti_method m2 \
        --is_lowbit False \
        --act_method 1 \
        --w_sym True \
        --do_smooth False \
        --use_kvcache_quant True \
        --use_sigma False \
        --open_outlier True \
        --is_dynamic False \
        --group_size 128 \
        --device_type npu \
        --disable_last_linear True \
        --model_type qwen2.5
vllm launch command
vllm serve <path> \
        --host 127.0.0.1 \
        --port 8010 \
        --served-model-name qwen2 \
        --gpu-memory-utilization 0.85 \
        --quantization ascend \
        --tensor-parallel-size 4 
vllm serve output log
(VllmWorker rank=3 pid=22679) DEBUG 07-31 10:45:55 [utils.py:183] Loaded weight lm_head.weight with shape torch.Size([38016, 5120])
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     self.worker.load_model()
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 213, in load_model
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     self.model_runner.load_model()
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1814, in load_model
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     return loader.load_model(vllm_config=vllm_config,
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     self.load_weights(model, model_config)
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     loaded_weights = model.load_weights(
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 498, in load_weights
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     return loader.load_weights(weights)
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 291, in load_weights
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 249, in _load_module
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     yield from self._load_module(prefix,
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     loaded_params = module_load_weights(weights)
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 403, in load_weights
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487]     param = params_dict[name]
(VllmWorker rank=2 pid=22583) ERROR 07-31 10:45:56 [multiproc_executor.py:487] KeyError: 'layers.0.self_attn.qkv_proj.kv_cache_offset'

vllm_serve_kv_cache_offset.log

Steps to reproduce:
In order to be able to work with quantized kv cache (see attached quant model description)

It is required to change vllm-ascend/vllm_ascend/quantization/quantizer.py
Because in quant model description if there are fields "fa_quant_type": null, or "kv_quant_type": "C8" start will eventually fail with

NotImplementedError: Currently, vLLM Ascend only supports following quant types:['W8A8', 'W8A8_DYNAMIC', 'C8']

If you output what type it gets - you will get None

Default code error
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quantizer.py", line 51, in get_quantizer
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     module = importlib.import_module("mindie_turbo")
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/usr/local/python3.10.17/lib/python3.10/importlib/__init__.py", line 126, in import_module
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     return _bootstrap._gcd_import(name[level:], package, level)
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] ModuleNotFoundError: No module named 'mindie_turbo'
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] 
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] During handling of the above exception, another exception occurred:
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] 
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.worker.load_model()
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 213, in load_model
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.model_runner.load_model()
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1814, in load_model
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     return loader.load_model(vllm_config=vllm_config,
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 448, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.model = Qwen2Model(vllm_config=vllm_config,
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 152, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 317, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.start_layer, self.end_layer, self.layers = make_layers(
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 639, in make_layers
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     [PPMissingLayer() for _ in range(start_layer)] + [
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 640, in <listcomp>
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 319, in <lambda>
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     lambda prefix: decoder_layer_type(config=config,
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 216, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.self_attn = Qwen2Attention(
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 162, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.attn = Attention(
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/attention/layer.py", line 112, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     quant_method = quant_config.get_quant_method(
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 103, in get_quant_method
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     return AscendKVCacheMethod(self, prefix)
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 227, in __init__
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     self.quantizer = AscendQuantizer.get_quantizer(
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quantizer.py", line 56, in get_quantizer
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     return VLLMAscendQuantizer.get_quantizer(quant_config, prefix,
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quantizer.py", line 268, in get_quantizer
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487]     raise NotImplementedError("Currently, vLLM Ascend only supports following quant types:" \
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] NotImplementedError: Currently, vLLM Ascend only supports following quant types:['W8A8', 'W8A8_DYNAMIC', 'C8']
(VllmWorker rank=3 pid=24267) ERROR 07-31 10:51:32 [multiproc_executor.py:487] Your quant type is: None
(VllmWorker rank=0 pid=24146) INFO 07-31 10:51:32 [quantizer.py:89] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quantizer.py", line 51, in get_quantizer
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     module = importlib.import_module("mindie_turbo")
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/usr/local/python3.10.17/lib/python3.10/importlib/__init__.py", line 126, in import_module
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     return _bootstrap._gcd_import(name[level:], package, level)
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] ModuleNotFoundError: No module named 'mindie_turbo'
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] 
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] During handling of the above exception, another exception occurred:
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] 
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.worker.load_model()
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 213, in load_model
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.model_runner.load_model()
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1814, in load_model
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     return loader.load_model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 448, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.model = Qwen2Model(vllm_config=vllm_config,
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 152, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 317, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.start_layer, self.end_layer, self.layers = make_layers(
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 639, in make_layers
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     [PPMissingLayer() for _ in range(start_layer)] + [
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 640, in <listcomp>
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 319, in <lambda>
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     lambda prefix: decoder_layer_type(config=config,
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 216, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.self_attn = Qwen2Attention(
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 162, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.attn = Attention(
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm/vllm/attention/layer.py", line 112, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     quant_method = quant_config.get_quant_method(
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 103, in get_quant_method
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     return AscendKVCacheMethod(self, prefix)
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 227, in __init__
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     self.quantizer = AscendQuantizer.get_quantizer(
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quantizer.py", line 56, in get_quantizer
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     return VLLMAscendQuantizer.get_quantizer(quant_config, prefix,
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]   File "/vllm-workspace/vllm-ascend/vllm_ascend/quantization/quantizer.py", line 268, in get_quantizer
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487]     raise NotImplementedError("Currently, vLLM Ascend only supports following quant types:" \
(VllmWorker rank=0 pid=24146) ERROR 07-31 10:51:33 [multiproc_executor.py:487] NotImplementedError: Currently, vLLM Ascend only supports following quant types:['W8A8', 'W8A8_DYNAMIC', 'C8']

vllm_serve_NotImplementedError.log

or remove manually from quant_model_description.json `kv_quant_type` and `fa_quant_type` if they are null.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions