-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Open
flashinfer-ai/flashinfer
#1350Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Model: Qwen3-32B FP8 quant
docker vllm-openai v0.10.0 (https://hub.docker.com/layers/vllm/vllm-openai/v0.10.0/images/sha256-af9dc182ee24be77a81ade64a15aa73250440a81224b9c4b7df897d025410b30)
flashinfer v0.2.8rc1
GPU: 1 x H100
🐛 Describe the bug
Entrypoint:
VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server \
--model /data/model/ \
--served-model-name Qwen3-32B \
--disable-log-requests \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--reasoning-parser qwen3
After some successful requests, it crashes with this error:
ERROR 07-29 11:19:43 [core.py:634] EngineCore encountered a fatal error.
ERROR 07-29 11:19:43 [core.py:634] Traceback (most recent call last):
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 625, in run_engine_core
ERROR 07-29 11:19:43 [core.py:634] engine_core.run_busy_loop()
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 652, in run_busy_loop
ERROR 07-29 11:19:43 [core.py:634] self._process_engine_step()
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 677, in _process_engine_step
ERROR 07-29 11:19:43 [core.py:634] outputs, model_executed = self.step_fn()
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 267, in step
ERROR 07-29 11:19:43 [core.py:634] model_output = self.execute_model_with_error_logging(
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 253, in execute_model_with_error_logging
ERROR 07-29 11:19:43 [core.py:634] raise err
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 244, in execute_model_with_error_logging
ERROR 07-29 11:19:43 [core.py:634] return model_fn(scheduler_output)
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
ERROR 07-29 11:19:43 [core.py:634] output = self.collective_rpc("execute_model",
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 07-29 11:19:43 [core.py:634] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2985, in run_method
ERROR 07-29 11:19:43 [core.py:634] return func(*args, **kwargs)
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-29 11:19:43 [core.py:634] return func(*args, **kwargs)
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 337, in execute_model
ERROR 07-29 11:19:43 [core.py:634] output = self.model_runner.execute_model(scheduler_output,
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-29 11:19:43 [core.py:634] return func(*args, **kwargs)
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1366, in execute_model
ERROR 07-29 11:19:43 [core.py:634] self._prepare_inputs(scheduler_output))
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 786, in _prepare_inputs
ERROR 07-29 11:19:43 [core.py:634] attn_metadata_i = (builder.build(
ERROR 07-29 11:19:43 [core.py:634] ^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 470, in build
ERROR 07-29 11:19:43 [core.py:634] self._plan(num_prefills, num_decodes, attn_metadata)
ERROR 07-29 11:19:43 [core.py:634] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 290, in _plan
ERROR 07-29 11:19:43 [core.py:634] attn_metadata.cascade_wrapper.plan(
ERROR 07-29 11:19:43 [core.py:634] TypeError: MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type'
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 636, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 625, in run_engine_core
engine_core.run_busy_loop()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 652, in run_busy_loop
self._process_engine_step()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 677, in _process_engine_step
outputs, model_executed = self.step_fn()
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 267, in step
model_output = self.execute_model_with_error_logging(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 253, in execute_model_with_error_logging
raise err
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 244, in execute_model_with_error_logging
return model_fn(scheduler_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
output = self.collective_rpc("execute_model",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2985, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 337, in execute_model
output = self.model_runner.execute_model(scheduler_output,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1366, in execute_model
self._prepare_inputs(scheduler_output))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 786, in _prepare_inputs
attn_metadata_i = (builder.build(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 470, in build
self._plan(num_prefills, num_decodes, attn_metadata)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 290, in _plan
attn_metadata.cascade_wrapper.plan(
TypeError: MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type'
ERROR 07-29 11:19:43 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-29 11:19:43 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-29 11:19:43 [async_llm.py:416] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-29 11:19:43 [async_llm.py:416] outputs = await engine_core.get_output_async()
ERROR 07-29 11:19:43 [async_llm.py:416] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [async_llm.py:416] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 751, in get_output_async
ERROR 07-29 11:19:43 [async_llm.py:416] raise self._format_exception(outputs) from None
ERROR 07-29 11:19:43 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
...
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [7]
I suppose this started here:
https://github.com/vllm-project/vllm/blame/6d8d0a24c02bfd84d46b3016b865a44f048ae84b/vllm/v1/attention/backends/flashinfer.py#L313
And I cannot find any version of FlashInfer where kv_data_type is accepted inside MultiLevelCascadeAttentionWrapper.plan()
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working