Skip to content

[Bug]: vllm==0.10.0 + flashinfer, MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type' #21822

@celidos

Description

@celidos

Your current environment

Model: Qwen3-32B FP8 quant
docker vllm-openai v0.10.0 (https://hub.docker.com/layers/vllm/vllm-openai/v0.10.0/images/sha256-af9dc182ee24be77a81ade64a15aa73250440a81224b9c4b7df897d025410b30)
flashinfer v0.2.8rc1
GPU: 1 x H100

🐛 Describe the bug

Entrypoint:

VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server \
    --model /data/model/ \
    --served-model-name Qwen3-32B \
    --disable-log-requests \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --reasoning-parser qwen3

After some successful requests, it crashes with this error:

ERROR 07-29 11:19:43 [core.py:634] EngineCore encountered a fatal error.
ERROR 07-29 11:19:43 [core.py:634] Traceback (most recent call last):
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 625, in run_engine_core
ERROR 07-29 11:19:43 [core.py:634]     engine_core.run_busy_loop()
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 652, in run_busy_loop
ERROR 07-29 11:19:43 [core.py:634]     self._process_engine_step()
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 677, in _process_engine_step
ERROR 07-29 11:19:43 [core.py:634]     outputs, model_executed = self.step_fn()
ERROR 07-29 11:19:43 [core.py:634]                               ^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 267, in step
ERROR 07-29 11:19:43 [core.py:634]     model_output = self.execute_model_with_error_logging(
ERROR 07-29 11:19:43 [core.py:634]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 253, in execute_model_with_error_logging
ERROR 07-29 11:19:43 [core.py:634]     raise err
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 244, in execute_model_with_error_logging
ERROR 07-29 11:19:43 [core.py:634]     return model_fn(scheduler_output)
ERROR 07-29 11:19:43 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
ERROR 07-29 11:19:43 [core.py:634]     output = self.collective_rpc("execute_model",
ERROR 07-29 11:19:43 [core.py:634]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 07-29 11:19:43 [core.py:634]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-29 11:19:43 [core.py:634]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2985, in run_method
ERROR 07-29 11:19:43 [core.py:634]     return func(*args, **kwargs)
ERROR 07-29 11:19:43 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-29 11:19:43 [core.py:634]     return func(*args, **kwargs)
ERROR 07-29 11:19:43 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 337, in execute_model
ERROR 07-29 11:19:43 [core.py:634]     output = self.model_runner.execute_model(scheduler_output,
ERROR 07-29 11:19:43 [core.py:634]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-29 11:19:43 [core.py:634]     return func(*args, **kwargs)
ERROR 07-29 11:19:43 [core.py:634]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1366, in execute_model
ERROR 07-29 11:19:43 [core.py:634]     self._prepare_inputs(scheduler_output))
ERROR 07-29 11:19:43 [core.py:634]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 786, in _prepare_inputs
ERROR 07-29 11:19:43 [core.py:634]     attn_metadata_i = (builder.build(
ERROR 07-29 11:19:43 [core.py:634]                        ^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 470, in build
ERROR 07-29 11:19:43 [core.py:634]     self._plan(num_prefills, num_decodes, attn_metadata)
ERROR 07-29 11:19:43 [core.py:634]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 290, in _plan
ERROR 07-29 11:19:43 [core.py:634]     attn_metadata.cascade_wrapper.plan(
ERROR 07-29 11:19:43 [core.py:634] TypeError: MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type'
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 636, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 625, in run_engine_core
    engine_core.run_busy_loop()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 652, in run_busy_loop
    self._process_engine_step()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 677, in _process_engine_step
    outputs, model_executed = self.step_fn()
                              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 267, in step
    model_output = self.execute_model_with_error_logging(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 253, in execute_model_with_error_logging
    raise err
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 244, in execute_model_with_error_logging
    return model_fn(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2985, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 337, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1366, in execute_model
    self._prepare_inputs(scheduler_output))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 786, in _prepare_inputs
    attn_metadata_i = (builder.build(
                       ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 470, in build
    self._plan(num_prefills, num_decodes, attn_metadata)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 290, in _plan
    attn_metadata.cascade_wrapper.plan(
TypeError: MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type'
ERROR 07-29 11:19:43 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-29 11:19:43 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-29 11:19:43 [async_llm.py:416]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-29 11:19:43 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-29 11:19:43 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-29 11:19:43 [async_llm.py:416]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 751, in get_output_async
ERROR 07-29 11:19:43 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-29 11:19:43 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
...
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [7]

I suppose this started here:
https://github.com/vllm-project/vllm/blame/6d8d0a24c02bfd84d46b3016b865a44f048ae84b/vllm/v1/attention/backends/flashinfer.py#L313

And I cannot find any version of FlashInfer where kv_data_type is accepted inside MultiLevelCascadeAttentionWrapper.plan()

https://github.com/flashinfer-ai/flashinfer/blob/8608c3755024d617566cd2e9f4617f30ca03f8ff/flashinfer/cascade.py#L402

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions