Skip to content

[Bug]: Add speculative decoding metrics for HPU #1921

@sungwook-son

Description

@sungwook-son

Your current environment

In the lines referenced below, is_cuda_alike() evaluates to False, causing the function to return None.
As a result, speculative decoding metrics are disabled.

def maybe_collect_rejsample_metrics(
self, k: int) -> Optional[SpecDecodeWorkerMetrics]:
# Skip for any platform that doesn't have device Event
if current_platform.Event is None:
return None
if not current_platform.is_cuda_alike():
return None
# If a copy was initiated in the previous call, collect and return.
if self._in_flight_copy is not None:
ready_event = self._in_flight_copy
self._in_flight_copy = None
return self._collect_rejsample_metrics(k, ready_event)
# Otherwise, check if we should start a new copy.
if self._should_collect_rejsample_metrics(self._timer()):
assert self._in_flight_copy is None
self._in_flight_copy = self._copy_rejsample_metrics_async()
return None

🐛 Describe the bug

export VLLM_CONTIGUOUS_PA=false
export VLLM_SKIP_WARMUP=true
export PT_HPU_LAZY_MODE=1
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true

python -m vllm.entrypoints.openai.api_server
--host 0.0.0.0 --port 8000
--model meta-llama/Llama-3.1-70B-Instruct
--seed 42 -tp 4
--speculative_config '{"model": "meta-llama/Llama-3.1-8B-Instruct", "num_speculative_tokens": 5, "target_parallel_config": 1}'
--gpu_memory_utilization 0.95 --max-model-len 16384

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions