forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 134
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
In the lines referenced below, is_cuda_alike() evaluates to False, causing the function to return None.
As a result, speculative decoding metrics are disabled.
vllm-fork/vllm/spec_decode/metrics.py
Lines 99 to 119 in bef3660
| def maybe_collect_rejsample_metrics( | |
| self, k: int) -> Optional[SpecDecodeWorkerMetrics]: | |
| # Skip for any platform that doesn't have device Event | |
| if current_platform.Event is None: | |
| return None | |
| if not current_platform.is_cuda_alike(): | |
| return None | |
| # If a copy was initiated in the previous call, collect and return. | |
| if self._in_flight_copy is not None: | |
| ready_event = self._in_flight_copy | |
| self._in_flight_copy = None | |
| return self._collect_rejsample_metrics(k, ready_event) | |
| # Otherwise, check if we should start a new copy. | |
| if self._should_collect_rejsample_metrics(self._timer()): | |
| assert self._in_flight_copy is None | |
| self._in_flight_copy = self._copy_rejsample_metrics_async() | |
| return None |
🐛 Describe the bug
export VLLM_CONTIGUOUS_PA=false
export VLLM_SKIP_WARMUP=true
export PT_HPU_LAZY_MODE=1
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
python -m vllm.entrypoints.openai.api_server
--host 0.0.0.0 --port 8000
--model meta-llama/Llama-3.1-70B-Instruct
--seed 42 -tp 4
--speculative_config '{"model": "meta-llama/Llama-3.1-8B-Instruct", "num_speculative_tokens": 5, "target_parallel_config": 1}'
--gpu_memory_utilization 0.95 --max-model-len 16384
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working