-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
B200 machine.
current TOT perf
Maximum request concurrency: 512
============ Serving Benchmark Result ============
Successful requests: 2560
Benchmark duration (s): 241.27
Total input tokens: 2621440
Total generated tokens: 2621440
Request throughput (req/s): 10.61
Output token throughput (tok/s): 10865.38
Total Token throughput (tok/s): 21730.75
---------------Time to First Token----------------
Mean TTFT (ms): 938.20
Median TTFT (ms): 416.17
P99 TTFT (ms): 6024.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 43.62
Median TPOT (ms): 42.69
P99 TPOT (ms): 65.47
---------------Inter-token Latency----------------
Mean ITL (ms): 903.22
Median ITL (ms): 870.80
P99 ITL (ms): 1903.97
We found the regression PR: #27922
before PR27922 it could get 17096.80 Output token throughput (tok/s).
@njhill
Benjamin told me you already had the fix: #29542
I tried PR29542, but it still cannot reach 17096.80. It only got 11986.50.
Repro command
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
server-side:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8087 --model openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 1 --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 2058 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.enable_fi_allreduce_fusion true --compilation_config.pass_config.enable_noop true --compilation_config.max_cudagraph_capture_size 2048 --speculative_config.method eagle3 --speculative_config.model nvidia/gpt-oss-120b-Eagle3 --speculative_config.num_speculative_tokens 3
client-side:
python3 benchmark_serving.py --backend vllm --host 0.0.0.0 --port 8087 --model openai/gpt-oss-120b --num-prompts 2560 --trust-remote-code --ignore-eos --max-concurrency 512 --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1.0 --use-chat-template --dataset-name random --save-result --result-filename 1128_benchmark_serving_results.json
Note: benchmark_serving.py is from the following repo.
git clone https://github.com/kimbochen/bench_serving.git
pip install pandas datasets --break-system-packages
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.