[Bug]: GPT-OSS-120B Eagle-3 High concurrency perf drop

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

B200 machine.

current TOT perf

Maximum request concurrency: 512
============ Serving Benchmark Result ============
Successful requests:                     2560
Benchmark duration (s):                  241.27
Total input tokens:                      2621440
Total generated tokens:                  2621440
Request throughput (req/s):              10.61
Output token throughput (tok/s):         10865.38
Total Token throughput (tok/s):          21730.75
---------------Time to First Token----------------
Mean TTFT (ms):                          938.20
Median TTFT (ms):                        416.17
P99 TTFT (ms):                           6024.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.62
Median TPOT (ms):                        42.69
P99 TPOT (ms):                           65.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           903.22
Median ITL (ms):                         870.80
P99 ITL (ms):                            1903.97
==================================================

We found the regression PR:  https://github.com/vllm-project/vllm/pull/27922
before PR27922 it could get  17096.80 Output token throughput (tok/s).

@njhill 
Benjamin told me you already had the fix: https://github.com/vllm-project/vllm/pull/29542

I tried PR29542, but it still cannot reach 17096.80. It only got 11986.50.

Repro command

export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

server-side:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8087 --model openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 1 --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 2058 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.enable_fi_allreduce_fusion true --compilation_config.pass_config.enable_noop true --compilation_config.max_cudagraph_capture_size 2048 --speculative_config.method eagle3 --speculative_config.model nvidia/gpt-oss-120b-Eagle3 --speculative_config.num_speculative_tokens 3 


client-side:
python3 benchmark_serving.py --backend vllm --host 0.0.0.0 --port 8087 --model openai/gpt-oss-120b --num-prompts 2560 --trust-remote-code --ignore-eos --max-concurrency 512 --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1.0 --use-chat-template --dataset-name random --save-result --result-filename 1128_benchmark_serving_results.json

Note: benchmark_serving.py is from the following repo.
git clone https://github.com/kimbochen/bench_serving.git
pip install pandas datasets --break-system-packages

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: GPT-OSS-120B Eagle-3 High concurrency perf drop #29657

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: GPT-OSS-120B Eagle-3 High concurrency perf drop #29657

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions