Skip to content

Commit 074e448

Browse files
committed
add docs and expansion test.
Signed-off-by: LiuBingyu <liubingyu62@gmail.com>
1 parent 6970ccb commit 074e448

2 files changed

Lines changed: 179 additions & 160 deletions

File tree

docs/configuration/pd_disaggregation.md

Lines changed: 112 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -3,162 +3,146 @@
33
PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode
44
stages so prompt processing and token generation can run on different workers.
55

6-
This is documented as a stage-config recipe instead of a bundled YAML because the
7-
deployment-specific values usually change per environment:
8-
9-
- GPU placement
10-
- `tensor_parallel_size`
11-
- connector backend and connector ports
12-
- connector IPs or bootstrap addresses
13-
14-
Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml)
15-
and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the
16-
changes below.
6+
After the config refactor, PD is no longer launched from a separate legacy
7+
`stage_configs/*.yaml` file. Instead, it is enabled from the deploy config via
8+
the `pd_separation` section in
9+
[`vllm_omni/deploy/qwen3_omni_moe.yaml`](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml).
10+
11+
## Current Config-Based Flow
12+
13+
At runtime, the config system does the following when
14+
`pd_separation.enabled: true`:
15+
16+
1. Load the normal 3-stage Qwen3-Omni pipeline + deploy config.
17+
2. Dynamically split the thinker into:
18+
- stage `0`: thinker prefill
19+
- stage `1`: thinker decode
20+
3. Shift downstream stages by one index:
21+
- talker: `1 -> 2`
22+
- code2wav: `2 -> 3`
23+
4. Inject `is_prefill_only`, `is_decode_only`, and `kv_transfer_config` into the
24+
resolved runtime stage configs.
25+
5. Reuse the existing PD detection / routing logic in the engine.
26+
27+
So the user-facing deploy file stays single-source, but the resolved runtime
28+
config becomes a 4-stage PD pipeline.
1729

1830
## Requirements
1931

20-
- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
32+
- 3+ GPUs for the common layout:
33+
- prefill on GPU `0`
34+
- decode on GPU `1`
35+
- talker + code2wav on GPU `2`
2136
- A KV connector supported by vLLM, such as `MooncakeConnector`
2237
- Matching `tensor_parallel_size` on the prefill and decode thinker stages
2338

24-
## 1. Split the thinker into prefill and decode stages
39+
## How to Enable PD
40+
41+
PD is enabled from the existing bundled deploy config:
42+
43+
- `vllm_omni/deploy/qwen3_omni_moe.yaml`
2544

26-
Replace the original thinker stage with two stages:
45+
No additional user-facing YAML is required. The intent of the config refactor
46+
is to keep Qwen3-Omni on a single deploy config and switch PD on through the
47+
`pd_separation` section in that file.
48+
49+
Edit `vllm_omni/deploy/qwen3_omni_moe.yaml` and enable / tune:
2750

2851
```yaml
29-
stage_args:
30-
- stage_id: 0
31-
stage_type: llm
32-
is_prefill_only: true
33-
runtime:
34-
devices: "0"
35-
engine_args:
52+
pd_separation:
53+
enabled: true
54+
async_chunk: false
55+
target_stage_id: 0
56+
stages:
57+
- role: prefill
3658
max_num_seqs: 16
37-
model_stage: thinker
38-
model_arch: Qwen3OmniMoeForConditionalGeneration
39-
worker_type: ar
40-
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
41-
gpu_memory_utilization: 0.9
42-
enforce_eager: true
43-
trust_remote_code: true
44-
engine_output_type: latent
45-
distributed_executor_backend: "mp"
46-
enable_prefix_caching: false
47-
max_num_batched_tokens: 32768
48-
hf_config_name: thinker_config
59+
devices: "0"
4960
tensor_parallel_size: 1
50-
kv_transfer_config:
51-
kv_connector: "MooncakeConnector"
52-
kv_role: "kv_producer"
53-
kv_rank: 0
54-
kv_parallel_size: 2
55-
kv_connector_extra_config:
56-
mooncake_bootstrap_port: 25201
57-
final_output: false
58-
is_comprehension: true
59-
default_sampling_params:
60-
temperature: 0.4
61-
top_p: 0.9
62-
top_k: 1
63-
max_tokens: 2048
64-
seed: 42
65-
detokenize: True
66-
repetition_penalty: 1.05
67-
68-
- stage_id: 1
69-
stage_type: llm
70-
is_decode_only: true
71-
runtime:
72-
devices: "1"
73-
engine_args:
61+
engine_extras:
62+
kv_transfer_config:
63+
kv_connector: "MooncakeConnector"
64+
kv_role: "kv_producer"
65+
kv_rank: 0
66+
kv_parallel_size: 2
67+
kv_connector_extra_config:
68+
mooncake_bootstrap_port: 25201
69+
- role: decode
7470
max_num_seqs: 64
75-
model_stage: thinker
76-
model_arch: Qwen3OmniMoeForConditionalGeneration
77-
worker_type: ar
78-
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
79-
gpu_memory_utilization: 0.9
80-
enforce_eager: true
81-
trust_remote_code: true
82-
engine_output_type: latent
83-
distributed_executor_backend: "mp"
84-
enable_prefix_caching: false
85-
max_num_batched_tokens: 32768
86-
hf_config_name: thinker_config
71+
devices: "1"
8772
tensor_parallel_size: 1
88-
kv_transfer_config:
89-
kv_connector: "MooncakeConnector"
90-
kv_role: "kv_consumer"
91-
kv_rank: 1
92-
kv_parallel_size: 2
93-
kv_connector_extra_config:
94-
mooncake_bootstrap_port: 25202
95-
engine_input_source: [0]
96-
final_output: true
97-
final_output_type: text
98-
is_comprehension: true
99-
default_sampling_params:
100-
temperature: 0.4
101-
top_p: 0.9
102-
top_k: 1
103-
max_tokens: 2048
104-
seed: 42
105-
detokenize: True
106-
repetition_penalty: 1.05
73+
engine_extras:
74+
kv_transfer_config:
75+
kv_connector: "MooncakeConnector"
76+
kv_role: "kv_consumer"
77+
kv_rank: 1
78+
kv_parallel_size: 2
79+
kv_connector_extra_config:
80+
mooncake_bootstrap_port: 25202
81+
82+
stages:
83+
- stage_id: 1
84+
devices: "2"
85+
- stage_id: 2
86+
devices: "2"
10787
```
10888
10989
Notes:
11090
111-
- `is_prefill_only: true` marks the thinker stage that only saves KV.
112-
- `is_decode_only: true` marks the thinker stage that resumes from remote KV.
113-
- `kv_transfer_config` is required on both stages.
114-
- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the
115-
prefill side only processes the prompt and exports KV.
91+
- `target_stage_id: 0` means the original thinker is the stage being split.
92+
- `async_chunk: false` matches the current PD path.
93+
- The top-level `stages` overrides above keep the common 3-GPU layout:
94+
- original talker (`stage_id: 1`) stays on GPU `2`
95+
- original code2wav (`stage_id: 2`) stays on GPU `2`
96+
- After PD expansion, these become runtime stage `2` and stage `3`.
11697

117-
## 2. Shift the downstream stages by one index
98+
## Launching with Config-Based PD
11899

119-
After inserting the extra thinker stage, renumber the remaining stages:
120-
121-
```yaml
122-
- stage_id: 2
123-
runtime:
124-
devices: "2"
125-
engine_input_source: [1]
126-
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
127-
128-
- stage_id: 3
129-
runtime:
130-
devices: "2"
131-
engine_args:
132-
max_num_seqs: 1
133-
engine_input_source: [2]
134-
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
100+
```bash
101+
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
102+
--deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml
135103
```
136104

137-
Compared with the default Qwen3-Omni config:
105+
If you edit the bundled deploy file in place, the explicit `--deploy-config`
106+
flag is optional as long as the runtime resolves the default deploy config for
107+
the model.
138108

139-
- the talker becomes stage `2` instead of stage `1`
140-
- the code2wav stage becomes stage `3` instead of stage `2`
141-
- the talker now reads from decode stage `1`
109+
## Tests
142110

143-
## 3. Add runtime edges for the four-stage pipeline
111+
At the moment, the PD-aware tests are these three files:
144112

145-
```yaml
146-
runtime:
147-
enabled: true
148-
edges:
149-
- from: 0
150-
to: 1
151-
- from: 1
152-
to: 2
153-
- from: 2
154-
to: 3
113+
- `tests/e2e/online_serving/test_qwen3_omni.py`
114+
- `tests/e2e/online_serving/test_qwen3_omni_expansion.py`
115+
- `tests/entrypoints/test_pd_disaggregation.py`
116+
117+
### 1. Online serving E2E
118+
119+
Both online-serving test files use `VLLM_TEST_PD_MODE=1` to switch from the
120+
default 2-GPU config to the PD deploy overlay and 3-GPU layout.
121+
122+
Run `test_qwen3_omni.py`:
123+
124+
```bash
125+
VLLM_TEST_PD_MODE=1 CUDA_VISIBLE_DEVICES=0,1,2 \
126+
pytest tests/e2e/online_serving/test_qwen3_omni.py -q
155127
```
156128

157-
## 4. Launch with your custom config
129+
Run `test_qwen3_omni_expansion.py`:
158130

159131
```bash
160-
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
161-
--stage-configs-path /path/to/qwen3_omni_pd.yaml
132+
VLLM_TEST_PD_MODE=1 CUDA_VISIBLE_DEVICES=0,1,2 \
133+
pytest tests/e2e/online_serving/test_qwen3_omni_expansion.py -q
134+
```
135+
136+
### 2. PD unit / entrypoint coverage
137+
138+
`test_pd_disaggregation.py` does not require the old PD YAML anymore. It builds
139+
a temporary deploy overlay inside the test process only, enables
140+
`pd_separation`, then verifies that the merged runtime config becomes a valid
141+
4-stage PD pipeline. This temporary file is a test helper, not a user-facing
142+
config artifact.
143+
144+
```bash
145+
pytest tests/entrypoints/test_pd_disaggregation.py -q
162146
```
163147

164148
## Operational Notes

0 commit comments

Comments
 (0)