|
3 | 3 | PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode |
4 | 4 | stages so prompt processing and token generation can run on different workers. |
5 | 5 |
|
6 | | -This is documented as a stage-config recipe instead of a bundled YAML because the |
7 | | -deployment-specific values usually change per environment: |
8 | | - |
9 | | -- GPU placement |
10 | | -- `tensor_parallel_size` |
11 | | -- connector backend and connector ports |
12 | | -- connector IPs or bootstrap addresses |
13 | | - |
14 | | -Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml) |
15 | | -and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the |
16 | | -changes below. |
| 6 | +After the config refactor, PD is no longer launched from a separate legacy |
| 7 | +`stage_configs/*.yaml` file. Instead, it is enabled from the deploy config via |
| 8 | +the `pd_separation` section in |
| 9 | +[`vllm_omni/deploy/qwen3_omni_moe.yaml`](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml). |
| 10 | + |
| 11 | +## Current Config-Based Flow |
| 12 | + |
| 13 | +At runtime, the config system does the following when |
| 14 | +`pd_separation.enabled: true`: |
| 15 | + |
| 16 | +1. Load the normal 3-stage Qwen3-Omni pipeline + deploy config. |
| 17 | +2. Dynamically split the thinker into: |
| 18 | + - stage `0`: thinker prefill |
| 19 | + - stage `1`: thinker decode |
| 20 | +3. Shift downstream stages by one index: |
| 21 | + - talker: `1 -> 2` |
| 22 | + - code2wav: `2 -> 3` |
| 23 | +4. Inject `is_prefill_only`, `is_decode_only`, and `kv_transfer_config` into the |
| 24 | + resolved runtime stage configs. |
| 25 | +5. Reuse the existing PD detection / routing logic in the engine. |
| 26 | + |
| 27 | +So the user-facing deploy file stays single-source, but the resolved runtime |
| 28 | +config becomes a 4-stage PD pipeline. |
17 | 29 |
|
18 | 30 | ## Requirements |
19 | 31 |
|
20 | | -- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav |
| 32 | +- 3+ GPUs for the common layout: |
| 33 | + - prefill on GPU `0` |
| 34 | + - decode on GPU `1` |
| 35 | + - talker + code2wav on GPU `2` |
21 | 36 | - A KV connector supported by vLLM, such as `MooncakeConnector` |
22 | 37 | - Matching `tensor_parallel_size` on the prefill and decode thinker stages |
23 | 38 |
|
24 | | -## 1. Split the thinker into prefill and decode stages |
| 39 | +## How to Enable PD |
| 40 | + |
| 41 | +PD is enabled from the existing bundled deploy config: |
| 42 | + |
| 43 | +- `vllm_omni/deploy/qwen3_omni_moe.yaml` |
25 | 44 |
|
26 | | -Replace the original thinker stage with two stages: |
| 45 | +No additional user-facing YAML is required. The intent of the config refactor |
| 46 | +is to keep Qwen3-Omni on a single deploy config and switch PD on through the |
| 47 | +`pd_separation` section in that file. |
| 48 | + |
| 49 | +Edit `vllm_omni/deploy/qwen3_omni_moe.yaml` and enable / tune: |
27 | 50 |
|
28 | 51 | ```yaml |
29 | | -stage_args: |
30 | | - - stage_id: 0 |
31 | | - stage_type: llm |
32 | | - is_prefill_only: true |
33 | | - runtime: |
34 | | - devices: "0" |
35 | | - engine_args: |
| 52 | +pd_separation: |
| 53 | + enabled: true |
| 54 | + async_chunk: false |
| 55 | + target_stage_id: 0 |
| 56 | + stages: |
| 57 | + - role: prefill |
36 | 58 | max_num_seqs: 16 |
37 | | - model_stage: thinker |
38 | | - model_arch: Qwen3OmniMoeForConditionalGeneration |
39 | | - worker_type: ar |
40 | | - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler |
41 | | - gpu_memory_utilization: 0.9 |
42 | | - enforce_eager: true |
43 | | - trust_remote_code: true |
44 | | - engine_output_type: latent |
45 | | - distributed_executor_backend: "mp" |
46 | | - enable_prefix_caching: false |
47 | | - max_num_batched_tokens: 32768 |
48 | | - hf_config_name: thinker_config |
| 59 | + devices: "0" |
49 | 60 | tensor_parallel_size: 1 |
50 | | - kv_transfer_config: |
51 | | - kv_connector: "MooncakeConnector" |
52 | | - kv_role: "kv_producer" |
53 | | - kv_rank: 0 |
54 | | - kv_parallel_size: 2 |
55 | | - kv_connector_extra_config: |
56 | | - mooncake_bootstrap_port: 25201 |
57 | | - final_output: false |
58 | | - is_comprehension: true |
59 | | - default_sampling_params: |
60 | | - temperature: 0.4 |
61 | | - top_p: 0.9 |
62 | | - top_k: 1 |
63 | | - max_tokens: 2048 |
64 | | - seed: 42 |
65 | | - detokenize: True |
66 | | - repetition_penalty: 1.05 |
67 | | - |
68 | | - - stage_id: 1 |
69 | | - stage_type: llm |
70 | | - is_decode_only: true |
71 | | - runtime: |
72 | | - devices: "1" |
73 | | - engine_args: |
| 61 | + engine_extras: |
| 62 | + kv_transfer_config: |
| 63 | + kv_connector: "MooncakeConnector" |
| 64 | + kv_role: "kv_producer" |
| 65 | + kv_rank: 0 |
| 66 | + kv_parallel_size: 2 |
| 67 | + kv_connector_extra_config: |
| 68 | + mooncake_bootstrap_port: 25201 |
| 69 | + - role: decode |
74 | 70 | max_num_seqs: 64 |
75 | | - model_stage: thinker |
76 | | - model_arch: Qwen3OmniMoeForConditionalGeneration |
77 | | - worker_type: ar |
78 | | - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler |
79 | | - gpu_memory_utilization: 0.9 |
80 | | - enforce_eager: true |
81 | | - trust_remote_code: true |
82 | | - engine_output_type: latent |
83 | | - distributed_executor_backend: "mp" |
84 | | - enable_prefix_caching: false |
85 | | - max_num_batched_tokens: 32768 |
86 | | - hf_config_name: thinker_config |
| 71 | + devices: "1" |
87 | 72 | tensor_parallel_size: 1 |
88 | | - kv_transfer_config: |
89 | | - kv_connector: "MooncakeConnector" |
90 | | - kv_role: "kv_consumer" |
91 | | - kv_rank: 1 |
92 | | - kv_parallel_size: 2 |
93 | | - kv_connector_extra_config: |
94 | | - mooncake_bootstrap_port: 25202 |
95 | | - engine_input_source: [0] |
96 | | - final_output: true |
97 | | - final_output_type: text |
98 | | - is_comprehension: true |
99 | | - default_sampling_params: |
100 | | - temperature: 0.4 |
101 | | - top_p: 0.9 |
102 | | - top_k: 1 |
103 | | - max_tokens: 2048 |
104 | | - seed: 42 |
105 | | - detokenize: True |
106 | | - repetition_penalty: 1.05 |
| 73 | + engine_extras: |
| 74 | + kv_transfer_config: |
| 75 | + kv_connector: "MooncakeConnector" |
| 76 | + kv_role: "kv_consumer" |
| 77 | + kv_rank: 1 |
| 78 | + kv_parallel_size: 2 |
| 79 | + kv_connector_extra_config: |
| 80 | + mooncake_bootstrap_port: 25202 |
| 81 | + |
| 82 | +stages: |
| 83 | + - stage_id: 1 |
| 84 | + devices: "2" |
| 85 | + - stage_id: 2 |
| 86 | + devices: "2" |
107 | 87 | ``` |
108 | 88 |
|
109 | 89 | Notes: |
110 | 90 |
|
111 | | -- `is_prefill_only: true` marks the thinker stage that only saves KV. |
112 | | -- `is_decode_only: true` marks the thinker stage that resumes from remote KV. |
113 | | -- `kv_transfer_config` is required on both stages. |
114 | | -- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the |
115 | | - prefill side only processes the prompt and exports KV. |
| 91 | +- `target_stage_id: 0` means the original thinker is the stage being split. |
| 92 | +- `async_chunk: false` matches the current PD path. |
| 93 | +- The top-level `stages` overrides above keep the common 3-GPU layout: |
| 94 | + - original talker (`stage_id: 1`) stays on GPU `2` |
| 95 | + - original code2wav (`stage_id: 2`) stays on GPU `2` |
| 96 | +- After PD expansion, these become runtime stage `2` and stage `3`. |
116 | 97 |
|
117 | | -## 2. Shift the downstream stages by one index |
| 98 | +## Launching with Config-Based PD |
118 | 99 |
|
119 | | -After inserting the extra thinker stage, renumber the remaining stages: |
120 | | - |
121 | | -```yaml |
122 | | - - stage_id: 2 |
123 | | - runtime: |
124 | | - devices: "2" |
125 | | - engine_input_source: [1] |
126 | | - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker |
127 | | -
|
128 | | - - stage_id: 3 |
129 | | - runtime: |
130 | | - devices: "2" |
131 | | - engine_args: |
132 | | - max_num_seqs: 1 |
133 | | - engine_input_source: [2] |
134 | | - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav |
| 100 | +```bash |
| 101 | +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ |
| 102 | + --deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml |
135 | 103 | ``` |
136 | 104 |
|
137 | | -Compared with the default Qwen3-Omni config: |
| 105 | +If you edit the bundled deploy file in place, the explicit `--deploy-config` |
| 106 | +flag is optional as long as the runtime resolves the default deploy config for |
| 107 | +the model. |
138 | 108 |
|
139 | | -- the talker becomes stage `2` instead of stage `1` |
140 | | -- the code2wav stage becomes stage `3` instead of stage `2` |
141 | | -- the talker now reads from decode stage `1` |
| 109 | +## Tests |
142 | 110 |
|
143 | | -## 3. Add runtime edges for the four-stage pipeline |
| 111 | +At the moment, the PD-aware tests are these three files: |
144 | 112 |
|
145 | | -```yaml |
146 | | -runtime: |
147 | | - enabled: true |
148 | | - edges: |
149 | | - - from: 0 |
150 | | - to: 1 |
151 | | - - from: 1 |
152 | | - to: 2 |
153 | | - - from: 2 |
154 | | - to: 3 |
| 113 | +- `tests/e2e/online_serving/test_qwen3_omni.py` |
| 114 | +- `tests/e2e/online_serving/test_qwen3_omni_expansion.py` |
| 115 | +- `tests/entrypoints/test_pd_disaggregation.py` |
| 116 | + |
| 117 | +### 1. Online serving E2E |
| 118 | + |
| 119 | +Both online-serving test files use `VLLM_TEST_PD_MODE=1` to switch from the |
| 120 | +default 2-GPU config to the PD deploy overlay and 3-GPU layout. |
| 121 | + |
| 122 | +Run `test_qwen3_omni.py`: |
| 123 | + |
| 124 | +```bash |
| 125 | +VLLM_TEST_PD_MODE=1 CUDA_VISIBLE_DEVICES=0,1,2 \ |
| 126 | +pytest tests/e2e/online_serving/test_qwen3_omni.py -q |
155 | 127 | ``` |
156 | 128 |
|
157 | | -## 4. Launch with your custom config |
| 129 | +Run `test_qwen3_omni_expansion.py`: |
158 | 130 |
|
159 | 131 | ```bash |
160 | | -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ |
161 | | - --stage-configs-path /path/to/qwen3_omni_pd.yaml |
| 132 | +VLLM_TEST_PD_MODE=1 CUDA_VISIBLE_DEVICES=0,1,2 \ |
| 133 | +pytest tests/e2e/online_serving/test_qwen3_omni_expansion.py -q |
| 134 | +``` |
| 135 | + |
| 136 | +### 2. PD unit / entrypoint coverage |
| 137 | + |
| 138 | +`test_pd_disaggregation.py` does not require the old PD YAML anymore. It builds |
| 139 | +a temporary deploy overlay inside the test process only, enables |
| 140 | +`pd_separation`, then verifies that the merged runtime config becomes a valid |
| 141 | +4-stage PD pipeline. This temporary file is a test helper, not a user-facing |
| 142 | +config artifact. |
| 143 | + |
| 144 | +```bash |
| 145 | +pytest tests/entrypoints/test_pd_disaggregation.py -q |
162 | 146 | ``` |
163 | 147 |
|
164 | 148 | ## Operational Notes |
|
0 commit comments