vllm-project
diff --git a/‎docs/configuration/pd_disaggregation.md‎
Lines changed: 112 additions & 128 deletions b/‎docs/configuration/pd_disaggregation.md‎
Lines changed: 112 additions & 128 deletions
@@ -3,162 +3,146 @@
 PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode
 stages so prompt processing and token generation can run on different workers.
 
-This is documented as a stage-config recipe instead of a bundled YAML because the
-deployment-specific values usually change per environment:
-
-- GPU placement
-- `tensor_parallel_size`
-- connector backend and connector ports
-- connector IPs or bootstrap addresses
-
-Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml)
-and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the
-changes below.
+After the config refactor, PD is no longer launched from a separate legacy
+`stage_configs/*.yaml` file. Instead, it is enabled from the deploy config via
+the `pd_separation` section in
+[`vllm_omni/deploy/qwen3_omni_moe.yaml`](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml).
+
+## Current Config-Based Flow
+
+At runtime, the config system does the following when
+`pd_separation.enabled: true`:
+
+1. Load the normal 3-stage Qwen3-Omni pipeline + deploy config.
+2. Dynamically split the thinker into:
+   - stage `0`: thinker prefill
+   - stage `1`: thinker decode
+3. Shift downstream stages by one index:
+   - talker: `1 -> 2`
+   - code2wav: `2 -> 3`
+4. Inject `is_prefill_only`, `is_decode_only`, and `kv_transfer_config` into the
+   resolved runtime stage configs.
+5. Reuse the existing PD detection / routing logic in the engine.
+
+So the user-facing deploy file stays single-source, but the resolved runtime
+config becomes a 4-stage PD pipeline.
 
 ## Requirements
 
-- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
+- 3+ GPUs for the common layout:
+  - prefill on GPU `0`
+  - decode on GPU `1`
+  - talker + code2wav on GPU `2`
 - A KV connector supported by vLLM, such as `MooncakeConnector`
 - Matching `tensor_parallel_size` on the prefill and decode thinker stages
 
-## 1. Split the thinker into prefill and decode stages
+## How to Enable PD
+
+PD is enabled from the existing bundled deploy config:
+
+- `vllm_omni/deploy/qwen3_omni_moe.yaml`
 
-Replace the original thinker stage with two stages:
+No additional user-facing YAML is required. The intent of the config refactor
+is to keep Qwen3-Omni on a single deploy config and switch PD on through the
+`pd_separation` section in that file.
+
+Edit `vllm_omni/deploy/qwen3_omni_moe.yaml` and enable / tune:
 
 ```yaml
-stage_args:
-  - stage_id: 0
-    stage_type: llm
-    is_prefill_only: true
-    runtime:
-      devices: "0"
-    engine_args:
+pd_separation:
+  enabled: true
+  async_chunk: false
+  target_stage_id: 0
+  stages:
+    - role: prefill
       max_num_seqs: 16
-      model_stage: thinker
-      model_arch: Qwen3OmniMoeForConditionalGeneration
-      worker_type: ar
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.9
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: latent
-      distributed_executor_backend: "mp"
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      hf_config_name: thinker_config
+      devices: "0"
       tensor_parallel_size: 1
-      kv_transfer_config:
-        kv_connector: "MooncakeConnector"
-        kv_role: "kv_producer"
-        kv_rank: 0
-        kv_parallel_size: 2
-        kv_connector_extra_config:
-          mooncake_bootstrap_port: 25201
-    final_output: false
-    is_comprehension: true
-    default_sampling_params:
-      temperature: 0.4
-      top_p: 0.9
-      top_k: 1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.05
-
-  - stage_id: 1
-    stage_type: llm
-    is_decode_only: true
-    runtime:
-      devices: "1"
-    engine_args:
+      engine_extras:
+        kv_transfer_config:
+          kv_connector: "MooncakeConnector"
+          kv_role: "kv_producer"
+          kv_rank: 0
+          kv_parallel_size: 2
+          kv_connector_extra_config:
+            mooncake_bootstrap_port: 25201
+    - role: decode
       max_num_seqs: 64
-      model_stage: thinker
-      model_arch: Qwen3OmniMoeForConditionalGeneration
-      worker_type: ar
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.9
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: latent
-      distributed_executor_backend: "mp"
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      hf_config_name: thinker_config
+      devices: "1"
       tensor_parallel_size: 1
-      kv_transfer_config:
-        kv_connector: "MooncakeConnector"
-        kv_role: "kv_consumer"
-        kv_rank: 1
-        kv_parallel_size: 2
-        kv_connector_extra_config:
-          mooncake_bootstrap_port: 25202
-    engine_input_source: [0]
-    final_output: true
-    final_output_type: text
-    is_comprehension: true
-    default_sampling_params:
-      temperature: 0.4
-      top_p: 0.9
-      top_k: 1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.05
+      engine_extras:
+        kv_transfer_config:
+          kv_connector: "MooncakeConnector"
+          kv_role: "kv_consumer"
+          kv_rank: 1
+          kv_parallel_size: 2
+          kv_connector_extra_config:
+            mooncake_bootstrap_port: 25202
+
+stages:
+  - stage_id: 1
+    devices: "2"
+  - stage_id: 2
+    devices: "2"
 ```
 
 Notes:
 
-- `is_prefill_only: true` marks the thinker stage that only saves KV.
-- `is_decode_only: true` marks the thinker stage that resumes from remote KV.
-- `kv_transfer_config` is required on both stages.
-- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the
-  prefill side only processes the prompt and exports KV.
+- `target_stage_id: 0` means the original thinker is the stage being split.
+- `async_chunk: false` matches the current PD path.
+- The top-level `stages` overrides above keep the common 3-GPU layout:
+  - original talker (`stage_id: 1`) stays on GPU `2`
+  - original code2wav (`stage_id: 2`) stays on GPU `2`
+- After PD expansion, these become runtime stage `2` and stage `3`.
 
-## 2. Shift the downstream stages by one index
+## Launching with Config-Based PD
 
-After inserting the extra thinker stage, renumber the remaining stages:
-
-```yaml
-  - stage_id: 2
-    runtime:
-      devices: "2"
-    engine_input_source: [1]
-    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
-
-  - stage_id: 3
-    runtime:
-      devices: "2"
-    engine_args:
-      max_num_seqs: 1
-    engine_input_source: [2]
-    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
+  --deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml
 ```
 
-Compared with the default Qwen3-Omni config:
+If you edit the bundled deploy file in place, the explicit `--deploy-config`
+flag is optional as long as the runtime resolves the default deploy config for
+the model.
 
-- the talker becomes stage `2` instead of stage `1`
-- the code2wav stage becomes stage `3` instead of stage `2`
-- the talker now reads from decode stage `1`
+## Tests
 
-## 3. Add runtime edges for the four-stage pipeline
+At the moment, the PD-aware tests are these three files:
 
-```yaml
-runtime:
-  enabled: true
-  edges:
-    - from: 0
-      to: 1
-    - from: 1
-      to: 2
-    - from: 2
-      to: 3
+- `tests/e2e/online_serving/test_qwen3_omni.py`
+- `tests/e2e/online_serving/test_qwen3_omni_expansion.py`
+- `tests/entrypoints/test_pd_disaggregation.py`
+
+### 1. Online serving E2E
+
+Both online-serving test files use `VLLM_TEST_PD_MODE=1` to switch from the
+default 2-GPU config to the PD deploy overlay and 3-GPU layout.
+
+Run `test_qwen3_omni.py`:
+
+```bash
+VLLM_TEST_PD_MODE=1 CUDA_VISIBLE_DEVICES=0,1,2 \
+pytest tests/e2e/online_serving/test_qwen3_omni.py -q
 ```
 
-## 4. Launch with your custom config
+Run `test_qwen3_omni_expansion.py`:
 
 ```bash
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
-    --stage-configs-path /path/to/qwen3_omni_pd.yaml
+VLLM_TEST_PD_MODE=1 CUDA_VISIBLE_DEVICES=0,1,2 \
+pytest tests/e2e/online_serving/test_qwen3_omni_expansion.py -q
+```
+
+### 2. PD unit / entrypoint coverage
+
+`test_pd_disaggregation.py` does not require the old PD YAML anymore. It builds
+a temporary deploy overlay inside the test process only, enables
+`pd_separation`, then verifies that the merged runtime config becomes a valid
+4-stage PD pipeline. This temporary file is a test helper, not a user-facing
+config artifact.
+
+```bash
+pytest tests/entrypoints/test_pd_disaggregation.py -q
 ```
 
 ## Operational Notes