vllm-project · spencerr221 · Apr 20, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 22, 2026
@@ -23,7 +23,7 @@ steps:
                   - image: 936637512419.dkr.ecr.us-west-2.amazonaws.com/vllm-ci-pull-through-cache/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
                     resources:
                       limits:
-                        nvidia.com/gpu: 2
+                        nvidia.com/gpu: 3
                     volumeMounts:
                       - name: devshm
                         mountPath: /dev/shm

@@ -193,7 +193,7 @@ steps:
               - image: 936637512419.dkr.ecr.us-west-2.amazonaws.com/vllm-ci-pull-through-cache/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
                 resources:
                   limits:
-                    nvidia.com/gpu: 2
+                    nvidia.com/gpu: 3
                 volumeMounts:
                   - name: devshm
                     mountPath: /dev/shm

diff --git a/docs/configuration/pd_disaggregation.md b/docs/configuration/pd_disaggregation.md
@@ -3,162 +3,200 @@
 PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode
 stages so prompt processing and token generation can run on different workers.
 
-This is documented as a stage-config recipe instead of a bundled YAML because the
-deployment-specific values usually change per environment:
-
-- GPU placement
-- `tensor_parallel_size`
-- connector backend and connector ports
-- connector IPs or bootstrap addresses
-
-Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml)
-and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the
-changes below.
+After the config refactor, PD is no longer launched from a separate legacy
+`stage_configs/*.yaml` file. Instead, it is enabled from the deploy config via
+the `pd_disaggregation` section in
+[`vllm_omni/deploy/qwen3_omni_moe.yaml`](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml).
+
+## Current Config-Based Flow
+
+At runtime, the config system does the following when
+`pd_disaggregation.enabled: true`:
+
+1. Load the normal 3-stage Qwen3-Omni pipeline + deploy config.
+2. Dynamically split the thinker into:
+   - stage `0`: thinker prefill
+   - stage `1`: thinker decode
+3. Shift downstream stages by one index:
+   - talker: `1 -> 2`
+   - code2wav: `2 -> 3`
+4. Inject `is_prefill_only`, `is_decode_only`, and `kv_transfer_config` into the
+   resolved runtime stage configs.
+5. Reuse the existing PD detection / routing logic in the engine.
+
+So the user-facing deploy file stays single-source, but the resolved runtime
+config becomes a 4-stage PD pipeline.
 
 ## Requirements
 
-- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
+- 3+ GPUs for the common layout:
+  - prefill on GPU `0`
+  - decode on GPU `1`
+  - talker + code2wav on GPU `2`
 - A KV connector supported by vLLM, such as `MooncakeConnector`
 - Matching `tensor_parallel_size` on the prefill and decode thinker stages
 
-## 1. Split the thinker into prefill and decode stages
+## How to Enable PD
+
+PD is enabled from the existing bundled deploy config:
+
+- `vllm_omni/deploy/qwen3_omni_moe.yaml`
 
-Replace the original thinker stage with two stages:
+No additional user-facing YAML is required. The intent of the config refactor
+is to keep Qwen3-Omni on a single deploy config and switch PD on through the
+`pd_disaggregation` section in that file.
+
+Edit `vllm_omni/deploy/qwen3_omni_moe.yaml` and enable / tune:
 
 ```yaml
-stage_args:
-  - stage_id: 0
-    stage_type: llm
-    is_prefill_only: true
-    runtime:
-      devices: "0"
-    engine_args:
+pd_disaggregation:
+  enabled: true
+  async_chunk: false
+  target_stage_id: 0
+  stages:
+    - role: prefill
       max_num_seqs: 16
-      model_stage: thinker
-      model_arch: Qwen3OmniMoeForConditionalGeneration
-      worker_type: ar
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.9
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: latent
-      distributed_executor_backend: "mp"
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      hf_config_name: thinker_config
+      devices: "0"
       tensor_parallel_size: 1
-      kv_transfer_config:
-        kv_connector: "MooncakeConnector"
-        kv_role: "kv_producer"
-        kv_rank: 0
-        kv_parallel_size: 2
-        kv_connector_extra_config:
-          mooncake_bootstrap_port: 25201
-    final_output: false
-    is_comprehension: true
-    default_sampling_params:
-      temperature: 0.4
-      top_p: 0.9
-      top_k: 1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.05
-
-  - stage_id: 1
-    stage_type: llm
-    is_decode_only: true
-    runtime:
-      devices: "1"
-    engine_args:
+      engine_extras:
+        kv_transfer_config:
+          kv_connector: "MooncakeConnector"
+          kv_role: "kv_producer"
+          kv_rank: 0
+          kv_parallel_size: 2
+          kv_connector_extra_config:
+            mooncake_bootstrap_port: 25201
+    - role: decode
       max_num_seqs: 64
-      model_stage: thinker
-      model_arch: Qwen3OmniMoeForConditionalGeneration
-      worker_type: ar
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.9
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: latent
-      distributed_executor_backend: "mp"
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      hf_config_name: thinker_config
+      devices: "1"
       tensor_parallel_size: 1
-      kv_transfer_config:
-        kv_connector: "MooncakeConnector"
-        kv_role: "kv_consumer"
-        kv_rank: 1
-        kv_parallel_size: 2
-        kv_connector_extra_config:
-          mooncake_bootstrap_port: 25202
-    engine_input_source: [0]
-    final_output: true
-    final_output_type: text
-    is_comprehension: true
-    default_sampling_params:
-      temperature: 0.4
-      top_p: 0.9
-      top_k: 1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.05
+      engine_extras:
+        kv_transfer_config:
+          kv_connector: "MooncakeConnector"
+          kv_role: "kv_consumer"
+          kv_rank: 1
+          kv_parallel_size: 2
+          kv_connector_extra_config:
+            mooncake_bootstrap_port: 25202
+  stage_overrides:
+    - stage_id: 1
+      devices: "2"
+    - stage_id: 2
+      devices: "2"
 ```
 
 Notes:
 
-- `is_prefill_only: true` marks the thinker stage that only saves KV.
-- `is_decode_only: true` marks the thinker stage that resumes from remote KV.
-- `kv_transfer_config` is required on both stages.
-- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the
-  prefill side only processes the prompt and exports KV.
+- `target_stage_id: 0` means the original thinker is the stage being split.
+- `async_chunk: false` matches the current PD path.
+- The `pd_disaggregation.stage_overrides` block keeps the common 3-GPU layout:
+  - original talker (`stage_id: 1`) stays on GPU `2`
+  - original code2wav (`stage_id: 2`) stays on GPU `2`
+- After PD expansion, these become runtime stage `2` and stage `3`.
+
+## Launching with Config-Based PD
 
-## 2. Shift the downstream stages by one index
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
+  --deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml
+```
 
-After inserting the extra thinker stage, renumber the remaining stages:
+If you edit the bundled deploy file in place, the explicit `--deploy-config`
+flag is optional as long as the runtime resolves the default deploy config for
+the model.
 
-```yaml
-  - stage_id: 2
-    runtime:
-      devices: "2"
-    engine_input_source: [1]
-    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
+You can also enable PD from CLI without editing the YAML:
 
-  - stage_id: 3
-    runtime:
-      devices: "2"
-    engine_args:
-      max_num_seqs: 1
-    engine_input_source: [2]
-    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
+  --deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml \
+  --enable-pd-disaggregation
 ```
 
-Compared with the default Qwen3-Omni config:
+`--enable-pd-disaggregation` overrides the deploy YAML's
+`pd_disaggregation.enabled` value for that launch only. Because the CLI flag is
+declared with `argparse.BooleanOptionalAction`, both forms are supported:
 
-- the talker becomes stage `2` instead of stage `1`
-- the code2wav stage becomes stage `3` instead of stage `2`
-- the talker now reads from decode stage `1`
+- `--enable-pd-disaggregation`: force PD on for this run
+- `--no-enable-pd-disaggregation`: force PD off for this run
 
-## 3. Add runtime edges for the four-stage pipeline
+When the flag is omitted, the YAML value stays in effect.
 
-```yaml
-runtime:
-  enabled: true
-  edges:
-    - from: 0
-      to: 1
-    - from: 1
-      to: 2
-    - from: 2
-      to: 3
+To tune the generated prefill/decode runtime stages from CLI, reuse
+`--stage-overrides` after PD is enabled. In the resolved 4-stage runtime config,
+stage `0` is prefill and stage `1` is decode:
+
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
+  --deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml \
+  --enable-pd-disaggregation \
+  --stage-overrides '{"0":{"max_num_seqs":8},"1":{"max_num_seqs":32}}'
 ```
 
-## 4. Launch with your custom config
+## Tests
+
+At the moment, the PD-aware tests are these three files:
+
+- `tests/e2e/online_serving/test_qwen3_omni.py`
+- `tests/e2e/online_serving/test_qwen3_omni_expansion.py`
+- `tests/entrypoints/test_pd_disaggregation.py`
+
+### 1. Online serving E2E
+
+Both online-serving test files include the regular 2-GPU cases and the PD
+3-GPU case in the same parametrized suite. The Qwen3-Omni coverage currently
+uses these modes:
+
+- `default`: non-PD, 2-GPU layout
+- `async_chunk`: non-PD async-chunk path, 2-GPU layout
+- `pd_default`: PD disaggregation, 3-GPU layout
+
+No `VLLM_TEST_PD_MODE` environment variable is needed. The tests select the
+desired mode directly from the parametrized config path, and all online-serving
+Qwen3-Omni cases launch through the stage CLI harness (`use_stage_cli=True`).
+
+Run `test_qwen3_omni.py`:
 
 ```bash
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
-    --stage-configs-path /path/to/qwen3_omni_pd.yaml
+pytest -s -v tests/e2e/online_serving/test_qwen3_omni.py \
+  -m "advanced_model" --run-level "advanced_model"
+```
+
+Run `test_qwen3_omni_expansion.py`:
+
+```bash
+pytest -s -v tests/e2e/online_serving/test_qwen3_omni_expansion.py \
+  -m "advanced_model" --run-level "advanced_model"
+```
+
+Run a single expansion case, for example:
+
+```bash
+pytest -s -v tests/e2e/online_serving/test_qwen3_omni_expansion.py \
+  -k "test_audio_in_video_002" \
+  -m "advanced_model" --run-level "advanced_model"
+```
+
+### 2. PD unit / entrypoint coverage
+
+`test_pd_disaggregation.py` does not require the old PD YAML anymore. It builds
+a temporary deploy overlay inside the test process only, enables
+`pd_disaggregation`, then verifies that the merged runtime config becomes a valid
+4-stage PD pipeline. This temporary file is a test helper, not a user-facing
+config artifact.
+
+```bash
+pytest tests/entrypoints/test_pd_disaggregation.py -q
+```
+
+### 3. DFX perf benchmark
+
+To run the Qwen3-Omni PD performance benchmark config added under
+`tests/dfx/perf/tests/`, use:
+
+```bash
+pytest tests/dfx/perf/scripts/run_benchmark.py \
+  --test-config-file tests/dfx/perf/tests/test_qwen_omni_pd.json -s
 ```
 
 ## Operational Notes

diff --git a/tests/dfx/conftest.py b/tests/dfx/conftest.py
@@ -7,7 +7,7 @@
 
 import pytest
 
-from tests.helpers.stage_config import modify_stage_config
+from tests.helpers.stage_config import get_deploy_config_path, modify_stage_config
 
 
 def load_configs(config_path: str) -> list[dict[str, Any]]:
@@ -85,7 +85,16 @@ def create_unique_server_params(
         model = server_params["model"]
         stage_config_name = server_params.get("stage_config_name")
         if stage_config_name:
-            stage_config_path = str(stage_configs_dir / stage_config_name)
+            stage_config_path = None
+            raw_stage_config = Path(stage_config_name)
+            if raw_stage_config.is_absolute():
+                stage_config_path = str(raw_stage_config)
+            else:
+                local_stage_config = stage_configs_dir / stage_config_name
+                if local_stage_config.exists():
+                    stage_config_path = str(local_stage_config)
+                else:
+                    stage_config_path = get_deploy_config_path(stage_config_name)
             delete = server_params.get("delete", None)
             update = server_params.get("update", None)
             stage_config_path = modify_stage(stage_config_path, update, delete)

diff --git a/tests/dfx/perf/scripts/run_benchmark.py b/tests/dfx/perf/scripts/run_benchmark.py
@@ -43,7 +43,7 @@ def _get_config_file_from_argv() -> str | None:
 if CONFIG_FILE_PATH is None:
     print(
         "No --test-config-file in argv, using default: tests/dfx/perf/tests/test_qwen_omni.json "
-        "(override with e.g. --test-config-file tests/dfx/perf/tests/test_tts.json)"
+        "(override with e.g. --test-config-file tests/dfx/perf/tests/test_qwen_omni_pd.json)"
     )
     CONFIG_FILE_PATH = _DEFAULT_CONFIG_FILE