Skip to content
2 changes: 1 addition & 1 deletion .buildkite/test-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ steps:
- image: 936637512419.dkr.ecr.us-west-2.amazonaws.com/vllm-ci-pull-through-cache/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
resources:
limits:
nvidia.com/gpu: 2
nvidia.com/gpu: 3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you add a separate label for PD test rather than all omni test with 3 cards

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you add a separate label for PD test rather than all omni test with 3 cards

if you separete job for PD test, maybe you can use job name Omni · Function Test with 2 H100 and Omni · Function Test with 3 H100

volumeMounts:
- name: devshm
mountPath: /dev/shm
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/test-ready.yml
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ steps:
- image: 936637512419.dkr.ecr.us-west-2.amazonaws.com/vllm-ci-pull-through-cache/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
resources:
limits:
nvidia.com/gpu: 2
nvidia.com/gpu: 3
volumeMounts:
- name: devshm
mountPath: /dev/shm
Expand Down
290 changes: 164 additions & 126 deletions docs/configuration/pd_disaggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,162 +3,200 @@
PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode
stages so prompt processing and token generation can run on different workers.

This is documented as a stage-config recipe instead of a bundled YAML because the
deployment-specific values usually change per environment:

- GPU placement
- `tensor_parallel_size`
- connector backend and connector ports
- connector IPs or bootstrap addresses

Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml)
and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the
changes below.
After the config refactor, PD is no longer launched from a separate legacy
`stage_configs/*.yaml` file. Instead, it is enabled from the deploy config via
the `pd_disaggregation` section in
[`vllm_omni/deploy/qwen3_omni_moe.yaml`](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml).

## Current Config-Based Flow

At runtime, the config system does the following when
`pd_disaggregation.enabled: true`:

1. Load the normal 3-stage Qwen3-Omni pipeline + deploy config.
2. Dynamically split the thinker into:
- stage `0`: thinker prefill
- stage `1`: thinker decode
3. Shift downstream stages by one index:
- talker: `1 -> 2`
- code2wav: `2 -> 3`
4. Inject `is_prefill_only`, `is_decode_only`, and `kv_transfer_config` into the
resolved runtime stage configs.
5. Reuse the existing PD detection / routing logic in the engine.

So the user-facing deploy file stays single-source, but the resolved runtime
config becomes a 4-stage PD pipeline.

## Requirements

- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
- 3+ GPUs for the common layout:
- prefill on GPU `0`
- decode on GPU `1`
- talker + code2wav on GPU `2`
- A KV connector supported by vLLM, such as `MooncakeConnector`
- Matching `tensor_parallel_size` on the prefill and decode thinker stages

## 1. Split the thinker into prefill and decode stages
## How to Enable PD

PD is enabled from the existing bundled deploy config:

- `vllm_omni/deploy/qwen3_omni_moe.yaml`

Replace the original thinker stage with two stages:
No additional user-facing YAML is required. The intent of the config refactor
is to keep Qwen3-Omni on a single deploy config and switch PD on through the
`pd_disaggregation` section in that file.

Edit `vllm_omni/deploy/qwen3_omni_moe.yaml` and enable / tune:

```yaml
stage_args:
- stage_id: 0
stage_type: llm
is_prefill_only: true
runtime:
devices: "0"
engine_args:
pd_disaggregation:
enabled: true
async_chunk: false
target_stage_id: 0
stages:
- role: prefill
max_num_seqs: 16
model_stage: thinker
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.9
enforce_eager: true
trust_remote_code: true
engine_output_type: latent
distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
hf_config_name: thinker_config
devices: "0"
tensor_parallel_size: 1
kv_transfer_config:
kv_connector: "MooncakeConnector"
kv_role: "kv_producer"
kv_rank: 0
kv_parallel_size: 2
kv_connector_extra_config:
mooncake_bootstrap_port: 25201
final_output: false
is_comprehension: true
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05

- stage_id: 1
stage_type: llm
is_decode_only: true
runtime:
devices: "1"
engine_args:
engine_extras:
kv_transfer_config:
kv_connector: "MooncakeConnector"
kv_role: "kv_producer"
kv_rank: 0
kv_parallel_size: 2
kv_connector_extra_config:
mooncake_bootstrap_port: 25201
- role: decode
max_num_seqs: 64
model_stage: thinker
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.9
enforce_eager: true
trust_remote_code: true
engine_output_type: latent
distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
hf_config_name: thinker_config
devices: "1"
tensor_parallel_size: 1
kv_transfer_config:
kv_connector: "MooncakeConnector"
kv_role: "kv_consumer"
kv_rank: 1
kv_parallel_size: 2
kv_connector_extra_config:
mooncake_bootstrap_port: 25202
engine_input_source: [0]
final_output: true
final_output_type: text
is_comprehension: true
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
engine_extras:
kv_transfer_config:
kv_connector: "MooncakeConnector"
kv_role: "kv_consumer"
kv_rank: 1
kv_parallel_size: 2
kv_connector_extra_config:
mooncake_bootstrap_port: 25202
stage_overrides:
- stage_id: 1
devices: "2"
- stage_id: 2
devices: "2"
```

Notes:

- `is_prefill_only: true` marks the thinker stage that only saves KV.
- `is_decode_only: true` marks the thinker stage that resumes from remote KV.
- `kv_transfer_config` is required on both stages.
- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the
prefill side only processes the prompt and exports KV.
- `target_stage_id: 0` means the original thinker is the stage being split.
- `async_chunk: false` matches the current PD path.
- The `pd_disaggregation.stage_overrides` block keeps the common 3-GPU layout:
- original talker (`stage_id: 1`) stays on GPU `2`
- original code2wav (`stage_id: 2`) stays on GPU `2`
- After PD expansion, these become runtime stage `2` and stage `3`.

## Launching with Config-Based PD

## 2. Shift the downstream stages by one index
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml
```

After inserting the extra thinker stage, renumber the remaining stages:
If you edit the bundled deploy file in place, the explicit `--deploy-config`
flag is optional as long as the runtime resolves the default deploy config for
the model.

```yaml
- stage_id: 2
runtime:
devices: "2"
engine_input_source: [1]
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
You can also enable PD from CLI without editing the YAML:

- stage_id: 3
runtime:
devices: "2"
engine_args:
max_num_seqs: 1
engine_input_source: [2]
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml \
--enable-pd-disaggregation
```

Compared with the default Qwen3-Omni config:
`--enable-pd-disaggregation` overrides the deploy YAML's
`pd_disaggregation.enabled` value for that launch only. Because the CLI flag is
declared with `argparse.BooleanOptionalAction`, both forms are supported:

- the talker becomes stage `2` instead of stage `1`
- the code2wav stage becomes stage `3` instead of stage `2`
- the talker now reads from decode stage `1`
- `--enable-pd-disaggregation`: force PD on for this run
- `--no-enable-pd-disaggregation`: force PD off for this run

## 3. Add runtime edges for the four-stage pipeline
When the flag is omitted, the YAML value stays in effect.

```yaml
runtime:
enabled: true
edges:
- from: 0
to: 1
- from: 1
to: 2
- from: 2
to: 3
To tune the generated prefill/decode runtime stages from CLI, reuse
`--stage-overrides` after PD is enabled. In the resolved 4-stage runtime config,
stage `0` is prefill and stage `1` is decode:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config vllm_omni/deploy/qwen3_omni_moe.yaml \
--enable-pd-disaggregation \
--stage-overrides '{"0":{"max_num_seqs":8},"1":{"max_num_seqs":32}}'
```

## 4. Launch with your custom config
## Tests

At the moment, the PD-aware tests are these three files:

- `tests/e2e/online_serving/test_qwen3_omni.py`
- `tests/e2e/online_serving/test_qwen3_omni_expansion.py`
- `tests/entrypoints/test_pd_disaggregation.py`

### 1. Online serving E2E

Both online-serving test files include the regular 2-GPU cases and the PD
3-GPU case in the same parametrized suite. The Qwen3-Omni coverage currently
uses these modes:

- `default`: non-PD, 2-GPU layout
- `async_chunk`: non-PD async-chunk path, 2-GPU layout
- `pd_default`: PD disaggregation, 3-GPU layout

No `VLLM_TEST_PD_MODE` environment variable is needed. The tests select the
desired mode directly from the parametrized config path, and all online-serving
Qwen3-Omni cases launch through the stage CLI harness (`use_stage_cli=True`).

Run `test_qwen3_omni.py`:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-configs-path /path/to/qwen3_omni_pd.yaml
pytest -s -v tests/e2e/online_serving/test_qwen3_omni.py \
-m "advanced_model" --run-level "advanced_model"
```

Run `test_qwen3_omni_expansion.py`:

```bash
pytest -s -v tests/e2e/online_serving/test_qwen3_omni_expansion.py \
-m "advanced_model" --run-level "advanced_model"
```

Run a single expansion case, for example:

```bash
pytest -s -v tests/e2e/online_serving/test_qwen3_omni_expansion.py \
-k "test_audio_in_video_002" \
-m "advanced_model" --run-level "advanced_model"
```

### 2. PD unit / entrypoint coverage

`test_pd_disaggregation.py` does not require the old PD YAML anymore. It builds
a temporary deploy overlay inside the test process only, enables
`pd_disaggregation`, then verifies that the merged runtime config becomes a valid
4-stage PD pipeline. This temporary file is a test helper, not a user-facing
config artifact.

```bash
pytest tests/entrypoints/test_pd_disaggregation.py -q
```

### 3. DFX perf benchmark

To run the Qwen3-Omni PD performance benchmark config added under
`tests/dfx/perf/tests/`, use:

```bash
pytest tests/dfx/perf/scripts/run_benchmark.py \
--test-config-file tests/dfx/perf/tests/test_qwen_omni_pd.json -s
```

## Operational Notes
Expand Down
13 changes: 11 additions & 2 deletions tests/dfx/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

import pytest

from tests.helpers.stage_config import modify_stage_config
from tests.helpers.stage_config import get_deploy_config_path, modify_stage_config


def load_configs(config_path: str) -> list[dict[str, Any]]:
Expand Down Expand Up @@ -85,7 +85,16 @@ def create_unique_server_params(
model = server_params["model"]
stage_config_name = server_params.get("stage_config_name")
if stage_config_name:
stage_config_path = str(stage_configs_dir / stage_config_name)
stage_config_path = None
raw_stage_config = Path(stage_config_name)
if raw_stage_config.is_absolute():
stage_config_path = str(raw_stage_config)
else:
local_stage_config = stage_configs_dir / stage_config_name
if local_stage_config.exists():
stage_config_path = str(local_stage_config)
else:
stage_config_path = get_deploy_config_path(stage_config_name)
delete = server_params.get("delete", None)
update = server_params.get("update", None)
stage_config_path = modify_stage(stage_config_path, update, delete)
Expand Down
2 changes: 1 addition & 1 deletion tests/dfx/perf/scripts/run_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def _get_config_file_from_argv() -> str | None:
if CONFIG_FILE_PATH is None:
print(
"No --test-config-file in argv, using default: tests/dfx/perf/tests/test_qwen_omni.json "
"(override with e.g. --test-config-file tests/dfx/perf/tests/test_tts.json)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this

"(override with e.g. --test-config-file tests/dfx/perf/tests/test_qwen_omni_pd.json)"
)
CONFIG_FILE_PATH = _DEFAULT_CONFIG_FILE

Expand Down
Loading
Loading