[None][feat] Disagg coordinator + orchestrator fleet by reasonsolo · Pull Request #15905 · NVIDIA/TensorRT-LLM

reasonsolo · 2026-07-03T04:51:25Z

TL;DR:

add num_workers: N, N>1 to disagg config yaml to enable multiprocess orchestrator fleet.
One possible drawback: /perf_metrics is still served by single orchestrator, if multi-processed, you may need to poll /perf_metrics for multiple times until it gives empty data.

Implement a coordinator/worker model:

disaggregated (num_workers >1) becomes a pure coordinator on port-1 that owns the ctx/gen routers, readiness, cluster/worker events, and the centralized ZMQ ingest bind, and serves only the internal /select, /finish, /cluster_info, /health API (coordinator_server.py).
The worker fleet runs via one uvicorn process group (workers=N) over a shared listening socket on the public port, rebuilt from a stateless import-string factory (create_worker_app); uvicorn owns supervision + graceful shutdown.
Placement is split on Router: extract_routing_key (client/worker side) + select_by_key / finish_by_handle (coordinator side). Round-robin -> empty key, conversation -> conversation_id (handle-based load release), centralized -> block hashes. The worker holds a CoordinatorHttpRouter that posts the key to the coordinator; single-process calls the router directly.
New DisaggCoordinator abstraction (disagg_coordinator.py): DisaggCoordinatorService (in-process, owns routers) and CoordinatorClient (worker, delegates over HTTP). OpenAIDisaggregatedService reads ctx_router/gen_router off the coordinator and drives get_next_server / finish_request uniformly, so serving is identical in both modes.
Remove router_http_server.py, disagg_app.py, RemoteHttpRouter and the remote_http router type, plus their tests; add test_coordinator_worker.py.

Verified in-container (gb200): test_coordinator_worker 2/2 across 5 repeats, test_per_rank_routing + test_centralized_kv_cache_router + test_openai_disagg_service 76 passed.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…agg service Replace the WEB_CONCURRENCY multi-worker no-op (and the router_http_server / disagg_app scaffolding) with a coordinator/worker model: - disaggregated (WEB_CONCURRENCY>1) becomes a pure coordinator on port-1 that owns the ctx/gen routers, readiness, cluster/worker events, and the centralized ZMQ ingest bind, and serves only the internal /select, /finish, /cluster_info, /health API (coordinator_server.py). - The worker fleet runs via one uvicorn process group (workers=N) over a shared listening socket on the public port, rebuilt from a stateless import-string factory (create_worker_app); uvicorn owns supervision + graceful shutdown. - Placement is split on Router: extract_routing_key (client/worker side) + select_by_key / finish_by_handle (coordinator side). Round-robin -> empty key, conversation -> conversation_id (handle-based load release), centralized -> block hashes. The worker holds a CoordinatorHttpRouter that posts the key to the coordinator; single-process calls the router directly. - New DisaggCoordinator abstraction (disagg_coordinator.py): DisaggCoordinatorService (in-process, owns routers) and CoordinatorClient (worker, delegates over HTTP). OpenAIDisaggregatedService reads ctx_router/gen_router off the coordinator and drives get_next_server / finish_request uniformly, so serving is identical in both modes. - Remove router_http_server.py, disagg_app.py, RemoteHttpRouter and the remote_http router type, plus their tests; add test_coordinator_worker.py. Verified in-container (gb200): test_coordinator_worker 2/2 across 5 repeats, test_per_rank_routing + test_centralized_kv_cache_router + test_openai_disagg_service 76 passed. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

This reverts commit 027ba10.

This reverts commit 2297701.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

… per-worker The disagg request id was minted locally in each fleet worker via get_global_disagg_request_id(node_id). With num_workers>1 all workers share the same node_id and each keeps its own counter starting at 0, so the snowflake ids (timestamp, machine_id, counter) collide across workers. The ctx->gen KV-cache transceiver keys transfers by disagg id, so colliding ids make transfers clash and never complete: the gen engine's IndexMapper fills with stuck DISAGG_GENERATION_TRANS_IN_PROGRESS requests (all slots in use), new requests can't allocate KV and retry forever, and fleet throughput collapses (~2 req/s). Both _send_disagg_request_ctx_first and _gen_first now fetch the id from the single coordinator (await self._coordinator.get_disagg_request_id()) -- owner issues in-process, delegating fleet workers fetch over HTTP (/disagg_request_id, already wired). Single issuer => globally unique ids. Also rename the service's self._cluster -> self._coordinator to match coordinator_server, and drop the now unused get_global_disagg_request_id import. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

… time CoordinatorDelegatingRouter.get_next_server was overwriting the generation request's disagg_request_id with a fresh coordinator-issued id (sent req_id=None to /select, then wrote body["req_id"] back onto the request). But the ctx worker already registered its KV-cache transfer TxSession under the id the request carried from the ctx phase. Overwriting it makes the gen transceiver wait on a key the ctx side never registered: the transfer never completes, gen requests stay DISAGG_GENERATION_TRANS_IN_PROGRESS, the gen IndexMapper fills (No free IndexMapper slots), and fleet throughput collapses to ~2 req/s. (Single-process num_workers=1 never hit this: no delegating router, id never rewritten.) Restore the last-good behavior: _request_id returns disagg_request_id for context and ctx_request_id for generation (the inherited ctx id), and get_next_server sends that id as the /select key WITHOUT rewriting the request. Placement never changes the disagg id, so the ctx<->gen KV transfer key stays consistent. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

github-actions Bot assigned reasonsolo Jul 3, 2026

reasonsolo force-pushed the feat/deepseek_v4_coordinator_disagg branch 2 times, most recently from 930de75 to fb6a7ff Compare July 3, 2026 05:25

generate request id from coordinator

027ba10

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the feat/deepseek_v4_coordinator_disagg branch from fb6a7ff to 027ba10 Compare July 3, 2026 05:27

reasonsolo added 5 commits July 2, 2026 22:46

Revert " generate request id from coordinator"

2297701

This reverts commit 027ba10.

Reapply " generate request id from coordinator"

4effcbe

This reverts commit 2297701.

remove llm_id changes

e197aea

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] Disagg coordinator + orchestrator fleet#15905

[None][feat] Disagg coordinator + orchestrator fleet#15905
reasonsolo wants to merge 7 commits into
NVIDIA:feat/deepseek_v4from
reasonsolo:feat/deepseek_v4_coordinator_disagg

reasonsolo commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

reasonsolo commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

reasonsolo commented Jul 3, 2026 •

edited

Loading