Skip to content

[None][feat] Disagg coordinator + orchestrator fleet#15905

Draft
reasonsolo wants to merge 7 commits into
NVIDIA:feat/deepseek_v4from
reasonsolo:feat/deepseek_v4_coordinator_disagg
Draft

[None][feat] Disagg coordinator + orchestrator fleet#15905
reasonsolo wants to merge 7 commits into
NVIDIA:feat/deepseek_v4from
reasonsolo:feat/deepseek_v4_coordinator_disagg

Conversation

@reasonsolo

@reasonsolo reasonsolo commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

TL;DR:

  1. add num_workers: N, N>1 to disagg config yaml to enable multiprocess orchestrator fleet.
  2. One possible drawback: /perf_metrics is still served by single orchestrator, if multi-processed, you may need to poll /perf_metrics for multiple times until it gives empty data.

Implement a coordinator/worker model:

  • disaggregated (num_workers >1) becomes a pure coordinator on port-1 that owns the ctx/gen routers, readiness, cluster/worker events, and the centralized ZMQ ingest bind, and serves only the internal /select, /finish, /cluster_info, /health API (coordinator_server.py).
  • The worker fleet runs via one uvicorn process group (workers=N) over a shared listening socket on the public port, rebuilt from a stateless import-string factory (create_worker_app); uvicorn owns supervision + graceful shutdown.
  • Placement is split on Router: extract_routing_key (client/worker side) + select_by_key / finish_by_handle (coordinator side). Round-robin -> empty key, conversation -> conversation_id (handle-based load release), centralized -> block hashes. The worker holds a CoordinatorHttpRouter that posts the key to the coordinator; single-process calls the router directly.
  • New DisaggCoordinator abstraction (disagg_coordinator.py): DisaggCoordinatorService (in-process, owns routers) and CoordinatorClient (worker, delegates over HTTP). OpenAIDisaggregatedService reads ctx_router/gen_router off the coordinator and drives get_next_server / finish_request uniformly, so serving is identical in both modes.
  • Remove router_http_server.py, disagg_app.py, RemoteHttpRouter and the remote_http router type, plus their tests; add test_coordinator_worker.py.

Verified in-container (gb200): test_coordinator_worker 2/2 across 5 repeats, test_per_rank_routing + test_centralized_kv_cache_router + test_openai_disagg_service 76 passed.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…agg service

Replace the WEB_CONCURRENCY multi-worker no-op (and the router_http_server /
disagg_app scaffolding) with a coordinator/worker model:

- disaggregated (WEB_CONCURRENCY>1) becomes a pure coordinator on port-1 that
  owns the ctx/gen routers, readiness, cluster/worker events, and the
  centralized ZMQ ingest bind, and serves only the internal /select, /finish,
  /cluster_info, /health API (coordinator_server.py).
- The worker fleet runs via one uvicorn process group (workers=N) over a shared
  listening socket on the public port, rebuilt from a stateless import-string
  factory (create_worker_app); uvicorn owns supervision + graceful shutdown.
- Placement is split on Router: extract_routing_key (client/worker side) +
  select_by_key / finish_by_handle (coordinator side). Round-robin -> empty key,
  conversation -> conversation_id (handle-based load release), centralized ->
  block hashes. The worker holds a CoordinatorHttpRouter that posts the key to
  the coordinator; single-process calls the router directly.
- New DisaggCoordinator abstraction (disagg_coordinator.py): DisaggCoordinatorService
  (in-process, owns routers) and CoordinatorClient (worker, delegates over HTTP).
  OpenAIDisaggregatedService reads ctx_router/gen_router off the coordinator and
  drives get_next_server / finish_request uniformly, so serving is identical in
  both modes.
- Remove router_http_server.py, disagg_app.py, RemoteHttpRouter and the
  remote_http router type, plus their tests; add test_coordinator_worker.py.

Verified in-container (gb200): test_coordinator_worker 2/2 across 5 repeats,
test_per_rank_routing + test_centralized_kv_cache_router + test_openai_disagg_service
76 passed.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo reasonsolo force-pushed the feat/deepseek_v4_coordinator_disagg branch 2 times, most recently from 930de75 to fb6a7ff Compare July 3, 2026 05:25
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo reasonsolo force-pushed the feat/deepseek_v4_coordinator_disagg branch from fb6a7ff to 027ba10 Compare July 3, 2026 05:27
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
… per-worker

The disagg request id was minted locally in each fleet worker via
get_global_disagg_request_id(node_id). With num_workers>1 all workers share the
same node_id and each keeps its own counter starting at 0, so the snowflake ids
(timestamp, machine_id, counter) collide across workers. The ctx->gen KV-cache
transceiver keys transfers by disagg id, so colliding ids make transfers clash
and never complete: the gen engine's IndexMapper fills with stuck
DISAGG_GENERATION_TRANS_IN_PROGRESS requests (all slots in use), new requests
can't allocate KV and retry forever, and fleet throughput collapses (~2 req/s).

Both _send_disagg_request_ctx_first and _gen_first now fetch the id from the
single coordinator (await self._coordinator.get_disagg_request_id()) -- owner
issues in-process, delegating fleet workers fetch over HTTP (/disagg_request_id,
already wired). Single issuer => globally unique ids. Also rename the service's
self._cluster -> self._coordinator to match coordinator_server, and drop the now
unused get_global_disagg_request_id import.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
… time

CoordinatorDelegatingRouter.get_next_server was overwriting the generation
request's disagg_request_id with a fresh coordinator-issued id (sent req_id=None
to /select, then wrote body["req_id"] back onto the request). But the ctx worker
already registered its KV-cache transfer TxSession under the id the request
carried from the ctx phase. Overwriting it makes the gen transceiver wait on a
key the ctx side never registered: the transfer never completes, gen requests
stay DISAGG_GENERATION_TRANS_IN_PROGRESS, the gen IndexMapper fills (No free
IndexMapper slots), and fleet throughput collapses to ~2 req/s. (Single-process
num_workers=1 never hit this: no delegating router, id never rewritten.)

Restore the last-good behavior: _request_id returns disagg_request_id for context
and ctx_request_id for generation (the inherited ctx id), and get_next_server
sends that id as the /select key WITHOUT rewriting the request. Placement never
changes the disagg id, so the ctx<->gen KV transfer key stays consistent.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant