[None][feat] Disagg coordinator + orchestrator fleet#15905
Draft
reasonsolo wants to merge 7 commits into
Draft
Conversation
…agg service Replace the WEB_CONCURRENCY multi-worker no-op (and the router_http_server / disagg_app scaffolding) with a coordinator/worker model: - disaggregated (WEB_CONCURRENCY>1) becomes a pure coordinator on port-1 that owns the ctx/gen routers, readiness, cluster/worker events, and the centralized ZMQ ingest bind, and serves only the internal /select, /finish, /cluster_info, /health API (coordinator_server.py). - The worker fleet runs via one uvicorn process group (workers=N) over a shared listening socket on the public port, rebuilt from a stateless import-string factory (create_worker_app); uvicorn owns supervision + graceful shutdown. - Placement is split on Router: extract_routing_key (client/worker side) + select_by_key / finish_by_handle (coordinator side). Round-robin -> empty key, conversation -> conversation_id (handle-based load release), centralized -> block hashes. The worker holds a CoordinatorHttpRouter that posts the key to the coordinator; single-process calls the router directly. - New DisaggCoordinator abstraction (disagg_coordinator.py): DisaggCoordinatorService (in-process, owns routers) and CoordinatorClient (worker, delegates over HTTP). OpenAIDisaggregatedService reads ctx_router/gen_router off the coordinator and drives get_next_server / finish_request uniformly, so serving is identical in both modes. - Remove router_http_server.py, disagg_app.py, RemoteHttpRouter and the remote_http router type, plus their tests; add test_coordinator_worker.py. Verified in-container (gb200): test_coordinator_worker 2/2 across 5 repeats, test_per_rank_routing + test_centralized_kv_cache_router + test_openai_disagg_service 76 passed. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
930de75 to
fb6a7ff
Compare
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
fb6a7ff to
027ba10
Compare
This reverts commit 027ba10.
This reverts commit 2297701.
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
… per-worker The disagg request id was minted locally in each fleet worker via get_global_disagg_request_id(node_id). With num_workers>1 all workers share the same node_id and each keeps its own counter starting at 0, so the snowflake ids (timestamp, machine_id, counter) collide across workers. The ctx->gen KV-cache transceiver keys transfers by disagg id, so colliding ids make transfers clash and never complete: the gen engine's IndexMapper fills with stuck DISAGG_GENERATION_TRANS_IN_PROGRESS requests (all slots in use), new requests can't allocate KV and retry forever, and fleet throughput collapses (~2 req/s). Both _send_disagg_request_ctx_first and _gen_first now fetch the id from the single coordinator (await self._coordinator.get_disagg_request_id()) -- owner issues in-process, delegating fleet workers fetch over HTTP (/disagg_request_id, already wired). Single issuer => globally unique ids. Also rename the service's self._cluster -> self._coordinator to match coordinator_server, and drop the now unused get_global_disagg_request_id import. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
… time CoordinatorDelegatingRouter.get_next_server was overwriting the generation request's disagg_request_id with a fresh coordinator-issued id (sent req_id=None to /select, then wrote body["req_id"] back onto the request). But the ctx worker already registered its KV-cache transfer TxSession under the id the request carried from the ctx phase. Overwriting it makes the gen transceiver wait on a key the ctx side never registered: the transfer never completes, gen requests stay DISAGG_GENERATION_TRANS_IN_PROGRESS, the gen IndexMapper fills (No free IndexMapper slots), and fleet throughput collapses to ~2 req/s. (Single-process num_workers=1 never hit this: no delegating router, id never rewritten.) Restore the last-good behavior: _request_id returns disagg_request_id for context and ctx_request_id for generation (the inherited ctx id), and get_next_server sends that id as the /select key WITHOUT rewriting the request. Placement never changes the disagg id, so the ctx<->gen KV transfer key stays consistent. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR:
Implement a coordinator/worker model:
Verified in-container (gb200): test_coordinator_worker 2/2 across 5 repeats, test_per_rank_routing + test_centralized_kv_cache_router + test_openai_disagg_service 76 passed.
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.