Skip to content

[None][perf] serve: opt-in msgspec msgpack transport for disagg orchestrator->worker request body#15910

Open
Tabrizian wants to merge 2 commits into
NVIDIA:feat/deepseek_v4from
Tabrizian:perf/disagg-msgspec
Open

[None][perf] serve: opt-in msgspec msgpack transport for disagg orchestrator->worker request body#15910
Tabrizian wants to merge 2 commits into
NVIDIA:feat/deepseek_v4from
Tabrizian:perf/disagg-msgspec

Conversation

@Tabrizian

Copy link
Copy Markdown
Member

What

Opt-in msgspec msgpack transport for the disaggregated orchestrator→worker request body, as an alternative to the orjson request-body parse (#15690).

Instead of speeding up the JSON parse (orjson), this removes JSON (de)serialization from the disagg serving path entirely: the orchestrator encodes the forwarded request dict as msgpack via msgspec.msgpack, and the worker decodes it.

How

  • openai_client.py (orchestrator): when enabled, body = msgspec.msgpack.Encoder().encode(request.model_dump(mode='json', exclude_unset=True)) and POST Content-Type: application/msgpack (else the existing model_dump_json + application/json path).
  • openai_server.py (worker): an APIRoute/Request that decodes application/msgpack bodies with msgspec, falling back to stdlib json for anything else (content-type gated).
  • requirements.txt: add msgspec.

Enabling

TRTLLM_SERVE_ENABLE_MSGSPEC=1 on both the orchestrator and the worker. Off by default → the JSON path is byte-for-byte unchanged (no route override, no import).

Why

For DeepSeek-V4 agentic loads the ~40k-token request body's JSON (de)serialization on the single serving event loop is a dispatch bottleneck. #15690 addressed it with a faster JSON parser (orjson); this is the msgpack alternative, to A/B the two. Purely the orchestrator↔worker leg; the client API is unchanged.

…>worker request body

Alternative to the orjson request-body parse (NVIDIA#15690): instead of speeding up the
JSON parse, encode the orchestrator->worker forwarded body as msgpack via msgspec
and decode it on the worker, removing JSON (de)serialization from the disagg
serving path. Opt-in via TRTLLM_SERVE_ENABLE_MSGSPEC=1 (set on both orchestrator
and worker); off by default the JSON path is byte-for-byte unchanged.

- openai_client.py (orchestrator): when enabled, encode the request dict with
  msgspec.msgpack and POST Content-Type: application/msgpack.
- openai_server.py (worker): decode application/msgpack bodies with msgspec,
  falling back to stdlib json otherwise (content-type gated).
- requirements.txt: add msgspec.

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
@Tabrizian Tabrizian requested review from a team as code owners July 3, 2026 07:48
@Tabrizian Tabrizian requested review from hchings and removed request for a team July 3, 2026 07:48
…e json)

FastAPI only sends the body through Request.json() when the Content-Type subtype
is json/+json; application/msgpack bypassed it, so pydantic received the raw
msgpack bytes ('bytes' object has no attribute 'get'). Keep Content-Type
application/json and flag msgpack with X-TRTLLM-Msgpack: 1; the worker decodes
with msgspec when that header is set.

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant