Skip to content

[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids) #15684

Description

@Thachnh

System Info

Summary

On DeepSeek-V4 (sparse-MLA), enabling the overlap scheduler together with chunked prefill deadlocks the engine. It serves a handful of requests, then hangs permanently — GPU utilization drops to 0%, the engine stops making progress, and the server stops responding (including /health). The container stays up; it is a hang, not a crash.

The only stable configuration we have found is disable_overlap_scheduler: true. This costs the ~10–20% throughput/ITL the overlap scheduler would otherwise provide, so a fix would be valuable.

Reproduction

Serve DeepSeek-V4-Flash with the overlap scheduler enabled and chunked prefill on:

trtllm-serve <DeepSeek-V4-Flash> --backend pytorch \
  --tp_size 1 --max_seq_len 131072 --max_num_tokens 16384 \
  --max_batch_size 128 --enable_chunked_prefill \
  --extra_llm_api_options '{
    "disable_overlap_scheduler": false,
    "enable_chunked_prefill": true,
    "kv_cache_config": {"dtype": "fp8", "tokens_per_block": 128, "enable_block_reuse": true},
    "scheduler_config": {"capacity_scheduler_policy": "MAX_UTILIZATION"},
    "speculative_config": {"decoding_type": "MTP", "num_nextn_predict_layers": 2},
    "cuda_graph_config": {"enable_padding": true},
    "moe_config": {"backend": "TRTLLM"}
  }'

Send a few concurrent /v1/completions requests with mixed prompt lengths (enough to engage chunked prefill and a not-yet-warmed sparse-MLA kernel shape).

With disable_overlap_scheduler: true the exact same workload runs indefinitely. Flipping only that one flag is the difference between healthy and hung.

Observed behavior

The engine serves ~2 requests, then wedges. Captured signature right at the hang:

[TRT-LLM] [I] [serve] Completion request: {... "max_tokens": 16384, "stream": true ...}
Stopping polling loop
"POST /v1/completions HTTP/1.1" 200 OK
[TRT-LLM] [W] [_torch] [AutoTuner] trtllm::mhc_pre_mapping using the fallback tactic, due to cache miss on input shapes=(torch.Size([9817, 16384]), ...)
[TRT-LLM] [W] [_torch] [AutoTuner] trtllm::mhc_fused_hc using the fallback tactic, due to cache miss on input shapes=(torch.Size([9817, 4096]), ...)
[TRT-LLM] [I] [serve] Completion request: {...}
Stopping polling loop
"POST /v1/completions HTTP/1.1" 200 OK
<no further output — engine wedged, GPU at 0% util, /health times out>
  • GPU util = 0%, memory still resident → the engine is blocked waiting, not computing.
  • A mid-serving AutoTuner cache-miss/fallback on the sparse-MLA mhc_* kernels (mhc_pre_mapping, mhc_fused_hc) immediately precedes the wedge — i.e. the hang is correlated with first-touch work on the sparse-MLA hybrid-cache path while overlap has the next step in flight.
  • /health and all in-flight requests hang afterward.

Engine-worker stack at the hang (py-spy). The executor worker thread is active+gil — blocked holding the GIL inside the sparse-MLA context-metadata prep, driven by the overlap executor loop:

Thread (active+gil): engine worker
    _compute_ctx_compressed_position_ids   (sparse/deepseek_v4/deepseek_v4.py:929)   <-- stuck here
    prepare_compressed_kv_metadata         (sparse/deepseek_v4/deepseek_v4.py:734)
    prepare                                (sparse/deepseek_v4/deepseek_v4.py:694)
    _prepare_tp_inputs                     (pyexecutor/model_engine.py:3196)
    _prepare_inputs                        (pyexecutor/model_engine.py:3886)
    forward                                (pyexecutor/model_engine.py:4101)
    _forward_step                          (pyexecutor/py_executor.py:4031)
    _executor_loop_overlap                 (pyexecutor/py_executor.py:2937)   <-- overlap loop
    _event_loop_wrapper                    (pyexecutor/py_executor.py:809)

Everything else (the proxy's kv_event_processor, /health, all in-flight requests) is then blocked waiting on this worker, which is why the whole server goes unresponsive.

Root cause (pinpointed)

The worker is stuck at _compute_ctx_compressed_position_ids line 929, which sizes a tensor from a device scalar:

# _compute_ctx_compressed_position_ids — "Context-only compressed position IDs (eager, data-dependent shapes)."
total_ctx_comp = cu_new_comp[num_contexts]                              # 0-dim CUDA tensor
ctx_idx = torch.arange(total_ctx_comp, dtype=torch.int32, device=device)  # <-- line 929: arange(end=CUDA scalar)

torch.arange(end=<CUDA tensor>) must read total_ctx_comp host-side to determine the output length → an implicit device→host sync on a data-dependent shape, issued from inside _prepare_inputsforward. The same prepare path (prepare_compressed_kv_metadata) also does explicit .item() syncs in the prefill branch (cu_new[num_contexts-1].item(), …max().item(), the gen-offset …[num_contexts].item()).

Under _executor_loop_overlap, the scheduler launches step N+1's work on the overlap stream while step N's _prepare_inputs performs this host sync. Chunked prefill re-enters the context path on every chunk, so the hazard recurs until the prepare-side sync and the overlap-stream work deadlock — consistent with the worker sitting active+gil in torch.arange while GPU utilization is 0%. CUDA_LAUNCH_BLOCKING=1 hides the hang, which also points to a cross-stream ordering/sync issue.

Note the gen path was deliberately hardened against exactly this — comments such as "pre-extracted by the caller to avoid tensor-scalar .item() inside the compiled function" and a graph-safe variant exist — but the context/prefill path was not, and that's where it hangs.

Expected behavior

Overlap scheduler + chunked prefill should run without deadlock on DeepSeek-V4, as it does for dense models. The context-phase compressed-position/length metadata should be computed without host syncs on the hot path (mirroring the gen-path fix), or the overlap scheduler should not launch the dependent step before the metadata sync resolves.

Workaround

Set disable_overlap_scheduler: true (what we run in production). Stable, at a throughput cost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Pytorch<NV>Pytorch backend related issues

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions