System Info
Summary
On DeepSeek-V4 (sparse-MLA), enabling the overlap scheduler together with chunked prefill deadlocks the engine. It serves a handful of requests, then hangs permanently — GPU utilization drops to 0%, the engine stops making progress, and the server stops responding (including /health). The container stays up; it is a hang, not a crash.
The only stable configuration we have found is disable_overlap_scheduler: true. This costs the ~10–20% throughput/ITL the overlap scheduler would otherwise provide, so a fix would be valuable.
Reproduction
Serve DeepSeek-V4-Flash with the overlap scheduler enabled and chunked prefill on:
trtllm-serve <DeepSeek-V4-Flash> --backend pytorch \
--tp_size 1 --max_seq_len 131072 --max_num_tokens 16384 \
--max_batch_size 128 --enable_chunked_prefill \
--extra_llm_api_options '{
"disable_overlap_scheduler": false,
"enable_chunked_prefill": true,
"kv_cache_config": {"dtype": "fp8", "tokens_per_block": 128, "enable_block_reuse": true},
"scheduler_config": {"capacity_scheduler_policy": "MAX_UTILIZATION"},
"speculative_config": {"decoding_type": "MTP", "num_nextn_predict_layers": 2},
"cuda_graph_config": {"enable_padding": true},
"moe_config": {"backend": "TRTLLM"}
}'
Send a few concurrent /v1/completions requests with mixed prompt lengths (enough to engage chunked prefill and a not-yet-warmed sparse-MLA kernel shape).
With disable_overlap_scheduler: true the exact same workload runs indefinitely. Flipping only that one flag is the difference between healthy and hung.
Observed behavior
The engine serves ~2 requests, then wedges. Captured signature right at the hang:
[TRT-LLM] [I] [serve] Completion request: {... "max_tokens": 16384, "stream": true ...}
Stopping polling loop
"POST /v1/completions HTTP/1.1" 200 OK
[TRT-LLM] [W] [_torch] [AutoTuner] trtllm::mhc_pre_mapping using the fallback tactic, due to cache miss on input shapes=(torch.Size([9817, 16384]), ...)
[TRT-LLM] [W] [_torch] [AutoTuner] trtllm::mhc_fused_hc using the fallback tactic, due to cache miss on input shapes=(torch.Size([9817, 4096]), ...)
[TRT-LLM] [I] [serve] Completion request: {...}
Stopping polling loop
"POST /v1/completions HTTP/1.1" 200 OK
<no further output — engine wedged, GPU at 0% util, /health times out>
- GPU util = 0%, memory still resident → the engine is blocked waiting, not computing.
- A mid-serving AutoTuner cache-miss/fallback on the sparse-MLA
mhc_* kernels (mhc_pre_mapping, mhc_fused_hc) immediately precedes the wedge — i.e. the hang is correlated with first-touch work on the sparse-MLA hybrid-cache path while overlap has the next step in flight.
/health and all in-flight requests hang afterward.
Engine-worker stack at the hang (py-spy). The executor worker thread is active+gil — blocked holding the GIL inside the sparse-MLA context-metadata prep, driven by the overlap executor loop:
Thread (active+gil): engine worker
_compute_ctx_compressed_position_ids (sparse/deepseek_v4/deepseek_v4.py:929) <-- stuck here
prepare_compressed_kv_metadata (sparse/deepseek_v4/deepseek_v4.py:734)
prepare (sparse/deepseek_v4/deepseek_v4.py:694)
_prepare_tp_inputs (pyexecutor/model_engine.py:3196)
_prepare_inputs (pyexecutor/model_engine.py:3886)
forward (pyexecutor/model_engine.py:4101)
_forward_step (pyexecutor/py_executor.py:4031)
_executor_loop_overlap (pyexecutor/py_executor.py:2937) <-- overlap loop
_event_loop_wrapper (pyexecutor/py_executor.py:809)
Everything else (the proxy's kv_event_processor, /health, all in-flight requests) is then blocked waiting on this worker, which is why the whole server goes unresponsive.
Root cause (pinpointed)
The worker is stuck at _compute_ctx_compressed_position_ids line 929, which sizes a tensor from a device scalar:
# _compute_ctx_compressed_position_ids — "Context-only compressed position IDs (eager, data-dependent shapes)."
total_ctx_comp = cu_new_comp[num_contexts] # 0-dim CUDA tensor
ctx_idx = torch.arange(total_ctx_comp, dtype=torch.int32, device=device) # <-- line 929: arange(end=CUDA scalar)
torch.arange(end=<CUDA tensor>) must read total_ctx_comp host-side to determine the output length → an implicit device→host sync on a data-dependent shape, issued from inside _prepare_inputs → forward. The same prepare path (prepare_compressed_kv_metadata) also does explicit .item() syncs in the prefill branch (cu_new[num_contexts-1].item(), …max().item(), the gen-offset …[num_contexts].item()).
Under _executor_loop_overlap, the scheduler launches step N+1's work on the overlap stream while step N's _prepare_inputs performs this host sync. Chunked prefill re-enters the context path on every chunk, so the hazard recurs until the prepare-side sync and the overlap-stream work deadlock — consistent with the worker sitting active+gil in torch.arange while GPU utilization is 0%. CUDA_LAUNCH_BLOCKING=1 hides the hang, which also points to a cross-stream ordering/sync issue.
Note the gen path was deliberately hardened against exactly this — comments such as "pre-extracted by the caller to avoid tensor-scalar .item() inside the compiled function" and a graph-safe variant exist — but the context/prefill path was not, and that's where it hangs.
Expected behavior
Overlap scheduler + chunked prefill should run without deadlock on DeepSeek-V4, as it does for dense models. The context-phase compressed-position/length metadata should be computed without host syncs on the hot path (mirroring the gen-path fix), or the overlap scheduler should not launch the dependent step before the metadata sync resolves.
Workaround
Set disable_overlap_scheduler: true (what we run in production). Stable, at a throughput cost.
System Info
feat/deepseek_v4). The deadlocking code path below is upstream sparse-MLA code (tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/deepseek_v4.py).deepseek-ai/DeepSeek-V4-Flash(sparse-MLA + MTP), TP1,max_seq_len=131072Summary
On DeepSeek-V4 (sparse-MLA), enabling the overlap scheduler together with chunked prefill deadlocks the engine. It serves a handful of requests, then hangs permanently — GPU utilization drops to 0%, the engine stops making progress, and the server stops responding (including
/health). The container stays up; it is a hang, not a crash.The only stable configuration we have found is
disable_overlap_scheduler: true. This costs the ~10–20% throughput/ITL the overlap scheduler would otherwise provide, so a fix would be valuable.Reproduction
Serve DeepSeek-V4-Flash with the overlap scheduler enabled and chunked prefill on:
Send a few concurrent
/v1/completionsrequests with mixed prompt lengths (enough to engage chunked prefill and a not-yet-warmed sparse-MLA kernel shape).With
disable_overlap_scheduler: truethe exact same workload runs indefinitely. Flipping only that one flag is the difference between healthy and hung.Observed behavior
The engine serves ~2 requests, then wedges. Captured signature right at the hang:
mhc_*kernels (mhc_pre_mapping,mhc_fused_hc) immediately precedes the wedge — i.e. the hang is correlated with first-touch work on the sparse-MLA hybrid-cache path while overlap has the next step in flight./healthand all in-flight requests hang afterward.Engine-worker stack at the hang (py-spy). The executor worker thread is
active+gil— blocked holding the GIL inside the sparse-MLA context-metadata prep, driven by the overlap executor loop:Everything else (the proxy's
kv_event_processor,/health, all in-flight requests) is then blocked waiting on this worker, which is why the whole server goes unresponsive.Root cause (pinpointed)
The worker is stuck at
_compute_ctx_compressed_position_idsline 929, which sizes a tensor from a device scalar:torch.arange(end=<CUDA tensor>)must readtotal_ctx_comphost-side to determine the output length → an implicit device→host sync on a data-dependent shape, issued from inside_prepare_inputs→forward. The same prepare path (prepare_compressed_kv_metadata) also does explicit.item()syncs in the prefill branch (cu_new[num_contexts-1].item(),…max().item(), the gen-offset…[num_contexts].item()).Under
_executor_loop_overlap, the scheduler launches step N+1's work on the overlap stream while step N's_prepare_inputsperforms this host sync. Chunked prefill re-enters the context path on every chunk, so the hazard recurs until the prepare-side sync and the overlap-stream work deadlock — consistent with the worker sittingactive+gilintorch.arangewhile GPU utilization is 0%.CUDA_LAUNCH_BLOCKING=1hides the hang, which also points to a cross-stream ordering/sync issue.Note the gen path was deliberately hardened against exactly this — comments such as "pre-extracted by the caller to avoid tensor-scalar
.item()inside the compiled function" and a graph-safe variant exist — but the context/prefill path was not, and that's where it hangs.Expected behavior
Overlap scheduler + chunked prefill should run without deadlock on DeepSeek-V4, as it does for dense models. The context-phase compressed-position/length metadata should be computed without host syncs on the hot path (mirroring the gen-path fix), or the overlap scheduler should not launch the dependent step before the metadata sync resolves.
Workaround
Set
disable_overlap_scheduler: true(what we run in production). Stable, at a throughput cost.