[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids)

### System Info

- **GPU**: 1x NVIDIA B300 (also reproduced earlier on B300 TP2/TP4)
- **TensorRT-LLM**: PyTorch backend, build based on the DeepSeek-V4 support branch (≈ v1.3.0rc / `feat/deepseek_v4`). The deadlocking code path below is upstream sparse-MLA code (`tensorrt_llm/_torch/attention_backend/sparse/deepseek_v4/deepseek_v4.py`).
- **Model**: `deepseek-ai/DeepSeek-V4-Flash` (sparse-MLA + MTP), TP1, `max_seq_len=131072`
- **Related**: companion to #15639 (same model, different bug). This one is the **overlap-scheduler deadlock**.

### Summary

On DeepSeek-V4 (sparse-MLA), enabling the **overlap scheduler together with chunked prefill** deadlocks the engine. It serves a handful of requests, then **hangs permanently** — GPU utilization drops to **0%**, the engine stops making progress, and the server stops responding (including `/health`). The container stays up; it is a hang, not a crash.

The only stable configuration we have found is `disable_overlap_scheduler: true`. This costs the ~10–20% throughput/ITL the overlap scheduler would otherwise provide, so a fix would be valuable.

### Reproduction

Serve DeepSeek-V4-Flash with the overlap scheduler **enabled** and chunked prefill on:

```bash
trtllm-serve <DeepSeek-V4-Flash> --backend pytorch \
  --tp_size 1 --max_seq_len 131072 --max_num_tokens 16384 \
  --max_batch_size 128 --enable_chunked_prefill \
  --extra_llm_api_options '{
    "disable_overlap_scheduler": false,
    "enable_chunked_prefill": true,
    "kv_cache_config": {"dtype": "fp8", "tokens_per_block": 128, "enable_block_reuse": true},
    "scheduler_config": {"capacity_scheduler_policy": "MAX_UTILIZATION"},
    "speculative_config": {"decoding_type": "MTP", "num_nextn_predict_layers": 2},
    "cuda_graph_config": {"enable_padding": true},
    "moe_config": {"backend": "TRTLLM"}
  }'
```

Send a few concurrent `/v1/completions` requests with mixed prompt lengths (enough to engage chunked prefill and a not-yet-warmed sparse-MLA kernel shape).

**With `disable_overlap_scheduler: true` the exact same workload runs indefinitely.** Flipping only that one flag is the difference between healthy and hung.

### Observed behavior

The engine serves ~2 requests, then wedges. Captured signature right at the hang:

```
[TRT-LLM] [I] [serve] Completion request: {... "max_tokens": 16384, "stream": true ...}
Stopping polling loop
"POST /v1/completions HTTP/1.1" 200 OK
[TRT-LLM] [W] [_torch] [AutoTuner] trtllm::mhc_pre_mapping using the fallback tactic, due to cache miss on input shapes=(torch.Size([9817, 16384]), ...)
[TRT-LLM] [W] [_torch] [AutoTuner] trtllm::mhc_fused_hc using the fallback tactic, due to cache miss on input shapes=(torch.Size([9817, 4096]), ...)
[TRT-LLM] [I] [serve] Completion request: {...}
Stopping polling loop
"POST /v1/completions HTTP/1.1" 200 OK
<no further output — engine wedged, GPU at 0% util, /health times out>
```

- **GPU util = 0%**, memory still resident → the engine is blocked waiting, not computing.
- A **mid-serving AutoTuner cache-miss/fallback on the sparse-MLA `mhc_*` kernels** (`mhc_pre_mapping`, `mhc_fused_hc`) immediately precedes the wedge — i.e. the hang is correlated with first-touch work on the sparse-MLA hybrid-cache path while overlap has the next step in flight.
- `/health` and all in-flight requests hang afterward.

**Engine-worker stack at the hang (py-spy).** The executor worker thread is `active+gil` — blocked holding the GIL inside the sparse-MLA context-metadata prep, driven by the overlap executor loop:

```
Thread (active+gil): engine worker
    _compute_ctx_compressed_position_ids   (sparse/deepseek_v4/deepseek_v4.py:929)   <-- stuck here
    prepare_compressed_kv_metadata         (sparse/deepseek_v4/deepseek_v4.py:734)
    prepare                                (sparse/deepseek_v4/deepseek_v4.py:694)
    _prepare_tp_inputs                     (pyexecutor/model_engine.py:3196)
    _prepare_inputs                        (pyexecutor/model_engine.py:3886)
    forward                                (pyexecutor/model_engine.py:4101)
    _forward_step                          (pyexecutor/py_executor.py:4031)
    _executor_loop_overlap                 (pyexecutor/py_executor.py:2937)   <-- overlap loop
    _event_loop_wrapper                    (pyexecutor/py_executor.py:809)
```

Everything else (the proxy's `kv_event_processor`, `/health`, all in-flight requests) is then blocked waiting on this worker, which is why the whole server goes unresponsive.

### Root cause (pinpointed)

The worker is stuck at `_compute_ctx_compressed_position_ids` **line 929**, which sizes a tensor from a **device scalar**:

```python
# _compute_ctx_compressed_position_ids — "Context-only compressed position IDs (eager, data-dependent shapes)."
total_ctx_comp = cu_new_comp[num_contexts]                              # 0-dim CUDA tensor
ctx_idx = torch.arange(total_ctx_comp, dtype=torch.int32, device=device)  # <-- line 929: arange(end=CUDA scalar)
```

`torch.arange(end=<CUDA tensor>)` must read `total_ctx_comp` **host-side** to determine the output length → an **implicit device→host sync on a data-dependent shape**, issued from inside `_prepare_inputs` → `forward`. The same prepare path (`prepare_compressed_kv_metadata`) also does explicit `.item()` syncs in the prefill branch (`cu_new[num_contexts-1].item()`, `…max().item()`, the gen-offset `…[num_contexts].item()`).

Under `_executor_loop_overlap`, the scheduler launches step N+1's work on the overlap stream while step N's `_prepare_inputs` performs this host sync. **Chunked prefill re-enters the context path on every chunk**, so the hazard recurs until the prepare-side sync and the overlap-stream work deadlock — consistent with the worker sitting `active+gil` in `torch.arange` while **GPU utilization is 0%**. `CUDA_LAUNCH_BLOCKING=1` hides the hang, which also points to a cross-stream ordering/sync issue.

Note the **gen path was deliberately hardened against exactly this** — comments such as *"pre-extracted by the caller to avoid tensor-scalar `.item()` inside the compiled function"* and a graph-safe variant exist — but the **context/prefill path was not**, and that's where it hangs.

### Expected behavior

Overlap scheduler + chunked prefill should run without deadlock on DeepSeek-V4, as it does for dense models. The context-phase compressed-position/length metadata should be computed without host syncs on the hot path (mirroring the gen-path fix), or the overlap scheduler should not launch the dependent step before the metadata sync resolves.

### Workaround

Set `disable_overlap_scheduler: true` (what we run in production). Stable, at a throughput cost.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids) #15684

System Info

Summary

Reproduction

Observed behavior

Root cause (pinpointed)

Expected behavior

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids) #15684

Description

System Info

Summary

Reproduction

Observed behavior

Root cause (pinpointed)

Expected behavior

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions