[None][test] Dump sibling worker logs from disagg BENCHMARK pytest by chenfeiz0326 · Pull Request #15912 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-07-03T08:28:36Z

Summary

For disagg perf-sanity tests, CI only surfaces the BENCHMARK srun step's stdout, so failures in the CTX/GEN/DISAGG_SERVER workers or their nested trtllm-serve subprocesses require SSHing to the cluster to fetch log files. This PR streams every sibling log (gen_server_*.log, ctx_server_*.log, disagg_server.log, trtllm-serve.CTX_*.log, trtllm-serve.GEN_*.log, trtllm-serve.DISAGG_SERVER.*.log) into BENCHMARK's stdout at teardown, so a single CI log has everything needed for debugging.

Coordination (option C from design)

Each worker (CTX_x / GEN_x / DISAGG_SERVER) writes a pytest_done.<TYPE>.<server_idx>.txt marker at the end of its test_e2e (in a finally — runs on both success and failure).
BENCHMARK polls for all expected markers, falling back to log-file size stability (3 consecutive same-size 3s polls) so a crashed worker that never wrote a marker doesn't block. Bounded by a 180s max wait — BENCHMARK never hangs.
After detection, a short time.sleep(3) covers the srun &> file redirect flush.
Full log contents are then streamed to stdout with clear ----- BEGIN/END ----- fences.

Runs in test_e2e's finally block so logs are dumped on both success and failure. report_error's existing error-tail (baked into the exception message) is untouched.

Only active for runtime == "multi_node_disagg_server"; aggregated tests are unaffected.

Test plan

Run a disagg e2e perf sanity test (single ctx, single gen) in CI; confirm BENCHMARK's stdout contains full CTX/GEN/DISAGG_SERVER worker logs and their nested trtllm-serve.* logs after the benchmark result summary.
Run a disagg gen_only test; confirm CTX logs are correctly skipped (no ctx workers spawned).
Simulate a worker crash (e.g. force an early failure in CTX_0); confirm BENCHMARK falls back to size-stability path and still dumps available logs within the 180s cap.
Confirm aggregated perf sanity tests (aggr_upload-* and aggr_upload-ctx_only-*) are unaffected.

Summary by CodeRabbit

Bug Fixes
- Improved reliability for distributed performance runs by better coordinating completion across worker processes.
- Added safer handling for logs and completion signals so results are collected more consistently, even when some workers finish at different times.

For disagg perf-sanity tests, CI only surfaces the BENCHMARK srun step's stdout, so failures in the CTX/GEN/DISAGG_SERVER workers or their nested trtllm-serve subprocesses require SSHing to fetch log files. Stream those sibling logs into BENCHMARK's stdout at teardown so a single CI log has everything needed for debugging. Coordination: - Each worker (CTX_x / GEN_x / DISAGG_SERVER) writes a pytest_done marker at the end of its test_e2e. - BENCHMARK polls for all expected markers, falling back to log-file size stability so a crashed worker doesn't block, bounded by a 180s max wait. - After detection, a short flush sleep covers the final srun `&> file` redirect flush, then full log contents are streamed to stdout with clear BEGIN/END fences. Runs in the test_e2e finally block so logs are dumped on both success and failure. Signed-off-by: chenfeiz <chenfeiz@nvidia.com>

coderabbitai · 2026-07-03T08:35:30Z

📝 Walkthrough

Walkthrough

This PR adds sibling-log coordination for disaggregated 4-srun perf sanity tests. New constants and helper functions compute per-worker marker paths, write completion markers, wait for sibling completion via markers or log-size stability, and dump sibling logs. The test_e2e flow is updated to use these helpers based on worker role.

Changes

Disagg Sibling Log Coordination

Layer / File(s)	Summary
Sibling log coordination utilities `tests/integration/defs/perf/test_perf_sanity.py`	Adds `sys` import and new constants/functions (`_pytest_done_marker_path`, `_expected_sibling_marker_paths`, `_write_pytest_done_marker`, `_wait_for_sibling_completion`, `_dump_sibling_logs`) for marker-based and log-stability-based completion detection and log dumping.
test_e2e disagg control flow integration `tests/integration/defs/perf/test_perf_sanity.py`	Updates `test_e2e` to detect disagg multi-node runs, skip perf parsing/upload for non-BENCHMARK workers, and in a `finally` block write done markers (non-BENCHMARK) or wait/sleep/dump sibling logs (BENCHMARK).

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant NonBenchmarkWorker
  participant BenchmarkWorker
  participant Filesystem

  NonBenchmarkWorker->>Filesystem: write pytest_done marker file
  BenchmarkWorker->>Filesystem: poll for marker files
  BenchmarkWorker->>Filesystem: check sibling log size stability
  Filesystem-->>BenchmarkWorker: markers present or logs stable / timeout
  BenchmarkWorker->>BenchmarkWorker: sleep to allow stdout flush
  BenchmarkWorker->>Filesystem: read sibling log contents
  BenchmarkWorker->>BenchmarkWorker: dump logs to stdout

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title is concise and clearly describes the main change: streaming sibling worker logs from disagg BENCHMARK pytest runs.
Description check	✅ Passed	The description covers the issue, solution, scope, and test plan, matching the template well enough despite minor section differences.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

tests/integration/defs/perf/test_perf_sanity.py (1)
2379-2385: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Narrow the broad except Exception to OSError.

Both this marker write and the log read at Line 2464 only perform filesystem/IO operations, so OSError captures the intended best-effort failure modes while still surfacing genuine programming errors. As per coding guidelines: "use specific exception types, not bare except" / "limit the except to the smallest set of errors possible".
♻️ Proposed change (apply similarly at Line 2464)
-    except Exception as e:
+    except OSError as e:
         print_info(f"Failed to write pytest_done marker: {e}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 2379 - 2385,
The filesystem-only failure handling around _pytest_done_marker_path and the
marker write/read logic is too broad; replace the generic except Exception
blocks with OSError so only I/O-related failures are swallowed. Update the
try/except in the pytest_done marker write path and apply the same narrowing to
the log read helper near the other mentioned location, keeping the existing
print_info fallback behavior intact.
Sources: Coding guidelines, Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 2379-2385: The filesystem-only failure handling around
_pytest_done_marker_path and the marker write/read logic is too broad; replace
the generic except Exception blocks with OSError so only I/O-related failures
are swallowed. Update the try/except in the pytest_done marker write path and
apply the same narrowing to the log read helper near the other mentioned
location, keeping the existing print_info fallback behavior intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30443650-aefe-46d8-9562-62132f2f50a4

📥 Commits

Reviewing files that changed from the base of the PR and between ca9c439 and 0afe71d.

📒 Files selected for processing (1)

tests/integration/defs/perf/test_perf_sanity.py

Extract the pytest_done marker + log-dump helpers out of the already-large test_perf_sanity.py into tests/test_common/error_utils.py so the perf-sanity test module isn't further bloated by disagg-specific coordination code. Signed-off-by: chenfeiz <chenfeiz@nvidia.com>

Signed-off-by: chenfeiz <chenfeiz@nvidia.com>

chenfeiz0326 · 2026-07-03T09:15:18Z

/bot run --disable-fail-fast --stage-list "DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-1"

tensorrt-cicd · 2026-07-03T09:15:41Z

PR_Github #57437 [ run ] triggered by Bot. Commit: 11aa7d4 Link to invocation

fredricz-20070104 · 2026-07-03T09:48:40Z

Hi @chenfeiz0326 Would you please dump all of the logs to a separate file instead of “slurm-*.out”. Since the slurm log file has its own business logic, it indicates whether the slurm job failed due to submission error or other issues. Use a separate file to track all of the disagg logs is a much better idea and more easy to work with LLMs and Agents.

tensorrt-cicd · 2026-07-03T09:53:35Z

PR_Github #57437 [ run ] completed with state FAILURE. Commit: 11aa7d4
/LLM/main/L0_MergeRequest_PR pipeline #46176 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chenfeiz0326 requested a review from a team as a code owner July 3, 2026 08:28

github-actions Bot assigned chenfeiz0326 Jul 3, 2026

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

chenfeiz0326 requested review from Shixiaowei02, fredricz-20070104 and ruodil July 3, 2026 09:02

[None][test] Shorten sibling-log wait timeout to 120s

11aa7d4

Signed-off-by: chenfeiz <chenfeiz@nvidia.com>

chenfeiz0326 requested a review from yufeiwu-nv July 3, 2026 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][test] Dump sibling worker logs from disagg BENCHMARK pytest#15912

[None][test] Dump sibling worker logs from disagg BENCHMARK pytest#15912
chenfeiz0326 wants to merge 3 commits into
NVIDIA:mainfrom
chenfeiz0326:feat/disagg-logs-in-benchmark

chenfeiz0326 commented Jul 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

chenfeiz0326 commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

fredricz-20070104 commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

chenfeiz0326 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Coordination (option C from design)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chenfeiz0326 commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

fredricz-20070104 commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenfeiz0326 commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading