Skip to content

[None][test] Dump sibling worker logs from disagg BENCHMARK pytest#15912

Open
chenfeiz0326 wants to merge 3 commits into
NVIDIA:mainfrom
chenfeiz0326:feat/disagg-logs-in-benchmark
Open

[None][test] Dump sibling worker logs from disagg BENCHMARK pytest#15912
chenfeiz0326 wants to merge 3 commits into
NVIDIA:mainfrom
chenfeiz0326:feat/disagg-logs-in-benchmark

Conversation

@chenfeiz0326

@chenfeiz0326 chenfeiz0326 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

For disagg perf-sanity tests, CI only surfaces the BENCHMARK srun step's stdout, so failures in the CTX/GEN/DISAGG_SERVER workers or their nested trtllm-serve subprocesses require SSHing to the cluster to fetch log files. This PR streams every sibling log (gen_server_*.log, ctx_server_*.log, disagg_server.log, trtllm-serve.CTX_*.log, trtllm-serve.GEN_*.log, trtllm-serve.DISAGG_SERVER.*.log) into BENCHMARK's stdout at teardown, so a single CI log has everything needed for debugging.

Coordination (option C from design)

  • Each worker (CTX_x / GEN_x / DISAGG_SERVER) writes a pytest_done.<TYPE>.<server_idx>.txt marker at the end of its test_e2e (in a finally — runs on both success and failure).
  • BENCHMARK polls for all expected markers, falling back to log-file size stability (3 consecutive same-size 3s polls) so a crashed worker that never wrote a marker doesn't block. Bounded by a 180s max wait — BENCHMARK never hangs.
  • After detection, a short time.sleep(3) covers the srun &> file redirect flush.
  • Full log contents are then streamed to stdout with clear ----- BEGIN/END ----- fences.

Runs in test_e2e's finally block so logs are dumped on both success and failure. report_error's existing error-tail (baked into the exception message) is untouched.

Only active for runtime == "multi_node_disagg_server"; aggregated tests are unaffected.

Test plan

  • Run a disagg e2e perf sanity test (single ctx, single gen) in CI; confirm BENCHMARK's stdout contains full CTX/GEN/DISAGG_SERVER worker logs and their nested trtllm-serve.* logs after the benchmark result summary.
  • Run a disagg gen_only test; confirm CTX logs are correctly skipped (no ctx workers spawned).
  • Simulate a worker crash (e.g. force an early failure in CTX_0); confirm BENCHMARK falls back to size-stability path and still dumps available logs within the 180s cap.
  • Confirm aggregated perf sanity tests (aggr_upload-* and aggr_upload-ctx_only-*) are unaffected.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability for distributed performance runs by better coordinating completion across worker processes.
    • Added safer handling for logs and completion signals so results are collected more consistently, even when some workers finish at different times.

For disagg perf-sanity tests, CI only surfaces the BENCHMARK srun step's
stdout, so failures in the CTX/GEN/DISAGG_SERVER workers or their nested
trtllm-serve subprocesses require SSHing to fetch log files. Stream those
sibling logs into BENCHMARK's stdout at teardown so a single CI log has
everything needed for debugging.

Coordination:
  - Each worker (CTX_x / GEN_x / DISAGG_SERVER) writes a pytest_done marker
    at the end of its test_e2e.
  - BENCHMARK polls for all expected markers, falling back to log-file
    size stability so a crashed worker doesn't block, bounded by a 180s
    max wait.
  - After detection, a short flush sleep covers the final srun `&> file`
    redirect flush, then full log contents are streamed to stdout with
    clear BEGIN/END fences.

Runs in the test_e2e finally block so logs are dumped on both success
and failure.

Signed-off-by: chenfeiz <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 requested a review from a team as a code owner July 3, 2026 08:28
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds sibling-log coordination for disaggregated 4-srun perf sanity tests. New constants and helper functions compute per-worker marker paths, write completion markers, wait for sibling completion via markers or log-size stability, and dump sibling logs. The test_e2e flow is updated to use these helpers based on worker role.

Changes

Disagg Sibling Log Coordination

Layer / File(s) Summary
Sibling log coordination utilities
tests/integration/defs/perf/test_perf_sanity.py
Adds sys import and new constants/functions (_pytest_done_marker_path, _expected_sibling_marker_paths, _write_pytest_done_marker, _wait_for_sibling_completion, _dump_sibling_logs) for marker-based and log-stability-based completion detection and log dumping.
test_e2e disagg control flow integration
tests/integration/defs/perf/test_perf_sanity.py
Updates test_e2e to detect disagg multi-node runs, skip perf parsing/upload for non-BENCHMARK workers, and in a finally block write done markers (non-BENCHMARK) or wait/sleep/dump sibling logs (BENCHMARK).

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant NonBenchmarkWorker
  participant BenchmarkWorker
  participant Filesystem

  NonBenchmarkWorker->>Filesystem: write pytest_done marker file
  BenchmarkWorker->>Filesystem: poll for marker files
  BenchmarkWorker->>Filesystem: check sibling log size stability
  Filesystem-->>BenchmarkWorker: markers present or logs stable / timeout
  BenchmarkWorker->>BenchmarkWorker: sleep to allow stdout flush
  BenchmarkWorker->>Filesystem: read sibling log contents
  BenchmarkWorker->>BenchmarkWorker: dump logs to stdout
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title is concise and clearly describes the main change: streaming sibling worker logs from disagg BENCHMARK pytest runs.
Description check ✅ Passed The description covers the issue, solution, scope, and test plan, matching the template well enough despite minor section differences.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)

2379-2385: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Narrow the broad except Exception to OSError.

Both this marker write and the log read at Line 2464 only perform filesystem/IO operations, so OSError captures the intended best-effort failure modes while still surfacing genuine programming errors. As per coding guidelines: "use specific exception types, not bare except" / "limit the except to the smallest set of errors possible".

♻️ Proposed change (apply similarly at Line 2464)
-    except Exception as e:
+    except OSError as e:
         print_info(f"Failed to write pytest_done marker: {e}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 2379 - 2385,
The filesystem-only failure handling around _pytest_done_marker_path and the
marker write/read logic is too broad; replace the generic except Exception
blocks with OSError so only I/O-related failures are swallowed. Update the
try/except in the pytest_done marker write path and apply the same narrowing to
the log read helper near the other mentioned location, keeping the existing
print_info fallback behavior intact.

Sources: Coding guidelines, Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 2379-2385: The filesystem-only failure handling around
_pytest_done_marker_path and the marker write/read logic is too broad; replace
the generic except Exception blocks with OSError so only I/O-related failures
are swallowed. Update the try/except in the pytest_done marker write path and
apply the same narrowing to the log read helper near the other mentioned
location, keeping the existing print_info fallback behavior intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30443650-aefe-46d8-9562-62132f2f50a4

📥 Commits

Reviewing files that changed from the base of the PR and between ca9c439 and 0afe71d.

📒 Files selected for processing (1)
  • tests/integration/defs/perf/test_perf_sanity.py

Extract the pytest_done marker + log-dump helpers out of the already-large
test_perf_sanity.py into tests/test_common/error_utils.py so the perf-sanity
test module isn't further bloated by disagg-specific coordination code.

Signed-off-by: chenfeiz <chenfeiz@nvidia.com>
Signed-off-by: chenfeiz <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 requested a review from yufeiwu-nv July 3, 2026 09:08
@chenfeiz0326

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57437 [ run ] triggered by Bot. Commit: 11aa7d4 Link to invocation

@fredricz-20070104

Copy link
Copy Markdown
Collaborator

Hi @chenfeiz0326 Would you please dump all of the logs to a separate file instead of “slurm-*.out”. Since the slurm log file has its own business logic, it indicates whether the slurm job failed due to submission error or other issues. Use a separate file to track all of the disagg logs is a much better idea and more easy to work with LLMs and Agents.

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57437 [ run ] completed with state FAILURE. Commit: 11aa7d4
/LLM/main/L0_MergeRequest_PR pipeline #46176 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants