[None][test] Dump sibling worker logs from disagg BENCHMARK pytest#15912
[None][test] Dump sibling worker logs from disagg BENCHMARK pytest#15912chenfeiz0326 wants to merge 3 commits into
Conversation
For disagg perf-sanity tests, CI only surfaces the BENCHMARK srun step's
stdout, so failures in the CTX/GEN/DISAGG_SERVER workers or their nested
trtllm-serve subprocesses require SSHing to fetch log files. Stream those
sibling logs into BENCHMARK's stdout at teardown so a single CI log has
everything needed for debugging.
Coordination:
- Each worker (CTX_x / GEN_x / DISAGG_SERVER) writes a pytest_done marker
at the end of its test_e2e.
- BENCHMARK polls for all expected markers, falling back to log-file
size stability so a crashed worker doesn't block, bounded by a 180s
max wait.
- After detection, a short flush sleep covers the final srun `&> file`
redirect flush, then full log contents are streamed to stdout with
clear BEGIN/END fences.
Runs in the test_e2e finally block so logs are dumped on both success
and failure.
Signed-off-by: chenfeiz <chenfeiz@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds sibling-log coordination for disaggregated 4-srun perf sanity tests. New constants and helper functions compute per-worker marker paths, write completion markers, wait for sibling completion via markers or log-size stability, and dump sibling logs. The ChangesDisagg Sibling Log Coordination
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant NonBenchmarkWorker
participant BenchmarkWorker
participant Filesystem
NonBenchmarkWorker->>Filesystem: write pytest_done marker file
BenchmarkWorker->>Filesystem: poll for marker files
BenchmarkWorker->>Filesystem: check sibling log size stability
Filesystem-->>BenchmarkWorker: markers present or logs stable / timeout
BenchmarkWorker->>BenchmarkWorker: sleep to allow stdout flush
BenchmarkWorker->>Filesystem: read sibling log contents
BenchmarkWorker->>BenchmarkWorker: dump logs to stdout
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)
2379-2385: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winNarrow the broad
except ExceptiontoOSError.Both this marker write and the log read at Line 2464 only perform filesystem/IO operations, so
OSErrorcaptures the intended best-effort failure modes while still surfacing genuine programming errors. As per coding guidelines: "use specific exception types, not bareexcept" / "limit the except to the smallest set of errors possible".♻️ Proposed change (apply similarly at Line 2464)
- except Exception as e: + except OSError as e: print_info(f"Failed to write pytest_done marker: {e}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 2379 - 2385, The filesystem-only failure handling around _pytest_done_marker_path and the marker write/read logic is too broad; replace the generic except Exception blocks with OSError so only I/O-related failures are swallowed. Update the try/except in the pytest_done marker write path and apply the same narrowing to the log read helper near the other mentioned location, keeping the existing print_info fallback behavior intact.Sources: Coding guidelines, Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 2379-2385: The filesystem-only failure handling around
_pytest_done_marker_path and the marker write/read logic is too broad; replace
the generic except Exception blocks with OSError so only I/O-related failures
are swallowed. Update the try/except in the pytest_done marker write path and
apply the same narrowing to the log read helper near the other mentioned
location, keeping the existing print_info fallback behavior intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 30443650-aefe-46d8-9562-62132f2f50a4
📒 Files selected for processing (1)
tests/integration/defs/perf/test_perf_sanity.py
Extract the pytest_done marker + log-dump helpers out of the already-large test_perf_sanity.py into tests/test_common/error_utils.py so the perf-sanity test module isn't further bloated by disagg-specific coordination code. Signed-off-by: chenfeiz <chenfeiz@nvidia.com>
Signed-off-by: chenfeiz <chenfeiz@nvidia.com>
|
/bot run --disable-fail-fast --stage-list "DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-1" |
|
PR_Github #57437 [ run ] triggered by Bot. Commit: |
|
Hi @chenfeiz0326 Would you please dump all of the logs to a separate file instead of “slurm-*.out”. Since the slurm log file has its own business logic, it indicates whether the slurm job failed due to submission error or other issues. Use a separate file to track all of the disagg logs is a much better idea and more easy to work with LLMs and Agents. |
|
PR_Github #57437 [ run ] completed with state
|
Summary
For disagg perf-sanity tests, CI only surfaces the BENCHMARK srun step's stdout, so failures in the CTX/GEN/DISAGG_SERVER workers or their nested
trtllm-servesubprocesses require SSHing to the cluster to fetch log files. This PR streams every sibling log (gen_server_*.log,ctx_server_*.log,disagg_server.log,trtllm-serve.CTX_*.log,trtllm-serve.GEN_*.log,trtllm-serve.DISAGG_SERVER.*.log) into BENCHMARK's stdout at teardown, so a single CI log has everything needed for debugging.Coordination (option C from design)
CTX_x/GEN_x/DISAGG_SERVER) writes apytest_done.<TYPE>.<server_idx>.txtmarker at the end of itstest_e2e(in afinally— runs on both success and failure).time.sleep(3)covers the srun&> fileredirect flush.----- BEGIN/END -----fences.Runs in
test_e2e'sfinallyblock so logs are dumped on both success and failure.report_error's existing error-tail (baked into the exception message) is untouched.Only active for
runtime == "multi_node_disagg_server"; aggregated tests are unaffected.Test plan
e2eperf sanity test (single ctx, single gen) in CI; confirm BENCHMARK's stdout contains full CTX/GEN/DISAGG_SERVER worker logs and their nestedtrtllm-serve.*logs after the benchmark result summary.gen_onlytest; confirm CTX logs are correctly skipped (no ctx workers spawned).aggr_upload-*andaggr_upload-ctx_only-*) are unaffected.Summary by CodeRabbit