Skip to content

[None][test] Add opt-in background prefetch of test MPI sessions and model page cache#15908

Open
sunnyqgg wants to merge 6 commits into
NVIDIA:mainfrom
sunnyqgg:test_overlap_prefectch
Open

[None][test] Add opt-in background prefetch of test MPI sessions and model page cache#15908
sunnyqgg wants to merge 6 commits into
NVIDIA:mainfrom
sunnyqgg:test_overlap_prefectch

Conversation

@sunnyqgg

@sunnyqgg sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Multi-GPU LLM-API tests pay ~50-65s per test to spawn an MpiPoolSession whose workers import tensorrt_llm, plus a cold read of the model weights. Both are pure CPU/IO work, so they run in background threads while the PREVIOUS test still owns the GPUs — hiding the cost behind its runtime.

Enabled by default (TRTLLM_TEST_PREFETCH_SESSION=0 to disable). No production code is changed; everything lives under tests/.

Three ways tests benefit

  1. Shadow mode — zero test changes (default). install_pool_factory() patches the three MpiPoolSession construction seams (executor/proxy.py, executor/rpc_proxy.py, llmapi/llm.py, all in their mpi_session is None branches) so bare LLM(...) tests consume prefetched pools automatically. Whenever a multi-GPU pool is built or handed over, a spare pool of the same size is built in the background for the next test. A size miss is discarded — the sync build is no slower than without prefetch. Tests passing their own _mpi_session (shared/grouped pools) are never intercepted.
  2. Precise mode: @pytest.mark.prefetch_session(N) + the prefetched_mpi_session fixture, for suites that know their boundaries.
  3. Weight page-cache warming: @pytest.mark.prefetch_model_dir(path) pre-reads the NEXT test's weight files (fires even between tests sharing a session).

Safety

  • Env-snapshot guard: workers freeze their env at spawn; a prefetched pool is discarded if TRTLLM*/TLLM*/NCCL*/CUDA*/UCX*/OMPI* vars changed since it was built.
  • Prefetched workers are asserted CUDA-clean before handover.
  • Any background failure, size miss, or 180s join timeout falls back to a synchronous build — no path is slower than today.
  • pytest_sessionfinish disposes any unconsumed shadow pool.

Measured benefit (4xB200, DeepSeek-V3-Lite nvfp4 / Llama-3.1-8B+EAGLE3)

Scenario OFF ON Benefit
Group boundary, same model: pool spawn 53.2s 0.0s fully hidden
4 ungrouped 1-GPU tests, chained: boundary pool cost 168.4s 71.2s -58% (wall -32%)
Different-model boundary: LLM create (14.2 GiB warmed, fully overlapped) 48.5s 14.7s -70%

Savings per boundary = min(pool build time, previous test duration).

Changes

  • tests/test_common/session_prefetcher.py — prefetcher, shadow-mode factory, env guard, page-cache warming (new)
  • tests/unittest/llmapi/conftest.py — hook wiring, factory install, marker registration, fixture (new)
  • tests/unittest/llmapi/test_session_prefetcher.py — pure-logic unit tests, no MPI/GPU needed (new)

Test plan

  • pytest tests/unittest/llmapi/test_session_prefetcher.py — 14 passed (container, B200 node)
  • A/B experiments above (3 scenarios, OFF/ON back-to-back on the same node)
  • CI

…model page cache

Multi-GPU LLM-API tests pay ~50-65s per test to spawn an MPI pool whose
workers import tensorrt_llm, plus a cold read of the model weights. Both
are pure CPU/IO work, so they can be done in background threads while the
PREVIOUS test still owns the GPUs, hiding the cost behind its runtime.

This adds an opt-in (TRTLLM_TEST_PREFETCH_SESSION=1, default off) test
infra module:
- SessionPrefetcher: at the last test of a session block, spawns the next
  block's MpiPoolSession in the background (workers import tensorrt_llm
  but must not touch CUDA before handover, asserted at take()).
- warm_page_cache(): pre-reads the NEXT test's weight files so the kernel
  page cache is hot when that test loads weights; fires whenever the next
  test declares a different prefetch_model_dir, even within a block.
- pytest wiring in tests/unittest/llmapi/conftest.py plus pure-logic unit
  tests (no MPI/GPU needed).

Measured on 4xB200 (DeepSeek-V3-Lite nvfp4 and Llama-3.1-8B+EAGLE3):
- group boundary, same model: 53.2s pool spawn -> 0s (fully hidden)
- 4 ungrouped one-GPU tests, chained: boundary pool cost 168.4s -> 71.2s,
  total wall -32%
- different-model boundary: warming 14.2 GiB fully overlapped, LLM create
  48.5s -> 14.7s (-70%)
Savings per boundary = min(pool build time, previous test duration).

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new test utility module implementing optional background session prefetching for multi-GPU LLM-API tests, enabled via an environment variable. It prebuilds MPI worker pools and warms OS page cache for model weight files, wires into pytest hooks, and includes a pure-logic test suite.

Changes

Session Prefetcher

Layer / File(s) Summary
Prefetcher core logic
tests/test_common/session_prefetcher.py
Implements warm_page_cache, spec/model-dir helper functions, the SessionPrefetcher class (background MPI pool build, page-cache warming, take() handover with locking), and the PREFETCHER singleton.
Pytest hook wiring
tests/unittest/llmapi/conftest.py
Registers prefetch_session/prefetch_model_dir markers and connects pytest_collection_modifyitems/pytest_runtest_setup to PREFETCHER.
Prefetcher unit tests
tests/unittest/llmapi/test_session_prefetcher.py
Adds fake marker/item helpers and tests for disabled state, spec dedup/boundary builds, model-switch warming, mismatch handling, cls.n_gpus fallback, and warm_page_cache file reading.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant PytestHook
  participant SessionPrefetcher
  participant MpiPoolSession
  participant PageCache

  PytestHook->>SessionPrefetcher: on_collection(items)
  PytestHook->>SessionPrefetcher: on_test_setup(item)
  SessionPrefetcher->>SessionPrefetcher: derive spec and model_dir
  alt spec boundary reached
    SessionPrefetcher->>MpiPoolSession: build pool in background thread
  end
  alt model dir changed
    SessionPrefetcher->>PageCache: warm_page_cache(model_dir) in background thread
  end
  PytestHook->>SessionPrefetcher: take(spec)
  SessionPrefetcher->>SessionPrefetcher: join background thread
  alt spec matches
    SessionPrefetcher-->>PytestHook: return prefetched session
  else mismatch
    SessionPrefetcher->>MpiPoolSession: shutdown discarded session
    SessionPrefetcher-->>PytestHook: return None
  end
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly matches the PR's main change: adding opt-in background prefetch for test MPI sessions and model cache warming.
Description check ✅ Passed The description covers the summary, implementation details, and test plan, but it does not follow the template sections exactly.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/unittest/llmapi/test_session_prefetcher.py (3)

79-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Move import time to module top-level.

Importing inline inside a test function is non-idiomatic; place it with the other imports at the top of the file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` at line 79, The test file
has an inline import of time inside a test rather than at module scope. Move the
import into the top-level imports alongside the other module imports in
test_session_prefetcher so the test stays idiomatic and the import location is
consistent.

Source: Coding guidelines


71-90: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Polling-with-sleep for async assertion; consider tightening.

Waiting up to 5s (100 * 0.05s) for the background warm call to land works but adds latency risk in CI if the thread is ever delayed, and is slower than necessary on the happy path teardown. Since the warm thread handle isn't exposed by SessionPrefetcher.on_test_setup (per the core module context), polling is the only option available from this test file without touching the core module. Given tests/** coverage guidance, this is acceptable but worth a lighter-weight signal (e.g., a threading.Event set inside the monkeypatched _warm lambda) for a more deterministic and faster test.

♻️ Suggested refactor using an Event instead of polling
 def test_model_switch_triggers_warm_even_within_block(prefetcher):
+    warmed_event = threading.Event()
+    orig_warmed = prefetcher.warmed
+    def _record(d):
+        orig_warmed.append(d)
+        warmed_event.set()
     items = [
         _FakeItem(spec=2, model_dir="/models/a"),
         _FakeItem(spec=2, model_dir="/models/b"),
     ]
     prefetcher.on_collection(items)
     prefetcher.on_test_setup(items[0])
-    # Warm threads are fire-and-forget; poll briefly for the recorded call.
-    import time
-
-    for _ in range(100):
-        if prefetcher.warmed:
-            break
-        time.sleep(0.05)
+    assert warmed_event.wait(timeout=5)
     assert prefetcher.warmed == ["/models/b"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` around lines 71 - 90, This
test waits on background warming with a long polling loop, which makes the
assertion slower and less deterministic. Update
test_model_switch_triggers_warm_even_within_block to use a tighter
synchronization signal, such as a threading.Event set by the monkeypatched _warm
path, so the test can wait directly for the warm call instead of polling
prefetcher.warmed. Keep the existing SessionPrefetcher.on_collection and
SessionPrefetcher.on_test_setup coverage, but replace the sleep loop with a
faster explicit wait and preserve the deduplication assertion.

47-47: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add return type annotations to test functions.

None of the test/helper functions in this file annotate a return type. As per coding guidelines, "Always annotate functions with return types (use None if no return)."

Also applies to: 55-55, 62-62, 71-71, 92-92, 98-98, 107-107

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` at line 47, Add return type
annotations to the test/helper functions in test_session_prefetcher.py so each
function explicitly declares a return of None. Update the definitions for the
affected test functions, including test_disabled_is_noop and the other
referenced tests in this file, to follow the coding guideline of always
annotating return types even when no value is returned.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unittest/llmapi/test_session_prefetcher.py`:
- Around line 92-95: Add coverage for the happy path in
SessionPrefetcher.take(): the current test only exercises the spec-mismatch
branch by setting _built_spec and _built_session directly. Add a test around
take() that sets matching built state, verifies it returns the built session,
and then verifies the internal state is cleared so a second take() with the same
spec returns None; use the existing SessionPrefetcher fixture and the take
method to validate the consume-once behavior.

---

Nitpick comments:
In `@tests/unittest/llmapi/test_session_prefetcher.py`:
- Line 79: The test file has an inline import of time inside a test rather than
at module scope. Move the import into the top-level imports alongside the other
module imports in test_session_prefetcher so the test stays idiomatic and the
import location is consistent.
- Around line 71-90: This test waits on background warming with a long polling
loop, which makes the assertion slower and less deterministic. Update
test_model_switch_triggers_warm_even_within_block to use a tighter
synchronization signal, such as a threading.Event set by the monkeypatched _warm
path, so the test can wait directly for the warm call instead of polling
prefetcher.warmed. Keep the existing SessionPrefetcher.on_collection and
SessionPrefetcher.on_test_setup coverage, but replace the sleep loop with a
faster explicit wait and preserve the deduplication assertion.
- Line 47: Add return type annotations to the test/helper functions in
test_session_prefetcher.py so each function explicitly declares a return of
None. Update the definitions for the affected test functions, including
test_disabled_is_noop and the other referenced tests in this file, to follow the
coding guideline of always annotating return types even when no value is
returned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c80a006d-33b2-4126-81c5-bb510a689648

📥 Commits

Reviewing files that changed from the base of the PR and between 695dd1d and 83c6254.

📒 Files selected for processing (3)
  • tests/test_common/session_prefetcher.py
  • tests/unittest/llmapi/conftest.py
  • tests/unittest/llmapi/test_session_prefetcher.py

Comment on lines +92 to +95
def test_take_spec_mismatch_returns_none(prefetcher):
prefetcher._built_spec = 4
prefetcher._built_session = None
assert prefetcher.take(2) is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Missing coverage: take() success path is untested.

The only take() test covers the spec-mismatch branch by directly poking _built_spec/_built_session. There's no test verifying that when the requested spec matches _built_spec, take() returns the built session and correctly resets internal state (so a second take() call with the same spec doesn't return a stale session). This is a core behavioral contract of the prefetch handover and should be covered.

As per path instructions, coverage here is insufficient — suggest adding a test like:

def test_take_returns_and_resets_on_match(prefetcher):
    prefetcher._built_spec = 4
    prefetcher._built_session = "fake-session"
    assert prefetcher.take(4) == "fake-session"
    assert prefetcher.take(4) is None  # already consumed
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` around lines 92 - 95, Add
coverage for the happy path in SessionPrefetcher.take(): the current test only
exercises the spec-mismatch branch by setting _built_spec and _built_session
directly. Add a test around take() that sets matching built state, verifies it
returns the built session, and then verifies the internal state is cleared so a
second take() with the same spec returns None; use the existing
SessionPrefetcher fixture and the take method to validate the consume-once
behavior.

Source: Path instructions

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57386 [ run ] triggered by Bot. Commit: 83c6254 Link to invocation

…prefetched pools

Tests opt in with @pytest.mark.prefetch_session(N) + the fixture, then pass
it via LLM(..., _mpi_session=...). The fixture hands over the pool built in
the background during the previous test, or builds one synchronously when
nothing was prefetched. This makes the PR self-contained: detection, build
and handover all live here.

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57394 [ run ] triggered by Bot. Commit: f7a120c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57386 [ run ] completed with state ABORTED. Commit: 83c6254
LLM/main/L0_MergeRequest_PR #46133 (Blue Ocean) completed with status: ABORTED

Link to invocation

…ow mode

- Default on (TRTLLM_TEST_PREFETCH_SESSION=0 to disable); with no consumer
  the prefetcher never fires, so plain suites are unaffected.
- Shadow mode: install_pool_factory() patches the three MpiPoolSession
  construction seams (proxy, rpc_proxy, llm) so bare LLM(...) tests consume
  prefetched pools with zero test changes. Whenever a multi-GPU pool is
  built or handed over, a spare pool of the same size is built in the
  background for the next test; a size miss is discarded and the sync
  build is no slower than without prefetch.
- Env-snapshot guard: workers freeze their environment at spawn, so a
  prefetched pool is discarded if TRTLLM*/TLLM*/NCCL*/CUDA*/UCX*/OMPI*
  vars changed since it was built.
- take() join timeout tightened to 180s (slowest legitimate build measured
  ~117s); on a genuine hang we fall back to a synchronous build.
- pytest_sessionfinish disposes any unconsumed shadow pool.

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57402 [ run ] triggered by Bot. Commit: 0d46445 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57394 [ run ] completed with state ABORTED. Commit: f7a120c

Link to invocation

The default single-GPU path also spawns a 1-worker MpiPoolSession
(executor.py _create_ipc_executor with mpi_session=None -> proxy.py),
paying the same ~50s spawn+import as multi-GPU pools (measured 45-57s
for 1-worker pools on B200). Excluding n_workers==1 from the factory
left the largest test population without the benefit.

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57407 [ run ] triggered by Bot. Commit: 214f115 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57402 [ run ] completed with state ABORTED. Commit: 0d46445

Link to invocation

@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

…hadow mode

Validated on existing tests (test_llm_pytorch.py, zero test changes) on a
B200 node: prefetch handed over a 1-worker pool and the 2-test run went
157.1s -> 126.3s (-20%), all results identical to baseline.

Fixes from that run:
- threadleak: the prefetched pool's MPIPoolExecutor manager thread
  (Thread-N (_manager_spawn)) and the prefetcher's own named threads
  legitimately outlive the test they start under; exclude them in
  tests/unittest/pytest.ini and name all prefetcher threads
  session-prefetch-*.
- CUDA-clean assert: some library versions initialize an idle CUDA
  context at tensorrt_llm import time, so the worker-side hard assert is
  unsatisfiable there. Downgrade to a report + log note; the idle context
  (no kernels, no allocations) safely coexists with the running test, as
  all A/B experiments demonstrated.

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57418 [ run ] triggered by Bot. Commit: af40b59 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57420 [ run ] triggered by Bot. Commit: af40b59 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57418 [ run ] completed with state ABORTED. Commit: af40b59

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57407 [ run ] completed with state ABORTED. Commit: 214f115

Link to invocation

@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

…mand-driven install

Cover every test tree with one plugin module instead of per-directory
conftests: both rootdirs (tests/unittest, tests/integration/defs) load
test_common.session_prefetcher_hooks via 'addopts = -p ...' in their
pytest.ini, so any pytest invocation of either tree gets the factory.

Demand-driven by construction: install_pool_factory_if_loaded() only
patches tensorrt_llm executor modules ALREADY imported by the collected
tests, so suites that never create MPI pools pay nothing - not even the
tensorrt_llm import. Tests that do create pools hit the patched seam and
prefetch automatically; there is nothing to scan or annotate.

The llmapi/conftest.py wiring is superseded by the plugin and removed.
Verified end-to-end on unmodified tests (2-test run 157.1s -> 132.3s,
handover logged, results identical); 14 unit tests pass.

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg

sunnyqgg commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57434 [ run ] triggered by Bot. Commit: c5b72cf Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57436 [ run ] triggered by Bot. Commit: c5b72cf Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57434 [ run ] completed with state ABORTED. Commit: c5b72cf

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57420 [ run ] completed with state ABORTED. Commit: af40b59

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57436 [ run ] completed with state SUCCESS. Commit: c5b72cf
/LLM/main/L0_MergeRequest_PR pipeline #46175 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants