[None][test] Add opt-in background prefetch of test MPI sessions and model page cache by sunnyqgg · Pull Request #15908 · NVIDIA/TensorRT-LLM

sunnyqgg · 2026-07-03T06:32:57Z

Summary

Multi-GPU LLM-API tests pay ~50-65s per test to spawn an MpiPoolSession whose workers import tensorrt_llm, plus a cold read of the model weights. Both are pure CPU/IO work, so they run in background threads while the PREVIOUS test still owns the GPUs — hiding the cost behind its runtime.

Enabled by default (TRTLLM_TEST_PREFETCH_SESSION=0 to disable). No production code is changed; everything lives under tests/.

Three ways tests benefit

Shadow mode — zero test changes (default). install_pool_factory() patches the three MpiPoolSession construction seams (executor/proxy.py, executor/rpc_proxy.py, llmapi/llm.py, all in their mpi_session is None branches) so bare LLM(...) tests consume prefetched pools automatically. Whenever a multi-GPU pool is built or handed over, a spare pool of the same size is built in the background for the next test. A size miss is discarded — the sync build is no slower than without prefetch. Tests passing their own _mpi_session (shared/grouped pools) are never intercepted.
Precise mode: @pytest.mark.prefetch_session(N) + the prefetched_mpi_session fixture, for suites that know their boundaries.
Weight page-cache warming: @pytest.mark.prefetch_model_dir(path) pre-reads the NEXT test's weight files (fires even between tests sharing a session).

Safety

Env-snapshot guard: workers freeze their env at spawn; a prefetched pool is discarded if TRTLLM*/TLLM*/NCCL*/CUDA*/UCX*/OMPI* vars changed since it was built.
Prefetched workers are asserted CUDA-clean before handover.
Any background failure, size miss, or 180s join timeout falls back to a synchronous build — no path is slower than today.
pytest_sessionfinish disposes any unconsumed shadow pool.

Measured benefit (4xB200, DeepSeek-V3-Lite nvfp4 / Llama-3.1-8B+EAGLE3)

Scenario	OFF	ON	Benefit
Group boundary, same model: pool spawn	53.2s	0.0s	fully hidden
4 ungrouped 1-GPU tests, chained: boundary pool cost	168.4s	71.2s	-58% (wall -32%)
Different-model boundary: LLM create (14.2 GiB warmed, fully overlapped)	48.5s	14.7s	-70%

Savings per boundary = min(pool build time, previous test duration).

Changes

tests/test_common/session_prefetcher.py — prefetcher, shadow-mode factory, env guard, page-cache warming (new)
tests/unittest/llmapi/conftest.py — hook wiring, factory install, marker registration, fixture (new)
tests/unittest/llmapi/test_session_prefetcher.py — pure-logic unit tests, no MPI/GPU needed (new)

Test plan

pytest tests/unittest/llmapi/test_session_prefetcher.py — 14 passed (container, B200 node)
A/B experiments above (3 scenarios, OFF/ON back-to-back on the same node)
CI

…model page cache Multi-GPU LLM-API tests pay ~50-65s per test to spawn an MPI pool whose workers import tensorrt_llm, plus a cold read of the model weights. Both are pure CPU/IO work, so they can be done in background threads while the PREVIOUS test still owns the GPUs, hiding the cost behind its runtime. This adds an opt-in (TRTLLM_TEST_PREFETCH_SESSION=1, default off) test infra module: - SessionPrefetcher: at the last test of a session block, spawns the next block's MpiPoolSession in the background (workers import tensorrt_llm but must not touch CUDA before handover, asserted at take()). - warm_page_cache(): pre-reads the NEXT test's weight files so the kernel page cache is hot when that test loads weights; fires whenever the next test declares a different prefetch_model_dir, even within a block. - pytest wiring in tests/unittest/llmapi/conftest.py plus pure-logic unit tests (no MPI/GPU needed). Measured on 4xB200 (DeepSeek-V3-Lite nvfp4 and Llama-3.1-8B+EAGLE3): - group boundary, same model: 53.2s pool spawn -> 0s (fully hidden) - 4 ungrouped one-GPU tests, chained: boundary pool cost 168.4s -> 71.2s, total wall -32% - different-model boundary: warming 14.2 GiB fully overlapped, LLM create 48.5s -> 14.7s (-70%) Savings per boundary = min(pool build time, previous test duration). Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-07-03T06:33:04Z

/bot run --disable-fail-fast

coderabbitai · 2026-07-03T06:38:22Z

📝 Walkthrough

Walkthrough

Adds a new test utility module implementing optional background session prefetching for multi-GPU LLM-API tests, enabled via an environment variable. It prebuilds MPI worker pools and warms OS page cache for model weight files, wires into pytest hooks, and includes a pure-logic test suite.

Changes

Session Prefetcher

Layer / File(s)	Summary
Prefetcher core logic `tests/test_common/session_prefetcher.py`	Implements `warm_page_cache`, spec/model-dir helper functions, the `SessionPrefetcher` class (background MPI pool build, page-cache warming, `take()` handover with locking), and the `PREFETCHER` singleton.
Pytest hook wiring `tests/unittest/llmapi/conftest.py`	Registers `prefetch_session`/`prefetch_model_dir` markers and connects `pytest_collection_modifyitems`/`pytest_runtest_setup` to `PREFETCHER`.
Prefetcher unit tests `tests/unittest/llmapi/test_session_prefetcher.py`	Adds fake marker/item helpers and tests for disabled state, spec dedup/boundary builds, model-switch warming, mismatch handling, `cls.n_gpus` fallback, and `warm_page_cache` file reading.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant PytestHook
  participant SessionPrefetcher
  participant MpiPoolSession
  participant PageCache

  PytestHook->>SessionPrefetcher: on_collection(items)
  PytestHook->>SessionPrefetcher: on_test_setup(item)
  SessionPrefetcher->>SessionPrefetcher: derive spec and model_dir
  alt spec boundary reached
    SessionPrefetcher->>MpiPoolSession: build pool in background thread
  end
  alt model dir changed
    SessionPrefetcher->>PageCache: warm_page_cache(model_dir) in background thread
  end
  PytestHook->>SessionPrefetcher: take(spec)
  SessionPrefetcher->>SessionPrefetcher: join background thread
  alt spec matches
    SessionPrefetcher-->>PytestHook: return prefetched session
  else mismatch
    SessionPrefetcher->>MpiPoolSession: shutdown discarded session
    SessionPrefetcher-->>PytestHook: return None
  end

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly matches the PR's main change: adding opt-in background prefetch for test MPI sessions and model cache warming.
Description check	✅ Passed	The description covers the summary, implementation details, and test plan, but it does not follow the template sections exactly.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/unittest/llmapi/test_session_prefetcher.py (3)
79-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Move import time to module top-level.

Importing inline inside a test function is non-idiomatic; place it with the other imports at the top of the file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` at line 79, The test file
has an inline import of time inside a test rather than at module scope. Move the
import into the top-level imports alongside the other module imports in
test_session_prefetcher so the test stays idiomatic and the import location is
consistent.
Source: Coding guidelines

71-90: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

Polling-with-sleep for async assertion; consider tightening.

Waiting up to 5s (100 * 0.05s) for the background warm call to land works but adds latency risk in CI if the thread is ever delayed, and is slower than necessary on the happy path teardown. Since the warm thread handle isn't exposed by SessionPrefetcher.on_test_setup (per the core module context), polling is the only option available from this test file without touching the core module. Given tests/** coverage guidance, this is acceptable but worth a lighter-weight signal (e.g., a threading.Event set inside the monkeypatched _warm lambda) for a more deterministic and faster test.
♻️ Suggested refactor using an Event instead of polling
 def test_model_switch_triggers_warm_even_within_block(prefetcher):
+    warmed_event = threading.Event()
+    orig_warmed = prefetcher.warmed
+    def _record(d):
+        orig_warmed.append(d)
+        warmed_event.set()
     items = [
         _FakeItem(spec=2, model_dir="/models/a"),
         _FakeItem(spec=2, model_dir="/models/b"),
     ]
     prefetcher.on_collection(items)
     prefetcher.on_test_setup(items[0])
-    # Warm threads are fire-and-forget; poll briefly for the recorded call.
-    import time
-
-    for _ in range(100):
-        if prefetcher.warmed:
-            break
-        time.sleep(0.05)
+    assert warmed_event.wait(timeout=5)
     assert prefetcher.warmed == ["/models/b"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` around lines 71 - 90, This
test waits on background warming with a long polling loop, which makes the
assertion slower and less deterministic. Update
test_model_switch_triggers_warm_even_within_block to use a tighter
synchronization signal, such as a threading.Event set by the monkeypatched _warm
path, so the test can wait directly for the warm call instead of polling
prefetcher.warmed. Keep the existing SessionPrefetcher.on_collection and
SessionPrefetcher.on_test_setup coverage, but replace the sleep loop with a
faster explicit wait and preserve the deduplication assertion.
47-47: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add return type annotations to test functions.

None of the test/helper functions in this file annotate a return type. As per coding guidelines, "Always annotate functions with return types (use None if no return)."

Also applies to: 55-55, 62-62, 71-71, 92-92, 98-98, 107-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_session_prefetcher.py` at line 47, Add return type
annotations to the test/helper functions in test_session_prefetcher.py so each
function explicitly declares a return of None. Update the definitions for the
affected test functions, including test_disabled_is_noop and the other
referenced tests in this file, to follow the coding guideline of always
annotating return types even when no value is returned.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unittest/llmapi/test_session_prefetcher.py`:
- Around line 92-95: Add coverage for the happy path in
SessionPrefetcher.take(): the current test only exercises the spec-mismatch
branch by setting _built_spec and _built_session directly. Add a test around
take() that sets matching built state, verifies it returns the built session,
and then verifies the internal state is cleared so a second take() with the same
spec returns None; use the existing SessionPrefetcher fixture and the take
method to validate the consume-once behavior.

---

Nitpick comments:
In `@tests/unittest/llmapi/test_session_prefetcher.py`:
- Line 79: The test file has an inline import of time inside a test rather than
at module scope. Move the import into the top-level imports alongside the other
module imports in test_session_prefetcher so the test stays idiomatic and the
import location is consistent.
- Around line 71-90: This test waits on background warming with a long polling
loop, which makes the assertion slower and less deterministic. Update
test_model_switch_triggers_warm_even_within_block to use a tighter
synchronization signal, such as a threading.Event set by the monkeypatched _warm
path, so the test can wait directly for the warm call instead of polling
prefetcher.warmed. Keep the existing SessionPrefetcher.on_collection and
SessionPrefetcher.on_test_setup coverage, but replace the sleep loop with a
faster explicit wait and preserve the deduplication assertion.
- Line 47: Add return type annotations to the test/helper functions in
test_session_prefetcher.py so each function explicitly declares a return of
None. Update the definitions for the affected test functions, including
test_disabled_is_noop and the other referenced tests in this file, to follow the
coding guideline of always annotating return types even when no value is
returned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c80a006d-33b2-4126-81c5-bb510a689648

📥 Commits

Reviewing files that changed from the base of the PR and between 695dd1d and 83c6254.

📒 Files selected for processing (3)

tests/test_common/session_prefetcher.py
tests/unittest/llmapi/conftest.py
tests/unittest/llmapi/test_session_prefetcher.py

coderabbitai · 2026-07-03T06:38:25Z

+def test_take_spec_mismatch_returns_none(prefetcher):
+    prefetcher._built_spec = 4
+    prefetcher._built_session = None
+    assert prefetcher.take(2) is None


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Missing coverage: take() success path is untested.

The only take() test covers the spec-mismatch branch by directly poking _built_spec/_built_session. There's no test verifying that when the requested spec matches _built_spec, take() returns the built session and correctly resets internal state (so a second take() call with the same spec doesn't return a stale session). This is a core behavioral contract of the prefetch handover and should be covered.

As per path instructions, coverage here is insufficient — suggest adding a test like:

def test_take_returns_and_resets_on_match(prefetcher): prefetcher._built_spec = 4 prefetcher._built_session = "fake-session" assert prefetcher.take(4) == "fake-session" assert prefetcher.take(4) is None # already consumed

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_session_prefetcher.py` around lines 92 - 95, Add coverage for the happy path in SessionPrefetcher.take(): the current test only exercises the spec-mismatch branch by setting _built_spec and _built_session directly. Add a test around take() that sets matching built state, verifies it returns the built session, and then verifies the internal state is cleared so a second take() with the same spec returns None; use the existing SessionPrefetcher fixture and the take method to validate the consume-once behavior.

Source: Path instructions

tensorrt-cicd · 2026-07-03T06:38:51Z

PR_Github #57386 [ run ] triggered by Bot. Commit: 83c6254 Link to invocation

…prefetched pools Tests opt in with @pytest.mark.prefetch_session(N) + the fixture, then pass it via LLM(..., _mpi_session=...). The fixture hands over the pool built in the background during the previous test, or builds one synchronously when nothing was prefetched. This makes the PR self-contained: detection, build and handover all live here. Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-07-03T06:49:25Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T06:55:03Z

PR_Github #57394 [ run ] triggered by Bot. Commit: f7a120c Link to invocation

tensorrt-cicd · 2026-07-03T06:58:29Z

PR_Github #57386 [ run ] completed with state ABORTED. Commit: 83c6254
LLM/main/L0_MergeRequest_PR #46133 (Blue Ocean) completed with status: ABORTED

Link to invocation

…ow mode - Default on (TRTLLM_TEST_PREFETCH_SESSION=0 to disable); with no consumer the prefetcher never fires, so plain suites are unaffected. - Shadow mode: install_pool_factory() patches the three MpiPoolSession construction seams (proxy, rpc_proxy, llm) so bare LLM(...) tests consume prefetched pools with zero test changes. Whenever a multi-GPU pool is built or handed over, a spare pool of the same size is built in the background for the next test; a size miss is discarded and the sync build is no slower than without prefetch. - Env-snapshot guard: workers freeze their environment at spawn, so a prefetched pool is discarded if TRTLLM*/TLLM*/NCCL*/CUDA*/UCX*/OMPI* vars changed since it was built. - take() join timeout tightened to 180s (slowest legitimate build measured ~117s); on a genuine hang we fall back to a synchronous build. - pytest_sessionfinish disposes any unconsumed shadow pool. Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-07-03T07:42:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T07:48:52Z

PR_Github #57402 [ run ] triggered by Bot. Commit: 0d46445 Link to invocation

tensorrt-cicd · 2026-07-03T07:53:46Z

PR_Github #57394 [ run ] completed with state ABORTED. Commit: f7a120c

Link to invocation

The default single-GPU path also spawns a 1-worker MpiPoolSession (executor.py _create_ipc_executor with mpi_session=None -> proxy.py), paying the same ~50s spawn+import as multi-GPU pools (measured 45-57s for 1-worker pools on B200). Excluding n_workers==1 from the factory left the largest test population without the benefit. Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-07-03T07:54:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T08:03:24Z

PR_Github #57407 [ run ] triggered by Bot. Commit: 214f115 Link to invocation

tensorrt-cicd · 2026-07-03T08:07:00Z

PR_Github #57402 [ run ] completed with state ABORTED. Commit: 0d46445

Link to invocation

sunnyqgg · 2026-07-03T08:30:58Z

/bot run --disable-fail-fast

…hadow mode Validated on existing tests (test_llm_pytorch.py, zero test changes) on a B200 node: prefetch handed over a 1-worker pool and the 2-test run went 157.1s -> 126.3s (-20%), all results identical to baseline. Fixes from that run: - threadleak: the prefetched pool's MPIPoolExecutor manager thread (Thread-N (_manager_spawn)) and the prefetcher's own named threads legitimately outlive the test they start under; exclude them in tests/unittest/pytest.ini and name all prefetcher threads session-prefetch-*. - CUDA-clean assert: some library versions initialize an idle CUDA context at tensorrt_llm import time, so the worker-side hard assert is unsatisfiable there. Downgrade to a report + log note; the idle context (no kernels, no allocations) safely coexists with the running test, as all A/B experiments demonstrated. Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-07-03T08:31:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T08:36:31Z

PR_Github #57418 [ run ] triggered by Bot. Commit: af40b59 Link to invocation

tensorrt-cicd · 2026-07-03T08:38:24Z

PR_Github #57420 [ run ] triggered by Bot. Commit: af40b59 Link to invocation

tensorrt-cicd · 2026-07-03T08:38:28Z

PR_Github #57418 [ run ] completed with state ABORTED. Commit: af40b59

Link to invocation

tensorrt-cicd · 2026-07-03T08:38:51Z

PR_Github #57407 [ run ] completed with state ABORTED. Commit: 214f115

Link to invocation

sunnyqgg · 2026-07-03T09:08:18Z

/bot run --disable-fail-fast

…mand-driven install Cover every test tree with one plugin module instead of per-directory conftests: both rootdirs (tests/unittest, tests/integration/defs) load test_common.session_prefetcher_hooks via 'addopts = -p ...' in their pytest.ini, so any pytest invocation of either tree gets the factory. Demand-driven by construction: install_pool_factory_if_loaded() only patches tensorrt_llm executor modules ALREADY imported by the collected tests, so suites that never create MPI pools pay nothing - not even the tensorrt_llm import. Tests that do create pools hit the patched seam and prefetch automatically; there is nothing to scan or annotate. The llmapi/conftest.py wiring is superseded by the plugin and removed. Verified end-to-end on unmodified tests (2-test run 157.1s -> 132.3s, handover logged, results identical); 14 unit tests pass. Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-07-03T09:09:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T09:14:11Z

PR_Github #57434 [ run ] triggered by Bot. Commit: c5b72cf Link to invocation

tensorrt-cicd · 2026-07-03T09:15:11Z

PR_Github #57436 [ run ] triggered by Bot. Commit: c5b72cf Link to invocation

tensorrt-cicd · 2026-07-03T09:15:14Z

PR_Github #57434 [ run ] completed with state ABORTED. Commit: c5b72cf

Link to invocation

tensorrt-cicd · 2026-07-03T09:15:38Z

PR_Github #57420 [ run ] completed with state ABORTED. Commit: af40b59

Link to invocation

tensorrt-cicd · 2026-07-03T13:24:02Z

PR_Github #57436 [ run ] completed with state SUCCESS. Commit: c5b72cf
/LLM/main/L0_MergeRequest_PR pipeline #46175 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned sunnyqgg Jul 3, 2026

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

sunnyqgg commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Three ways tests benefit

Safety

Measured benefit (4xB200, DeepSeek-V3-Lite nvfp4 / Llama-3.1-8B+EAGLE3)

Changes

Test plan

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

sunnyqgg commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sunnyqgg commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading