[None][test] Add opt-in background prefetch of test MPI sessions and model page cache#15908
[None][test] Add opt-in background prefetch of test MPI sessions and model page cache#15908sunnyqgg wants to merge 6 commits into
Conversation
…model page cache Multi-GPU LLM-API tests pay ~50-65s per test to spawn an MPI pool whose workers import tensorrt_llm, plus a cold read of the model weights. Both are pure CPU/IO work, so they can be done in background threads while the PREVIOUS test still owns the GPUs, hiding the cost behind its runtime. This adds an opt-in (TRTLLM_TEST_PREFETCH_SESSION=1, default off) test infra module: - SessionPrefetcher: at the last test of a session block, spawns the next block's MpiPoolSession in the background (workers import tensorrt_llm but must not touch CUDA before handover, asserted at take()). - warm_page_cache(): pre-reads the NEXT test's weight files so the kernel page cache is hot when that test loads weights; fires whenever the next test declares a different prefetch_model_dir, even within a block. - pytest wiring in tests/unittest/llmapi/conftest.py plus pure-logic unit tests (no MPI/GPU needed). Measured on 4xB200 (DeepSeek-V3-Lite nvfp4 and Llama-3.1-8B+EAGLE3): - group boundary, same model: 53.2s pool spawn -> 0s (fully hidden) - 4 ungrouped one-GPU tests, chained: boundary pool cost 168.4s -> 71.2s, total wall -32% - different-model boundary: warming 14.2 GiB fully overlapped, LLM create 48.5s -> 14.7s (-70%) Savings per boundary = min(pool build time, previous test duration). Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughAdds a new test utility module implementing optional background session prefetching for multi-GPU LLM-API tests, enabled via an environment variable. It prebuilds MPI worker pools and warms OS page cache for model weight files, wires into pytest hooks, and includes a pure-logic test suite. ChangesSession Prefetcher
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant PytestHook
participant SessionPrefetcher
participant MpiPoolSession
participant PageCache
PytestHook->>SessionPrefetcher: on_collection(items)
PytestHook->>SessionPrefetcher: on_test_setup(item)
SessionPrefetcher->>SessionPrefetcher: derive spec and model_dir
alt spec boundary reached
SessionPrefetcher->>MpiPoolSession: build pool in background thread
end
alt model dir changed
SessionPrefetcher->>PageCache: warm_page_cache(model_dir) in background thread
end
PytestHook->>SessionPrefetcher: take(spec)
SessionPrefetcher->>SessionPrefetcher: join background thread
alt spec matches
SessionPrefetcher-->>PytestHook: return prefetched session
else mismatch
SessionPrefetcher->>MpiPoolSession: shutdown discarded session
SessionPrefetcher-->>PytestHook: return None
end
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/unittest/llmapi/test_session_prefetcher.py (3)
79-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueMove
import timeto module top-level.Importing inline inside a test function is non-idiomatic; place it with the other imports at the top of the file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_session_prefetcher.py` at line 79, The test file has an inline import of time inside a test rather than at module scope. Move the import into the top-level imports alongside the other module imports in test_session_prefetcher so the test stays idiomatic and the import location is consistent.Source: Coding guidelines
71-90: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valuePolling-with-sleep for async assertion; consider tightening.
Waiting up to 5s (
100 * 0.05s) for the background warm call to land works but adds latency risk in CI if the thread is ever delayed, and is slower than necessary on the happy path teardown. Since the warm thread handle isn't exposed bySessionPrefetcher.on_test_setup(per the core module context), polling is the only option available from this test file without touching the core module. Giventests/**coverage guidance, this is acceptable but worth a lighter-weight signal (e.g., athreading.Eventset inside the monkeypatched_warmlambda) for a more deterministic and faster test.♻️ Suggested refactor using an Event instead of polling
def test_model_switch_triggers_warm_even_within_block(prefetcher): + warmed_event = threading.Event() + orig_warmed = prefetcher.warmed + def _record(d): + orig_warmed.append(d) + warmed_event.set() items = [ _FakeItem(spec=2, model_dir="/models/a"), _FakeItem(spec=2, model_dir="/models/b"), ] prefetcher.on_collection(items) prefetcher.on_test_setup(items[0]) - # Warm threads are fire-and-forget; poll briefly for the recorded call. - import time - - for _ in range(100): - if prefetcher.warmed: - break - time.sleep(0.05) + assert warmed_event.wait(timeout=5) assert prefetcher.warmed == ["/models/b"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_session_prefetcher.py` around lines 71 - 90, This test waits on background warming with a long polling loop, which makes the assertion slower and less deterministic. Update test_model_switch_triggers_warm_even_within_block to use a tighter synchronization signal, such as a threading.Event set by the monkeypatched _warm path, so the test can wait directly for the warm call instead of polling prefetcher.warmed. Keep the existing SessionPrefetcher.on_collection and SessionPrefetcher.on_test_setup coverage, but replace the sleep loop with a faster explicit wait and preserve the deduplication assertion.
47-47: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd return type annotations to test functions.
None of the test/helper functions in this file annotate a return type. As per coding guidelines, "Always annotate functions with return types (use
Noneif no return)."Also applies to: 55-55, 62-62, 71-71, 92-92, 98-98, 107-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_session_prefetcher.py` at line 47, Add return type annotations to the test/helper functions in test_session_prefetcher.py so each function explicitly declares a return of None. Update the definitions for the affected test functions, including test_disabled_is_noop and the other referenced tests in this file, to follow the coding guideline of always annotating return types even when no value is returned.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unittest/llmapi/test_session_prefetcher.py`:
- Around line 92-95: Add coverage for the happy path in
SessionPrefetcher.take(): the current test only exercises the spec-mismatch
branch by setting _built_spec and _built_session directly. Add a test around
take() that sets matching built state, verifies it returns the built session,
and then verifies the internal state is cleared so a second take() with the same
spec returns None; use the existing SessionPrefetcher fixture and the take
method to validate the consume-once behavior.
---
Nitpick comments:
In `@tests/unittest/llmapi/test_session_prefetcher.py`:
- Line 79: The test file has an inline import of time inside a test rather than
at module scope. Move the import into the top-level imports alongside the other
module imports in test_session_prefetcher so the test stays idiomatic and the
import location is consistent.
- Around line 71-90: This test waits on background warming with a long polling
loop, which makes the assertion slower and less deterministic. Update
test_model_switch_triggers_warm_even_within_block to use a tighter
synchronization signal, such as a threading.Event set by the monkeypatched _warm
path, so the test can wait directly for the warm call instead of polling
prefetcher.warmed. Keep the existing SessionPrefetcher.on_collection and
SessionPrefetcher.on_test_setup coverage, but replace the sleep loop with a
faster explicit wait and preserve the deduplication assertion.
- Line 47: Add return type annotations to the test/helper functions in
test_session_prefetcher.py so each function explicitly declares a return of
None. Update the definitions for the affected test functions, including
test_disabled_is_noop and the other referenced tests in this file, to follow the
coding guideline of always annotating return types even when no value is
returned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c80a006d-33b2-4126-81c5-bb510a689648
📒 Files selected for processing (3)
tests/test_common/session_prefetcher.pytests/unittest/llmapi/conftest.pytests/unittest/llmapi/test_session_prefetcher.py
| def test_take_spec_mismatch_returns_none(prefetcher): | ||
| prefetcher._built_spec = 4 | ||
| prefetcher._built_session = None | ||
| assert prefetcher.take(2) is None |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Missing coverage: take() success path is untested.
The only take() test covers the spec-mismatch branch by directly poking _built_spec/_built_session. There's no test verifying that when the requested spec matches _built_spec, take() returns the built session and correctly resets internal state (so a second take() call with the same spec doesn't return a stale session). This is a core behavioral contract of the prefetch handover and should be covered.
As per path instructions, coverage here is insufficient — suggest adding a test like:
def test_take_returns_and_resets_on_match(prefetcher):
prefetcher._built_spec = 4
prefetcher._built_session = "fake-session"
assert prefetcher.take(4) == "fake-session"
assert prefetcher.take(4) is None # already consumed🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unittest/llmapi/test_session_prefetcher.py` around lines 92 - 95, Add
coverage for the happy path in SessionPrefetcher.take(): the current test only
exercises the spec-mismatch branch by setting _built_spec and _built_session
directly. Add a test around take() that sets matching built state, verifies it
returns the built session, and then verifies the internal state is cleared so a
second take() with the same spec returns None; use the existing
SessionPrefetcher fixture and the take method to validate the consume-once
behavior.
Source: Path instructions
|
PR_Github #57386 [ run ] triggered by Bot. Commit: |
…prefetched pools Tests opt in with @pytest.mark.prefetch_session(N) + the fixture, then pass it via LLM(..., _mpi_session=...). The fixture hands over the pool built in the background during the previous test, or builds one synchronously when nothing was prefetched. This makes the PR self-contained: detection, build and handover all live here. Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57394 [ run ] triggered by Bot. Commit: |
|
PR_Github #57386 [ run ] completed with state |
…ow mode - Default on (TRTLLM_TEST_PREFETCH_SESSION=0 to disable); with no consumer the prefetcher never fires, so plain suites are unaffected. - Shadow mode: install_pool_factory() patches the three MpiPoolSession construction seams (proxy, rpc_proxy, llm) so bare LLM(...) tests consume prefetched pools with zero test changes. Whenever a multi-GPU pool is built or handed over, a spare pool of the same size is built in the background for the next test; a size miss is discarded and the sync build is no slower than without prefetch. - Env-snapshot guard: workers freeze their environment at spawn, so a prefetched pool is discarded if TRTLLM*/TLLM*/NCCL*/CUDA*/UCX*/OMPI* vars changed since it was built. - take() join timeout tightened to 180s (slowest legitimate build measured ~117s); on a genuine hang we fall back to a synchronous build. - pytest_sessionfinish disposes any unconsumed shadow pool. Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57402 [ run ] triggered by Bot. Commit: |
|
PR_Github #57394 [ run ] completed with state |
The default single-GPU path also spawns a 1-worker MpiPoolSession (executor.py _create_ipc_executor with mpi_session=None -> proxy.py), paying the same ~50s spawn+import as multi-GPU pools (measured 45-57s for 1-worker pools on B200). Excluding n_workers==1 from the factory left the largest test population without the benefit. Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57407 [ run ] triggered by Bot. Commit: |
|
PR_Github #57402 [ run ] completed with state |
|
/bot run --disable-fail-fast |
…hadow mode Validated on existing tests (test_llm_pytorch.py, zero test changes) on a B200 node: prefetch handed over a 1-worker pool and the 2-test run went 157.1s -> 126.3s (-20%), all results identical to baseline. Fixes from that run: - threadleak: the prefetched pool's MPIPoolExecutor manager thread (Thread-N (_manager_spawn)) and the prefetcher's own named threads legitimately outlive the test they start under; exclude them in tests/unittest/pytest.ini and name all prefetcher threads session-prefetch-*. - CUDA-clean assert: some library versions initialize an idle CUDA context at tensorrt_llm import time, so the worker-side hard assert is unsatisfiable there. Downgrade to a report + log note; the idle context (no kernels, no allocations) safely coexists with the running test, as all A/B experiments demonstrated. Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57418 [ run ] triggered by Bot. Commit: |
|
PR_Github #57420 [ run ] triggered by Bot. Commit: |
|
PR_Github #57418 [ run ] completed with state |
|
PR_Github #57407 [ run ] completed with state |
|
/bot run --disable-fail-fast |
…mand-driven install Cover every test tree with one plugin module instead of per-directory conftests: both rootdirs (tests/unittest, tests/integration/defs) load test_common.session_prefetcher_hooks via 'addopts = -p ...' in their pytest.ini, so any pytest invocation of either tree gets the factory. Demand-driven by construction: install_pool_factory_if_loaded() only patches tensorrt_llm executor modules ALREADY imported by the collected tests, so suites that never create MPI pools pay nothing - not even the tensorrt_llm import. Tests that do create pools hit the patched seam and prefetch automatically; there is nothing to scan or annotate. The llmapi/conftest.py wiring is superseded by the plugin and removed. Verified end-to-end on unmodified tests (2-test run 157.1s -> 132.3s, handover logged, results identical); 14 unit tests pass. Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57434 [ run ] triggered by Bot. Commit: |
|
PR_Github #57436 [ run ] triggered by Bot. Commit: |
|
PR_Github #57434 [ run ] completed with state |
|
PR_Github #57420 [ run ] completed with state |
|
PR_Github #57436 [ run ] completed with state
|
Summary
Multi-GPU LLM-API tests pay ~50-65s per test to spawn an
MpiPoolSessionwhose workers importtensorrt_llm, plus a cold read of the model weights. Both are pure CPU/IO work, so they run in background threads while the PREVIOUS test still owns the GPUs — hiding the cost behind its runtime.Enabled by default (
TRTLLM_TEST_PREFETCH_SESSION=0to disable). No production code is changed; everything lives undertests/.Three ways tests benefit
install_pool_factory()patches the threeMpiPoolSessionconstruction seams (executor/proxy.py,executor/rpc_proxy.py,llmapi/llm.py, all in theirmpi_session is Nonebranches) so bareLLM(...)tests consume prefetched pools automatically. Whenever a multi-GPU pool is built or handed over, a spare pool of the same size is built in the background for the next test. A size miss is discarded — the sync build is no slower than without prefetch. Tests passing their own_mpi_session(shared/grouped pools) are never intercepted.@pytest.mark.prefetch_session(N)+ theprefetched_mpi_sessionfixture, for suites that know their boundaries.@pytest.mark.prefetch_model_dir(path)pre-reads the NEXT test's weight files (fires even between tests sharing a session).Safety
TRTLLM*/TLLM*/NCCL*/CUDA*/UCX*/OMPI*vars changed since it was built.pytest_sessionfinishdisposes any unconsumed shadow pool.Measured benefit (4xB200, DeepSeek-V3-Lite nvfp4 / Llama-3.1-8B+EAGLE3)
Savings per boundary = min(pool build time, previous test duration).
Changes
tests/test_common/session_prefetcher.py— prefetcher, shadow-mode factory, env guard, page-cache warming (new)tests/unittest/llmapi/conftest.py— hook wiring, factory install, marker registration, fixture (new)tests/unittest/llmapi/test_session_prefetcher.py— pure-logic unit tests, no MPI/GPU needed (new)Test plan
pytest tests/unittest/llmapi/test_session_prefetcher.py— 14 passed (container, B200 node)