[https://nvbugs/6412133][fix] Only populate self.all_weights[self.device_id] at construction; lazily…#15921
[https://nvbugs/6412133][fix] Only populate self.all_weights[self.device_id] at construction; lazily…#15921trtllm-agent wants to merge 1 commit into
self.all_weights[self.device_id] at construction; lazily…#15921Conversation
…Handles
RefNVFP4ModelWithIPCHandles._quantize_and_replicate_weights eagerly
replicated all NVFP4 quantized tensors to every visible CUDA device
(range(torch.cuda.device_count())). On a 4-GPU CI runner, cuda:2 and
cuda:3 held a full copy of the 30B-A3B weights even though the TP=2
test only exposes cuda:0/1 via get_weight_ipc_handles_serialized([0, 1]).
Those unused replicas persisted across parametrize IDs and occasionally
triggered SIGABRT ("Test terminated unexpectedly") during subsequent
test setup.
Only populate self.device_id at construction; materialize additional
replicas lazily on the first request in get_weight_ipc_handles_serialized
and cache them so per-filter partial updates aren't slower.
Also remove the corresponding waives.txt entry.
Signed-off-by: trtllm-agent <296075020+trtllm-agent@users.noreply.github.com>
📝 WalkthroughWalkthroughModifies ChangesNVFP4 weight replication fix
Estimated code review effort: 2 (Simple) | ~10 minutes Sequence Diagram(s)sequenceDiagram
participant Test as TestCode
participant RefModel as RefNVFP4ModelWithIPCHandles
participant GPUOwner as device_id GPU
participant GPUTarget as Requested GPU
Test->>RefModel: _quantize_and_replicate_weights()
RefModel->>GPUOwner: store model_weights on self.device_id only
Test->>RefModel: get_weight_ipc_handles_serialized(device)
alt device not in self.all_weights
RefModel->>GPUOwner: read tensors
RefModel->>GPUTarget: copy tensors to cuda:{device}
end
RefModel-->>Test: serialized IPC handles
Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py (1)
423-435: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winPass explicit
device_idshere —tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py:720-723still falls back totorch.cuda.device_count(), so this path can materialize weights on every visible GPU and keep the full-replication behavior alive.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py` around lines 423 - 435, The fallback in get_weight_ipc_handles_serialized still uses torch.cuda.device_count(), which can replicate weights across every visible GPU. Update this path to require and pass through the explicit device_ids from the caller, and use that list instead of deriving all CUDA devices when building all_weights so the serialization stays limited to the intended devices.
🧹 Nitpick comments (2)
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py (2)
285-330: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winStale docstring: no longer "replicates across devices".
The docstring still says the method quantizes weights "and replicate[s] across devices," but per Line 330 only
self.device_idis populated now; replication is deferred toget_weight_ipc_handles_serialized. Worth updating to avoid confusing future readers about where replication actually happens.📝 Suggested docstring tweak
def _quantize_and_replicate_weights(self): - """Quantize linear weights to NVFP4 and replicate across devices. + """Quantize linear weights to NVFP4 for the owning device. + + Cross-device replicas are created lazily in + ``get_weight_ipc_handles_serialized`` when requested. Projections in the same fusion group (qkv, gate_up) share a unified weight_scale_2 computed from their joint amax, matching the behavior of verl's TRTLLMNVFP4QuantizerHelper. """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py` around lines 285 - 330, Update the _quantize_and_replicate_weights docstring to match the current behavior: it no longer replicates weights across devices in this method, since only self.all_weights[self.device_id] is populated and replication is deferred to get_weight_ipc_handles_serialized. Adjust the wording around the method purpose and the “replicate across devices” claim so readers are not misled.
219-230: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider adding a regression assertion for the lazy-replication behavior.
As per path instructions ("Act as a QA engineer reviewing test changes and coverage… suggest concrete list file names and whether coverage is sufficient"): this change fixes an intermittent SIGABRT tied to stale replicas persisting on unused devices, but there's no visible assertion that
self.all_weightsstays scoped to only the owning device until requested (e.g.assert list(ref_model.all_weights) == [device_id]right after__init__, and that a new key appears only afterget_weight_ipc_handles_serialized([other_device])). Without it, coverage for this specific fix is insufficient — a future refactor could silently reintroduce eager replication without any test failing.Also applies to: 423-447
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py` around lines 219 - 230, Add a regression assertion in the multi-GPU test to verify lazy replication stays scoped to the owning device after initialization. In the __init__-related setup for the ref model, assert that self.all_weights contains only the current device key immediately after construction, then call get_weight_ipc_handles_serialized(...) for another device and assert the new key appears only at that point. Use the existing ref_model/all_weights and get_weight_ipc_handles_serialized symbols in the test so this behavior is covered and a future eager-replication refactor will fail.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`:
- Around line 423-435: The fallback in get_weight_ipc_handles_serialized still
uses torch.cuda.device_count(), which can replicate weights across every visible
GPU. Update this path to require and pass through the explicit device_ids from
the caller, and use that list instead of deriving all CUDA devices when building
all_weights so the serialization stays limited to the intended devices.
---
Nitpick comments:
In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`:
- Around line 285-330: Update the _quantize_and_replicate_weights docstring to
match the current behavior: it no longer replicates weights across devices in
this method, since only self.all_weights[self.device_id] is populated and
replication is deferred to get_weight_ipc_handles_serialized. Adjust the wording
around the method purpose and the “replicate across devices” claim so readers
are not misled.
- Around line 219-230: Add a regression assertion in the multi-GPU test to
verify lazy replication stays scoped to the owning device after initialization.
In the __init__-related setup for the ref model, assert that self.all_weights
contains only the current device key immediately after construction, then call
get_weight_ipc_handles_serialized(...) for another device and assert the new key
appears only at that point. Use the existing ref_model/all_weights and
get_weight_ipc_handles_serialized symbols in the test so this behavior is
covered and a future eager-replication refactor will fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7d746a5c-4a95-4a89-bfbe-b892ba23831d
📒 Files selected for processing (2)
tests/integration/test_lists/waives.txttests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
Summary
_quantize_and_replicate_weightseagerly replicates NVFP4 quantized weights to every visible CUDA device (0..3), so cuda:2/3 retain stale replicas across parametrized test runs on the 4-GPU CI runner, occasionally SIGABRT-ing subsequent test setup.self.all_weights[self.device_id]at construction; lazily materialize per-device replicas on the firstget_weight_ipc_handles_serialized(device_ids=[...])call and cache them so partial-update paths aren't slower. Also removed the waives.txt entry.Test plan
Links
Summary by CodeRabbit