Skip to content

[https://nvbugs/6412133][fix] Only populate self.all_weights[self.device_id] at construction; lazily…#15921

Open
trtllm-agent wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6412133
Open

[https://nvbugs/6412133][fix] Only populate self.all_weights[self.device_id] at construction; lazily…#15921
trtllm-agent wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6412133

Conversation

@trtllm-agent

@trtllm-agent trtllm-agent commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: _quantize_and_replicate_weights eagerly replicates NVFP4 quantized weights to every visible CUDA device (0..3), so cuda:2/3 retain stale replicas across parametrized test runs on the 4-GPU CI runner, occasionally SIGABRT-ing subsequent test setup.
  • Fix: Only populate self.all_weights[self.device_id] at construction; lazily materialize per-device replicas on the first get_weight_ipc_handles_serialized(device_ids=[...]) call and cache them so partial-update paths aren't slower. Also removed the waives.txt entry.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes
    • Improved multi-GPU weight handling so quantized weights are copied only when needed, helping avoid missing-device issues during model updates.
    • Fixed on-demand GPU weight availability when serializing inter-process handles, which should make multi-GPU test runs more reliable.
  • Chores
    • Removed an outdated waived test entry.

…Handles

RefNVFP4ModelWithIPCHandles._quantize_and_replicate_weights eagerly
replicated all NVFP4 quantized tensors to every visible CUDA device
(range(torch.cuda.device_count())). On a 4-GPU CI runner, cuda:2 and
cuda:3 held a full copy of the 30B-A3B weights even though the TP=2
test only exposes cuda:0/1 via get_weight_ipc_handles_serialized([0, 1]).
Those unused replicas persisted across parametrize IDs and occasionally
triggered SIGABRT ("Test terminated unexpectedly") during subsequent
test setup.

Only populate self.device_id at construction; materialize additional
replicas lazily on the first request in get_weight_ipc_handles_serialized
and cache them so per-filter partial updates aren't slower.

Also remove the corresponding waives.txt entry.

Signed-off-by: trtllm-agent <296075020+trtllm-agent@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Modifies RefNVFP4ModelWithIPCHandles in the multi-GPU NVFP4 test module to store quantized weights only on the owning device instead of eagerly replicating to all GPUs, and adds on-demand cross-GPU copying when IPC handles are requested. Removes a previously skipped test waiver entry.

Changes

NVFP4 weight replication fix

Layer / File(s) Summary
Defer and lazily replicate quantized weights
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py
Quantized weights are stored only on self.device_id during quantization; get_weight_ipc_handles_serialized now copies tensors to the requested device on demand when not already present before serializing IPC handles.
Remove obsolete test waiver
tests/integration/test_lists/waives.txt
Removes the SKIP waiver entry for test_llm_update_weights_nvfp4[fp8-Qwen3/Qwen3-30B-A3B] previously tied to a tracked bug.

Estimated code review effort: 2 (Simple) | ~10 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Test as TestCode
  participant RefModel as RefNVFP4ModelWithIPCHandles
  participant GPUOwner as device_id GPU
  participant GPUTarget as Requested GPU

  Test->>RefModel: _quantize_and_replicate_weights()
  RefModel->>GPUOwner: store model_weights on self.device_id only

  Test->>RefModel: get_weight_ipc_handles_serialized(device)
  alt device not in self.all_weights
    RefModel->>GPUOwner: read tensors
    RefModel->>GPUTarget: copy tensors to cuda:{device}
  end
  RefModel-->>Test: serialized IPC handles
Loading

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15898: Both PRs modify tests/integration/test_lists/waives.txt by changing waived/skipped integration test entries.
  • NVIDIA/TensorRT-LLM#15914: Directly overlaps by addressing the same NVFP4 multi-GPU test waiver tied to NVBUG 6412133.

Suggested reviewers: yuxianq, dongxuy04, pcastonguay

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the bug fix and lazy weight population change and includes the NVBugs reference.
Description check ✅ Passed The description explains the root cause, fix, test plan, and bug link, though it uses different headings than the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py (1)

423-435: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Pass explicit device_ids heretests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py:720-723 still falls back to torch.cuda.device_count(), so this path can materialize weights on every visible GPU and keep the full-replication behavior alive.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`
around lines 423 - 435, The fallback in get_weight_ipc_handles_serialized still
uses torch.cuda.device_count(), which can replicate weights across every visible
GPU. Update this path to require and pass through the explicit device_ids from
the caller, and use that list instead of deriving all CUDA devices when building
all_weights so the serialization stays limited to the intended devices.
🧹 Nitpick comments (2)
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py (2)

285-330: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Stale docstring: no longer "replicates across devices".

The docstring still says the method quantizes weights "and replicate[s] across devices," but per Line 330 only self.device_id is populated now; replication is deferred to get_weight_ipc_handles_serialized. Worth updating to avoid confusing future readers about where replication actually happens.

📝 Suggested docstring tweak
     def _quantize_and_replicate_weights(self):
-        """Quantize linear weights to NVFP4 and replicate across devices.
+        """Quantize linear weights to NVFP4 for the owning device.
+
+        Cross-device replicas are created lazily in
+        ``get_weight_ipc_handles_serialized`` when requested.
 
         Projections in the same fusion group (qkv, gate_up) share a unified
         weight_scale_2 computed from their joint amax, matching the behavior
         of verl's TRTLLMNVFP4QuantizerHelper.
         """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`
around lines 285 - 330, Update the _quantize_and_replicate_weights docstring to
match the current behavior: it no longer replicates weights across devices in
this method, since only self.all_weights[self.device_id] is populated and
replication is deferred to get_weight_ipc_handles_serialized. Adjust the wording
around the method purpose and the “replicate across devices” claim so readers
are not misled.

219-230: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider adding a regression assertion for the lazy-replication behavior.

As per path instructions ("Act as a QA engineer reviewing test changes and coverage… suggest concrete list file names and whether coverage is sufficient"): this change fixes an intermittent SIGABRT tied to stale replicas persisting on unused devices, but there's no visible assertion that self.all_weights stays scoped to only the owning device until requested (e.g. assert list(ref_model.all_weights) == [device_id] right after __init__, and that a new key appears only after get_weight_ipc_handles_serialized([other_device])). Without it, coverage for this specific fix is insufficient — a future refactor could silently reintroduce eager replication without any test failing.

Also applies to: 423-447

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`
around lines 219 - 230, Add a regression assertion in the multi-GPU test to
verify lazy replication stays scoped to the owning device after initialization.
In the __init__-related setup for the ref model, assert that self.all_weights
contains only the current device key immediately after construction, then call
get_weight_ipc_handles_serialized(...) for another device and assert the new key
appears only at that point. Use the existing ref_model/all_weights and
get_weight_ipc_handles_serialized symbols in the test so this behavior is
covered and a future eager-replication refactor will fail.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`:
- Around line 423-435: The fallback in get_weight_ipc_handles_serialized still
uses torch.cuda.device_count(), which can replicate weights across every visible
GPU. Update this path to require and pass through the explicit device_ids from
the caller, and use that list instead of deriving all CUDA devices when building
all_weights so the serialization stays limited to the intended devices.

---

Nitpick comments:
In
`@tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py`:
- Around line 285-330: Update the _quantize_and_replicate_weights docstring to
match the current behavior: it no longer replicates weights across devices in
this method, since only self.all_weights[self.device_id] is populated and
replication is deferred to get_weight_ipc_handles_serialized. Adjust the wording
around the method purpose and the “replicate across devices” claim so readers
are not misled.
- Around line 219-230: Add a regression assertion in the multi-GPU test to
verify lazy replication stays scoped to the owning device after initialization.
In the __init__-related setup for the ref model, assert that self.all_weights
contains only the current device key immediately after construction, then call
get_weight_ipc_handles_serialized(...) for another device and assert the new key
appears only at that point. Use the existing ref_model/all_weights and
get_weight_ipc_handles_serialized symbols in the test so this behavior is
covered and a future eager-replication refactor will fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7d746a5c-4a95-4a89-bfbe-b892ba23831d

📥 Commits

Reviewing files that changed from the base of the PR and between 3b23a9f and 91f177f.

📒 Files selected for processing (2)
  • tests/integration/test_lists/waives.txt
  • tests/unittest/_torch/ray_orchestrator/multi_gpu/test_llm_update_weights_multi_gpu.py
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants