-
-
Notifications
You must be signed in to change notification settings - Fork 12.9k
[BugFix] Fix CPU Offloading Bug with UVA #32991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug with CPU offloading using Unified Virtual Addressing (UVA), where weights re-initialized during quantization were not being correctly re-offloaded. The changes introduce a mechanism to track and re-offload these replaced tensors, ensuring the offloading works as expected. The modification in csrc/cuda_view.cu to use a custom deleter for managing tensor lifetimes is a robust improvement. My review includes one suggestion to optimize the re-offloading logic to avoid unnecessary data copies for non-replaced parameters, which will improve efficiency.
| if name in uva_offloaded_parameters: | ||
| assert pin_memory, "UVA offloaded parameters must be pinned" | ||
| cpu_data = p.data.to("cpu").pin_memory() | ||
| p.data = get_cuda_view_from_cpu_tensor(cpu_data) | ||
| p._vllm_is_uva_offloaded = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for re-offloading UVA parameters is correct but could be more efficient. It re-offloads any parameter that was originally UVA-offloaded, even if it was not replaced within the device_loading_context. This results in unnecessary data copies for parameters that are already correctly offloaded.
To optimize this, we should only re-offload parameters that were originally UVA-offloaded but have since been replaced (i.e., they no longer have the _vllm_is_uva_offloaded attribute). This makes the check more specific and avoids unnecessary work.
| if name in uva_offloaded_parameters: | |
| assert pin_memory, "UVA offloaded parameters must be pinned" | |
| cpu_data = p.data.to("cpu").pin_memory() | |
| p.data = get_cuda_view_from_cpu_tensor(cpu_data) | |
| p._vllm_is_uva_offloaded = True | |
| if name in uva_offloaded_parameters and not getattr( | |
| p, "_vllm_is_uva_offloaded", False): | |
| assert pin_memory, "UVA offloaded parameters must be pinned" | |
| cpu_data = p.data.to("cpu").pin_memory() | |
| p.data = get_cuda_view_from_cpu_tensor(cpu_data) | |
| p._vllm_is_uva_offloaded = True |
Purpose
This PR fixes a bug in the CPU offloading logic where weights re-initialized during quantization processing are not being correctly re-offloaded to the CPU when using Unified Virtual Addressing (UVA).
Bug:
During
process_weights_after_loading, certain quantization methods replace original model weights with newly initialized tensors (e.g., here). These new tensors are initialized as device tensors. The current logic fails to re-offload these "replaced" tensors to CPU memory. As a result, weights that were supposed to be offloaded end up on the GPU, undoing the effects of offloading.In contrast, the non-UVA code path (currently disabled) correctly handles this by re-offloading weights by name. This PR brings that consistency to the UVA path.
Test Plan
As an example of the bug, running deepseek r1 nvfp4 (~350 GB) on GB300 (288 GB) with 200 GB CPU offloading results in OOM:
Current main:
python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/DeepSeek-R1-NVFP4 \ --trust-remote-code \ --cpu-offload-gb 200 \ --load-format dummyTest Result
Using this PR allows the above configuration to run.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.