-
-
Notifications
You must be signed in to change notification settings - Fork 12.9k
[Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling #32969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request effectively addresses the RuntimeError encountered during Qwen2.5-VL profiling by correctly handling the dimensions of vision_embeddings before splitting. The changes to flatten the embeddings to 2D, manage size mismatches during profiling, and remove redundant operations are well-implemented and directly resolve the reported issue. This improves the robustness and correctness of the multimodal embedding process.
|
cc @tjtanaa @DarkLight1337 Let me know guys if you can help reviewing this 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| if actual_size < total_expected: | ||
| repeat_factor = ( | ||
| total_expected + actual_size - 1 | ||
| ) // actual_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero if vision embeddings tensor is empty
Low Severity
The repeat_factor calculation divides by actual_size without checking if it's zero. If vision_embeddings.shape[0] is 0 (empty tensor) while total_expected > 0, this causes a ZeroDivisionError. The code enters the if actual_size < total_expected block since 0 < positive is true, then attempts (total_expected + 0 - 1) // 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think failing fast is the better approach here - an empty tensor from the vision encoder indicates an upstream bug that shouldn't be silently papered over. Added an explicit check with a clear error message.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Fixes
RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1 (input tensor's size at dimension 0), but got split_sizes=[4900, 4900]when running Qwen2.5-VL with the transformers backend.Root Cause
During memory profiling, vLLM creates dummy encoder outputs with minimal size (shape
[1, hidden_dim]). The originalembed_multimodalcode calledunsqueeze(0)on 2D tensors, then attemptedtorch.split()with sizes derived fromnum_image_patches(e.g.,[4900, 4900]), causing a dimension mismatch.Fix
unsqueeze(0)with properview(-1, hidden_dim)for 3D to 2D flattening.flatten(start_dim=0, end_dim=-2)loop (no-op on 2D tensors)Testing