-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Bugfix][Nixl] Fix full prefix cache hit bug #18632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Bugfix][Nixl] Fix full prefix cache hit bug #18632
Conversation
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: [email protected] <[email protected]>
@njhill - can you let me know if this works okay with multi-connector? |
# If remote_blocks and num_external_tokens = 0, we have | ||
# a full prefix cache hit on the D worker. We need to call | ||
# send_notif in _read_blocks to free the memory on the P. | ||
local_block_ids = (blocks.get_unhashed_block_ids() | ||
if num_external_tokens > 0 else []) | ||
# Get unhashed blocks to pull from remote. | ||
self._reqs_need_recv[request.request_id] = ( | ||
request, blocks.get_unhashed_block_ids()) | ||
request, local_block_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-redhat I'm still not sure that this part or the change to always call update_state_after_alloc
is needed. I'd already added logic for this case in get_num_new_matched_tokens
above:
vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
Lines 215 to 222 in f203673
# NOTE: if count is 0 here, we have less than block_size | |
# tokens to pull after subtracting the local prefix cache hit. | |
# The remote only sends fully computed blocks, so there is | |
# nothing to transfer but we still need to notify the | |
# prefill worker so that the remote blocks are freed. | |
if all(p in params for p in ("remote_engine_id", "remote_host", | |
"remote_port")): | |
self._reqs_need_recv[request.request_id] = (request, []) |
I can see that the other two fixes below in build_connector_meta
and _read_blocks
are of course needed though.
If you think it's better to have this logic in this method then we can remove it from the other one. But again I feel it's logically clearer to not call update_state_after_alloc
if 0 was returned from get_num_new_matched_tokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that get_num_new_matched_tokens
should be a pure function. Adding a side effect to it is surprising given the name of the method and the fact that we will have different behavior depending on what happens if the request is or is not able to be scheduled. This issue is actually causing a bug right now.
- If
allocate_slots
returns None, the request will remain in the waiting queue. this will cause us to add the requests toreqs_need_recv
more than one and as a result we will call read_blocks twice which will do a double free on the P worker side. Similarly this will happen if the request is preempted (it will get re-added to waiting). This is because we are not properly updating the request to havedo_remote_prefill=False
when it is added toreqs_need_recv
from theget_num_new_matched_tokens
function.
This is all just evidence that putting a side effect into this function is not a good idea. The update_state_after_alloc
is where we should handle everything related to reqs_need_recv
so we have a single place where all the logic is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed those lines from get_num_new_matched_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-redhat that makes sense, I agree about the pure function thing. I did also notice the fact that this could result in a double free on the P worker side in the case that it can't be scheduled, which isn't ideal (though I think would probably be harmless).
But to me, thinking from the pov of a generic connector interface, it still feels a bit odd given the connector isn't offering any tokens. I guess we should very clearly document the semantics and expectations for the interface.
A related quirk is that in the async load case, I think currently update_state_after_alloc
will be called twice for a request (a second time once the request moves out of WAITING_FOR_REMOTE_KVS
).
Signed-off-by: [email protected] <[email protected]>
@@ -212,15 +212,6 @@ def get_num_new_matched_tokens( | |||
if count > 0: | |||
return count, True | |||
|
|||
# NOTE: if count is 0 here, we have less than block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now handled in update_state_after_alloc
@robertgshaw2-redhat changes will be needed to multi-connector too, I've pushed them to a branch, feel free to pull into this PR: njhill@4150a41 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with the multi-connector changes
SUMMARY:
send_notif
since we skip callingupdate_state_after_alloc