Skip to content

SVM bind_to_queue and unbind_queue cannot be safely used #645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
inducer opened this issue Oct 21, 2022 · 5 comments
Open

SVM bind_to_queue and unbind_queue cannot be safely used #645

inducer opened this issue Oct 21, 2022 · 5 comments
Labels

Comments

@inducer
Copy link
Owner

inducer commented Oct 21, 2022

Since they modify the allocation in place, they're unusable e.g. in the array. Consider the expression ary2 = ary.with_queue(None), which one might think should unbind underlying SVM from the queue. But ary is still ailve, and both ary and ary2 refer to the same allocation.

Instead, we need some sort of "holder" object that inserts a queue into a list of queues that need to be synchronized with when freeing the SVM.

cc @matthiasdiener

@inducer inducer added the bug label Oct 21, 2022
@ayush-kulkarni
Copy link

ayush-kulkarni commented May 7, 2025

Is unbind_queue now called unbind_from_queue?

Also would you be able to clarify this question a little bit more? I'm struggling to understand what exactly the issue is and what a "holder" object is.

@inducer
Copy link
Owner Author

inducer commented May 8, 2025

Is unbind_queue now called unbind_from_queue?

Yes.

Also would you be able to clarify this question a little bit more? I'm struggling to understand what exactly the issue is and what a "holder" object is.

Sure. Take a look at the OpenCL documentation for clEnqueueSVMFree (and ask yourself why it exists, i.e. why clSVMFree would be inappropriate). Now ask yourself how this would interoperate with an array type in which

  • multiple Array objects can hold a reference to the same allocation, and
  • each of those objects may have a different built-in queue.

Which queues will need to be waited on before deallocation?

@ayush-kulkarni
Copy link

Sure. Take a look at the OpenCL documentation for clEnqueueSVMFree (and ask yourself why it exists, i.e. why clSVMFree would be inappropriate). Now ask yourself how this would interoperate with an array type in which

  • multiple Array objects can hold a reference to the same allocation, and
  • each of those objects may have a different built-in queue.

Thank you for the clarification, I'm starting to understand it better now.

Which queues will need to be waited on before deallocation?

So the queues that will need to be waited on before deallocation would be the ones that still have pending SVM tasks to do otherwise we would be releasing the memory of the overlying queue while not doing the same to the underlying one, right?

And if I'm understanding this right then for our example ary2 = ary.with_queue(None) the issue is that we're clearing ary2's memory, however ary's memory is still pending to be released (and never gets released), which causes the bug.

In order to fix this we would need a stack-like structure, so that when we call the command ary2 = ary.with_queue(None) it first checks if there are any underlying calls to the SVM that need to be completed, and if so it pushes those onto the stack and would keep doing this until there aren't any more underlying calls. Then it would start popping the stack and start deallocating memory, thus achieving the right order of deallocation. Would an implementation like this work?

@inducer
Copy link
Owner Author

inducer commented May 8, 2025

So the queues that will need to be waited on before deallocation would be the ones that still have pending SVM tasks to do

There's no practical way to detect which queues those are.

overlying queue [...] underlying one

Not sure what you mean, please define.

And if I'm understanding this right then for our example ary2 = ary.with_queue(None) the issue is that we're clearing ary2's memory, however ary's memory is still pending to be released (and never gets released), which causes the bug.

ary and ary2 refer to the same memory (or really, the same svm_allocation). As it stands right now, that SVM allocation will be associated with ary2's queue (and wait for its work to complete), but not that of ary. There currently is code to insert a synchronization point into ary's queue, so that work before bind_to_queue in ary's queue gets waited on:

pyopencl/src/wrap_cl.hpp

Lines 3869 to 3877 in ca1de86

if (m_queue.data() != queue.data())
{
// make sure synchronization promises stay valid in new queue
cl_event evt;
PYOPENCL_CALL_GUARDED(clEnqueueMarker, (m_queue.data(), &evt));
PYOPENCL_CALL_GUARDED(clEnqueueMarkerWithWaitList,
(queue.data(), 1, &evt, nullptr));
}

But that's insufficient, since more work could be enqueued on ary's queue after that time.

So, in summary, svm_allocation needs to keep track of not one queue but rather a multi-set of queues currently bound to it. At deallocation, it must enqueue markers in all of them and have the deallocation wait for all those markers.

@ayush-kulkarni
Copy link

ary and ary2 refer to the same memory (or really, the same svm_allocation). As it stands right now, that SVM allocation will be associated with ary2's queue (and wait for its work to complete), but not that of ary. There currently is code to insert a synchronization point into ary's queue, so that work before bind_to_queue in ary's queue gets waited on:

Oh okay thank you for correcting me!

Over the past days I have been tinkering with the code a little to get a hold of what the functions do and to overall just understand pyopencl better. However, I still find it a little difficult to fully understand the codebase (as this is my first time diving deep into one like this). At first it was a little overwhelming to navigate through all the files and try to understand what each piece was doing, but it got a little better the more time I spent on it. Would you be able to guide me on how to better understand the source code so that I can try fixing this issue better? I am always eager to learn more and can pick things up quick and do not plan on giving up on fixing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants