Skip to content

Multi-gpu KNN build for UMAP using all-neighbors API #6654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
May 29, 2025

Conversation

jinsolp
Copy link
Contributor

@jinsolp jinsolp commented May 8, 2025

Description

Allows multi-gpu knn graph building in UMAP using the all-neighbors API.

PRs that need to be merged before this one

Changes in cuML UMAP usage

from pylibraft.common import DeviceResourcesSNMG

# if want to use multi GPU
multigpu_handle = DeviceResourcesSNMG()
umap_nnd = UMAP(handle = multigpu_handle, 
                 build_algo="nn_descent", 
                 build_kwds={"nnd_n_nearest_clusters": 2, 
                               "nnd_n_clusters": 8, 
                               "nnd_graph_degree": 32, 
                               "nnd_max_iterations": 20
                 })

Closes #6729

@jinsolp jinsolp requested review from a team as code owners May 8, 2025 20:44
@jinsolp jinsolp requested review from dantegd and vyasr May 8, 2025 20:44
@github-actions github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels May 8, 2025
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hints in Python docs are awesome. But again, there is no choice for a user of UMAP to select multi-GPU build. We must not do this automatically.

@jinsolp
Copy link
Contributor Author

jinsolp commented May 12, 2025

@divyegala

But again, there is no choice for a user of UMAP to select multi-GPU build. We must not do this automatically.

I am working on multi-gpu resource on raft side. Currently the plan is to wrap this in pylibraft and advise users to explicitly pass this as the handle if they want to use multi-gpu. Like this;

from pylibraft.common import DeviceResourcesSNMG

# if want to use multi GPU
multigpu_handle = DeviceResourcesSNMG()
# can also do this indicating which GPUs they want to use
multigpu_handle = DeviceResourcesSNMG([2,4,6,7])

# pass the multigpu_handle as the handle to UMAP
umap_nnd = UMAP(handle = multigpu_handle, 
                 build_algo="nn_descent", 
                 build_kwds={"n_nearest_clusters": 2, 
                            "n_clusters": 8, 
                            "nn_descent": {'graph_degree': 32, 'max_iterations': 20}
                 })

If nothing is passed to the handle, then the default handle will be used, and this results in a single-gpu run.

Do you think there is a better way to let users do a multi-gpu build? Any advice would be greatly appreciated : )

@divyegala
Copy link
Member

@jinsolp thanks for the explanation, that API looks good to me. Opt-in behavior is perfect.

We need to be very explicit about the fact that this is not multi-GPU UMAP but rather multi-GPU KNN step in UMAP. That's because users of cuml-dask expect that their data can be distributed across GPUs and we would still be able to create a learned ML model, but in this use-case data will need to be on a single-GPU for other parts of UMAP.

@jinsolp jinsolp force-pushed the umap-use-all-neighbors branch from d2383e9 to 888dd7a Compare May 12, 2025 22:16
@jinsolp jinsolp removed request for dantegd and vyasr May 12, 2025 22:34
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-GPU C++ and Python LGTM, but I'd like @csadorf to assign a Python reviewer as well. For multi-GPU, I'd like to see tests added perhaps in the dask module to ensure the environment has enough GPUs.

Copy link
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinsolp! Just two small comments

@csadorf csadorf self-requested a review May 13, 2025 16:38
@csadorf
Copy link
Contributor

csadorf commented May 13, 2025

I would like to review this PR prior to merge.

@jinsolp
Copy link
Contributor Author

jinsolp commented May 21, 2025

[After discussing offline]
For now, decided to stick to current nnd_ prefixed build_kwds without any nested kwds since we just expose nn descent build for UMAP. This PR is no longer a breaking change, and will just have an additional nnd_n_nearest_clusters in build_kwds.

@csadorf csadorf added non-breaking Non-breaking change and removed breaking Breaking change labels May 22, 2025
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cpp approval only. LGTM!

@csadorf
Copy link
Contributor

csadorf commented May 27, 2025

@jcrist Can you approve this PR assuming that your concerns were addressed?

@github-actions github-actions bot removed the CMake label May 27, 2025
@jinsolp
Copy link
Contributor Author

jinsolp commented May 27, 2025

Changed docs (content staying almost same, change in style) in .pyx file because of docs build failure in CI
@csadorf

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is blocked on some changes from rapidsai/cuvs#944

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A combination of nnd_n_clusters>1 and data_on_host=False (the default) will currently break user code, because batching is not supported on device.

Agreed mitigation approach is to auto-set data_on_host=True with a deprecation warning in 25.06 and then require data_on_host=True in combination with nnd_n_clusters>1 as of 25.08.

@csadorf csadorf dismissed divyegala’s stale review May 28, 2025 21:00

No longer blocked.

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny suggestion for language, otherwise LGTM.

@divyegala
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit c1a572d into rapidsai:branch-25.06 May 29, 2025
92 of 93 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Development

Successfully merging this pull request may close these issues.

Multi-GPU KNN graph construction in UMAP
6 participants