[CUDA][Performance] Add radix select implementation for efficient partition operations by Lyxot · Pull Request #3117 · ml-explore/mlx

Lyxot · 2026-02-09T19:23:15Z

Proposed changes

This adds a CUDA radix-select based path for argpartition partition and introduces multi-block-per-row and multi-row-per-block for shapes where normal radix select underperforms. #3064

What changed

Added CUDA radix-select kernels in mlx/backend/cuda/device/radix_select.cuh:
- Small shared-memory path for smaller sorted axes
- Large streaming path
- Tiled large-array path with launch planning (blocks_per_row, rows_per_block)
- Deterministic scatter for equal keys (stable partition ordering behavior)
Integrated new radix partition dispatch in mlx/backend/cuda/sort.cu:
- ArgPartition::eval_gpu / Partition::eval_gpu now call radix partition path
- Size-based dispatch between small and large kernels
- Tiled launch used for large contiguous workloads when beneficial
Added benchmark and verification script benchmarks/python/radix_select_bench.py: Correctness checks, determinism checks, and performance sweep utilities

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Lyxot · 2026-02-09T19:23:25Z

I got the following benchmark results on the 4070 Super

Dtype: bfloat16
Config                      ArgPartition      ArgSort    Speedup
--------------------------------------------------------------------------------
b=2048, v=8192, k=32             0.315ms      1.564ms   4.97x
b=2048, v=4096, k=32             0.173ms      0.517ms   2.98x
b=1024, v=4096, k=16             0.097ms      0.256ms   2.64x
b=512, v=2048, k=64              0.051ms      0.078ms   1.53x
b=256, v=1024, k=32              0.029ms      0.036ms   1.22x
b=128, v=512, k=16               0.027ms      0.028ms   1.02x
b=1, v=128000, k=64              0.058ms      0.076ms   1.30x
b=1, v=512, k=32                 0.026ms      0.027ms   1.04x
b=16, v=8192, k=32               0.049ms      0.046ms   0.94x
b=32, v=8192, k=32               0.054ms      0.061ms   1.13x
b=64, v=8192, k=32               0.072ms      0.075ms   1.04x

Dtype: float32
Config                      ArgPartition      ArgSort    Speedup
--------------------------------------------------------------------------------
b=2048, v=8192, k=32             0.376ms      1.851ms   4.93x
b=2048, v=4096, k=32             0.211ms      0.628ms   2.98x
b=1024, v=4096, k=16             0.119ms      0.325ms   2.73x
b=512, v=2048, k=64              0.060ms      0.095ms   1.57x
b=256, v=1024, k=32              0.034ms      0.039ms   1.14x
b=128, v=512, k=16               0.034ms      0.029ms   0.86x
b=1, v=128000, k=64              0.083ms      0.084ms   1.01x
b=1, v=512, k=32                 0.028ms      0.027ms   0.96x
b=16, v=8192, k=32               0.076ms      0.051ms   0.67x
b=32, v=8192, k=32               0.078ms      0.067ms   0.86x
b=64, v=8192, k=32               0.114ms      0.088ms   0.78x

Lyxot · 2026-02-09T19:24:10Z

Most performance is basically OK, but there are still some dtypes that need further optimization (float32)

Dtype=int16  k=vocab*0.004

            v=512       v=1024      v=2048      v=4096      v=8192     v=16384     v=32768     v=65536     v=131072 
b=1          1.19x       1.14x       1.43x       0.96x       0.99x       1.10x       1.23x       1.30x       1.26x  
b=2          1.10x       1.18x       1.24x       0.97x       1.03x       1.13x       1.20x       1.35x       1.76x  
b=4          1.10x       1.11x       1.24x       0.98x       1.00x       1.08x       1.21x       1.60x       2.27x  
b=8          1.24x       1.52x       1.51x       0.89x       1.03x       1.19x       1.67x       2.14x       2.90x  
b=16         1.39x       1.60x       1.67x       0.89x       1.05x       1.23x       2.11x       2.97x       4.45x  
b=32         1.08x       1.13x       1.17x       0.95x       1.36x       1.81x       1.96x       3.65x       5.43x  
b=48         1.42x       1.28x       1.14x       0.99x       1.12x       2.05x       2.80x       3.74x       8.90x  
b=64         1.13x       1.04x       1.29x       1.18x       1.39x       2.13x       3.60x       4.96x      10.46x  
b=96         1.16x       1.24x       0.98x       1.38x       1.72x       2.22x       3.67x       7.91x       9.29x  
b=128        1.96x       1.39x       1.10x       1.30x       1.93x       2.98x       4.37x       8.74x       9.06x  
b=256        1.14x       1.36x       1.39x       1.97x       3.23x       4.67x       9.05x       8.68x       6.98x  
b=512        1.20x       1.56x       1.54x       2.85x       4.21x       7.15x       7.93x       8.58x       7.72x  
b=1024       1.33x       1.94x       1.79x       3.55x       6.09x       7.26x       8.90x       8.42x       7.25x  
b=2048       1.53x       1.90x       1.94x       4.14x       5.41x       7.96x       9.28x       8.76x       7.58x

Dtype=float32  k=vocab*0.004

            v=512       v=1024      v=2048      v=4096      v=8192     v=16384     v=32768     v=65536     v=131072 
b=1          0.91x       1.15x       1.07x       0.84x       0.74x       0.77x       0.91x       0.93x       1.09x  
b=2          0.88x       0.96x       1.01x       0.81x       0.69x       0.69x       0.78x       0.90x       1.24x  
b=4          0.82x       0.86x       1.06x       0.81x       0.78x       0.78x       0.90x       1.29x       2.15x  
b=8          0.84x       0.92x       0.97x       0.89x       0.71x       0.84x       1.02x       1.60x       2.21x  
b=16         0.87x       0.93x       0.97x       0.71x       0.68x       1.00x       1.38x       1.99x       3.15x  
b=32         0.87x       0.79x       0.96x       0.84x       0.88x       1.21x       1.80x       2.82x       6.07x  
b=48         0.93x       0.90x       0.75x       0.74x       0.62x       1.37x       2.24x       3.20x       7.27x  
b=64         0.84x       1.00x       1.09x       1.13x       0.78x       1.55x       2.53x       5.14x       7.46x  
b=96         0.68x       0.95x       1.08x       0.93x       1.23x       1.80x       3.16x       6.69x       6.16x  
b=128        0.91x       0.99x       1.20x       1.41x       1.30x       2.20x       4.37x       6.35x       3.91x  
b=256        0.90x       1.12x       1.30x       1.55x       2.80x       4.82x       6.60x       3.58x       4.22x  
b=512        0.88x       1.29x       1.51x       2.19x       3.82x       5.29x       5.89x       4.02x       4.10x  
b=1024       0.84x       1.25x       1.68x       2.79x       4.25x       5.89x       5.97x       3.65x       4.15x  
b=2048       0.93x       1.41x       1.84x       2.81x       4.94x       7.04x       6.35x       3.64x       4.24x

Lyxot · 2026-02-09T19:24:29Z

Benchmark results may vary with hardware. Further test is required.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlx/backend/cuda/device/radix_select.cuh

mlx/backend/cuda/sort.cu

Lyxot · 2026-02-10T18:59:23Z

@zcbenz Could you please review this PR?

zcbenz · 2026-02-12T10:12:41Z

@Lyxot Thanks for your contributions, I'm not familiar with gpu radix sort and I have to do some homework before I can review, and I'm currently stuck solving a hard problem so it will take some time before I can look into this. Maybe other maintainers can take a look before I do.

angeloskath · 2026-02-15T08:49:39Z

Hi @Lyxot thanks for the PR! I think part of my comment on #3069 also applies here.

In short I think a PR that tries to address a smaller use case but is consistently better than the fallback would be much better. It would be shorter, the code would be simpler and more importantly we wouldn't have to either accept regressions or make some complicated heuristic for routing to the fallback.

My suggestion to begin with is to only tackle the use case of small axes that fit in shared memory. That would cover for instance MoE expert selection since the number of tokens can vary from 1 to 10s of thousands but the axis is fairly small 8 to a few hundreds. This is also a use case where the particular implementation is slow. (as is also the case in #3069).

Lyxot · 2026-02-16T12:36:44Z

@angeloskath I tuned the small-kernel which is fit in shared memory.

If you prefer a simpler scope for this PR, I can remove the large-kernel path and keep only the small-kernel with fallback to merge sort for the rest.

Lyxot · 2026-02-16T12:42:25Z

current performance of small kernel is:

Dtype=bfloat16  k=vocab*0.004  small-kernel-limit≈6144  smem=48.0KB

             v=32        v=64        v=96       v=160       v=256       v=384       v=512       v=1024      v=2048      v=4096  
b=1          1.31x       1.13x       1.17x       1.15x       1.06x       1.13x       1.25x       1.40x       1.25x       1.42x  
b=4          1.18x       0.97x       1.12x       1.01x       0.95x       1.10x       1.00x       1.13x       1.23x       1.69x  
b=8          1.17x       1.13x       1.13x       1.04x       1.02x       1.10x       1.05x       1.14x       1.32x       1.28x  
b=16         1.18x       1.30x       1.04x       1.10x       1.02x       1.09x       1.07x       1.12x       1.23x       1.34x  
b=32         1.25x       1.13x       1.11x       1.06x       1.03x       1.09x       0.94x       1.23x       1.27x       1.20x  
b=64         1.11x       1.06x       1.13x       1.08x       1.10x       1.10x       1.07x       1.19x       1.43x       1.65x  
b=128        1.44x       1.04x       0.87x       1.08x       1.08x       1.17x       1.13x       1.20x       1.64x       2.07x  
b=256        1.17x       1.41x       0.97x       1.22x       1.19x       1.17x       1.21x       1.36x       1.76x       2.10x  
b=512        1.24x       1.21x       1.23x       1.19x       1.09x       1.35x       1.25x       2.17x       2.44x       2.44x  
b=1024       1.56x       1.46x       1.38x       1.28x       1.39x       1.63x       1.48x       1.70x       2.46x       2.81x  
b=2048       1.89x       1.76x       1.68x       1.46x       1.29x       1.95x       1.73x       2.26x       2.70x       2.92x  
b=4096       2.43x       2.19x       2.24x       1.65x       1.40x       2.27x       1.95x       2.45x       2.91x       2.85x  
b=8192       2.38x       2.92x       2.42x       1.95x       1.65x       2.55x       2.14x       2.71x       2.95x       2.90x  

Dtype=float32  k=vocab*0.004  small-kernel-limit≈4096  smem=48.0KB

             v=32        v=64        v=96       v=160       v=256       v=384       v=512       v=1024      v=2048      v=4096  
b=1          1.10x       1.07x       1.04x       0.99x       0.94x       1.04x       0.99x       1.07x       1.23x       1.33x  
b=4          1.08x       1.04x       1.03x       0.99x       0.94x       1.02x       0.96x       1.19x       1.10x       1.27x  
b=8          1.13x       0.84x       1.17x       0.99x       0.95x       1.02x       0.97x       1.05x       1.15x       1.20x  
b=16         1.25x       1.07x       1.05x       0.99x       0.92x       1.02x       0.98x       1.04x       1.16x       1.24x  
b=32         1.06x       0.86x       1.03x       0.99x       0.97x       1.01x       0.98x       1.04x       1.15x       1.24x  
b=64         1.07x       1.00x       1.17x       1.12x       0.85x       1.01x       1.00x       1.26x       1.28x       1.74x  
b=128        1.08x       1.06x       1.05x       0.99x       0.96x       0.94x       0.81x       1.32x       1.47x       2.02x  
b=256        1.10x       1.34x       1.03x       1.05x       0.93x       1.26x       1.15x       1.39x       1.91x       2.10x  
b=512        1.24x       1.06x       1.07x       1.09x       1.05x       1.36x       1.27x       1.84x       1.94x       2.56x  
b=1024       1.62x       1.47x       1.40x       1.08x       1.21x       1.65x       1.51x       1.99x       2.57x       2.91x  
b=2048       1.85x       1.74x       1.62x       1.43x       1.27x       1.93x       1.77x       2.32x       2.61x       2.92x  
b=4096       2.42x       2.11x       1.95x       1.69x       1.21x       2.51x       1.96x       2.49x       2.89x       3.05x  
b=8192       2.70x       2.64x       2.29x       1.78x       1.49x       2.58x       2.19x       2.52x       3.13x       3.25x

Fix two correctness issues in CUDA radix partition/argpartition: - In the large contiguous radix path, stop deriving row bases from `row * min(non-axis stride)` and compute row offsets with `elem_to_loc(...)` using non-axis shape/strides (matching merge-sort indexing behavior). - Keep stride arguments 64-bit end-to-end in radix-select kernels and launches (remove narrowing to `int` and related `INT32_MAX` guard). This fixes incorrect row addressing for valid contiguous non-linear layouts (e.g. column-major with axis=0) and avoids silent misindexing on large strides.

Eliminate MAX_NDIM-based rank limits in CUDA radix partition/argpartition by switching radix kernels from fixed-size __grid_constant__ shape/stride params to dynamic device pointers for non-axis metadata. Changes: - Update radix kernels to take dynamic NC metadata pointers: - radix_select_small_nc_kernel - radix_select_large_streaming_kernel - radix_select_large_streaming_nc_kernel - In gpu_radix_partition_small/gpu_radix_partition_large: - allocate device buffers for nc_shape/in_nc_strides/out_nc_strides - copy host metadata with cudaMemcpyAsync - pass pointers into kernel launches - Remove MAX_NDIM-dependent routing so high-rank tensors can still use radix partition path. - Keep stride handling 64-bit end-to-end in radix launches/kernels. Also slightly widens fallback-model threshold range (without changing max_rows).

remove fallback strategy

…dix select ties

based on estimated shared-memory usage and device limits

Lyxot · 2026-03-02T16:45:34Z

@angeloskath I’ve narrowed the scope of this PR.

This version now only targets the small-axis case that fits in shared memory, and falls back to merge sort for the remaining cases. I also removed the larger-kernel path to keep the implementation smaller and avoid extra routing / heuristic complexity. Could you please take another look?

Copilot AI review requested due to automatic review settings February 9, 2026 19:23

Copilot started reviewing on behalf of Lyxot February 9, 2026 19:24 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Lyxot requested a review from Copilot February 9, 2026 19:43

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Lyxot requested a review from Copilot February 10, 2026 08:39

Copilot started reviewing on behalf of Lyxot February 10, 2026 08:40 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Lyxot added 14 commits March 2, 2026 22:30

Init CUDA radix select

e2f1b88

update fallback strategy

8084da7

format code

971b60d

unify radix partition kernels to reduce code duplication

c87e3c8

implement dynamic shared memory for radix partition small kernel

ccf39c6

introduce multi-block-per-row and multi-row-per-block

cbefde1

remove fallback strategy

fix: make radix select tie-order deterministic

e01761f

update benchmark scripts

10307f4

fix: canonicalize signed zero in CUDA radix keys for deterministic ra…

9eae0ca

…dix select ties

fix: make radix select tie-order deterministic for small kernel

88184ce

fall back to merge sort when radix nc_dim > MAX_NDIM

62fe754

tune radix-select small-kernel launch and radix width

041f207

Lyxot added 4 commits March 2, 2026 22:30

remove multi-row-per-block

65dfbc3

replace the fixed argpartition small-kernel cutoff with a runtime check

fd20a11

based on estimated shared-memory usage and device limits

remove large kernel

d9a85a0

remove runtime fit routing

5fc2113

Lyxot force-pushed the cuda-radix-select branch from 0305568 to 5fc2113 Compare March 2, 2026 16:44

Conversation

Lyxot commented Feb 9, 2026

Proposed changes

What changed

Checklist

Uh oh!

Lyxot commented Feb 9, 2026

Uh oh!

Lyxot commented Feb 9, 2026

Uh oh!

Lyxot commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lyxot commented Feb 10, 2026

Uh oh!

zcbenz commented Feb 12, 2026

Uh oh!

angeloskath commented Feb 15, 2026

Uh oh!

Lyxot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyxot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyxot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lyxot commented Feb 16, 2026 •

edited

Loading

Lyxot commented Feb 16, 2026 •

edited

Loading