Need for Speed

Compare Kernels written in Different GPU Programming Languages

Need for Speed is a benchmarking and comparison framework for GPU kernel implementations. Aim of this project is write kernels in different GPU programming languages, and compare their correctness and performance against PyTorch ops as the baseline. The project currently includes Tiled GEMM (General Matrix Multiplication) kernels implemented in Triton and Mojo, with more languages and kernels planned for the future.

Available Kernels

The table lists all the kernels and the programming languages they are implemented in. Each kernel is designed to perform the same mathematical operation (GEMM) but may use different optimization techniques and memory access patterns based on the capabilities of the language:

Language	Kernel	Description
Triton	Tiled GEMM	Auto-tuned tiled kernel with FP32 accumulation, Tensor Core utilization, and grouped program IDs for L2 cache locality
Mojo	Tiled GEMM	Shared-memory tiled kernel with async DRAM-to-SRAM copies, barrier synchronization, and FP32 accumulation
Mojo	Naive GEMM	One-thread-per-element baseline using global memory only
cuBLAS	`torch.matmul`	Reference baseline via PyTorch's cuBLAS integration

Kernel source locations:

kernels/
└── gemm/
    ├── triton/
    │   ├── wrapper.py                  # Python wrapper
    │   └── kernels/matmul.py           # Triton kernel implementation
    └── mojo/
        ├── wrapper.py                  # Python wrapper
        └── kernels/
            ├── matmul.mojo             # Tiled GEMM kernel
            └── matmul_naive.mojo       # Naive GEMM kernel

How Kernels Are Launched

Each kernel implementation has a Python wrapper (wrapper.py) that provides a consistent interface. The wrappers handle:

Input validation (CUDA device checks, shape compatibility)
Memory layout enforcement (contiguous tensors)
Grid/block computation
Kernel dispatch

Both Triton and Mojo wrappers are decorated with @torch.compile to enable PyTorch's JIT compilation and optimization pipeline.

Running Individual Kernels

Each wrapper includes a main() function that runs a small correctness check:

# Run the Triton kernel
python -m kernels.gemm.triton.wrapper

# Run the Mojo kernels
python -m kernels.gemm.mojo.wrapper

Running Tests

The test suite validates correctness across multiple matrix sizes (64, 128, 256, 1024) using identity matrices, zero matrices, and random inputs, all checked against torch.matmul.

# Run all tests
bash scripts/run_test.sh --all

# Run only Triton tests
bash scripts/run_test.sh --triton

# Run only Mojo tests
bash scripts/run_test.sh --mojo

Note: Mojo tests require the Mojo package to be compiled first. The run_test.sh script handles this automatically. To compile manually: mojo package kernels/gemm/mojo/kernels -o kernels/gemm/mojo/kernels/mojo_kernel.mojopkg

Running Benchmarks

Benchmarks measure kernel execution time and compute TFLOPS for square matrices at sizes 1024, 2048, and 4096, comparing each kernel against cuBLAS.

bash scripts/run_benchmark.sh

Results are printed as a formatted table and saved as JSON to benchmarks/results/.

Setup

Prerequisites: Python >= 3.10, NVIDIA GPU with CUDA support, uv package manager.

# Set up the development environment
bash scripts/setup_dev.sh

This creates a virtual environment, installs PyTorch, Triton, the Modular SDK (for Mojo), and all other dependencies.

Contributing

Contributions are welcome! This project is designed to be extended with various compute kernels written in additional GPU programming languages. If you have experience with any other GPU programming language, we encourage you to contribute a kernel implementation.

To add a new kernel:

Create a new directory under kernels/gemm/<language>/ with a wrapper.py that exposes a matmul function matching the existing interface.
Add corresponding tests in tests/.
Register the new kernel in benchmarks/benchmark.py so it is included in benchmark runs.

Please see the existing Triton and Mojo implementations as reference for the expected structure and conventions.

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmarks		benchmarks
kernels/gemm		kernels/gemm
scripts		scripts
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Need for Speed

Available Kernels

How Kernels Are Launched

Running Individual Kernels

Running Tests

Running Benchmarks

Setup

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Need for Speed

Available Kernels

How Kernels Are Launched

Running Individual Kernels

Running Tests

Running Benchmarks

Setup

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages