Thank you for your interest in contributing to AITER! We are building a high-performance inference runtime optimized for AMD GPUs with ROCm. Our community welcomes contributions of all kinds, whether you're fixing bugs, optimizing kernels, adding new operators, or improving documentation.
There are several ways you can contribute to AITER:
- Report Issues: Identify and report bugs, performance issues, or unexpected behavior
- Add Operators: Request or implement new operators for LLM inference
- Optimize Kernels: Improve existing HIP/CK/Triton kernel performance
- Hardware Support: Extend support for new AMD GPU architectures (MI200, MI300, etc.)
- Documentation: Improve docs, add tutorials, or write performance guides
- Code Review: Review pull requests and provide constructive feedback
- Community Support: Answer questions and help other users
We also encourage you to share your experiences with AITER in blog posts, social media, or conference talks. If AITER helps your project, please consider starring our repository!
Not sure where to start? Check out these tasks:
- Good First Issues: Simple bugs or small enhancements
- Help Wanted: Features or optimizations that need community help
- Kernel Optimization: Performance improvement opportunities for existing operators
- New Operator Requests: Missing operators needed for new models
Before contributing, ensure you have:
- AMD GPU: MI200, MI300 series, or compatible ROCm hardware
- ROCm: Version 5.7+ installed and configured
- Python: 3.9, 3.10, 3.11, or 3.12
- PyTorch: ROCm-enabled PyTorch 2.0+
- Git: For version control
Fork the AITER repository to your GitHub account, then clone it:
git clone https://github.com/<<your_username>>/aiter.git
cd aiter
git remote add upstream https://github.com/ROCm/aiter.git # Add upstream remoteWe recommend using a virtual environment:
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Upgrade pip
pip install --upgrade pipFor Python-only development:
# Install with precompiled kernels
PREBUILD_KERNELS=1 GPU_ARCHS="gfx942;gfx950" python setup.py developFor full development (Python + HIP/CK kernels):
# Build all kernels from source
pip install -e .
---
## Code Quality
### Pre-commit Hooks
AITER uses `pre-commit` to maintain code quality. Install it before making changes:
```bash
pip install black==25.1.0 ruff==0.11.11
apt install clang-format # For C++/HIP code formatting
# Install pre-commit hooks
bash ./.githooks/install
This will automatically:
- Format Python code with
black - Lint Python code with
ruff - Format C++/HIP code with
clang-format - Check for common issues
To manually run linters:
# Python formatting
black aiter/ op_tests/
# Python linting
ruff check aiter/ op_tests/
# C++/HIP formatting
find csrc/ -name "*.cu" -o -name "*.h" -o -name "*.cpp" | xargs clang-format -iPython Code:
- Follow PEP 8
- Use type hints for function signatures
- Maximum line length: 88 characters (black default)
- Use descriptive variable names
C++/HIP Code:
- Follow Google C++ Style Guide
- Use
snake_casefor functions and variables - Use
PascalCasefor classes - Comment complex kernel logic and optimizations
- Always document performance-critical sections
Kernel Development:
- Optimize for memory bandwidth (most operators are memory-bound)
- Use vectorized loads/stores when possible (vec4, vec8, vec16)
- Minimize global memory accesses
- Document arithmetic intensity and roofline analysis
- Include performance benchmarks in PR description
- Minimize external dependencies - avoid adding new third-party libraries
AITER tests are standalone Python scripts in the op_tests/ directory:
# Run all tests using the CI script
bash .github/scripts/aiter_test.sh
# Run a specific test file directly
python op_tests/test_rmsnorm2d.py
# Run with specific parameters
python op_tests/test_rmsnorm2d.py --dtype bf16 --m 1024 --n 4096
# Run Triton-specific tests
python op_tests/triton_tests/normalization/test_rmsnorm.py
# Run multi-GPU tests
MULTIGPU=TRUE bash .github/scripts/aiter_test.shWhen adding new features or fixing bugs, include tests. AITER tests are standalone Python scripts:
# op_tests/test_new_operator.py
import torch
import argparse
from aiter.ops import new_operator
from aiter.test_common import checkAllclose, perftest
from aiter import dtypes
@perftest()
def run_reference(input, param):
"""Reference implementation."""
# Your reference implementation
return expected_output
@perftest()
def run_aiter(input, param):
"""AITER optimized implementation."""
return new_operator(input, param)
def test_new_operator(dtype, m, n):
"""Test operator correctness and performance."""
input_tensor = torch.randn(m, n, dtype=dtype, device='cuda')
# Run both implementations
expected = run_reference(input_tensor, param)
output = run_aiter(input_tensor, param)
# Check correctness
checkAllclose(output, expected, rtol=1e-3, atol=1e-3)
print(f"✓ Test passed for dtype={dtype}, m={m}, n={n}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-d", "--dtype", type=str, default=None,
help="Data type (e.g., bf16, fp16)")
parser.add_argument("-m", type=int, default=None, help="M dimension")
parser.add_argument("-n", type=int, default=None, help="N dimension")
args = parser.parse_args()
# Define test configurations
l_dtype = [torch.float16, torch.bfloat16]
l_m = [1024, 2048, 4096]
l_n = [4096, 8192]
if args.dtype:
l_dtype = [dtypes.d_dtypes[args.dtype]]
if args.m:
l_m = [args.m]
if args.n:
l_n = [args.n]
# Run tests
for dtype in l_dtype:
for m in l_m:
for n in l_n:
test_new_operator(dtype, m, n)Key Features:
- Use
@perftest()decorator for performance timing - Use
checkAllclose()for numerical comparison - Support command-line arguments for flexible testing
- Print clear pass/fail messages
If you don't have access to specific AMD GPU models, mention this in your PR. Our CI system will run tests on:
- MI300X (gfx942)
- MI350X (gfx950)
When developing new operators:
- Minimize External Dependencies: Avoid introducing new third-party libraries whenever possible
- Prefer using existing dependencies: PyTorch, ROCm/HIP, Composable Kernel (CK), Triton
- If a new dependency is absolutely necessary, discuss it in the PR and provide strong justification
- Consider implementing functionality from scratch if the dependency is small or simple
- Avoid dependencies that are not well-maintained or AMD GPU-specific
When developing or modifying HIP kernels:
-
Use JIT Compilation System:
from aiter.jit import compile_ops @compile_ops( srcs=["path/to/kernel.cu"], extra_hip_flags=["-O3", "-DCK_TILE_FMHA_FWD_FAST_EXP2=1"] ) def my_operator(input: torch.Tensor) -> torch.Tensor: return torch_ops.my_operator_kernel(input)
-
Profile Memory Access Patterns:
# Use AITER_LOG_MORE=1 to analyze kernel performance AITER_LOG_MORE=1 python3 op_tests/test_gemm_a8w8.py -
Check Roofline Model:
- Document arithmetic intensity (FLOPs/Byte)
- Compare against hardware peak (MI300X: ~380 TFLOPS FP16, 3.2 TB/s)
- Explain if kernel is compute-bound or memory-bound
-
Optimize for Coalesced Access:
// Good: Coalesced access vec8_t<scalar_t> data = vectorized_ptr[blockIdx.x * vec_size + threadIdx.x]; // Bad: Strided access scalar_t data = ptr[threadIdx.x * stride]; // Avoid when stride is large
When integrating CK tiles:
-
Generate CK Instances:
cd 3rdparty/composable_kernel/example/ck_tile/01_fmha python generate.py -d fwd --receipt 200 --filter "*bf16*" --output_dir /tmp/ck_gen
-
Add to JIT Pipeline:
blob_gen_cmd = [ f"{CK_DIR}/example/ck_tile/01_fmha/generate.py -d fwd --filter {filter_pattern} --output_dir {{}}" ]
-
Register with PyTorch:
TORCH_LIBRARY(aiter, m) { m.def("my_ck_op(Tensor input) -> Tensor"); }
For Triton kernels:
-
Use Appropriate Block Sizes:
@triton.jit def kernel(input_ptr, output_ptr, BLOCK_SIZE: tl.constexpr): # Use BLOCK_SIZE between 128-1024 for AMD GPUs pass
-
Enable Software Pipelining:
for blk_idx in tl.range(0, n_blocks, num_stages=2): # Enable pipelining data = tl.load(ptr + blk_idx * BLOCK_SIZE)
-
Use Cache Modifiers:
# For data that will be reused x = tl.load(input_ptr, cache_modifier=".cg") # Cache globally
Always benchmark your changes:
# Benchmark single operator
python op_tests/op_benchmarks/triton/bench_rmsnorm.py
For kernel PRs, include in description:
- Hardware: GPU model (e.g., MI300X)
- Baseline: Performance before changes
- Optimized: Performance after changes
- Improvement: Percentage gain
- Bandwidth Utilization: % of peak memory bandwidth
- Roofline Analysis: Where operator sits on roofline model
Example:
## Performance Results (MI300X)
| Config | Baseline | Optimized | Improvement |
|--------|----------|-----------|-------------|
| FP16, BS=1024, HS=4096 | 180 μs | 150 μs | 16.7% |
| BF16, BS=2048, HS=8192 | 720 μs | 600 μs | 16.7% |
Bandwidth Utilization: 78% of peak (2.5 TB/s / 3.2 TB/s)
Arithmetic Intensity: 0.83 FLOPs/Byte (memory-bound as expected)
AITER uses deepwiki for documentation:
https://deepwiki.com/ROCm/aiter
Ensure your PR:
- Passes all pre-commit hooks
- Includes relevant tests
- Updates documentation if needed
- Includes performance benchmarks (for kernel changes)
- Has a clear, descriptive title
Use one of these prefixes:
[Bugfix]- Bug fixes[Feature]- New features or operators[Kernel]- Kernel optimizations or new kernels[HIP]- HIP-specific changes[CK]- Composable Kernel integration[Triton]- Triton kernel changes[JIT]- JIT compilation system changes[Perf]- Performance optimizations[Doc]- Documentation improvements[Test]- Test additions or fixes[CI]- CI/CD improvements[Hardware]- Hardware-specific changes (e.g.,[Hardware][MI300X])[Misc]- Miscellaneous changes
Examples:
[Kernel][Perf] Optimize RMSNorm for MI300X using vec16 loads[Feature] Add PagedAttention operator with CK backend[Bugfix] Fix numerical instability in FP16 softmax
## Summary
Brief description of changes
## Motivation
Why is this change needed?
## Changes
- Detailed list of changes
- Impact on existing code
## Performance (if applicable)
| Configuration | Before | After | Improvement |
|---------------|--------|-------|-------------|
| ... | ... | ... | ... |
## Testing
- [ ] Unit tests added/updated
- [ ] Performance benchmarks run
- [ ] Tested on MI250X
- [ ] Tested on MI300X
## Documentation
- [ ] Docstrings updated
- [ ] User guide updated
- [ ] Performance guide updated
## Dependencies
- [ ] No new third-party dependencies added
- [ ] If new dependencies added, justification provided
## Breaking Changes
List any breaking changes and migration guide- Initial Review: A maintainer will review within 3-5 business days
- Feedback: Address comments and push updates
- Approval: After approval, CI will run full test suite
- Merge: Once CI passes, a maintainer will merge
If your PR is urgent or hasn't been reviewed, ping maintainers on the issue or PR.
When adding support for a new model:
- Identify required operators
- Check if operators exist in AITER
- Implement missing operators
- Add model configuration
- Add accuracy tests
- Benchmark performance
For kernel optimizations:
- Profile baseline performance with
rocprof - Identify bottleneck (memory-bound vs compute-bound)
- Apply optimizations:
- Increase vectorization width
- Improve memory coalescing
- Use shared memory effectively
- Enable software pipelining
- Verify correctness with tests
- Document performance improvement
For new AMD GPU architectures:
- Check architecture features (cache sizes, VGPR count, LDS size)
- Tune kernel parameters for new hardware
- Update
get_gfx()detection - Add hardware-specific code paths if needed
- Update CI to test on new hardware
- Issues: GitHub Issues
- Discussions: GitHub Discussions
We follow the Contributor Covenant Code of Conduct. Please be respectful and constructive in all interactions.
By contributing to AITER, you certify that your contribution was created in whole or in part by you and that you have the right to submit it under the MIT License, as specified in this project's LICENSE
git commit -s -m "Your commit message"This adds a Signed-off-by line to your commit message.
Tip: Enable automatic sign-off in your Git config:
git config --global format.signoff trueAITER uses a custom JIT system for compiling kernels. When modifying:
- Adding New Modules: Update
optCompilerConfig.json - Dynamic Configuration: Use
gen_funcin@compile_opsdecorator - Blob Generation: Add code generation commands to
blob_gen_cmd
Q: My kernel compiles but is slower than expected. What should I check?
A:
- Check memory access patterns (use
rocprof) - Verify coalescing (check
MemUnitBusy) - Measure bandwidth utilization
- Compare against roofline model
Q: How do I test on hardware I don't have access to?
A: Submit your PR and mention hardware limitations. Our CI will test on MI250X and MI300X.
Q: When should I use HIP vs CK vs Triton?
A:
- HIP: Maximum control, hardware-specific optimizations
- CK: High-performance tile-based GEMM-like operations
- Triton: Rapid prototyping, easier to write
Q: My PR conflicts with main branch. How do I resolve?
A:
git fetch upstream
git rebase upstream/main
# Resolve conflicts
For more details, refer to the JIT compilation system documentation and comments in the relevant source files.Q: Can I add a new third-party library dependency?
A: We strongly prefer to avoid new dependencies. If you believe a new library is necessary:
- Check if the functionality can be implemented using existing dependencies (PyTorch, HIP, CK, Triton)
- Consider implementing the feature from scratch if it's relatively simple
- If the dependency is essential, provide strong justification in your PR:
- Why existing solutions don't work
- Performance benefits or features it provides
- Library maintenance status and AMD GPU support
- Impact on build time and binary size
Thank you for contributing to AITER! Your contributions help make AITER the best inference runtime for AMD GPUs. Whether you're optimizing a single kernel or adding a major feature, we appreciate your effort and dedication to the project.
Happy coding! 🚀