Contributing to AITER

Thank you for your interest in contributing to AITER! We are building a high-performance inference runtime optimized for AMD GPUs with ROCm. Our community welcomes contributions of all kinds, whether you're fixing bugs, optimizing kernels, adding new operators, or improving documentation.

Ways to Contribute

There are several ways you can contribute to AITER:

Report Issues: Identify and report bugs, performance issues, or unexpected behavior
Add Operators: Request or implement new operators for LLM inference
Optimize Kernels: Improve existing HIP/CK/Triton kernel performance
Hardware Support: Extend support for new AMD GPU architectures (MI200, MI300, etc.)
Documentation: Improve docs, add tutorials, or write performance guides
Code Review: Review pull requests and provide constructive feedback
Community Support: Answer questions and help other users

We also encourage you to share your experiences with AITER in blog posts, social media, or conference talks. If AITER helps your project, please consider starring our repository!

Getting Started

Job Board

Not sure where to start? Check out these tasks:

Good First Issues: Simple bugs or small enhancements
Help Wanted: Features or optimizations that need community help
Kernel Optimization: Performance improvement opportunities for existing operators
New Operator Requests: Missing operators needed for new models

Prerequisites

Before contributing, ensure you have:

AMD GPU: MI200, MI300 series, or compatible ROCm hardware
ROCm: Version 5.7+ installed and configured
Python: 3.9, 3.10, 3.11, or 3.12
PyTorch: ROCm-enabled PyTorch 2.0+
Git: For version control

Development Setup

1. Fork and Clone

Fork the AITER repository to your GitHub account, then clone it:

git clone https://github.com/<<your_username>>/aiter.git
cd aiter
git remote add upstream https://github.com/ROCm/aiter.git  # Add upstream remote

2. Set Up Python Environment

We recommend using a virtual environment:

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

3. Install AITER from Source

For Python-only development:

# Install with precompiled kernels
PREBUILD_KERNELS=1 GPU_ARCHS="gfx942;gfx950" python setup.py develop

For full development (Python + HIP/CK kernels):

# Build all kernels from source
pip install -e .


---

## Code Quality

### Pre-commit Hooks

AITER uses `pre-commit` to maintain code quality. Install it before making changes:

```bash
pip install black==25.1.0 ruff==0.11.11
apt install clang-format  # For C++/HIP code formatting

# Install pre-commit hooks
bash ./.githooks/install

This will automatically:

Format Python code with black
Lint Python code with ruff
Format C++/HIP code with clang-format
Check for common issues

Manual Linting

To manually run linters:

# Python formatting
black aiter/ op_tests/

# Python linting
ruff check aiter/ op_tests/

# C++/HIP formatting
find csrc/ -name "*.cu" -o -name "*.h" -o -name "*.cpp" | xargs clang-format -i

Code Style Guidelines

Python Code:

Follow PEP 8
Use type hints for function signatures
Maximum line length: 88 characters (black default)
Use descriptive variable names

C++/HIP Code:

Follow Google C++ Style Guide
Use snake_case for functions and variables
Use PascalCase for classes
Comment complex kernel logic and optimizations
Always document performance-critical sections

Kernel Development:

Optimize for memory bandwidth (most operators are memory-bound)
Use vectorized loads/stores when possible (vec4, vec8, vec16)
Minimize global memory accesses
Document arithmetic intensity and roofline analysis
Include performance benchmarks in PR description
Minimize external dependencies - avoid adding new third-party libraries

Testing

Running Tests

AITER tests are standalone Python scripts in the op_tests/ directory:

# Run all tests using the CI script
bash .github/scripts/aiter_test.sh

# Run a specific test file directly
python op_tests/test_rmsnorm2d.py

# Run with specific parameters
python op_tests/test_rmsnorm2d.py --dtype bf16 --m 1024 --n 4096

# Run Triton-specific tests
python op_tests/triton_tests/normalization/test_rmsnorm.py

# Run multi-GPU tests
MULTIGPU=TRUE bash .github/scripts/aiter_test.sh

Adding Tests

When adding new features or fixing bugs, include tests. AITER tests are standalone Python scripts:

# op_tests/test_new_operator.py
import torch
import argparse
from aiter.ops import new_operator
from aiter.test_common import checkAllclose, perftest
from aiter import dtypes

@perftest()
def run_reference(input, param):
    """Reference implementation."""
    # Your reference implementation
    return expected_output

@perftest()
def run_aiter(input, param):
    """AITER optimized implementation."""
    return new_operator(input, param)

def test_new_operator(dtype, m, n):
    """Test operator correctness and performance."""
    input_tensor = torch.randn(m, n, dtype=dtype, device='cuda')
    
    # Run both implementations
    expected = run_reference(input_tensor, param)
    output = run_aiter(input_tensor, param)
    
    # Check correctness
    checkAllclose(output, expected, rtol=1e-3, atol=1e-3)
    print(f"✓ Test passed for dtype={dtype}, m={m}, n={n}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-d", "--dtype", type=str, default=None,
                       help="Data type (e.g., bf16, fp16)")
    parser.add_argument("-m", type=int, default=None, help="M dimension")
    parser.add_argument("-n", type=int, default=None, help="N dimension")
    
    args = parser.parse_args()
    
    # Define test configurations
    l_dtype = [torch.float16, torch.bfloat16]
    l_m = [1024, 2048, 4096]
    l_n = [4096, 8192]
    
    if args.dtype:
        l_dtype = [dtypes.d_dtypes[args.dtype]]
    if args.m:
        l_m = [args.m]
    if args.n:
        l_n = [args.n]
    
    # Run tests
    for dtype in l_dtype:
        for m in l_m:
            for n in l_n:
                test_new_operator(dtype, m, n)

Key Features:

Use @perftest() decorator for performance timing
Use checkAllclose() for numerical comparison
Support command-line arguments for flexible testing
Print clear pass/fail messages

Testing on Different Hardware

If you don't have access to specific AMD GPU models, mention this in your PR. Our CI system will run tests on:

MI300X (gfx942)
MI350X (gfx950)

Kernel Development

General Principles

When developing new operators:

Minimize External Dependencies: Avoid introducing new third-party libraries whenever possible
- Prefer using existing dependencies: PyTorch, ROCm/HIP, Composable Kernel (CK), Triton
- If a new dependency is absolutely necessary, discuss it in the PR and provide strong justification
- Consider implementing functionality from scratch if the dependency is small or simple
- Avoid dependencies that are not well-maintained or AMD GPU-specific

HIP Kernel Development

When developing or modifying HIP kernels:

Use JIT Compilation System:

from aiter.jit import compile_ops

@compile_ops(
    srcs=["path/to/kernel.cu"],
    extra_hip_flags=["-O3", "-DCK_TILE_FMHA_FWD_FAST_EXP2=1"]
)
def my_operator(input: torch.Tensor) -> torch.Tensor:
    return torch_ops.my_operator_kernel(input)

Profile Memory Access Patterns:

# Use AITER_LOG_MORE=1 to analyze kernel performance
AITER_LOG_MORE=1 python3 op_tests/test_gemm_a8w8.py

Check Roofline Model:
- Document arithmetic intensity (FLOPs/Byte)
- Compare against hardware peak (MI300X: ~380 TFLOPS FP16, 3.2 TB/s)
- Explain if kernel is compute-bound or memory-bound

Optimize for Coalesced Access:

// Good: Coalesced access
vec8_t<scalar_t> data = vectorized_ptr[blockIdx.x * vec_size + threadIdx.x];

// Bad: Strided access
scalar_t data = ptr[threadIdx.x * stride];  // Avoid when stride is large

Composable Kernel (CK) Integration

When integrating CK tiles:

Generate CK Instances:

cd 3rdparty/composable_kernel/example/ck_tile/01_fmha
python generate.py -d fwd --receipt 200 --filter "*bf16*" --output_dir /tmp/ck_gen

Add to JIT Pipeline:

blob_gen_cmd = [
    f"{CK_DIR}/example/ck_tile/01_fmha/generate.py -d fwd --filter {filter_pattern} --output_dir {{}}"
]

Register with PyTorch:

TORCH_LIBRARY(aiter, m) {
    m.def("my_ck_op(Tensor input) -> Tensor");
}

Triton Kernel Development

For Triton kernels:

Use Appropriate Block Sizes:

@triton.jit
def kernel(input_ptr, output_ptr, BLOCK_SIZE: tl.constexpr):
    # Use BLOCK_SIZE between 128-1024 for AMD GPUs
    pass

Enable Software Pipelining:

for blk_idx in tl.range(0, n_blocks, num_stages=2):  # Enable pipelining
    data = tl.load(ptr + blk_idx * BLOCK_SIZE)

Use Cache Modifiers:

# For data that will be reused
x = tl.load(input_ptr, cache_modifier=".cg")  # Cache globally

Performance Testing

Benchmarking Operators

Always benchmark your changes:

# Benchmark single operator
python op_tests/op_benchmarks/triton/bench_rmsnorm.py

Performance Requirements

For kernel PRs, include in description:

Hardware: GPU model (e.g., MI300X)
Baseline: Performance before changes
Optimized: Performance after changes
Improvement: Percentage gain
Bandwidth Utilization: % of peak memory bandwidth
Roofline Analysis: Where operator sits on roofline model

Example:

## Performance Results (MI300X)

| Config | Baseline | Optimized | Improvement |
|--------|----------|-----------|-------------|
| FP16, BS=1024, HS=4096 | 180 μs | 150 μs | 16.7% |
| BF16, BS=2048, HS=8192 | 720 μs | 600 μs | 16.7% |

Bandwidth Utilization: 78% of peak (2.5 TB/s / 3.2 TB/s)
Arithmetic Intensity: 0.83 FLOPs/Byte (memory-bound as expected)

Documentation

Building Documentation

AITER uses deepwiki for documentation:

https://deepwiki.com/ROCm/aiter

Pull Requests

Before Submitting

Ensure your PR:

Passes all pre-commit hooks
Includes relevant tests
Updates documentation if needed
Includes performance benchmarks (for kernel changes)
Has a clear, descriptive title

PR Title Format

Use one of these prefixes:

[Bugfix] - Bug fixes
[Feature] - New features or operators
[Kernel] - Kernel optimizations or new kernels
[HIP] - HIP-specific changes
[CK] - Composable Kernel integration
[Triton] - Triton kernel changes
[JIT] - JIT compilation system changes
[Perf] - Performance optimizations
[Doc] - Documentation improvements
[Test] - Test additions or fixes
[CI] - CI/CD improvements
[Hardware] - Hardware-specific changes (e.g., [Hardware][MI300X])
[Misc] - Miscellaneous changes

Examples:

[Kernel][Perf] Optimize RMSNorm for MI300X using vec16 loads
[Feature] Add PagedAttention operator with CK backend
[Bugfix] Fix numerical instability in FP16 softmax

PR Description Template

## Summary
Brief description of changes

## Motivation
Why is this change needed?

## Changes
- Detailed list of changes
- Impact on existing code

## Performance (if applicable)
| Configuration | Before | After | Improvement |
|---------------|--------|-------|-------------|
| ...           | ...    | ...   | ...         |

## Testing
- [ ] Unit tests added/updated
- [ ] Performance benchmarks run
- [ ] Tested on MI250X
- [ ] Tested on MI300X

## Documentation
- [ ] Docstrings updated
- [ ] User guide updated
- [ ] Performance guide updated

## Dependencies
- [ ] No new third-party dependencies added
- [ ] If new dependencies added, justification provided

## Breaking Changes
List any breaking changes and migration guide

Code Review Process

Initial Review: A maintainer will review within 3-5 business days
Feedback: Address comments and push updates
Approval: After approval, CI will run full test suite
Merge: Once CI passes, a maintainer will merge

If your PR is urgent or hasn't been reviewed, ping maintainers on the issue or PR.

Specific Contribution Areas

Adding New Models

When adding support for a new model:

Identify required operators
Check if operators exist in AITER
Implement missing operators
Add model configuration
Add accuracy tests
Benchmark performance

Optimizing Existing Kernels

For kernel optimizations:

Profile baseline performance with rocprof
Identify bottleneck (memory-bound vs compute-bound)
Apply optimizations:
- Increase vectorization width
- Improve memory coalescing
- Use shared memory effectively
- Enable software pipelining
Verify correctness with tests
Document performance improvement

Hardware-Specific Optimizations

For new AMD GPU architectures:

Check architecture features (cache sizes, VGPR count, LDS size)
Tune kernel parameters for new hardware
Update get_gfx() detection
Add hardware-specific code paths if needed
Update CI to test on new hardware

Community

Getting Help

Issues: GitHub Issues
Discussions: GitHub Discussions

Code of Conduct

We follow the Contributor Covenant Code of Conduct. Please be respectful and constructive in all interactions.

Developer Certificate of Origin (DCO)

By contributing to AITER, you certify that your contribution was created in whole or in part by you and that you have the right to submit it under the MIT License, as specified in this project's LICENSE

git commit -s -m "Your commit message"

This adds a Signed-off-by line to your commit message.

Tip: Enable automatic sign-off in your Git config:

git config --global format.signoff true

Advanced Topics

JIT Compilation System

AITER uses a custom JIT system for compiling kernels. When modifying:

Adding New Modules: Update optCompilerConfig.json
Dynamic Configuration: Use gen_func in @compile_ops decorator
Blob Generation: Add code generation commands to blob_gen_cmd

FAQ

Q: My kernel compiles but is slower than expected. What should I check?

Check memory access patterns (use rocprof)
Verify coalescing (check MemUnitBusy)
Measure bandwidth utilization
Compare against roofline model

Q: How do I test on hardware I don't have access to?

A: Submit your PR and mention hardware limitations. Our CI will test on MI250X and MI300X.

Q: When should I use HIP vs CK vs Triton?

HIP: Maximum control, hardware-specific optimizations
CK: High-performance tile-based GEMM-like operations
Triton: Rapid prototyping, easier to write

Q: My PR conflicts with main branch. How do I resolve?

git fetch upstream
git rebase upstream/main
# Resolve conflicts
For more details, refer to the JIT compilation system documentation and comments in the relevant source files.

Q: Can I add a new third-party library dependency?

A: We strongly prefer to avoid new dependencies. If you believe a new library is necessary:

Check if the functionality can be implemented using existing dependencies (PyTorch, HIP, CK, Triton)
Consider implementing the feature from scratch if it's relatively simple
If the dependency is essential, provide strong justification in your PR:
- Why existing solutions don't work
- Performance benefits or features it provides
- Library maintenance status and AMD GPU support
- Impact on build time and binary size

Thank You!

Thank you for contributing to AITER! Your contributions help make AITER the best inference runtime for AMD GPUs. Whether you're optimizing a single kernel or adding a major feature, we appreciate your effort and dedication to the project.

Happy coding! 🚀

FilesExpand file tree

CONTRIBUTE.md

Latest commit

History