Quickstart
This guide will get you started with AITER in 5 minutes.
Installation
# Install from source
git clone --recursive https://github.com/ROCm/aiter.git
cd aiter
python3 setup.py develop
Verify Installation
import aiter
import torch
# Verify AITER is working
print(f"PyTorch version: {torch.__version__}")
print(f"ROCm available: {torch.cuda.is_available()}")
# Try importing a key function
from aiter import flash_attn_func
print("AITER loaded successfully!")
First Example: Flash Attention
Here’s a simple example using AITER’s optimized attention kernel:
import torch
import aiter
# Input tensors (batch_size=2, seq_len=1024, num_heads=16, head_dim=64)
batch_size, seq_len, num_heads, head_dim = 2, 1024, 16, 64
query = torch.randn(batch_size, seq_len, num_heads, head_dim,
device='cuda', dtype=torch.float16)
key = torch.randn(batch_size, seq_len, num_heads, head_dim,
device='cuda', dtype=torch.float16)
value = torch.randn(batch_size, seq_len, num_heads, head_dim,
device='cuda', dtype=torch.float16)
# Run optimized flash attention
output = aiter.flash_attn_func(query, key, value, causal=True)
print(f"Output shape: {output.shape}")
# Output shape: torch.Size([2, 1024, 16, 64])
Variable-Length Sequences
AITER excels at handling variable-length sequences with page tables:
import torch
import aiter
# Query with variable lengths per batch
query = torch.randn(5, 2048, 16, 64, device='cuda', dtype=torch.float16)
# Page table configuration (see tutorials for details)
page_table = torch.tensor([[0, 1, 2], [3, 4, 5]], device='cuda', dtype=torch.int32)
# KV cache in paged format
kv_cache = torch.randn(6, 16, 128, 64, device='cuda', dtype=torch.float16)
# Variable-length attention with page tables
output = aiter.flash_attn_with_kvcache(
query, kv_cache, page_table,
block_size=128, causal=True
)
Mixture of Experts (MoE)
Efficient grouped GEMM for MoE layers:
import torch
import aiter
# MOE routing - select top-2 experts for each token
num_tokens = 4096
num_experts = 8
hidden_dim = 512
ffn_dim = 2048
top_k = 2
# Input tokens
x = torch.randn(num_tokens, hidden_dim, device='cuda', dtype=torch.float16)
# Expert weights for all experts
w1 = torch.randn(num_experts, hidden_dim, ffn_dim, device='cuda', dtype=torch.float16)
w2 = torch.randn(num_experts, ffn_dim, hidden_dim, device='cuda', dtype=torch.float16)
# Router logits and expert selection
router_logits = torch.randn(num_tokens, num_experts, device='cuda', dtype=torch.float16)
# Fused MOE operation (gate + up projection + down projection)
output = aiter.fmoe(
x, w1, w2, router_logits,
topk=top_k,
renormalize=True
)
print(f"MoE output shape: {output.shape}") # [4096, 512]
RMSNorm
Optimized normalization for LLM inference:
import torch
import aiter
# Input tensor (batch_size, seq_len, hidden_dim)
x = torch.randn(2, 1024, 4096, device='cuda', dtype=torch.float16)
# Weight for normalization
weight = torch.ones(4096, device='cuda', dtype=torch.float16)
# Fast RMSNorm
output = aiter.rmsnorm(x, weight, eps=1e-6)
Performance Tips
Use FP16/BF16: AITER kernels are optimized for half-precision
Enable compilation: Set
PREBUILD_KERNELS=2for inference workloadsBatch when possible: Larger batches better utilize GPU
Profile first: Use ROCm profiler to identify bottlenecks
# Example: Profile your workload
rocprof --stats python your_script.py
Next Steps
tutorials/attention - Deep dive into attention mechanisms
tutorials/moe - Learn about MoE optimizations
tutorials/variable_length - Handle variable-length sequences
Attention Operations - Full API reference
benchmarks - Performance comparisons
Common Issues
- ImportError: No module named ‘aiter’
Make sure ROCm libraries are in your library path:
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
- RuntimeError: No AMD GPU found
Verify GPU is accessible:
rocm-smi rocminfo | grep gfx
- Compilation errors during first run
JIT compilation may take time on first use. Pre-compile kernels:
PREBUILD_KERNELS=2 GPU_ARCHS="native" python3 setup.py install
Get Help
Documentation: https://doc.aiter.amd.com
GitHub Issues: https://github.com/ROCm/aiter/issues
ROCm Community: https://github.com/ROCm/ROCm/discussions