Quickstart ========== This guide will get you started with AITER in 5 minutes. Installation ------------ .. code-block:: bash # Install from source git clone --recursive https://github.com/ROCm/aiter.git cd aiter python3 setup.py develop Verify Installation ------------------- .. code-block:: python import aiter import torch # Verify AITER is working print(f"PyTorch version: {torch.__version__}") print(f"ROCm available: {torch.cuda.is_available()}") # Try importing a key function from aiter import flash_attn_func print("AITER loaded successfully!") First Example: Flash Attention ------------------------------- Here's a simple example using AITER's optimized attention kernel: .. code-block:: python import torch import aiter # Input tensors (batch_size=2, seq_len=1024, num_heads=16, head_dim=64) batch_size, seq_len, num_heads, head_dim = 2, 1024, 16, 64 query = torch.randn(batch_size, seq_len, num_heads, head_dim, device='cuda', dtype=torch.float16) key = torch.randn(batch_size, seq_len, num_heads, head_dim, device='cuda', dtype=torch.float16) value = torch.randn(batch_size, seq_len, num_heads, head_dim, device='cuda', dtype=torch.float16) # Run optimized flash attention output = aiter.flash_attn_func(query, key, value, causal=True) print(f"Output shape: {output.shape}") # Output shape: torch.Size([2, 1024, 16, 64]) Variable-Length Sequences ------------------------- AITER excels at handling variable-length sequences with page tables: .. code-block:: python import torch import aiter # Query with variable lengths per batch query = torch.randn(5, 2048, 16, 64, device='cuda', dtype=torch.float16) # Page table configuration (see tutorials for details) page_table = torch.tensor([[0, 1, 2], [3, 4, 5]], device='cuda', dtype=torch.int32) # KV cache in paged format kv_cache = torch.randn(6, 16, 128, 64, device='cuda', dtype=torch.float16) # Variable-length attention with page tables output = aiter.flash_attn_with_kvcache( query, kv_cache, page_table, block_size=128, causal=True ) Mixture of Experts (MoE) ------------------------ Efficient grouped GEMM for MoE layers: .. code-block:: python import torch import aiter # MOE routing - select top-2 experts for each token num_tokens = 4096 num_experts = 8 hidden_dim = 512 ffn_dim = 2048 top_k = 2 # Input tokens x = torch.randn(num_tokens, hidden_dim, device='cuda', dtype=torch.float16) # Expert weights for all experts w1 = torch.randn(num_experts, hidden_dim, ffn_dim, device='cuda', dtype=torch.float16) w2 = torch.randn(num_experts, ffn_dim, hidden_dim, device='cuda', dtype=torch.float16) # Router logits and expert selection router_logits = torch.randn(num_tokens, num_experts, device='cuda', dtype=torch.float16) # Fused MOE operation (gate + up projection + down projection) output = aiter.fmoe( x, w1, w2, router_logits, topk=top_k, renormalize=True ) print(f"MoE output shape: {output.shape}") # [4096, 512] RMSNorm ------- Optimized normalization for LLM inference: .. code-block:: python import torch import aiter # Input tensor (batch_size, seq_len, hidden_dim) x = torch.randn(2, 1024, 4096, device='cuda', dtype=torch.float16) # Weight for normalization weight = torch.ones(4096, device='cuda', dtype=torch.float16) # Fast RMSNorm output = aiter.rmsnorm(x, weight, eps=1e-6) Performance Tips ---------------- 1. **Use FP16/BF16**: AITER kernels are optimized for half-precision 2. **Enable compilation**: Set ``PREBUILD_KERNELS=2`` for inference workloads 3. **Batch when possible**: Larger batches better utilize GPU 4. **Profile first**: Use ROCm profiler to identify bottlenecks .. code-block:: bash # Example: Profile your workload rocprof --stats python your_script.py Next Steps ---------- * :doc:`tutorials/attention` - Deep dive into attention mechanisms * :doc:`tutorials/moe` - Learn about MoE optimizations * :doc:`tutorials/variable_length` - Handle variable-length sequences * :doc:`api/attention` - Full API reference * :doc:`benchmarks` - Performance comparisons Common Issues ------------- **ImportError: No module named 'aiter'** Make sure ROCm libraries are in your library path: .. code-block:: bash export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH **RuntimeError: No AMD GPU found** Verify GPU is accessible: .. code-block:: bash rocm-smi rocminfo | grep gfx **Compilation errors during first run** JIT compilation may take time on first use. Pre-compile kernels: .. code-block:: bash PREBUILD_KERNELS=2 GPU_ARCHS="native" python3 setup.py install Get Help -------- * **Documentation**: https://doc.aiter.amd.com * **GitHub Issues**: https://github.com/ROCm/aiter/issues * **ROCm Community**: https://github.com/ROCm/ROCm/discussions