AITER Documentation

AITER (AMD Inference and Training Enhanced Repository) is AMD’s high-performance AI operator library for ROCm, providing optimized kernels for inference and training workloads.

ROCm Compatible License

Why AITER?

  • High Performance: Optimized kernels using Triton, Composable Kernel (CK), and hand-written assembly

  • Comprehensive: Supports both inference and training workloads

  • Flexible: C++ and Python APIs for easy integration

  • AMD Optimized: Built specifically for AMD GPUs and the ROCm platform

Quick Start

Installation

pip install aiter  # Coming soon!

# For now, install from source:
git clone --recursive https://github.com/ROCm/aiter.git
cd aiter
python3 setup.py develop

Quick Example

import aiter
import torch

# Example: Flash Attention
# TODO: Add actual example code

Core Features

Attention Kernels

  • Multi-Head Attention (MHA): Standard attention with optimized implementations

  • Multi-Latent Attention (MLA): DeepSeek-style latent attention

  • Paged Attention: Efficient KV-cache management for serving

GEMM Operations

  • Mixed Precision GEMM: FP16, BF16, FP8, INT4 support

  • Tuned GEMM: Pre-tuned configurations for common shapes

  • Fused Operations: GEMM with activation fusion

Mixture of Experts (MoE)

  • Fused MoE: Optimized expert routing and computation

  • Multiple Routing: Support for various routing strategies

  • Quantized Experts: FP8 and INT4 expert weights

Normalization

  • RMSNorm: Root mean square normalization

  • LayerNorm: Standard layer normalization

  • Fused Variants: Combined with other operations

Other Operators

  • RoPE: Rotary position embeddings

  • Quantization: BF16/FP16 → FP8/INT4 conversion

  • Element-wise: Optimized basic operations

  • Communication: AllReduce and collective operations via Triton/Iris

GPU Support

AITER supports AMD GPUs with the following architectures:

Architecture

gfx Target

Example GPUs

ROCm Version

CDNA 2

gfx90a

MI210, MI250, MI250X

ROCm 5.0+

CDNA 3

gfx942

MI300A, MI300X

ROCm 6.0+

CDNA 3.5

gfx950

MI350X (upcoming)

ROCm 6.3+

Table of Contents

Indices and tables