Core Operators ============== RMSNorm ------- .. autofunction:: aiter.rmsnorm Root Mean Square Layer Normalization, commonly used in LLMs like Llama. **Parameters:** * **x** (*torch.Tensor*) - Input tensor of shape ``(..., hidden_dim)`` * **weight** (*torch.Tensor*) - Scaling weights of shape ``(hidden_dim,)`` * **eps** (*float*, optional) - Epsilon for numerical stability. Default: ``1e-6`` **Returns:** * **output** (*torch.Tensor*) - Normalized tensor with same shape as input **Example:** .. code-block:: python import torch import aiter x = torch.randn(2, 1024, 4096, device='cuda', dtype=torch.float16) weight = torch.ones(4096, device='cuda', dtype=torch.float16) output = aiter.rmsnorm(x, weight, eps=1e-6) LayerNorm --------- .. autofunction:: aiter.layernorm Standard layer normalization with optional bias. **Parameters:** * **x** (*torch.Tensor*) - Input tensor ``(..., hidden_dim)`` * **weight** (*torch.Tensor*) - Weights ``(hidden_dim,)`` * **bias** (*torch.Tensor*, optional) - Bias ``(hidden_dim,)`` * **eps** (*float*, optional) - Epsilon. Default: ``1e-5`` **Returns:** * **output** (*torch.Tensor*) - Normalized output SoftMax ------- .. autofunction:: aiter.softmax Optimized softmax operation with optional masking. **Parameters:** * **x** (*torch.Tensor*) - Input tensor * **dim** (*int*) - Dimension to apply softmax * **mask** (*torch.Tensor*, optional) - Attention mask **Returns:** * **output** (*torch.Tensor*) - Softmax output GELU ---- .. autofunction:: aiter.gelu Fast GELU activation function. **Parameters:** * **x** (*torch.Tensor*) - Input tensor * **approximate** (*str*, optional) - Approximation method. Options: ``'none'``, ``'tanh'``. Default: ``'none'`` **Returns:** * **output** (*torch.Tensor*) - GELU output **Example:** .. code-block:: python import torch import aiter x = torch.randn(2, 1024, 4096, device='cuda', dtype=torch.float16) # Exact GELU output_exact = aiter.gelu(x) # Fast approximate GELU output_approx = aiter.gelu(x, approximate='tanh') SwiGLU ------ .. autofunction:: aiter.swiglu Swish-Gated Linear Unit activation. **Parameters:** * **x** (*torch.Tensor*) - Input tensor ``(..., 2 * hidden_dim)`` * **dim** (*int*, optional) - Dimension to split. Default: ``-1`` **Returns:** * **output** (*torch.Tensor*) - SwiGLU output ``(..., hidden_dim)`` Rotary Position Embedding (RoPE) --------------------------------- .. autofunction:: aiter.apply_rotary_pos_emb Apply rotary position embeddings to query and key tensors. **Parameters:** * **q** (*torch.Tensor*) - Query tensor ``(batch, seq_len, num_heads, head_dim)`` * **k** (*torch.Tensor*) - Key tensor ``(batch, seq_len, num_heads, head_dim)`` * **cos** (*torch.Tensor*) - Cosine embeddings ``(seq_len, head_dim // 2)`` * **sin** (*torch.Tensor*) - Sine embeddings ``(seq_len, head_dim // 2)`` * **position_ids** (*torch.Tensor*, optional) - Position indices **Returns:** * **q_rot** (*torch.Tensor*) - Rotated query * **k_rot** (*torch.Tensor*) - Rotated key **Example:** .. code-block:: python import torch import aiter seq_len, head_dim = 1024, 64 q = torch.randn(2, seq_len, 16, head_dim, device='cuda', dtype=torch.float16) k = torch.randn(2, seq_len, 16, head_dim, device='cuda', dtype=torch.float16) # Precompute RoPE embeddings cos, sin = aiter.precompute_rope_embeddings(seq_len, head_dim) # Apply rotation q_rot, k_rot = aiter.apply_rotary_pos_emb(q, k, cos, sin) Sampling Operations ------------------- Top-K Sampling ^^^^^^^^^^^^^^ .. autofunction:: aiter.top_k_sampling Sample from top-k logits. **Parameters:** * **logits** (*torch.Tensor*) - Logits ``(batch, vocab_size)`` * **k** (*int*) - Number of top candidates * **temperature** (*float*, optional) - Sampling temperature. Default: ``1.0`` **Returns:** * **tokens** (*torch.Tensor*) - Sampled token IDs ``(batch,)`` Top-P (Nucleus) Sampling ^^^^^^^^^^^^^^^^^^^^^^^^^ .. autofunction:: aiter.top_p_sampling Nucleus sampling with probability threshold. **Parameters:** * **logits** (*torch.Tensor*) - Logits ``(batch, vocab_size)`` * **p** (*float*) - Cumulative probability threshold (0.0 to 1.0) * **temperature** (*float*, optional) - Temperature. Default: ``1.0`` **Returns:** * **tokens** (*torch.Tensor*) - Sampled tokens ``(batch,)`` Performance Notes ----------------- All operators are optimized for AMD GPUs: * **FP16/BF16 preferred**: Best performance on MI300X * **Large batches**: Better GPU utilization * **Fused operations**: Many ops fused into single kernels * **In-place when possible**: Reduces memory allocations Supported Data Types --------------------- .. list-table:: :header-rows: 1 :widths: 30 25 25 20 * - Operator - FP32 - FP16 - BF16 * - rmsnorm - ✓ - ✓ (fastest) - ✓ * - layernorm - ✓ - ✓ (fastest) - ✓ * - gelu - ✓ - ✓ (fastest) - ✓ * - swiglu - ✓ - ✓ (fastest) - ✓ * - apply_rotary_pos_emb - ✓ - ✓ (fastest) - ✓ * - sampling ops - ✓ - ✓ - ✓ See Also -------- * :doc:`../tutorials/normalization` - Normalization tutorial * :doc:`../tutorials/custom_ops` - Adding custom operators * :doc:`gemm` - Matrix multiplication operations