Iris#

First-Class Multi-GPU Programming Experience in Triton
What is Iris?#
Iris is a Triton-based framework for Remote Memory Access (RMA) operations. Iris provides SHMEM-like APIs within Triton for Multi-GPU programming. Iris’ goal is to make Multi-GPU programming a first-class citizen in Triton while retaining Triton’s programmability and performance.
Key Features#
SHMEM-like RMA: Iris provides SHMEM-like RMA support in Triton
Simple and Intuitive API: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs
Triton-based: Iris is built on top of Triton and inherits Triton’s performance and capabilities
Quick Start#
The recommended way to get started is using Docker Compose:
# Clone the repository
git clone https://github.com/ROCm/iris.git
cd iris
# Start the development container
docker compose up --build -d
# Attach to the running container
docker attach iris-dev
# Install Iris in development mode
cd iris && pip install -e .
Run Your First Example#
Here’s a simple example showing how to perform remote memory operations between GPUs using Iris:
import torch
import triton
import triton.language as tl
import iris
# Device-side APIs
@triton.jit
def kernel(buffer, buffer_size: tl.constexpr, block_size: tl.constexpr, heap_bases_ptr):
# Compute start index of this block
pid = tl.program_id(0)
block_start = pid * block_size
offsets = block_start + tl.arange(0, block_size)
# Guard for out-of-bounds accesses
mask = offsets < buffer_size
# Store 1 in the target buffer at each offset
source_rank = 0
target_rank = 1
iris.store(buffer + offsets, 1,
source_rank, target_rank,
heap_bases_ptr, mask=mask)
# Iris initialization
heap_size = 2**30 # 1GiB symmetric heap for inter-GPU communication
iris_ctx = iris.iris(heap_size)
cur_rank = iris_ctx.get_rank()
# Iris tensor allocation
buffer_size = 4096 # 4K elements buffer
buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32)
# Launch the kernel on rank 0
block_size = 1024
grid = lambda meta: (triton.cdiv(buffer_size, meta["block_size"]),)
source_rank = 0
if cur_rank == source_rank:
kernel[grid](
buffer,
buffer_size,
block_size,
iris_ctx.get_heap_bases(),
)
# Synchronize all ranks
iris_ctx.barrier()
For more examples, see the Examples page with ready-to-run scripts and usage patterns.
For other setup methods, see the Installation Guide.
Documentation Structure#
📚 Getting Started#
Installation: Set up Iris on your system
Examples: Working code examples
Contributing: How to contribute
🧠Conceptual#
Programming Model: How Iris works
Fine-grained Overlap: GEMM & communication overlap
📖 Reference#
API Reference: Structured API documentation
Initialization & Helpers: Factory and helper methods
Tensor Creation: Tensor-like APIs on Iris
Triton Device Functions: Load/store and atomics
Supported GPUs#
Iris currently supports:
MI300X, MI350X & MI355X
Note: Iris may work on other AMD GPUs with ROCm compatibility.
Roadmap#
We plan to extend Iris with the following features:
Extended GPU Support: Testing and optimization for other AMD GPUs
RDMA Support: Multi-node support using Remote Direct Memory Access (RDMA) for distributed computing across multiple machines
End-to-End Integration: Comprehensive examples covering various use cases and end-to-end patterns
Community & Support#
GitHub Discussions#
Join the GitHub Discussions to ask questions, share ideas, and connect with the Iris community.
GitHub Issues#
Found a bug or have a feature request? Report it on GitHub Issues.
Contributing#
Want to contribute to Iris? Check out the Contributing Guide to learn how you can help make Iris better for everyone.
Ready to start your multi-GPU journey? Begin with the Installation Guide!