Iris#

Iris Logo

First-Class Multi-GPU Programming Experience in Triton

What is Iris?#

Iris is a Triton-based framework for Remote Memory Access (RMA) operations. Iris provides SHMEM-like APIs within Triton for Multi-GPU programming. Iris’ goal is to make Multi-GPU programming a first-class citizen in Triton while retaining Triton’s programmability and performance.

Key Features#

  • SHMEM-like RMA: Iris provides SHMEM-like RMA support in Triton

  • Simple and Intuitive API: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs

  • Triton-based: Iris is built on top of Triton and inherits Triton’s performance and capabilities

Quick Start#

The recommended way to get started is using Docker Compose:

# Clone the repository
git clone https://github.com/ROCm/iris.git
cd iris

# Start the development container
docker compose up --build -d

# Attach to the running container
docker attach iris-dev

# Install Iris in development mode
cd iris && pip install -e .

Run Your First Example#

Here’s a simple example showing how to perform remote memory operations between GPUs using Iris:

import torch
import triton
import triton.language as tl
import iris

# Device-side APIs
@triton.jit
def kernel(buffer, buffer_size: tl.constexpr, block_size: tl.constexpr, heap_bases_ptr):
    # Compute start index of this block
    pid = tl.program_id(0)
    block_start = pid * block_size
    offsets = block_start + tl.arange(0, block_size)

    # Guard for out-of-bounds accesses
    mask = offsets < buffer_size

    # Store 1 in the target buffer at each offset
    source_rank = 0
    target_rank = 1
    iris.store(buffer + offsets, 1,
            source_rank, target_rank,
            heap_bases_ptr, mask=mask)

# Iris initialization
heap_size = 2**30   # 1GiB symmetric heap for inter-GPU communication
iris_ctx = iris.iris(heap_size)
cur_rank = iris_ctx.get_rank()

# Iris tensor allocation
buffer_size = 4096  # 4K elements buffer
buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32)

# Launch the kernel on rank 0
block_size = 1024
grid = lambda meta: (triton.cdiv(buffer_size, meta["block_size"]),)
source_rank = 0
if cur_rank == source_rank:
    kernel[grid](
        buffer,
        buffer_size,
        block_size,
        iris_ctx.get_heap_bases(),
    )

# Synchronize all ranks
iris_ctx.barrier()

For more examples, see the Examples page with ready-to-run scripts and usage patterns.

For other setup methods, see the Installation Guide.

Documentation Structure#

📚 Getting Started#

🧠 Conceptual#

📖 Reference#

Supported GPUs#

Iris currently supports:

  • MI300X, MI350X & MI355X

Note: Iris may work on other AMD GPUs with ROCm compatibility.

Roadmap#

We plan to extend Iris with the following features:

  • Extended GPU Support: Testing and optimization for other AMD GPUs

  • RDMA Support: Multi-node support using Remote Direct Memory Access (RDMA) for distributed computing across multiple machines

  • End-to-End Integration: Comprehensive examples covering various use cases and end-to-end patterns

Community & Support#

GitHub Discussions#

Join the GitHub Discussions to ask questions, share ideas, and connect with the Iris community.

GitHub Issues#

Found a bug or have a feature request? Report it on GitHub Issues.

Contributing#

Want to contribute to Iris? Check out the Contributing Guide to learn how you can help make Iris better for everyone.


Ready to start your multi-GPU journey? Begin with the Installation Guide!