Taxonomy of Multi-GPU Programming Patterns#

This page describes the taxonomy of programming patterns for fine-grained multi-GPU computation and communication overlap, focusing on GEMM + All-Scatter as a representative example.

Background: Contemporary vs. Iris Approaches#

Bulk Synchronous Programming Model (Contemporary)#

The traditional approach, exemplified by frameworks like RCCL, follows a stop-and-go pattern:

Contemporary Bulk Synchronous Approach
Contemporary approach showing CPU-initiated control with separate compute and RCCL kernels requiring host-device synchronization.

Limitations:

  • CPU-initiated control path

  • Requires host and device synchronization

  • Stop-and-go execution creates inefficiencies

  • Cannot achieve fine-grained overlap

Iris Approach#

Iris enables device-side primitives that allow natural work-group-centric programming:

World of Iris
Iris approach with GPU-initiated control where communication begins as soon as data is produced, enabling fine-grained overlap.

@triton.jit
def kernel(...):
    # Compute at work group granularity
    result = compute(...)
    
    # Store to remote GPU (no intermediate buffers)
    iris.store(buffer + offsets, result, cur_rank, remote_rank, heap_bases)
    
    # Continue with more compute
    # Load from remote GPU
    data = iris.load(buffer + offsets, cur_rank, remote_rank, heap_bases)

Advantages:

  • No intermediate buffers required

  • No synchronization overhead between compute and communication

  • Remove device communication may begin as soon as the data is produced

  • Fine-grained control over resource allocation

Programming Pattern Taxonomy#

The fundamental insight is that achieving peak performance requires computation-communication overlap. Iris provides multiple programming patterns to achieve this overlap, each with different trade-offs.

Pattern Classification#

Iris Taxonomy Overview
Figure 1: Complete taxonomy of fused and unfused patterns in Iris, illustrated with an All-Scatter example across multiple GPUs.

Pattern Details#

1. Unfused Bulk Synchronous#

Description: Traditional approach with separate compute and communication kernels, synchronized at the host.

Example Implementation: examples/12_gemm_all_scatter_bulk_synchronous

Characteristics:

  • Two separate kernel launches

  • Hard barrier (CPU synchronization) between kernels

  • All CUs assigned to compute, then all to communication

  • No overlap between computation and communication

Unfused Bulk Synchronous Pattern
Figure 2: Bulk synchronous execution showing complete separation of compute (GEMM) and communication phases with synchronization barriers.

Use Case: Baseline implementation; matches RCCL-style collectives

Performance: Provides slight gains through kernel specialization but lacks overlap benefits

Bulk Synchronous Performance
Figure 3: Performance comparison showing bulk synchronous (purple) vs. torch+rccl (orange) across various matrix sizes.


2. Unfused Producer-Consumer#

Description: Separate compute and communication kernels running concurrently with fine-grained synchronization.

Example Implementation: examples/11_gemm_all_scatter_producer_consumer

Characteristics:

  • Two kernels launched on different streams to establish concurrency

  • GPU partitioned between compute CUs and communication CUs

  • Producer (compute) releases tiles as they complete

  • Consumer (communication) spins waiting for tiles, then consumes them

  • Uses atomic operations with acquire/release semantics for synchronization

Unfused Producer-Consumer Pattern
Figure 4: Producer-consumer pattern showing concurrent GEMM (producer) and All-Scatter (consumer) kernels with lock-based synchronization. Note the overlap between tile computation and communication.

Implementation Pattern:

# Producer side
compute_tile(...)
# Release pattern - notify tile is ready
iris.atomic_cas(flag + tile_id, 0, 1, consumer_rank, 
                sem="release", scope="sys")

# Consumer side
# Acquire pattern - wait for tile
done = 0
while done == 0:
    done = iris.atomic_cas(flag + tile_id, 1, 0, consumer_rank,
                          sem="acquire", scope="sys")
# Consume data
data = iris.load(buffer + offsets, consumer_rank, mask=mask)

Key Decision: GPU partitioning (e.g., 256 CUs for compute, 48 CUs for communication)

Performance: Achieves significant speedups (up to 2.5x) through compute-communication overlap

Producer-Consumer Performance
Figure 5: Performance comparison showing producer-consumer (pink) achieving up to 2.5x speedup compared to bulk synchronous approaches.


3. Fused Sequential#

Description: Single über-kernel that computes then communicates sequentially.

Example Implementation: examples/07_gemm_all_scatter

Characteristics:

  • Single kernel launch

  • Data stays in registers (no HBM writeback between phases)

  • Each work group: compute tile → scatter tile → repeat

  • No explicit synchronization needed within work group

Fused Sequential Pattern
Figure 6: Fused sequential pattern showing GEMM and communication tightly coupled within a single kernel, with data remaining in registers between phases.

Advantages:

  • Simplest fused implementation

  • Avoids HBM round-trip for intermediate data

  • Reduces memory bandwidth pressure

Disadvantages:

  • Increased register pressure

  • Tail effect amplified (idle resources at kernel end)

  • No concurrent execution of compute and communication

Use Case: Problems where register pressure is manageable and simplicity is valued

Fused Sequential Performance
Figure 7: Performance comparison including fused sequential (light purple) showing competitive performance with reduced memory bandwidth usage.


4. Fused Producer-Consumer with Work Group Specialization#

Description: Single kernel where work groups specialize as either producers or consumers at launch.

Example Implementation: examples/10_gemm_all_scatter_wg_specialization

Characteristics:

  • Single kernel with branching at entry: if (block_id < num_gem_cus)

  • Producer work groups: compute tiles and release locks

  • Consumer work groups: acquire locks and scatter tiles

  • Fine-grained synchronization using atomics with GPU scope

  • Producer-consumer relationship within a single kernel

Fused Work Group Specialization Pattern
Figure 8: Work group specialization pattern showing a single kernel with work groups specialized as producers (CU₀-CU₂) or consumers (CU₃-CU₄), using spinlocks for fine-grained synchronization.

Implementation Pattern:

@triton.jit
def fused_kernel(...):
    if pid < num_gem_cus:
        # Producer: compute tile
        result = compute_gemm_tile(...)
        store_result(result, ...)
        # Release lock
        iris.atomic_cas(flag, 0, 1, sem="release", scope="gpu")
    else:
        # Consumer: wait and scatter
        while not acquired:
            acquired = iris.atomic_cas(flag, 1, 0, sem="acquire", scope="gpu")
        scatter_tile(...)

Advantages:

  • Better resource utilization (single kernel overhead)

  • Potential locality benefits (producer and consumer co-resident)

  • No stream synchronization overhead

Performance: Achieves strong speedups (up to 2x) for certain problem shapes

Work Group Specialization Performance
Figure 9: Comprehensive performance comparison showing all patterns. Work group specialization (dark purple) achieves the best performance for most problem sizes, demonstrating up to 60% improvement over producer-consumer patterns.


5. Fused Producer-Consumer with Wavefront Specialization#

Description: Specialization at the wavefront level within work groups (future work).

Status: Not yet implementable in standard Triton; requires Gluon

Potential Benefits:

  • Even finer-grained resource partitioning

  • Better cache locality

  • Reduced synchronization overhead


Design Philosophy#

Iris embraces a core principle: Provide low-level device APIs that enable rapid exploration of all these patterns.

Rather than prescribing one “best” approach, Iris empowers developers to:

  • Quickly prototype different patterns (examples implemented in ~10 minutes to 1 hour)

  • Experiment with GPU partitioning strategies

  • Optimize for specific problem characteristics

  • Discover novel patterns not yet in this taxonomy

The goal is to make fine-grained multi-GPU programmability a first-class citizen in Triton, maintaining both programmability and performance.

See Also#