Collective Communication Operations#

Warning

The Gluon API is experimental and may undergo breaking changes in future releases.

Collective communication operations accessible via the ccl attribute on the IrisGluon instance (e.g. ctx.ccl.all_to_all(...)).

all_to_all#

CCL.all_to_all(output_tensor, input_tensor, group=None, async_op=False, config=None)#

All-to-all collective operation.

Each rank sends a tensor chunk to each other rank and receives a tensor chunk from each other rank. Input/output tensors should have shape (M, N * world_size) where each chunk of N columns corresponds to one rank.

Parameters:
  • output_tensor – Output tensor of shape (M, N * world_size)

  • input_tensor – Input tensor of shape (M, N * world_size)

  • group – ProcessGroup or None. If None, uses all ranks in shmem context. Default: None.

  • async_op – If False, performs a barrier at the end. If True, returns immediately. Default: False.

  • config – Config instance with kernel parameters (default: None). If None, uses default Config values. Set config.use_gluon=True to use Gluon implementation with traffic shaping.

Example

>>> shmem = iris_gluon.iris()
>>> shmem.ccl.all_to_all(output_tensor, input_tensor)
>>> # Custom configuration with Gluon traffic shaping
>>> from iris.ccl import Config
>>> config = Config(use_gluon=True, block_size_m=128, block_size_n=32)
>>> shmem.ccl.all_to_all(output_tensor, input_tensor, config=config)

all_gather#

CCL.all_gather(output_tensor, input_tensor, group=None, async_op=False, config=None)#

All-gather collective operation.

Each rank sends its input tensor to all ranks, and all ranks receive and concatenate all input tensors along dimension 0 (rows), matching torch.distributed.all_gather_into_tensor behavior.

Parameters:
  • output_tensor – Output tensor of shape (world_size * M, N) - will contain concatenated inputs

  • input_tensor – Input tensor of shape (M, N) - local rank’s data to send

  • group – ProcessGroup or None. If None, uses all ranks in shmem context. Default: None.

  • async_op – If False, performs a barrier at the end. If True, returns immediately. Default: False.

  • config – Config instance with kernel parameters (default: None). If None, uses default Config values.

Example

>>> shmem = iris_gluon.iris()
>>> # Input: (M, N), Output: (world_size * M, N)
>>> shmem.ccl.all_gather(output_tensor, input_tensor)
>>> # Custom configuration
>>> from iris.ccl import Config
>>> config = Config(block_size_m=128, block_size_n=32)
>>> shmem.ccl.all_gather(output_tensor, input_tensor, config=config)

reduce_scatter#

CCL.reduce_scatter(output_tensor, input_tensor, op=None, group=None, async_op=False, config=None)#

Reduce-scatter collective operation.

Each rank reduces its assigned tiles from all ranks’ inputs and stores the result only to its own output tensor. This is similar to all-reduce but without broadcasting the result to all ranks.

Parameters:
  • output_tensor – Output tensor of shape (M, N) - will contain reduced tiles for this rank

  • input_tensor – Input tensor of shape (M, N) - local rank’s partial data

  • op – Reduction operation to apply. Currently only ReduceOp.SUM is supported. Default: ReduceOp.SUM.

  • group – ProcessGroup or None. If None, uses all ranks in shmem context. Default: None.

  • async_op – If False, performs a barrier at the end. If True, returns immediately. Default: False.

  • config – Config instance with kernel parameters (default: None). If None, uses default Config values. Only supports reduce_scatter_variant=”two_shot”.

Example

>>> shmem = iris_gluon.iris()
>>> shmem.ccl.reduce_scatter(output_tensor, input_tensor)
>>> # Custom configuration
>>> from iris.ccl import Config
>>> config = Config(reduce_scatter_variant="two_shot", all_reduce_distribution=1)
>>> shmem.ccl.reduce_scatter(output_tensor, input_tensor, config=config)