Device Functions#

Warning

The Gluon API is experimental and may undergo breaking changes in future releases.

Device-side functions provided by Iris Gluon for remote memory operations and atomics. These methods are part of the IrisDeviceCtx aggregate used within Gluon kernels.

Iris Gluon: Gluon-based Multi-GPU Communication Framework

This module provides a Gluon-based implementation of Iris that uses the @aggregate decorator with Gluon’s @gluon.jit to encapsulate the Iris backend struct, eliminating the need to pass heap_bases around manually.

Key Features: - Uses Gluon’s @gluon.jit decorator for device-side methods - Encapsulates heap_bases and rank info in IrisDeviceCtx aggregate - Provides same functionality as original Iris with improved ergonomics

Example

>>> import iris.iris_gluon as iris_gl
>>> ctx = iris_gl.iris(heap_size=2**30)  # 1GB heap
>>> context_tensor = ctx.get_device_context()  # Get context tensor
>>>
>>> @gluon.jit
>>> def kernel(IrisDeviceCtx: gl.constexpr, context_tensor):
>>>     ctx = IrisDeviceCtx.initialize(context_tensor)
>>>     data = ctx.load(buffer, 1)
class IrisDeviceCtx(cur_rank, num_ranks, heap_bases)[source]

Gluon device-side context that decodes the tensor from Iris.get_device_context().

This aggregate encapsulates the heap_bases pointer and provides device-side methods for memory operations and atomics using Gluon.

Parameters:
  • cur_rank (<MagicMock name='mock.tensor' id='139796306165904'>)

  • num_ranks (<MagicMock name='mock.tensor' id='139796306165904'>)

  • heap_bases (<MagicMock name='mock.tensor' id='139796306165904'>)

cur_rank

Current rank ID

Type:

<MagicMock name=’mock.tensor’ id=’139796306165904’>

num_ranks

Total number of ranks

Type:

<MagicMock name=’mock.tensor’ id=’139796306165904’>

heap_bases

Pointer to array of heap base addresses for all ranks

Type:

<MagicMock name=’mock.tensor’ id=’139796306165904’>

__init__(cur_rank, num_ranks, heap_bases)[source]
initialize()[source]

Initialize IrisDeviceCtx from the encoded tensor.

The context tensor has the format: [cur_rank, num_ranks, heap_base_0, heap_base_1, …]

Parameters:

context_tensor – Pointer to encoded context data

Returns:

Initialized device context

Return type:

IrisDeviceCtx

load(pointer, from_rank, mask=None)[source]

Loads a value from the specified rank’s memory location to the current rank.

Parameters:
  • pointer – Pointer in the from_rank’s address space

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional loading

Returns:

The loaded value from the target memory location

Example

>>> # Load from rank 1 to current rank
>>> data = ctx.load(buffer + offsets, 1, mask=mask)
store(pointer, value, to_rank, mask=None)[source]

Writes data from the current rank to the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • value – The value to store

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional storing

Example

>>> # Store from current rank to rank 1
>>> ctx.store(buffer + offsets, values, 1, mask=mask)
get(from_ptr, to_ptr, from_rank, mask=None)[source]

Copies data from the specified rank’s memory to the current rank’s local memory.

Parameters:
  • from_ptr – Pointer to remote memory in from_rank’s address space

  • to_ptr – Pointer to local memory in current rank

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional operations

Example

>>> # Copy from rank 1 to current rank's local memory
>>> ctx.get(remote_ptr + offsets, local_ptr + offsets, 1, mask=mask)
put(from_ptr, to_ptr, to_rank, mask=None)[source]

Copies data from the current rank’s local memory to the specified rank’s memory.

Parameters:
  • from_ptr – Pointer to local memory in current rank

  • to_ptr – Pointer to remote memory in to_rank’s address space

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional operations

Example

>>> # Copy from current rank's local memory to rank 1
>>> ctx.put(local_ptr + offsets, remote_ptr + offsets, 1, mask=mask)
copy(src_ptr, dst_ptr, from_rank, to_rank, mask=None)[source]

Copies data from the specified rank’s memory into the destination rank’s memory.

This function performs the transfer by translating src_ptr from the from_rank’s address space to the to_rank’s address space, performing a masked load from the translated source, and storing the loaded data to dst_ptr in the to_rank memory location. If from_rank and to_rank are the same, this function performs a local copy operation. It is undefined behaviour if neither from_rank nor to_rank is the cur_rank.

Parameters:
  • src_ptr – Pointer in the from_rank’s local memory from which to read data

  • dst_ptr – Pointer in the to_rank’s local memory where the data will be written

  • from_rank – The rank ID that owns src_ptr (source rank)

  • to_rank – The rank ID that will receive the data (destination rank)

  • mask – Optional mask for conditional operations

Example

>>> # Copy from rank 1 to rank 0 (current rank must be either 1 or 0)
>>> ctx.copy(remote_ptr + offsets, local_ptr + offsets, 1, 0, mask=mask)
atomic_add(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic add at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to add

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically add to rank 1's memory
>>> old_val = ctx.atomic_add(buffer, 5, 1)
atomic_sub(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Atomically subtracts data from the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • val – The value to subtract

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically subtract from rank 1's memory
>>> old_val = ctx.atomic_sub(buffer, 3, 1)
atomic_cas(pointer, cmp, val, to_rank, sem=None, scope=None)[source]

Atomically compares and exchanges the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • cmp – The expected value to compare

  • val – The new value to write if comparison succeeds

  • to_rank – The rank ID to which the atomic operation will be performed

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Compare-and-swap on rank 1's memory
>>> old_val = ctx.atomic_cas(flag + pid, 0, 1, 1, sem="release", scope="sys")
atomic_xchg(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic exchange at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to exchange

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Exchange value with rank 1's memory
>>> old_val = ctx.atomic_xchg(buffer, 99, 1)
atomic_xor(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic xor at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to xor

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically XOR with rank 1's memory
>>> old_val = ctx.atomic_xor(buffer, 0xFF, 1)
atomic_and(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic and at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to and

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically AND with rank 1's memory
>>> old_val = ctx.atomic_and(buffer, 0x0F, 1)
atomic_or(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic or at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to or

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically OR with rank 1's memory
>>> old_val = ctx.atomic_or(buffer, 0xF0, 1)
atomic_min(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic min at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute minimum with rank 1's memory
>>> old_val = ctx.atomic_min(buffer, 10, 1)
atomic_max(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic max at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute maximum with rank 1's memory
>>> old_val = ctx.atomic_max(buffer, 100, 1)
class IrisGluon(heap_size=1073741824)[source]

Gluon-based Iris class for multi-GPU communication and memory management.

This class provides the same functionality as the original Iris class but uses Gluon’s @aggregate decorator to encapsulate the backend state.

Parameters:

heap_size (int) – Size of the symmetric heap in bytes. Default: 1GB (2^30)

Example

>>> ctx = iris_gluon.iris(heap_size=2**31)  # 2GB heap
>>> backend = ctx.get_backend()  # Get Gluon aggregate
>>> tensor = ctx.zeros(1000, 1000, dtype=torch.float32)
__init__(heap_size=1073741824)[source]
debug(message)[source]

Log a debug message with rank information.

info(message)[source]

Log an info message with rank information.

warning(message)[source]

Log a warning message with rank information.

error(message)[source]

Log an error message with rank information.

get_device_context()[source]

Get the device context tensor for Gluon kernels.

Returns a tensor encoding: [cur_rank, num_ranks, heap_base_0, heap_base_1, …]

Returns:

Encoded context data as int64 tensor on device

Return type:

torch.Tensor

Example

>>> ctx = iris_gluon.iris()
>>> context_tensor = ctx.get_device_context()
>>>
>>> @gluon.jit
>>> def kernel(IrisDeviceCtx: gl.constexpr, context_tensor):
>>>     ctx = IrisDeviceCtx.initialize(context_tensor)
>>>     data = ctx.load(buffer, 1)
get_backend()[source]

Legacy method for backward compatibility. Use get_device_context() for Gluon kernels.

Returns:

Device context tensor

Return type:

torch.Tensor

get_heap_bases()[source]

Return the tensor of symmetric heap base addresses for all ranks.

Returns:

A 1D tensor of uint64 heap base addresses

Return type:

torch.Tensor

barrier()[source]

Synchronize all ranks using a distributed barrier.

get_device()[source]

Get the underlying device where the Iris symmetric heap resides.

Returns:

The CUDA device of Iris-managed memory

Return type:

torch.device

get_cu_count()[source]

Get the number of compute units (CUs) for the current GPU.

Returns:

Number of compute units on this rank’s GPU

Return type:

int

get_rank()[source]

Get the current rank ID.

Returns:

The current rank ID

Return type:

int

get_num_ranks()[source]

Get the total number of ranks.

Returns:

The total number of ranks in the distributed system

Return type:

int

broadcast(data, src_rank=0)[source]

Broadcast data from source rank to all ranks.

Parameters:
  • data – Data to broadcast (scalar or tensor)

  • src_rank – Source rank for broadcast (default: 0)

Returns:

The broadcasted data

zeros(*size, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False)[source]

Create a tensor filled with zeros on the symmetric heap.

Parameters:
  • size – Shape of the tensor

  • dtype – Data type (default: torch.float32)

  • device – Device (must match Iris device)

  • layout – Layout (default: torch.strided)

  • requires_grad – Whether to track gradients

Returns:

Zero-initialized tensor on the symmetric heap

Return type:

torch.Tensor

ones(*size, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False)[source]

Returns a tensor filled with the scalar value 1, with the shape defined by the variable argument size. The tensor is allocated on the Iris symmetric heap.

Parameters:

*size (int...) – a sequence of integers defining the shape of the output tensor. Can be a variable number of arguments or a collection like a list or tuple.

Keyword Arguments:
  • out (Tensor, optional) – the output tensor.

  • dtype (torch.dtype, optional) – the desired data type of returned tensor. Default: if None, uses a global default (see torch.set_default_dtype()).

  • layout (torch.layout, optional) – the desired layout of returned Tensor. Default: torch.strided. Note: Iris tensors always use torch.strided regardless of this parameter.

  • device (torch.device, optional) – the desired device of returned tensor. Default: if None, uses the current device for the default tensor type.

  • requires_grad (bool, optional) – If autograd should record operations on the returned tensor. Default: False.

Example

>>> ctx = iris_gluon.iris(1 << 20)
>>> tensor = ctx.ones(2, 3)
>>> print(tensor.shape)  # torch.Size([2, 3])
>>> print(tensor[0])  # tensor([1., 1., 1.], device='cuda:0')
full(size, fill_value, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False)[source]

Creates a tensor of size size filled with fill_value. The tensor’s dtype is inferred from fill_value. The tensor is allocated on the Iris symmetric heap.

Parameters:
  • size (int...) – a list, tuple, or torch.Size of integers defining the shape of the output tensor.

  • fill_value (Scalar) – the value to fill the output tensor with.

Keyword Arguments:
  • out (Tensor, optional) – the output tensor.

  • dtype (torch.dtype, optional) – the desired data type of returned tensor. Default: if None, uses a global default (see torch.set_default_dtype()).

  • layout (torch.layout, optional) – the desired layout of returned Tensor. Default: torch.strided. Note: Iris tensors always use torch.strided regardless of this parameter.

  • device (torch.device, optional) – the desired device of returned tensor. Default: if None, uses the current device for the default tensor type.

  • requires_grad (bool, optional) – If autograd should record operations on the returned tensor. Default: False.

Example

>>> ctx = iris_gluon.iris(1 << 20)
>>> tensor = ctx.full((2, 3), 3.14)
>>> print(tensor.shape)  # torch.Size([2, 3])
>>> print(tensor[0])  # tensor([3.1400, 3.1400, 3.1400], device='cuda:0')
zeros_like(input, *, dtype=None, layout=None, device=None, requires_grad=False, memory_format=torch.preserve_format)[source]

Returns a tensor filled with the scalar value 0, with the same size as input, allocated on the Iris symmetric heap.

Parameters:

input (Tensor) – the size of input will determine size of the output tensor.

Keyword Arguments:
  • dtype (torch.dtype, optional) – the desired data type of returned Tensor. Default: if None, defaults to the dtype of input.

  • layout (torch.layout, optional) – the desired layout of returned tensor. Default: if None, defaults to the layout of input. Note: Iris tensors are always contiguous (strided).

  • device (torch.device, optional) – the desired device of returned tensor. Default: if None, defaults to the device of input. Must be compatible with this Iris instance.

  • requires_grad (bool, optional) – If autograd should record operations on the returned tensor. Default: False.

  • memory_format (torch.memory_format, optional) – the desired memory format of returned Tensor. Default: torch.preserve_format.

Example

>>> ctx = iris_gluon.iris(1 << 20)
>>> input_tensor = ctx.ones(2, 3)
>>> zeros_tensor = ctx.zeros_like(input_tensor)
>>> print(zeros_tensor.shape)  # torch.Size([2, 3])
iris(heap_size=1073741824)[source]

Create and return a Gluon-based Iris instance with the specified heap size. :param heap_size: Size of the heap in bytes. Defaults to 1GB. :type heap_size: int

Returns:

An initialized Gluon-based Iris instance

Return type:

IrisGluon

Example

>>> import iris.iris_gluon as iris_gl
>>> ctx = iris_gl.iris(2**30)  # 1GB heap
>>> backend = ctx.get_backend()
>>> tensor = ctx.zeros(1024, 1024)

Initialization#

initialize#

IrisDeviceCtx.initialize()[source]

Initialize IrisDeviceCtx from the encoded tensor.

The context tensor has the format: [cur_rank, num_ranks, heap_base_0, heap_base_1, …]

Parameters:

context_tensor – Pointer to encoded context data

Returns:

Initialized device context

Return type:

IrisDeviceCtx

Memory transfer operations#

load#

IrisDeviceCtx.load(pointer, from_rank, mask=None)[source]

Loads a value from the specified rank’s memory location to the current rank.

Parameters:
  • pointer – Pointer in the from_rank’s address space

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional loading

Returns:

The loaded value from the target memory location

Example

>>> # Load from rank 1 to current rank
>>> data = ctx.load(buffer + offsets, 1, mask=mask)

store#

IrisDeviceCtx.store(pointer, value, to_rank, mask=None)[source]

Writes data from the current rank to the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • value – The value to store

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional storing

Example

>>> # Store from current rank to rank 1
>>> ctx.store(buffer + offsets, values, 1, mask=mask)

copy#

IrisDeviceCtx.copy(src_ptr, dst_ptr, from_rank, to_rank, mask=None)[source]

Copies data from the specified rank’s memory into the destination rank’s memory.

This function performs the transfer by translating src_ptr from the from_rank’s address space to the to_rank’s address space, performing a masked load from the translated source, and storing the loaded data to dst_ptr in the to_rank memory location. If from_rank and to_rank are the same, this function performs a local copy operation. It is undefined behaviour if neither from_rank nor to_rank is the cur_rank.

Parameters:
  • src_ptr – Pointer in the from_rank’s local memory from which to read data

  • dst_ptr – Pointer in the to_rank’s local memory where the data will be written

  • from_rank – The rank ID that owns src_ptr (source rank)

  • to_rank – The rank ID that will receive the data (destination rank)

  • mask – Optional mask for conditional operations

Example

>>> # Copy from rank 1 to rank 0 (current rank must be either 1 or 0)
>>> ctx.copy(remote_ptr + offsets, local_ptr + offsets, 1, 0, mask=mask)

get#

IrisDeviceCtx.get(from_ptr, to_ptr, from_rank, mask=None)[source]

Copies data from the specified rank’s memory to the current rank’s local memory.

Parameters:
  • from_ptr – Pointer to remote memory in from_rank’s address space

  • to_ptr – Pointer to local memory in current rank

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional operations

Example

>>> # Copy from rank 1 to current rank's local memory
>>> ctx.get(remote_ptr + offsets, local_ptr + offsets, 1, mask=mask)

put#

IrisDeviceCtx.put(from_ptr, to_ptr, to_rank, mask=None)[source]

Copies data from the current rank’s local memory to the specified rank’s memory.

Parameters:
  • from_ptr – Pointer to local memory in current rank

  • to_ptr – Pointer to remote memory in to_rank’s address space

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional operations

Example

>>> # Copy from current rank's local memory to rank 1
>>> ctx.put(local_ptr + offsets, remote_ptr + offsets, 1, mask=mask)

Atomic operations#

atomic_add#

IrisDeviceCtx.atomic_add(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic add at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to add

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically add to rank 1's memory
>>> old_val = ctx.atomic_add(buffer, 5, 1)

atomic_sub#

IrisDeviceCtx.atomic_sub(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Atomically subtracts data from the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • val – The value to subtract

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically subtract from rank 1's memory
>>> old_val = ctx.atomic_sub(buffer, 3, 1)

atomic_cas#

IrisDeviceCtx.atomic_cas(pointer, cmp, val, to_rank, sem=None, scope=None)[source]

Atomically compares and exchanges the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • cmp – The expected value to compare

  • val – The new value to write if comparison succeeds

  • to_rank – The rank ID to which the atomic operation will be performed

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Compare-and-swap on rank 1's memory
>>> old_val = ctx.atomic_cas(flag + pid, 0, 1, 1, sem="release", scope="sys")

atomic_xchg#

IrisDeviceCtx.atomic_xchg(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic exchange at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to exchange

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Exchange value with rank 1's memory
>>> old_val = ctx.atomic_xchg(buffer, 99, 1)

atomic_xor#

IrisDeviceCtx.atomic_xor(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic xor at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to xor

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically XOR with rank 1's memory
>>> old_val = ctx.atomic_xor(buffer, 0xFF, 1)

atomic_and#

IrisDeviceCtx.atomic_and(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic and at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to and

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically AND with rank 1's memory
>>> old_val = ctx.atomic_and(buffer, 0x0F, 1)

atomic_or#

IrisDeviceCtx.atomic_or(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic or at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to or

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically OR with rank 1's memory
>>> old_val = ctx.atomic_or(buffer, 0xF0, 1)

atomic_min#

IrisDeviceCtx.atomic_min(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic min at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute minimum with rank 1's memory
>>> old_val = ctx.atomic_min(buffer, 10, 1)

atomic_max#

IrisDeviceCtx.atomic_max(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic max at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute maximum with rank 1's memory
>>> old_val = ctx.atomic_max(buffer, 100, 1)