Device Functions#

Warning

The Gluon API is experimental and may undergo breaking changes in future releases.

Device-side functions provided by Iris Gluon for remote memory operations and atomics. These methods are part of the IrisDeviceCtx aggregate used within Gluon kernels.

Initialization#

initialize#

static IrisDeviceCtx.initialize(context_tensor)[source]

Initialize IrisDeviceCtx from the encoded tensor.

The context tensor has the format: [cur_rank, num_ranks, heap_base_0, heap_base_1, …]

Parameters:

context_tensor – Pointer to encoded context data

Returns:

Initialized device context

Return type:

IrisDeviceCtx

Memory transfer operations#

load#

IrisDeviceCtx.load(pointer, from_rank, mask=None)[source]

Loads a value from the specified rank’s memory location to the current rank.

Parameters:
  • pointer – Pointer in the from_rank’s address space

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional loading

Returns:

The loaded value from the target memory location

Example

>>> # Load from rank 1 to current rank
>>> data = ctx.load(buffer + offsets, 1, mask=mask)

store#

IrisDeviceCtx.store(pointer, value, to_rank, mask=None)[source]

Writes data from the current rank to the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • value – The value to store

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional storing

Example

>>> # Store from current rank to rank 1
>>> ctx.store(buffer + offsets, values, 1, mask=mask)

copy#

IrisDeviceCtx.copy(src_ptr, dst_ptr, from_rank, to_rank, mask=None)[source]

Copies data from the specified rank’s memory into the destination rank’s memory.

This function performs the transfer by translating src_ptr from the from_rank’s address space to the to_rank’s address space, performing a masked load from the translated source, and storing the loaded data to dst_ptr in the to_rank memory location. If from_rank and to_rank are the same, this function performs a local copy operation. It is undefined behaviour if neither from_rank nor to_rank is the cur_rank.

Parameters:
  • src_ptr – Pointer in the from_rank’s local memory from which to read data

  • dst_ptr – Pointer in the to_rank’s local memory where the data will be written

  • from_rank – The rank ID that owns src_ptr (source rank)

  • to_rank – The rank ID that will receive the data (destination rank)

  • mask – Optional mask for conditional operations

Example

>>> # Copy from rank 1 to rank 0 (current rank must be either 1 or 0)
>>> ctx.copy(remote_ptr + offsets, local_ptr + offsets, 1, 0, mask=mask)

get#

IrisDeviceCtx.get(from_ptr, to_ptr, from_rank, mask=None)[source]

Copies data from the specified rank’s memory to the current rank’s local memory.

Parameters:
  • from_ptr – Pointer to remote memory in from_rank’s address space

  • to_ptr – Pointer to local memory in current rank

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional operations

Example

>>> # Copy from rank 1 to current rank's local memory
>>> ctx.get(remote_ptr + offsets, local_ptr + offsets, 1, mask=mask)

put#

IrisDeviceCtx.put(from_ptr, to_ptr, to_rank, mask=None)[source]

Copies data from the current rank’s local memory to the specified rank’s memory.

Parameters:
  • from_ptr – Pointer to local memory in current rank

  • to_ptr – Pointer to remote memory in to_rank’s address space

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional operations

Example

>>> # Copy from current rank's local memory to rank 1
>>> ctx.put(local_ptr + offsets, remote_ptr + offsets, 1, mask=mask)

Atomic operations#

atomic_add#

IrisDeviceCtx.atomic_add(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic add at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to add

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically add to rank 1's memory
>>> old_val = ctx.atomic_add(buffer, 5, 1)

atomic_sub#

IrisDeviceCtx.atomic_sub(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Atomically subtracts data from the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • val – The value to subtract

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically subtract from rank 1's memory
>>> old_val = ctx.atomic_sub(buffer, 3, 1)

atomic_cas#

IrisDeviceCtx.atomic_cas(pointer, cmp, val, to_rank, sem=None, scope=None)[source]

Atomically compares and exchanges the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • cmp – The expected value to compare

  • val – The new value to write if comparison succeeds

  • to_rank – The rank ID to which the atomic operation will be performed

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Compare-and-swap on rank 1's memory
>>> old_val = ctx.atomic_cas(flag + pid, 0, 1, 1, sem="release", scope="sys")

atomic_xchg#

IrisDeviceCtx.atomic_xchg(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic exchange at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to exchange

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Exchange value with rank 1's memory
>>> old_val = ctx.atomic_xchg(buffer, 99, 1)

atomic_xor#

IrisDeviceCtx.atomic_xor(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic xor at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to xor

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically XOR with rank 1's memory
>>> old_val = ctx.atomic_xor(buffer, 0xFF, 1)

atomic_and#

IrisDeviceCtx.atomic_and(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic and at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to and

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically AND with rank 1's memory
>>> old_val = ctx.atomic_and(buffer, 0x0F, 1)

atomic_or#

IrisDeviceCtx.atomic_or(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic or at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to or

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically OR with rank 1's memory
>>> old_val = ctx.atomic_or(buffer, 0xF0, 1)

atomic_min#

IrisDeviceCtx.atomic_min(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic min at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute minimum with rank 1's memory
>>> old_val = ctx.atomic_min(buffer, 10, 1)

atomic_max#

IrisDeviceCtx.atomic_max(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic max at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute maximum with rank 1's memory
>>> old_val = ctx.atomic_max(buffer, 100, 1)