Device Functions#

Warning

The Gluon API is experimental and may undergo breaking changes in future releases.

Device-side functions provided by Iris Gluon for remote memory operations and atomics. These methods are part of the IrisDeviceCtx aggregate used within Gluon kernels.

Initialization#

initialize#

static IrisDeviceCtx.initialize(context_tensor, tracing=False)[source]

Initialize IrisDeviceCtx from the encoded tensor.

The context tensor has the format: [cur_rank, num_ranks, heap_base_0, heap_base_1, ..., trace_info...]

If tracing is enabled on the host (via shmem.tracing.enable()), the context tensor also contains tracing buffer pointers after the heap bases.

Parameters:
  • context_tensor – Pointer to encoded context data

  • tracing (<MagicMock name='mock.constexpr' id='140274990000848'>) – Enable event tracing (constexpr, default: False)

Returns:

Initialized device context

Return type:

IrisDeviceCtx

Memory transfer operations#

load#

IrisDeviceCtx.load(pointer, from_rank, mask=None, other=None, cache_modifier=None, volatile=False)[source]

Loads a value from the specified rank’s memory location to the current rank.

Parameters:
  • pointer – Pointer in the from_rank’s address space

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional loading

  • other – Value to return for masked-out elements. If not provided, the result for masked-out elements is undefined.

  • cache_modifier (str, optional) –

    Controls cache behavior of the load.

    Supported values:
    • None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy.

    • ”.ca”: Cache at all levels (CU, L2, LLC) with LRU policy.

    • ”.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted.

    • ”.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.

      Ensures global coherence by invalidating stale GPU cache lines.

  • volatile (bool, optional) – If True, disables compiler optimizations that could reorder or eliminate the load. Defaults to False.

Returns:

The loaded value from the target memory location

Example

>>> # Load from rank 1 to current rank
>>> data = ctx.load(buffer + offsets, 1, mask=mask)

store#

IrisDeviceCtx.store(pointer, value, to_rank, mask=None, cache_modifier=None)[source]

Writes data from the current rank to the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • value – The value to store

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional storing

  • cache_modifier (str, optional) –

    Controls cache behavior of the store. Supported values are:

    • None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy.

    • ”.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later.

    • ”.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU.

    • ”.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC.

    • ”.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.

Example

>>> # Store from current rank to rank 1
>>> ctx.store(buffer + offsets, values, 1, mask=mask)

copy#

IrisDeviceCtx.copy(src_ptr, dst_ptr, from_rank, to_rank, mask=None, other=None, load_cache_modifier=None, store_cache_modifier=None)[source]

Copies data from the specified rank’s memory into the destination rank’s memory.

This function performs the transfer by translating src_ptr from the from_rank’s address space to the to_rank’s address space, performing a masked load from the translated source, and storing the loaded data to dst_ptr in the to_rank memory location. If from_rank and to_rank are the same, this function performs a local copy operation. It is undefined behaviour if neither from_rank nor to_rank is the cur_rank.

Parameters:
  • src_ptr – Pointer in the from_rank’s local memory from which to read data

  • dst_ptr – Pointer in the to_rank’s local memory where the data will be written

  • from_rank – The rank ID that owns src_ptr (source rank)

  • to_rank – The rank ID that will receive the data (destination rank)

  • mask – Optional mask for conditional operations

  • other – Value to return for masked-out elements during the load operation. If not provided, the result for masked-out elements is undefined.

  • load_cache_modifier (str, optional) – Controls cache behavior of the load. Supported values are: - None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy. - “.ca”: Cache at all levels (CU, L2, LLC) with LRU policy. - “.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted. - “.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.

  • store_cache_modifier (str, optional) – Controls cache behavior of the store. Supported values are: - None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy. - “.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later. - “.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU. - “.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC. - “.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.

Example

>>> # Copy from rank 1 to rank 0 (current rank must be either 1 or 0)
>>> ctx.copy(remote_ptr + offsets, local_ptr + offsets, 1, 0, mask=mask)

get#

IrisDeviceCtx.get(from_ptr, to_ptr, from_rank, mask=None, other=None, load_cache_modifier=None, store_cache_modifier=None)[source]

Copies data from the specified rank’s memory to the current rank’s local memory.

Parameters:
  • from_ptr – Pointer to remote memory in from_rank’s address space

  • to_ptr – Pointer to local memory in current rank

  • from_rank – The rank ID from which to read the data

  • mask – Optional mask for conditional operations

  • other – Value to return for masked-out elements during the load operation. If not provided, the result for masked-out elements is undefined.

  • load_cache_modifier (str, optional) – Controls cache behavior of the load. Supported values are: - None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy. - “.ca”: Cache at all levels (CU, L2, LLC) with LRU policy. - “.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted. - “.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.

  • store_cache_modifier (str, optional) – Controls cache behavior of the store. Supported values are: - None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy. - “.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later. - “.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU. - “.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC. - “.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.

Example

>>> # Copy from rank 1 to current rank's local memory
>>> ctx.get(remote_ptr + offsets, local_ptr + offsets, 1, mask=mask)

put#

IrisDeviceCtx.put(from_ptr, to_ptr, to_rank, mask=None, other=None, load_cache_modifier=None, store_cache_modifier=None)[source]

Copies data from the current rank’s local memory to the specified rank’s memory.

Parameters:
  • from_ptr – Pointer to local memory in current rank

  • to_ptr – Pointer to remote memory in to_rank’s address space

  • to_rank – The rank ID to which the data will be written

  • mask – Optional mask for conditional operations

  • other – Value to return for masked-out elements during the load operation. If not provided, the result for masked-out elements is undefined.

  • load_cache_modifier (str, optional) – Controls cache behavior of the load. Supported values are: - None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy. - “.ca”: Cache at all levels (CU, L2, LLC) with LRU policy. - “.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted. - “.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.

  • store_cache_modifier (str, optional) – Controls cache behavior of the store. Supported values are: - None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy. - “.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later. - “.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU. - “.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC. - “.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.

Example

>>> # Copy from current rank's local memory to rank 1
>>> ctx.put(local_ptr + offsets, remote_ptr + offsets, 1, mask=mask)

Atomic operations#

atomic_add#

IrisDeviceCtx.atomic_add(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic add at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to add

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically add to rank 1's memory
>>> old_val = ctx.atomic_add(buffer, 5, 1)

atomic_sub#

IrisDeviceCtx.atomic_sub(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Atomically subtracts data from the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • val – The value to subtract

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically subtract from rank 1's memory
>>> old_val = ctx.atomic_sub(buffer, 3, 1)

atomic_cas#

IrisDeviceCtx.atomic_cas(pointer, cmp, val, to_rank, sem=None, scope=None)[source]

Atomically compares and exchanges the specified rank’s memory location.

Parameters:
  • pointer – Pointer in the current rank’s address space

  • cmp – The expected value to compare

  • val – The new value to write if comparison succeeds

  • to_rank – The rank ID to which the atomic operation will be performed

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Compare-and-swap on rank 1's memory
>>> old_val = ctx.atomic_cas(flag + pid, 0, 1, 1, sem="release", scope="sys")

atomic_xchg#

IrisDeviceCtx.atomic_xchg(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic exchange at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to exchange

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Exchange value with rank 1's memory
>>> old_val = ctx.atomic_xchg(buffer, 99, 1)

atomic_xor#

IrisDeviceCtx.atomic_xor(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic xor at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to xor

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically XOR with rank 1's memory
>>> old_val = ctx.atomic_xor(buffer, 0xFF, 1)

atomic_and#

IrisDeviceCtx.atomic_and(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic and at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to and

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically AND with rank 1's memory
>>> old_val = ctx.atomic_and(buffer, 0x0F, 1)

atomic_or#

IrisDeviceCtx.atomic_or(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic or at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to or

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically OR with rank 1's memory
>>> old_val = ctx.atomic_or(buffer, 0xF0, 1)

atomic_min#

IrisDeviceCtx.atomic_min(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic min at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute minimum with rank 1's memory
>>> old_val = ctx.atomic_min(buffer, 10, 1)

atomic_max#

IrisDeviceCtx.atomic_max(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]

Performs an atomic max at the specified rank’s memory location.

Parameters:
  • pointer – The memory location in the current rank’s address space

  • val – The value to compare and potentially store

  • to_rank – The rank ID to which the atomic operation will be performed

  • mask – Optional mask for conditional operations

  • sem – Memory semantics (acquire, release, acq_rel, relaxed)

  • scope – Scope of synchronization (gpu, cta, sys)

Returns:

The value at the memory location before the atomic operation

Example

>>> # Atomically compute maximum with rank 1's memory
>>> old_val = ctx.atomic_max(buffer, 100, 1)