Device Functions#
Warning
The Gluon API is experimental and may undergo breaking changes in future releases.
Device-side functions provided by Iris Gluon for remote memory operations and atomics. These methods are part of the IrisDeviceCtx aggregate used within Gluon kernels.
Initialization#
initialize#
- static IrisDeviceCtx.initialize(context_tensor, tracing=False)[source]
Initialize IrisDeviceCtx from the encoded tensor.
The context tensor has the format:
[cur_rank, num_ranks, heap_base_0, heap_base_1, ..., trace_info...]If tracing is enabled on the host (via
shmem.tracing.enable()), the context tensor also contains tracing buffer pointers after the heap bases.- Parameters:
context_tensor – Pointer to encoded context data
tracing (<MagicMock name='mock.constexpr' id='140274990000848'>) – Enable event tracing (constexpr, default: False)
- Returns:
Initialized device context
- Return type:
IrisDeviceCtx
Memory transfer operations#
load#
- IrisDeviceCtx.load(pointer, from_rank, mask=None, other=None, cache_modifier=None, volatile=False)[source]
Loads a value from the specified rank’s memory location to the current rank.
- Parameters:
pointer – Pointer in the from_rank’s address space
from_rank – The rank ID from which to read the data
mask – Optional mask for conditional loading
other – Value to return for masked-out elements. If not provided, the result for masked-out elements is undefined.
cache_modifier (str, optional) –
Controls cache behavior of the load.
- Supported values:
None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy.
”.ca”: Cache at all levels (CU, L2, LLC) with LRU policy.
”.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted.
- ”.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.
Ensures global coherence by invalidating stale GPU cache lines.
volatile (bool, optional) – If True, disables compiler optimizations that could reorder or eliminate the load. Defaults to False.
- Returns:
The loaded value from the target memory location
Example
>>> # Load from rank 1 to current rank >>> data = ctx.load(buffer + offsets, 1, mask=mask)
store#
- IrisDeviceCtx.store(pointer, value, to_rank, mask=None, cache_modifier=None)[source]
Writes data from the current rank to the specified rank’s memory location.
- Parameters:
pointer – Pointer in the current rank’s address space
value – The value to store
to_rank – The rank ID to which the data will be written
mask – Optional mask for conditional storing
cache_modifier (str, optional) –
Controls cache behavior of the store. Supported values are:
None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy.
”.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later.
”.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU.
”.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC.
”.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.
Example
>>> # Store from current rank to rank 1 >>> ctx.store(buffer + offsets, values, 1, mask=mask)
copy#
- IrisDeviceCtx.copy(src_ptr, dst_ptr, from_rank, to_rank, mask=None, other=None, load_cache_modifier=None, store_cache_modifier=None)[source]
Copies data from the specified rank’s memory into the destination rank’s memory.
This function performs the transfer by translating src_ptr from the from_rank’s address space to the to_rank’s address space, performing a masked load from the translated source, and storing the loaded data to dst_ptr in the to_rank memory location. If from_rank and to_rank are the same, this function performs a local copy operation. It is undefined behaviour if neither from_rank nor to_rank is the cur_rank.
- Parameters:
src_ptr – Pointer in the from_rank’s local memory from which to read data
dst_ptr – Pointer in the to_rank’s local memory where the data will be written
from_rank – The rank ID that owns src_ptr (source rank)
to_rank – The rank ID that will receive the data (destination rank)
mask – Optional mask for conditional operations
other – Value to return for masked-out elements during the load operation. If not provided, the result for masked-out elements is undefined.
load_cache_modifier (str, optional) – Controls cache behavior of the load. Supported values are: - None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy. - “.ca”: Cache at all levels (CU, L2, LLC) with LRU policy. - “.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted. - “.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.
store_cache_modifier (str, optional) – Controls cache behavior of the store. Supported values are: - None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy. - “.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later. - “.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU. - “.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC. - “.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.
Example
>>> # Copy from rank 1 to rank 0 (current rank must be either 1 or 0) >>> ctx.copy(remote_ptr + offsets, local_ptr + offsets, 1, 0, mask=mask)
get#
- IrisDeviceCtx.get(from_ptr, to_ptr, from_rank, mask=None, other=None, load_cache_modifier=None, store_cache_modifier=None)[source]
Copies data from the specified rank’s memory to the current rank’s local memory.
- Parameters:
from_ptr – Pointer to remote memory in from_rank’s address space
to_ptr – Pointer to local memory in current rank
from_rank – The rank ID from which to read the data
mask – Optional mask for conditional operations
other – Value to return for masked-out elements during the load operation. If not provided, the result for masked-out elements is undefined.
load_cache_modifier (str, optional) – Controls cache behavior of the load. Supported values are: - None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy. - “.ca”: Cache at all levels (CU, L2, LLC) with LRU policy. - “.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted. - “.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.
store_cache_modifier (str, optional) – Controls cache behavior of the store. Supported values are: - None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy. - “.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later. - “.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU. - “.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC. - “.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.
Example
>>> # Copy from rank 1 to current rank's local memory >>> ctx.get(remote_ptr + offsets, local_ptr + offsets, 1, mask=mask)
put#
- IrisDeviceCtx.put(from_ptr, to_ptr, to_rank, mask=None, other=None, load_cache_modifier=None, store_cache_modifier=None)[source]
Copies data from the current rank’s local memory to the specified rank’s memory.
- Parameters:
from_ptr – Pointer to local memory in current rank
to_ptr – Pointer to remote memory in to_rank’s address space
to_rank – The rank ID to which the data will be written
mask – Optional mask for conditional operations
other – Value to return for masked-out elements during the load operation. If not provided, the result for masked-out elements is undefined.
load_cache_modifier (str, optional) – Controls cache behavior of the load. Supported values are: - None: (default) — Same as “.ca”. Uses cache at all levels (CU, L2, LLC) with LRU policy. - “.ca”: Cache at all levels (CU, L2, LLC) with LRU policy. - “.cg”: Bypasses the CU (L1) cache, streams through L2, and may hit in LLC but the line is not retained or inserted. - “.cv”: Bypasses all GPU caches (CU and L2) and fetches directly from system memory. If data exists in the LLC, it may hit, but is not retained or inserted.
store_cache_modifier (str, optional) – Controls cache behavior of the store. Supported values are: - None: (default) — Same as “.wb”. Uses write-back caching at all levels (CU, L2, LLC) with LRU policy. - “.wb”: Write-back. Write-allocate on L1 miss, inserted into caches and written back later. - “.cg”: Cache Global. Equivalent to “.wb” — stored through L1 → L2 → LLC under LRU. - “.cs”: Cache Streaming. Bypasses L1, streamed through L2, not retained in LLC. - “.wt”: Write-Through. Bypasses L1 and L2 (coherent cache bypass), may hit in LLC with LRU.
Example
>>> # Copy from current rank's local memory to rank 1 >>> ctx.put(local_ptr + offsets, remote_ptr + offsets, 1, mask=mask)
Atomic operations#
atomic_add#
- IrisDeviceCtx.atomic_add(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic add at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to add
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically add to rank 1's memory >>> old_val = ctx.atomic_add(buffer, 5, 1)
atomic_sub#
- IrisDeviceCtx.atomic_sub(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Atomically subtracts data from the specified rank’s memory location.
- Parameters:
pointer – Pointer in the current rank’s address space
val – The value to subtract
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically subtract from rank 1's memory >>> old_val = ctx.atomic_sub(buffer, 3, 1)
atomic_cas#
- IrisDeviceCtx.atomic_cas(pointer, cmp, val, to_rank, sem=None, scope=None)[source]
Atomically compares and exchanges the specified rank’s memory location.
- Parameters:
pointer – Pointer in the current rank’s address space
cmp – The expected value to compare
val – The new value to write if comparison succeeds
to_rank – The rank ID to which the atomic operation will be performed
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Compare-and-swap on rank 1's memory >>> old_val = ctx.atomic_cas(flag + pid, 0, 1, 1, sem="release", scope="sys")
atomic_xchg#
- IrisDeviceCtx.atomic_xchg(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic exchange at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to exchange
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Exchange value with rank 1's memory >>> old_val = ctx.atomic_xchg(buffer, 99, 1)
atomic_xor#
- IrisDeviceCtx.atomic_xor(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic xor at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to xor
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically XOR with rank 1's memory >>> old_val = ctx.atomic_xor(buffer, 0xFF, 1)
atomic_and#
- IrisDeviceCtx.atomic_and(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic and at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to and
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically AND with rank 1's memory >>> old_val = ctx.atomic_and(buffer, 0x0F, 1)
atomic_or#
- IrisDeviceCtx.atomic_or(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic or at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to or
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically OR with rank 1's memory >>> old_val = ctx.atomic_or(buffer, 0xF0, 1)
atomic_min#
- IrisDeviceCtx.atomic_min(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic min at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to compare and potentially store
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically compute minimum with rank 1's memory >>> old_val = ctx.atomic_min(buffer, 10, 1)
atomic_max#
- IrisDeviceCtx.atomic_max(pointer, val, to_rank, mask=None, sem=None, scope=None)[source]
Performs an atomic max at the specified rank’s memory location.
- Parameters:
pointer – The memory location in the current rank’s address space
val – The value to compare and potentially store
to_rank – The rank ID to which the atomic operation will be performed
mask – Optional mask for conditional operations
sem – Memory semantics (acquire, release, acq_rel, relaxed)
scope – Scope of synchronization (gpu, cta, sys)
- Returns:
The value at the memory location before the atomic operation
Example
>>> # Atomically compute maximum with rank 1's memory >>> old_val = ctx.atomic_max(buffer, 100, 1)