# CuTe Layout Algebra Reference for FlyDSL

> FlyDSL implements the CuTe layout algebra for AMD GPUs. This guide covers the mathematical foundations of the layout algebra and how FlyDSL exposes them through its Python API.

The CuTe layout algebra was introduced in the [CUTLASS](https://github.com/NVIDIA/cutlass) C++ library under BSD-3-Clause license (`include/cute/`). FlyDSL adopts the same algebraic framework — shapes, strides, coordinate mappings, products, and divides — and provides a Python API targeting AMD ROCm/HIP GPUs via MLIR.

---

## 1. Overview

### 1.1 What is the CuTe Layout Algebra?

The CuTe layout algebra is a mathematical framework for describing multidimensional data layouts as compositions of shapes, strides, and coordinate transformations. It provides:

- **Layouts** as first-class objects: a pair `(Shape, Stride)` that maps logical coordinates to physical offsets
- **Algebraic operations**: composition, complement, products, and divides that transform layouts while preserving correctness
- **Tiling and partitioning**: systematic decomposition of data across threads, warps/wavefronts, and blocks

The algebra is defined in the C++ headers of CUTLASS (BSD-3-Clause):
- `include/cute/layout.hpp` — Layout type, shape/stride types, core operations
- `include/cute/tensor.hpp` — Tensor type (pointer + layout)
- `include/cute/algorithm/` — Copy, GEMM, and other algorithmic building blocks
- `include/cute/numeric/integral_constant.hpp` — Compile-time integer constants

A pure-Python reference implementation also exists in PyTorch:
- `torch/distributed/_pycute/layout.py` — Layout class with all algebra operations

### 1.2 FlyDSL as an AMD Implementation

FlyDSL implements the CuTe layout algebra for AMD GPUs through the Fly MLIR dialect:

| Aspect | CuTe C++ (CUTLASS) | FlyDSL |
|---|---|---|
| **Language** | C++ templates | Python + MLIR emission |
| **Hardware** | NVIDIA CUDA GPUs | AMD ROCm/HIP GPUs |
| **IR backend** | C++ templates → CUDA/PTX | Fly MLIR dialect → ROCDL → HSACO |
| **Kernel model** | C++ kernel functions | `@flyc.kernel` + `@flyc.jit` |
| **Memory model** | GMEM → SMEM → RMEM | GMEM → LDS → VGPR |
| **Compilation** | nvcc / CUTLASS build | Python → MLIR → ROCDL → HSACO binary |
| **Wave/Warp size** | 32 threads (warp) | 64 threads (wavefront) |

---

## 2. Layout Algebra Fundamentals

### 2.1 Core Types

A **Layout** is defined by a pair `(Shape, Stride)`:

| Concept | Mathematical Definition | FlyDSL API |
|---|---|---|
| **Shape** | Tuple of positive integers describing dimensions | `fx.make_shape(M, N)` |
| **Stride** | Tuple of integers describing step sizes per dimension | `fx.make_stride(s0, s1)` |
| **Layout** | Pair `(Shape, Stride)` defining a coordinate → index mapping | `fx.make_layout(shape, stride)` |
| **Coord** | Tuple of integers identifying a position in logical space | `fx.make_coord(i, j)` |

> **Reference:** `include/cute/layout.hpp` — `Layout<Shape, Stride>` template class.

**FlyDSL example:**
```python
import flydsl.expr as fx

shape = fx.make_shape(128, 64)
stride = fx.make_stride(1, 128)    # Column-major
layout = fx.make_layout(shape, stride)
coord = fx.make_coord(3, 5)
```

### 2.2 Query Operations

| Operation | Formula | FlyDSL API |
|---|---|---|
| **size** | `product(shape)` — total number of elements | `fx.size(layout)` |
| **cosize** | `max(index) + 1` — size of the codomain | `fx.cosize(layout)` |
| **rank** | Number of modes (top-level dimensions) | `fx.rank(layout)` |
| **size of mode i** | `shape[i]` | `fx.get(fx.get_shape(layout), i)` |

> **Reference:** `include/cute/layout.hpp` — `size()`, `cosize()`, `rank()` functions.

### 2.3 Coordinate Mapping

The fundamental operation of a layout is mapping a logical coordinate to a physical index:

```
index = crd2idx(coord, shape, stride) = dot(coord, stride)
```

For a layout `L = ((S0, S1), (d0, d1))` and coordinate `(c0, c1)`:

```
index = c0 * d0 + c1 * d1
```

The inverse operation recovers a coordinate from a linear index:

```
coord = idx2crd(index, shape, stride)
```

| Operation | Definition | FlyDSL API |
|---|---|---|
| **crd2idx** | `coord → index = sum(c_i * d_i)` | `fx.crd2idx(coord, layout)` |
| **idx2crd** | `index → coord` (successive div/mod by shape elements) | `fx.idx2crd(idx, layout)` |

> **Reference:** `include/cute/layout.hpp` — `crd2idx()`, `idx2crd()`.

### 2.4 Layout Algebra Operations

All operations below are defined mathematically in the CuTe algebra and implemented in FlyDSL with identical semantics.

#### Composition

Given layouts `A = (S_A, d_A)` and `B = (S_B, d_B)`, the composition `A ∘ B` creates a new layout where B's indices are fed through A:

```
(A ∘ B)(c) = A(B(c))
```

FlyDSL: `fx.composition(A, B)`

> **Reference:** `include/cute/layout.hpp` — `composition()`.

#### Complement

The complement of layout `A` with respect to a codomain size `M` produces a layout `B` such that `(A, B)` together cover `[0, M)`:

FlyDSL: `fx.complement(layout, cotarget)`

> **Reference:** `include/cute/layout.hpp` — `complement()`.

#### Coalesce

Merges adjacent modes with compatible strides into a single mode, producing a simplified but functionally equivalent layout:

FlyDSL: `fx.coalesce(layout)`

> **Reference:** `include/cute/layout.hpp` — `coalesce()`.

#### Products

Products combine two layouts to create higher-rank layouts. They differ in how the result modes are organized:

| Product | Description | FlyDSL API |
|---|---|---|
| **Logical Product** | Append B's modes as new outer modes of A | `fx.logical_product(A, B)` |
| **Zipped Product** | Like logical, but zip inner modes together | `fx.zipped_product(A, B)` |
| **Tiled Product** | Like logical, but group by tile | `fx.tiled_product(A, B)` |
| **Flat Product** | Flatten all result modes | `fx.flat_product(A, B)` |
| **Raked Product** | Interleave A and B elements (raked distribution) | `fx.raked_product(A, B)` |
| **Blocked Product** | Block A elements together, then B (blocked distribution) | `fx.blocked_product(A, B)` |

> **Reference:** `include/cute/layout.hpp` — `logical_product()`, `zipped_product()`, `tiled_product()`, `flat_product()`, `raked_product()`, `blocked_product()`.

#### Divides

Divides decompose a layout by a tiler, creating a hierarchical layout with "tile" and "remainder" modes:

| Divide | Description | FlyDSL API |
|---|---|---|
| **Logical Divide** | Split A by tiler, keep full mode hierarchy | `fx.logical_divide(A, tiler)` |
| **Zipped Divide** | Like logical, but zip tile modes | `fx.zipped_divide(A, tiler)` |
| **Tiled Divide** | Like logical, but group by tile | `fx.tiled_divide(A, tiler)` |
| **Flat Divide** | Flatten tile and remainder modes | `fx.flat_divide(A, tiler)` |

> **Reference:** `include/cute/layout.hpp` — `logical_divide()`, `zipped_divide()`, `tiled_divide()`, `flat_divide()`.

#### Partitioning Utilities

| Operation | Description | FlyDSL API |
|---|---|---|
| **local_partition** | Partition a layout among threads/tiles | `fx.local_partition(layout, ...)` |
| **local_tile** | Extract a tile from a layout | `fx.local_tile(layout, ...)` |

> **Reference:** `include/cute/algorithm/` — `local_partition.hpp`, `local_tile.hpp`.

---

## 3. FlyDSL Kernel Development

FlyDSL kernels are defined using `@flyc.kernel` for GPU device functions and `@flyc.jit` for host-side launch wrappers:

```python
import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(
    A: fx.Tensor,
    B: fx.Tensor,
    C: fx.Tensor,
    block_dim: fx.Constexpr[int],
):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    # Kernel body — use layout algebra here
    ...

@flyc.jit
def launch(
    A: fx.Tensor, B: fx.Tensor, C,
    n: fx.Int32,
    stream: fx.Stream = fx.Stream(None),
):
    my_kernel(A, B, C, block_dim).launch(
        grid=(grid_x, 1, 1), block=(block_dim, 1, 1), stream=stream,
    )
```

**Key elements:**
- `@flyc.kernel` decorator compiles the function body into GPU IR via AST rewriting
- `@flyc.jit` decorator wraps a host-side function that constructs and launches kernels
- `fx.Tensor` denotes a GPU tensor argument
- `fx.Constexpr[int]` denotes a compile-time constant (affects cache key)
- `fx.Int32` denotes a dynamic int32 argument
- `fx.Stream` denotes a GPU stream argument

---

## 4. Thread and Block Hierarchy

GPU kernels organize threads into a hierarchy of blocks and grids. FlyDSL provides direct access to thread/block indices:

| Concept | FlyDSL API | Description |
|---|---|---|
| Thread index | `fx.thread_idx.x` | Thread index within block |
| Block index | `fx.block_idx.x` | Block index within grid |
| Block dimension | `fx.block_dim.x` | Number of threads per block |

Supported dimensions: `.x`, `.y`, `.z`.

**Hardware mapping (NVIDIA → AMD):**

| NVIDIA Concept | AMD Concept | Notes |
|---|---|---|
| Warp (32 threads) | Wavefront (64 threads) | Fundamental SIMD unit |
| Thread Block | Workgroup | Cooperative thread group |
| SM (Streaming Multiprocessor) | CU (Compute Unit) | Processing unit |
| Tensor Core (HMMA/GMMA) | MFMA (Matrix Fused Multiply-Add) | Matrix math unit |
| CUDA Core | Shader Processor | Scalar ALU |

---

## 5. Tensor Creation and Memory

### 5.1 Tensor Construction

FlyDSL provides tensor operations with layout-aware partitioning:

```python
import flydsl.expr as fx

# Create a buffer tensor from a tensor argument (AMD buffer descriptor)
A = fx.rocdl.make_buffer_tensor(A)

# Partition using layout algebra
tA = fx.logical_divide(A, fx.make_layout(block_dim, 1))
tA = fx.slice(tA, (None, bid))

# Register fragment
frag = fx.make_fragment_like(partition_src)
```

### 5.2 Memory Hierarchy

| Level | NVIDIA | AMD | Typical Size |
|---|---|---|---|
| Global Memory (GMEM) | Global Memory | Global Memory (HBM) | GBs |
| Shared/Local Memory | SMEM (48–228 KB) | LDS (64–160 KB) | Per-CU |
| Register File | RMEM (256 KB/SM) | VGPR (512 KB/CU) | Per-thread |
| L2 Cache | L2 Cache | L2 Cache | MBs |

**LDS allocation in FlyDSL:**
```python
from flydsl.utils.smem_allocator import SmemAllocator

allocator = SmemAllocator(ctx, arch="gfx942")
lds_gen = allocator.allocate_array(T.f16(), num_elems=128*64)
allocator.finalize()

base = allocator.get_base()
lds_ptr = lds_gen(base)
```

### 5.3 Swizzling (Bank Conflict Avoidance)

Swizzling remaps addresses to avoid bank conflicts in shared/local memory. FlyDSL provides XOR-based swizzling at 16-byte granularity:

```python
col_swizzled = fx.swizzle_xor16(row, col_bytes, k_blocks16)
```

The swizzle function XORs the row index into the column address at 16-byte boundaries, distributing accesses across LDS banks.

---

## 6. Data Movement

### 6.1 Copy Atoms and Tiled Copies

FlyDSL uses the CuTe copy abstraction: a **copy atom** defines a single thread's copy capability, and a **tiled copy** distributes the atom across all threads:

```python
import flydsl.expr as fx

# Create copy atom (e.g., 128-bit buffer copy)
copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32)

# Create tiled copy via raked product
thr_layout = fx.make_layout((4, 1), (1, 1))
val_layout = fx.make_layout((1, 8), (1, 1))
layout_thr_val = fx.raked_product(thr_layout, val_layout)
tile_mn = fx.make_tile(4, 8)
tiled_copy = fx.make_tiled_copy(copy_atom, layout_thr_val, tile_mn)

# Get thread slice
thr_copy = tiled_copy.get_slice(tid)
src_partition = thr_copy.partition_S(src_tensor)
dst_partition = thr_copy.partition_D(dst_tensor)

# Execute copy
fx.copy(copy_atom, src_partition, dst_partition)
```

### 6.2 Buffer Loads (AMD-specific)

AMD GPUs provide buffer load instructions for efficient global memory access. FlyDSL exposes these via the ``rocdl`` submodule:

```python
import flydsl.expr as fx

# Create a buffer tensor from a tensor argument (uses AMD buffer descriptor)
A_buf = fx.rocdl.make_buffer_tensor(A)

# Use buffer copy atoms for efficient memory access
copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32)
```

---

## 7. Compute Operations (MFMA)

AMD GPUs use MFMA (Matrix Fused Multiply-Add) instructions for matrix math. FlyDSL provides direct access to MFMA intrinsics:

```python
import flydsl.expr as fx

# Create MFMA atom (16x16x4 FP32)
mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32))
tiled_mma = fx.make_tiled_mma(mma_atom, fx.make_layout((2, 2, 1), (1, 2, 0)))

# Partition and execute GEMM
thr_mma = tiled_mma.thr_slice(tid)
frag_A = thr_mma.make_fragment_A(partition_A)
frag_B = thr_mma.make_fragment_B(partition_B)
frag_C = thr_mma.make_fragment_C(partition_C)
fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C)
```

**MFMA instruction reference (AMD CDNA):**

| Instruction | Data Type | M×N×K | Architecture |
|---|---|---|---|
| `mfma_f32_16x16x16f16` | FP16 | 16×16×16 | GFX942+ |
| `mfma_f32_16x16x32_fp8_fp8` | FP8 | 16×16×32 | GFX942+ |
| `mfma_i32_16x16x32_i8` | INT8 | 16×16×32 | GFX942+ |
| `mfma_f32_32x32x8f16` | FP16 | 32×32×8 | GFX942+ |
| `mfma_scale_x128` | MXFP4 | 16×16×128 | GFX950 |

**K64-byte micro-step pattern (2× K32 per step):**
```python
for ku in range(tile_k_bytes // 64):
    a_val = lds_load_pack_k32(...)   # Load A from LDS
    b_val = load_b_pack_k32(...)     # Load B from GMEM
    c_acc = rocdl.mfma_f32_16x16x32_fp8_fp8(a_val, b_val, c_acc)
    # second half
    a_val2 = lds_load_pack_k32(...)
    b_val2 = load_b_pack_k32(...)
    c_acc = rocdl.mfma_f32_16x16x32_fp8_fp8(a_val2, b_val2, c_acc)
```

---

## 8. Synchronization

| FlyDSL API | Description |
|---|---|
| `gpu.barrier()` | Workgroup-level barrier (equivalent to `__syncthreads`) |

```python
import flydsl.expr as fx
fx.gpu.barrier()
```

---

## 9. Compilation and Execution

### 9.1 Compilation Pipeline

FlyDSL compiles Python → MLIR IR → ROCDL dialect → HSACO binary:

```python
import flydsl.compiler as flyc
import flydsl.expr as fx

# Define kernel and launch wrapper
@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, C: fx.Tensor, ...):
    ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, C, ...,
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, C, ...).launch(
        grid=(...), block=(...), stream=stream,
    )

# Call the jit function — compilation happens automatically on first call
launch(A_torch, B_torch, C_torch, ..., stream=torch.cuda.Stream())
```

### 9.2 Environment Variables

| Variable | Description |
|---|---|
| `ARCH` | Target architecture (e.g., `gfx942`, `gfx950`) |
| `FLYDSL_DUMP_IR=1` | Dump intermediate MLIR IR |
| `FLYDSL_DUMP_DIR=/path` | IR dump location |
| `FLYDSL_COMPILE_ONLY=1` | Skip execution, compile only |
| `FLYDSL_NO_CACHE=1` | Disable compilation cache |
| `FLYDSL_RUNTIME_CACHE_DIR=/path` | Cache directory (default: `~/.flydsl/cache/`) |

---

## 10. Complete Example: GEMM with Layout Algebra

This example shows how layout algebra concepts come together in a FlyDSL GEMM kernel. The layout algebra handles data distribution across threads and memory hierarchy; MFMA instructions handle the compute.

```python
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx

block_m, block_n, block_k = 64, 64, 8

@flyc.kernel
def gemm_kernel(A: fx.Tensor, B: fx.Tensor, C: fx.Tensor):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x

    tileA = fx.make_tile(block_m, block_k)
    tileB = fx.make_tile(block_n, block_k)
    tileC = fx.make_tile(block_m, block_n)

    A = fx.rocdl.make_buffer_tensor(A)
    B = fx.rocdl.make_buffer_tensor(B)
    C = fx.rocdl.make_buffer_tensor(C)

    bA = fx.slice(fx.zipped_divide(A, tileA), (None, bid))
    bB = fx.slice(fx.zipped_divide(B, tileB), (None, bid))
    bC = fx.slice(fx.zipped_divide(C, tileC), (None, bid))

    # MFMA atom setup
    mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32))
    tiled_mma = fx.make_tiled_mma(mma_atom, fx.make_layout((2, 2, 1), (1, 2, 0)))
    thr_mma = tiled_mma.thr_slice(tid)

    # Tiled copies for A, B, C
    copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy32b(), fx.Float32)
    tiled_copy_A = fx.make_tiled_copy_A(copy_atom, tiled_mma)
    tiled_copy_B = fx.make_tiled_copy_B(copy_atom, tiled_mma)

    # Partition and copy to register fragments
    frag_A = thr_mma.make_fragment_A(thr_mma.partition_A(bA))
    frag_B = thr_mma.make_fragment_B(thr_mma.partition_B(bB))
    frag_C = thr_mma.make_fragment_C(thr_mma.partition_C(bC))

    # ... copy data to fragments, then GEMM ...
    fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C)

    # Store result back to C
    # ...

@flyc.jit
def tiledMma(A: fx.Tensor, B: fx.Tensor, C: fx.Tensor,
             stream: fx.Stream = fx.Stream(None)):
    gemm_kernel(A, B, C).launch(grid=(1, 1, 1), block=(256, 1, 1), stream=stream)
```

See `examples/03-tiledMma.py` for a complete working GEMM example, and
`kernels/preshuffle_gemm.py` for a production-quality GEMM implementation.

---

## 11. References

### CuTe Layout Algebra (BSD-3-Clause)
- **C++ headers:** [CUTLASS `include/cute/`](https://github.com/NVIDIA/cutlass/tree/main/include/cute)
  - `layout.hpp` — Layout type, all algebra operations
  - `tensor.hpp` — Tensor type (pointer + layout)
  - `algorithm/` — Copy, GEMM, partitioning algorithms
- **GTC presentations:** "CuTe: A Layout Algebra for CUTLASS" — mathematical foundations and design rationale
- **PyCute reference:** `torch/distributed/_pycute/layout.py` — pure-Python layout algebra (open source, PyTorch)

### FlyDSL Source Files
- `python/flydsl/expr/` — Layout algebra and expression API (`primitive.py`, `derived.py`, etc.)
- `python/flydsl/expr/rocdl/` — ROCDL-specific operations
- `python/flydsl/compiler/` — JIT compilation pipeline (`kernel_function.py`, `jit_function.py`)
- `python/flydsl/utils/smem_allocator.py` — SmemAllocator
- `examples/01-vectorAdd.py` — VecAdd example with layout algebra
- `examples/02-tiledCopy.py` — Tiled copy example
- `examples/03-tiledMma.py` — Tiled MFMA GEMM example
- `kernels/preshuffle_gemm.py` — Production GEMM implementation
- `kernels/preshuffle_gemm_flyc.py` — GEMM using `@flyc.kernel` API