Kernel Development

This tutorial covers advanced kernel development techniques in FlyDSL, including tiled data movement, MFMA instructions, shared memory, and performance optimization.

Tiled Copies

FlyDSL uses a hierarchical tiling model to partition data across blocks, warps, and threads:

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def copy_kernel(src: fx.Tensor, dst: fx.Tensor):
    tid = fx.thread_idx.x

    # Define thread and value layouts
    thr_layout = fx.make_layout((4, 1), (1, 1))
    val_layout = fx.make_layout((1, 8), (1, 1))

    # Create a copy atom (e.g., 128-bit buffer copy)
    copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32)

    # Build the tiled copy descriptor via raked product
    layout_thr_val = fx.raked_product(thr_layout, val_layout)
    tile_mn = fx.make_tile(4, 8)
    tiled_copy = fx.make_tiled_copy(copy_atom, layout_thr_val, tile_mn)

    # Partition a tensor for this thread
    thr_copy = tiled_copy.get_slice(tid)
    partition_src = thr_copy.partition_S(block_tile)
    partition_dst = thr_copy.partition_D(fragment)

    # Execute copy
    fx.copy(copy_atom, partition_src, partition_dst)

See examples/02-tiledCopy.py for a complete working example.

MFMA Instructions

For matrix operations, FlyDSL supports AMD’s Matrix Fused Multiply-Add (MFMA) instructions via make_mma_atom and make_tiled_mma:

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def mfma_kernel(A: fx.Tensor, B: fx.Tensor, C: fx.Tensor):
    tid = fx.thread_idx.x

    # Create an MFMA atom (16x16x4 FP32)
    mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32))
    tiled_mma = fx.make_tiled_mma(mma_atom, fx.make_layout((2, 2, 1), (1, 2, 0)))

    # Partition A, B, C for this thread
    thr_mma = tiled_mma.thr_slice(tid)
    frag_A = thr_mma.make_fragment_A(partition_A)
    frag_B = thr_mma.make_fragment_B(partition_B)
    frag_C = thr_mma.make_fragment_C(partition_C)

    # Execute GEMM
    fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C)

See examples/03-tiledMma.py for a complete GEMM example and kernels/gemm/preshuffle_gemm.py for a production GEMM implementation with LDS pipeline.

Shared Memory (LDS)

FlyDSL provides explicit control over Local Data Share (LDS) allocation and data movement:

Allocate LDS buffers with appropriate padding to avoid bank conflicts
Use cooperative loads to fill LDS from global memory
Synchronize with barriers before consuming LDS data

See kernels/gemm/preshuffle_gemm.py for LDS double-buffering patterns.

Performance Optimization

Key optimization techniques demonstrated in the pre-built kernels:

LDS double-buffering: Overlap compute with data movement (preshuffle_gemm)
Buffer tensor operations: Hardware bounds-checked memory access (rocdl.make_buffer_tensor)
Software pipelining: Hide memory latency with multi-stage pipelines
Pre-shuffled weights: Avoid runtime layout transformations for MFMA

Reference Implementations

Study these kernels for real-world patterns:

kernels/gemm/preshuffle_gemm.py – MFMA + LDS pipeline GEMM
kernels/norm/softmax_kernel.py – online numerically stable softmax
kernels/norm/layernorm_kernel.py – fused normalization
kernels/attention/pa_decode_fp8.py – paged attention decode with FP8