Kernel Development
This tutorial covers advanced kernel development techniques in FlyDSL, including tiled data movement, MFMA instructions, shared memory, and performance optimization.
Tiled Copies
FlyDSL uses a hierarchical tiling model to partition data across blocks, warps, and threads:
import flydsl.expr as fx
# Define thread and value layouts
thr_layout = fx.make_layout((4, 1), (1, 1))
val_layout = fx.make_layout((1, 8), (1, 1))
# Create a copy atom (e.g., 128-bit buffer copy)
copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32)
# Build the tiled copy descriptor via raked product
layout_thr_val = fx.raked_product(thr_layout, val_layout)
tile_mn = fx.make_tile(4, 8)
tiled_copy = fx.make_tiled_copy(copy_atom, layout_thr_val, tile_mn)
# Partition a tensor for this thread
thr_copy = tiled_copy.get_slice(tid)
partition_src = thr_copy.partition_S(block_tile)
partition_dst = thr_copy.partition_D(fragment)
# Execute copy
fx.copy(copy_atom, partition_src, partition_dst)
See examples/02-tiledCopy.py for a complete working example.
MFMA Instructions
For matrix operations, FlyDSL supports AMD’s Matrix Fused Multiply-Add (MFMA)
instructions via make_mma_atom and make_tiled_mma:
import flydsl.expr as fx
# Create an MFMA atom (16x16x4 FP32)
mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32))
tiled_mma = fx.make_tiled_mma(mma_atom, fx.make_layout((2, 2, 1), (1, 2, 0)))
# Partition A, B, C for this thread
thr_mma = tiled_mma.thr_slice(tid)
frag_A = thr_mma.make_fragment_A(partition_A)
frag_B = thr_mma.make_fragment_B(partition_B)
frag_C = thr_mma.make_fragment_C(partition_C)
# Execute GEMM
fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C)
See examples/03-tiledMma.py for a complete GEMM example and
kernels/preshuffle_gemm.py for a production GEMM implementation with
LDS pipeline.
Performance Optimization
Key optimization techniques demonstrated in the pre-built kernels:
LDS double-buffering: Overlap compute with data movement (
preshuffle_gemm)Buffer tensor operations: Hardware bounds-checked memory access (
rocdl.make_buffer_tensor)Software pipelining: Hide memory latency with multi-stage pipelines
Pre-shuffled weights: Avoid runtime layout transformations for MFMA
Reference Implementations
Study these kernels for real-world patterns:
kernels/preshuffle_gemm.py– MFMA + LDS pipeline GEMMkernels/preshuffle_gemm_flyc.py– GEMM using new@flyc.kernelAPIkernels/softmax_kernel.py– online numerically stable softmaxkernels/layernorm_kernel.py– fused normalizationkernels/pa_decode_fp8.py– paged attention decode with FP8
See also
Kernel Authoring Guide – comprehensive kernel authoring reference
Pre-built Kernel Library Guide – all pre-built kernels with configuration details
Testing & Benchmarking Guide – how to test and benchmark kernels