Kernel Development

This tutorial covers advanced kernel development techniques in FlyDSL, including tiled data movement, MFMA instructions, shared memory, and performance optimization.

Tiled Copies

FlyDSL uses a hierarchical tiling model to partition data across blocks, warps, and threads:

import flydsl.expr as fx

# Define thread and value layouts
thr_layout = fx.make_layout((4, 1), (1, 1))
val_layout = fx.make_layout((1, 8), (1, 1))

# Create a copy atom (e.g., 128-bit buffer copy)
copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32)

# Build the tiled copy descriptor via raked product
layout_thr_val = fx.raked_product(thr_layout, val_layout)
tile_mn = fx.make_tile(4, 8)
tiled_copy = fx.make_tiled_copy(copy_atom, layout_thr_val, tile_mn)

# Partition a tensor for this thread
thr_copy = tiled_copy.get_slice(tid)
partition_src = thr_copy.partition_S(block_tile)
partition_dst = thr_copy.partition_D(fragment)

# Execute copy
fx.copy(copy_atom, partition_src, partition_dst)

See examples/02-tiledCopy.py for a complete working example.

MFMA Instructions

For matrix operations, FlyDSL supports AMD’s Matrix Fused Multiply-Add (MFMA) instructions via make_mma_atom and make_tiled_mma:

import flydsl.expr as fx

# Create an MFMA atom (16x16x4 FP32)
mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32))
tiled_mma = fx.make_tiled_mma(mma_atom, fx.make_layout((2, 2, 1), (1, 2, 0)))

# Partition A, B, C for this thread
thr_mma = tiled_mma.thr_slice(tid)
frag_A = thr_mma.make_fragment_A(partition_A)
frag_B = thr_mma.make_fragment_B(partition_B)
frag_C = thr_mma.make_fragment_C(partition_C)

# Execute GEMM
fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C)

See examples/03-tiledMma.py for a complete GEMM example and kernels/preshuffle_gemm.py for a production GEMM implementation with LDS pipeline.

Shared Memory (LDS)

FlyDSL provides explicit control over Local Data Share (LDS) allocation and data movement:

  1. Allocate LDS buffers with appropriate padding to avoid bank conflicts

  2. Use cooperative loads to fill LDS from global memory

  3. Synchronize with barriers before consuming LDS data

See kernels/preshuffle_gemm.py for LDS double-buffering patterns.

Performance Optimization

Key optimization techniques demonstrated in the pre-built kernels:

  • LDS double-buffering: Overlap compute with data movement (preshuffle_gemm)

  • Buffer tensor operations: Hardware bounds-checked memory access (rocdl.make_buffer_tensor)

  • Software pipelining: Hide memory latency with multi-stage pipelines

  • Pre-shuffled weights: Avoid runtime layout transformations for MFMA

Reference Implementations

Study these kernels for real-world patterns:

  • kernels/preshuffle_gemm.py – MFMA + LDS pipeline GEMM

  • kernels/preshuffle_gemm_flyc.py – GEMM using new @flyc.kernel API

  • kernels/softmax_kernel.py – online numerically stable softmax

  • kernels/layernorm_kernel.py – fused normalization

  • kernels/pa_decode_fp8.py – paged attention decode with FP8

See also