Kernel Development ================== This tutorial covers advanced kernel development techniques in FlyDSL, including tiled data movement, MFMA instructions, shared memory, and performance optimization. Tiled Copies ------------- FlyDSL uses a hierarchical tiling model to partition data across blocks, warps, and threads: .. code-block:: python import flydsl.expr as fx # Define thread and value layouts thr_layout = fx.make_layout((4, 1), (1, 1)) val_layout = fx.make_layout((1, 8), (1, 1)) # Create a copy atom (e.g., 128-bit buffer copy) copy_atom = fx.make_copy_atom(fx.rocdl.BufferCopy128b(), fx.Float32) # Build the tiled copy descriptor via raked product layout_thr_val = fx.raked_product(thr_layout, val_layout) tile_mn = fx.make_tile(4, 8) tiled_copy = fx.make_tiled_copy(copy_atom, layout_thr_val, tile_mn) # Partition a tensor for this thread thr_copy = tiled_copy.get_slice(tid) partition_src = thr_copy.partition_S(block_tile) partition_dst = thr_copy.partition_D(fragment) # Execute copy fx.copy(copy_atom, partition_src, partition_dst) See ``examples/02-tiledCopy.py`` for a complete working example. MFMA Instructions ----------------- For matrix operations, FlyDSL supports AMD's Matrix Fused Multiply-Add (MFMA) instructions via ``make_mma_atom`` and ``make_tiled_mma``: .. code-block:: python import flydsl.expr as fx # Create an MFMA atom (16x16x4 FP32) mma_atom = fx.make_mma_atom(fx.rocdl.MFMA(16, 16, 4, fx.Float32)) tiled_mma = fx.make_tiled_mma(mma_atom, fx.make_layout((2, 2, 1), (1, 2, 0))) # Partition A, B, C for this thread thr_mma = tiled_mma.thr_slice(tid) frag_A = thr_mma.make_fragment_A(partition_A) frag_B = thr_mma.make_fragment_B(partition_B) frag_C = thr_mma.make_fragment_C(partition_C) # Execute GEMM fx.gemm(mma_atom, frag_C, frag_A, frag_B, frag_C) See ``examples/03-tiledMma.py`` for a complete GEMM example and ``kernels/preshuffle_gemm.py`` for a production GEMM implementation with LDS pipeline. Shared Memory (LDS) -------------------- FlyDSL provides explicit control over Local Data Share (LDS) allocation and data movement: 1. Allocate LDS buffers with appropriate padding to avoid bank conflicts 2. Use cooperative loads to fill LDS from global memory 3. Synchronize with barriers before consuming LDS data See ``kernels/preshuffle_gemm.py`` for LDS double-buffering patterns. Performance Optimization ------------------------ Key optimization techniques demonstrated in the pre-built kernels: - **LDS double-buffering**: Overlap compute with data movement (``preshuffle_gemm``) - **Buffer tensor operations**: Hardware bounds-checked memory access (``rocdl.make_buffer_tensor``) - **Software pipelining**: Hide memory latency with multi-stage pipelines - **Pre-shuffled weights**: Avoid runtime layout transformations for MFMA Reference Implementations ------------------------- Study these kernels for real-world patterns: - ``kernels/preshuffle_gemm.py`` -- MFMA + LDS pipeline GEMM - ``kernels/preshuffle_gemm_flyc.py`` -- GEMM using new ``@flyc.kernel`` API - ``kernels/softmax_kernel.py`` -- online numerically stable softmax - ``kernels/layernorm_kernel.py`` -- fused normalization - ``kernels/pa_decode_fp8.py`` -- paged attention decode with FP8 .. seealso:: - :doc:`../kernel_authoring_guide` -- comprehensive kernel authoring reference - :doc:`../prebuilt_kernels_guide` -- all pre-built kernels with configuration details - :doc:`../testing_benchmarking_guide` -- how to test and benchmark kernels