Pre-built Kernels

FlyDSL ships with a collection of pre-built GPU kernels in the kernels/ directory. These serve as both ready-to-use components and reference implementations for kernel development.

GEMM Kernels

  • kernels.preshuffle_gemm – MFMA-based GEMM with LDS pipeline and pre-shuffled weights (FP8, INT8, BF16)

  • kernels.preshuffle_gemm_flyc – Preshuffle GEMM using the new @flyc.kernel API

  • kernels.mixed_preshuffle_gemm – Mixed-precision GEMM with pre-shuffled layouts

  • kernels.blockscale_preshuffle_gemm – Block-scale (MXFP4) preshuffle GEMM

MoE (Mixture-of-Experts) Kernels

  • kernels.moe_gemm_2stage – MoE GEMM with 2-stage pipeline (stage1 + stage2)

  • kernels.mixed_moe_gemm_2stage – Mixed-precision MoE GEMM

  • kernels.moe_blockscale_2stage – MoE with block-scale quantization (MXFP4)

  • kernels.moe_reduce – MoE reduction kernel: sums over the topk dimension (Y[t, d] = sum(X[t, :, d])). Supports optional masking, f16/bf16/f32, and is compiled via compile_moe_reduction().

Paged Attention

  • kernels.pa_decode_fp8 – Paged attention decode kernel with FP8 support

Normalization

  • kernels.layernorm_kernel – Layer normalization

  • kernels.rmsnorm_kernel – RMS normalization

Softmax

  • kernels.softmax_kernel – Numerically stable softmax

Reduction

  • kernels.reduce – Warp-level reduction utilities (warp_reduce_sum, warp_reduce_max)

Utilities

  • kernels.kernels_common – Shared constants and helper functions

  • kernels.layout_utils – Layout utility functions

  • kernels.mfma_epilogues – MFMA epilogue patterns (store, accumulate, scale)

  • kernels.mfma_preshuffle_pipeline – Shared MFMA preshuffle helpers (B layout builder, K32 pack loads) used by preshuffle GEMM and MoE kernels

See also

Pre-built Kernel Library Guide for detailed usage and configuration of each kernel.