Pre-built Kernels
=================

FlyDSL ships with a collection of pre-built GPU kernels in the ``kernels/``
directory. These serve as both ready-to-use components and reference
implementations for kernel development.

GEMM Kernels
-------------

- ``kernels.preshuffle_gemm`` -- MFMA-based GEMM with LDS pipeline and pre-shuffled weights (FP8, INT8, BF16)
- ``kernels.preshuffle_gemm_flyc`` -- Preshuffle GEMM using the new ``@flyc.kernel`` API
- ``kernels.mixed_preshuffle_gemm`` -- Mixed-precision GEMM with pre-shuffled layouts
- ``kernels.blockscale_preshuffle_gemm`` -- Block-scale (MXFP4) preshuffle GEMM

MoE (Mixture-of-Experts) Kernels
----------------------------------

- ``kernels.moe_gemm_2stage`` -- MoE GEMM with 2-stage pipeline (stage1 + stage2)
- ``kernels.mixed_moe_gemm_2stage`` -- Mixed-precision MoE GEMM
- ``kernels.moe_blockscale_2stage`` -- MoE with block-scale quantization (MXFP4)
- ``kernels.moe_reduce`` -- MoE reduction kernel: sums over the topk dimension
  (``Y[t, d] = sum(X[t, :, d])``). Supports optional masking, f16/bf16/f32,
  and is compiled via ``compile_moe_reduction()``.

Paged Attention
----------------

- ``kernels.pa_decode_fp8`` -- Paged attention decode kernel with FP8 support

Normalization
-------------

- ``kernels.layernorm_kernel`` -- Layer normalization
- ``kernels.rmsnorm_kernel`` -- RMS normalization

Softmax
-------

- ``kernels.softmax_kernel`` -- Numerically stable softmax

Reduction
---------

- ``kernels.reduce`` -- Warp-level reduction utilities (``warp_reduce_sum``, ``warp_reduce_max``)

Utilities
---------

- ``kernels.kernels_common`` -- Shared constants and helper functions
- ``kernels.layout_utils`` -- Layout utility functions
- ``kernels.mfma_epilogues`` -- MFMA epilogue patterns (store, accumulate, scale)
- ``kernels.mfma_preshuffle_pipeline`` -- Shared MFMA preshuffle helpers (B layout builder, K32 pack loads) used by preshuffle GEMM and MoE kernels

.. seealso:: :doc:`../prebuilt_kernels_guide` for detailed usage and configuration of each kernel.