Compiler & Pipeline

FlyDSL includes a JIT compiler that traces Python kernel functions into MLIR and lowers them through the Fly dialect pipeline to GPU binaries.

@flyc.kernel and @flyc.jit

The primary API for defining and compiling kernels:

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int]):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    # ... kernel body using layout ops ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int],
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, n).launch(
        grid=(grid_x, 1, 1),
        block=(256, 1, 1),
        stream=stream,
    )
  • @flyc.kernel compiles the function body into a gpu.func inside a gpu.module. It uses AST rewriting to trace Python code into MLIR IR.

  • @flyc.jit wraps a host-side function that constructs and launches kernels. On first call it triggers JIT compilation; subsequent calls with the same type signature use a cached compiled artifact.

Compilation Flow

On first call, @flyc.jit runs the following pipeline:

  1. AST Rewriting: The Python source is parsed and rewritten to emit MLIR ops.

  2. MLIR Module Construction: Kernel body is traced into gpu, arith, scf, memref, and fly dialect ops.

  3. Fly Pass Pipeline: The module is lowered through a series of MLIR passes:

    • gpu-kernel-outlining

    • fly-canonicalize

    • fly-layout-lowering

    • convert-fly-to-rocdl

    • canonicalize + cse

    • gpu.module(convert-gpu-to-rocdl{...})

    • rocdl-attach-target{chip=gfxNNN}

    • gpu-to-llvm, convert-arith/func-to-llvm

    • gpu-module-to-binary{format=fatbin}

  4. Cached Artifact: The compiled binary is cached to disk (~/.flydsl/cache/) keyed by the compiler toolchain hash and kernel type signature.

Tensor Arguments

Use flyc.from_dlpack to convert PyTorch tensors into FlyDSL tensor descriptors with layout metadata:

import flydsl.compiler as flyc

tA = flyc.from_dlpack(torch_tensor).mark_layout_dynamic(
    leading_dim=0, divisibility=4
)
launch(tA, B, n, stream=torch.cuda.Stream())

Buffer Operations

The flydsl.expr.buffer_ops module provides high-level Python wrappers for AMD CDNA3/CDNA4 buffer load/store operations. Buffer operations use a scalar base pointer (SGPRs) and per-thread offsets for efficient global memory access with hardware bounds checking.

ROCDL Operations

The flydsl.expr.rocdl module provides AMD-specific operations:

  • fx.rocdl.make_buffer_tensor – create buffer resource descriptor from tensor

  • fx.rocdl.BufferCopy32b / BufferCopy128b – buffer copy atoms

  • fx.rocdl.MFMA – MFMA instruction atoms (e.g., MFMA(16, 16, 4, fx.Float32))

fly-opt CLI

The fly-opt tool is a command-line interface for running MLIR passes on .mlir files:

fly-opt --fly-canonicalize input.mlir
fly-opt --fly-layout-lowering input.mlir
fly-opt --help