Compiler & Pipeline
FlyDSL includes a JIT compiler that traces Python kernel functions into MLIR and lowers them through the Fly dialect pipeline to GPU binaries.
@flyc.kernel and @flyc.jit
The primary API for defining and compiling kernels:
import flydsl.compiler as flyc
import flydsl.expr as fx
@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int]):
tid = fx.thread_idx.x
bid = fx.block_idx.x
# ... kernel body using layout ops ...
@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int],
stream: fx.Stream = fx.Stream(None)):
my_kernel(A, B, n).launch(
grid=(grid_x, 1, 1),
block=(256, 1, 1),
stream=stream,
)
@flyc.kernelcompiles the function body into agpu.funcinside agpu.module. It uses AST rewriting to trace Python code into MLIR IR.@flyc.jitwraps a host-side function that constructs and launches kernels. On first call it triggers JIT compilation; subsequent calls with the same type signature use a cached compiled artifact.
Compilation Flow
On first call, @flyc.jit runs the following pipeline:
AST Rewriting: The Python source is parsed and rewritten to emit MLIR ops.
MLIR Module Construction: Kernel body is traced into
gpu,arith,scf,memref, andflydialect ops.Fly Pass Pipeline: The module is lowered through a series of MLIR passes:
gpu-kernel-outliningfly-canonicalizefly-layout-loweringconvert-fly-to-rocdlcanonicalize+csegpu.module(convert-gpu-to-rocdl{...})rocdl-attach-target{chip=gfxNNN}gpu-to-llvm,convert-arith/func-to-llvmgpu-module-to-binary{format=fatbin}
Cached Artifact: The compiled binary is cached to disk (
~/.flydsl/cache/) keyed by the compiler toolchain hash and kernel type signature.
Tensor Arguments
Use flyc.from_dlpack to convert PyTorch tensors into FlyDSL tensor
descriptors with layout metadata:
import flydsl.compiler as flyc
tA = flyc.from_dlpack(torch_tensor).mark_layout_dynamic(
leading_dim=0, divisibility=4
)
launch(tA, B, n, stream=torch.cuda.Stream())
Buffer Operations
The flydsl.expr.buffer_ops module provides high-level Python wrappers for
AMD CDNA3/CDNA4 buffer load/store operations. Buffer operations use a scalar
base pointer (SGPRs) and per-thread offsets for efficient global memory access
with hardware bounds checking.
ROCDL Operations
The flydsl.expr.rocdl module provides AMD-specific operations:
fx.rocdl.make_buffer_tensor – create buffer resource descriptor from tensor
fx.rocdl.BufferCopy32b / BufferCopy128b – buffer copy atoms
fx.rocdl.MFMA – MFMA instruction atoms (e.g.,
MFMA(16, 16, 4, fx.Float32))
fly-opt CLI
The fly-opt tool is a command-line interface for running MLIR passes on
.mlir files:
fly-opt --fly-canonicalize input.mlir
fly-opt --fly-layout-lowering input.mlir
fly-opt --help