Compiler & Pipeline

FlyDSL includes a JIT compiler that traces Python kernel functions into MLIR and lowers them through the Fly dialect pipeline to GPU binaries.

@flyc.kernel and @flyc.jit

The primary API for defining and compiling kernels:

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int]):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    # ... kernel body using layout ops ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int],
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, n).launch(
        grid=(grid_x, 1, 1),
        block=(256, 1, 1),
        stream=stream,
    )
  • @flyc.kernel compiles the function body into a gpu.func inside a gpu.module. It uses AST rewriting to trace Python code into MLIR IR.

  • @flyc.jit wraps a host-side function that constructs and launches kernels. On first call it triggers JIT compilation; subsequent calls with the same type signature use a cached compiled artifact.

Compilation Flow

On first call, @flyc.jit runs the following pipeline:

  1. AST Rewriting: The Python source is parsed and rewritten to emit MLIR ops.

  2. MLIR Module Construction: Kernel body is traced into fly, gpu, arith, scf, memref, and vector dialect ops.

  3. Fly Pass Pipeline: The module is lowered through three pass stages, defined in RocmBackend._pipeline_parts() (python/flydsl/compiler/backends/rocm.py). See Architecture & Compilation Pipeline Guide §3 for the per-pass table.

    1. pre_binary_fragments (Fly → ROCDL):

      • fly-rewrite-func-signature

      • fly-canonicalize

      • fly-layout-lowering

      • fly-int-swizzle-simplify

      • canonicalize

      • fly-convert-atom-call-to-ssa-form

      • fly-promote-regmem-to-vectorssa

      • convert-fly-to-rocdl

      • canonicalize

      • gpu.module(convert-scf-to-cf, cse, convert-gpu-to-rocdl{chipset=gfxNNN ...}, fly-rocdl-cluster-attr)

    2. binary_prep_fragments (→ LLVM):

      • rocdl-attach-target{chip=gfxNNN ...}

      • convert-scf-to-cf

      • convert-cf-to-llvm

      • gpu-to-llvm{use-bare-pointers-...=true}

      • convert-vector-to-llvm

      • convert-arith-to-llvm

      • convert-func-to-llvm

      • reconcile-unrealized-casts

      • ensure-debug-info-scope-on-llvm-func (optional, gated by FLYDSL_DEBUG_ENABLE_DEBUG_INFO)

    3. binary_fragment:

      • gpu-module-to-binary{format=fatbin opts="..."}

  4. Cached Artifact: The compiled binary is cached to disk (~/.flydsl/cache/) keyed by the compiler toolchain hash and kernel type signature.

Tensor Arguments

Use flyc.from_dlpack to convert PyTorch tensors into FlyDSL tensor descriptors with layout metadata:

import flydsl.compiler as flyc

tA = flyc.from_dlpack(torch_tensor).mark_layout_dynamic(
    leading_dim=0, divisibility=4
)
launch(tA, B, n, stream=torch.cuda.Stream())

Buffer Operations

The flydsl.expr.buffer_ops module provides high-level Python wrappers for AMD CDNA3/CDNA4 buffer load/store operations. Buffer operations use a scalar base pointer (SGPRs) and per-thread offsets for efficient global memory access with hardware bounds checking.

ROCDL Operations

The flydsl.expr.rocdl module provides AMD-specific operations:

  • fx.rocdl.make_buffer_tensor – create buffer resource descriptor from tensor

  • fx.rocdl.BufferCopy32b / BufferCopy128b – buffer copy atoms

  • fx.rocdl.MFMA – MFMA instruction atoms (e.g., MFMA(16, 16, 4, fx.Float32))

fly-opt CLI

The fly-opt tool is a command-line interface for running MLIR passes on .mlir files:

fly-opt --fly-canonicalize input.mlir
fly-opt --fly-layout-lowering input.mlir
fly-opt --help