Compiler & Pipeline

FlyDSL includes a JIT compiler that traces Python kernel functions into MLIR and lowers them through the Fly dialect pipeline to GPU binaries.

`@flyc.kernel` and `@flyc.jit`

The primary API for defining and compiling kernels:

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int]):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    # ... kernel body using layout ops ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int],
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, n).launch(
        grid=(grid_x, 1, 1),
        block=(256, 1, 1),
        stream=stream,
    )

@flyc.kernel compiles the function body into a gpu.func inside a gpu.module. It uses AST rewriting to trace Python code into MLIR IR.
@flyc.jit wraps a host-side function that constructs and launches kernels. On first call it triggers JIT compilation; subsequent calls with the same type signature use a cached compiled artifact.

Compilation Flow

On first call, @flyc.jit runs the following pipeline:

AST Rewriting: The Python source is parsed and rewritten to emit MLIR ops.
MLIR Module Construction: Kernel body is traced into fly, gpu, arith, scf, memref, and vector dialect ops.
Fly Pass Pipeline: The module is lowered through three pass stages, defined in RocmBackend._pipeline_parts() (python/flydsl/compiler/backends/rocm.py). See Architecture & Compilation Pipeline Guide §3 for the per-pass table.
1. pre_binary_fragments (Fly → ROCDL):
  - fly-rewrite-func-signature
  - fly-canonicalize
  - fly-layout-lowering
  - fly-int-swizzle-simplify
  - canonicalize
  - fly-convert-atom-call-to-ssa-form
  - fly-promote-regmem-to-vectorssa
  - convert-fly-to-rocdl
  - canonicalize
  - gpu.module(convert-scf-to-cf, cse, convert-gpu-to-rocdl{chipset=gfxNNN ...}, fly-rocdl-cluster-attr)
2. binary_prep_fragments (→ LLVM):
  - rocdl-attach-target{chip=gfxNNN ...}
  - convert-scf-to-cf
  - convert-cf-to-llvm
  - gpu-to-llvm{use-bare-pointers-...=true}
  - convert-vector-to-llvm
  - convert-arith-to-llvm
  - convert-func-to-llvm
  - reconcile-unrealized-casts
  - ensure-debug-info-scope-on-llvm-func (optional, gated by FLYDSL_DEBUG_ENABLE_DEBUG_INFO)
3. binary_fragment:
  - gpu-module-to-binary{format=fatbin opts="..."}
Cached Artifact: The compiled binary is cached to disk (~/.flydsl/cache/) keyed by the compiler toolchain hash and kernel type signature.

Tensor Arguments

Use flyc.from_dlpack to convert PyTorch tensors into FlyDSL tensor descriptors with layout metadata:

import flydsl.compiler as flyc

tA = flyc.from_dlpack(torch_tensor).mark_layout_dynamic(
    leading_dim=0, divisibility=4
)
launch(tA, B, n, stream=torch.cuda.Stream())

Buffer Operations

The flydsl.expr.buffer_ops module provides high-level Python wrappers for AMD CDNA3/CDNA4 buffer load/store operations. Buffer operations use a scalar base pointer (SGPRs) and per-thread offsets for efficient global memory access with hardware bounds checking.

ROCDL Operations

The flydsl.expr.rocdl module provides AMD-specific operations:

fx.rocdl.make_buffer_tensor – create buffer resource descriptor from tensor (CDNA buffer copy)
fx.rocdl.BufferCopy32b / BufferCopy128b – buffer copy atoms
fx.rocdl.MFMA – MFMA instruction atoms (CDNA3/CDNA4; e.g., MFMA(16, 16, 4, fx.Float32))
fx.rocdl.WMMA / fx.rocdl.WMMAScale – gfx1250 (wave32) WMMA and E8M0 MX-scaled WMMA MMA atoms
fx.rocdl.make_tdm_atom / fx.rocdl.TDM / fx.rocdl.advance_tdm_atom – gfx1250 TDM async Global↔LDS whole-tile copy atom (1–5D; descriptor as atom state)

fly-opt CLI

The fly-opt tool is a command-line interface for running MLIR passes on .mlir files:

fly-opt --fly-canonicalize input.mlir
fly-opt --fly-layout-lowering input.mlir
fly-opt --help