Quick Start

This guide walks through writing, compiling, and running a simple GPU kernel with FlyDSL.

A Minimal Vector Add Kernel

The following example demonstrates the core FlyDSL workflow: define a kernel with @flyc.kernel, use layout algebra to partition data, then launch with @flyc.jit.

import torch
import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def vectorAddKernel(
    A: fx.Tensor, B: fx.Tensor, C: fx.Tensor,
    block_dim: fx.Constexpr[int],
):
    bid = fx.block_idx.x
    tid = fx.thread_idx.x

    # Partition tensors by block using layout algebra
    tA = fx.logical_divide(A, fx.make_layout(block_dim, 1))
    tB = fx.logical_divide(B, fx.make_layout(block_dim, 1))
    tC = fx.logical_divide(C, fx.make_layout(block_dim, 1))

    tA = fx.slice(tA, (None, bid))
    tB = fx.slice(tB, (None, bid))
    tC = fx.slice(tC, (None, bid))

    # Allocate register fragments, load, compute, store
    RABTy = fx.MemRefType.get(fx.T.f32(), fx.LayoutType.get(1, 1),
                              fx.AddressSpace.Register)
    copyAtom = fx.make_copy_atom(fx.UniversalCopy32b(), fx.Float32)
    rA = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
    rB = fx.memref_alloca(RABTy, fx.make_layout(1, 1))
    rC = fx.memref_alloca(RABTy, fx.make_layout(1, 1))

    fx.copy_atom_call(copyAtom, fx.slice(tA, (None, tid)), rA)
    fx.copy_atom_call(copyAtom, fx.slice(tB, (None, tid)), rB)

    vC = fx.arith.addf(fx.memref_load_vec(rA), fx.memref_load_vec(rB))
    fx.memref_store_vec(vC, rC)
    fx.copy_atom_call(copyAtom, rC, fx.slice(tC, (None, tid)))

@flyc.jit
def vectorAdd(
    A: fx.Tensor, B: fx.Tensor, C,
    n: fx.Int32,
    const_n: fx.Constexpr[int],
    stream: fx.Stream = fx.Stream(None),
):
    block_dim = 64
    grid_x = (n + block_dim - 1) // block_dim
    vectorAddKernel(A, B, C, block_dim).launch(
        grid=(grid_x, 1, 1), block=[block_dim, 1, 1], stream=stream,
    )

# Usage
n = 128
A = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
B = torch.randint(0, 10, (n,), dtype=torch.float32).cuda()
C = torch.zeros(n, dtype=torch.float32).cuda()
vectorAdd(A, B, C, n, n + 1, stream=torch.cuda.Stream())
torch.cuda.synchronize()
print("Result correct:", torch.allclose(C, A + B))

See examples/01-vectorAdd.py for the complete implementation with CUDA Graph capture support.

Key Concepts

  1. @flyc.kernel: Decorator that compiles a Python function into GPU IR.

  2. @flyc.jit: Decorator for host-side JIT functions that launch kernels.

  3. fx.Tensor / fx.Constexpr: Type annotations for kernel/JIT arguments.

  4. Layout algebra: make_layout, logical_divide, zipped_divide, raked_product – express data partitioning across the GPU hierarchy.

  5. Copy atoms: make_copy_atom, make_tiled_copy – vectorized data movement with layout-aware partitioning.

  6. Automatic caching: Compiled kernels are cached to disk (~/.flydsl/cache/) and reused on subsequent calls.

Compilation Pipeline

On first call, @flyc.jit traces the Python function into an MLIR module, then compiles it through the Fly MLIR pipeline:

Python Function (@flyc.kernel / @flyc.jit)
        │
        ▼  AST Rewriting + Tracing
   MLIR Module (gpu, arith, scf, memref dialects)
        │
        ▼  MlirCompiler.compile()
   ┌────────────────────────────────────────────────┐
   │  gpu-kernel-outlining                          │
   │  fly-canonicalize                              │
   │  fly-layout-lowering                           │
   │  convert-fly-to-rocdl                          │
   │  canonicalize + cse                            │
   │  gpu.module(convert-gpu-to-rocdl{...})         │
   │  rocdl-attach-target{chip=gfxNNN}              │
   │  gpu-to-llvm → convert-arith/func-to-llvm      │
   │  gpu-module-to-binary{format=fatbin}           │
   └────────────────────────────────────────────────┘
        │
        ▼
   Cached Compiled Artifact (ExecutionEngine)

AOT Pre-compilation

FlyDSL supports ahead-of-time (AOT) compilation of kernels for deployment without JIT overhead. The tests/python/examples/aot_example.py script demonstrates pre-compiling preshuffle GEMM kernels into a cache directory:

# Pre-compile with default configurations (auto-detect GPU arch)
python tests/python/examples/aot_example.py

# Pre-compile and verify by running kernels on GPU
python tests/python/examples/aot_example.py --run_kernel

# Custom cache directory
FLYDSL_RUNTIME_CACHE_DIR=/my/cache python tests/python/examples/aot_example.py

At runtime, compiled kernels are loaded from the cache automatically when FLYDSL_RUNTIME_CACHE_DIR is set.

Next Steps