Basic Usage

This tutorial covers the fundamentals of using FlyDSL to write and run GPU kernels.

Setting Up the Environment

After installing FlyDSL, ensure the Python path is configured:

export PYTHONPATH=$(pwd)/build-fly/python_packages:$(pwd):$PYTHONPATH

Understanding Layouts

Layouts are the core abstraction in FlyDSL. A layout maps logical coordinates to physical memory indices using a (Shape, Stride) pair.

Note

All fx.* operations (make_layout, make_shape, etc.) generate MLIR IR under the hood and must be called inside a @flyc.kernel or @flyc.jit function. They cannot be used in bare Python scripts.

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(data: fx.Tensor):
    # Create a 2D layout: 8 rows x 16 columns, column-major
    layout = fx.make_layout((8, 16), (1, 8))
    # Index = dot(Coord, Stride) = i*1 + j*8

Layout operations (all used inside kernel/jit functions):

fx.size(layout) – total number of elements
fx.rank(layout) – number of dimensions
fx.crd2idx(coord, layout) – coordinate to linear index
fx.idx2crd(index, layout) – linear index to coordinate

Defining a Kernel

Kernels are defined using the @flyc.kernel decorator:

import flydsl.compiler as flyc
import flydsl.expr as fx

@flyc.kernel
def my_kernel(
    data: fx.Tensor,
    n: fx.Constexpr[int],
):
    tid = fx.thread_idx.x
    bid = fx.block_idx.x
    # ... kernel body using layout ops

Key points:

@flyc.kernel compiles the function body into GPU IR via AST rewriting
fx.Tensor denotes a GPU tensor argument
fx.Constexpr[int] denotes a compile-time constant (affects cache key)
GPU intrinsics are accessed via fx.thread_idx, fx.block_idx, etc.

Launching Kernels

Kernels are launched via @flyc.jit host-side functions:

@flyc.jit
def launch(
    data: fx.Tensor,
    n: fx.Constexpr[int],
    stream: fx.Stream = fx.Stream(None),
):
    my_kernel(data, n).launch(
        grid=(grid_x, 1, 1),
        block=(256, 1, 1),
        stream=stream,
    )

# Usage with PyTorch
import torch
data = torch.randn(1024, dtype=torch.float32).cuda()
launch(data, 1024, stream=torch.cuda.Stream())
torch.cuda.synchronize()

Next Steps

Kernel Development – advanced kernel techniques
Layout Algebra Guide – deep dive into the layout system
Kernel Authoring Guide – comprehensive kernel authoring reference