Compiler & Pipeline =================== FlyDSL includes a JIT compiler that traces Python kernel functions into MLIR and lowers them through the Fly dialect pipeline to GPU binaries. ``@flyc.kernel`` and ``@flyc.jit`` ------------------------------------ The primary API for defining and compiling kernels: .. code-block:: python import flydsl.compiler as flyc import flydsl.expr as fx @flyc.kernel def my_kernel(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int]): tid = fx.thread_idx.x bid = fx.block_idx.x # ... kernel body using layout ops ... @flyc.jit def launch(A: fx.Tensor, B: fx.Tensor, n: fx.Constexpr[int], stream: fx.Stream = fx.Stream(None)): my_kernel(A, B, n).launch( grid=(grid_x, 1, 1), block=(256, 1, 1), stream=stream, ) - ``@flyc.kernel`` compiles the function body into a ``gpu.func`` inside a ``gpu.module``. It uses AST rewriting to trace Python code into MLIR IR. - ``@flyc.jit`` wraps a host-side function that constructs and launches kernels. On first call it triggers JIT compilation; subsequent calls with the same type signature use a cached compiled artifact. Compilation Flow ----------------- On first call, ``@flyc.jit`` runs the following pipeline: 1. **AST Rewriting**: The Python source is parsed and rewritten to emit MLIR ops. 2. **MLIR Module Construction**: Kernel body is traced into ``gpu``, ``arith``, ``scf``, ``memref``, and ``fly`` dialect ops. 3. **Fly Pass Pipeline**: The module is lowered through a series of MLIR passes: - ``gpu-kernel-outlining`` - ``fly-canonicalize`` - ``fly-layout-lowering`` - ``convert-fly-to-rocdl`` - ``canonicalize`` + ``cse`` - ``gpu.module(convert-gpu-to-rocdl{...})`` - ``rocdl-attach-target{chip=gfxNNN}`` - ``gpu-to-llvm``, ``convert-arith/func-to-llvm`` - ``gpu-module-to-binary{format=fatbin}`` 4. **Cached Artifact**: The compiled binary is cached to disk (``~/.flydsl/cache/``) keyed by the compiler toolchain hash and kernel type signature. Tensor Arguments ----------------- Use ``flyc.from_dlpack`` to convert PyTorch tensors into FlyDSL tensor descriptors with layout metadata: .. code-block:: python import flydsl.compiler as flyc tA = flyc.from_dlpack(torch_tensor).mark_layout_dynamic( leading_dim=0, divisibility=4 ) launch(tA, B, n, stream=torch.cuda.Stream()) Buffer Operations ----------------- The ``flydsl.expr.buffer_ops`` module provides high-level Python wrappers for AMD CDNA3/CDNA4 buffer load/store operations. Buffer operations use a scalar base pointer (SGPRs) and per-thread offsets for efficient global memory access with hardware bounds checking. ROCDL Operations ----------------- The ``flydsl.expr.rocdl`` module provides AMD-specific operations: - **fx.rocdl.make_buffer_tensor** -- create buffer resource descriptor from tensor - **fx.rocdl.BufferCopy32b** / **BufferCopy128b** -- buffer copy atoms - **fx.rocdl.MFMA** -- MFMA instruction atoms (e.g., ``MFMA(16, 16, 4, fx.Float32)``) fly-opt CLI ------------ The ``fly-opt`` tool is a command-line interface for running MLIR passes on ``.mlir`` files: .. code-block:: bash fly-opt --fly-canonicalize input.mlir fly-opt --fly-layout-lowering input.mlir fly-opt --help