Testing & Benchmarking Guide

Test infrastructure, running tests, benchmark harness, writing new tests, and performance measurement.

Quick Reference

Category

Location

Requires GPU

Description

MLIR lit tests

tests/mlir/{LayoutAlgebra,Conversion,Transforms}/

No

Verify Fly dialect lowering

Python IR tests

tests/pyir/test_*.py

No

Python-based MLIR generation + lowering

GPU kernel tests

tests/kernels/test_*.py

Yes

Full compilation → GPU execution

AOT examples

tests/python/examples/

Varies

AOT pre-compilation examples

Run GEMM tests:

bash scripts/run_tests.sh

Run benchmarks:

bash scripts/run_benchmark.sh

1. Test Categories

1.1 MLIR Lit Tests (tests/mlir/)

MLIR-based tests organized by category, verified using the fly-opt tool. Validates that Fly dialect operations lower correctly to standard MLIR dialects without needing a GPU.

Directories:

Directory

Tests

Description

LayoutAlgebra/

coalesce.mlir, composition.mlir, construction.mlir, coordinate.mlir, divide.mlir, int_tuple.mlir, product.mlir, size_cosize.mlir

Layout algebra operations

Conversion/

fly_gpu_to_llvm.mlir, gpu_ops.mlir, memref_alloca.mlir, memref_ops.mlir, mma_atom.mlir, pointer_ops.mlir, type_conversion.mlir

Dialect conversion passes

Transforms/

canonicalize.mlir, layout_lowering.mlir

Transformation passes

Running individually:

# Build fly-opt first if needed
cmake --build build-fly --target fly-opt -j$(nproc)

# Run a single test
build-fly/bin/fly-opt --fly-canonicalize tests/mlir/LayoutAlgebra/construction.mlir

1.2 Python IR Tests (tests/pyir/)

Python-based tests that generate MLIR IR using the FlyDSL Python API and verify the IR structure and lowering. No GPU execution required.

Files:

Test File

Description

test_layout_algebra.py

Layout algebra: coalesce, composition, divide, product, complement

test_rocir_print.py

IR printing for Fly dialect ops

test_static_vs_dynamic.py

Static vs dynamic value handling

Running individually:

python tests/pyir/test_layout_algebra.py

1.3 GPU Kernel Tests (tests/kernels/)

Full end-to-end tests: compile FlyDSL kernels, execute on GPU, validate against PyTorch reference.

Files:

Test File

Kernel

Description

test_vec_add.py

VecAdd

Vector addition (C = A + B)

test_softmax.py

Softmax

Row-wise softmax

test_layernorm.py

LayerNorm

Layer normalization

test_rmsnorm.py

RMSNorm

RMS normalization

test_preshuffle_gemm.py

GEMM

Preshuffle MFMA GEMM (fp8/int8/int4/bf16)

test_blockscale_preshuffle_gemm.py

GEMM

Block-scale (MXFP4) preshuffle GEMM

test_moe_gemm.py

MoE GEMM

Mixture-of-Experts GEMM

test_moe_blockscale.py

MoE

MoE with block-scale quantization

test_moe_reduce.py

MoE Reduce

MoE reduction kernel

test_pa.py

Paged Attn

Paged attention decode

test_quant.py

Quantization

Quantization ops

test_ref.py

Reference

Reference implementations

Running individually:

python tests/kernels/test_softmax.py
python tests/kernels/test_preshuffle_gemm.py --in_dtype fp8 -M 16 -N 5120 -K 8192

1.4 AOT Examples (tests/python/examples/)

AOT pre-compilation examples:

tests/python/examples/
└── aot_example.py      # AOT pre-compilation for preshuffle GEMM

2. Test Runner Scripts

2.1 scripts/run_tests.sh

Runs the preshuffle GEMM test suite via pytest:

bash scripts/run_tests.sh

Features:

  • Auto-discovers build directory (build-fly/)

  • Sets up PYTHONPATH and LD_LIBRARY_PATH

  • Runs pytest tests/kernels/test_preshuffle_gemm.py

  • By default skips large_shape-marked tests (set RUN_TESTS_FULL=1 for all)

  • Outputs pass/fail summary

Environment setup:

PYTHONPATH="${BUILD_DIR}/python_packages:${REPO_ROOT}:${PYTHONPATH}"
LD_LIBRARY_PATH="${MLIR_LIBS_DIR}:${LD_LIBRARY_PATH}"

2.2 scripts/run_benchmark.sh

Specialized benchmarking harness for performance characterization.

Default configurations:

# Softmax/LayerNorm: "M,N,dtype"
SOFTMAX_SHAPES='32768,8192,bf16'
LAYERNORM_SHAPES='32768,8192,bf16'

# Preshuffle GEMM: "dtype,M,N,K,tile_m,tile_n,tile_k"
GEMM_SHAPES='
fp8,16,40960,5120,16,128,256
fp8,16,77824,5120,16,128,256
fp8,5120,5120,8320,64,256,128
fp8,9728,8192,8320,64,256,128
int8,9728,8192,8320,64,256,128
int4,9728,8192,8320,64,256,128
bf16,5120,5120,8320,64,256,128
'

# FP4 GEMM (gfx950 only): "M,N,K,tile_m,tile_n,tile_k"
GEMM_FP4_SHAPES='8192,8192,8192,64,128,256'

Selective execution:

bash scripts/run_benchmark.sh                    # default: GEMM only
bash scripts/run_benchmark.sh softmax             # only softmax
bash scripts/run_benchmark.sh gemm moe            # GEMM and MoE
bash scripts/run_benchmark.sh --only softmax,layernorm
bash scripts/run_benchmark.sh --list              # list available ops

Output format: Tabular with TB/s and TFLOPS columns:

op             shape                              dtype       TB/s    TFLOPS
-------------- ---------------------------------- ---------- ---------- ----------
gemm           16x40960x5120                      fp8         1.234     56.789

Logs: Written to ${BENCH_LOG_DIR:-/tmp/flydsl_bench}/


3. Pytest Configuration

3.1 tests/conftest.py

Pytest configuration with MLIR context fixtures for the Fly dialect.

Fixtures:

@pytest.fixture
def ctx():
    """Fresh MLIR context per test with dialects registered."""
    # Creates Context, yields object with: ctx.context, ctx.module, ctx.location

@pytest.fixture
def module(ctx):
    """Provides ctx.module."""

@pytest.fixture
def insert_point(ctx):
    """Sets insertion point to module body."""

Build discovery: Supports multiple build layouts:

  • build-fly/python_packages (preferred)

  • build/python_packages/flydsl (fallback)

Session hook: Prevents pytest exit code 5 (no tests collected) from being treated as failure.


4. Performance Measurement

4.1 tests/test_common.py

Core performance testing utilities (adapted from AIter).

perftest() decorator:

@perftest(num_iters=20, num_warmup=3, testGraph=False, num_rotate_args=0)
def my_kernel_test(Input, Output):
    # Kernel invocation
    ...

Features:

  • Device memory profiling to determine rotation count

  • Torch CUDA event timing

  • HIPGraph capture mode (testGraph=True)

  • Cache-aware iteration calculation

checkAllclose() function:

checkAllclose(output, reference, rtol=1e-2, atol=1e-2, tol_err_ratio=0.05)

Returns a mismatch ratio in [0, 1] (0 = pass).

verify_output() function:

verify_output(c_out, c_ref, atol=1e-2, rtol=1e-2, msg='')

High-level validation wrapper around checkAllclose.

4.2 tests/kernels/benchmark_common.py

Shared benchmark harness for performance comparison.

Key functions:

# Measure device time (torch CUDA events)
gpu_us = bench_gpu_us_torch(fn, warmup=20, iters=200)

5. Compilation Utilities (tests/utils.py)

compile_to_hsaco()

Standalone compilation path for tests:

from tests.utils import compile_to_hsaco

hsaco = compile_to_hsaco(mlir_module, kernel_name="my_kernel")

Pipeline stages:

  1. Fly coordinate lowering

  2. fly-to-standard lowering

  3. canonicalize + cse

  4. Attach ROCDL target (auto-detect GPU arch)

  5. convert-gpu-to-rocdl (SCF→CF, bare pointer memref)

  6. gpu-to-llvm + lower-to-llvm

  7. gpu-module-to-binary

Weight Utilities

from tests.utils import pertoken_quant, shuffle_weight

# Per-token quantization (handles NaN/Inf)
quantized, scales = pertoken_quant(tensor, dtype=torch.float8_e4m3fnuz)

# Weight preshuffle for MFMA (layout 16x16)
shuffled = shuffle_weight(weight, layout=(16, 16))

6. Writing New Tests

6.1 PyIR Test Pattern (No GPU)

# tests/pyir/test_my_feature.py
import flydsl.expr as fx
from flydsl.expr.typing import T

def test_my_layout_op(ctx, insert_point):
    shape = fx.make_shape(4, 8)
    stride = fx.make_stride(8, 1)
    layout = fx.make_layout(shape, stride)
    result = fx.size(layout)
    ir_str = str(ctx.module)
    assert "fly.make_layout" in ir_str

6.2 GPU Kernel Test Pattern (New API)

# tests/kernels/test_my_kernel.py
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx
from flydsl.expr import arith, gpu
from tests.test_common import checkAllclose

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int]):
    tid = gpu.thread_idx.x
    bid = gpu.block_idx.x
    # ... kernel body ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int],
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, N).launch(grid=(N // 256,), block=(256,), stream=stream)

def test_my_kernel():
    N = 1024
    A = torch.randn(N, device="cuda", dtype=torch.float32)
    B = torch.empty(N, device="cuda", dtype=torch.float32)

    launch(A, B, N)

    # Reference
    ref = A  # or some computation

    # Validate
    err = checkAllclose(B, ref, rtol=1e-2, atol=1e-2)
    assert err == 0, f"Mismatch: {err * 100:.2f}%"

6.3 Benchmark Test Pattern

from tests.kernels.benchmark_common import bench_gpu_us_torch

def benchmark_my_kernel():
    # Setup
    launch_fn = compile_my_kernel(...)

    def run():
        launch_fn(input_tensor, output_tensor)

    # Measure
    gpu_us = bench_gpu_us_torch(run, warmup=20, iters=200)

    # Compute metrics
    total_bytes = 2 * M * N * elem_size
    bandwidth_tbs = total_bytes / (gpu_us * 1e-6) / 1e12
    print(f"Time: {gpu_us:.1f} us, Bandwidth: {bandwidth_tbs:.2f} TB/s")

7. GEMM Test CLI Arguments

The test_preshuffle_gemm.py test supports extensive CLI configuration:

python tests/kernels/test_preshuffle_gemm.py \
    --in_dtype fp8 \
    -M 16 -N 5120 -K 8192 \
    --tile_m 16 --tile_n 128 --tile_k 256 \
    --lds_stage 2 \
    --num_iters 20 \
    --num_warmup 3 \
    --no_aiter_bench \
    --test_graph        # or -tg for HIPGraph mode
    --wfp4              # FP4 weight path (gfx950 only)

8. Test Configuration via Environment Variables

Variable

Used By

Description

ROCDSL_SOFTMAX_SHAPES

test_softmax.py

Override softmax test shapes ("M,N,dtype;...")

ROCDSL_LAYERNORM_SHAPES

test_layernorm.py

Override layernorm test shapes

FLYDSL_DUMP_IR

Compiler

Dump intermediate IR at each pipeline stage

FLYDSL_DUMP_DIR

Compiler

IR dump directory (default: ~/.flydsl/debug)

FLYDSL_RUNTIME_CACHE_DIR

Compiler

Cache directory (default: ~/.flydsl/cache)

RUN_TESTS_FULL

run_tests.sh

Set to 1 to run all parametrized cases

BENCH_LOG_DIR

run_benchmark.sh

Benchmark log directory (default: /tmp/flydsl_bench)


9. IR Dump Workflow

Via MlirCompiler

FLYDSL_DUMP_IR=1 FLYDSL_DUMP_DIR=./dumps python my_test.py

Produces numbered .mlir files per pipeline stage plus final_isa.s.

Dedicated IR Dump Script

bash scripts/dumpir.sh

10. Source Files

File

Description

scripts/run_tests.sh

GEMM test runner (pytest)

scripts/run_benchmark.sh

Benchmark harness with configurable shapes

scripts/dumpir.sh

IR dump helper script

tests/conftest.py

Pytest fixtures (MLIR context, module, insert point)

tests/test_common.py

perftest(), checkAllclose(), verify_output()

tests/utils.py

compile_to_hsaco(), pertoken_quant(), shuffle_weight()

tests/kernels/benchmark_common.py

bench_gpu_us_torch(), benchmark harness

tests/mlir/{LayoutAlgebra,Conversion,Transforms}/

MLIR lit tests (18 files)

tests/pyir/test_*.py

Python IR generation tests (3 files)

tests/kernels/test_*.py

GPU kernel tests (12 files)

tests/python/examples/

AOT pre-compilation examples