Testing & Benchmarking Guide

Test infrastructure, running tests, benchmark harness, writing new tests, and performance measurement.

Quick Reference

Category	Location	Requires GPU	Description
MLIR lit tests	`tests/mlir/{LayoutAlgebra,Conversion,Transforms}/`	No	Verify Fly dialect lowering
Python tests	`tests/python/examples/`	Varies	Python-based MLIR generation + AOT examples
GPU kernel tests	`tests/kernels/test_*.py`	Yes	Full compilation → GPU execution
AOT examples	`tests/python/examples/`	Varies	AOT pre-compilation examples

Run GEMM tests:

bash scripts/run_tests.sh

Run benchmarks:

bash scripts/run_benchmark.sh

1. Test Categories

1.1 MLIR Lit Tests (`tests/mlir/`)

MLIR-based tests organized by category, verified using the fly-opt tool. Validates that Fly dialect operations lower correctly to standard MLIR dialects without needing a GPU.

Directories:

Directory	Tests	Description
`LayoutAlgebra/`	`coalesce.mlir`, `composition.mlir`, `construction.mlir`, `coordinate.mlir`, `divide.mlir`, `int_tuple.mlir`, `product.mlir`, `size_cosize.mlir`	Layout algebra operations
`Conversion/`	`gpu_ops.mlir`, `memref_alloca.mlir`, `memref_ops.mlir`, `mma_atom.mlir`, `pointer_ops.mlir`, `type_conversion.mlir`	Dialect conversion passes
`Transforms/`	`canonicalize.mlir`, `layout_lowering.mlir`	Transformation passes

Running individually:

# Build fly-opt first if needed
cmake --build build-fly --target fly-opt -j$(nproc)

# Run a single test
build-fly/bin/fly-opt --fly-canonicalize tests/mlir/LayoutAlgebra/construction.mlir

1.2 Python Tests (`tests/python/`)

Python-based tests including AOT pre-compilation examples.

Running:

python tests/python/examples/aot_example.py

1.3 GPU Kernel Tests (`tests/kernels/`)

Full end-to-end tests: compile FlyDSL kernels, execute on GPU, validate against PyTorch reference.

Files:

Test File	Kernel	Description
`test_vec_add.py`	VecAdd	Vector addition (C = A + B)
`test_softmax.py`	Softmax	Row-wise softmax
`test_layernorm.py`	LayerNorm	Layer normalization
`test_rmsnorm.py`	RMSNorm	RMS normalization
`test_preshuffle_gemm.py`	GEMM	Preshuffle MFMA GEMM (fp8/int8/fp16/bf16)
`test_blockscale_preshuffle_gemm.py`	GEMM	Block-scale (MXFP4) preshuffle GEMM
`test_moe_gemm.py`	MoE GEMM	Mixture-of-Experts GEMM
`test_moe_blockscale.py`	MoE	MoE with block-scale quantization
`test_moe_reduce.py`	MoE Reduce	MoE reduction kernel
`test_pa.py`	Paged Attn	Paged attention decode
`test_quant.py`	Quantization	Quantization ops
`test_ref.py`	Reference	Reference implementations

Running individually:

python tests/kernels/test_softmax.py
python tests/kernels/test_preshuffle_gemm.py --in_dtype fp8 -M 16 -N 5120 -K 8192

1.4 AOT Examples (`tests/python/examples/`)

AOT pre-compilation examples:

tests/python/examples/
└── aot_example.py      # AOT pre-compilation for preshuffle GEMM

2. Test Runner Scripts

2.1 `scripts/run_tests.sh`

Runs the full FlyDSL test suite:

bash scripts/run_tests.sh

Features:

Auto-discovers build directory (build-fly/)
Auto-selects the GPU with the most free VRAM when HIP_VISIBLE_DEVICES is unset
Sets up PYTHONPATH and LD_LIBRARY_PATH, and exports FLYDSL_RUN_QUANT=1
Runs pytest over tests/kernels/, tests/unit/, tests/system/, and tests/python/examples/
Runs the standalone examples/ scripts and the MLIR FileCheck tests (tests/mlir/)
By default skips large_shape-marked tests (set RUN_TESTS_FULL=1 for all)
Fail-fast: exits on the first failure

Environment setup:

PYTHONPATH="${BUILD_DIR}/python_packages:${REPO_ROOT}:${PYTHONPATH}"
LD_LIBRARY_PATH="${MLIR_LIBS_DIR}:${LD_LIBRARY_PATH}"

2.2 `scripts/run_benchmark.sh`

Specialized benchmarking harness for performance characterization.

Default configurations:

# Softmax/LayerNorm: "M,N,dtype"
SOFTMAX_SHAPES='32768,8192,bf16'
LAYERNORM_SHAPES='32768,8192,bf16'

# Preshuffle GEMM: "dtype,M,N,K,tile_m,tile_n,tile_k"
GEMM_SHAPES='
fp8,16,40960,5120,16,128,256
fp8,16,77824,5120,16,128,256
fp8,5120,5120,8320,64,256,128
fp8,9728,8192,8320,64,256,128
int8,9728,8192,8320,64,256,128
bf16,5120,5120,8320,64,256,128
'

# FP4 GEMM (gfx950 only): "M,N,K,tile_m,tile_n,tile_k"
GEMM_FP4_SHAPES='8192,8192,8192,64,128,256'

Selective execution:

bash scripts/run_benchmark.sh                    # default: GEMM only
bash scripts/run_benchmark.sh softmax             # only softmax
bash scripts/run_benchmark.sh gemm moe            # GEMM and MoE
bash scripts/run_benchmark.sh --only softmax,layernorm
bash scripts/run_benchmark.sh --list              # list available ops

Output format: Tabular with TB/s and TFLOPS columns:

op             shape                              dtype       TB/s    TFLOPS
-------------- ---------------------------------- ---------- ---------- ----------
gemm           16x40960x5120                      fp8         1.234     56.789

Logs: Written to ${BENCH_LOG_DIR:-/tmp/flydsl_bench}/

3. Pytest Configuration

3.1 `tests/conftest.py`

Pytest configuration with MLIR context fixtures for the Fly dialect.

Fixtures:

@pytest.fixture
def ctx():
    """Fresh MLIR context per test with dialects registered."""
    # Creates Context, yields object with: ctx.context, ctx.module, ctx.location

@pytest.fixture
def module(ctx):
    """Provides ctx.module."""

@pytest.fixture
def insert_point(ctx):
    """Sets insertion point to module body."""

Build discovery: Supports multiple build layouts:

build-fly/python_packages (preferred)
build/python_packages/flydsl (fallback)

Session hook: Prevents pytest exit code 5 (no tests collected) from being treated as failure.

4. Performance Measurement

4.1 `tests/test_common.py`

Core performance testing utilities (adapted from AIter).

perftest() decorator:

@perftest(num_iters=20, num_warmup=3, testGraph=False, num_rotate_args=0)
def my_kernel_test(Input, Output):
    # Kernel invocation
    ...

Features:

Device memory profiling to determine rotation count
Torch CUDA event timing
HIPGraph capture mode (testGraph=True)
Cache-aware iteration calculation

checkAllclose() function:

checkAllclose(output, reference, rtol=1e-2, atol=1e-2, tol_err_ratio=0.05)

Returns a mismatch ratio in [0, 1] (0 = pass).

verify_output() function:

verify_output(c_out, c_ref, atol=1e-2, rtol=1e-2, msg='')

High-level validation wrapper around checkAllclose.

4.2 `tests/kernels/benchmark_common.py`

Shared benchmark harness for performance comparison.

Key functions:

# Measure device time (torch CUDA events)
gpu_us = bench_gpu_us_torch(fn, warmup=20, iters=200)

5. Test Utilities (`tests/utils.py`)

Weight Utilities

from tests.utils import pertoken_quant, shuffle_weight

# Per-token quantization (handles NaN/Inf)
quantized, scales = pertoken_quant(tensor, dtype=torch.float8_e4m3fnuz)

# Weight preshuffle for MFMA (layout 16x16)
shuffled = shuffle_weight(weight, layout=(16, 16))

6. Writing New Tests

6.1 PyIR Test Pattern (No GPU)

# tests/python/test_my_feature.py
import flydsl.expr as fx
from flydsl.expr.typing import T

def test_my_layout_op(ctx, insert_point):
    shape = fx.make_shape(4, 8)
    stride = fx.make_stride(8, 1)
    layout = fx.make_layout(shape, stride)
    result = fx.size(layout)
    ir_str = str(ctx.module)
    assert "fly.make_layout" in ir_str

6.2 GPU Kernel Test Pattern (New API)

# tests/kernels/test_my_kernel.py
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx
from flydsl.expr import gpu
from tests.test_common import checkAllclose

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int]):
    tid = gpu.thread_idx.x
    bid = gpu.block_idx.x
    # ... kernel body ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int],
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, N).launch(grid=(N // 256,), block=(256,), stream=stream)

def test_my_kernel():
    N = 1024
    A = torch.randn(N, device="cuda", dtype=torch.float32)
    B = torch.empty(N, device="cuda", dtype=torch.float32)

    launch(A, B, N)

    # Reference
    ref = A  # or some computation

    # Validate
    err = checkAllclose(B, ref, rtol=1e-2, atol=1e-2)
    assert err == 0, f"Mismatch: {err * 100:.2f}%"

6.3 Benchmark Test Pattern

from tests.kernels.benchmark_common import bench_gpu_us_torch

def benchmark_my_kernel():
    # Setup
    launch_fn = compile_my_kernel(...)

    def run():
        launch_fn(input_tensor, output_tensor)

    # Measure
    gpu_us = bench_gpu_us_torch(run, warmup=20, iters=200)

    # Compute metrics
    total_bytes = 2 * M * N * elem_size
    bandwidth_tbs = total_bytes / (gpu_us * 1e-6) / 1e12
    print(f"Time: {gpu_us:.1f} us, Bandwidth: {bandwidth_tbs:.2f} TB/s")

7. GEMM Test CLI Arguments

The test_preshuffle_gemm.py test supports extensive CLI configuration:

python tests/kernels/test_preshuffle_gemm.py \
    --in_dtype fp8 \
    -M 16 -N 5120 -K 8192 \
    --tile_m 16 --tile_n 128 --tile_k 256 \
    --lds_stage 2 \
    --num_iters 20 \
    --num_warmup 3 \
    --no_aiter_bench \
    --test_graph        # or -tg for HIPGraph mode
    --wfp4              # FP4 weight path (gfx950 only)

8. Test Configuration via Environment Variables

Variable	Used By	Description
`ROCDSL_SOFTMAX_SHAPES`	`test_softmax.py`	Override softmax test shapes (`"M,N,dtype;..."`)
`ROCDSL_LAYERNORM_SHAPES`	`test_layernorm.py`	Override layernorm test shapes
`FLYDSL_DUMP_IR`	Compiler	Dump intermediate IR at each pipeline stage
`FLYDSL_DUMP_DIR`	Compiler	IR dump directory (default: `~/.flydsl/debug`)
`FLYDSL_RUNTIME_CACHE_DIR`	Compiler	Cache directory (default: `~/.flydsl/cache`)
`RUN_TESTS_FULL`	`run_tests.sh`	Set to `1` to run all parametrized cases
`BENCH_LOG_DIR`	`run_benchmark.sh`	Benchmark log directory (default: `/tmp/flydsl_bench`)

9. IR Dump Workflow

Via `MlirCompiler`

FLYDSL_DUMP_IR=1 FLYDSL_DUMP_DIR=./dumps python my_test.py

Produces numbered .mlir files per pipeline stage plus final_isa.s.

Dedicated IR Dump Script

bash scripts/dumpir.sh

10. Source Files

File	Description
`scripts/run_tests.sh`	Full test runner (pytest + examples + FileCheck)
`scripts/run_benchmark.sh`	Benchmark harness with configurable shapes
`scripts/dumpir.sh`	IR dump helper script
`tests/conftest.py`	Pytest fixtures (MLIR context, module, insert point)
`tests/test_common.py`	`perftest()`, `checkAllclose()`, `verify_output()`
`tests/utils.py`	`pertoken_quant()`, `shuffle_weight()`
`tests/kernels/benchmark_common.py`	`bench_gpu_us_torch()`, benchmark harness
`tests/mlir/{LayoutAlgebra,Conversion,Transforms}/`	MLIR lit tests (18 files)
`tests/python/examples/`	Python AOT examples
`tests/kernels/test_*.py`	GPU kernel tests (12 files)
`tests/python/examples/`	AOT pre-compilation examples