Testing & Benchmarking Guide
Test infrastructure, running tests, benchmark harness, writing new tests, and performance measurement.
Quick Reference
Category |
Location |
Requires GPU |
Description |
|---|---|---|---|
MLIR lit tests |
|
No |
Verify Fly dialect lowering |
Python IR tests |
|
No |
Python-based MLIR generation + lowering |
GPU kernel tests |
|
Yes |
Full compilation → GPU execution |
AOT examples |
|
Varies |
AOT pre-compilation examples |
Run GEMM tests:
bash scripts/run_tests.sh
Run benchmarks:
bash scripts/run_benchmark.sh
1. Test Categories
1.1 MLIR Lit Tests (tests/mlir/)
MLIR-based tests organized by category, verified using the fly-opt tool. Validates that Fly dialect operations lower correctly to standard MLIR dialects without needing a GPU.
Directories:
Directory |
Tests |
Description |
|---|---|---|
|
|
Layout algebra operations |
|
|
Dialect conversion passes |
|
|
Transformation passes |
Running individually:
# Build fly-opt first if needed
cmake --build build-fly --target fly-opt -j$(nproc)
# Run a single test
build-fly/bin/fly-opt --fly-canonicalize tests/mlir/LayoutAlgebra/construction.mlir
1.2 Python IR Tests (tests/pyir/)
Python-based tests that generate MLIR IR using the FlyDSL Python API and verify the IR structure and lowering. No GPU execution required.
Files:
Test File |
Description |
|---|---|
|
Layout algebra: coalesce, composition, divide, product, complement |
|
IR printing for Fly dialect ops |
|
Static vs dynamic value handling |
Running individually:
python tests/pyir/test_layout_algebra.py
1.3 GPU Kernel Tests (tests/kernels/)
Full end-to-end tests: compile FlyDSL kernels, execute on GPU, validate against PyTorch reference.
Files:
Test File |
Kernel |
Description |
|---|---|---|
|
VecAdd |
Vector addition (C = A + B) |
|
Softmax |
Row-wise softmax |
|
LayerNorm |
Layer normalization |
|
RMSNorm |
RMS normalization |
|
GEMM |
Preshuffle MFMA GEMM (fp8/int8/int4/bf16) |
|
GEMM |
Block-scale (MXFP4) preshuffle GEMM |
|
MoE GEMM |
Mixture-of-Experts GEMM |
|
MoE |
MoE with block-scale quantization |
|
MoE Reduce |
MoE reduction kernel |
|
Paged Attn |
Paged attention decode |
|
Quantization |
Quantization ops |
|
Reference |
Reference implementations |
Running individually:
python tests/kernels/test_softmax.py
python tests/kernels/test_preshuffle_gemm.py --in_dtype fp8 -M 16 -N 5120 -K 8192
1.4 AOT Examples (tests/python/examples/)
AOT pre-compilation examples:
tests/python/examples/
└── aot_example.py # AOT pre-compilation for preshuffle GEMM
2. Test Runner Scripts
2.1 scripts/run_tests.sh
Runs the preshuffle GEMM test suite via pytest:
bash scripts/run_tests.sh
Features:
Auto-discovers build directory (
build-fly/)Sets up
PYTHONPATHandLD_LIBRARY_PATHRuns
pytest tests/kernels/test_preshuffle_gemm.pyBy default skips
large_shape-marked tests (setRUN_TESTS_FULL=1for all)Outputs pass/fail summary
Environment setup:
PYTHONPATH="${BUILD_DIR}/python_packages:${REPO_ROOT}:${PYTHONPATH}"
LD_LIBRARY_PATH="${MLIR_LIBS_DIR}:${LD_LIBRARY_PATH}"
2.2 scripts/run_benchmark.sh
Specialized benchmarking harness for performance characterization.
Default configurations:
# Softmax/LayerNorm: "M,N,dtype"
SOFTMAX_SHAPES='32768,8192,bf16'
LAYERNORM_SHAPES='32768,8192,bf16'
# Preshuffle GEMM: "dtype,M,N,K,tile_m,tile_n,tile_k"
GEMM_SHAPES='
fp8,16,40960,5120,16,128,256
fp8,16,77824,5120,16,128,256
fp8,5120,5120,8320,64,256,128
fp8,9728,8192,8320,64,256,128
int8,9728,8192,8320,64,256,128
int4,9728,8192,8320,64,256,128
bf16,5120,5120,8320,64,256,128
'
# FP4 GEMM (gfx950 only): "M,N,K,tile_m,tile_n,tile_k"
GEMM_FP4_SHAPES='8192,8192,8192,64,128,256'
Selective execution:
bash scripts/run_benchmark.sh # default: GEMM only
bash scripts/run_benchmark.sh softmax # only softmax
bash scripts/run_benchmark.sh gemm moe # GEMM and MoE
bash scripts/run_benchmark.sh --only softmax,layernorm
bash scripts/run_benchmark.sh --list # list available ops
Output format: Tabular with TB/s and TFLOPS columns:
op shape dtype TB/s TFLOPS
-------------- ---------------------------------- ---------- ---------- ----------
gemm 16x40960x5120 fp8 1.234 56.789
Logs: Written to ${BENCH_LOG_DIR:-/tmp/flydsl_bench}/
3. Pytest Configuration
3.1 tests/conftest.py
Pytest configuration with MLIR context fixtures for the Fly dialect.
Fixtures:
@pytest.fixture
def ctx():
"""Fresh MLIR context per test with dialects registered."""
# Creates Context, yields object with: ctx.context, ctx.module, ctx.location
@pytest.fixture
def module(ctx):
"""Provides ctx.module."""
@pytest.fixture
def insert_point(ctx):
"""Sets insertion point to module body."""
Build discovery: Supports multiple build layouts:
build-fly/python_packages(preferred)build/python_packages/flydsl(fallback)
Session hook: Prevents pytest exit code 5 (no tests collected) from being treated as failure.
4. Performance Measurement
4.1 tests/test_common.py
Core performance testing utilities (adapted from AIter).
perftest() decorator:
@perftest(num_iters=20, num_warmup=3, testGraph=False, num_rotate_args=0)
def my_kernel_test(Input, Output):
# Kernel invocation
...
Features:
Device memory profiling to determine rotation count
Torch CUDA event timing
HIPGraph capture mode (
testGraph=True)Cache-aware iteration calculation
checkAllclose() function:
checkAllclose(output, reference, rtol=1e-2, atol=1e-2, tol_err_ratio=0.05)
Returns a mismatch ratio in [0, 1] (0 = pass).
verify_output() function:
verify_output(c_out, c_ref, atol=1e-2, rtol=1e-2, msg='')
High-level validation wrapper around checkAllclose.
4.2 tests/kernels/benchmark_common.py
Shared benchmark harness for performance comparison.
Key functions:
# Measure device time (torch CUDA events)
gpu_us = bench_gpu_us_torch(fn, warmup=20, iters=200)
5. Compilation Utilities (tests/utils.py)
compile_to_hsaco()
Standalone compilation path for tests:
from tests.utils import compile_to_hsaco
hsaco = compile_to_hsaco(mlir_module, kernel_name="my_kernel")
Pipeline stages:
Fly coordinate lowering
fly-to-standardloweringcanonicalize+cseAttach ROCDL target (auto-detect GPU arch)
convert-gpu-to-rocdl(SCF→CF, bare pointer memref)gpu-to-llvm+lower-to-llvmgpu-module-to-binary
Weight Utilities
from tests.utils import pertoken_quant, shuffle_weight
# Per-token quantization (handles NaN/Inf)
quantized, scales = pertoken_quant(tensor, dtype=torch.float8_e4m3fnuz)
# Weight preshuffle for MFMA (layout 16x16)
shuffled = shuffle_weight(weight, layout=(16, 16))
6. Writing New Tests
6.1 PyIR Test Pattern (No GPU)
# tests/pyir/test_my_feature.py
import flydsl.expr as fx
from flydsl.expr.typing import T
def test_my_layout_op(ctx, insert_point):
shape = fx.make_shape(4, 8)
stride = fx.make_stride(8, 1)
layout = fx.make_layout(shape, stride)
result = fx.size(layout)
ir_str = str(ctx.module)
assert "fly.make_layout" in ir_str
6.2 GPU Kernel Test Pattern (New API)
# tests/kernels/test_my_kernel.py
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx
from flydsl.expr import arith, gpu
from tests.test_common import checkAllclose
@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int]):
tid = gpu.thread_idx.x
bid = gpu.block_idx.x
# ... kernel body ...
@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int],
stream: fx.Stream = fx.Stream(None)):
my_kernel(A, B, N).launch(grid=(N // 256,), block=(256,), stream=stream)
def test_my_kernel():
N = 1024
A = torch.randn(N, device="cuda", dtype=torch.float32)
B = torch.empty(N, device="cuda", dtype=torch.float32)
launch(A, B, N)
# Reference
ref = A # or some computation
# Validate
err = checkAllclose(B, ref, rtol=1e-2, atol=1e-2)
assert err == 0, f"Mismatch: {err * 100:.2f}%"
6.3 Benchmark Test Pattern
from tests.kernels.benchmark_common import bench_gpu_us_torch
def benchmark_my_kernel():
# Setup
launch_fn = compile_my_kernel(...)
def run():
launch_fn(input_tensor, output_tensor)
# Measure
gpu_us = bench_gpu_us_torch(run, warmup=20, iters=200)
# Compute metrics
total_bytes = 2 * M * N * elem_size
bandwidth_tbs = total_bytes / (gpu_us * 1e-6) / 1e12
print(f"Time: {gpu_us:.1f} us, Bandwidth: {bandwidth_tbs:.2f} TB/s")
7. GEMM Test CLI Arguments
The test_preshuffle_gemm.py test supports extensive CLI configuration:
python tests/kernels/test_preshuffle_gemm.py \
--in_dtype fp8 \
-M 16 -N 5120 -K 8192 \
--tile_m 16 --tile_n 128 --tile_k 256 \
--lds_stage 2 \
--num_iters 20 \
--num_warmup 3 \
--no_aiter_bench \
--test_graph # or -tg for HIPGraph mode
--wfp4 # FP4 weight path (gfx950 only)
8. Test Configuration via Environment Variables
Variable |
Used By |
Description |
|---|---|---|
|
|
Override softmax test shapes ( |
|
|
Override layernorm test shapes |
|
Compiler |
Dump intermediate IR at each pipeline stage |
|
Compiler |
IR dump directory (default: |
|
Compiler |
Cache directory (default: |
|
|
Set to |
|
|
Benchmark log directory (default: |
9. IR Dump Workflow
Via MlirCompiler
FLYDSL_DUMP_IR=1 FLYDSL_DUMP_DIR=./dumps python my_test.py
Produces numbered .mlir files per pipeline stage plus final_isa.s.
Dedicated IR Dump Script
bash scripts/dumpir.sh
10. Source Files
File |
Description |
|---|---|
|
GEMM test runner (pytest) |
|
Benchmark harness with configurable shapes |
|
IR dump helper script |
|
Pytest fixtures (MLIR context, module, insert point) |
|
|
|
|
|
|
|
MLIR lit tests (18 files) |
|
Python IR generation tests (3 files) |
|
GPU kernel tests (12 files) |
|
AOT pre-compilation examples |