# Testing & Benchmarking Guide

> Test infrastructure, running tests, benchmark harness, writing new tests, and performance measurement.

## Quick Reference

| Category | Location | Requires GPU | Description |
|---|---|---|---|
| **MLIR lit tests** | `tests/mlir/{LayoutAlgebra,Conversion,Transforms}/` | No | Verify Fly dialect lowering |
| **Python IR tests** | `tests/pyir/test_*.py` | No | Python-based MLIR generation + lowering |
| **GPU kernel tests** | `tests/kernels/test_*.py` | Yes | Full compilation → GPU execution |
| **AOT examples** | `tests/python/examples/` | Varies | AOT pre-compilation examples |

**Run GEMM tests:**
```bash
bash scripts/run_tests.sh
```

**Run benchmarks:**
```bash
bash scripts/run_benchmark.sh
```

---

## 1. Test Categories

### 1.1 MLIR Lit Tests (`tests/mlir/`)

MLIR-based tests organized by category, verified using the `fly-opt` tool. Validates that Fly dialect operations lower correctly to standard MLIR dialects without needing a GPU.

**Directories:**

| Directory | Tests | Description |
|---|---|---|
| `LayoutAlgebra/` | `coalesce.mlir`, `composition.mlir`, `construction.mlir`, `coordinate.mlir`, `divide.mlir`, `int_tuple.mlir`, `product.mlir`, `size_cosize.mlir` | Layout algebra operations |
| `Conversion/` | `fly_gpu_to_llvm.mlir`, `gpu_ops.mlir`, `memref_alloca.mlir`, `memref_ops.mlir`, `mma_atom.mlir`, `pointer_ops.mlir`, `type_conversion.mlir` | Dialect conversion passes |
| `Transforms/` | `canonicalize.mlir`, `layout_lowering.mlir` | Transformation passes |

**Running individually:**
```bash
# Build fly-opt first if needed
cmake --build build-fly --target fly-opt -j$(nproc)

# Run a single test
build-fly/bin/fly-opt --fly-canonicalize tests/mlir/LayoutAlgebra/construction.mlir
```

### 1.2 Python IR Tests (`tests/pyir/`)

Python-based tests that generate MLIR IR using the FlyDSL Python API and verify the IR structure and lowering. No GPU execution required.

**Files:**
| Test File | Description |
|---|---|
| `test_layout_algebra.py` | Layout algebra: coalesce, composition, divide, product, complement |
| `test_rocir_print.py` | IR printing for Fly dialect ops |
| `test_static_vs_dynamic.py` | Static vs dynamic value handling |

**Running individually:**
```bash
python tests/pyir/test_layout_algebra.py
```

### 1.3 GPU Kernel Tests (`tests/kernels/`)

Full end-to-end tests: compile FlyDSL kernels, execute on GPU, validate against PyTorch reference.

**Files:**
| Test File | Kernel | Description |
|---|---|---|
| `test_vec_add.py` | VecAdd | Vector addition (C = A + B) |
| `test_softmax.py` | Softmax | Row-wise softmax |
| `test_layernorm.py` | LayerNorm | Layer normalization |
| `test_rmsnorm.py` | RMSNorm | RMS normalization |
| `test_preshuffle_gemm.py` | GEMM | Preshuffle MFMA GEMM (fp8/int8/int4/bf16) |
| `test_blockscale_preshuffle_gemm.py` | GEMM | Block-scale (MXFP4) preshuffle GEMM |
| `test_moe_gemm.py` | MoE GEMM | Mixture-of-Experts GEMM |
| `test_moe_blockscale.py` | MoE | MoE with block-scale quantization |
| `test_moe_reduce.py` | MoE Reduce | MoE reduction kernel |
| `test_pa.py` | Paged Attn | Paged attention decode |
| `test_quant.py` | Quantization | Quantization ops |
| `test_ref.py` | Reference | Reference implementations |

**Running individually:**
```bash
python tests/kernels/test_softmax.py
python tests/kernels/test_preshuffle_gemm.py --in_dtype fp8 -M 16 -N 5120 -K 8192
```

### 1.4 AOT Examples (`tests/python/examples/`)

AOT pre-compilation examples:

```
tests/python/examples/
└── aot_example.py      # AOT pre-compilation for preshuffle GEMM
```

---

## 2. Test Runner Scripts

### 2.1 `scripts/run_tests.sh`

Runs the preshuffle GEMM test suite via pytest:

```bash
bash scripts/run_tests.sh
```

**Features:**
- Auto-discovers build directory (`build-fly/`)
- Sets up `PYTHONPATH` and `LD_LIBRARY_PATH`
- Runs `pytest tests/kernels/test_preshuffle_gemm.py`
- By default skips `large_shape`-marked tests (set `RUN_TESTS_FULL=1` for all)
- Outputs pass/fail summary

**Environment setup:**
```bash
PYTHONPATH="${BUILD_DIR}/python_packages:${REPO_ROOT}:${PYTHONPATH}"
LD_LIBRARY_PATH="${MLIR_LIBS_DIR}:${LD_LIBRARY_PATH}"
```

### 2.2 `scripts/run_benchmark.sh`

Specialized benchmarking harness for performance characterization.

**Default configurations:**
```bash
# Softmax/LayerNorm: "M,N,dtype"
SOFTMAX_SHAPES='32768,8192,bf16'
LAYERNORM_SHAPES='32768,8192,bf16'

# Preshuffle GEMM: "dtype,M,N,K,tile_m,tile_n,tile_k"
GEMM_SHAPES='
fp8,16,40960,5120,16,128,256
fp8,16,77824,5120,16,128,256
fp8,5120,5120,8320,64,256,128
fp8,9728,8192,8320,64,256,128
int8,9728,8192,8320,64,256,128
int4,9728,8192,8320,64,256,128
bf16,5120,5120,8320,64,256,128
'

# FP4 GEMM (gfx950 only): "M,N,K,tile_m,tile_n,tile_k"
GEMM_FP4_SHAPES='8192,8192,8192,64,128,256'
```

**Selective execution:**
```bash
bash scripts/run_benchmark.sh                    # default: GEMM only
bash scripts/run_benchmark.sh softmax             # only softmax
bash scripts/run_benchmark.sh gemm moe            # GEMM and MoE
bash scripts/run_benchmark.sh --only softmax,layernorm
bash scripts/run_benchmark.sh --list              # list available ops
```

**Output format:** Tabular with TB/s and TFLOPS columns:
```
op             shape                              dtype       TB/s    TFLOPS
-------------- ---------------------------------- ---------- ---------- ----------
gemm           16x40960x5120                      fp8         1.234     56.789
```

**Logs:** Written to `${BENCH_LOG_DIR:-/tmp/flydsl_bench}/`

---

## 3. Pytest Configuration

### 3.1 `tests/conftest.py`

Pytest configuration with MLIR context fixtures for the Fly dialect.

**Fixtures:**

```python
@pytest.fixture
def ctx():
    """Fresh MLIR context per test with dialects registered."""
    # Creates Context, yields object with: ctx.context, ctx.module, ctx.location

@pytest.fixture
def module(ctx):
    """Provides ctx.module."""

@pytest.fixture
def insert_point(ctx):
    """Sets insertion point to module body."""
```

**Build discovery:** Supports multiple build layouts:
- `build-fly/python_packages` (preferred)
- `build/python_packages/flydsl` (fallback)

**Session hook:** Prevents pytest exit code 5 (no tests collected) from being treated as failure.

---

## 4. Performance Measurement

### 4.1 `tests/test_common.py`

Core performance testing utilities (adapted from AIter).

**`perftest()` decorator:**
```python
@perftest(num_iters=20, num_warmup=3, testGraph=False, num_rotate_args=0)
def my_kernel_test(Input, Output):
    # Kernel invocation
    ...
```

Features:
- Device memory profiling to determine rotation count
- Torch CUDA event timing
- HIPGraph capture mode (`testGraph=True`)
- Cache-aware iteration calculation

**`checkAllclose()` function:**
```python
checkAllclose(output, reference, rtol=1e-2, atol=1e-2, tol_err_ratio=0.05)
```
Returns a mismatch ratio in [0, 1] (0 = pass).

**`verify_output()` function:**
```python
verify_output(c_out, c_ref, atol=1e-2, rtol=1e-2, msg='')
```
High-level validation wrapper around `checkAllclose`.

### 4.2 `tests/kernels/benchmark_common.py`

Shared benchmark harness for performance comparison.

**Key functions:**
```python
# Measure device time (torch CUDA events)
gpu_us = bench_gpu_us_torch(fn, warmup=20, iters=200)
```

---

## 5. Compilation Utilities (`tests/utils.py`)

### `compile_to_hsaco()`

Standalone compilation path for tests:

```python
from tests.utils import compile_to_hsaco

hsaco = compile_to_hsaco(mlir_module, kernel_name="my_kernel")
```

**Pipeline stages:**
1. Fly coordinate lowering
2. `fly-to-standard` lowering
3. `canonicalize` + `cse`
4. Attach ROCDL target (auto-detect GPU arch)
5. `convert-gpu-to-rocdl` (SCF→CF, bare pointer memref)
6. `gpu-to-llvm` + `lower-to-llvm`
7. `gpu-module-to-binary`

### Weight Utilities

```python
from tests.utils import pertoken_quant, shuffle_weight

# Per-token quantization (handles NaN/Inf)
quantized, scales = pertoken_quant(tensor, dtype=torch.float8_e4m3fnuz)

# Weight preshuffle for MFMA (layout 16x16)
shuffled = shuffle_weight(weight, layout=(16, 16))
```

---

## 6. Writing New Tests

### 6.1 PyIR Test Pattern (No GPU)

```python
# tests/pyir/test_my_feature.py
import flydsl.expr as fx
from flydsl.expr.typing import T

def test_my_layout_op(ctx, insert_point):
    shape = fx.make_shape(4, 8)
    stride = fx.make_stride(8, 1)
    layout = fx.make_layout(shape, stride)
    result = fx.size(layout)
    ir_str = str(ctx.module)
    assert "fly.make_layout" in ir_str
```

### 6.2 GPU Kernel Test Pattern (New API)

```python
# tests/kernels/test_my_kernel.py
import torch
import flydsl.compiler as flyc
import flydsl.expr as fx
from flydsl.expr import arith, gpu
from tests.test_common import checkAllclose

@flyc.kernel
def my_kernel(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int]):
    tid = gpu.thread_idx.x
    bid = gpu.block_idx.x
    # ... kernel body ...

@flyc.jit
def launch(A: fx.Tensor, B: fx.Tensor, N: fx.Constexpr[int],
           stream: fx.Stream = fx.Stream(None)):
    my_kernel(A, B, N).launch(grid=(N // 256,), block=(256,), stream=stream)

def test_my_kernel():
    N = 1024
    A = torch.randn(N, device="cuda", dtype=torch.float32)
    B = torch.empty(N, device="cuda", dtype=torch.float32)

    launch(A, B, N)

    # Reference
    ref = A  # or some computation

    # Validate
    err = checkAllclose(B, ref, rtol=1e-2, atol=1e-2)
    assert err == 0, f"Mismatch: {err * 100:.2f}%"
```

### 6.3 Benchmark Test Pattern

```python
from tests.kernels.benchmark_common import bench_gpu_us_torch

def benchmark_my_kernel():
    # Setup
    launch_fn = compile_my_kernel(...)

    def run():
        launch_fn(input_tensor, output_tensor)

    # Measure
    gpu_us = bench_gpu_us_torch(run, warmup=20, iters=200)

    # Compute metrics
    total_bytes = 2 * M * N * elem_size
    bandwidth_tbs = total_bytes / (gpu_us * 1e-6) / 1e12
    print(f"Time: {gpu_us:.1f} us, Bandwidth: {bandwidth_tbs:.2f} TB/s")
```

---

## 7. GEMM Test CLI Arguments

The `test_preshuffle_gemm.py` test supports extensive CLI configuration:

```bash
python tests/kernels/test_preshuffle_gemm.py \
    --in_dtype fp8 \
    -M 16 -N 5120 -K 8192 \
    --tile_m 16 --tile_n 128 --tile_k 256 \
    --lds_stage 2 \
    --num_iters 20 \
    --num_warmup 3 \
    --no_aiter_bench \
    --test_graph        # or -tg for HIPGraph mode
    --wfp4              # FP4 weight path (gfx950 only)
```

---

## 8. Test Configuration via Environment Variables

| Variable | Used By | Description |
|---|---|---|
| `ROCDSL_SOFTMAX_SHAPES` | `test_softmax.py` | Override softmax test shapes (`"M,N,dtype;..."`) |
| `ROCDSL_LAYERNORM_SHAPES` | `test_layernorm.py` | Override layernorm test shapes |
| `FLYDSL_DUMP_IR` | Compiler | Dump intermediate IR at each pipeline stage |
| `FLYDSL_DUMP_DIR` | Compiler | IR dump directory (default: `~/.flydsl/debug`) |
| `FLYDSL_RUNTIME_CACHE_DIR` | Compiler | Cache directory (default: `~/.flydsl/cache`) |
| `RUN_TESTS_FULL` | `run_tests.sh` | Set to `1` to run all parametrized cases |
| `BENCH_LOG_DIR` | `run_benchmark.sh` | Benchmark log directory (default: `/tmp/flydsl_bench`) |

---

## 9. IR Dump Workflow

### Via `MlirCompiler`

```bash
FLYDSL_DUMP_IR=1 FLYDSL_DUMP_DIR=./dumps python my_test.py
```

Produces numbered `.mlir` files per pipeline stage plus `final_isa.s`.

### Dedicated IR Dump Script

```bash
bash scripts/dumpir.sh
```

---

## 10. Source Files

| File | Description |
|---|---|
| `scripts/run_tests.sh` | GEMM test runner (pytest) |
| `scripts/run_benchmark.sh` | Benchmark harness with configurable shapes |
| `scripts/dumpir.sh` | IR dump helper script |
| `tests/conftest.py` | Pytest fixtures (MLIR context, module, insert point) |
| `tests/test_common.py` | `perftest()`, `checkAllclose()`, `verify_output()` |
| `tests/utils.py` | `compile_to_hsaco()`, `pertoken_quant()`, `shuffle_weight()` |
| `tests/kernels/benchmark_common.py` | `bench_gpu_us_torch()`, benchmark harness |
| `tests/mlir/{LayoutAlgebra,Conversion,Transforms}/` | MLIR lit tests (18 files) |
| `tests/pyir/test_*.py` | Python IR generation tests (3 files) |
| `tests/kernels/test_*.py` | GPU kernel tests (12 files) |
| `tests/python/examples/` | AOT pre-compilation examples |