Architecture

Architecture#

tritonBLAS delivers high-performance matrix multiplication through an elegant, layered architecture that balances simplicity with power.

At a Glance#

Design Philosophy

No autotuning required. tritonBLAS uses an analytical model to predict optimal configurations instantly, eliminating the overhead and unpredictability of traditional autotuning approaches.

┌─────────────────────────────────────────────────────────────────────┐
│                         Your Application                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────────────┐       ┌─────────────────────────────┐     │
│   │      matmul()       │       │        matmul_lt()          │     │
│   │   Simple & Quick    │       │   Peak Performance API      │     │
│   └──────────┬──────────┘       └──────────────┬──────────────┘     │
│              │                                  │                   │
│              └──────────────┬───────────────────┘                   │
│                             ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                   Analytical Model                          │   │
│   │          Instant configuration prediction                   │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                             │                                       │
│              ┌──────────────┼──────────────┐                        │
│              ▼              ▼              ▼                        │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│   │  Persistent  │  │   Stream-K   │  │ Specialized  │              │
│   │    GEMM      │  │    GEMM      │  │   Kernels    │              │
│   └──────────────┘  └──────────────┘  └──────────────┘              │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │   Triton + ROCm  │
                    │    GPU Runtime   │
                    └──────────────────┘

The Two APIs#

tritonBLAS offers two paths to high performance:

	matmul()	matmul_lt()
Purpose	Drop-in replacement	Peak performance
Config	Automatic	Reusable selector
Best for	Prototyping, varied workloads	Production, repeated operations

matmul() — Simple & Quick#

Perfect for quick integration. Just swap in tritonblas.matmul() and get automatic optimization:

import tritonblas

tritonblas.matmul(A, B, C)

matmul_lt() — Peak Performance#

Maximum throughput with reusable configurations. Inspired by hipBLASLt/cuBLASLt:

import tritonblas

# Create configuration once
selector = tritonblas.MatmulHeuristicResult(m, n, k, a_dtype, b_dtype, c_dtype)

# Reuse for maximum performance
tritonblas.matmul_lt(A, B, C, selector)

How It Works#

Standard Path#

When you call matmul(), tritonBLAS automatically:

   Your Code          Analytical Model         Optimal Kernel
       │                    │                       │
       │   matmul(A, B)     │                       │
       │ ──────────────────►│                       │
       │                    │  Analyze M, N, K      │
       │                    │  + dtypes + hardware  │
       │                    │                       │
       │                    │  Select config        │
       │                    │ ─────────────────────►│
       │                    │                       │  Execute
       │◄───────────────────────────────────────────│
       │         Result C                           │

Optimized Path#

With matmul_lt(), you control when configuration happens:

                         ┌──────────────────────────────────────┐
                         │  MatmulHeuristicResult(m, n, k, ...) │
                         │                                      │
                         │  ► Analyzed once                     │
                         │  ► Stored in selector                │
                         │  ► Reused for all calls              │
                         └───────────────┬──────────────────────┘
                                         │
        ┌──────────────────┬─────────────┼─────────────┬──────────────────┐
        ▼                  ▼             ▼             ▼                  ▼
   matmul_lt()        matmul_lt()   matmul_lt()   matmul_lt()        matmul_lt()
        │                  │             │             │                  │
        ▼                  ▼             ▼             ▼                  ▼
   ┌─────────┐        ┌─────────┐   ┌─────────┐   ┌─────────┐        ┌─────────┐
   │ Result  │        │ Result  │   │ Result  │   │ Result  │        │ Result  │
   └─────────┘        └─────────┘   └─────────┘   └─────────┘        └─────────┘
   
   No recompilation. No reconfiguration. Maximum throughput.

Kernel Architecture#

tritonBLAS includes several kernel implementations:

Persistent GEMM#

The workhorse kernel for most workloads:

Persistent threads: Workgroups stay alive to process multiple tiles
Tiled computation: Optimized block sizes for GPU cache hierarchy
Multi-XCD aware: Chiplet-optimized scheduling for MI300X

Stream-K GEMM#

For better load balancing on irregular shapes:

Fine-grained work distribution: Splits work at K-iteration level
Automatic tail handling: No wasted compute on partial tiles
Enable with: enable_streamk=True

Specialized Kernels#

FP4 GEMM: 4-bit floating point for extreme compression
A8W8 GEMM: INT8 quantized inference with scale factors
FP8 GEMM: 8-bit floating point for efficient training

The Stages Module#

For kernel developers, tritonBLAS provides composable building blocks:

┌─────────────────────────────────────────────────────────────────┐
│                        Your Custom Kernel                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐    │
│  │ GemmContext │   │ScheduleCtx  │   │ InputView/OutputView│    │
│  │             │   │             │   │                     │    │
│  │ Block sizes │   │ Tile loop   │   │ Matrix access       │    │
│  │ K-loop      │   │ Work dist.  │   │ Pointer math        │    │
│  │ Accumulator │   │ Stream-K    │   │ Bounds checking     │    │
│  └─────────────┘   └─────────────┘   └─────────────────────┘    │
│                                                                 │
│  ┌─────────────┐   ┌─────────────┐                              │
│  │  ScaleView  │   │  BiasView   │                              │
│  │             │   │             │                              │
│  │ Quantization│   │ Bias add    │                              │
│  │ scales      │   │ epilogue    │                              │
│  └─────────────┘   └─────────────┘                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

See the Stages Reference for details.

Learn More#

Analytical Model: How we predict optimal configurations
Stages Reference: Building custom kernels
Core API Reference: API documentation