Core API Reference

Core API Reference#

This documentation is automatically generated from source code docstrings.

tritonblas Module#

Matrix Multiplication Functions#

matmul#

matmul(a, b, out=None, enable_streamk=False, sk_grid=None, work_stealing=False)[source]#

Parameters:

a (torch.Tensor)
b (torch.Tensor)
out (torch.Tensor | None)
enable_streamk (bool | None)
sk_grid (int | None)
work_stealing (bool | None)

Return type:

torch.Tensor | None

matmul_lt#

matmul_lt(a, b, c, selector, config, enable_streamk=False, work_stealing=False)[source]#

Parameters:

a (torch.Tensor)
b (torch.Tensor)
c (torch.Tensor)
config (MatmulConfig)

matmul_a8w8#

matmul_a8w8(a, b, a_scale, b_scale, c, enable_streamk=False, work_stealing=False, sk_grid=None)[source]#

Parameters:

a (torch.Tensor)
b (torch.Tensor)
a_scale (torch.Tensor)
b_scale (torch.Tensor)
c (torch.Tensor)

matmul_a8w8_lt#

matmul_a8w8_lt(a, b, a_scale, b_scale, c, selector, config, enable_streamk=False, work_stealing=False)[source]#

Parameters:

a (torch.Tensor)
b (torch.Tensor)
a_scale (torch.Tensor)
b_scale (torch.Tensor)
c (torch.Tensor)
config (MatmulConfig)

matmul_fp4#

matmul_fp4(a, b, c, a_scales, b_scales, block_m=None, block_n=None, block_k=None, group_size_m=8, num_warps=8, num_stages=2)[source]#

FP4 matrix multiplication: C = A @ B

Parameters:

a (torch.Tensor) – Input matrix A in FP4 format (M, K//2), packed 2 elements per uint8
b (torch.Tensor) – Input matrix B in FP4 format (N, K//2), packed 2 elements per uint8
c (torch.Tensor) – Output matrix C (M, N) in bfloat16 or float16
a_scales (torch.Tensor) – Scales for A in e8m0 format (M, K // 32)
b_scales (torch.Tensor) – Scales for B in e8m0 format (N, K // 32)
block_m (int) – Block size for M dimension
block_n (int) – Block size for N dimension
block_k (int) – Block size for K dimension (must be multiple of 64 for FP4)
group_size_m (int) – Group size for M dimension tiling
num_warps (int) – Number of warps per thread block (default: 8)
num_stages (int) – Number of pipeline stages (default: 2)

Returns:

Output matrix C

Configuration Classes#

OrigamiMatmulSelector#

class OrigamiMatmulSelector(m, n, k, a_dtype, b_dtype, out_dtype, device, mx_block_size=0, streamk=False, total_cus=None, active_cus=None, num_stages=2)[source]#

Bases: object

Parameters:

m (int)
n (int)
k (int)
a_dtype (torch.dtype)
b_dtype (torch.dtype)
out_dtype (torch.dtype)
device (torch.device)
total_cus (int)
active_cus (int)
num_stages (int)

static estimate_triton_lds(block_m, block_n, block_k, bytes_a, bytes_b, num_stages=2)[source]#

Class-level wrapper for estimate_triton_lds_bytes.

Parameters:

block_m (int)
block_n (int)
block_k (int)
bytes_a (float)
bytes_b (float)
num_stages (int)

Return type:

float

dtype_to_str = {torch.bfloat16: 'bf16', torch.complex128: 'c64', torch.complex64: 'c32', torch.float16: 'f16', torch.float32: 'f32', torch.float64: 'f64', torch.float8_e4m3fn: 'f8', torch.float8_e4m3fnuz: 'f8', torch.float8_e5m2: 'f8', torch.float8_e5m2fnuz: 'f8', torch.int32: 'i32', torch.int8: 'i8'}#

COUNTERS_PER_XCD = 4#

__init__(m, n, k, a_dtype, b_dtype, out_dtype, device, mx_block_size=0, streamk=False, total_cus=None, active_cus=None, num_stages=2)[source]#

Parameters:

m (int)
n (int)
k (int)
a_dtype (torch.dtype)
b_dtype (torch.dtype)
out_dtype (torch.dtype)
device (torch.device)
total_cus (int)
active_cus (int)
num_stages (int)

hierarchical_split(num_xcds)[source]#

Compute optimal local/global tile split for hierarchical WS.

Uses the full hardware CU count (not active CUs) so that the split is a topology-level constant, avoiding Triton recompilation when the active CU mask changes.

Adaptive split based on tiles-per-CU density: - <=4 tiles/CU: 100% local (global counter overhead dominates) - >4 tiles/CU: local_frac decreases linearly, floor at 50%

Returns (local_per_xcd, global_tiles).

Parameters:: num_xcds (int)
Return type:: tuple

property block_m#

property block_n#

property block_k#

property group_m#

property num_sms#

property num_stages#

property waves_per_eu#

property even_k#

property sk_grid#

Core API Reference

Contents

Core API Reference#

tritonblas Module#

Matrix Multiplication Functions#

matmul#

matmul_lt#

matmul_a8w8#

matmul_a8w8_lt#

matmul_fp4#

Configuration Classes#

OrigamiMatmulSelector#