ATOM Configuration Guide

ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM’s runtime behaviour.


Quick Reference

Config Class

Primary Purpose

Config

Master dataclass – model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs

CompilationConfig

Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings

CompilationLevel

Integer constants for the four compilation levels

CUDAGraphMode

Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid)

QuantizationConfig

Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping

LayerQuantConfig

Per-layer quantization parameters: quant type, dtype, dynamic flag, method

ParallelConfig

Data-parallel size, rank, master IP/port

SpeculativeConfig

Speculative decoding method, draft model, number of speculative tokens

KVCacheConfig / KVCacheTensor

Per-layer KV cache tensor descriptors (k/v caches and scales)

SamplingParams

Temperature, max tokens, stop strings, ignore-EOS flag

EngineArgs

CLI argument parser that builds a Config for LLMEngine


1. Master Configuration (Config)

Defined in atom/config.py. The root dataclass that the engine consumes.

Field

Type

Default

Description

model

str

(required)

HuggingFace model name or local path

trust_remote_code

bool

False

Trust remote code when loading the model from HuggingFace

max_num_batched_tokens

int

16384

Maximum number of tokens batched together per scheduler step

scheduler_delay_factor

float

0.0

Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt

max_num_seqs

int

512

Maximum number of sequences batched together

max_model_len

int | None

None

Maximum context length; defaults to hf_config.max_position_embeddings (capped by it when set)

gpu_memory_utilization

float

0.9

Fraction of GPU memory available for KV cache and weights (0.0 – 1.0)

tensor_parallel_size

int

1

Number of tensor-parallel GPUs (1 – 8)

enforce_eager

bool

False

Disable compilation and CUDA graphs; run in eager mode

parallel_config

ParallelConfig

ParallelConfig()

Data-parallel configuration (see Section 4)

kv_cache_block_size

int

16

Block size for paged KV cache; must be a multiple of 16 or exactly 1

num_kvcache_blocks

int

-1

Number of KV cache blocks (-1 = auto)

kv_cache_dtype

str

"bf16"

KV cache data type ("bf16" or "fp8")

enable_prefix_caching

bool

False

Enable prefix caching to reuse KV blocks across requests sharing the same prefix

port

int

8006

Engine internal communication port

torch_profiler_dir

str | None

os.getenv("ATOM_TORCH_PROFILER_DIR", None)

Directory for saving PyTorch profiler traces; creates the directory if it does not exist

compilation_config

CompilationConfig

CompilationConfig()

Compilation and CUDA graph settings (see Section 2)

quant_config

QuantizationConfig

(auto-detected)

Quantization settings; auto-detected from HuggingFace config during __post_init__ via QuantizationConfig(hf_config) (see Section 3)

asyncio_mode

bool

False

Enable asyncio-based engine loop

load_dummy

bool

False

Skip loading model weights (for benchmarking / testing)

enable_expert_parallel

bool

False

Enable Expert Parallelism for MoE models

master_addr

str

"127.0.0.1"

Master address for distributed communication

graph_bs

Optional[list[int]]

None

Explicit list of batch sizes for CUDA graph capture; derived from compilation_config during init

enable_dp_attention

bool

False

Enable data-parallel attention

torch_dtype

torch.dtype

(computed)

Inferred from hf_config.torch_dtype; falls back to torch.bfloat16

speculative_config

Optional[SpeculativeConfig]

None

Speculative decoding configuration (see Section 5)

bos_token_id

int

-1

Beginning-of-sequence token ID (-1 = use model default)

eos_token_id

int

-1

End-of-sequence token ID (-1 = use model default)

stop_token_ids

list[int]

[]

Additional stop token IDs; populated from GenerationConfig.eos_token_id during init

Auto-derived fields (set in __post_init__, not user-supplied):

Field

Type

Description

hf_config

PretrainedConfig

Loaded automatically via get_hf_config(model)

generation_config

GenerationConfig

Loaded automatically via get_generation_config(model)


2. Compilation Configuration (CompilationConfig)

Defined in atom/config.py. Controls torch.compile and CUDA graph behaviour.

2.1 Compilation Levels (CompilationLevel)

Constant

Value

Description

NO_COMPILATION

0

No compilation – pure eager execution

DYNAMO_AS_IS

1

Use torch.compile / TorchDynamo as-is

DYNAMO_ONCE

2

TorchDynamo with a single compilation pass

PIECEWISE

3

Piecewise compilation with CUDA graph capture (recommended for production)

2.2 CompilationConfig Fields

Field

Type

Default

Description

level

int

0

Compilation level (see table above); must be 0 – 3

use_cudagraph

bool

True

Whether to use CUDA graphs

cudagraph_capture_sizes

Optional[list[int]]

None

Explicit list of batch sizes for CUDA graph capture; overrides cuda_graph_sizes when set

cuda_graph_sizes

list[int]

[] (post-init: [512])

CUDA graph sizing strategy: 1 value generates [1,2,4,8] + range(16, N+1, 16); multiple values used as-is; empty defaults to [512]

debug_dump_path

str

""

Path to dump debug / compilation information

cache_dir

str

""

Directory for compilation caches

use_inductor

bool

True

Enable TorchInductor backend

cudagraph_mode

Optional[CUDAGraphMode]

None

CUDA graph capture mode (see below); set to PIECEWISE automatically at level 3

splitting_ops

Optional[list[str]]

None

Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with ["aiter.unified_attention_with_output", "aiter.mla_attention"]

cudagraph_copy_inputs

bool

False

Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode

compile_sizes

Optional[list[Union[int, str]]]

None

Sizes to compile for inductor; accepts integers and the string "cudagraph_capture_sizes"

inductor_compile_config

dict

{}

Additional configuration passed to the inductor backend

2.3 CUDA Graph Mode (CUDAGraphMode)

Mode

Value

Description

NONE

0

No CUDA graph capture

PIECEWISE

1

Piecewise CUDA graphs – attention ops stay outside the graph for flexibility (default at level 3)

FULL

2

Full CUDA graph capture for all batches; best for small models / short prompts

FULL_DECODE_ONLY

(FULL, NONE)

Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups)

FULL_AND_PIECEWISE

(FULL, PIECEWISE)

Full graphs for decode, piecewise for prefill/mixed – most performant mode for most models

Helper methods on CUDAGraphMode:

  • decode_mode() – returns the mode used for pure decode batches.

  • mixed_mode() – returns the mode used for mixed prefill-decode batches.

  • requires_piecewise_compilation() – whether the mode needs piecewise compilation.

  • has_full_cudagraphs() – whether the mode includes full CUDA graph capture.

  • separate_routine() – whether decode and mixed batches use different routines.


3. Quantization Configuration (QuantizationConfig & LayerQuantConfig)

Defined in atom/config.py. The quantization system uses two classes:

  • QuantizationConfig – the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists. It is not a dict subclass.

  • LayerQuantConfig(dict) – a dict subclass that stores the concrete quantization parameters for a single layer (or as the global default).

3.1 LayerQuantConfig Fields

LayerQuantConfig extends dict. Fields are stored and accessed as dictionary keys (e.g., cfg["quant_type"]).

Key

Type

Default

Description

quant_type

QuantType

QuantType.No

Quantization granularity (see below)

quant_dtype

torch.dtype

torch.bfloat16

Data type for quantized weights

is_dynamic

bool

True

Use dynamic quantization (scales computed at runtime)

quant_method

str

""

Quantization method (e.g., "quark", "compressed-tensors")

3.2 QuantizationConfig Attributes

Attribute

Type

Description

torch_dtype

torch.dtype

The model’s default dtype (from hf_config.torch_dtype)

hf_quant_config

dict | None

Raw quantization_config dict from HuggingFace config

global_quant_config

LayerQuantConfig

Default quantization config applied to all layers

layer_quant_config

dict[str, LayerQuantConfig]

Per-layer overrides keyed by layer name pattern (supports fnmatch globs like "*.mlp.*")

exclude_layers

list[str]

Layer names excluded from quantization (supports exact match and "re:" regex prefix)

quant_method

str

Top-level quantization method name (e.g., "quark", "compressed-tensors")

Key methods:

Method

Description

get_name()

Returns the quantization method name

get_layer_quant_config(layer_name)

Returns the LayerQuantConfig for a layer: checks exclusions first, then per-layer overrides, then falls back to global config

should_ignore_layer_quant(layer_name)

Returns True if the layer is in the exclusion list

remap_layer_name(hf_config, packed_modules_mapping)

Remaps layer names for packed/fused modules (e.g., q_a_projfused_qkv_a_proj for DeepSeek)

compute_hash()

Returns a SHA-256 hash of the quantization config for cache invalidation

parse_quark_config_dict(config)

Parses a quark-format config dict into a LayerQuantConfig

3.3 QuantType Values (from AITER)

Value

Description

QuantType.No

No quantization

QuantType.per_Token

Per-token / per-channel quantization

QuantType.per_1x128

Block quantization with group size 128

QuantType.per_1x32

Block quantization with group size 32

QuantType.per_128x128

Large 2D block quantization (remapped to per_1x128 in MoE kernels)

QuantType.per_Tensor

Per-tensor quantization

3.4 Supported Quantization Dtypes

Dtype

AITER Key

Notes

FP8 (E4M3)

"fp8"

8-bit floating point

MXFP4

"fp4x2"

Microscaling FP4; forces QuantType.per_1x32

INT8

"i8"

8-bit integer

INT4

"i4x2"

4-bit integer (packed)

3.5 Auto-Detection from HuggingFace

During Config.__post_init__, ATOM constructs QuantizationConfig(hf_config) which reads hf_config.quantization_config and automatically determines quantization parameters:

For quark models (quant_method == "quark"):

  1. Parses global_quant_config dict via parse_quark_config_dict() to produce the global LayerQuantConfig.

  2. Parses each entry in layer_quant_config dict to produce per-layer overrides.

  3. Reads the "exclude" list for excluded layers.

  4. Within each config dict, weight.qscheme determines quant_type ("per_channel"per_Token, "per_tensor"per_Tensor, "per_group"per_1x32), and weight.dtype determines quant_dtype.

  5. input_tensors.is_dynamic controls dynamic quantization (defaults to True if absent).

For other models (compressed-tensors, etc.):

  1. If quant_method == "compressed-tensors" or channel quantization is detected, sets per_Token.

  2. If weight_block_size or group_size is found: group size 128 maps to per_1x128, group size 32 maps to per_1x32.

  3. Otherwise falls back to per_Tensor.

  4. The dtype is parsed from fields like dtype, weight_dtype, or quant_method looking for fp8, fp4, mxfp4, int8, int4, or num_bits.

  5. If activation_scheme is "static", is_dynamic is set to False.

  6. Excluded layers are read from the "ignore" key.

3.6 Layer-Level Quantization Dispatch

Linear layers, MoE layers, and fused ops call quant_config.get_layer_quant_config(prefix) to obtain the appropriate LayerQuantConfig for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).


4. Parallel Configuration (ParallelConfig)

Defined in atom/config.py. Controls data parallelism. Environment variables (Section 8) override defaults when set.

Field

Type

Default

Description

data_parallel_size

int

1

Number of data-parallel groups; overridden by ATOM_DP_SIZE env var

data_parallel_size_local

int

1

Number of local data-parallel groups

data_parallel_rank

int

0

Rank within the data-parallel group; overridden by ATOM_DP_RANK

data_parallel_rank_local

Optional[int]

None

Local rank within the data-parallel group (SPMD mode); overridden by ATOM_DP_RANK_LOCAL

data_parallel_master_port

int

29500

Port used by the data-parallel master for process group initialization

data_parallel_base_port

int

get_open_port()

Base port for data-parallel communication (dynamically assigned)

data_parallel_master_ip

str

"127.0.0.1"

IP address of the data-parallel master

Computed property:

  • world_size – set during init, equals TP x PP.

  • world_size_across_dpworld_size * data_parallel_size.


5. Speculative Decoding Configuration (SpeculativeConfig)

Defined in atom/config.py. Currently only the Multi-Token Prediction (MTP) method with num_speculative_tokens=1 is supported.

Field

Type

Default

Description

method

Optional[str]

""

Speculative decoding method; currently only "mtp" is accepted

model

Optional[str]

None

Draft model name or path (typically the same as the target model for MTP)

num_speculative_tokens

Optional[int]

None

Number of speculative tokens per iteration; must be 1

draft_model_hf_config

Optional[PretrainedConfig]

None

HuggingFace config for the draft model; auto-loaded from model when None

Post-init behaviour:

  • Loads draft_model_hf_config from model if not provided.

  • For DeepSeek V3 / MTP models: overrides model_type to "deepseek_mtp", sets n_predict=1 and num_nextn_predict_layers=1, and switches architectures to ["DeepSeekMTPModel"].

  • Config.__post_init__ raises ValueError if num_speculative_tokens != 1.


6. Sampling Parameters (SamplingParams)

Defined in atom/sampling_params.py. Passed per-request to control generation.

Field

Type

Default

Description

temperature

float

1.0

Sampling temperature; lower values make output more deterministic

max_tokens

int

64

Maximum number of tokens to generate

ignore_eos

bool

False

Continue generating past the EOS token

stop_strings

Optional[list[str]]

None

List of strings that trigger generation to stop


7. CLI Arguments (EngineArgs)

Defined in atom/model_engine/arg_utils.py. The EngineArgs dataclass exposes all flags via add_cli_args() and converts them into a Config via create_engine().

Flag

Short

Type

Default

Description

--model

str

"Qwen/Qwen3-0.6B"

Model name or path

--trust-remote-code

flag

False

Trust remote code when loading model

--tensor-parallel-size

-tp

int

1

Tensor parallel size

--data-parallel-size

-dp

int

1

Data parallel size

--enforce-eager

flag

False

Enforce eager mode execution

--enable_prefix_caching

flag

False

Enable prefix caching

--port

int

8006

Engine internal port

--kv_cache_dtype

str

"bf16"

KV cache dtype; choices: bf16, fp8

--block-size

int

16

KV cache block size (maps to kv_cache_block_size)

--max-model-len

int

None

Maximum model context length; defaults to hf_config.max_position_embeddings

--cudagraph-capture-sizes

str

"[1,2,4,8,16,32,48,64,128,256]"

CUDA graph capture sizes as a Python list string

--level

int

3

Compilation level (0 – 3)

--load_dummy

flag

False

Skip loading model weights

--enable-expert-parallel

flag

False

Enable Expert Parallelism (EP MoE)

--torch-profiler-dir

str

None

Directory for torch profiler traces

--enable-dp-attention

flag

False

Enable DP attention

--method

str

None

Speculative method; choices: mtp

--num-speculative-tokens

int

1

Number of speculative tokens per iteration

--max-num-batched-tokens

int

16384

Maximum number of tokens to batch in the async engine

--max-num-seqs

int

512

Maximum number of sequences to batch together

--gpu-memory-utilization

float

0.9

Fraction of GPU memory to use (0.0 – 1.0)

--scheduler-delay-factor

float

0.0

Delay factor multiplied by previous prompt latency before scheduling next prompt

Example:

python -m atom.entrypoint \
    --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --level 3 \
    --cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256

8. Environment Variables

8.1 Variables Registered in atom/utils/envs.py

All variables use lazy evaluation. Boolean variables treat "1" as True and anything else (including unset) as False, unless noted otherwise.

Variable

Type

Default

Description

ATOM_DP_RANK

int

0

Data-parallel rank of this process

ATOM_DP_RANK_LOCAL

int

0

Local data-parallel rank (for SPMD mode)

ATOM_DP_SIZE

int

1

Total number of data-parallel groups

ATOM_DP_MASTER_IP

str

"127.0.0.1"

IP address of the data-parallel master

ATOM_DP_MASTER_PORT

int

29500

Port of the data-parallel master

~~ATOM_ENFORCE_EAGER~~

Removed. Use CLI flag --enforce-eager instead.

ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION

bool

False

Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models

ATOM_USE_TRITON_GEMM

bool

False

Use Triton-based GEMM kernels instead of default backends

ATOM_USE_TRITON_MXFP4_BMM

bool

False

Use Triton-based MXFP4 batched matrix multiply

ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION

bool

True

Enable fused input RMSNorm + quantization for DeepSeek models

ATOM_ENABLE_DS_QKNORM_QUANT_FUSION

bool

True

Enable fused QK-norm + quantization for DeepSeek models

ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION

bool

True

Enable fused all-reduce + RMSNorm kernel

ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT

bool

True

Enable AITER Triton fused RMSNorm + quantization for LLaMA models

ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT

bool

True

Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models

8.2 Additional Environment Variables (Used Outside envs.py)

Variable

Type

Default

Where Used

Description

ATOM_TORCH_PROFILER_DIR

str

None

atom/config.py (Config.torch_profiler_dir)

Directory for PyTorch profiler output; sets the default for Config.torch_profiler_dir

ATOM_PROFILER_MORE

str

"0"

atom/model_engine/model_runner.py

Set to "1" to enable detailed profiling (record_shapes, with_stack, profile_memory)

HF_TOKEN

str

None

atom/config.py (get_hf_config)

HuggingFace authentication token for gated model downloads


9. Decision Tree – Choosing a Compilation Level

Start
  |
  v
Is this a debugging / development run?
  |-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
  |
  v
Do you need torch.compile but no graph splitting?
  |-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
  |-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
  |
  v
Production inference on ROCm/HIP GPU?
  |-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
              - Auto-sets CUDAGraphMode.PIECEWISE
              - Auto-populates splitting_ops for attention ops
              - Pair with --cudagraph-capture-sizes for your batch profile
  |
  v
Need maximum decode throughput?
  |-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
              (full graphs for decode, piecewise for prefill)

Rules of thumb:

  • Level 3 is the default for EngineArgs and is recommended for most production workloads.

  • Level 0 / --enforce-eager is useful for debugging, profiling, or when CUDA graphs are incompatible with your model.

  • Match --cudagraph-capture-sizes to your expected batch sizes for optimal memory usage and launch latency.

  • When using --enable-dp-attention or Expert Parallelism (--enable-expert-parallel), level 3 is still recommended.


Source Files

File

Description

atom/config.py

Config, CompilationConfig, CompilationLevel, CUDAGraphMode, LayerQuantConfig, QuantizationConfig, ParallelConfig, SpeculativeConfig, KVCacheTensor, KVCacheConfig, get_hf_config

atom/utils/envs.py

All ATOM_* environment variable definitions with lazy evaluation

atom/model_engine/arg_utils.py

EngineArgs dataclass and CLI argument parser

atom/sampling_params.py

SamplingParams dataclass

atom/model_engine/model_runner.py

Uses ATOM_PROFILER_MORE and ATOM_TORCH_PROFILER_DIR for profiling