ATOM Configuration Guide
ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM’s runtime behaviour.
Quick Reference
Config Class |
Primary Purpose |
|---|---|
|
Master dataclass – model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs |
|
Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings |
|
Integer constants for the four compilation levels |
|
Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid) |
|
Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping |
|
Per-layer quantization parameters: quant type, dtype, dynamic flag, method |
|
Data-parallel size, rank, master IP/port |
|
Speculative decoding method, draft model, number of speculative tokens |
|
Per-layer KV cache tensor descriptors (k/v caches and scales) |
|
Temperature, max tokens, stop strings, ignore-EOS flag |
|
CLI argument parser that builds a |
1. Master Configuration (Config)
Defined in atom/config.py. The root dataclass that the engine consumes.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
HuggingFace model name or local path |
|
|
|
Trust remote code when loading the model from HuggingFace |
|
|
|
Maximum number of tokens batched together per scheduler step |
|
|
|
Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt |
|
|
|
Maximum number of sequences batched together |
|
|
|
Maximum context length; defaults to |
|
|
|
Fraction of GPU memory available for KV cache and weights (0.0 – 1.0) |
|
|
|
Number of tensor-parallel GPUs (1 – 8) |
|
|
|
Disable compilation and CUDA graphs; run in eager mode |
|
|
|
Data-parallel configuration (see Section 4) |
|
|
|
Block size for paged KV cache; must be a multiple of 16 or exactly 1 |
|
|
|
Number of KV cache blocks ( |
|
|
|
KV cache data type ( |
|
|
|
Enable prefix caching to reuse KV blocks across requests sharing the same prefix |
|
|
|
Engine internal communication port |
|
|
|
Directory for saving PyTorch profiler traces; creates the directory if it does not exist |
|
|
|
Compilation and CUDA graph settings (see Section 2) |
|
|
(auto-detected) |
Quantization settings; auto-detected from HuggingFace config during |
|
|
|
Enable asyncio-based engine loop |
|
|
|
Skip loading model weights (for benchmarking / testing) |
|
|
|
Enable Expert Parallelism for MoE models |
|
|
|
Master address for distributed communication |
|
|
|
Explicit list of batch sizes for CUDA graph capture; derived from |
|
|
|
Enable data-parallel attention |
|
|
(computed) |
Inferred from |
|
|
|
Speculative decoding configuration (see Section 5) |
|
|
|
Beginning-of-sequence token ID ( |
|
|
|
End-of-sequence token ID ( |
|
|
|
Additional stop token IDs; populated from |
Auto-derived fields (set in __post_init__, not user-supplied):
Field |
Type |
Description |
|---|---|---|
|
|
Loaded automatically via |
|
|
Loaded automatically via |
2. Compilation Configuration (CompilationConfig)
Defined in atom/config.py. Controls torch.compile and CUDA graph behaviour.
2.1 Compilation Levels (CompilationLevel)
Constant |
Value |
Description |
|---|---|---|
|
|
No compilation – pure eager execution |
|
|
Use torch.compile / TorchDynamo as-is |
|
|
TorchDynamo with a single compilation pass |
|
|
Piecewise compilation with CUDA graph capture (recommended for production) |
2.2 CompilationConfig Fields
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Compilation level (see table above); must be 0 – 3 |
|
|
|
Whether to use CUDA graphs |
|
|
|
Explicit list of batch sizes for CUDA graph capture; overrides |
|
|
|
CUDA graph sizing strategy: 1 value generates |
|
|
|
Path to dump debug / compilation information |
|
|
|
Directory for compilation caches |
|
|
|
Enable TorchInductor backend |
|
|
|
CUDA graph capture mode (see below); set to |
|
|
|
Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with |
|
|
|
Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode |
|
|
|
Sizes to compile for inductor; accepts integers and the string |
|
|
|
Additional configuration passed to the inductor backend |
2.3 CUDA Graph Mode (CUDAGraphMode)
Mode |
Value |
Description |
|---|---|---|
|
|
No CUDA graph capture |
|
|
Piecewise CUDA graphs – attention ops stay outside the graph for flexibility (default at level 3) |
|
|
Full CUDA graph capture for all batches; best for small models / short prompts |
|
|
Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups) |
|
|
Full graphs for decode, piecewise for prefill/mixed – most performant mode for most models |
Helper methods on CUDAGraphMode:
decode_mode()– returns the mode used for pure decode batches.mixed_mode()– returns the mode used for mixed prefill-decode batches.requires_piecewise_compilation()– whether the mode needs piecewise compilation.has_full_cudagraphs()– whether the mode includes full CUDA graph capture.separate_routine()– whether decode and mixed batches use different routines.
3. Quantization Configuration (QuantizationConfig & LayerQuantConfig)
Defined in atom/config.py. The quantization system uses two classes:
QuantizationConfig– the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists. It is not adictsubclass.LayerQuantConfig(dict)– adictsubclass that stores the concrete quantization parameters for a single layer (or as the global default).
3.1 LayerQuantConfig Fields
LayerQuantConfig extends dict. Fields are stored and accessed as dictionary keys (e.g., cfg["quant_type"]).
Key |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Quantization granularity (see below) |
|
|
|
Data type for quantized weights |
|
|
|
Use dynamic quantization (scales computed at runtime) |
|
|
|
Quantization method (e.g., |
3.2 QuantizationConfig Attributes
Attribute |
Type |
Description |
|---|---|---|
|
|
The model’s default dtype (from |
|
|
Raw |
|
|
Default quantization config applied to all layers |
|
|
Per-layer overrides keyed by layer name pattern (supports fnmatch globs like |
|
|
Layer names excluded from quantization (supports exact match and |
|
|
Top-level quantization method name (e.g., |
Key methods:
Method |
Description |
|---|---|
|
Returns the quantization method name |
|
Returns the |
|
Returns |
|
Remaps layer names for packed/fused modules (e.g., |
|
Returns a SHA-256 hash of the quantization config for cache invalidation |
|
Parses a quark-format config dict into a |
3.3 QuantType Values (from AITER)
Value |
Description |
|---|---|
|
No quantization |
|
Per-token / per-channel quantization |
|
Block quantization with group size 128 |
|
Block quantization with group size 32 |
|
Large 2D block quantization (remapped to |
|
Per-tensor quantization |
3.4 Supported Quantization Dtypes
Dtype |
AITER Key |
Notes |
|---|---|---|
FP8 (E4M3) |
|
8-bit floating point |
MXFP4 |
|
Microscaling FP4; forces |
INT8 |
|
8-bit integer |
INT4 |
|
4-bit integer (packed) |
3.5 Auto-Detection from HuggingFace
During Config.__post_init__, ATOM constructs QuantizationConfig(hf_config) which
reads hf_config.quantization_config and automatically determines quantization
parameters:
For quark models (quant_method == "quark"):
Parses
global_quant_configdict viaparse_quark_config_dict()to produce the globalLayerQuantConfig.Parses each entry in
layer_quant_configdict to produce per-layer overrides.Reads the
"exclude"list for excluded layers.Within each config dict,
weight.qschemedeterminesquant_type("per_channel"→per_Token,"per_tensor"→per_Tensor,"per_group"→per_1x32), andweight.dtypedeterminesquant_dtype.input_tensors.is_dynamiccontrols dynamic quantization (defaults toTrueif absent).
For other models (compressed-tensors, etc.):
If
quant_method == "compressed-tensors"or channel quantization is detected, setsper_Token.If
weight_block_sizeorgroup_sizeis found: group size 128 maps toper_1x128, group size 32 maps toper_1x32.Otherwise falls back to
per_Tensor.The dtype is parsed from fields like
dtype,weight_dtype, orquant_methodlooking forfp8,fp4,mxfp4,int8,int4, ornum_bits.If
activation_schemeis"static",is_dynamicis set toFalse.Excluded layers are read from the
"ignore"key.
3.6 Layer-Level Quantization Dispatch
Linear layers, MoE layers, and fused ops call quant_config.get_layer_quant_config(prefix) to obtain the appropriate LayerQuantConfig for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).
4. Parallel Configuration (ParallelConfig)
Defined in atom/config.py. Controls data parallelism. Environment variables
(Section 8) override defaults when set.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Number of data-parallel groups; overridden by |
|
|
|
Number of local data-parallel groups |
|
|
|
Rank within the data-parallel group; overridden by |
|
|
|
Local rank within the data-parallel group (SPMD mode); overridden by |
|
|
|
Port used by the data-parallel master for process group initialization |
|
|
|
Base port for data-parallel communication (dynamically assigned) |
|
|
|
IP address of the data-parallel master |
Computed property:
world_size– set during init, equals TP x PP.world_size_across_dp–world_size * data_parallel_size.
5. Speculative Decoding Configuration (SpeculativeConfig)
Defined in atom/config.py. Currently only the Multi-Token Prediction (MTP)
method with num_speculative_tokens=1 is supported.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Speculative decoding method; currently only |
|
|
|
Draft model name or path (typically the same as the target model for MTP) |
|
|
|
Number of speculative tokens per iteration; must be |
|
|
|
HuggingFace config for the draft model; auto-loaded from |
Post-init behaviour:
Loads
draft_model_hf_configfrommodelif not provided.For DeepSeek V3 / MTP models: overrides
model_typeto"deepseek_mtp", setsn_predict=1andnum_nextn_predict_layers=1, and switches architectures to["DeepSeekMTPModel"].Config.__post_init__raisesValueErrorifnum_speculative_tokens != 1.
6. Sampling Parameters (SamplingParams)
Defined in atom/sampling_params.py. Passed per-request to control generation.
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Sampling temperature; lower values make output more deterministic |
|
|
|
Maximum number of tokens to generate |
|
|
|
Continue generating past the EOS token |
|
|
|
List of strings that trigger generation to stop |
7. CLI Arguments (EngineArgs)
Defined in atom/model_engine/arg_utils.py. The EngineArgs dataclass exposes
all flags via add_cli_args() and converts them into a Config via
create_engine().
Flag |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
|
Model name or path |
|
|
flag |
|
Trust remote code when loading model |
|
|
|
|
|
Tensor parallel size |
|
|
|
|
Data parallel size |
|
flag |
|
Enforce eager mode execution |
|
|
flag |
|
Enable prefix caching |
|
|
|
|
Engine internal port |
|
|
|
|
KV cache dtype; choices: |
|
|
|
|
KV cache block size (maps to |
|
|
|
|
Maximum model context length; defaults to |
|
|
|
|
CUDA graph capture sizes as a Python list string |
|
|
|
|
Compilation level (0 – 3) |
|
|
flag |
|
Skip loading model weights |
|
|
flag |
|
Enable Expert Parallelism (EP MoE) |
|
|
|
|
Directory for torch profiler traces |
|
|
flag |
|
Enable DP attention |
|
|
|
|
Speculative method; choices: |
|
|
|
|
Number of speculative tokens per iteration |
|
|
|
|
Maximum number of tokens to batch in the async engine |
|
|
|
|
Maximum number of sequences to batch together |
|
|
|
|
Fraction of GPU memory to use (0.0 – 1.0) |
|
|
|
|
Delay factor multiplied by previous prompt latency before scheduling next prompt |
Example:
python -m atom.entrypoint \
--model deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 8 \
--level 3 \
--cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
--kv_cache_dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 256
8. Environment Variables
8.1 Variables Registered in atom/utils/envs.py
All variables use lazy evaluation. Boolean variables treat "1" as True and
anything else (including unset) as False, unless noted otherwise.
Variable |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Data-parallel rank of this process |
|
|
|
Local data-parallel rank (for SPMD mode) |
|
|
|
Total number of data-parallel groups |
|
|
|
IP address of the data-parallel master |
|
|
|
Port of the data-parallel master |
~~ |
Removed. Use CLI flag |
||
|
|
|
Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models |
|
|
|
Use Triton-based GEMM kernels instead of default backends |
|
|
|
Use Triton-based MXFP4 batched matrix multiply |
|
|
|
Enable fused input RMSNorm + quantization for DeepSeek models |
|
|
|
Enable fused QK-norm + quantization for DeepSeek models |
|
|
|
Enable fused all-reduce + RMSNorm kernel |
|
|
|
Enable AITER Triton fused RMSNorm + quantization for LLaMA models |
|
|
|
Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models |
8.2 Additional Environment Variables (Used Outside envs.py)
Variable |
Type |
Default |
Where Used |
Description |
|---|---|---|---|---|
|
|
|
|
Directory for PyTorch profiler output; sets the default for |
|
|
|
|
Set to |
|
|
|
|
HuggingFace authentication token for gated model downloads |
9. Decision Tree – Choosing a Compilation Level
Start
|
v
Is this a debugging / development run?
|-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
|
v
Do you need torch.compile but no graph splitting?
|-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
|-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
|
v
Production inference on ROCm/HIP GPU?
|-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
- Auto-sets CUDAGraphMode.PIECEWISE
- Auto-populates splitting_ops for attention ops
- Pair with --cudagraph-capture-sizes for your batch profile
|
v
Need maximum decode throughput?
|-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
(full graphs for decode, piecewise for prefill)
Rules of thumb:
Level 3 is the default for
EngineArgsand is recommended for most production workloads.Level 0 /
--enforce-eageris useful for debugging, profiling, or when CUDA graphs are incompatible with your model.Match
--cudagraph-capture-sizesto your expected batch sizes for optimal memory usage and launch latency.When using
--enable-dp-attentionor Expert Parallelism (--enable-expert-parallel), level 3 is still recommended.
Source Files
File |
Description |
|---|---|
|
|
|
All |
|
|
|
|
|
Uses |