# ATOM Configuration Guide ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on [AITER](https://github.com/ROCm/aiter) kernels for ROCm/HIP GPUs. This guide documents every configuration class, CLI flag, and environment variable that controls ATOM's runtime behaviour. --- ## Quick Reference | Config Class | Primary Purpose | |---|---| | `Config` | Master dataclass -- model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs | | `CompilationConfig` | Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings | | `CompilationLevel` | Integer constants for the four compilation levels | | `CUDAGraphMode` | Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid) | | `QuantizationConfig` | Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping | | `LayerQuantConfig` | Per-layer quantization parameters: quant type, dtype, dynamic flag, method | | `ParallelConfig` | Data-parallel size, rank, master IP/port | | `SpeculativeConfig` | Speculative decoding method, draft model, number of speculative tokens | | `KVCacheConfig` / `KVCacheTensor` | Per-layer KV cache tensor descriptors (k/v caches and scales) | | `SamplingParams` | Temperature, max tokens, stop strings, ignore-EOS flag | | `EngineArgs` | CLI argument parser that builds a `Config` for `LLMEngine` | --- ## 1. Master Configuration (`Config`) Defined in `atom/config.py`. The root dataclass that the engine consumes. | Field | Type | Default | Description | |---|---|---|---| | `model` | `str` | *(required)* | HuggingFace model name or local path | | `trust_remote_code` | `bool` | `False` | Trust remote code when loading the model from HuggingFace | | `max_num_batched_tokens` | `int` | `16384` | Maximum number of tokens batched together per scheduler step | | `scheduler_delay_factor` | `float` | `0.0` | Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt | | `max_num_seqs` | `int` | `512` | Maximum number of sequences batched together | | `max_model_len` | `int \| None` | `None` | Maximum context length; defaults to `hf_config.max_position_embeddings` (capped by it when set) | | `gpu_memory_utilization` | `float` | `0.9` | Fraction of GPU memory available for KV cache and weights (0.0 -- 1.0) | | `tensor_parallel_size` | `int` | `1` | Number of tensor-parallel GPUs (1 -- 8) | | `enforce_eager` | `bool` | `False` | Disable compilation and CUDA graphs; run in eager mode | | `parallel_config` | `ParallelConfig` | `ParallelConfig()` | Data-parallel configuration (see Section 4) | | `kv_cache_block_size` | `int` | `16` | Block size for paged KV cache; must be a multiple of 16 or exactly 1 | | `num_kvcache_blocks` | `int` | `-1` | Number of KV cache blocks (`-1` = auto) | | `kv_cache_dtype` | `str` | `"bf16"` | KV cache data type (`"bf16"` or `"fp8"`) | | `enable_prefix_caching` | `bool` | `False` | Enable prefix caching to reuse KV blocks across requests sharing the same prefix | | `port` | `int` | `8006` | Engine internal communication port | | `torch_profiler_dir` | `str \| None` | `os.getenv("ATOM_TORCH_PROFILER_DIR", None)` | Directory for saving PyTorch profiler traces; creates the directory if it does not exist | | `compilation_config` | `CompilationConfig` | `CompilationConfig()` | Compilation and CUDA graph settings (see Section 2) | | `quant_config` | `QuantizationConfig` | *(auto-detected)* | Quantization settings; auto-detected from HuggingFace config during `__post_init__` via `QuantizationConfig(hf_config)` (see Section 3) | | `asyncio_mode` | `bool` | `False` | Enable asyncio-based engine loop | | `load_dummy` | `bool` | `False` | Skip loading model weights (for benchmarking / testing) | | `enable_expert_parallel` | `bool` | `False` | Enable Expert Parallelism for MoE models | | `master_addr` | `str` | `"127.0.0.1"` | Master address for distributed communication | | `graph_bs` | `Optional[list[int]]` | `None` | Explicit list of batch sizes for CUDA graph capture; derived from `compilation_config` during init | | `enable_dp_attention` | `bool` | `False` | Enable data-parallel attention | | `torch_dtype` | `torch.dtype` | *(computed)* | Inferred from `hf_config.torch_dtype`; falls back to `torch.bfloat16` | | `speculative_config` | `Optional[SpeculativeConfig]` | `None` | Speculative decoding configuration (see Section 5) | | `bos_token_id` | `int` | `-1` | Beginning-of-sequence token ID (`-1` = use model default) | | `eos_token_id` | `int` | `-1` | End-of-sequence token ID (`-1` = use model default) | | `stop_token_ids` | `list[int]` | `[]` | Additional stop token IDs; populated from `GenerationConfig.eos_token_id` during init | **Auto-derived fields** (set in `__post_init__`, not user-supplied): | Field | Type | Description | |---|---|---| | `hf_config` | `PretrainedConfig` | Loaded automatically via `get_hf_config(model)` | | `generation_config` | `GenerationConfig` | Loaded automatically via `get_generation_config(model)` | --- ## 2. Compilation Configuration (`CompilationConfig`) Defined in `atom/config.py`. Controls torch.compile and CUDA graph behaviour. ### 2.1 Compilation Levels (`CompilationLevel`) | Constant | Value | Description | |---|---|---| | `NO_COMPILATION` | `0` | No compilation -- pure eager execution | | `DYNAMO_AS_IS` | `1` | Use torch.compile / TorchDynamo as-is | | `DYNAMO_ONCE` | `2` | TorchDynamo with a single compilation pass | | `PIECEWISE` | `3` | Piecewise compilation with CUDA graph capture (recommended for production) | ### 2.2 `CompilationConfig` Fields | Field | Type | Default | Description | |---|---|---|---| | `level` | `int` | `0` | Compilation level (see table above); must be 0 -- 3 | | `use_cudagraph` | `bool` | `True` | Whether to use CUDA graphs | | `cudagraph_capture_sizes` | `Optional[list[int]]` | `None` | Explicit list of batch sizes for CUDA graph capture; overrides `cuda_graph_sizes` when set | | `cuda_graph_sizes` | `list[int]` | `[]` (post-init: `[512]`) | CUDA graph sizing strategy: 1 value generates `[1,2,4,8] + range(16, N+1, 16)`; multiple values used as-is; empty defaults to `[512]` | | `debug_dump_path` | `str` | `""` | Path to dump debug / compilation information | | `cache_dir` | `str` | `""` | Directory for compilation caches | | `use_inductor` | `bool` | `True` | Enable TorchInductor backend | | `cudagraph_mode` | `Optional[CUDAGraphMode]` | `None` | CUDA graph capture mode (see below); set to `PIECEWISE` automatically at level 3 | | `splitting_ops` | `Optional[list[str]]` | `None` | Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with `["aiter.unified_attention_with_output", "aiter.mla_attention"]` | | `cudagraph_copy_inputs` | `bool` | `False` | Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode | | `compile_sizes` | `Optional[list[Union[int, str]]]` | `None` | Sizes to compile for inductor; accepts integers and the string `"cudagraph_capture_sizes"` | | `inductor_compile_config` | `dict` | `{}` | Additional configuration passed to the inductor backend | ### 2.3 CUDA Graph Mode (`CUDAGraphMode`) | Mode | Value | Description | |---|---|---| | `NONE` | `0` | No CUDA graph capture | | `PIECEWISE` | `1` | Piecewise CUDA graphs -- attention ops stay outside the graph for flexibility (default at level 3) | | `FULL` | `2` | Full CUDA graph capture for all batches; best for small models / short prompts | | `FULL_DECODE_ONLY` | `(FULL, NONE)` | Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups) | | `FULL_AND_PIECEWISE` | `(FULL, PIECEWISE)` | Full graphs for decode, piecewise for prefill/mixed -- most performant mode for most models | Helper methods on `CUDAGraphMode`: - `decode_mode()` -- returns the mode used for pure decode batches. - `mixed_mode()` -- returns the mode used for mixed prefill-decode batches. - `requires_piecewise_compilation()` -- whether the mode needs piecewise compilation. - `has_full_cudagraphs()` -- whether the mode includes full CUDA graph capture. - `separate_routine()` -- whether decode and mixed batches use different routines. --- ## 3. Quantization Configuration (`QuantizationConfig` & `LayerQuantConfig`) Defined in `atom/config.py`. The quantization system uses two classes: - **`QuantizationConfig`** -- the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists. It is **not** a `dict` subclass. - **`LayerQuantConfig(dict)`** -- a `dict` subclass that stores the concrete quantization parameters for a single layer (or as the global default). ### 3.1 `LayerQuantConfig` Fields `LayerQuantConfig` extends `dict`. Fields are stored and accessed as dictionary keys (e.g., `cfg["quant_type"]`). | Key | Type | Default | Description | |---|---|---|---| | `quant_type` | `QuantType` | `QuantType.No` | Quantization granularity (see below) | | `quant_dtype` | `torch.dtype` | `torch.bfloat16` | Data type for quantized weights | | `is_dynamic` | `bool` | `True` | Use dynamic quantization (scales computed at runtime) | | `quant_method` | `str` | `""` | Quantization method (e.g., `"quark"`, `"compressed-tensors"`) | ### 3.2 `QuantizationConfig` Attributes | Attribute | Type | Description | |---|---|---| | `torch_dtype` | `torch.dtype` | The model's default dtype (from `hf_config.torch_dtype`) | | `hf_quant_config` | `dict \| None` | Raw `quantization_config` dict from HuggingFace config | | `global_quant_config` | `LayerQuantConfig` | Default quantization config applied to all layers | | `layer_quant_config` | `dict[str, LayerQuantConfig]` | Per-layer overrides keyed by layer name pattern (supports fnmatch globs like `"*.mlp.*"`) | | `exclude_layers` | `list[str]` | Layer names excluded from quantization (supports exact match and `"re:"` regex prefix) | | `quant_method` | `str` | Top-level quantization method name (e.g., `"quark"`, `"compressed-tensors"`) | Key methods: | Method | Description | |---|---| | `get_name()` | Returns the quantization method name | | `get_layer_quant_config(layer_name)` | Returns the `LayerQuantConfig` for a layer: checks exclusions first, then per-layer overrides, then falls back to global config | | `should_ignore_layer_quant(layer_name)` | Returns `True` if the layer is in the exclusion list | | `remap_layer_name(hf_config, packed_modules_mapping)` | Remaps layer names for packed/fused modules (e.g., `q_a_proj` → `fused_qkv_a_proj` for DeepSeek) | | `compute_hash()` | Returns a SHA-256 hash of the quantization config for cache invalidation | | `parse_quark_config_dict(config)` | Parses a quark-format config dict into a `LayerQuantConfig` | ### 3.3 `QuantType` Values (from AITER) | Value | Description | |---|---| | `QuantType.No` | No quantization | | `QuantType.per_Token` | Per-token / per-channel quantization | | `QuantType.per_1x128` | Block quantization with group size 128 | | `QuantType.per_1x32` | Block quantization with group size 32 | | `QuantType.per_128x128` | Large 2D block quantization (remapped to `per_1x128` in MoE kernels) | | `QuantType.per_Tensor` | Per-tensor quantization | ### 3.4 Supported Quantization Dtypes | Dtype | AITER Key | Notes | |---|---|---| | FP8 (E4M3) | `"fp8"` | 8-bit floating point | | MXFP4 | `"fp4x2"` | Microscaling FP4; forces `QuantType.per_1x32` | | INT8 | `"i8"` | 8-bit integer | | INT4 | `"i4x2"` | 4-bit integer (packed) | ### 3.5 Auto-Detection from HuggingFace During `Config.__post_init__`, ATOM constructs `QuantizationConfig(hf_config)` which reads `hf_config.quantization_config` and automatically determines quantization parameters: **For quark models** (`quant_method == "quark"`): 1. Parses `global_quant_config` dict via `parse_quark_config_dict()` to produce the global `LayerQuantConfig`. 2. Parses each entry in `layer_quant_config` dict to produce per-layer overrides. 3. Reads the `"exclude"` list for excluded layers. 4. Within each config dict, `weight.qscheme` determines `quant_type` (`"per_channel"` → `per_Token`, `"per_tensor"` → `per_Tensor`, `"per_group"` → `per_1x32`), and `weight.dtype` determines `quant_dtype`. 5. `input_tensors.is_dynamic` controls dynamic quantization (defaults to `True` if absent). **For other models** (compressed-tensors, etc.): 1. If `quant_method == "compressed-tensors"` or channel quantization is detected, sets `per_Token`. 2. If `weight_block_size` or `group_size` is found: group size 128 maps to `per_1x128`, group size 32 maps to `per_1x32`. 3. Otherwise falls back to `per_Tensor`. 4. The dtype is parsed from fields like `dtype`, `weight_dtype`, or `quant_method` looking for `fp8`, `fp4`, `mxfp4`, `int8`, `int4`, or `num_bits`. 5. If `activation_scheme` is `"static"`, `is_dynamic` is set to `False`. 6. Excluded layers are read from the `"ignore"` key. ### 3.6 Layer-Level Quantization Dispatch Linear layers, MoE layers, and fused ops call `quant_config.get_layer_quant_config(prefix)` to obtain the appropriate `LayerQuantConfig` for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP). --- ## 4. Parallel Configuration (`ParallelConfig`) Defined in `atom/config.py`. Controls data parallelism. Environment variables (Section 8) override defaults when set. | Field | Type | Default | Description | |---|---|---|---| | `data_parallel_size` | `int` | `1` | Number of data-parallel groups; overridden by `ATOM_DP_SIZE` env var | | `data_parallel_size_local` | `int` | `1` | Number of local data-parallel groups | | `data_parallel_rank` | `int` | `0` | Rank within the data-parallel group; overridden by `ATOM_DP_RANK` | | `data_parallel_rank_local` | `Optional[int]` | `None` | Local rank within the data-parallel group (SPMD mode); overridden by `ATOM_DP_RANK_LOCAL` | | `data_parallel_master_port` | `int` | `29500` | Port used by the data-parallel master for process group initialization | | `data_parallel_base_port` | `int` | `get_open_port()` | Base port for data-parallel communication (dynamically assigned) | | `data_parallel_master_ip` | `str` | `"127.0.0.1"` | IP address of the data-parallel master | **Computed property:** - `world_size` -- set during init, equals TP x PP. - `world_size_across_dp` -- `world_size * data_parallel_size`. --- ## 5. Speculative Decoding Configuration (`SpeculativeConfig`) Defined in `atom/config.py`. Currently only the Multi-Token Prediction (MTP) method with `num_speculative_tokens=1` is supported. | Field | Type | Default | Description | |---|---|---|---| | `method` | `Optional[str]` | `""` | Speculative decoding method; currently only `"mtp"` is accepted | | `model` | `Optional[str]` | `None` | Draft model name or path (typically the same as the target model for MTP) | | `num_speculative_tokens` | `Optional[int]` | `None` | Number of speculative tokens per iteration; **must be `1`** | | `draft_model_hf_config` | `Optional[PretrainedConfig]` | `None` | HuggingFace config for the draft model; auto-loaded from `model` when `None` | **Post-init behaviour:** - Loads `draft_model_hf_config` from `model` if not provided. - For DeepSeek V3 / MTP models: overrides `model_type` to `"deepseek_mtp"`, sets `n_predict=1` and `num_nextn_predict_layers=1`, and switches architectures to `["DeepSeekMTPModel"]`. - `Config.__post_init__` raises `ValueError` if `num_speculative_tokens != 1`. --- ## 6. Sampling Parameters (`SamplingParams`) Defined in `atom/sampling_params.py`. Passed per-request to control generation. | Field | Type | Default | Description | |---|---|---|---| | `temperature` | `float` | `1.0` | Sampling temperature; lower values make output more deterministic | | `max_tokens` | `int` | `64` | Maximum number of tokens to generate | | `ignore_eos` | `bool` | `False` | Continue generating past the EOS token | | `stop_strings` | `Optional[list[str]]` | `None` | List of strings that trigger generation to stop | --- ## 7. CLI Arguments (`EngineArgs`) Defined in `atom/model_engine/arg_utils.py`. The `EngineArgs` dataclass exposes all flags via `add_cli_args()` and converts them into a `Config` via `create_engine()`. | Flag | Short | Type | Default | Description | |---|---|---|---|---| | `--model` | | `str` | `"Qwen/Qwen3-0.6B"` | Model name or path | | `--trust-remote-code` | | flag | `False` | Trust remote code when loading model | | `--tensor-parallel-size` | `-tp` | `int` | `1` | Tensor parallel size | | `--data-parallel-size` | `-dp` | `int` | `1` | Data parallel size | | `--enforce-eager` | | flag | `False` | Enforce eager mode execution | | `--enable_prefix_caching` | | flag | `False` | Enable prefix caching | | `--port` | | `int` | `8006` | Engine internal port | | `--kv_cache_dtype` | | `str` | `"bf16"` | KV cache dtype; choices: `bf16`, `fp8` | | `--block-size` | | `int` | `16` | KV cache block size (maps to `kv_cache_block_size`) | | `--max-model-len` | | `int` | `None` | Maximum model context length; defaults to `hf_config.max_position_embeddings` | | `--cudagraph-capture-sizes` | | `str` | `"[1,2,4,8,16,32,48,64,128,256]"` | CUDA graph capture sizes as a Python list string | | `--level` | | `int` | `3` | Compilation level (0 -- 3) | | `--load_dummy` | | flag | `False` | Skip loading model weights | | `--enable-expert-parallel` | | flag | `False` | Enable Expert Parallelism (EP MoE) | | `--torch-profiler-dir` | | `str` | `None` | Directory for torch profiler traces | | `--enable-dp-attention` | | flag | `False` | Enable DP attention | | `--method` | | `str` | `None` | Speculative method; choices: `mtp` | | `--num-speculative-tokens` | | `int` | `1` | Number of speculative tokens per iteration | | `--max-num-batched-tokens` | | `int` | `16384` | Maximum number of tokens to batch in the async engine | | `--max-num-seqs` | | `int` | `512` | Maximum number of sequences to batch together | | `--gpu-memory-utilization` | | `float` | `0.9` | Fraction of GPU memory to use (0.0 -- 1.0) | | `--scheduler-delay-factor` | | `float` | `0.0` | Delay factor multiplied by previous prompt latency before scheduling next prompt | **Example:** ```bash python -m atom.entrypoint \ --model deepseek-ai/DeepSeek-R1 \ --tensor-parallel-size 8 \ --level 3 \ --cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \ --kv_cache_dtype fp8 \ --gpu-memory-utilization 0.92 \ --max-num-seqs 256 ``` --- ## 8. Environment Variables ### 8.1 Variables Registered in `atom/utils/envs.py` All variables use lazy evaluation. Boolean variables treat `"1"` as `True` and anything else (including unset) as `False`, unless noted otherwise. | Variable | Type | Default | Description | |---|---|---|---| | `ATOM_DP_RANK` | `int` | `0` | Data-parallel rank of this process | | `ATOM_DP_RANK_LOCAL` | `int` | `0` | Local data-parallel rank (for SPMD mode) | | `ATOM_DP_SIZE` | `int` | `1` | Total number of data-parallel groups | | `ATOM_DP_MASTER_IP` | `str` | `"127.0.0.1"` | IP address of the data-parallel master | | `ATOM_DP_MASTER_PORT` | `int` | `29500` | Port of the data-parallel master | | ~~`ATOM_ENFORCE_EAGER`~~ | | | Removed. Use CLI flag `--enforce-eager` instead. | | `ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION` | `bool` | `False` | Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models | | `ATOM_USE_TRITON_GEMM` | `bool` | `False` | Use Triton-based GEMM kernels instead of default backends | | `ATOM_USE_TRITON_MXFP4_BMM` | `bool` | `False` | Use Triton-based MXFP4 batched matrix multiply | | `ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION` | `bool` | `True` | Enable fused input RMSNorm + quantization for DeepSeek models | | `ATOM_ENABLE_DS_QKNORM_QUANT_FUSION` | `bool` | `True` | Enable fused QK-norm + quantization for DeepSeek models | | `ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION` | `bool` | `True` | Enable fused all-reduce + RMSNorm kernel | | `ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT` | `bool` | `True` | Enable AITER Triton fused RMSNorm + quantization for LLaMA models | | `ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT` | `bool` | `True` | Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models | ### 8.2 Additional Environment Variables (Used Outside `envs.py`) | Variable | Type | Default | Where Used | Description | |---|---|---|---|---| | `ATOM_TORCH_PROFILER_DIR` | `str` | `None` | `atom/config.py` (`Config.torch_profiler_dir`) | Directory for PyTorch profiler output; sets the default for `Config.torch_profiler_dir` | | `ATOM_PROFILER_MORE` | `str` | `"0"` | `atom/model_engine/model_runner.py` | Set to `"1"` to enable detailed profiling (`record_shapes`, `with_stack`, `profile_memory`) | | `HF_TOKEN` | `str` | `None` | `atom/config.py` (`get_hf_config`) | HuggingFace authentication token for gated model downloads | --- ## 9. Decision Tree -- Choosing a Compilation Level ``` Start | v Is this a debugging / development run? |-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager | v Do you need torch.compile but no graph splitting? |-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE) |-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS) | v Production inference on ROCm/HIP GPU? |-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs] - Auto-sets CUDAGraphMode.PIECEWISE - Auto-populates splitting_ops for attention ops - Pair with --cudagraph-capture-sizes for your batch profile | v Need maximum decode throughput? |-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE (full graphs for decode, piecewise for prefill) ``` **Rules of thumb:** - **Level 3** is the default for `EngineArgs` and is recommended for most production workloads. - **Level 0** / `--enforce-eager` is useful for debugging, profiling, or when CUDA graphs are incompatible with your model. - Match `--cudagraph-capture-sizes` to your expected batch sizes for optimal memory usage and launch latency. - When using `--enable-dp-attention` or Expert Parallelism (`--enable-expert-parallel`), level 3 is still recommended. --- ## Source Files | File | Description | |---|---| | `atom/config.py` | `Config`, `CompilationConfig`, `CompilationLevel`, `CUDAGraphMode`, `LayerQuantConfig`, `QuantizationConfig`, `ParallelConfig`, `SpeculativeConfig`, `KVCacheTensor`, `KVCacheConfig`, `get_hf_config` | | `atom/utils/envs.py` | All `ATOM_*` environment variable definitions with lazy evaluation | | `atom/model_engine/arg_utils.py` | `EngineArgs` dataclass and CLI argument parser | | `atom/sampling_params.py` | `SamplingParams` dataclass | | `atom/model_engine/model_runner.py` | Uses `ATOM_PROFILER_MORE` and `ATOM_TORCH_PROFILER_DIR` for profiling |