ATOM Architecture Guide

Quick Reference

Class

Import

Purpose

LLMEngine

from atom.model_engine.llm_engine import LLMEngine

User-facing inference API

InputOutputProcessor

from atom.model_engine.llm_engine import InputOutputProcessor

Tokenize/detokenize, TTFT/TPOT stats

CoreManager

from atom.model_engine.engine_core_mgr import CoreManager

Multi-process orchestration via ZMQ

EngineCore

from atom.model_engine.engine_core import EngineCore

Per-process engine loop

DPEngineCoreProc

from atom.model_engine.engine_core import DPEngineCoreProc

Data-parallel engine core variant

ModelRunner

from atom.model_engine.model_runner import ModelRunner

Per-GPU model execution

Scheduler

from atom.model_engine.scheduler import Scheduler

Prefill-first request scheduling

BlockManager

from atom.model_engine.block_manager import BlockManager

KV cache block allocation

Sequence

from atom.model_engine.sequence import Sequence

Request state and token tracking

ForwardContext

from atom.utils.forward_context import ForwardContext

Global forward pass metadata

Config

from atom.config import Config

Master configuration dataclass

1. System Overview

ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine, inspired by vLLM’s architecture and built on the AITER kernel library for ROCm/HIP GPUs.

Key design principles:

Multi-process architecture – each engine core runs in its own process, with ZMQ-based IPC connecting the user-facing API to one or more GPU workers.
AITER-native execution – model forward passes use AITER’s optimized attention, MoE, sampling, and communication kernels rather than generic PyTorch operators.
CUDA graph acceleration – decode batches are captured into CUDA graphs for replay, eliminating per-step kernel launch overhead.
Prefill-first scheduling – the scheduler prioritizes prompt prefills before decode steps, following vLLM’s continuous batching strategy.
Speculative decoding – optional EAGLE/MTP (Multi-Token Prediction) draft models propose tokens that are verified via rejection sampling.

2. Component Architecture

LLMEngine (user-facing API)
├── InputOutputProcessor (tokenize/detokenize, TTFT/TPOT stats)
├── CoreManager (multi-process orchestration via ZMQ)
│   └── EngineCore (one per DP rank, runs in its own process)
│       ├── ModelRunner (per-GPU execution via AsyncIOProcManager)
│       │   ├── Model (Qwen3, Llama, DeepSeek, Mixtral, etc.)
│       │   ├── Sampler / RejectionSampler
│       │   └── EagleProposer (optional MTP draft)
│       └── Scheduler
│           └── BlockManager (KV cache block management)
└── Config (master configuration)

Supported model architectures (registered in support_model_arch_dict, a module-level dict in model_runner.py):

Architecture key	Implementation
`Qwen3ForCausalLM`	`atom.models.qwen3.Qwen3ForCausalLM`
`Qwen3MoeForCausalLM`	`atom.models.qwen3_moe.Qwen3MoeForCausalLM`
`LlamaForCausalLM`	`atom.models.llama.LlamaForCausalLM`
`MixtralForCausalLM`	`atom.models.mixtral.MixtralForCausalLM`
`DeepseekV3ForCausalLM`	`atom.models.deepseek_v2.DeepseekV2ForCausalLM`
`DeepseekV32ForCausalLM`	`atom.models.deepseek_v2.DeepseekV2ForCausalLM`
`GptOssForCausalLM`	`atom.models.gpt_oss.GptOssForCausalLM`
`Glm4MoeForCausalLM`	`atom.models.glm4_moe.Glm4MoeForCausalLM`

3. Request Lifecycle

A request flows through the system in ten steps:

LLMEngine.add_request() / generate() – the user submits a list of prompts (strings or pre-tokenized token IDs) together with SamplingParams.
InputOutputProcessor.preprocess() – each prompt is tokenized via the HuggingFace tokenizer. A Sequence object is created to track the request’s state, timing, and block allocation. arrive_time is recorded.
CoreManager.add_request() – the list of Sequence objects is serialized with pickle and sent over a ZMQ ROUTER socket. When multiple DP ranks are active, requests are distributed round-robin.
EngineCore.process_input_sockets() – an I/O thread on the EngineCore process receives the serialized data on a ZMQ DEALER socket, deserializes it, and places the sequences into the input_queue.
EngineCore.busy_loop() – the main execution loop pulls from input_queue via pull_and_process_input_queue(), feeds new sequences into the scheduler, and repeatedly calls _process_engine_step() until all work is done.
Scheduler.schedule() – implements prefill-first scheduling. Waiting sequences are scheduled for prefill if they fit within max_num_seqs and max_num_batched_tokens and the BlockManager can allocate blocks. If no prefills are pending, running sequences are batched for decode. The scheduler returns a ScheduledBatch and the corresponding sequence map.
ModelRunner.forward() – executes the three-phase forward pass:
- prepare_model() – assembles input IDs (handling deferred output from previous steps), builds attention metadata, and gathers sampling temperatures.
- run_model() – runs the model forward. Prefill and large batches run eagerly; decode batches replay captured CUDA graphs. Returns logits and hidden states.
- postprocess() – samples tokens (or runs rejection sampling for speculative decoding), prepares deferred output via tokenIDProcessor, and optionally proposes draft tokens through EagleProposer.
Scheduler.postprocess() – appends sampled tokens to each Sequence, records first_token_time, checks stop conditions (EOS, stop token IDs, stop token sequences, max_tokens), and moves finished sequences out of the running queue. The BlockManager deallocates blocks for finished sequences.
Output via ZMQ – finished sequences are placed on the output_queue. A dedicated output thread serializes them and sends them over a ZMQ PUSH socket back to the CoreManager, which receives them on a PULL socket and places them in outputs_queue.
InputOutputProcessor.postprocess() – detokenizes completed sequences, computes TTFT (Time To First Token) and TPOT (Time Per Output Token), and returns structured output dictionaries.

4. Forward Context Pattern

ATOM uses a module-level global ForwardContext to pass metadata through CUDA graph boundaries without threading it as function parameters.

Core dataclasses (defined in atom/utils/forward_context.py):

ForwardContext – top-level container holding:
- attn_metadata (AttentionMetaData) – cumulative sequence lengths, block tables, slot mappings, and backend-specific metadata.
- context (Context) – positions, prefill flag, batch size, graph batch size, draft flag.
- dp_metadata (DPMetadata) – cross-DP-rank token counts and cumulative sums.
- spec_decode_metadata (SpecDecodeMetadata) – draft token IDs, target/bonus logits indices.
- kv_cache_data (dict[str, KVCacheTensor]) – per-layer KV cache tensor references.
Context – lightweight struct: positions, is_prefill, batch_size, graph_bs, is_draft.
DPMetadata – data parallel metadata with num_tokens_across_dp() (all-reduce), max_tokens_across_dp, and chunked_sizes() context manager.

Global accessors:

Function	Purpose
`set_forward_context(attn_metadata, atom_config, context, ...)`	Set the global context before a forward pass
`get_forward_context()`	Retrieve the current context (used by attention backends)
`reset_forward_context()`	Clear after forward pass completes
`set_kv_cache_data(kv_cache_data)`	Register KV cache tensors at initialization

This pattern enables stateless dispatch: attention backends and model operators call get_forward_context() to access metadata without requiring it as a function parameter, which is critical for CUDA graph compatibility.

5. Multi-Process Architecture

ATOM uses a multi-process design with ZMQ sockets for inter-process communication:

                        ┌──────────────────────────────────┐
                        │          LLMEngine               │
                        │  ┌────────────────────────────┐  │
                        │  │    CoreManager              │  │
                        │  │                             │  │
                        │  │  ROUTER ──────► DEALER      │  │
                        │  │  (input)        (per rank)  │  │
                        │  │                             │  │
                        │  │  PULL ◄─────── PUSH         │  │
                        │  │  (output)      (per rank)   │  │
                        │  └────────────────────────────┘  │
                        └──────────────────────────────────┘
                              │                    ▲
                     pickle   │                    │  pickle
                              ▼                    │
               ┌──────────────────────────────────────┐
               │         EngineCore (Process)          │
               │                                       │
               │  input_queue ──► busy_loop             │
               │                    │                   │
               │  ┌─────────────────▼───────────────┐  │
               │  │  AsyncIOProcManager              │  │
               │  │  ┌────────────────────────────┐  │  │
               │  │  │ ModelRunner (TP rank 0)     │  │  │
               │  │  │ ModelRunner (TP rank 1)     │  │  │
               │  │  │ ...                         │  │  │
               │  │  └────────────────────────────┘  │  │
               │  └──────────────────────────────────┘  │
               │                                       │
               │  Scheduler + BlockManager             │
               └──────────────────────────────────────┘

Socket types:

Socket	Type	Direction	Purpose
Input	`ROUTER` (CoreManager) / `DEALER` (EngineCore)	CoreManager -> EngineCore	Send requests and control commands
Output	`PUSH` (EngineCore) / `PULL` (CoreManager)	EngineCore -> CoreManager	Return finished sequences and stream outputs

Process hierarchy:

CoreManager spawns one EngineCore process per DP rank using multiprocessing.Process.
Each EngineCore creates an AsyncIOProcManager, which in turn spawns one subprocess per TP rank.
Each ModelRunner subprocess initializes AITER’s distributed environment via init_dist_env() from AITER, setting up NCCL communication across TP ranks.

Data-parallel variant (DPEngineCoreProc):

When data_parallel_size > 1, each EngineCore process is a DPEngineCoreProc that synchronizes with other DP ranks via torch.distributed.all_reduce on a Gloo process group. The busy_loop() override ensures all DP ranks stay in lockstep: if one rank has a prefill batch while another does not, the idle rank executes a dummy prefill (dummy_prefill_execution()) to keep NCCL collectives synchronized.

6. Sequence Lifecycle

The Sequence class (in atom/model_engine/sequence.py) is the central data structure tracking a single request through the engine.

Key fields:

Field	Type	Purpose
`id`	`int`	Auto-incrementing unique identifier
`token_ids`	`list[int]`	Full token sequence (prompt + generated)
`num_prompt_tokens`	`int`	Length of the original prompt
`num_tokens`	`int` (property)	Total length including generated tokens
`block_table`	`list[int]`	KV cache block IDs allocated to this sequence
`status`	`SequenceStatus`	Current lifecycle state
`type`	`SequenceType`	Current execution type
`temperature`	`float`	Sampling temperature
`max_tokens`	`int`	Maximum completion length
`arrive_time`	`float`	Timestamp when request entered the system
`first_token_time`	`float`	Timestamp of first generated token (for TTFT)
`leave_time`	`float`	Timestamp when request finished (for TPOT)
`spec_token_ids`	`list[int]`	Speculative/draft token IDs for MTP
`stream_callback`	`Callable`	Optional per-token streaming callback

Status transitions:

WAITING ──(scheduled for prefill)──► RUNNING ──(stop condition met)──► FINISHED
   ▲                                    │
   └────────(preempted by scheduler)────┘

SequenceStatus.WAITING – queued in the scheduler’s waiting deque, awaiting block allocation.
SequenceStatus.RUNNING – actively being processed (prefill or decode).
SequenceStatus.FINISHED – stop condition met (EOS, stop token, stop sequence, or max_tokens). Blocks are deallocated.
SequenceStatus.EXIT_ENGINE – sentinel status used to signal engine shutdown.

Execution types:

SequenceType.DUMMY – initial state before scheduling.
SequenceType.PREFILL – prompt processing phase (all prompt tokens in one batch).
SequenceType.DECODE – autoregressive token generation (one or more tokens per step with MTP).

Source Files

File	Description
`atom/model_engine/llm_engine.py`	`LLMEngine` user-facing API, `InputOutputProcessor` for tokenization/detokenization and TTFT/TPOT statistics
`atom/model_engine/engine_core.py`	`EngineCore` main execution loop, `DPEngineCoreProc` data-parallel variant, `EngineCoreRequestType` message protocol
`atom/model_engine/engine_core_mgr.py`	`CoreManager` ZMQ orchestration, process launching, round-robin DP dispatch
`atom/model_engine/model_runner.py`	`ModelRunner` per-GPU execution (model loading, CUDA graph capture, forward pass), `tokenIDProcessor` deferred output handling
`atom/model_engine/scheduler.py`	`Scheduler` prefill-first scheduling, `ScheduledBatch` batch descriptor, `ScheduledBatchOutput` forward results
`atom/model_engine/sequence.py`	`Sequence` request state, `SequenceStatus` and `SequenceType` enums
`atom/model_engine/block_manager.py`	`BlockManager` KV cache block allocation with optional prefix caching
`atom/model_engine/request.py`	`RequestOutput` dataclass for streaming callbacks
`atom/model_engine/async_proc.py`	`AsyncIOProcManager` and `AsyncIOProc` for spawning and managing ModelRunner subprocesses
`atom/utils/forward_context.py`	`ForwardContext`, `Context`, `DPMetadata`, `SpecDecodeMetadata`, `AttentionMetaData` dataclasses and global accessors
`atom/config.py`	`Config` master configuration, `ParallelConfig`, `CompilationConfig`, `LayerQuantConfig`, `QuantizationConfig`, `SpeculativeConfig`, `KVCacheTensor`