ATOM Architecture Guide
Quick Reference
Class
Import
Purpose
LLMEngine
from atom.model_engine.llm_engine import LLMEngineUser-facing inference API
InputOutputProcessor
from atom.model_engine.llm_engine import InputOutputProcessorTokenize/detokenize, TTFT/TPOT stats
CoreManager
from atom.model_engine.engine_core_mgr import CoreManagerMulti-process orchestration via ZMQ
EngineCore
from atom.model_engine.engine_core import EngineCorePer-process engine loop
DPEngineCoreProc
from atom.model_engine.engine_core import DPEngineCoreProcData-parallel engine core variant
ModelRunner
from atom.model_engine.model_runner import ModelRunnerPer-GPU model execution
Scheduler
from atom.model_engine.scheduler import SchedulerPrefill-first request scheduling
BlockManager
from atom.model_engine.block_manager import BlockManagerKV cache block allocation
Sequence
from atom.model_engine.sequence import SequenceRequest state and token tracking
ForwardContext
from atom.utils.forward_context import ForwardContextGlobal forward pass metadata
Config
from atom.config import ConfigMaster configuration dataclass
1. System Overview
ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine, inspired by vLLM’s architecture and built on the AITER kernel library for ROCm/HIP GPUs.
Key design principles:
Multi-process architecture – each engine core runs in its own process, with ZMQ-based IPC connecting the user-facing API to one or more GPU workers.
AITER-native execution – model forward passes use AITER’s optimized attention, MoE, sampling, and communication kernels rather than generic PyTorch operators.
CUDA graph acceleration – decode batches are captured into CUDA graphs for replay, eliminating per-step kernel launch overhead.
Prefill-first scheduling – the scheduler prioritizes prompt prefills before decode steps, following vLLM’s continuous batching strategy.
Speculative decoding – optional EAGLE/MTP (Multi-Token Prediction) draft models propose tokens that are verified via rejection sampling.
2. Component Architecture
LLMEngine (user-facing API)
├── InputOutputProcessor (tokenize/detokenize, TTFT/TPOT stats)
├── CoreManager (multi-process orchestration via ZMQ)
│ └── EngineCore (one per DP rank, runs in its own process)
│ ├── ModelRunner (per-GPU execution via AsyncIOProcManager)
│ │ ├── Model (Qwen3, Llama, DeepSeek, Mixtral, etc.)
│ │ ├── Sampler / RejectionSampler
│ │ └── EagleProposer (optional MTP draft)
│ └── Scheduler
│ └── BlockManager (KV cache block management)
└── Config (master configuration)
Supported model architectures (registered in support_model_arch_dict, a module-level dict in model_runner.py):
Architecture key |
Implementation |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3. Request Lifecycle
A request flows through the system in ten steps:
LLMEngine.add_request()/generate()– the user submits a list of prompts (strings or pre-tokenized token IDs) together withSamplingParams.InputOutputProcessor.preprocess()– each prompt is tokenized via the HuggingFace tokenizer. ASequenceobject is created to track the request’s state, timing, and block allocation.arrive_timeis recorded.CoreManager.add_request()– the list ofSequenceobjects is serialized withpickleand sent over a ZMQROUTERsocket. When multiple DP ranks are active, requests are distributed round-robin.EngineCore.process_input_sockets()– an I/O thread on theEngineCoreprocess receives the serialized data on a ZMQDEALERsocket, deserializes it, and places the sequences into theinput_queue.EngineCore.busy_loop()– the main execution loop pulls frominput_queueviapull_and_process_input_queue(), feeds new sequences into the scheduler, and repeatedly calls_process_engine_step()until all work is done.Scheduler.schedule()– implements prefill-first scheduling. Waiting sequences are scheduled for prefill if they fit withinmax_num_seqsandmax_num_batched_tokensand theBlockManagercan allocate blocks. If no prefills are pending, running sequences are batched for decode. The scheduler returns aScheduledBatchand the corresponding sequence map.ModelRunner.forward()– executes the three-phase forward pass:prepare_model()– assembles input IDs (handling deferred output from previous steps), builds attention metadata, and gathers sampling temperatures.run_model()– runs the model forward. Prefill and large batches run eagerly; decode batches replay captured CUDA graphs. Returns logits and hidden states.postprocess()– samples tokens (or runs rejection sampling for speculative decoding), prepares deferred output viatokenIDProcessor, and optionally proposes draft tokens throughEagleProposer.
Scheduler.postprocess()– appends sampled tokens to eachSequence, recordsfirst_token_time, checks stop conditions (EOS, stop token IDs, stop token sequences,max_tokens), and moves finished sequences out of the running queue. TheBlockManagerdeallocates blocks for finished sequences.Output via ZMQ – finished sequences are placed on the
output_queue. A dedicated output thread serializes them and sends them over a ZMQPUSHsocket back to theCoreManager, which receives them on aPULLsocket and places them inoutputs_queue.InputOutputProcessor.postprocess()– detokenizes completed sequences, computes TTFT (Time To First Token) and TPOT (Time Per Output Token), and returns structured output dictionaries.
4. Forward Context Pattern
ATOM uses a module-level global ForwardContext to pass metadata through CUDA graph boundaries without threading it as function parameters.
Core dataclasses (defined in atom/utils/forward_context.py):
ForwardContext– top-level container holding:attn_metadata(AttentionMetaData) – cumulative sequence lengths, block tables, slot mappings, and backend-specific metadata.context(Context) – positions, prefill flag, batch size, graph batch size, draft flag.dp_metadata(DPMetadata) – cross-DP-rank token counts and cumulative sums.spec_decode_metadata(SpecDecodeMetadata) – draft token IDs, target/bonus logits indices.kv_cache_data(dict[str, KVCacheTensor]) – per-layer KV cache tensor references.
Context– lightweight struct:positions,is_prefill,batch_size,graph_bs,is_draft.DPMetadata– data parallel metadata withnum_tokens_across_dp()(all-reduce),max_tokens_across_dp, andchunked_sizes()context manager.
Global accessors:
Function |
Purpose |
|---|---|
|
Set the global context before a forward pass |
|
Retrieve the current context (used by attention backends) |
|
Clear after forward pass completes |
|
Register KV cache tensors at initialization |
This pattern enables stateless dispatch: attention backends and model operators call get_forward_context() to access metadata without requiring it as a function parameter, which is critical for CUDA graph compatibility.
5. Multi-Process Architecture
ATOM uses a multi-process design with ZMQ sockets for inter-process communication:
┌──────────────────────────────────┐
│ LLMEngine │
│ ┌────────────────────────────┐ │
│ │ CoreManager │ │
│ │ │ │
│ │ ROUTER ──────► DEALER │ │
│ │ (input) (per rank) │ │
│ │ │ │
│ │ PULL ◄─────── PUSH │ │
│ │ (output) (per rank) │ │
│ └────────────────────────────┘ │
└──────────────────────────────────┘
│ ▲
pickle │ │ pickle
▼ │
┌──────────────────────────────────────┐
│ EngineCore (Process) │
│ │
│ input_queue ──► busy_loop │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ AsyncIOProcManager │ │
│ │ ┌────────────────────────────┐ │ │
│ │ │ ModelRunner (TP rank 0) │ │ │
│ │ │ ModelRunner (TP rank 1) │ │ │
│ │ │ ... │ │ │
│ │ └────────────────────────────┘ │ │
│ └──────────────────────────────────┘ │
│ │
│ Scheduler + BlockManager │
└──────────────────────────────────────┘
Socket types:
Socket |
Type |
Direction |
Purpose |
|---|---|---|---|
Input |
|
CoreManager -> EngineCore |
Send requests and control commands |
Output |
|
EngineCore -> CoreManager |
Return finished sequences and stream outputs |
Process hierarchy:
CoreManagerspawns oneEngineCoreprocess per DP rank usingmultiprocessing.Process.Each
EngineCorecreates anAsyncIOProcManager, which in turn spawns one subprocess per TP rank.Each
ModelRunnersubprocess initializes AITER’s distributed environment viainit_dist_env()from AITER, setting up NCCL communication across TP ranks.
Data-parallel variant (DPEngineCoreProc):
When data_parallel_size > 1, each EngineCore process is a DPEngineCoreProc that synchronizes with other DP ranks via torch.distributed.all_reduce on a Gloo process group. The busy_loop() override ensures all DP ranks stay in lockstep: if one rank has a prefill batch while another does not, the idle rank executes a dummy prefill (dummy_prefill_execution()) to keep NCCL collectives synchronized.
6. Sequence Lifecycle
The Sequence class (in atom/model_engine/sequence.py) is the central data structure tracking a single request through the engine.
Key fields:
Field |
Type |
Purpose |
|---|---|---|
|
|
Auto-incrementing unique identifier |
|
|
Full token sequence (prompt + generated) |
|
|
Length of the original prompt |
|
|
Total length including generated tokens |
|
|
KV cache block IDs allocated to this sequence |
|
|
Current lifecycle state |
|
|
Current execution type |
|
|
Sampling temperature |
|
|
Maximum completion length |
|
|
Timestamp when request entered the system |
|
|
Timestamp of first generated token (for TTFT) |
|
|
Timestamp when request finished (for TPOT) |
|
|
Speculative/draft token IDs for MTP |
|
|
Optional per-token streaming callback |
Status transitions:
WAITING ──(scheduled for prefill)──► RUNNING ──(stop condition met)──► FINISHED
▲ │
└────────(preempted by scheduler)────┘
SequenceStatus.WAITING– queued in the scheduler’s waiting deque, awaiting block allocation.SequenceStatus.RUNNING– actively being processed (prefill or decode).SequenceStatus.FINISHED– stop condition met (EOS, stop token, stop sequence, ormax_tokens). Blocks are deallocated.SequenceStatus.EXIT_ENGINE– sentinel status used to signal engine shutdown.
Execution types:
SequenceType.DUMMY– initial state before scheduling.SequenceType.PREFILL– prompt processing phase (all prompt tokens in one batch).SequenceType.DECODE– autoregressive token generation (one or more tokens per step with MTP).
Source Files
File |
Description |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|