ATOM Architecture Guide

Quick Reference

Class

Import

Purpose

LLMEngine

from atom.model_engine.llm_engine import LLMEngine

User-facing inference API

InputOutputProcessor

from atom.model_engine.llm_engine import InputOutputProcessor

Tokenize/detokenize, TTFT/TPOT stats

CoreManager

from atom.model_engine.engine_core_mgr import CoreManager

Multi-process orchestration via ZMQ

EngineCore

from atom.model_engine.engine_core import EngineCore

Per-process engine loop

DPEngineCoreProc

from atom.model_engine.engine_core import DPEngineCoreProc

Data-parallel engine core variant

ModelRunner

from atom.model_engine.model_runner import ModelRunner

Per-GPU model execution

Scheduler

from atom.model_engine.scheduler import Scheduler

Prefill-first request scheduling

BlockManager

from atom.model_engine.block_manager import BlockManager

KV cache block allocation

Sequence

from atom.model_engine.sequence import Sequence

Request state and token tracking

ForwardContext

from atom.utils.forward_context import ForwardContext

Global forward pass metadata

Config

from atom.config import Config

Master configuration dataclass


1. System Overview

ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine, inspired by vLLM’s architecture and built on the AITER kernel library for ROCm/HIP GPUs.

Key design principles:

  • Multi-process architecture – each engine core runs in its own process, with ZMQ-based IPC connecting the user-facing API to one or more GPU workers.

  • AITER-native execution – model forward passes use AITER’s optimized attention, MoE, sampling, and communication kernels rather than generic PyTorch operators.

  • CUDA graph acceleration – decode batches are captured into CUDA graphs for replay, eliminating per-step kernel launch overhead.

  • Prefill-first scheduling – the scheduler prioritizes prompt prefills before decode steps, following vLLM’s continuous batching strategy.

  • Speculative decoding – optional EAGLE/MTP (Multi-Token Prediction) draft models propose tokens that are verified via rejection sampling.


2. Component Architecture

LLMEngine (user-facing API)
├── InputOutputProcessor (tokenize/detokenize, TTFT/TPOT stats)
├── CoreManager (multi-process orchestration via ZMQ)
│   └── EngineCore (one per DP rank, runs in its own process)
│       ├── ModelRunner (per-GPU execution via AsyncIOProcManager)
│       │   ├── Model (Qwen3, Llama, DeepSeek, Mixtral, etc.)
│       │   ├── Sampler / RejectionSampler
│       │   └── EagleProposer (optional MTP draft)
│       └── Scheduler
│           └── BlockManager (KV cache block management)
└── Config (master configuration)

Supported model architectures (registered in support_model_arch_dict, a module-level dict in model_runner.py):

Architecture key

Implementation

Qwen3ForCausalLM

atom.models.qwen3.Qwen3ForCausalLM

Qwen3MoeForCausalLM

atom.models.qwen3_moe.Qwen3MoeForCausalLM

LlamaForCausalLM

atom.models.llama.LlamaForCausalLM

MixtralForCausalLM

atom.models.mixtral.MixtralForCausalLM

DeepseekV3ForCausalLM

atom.models.deepseek_v2.DeepseekV2ForCausalLM

DeepseekV32ForCausalLM

atom.models.deepseek_v2.DeepseekV2ForCausalLM

GptOssForCausalLM

atom.models.gpt_oss.GptOssForCausalLM

Glm4MoeForCausalLM

atom.models.glm4_moe.Glm4MoeForCausalLM


3. Request Lifecycle

A request flows through the system in ten steps:

  1. LLMEngine.add_request() / generate() – the user submits a list of prompts (strings or pre-tokenized token IDs) together with SamplingParams.

  2. InputOutputProcessor.preprocess() – each prompt is tokenized via the HuggingFace tokenizer. A Sequence object is created to track the request’s state, timing, and block allocation. arrive_time is recorded.

  3. CoreManager.add_request() – the list of Sequence objects is serialized with pickle and sent over a ZMQ ROUTER socket. When multiple DP ranks are active, requests are distributed round-robin.

  4. EngineCore.process_input_sockets() – an I/O thread on the EngineCore process receives the serialized data on a ZMQ DEALER socket, deserializes it, and places the sequences into the input_queue.

  5. EngineCore.busy_loop() – the main execution loop pulls from input_queue via pull_and_process_input_queue(), feeds new sequences into the scheduler, and repeatedly calls _process_engine_step() until all work is done.

  6. Scheduler.schedule() – implements prefill-first scheduling. Waiting sequences are scheduled for prefill if they fit within max_num_seqs and max_num_batched_tokens and the BlockManager can allocate blocks. If no prefills are pending, running sequences are batched for decode. The scheduler returns a ScheduledBatch and the corresponding sequence map.

  7. ModelRunner.forward() – executes the three-phase forward pass:

    • prepare_model() – assembles input IDs (handling deferred output from previous steps), builds attention metadata, and gathers sampling temperatures.

    • run_model() – runs the model forward. Prefill and large batches run eagerly; decode batches replay captured CUDA graphs. Returns logits and hidden states.

    • postprocess() – samples tokens (or runs rejection sampling for speculative decoding), prepares deferred output via tokenIDProcessor, and optionally proposes draft tokens through EagleProposer.

  8. Scheduler.postprocess() – appends sampled tokens to each Sequence, records first_token_time, checks stop conditions (EOS, stop token IDs, stop token sequences, max_tokens), and moves finished sequences out of the running queue. The BlockManager deallocates blocks for finished sequences.

  9. Output via ZMQ – finished sequences are placed on the output_queue. A dedicated output thread serializes them and sends them over a ZMQ PUSH socket back to the CoreManager, which receives them on a PULL socket and places them in outputs_queue.

  10. InputOutputProcessor.postprocess() – detokenizes completed sequences, computes TTFT (Time To First Token) and TPOT (Time Per Output Token), and returns structured output dictionaries.


4. Forward Context Pattern

ATOM uses a module-level global ForwardContext to pass metadata through CUDA graph boundaries without threading it as function parameters.

Core dataclasses (defined in atom/utils/forward_context.py):

  • ForwardContext – top-level container holding:

    • attn_metadata (AttentionMetaData) – cumulative sequence lengths, block tables, slot mappings, and backend-specific metadata.

    • context (Context) – positions, prefill flag, batch size, graph batch size, draft flag.

    • dp_metadata (DPMetadata) – cross-DP-rank token counts and cumulative sums.

    • spec_decode_metadata (SpecDecodeMetadata) – draft token IDs, target/bonus logits indices.

    • kv_cache_data (dict[str, KVCacheTensor]) – per-layer KV cache tensor references.

  • Context – lightweight struct: positions, is_prefill, batch_size, graph_bs, is_draft.

  • DPMetadata – data parallel metadata with num_tokens_across_dp() (all-reduce), max_tokens_across_dp, and chunked_sizes() context manager.

Global accessors:

Function

Purpose

set_forward_context(attn_metadata, atom_config, context, ...)

Set the global context before a forward pass

get_forward_context()

Retrieve the current context (used by attention backends)

reset_forward_context()

Clear after forward pass completes

set_kv_cache_data(kv_cache_data)

Register KV cache tensors at initialization

This pattern enables stateless dispatch: attention backends and model operators call get_forward_context() to access metadata without requiring it as a function parameter, which is critical for CUDA graph compatibility.


5. Multi-Process Architecture

ATOM uses a multi-process design with ZMQ sockets for inter-process communication:

                        ┌──────────────────────────────────┐
                        │          LLMEngine               │
                        │  ┌────────────────────────────┐  │
                        │  │    CoreManager              │  │
                        │  │                             │  │
                        │  │  ROUTER ──────► DEALER      │  │
                        │  │  (input)        (per rank)  │  │
                        │  │                             │  │
                        │  │  PULL ◄─────── PUSH         │  │
                        │  │  (output)      (per rank)   │  │
                        │  └────────────────────────────┘  │
                        └──────────────────────────────────┘
                              │                    ▲
                     pickle   │                    │  pickle
                              ▼                    │
               ┌──────────────────────────────────────┐
               │         EngineCore (Process)          │
               │                                       │
               │  input_queue ──► busy_loop             │
               │                    │                   │
               │  ┌─────────────────▼───────────────┐  │
               │  │  AsyncIOProcManager              │  │
               │  │  ┌────────────────────────────┐  │  │
               │  │  │ ModelRunner (TP rank 0)     │  │  │
               │  │  │ ModelRunner (TP rank 1)     │  │  │
               │  │  │ ...                         │  │  │
               │  │  └────────────────────────────┘  │  │
               │  └──────────────────────────────────┘  │
               │                                       │
               │  Scheduler + BlockManager             │
               └──────────────────────────────────────┘

Socket types:

Socket

Type

Direction

Purpose

Input

ROUTER (CoreManager) / DEALER (EngineCore)

CoreManager -> EngineCore

Send requests and control commands

Output

PUSH (EngineCore) / PULL (CoreManager)

EngineCore -> CoreManager

Return finished sequences and stream outputs

Process hierarchy:

  • CoreManager spawns one EngineCore process per DP rank using multiprocessing.Process.

  • Each EngineCore creates an AsyncIOProcManager, which in turn spawns one subprocess per TP rank.

  • Each ModelRunner subprocess initializes AITER’s distributed environment via init_dist_env() from AITER, setting up NCCL communication across TP ranks.

Data-parallel variant (DPEngineCoreProc):

When data_parallel_size > 1, each EngineCore process is a DPEngineCoreProc that synchronizes with other DP ranks via torch.distributed.all_reduce on a Gloo process group. The busy_loop() override ensures all DP ranks stay in lockstep: if one rank has a prefill batch while another does not, the idle rank executes a dummy prefill (dummy_prefill_execution()) to keep NCCL collectives synchronized.


6. Sequence Lifecycle

The Sequence class (in atom/model_engine/sequence.py) is the central data structure tracking a single request through the engine.

Key fields:

Field

Type

Purpose

id

int

Auto-incrementing unique identifier

token_ids

list[int]

Full token sequence (prompt + generated)

num_prompt_tokens

int

Length of the original prompt

num_tokens

int (property)

Total length including generated tokens

block_table

list[int]

KV cache block IDs allocated to this sequence

status

SequenceStatus

Current lifecycle state

type

SequenceType

Current execution type

temperature

float

Sampling temperature

max_tokens

int

Maximum completion length

arrive_time

float

Timestamp when request entered the system

first_token_time

float

Timestamp of first generated token (for TTFT)

leave_time

float

Timestamp when request finished (for TPOT)

spec_token_ids

list[int]

Speculative/draft token IDs for MTP

stream_callback

Callable

Optional per-token streaming callback

Status transitions:

WAITING ──(scheduled for prefill)──► RUNNING ──(stop condition met)──► FINISHED
   ▲                                    │
   └────────(preempted by scheduler)────┘
  • SequenceStatus.WAITING – queued in the scheduler’s waiting deque, awaiting block allocation.

  • SequenceStatus.RUNNING – actively being processed (prefill or decode).

  • SequenceStatus.FINISHED – stop condition met (EOS, stop token, stop sequence, or max_tokens). Blocks are deallocated.

  • SequenceStatus.EXIT_ENGINE – sentinel status used to signal engine shutdown.

Execution types:

  • SequenceType.DUMMY – initial state before scheduling.

  • SequenceType.PREFILL – prompt processing phase (all prompt tokens in one batch).

  • SequenceType.DECODE – autoregressive token generation (one or more tokens per step with MTP).


Source Files

File

Description

atom/model_engine/llm_engine.py

LLMEngine user-facing API, InputOutputProcessor for tokenization/detokenization and TTFT/TPOT statistics

atom/model_engine/engine_core.py

EngineCore main execution loop, DPEngineCoreProc data-parallel variant, EngineCoreRequestType message protocol

atom/model_engine/engine_core_mgr.py

CoreManager ZMQ orchestration, process launching, round-robin DP dispatch

atom/model_engine/model_runner.py

ModelRunner per-GPU execution (model loading, CUDA graph capture, forward pass), tokenIDProcessor deferred output handling

atom/model_engine/scheduler.py

Scheduler prefill-first scheduling, ScheduledBatch batch descriptor, ScheduledBatchOutput forward results

atom/model_engine/sequence.py

Sequence request state, SequenceStatus and SequenceType enums

atom/model_engine/block_manager.py

BlockManager KV cache block allocation with optional prefix caching

atom/model_engine/request.py

RequestOutput dataclass for streaming callbacks

atom/model_engine/async_proc.py

AsyncIOProcManager and AsyncIOProc for spawning and managing ModelRunner subprocesses

atom/utils/forward_context.py

ForwardContext, Context, DPMetadata, SpecDecodeMetadata, AttentionMetaData dataclasses and global accessors

atom/config.py

Config master configuration, ParallelConfig, CompilationConfig, LayerQuantConfig, QuantizationConfig, SpeculativeConfig, KVCacheTensor