# ATOM Architecture Guide

> **Quick Reference**
>
> | Class | Import | Purpose |
> |-------|--------|---------|
> | `LLMEngine` | `from atom.model_engine.llm_engine import LLMEngine` | User-facing inference API |
> | `InputOutputProcessor` | `from atom.model_engine.llm_engine import InputOutputProcessor` | Tokenize/detokenize, TTFT/TPOT stats |
> | `CoreManager` | `from atom.model_engine.engine_core_mgr import CoreManager` | Multi-process orchestration via ZMQ |
> | `EngineCore` | `from atom.model_engine.engine_core import EngineCore` | Per-process engine loop |
> | `DPEngineCoreProc` | `from atom.model_engine.engine_core import DPEngineCoreProc` | Data-parallel engine core variant |
> | `ModelRunner` | `from atom.model_engine.model_runner import ModelRunner` | Per-GPU model execution |
> | `Scheduler` | `from atom.model_engine.scheduler import Scheduler` | Prefill-first request scheduling |
> | `BlockManager` | `from atom.model_engine.block_manager import BlockManager` | KV cache block allocation |
> | `Sequence` | `from atom.model_engine.sequence import Sequence` | Request state and token tracking |
> | `ForwardContext` | `from atom.utils.forward_context import ForwardContext` | Global forward pass metadata |
> | `Config` | `from atom.config import Config` | Master configuration dataclass |

---

## 1. System Overview

ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine, inspired by vLLM's architecture and built on the [AITER](https://github.com/ROCm/aiter) kernel library for ROCm/HIP GPUs.

Key design principles:

- **Multi-process architecture** -- each engine core runs in its own process, with ZMQ-based IPC connecting the user-facing API to one or more GPU workers.
- **AITER-native execution** -- model forward passes use AITER's optimized attention, MoE, sampling, and communication kernels rather than generic PyTorch operators.
- **CUDA graph acceleration** -- decode batches are captured into CUDA graphs for replay, eliminating per-step kernel launch overhead.
- **Prefill-first scheduling** -- the scheduler prioritizes prompt prefills before decode steps, following vLLM's continuous batching strategy.
- **Speculative decoding** -- optional EAGLE/MTP (Multi-Token Prediction) draft models propose tokens that are verified via rejection sampling.

---

## 2. Component Architecture

```
LLMEngine (user-facing API)
├── InputOutputProcessor (tokenize/detokenize, TTFT/TPOT stats)
├── CoreManager (multi-process orchestration via ZMQ)
│   └── EngineCore (one per DP rank, runs in its own process)
│       ├── ModelRunner (per-GPU execution via AsyncIOProcManager)
│       │   ├── Model (Qwen3, Llama, DeepSeek, Mixtral, etc.)
│       │   ├── Sampler / RejectionSampler
│       │   └── EagleProposer (optional MTP draft)
│       └── Scheduler
│           └── BlockManager (KV cache block management)
└── Config (master configuration)
```

**Supported model architectures** (registered in `support_model_arch_dict`, a module-level dict in `model_runner.py`):

| Architecture key | Implementation |
|---|---|
| `Qwen3ForCausalLM` | `atom.models.qwen3.Qwen3ForCausalLM` |
| `Qwen3MoeForCausalLM` | `atom.models.qwen3_moe.Qwen3MoeForCausalLM` |
| `LlamaForCausalLM` | `atom.models.llama.LlamaForCausalLM` |
| `MixtralForCausalLM` | `atom.models.mixtral.MixtralForCausalLM` |
| `DeepseekV3ForCausalLM` | `atom.models.deepseek_v2.DeepseekV2ForCausalLM` |
| `DeepseekV32ForCausalLM` | `atom.models.deepseek_v2.DeepseekV2ForCausalLM` |
| `GptOssForCausalLM` | `atom.models.gpt_oss.GptOssForCausalLM` |
| `Glm4MoeForCausalLM` | `atom.models.glm4_moe.Glm4MoeForCausalLM` |

---

## 3. Request Lifecycle

A request flows through the system in ten steps:

1. **`LLMEngine.add_request()` / `generate()`** -- the user submits a list of prompts (strings or pre-tokenized token IDs) together with `SamplingParams`.

2. **`InputOutputProcessor.preprocess()`** -- each prompt is tokenized via the HuggingFace tokenizer. A `Sequence` object is created to track the request's state, timing, and block allocation. `arrive_time` is recorded.

3. **`CoreManager.add_request()`** -- the list of `Sequence` objects is serialized with `pickle` and sent over a ZMQ `ROUTER` socket. When multiple DP ranks are active, requests are distributed round-robin.

4. **`EngineCore.process_input_sockets()`** -- an I/O thread on the `EngineCore` process receives the serialized data on a ZMQ `DEALER` socket, deserializes it, and places the sequences into the `input_queue`.

5. **`EngineCore.busy_loop()`** -- the main execution loop pulls from `input_queue` via `pull_and_process_input_queue()`, feeds new sequences into the scheduler, and repeatedly calls `_process_engine_step()` until all work is done.

6. **`Scheduler.schedule()`** -- implements prefill-first scheduling. Waiting sequences are scheduled for prefill if they fit within `max_num_seqs` and `max_num_batched_tokens` and the `BlockManager` can allocate blocks. If no prefills are pending, running sequences are batched for decode. The scheduler returns a `ScheduledBatch` and the corresponding sequence map.

7. **`ModelRunner.forward()`** -- executes the three-phase forward pass:
   - `prepare_model()` -- assembles input IDs (handling deferred output from previous steps), builds attention metadata, and gathers sampling temperatures.
   - `run_model()` -- runs the model forward. Prefill and large batches run eagerly; decode batches replay captured CUDA graphs. Returns logits and hidden states.
   - `postprocess()` -- samples tokens (or runs rejection sampling for speculative decoding), prepares deferred output via `tokenIDProcessor`, and optionally proposes draft tokens through `EagleProposer`.

8. **`Scheduler.postprocess()`** -- appends sampled tokens to each `Sequence`, records `first_token_time`, checks stop conditions (EOS, stop token IDs, stop token sequences, `max_tokens`), and moves finished sequences out of the running queue. The `BlockManager` deallocates blocks for finished sequences.

9. **Output via ZMQ** -- finished sequences are placed on the `output_queue`. A dedicated output thread serializes them and sends them over a ZMQ `PUSH` socket back to the `CoreManager`, which receives them on a `PULL` socket and places them in `outputs_queue`.

10. **`InputOutputProcessor.postprocess()`** -- detokenizes completed sequences, computes TTFT (Time To First Token) and TPOT (Time Per Output Token), and returns structured output dictionaries.

---

## 4. Forward Context Pattern

ATOM uses a module-level global `ForwardContext` to pass metadata through CUDA graph boundaries without threading it as function parameters.

**Core dataclasses** (defined in `atom/utils/forward_context.py`):

- **`ForwardContext`** -- top-level container holding:
  - `attn_metadata` (`AttentionMetaData`) -- cumulative sequence lengths, block tables, slot mappings, and backend-specific metadata.
  - `context` (`Context`) -- positions, prefill flag, batch size, graph batch size, draft flag.
  - `dp_metadata` (`DPMetadata`) -- cross-DP-rank token counts and cumulative sums.
  - `spec_decode_metadata` (`SpecDecodeMetadata`) -- draft token IDs, target/bonus logits indices.
  - `kv_cache_data` (`dict[str, KVCacheTensor]`) -- per-layer KV cache tensor references.

- **`Context`** -- lightweight struct: `positions`, `is_prefill`, `batch_size`, `graph_bs`, `is_draft`.

- **`DPMetadata`** -- data parallel metadata with `num_tokens_across_dp()` (all-reduce), `max_tokens_across_dp`, and `chunked_sizes()` context manager.

**Global accessors:**

| Function | Purpose |
|---|---|
| `set_forward_context(attn_metadata, atom_config, context, ...)` | Set the global context before a forward pass |
| `get_forward_context()` | Retrieve the current context (used by attention backends) |
| `reset_forward_context()` | Clear after forward pass completes |
| `set_kv_cache_data(kv_cache_data)` | Register KV cache tensors at initialization |

This pattern enables stateless dispatch: attention backends and model operators call `get_forward_context()` to access metadata without requiring it as a function parameter, which is critical for CUDA graph compatibility.

---

## 5. Multi-Process Architecture

ATOM uses a multi-process design with ZMQ sockets for inter-process communication:

```
                        ┌──────────────────────────────────┐
                        │          LLMEngine               │
                        │  ┌────────────────────────────┐  │
                        │  │    CoreManager              │  │
                        │  │                             │  │
                        │  │  ROUTER ──────► DEALER      │  │
                        │  │  (input)        (per rank)  │  │
                        │  │                             │  │
                        │  │  PULL ◄─────── PUSH         │  │
                        │  │  (output)      (per rank)   │  │
                        │  └────────────────────────────┘  │
                        └──────────────────────────────────┘
                              │                    ▲
                     pickle   │                    │  pickle
                              ▼                    │
               ┌──────────────────────────────────────┐
               │         EngineCore (Process)          │
               │                                       │
               │  input_queue ──► busy_loop             │
               │                    │                   │
               │  ┌─────────────────▼───────────────┐  │
               │  │  AsyncIOProcManager              │  │
               │  │  ┌────────────────────────────┐  │  │
               │  │  │ ModelRunner (TP rank 0)     │  │  │
               │  │  │ ModelRunner (TP rank 1)     │  │  │
               │  │  │ ...                         │  │  │
               │  │  └────────────────────────────┘  │  │
               │  └──────────────────────────────────┘  │
               │                                       │
               │  Scheduler + BlockManager             │
               └──────────────────────────────────────┘
```

**Socket types:**

| Socket | Type | Direction | Purpose |
|---|---|---|---|
| Input | `ROUTER` (CoreManager) / `DEALER` (EngineCore) | CoreManager -> EngineCore | Send requests and control commands |
| Output | `PUSH` (EngineCore) / `PULL` (CoreManager) | EngineCore -> CoreManager | Return finished sequences and stream outputs |

**Process hierarchy:**

- **`CoreManager`** spawns one `EngineCore` process per DP rank using `multiprocessing.Process`.
- Each **`EngineCore`** creates an `AsyncIOProcManager`, which in turn spawns one subprocess per TP rank.
- Each **`ModelRunner`** subprocess initializes AITER's distributed environment via `init_dist_env()` from AITER, setting up NCCL communication across TP ranks.

**Data-parallel variant** (`DPEngineCoreProc`):

When `data_parallel_size > 1`, each EngineCore process is a `DPEngineCoreProc` that synchronizes with other DP ranks via `torch.distributed.all_reduce` on a Gloo process group. The `busy_loop()` override ensures all DP ranks stay in lockstep: if one rank has a prefill batch while another does not, the idle rank executes a dummy prefill (`dummy_prefill_execution()`) to keep NCCL collectives synchronized.

---

## 6. Sequence Lifecycle

The `Sequence` class (in `atom/model_engine/sequence.py`) is the central data structure tracking a single request through the engine.

**Key fields:**

| Field | Type | Purpose |
|---|---|---|
| `id` | `int` | Auto-incrementing unique identifier |
| `token_ids` | `list[int]` | Full token sequence (prompt + generated) |
| `num_prompt_tokens` | `int` | Length of the original prompt |
| `num_tokens` | `int` (property) | Total length including generated tokens |
| `block_table` | `list[int]` | KV cache block IDs allocated to this sequence |
| `status` | `SequenceStatus` | Current lifecycle state |
| `type` | `SequenceType` | Current execution type |
| `temperature` | `float` | Sampling temperature |
| `max_tokens` | `int` | Maximum completion length |
| `arrive_time` | `float` | Timestamp when request entered the system |
| `first_token_time` | `float` | Timestamp of first generated token (for TTFT) |
| `leave_time` | `float` | Timestamp when request finished (for TPOT) |
| `spec_token_ids` | `list[int]` | Speculative/draft token IDs for MTP |
| `stream_callback` | `Callable` | Optional per-token streaming callback |

**Status transitions:**

```
WAITING ──(scheduled for prefill)──► RUNNING ──(stop condition met)──► FINISHED
   ▲                                    │
   └────────(preempted by scheduler)────┘
```

- `SequenceStatus.WAITING` -- queued in the scheduler's waiting deque, awaiting block allocation.
- `SequenceStatus.RUNNING` -- actively being processed (prefill or decode).
- `SequenceStatus.FINISHED` -- stop condition met (EOS, stop token, stop sequence, or `max_tokens`). Blocks are deallocated.
- `SequenceStatus.EXIT_ENGINE` -- sentinel status used to signal engine shutdown.

**Execution types:**

- `SequenceType.DUMMY` -- initial state before scheduling.
- `SequenceType.PREFILL` -- prompt processing phase (all prompt tokens in one batch).
- `SequenceType.DECODE` -- autoregressive token generation (one or more tokens per step with MTP).

---

## Source Files

| File | Description |
|------|-------------|
| `atom/model_engine/llm_engine.py` | `LLMEngine` user-facing API, `InputOutputProcessor` for tokenization/detokenization and TTFT/TPOT statistics |
| `atom/model_engine/engine_core.py` | `EngineCore` main execution loop, `DPEngineCoreProc` data-parallel variant, `EngineCoreRequestType` message protocol |
| `atom/model_engine/engine_core_mgr.py` | `CoreManager` ZMQ orchestration, process launching, round-robin DP dispatch |
| `atom/model_engine/model_runner.py` | `ModelRunner` per-GPU execution (model loading, CUDA graph capture, forward pass), `tokenIDProcessor` deferred output handling |
| `atom/model_engine/scheduler.py` | `Scheduler` prefill-first scheduling, `ScheduledBatch` batch descriptor, `ScheduledBatchOutput` forward results |
| `atom/model_engine/sequence.py` | `Sequence` request state, `SequenceStatus` and `SequenceType` enums |
| `atom/model_engine/block_manager.py` | `BlockManager` KV cache block allocation with optional prefix caching |
| `atom/model_engine/request.py` | `RequestOutput` dataclass for streaming callbacks |
| `atom/model_engine/async_proc.py` | `AsyncIOProcManager` and `AsyncIOProc` for spawning and managing ModelRunner subprocesses |
| `atom/utils/forward_context.py` | `ForwardContext`, `Context`, `DPMetadata`, `SpecDecodeMetadata`, `AttentionMetaData` dataclasses and global accessors |
| `atom/config.py` | `Config` master configuration, `ParallelConfig`, `CompilationConfig`, `LayerQuantConfig`, `QuantizationConfig`, `SpeculativeConfig`, `KVCacheTensor` |