ATOM Serving & Benchmarking Guide

ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide covers the OpenAI-compatible serving API, programmatic engine usage, benchmarking tools, profiling, and speculative decoding.


Quick Reference

# Start the OpenAI-compatible server
python -m atom.entrypoints.openai_server --model <model_name_or_path> --kv_cache_dtype fp8

# Run the online serving benchmark
python -m atom.benchmarks.benchmark_serving \
    --backend vllm --model <model_name_or_path> \
    --base-url http://localhost:8000 \
    --dataset-name random --random-input-len 1024 --random-output-len 128 \
    --num-prompts 1000 --request-rate inf --ignore-eos

# Simple inference example
python -m atom.examples.simple_inference --model <model_name_or_path> --kv_cache_dtype fp8

# Offline profiling
python -m atom.examples.profile_offline --model <model_name_or_path> --kv_cache_dtype fp8

# Accuracy validation with lm-eval
lm_eval --model local-completions \
    --model_args model=<model>,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
    --tasks gsm8k --num_fewshot 5

1. OpenAI-Compatible Server

The server is implemented in atom/entrypoints/openai_server.py using FastAPI and Uvicorn. It exposes OpenAI-compatible HTTP endpoints so that existing clients (curl, OpenAI SDK, lm-eval) work without modification.

1.1 Endpoints

Method

Path

Description

POST

/v1/chat/completions

Chat completion (ChatCompletionRequest -> ChatCompletionResponse)

POST

/v1/completions

Text completion (CompletionRequest -> CompletionResponse)

GET

/v1/models

List available models

GET

/health

Health check (returns {"status": "ok"})

POST

/start_profile

Start torch profiler on the engine

POST

/stop_profile

Stop torch profiler and flush traces

1.2 Request Models

ChatCompletionRequest fields:

Field

Type

Default

Description

model

Optional[str]

None

Model name (validated against the loaded model)

messages

Optional[List[ChatMessage]]

None

List of chat messages (role, content)

prompt

Optional[List[ChatMessage]]

None

Alias for messages

temperature

Optional[float]

1.0

Sampling temperature

top_p

Optional[float]

1.0

Nucleus sampling threshold

max_tokens

Optional[int]

256

Maximum tokens to generate

stop

Optional[List[str]]

None

Stop strings

ignore_eos

Optional[bool]

False

Ignore end-of-sequence token

stream

Optional[bool]

False

Enable server-sent events streaming

seed

Optional[int]

None

Random seed

CompletionRequest fields:

Field

Type

Default

Description

model

Optional[str]

None

Model name

prompt

str

(required)

Text prompt

temperature

Optional[float]

1.0

Sampling temperature

top_p

Optional[float]

1.0

Nucleus sampling threshold

max_tokens

Optional[int]

256

Maximum tokens to generate

stop

Optional[List[str]]

None

Stop strings

ignore_eos

Optional[bool]

False

Ignore end-of-sequence token

stream

Optional[bool]

False

Enable SSE streaming

1.3 Response Models

Both ChatCompletionResponse and CompletionResponse include:

  • id – unique request identifier (e.g. chatcmpl-<uuid> or cmpl-<uuid>)

  • object"chat.completion" or "text_completion"

  • created – Unix timestamp

  • model – model name

  • choices – list of generated completions

  • usage – token counts (prompt_tokens, completion_tokens, total_tokens) plus ttft_s, tpot_s, and latency_s timing fields

Streaming responses use the SSE (Server-Sent Events) protocol with data: [DONE]\n\n as the termination signal.

1.4 Server Startup

python -m atom.entrypoints.openai_server \
    --model <model_name_or_path> \
    --kv_cache_dtype fp8 \
    --host 0.0.0.0 \
    --server-port 8000

Server-specific CLI arguments:

Argument

Default

Description

--host

0.0.0.0

Bind address

--server-port

8000

HTTP port (note: --port is for internal engine communication)

All EngineArgs arguments are also accepted (see Section 7 for the full list).

1.5 Example: curl

# Non-streaming chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

# Streaming text completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The capital of France is",
    "max_tokens": 64,
    "stream": true
  }'

2. Programmatic API (LLMEngine)

The LLMEngine class in atom/model_engine/llm_engine.py provides a Python-native interface for inference without running an HTTP server.

2.1 Initialization

from atom import LLMEngine, SamplingParams

engine = LLMEngine(model="deepseek-ai/DeepSeek-R1", kv_cache_dtype="fp8",
                   tensor_parallel_size=8)

LLMEngine.__init__(model, **kwargs) accepts all Config field names as keyword arguments (e.g. tensor_parallel_size, kv_cache_dtype, max_model_len, data_parallel_size, gpu_memory_utilization).

2.2 SamplingParams

Defined in atom/sampling_params.py:

@dataclass
class SamplingParams:
    temperature: float = 1.0
    max_tokens: int = 64
    ignore_eos: bool = False
    stop_strings: Optional[list[str]] = None

2.3 Core Methods

Method

Signature

Description

generate

(prompts: list[str], sampling_params) -> list[dict]

Synchronous batch generation; blocks until all prompts complete

add_request

(prompt_or_tokens_list, sampling_params_list, stream_callback=None)

Submit requests for asynchronous processing

step

() -> list[Sequence]

Retrieve completed sequences

is_finished

() -> bool

Check whether all pending requests have completed

start_profile

()

Start torch profiler on all workers

stop_profile

()

Stop torch profiler and write traces

print_mtp_statistics

()

Print speculative decoding acceptance statistics

2.4 Synchronous Generation Example

from atom import LLMEngine, SamplingParams

engine = LLMEngine(model="meta-llama/Meta-Llama-3-8B", kv_cache_dtype="fp8")
params = SamplingParams(temperature=0.6, max_tokens=256)

outputs = engine.generate(["Explain quantum computing in simple terms."], params)
for out in outputs:
    print(out["text"])

Each output dictionary contains: text, token_ids, latency, finish_reason, num_tokens_input, num_tokens_output, ttft, and tpot.

2.5 Asynchronous / Streaming Usage

engine.add_request(
    prompt_or_tokens_list=["Hello world", "How are you?"],
    sampling_params_list=SamplingParams(temperature=0.8, max_tokens=128),
    stream_callback=my_callback,  # called per-token with RequestOutput
)

while not engine.is_finished():
    completed = engine.step()
    # process completed sequences

3. Simple Inference

The atom/examples/simple_inference.py script provides a quick way to validate model loading and generation.

3.1 Usage

python -m atom.examples.simple_inference \
    --model meta-llama/Meta-Llama-3-8B \
    --kv_cache_dtype fp8 \
    --temperature 0.6

3.2 What It Does

  1. Parses all EngineArgs plus --temperature (default 0.6).

  2. Creates an LLMEngine via EngineArgs.from_cli_args(args).create_engine().

  3. Applies the model’s chat template to four built-in prompts (English and Chinese) with enable_thinking=True.

  4. Runs a warmup generation, then generates completions for the batch.

  5. Calls llm.print_mtp_statistics() to report speculative decoding stats (if MTP is enabled).


4. Benchmarking

ATOM ships a comprehensive online serving benchmark in atom/benchmarks/benchmark_serving.py (adapted from vLLM’s benchmarking tooling).

4.1 Metrics

The BenchmarkMetrics dataclass tracks:

Metric

Abbreviation

Description

Time to First Token

TTFT

Latency from request submission to the first generated token

Time per Output Token

TPOT

Average latency per output token (excluding the first)

Inter-Token Latency

ITL

Latency between successive output tokens

End-to-End Latency

E2EL

Total latency from request send to full response receipt

Request Throughput

Completed requests per second

Output Token Throughput

Generated tokens per second

Total Token Throughput

(input + output) tokens per second

Request Goodput

Requests per second meeting SLO targets

For each latency metric, mean, median, standard deviation, and configurable percentiles (default: P99) are reported.

4.2 Key CLI Arguments

Argument

Default

Description

--backend

vllm

Backend type. Choices: tgi, vllm, lmdeploy, deepspeed-mii, openai, openai-chat, tensorrt-llm, scalellm, sglang

--model

(required)

Model name or path

--base-url

None

Server base URL (e.g. http://localhost:8000)

--host

127.0.0.1

Server host (used when --base-url is not set)

--port

8000

Server port (used when --base-url is not set)

--endpoint

/v1/completions

API endpoint path

--dataset-name

sharegpt

Dataset type: sharegpt, burstgpt, sonnet, random, hf

--dataset-path

None

Path to dataset file or HuggingFace dataset ID

--num-prompts

1000

Number of prompts to benchmark

--request-rate

inf

Requests per second (inf = send all at once)

--burstiness

1.0

Burstiness factor (1.0 = Poisson process)

--max-concurrency

None

Maximum concurrent requests

--ignore-eos

False

Ignore EOS token in generation

--save-result

False

Save results to JSON

--result-dir

None

Directory for result JSON files

--result-filename

None

Custom filename for results

--percentile-metrics

ttft,tpot,itl

Comma-separated metrics to report percentiles for

--metric-percentiles

99

Comma-separated percentile values (e.g. 25,50,75,99)

--goodput

None

SLO targets as KEY:VALUE pairs (e.g. ttft:100 tpot:50)

--profile

False

Enable torch profiler during the benchmark run

--tokenizer

None

Custom tokenizer name or path

--seed

0

Random seed

Random dataset options:

Argument

Default

Description

--random-input-len

1024

Input token length

--random-output-len

128

Output token length

--random-range-ratio

1.0

Length variation ratio

--random-prefix-len

0

Fixed prefix token length

--use-chat-template

False

Apply chat template to random prompts

4.3 Backend Request Functions

Defined in atom/benchmarks/backend_request_func.py:

Backend Key

Function

Protocol

vllm

async_request_openai_completions

OpenAI Completions API (streaming)

openai

async_request_openai_completions

OpenAI Completions API (streaming)

openai-chat

async_request_openai_chat_completions

OpenAI Chat Completions API (streaming)

tgi

async_request_tgi

TGI generate_stream

tensorrt-llm

async_request_trt_llm

TRT-LLM generate_stream

deepspeed-mii

async_request_deepspeed_mii

DeepSpeed-MII

lmdeploy

async_request_openai_completions

OpenAI Completions API

scalellm

async_request_openai_completions

OpenAI Completions API

sglang

async_request_openai_completions

OpenAI Completions API

Each function uses RequestFuncInput and returns a RequestFuncOutput with timing data (ttft, itl, latency, tpot).

4.4 Full Benchmark Example

# 1. Start the server
python -m atom.entrypoints.openai_server \
    --kv_cache_dtype fp8 -tp 8 --model deepseek-ai/DeepSeek-R1

# 2. Run benchmark
MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result

python -m atom.benchmarks.benchmark_serving \
    --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
    --dataset-name=random \
    --random-input-len=$ISL --random-output-len=$OSL \
    --random-range-ratio 0.8 \
    --num-prompts=$(( $CONC * 10 )) \
    --max-concurrency=$CONC \
    --request-rate=inf --ignore-eos \
    --save-result --percentile-metrics="ttft,tpot,itl,e2el" \
    --result-dir=./ --result-filename=$RESULT_FILENAME.json

5. Profiling

ATOM supports PyTorch profiling via environment variables, HTTP endpoints, and the programmatic API.

5.1 Configuration

Mechanism

Description

--torch-profiler-dir <dir>

CLI arg to set the trace output directory

ATOM_TORCH_PROFILER_DIR env var

Sets the default torch_profiler_dir in Config

ATOM_PROFILER_MORE=1 env var

Enables detailed profiling: record_shapes, with_stack, profile_memory

When a profiler directory is configured, each worker saves traces to a rank-specific subdirectory:

  • Multi-GPU with DP: {profiler_dir}/dp{dp_rank}_tp{rank}/

  • Single-GPU / TP-only: {profiler_dir}/rank_{rank}/

Traces are saved in gzip-compressed TensorBoard format and can be viewed with tensorboard --logdir <profiler_dir> or Chrome’s chrome://tracing.

5.2 Online Profiling (HTTP)

While the server is running, start and stop profiling with HTTP requests:

# Start profiling
curl -s -S -X POST http://127.0.0.1:8000/start_profile

# ... run your workload ...

# Stop profiling and flush traces
curl -s -S -X POST http://127.0.0.1:8000/stop_profile

The server must be started with --torch-profiler-dir or with ATOM_TORCH_PROFILER_DIR set for these endpoints to produce traces.

5.3 Programmatic Profiling

engine = LLMEngine(model="Qwen/Qwen3-0.6B", torch_profiler_dir="./traces")

engine.start_profile()
outputs = engine.generate(prompts, sampling_params)
engine.stop_profile()
# Traces written to ./traces/rank_0/

5.4 Offline Profiling Script

atom/examples/profile_offline.py provides a self-contained offline profiling workflow:

python -m atom.examples.profile_offline \
    --model Qwen/Qwen3-0.6B \
    --kv_cache_dtype fp8 \
    --torch-profiler-dir ./profiler_traces \
    --input-length 128 \
    --output-length 32 \
    --bs 4

Script-specific arguments:

Argument

Default

Description

--input-length

128

Approximate input prompt length in tokens

--output-length

32

Output generation length in tokens

--bs

1

Batch size (number of parallel requests)

--random-input

False

Use random token input instead of predefined text

If --torch-profiler-dir is not specified, the script defaults to ./profiler_traces.

5.5 Profiling During Benchmarks

The benchmark tool can trigger profiling automatically via --profile:

python -m atom.benchmarks.benchmark_serving \
    --model <model> --backend vllm \
    --base-url http://localhost:8000 \
    --dataset-name random --num-prompts 100 \
    --profile

This sends POST /start_profile before the benchmark and POST /stop_profile after completion.


6. Speculative Decoding (MTP)

ATOM supports Multi-Token Prediction (MTP) for DeepSeek models using the Eagle-style speculative decoding framework.

6.1 Architecture

  • EagleProposer (atom/spec_decode/eagle.py): Loads and runs the draft (MTP) model to propose speculative tokens. Supports the DeepSeekMTPModel architecture via DeepSeekMTP.

  • RejectionSampler (atom/model_ops/rejection_sampler.py): Implements greedy rejection sampling with a Triton kernel. Compares draft token IDs against target model argmax and accepts matching prefixes; appends a bonus token if all drafts are accepted.

6.2 Configuration

Enable MTP via CLI arguments:

python -m atom.entrypoints.openai_server \
    --model deepseek-ai/DeepSeek-R1 \
    --kv_cache_dtype fp8 -tp 8 \
    --method mtp \
    --num-speculative-tokens 1

Argument

Default

Description

--method

None

Speculative method; currently only mtp is supported

--num-speculative-tokens

1

Number of draft tokens per iteration (draft model runs this many autoregressive steps)

6.3 MTP Statistics

ATOM tracks acceptance statistics at runtime:

  • total_draft_tokens: Total number of draft tokens proposed

  • total_accepted_tokens: Number of draft tokens accepted by rejection sampling

  • acceptance_rate: Ratio of accepted to draft tokens

Statistics are logged every 1000 draft tokens and can be printed on demand:

engine.print_mtp_statistics()

Example output:

MTP Statistics:
  Total draft tokens: 5000
  Accepted tokens:    4250
  Acceptance rate:    85.00%

6.4 How Rejection Sampling Works

  1. The draft model generates num_speculative_tokens token predictions autoregressively using argmax.

  2. The target model verifies all draft tokens in a single forward pass.

  3. The rejection_greedy_sample_kernel (Triton) compares each draft token against the target model’s argmax:

    • If they match, the token is accepted.

    • On the first mismatch, the target model’s token replaces it and all subsequent draft tokens are discarded.

    • If all draft tokens match, a bonus token from the target model is appended.


7. Deployment Examples

7.1 Single-GPU

python -m atom.entrypoints.openai_server \
    --model Qwen/Qwen3-0.6B \
    --kv_cache_dtype fp8

7.2 Multi-GPU with Tensor Parallelism

python -m atom.entrypoints.openai_server \
    --model deepseek-ai/DeepSeek-R1 \
    --kv_cache_dtype fp8 \
    -tp 8

7.3 Docker Deployment

# Pull the ROCm PyTorch image
docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

# Launch container
docker run -it --network=host \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -v $HOME:/home/$USER \
    -v /mnt:/mnt \
    -v /data:/data \
    --shm-size=16G \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

# Inside the container
pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git && cd ATOM && pip install .

# Start serving
python -m atom.entrypoints.openai_server \
    --model deepseek-ai/DeepSeek-R1 \
    --kv_cache_dtype fp8 -tp 8

7.4 Engine CLI Arguments (EngineArgs)

These arguments are available for all entrypoints (server, examples, and any script using EngineArgs.add_cli_args):

Argument

Default

Description

--model

Qwen/Qwen3-0.6B

Model name or path

--trust-remote-code

False

Trust remote code from HuggingFace

--tensor-parallel-size, -tp

1

Tensor parallel size

--data-parallel-size, -dp

1

Data parallel size

--enforce-eager

False

Disable CUDA graph capture; use eager execution

--enable_prefix_caching

False

Enable prefix caching

--port

8006

Internal engine communication port

--kv_cache_dtype

bf16

KV cache dtype: bf16 or fp8

--block-size

16

KV cache block size

--max-model-len

None

Maximum context length (defaults to HF config)

--max-num-batched-tokens

16384

Maximum tokens per batch

--max-num-seqs

512

Maximum sequences per batch

--gpu-memory-utilization

0.9

GPU memory utilization (0.0 to 1.0)

--scheduler-delay-factor

0.0

Delay factor before scheduling next prompt

--cudagraph-capture-sizes

[1,2,4,...,256]

Batch sizes for CUDA graph capture

--level

3

Compilation level (0-3); 3 = torch.compile

--load_dummy

False

Skip loading model weights (for testing)

--enable-expert-parallel

False

Enable expert parallelism for MoE

--enable-dp-attention

False

Enable data-parallel attention

--torch-profiler-dir

None

Directory for torch profiler traces

--method

None

Speculative decoding method (mtp)

--num-speculative-tokens

1

Number of speculative tokens per step


8. Accuracy Validation

ATOM supports accuracy validation through the lm-eval framework via the OpenAI-compatible API.

8.1 Setup

pip install lm-eval[api]

8.2 Run Evaluation

Start an ATOM server, then run lm-eval against it:

# Start server
python -m atom.entrypoints.openai_server \
    --model meta-llama/Meta-Llama-3-8B \
    --kv_cache_dtype fp8

# Run evaluation
lm_eval --model local-completions \
    --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
    --tasks gsm8k \
    --num_fewshot 5

Any lm-eval task can be used. The local-completions model type sends requests to the /v1/completions endpoint, making it compatible with the ATOM server without modification.


Source Files

File

Description

atom/entrypoints/openai_server.py

OpenAI-compatible API server (FastAPI + Uvicorn)

atom/model_engine/llm_engine.py

LLMEngine programmatic API

atom/sampling_params.py

SamplingParams dataclass

atom/model_engine/arg_utils.py

EngineArgs CLI argument definitions and engine factory

atom/examples/simple_inference.py

Simple batch inference example

atom/examples/profile_offline.py

Offline profiling tool

atom/benchmarks/benchmark_serving.py

Online serving benchmark (BenchmarkMetrics, dataset sampling, result reporting)

atom/benchmarks/backend_request_func.py

Async HTTP request functions for each backend (RequestFuncInput, RequestFuncOutput, ASYNC_REQUEST_FUNCS)

atom/benchmarks/benchmark_utils.py

convert_to_pytorch_benchmark_format utility

atom/spec_decode/eagle.py

EagleProposer – MTP draft model for DeepSeek speculative decoding

atom/model_ops/rejection_sampler.py

RejectionSampler with Triton greedy rejection kernel

atom/config.py

Config, CompilationConfig, SpeculativeConfig dataclasses

atom/model_engine/model_runner.py

ModelRunner with start_profiler/stop_profiler and MTP statistics