ATOM Serving & Benchmarking Guide
ATOM (AiTer Optimized Model) is AMD’s lightweight LLM inference engine built on AITER kernels for ROCm/HIP GPUs. This guide covers the OpenAI-compatible serving API, programmatic engine usage, benchmarking tools, profiling, and speculative decoding.
Quick Reference
# Start the OpenAI-compatible server
python -m atom.entrypoints.openai_server --model <model_name_or_path> --kv_cache_dtype fp8
# Run the online serving benchmark
python -m atom.benchmarks.benchmark_serving \
--backend vllm --model <model_name_or_path> \
--base-url http://localhost:8000 \
--dataset-name random --random-input-len 1024 --random-output-len 128 \
--num-prompts 1000 --request-rate inf --ignore-eos
# Simple inference example
python -m atom.examples.simple_inference --model <model_name_or_path> --kv_cache_dtype fp8
# Offline profiling
python -m atom.examples.profile_offline --model <model_name_or_path> --kv_cache_dtype fp8
# Accuracy validation with lm-eval
lm_eval --model local-completions \
--model_args model=<model>,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k --num_fewshot 5
1. OpenAI-Compatible Server
The server is implemented in atom/entrypoints/openai_server.py using FastAPI
and Uvicorn. It exposes OpenAI-compatible HTTP endpoints so that existing
clients (curl, OpenAI SDK, lm-eval) work without modification.
1.1 Endpoints
Method |
Path |
Description |
|---|---|---|
|
|
Chat completion (ChatCompletionRequest -> ChatCompletionResponse) |
|
|
Text completion (CompletionRequest -> CompletionResponse) |
|
|
List available models |
|
|
Health check (returns |
|
|
Start torch profiler on the engine |
|
|
Stop torch profiler and flush traces |
1.2 Request Models
ChatCompletionRequest fields:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model name (validated against the loaded model) |
|
|
|
List of chat messages ( |
|
|
|
Alias for |
|
|
|
Sampling temperature |
|
|
|
Nucleus sampling threshold |
|
|
|
Maximum tokens to generate |
|
|
|
Stop strings |
|
|
|
Ignore end-of-sequence token |
|
|
|
Enable server-sent events streaming |
|
|
|
Random seed |
CompletionRequest fields:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Model name |
|
|
(required) |
Text prompt |
|
|
|
Sampling temperature |
|
|
|
Nucleus sampling threshold |
|
|
|
Maximum tokens to generate |
|
|
|
Stop strings |
|
|
|
Ignore end-of-sequence token |
|
|
|
Enable SSE streaming |
1.3 Response Models
Both ChatCompletionResponse and CompletionResponse include:
id– unique request identifier (e.g.chatcmpl-<uuid>orcmpl-<uuid>)object–"chat.completion"or"text_completion"created– Unix timestampmodel– model namechoices– list of generated completionsusage– token counts (prompt_tokens,completion_tokens,total_tokens) plusttft_s,tpot_s, andlatency_stiming fields
Streaming responses use the SSE (Server-Sent Events) protocol with
data: [DONE]\n\n as the termination signal.
1.4 Server Startup
python -m atom.entrypoints.openai_server \
--model <model_name_or_path> \
--kv_cache_dtype fp8 \
--host 0.0.0.0 \
--server-port 8000
Server-specific CLI arguments:
Argument |
Default |
Description |
|---|---|---|
|
|
Bind address |
|
|
HTTP port (note: |
All EngineArgs arguments are also accepted (see Section 7 for the full list).
1.5 Example: curl
# Non-streaming chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128
}'
# Streaming text completion
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The capital of France is",
"max_tokens": 64,
"stream": true
}'
2. Programmatic API (LLMEngine)
The LLMEngine class in atom/model_engine/llm_engine.py provides a
Python-native interface for inference without running an HTTP server.
2.1 Initialization
from atom import LLMEngine, SamplingParams
engine = LLMEngine(model="deepseek-ai/DeepSeek-R1", kv_cache_dtype="fp8",
tensor_parallel_size=8)
LLMEngine.__init__(model, **kwargs) accepts all Config field names as
keyword arguments (e.g. tensor_parallel_size, kv_cache_dtype,
max_model_len, data_parallel_size, gpu_memory_utilization).
2.2 SamplingParams
Defined in atom/sampling_params.py:
@dataclass
class SamplingParams:
temperature: float = 1.0
max_tokens: int = 64
ignore_eos: bool = False
stop_strings: Optional[list[str]] = None
2.3 Core Methods
Method |
Signature |
Description |
|---|---|---|
|
|
Synchronous batch generation; blocks until all prompts complete |
|
|
Submit requests for asynchronous processing |
|
|
Retrieve completed sequences |
|
|
Check whether all pending requests have completed |
|
|
Start torch profiler on all workers |
|
|
Stop torch profiler and write traces |
|
|
Print speculative decoding acceptance statistics |
2.4 Synchronous Generation Example
from atom import LLMEngine, SamplingParams
engine = LLMEngine(model="meta-llama/Meta-Llama-3-8B", kv_cache_dtype="fp8")
params = SamplingParams(temperature=0.6, max_tokens=256)
outputs = engine.generate(["Explain quantum computing in simple terms."], params)
for out in outputs:
print(out["text"])
Each output dictionary contains: text, token_ids, latency,
finish_reason, num_tokens_input, num_tokens_output, ttft, and tpot.
2.5 Asynchronous / Streaming Usage
engine.add_request(
prompt_or_tokens_list=["Hello world", "How are you?"],
sampling_params_list=SamplingParams(temperature=0.8, max_tokens=128),
stream_callback=my_callback, # called per-token with RequestOutput
)
while not engine.is_finished():
completed = engine.step()
# process completed sequences
3. Simple Inference
The atom/examples/simple_inference.py script provides a quick way to validate
model loading and generation.
3.1 Usage
python -m atom.examples.simple_inference \
--model meta-llama/Meta-Llama-3-8B \
--kv_cache_dtype fp8 \
--temperature 0.6
3.2 What It Does
Parses all
EngineArgsplus--temperature(default0.6).Creates an
LLMEngineviaEngineArgs.from_cli_args(args).create_engine().Applies the model’s chat template to four built-in prompts (English and Chinese) with
enable_thinking=True.Runs a warmup generation, then generates completions for the batch.
Calls
llm.print_mtp_statistics()to report speculative decoding stats (if MTP is enabled).
4. Benchmarking
ATOM ships a comprehensive online serving benchmark in
atom/benchmarks/benchmark_serving.py (adapted from vLLM’s benchmarking
tooling).
4.1 Metrics
The BenchmarkMetrics dataclass tracks:
Metric |
Abbreviation |
Description |
|---|---|---|
Time to First Token |
TTFT |
Latency from request submission to the first generated token |
Time per Output Token |
TPOT |
Average latency per output token (excluding the first) |
Inter-Token Latency |
ITL |
Latency between successive output tokens |
End-to-End Latency |
E2EL |
Total latency from request send to full response receipt |
Request Throughput |
– |
Completed requests per second |
Output Token Throughput |
– |
Generated tokens per second |
Total Token Throughput |
– |
(input + output) tokens per second |
Request Goodput |
– |
Requests per second meeting SLO targets |
For each latency metric, mean, median, standard deviation, and configurable percentiles (default: P99) are reported.
4.2 Key CLI Arguments
Argument |
Default |
Description |
|---|---|---|
|
|
Backend type. Choices: |
|
(required) |
Model name or path |
|
|
Server base URL (e.g. |
|
|
Server host (used when |
|
|
Server port (used when |
|
|
API endpoint path |
|
|
Dataset type: |
|
|
Path to dataset file or HuggingFace dataset ID |
|
|
Number of prompts to benchmark |
|
|
Requests per second ( |
|
|
Burstiness factor (1.0 = Poisson process) |
|
|
Maximum concurrent requests |
|
|
Ignore EOS token in generation |
|
|
Save results to JSON |
|
|
Directory for result JSON files |
|
|
Custom filename for results |
|
|
Comma-separated metrics to report percentiles for |
|
|
Comma-separated percentile values (e.g. |
|
|
SLO targets as |
|
|
Enable torch profiler during the benchmark run |
|
|
Custom tokenizer name or path |
|
|
Random seed |
Random dataset options:
Argument |
Default |
Description |
|---|---|---|
|
|
Input token length |
|
|
Output token length |
|
|
Length variation ratio |
|
|
Fixed prefix token length |
|
|
Apply chat template to random prompts |
4.3 Backend Request Functions
Defined in atom/benchmarks/backend_request_func.py:
Backend Key |
Function |
Protocol |
|---|---|---|
|
|
OpenAI Completions API (streaming) |
|
|
OpenAI Completions API (streaming) |
|
|
OpenAI Chat Completions API (streaming) |
|
|
TGI |
|
|
TRT-LLM |
|
|
DeepSpeed-MII |
|
|
OpenAI Completions API |
|
|
OpenAI Completions API |
|
|
OpenAI Completions API |
Each function uses RequestFuncInput and returns a RequestFuncOutput with
timing data (ttft, itl, latency, tpot).
4.4 Full Benchmark Example
# 1. Start the server
python -m atom.entrypoints.openai_server \
--kv_cache_dtype fp8 -tp 8 --model deepseek-ai/DeepSeek-R1
# 2. Run benchmark
MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result
python -m atom.benchmarks.benchmark_serving \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 0.8 \
--num-prompts=$(( $CONC * 10 )) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.json
5. Profiling
ATOM supports PyTorch profiling via environment variables, HTTP endpoints, and the programmatic API.
5.1 Configuration
Mechanism |
Description |
|---|---|
|
CLI arg to set the trace output directory |
|
Sets the default |
|
Enables detailed profiling: |
When a profiler directory is configured, each worker saves traces to a rank-specific subdirectory:
Multi-GPU with DP:
{profiler_dir}/dp{dp_rank}_tp{rank}/Single-GPU / TP-only:
{profiler_dir}/rank_{rank}/
Traces are saved in gzip-compressed TensorBoard format and can be viewed with
tensorboard --logdir <profiler_dir> or Chrome’s chrome://tracing.
5.2 Online Profiling (HTTP)
While the server is running, start and stop profiling with HTTP requests:
# Start profiling
curl -s -S -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
# Stop profiling and flush traces
curl -s -S -X POST http://127.0.0.1:8000/stop_profile
The server must be started with --torch-profiler-dir or with
ATOM_TORCH_PROFILER_DIR set for these endpoints to produce traces.
5.3 Programmatic Profiling
engine = LLMEngine(model="Qwen/Qwen3-0.6B", torch_profiler_dir="./traces")
engine.start_profile()
outputs = engine.generate(prompts, sampling_params)
engine.stop_profile()
# Traces written to ./traces/rank_0/
5.4 Offline Profiling Script
atom/examples/profile_offline.py provides a self-contained offline profiling
workflow:
python -m atom.examples.profile_offline \
--model Qwen/Qwen3-0.6B \
--kv_cache_dtype fp8 \
--torch-profiler-dir ./profiler_traces \
--input-length 128 \
--output-length 32 \
--bs 4
Script-specific arguments:
Argument |
Default |
Description |
|---|---|---|
|
|
Approximate input prompt length in tokens |
|
|
Output generation length in tokens |
|
|
Batch size (number of parallel requests) |
|
|
Use random token input instead of predefined text |
If --torch-profiler-dir is not specified, the script defaults to
./profiler_traces.
5.5 Profiling During Benchmarks
The benchmark tool can trigger profiling automatically via --profile:
python -m atom.benchmarks.benchmark_serving \
--model <model> --backend vllm \
--base-url http://localhost:8000 \
--dataset-name random --num-prompts 100 \
--profile
This sends POST /start_profile before the benchmark and
POST /stop_profile after completion.
6. Speculative Decoding (MTP)
ATOM supports Multi-Token Prediction (MTP) for DeepSeek models using the Eagle-style speculative decoding framework.
6.1 Architecture
EagleProposer (
atom/spec_decode/eagle.py): Loads and runs the draft (MTP) model to propose speculative tokens. Supports theDeepSeekMTPModelarchitecture viaDeepSeekMTP.RejectionSampler (
atom/model_ops/rejection_sampler.py): Implements greedy rejection sampling with a Triton kernel. Compares draft token IDs against target model argmax and accepts matching prefixes; appends a bonus token if all drafts are accepted.
6.2 Configuration
Enable MTP via CLI arguments:
python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1 \
--kv_cache_dtype fp8 -tp 8 \
--method mtp \
--num-speculative-tokens 1
Argument |
Default |
Description |
|---|---|---|
|
|
Speculative method; currently only |
|
|
Number of draft tokens per iteration (draft model runs this many autoregressive steps) |
6.3 MTP Statistics
ATOM tracks acceptance statistics at runtime:
total_draft_tokens: Total number of draft tokens proposed
total_accepted_tokens: Number of draft tokens accepted by rejection sampling
acceptance_rate: Ratio of accepted to draft tokens
Statistics are logged every 1000 draft tokens and can be printed on demand:
engine.print_mtp_statistics()
Example output:
MTP Statistics:
Total draft tokens: 5000
Accepted tokens: 4250
Acceptance rate: 85.00%
6.4 How Rejection Sampling Works
The draft model generates
num_speculative_tokenstoken predictions autoregressively using argmax.The target model verifies all draft tokens in a single forward pass.
The
rejection_greedy_sample_kernel(Triton) compares each draft token against the target model’s argmax:If they match, the token is accepted.
On the first mismatch, the target model’s token replaces it and all subsequent draft tokens are discarded.
If all draft tokens match, a bonus token from the target model is appended.
7. Deployment Examples
7.1 Single-GPU
python -m atom.entrypoints.openai_server \
--model Qwen/Qwen3-0.6B \
--kv_cache_dtype fp8
7.2 Multi-GPU with Tensor Parallelism
python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1 \
--kv_cache_dtype fp8 \
-tp 8
7.3 Docker Deployment
# Pull the ROCm PyTorch image
docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0
# Launch container
docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0
# Inside the container
pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git && cd ATOM && pip install .
# Start serving
python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1 \
--kv_cache_dtype fp8 -tp 8
7.4 Engine CLI Arguments (EngineArgs)
These arguments are available for all entrypoints (server, examples, and any
script using EngineArgs.add_cli_args):
Argument |
Default |
Description |
|---|---|---|
|
|
Model name or path |
|
|
Trust remote code from HuggingFace |
|
|
Tensor parallel size |
|
|
Data parallel size |
|
|
Disable CUDA graph capture; use eager execution |
|
|
Enable prefix caching |
|
|
Internal engine communication port |
|
|
KV cache dtype: |
|
|
KV cache block size |
|
|
Maximum context length (defaults to HF config) |
|
|
Maximum tokens per batch |
|
|
Maximum sequences per batch |
|
|
GPU memory utilization (0.0 to 1.0) |
|
|
Delay factor before scheduling next prompt |
|
|
Batch sizes for CUDA graph capture |
|
|
Compilation level (0-3); 3 = torch.compile |
|
|
Skip loading model weights (for testing) |
|
|
Enable expert parallelism for MoE |
|
|
Enable data-parallel attention |
|
|
Directory for torch profiler traces |
|
|
Speculative decoding method ( |
|
|
Number of speculative tokens per step |
8. Accuracy Validation
ATOM supports accuracy validation through the lm-eval framework via the OpenAI-compatible API.
8.1 Setup
pip install lm-eval[api]
8.2 Run Evaluation
Start an ATOM server, then run lm-eval against it:
# Start server
python -m atom.entrypoints.openai_server \
--model meta-llama/Meta-Llama-3-8B \
--kv_cache_dtype fp8
# Run evaluation
lm_eval --model local-completions \
--model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 5
Any lm-eval task can be used. The local-completions model type sends
requests to the /v1/completions endpoint, making it compatible with the ATOM
server without modification.
Source Files
File |
Description |
|---|---|
|
OpenAI-compatible API server (FastAPI + Uvicorn) |
|
|
|
|
|
|
|
Simple batch inference example |
|
Offline profiling tool |
|
Online serving benchmark ( |
|
Async HTTP request functions for each backend ( |
|
|
|
|
|
|
|
|
|
|