# ATOM Serving & Benchmarking Guide ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on [AITER](https://github.com/ROCm/aiter) kernels for ROCm/HIP GPUs. This guide covers the OpenAI-compatible serving API, programmatic engine usage, benchmarking tools, profiling, and speculative decoding. --- ## Quick Reference ```bash # Start the OpenAI-compatible server python -m atom.entrypoints.openai_server --model --kv_cache_dtype fp8 # Run the online serving benchmark python -m atom.benchmarks.benchmark_serving \ --backend vllm --model \ --base-url http://localhost:8000 \ --dataset-name random --random-input-len 1024 --random-output-len 128 \ --num-prompts 1000 --request-rate inf --ignore-eos # Simple inference example python -m atom.examples.simple_inference --model --kv_cache_dtype fp8 # Offline profiling python -m atom.examples.profile_offline --model --kv_cache_dtype fp8 # Accuracy validation with lm-eval lm_eval --model local-completions \ --model_args model=,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \ --tasks gsm8k --num_fewshot 5 ``` --- ## 1. OpenAI-Compatible Server The server is implemented in `atom/entrypoints/openai_server.py` using FastAPI and Uvicorn. It exposes OpenAI-compatible HTTP endpoints so that existing clients (curl, OpenAI SDK, lm-eval) work without modification. ### 1.1 Endpoints | Method | Path | Description | |--------|------|-------------| | `POST` | `/v1/chat/completions` | Chat completion (ChatCompletionRequest -> ChatCompletionResponse) | | `POST` | `/v1/completions` | Text completion (CompletionRequest -> CompletionResponse) | | `GET` | `/v1/models` | List available models | | `GET` | `/health` | Health check (returns `{"status": "ok"}`) | | `POST` | `/start_profile` | Start torch profiler on the engine | | `POST` | `/stop_profile` | Stop torch profiler and flush traces | ### 1.2 Request Models **ChatCompletionRequest** fields: | Field | Type | Default | Description | |-------|------|---------|-------------| | `model` | `Optional[str]` | `None` | Model name (validated against the loaded model) | | `messages` | `Optional[List[ChatMessage]]` | `None` | List of chat messages (`role`, `content`) | | `prompt` | `Optional[List[ChatMessage]]` | `None` | Alias for `messages` | | `temperature` | `Optional[float]` | `1.0` | Sampling temperature | | `top_p` | `Optional[float]` | `1.0` | Nucleus sampling threshold | | `max_tokens` | `Optional[int]` | `256` | Maximum tokens to generate | | `stop` | `Optional[List[str]]` | `None` | Stop strings | | `ignore_eos` | `Optional[bool]` | `False` | Ignore end-of-sequence token | | `stream` | `Optional[bool]` | `False` | Enable server-sent events streaming | | `seed` | `Optional[int]` | `None` | Random seed | **CompletionRequest** fields: | Field | Type | Default | Description | |-------|------|---------|-------------| | `model` | `Optional[str]` | `None` | Model name | | `prompt` | `str` | (required) | Text prompt | | `temperature` | `Optional[float]` | `1.0` | Sampling temperature | | `top_p` | `Optional[float]` | `1.0` | Nucleus sampling threshold | | `max_tokens` | `Optional[int]` | `256` | Maximum tokens to generate | | `stop` | `Optional[List[str]]` | `None` | Stop strings | | `ignore_eos` | `Optional[bool]` | `False` | Ignore end-of-sequence token | | `stream` | `Optional[bool]` | `False` | Enable SSE streaming | ### 1.3 Response Models Both `ChatCompletionResponse` and `CompletionResponse` include: - `id` -- unique request identifier (e.g. `chatcmpl-` or `cmpl-`) - `object` -- `"chat.completion"` or `"text_completion"` - `created` -- Unix timestamp - `model` -- model name - `choices` -- list of generated completions - `usage` -- token counts (`prompt_tokens`, `completion_tokens`, `total_tokens`) plus `ttft_s`, `tpot_s`, and `latency_s` timing fields Streaming responses use the SSE (Server-Sent Events) protocol with `data: [DONE]\n\n` as the termination signal. ### 1.4 Server Startup ```bash python -m atom.entrypoints.openai_server \ --model \ --kv_cache_dtype fp8 \ --host 0.0.0.0 \ --server-port 8000 ``` Server-specific CLI arguments: | Argument | Default | Description | |----------|---------|-------------| | `--host` | `0.0.0.0` | Bind address | | `--server-port` | `8000` | HTTP port (note: `--port` is for internal engine communication) | All `EngineArgs` arguments are also accepted (see Section 7 for the full list). ### 1.5 Example: curl ```bash # Non-streaming chat completion curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128 }' # Streaming text completion curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "The capital of France is", "max_tokens": 64, "stream": true }' ``` --- ## 2. Programmatic API (LLMEngine) The `LLMEngine` class in `atom/model_engine/llm_engine.py` provides a Python-native interface for inference without running an HTTP server. ### 2.1 Initialization ```python from atom import LLMEngine, SamplingParams engine = LLMEngine(model="deepseek-ai/DeepSeek-R1", kv_cache_dtype="fp8", tensor_parallel_size=8) ``` `LLMEngine.__init__(model, **kwargs)` accepts all `Config` field names as keyword arguments (e.g. `tensor_parallel_size`, `kv_cache_dtype`, `max_model_len`, `data_parallel_size`, `gpu_memory_utilization`). ### 2.2 SamplingParams Defined in `atom/sampling_params.py`: ```python @dataclass class SamplingParams: temperature: float = 1.0 max_tokens: int = 64 ignore_eos: bool = False stop_strings: Optional[list[str]] = None ``` ### 2.3 Core Methods | Method | Signature | Description | |--------|-----------|-------------| | `generate` | `(prompts: list[str], sampling_params) -> list[dict]` | Synchronous batch generation; blocks until all prompts complete | | `add_request` | `(prompt_or_tokens_list, sampling_params_list, stream_callback=None)` | Submit requests for asynchronous processing | | `step` | `() -> list[Sequence]` | Retrieve completed sequences | | `is_finished` | `() -> bool` | Check whether all pending requests have completed | | `start_profile` | `()` | Start torch profiler on all workers | | `stop_profile` | `()` | Stop torch profiler and write traces | | `print_mtp_statistics` | `()` | Print speculative decoding acceptance statistics | ### 2.4 Synchronous Generation Example ```python from atom import LLMEngine, SamplingParams engine = LLMEngine(model="meta-llama/Meta-Llama-3-8B", kv_cache_dtype="fp8") params = SamplingParams(temperature=0.6, max_tokens=256) outputs = engine.generate(["Explain quantum computing in simple terms."], params) for out in outputs: print(out["text"]) ``` Each output dictionary contains: `text`, `token_ids`, `latency`, `finish_reason`, `num_tokens_input`, `num_tokens_output`, `ttft`, and `tpot`. ### 2.5 Asynchronous / Streaming Usage ```python engine.add_request( prompt_or_tokens_list=["Hello world", "How are you?"], sampling_params_list=SamplingParams(temperature=0.8, max_tokens=128), stream_callback=my_callback, # called per-token with RequestOutput ) while not engine.is_finished(): completed = engine.step() # process completed sequences ``` --- ## 3. Simple Inference The `atom/examples/simple_inference.py` script provides a quick way to validate model loading and generation. ### 3.1 Usage ```bash python -m atom.examples.simple_inference \ --model meta-llama/Meta-Llama-3-8B \ --kv_cache_dtype fp8 \ --temperature 0.6 ``` ### 3.2 What It Does 1. Parses all `EngineArgs` plus `--temperature` (default `0.6`). 2. Creates an `LLMEngine` via `EngineArgs.from_cli_args(args).create_engine()`. 3. Applies the model's chat template to four built-in prompts (English and Chinese) with `enable_thinking=True`. 4. Runs a warmup generation, then generates completions for the batch. 5. Calls `llm.print_mtp_statistics()` to report speculative decoding stats (if MTP is enabled). --- ## 4. Benchmarking ATOM ships a comprehensive online serving benchmark in `atom/benchmarks/benchmark_serving.py` (adapted from vLLM's benchmarking tooling). ### 4.1 Metrics The `BenchmarkMetrics` dataclass tracks: | Metric | Abbreviation | Description | |--------|--------------|-------------| | Time to First Token | **TTFT** | Latency from request submission to the first generated token | | Time per Output Token | **TPOT** | Average latency per output token (excluding the first) | | Inter-Token Latency | **ITL** | Latency between successive output tokens | | End-to-End Latency | **E2EL** | Total latency from request send to full response receipt | | Request Throughput | -- | Completed requests per second | | Output Token Throughput | -- | Generated tokens per second | | Total Token Throughput | -- | (input + output) tokens per second | | Request Goodput | -- | Requests per second meeting SLO targets | For each latency metric, mean, median, standard deviation, and configurable percentiles (default: P99) are reported. ### 4.2 Key CLI Arguments | Argument | Default | Description | |----------|---------|-------------| | `--backend` | `vllm` | Backend type. Choices: `tgi`, `vllm`, `lmdeploy`, `deepspeed-mii`, `openai`, `openai-chat`, `tensorrt-llm`, `scalellm`, `sglang` | | `--model` | (required) | Model name or path | | `--base-url` | `None` | Server base URL (e.g. `http://localhost:8000`) | | `--host` | `127.0.0.1` | Server host (used when `--base-url` is not set) | | `--port` | `8000` | Server port (used when `--base-url` is not set) | | `--endpoint` | `/v1/completions` | API endpoint path | | `--dataset-name` | `sharegpt` | Dataset type: `sharegpt`, `burstgpt`, `sonnet`, `random`, `hf` | | `--dataset-path` | `None` | Path to dataset file or HuggingFace dataset ID | | `--num-prompts` | `1000` | Number of prompts to benchmark | | `--request-rate` | `inf` | Requests per second (`inf` = send all at once) | | `--burstiness` | `1.0` | Burstiness factor (1.0 = Poisson process) | | `--max-concurrency` | `None` | Maximum concurrent requests | | `--ignore-eos` | `False` | Ignore EOS token in generation | | `--save-result` | `False` | Save results to JSON | | `--result-dir` | `None` | Directory for result JSON files | | `--result-filename` | `None` | Custom filename for results | | `--percentile-metrics` | `ttft,tpot,itl` | Comma-separated metrics to report percentiles for | | `--metric-percentiles` | `99` | Comma-separated percentile values (e.g. `25,50,75,99`) | | `--goodput` | `None` | SLO targets as `KEY:VALUE` pairs (e.g. `ttft:100 tpot:50`) | | `--profile` | `False` | Enable torch profiler during the benchmark run | | `--tokenizer` | `None` | Custom tokenizer name or path | | `--seed` | `0` | Random seed | **Random dataset options:** | Argument | Default | Description | |----------|---------|-------------| | `--random-input-len` | `1024` | Input token length | | `--random-output-len` | `128` | Output token length | | `--random-range-ratio` | `1.0` | Length variation ratio | | `--random-prefix-len` | `0` | Fixed prefix token length | | `--use-chat-template` | `False` | Apply chat template to random prompts | ### 4.3 Backend Request Functions Defined in `atom/benchmarks/backend_request_func.py`: | Backend Key | Function | Protocol | |-------------|----------|----------| | `vllm` | `async_request_openai_completions` | OpenAI Completions API (streaming) | | `openai` | `async_request_openai_completions` | OpenAI Completions API (streaming) | | `openai-chat` | `async_request_openai_chat_completions` | OpenAI Chat Completions API (streaming) | | `tgi` | `async_request_tgi` | TGI `generate_stream` | | `tensorrt-llm` | `async_request_trt_llm` | TRT-LLM `generate_stream` | | `deepspeed-mii` | `async_request_deepspeed_mii` | DeepSpeed-MII | | `lmdeploy` | `async_request_openai_completions` | OpenAI Completions API | | `scalellm` | `async_request_openai_completions` | OpenAI Completions API | | `sglang` | `async_request_openai_completions` | OpenAI Completions API | Each function uses `RequestFuncInput` and returns a `RequestFuncOutput` with timing data (`ttft`, `itl`, `latency`, `tpot`). ### 4.4 Full Benchmark Example ```bash # 1. Start the server python -m atom.entrypoints.openai_server \ --kv_cache_dtype fp8 -tp 8 --model deepseek-ai/DeepSeek-R1 # 2. Run benchmark MODEL=deepseek-ai/DeepSeek-R1 ISL=1024 OSL=1024 CONC=128 PORT=8000 RESULT_FILENAME=Deepseek-R1-result python -m atom.benchmarks.benchmark_serving \ --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \ --dataset-name=random \ --random-input-len=$ISL --random-output-len=$OSL \ --random-range-ratio 0.8 \ --num-prompts=$(( $CONC * 10 )) \ --max-concurrency=$CONC \ --request-rate=inf --ignore-eos \ --save-result --percentile-metrics="ttft,tpot,itl,e2el" \ --result-dir=./ --result-filename=$RESULT_FILENAME.json ``` --- ## 5. Profiling ATOM supports PyTorch profiling via environment variables, HTTP endpoints, and the programmatic API. ### 5.1 Configuration | Mechanism | Description | |-----------|-------------| | `--torch-profiler-dir ` | CLI arg to set the trace output directory | | `ATOM_TORCH_PROFILER_DIR` env var | Sets the default `torch_profiler_dir` in `Config` | | `ATOM_PROFILER_MORE=1` env var | Enables detailed profiling: `record_shapes`, `with_stack`, `profile_memory` | When a profiler directory is configured, each worker saves traces to a rank-specific subdirectory: - Multi-GPU with DP: `{profiler_dir}/dp{dp_rank}_tp{rank}/` - Single-GPU / TP-only: `{profiler_dir}/rank_{rank}/` Traces are saved in gzip-compressed TensorBoard format and can be viewed with `tensorboard --logdir ` or Chrome's `chrome://tracing`. ### 5.2 Online Profiling (HTTP) While the server is running, start and stop profiling with HTTP requests: ```bash # Start profiling curl -s -S -X POST http://127.0.0.1:8000/start_profile # ... run your workload ... # Stop profiling and flush traces curl -s -S -X POST http://127.0.0.1:8000/stop_profile ``` The server must be started with `--torch-profiler-dir` or with `ATOM_TORCH_PROFILER_DIR` set for these endpoints to produce traces. ### 5.3 Programmatic Profiling ```python engine = LLMEngine(model="Qwen/Qwen3-0.6B", torch_profiler_dir="./traces") engine.start_profile() outputs = engine.generate(prompts, sampling_params) engine.stop_profile() # Traces written to ./traces/rank_0/ ``` ### 5.4 Offline Profiling Script `atom/examples/profile_offline.py` provides a self-contained offline profiling workflow: ```bash python -m atom.examples.profile_offline \ --model Qwen/Qwen3-0.6B \ --kv_cache_dtype fp8 \ --torch-profiler-dir ./profiler_traces \ --input-length 128 \ --output-length 32 \ --bs 4 ``` Script-specific arguments: | Argument | Default | Description | |----------|---------|-------------| | `--input-length` | `128` | Approximate input prompt length in tokens | | `--output-length` | `32` | Output generation length in tokens | | `--bs` | `1` | Batch size (number of parallel requests) | | `--random-input` | `False` | Use random token input instead of predefined text | If `--torch-profiler-dir` is not specified, the script defaults to `./profiler_traces`. ### 5.5 Profiling During Benchmarks The benchmark tool can trigger profiling automatically via `--profile`: ```bash python -m atom.benchmarks.benchmark_serving \ --model --backend vllm \ --base-url http://localhost:8000 \ --dataset-name random --num-prompts 100 \ --profile ``` This sends `POST /start_profile` before the benchmark and `POST /stop_profile` after completion. --- ## 6. Speculative Decoding (MTP) ATOM supports Multi-Token Prediction (MTP) for DeepSeek models using the Eagle-style speculative decoding framework. ### 6.1 Architecture - **EagleProposer** (`atom/spec_decode/eagle.py`): Loads and runs the draft (MTP) model to propose speculative tokens. Supports the `DeepSeekMTPModel` architecture via `DeepSeekMTP`. - **RejectionSampler** (`atom/model_ops/rejection_sampler.py`): Implements greedy rejection sampling with a Triton kernel. Compares draft token IDs against target model argmax and accepts matching prefixes; appends a bonus token if all drafts are accepted. ### 6.2 Configuration Enable MTP via CLI arguments: ```bash python -m atom.entrypoints.openai_server \ --model deepseek-ai/DeepSeek-R1 \ --kv_cache_dtype fp8 -tp 8 \ --method mtp \ --num-speculative-tokens 1 ``` | Argument | Default | Description | |----------|---------|-------------| | `--method` | `None` | Speculative method; currently only `mtp` is supported | | `--num-speculative-tokens` | `1` | Number of draft tokens per iteration (draft model runs this many autoregressive steps) | ### 6.3 MTP Statistics ATOM tracks acceptance statistics at runtime: - **total_draft_tokens**: Total number of draft tokens proposed - **total_accepted_tokens**: Number of draft tokens accepted by rejection sampling - **acceptance_rate**: Ratio of accepted to draft tokens Statistics are logged every 1000 draft tokens and can be printed on demand: ```python engine.print_mtp_statistics() ``` Example output: ``` MTP Statistics: Total draft tokens: 5000 Accepted tokens: 4250 Acceptance rate: 85.00% ``` ### 6.4 How Rejection Sampling Works 1. The draft model generates `num_speculative_tokens` token predictions autoregressively using argmax. 2. The target model verifies all draft tokens in a single forward pass. 3. The `rejection_greedy_sample_kernel` (Triton) compares each draft token against the target model's argmax: - If they match, the token is accepted. - On the first mismatch, the target model's token replaces it and all subsequent draft tokens are discarded. - If all draft tokens match, a bonus token from the target model is appended. --- ## 7. Deployment Examples ### 7.1 Single-GPU ```bash python -m atom.entrypoints.openai_server \ --model Qwen/Qwen3-0.6B \ --kv_cache_dtype fp8 ``` ### 7.2 Multi-GPU with Tensor Parallelism ```bash python -m atom.entrypoints.openai_server \ --model deepseek-ai/DeepSeek-R1 \ --kv_cache_dtype fp8 \ -tp 8 ``` ### 7.3 Docker Deployment ```bash # Pull the ROCm PyTorch image docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0 # Launch container docker run -it --network=host \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -v $HOME:/home/$USER \ -v /mnt:/mnt \ -v /data:/data \ --shm-size=16G \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0 # Inside the container pip install amd-aiter git clone https://github.com/ROCm/ATOM.git && cd ATOM && pip install . # Start serving python -m atom.entrypoints.openai_server \ --model deepseek-ai/DeepSeek-R1 \ --kv_cache_dtype fp8 -tp 8 ``` ### 7.4 Engine CLI Arguments (EngineArgs) These arguments are available for all entrypoints (server, examples, and any script using `EngineArgs.add_cli_args`): | Argument | Default | Description | |----------|---------|-------------| | `--model` | `Qwen/Qwen3-0.6B` | Model name or path | | `--trust-remote-code` | `False` | Trust remote code from HuggingFace | | `--tensor-parallel-size`, `-tp` | `1` | Tensor parallel size | | `--data-parallel-size`, `-dp` | `1` | Data parallel size | | `--enforce-eager` | `False` | Disable CUDA graph capture; use eager execution | | `--enable_prefix_caching` | `False` | Enable prefix caching | | `--port` | `8006` | Internal engine communication port | | `--kv_cache_dtype` | `bf16` | KV cache dtype: `bf16` or `fp8` | | `--block-size` | `16` | KV cache block size | | `--max-model-len` | `None` | Maximum context length (defaults to HF config) | | `--max-num-batched-tokens` | `16384` | Maximum tokens per batch | | `--max-num-seqs` | `512` | Maximum sequences per batch | | `--gpu-memory-utilization` | `0.9` | GPU memory utilization (0.0 to 1.0) | | `--scheduler-delay-factor` | `0.0` | Delay factor before scheduling next prompt | | `--cudagraph-capture-sizes` | `[1,2,4,...,256]` | Batch sizes for CUDA graph capture | | `--level` | `3` | Compilation level (0-3); 3 = torch.compile | | `--load_dummy` | `False` | Skip loading model weights (for testing) | | `--enable-expert-parallel` | `False` | Enable expert parallelism for MoE | | `--enable-dp-attention` | `False` | Enable data-parallel attention | | `--torch-profiler-dir` | `None` | Directory for torch profiler traces | | `--method` | `None` | Speculative decoding method (`mtp`) | | `--num-speculative-tokens` | `1` | Number of speculative tokens per step | --- ## 8. Accuracy Validation ATOM supports accuracy validation through the [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) framework via the OpenAI-compatible API. ### 8.1 Setup ```bash pip install lm-eval[api] ``` ### 8.2 Run Evaluation Start an ATOM server, then run lm-eval against it: ```bash # Start server python -m atom.entrypoints.openai_server \ --model meta-llama/Meta-Llama-3-8B \ --kv_cache_dtype fp8 # Run evaluation lm_eval --model local-completions \ --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \ --tasks gsm8k \ --num_fewshot 5 ``` Any lm-eval task can be used. The `local-completions` model type sends requests to the `/v1/completions` endpoint, making it compatible with the ATOM server without modification. --- ## Source Files | File | Description | |------|-------------| | `atom/entrypoints/openai_server.py` | OpenAI-compatible API server (FastAPI + Uvicorn) | | `atom/model_engine/llm_engine.py` | `LLMEngine` programmatic API | | `atom/sampling_params.py` | `SamplingParams` dataclass | | `atom/model_engine/arg_utils.py` | `EngineArgs` CLI argument definitions and engine factory | | `atom/examples/simple_inference.py` | Simple batch inference example | | `atom/examples/profile_offline.py` | Offline profiling tool | | `atom/benchmarks/benchmark_serving.py` | Online serving benchmark (`BenchmarkMetrics`, dataset sampling, result reporting) | | `atom/benchmarks/backend_request_func.py` | Async HTTP request functions for each backend (`RequestFuncInput`, `RequestFuncOutput`, `ASYNC_REQUEST_FUNCS`) | | `atom/benchmarks/benchmark_utils.py` | `convert_to_pytorch_benchmark_format` utility | | `atom/spec_decode/eagle.py` | `EagleProposer` -- MTP draft model for DeepSeek speculative decoding | | `atom/model_ops/rejection_sampler.py` | `RejectionSampler` with Triton greedy rejection kernel | | `atom/config.py` | `Config`, `CompilationConfig`, `SpeculativeConfig` dataclasses | | `atom/model_engine/model_runner.py` | `ModelRunner` with `start_profiler`/`stop_profiler` and MTP statistics |