Quickstart
This guide will get you started with ATOM in 5 minutes.
Serving a Model
from atom import LLMEngine, SamplingParams
# Load model
llm = LLMEngine(
model="meta-llama/Llama-2-7b-hf",
gpu_memory_utilization=0.9,
max_model_len=4096
)
# Create sampling parameters
sampling_params = SamplingParams(max_tokens=50, temperature=0.8)
# Generate text (note: prompts must be a list)
outputs = llm.generate(["Hello, my name is"], sampling_params)
print(outputs[0])
Batch Inference
from atom import LLMEngine, SamplingParams
llm = LLMEngine(model="meta-llama/Llama-2-7b-hf")
# Batch prompts
prompts = [
"The capital of France is",
"The largest ocean is",
"Python is a"
]
# Create sampling parameters
sampling_params = SamplingParams(max_tokens=20, temperature=0.7)
# Generate in batch
outputs = llm.generate(prompts, sampling_params)
# outputs is a list of strings
for i, output in enumerate(outputs):
print(f"Prompt: {prompts[i]}")
print(f"Output: {output}\n")
Distributed Serving
Multi-GPU serving:
from atom import LLMEngine, SamplingParams
# Use 4 GPUs with tensor parallelism
llm = LLMEngine(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4,
gpu_memory_utilization=0.95
)
sampling_params = SamplingParams(max_tokens=100, temperature=0.7)
outputs = llm.generate(["Tell me about AMD GPUs"], sampling_params)
print(outputs[0])
API Server
Start a RESTful API server:
python -m atom.entrypoints.openai_server \
--model meta-llama/Llama-2-7b-hf \
--host 0.0.0.0 \
--port 8000
Query the server:
import requests
response = requests.post(
"http://localhost:8000/generate",
json={
"prompt": "Hello, world!",
"max_tokens": 50
}
)
print(response.json()["text"])
Performance Tips
GPU Memory: Set gpu_memory_utilization to 0.9-0.95
Batch Size: Increase max_num_batched_tokens for throughput
KV Cache: Configure block_size based on workload
Compilation: Enable CUDAGraph for repeated inference
Next Steps
ATOM Architecture Guide - Understand ATOM architecture
ATOM Configuration Guide - Configure for your workload
ATOM Serving & Benchmarking Guide - Measure performance