Quickstart

This guide will get you started with ATOM in 5 minutes.

Serving a Model

from atom import LLMEngine, SamplingParams

# Load model
llm = LLMEngine(
    model="meta-llama/Llama-2-7b-hf",
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

# Create sampling parameters
sampling_params = SamplingParams(max_tokens=50, temperature=0.8)

# Generate text (note: prompts must be a list)
outputs = llm.generate(["Hello, my name is"], sampling_params)
print(outputs[0])

Batch Inference

from atom import LLMEngine, SamplingParams

llm = LLMEngine(model="meta-llama/Llama-2-7b-hf")

# Batch prompts
prompts = [
    "The capital of France is",
    "The largest ocean is",
    "Python is a"
]

# Create sampling parameters
sampling_params = SamplingParams(max_tokens=20, temperature=0.7)

# Generate in batch
outputs = llm.generate(prompts, sampling_params)

# outputs is a list of strings
for i, output in enumerate(outputs):
    print(f"Prompt: {prompts[i]}")
    print(f"Output: {output}\n")

Distributed Serving

Multi-GPU serving:

from atom import LLMEngine, SamplingParams

# Use 4 GPUs with tensor parallelism
llm = LLMEngine(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.95
)

sampling_params = SamplingParams(max_tokens=100, temperature=0.7)
outputs = llm.generate(["Tell me about AMD GPUs"], sampling_params)
print(outputs[0])

API Server

Start a RESTful API server:

python -m atom.entrypoints.openai_server \
    --model meta-llama/Llama-2-7b-hf \
    --host 0.0.0.0 \
    --port 8000

Query the server:

import requests

response = requests.post(
    "http://localhost:8000/generate",
    json={
        "prompt": "Hello, world!",
        "max_tokens": 50
    }
)

print(response.json()["text"])

Performance Tips

  1. GPU Memory: Set gpu_memory_utilization to 0.9-0.95

  2. Batch Size: Increase max_num_batched_tokens for throughput

  3. KV Cache: Configure block_size based on workload

  4. Compilation: Enable CUDAGraph for repeated inference

Next Steps