Quickstart ========== This guide will get you started with ATOM in 5 minutes. Serving a Model --------------- .. code-block:: python from atom import LLMEngine, SamplingParams # Load model llm = LLMEngine( model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9, max_model_len=4096 ) # Create sampling parameters sampling_params = SamplingParams(max_tokens=50, temperature=0.8) # Generate text (note: prompts must be a list) outputs = llm.generate(["Hello, my name is"], sampling_params) print(outputs[0]) Batch Inference --------------- .. code-block:: python from atom import LLMEngine, SamplingParams llm = LLMEngine(model="meta-llama/Llama-2-7b-hf") # Batch prompts prompts = [ "The capital of France is", "The largest ocean is", "Python is a" ] # Create sampling parameters sampling_params = SamplingParams(max_tokens=20, temperature=0.7) # Generate in batch outputs = llm.generate(prompts, sampling_params) # outputs is a list of strings for i, output in enumerate(outputs): print(f"Prompt: {prompts[i]}") print(f"Output: {output}\n") Distributed Serving ------------------- Multi-GPU serving: .. code-block:: python from atom import LLMEngine, SamplingParams # Use 4 GPUs with tensor parallelism llm = LLMEngine( model="meta-llama/Llama-2-70b-hf", tensor_parallel_size=4, gpu_memory_utilization=0.95 ) sampling_params = SamplingParams(max_tokens=100, temperature=0.7) outputs = llm.generate(["Tell me about AMD GPUs"], sampling_params) print(outputs[0]) API Server ---------- Start a RESTful API server: .. code-block:: bash python -m atom.entrypoints.openai_server \ --model meta-llama/Llama-2-7b-hf \ --host 0.0.0.0 \ --port 8000 Query the server: .. code-block:: python import requests response = requests.post( "http://localhost:8000/generate", json={ "prompt": "Hello, world!", "max_tokens": 50 } ) print(response.json()["text"]) Performance Tips ---------------- 1. **GPU Memory**: Set `gpu_memory_utilization` to 0.9-0.95 2. **Batch Size**: Increase `max_num_batched_tokens` for throughput 3. **KV Cache**: Configure `block_size` based on workload 4. **Compilation**: Enable CUDAGraph for repeated inference Next Steps ---------- * :doc:`architecture_guide` - Understand ATOM architecture * :doc:`configuration_guide` - Configure for your workload * :doc:`serving_benchmarking_guide` - Measure performance