ATOM Logo

Getting Started

  • Installation
    • Requirements
    • Installation Methods
      • From Source
      • Docker Installation
    • Environment Variables
    • Verification
    • Troubleshooting
  • Quickstart
    • Serving a Model
    • Batch Inference
    • Distributed Serving
    • API Server
    • Performance Tips
    • Next Steps

User Guides

  • ATOM Architecture Guide
    • 1. System Overview
    • 2. Component Architecture
    • 3. Request Lifecycle
    • 4. Forward Context Pattern
    • 5. Multi-Process Architecture
    • 6. Sequence Lifecycle
    • Source Files
  • ATOM Configuration Guide
    • Quick Reference
    • 1. Master Configuration (Config)
    • 2. Compilation Configuration (CompilationConfig)
      • 2.1 Compilation Levels (CompilationLevel)
      • 2.2 CompilationConfig Fields
      • 2.3 CUDA Graph Mode (CUDAGraphMode)
    • 3. Quantization Configuration (QuantizationConfig & LayerQuantConfig)
      • 3.1 LayerQuantConfig Fields
      • 3.2 QuantizationConfig Attributes
      • 3.3 QuantType Values (from AITER)
      • 3.4 Supported Quantization Dtypes
      • 3.5 Auto-Detection from HuggingFace
      • 3.6 Layer-Level Quantization Dispatch
    • 4. Parallel Configuration (ParallelConfig)
    • 5. Speculative Decoding Configuration (SpeculativeConfig)
    • 6. Sampling Parameters (SamplingParams)
    • 7. CLI Arguments (EngineArgs)
    • 8. Environment Variables
      • 8.1 Variables Registered in atom/utils/envs.py
      • 8.2 Additional Environment Variables (Used Outside envs.py)
    • 9. Decision Tree – Choosing a Compilation Level
    • Source Files
  • ATOM Model Support Guide
    • Quick Reference
    • 1. Supported Model Architectures
    • 2. Model Architecture Details
      • Qwen3 (Qwen3ForCausalLM)
      • Qwen3-MoE (Qwen3MoeForCausalLM)
      • Llama (LlamaForCausalLM)
      • Mixtral (MixtralForCausalLM)
      • DeepSeek V2/V3 (DeepseekV2ForCausalLM)
      • DeepSeek MTP (DeepSeekMTP)
      • GPT-OSS (GptOssForCausalLM)
      • GLM4-MoE (Glm4MoeForCausalLM)
    • 3. Weight Loading
      • Function Signature
      • Loading Flow
      • Layers Beyond num_hidden_layers
    • 4. Adding a New Model
      • Step 1: Create the Model File
      • Step 2: Implement Layer Classes
      • Step 3: Implement the Model and CausalLM Classes
      • Step 4: Register the Model
      • Step 5: Handle Weight Loading
    • 5. Model-Specific Optimizations
      • Llama: Fused RMSNorm+Quant and SiLU+Mul+Quant
      • DeepSeek V2/V3: MLA + Fused Input Norm + QK Norm Fusion
      • Qwen3-MoE: QK Norm + RoPE + Cache + Quant Fusion
      • MTP: DeepSeek Multi-Token Prediction
    • Source Files
  • ATOM Model Operations Guide
    • Quick Reference
    • 1. AITER Integration Overview
      • AITER Kernel Mapping Table
    • 2. Linear Operations
      • 2.1 Class Hierarchy
      • 2.2 Quantization Dispatch
      • 2.3 Tensor Parallel Sharding
      • 2.4 Weight Processing
    • 3. Attention Operations
      • 3.1 Base: Attention (base_attention.py)
      • 3.2 Multi-Head Attention (attention_mha.py)
      • 3.3 Multi-head Latent Attention (attention_mla.py)
      • 3.4 Backend Abstraction (attentions/backends.py)
      • 3.5 KV Cache Operations
    • 4. Mixture of Experts (MoE)
      • 4.1 FusedMoE Class (moe.py)
      • 4.2 Quantization Methods
      • 4.3 TopK Routing (topK.py)
      • 4.4 FusedMoEParallelConfig
      • 4.5 MORI Integration (fused_moe/mori_prepare_finalize.py)
      • 4.6 MoE Quantization Config (fused_moe/config.py)
      • 4.7 Triton MoE Fallback (fused_moe_triton.py)
    • 5. Normalization
      • 5.1 RMSNorm (layernorm.py)
      • 5.2 LayerNorm (layernorm.py)
    • 6. Activation Functions
      • 6.1 SiluAndMul (activation.py)
    • 7. Embedding & Output Head
      • 7.1 VocabParallelEmbedding (embed_head.py)
      • 7.2 ParallelLMHead (embed_head.py)
    • 8. Rotary Position Embedding (RoPE)
      • 8.1 RotaryEmbedding (rotary_embedding.py)
      • 8.2 get_rope() Factory
      • 8.3 Integration in Attention
    • 9. Sampling
      • 9.1 Sampler (sampler.py)
      • 9.2 RejectionSampler (rejection_sampler.py)
    • 10. Fused Kernel Chains
    • Source Files
      • atom/model_ops/
      • atom/model_ops/attentions/
      • atom/model_ops/fused_moe/
      • atom/utils/
  • ATOM Scheduling & KV Cache Guide
    • Quick Reference
    • 1. Scheduling Algorithm
      • 1.1 Scheduler Initialization
      • 1.2 Schedule Flow
      • 1.3 Delay Factor
      • 1.4 Preemption
    • 2. ScheduledBatch Structure
      • 2.1 Constructor Signature
      • 2.2 Fields
      • 2.3 ScheduledBatchOutput
    • 3. Block Manager
      • 3.1 Block Class
      • 3.2 BlockManager Initialization
      • 3.3 Allocation (allocate)
      • 3.4 Deallocation (deallocate)
      • 3.5 Can-Allocate and Can-Append Checks
      • 3.6 May-Append (Decode Extension)
    • 4. Prefix Caching
      • 4.1 Hash Function
      • 4.2 Hash Chaining
      • 4.3 Cache Lookup During Allocation
      • 4.4 Reference Counting
      • 4.5 Enabling Prefix Caching
    • 5. Postprocessing
      • 5.1 Signature
      • 5.2 Token Appending
      • 5.3 Stop Condition Checking
      • 5.4 Stream Output
      • 5.5 Sequence Cleanup
      • 5.6 Placeholder Insertion
    • 6. Speculative Decoding Integration
      • 6.1 Scheduler Tracking
      • 6.2 Draft Tokens in Scheduling
      • 6.3 Acceptance Statistics
      • 6.4 Draft Token Storage on Sequences
    • 7. Sequence Management
      • 7.1 Constructor
      • 7.2 Core Fields
      • 7.3 Timing Fields
      • 7.4 Computed Properties
      • 7.5 num_tokens Setter
      • 7.6 Lifecycle
      • 7.7 SequenceStatus Enum
      • 7.8 SequenceType Enum
    • Source Files
  • ATOM Distributed Inference Guide
    • Quick Reference
    • 1. Tensor Parallelism (TP)
      • Weight Sharding
      • Process Group Initialization
      • AllReduce
      • Configuration
    • 2. Data Parallelism (DP)
      • Architecture
      • DP Process Group Initialization
      • Synchronized Busy Loop
      • Dummy Batch Execution
      • Device Assignment
      • DPMetadata
      • CoreManager (DP Orchestration)
      • Configuration
    • 3. Expert Parallelism (EP)
      • FusedMoEParallelConfig
      • Expert Distribution
      • MORI Communication
      • Configuration
    • 4. Environment Variables
    • 5. Multi-GPU Deployment Examples
      • DeepSeek-R1 on 8 GPUs (TP8)
      • Qwen3-235B-A22B on 8 GPUs (TP8 + EP)
      • Kimi-K2-Thinking on 4 GPUs (TP4)
    • 6. Combined Parallelism Strategies
      • TP Only (Dense Models)
      • TP + EP (MoE Models)
      • TP + DP (Dense Throughput)
      • TP + DP + EP (MoE Throughput)
      • DP Attention Mode
    • Source Files
  • ATOM Compilation & CUDA Graphs Guide
    • 1. Compilation Levels
      • Level 0 – NO_COMPILATION
      • Level 1 – DYNAMO_AS_IS
      • Level 2 – DYNAMO_ONCE
      • Level 3 – PIECEWISE (Production Default)
    • 2. CUDA Graph Modes
      • NONE (value: 0)
      • PIECEWISE (value: 1)
      • FULL (value: 2)
      • FULL_DECODE_ONLY (value: (FULL, NONE))
      • FULL_AND_PIECEWISE (value: (FULL, PIECEWISE))
      • Helper Methods
    • 3. CUDA Graph Capture
      • Capture Flow
      • Graph Keying
      • Graph Pool Sharing
      • Default Capture Sizes
      • Graph Replay in run_model()
    • 4. Piecewise Compilation
      • Splitting Operations
      • Compilation Pipeline
      • Cache Management
    • 5. Forward Context & Stateless Dispatch
      • ForwardContext Fields
      • Lifecycle
      • Context Dataclass
      • Integration with CUDA Graphs
    • 6. Compiler Backend
      • CompilerManager
      • CompilerInterface
      • InductorAdaptor
      • InductorStandaloneAdaptor
      • VllmBackend
      • @support_torch_compile Decorator
      • Custom Op Registration
    • 7. Configuration Options
    • 8. Decision Tree
      • Common Configurations
    • Source Files
  • ATOM Serving & Benchmarking Guide
    • Quick Reference
    • 1. OpenAI-Compatible Server
      • 1.1 Endpoints
      • 1.2 Request Models
      • 1.3 Response Models
      • 1.4 Server Startup
      • 1.5 Example: curl
    • 2. Programmatic API (LLMEngine)
      • 2.1 Initialization
      • 2.2 SamplingParams
      • 2.3 Core Methods
      • 2.4 Synchronous Generation Example
      • 2.5 Asynchronous / Streaming Usage
    • 3. Simple Inference
      • 3.1 Usage
      • 3.2 What It Does
    • 4. Benchmarking
      • 4.1 Metrics
      • 4.2 Key CLI Arguments
      • 4.3 Backend Request Functions
      • 4.4 Full Benchmark Example
    • 5. Profiling
      • 5.1 Configuration
      • 5.2 Online Profiling (HTTP)
      • 5.3 Programmatic Profiling
      • 5.4 Offline Profiling Script
      • 5.5 Profiling During Benchmarks
    • 6. Speculative Decoding (MTP)
      • 6.1 Architecture
      • 6.2 Configuration
      • 6.3 MTP Statistics
      • 6.4 How Rejection Sampling Works
    • 7. Deployment Examples
      • 7.1 Single-GPU
      • 7.2 Multi-GPU with Tensor Parallelism
      • 7.3 Docker Deployment
      • 7.4 Engine CLI Arguments (EngineArgs)
    • 8. Accuracy Validation
      • 8.1 Setup
      • 8.2 Run Evaluation
    • Source Files

API Reference

  • Serving API
    • LLMEngine Class
      • Methods
        • generate()
    • SamplingParams
    • Return Values
    • Example
  • Supported Models
    • Llama Models
    • GPT Models
    • Mixtral
    • Other Architectures
    • Model Configuration
    • Performance by Model Size
    • Quantization
ATOM
  • Search


© Copyright 2026, AMD.

Built with Sphinx using a theme provided by Read the Docs.