# ATOM vLLM Plugin Backend ATOM can work as the vLLM out-of-tree (OOT) plugin backend — installed as a separate Python package and plugged into vLLM through vLLM's official plugin interfaces. This keeps the integration clean while letting ATOM reuse the mature serving and runtime features already provided by vLLM. This integration follows the direction described in the [RFC to enable ATOM as a vLLM out-of-tree platform](https://github.com/ROCm/ATOM/issues/201). The high-level idea is that vLLM remains the framework-level runtime, while ATOM focuses on model-level and kernel-level optimization for AMD GPUs. In this mode, ATOM serves as the optimized execution backend and an incubation layer for new kernels, fusions, and model implementations before they are mature enough to be upstreamed. ## 1. Architecture ### 1.1 Design overview In practice, the responsibilities are split as follows: | Layer | Responsibility | |---|---| | vLLM | API server, CLI, engine, scheduler, worker orchestration, cache management, and framework-level features | | ATOM | Platform plugin, model registry overrides, model wrappers, attention backends, and the optimized execution path built around ATOM/AITER integrations | | Integration boundary | vLLM calls the official plugin hooks, while ATOM implements the required platform and model interfaces without changing vLLM source | This relationship is important: ATOM is not replacing vLLM as a serving framework. Instead, ATOM plugs optimized model execution components into the extension points that vLLM already exposes. ### 1.2 How it works When the `atom` package is installed in the same Python environment as `vllm`, two entry points are exposed following the official vLLM plugin convention: ```toml [project.entry-points."vllm.platform_plugins"] atom = "atom.plugin.vllm.register:register_platform" [project.entry-points."vllm.general_plugins"] atom_model_registry = "atom.plugin.vllm.register:register_model" ``` During `vllm serve` startup, vLLM scans installed Python packages, loads these entry points, and activates the ATOM hooks: - `register_platform()` returns `atom.plugin.vllm.platform.ATOMPlatform`, so vLLM resolves `current_platform` to the ATOM platform. - `register_model()` updates selected vLLM `ModelRegistry` entries to ATOM wrappers such as `ATOMForCausalLM` and `ATOMMoEForCausalLM`. - When vLLM constructs attention layers, `ATOMPlatform.get_attn_backend_cls()` returns `atom.model_ops.attentions.aiter_attention.AiterBackend` or `atom.model_ops.attentions.aiter_mla.AiterMLABackend`. - When a supported model is instantiated, the ATOM wrapper creates the ATOM plugin config, initializes the ATOM/AITER runtime state, and constructs the ATOM model implementation. - vLLM continues to drive request scheduling and serving, while the hot model execution path runs through ATOM model code, ATOM attention backends, and AITER-backed kernels. ### 1.3 Plugin lifecycle ``` vLLM startup │ ├─ 1. register_platform() │ ├─ _set_framework_backbone("vllm") │ └─ return "atom.plugin.vllm.platform.ATOMPlatform" │ ├─ 2. register_model() │ ├─ Override ModelRegistry for supported architectures │ ├─ patch_vllm_mla_attention() │ └─ Patch Attention.process_weights_after_loading │ ├─ 3. vLLM loads model → ATOMModelBase.__init__() │ ├─ generate_atom_config_for_plugin_mode(vllm_config) │ │ └─ _generate_atom_config_from_vllm_config() │ │ ├─ Build PluginConfig (vLLM-specific fields) │ │ └─ Build ATOM Config (model, TP, KV cache, etc.) │ ├─ set_attn_cls() → ops.Attention = PagedAttention │ ├─ init_aiter_dist() → initialize AITER distributed env │ └─ Construct ATOM model (e.g., DeepseekV3ForCausalLM) │ ├─ 4. ATOMPlatform.get_attn_backend_cls() │ ├─ MLA model → AiterMLABackend │ └─ MHA model → AiterBackend │ └─ 5. Forward pass ├─ vLLM calls ATOMModelBase.forward() ├─ Delegates to self.model(input_ids, positions, ...) └─ Attention uses ATOM's AITER kernels via plugin decorators ``` ### 1.4 Key Modules | Module | Purpose | |---|---| | `atom.plugin.vllm.register` | vLLM plugin entry points for platform and model registration | | `atom.plugin.vllm.platform` | The ATOM platform class exposed to vLLM | | `atom.plugin.vllm.model_wrapper` | ATOM model wrappers used by vLLM model construction | | `atom.model_ops.attentions.aiter_attention` | ATOM MHA attention backend for vLLM plugin mode | | `atom.model_ops.attentions.aiter_mla` | ATOM MLA attention backend for vLLM plugin mode | ### 1.5 Component Diagram ``` atom/plugin/ ├── __init__.py # Public API: is_vllm, is_plugin_mode ├── prepare.py # Framework detection and state management ├── config.py # PluginConfig + vLLM-to-ATOM config translation ├── register.py # set_attn_cls, init_aiter_dist ├── attention.py # vLLM attention metadata builders and backend decorators ├── attention_mha.py # MHA (PagedAttention) plugin-mode decorator ├── attention_mla.py # MLA plugin-mode methods and decorator ├── moe.py # FusedMoE decorator for plugin mode └── vllm/ ├── __init__.py # vLLM sub-package exports ├── register.py # register_platform(), register_model() ├── platform.py # ATOMPlatform (RocmPlatform subclass) ├── model_wrapper.py # ATOMModelBase, ATOMForCausalLM, ATOMMoEForCausalLM └── mla_patch.py # Patches vLLM MLAAttention for ATOM MLA integration ``` --- ## 2. Configuration Translation When vLLM constructs an ATOM model, `generate_atom_config_for_plugin_mode()` translates vLLM's `VllmConfig` into an ATOM `Config`. The translation preserves vLLM's scheduling, caching, and parallelism decisions while injecting ATOM-specific compilation and plugin settings. ### 2.1 `PluginConfig` Fields | Field | Type | Default | Description | |---|---|---|---| | `model_config` | `Any` | `None` | vLLM's model config object | | `rank` | `int` | `0` | Current process rank | | `is_plugin_mode` | `bool` | `False` | Always `True` when running as a plugin | | `is_vllm` | `bool` | `False` | `True` when running inside vLLM | | `vllm_scheduler_config` | `Any` | `None` | vLLM scheduler config | | `vllm_cache_config` | `Any` | `None` | vLLM cache config | | `vllm_quant_config` | `Any` | `None` | vLLM quantization config | | `vllm_use_atom_attention` | `bool` | `False` | Whether ATOM attention is active | ### 2.2 vLLM Config Mapping The following table shows how vLLM config fields map to ATOM `Config` fields: | ATOM `Config` Field | Source (vLLM) | |---|---| | `model` | `model_config.model` | | `max_num_batched_tokens` | `scheduler_config.max_num_batched_tokens` | | `max_num_seqs` | `scheduler_config.max_num_seqs` | | `max_model_len` | `model_config.max_model_len` (or `scheduler_config.max_model_len`) | | `gpu_memory_utilization` | `cache_config.gpu_memory_utilization` | | `tensor_parallel_size` | `parallel_config.tensor_parallel_size` | | `kv_cache_block_size` | `cache_config.block_size` | | `num_kvcache_blocks` | `cache_config.num_gpu_blocks` | | `kv_cache_dtype` | `cache_config.cache_dtype` | | `enable_prefix_caching` | `cache_config.enable_prefix_caching` | | `enable_expert_parallel` | `parallel_config.enable_expert_parallel` | | `compilation_config.level` | `compilation_config.mode` | | `enforce_eager` | Always `True` (ATOM does not use its own CUDA graph logic in plugin mode) | **CUDA graphs vs torch.compile:** - **CUDA graphs** — In plugin mode, ATOM sets `enforce_eager=True` and `use_cudagraph=False` in its own `Config`, meaning ATOM's CUDA graph capture and replay logic are completely disabled. CUDA graph management is fully delegated to vLLM — vLLM decides when to capture, which batch sizes to graph, and how to replay. ATOM's attention backends cooperate by implementing `build_for_cudagraph_capture()` so that vLLM can capture ATOM kernels inside its own CUDA graphs. - **torch.compile** — In contrast, torch.compile is handled entirely by ATOM, not by vLLM. ATOM's `@support_torch_compile` decorator wraps each model's `forward` method and routes compilation through ATOM's own `VllmBackend`. The compilation level is derived from vLLM's `compilation_config.mode` (e.g., `PIECEWISE`), but the actual compilation pipeline — including graph splitting, Inductor invocation, and compiled-graph caching — is ATOM's own implementation. Graph splitting is a key difference: ATOM splits the `torch.fx` graph at attention boundaries (the `unified_attention` op registered by vLLM) so that each piecewise subgraph can be compiled and cached independently. This split strategy is defined in ATOM's `split_graph()` / `_split_judge_func()` and is independent of vLLM's compilation backend. --- ## 3. Attention Integration vLLM's OOT plugin interface allows an external platform to supply its own attention backend. ATOM hooks into this by overriding `ATOMPlatform.get_attn_backend_cls()` — the only contract point between vLLM and the plugin for attention dispatch. ### 3.1 How the Backend Is Selected When vLLM resolves the attention backend for a model, it calls the platform's `get_attn_backend_cls()`. ATOM's implementation returns one of two backends based on the model's attention type: | Model Attention Type | Returned Backend | Example Models | |---|---|---| | MLA (`use_mla == True`) | `AiterMLABackend` | DeepSeek-R1, Kimi-K2 | | Standard MHA | `AiterBackend` | Qwen3, Llama | Setting `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1` causes `ATOMPlatform` to delegate back to the parent `RocmPlatform.get_attn_backend_cls()`, restoring vLLM's built-in ROCm attention path. ## 4. Supported Models Currently, the plugin backend supports the following model architectures: | HF architecture | ATOM model implementation | Model family example | |---|---|---| | `Qwen3ForCausalLM` | `atom.models.qwen3.Qwen3ForCausalLM` | Qwen3 dense | | `Qwen3MoeForCausalLM` | `atom.models.qwen3_moe.Qwen3MoeForCausalLM` | Qwen3 MoE | | `GptOssForCausalLM` | `atom.models.gpt_oss.GptOssForCausalLM` | GPT-OSS | | `DeepseekV3ForCausalLM` | `atom.models.deepseek_v2.DeepseekV3ForCausalLM` | DeepSeek-R1 / DeepSeek V3 / Kimi-K2 style models | | `Glm4MoeForCausalLM` | `atom.models.glm4_moe.Glm4MoeForCausalLM` | GLM-4-MoE | `Kimi-K2` is also supported. Although it is usually loaded with `--trust-remote-code`, it shares the same DeepSeek-style MLA+MoE architecture path and reuses `atom.models.deepseek_v2.DeepseekV3ForCausalLM` in the ATOM vLLM OOT backend. --- ## 5. Installation and Quick Start ### 5.1 Prerequisites - AMD Instinct MI300X / MI300A / MI355X GPUs ### 5.2 Set Up the Environment The recommended approach is to pull an official ATOM + vLLM Docker image from [Docker Hub](https://hub.docker.com/r/rocm/atom-dev/tags?name=vllm). These images ship with ROCm, PyTorch, AITER, ATOM, and a compatible vLLM build pre-installed — no manual dependency management is required. Pull the latest OOT image: ```bash docker pull rocm/atom-dev:vllm-latest ``` If you need an OOT docker image for a specific vLLM version or a specific release date, browse the available tags on [Docker Hub](https://hub.docker.com/r/rocm/atom-dev/tags) and pull the exact tag you need there. For example, to pull the OOT docker adapted to vLLM `0.17.0` on `2026-03-15`: ```bash docker pull rocm/atom-dev:vllm-v0.17.0-nightly_20260315 ``` ### 5.3 Launch vLLM with ATOM Plugin The ATOM vLLM plugin backend keeps the standard vLLM CLI, server APIs, and general usage flow compatible with upstream vLLM. For general server options, OpenAI-compatible API usage, and client patterns, refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/). ```bash vllm serve ${model} \ --host localhost \ --port 8000 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --trust-remote-code \ --gpu_memory_utilization 0.9 \ --async-scheduling \ --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ --kv-cache-dtype fp8 \ --no-enable-prefix-caching ``` ATOM will log its activation at startup: ``` INFO atom: Register model DeepseekV3ForCausalLM to vLLM with atom.plugin.vllm.model_wrapper:ATOMMoEForCausalLM INFO atom: Use atom attention backend ``` ### 5.4 Benchmark Serving Users can use the default vllm bench commands for performance benchmarking. ```bash vllm bench serve \ --host localhost \ --port 8000 \ --model ${model} \ --dataset-name random \ --random-input-len 8000 \ --random-output-len 1000 \ --max-concurrency 64 \ --num-prompts 640 \ --trust_remote_code \ --percentile-metrics ttft,tpot,itl,e2el ``` ### 5.5 Enable Profiling If you want to collect profiles, add the recommended commands by vLLM with `--profiler-config "$profiler_config"`. ```bash profiler_dir=./ profiler_config=$(printf '{"profiler":"torch","torch_profiler_dir":"%s","torch_profiler_with_stack":true,"torch_profiler_record_shapes":true}' \ "${profiler_dir}") ``` ### 5.6 Disable ATOM Plugin This is intended for **debugging only**. When the ATOM plugin is disabled, vLLM falls back to its built-in ROCm path, which may encounter version mismatches with the AITER library bundled in the environment. To run pure vLLM without ATOM, set environment variables before launching: ```bash # Disable the entire ATOM plugin (platform + models) export ATOM_DISABLE_VLLM_PLUGIN=1 # Or disable only ATOM attention (keep ATOM models but use vLLM attention) export ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1 ``` --- ## 6. Environment Variables | Variable | Type | Default | Description | |---|---|---|---| | `ATOM_DISABLE_VLLM_PLUGIN` | bool | `0` (false) | Set to `1` to disable the entire ATOM vLLM plugin (platform + model registration). vLLM runs in pure ROCm mode. | | `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION` | bool | `0` (false) | Set to `1` to disable only ATOM's attention backends. ATOM models are still used, but attention falls back to vLLM's default ROCm backend. | | `ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION` | bool | `0` (false) | Enable QK-norm + RoPE + cache + quant fusion in attention. Recommended for Qwen3-MoE models. |