vLLM-ATOM backend

ATOM can work as the vLLM-ATOM out-of-tree (OOT) backend — installed as a separate Python package and plugged into vLLM through vLLM’s official plugin interfaces. This keeps the integration clean while letting ATOM reuse the mature serving and runtime features already provided by vLLM.

This integration follows the direction described in the RFC to enable ATOM as a vLLM out-of-tree platform. The high-level idea is that vLLM remains the framework-level runtime, while ATOM focuses on model-level and kernel-level optimization for AMD GPUs. In this mode, ATOM serves as the optimized execution backend and an incubation layer for new kernels, fusions, and model implementations before they are mature enough to be upstreamed.

Architecture

Design overview

In practice, the responsibilities are split as follows:

Layer	Responsibility
vLLM	API server, CLI, engine, scheduler, worker orchestration, cache management, and framework-level features
ATOM	Platform plugin, model registry overrides, model wrappers, attention backends, and the optimized execution path built around ATOM/AITER integrations
Integration boundary	vLLM calls the official plugin hooks, while ATOM implements the required platform and model interfaces without changing vLLM source

This relationship is important: ATOM is not replacing vLLM as a serving framework. Instead, ATOM plugs optimized model execution components into the extension points that vLLM already exposes.

How it works

When the atom package is installed in the same Python environment as vllm, two entry points are exposed following the official vLLM plugin convention:

[project.entry-points."vllm.platform_plugins"]
atom = "atom.plugin.vllm.register:register_platform"

[project.entry-points."vllm.general_plugins"]
atom_model_registry = "atom.plugin.vllm.register:register_model"

During vllm serve startup, vLLM scans installed Python packages, loads these entry points, and activates the ATOM hooks:

register_platform() returns atom.plugin.vllm.platform.ATOMPlatform, so vLLM resolves current_platform to the ATOM platform.
register_model() updates selected vLLM ModelRegistry entries to ATOM wrappers such as ATOMForCausalLM and ATOMMoEForCausalLM.
When vLLM constructs attention layers, ATOMPlatform.get_attn_backend_cls() returns atom.model_ops.attentions.aiter_attention.AiterBackend or atom.model_ops.attentions.aiter_mla.AiterMLABackend.
When a supported model is instantiated, the ATOM wrapper creates the ATOM plugin config, initializes the ATOM/AITER runtime state, and constructs the ATOM model implementation.
vLLM continues to drive request scheduling and serving, while the hot model execution path runs through ATOM model code, ATOM attention backends, and AITER-backed kernels.

Plugin lifecycle

vLLM startup
│
├─ 1. register_platform()
│     ├─ _set_framework_backbone("vllm")
│     └─ return "atom.plugin.vllm.platform.ATOMPlatform"
│
├─ 2. register_model()
│     └─ Override ModelRegistry for supported architectures
│
├─ 3. vLLM loads model → ATOMModelBase.__init__()
│     ├─ generate_atom_config_for_plugin_mode(vllm_config)
│     │     └─ _generate_atom_config_from_vllm_config()
│     │           ├─ Build PluginConfig (vLLM-specific fields)
│     │           └─ Build ATOM Config (model, TP, KV cache, etc.)
│     ├─ Attention dispatcher selects AttentionForVllm
│     ├─ init_aiter_dist() → initialize AITER distributed env
│     └─ Construct ATOM model (e.g., DeepseekV3ForCausalLM)
│
├─ 4. AttentionForVllm layer construction
│     ├─ MLA model → AiterMlaBackendForVllm
│     └─ MHA model → AiterMhaBackendForVllm
│
└─ 5. Forward pass
      ├─ vLLM calls ATOMModelBase.forward()
      ├─ Delegates to self.model(input_ids, positions, ...)
      └─ AttentionForVllm uses ATOM's AITER kernels

Key modules

Module	Purpose
`atom.plugin.vllm.register`	vLLM plugin entry points for platform and model registration
`atom.plugin.vllm.platform`	The ATOM platform class exposed to vLLM
`atom.plugin.vllm.model_wrapper`	ATOM model wrappers used by vLLM model construction
`atom.plugin.vllm.attention`	vLLM-specific ATOM attention layers, backends, metadata, and impls

Component diagram

atom/plugin/
├── __init__.py              # Public API: is_vllm, is_plugin_mode
├── prepare.py               # Framework detection and state management
├── config.py                # PluginConfig + vLLM-to-ATOM config translation
├── register.py              # set_attn_cls, init_aiter_dist
└── vllm/
    ├── __init__.py           # vLLM sub-package exports
    ├── register.py           # register_platform(), register_model()
    ├── platform.py           # ATOMPlatform (RocmPlatform subclass)
    ├── model_wrapper.py      # ATOMModelBase, ATOMForCausalLM, ATOMMoEForCausalLM
    ├── moe.py                # vLLM-specific FusedMoE name adaptation
    └── attention/            # vLLM-specific ATOM attention stack

Configuration translation

When vLLM constructs an ATOM model, generate_atom_config_for_plugin_mode() translates vLLM’s VllmConfig into an ATOM Config. The translation preserves vLLM’s scheduling, caching, and parallelism decisions while injecting ATOM-specific compilation and plugin settings.

`PluginConfig` fields

Field	Type	Default	Description
`model_config`	`Any`	`None`	vLLM’s model config object
`rank`	`int`	`0`	Current process rank
`is_plugin_mode`	`bool`	`False`	Always `True` when running as a plugin
`is_vllm`	`bool`	`False`	`True` when running inside vLLM
`vllm_scheduler_config`	`Any`	`None`	vLLM scheduler config
`vllm_cache_config`	`Any`	`None`	vLLM cache config
`vllm_quant_config`	`Any`	`None`	vLLM quantization config

vLLM config mapping

The following table shows how vLLM config fields map to ATOM Config fields:

ATOM `Config` Field	Source (vLLM)
`model`	`model_config.model`
`max_num_batched_tokens`	`scheduler_config.max_num_batched_tokens`
`max_num_seqs`	`scheduler_config.max_num_seqs`
`max_model_len`	`model_config.max_model_len` (or `scheduler_config.max_model_len`)
`gpu_memory_utilization`	`cache_config.gpu_memory_utilization`
`tensor_parallel_size`	`parallel_config.tensor_parallel_size`
`kv_cache_block_size`	`cache_config.block_size`
`num_kvcache_blocks`	`cache_config.num_gpu_blocks`
`kv_cache_dtype`	`cache_config.cache_dtype`
`enable_prefix_caching`	`cache_config.enable_prefix_caching`
`enable_expert_parallel`	`parallel_config.enable_expert_parallel`
`compilation_config.level`	`compilation_config.mode`
`enforce_eager`	Always `True` (ATOM does not use its own CUDA graph logic in plugin mode)

CUDA graphs vs torch.compile:

CUDA graphs — In plugin mode, ATOM sets enforce_eager=True and use_cudagraph=False in its own Config, meaning ATOM’s CUDA graph capture and replay logic are completely disabled. CUDA graph management is fully delegated to vLLM — vLLM decides when to capture, which batch sizes to graph, and how to replay. ATOM’s attention backends cooperate by implementing build_for_cudagraph_capture() so that vLLM can capture ATOM kernels inside its own CUDA graphs.
torch.compile — In contrast, torch.compile is handled entirely by ATOM, not by vLLM. ATOM’s @support_torch_compile decorator wraps each model’s forward method and routes compilation through ATOM’s own VllmBackend. The compilation level is derived from vLLM’s compilation_config.mode (e.g., PIECEWISE), but the actual compilation pipeline — including graph splitting, Inductor invocation, and compiled-graph caching — is ATOM’s own implementation.

Graph splitting is a key difference: ATOM splits the torch.fx graph at attention boundaries (the unified_attention op registered by vLLM) so that each piecewise subgraph can be compiled and cached independently. This split strategy is defined in ATOM’s split_graph() / _split_judge_func() and is independent of vLLM’s compilation backend.

Attention integration

ATOM’s vLLM plugin constructs ATOM-owned attention layers directly from the ATOM model implementation. These layers implement vLLM’s AttentionLayerBase contract and return their ATOM-vLLM backend via get_attn_backend().

How the backend is selected

When vLLM discovers attention-like layers through static_forward_context, each ATOM-vLLM attention layer returns its backend based on the model’s attention type:

Model Attention Type	Returned Backend	Example Models
MLA (`use_mla == True`)	`AiterMlaBackendForVllm`	DeepSeek-R1, Kimi-K2
Standard MHA	`AiterMhaBackendForVllm`	Qwen3, Llama

ATOM-vLLM no longer constructs vLLM’s standard Attention / MLAAttention, so there is no supported mode to disable only ATOM attention while keeping ATOM models active. Use ATOM_DISABLE_VLLM_PLUGIN=1 for pure vLLM.

Supported models

Currently, the plugin backend supports the following model architectures:

HF architecture	ATOM model implementation	Model family example
`Qwen3ForCausalLM`	`atom.models.qwen3.Qwen3ForCausalLM`	Qwen3 dense
`Qwen3MoeForCausalLM`	`atom.models.qwen3_moe.Qwen3MoeForCausalLM`	Qwen3 MoE
`GptOssForCausalLM`	`atom.models.gpt_oss.GptOssForCausalLM`	GPT-OSS
`DeepseekV3ForCausalLM`	`atom.models.deepseek_v2.DeepseekV3ForCausalLM`	DeepSeek-R1 / DeepSeek V3 / Kimi-K2 style models
`Glm4MoeForCausalLM`	`atom.models.glm4_moe.Glm4MoeForCausalLM`	GLM-4-MoE

Kimi-K2 is also supported. Although it is usually loaded with --trust-remote-code, it shares the same DeepSeek-style MLA+MoE architecture path and reuses atom.models.deepseek_v2.DeepseekV3ForCausalLM in the ATOM vLLM OOT backend.

Installation and quick start

Prerequisites

AMD Instinct MI300X / MI300A / MI355X GPUs

Set up the environment

The recommended approach is to pull an official ATOM + vLLM Docker image from Docker Hub. These images ship with ROCm, PyTorch, AITER, ATOM, and a compatible vLLM build pre-installed — no manual dependency management is required.

Pull the latest OOT image:

docker pull rocm/atom-dev:vllm-latest

If you need an OOT docker image for a specific vLLM version or a specific release date, browse the available tags on Docker Hub and pull the exact tag you need there. For example, to pull the OOT docker adapted to vLLM 0.17.0 on 2026-03-15:

docker pull rocm/atom-dev:vllm-v0.17.0-nightly_20260315

Launch vLLM with ATOM plugin

The vLLM-ATOM backend keeps the standard vLLM CLI, server APIs, and general usage flow compatible with upstream vLLM. For general server options, OpenAI-compatible API usage, and client patterns, refer to the official vLLM documentation.

vllm serve ${model} \
    --host localhost \
    --port 8000 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --trust-remote-code \
    --gpu_memory_utilization 0.9 \
    --async-scheduling \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching

ATOM will log its activation at startup:

INFO atom: Register model DeepseekV3ForCausalLM to vLLM with atom.plugin.vllm.model_wrapper:ATOMMoEForCausalLM
INFO atom: Use atom attention backend

Benchmark serving

Users can use the default vllm bench commands for performance benchmarking.

vllm bench serve \
    --host localhost \
    --port 8000 \
    --model ${model} \
    --dataset-name random \
    --random-input-len 8000 \
    --random-output-len 1000 \
    --max-concurrency 64 \
    --num-prompts 640 \
    --trust_remote_code \
    --percentile-metrics ttft,tpot,itl,e2el

Enable profiling

If you want to collect profiles, add the recommended commands by vLLM with --profiler-config "$profiler_config".

profiler_dir=./

profiler_config=$(printf '{"profiler":"torch","torch_profiler_dir":"%s","torch_profiler_with_stack":true,"torch_profiler_record_shapes":true}' \
    "${profiler_dir}")

Disable ATOM plugin

This is intended for debugging only. When the ATOM plugin is disabled, vLLM falls back to its built-in ROCm path, which may encounter version mismatches with the AITER library bundled in the environment. To run pure vLLM without ATOM, set environment variables before launching:

# Disable the entire ATOM plugin (platform + models)
export ATOM_DISABLE_VLLM_PLUGIN=1

Environment variables

Variable	Type	Default	Description
`ATOM_DISABLE_VLLM_PLUGIN`	bool	`0` (false)	Set to `1` to disable the entire ATOM vLLM plugin (platform + model registration). vLLM runs in pure ROCm mode.
`ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION`	bool	`0` (false)	Enable QK-norm + RoPE + cache + quant fusion in attention. Recommended for Qwen3-MoE models.

Online quantization

Online (load-time) quantization is supported in plugin mode. Since vllm serve cannot take ATOM’s --online_quant_config CLI flag, pass the same JSON through vLLM’s --additional-config under the online_quant_config key:

vllm serve <model> ... \
    --additional-config '{"online_quant_config": {"global_quant_config": "ptpc_fp8", "layer_quant_config": {"*expert*": "mxfp4"}, "exclude_layer": ["lm_head", "*.gate.*"]}}'

See the online quantization guide for the full schema.