EAGLE 3 / EAGLE 3.1 speculative decoding

ATOM supports the EAGLE-family speculative decoding methods through the --method eagle3 CLI flag. This guide covers both the original EAGLE 3 path (in production since the K2.5 release) and the EAGLE 3.1 extensions added in May 2026 (fc_norm, post-norm hidden state feedback, MLA draft variant).

For the framework that hosts both methods, see serving_benchmarking_guide.md § 6.

EAGLE 3 vs EAGLE 3.1

EAGLE 3 is a self-distilled drafter that runs one transformer decoder layer per speculative step over the concatenation of an embedding tensor and a target-side hidden state. The drafter shares the target’s tokenizer and emits its own auxiliary projection (fc) over a few selected target hidden layers (eagle_aux_hidden_state_layer_ids).

EAGLE 3.1 keeps the same --method eagle3 code path and adds two backward-compatible toggles in the drafter HF config. They are detected via getattr(...) with the EAGLE 3 default, so loading an existing EAGLE 3 checkpoint (no new fields) takes the original code path byte-for-byte.

Field on draft `config.json`	Type / default	Behavior when set
`target_hidden_size`	`int`, default `hidden_size`	Sizes the `fc` input fan-in per aux slot. Lets the draft project from a target whose hidden dim differs from the draft’s.
`num_aux_hidden_states`	`int`, default `len(eagle_aux_hidden_state_layer_ids)` (else `3`)	Number of target hidden-state slots concatenated into the `fc` input. Multiplies the `fc` fan-in.
`fc_norm`	`bool`, default `False`	When `True`, each aux slot is RMSNormed (one `RMSNorm` per slot) before concatenation and `fc`. When `False`, the legacy raw-concat path runs.
`norm_output`	`bool`, default `False`	When `True`, the drafter feeds its post-norm hidden state to its own next speculative step instead of the pre-norm residual. The logits path always applies the final norm once.

Together these target attention drift: under long contexts or out-of-distribution prompts, the drafter’s attention shifts away from the sink token as speculative depth grows, and acceptance length collapses. EAGLE 3.1 reports up to 2× acceptance length on long-context production workloads and up to 2.03× per-user output throughput at concurrency 1 on Kimi-K2.6-NVFP4 + vLLM TP=4 (GB200). See the vLLM blog post and the original integration in vLLM PR #42764.

Drafter architectures

ATOM ships two EAGLE 3 drafter classes, both routed through atom.spec_decode.eagle.support_eagle_model_arch_dict based on the draft HF config’s architectures[0] after Config.hf_config_override rewrites it to the internal name:

Target stack	Draft HF arch (in `config.json`)	Internal class (in `support_eagle_model_arch_dict`)	Source file	KV cache
MHA / GQA (Llama-style)	`LlamaForCausalLMEagle3`	`Eagle3LlamaModel`	`atom/models/eagle3_llama.py`	Independent draft pool via `Eagle3DraftBuilder`
MLA (DeepSeek V3 / K2.6)	`Eagle3DeepseekV2ForCausalLM`	`Eagle3DeepseekMLAModel`	`atom/models/eagle3_deepseek_mla.py`	Piggybacks the target’s MLA KV pool at `layer_id = num_hidden_layers`

Both classes implement the same set of EAGLE 3.1 toggles. The MLA variant reuses the MLA attention block from atom/models/deepseek_v2.py rather than re-implementing it. Because its KV shape is identical to the target’s, no parallel Eagle3MLADraftBuilder is needed — the draft slots into the target pool as one additional layer (accounted for by num_nextn_predict_layers).

Drafter forward contract

To support post-norm feedback, forward() on EAGLE 3.1 drafters returns a tuple:

hidden_for_logits, hidden_for_next_step = model(input_ids, positions, hidden_states)

hidden_for_logits: pre-norm hidden state. The final norm + lm_head in compute_logits() applies on top of this, exactly as in EAGLE 3.
hidden_for_next_step: post-norm hidden state when norm_output=True, otherwise identical to hidden_for_logits. This is what the propose loop feeds back as hidden_states on the next speculative step.

EagleProposer.propose() (atom/spec_decode/eagle.py) unpacks the tuple via isinstance(model_out, tuple), so legacy drafters returning a single tensor continue to work without changes.

Usage

K2.6 + EAGLE 3.1 MLA draft

python -m atom.entrypoints.openai_server \
    --model /path/to/Kimi-K2.6-MXFP4 \
    --trust-remote-code \
    -tp 8 \
    --kv_cache_dtype fp8 \
    --method eagle3 \
    --num-speculative-tokens 3 \
    --draft-model lightseekorg/kimi-k2.6-eagle3.1-mla

The draft config.json should expose fc_norm: true, norm_output: true, kv_lora_rank (signals MLA), and an eagle_config.eagle_aux_hidden_state_layer_ids list. These drive the routing and toggles automatically — no extra CLI flags.

K2.5 + legacy EAGLE 3 draft (backward-compat)

python -m atom.entrypoints.openai_server \
    --model /path/to/Kimi-K2.5-MXFP4 \
    --trust-remote-code \
    -tp 8 \
    --kv_cache_dtype fp8 \
    --method eagle3 \
    --num-speculative-tokens 3 \
    --draft-model /path/to/kimi-k2.5-eagle3

The legacy draft’s config.json has none of the new fields, so all getattr toggles fall through to their EAGLE 3 defaults. Behavior is byte-equivalent to pre-EAGLE-3.1 ATOM.

Runtime acceptance stats

ATOM’s scheduler already emits per-window acceptance via the SpecStats class (atom/model_engine/scheduler.py). With num_speculative_tokens=3, a [MTP Stats] line appears every 1000 decode steps in the server log:

[MTP Stats] acceptance: 59.62%  avg toks/fwd: 2.79  dist 0/1/2/3: 21.69/19.53/16.97/41.80%

No instrumentation needed — read these directly from the server’s stdout log (docker logs <container> or the tee target you launched it under).

Reference: 8× MI355X validation

Measured 2026-05-27 on the rocm/atom-dev:latest container, TP=8, MXFP4 targets, gsm8k 5-shot, 1319 samples.

Setup	gsm8k 5-shot	Acceptance	Avg toks/fwd	Dist 0/1/2/3
K2.5 MXFP4 + `kimi-k2.5-eagle3` (legacy path)	0.9356 (±0.0068)	67.96%	3.04	14.95 / 16.30 / 18.68 / 50.07%
K2.6 MXFP4 + `kimi-k2.6-eagle3.1-mla`	0.9393 (±0.0066)	59.62%	2.79	21.69 / 19.53 / 16.97 / 41.80%

The K2.5 legacy run is the backward-compat probe — it must match the pre-EAGLE-3.1 K2.5 baseline. The K2.6 run shows EAGLE 3.1 functional on the new MLA draft without regressing correctness vs. the legacy path.

The lower acceptance rate on K2.6 vs K2.5 is not a regression — the two spec pairs are not directly comparable (different targets, different draft training maturity), and gsm8k is a short-context structured-reasoning task rather than the long-context production workload EAGLE 3.1 was designed for. A direct K2.6-target ablation requires an additional K2.5-style EAGLE 3 draft sized to K2.6, which is not yet publicly available.

Implementation notes for contributors

Adding a new MLA target’s EAGLE 3.1 draft: register an HF arch rewrite in Config.hf_config_override (atom/config.py), add the entry to support_eagle_model_arch_dict (atom/spec_decode/eagle.py), and reuse Eagle3DeepseekMLAModel if the target’s MLA head dims match. Sizing per draft is automatic via the four getattr toggles in § 1.
atom_config copy semantics in EagleProposer.__init__: the draft config is constructed by shallow-copying atom_config and then isolating hf_config + compilation_config. Avoid copy.deepcopy(atom_config) — models like K2.6 attach cuda.Stream objects to the config tree from custom HF modeling code, and Stream cannot be pickled.
mtp_start_layer_idx dispatch in atom/model_engine/model_runner.py branches on speculative_config.method == "eagle3", not on the presence of eagle3_draft_builder. MLA drafts piggyback the target pool and so have no draft builder, but they still need the layer-index offset.
Do not modify @support_torch_compile-decorated model files for EAGLE-related changes — instrument at call sites (EagleProposer.propose, ModelRunner.run_model) instead. See the project CLAUDE.md.

References

vLLM integration PR (merged main, ships v0.22.0): https://github.com/vllm-project/vllm/pull/42764
TorchSpec training PR: https://github.com/lightseekorg/TorchSpec/pull/97
vLLM blog: https://vllm.ai/blog/2026-05-26-eagle-3-1
Kimi K2.6 EAGLE 3.1 draft model: https://huggingface.co/lightseekorg/kimi-k2.6-eagle3.1-mla
Original EAGLE repository: https://github.com/SafeAILab/EAGLE