# Online Quantization Guide

ATOM can quantize or re-quantize model weights while loading them by passing
`--online_quant_config` to the engine. The source checkpoint stays on disk
unchanged; quantization happens in memory inside `process_weights_after_loading`
right after the loader finishes copying tensors.

This guide covers when to use online quantization, the full configuration
syntax, ready-to-run recipes for the most common model families, how to verify
the result, and troubleshooting tips. For the dataclass-level field reference,
see [`configuration_guide.md` § 3.7](./configuration_guide.md#37-online-quantization-at-load-time).

---

## 1. When to use online quantization

Use online quantization when one of the following holds:

- The model only ships an unquantized (BF16/FP16) or FP8-block checkpoint, and
  you want to evaluate a different runtime format (e.g. MXFP4 experts) without
  rebuilding the checkpoint offline.
- You want to sweep mixed-precision recipes (different formats for attention vs.
  MoE experts vs. shared experts) on the same source weights.
- You need a quick A/B between FP8 and MXFP4 on the same model without
  downloading two separate Hugging Face repos.

Prefer an offline pre-quantized checkpoint (e.g. `amd/DeepSeek-R1-0528-MXFP4`)
when one already exists for your target format — it has lower load time,
deterministic per-layer assignment, and no online quantization overhead on
every restart.

### Supported source-checkpoint formats

Online quantization is only activated when the source model's `quant_method`
is one of:

| Source `quant_method` | Behavior |
|---|---|
| _(none, i.e. BF16/FP16 model)_ | Quantized directly from float weights. |
| `fp8` (block FP8, `QuantType.per_1x128`) | FP8 block weights are dequantized to BF16 first, then re-quantized. |
| `mxfp4` | **Not re-quantized.** Source MXFP4 weights are currently passed through unchanged — there is no dequant path for `per_1x32`, so the requested target format does not take effect on these layers. |

---

## 2. Configuration syntax

The flag accepts a single JSON object with three optional fields:

```bash
--online_quant_config '{
  "global_quant_config": "ptpc_fp8",
  "layer_quant_config": {"*expert*": "mxfp4"},
  "exclude_layer": ["lm_head", "*.gate.*"]
}'
```

| Field | Type | Description |
|---|---|---|
| `global_quant_config` | `str` | Default target format applied to every Linear / MoE layer. Omit (or pass `""`) to leave non-matching layers at their source precision. |
| `layer_quant_config` | `dict[str, str]` | Per-layer target overrides. Keys are fnmatch-style globs such as `"*expert*"`, `"*.mlp.gate_proj"`. Matched layers override `global_quant_config`. |
| `exclude_layer` | `str` \| `list[str]` | Layer name patterns to leave at source precision. Supports exact match and glob (`*`). Prefer a JSON list when excluding more than one pattern. |

Resolution order for a given layer name:

1. If it matches `exclude_layer` → not quantized.
2. Otherwise, first matching `layer_quant_config` pattern (in dict order).
3. Otherwise, fall back to `global_quant_config`.
4. If `global_quant_config` is also empty, the layer keeps its source format.

### 2.1 Target formats

Only two target formats are currently supported. Any other string (for example
`ptpc_i8`, `mxi4`, `mxfp8`) will either be rejected by the JSON parser or
trigger an assertion in the loader when the layer's weight is quantized.

| Format string | Underlying `QuantType` | Weight dtype |
|---|---|---|
| `ptpc_fp8` | `QuantType.per_Token` | `torch.float8_e4m3fn` |
| `mxfp4` | `QuantType.per_1x32` | packed FP4 (`torch.float4_e2m1fn_x2`, group size 32) |

### 2.2 Picking the right pattern

ATOM's resolver runs against the **fully-qualified layer name** as reported by
`model.named_modules()`. Useful patterns:

| Pattern | Matches | Why |
|---|---|---|
| `"*expert*"` | MoE expert weights (e.g. `model.layers.3.mlp.experts`) | Substring match on the fused expert module. |
| `"*.gate.*"` | MoE router / gate Linear | Always exclude — quantizing the router destroys top-k accuracy. |
| `"lm_head"` | Output projection | Always exclude — kept at source precision avoids logit-distribution shift. |
| `"*shared_expert*"` | Shared experts in DeepSeek / Qwen3 MoE | Keep at higher precision if you see accuracy regressions. |

---

## 3. Recipes

The four recipes below are the configurations validated in
[ROCm/ATOM#653](https://github.com/ROCm/ATOM/pull/653). Each has been A/B
tested against its offline-quantized equivalent on gsm8k accuracy and
ISL=1024 / OSL=1024 / concurrency=128 throughput.

All commands assume you are inside the standard ATOM container
(`docker pull rocm/atom:latest`).

### 3.1 Qwen3-30B-A3B-Thinking-2507 — full per-token FP8

BF16 source → every Linear and the fused expert module quantized to
`ptpc_fp8`. The matching offline checkpoint is
`amd/Qwen3-30B-A3B-Thinking-2507-ptpc`.

```bash
python -m atom.entrypoints.openai_server \
  --model Qwen/Qwen3-30B-A3B-Thinking-2507 \
  -tp 4 \
  --online_quant_config '{
    "global_quant_config": "ptpc_fp8",
    "exclude_layer": ["lm_head", "*.gate.*"]
  }'
```

### 3.2 Qwen3-235B-A22B-Instruct-2507 — full MXFP4

BF16 source → every Linear (including experts) quantized to `mxfp4`, served
with expert parallel.

```bash
python -m atom.entrypoints.openai_server \
  --model Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --online_quant_config '{
    "global_quant_config": "mxfp4",
    "exclude_layer": ["lm_head", "*.gate.*"]
  }'
```

### 3.3 DeepSeek-R1-0528 — FP8 attention + MXFP4 experts

FP8 source → non-expert Linear stays at `ptpc_fp8`, fused MoE experts are
downgraded to `mxfp4`. The matching offline checkpoint layout is
`amd/DeepSeek-R1-0528-MXFP4`.

```bash
python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --enforce-eager -tp 8 \
  --online_quant_config '{
    "global_quant_config": "ptpc_fp8",
    "layer_quant_config": {"*expert*": "mxfp4"},
    "exclude_layer": ["lm_head", "*.gate.*"]
  }'
```

`--enforce-eager` mirrors the configuration used by the PR's accuracy
reproduction. Drop it to get full CUDA-graph throughput; it does not affect
the online quantization output.

### 3.4 DeepSeek-R1-0528 + MTP-3 — FP8 attention + MXFP4 experts

Same online quantization recipe as § 3.3, layered with MTP-3 speculative
decoding for ~2.5× lower TPOT.

```bash
python -m atom.entrypoints.openai_server \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --enforce-eager -tp 8 \
  --method mtp --num-speculative-tokens 3 \
  --online_quant_config '{
    "global_quant_config": "ptpc_fp8",
    "layer_quant_config": {"*expert*": "mxfp4"},
    "exclude_layer": ["lm_head", "*.gate.*"]
  }'
```

`--method mtp --num-speculative-tokens 3` is independent of online
quantization — it can be added to any of the recipes above without changing
the `--online_quant_config` JSON.

---

## 4. Verifying the result

When online quantization runs, rank 0 writes
`online_quant_info_<timestamp>_<ns>.json` to:

1. `$ATOM_TORCH_PROFILER_DIR` if the env var is set, otherwise
2. the current working directory.

A representative payload:

```json
{
  "model": "Qwen/Qwen3-30B-A3B-Thinking-2507",
  "online_quant_config": {
    "global_quant_config": "ptpc_fp8",
    "exclude_layer": ["lm_head", "*.gate.*"]
  },
  "elapsed_seconds": 2.343,
  "num_layers": 144,
  "layers": [
    {
      "layer": "model.layers.0.self_attn.qkv_proj",
      "quant_type": "per_Token",
      "quant_dtype": "torch.float8_e4m3fn"
    },
    {
      "layer": "model.layers.0.mlp.experts",
      "quant_type": "per_Token",
      "quant_dtype": "torch.float8_e4m3fn"
    }
  ]
}
```

Things to check:

- `num_layers` matches your expectation. For a Qwen3 MoE with 48 transformer
  blocks you should see `48 × 3 = 144` entries (qkv_proj + o_proj + experts).
  A drastically smaller count usually means a typo in the pattern made
  everything fall into `exclude_layer`.
- Per-layer `quant_type` / `quant_dtype` reflect the format you intended for
  that pattern. The mapping is:

  | Format string | `quant_type` | `quant_dtype` |
  |---|---|---|
  | `ptpc_fp8` | `per_Token` | `torch.float8_e4m3fn` |
  | `mxfp4` | `per_1x32` | `torch.uint8` (packed FP4x2) |

- `elapsed_seconds` indicates the post-loading processing time on rank 0. A
  large jump from one restart to another with the same config usually points
  to a TP gather being triggered (see § 5.2).

The runtime also logs a one-line summary in the server log:

```
Weight post-processing done: 2.34 seconds, 144 layers online-quantized
Online quantization info saved to /root/online_quant_info_20260525_033839_112444436.json
```

---

## 5. Notes and gotchas

### 5.1 When online quantization activates

`--online_quant_config` is only applied when the source checkpoint's
`quant_method` is unquantized or per-block FP8 (see § 1).

### 5.2 Tensor-parallel behavior

Tensor-parallel weights are gathered onto a single rank before quantization
**only** when local quantization would produce different scales than quantizing
the full unpartitioned weight. Concretely:

- `ptpc_fp8` (`per_Token`): scales are per output channel and the channel
  dimension is exactly what TP shards on, so quantization is done locally with
  no gather.
- `mxfp4` (`per_1x32`): scales are within 32-element blocks along the input
  dimension; for `RowParallelLinear` this requires a gather on the input dim
  before quantization, then re-sharding. This is the most expensive case.

If load time grows linearly with TP size, your recipe is hitting the gather
path.

### 5.3 Only Linear and fused MoE modules are quantized

Modules whose weights are not loaded through ATOM's `LinearMethodBase` or
`FusedMoEMethodBase` paths are skipped silently. In practice this means
embeddings, layernorms, attention bias, and any custom op kept in BF16 will not
appear in `online_quant_info_*.json` — that is expected.

### 5.4 Compile cache

The compile cache (`/root/.cache/atom/*`) is keyed on the full quantization
config hash. Switching `--online_quant_config` between runs will trigger a
recompile on first startup. If you are iterating rapidly:

```bash
rm -rf /root/.cache/atom/*
```

### 5.5 Always exclude the MoE gate

The MoE router (`*.gate.*`) is a tiny Linear that produces top-k routing
logits. Quantizing it consistently produces large accuracy drops on every MoE
model we have measured. Keep it in the exclude list unless you have a specific
reason not to.

---