# ATOM Configuration Guide

ATOM (AiTer Optimized Model) is AMD's lightweight LLM inference engine built on
[AITER](https://github.com/ROCm/aiter) kernels for ROCm/HIP GPUs. This guide
documents every configuration class, CLI flag, and environment variable that
controls ATOM's runtime behaviour.

---

## Quick Reference

| Config Class | Primary Purpose |
|---|---|
| `Config` | Master dataclass -- model path, memory, TP size, scheduler limits, KV cache, profiler, and references to all sub-configs |
| `CompilationConfig` | Compilation level (0-3), CUDA graph capture sizes, piecewise splitting ops, inductor settings |
| `CompilationLevel` | Integer constants for the four compilation levels |
| `CUDAGraphMode` | Enum controlling how CUDA graphs are captured (none / piecewise / full / hybrid) |
| `QuantizationConfig` | Layer-wise quantization orchestrator: global config, per-layer overrides, exclude lists, layer name remapping |
| `LayerQuantConfig` | Per-layer quantization parameters: quant type, dtype, dynamic flag, method |
| `ParallelConfig` | Data-parallel size, rank, master IP/port |
| `SpeculativeConfig` | Speculative decoding method, draft model, number of speculative tokens |
| `KVCacheConfig` / `KVCacheTensor` | Per-layer KV cache tensor descriptors (k/v caches and scales) |
| `SamplingParams` | Temperature, max tokens, stop strings, ignore-EOS flag |
| `EngineArgs` | CLI argument parser that builds a `Config` for `LLMEngine` |

---

## 1. Master Configuration (`Config`)

Defined in `atom/config.py`. The root dataclass that the engine consumes.

| Field | Type | Default | Description |
|---|---|---|---|
| `model` | `str` | *(required)* | HuggingFace model name or local path |
| `trust_remote_code` | `bool` | `False` | Trust remote code when loading the model from HuggingFace |
| `max_num_batched_tokens` | `int` | `16384` | Maximum number of tokens batched together per scheduler step |
| `scheduler_delay_factor` | `float` | `0.0` | Multiplicative delay (factor x previous prompt latency) before scheduling the next prompt |
| `max_num_seqs` | `int` | `512` | Maximum number of sequences batched together |
| `max_model_len` | `int \| None` | `None` | Maximum context length; defaults to `hf_config.max_position_embeddings` (capped by it when set) |
| `gpu_memory_utilization` | `float` | `0.9` | Fraction of GPU memory available for KV cache and weights (0.0 -- 1.0) |
| `tensor_parallel_size` | `int` | `1` | Number of tensor-parallel GPUs (1 -- 8) |
| `enforce_eager` | `bool` | `False` | Disable compilation and CUDA graphs; run in eager mode |
| `parallel_config` | `ParallelConfig` | `ParallelConfig()` | Data-parallel configuration (see Section 4) |
| `kv_cache_block_size` | `int` | `16` | Block size for paged KV cache; must be a multiple of 16 or exactly 1 |
| `num_kvcache_blocks` | `int` | `-1` | Number of KV cache blocks (`-1` = auto) |
| `kv_cache_dtype` | `str` | `"bf16"` | KV cache data type (`"bf16"` or `"fp8"`) |
| `enable_prefix_caching` | `bool` | `False` | Enable prefix caching to reuse KV blocks across requests sharing the same prefix |
| `port` | `int` | `8006` | Engine internal communication port |
| `torch_profiler_dir` | `str \| None` | `os.getenv("ATOM_TORCH_PROFILER_DIR", None)` | Directory for saving PyTorch profiler traces; creates the directory if it does not exist |
| `compilation_config` | `CompilationConfig` | `CompilationConfig()` | Compilation and CUDA graph settings (see Section 2) |
| `quant_config` | `QuantizationConfig` | *(auto-detected)* | Quantization settings; auto-detected from HuggingFace config during `__post_init__` via `QuantizationConfig(hf_config)` (see Section 3) |
| `asyncio_mode` | `bool` | `False` | Enable asyncio-based engine loop |
| `load_dummy` | `bool` | `False` | Skip loading model weights (for benchmarking / testing) |
| `enable_expert_parallel` | `bool` | `False` | Enable Expert Parallelism for MoE models |
| `master_addr` | `str` | `"127.0.0.1"` | Master address for distributed communication |
| `graph_bs` | `Optional[list[int]]` | `None` | Explicit list of batch sizes for CUDA graph capture; derived from `compilation_config` during init |
| `enable_dp_attention` | `bool` | `False` | Enable data-parallel attention |
| `torch_dtype` | `torch.dtype` | *(computed)* | Inferred from `hf_config.torch_dtype`; falls back to `torch.bfloat16` |
| `speculative_config` | `Optional[SpeculativeConfig]` | `None` | Speculative decoding configuration (see Section 5) |
| `bos_token_id` | `int` | `-1` | Beginning-of-sequence token ID (`-1` = use model default) |
| `eos_token_id` | `int` | `-1` | End-of-sequence token ID (`-1` = use model default) |
| `stop_token_ids` | `list[int]` | `[]` | Additional stop token IDs; populated from `GenerationConfig.eos_token_id` during init |

**Auto-derived fields** (set in `__post_init__`, not user-supplied):

| Field | Type | Description |
|---|---|---|
| `hf_config` | `PretrainedConfig` | Loaded automatically via `get_hf_config(model)` |
| `generation_config` | `GenerationConfig` | Loaded automatically via `get_generation_config(model)` |

---

## 2. Compilation Configuration (`CompilationConfig`)

Defined in `atom/config.py`. Controls torch.compile and CUDA graph behaviour.

### 2.1 Compilation Levels (`CompilationLevel`)

| Constant | Value | Description |
|---|---|---|
| `NO_COMPILATION` | `0` | No compilation -- pure eager execution |
| `DYNAMO_AS_IS` | `1` | Use torch.compile / TorchDynamo as-is |
| `DYNAMO_ONCE` | `2` | TorchDynamo with a single compilation pass |
| `PIECEWISE` | `3` | Piecewise compilation with CUDA graph capture (recommended for production) |

### 2.2 `CompilationConfig` Fields

| Field | Type | Default | Description |
|---|---|---|---|
| `level` | `int` | `0` | Compilation level (see table above); must be 0 -- 3 |
| `use_cudagraph` | `bool` | `True` | Whether to use CUDA graphs |
| `cudagraph_capture_sizes` | `Optional[list[int]]` | `None` | Explicit list of batch sizes for CUDA graph capture; overrides `cuda_graph_sizes` when set |
| `cuda_graph_sizes` | `list[int]` | `[]` (post-init: `[512]`) | CUDA graph sizing strategy: 1 value generates `[1,2,4,8] + range(16, N+1, 16)`; multiple values used as-is; empty defaults to `[512]` |
| `debug_dump_path` | `str` | `""` | Path to dump debug / compilation information |
| `cache_dir` | `str` | `""` | Directory for compilation caches |
| `use_inductor` | `bool` | `True` | Enable TorchInductor backend |
| `cudagraph_mode` | `Optional[CUDAGraphMode]` | `None` | CUDA graph capture mode (see below); set to `PIECEWISE` automatically at level 3 |
| `splitting_ops` | `Optional[list[str]]` | `None` | Ops that split the graph into sub-graphs for piecewise compilation; auto-populated at level 3 with `["aiter.unified_attention_with_output", "aiter.mla_attention"]` |
| `cudagraph_copy_inputs` | `bool` | `False` | Copy input tensors into internally managed buffers before CUDA graph replay; only effective in PIECEWISE mode |
| `compile_sizes` | `Optional[list[Union[int, str]]]` | `None` | Sizes to compile for inductor; accepts integers and the string `"cudagraph_capture_sizes"` |
| `inductor_compile_config` | `dict` | `{}` | Additional configuration passed to the inductor backend |

### 2.3 CUDA Graph Mode (`CUDAGraphMode`)

| Mode | Value | Description |
|---|---|---|
| `NONE` | `0` | No CUDA graph capture |
| `PIECEWISE` | `1` | Piecewise CUDA graphs -- attention ops stay outside the graph for flexibility (default at level 3) |
| `FULL` | `2` | Full CUDA graph capture for all batches; best for small models / short prompts |
| `FULL_DECODE_ONLY` | `(FULL, NONE)` | Full CUDA graphs for decode batches only; mixed prefill-decode runs without graphs (useful in P/D setups) |
| `FULL_AND_PIECEWISE` | `(FULL, PIECEWISE)` | Full graphs for decode, piecewise for prefill/mixed -- most performant mode for most models |

Helper methods on `CUDAGraphMode`:

- `decode_mode()` -- returns the mode used for pure decode batches.
- `mixed_mode()` -- returns the mode used for mixed prefill-decode batches.
- `requires_piecewise_compilation()` -- whether the mode needs piecewise compilation.
- `has_full_cudagraphs()` -- whether the mode includes full CUDA graph capture.
- `separate_routine()` -- whether decode and mixed batches use different routines.

---

## 3. Quantization Configuration (`QuantizationConfig` & `LayerQuantConfig`)

Defined in `atom/config.py`. The quantization system uses two classes:

- **`QuantizationConfig`** -- the top-level orchestrator that holds a global config, per-layer overrides, and exclusion lists. It is **not** a `dict` subclass.
- **`LayerQuantConfig(dict)`** -- a `dict` subclass that stores the concrete quantization parameters for a single layer (or as the global default).

### 3.1 `LayerQuantConfig` Fields

`LayerQuantConfig` extends `dict`. Fields are stored and accessed as dictionary keys (e.g., `cfg["quant_type"]`).

| Key | Type | Default | Description |
|---|---|---|---|
| `quant_type` | `QuantType` | `QuantType.No` | Quantization granularity (see below) |
| `quant_dtype` | `torch.dtype` | `torch.bfloat16` | Data type for quantized weights |
| `is_dynamic` | `bool` | `True` | Use dynamic quantization (scales computed at runtime) |
| `quant_method` | `str` | `""` | Quantization method (e.g., `"quark"`, `"compressed-tensors"`) |

### 3.2 `QuantizationConfig` Attributes

| Attribute | Type | Description |
|---|---|---|
| `torch_dtype` | `torch.dtype` | The model's default dtype (from `hf_config.torch_dtype`) |
| `hf_quant_config` | `dict \| None` | Raw `quantization_config` dict from HuggingFace config |
| `global_quant_config` | `LayerQuantConfig` | Default quantization config applied to all layers |
| `layer_quant_config` | `dict[str, LayerQuantConfig]` | Per-layer overrides keyed by layer name pattern (supports fnmatch globs like `"*.mlp.*"`) |
| `exclude_layers` | `list[str]` | Layer names excluded from quantization (supports exact match and `"re:"` regex prefix) |
| `quant_method` | `str` | Top-level quantization method name (e.g., `"quark"`, `"compressed-tensors"`) |

Key methods:

| Method | Description |
|---|---|
| `get_name()` | Returns the quantization method name |
| `get_layer_quant_config(layer_name)` | Returns the `LayerQuantConfig` for a layer: checks exclusions first, then per-layer overrides, then falls back to global config |
| `should_ignore_layer_quant(layer_name)` | Returns `True` if the layer is in the exclusion list |
| `remap_layer_name(hf_config, packed_modules_mapping)` | Remaps layer names for packed/fused modules (e.g., `q_a_proj` → `fused_qkv_a_proj` for DeepSeek) |
| `compute_hash()` | Returns a SHA-256 hash of the quantization config for cache invalidation |
| `parse_quark_config_dict(config)` | Parses a quark-format config dict into a `LayerQuantConfig` |

### 3.3 `QuantType` Values (from AITER)

| Value | Description |
|---|---|
| `QuantType.No` | No quantization |
| `QuantType.per_Token` | Per-token / per-channel quantization |
| `QuantType.per_1x128` | Block quantization with group size 128 |
| `QuantType.per_1x32` | Block quantization with group size 32 |
| `QuantType.per_128x128` | Large 2D block quantization (remapped to `per_1x128` in MoE kernels) |
| `QuantType.per_Tensor` | Per-tensor quantization |

### 3.4 Supported Quantization Dtypes

| Dtype | AITER Key | Notes |
|---|---|---|
| FP8 (E4M3) | `"fp8"` | 8-bit floating point |
| MXFP4 | `"fp4x2"` | Microscaling FP4; forces `QuantType.per_1x32` |
| INT8 | `"i8"` | 8-bit integer |
| INT4 | `"i4x2"` | 4-bit integer (packed) |

### 3.5 Auto-Detection from HuggingFace

During `Config.__post_init__`, ATOM constructs `QuantizationConfig(hf_config)` which
reads `hf_config.quantization_config` and automatically determines quantization
parameters:

**For quark models** (`quant_method == "quark"`):

1. Parses `global_quant_config` dict via `parse_quark_config_dict()` to produce the global `LayerQuantConfig`.
2. Parses each entry in `layer_quant_config` dict to produce per-layer overrides.
3. Reads the `"exclude"` list for excluded layers.
4. Within each config dict, `weight.qscheme` determines `quant_type` (`"per_channel"` → `per_Token`, `"per_tensor"` → `per_Tensor`, `"per_group"` → `per_1x32`), and `weight.dtype` determines `quant_dtype`.
5. `input_tensors.is_dynamic` controls dynamic quantization (defaults to `True` if absent).

**For other models** (compressed-tensors, etc.):

1. If `quant_method == "compressed-tensors"` or channel quantization is detected, sets `per_Token`.
2. If `weight_block_size` or `group_size` is found: group size 128 maps to `per_1x128`, group size 32 maps to `per_1x32`.
3. Otherwise falls back to `per_Tensor`.
4. The dtype is parsed from fields like `dtype`, `weight_dtype`, or `quant_method` looking for `fp8`, `fp4`, `mxfp4`, `int8`, `int4`, or `num_bits`.
5. If `activation_scheme` is `"static"`, `is_dynamic` is set to `False`.
6. Excluded layers are read from the `"ignore"` key.

### 3.6 Layer-Level Quantization Dispatch

Linear layers, MoE layers, and fused ops call `quant_config.get_layer_quant_config(prefix)` to obtain the appropriate `LayerQuantConfig` for their position in the model. This enables mixed-precision quantization where different layers can have different quant types and dtypes (e.g., FP8 for attention, FP4 for MLP).

---

## 4. Parallel Configuration (`ParallelConfig`)

Defined in `atom/config.py`. Controls data parallelism. Environment variables
(Section 8) override defaults when set.

| Field | Type | Default | Description |
|---|---|---|---|
| `data_parallel_size` | `int` | `1` | Number of data-parallel groups; overridden by `ATOM_DP_SIZE` env var |
| `data_parallel_size_local` | `int` | `1` | Number of local data-parallel groups |
| `data_parallel_rank` | `int` | `0` | Rank within the data-parallel group; overridden by `ATOM_DP_RANK` |
| `data_parallel_rank_local` | `Optional[int]` | `None` | Local rank within the data-parallel group (SPMD mode); overridden by `ATOM_DP_RANK_LOCAL` |
| `data_parallel_master_port` | `int` | `29500` | Port used by the data-parallel master for process group initialization |
| `data_parallel_base_port` | `int` | `get_open_port()` | Base port for data-parallel communication (dynamically assigned) |
| `data_parallel_master_ip` | `str` | `"127.0.0.1"` | IP address of the data-parallel master |

**Computed property:**

- `world_size` -- set during init, equals TP x PP.
- `world_size_across_dp` -- `world_size * data_parallel_size`.

---

## 5. Speculative Decoding Configuration (`SpeculativeConfig`)

Defined in `atom/config.py`. Currently only the Multi-Token Prediction (MTP)
method with `num_speculative_tokens=1` is supported.

| Field | Type | Default | Description |
|---|---|---|---|
| `method` | `Optional[str]` | `""` | Speculative decoding method; currently only `"mtp"` is accepted |
| `model` | `Optional[str]` | `None` | Draft model name or path (typically the same as the target model for MTP) |
| `num_speculative_tokens` | `Optional[int]` | `None` | Number of speculative tokens per iteration; **must be `1`** |
| `draft_model_hf_config` | `Optional[PretrainedConfig]` | `None` | HuggingFace config for the draft model; auto-loaded from `model` when `None` |

**Post-init behaviour:**

- Loads `draft_model_hf_config` from `model` if not provided.
- For DeepSeek V3 / MTP models: overrides `model_type` to `"deepseek_mtp"`, sets `n_predict=1` and `num_nextn_predict_layers=1`, and switches architectures to `["DeepSeekMTPModel"]`.
- `Config.__post_init__` raises `ValueError` if `num_speculative_tokens != 1`.

---

## 6. Sampling Parameters (`SamplingParams`)

Defined in `atom/sampling_params.py`. Passed per-request to control generation.

| Field | Type | Default | Description |
|---|---|---|---|
| `temperature` | `float` | `1.0` | Sampling temperature; lower values make output more deterministic |
| `max_tokens` | `int` | `64` | Maximum number of tokens to generate |
| `ignore_eos` | `bool` | `False` | Continue generating past the EOS token |
| `stop_strings` | `Optional[list[str]]` | `None` | List of strings that trigger generation to stop |

---

## 7. CLI Arguments (`EngineArgs`)

Defined in `atom/model_engine/arg_utils.py`. The `EngineArgs` dataclass exposes
all flags via `add_cli_args()` and converts them into a `Config` via
`create_engine()`.

| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
| `--model` | | `str` | `"Qwen/Qwen3-0.6B"` | Model name or path |
| `--trust-remote-code` | | flag | `False` | Trust remote code when loading model |
| `--tensor-parallel-size` | `-tp` | `int` | `1` | Tensor parallel size |
| `--data-parallel-size` | `-dp` | `int` | `1` | Data parallel size |
| `--enforce-eager` | | flag | `False` | Enforce eager mode execution |
| `--enable_prefix_caching` | | flag | `False` | Enable prefix caching |
| `--port` | | `int` | `8006` | Engine internal port |
| `--kv_cache_dtype` | | `str` | `"bf16"` | KV cache dtype; choices: `bf16`, `fp8` |
| `--block-size` | | `int` | `16` | KV cache block size (maps to `kv_cache_block_size`) |
| `--max-model-len` | | `int` | `None` | Maximum model context length; defaults to `hf_config.max_position_embeddings` |
| `--cudagraph-capture-sizes` | | `str` | `"[1,2,4,8,16,32,48,64,128,256]"` | CUDA graph capture sizes as a Python list string |
| `--level` | | `int` | `3` | Compilation level (0 -- 3) |
| `--load_dummy` | | flag | `False` | Skip loading model weights |
| `--enable-expert-parallel` | | flag | `False` | Enable Expert Parallelism (EP MoE) |
| `--torch-profiler-dir` | | `str` | `None` | Directory for torch profiler traces |
| `--enable-dp-attention` | | flag | `False` | Enable DP attention |
| `--method` | | `str` | `None` | Speculative method; choices: `mtp` |
| `--num-speculative-tokens` | | `int` | `1` | Number of speculative tokens per iteration |
| `--max-num-batched-tokens` | | `int` | `16384` | Maximum number of tokens to batch in the async engine |
| `--max-num-seqs` | | `int` | `512` | Maximum number of sequences to batch together |
| `--gpu-memory-utilization` | | `float` | `0.9` | Fraction of GPU memory to use (0.0 -- 1.0) |
| `--scheduler-delay-factor` | | `float` | `0.0` | Delay factor multiplied by previous prompt latency before scheduling next prompt |

**Example:**

```bash
python -m atom.entrypoint \
    --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --level 3 \
    --cudagraph-capture-sizes "[1,2,4,8,16,32,64,128,256]" \
    --kv_cache_dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 256
```

---

## 8. Environment Variables

### 8.1 Variables Registered in `atom/utils/envs.py`

All variables use lazy evaluation. Boolean variables treat `"1"` as `True` and
anything else (including unset) as `False`, unless noted otherwise.

| Variable | Type | Default | Description |
|---|---|---|---|
| `ATOM_DP_RANK` | `int` | `0` | Data-parallel rank of this process |
| `ATOM_DP_RANK_LOCAL` | `int` | `0` | Local data-parallel rank (for SPMD mode) |
| `ATOM_DP_SIZE` | `int` | `1` | Total number of data-parallel groups |
| `ATOM_DP_MASTER_IP` | `str` | `"127.0.0.1"` | IP address of the data-parallel master |
| `ATOM_DP_MASTER_PORT` | `int` | `29500` | Port of the data-parallel master |
| ~~`ATOM_ENFORCE_EAGER`~~ | | | Removed. Use CLI flag `--enforce-eager` instead. |
| `ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION` | `bool` | `False` | Enable QK-norm + RoPE + cache + quant fusion; enable for Qwen3-MoE models |
| `ATOM_USE_TRITON_GEMM` | `bool` | `False` | Use Triton-based GEMM kernels instead of default backends |
| `ATOM_USE_TRITON_MXFP4_BMM` | `bool` | `False` | Use Triton-based MXFP4 batched matrix multiply |
| `ATOM_ENABLE_DS_INPUT_RMSNORM_QUANT_FUSION` | `bool` | `True` | Enable fused input RMSNorm + quantization for DeepSeek models |
| `ATOM_ENABLE_DS_QKNORM_QUANT_FUSION` | `bool` | `True` | Enable fused QK-norm + quantization for DeepSeek models |
| `ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION` | `bool` | `True` | Enable fused all-reduce + RMSNorm kernel |
| `ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT` | `bool` | `True` | Enable AITER Triton fused RMSNorm + quantization for LLaMA models |
| `ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT` | `bool` | `True` | Enable AITER Triton fused SiLU + multiply + quantization for LLaMA models |

### 8.2 Additional Environment Variables (Used Outside `envs.py`)

| Variable | Type | Default | Where Used | Description |
|---|---|---|---|---|
| `ATOM_TORCH_PROFILER_DIR` | `str` | `None` | `atom/config.py` (`Config.torch_profiler_dir`) | Directory for PyTorch profiler output; sets the default for `Config.torch_profiler_dir` |
| `ATOM_PROFILER_MORE` | `str` | `"0"` | `atom/model_engine/model_runner.py` | Set to `"1"` to enable detailed profiling (`record_shapes`, `with_stack`, `profile_memory`) |
| `HF_TOKEN` | `str` | `None` | `atom/config.py` (`get_hf_config`) | HuggingFace authentication token for gated model downloads |

---

## 9. Decision Tree -- Choosing a Compilation Level

```
Start
  |
  v
Is this a debugging / development run?
  |-- Yes --> Level 0 (NO_COMPILATION) or --enforce-eager
  |
  v
Do you need torch.compile but no graph splitting?
  |-- Yes, one-shot compile --> Level 2 (DYNAMO_ONCE)
  |-- Yes, keep Dynamo default --> Level 1 (DYNAMO_AS_IS)
  |
  v
Production inference on ROCm/HIP GPU?
  |-- Yes --> Level 3 (PIECEWISE) [default in EngineArgs]
              - Auto-sets CUDAGraphMode.PIECEWISE
              - Auto-populates splitting_ops for attention ops
              - Pair with --cudagraph-capture-sizes for your batch profile
  |
  v
Need maximum decode throughput?
  |-- Yes --> Level 3 + set cudagraph_mode to FULL_AND_PIECEWISE
              (full graphs for decode, piecewise for prefill)
```

**Rules of thumb:**

- **Level 3** is the default for `EngineArgs` and is recommended for most
  production workloads.
- **Level 0** / `--enforce-eager` is useful for debugging, profiling, or when
  CUDA graphs are incompatible with your model.
- Match `--cudagraph-capture-sizes` to your expected batch sizes for optimal
  memory usage and launch latency.
- When using `--enable-dp-attention` or Expert Parallelism (`--enable-expert-parallel`),
  level 3 is still recommended.

---

## Source Files

| File | Description |
|---|---|
| `atom/config.py` | `Config`, `CompilationConfig`, `CompilationLevel`, `CUDAGraphMode`, `LayerQuantConfig`, `QuantizationConfig`, `ParallelConfig`, `SpeculativeConfig`, `KVCacheTensor`, `KVCacheConfig`, `get_hf_config` |
| `atom/utils/envs.py` | All `ATOM_*` environment variable definitions with lazy evaluation |
| `atom/model_engine/arg_utils.py` | `EngineArgs` dataclass and CLI argument parser |
| `atom/sampling_params.py` | `SamplingParams` dataclass |
| `atom/model_engine/model_runner.py` | Uses `ATOM_PROFILER_MORE` and `ATOM_TORCH_PROFILER_DIR` for profiling |