madengine — Codebase Wiki

AI/ML model automation & benchmarking platform for local Docker, Kubernetes, and SLURM. A Typer-based CLI that discovers models, builds Docker images, runs them across compute targets, and writes structured performance results.

Entry point: src/madengine/cli/app.py::cli_main → console script madengine registered in pyproject.toml.

v2.1.0 — 2026-05-28 Python ≥ 3.8 5-layer CLI Local · K8s · SLURM · slurm_multi Typer + Rich ROCm & CUDA Jinja2 templates

Overview

What madengine does

Discover — finds model definitions from models.json or dynamic scripts, resolves tags
Build — calls docker build for each model, writes build_manifest.json
Run — reads manifest, infers compute target, dispatches containers, writes perf.csv
Report — converts perf.csv to HTML or email; uploads to MongoDB

All four stages share a single --additional-context configuration spine that controls GPU vendor, deployment type, launcher, profiling tools, and environment variables.

What's new in v2.1.0

slurm_multi — self-managed multi-node SLURM launcher for workloads with per-node Docker (e.g. SGLang Disagg)
--use-image [auto] / --build-on-compute — new madengine build modes
Docker --build-context tools= — shared tool APIs accessible in every Dockerfile
Local MAD_MULTI_NODE_RUNNER — Megatron / DeepSpeed / TorchTitan now work on local Docker
SLURM env-var escaping — double-quote escaping preserves spaces & paths

Quick start

# 1. Install
pip install -e ".[dev]"

# 2. Discover available models
madengine discover --tags dummy

# 3. Build + run (single command)
madengine run --tags dummy \
  --additional-context '{"gpu_vendor":"AMD","guest_os":"UBUNTU"}'

# 4. Build only, then run from manifest
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
  --additional-context '{"docker_gpus":"0,1,2,3"}'

Local mode: no k8s or slurm key in context → ContainerRunner (local Docker).

# Single-node K8s (minimal — defaults applied from presets/k8s/)
madengine run --tags llama3 \
  --additional-context '{"k8s":{"gpu_count":4}}'

# Multi-node vLLM on K8s
madengine run --tags vllm-serve \
  --additional-context '{
    "k8s": {"namespace":"ml-team","gpu_count":8},
    "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
  }'

# K8s with NFS data PVC and secrets
madengine run --tags model \
  --additional-context '{
    "k8s": {"namespace":"ml","gpu_count":8,"data_storage_class":"nfs-banff"},
    "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy"}
  }'

Presence of "k8s" or "kubernetes" key → KubernetesDeployment. Requires pip install -e ".[all]".

# Single-node SLURM (build on login node, deploy via sbatch)
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":1,"gpus_per_node":8,"time":"12:00:00"}
  }'

# Multi-node torchrun
madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
    "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8}
  }'

# DeepSpeed with reservation
madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,
              "time":"48:00:00","reservation":"ml-training"},
    "distributed": {"launcher":"deepspeed","nnodes":8,"nproc_per_node":8}
  }'

Presence of "slurm" key → SlurmDeployment. Generates sbatch wrapper from Jinja2 template.

# SGLang Disaggregated (3+ nodes: proxy + prefill + decode)
madengine run --tags pyt_sglang_disagg_qwen3-32b \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
    "distributed": {"launcher":"slurm_multi"}
  }'

# Build options for slurm_multi models:
# Option A — use pre-built registry image (skip local build)
madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:latest

# Option B — auto-resolve DOCKER_IMAGE_NAME from model card
madengine build --tags pyt_sglang_disagg --use-image auto

# Option C — build on compute node, push, then run pulls in parallel
madengine build --tags pyt_sglang_disagg \
  --registry registry.io/ml --build-on-compute

slurm_multi bypasses the standard sbatch template: the model's own .slurm script runs directly on the head node so srun/scontrol work inside it.

# Store configuration in a JSON file and reference it
cat > my_run.json <<'EOF'
{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "slurm": {
    "partition": "gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "exclusive": true
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 4,
    "nproc_per_node": 8,
    "backend": "nccl"
  },
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0"
  },
  "tools": [{"name": "rocprofv3_compute"}]
}
EOF

madengine run --tags llama3 --additional-context-file my_run.json

--additional-context-file and --additional-context are mutually exclusive. The file is parsed as JSON (not ast.literal_eval).

Install & dev

Setup

# Base install (includes dev tools)
pip install -e ".[dev]"

# With Kubernetes support
pip install -e ".[all]"

# Enable pre-commit hooks
pre-commit install

Optional extras

Extra	Adds
`[dev]`	pytest, black, flake8, mypy, isort, pre-commit
`[kubernetes]`	`kubernetes>=28.0.0`, pyyaml
`[all]`	dev + kubernetes

Test & quality

pytest                           # all tests
pytest tests/unit/ -v            # unit only
pytest tests/unit/test_slurm_multi.py -v
pytest --cov=src/madengine --cov-report=html
pytest -m "not slow"             # skip slow tests
pytest -m "unit and amd"         # combined markers

black src/ tests/
isort src/ tests/
flake8 src/ tests/
mypy src/madengine
pre-commit run --all-files

5-layer architecture

Each layer talks only to the layers below it. Layers are color-coded throughout this wiki.

CLI Orchestration Deployment Execution Core Utils Reporting

Layer	Path	Responsibilities	Key types
CLI	src/madengine/cli/	Typer app, 5 commands, argument validation, Rich output, exit-code mapping.	`app.py`, `commands/{build,run,discover,report,database}.py`, `constants.ExitCode`
Orchestration	src/madengine/orchestration/	Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment backend.	`BuildOrchestrator`, `RunOrchestrator`, `image_filtering.py`
Deployment	src/madengine/deployment/	Factory + Template Method pattern. K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring.	`DeploymentFactory`, `BaseDeployment`, `KubernetesDeployment`, `SlurmDeployment`, `ConfigLoader`
Execution	src/madengine/execution/	Local Docker build/run, log scanning, timeout resolution, perf parsing, self-managed launcher bypass.	`ContainerRunner`, `DockerBuilder`, `container_runner_helpers`
Core	src/madengine/core/	Cross-cutting primitives: context merging & GPU detection, shell execution, Docker wrapper, error hierarchy, auth, timeout.	`Context`, `Console`, `Docker`, `MADEngineError`, `load_credentials`
Utils	src/madengine/utils/	Model discovery, GPU vendor abstraction, ROCm path resolution, config parsing.	`DiscoverModels`, `gpu_tool_factory`, `rocm_path_resolver`, `ConfigParser`
Reporting	src/madengine/reporting/	perf.csv writers, HTML/email report generation. Database upload in src/madengine/database/.	`update_perf_csv`, `csv_to_html`, `csv_to_email`, `mongodb.py`

Architecture diagram

Key data flows

Build flow

madengine build → BuildOrchestrator.execute()
Context(build_only_mode=True) — GPU vendor / arch detection skipped unless detect_local_gpu_arch=True
ConfigLoader.load_config() applies preset defaults (SLURM or K8s) over user config
DiscoverModels resolves --tags from root models.json, scripts/{dir}/models.json, or scripts/{dir}/get_models_json.py
slurm_multi gate: if model uses slurm_multi and no --registry/--use-image given → auto-resolves DOCKER_IMAGE_NAME from model card or raises ConfigurationError
DockerBuilder.build_all_models() — passes --build-context tools=scripts/common/tools if that dir exists
After registry push: sets DOCKER_IMAGE_NAME in manifest env_vars for parallel SLURM pull
Writes build_manifest.json

Run flow

madengine run → RunOrchestrator.execute()
If manifest exists: skip build; else trigger _build_phase()
Context(build_only_mode=False) — full GPU detection, ROCm path resolution
_load_and_merge_manifest() — runtime context overrides manifest deployment_config
Target inference: "k8s"/"kubernetes" → K8s · "slurm" → SLURM · neither → local
_copy_scripts() — populates scripts/common/{pre_scripts,post_scripts,tools} from madengine package
Dispatch: ContainerRunner (local) or DeploymentFactory.create() (SLURM/K8s)
Results → perf.csv / perf_entry.csv
_cleanup_model_dir_copies() — removes populated scripts/common/ files

SLURM job flow (inside sbatch)

sbatch script sets MASTER_ADDR (via scontrol), WORLD_SIZE, NNODES, node-local GPU visibility
Multi-node: generates a task script per node; runs via srun bash $TASK_SCRIPT — each node calls madengine run with local manifest
Single-node: creates local manifest with deployment_config.target="docker", calls madengine run
Each node's madengine run → ContainerRunner → docker run with SLURM env vars injected
Results collected from per-node perf.csv and aggregated

CLI — `discover`

Lists and validates model definitions without building or running.

madengine discover [OPTIONS]

  --tags TEXT              Comma-separated tags/names to filter  [required]
  --verbose / --no-verbose Show full model JSON  [default: no-verbose]

Tag syntax

Pattern	Example	Meaning
Simple tag	`--tags llama3`	Any model with tag `llama3`
Multiple tags	`--tags llama3,vllm`	Any model matching any listed tag
All models	`--tags all`	Every discovered model
Scoped (exact dir)	`--tags MAD/llama3`	Only from `scripts/MAD/` subdirectory
Dynamic + args	`--tags dummy3:dummy_3:batch=512`	Dynamic model with arg override

Discovery sources (checked in order per directory)

Root models.json
scripts/{dir}/models.json (static list)
scripts/{dir}/get_models_json.py — dynamic; must export list_models() → List[CustomModel]

CLI — `build`

Builds Docker images for discovered models and writes build_manifest.json.

madengine build [OPTIONS]

  --tags TEXT                    Tags to select models (mutually exclusive with --batch-manifest)
  --batch-manifest FILE          JSON file of multiple tag groups to build in sequence
  --registry TEXT                Push built images to this registry URL
  --target-archs TEXT            Comma-separated GPU arch list (e.g. "gfx90a,gfx942")
  --use-image [IMAGE|auto]       Skip local build; use named image or auto-resolve from model card
  --build-on-compute             Build on SLURM compute node + push (requires --registry)
  --additional-context TEXT      Python dict / JSON string of context overrides
  --additional-context-file FILE Path to a JSON context file (mutually exclusive with --additional-context)
  --clean-docker-cache           Pass --no-cache to docker build
  --manifest-output FILE         Output path for build_manifest.json  [default: build_manifest.json]
  --summary-output FILE          Output path for build summary JSON
  --live-output / --no-live-output   Stream docker build output line by line  [default: no-live-output]
  --verbose / --no-verbose

Mutual exclusions:

--batch-manifest vs --tags
--use-image vs --registry
--use-image vs --build-on-compute
--build-on-compute requires --registry
--additional-context-file vs --additional-context

`--use-image` modes

Invocation	Behavior
`--use-image auto`	Reads `DOCKER_IMAGE_NAME` from model card `env_vars`
`--use-image registry.io/img:tag`	Uses the explicit image name; skips all Docker build steps

CLI — `run`

Runs models from a manifest (build if needed) and writes perf.csv.

madengine run [OPTIONS]

  --tags TEXT                    Select models (triggers build if no manifest)
  --manifest-file FILE           Use existing manifest; skip build  [default: build_manifest.json]
  --registry TEXT                Registry for image pull auth
  --timeout INT                  Seconds per model; -1=7200s default, 0=disabled
  --additional-context TEXT      Python dict or JSON string
  --additional-context-file FILE JSON file (mutually exclusive with --additional-context)
  --keep-alive                   Leave container running after model completes
  --keep-model-dir               Do not clean up model directory copy
  --clean-docker-cache           Remove docker image before pull (SLURM mode)
  --skip-model-run               Build/pull only; skip execution
  --manifest-output FILE
  --summary-output FILE
  --live-output / --no-live-output  Stream container output  [default: no-live-output]
  --output FILE                  Redirect container stdout to file
  --tools-json-file-name FILE    Tools config  [default: ./scripts/common/tools.json]
  --generate-sys-env-details / --no-generate-sys-env-details
  --force-mirror-local           Force ContainerRunner even in SLURM/K8s context
  --disable-skip-gpu-arch        Ignore skip_gpu_arch model field
  --cleanup-perf                 Remove existing perf.csv before run
  --verbose / --no-verbose

Timeout resolution

Value	Resolved timeout
`-1` (default)	7200 s (2 hours)
`0`	Disabled (no timeout)
model card `timeout` field	Used when CLI is default (-1)
Explicit positive int	That many seconds, overrides model card

CLI — `report` & `database`

report

# Convert perf.csv to HTML
madengine report to-html --csv-file perf.csv

# Generate consolidated email report
madengine report to-email \
  --directory ./results \
  --output run_results.html

Source: cli/commands/report.py → reporting/csv_to_html.py, reporting/csv_to_email.py

database

madengine database \
  --csv-file perf.csv \
  --database-name benchmarks \
  --collection-name runs

Reads from env: MONGO_HOST, MONGO_PORT, MONGO_USER, MONGO_PASSWORD, MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS.

Source: cli/commands/database.py → database/mongodb.py

Exit codes CI contract

Defined in src/madengine/cli/constants.py::ExitCode. Use these in CI pipelines instead of log scraping.

Code	Name	Meaning
`0`	SUCCESS	All operations succeeded.
`1`	FAILURE	General / unhandled failure (keyboard interrupt, unexpected exception).
`2`	BUILD_FAILURE	One or more Docker image builds failed.
`3`	RUN_FAILURE	One or more model runs failed. Results still written to `perf.csv` with `STATUS=FAILURE`.
`4`	INVALID_ARGS	Argument validation rejected the invocation.

In Jenkins, use madengine run … 2>&1 | tee madengine.log with bash -o pipefail so tee doesn't swallow the exit code.

`additional_context` — configuration spine

--additional-context accepts a Python dict string (parsed with ast.literal_eval, not json.loads) or --additional-context-file accepts a JSON file. The dict is deep-merged into Context.ctx alongside system-detected values.

Gotcha — Python dict, not JSON: pass '{"key":"val"}' (valid JSON is also valid Python) or "{'key':'val'}". Do not use True/False as unquoted Python booleans in shell — shell expansion will fail. Use true/false (JSON) or single-quote the whole argument.

Key	Type	Subsystem	Description & example
`gpu_vendor`	string	Core	Override GPU vendor detection. `"AMD"` or `"NVIDIA"`. Defaults to `"AMD"` if not set and auto-detect fails.
`guest_os`	string	Core	Container OS for package manager selection. `"UBUNTU"` or `"CENTOS"`. Affects rocEnvTool installer selection.
`MAD_ROCM_PATH`	string	Core	Override host ROCm root path (e.g. `"/opt/rocm-6.2"`). Takes priority over auto-detection and `ROCM_PATH` env.
`docker_env_vars`	dict	Exec	Env vars injected as `--env` into `docker run`. Keys are validated with `_ENV_KEY_RE`. Special: `docker_env_vars.MAD_ROCM_PATH` overrides in-container ROCm root independently of host.
`docker_build_arg`	dict	Exec	Extra `--build-arg KEY=VAL` flags passed to `docker build`.
`docker_gpus`	string	Exec	Comma-separated GPU indices to expose, or `"all"`. E.g. `"0,1,2,3"`.
`docker_cpus`	string	Exec	CPU affinity string for `--cpuset-cpus`. E.g. `"0-15"`.
`docker_mounts`	dict	Exec	Extra volume mounts. E.g. `{"host_path":"/data","container_path":"/mnt/data"}`.
`docker_image` / `MAD_CONTAINER_IMAGE`	string	Orch	Skip build entirely; use this image for all models. Creates a synthetic manifest.
`k8s` / `kubernetes`	dict	Deploy	Selects Kubernetes deployment. See K8s config section for sub-keys.
`slurm`	dict	Deploy	Selects SLURM deployment. See SLURM config section for sub-keys.
`distributed`	dict	Deploy	Distributed launcher configuration. `launcher`, `nnodes`, `nproc_per_node`, `backend`, `port`. See Per-launcher config.
`distributed.launcher`	string	Deploy	`"torchrun"`, `"deepspeed"`, `"megatron"`, `"torchtitan"`, `"primus"`, `"vllm"`, `"sglang"`, `"sglang_disagg"`, `"slurm_multi"`/`"slurm-multi"`.
`distributed.sglang_disagg`	dict	Deploy	Fine-tune prefill/decode node split. `{"prefill_nodes":1,"decode_nodes":2}`. Default ~40% prefill, rest decode. Min 3 nodes total.
`vllm`	dict	Deploy	vLLM-specific config (tensor/pipeline parallelism, model, etc.).
`primus`	dict	Deploy	Primus-specific config. `config_path`, `cli_extra`, `backend`.
`secrets`	dict	Deploy	K8s only. Auto-converted to a K8s `Secret` and mounted as env vars. E.g. `{"HF_TOKEN":"hf_xxx"}`.
`tools`	list	Exec	Profiling/tracing tools. Each item: `{"name":"rocprofv3_compute"}`. Stackable. See Profiling tools.
`rocenv_mode`	string	Exec	`"lite"` (default) or `"full"`. Full mode runs lshw/dmidecode/dmesg/modinfo, installs missing tools per `guest_os`.
`pre_scripts`	list	Exec	Scripts to run inside the container before the model script.
`post_scripts`	list	Exec	Scripts to run inside the container after the model script.
`encapsulate_script`	string	Exec	Script prepended to the model run command (wraps the whole execution).
`log_error_pattern_scan`	bool	Exec	Set `false` to disable post-run log substring error detection. Useful when pytest/JUnit is authoritative.
`log_error_patterns`	list	Exec	Replace the default error patterns list entirely. Each string is matched as substring in log lines.
`log_error_benign_patterns`	list	Exec	Literal substrings that mark a matching log line as benign (not an error).
`env_vars`	dict	Deploy	Top-level env vars merged into deployment config (SLURM script / K8s job manifest).
`gen_sys_env_details`	bool	Exec	Enable/disable rocEnvTool system environment collection. Default: `true`.
`debug`	bool	Deploy	Enable debug-level logging in deployment templates.

SLURM sub-keys (`slurm` dict)

Key	Default (from preset)	Description
`partition`	`"amd-rccl"`	SLURM partition name.
`nodes`	`1`	Number of nodes to allocate.
`gpus_per_node`	`8`	GPUs per node.
`time`	`"24:00:00"`	Wall time limit (HH:MM:SS).
`exclusive`	`true`	Request exclusive node access.
`nodelist`	—	Pin to specific nodes. Also skips node health preflight check.
`exclude`	—	Nodes to exclude.
`constraint`	—	Node feature constraints.
`reservation`	—	SLURM reservation name. Forwarded to srun health/cleanup commands.
`qos`	—	Quality of service.
`account`	—	SLURM account for billing.
`modules`	`[]`	List of environment modules to load before job.
`output_dir`	CWD	Directory for SLURM log/output files.
`network_interface`	—	Network interface for NCCL/RCCL (e.g. `"ib0"`).
`shared_workspace`	—	Shared filesystem path accessible from all nodes.

Kubernetes sub-keys (`k8s` dict)

Key	Default	Description
`namespace`	`"default"`	Kubernetes namespace.
`gpu_count`	—	Number of GPUs per pod.
`gpu_resource_name`	`"amd.com/gpu"`	K8s GPU resource type. Auto-set by GPU-vendor preset.
`image_pull_policy`	`"Always"`	K8s imagePullPolicy.
`kubeconfig`	`"~/.kube/config"`	Path to kubeconfig.
`data_storage_class`	`"nfs-banff"`	Storage class for data PVC. Falls back to `nfs_storage_class` then `storage_class`.
`storage_class`	`"nfs-banff"`	Generic storage class fallback.
`memory`	`"64Gi"`	Container memory request.
`memory_limit`	`"128Gi"`	Container memory limit.
`cpu`	`"16"`	CPU request.
`cpu_limit`	`"32"`	CPU limit.
`host_ipc`	`false`	Enable hostIPC (needed for multi-node NCCL).
`backoff_limit`	`3`	K8s Job backoffLimit (retries).
`ttl_seconds_after_finished`	`null`	Auto-delete job after N seconds.
`recreate_shared_data_pvc`	`false`	Re-create data PVC even if it already exists.
`secrets.strategy`	`"from_local_credentials"`	How to load K8s image pull secrets.
`secrets.image_pull_secret_names`	`[]`	Existing K8s secret names to use as image pull secrets.

Model definition — `models.json`

Each model definition lives in a models.json file (or is returned by get_models_json.py::list_models()). Fields map to the CustomModel dataclass in utils/discover_models.py.

{
  "name": "llama3-8b-train",          // Unique model identifier
  "dockerfile": "docker/Dockerfile.ubuntu.amd",
  "dockercontext": ".",               // Build context dir (relative to scripts dir)
  "scripts": "scripts/llama3/train.sh",
  "url": "https://github.com/org/repo",
  "cred": "hf_token",                 // Credential key from credential.json
  "owner": "ml-team",
  "data": "llama3-dataset",           // Data identifier for DataProvider
  "n_gpus": "8",                      // "-1" = all available; "0" = CPU-only
  "timeout": 14400,                   // Seconds; overridden by --timeout CLI flag
  "training_precision": "bf16",
  "tags": ["llama3", "training", "amd"],
  "args": "--batch-size 4 --seq-len 4096",
  "multiple_results": "results.csv",  // CSV file with multiple perf rows
  "skip_gpu_arch": "gfx908,gfx1100", // Comma-list of archs to skip this model on
  "additional_docker_run_options": "--shm-size 64g",
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 8
  },
  "env_vars": {
    "HF_TOKEN": "auto",              // Injected into container env
    "DOCKER_IMAGE_NAME": "reg/img"   // Used by slurm_multi parallel pull
  }
}

Key field notes

Field	Notes
`n_gpus`	`"-1"` = use all GPUs on the host (`MAD_SYSTEM_NGPUS`). Positive int = that many GPUs. Used for perf CSV metadata.
`timeout`	Used when CLI `--timeout=-1` (default). Explicit CLI value always wins.
`skip_gpu_arch`	Comma-separated GPU arch names (e.g. `"gfx908,A100"`). Model is skipped if detected arch matches. Disable with `--disable-skip-gpu-arch`.
`multiple_results`	Path to CSV file (relative to model dir) with per-result rows that are appended to `perf.csv` individually.
`DOCKER_IMAGE_NAME` in `env_vars`	Required for `slurm_multi`: specifies the registry image for parallel `srun docker pull` on compute nodes. Also set automatically by `DockerBuilder` after a successful push.

Build manifest — `build_manifest.json`

Written by madengine build, consumed by madengine run. Pass with --manifest-file.

{
  "built_images": {
    "ci-llama3_Dockerfile.ubuntu.amd": {
      "docker_image": "registry.io/ml/ci-llama3:sha256-abc",
      "docker_sha":   "sha256:abc123",
      "build_duration": 183.4
    }
  },
  "built_models": {
    "ci-llama3_Dockerfile.ubuntu.amd": {
      "name":          "llama3-8b-train",
      "dockerfile":    "docker/Dockerfile.ubuntu.amd",
      "docker_image":  "ci-llama3_Dockerfile.ubuntu.amd",
      "docker_sha":    "sha256:abc123",
      "build_duration": 183.4,
      "scripts":       "scripts/llama3/train.sh",
      "args":          "--batch-size 4",
      "tags":          ["llama3","training"],
      "n_gpus":        "8",
      "timeout":       14400,
      "skip_gpu_arch": "",
      "multiple_results": "",
      "distributed":   {"launcher":"torchrun","nnodes":2,"nproc_per_node":8},
      "env_vars":      {"DOCKER_IMAGE_NAME":"registry.io/ml/ci-llama3:sha256-abc"},
      "built_on_compute": false
    }
  },
  "context": {
    "gpu_vendor": "AMD",
    "guest_os":   "UBUNTU",
    "docker_env_vars": {"MAD_GPU_VENDOR":"AMD","MAD_SYSTEM_NGPUS":"8"},
    "docker_build_arg": {}
  },
  "deployment_config": {
    "target":  "slurm",
    "slurm":   {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
    "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8},
    "env_vars": {"NCCL_DEBUG":"WARN"},
    "debug": false
  },
  "summary": {"total":1,"success":1,"failed":0}
}

Merging at runtime: values in deployment_config are merged into the runtime context at startup. Keys in --additional-context take precedence over deployment_config.

Deployment target inference

No explicit deploy field needed. RunOrchestrator._infer_deployment_target() inspects the merged context:

Context condition	Target	Class	Path
`"k8s"` or `"kubernetes"` key present	Kubernetes	`KubernetesDeployment`	deployment/kubernetes.py
`"slurm"` key present	SLURM	`SlurmDeployment`	deployment/slurm.py
Neither	Local Docker	`ContainerRunner`	execution/container_runner.py

Within SLURM deployment, if distributed.launcher == "slurm_multi" (or "slurm-multi"), SlurmDeployment.prepare() takes the slurm_multi path instead of generating the standard Jinja2 template.

Force local: use --force-mirror-local on madengine run to always use ContainerRunner even when slurm/k8s keys are in context.

SLURM deployment

Implemented in src/madengine/deployment/slurm.py. Generates an sbatch script from a Jinja2 template at src/madengine/deployment/templates/slurm/job.sh.j2.

Preset merge order

ConfigLoader.load_slurm_config() applies three layers (last wins):

presets/slurm/defaults.json — base defaults for all SLURM runs
presets/slurm/profiles/single-node.json or multi-node.json — profile selected by nodes count
User-supplied slurm / distributed / env_vars keys

presets/slurm/defaults.json — base preset contents

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "debug": false,
  "slurm": {
    "partition": "amd-rccl",
    "nodes": 1,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "exclusive": true,
    "modules": []
  },
  "distributed": {
    "backend": "nccl",
    "port": 29500
  },
  "env_vars": {
    "OMP_NUM_THREADS": "8",
    "MIOPEN_FIND_MODE": "1",
    "MIOPEN_USER_DB_PATH": "/tmp/.miopen"
  }
}

presets/slurm/profiles/multi-node.json — additional env vars for multi-node

{
  "slurm": {"nodes": 2, "gpus_per_node": 8, "time": "24:00:00"},
  "distributed": {"backend": "nccl", "port": 29500},
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "NCCL_DEBUG_SUBSYS": "INIT",
    "NCCL_IB_DISABLE": "0",
    "NCCL_SOCKET_IFNAME": "ib0",
    "TORCH_NCCL_HIGH_PRIORITY": "1",
    "GPU_MAX_HW_QUEUES": "8",
    "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1",
    "NCCL_TIMEOUT": "1200",
    "HSA_ENABLE_SDMA": "0",
    "HSA_FORCE_FINE_GRAIN_PCIE": "1",
    "RCCL_ENABLE_HIPGRAPH": "0"
  }
}

What the SLURM job script does

Sets MASTER_ADDR via scontrol show hostnames, MASTER_PORT, WORLD_SIZE, NNODES
Sets per-node HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES (vLLM/SGLang: only HIP_VISIBLE_DEVICES)
Sets MIOPEN_USER_DB_PATH per-process: /tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}
Sets TORCH_ELASTIC_RDZV_TIMEOUT=3600 for PyTorch elastic
Sets MAD_DEPLOYMENT_TYPE=slurm, MAD_SLURM_JOB_ID, MAD_NODE_RANK, MAD_IN_SLURM_JOB=1
Multi-node: generates per-node task script; runs via srun bash $TASK_SCRIPT
Single-node: creates synthetic manifest with deployment_config.target="docker" and calls madengine run

Node health preflight

SlurmNodeSelector runs a health-check srun before the main job unless slurm.nodelist is set (then skipped). Supports slurm.reservation forwarded to srun commands.

Monitoring

Polls squeue every 30 seconds. Terminal states: COMPLETED, FAILED, CANCELLED — a scancel'd job will not loop forever.

SLURM inside existing allocation (salloc): if SLURM_JOB_ID is set and the launcher is slurm_multi, madengine runs the wrapper script directly with bash instead of nesting a new sbatch. Other launchers still submit via sbatch even inside salloc.

Kubernetes deployment

Implemented in src/madengine/deployment/kubernetes.py and 6 focused mixin modules (refactored in v2.0.3). Requires pip install -e ".[kubernetes]".

Mixin modules

Module	Concern
k8s_pvc.py	PVC lifecycle. Storage-class fallback: `data_storage_class` → `nfs_storage_class` → `storage_class`. Default: `"nfs-banff"`.
k8s_results.py	Log/artifact collection, perf aggregation. Uses shared `collector_pod_name()` helper — truncated `collector-{id[:15]}` to stay within K8s name limits.
k8s_scripts.py	Script extraction, ConfigMap building. Carries `rocenv_mode` and `guest_os` into the ConfigMap.
k8s_template_context.py	Assembles Jinja2 template context dict passed to `job.yaml.j2`.
kubernetes_launcher_mixin.py	Selects the right Jinja2 template per launcher type.
k8s_secrets.py	Converts `additional_context.secrets` dict to K8s `Secret` objects mounted as env vars.

Preset merge order

ConfigLoader.load_k8s_config() applies five layers (last wins):

presets/k8s/defaults.json — base defaults
presets/k8s/gpu-vendors/amd.json or nvidia.json — GPU resource name
presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars (only if AMD + multi-GPU)
presets/k8s/profiles/single-gpu.json, multi-gpu.json, or multi-node.json
User config

presets/k8s/defaults.json — base preset contents

{
  "k8s": {
    "kubeconfig": "~/.kube/config",
    "namespace": "default",
    "image_pull_policy": "Always",
    "backoff_limit": 3,
    "ttl_seconds_after_finished": null,
    "nfs_storage_class": "nfs-banff",
    "storage_class": "nfs-banff",
    "data_storage_class": "nfs-banff",
    "recreate_shared_data_pvc": false,
    "secrets": {
      "strategy": "from_local_credentials",
      "image_pull_secret_names": [],
      "runtime_secret_name": null
    }
  },
  "env_vars": {"OMP_NUM_THREADS": "8"}
}

presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars

{
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "NCCL_IB_DISABLE": "0",
    "NCCL_SOCKET_IFNAME": "ib0",
    "TORCH_NCCL_HIGH_PRIORITY": "1",
    "GPU_MAX_HW_QUEUES": "8",
    "HSA_ENABLE_SDMA": "0",
    "MIOPEN_FIND_MODE": "1",
    "MIOPEN_USER_DB_PATH": "/tmp/.miopen",
    "HSA_FORCE_FINE_GRAIN_PCIE": "1",
    "RCCL_ENABLE_HIPGRAPH": "0"
  }
}

Known issue: in multi-node K8s jobs, a node may show FAILED in the results table even when the pod succeeded — this occurs when the kubelet returns 502 between job completion and log collection. PVC artifacts are still collected. Check kubectl describe pod <pod>.

Secrets management

# Pass secrets via additional_context
madengine run --tags llm-serve \
  --additional-context '{
    "k8s": {"namespace":"ml","gpu_count":8},
    "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy","S3_KEY":"zzz"}
  }'

Secrets in additional_context.secrets are auto-converted to a K8s Secret object and mounted as environment variables in the job pod. They are never written to perf.csv or build logs.

slurm_multi launcher merged in v2.1.0

What it is

An escape-hatch SLURM launcher for workloads that orchestrate their own per-node Docker containers via srun — for example SGLang Disaggregated (proxy + prefill + decode) or any topology that needs to call srun/scontrol from inside the job step.

Generates a wrapper SBATCH that runs the model's own .slurm (or .sh) script directly on the head node on baremetal — no outer container — so the workload can spawn its own per-node containers without nesting.

How to select it

{
  "slurm": {
    "partition": "gpu",
    "nodes": 3,
    "gpus_per_node": 8,
    "time": "02:00:00"
  },
  "distributed": {
    "launcher": "slurm_multi"
  }
}

Alias "slurm-multi" (hyphen) is also accepted and normalized automatically.

Build modes

Mode	Flag	Behavior
Use prebuilt image	`--use-image registry.io/img:tag`	Skip local build. Uses explicit image.
Auto-resolve from model card	`--use-image auto`	Reads `env_vars.DOCKER_IMAGE_NAME` from model card.
Build on compute	`--build-on-compute --registry reg.io/ml`	Builds on SLURM compute node, pushes to registry. Manifest sets `built_on_compute: true`. Run phase pulls in parallel on all nodes.
Implicit fallback	no flags	If model card has `DOCKER_IMAGE_NAME`, auto-uses it. Otherwise raises `ConfigurationError` listing options.

Execution paths

sbatch (default): wrapper SBATCH submitted to SLURM. Head node calls srun docker pull on all nodes in parallel, then runs the model's script.
bash-in-salloc: if SLURM_JOB_ID env var is set (inside existing salloc), the launcher runs the wrapper synchronously with bash. Sets DeploymentResult.skip_monitoring=True so the monitor poll is skipped.

Results aggregation

_collect_slurm_multi_results() reads per-job CSV from /shared_inference/$USER/$JOBID/perf.csv and writes those rows into cwd/perf.csv (copy if absent, append data rows if present). This ensures display_performance_table and madengine report to-html find results without extra arguments.

Local self-managed execution

When slurm_multi is detected in a non-SLURM context (e.g. local Docker mode), ContainerRunner._run_self_managed() runs the model's script directly on the host. Env vars from model card and additional_context are injected; keys are logged without values to avoid leaking credentials.

Docker `--build-context tools=` v2.1.0

What it does

Every docker build issued by DockerBuilder now passes --build-context tools=scripts/common/tools when that directory exists. Dockerfiles can pull shared helper scripts from the named context:

# In any model Dockerfile
COPY --from=tools rocm_smi/*.py /opt/mad/tools/rocm_smi/
COPY --from=tools gpu_info/*.py /opt/mad/tools/

Eliminates duplication of shared APIs across model Dockerfiles.

Conditional emission (PR #134)

The flag is only added when scripts/common/tools/ exists at build time. Builds in MAD projects without a tools directory do not receive the flag and will not fail.

Implementation: single guarded block in execution/docker_builder.py.

SLURM fix in same PR: switched from shlex.quote() to double-quote escaping in slurm.py env-var generation so spaces and paths in values survive correctly in the sbatch script.

Launcher matrix

Launcher	Local	K8s	SLURM	Type	Notes
`torchrun`	✅	✅	✅	Train	DDP / FSDP, elastic rendezvous.
`megatron` / `megatron-lm`	✅	✅	✅	Train	TP + PP parallelism; sets TP/PP/CP size env vars.
`torchtitan`	✅	✅	✅	Train	FSDP2 + TP + PP + CP; Llama 3.1 8B–405B.
`deepspeed`	✅	✅	✅	Train	ZeRO, pipeline parallelism; dynamic hostfile from SLURM.
`vllm`	✅	✅	✅	Infer	PagedAttention; each node self-managing (no torchrun wrapper).
`sglang`	✅	✅	✅	Infer	RadixAttention, structured gen; each node self-managing.
`sglang_disagg`	❌	✅	✅	Infer	Disaggregated prefill/decode; min 3 nodes (1 proxy + ≥1P + ≥1D).
`primus`	✅	✅	✅	Train	Megatron / TorchTitan / MaxText via Primus YAML config.
`slurm_multi`	✅ (self-mgd)	❌	✅	Meta	Bypasses template; model's own SLURM script on head node.

Per-launcher configuration

Standard PyTorch distributed launcher. Generates: torchrun --nnodes=N --nproc_per_node=N --node_rank=R --master_addr=ADDR --master_port=PORT

{
  "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 4,
    "nproc_per_node": 8,
    "backend": "nccl",
    "port": 29500
  },
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0",
    "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1"
  }
}

Local: MAD_MULTI_NODE_RUNNER is set to torchrun --standalone --nproc_per_node=N (single-node only).

Uses torchrun under the hood; sets TENSOR_MODEL_PARALLEL_SIZE, PIPELINE_MODEL_PARALLEL_SIZE, CONTEXT_PARALLEL_SIZE env vars for the Megatron script to read.

{
  "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,"time":"48:00:00"},
  "distributed": {
    "launcher": "megatron",
    "nnodes": 8,
    "nproc_per_node": 8
  },
  "env_vars": {
    "TENSOR_MODEL_PARALLEL_SIZE": "4",
    "PIPELINE_MODEL_PARALLEL_SIZE": "2",
    "CONTEXT_PARALLEL_SIZE": "1",
    "NCCL_IB_DISABLE": "0"
  }
}

FSDP2 + TP + PP + CP. Sets TORCHTITAN_TENSOR_PARALLEL_SIZE, TORCHTITAN_PIPELINE_PARALLEL_SIZE, TORCHTITAN_FSDP_ENABLED, TORCHTITAN_CONTEXT_PARALLEL_SIZE.

{
  "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
  "distributed": {
    "launcher": "torchtitan",
    "nnodes": 4,
    "nproc_per_node": 8
  },
  "env_vars": {
    "TORCHTITAN_TENSOR_PARALLEL_SIZE": "2",
    "TORCHTITAN_FSDP_ENABLED": "true"
  }
}

DeepSpeed with dynamic SLURM hostfile generation. Generates: deepspeed --hostfile=/tmp/hostfile …

{
  "slurm": {
    "partition": "gpu",
    "nodes": 8,
    "gpus_per_node": 8,
    "time": "48:00:00",
    "reservation": "ml-priority"
  },
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 8,
    "nproc_per_node": 8,
    "backend": "nccl"
  },
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0"
  }
}

Each node runs independently (no torchrun). Sets VLLM_TENSOR_PARALLEL_SIZE, VLLM_PIPELINE_PARALLEL_SIZE, VLLM_DISTRIBUTED_BACKEND. Only HIP_VISIBLE_DEVICES is set (not ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES) to avoid conflict with Ray.

{
  "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"12:00:00"},
  "distributed": {
    "launcher": "vllm",
    "nnodes": 2,
    "nproc_per_node": 8
  },
  "env_vars": {
    "VLLM_TENSOR_PARALLEL_SIZE": "8",
    "VLLM_PIPELINE_PARALLEL_SIZE": "2"
  }
}

AMD+Ray gotcha: RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES is automatically overridden to "" when HIP_VISIBLE_DEVICES is set, preventing the rocm/vllm image from ignoring GPU visibility.

SGLang standard (RadixAttention, structured gen). Each node self-managing. Sets SGLANG_TENSOR_PARALLEL_SIZE, SGLANG_PIPELINE_PARALLEL_SIZE.

{
  "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"06:00:00"},
  "distributed": {
    "launcher": "sglang",
    "nnodes": 2,
    "nproc_per_node": 8
  },
  "env_vars": {
    "SGLANG_TENSOR_PARALLEL_SIZE": "8"
  }
}

Disaggregated prefill + decode topology. Minimum 3 nodes: 1 proxy + ≥1 prefill + ≥1 decode. Node split: default ~40% prefill, rest decode.

{
  "slurm": {
    "partition": "gpu",
    "nodes": 5,
    "gpus_per_node": 8,
    "time": "04:00:00"
  },
  "distributed": {
    "launcher": "sglang_disagg",
    "nnodes": 5,
    "nproc_per_node": 8,
    "sglang_disagg": {
      "prefill_nodes": 2,
      "decode_nodes": 2
    }
  },
  "env_vars": {
    "SGLANG_TP_SIZE": "8"
  }
}

Sets: SGLANG_DISAGG_MODE, SGLANG_DISAGG_PREFILL_NODES, SGLANG_DISAGG_DECODE_NODES, SGLANG_DISAGG_TOTAL_NODES, SGLANG_NODE_IPS, SGLANG_NODE_RANK.

Config recipes

Complete working configurations for common scenarios.

Local — single GPU, AMD

madengine run --tags llama3 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_gpus": "0"
  }'

Local — all 8 GPUs, with Megatron env vars

madengine run --tags megatron-llama3 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_env_vars": {
      "TENSOR_MODEL_PARALLEL_SIZE": "4",
      "PIPELINE_MODEL_PARALLEL_SIZE": "2"
    }
  }'

SLURM — single node torchrun

cat > slurm-single.json <<'EOF'
{
  "slurm": {
    "partition": "amd-gpu",
    "nodes": 1,
    "gpus_per_node": 8,
    "time": "12:00:00",
    "exclusive": true
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 1,
    "nproc_per_node": 8
  }
}
EOF
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
  --additional-context-file slurm-single.json

SLURM — 4-node DeepSpeed with reservation

cat > slurm-multi.json <<'EOF'
{
  "slurm": {
    "partition": "amd-gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "exclusive": true,
    "reservation": "ml-training-q1",
    "network_interface": "ib0"
  },
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 4,
    "nproc_per_node": 8,
    "backend": "nccl"
  },
  "env_vars": {
    "NCCL_IB_DISABLE": "0",
    "NCCL_SOCKET_IFNAME": "ib0",
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0"
  }
}
EOF
madengine run --manifest-file build_manifest.json \
  --additional-context-file slurm-multi.json

K8s — single pod, 4 AMD GPUs

madengine run --tags llama3-infer \
  --additional-context '{
    "k8s": {
      "namespace": "ml-team",
      "gpu_count": 4
    }
  }'

K8s — multi-node vLLM with HF secret

madengine run --tags vllm-llama3-70b \
  --additional-context '{
    "k8s": {
      "namespace": "ml-team",
      "gpu_count": 8,
      "host_ipc": true,
      "data_storage_class": "nfs-banff"
    },
    "distributed": {
      "launcher": "vllm",
      "nnodes": 2,
      "nproc_per_node": 8
    },
    "secrets": {"HF_TOKEN": "hf_xxxxxxx"},
    "env_vars": {
      "VLLM_TENSOR_PARALLEL_SIZE": "8",
      "VLLM_PIPELINE_PARALLEL_SIZE": "2"
    }
  }'

SLURM — SGLang Disagg (3 nodes: 1 proxy + 1P + 1D)

madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:v0.4

madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {
      "partition": "amd-gpu",
      "nodes": 3,
      "gpus_per_node": 8,
      "time": "04:00:00"
    },
    "distributed": {
      "launcher": "slurm_multi"
    }
  }'

Local run with ROCm compute profiling

madengine run --tags llama3 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "tools": [
      {"name": "rocprofv3_compute"}
    ],
    "rocenv_mode": "full"
  }'

Stack multiple profilers:

  "tools": [
    {"name": "rocprofv3_compute"},
    {"name": "rccl_trace"},
    {"name": "gpu_info_power_profiler"}
  ]

Profiling & tracing tools

Enable via --additional-context '{"tools":[{"name":"…"}]}'. Tools are stackable — list multiple objects. Implemented in scripts/common/tools/ and execution/container_runner.py::apply_tools().

Do not combine rocm_trace_lite with rocprof / rocprofv3_* in the same run — they conflict at the kernel-tracing level.

Tool name	Purpose	Output location	Notes
`rocprof`	Legacy GPU kernel profiling	Kernel timings / occupancy CSVs	Use `rocprofv3_*` on ROCm ≥ 7.0
`rocprofv3_compute`	Compute-bound kernels	ALU, wave execution metrics	ROCm ≥ 7.0
`rocprofv3_memory`	Memory-bound workloads	Cache hits, bandwidth
`rocprofv3_communication`	Multi-GPU communication	RCCL traces
`rocprofv3_full`	Comprehensive (all metrics)	All counters	High overhead — short runs only
`rocprofv3_lightweight`	Minimal overhead tracing	HIP API + kernel traces
`rocprofv3_perfetto`	Perfetto UI traces	Perfetto JSON for ui.perfetto.dev
`rocprofv3_api_overhead`	API call timing	Per-API timing report
`rocprofv3_pc_sampling`	Kernel hotspot identification	PC sample histograms
`rocm_trace_lite`	RTL lite dispatch trace	`rocm_trace_lite_output/trace.db`	Pinned GitHub release wheel by default
`rocm_trace_lite_default`	RTL default mode	Same paths, broader coverage	v2.0.3+
`rocblas_trace`	rocBLAS call tracing	Per-library log
`miopen_trace`	MIOpen call tracing	Per-library log
`tensile_trace`	Tensile call tracing	Per-library log
`rccl_trace`	RCCL communication tracing	Per-library log
`gpu_info_power_profiler`	Power consumption over time	CSV time series
`gpu_info_vram_profiler`	VRAM usage over time	CSV time series
`therock_check`	TheRock ROCm stack validation	Detection report	Identifies apt vs TheRock install

rocm_trace_lite wheel control

Env var	Effect
`ROCM_TRACE_LITE_FOLLOW_LATEST=1`	Always pull the latest wheel from GitHub
`ROCM_TRACE_LITE_WHEEL_URL=https://…`	Use a specific wheel URL (air-gapped installs)

rocEnvTool modes

Mode (`rocenv_mode`)	Collects
`"lite"` (default)	Basic ROCm info, GPU topology, driver version
`"full"`	All of lite + lshw, dmidecode, dmesg, modinfo; best-effort installs missing tools per `guest_os`

ROCm path resolution

Implemented in src/madengine/utils/rocm_path_resolver.py and src/madengine/core/context.py. Two independent resolution chains run in parallel.

Host path (build & tools)

MAD_ROCM_PATH in --additional-context
Auto-detect: /opt/rocm, versioned /opt/rocm-*, TheRock (rocm-sdk + markers)
rocminfo / amd-smi / rocm-smi location on PATH
ROCM_PATH environment variable
/opt/rocm fallback (with warning)

Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only env var / default.

In-container path (AMD Docker runs)

docker_env_vars.MAD_ROCM_PATH in additional_context
ROCM_PATH / ROCM_HOME from image OCI config (docker image inspect)
In-image shell probe (docker run --rm image env)
/opt/rocm fallback with warning

The run-phase env table prints host vs container ROCm root, installation type (apt / therock / unknown), and version side-by-side.

renderD mapping: ROCm < 6.4.1 uses legacy unique_id method; 6.4.1+ uses amd-smi node_id. The gpu_renderDs context key maps GPU index → /dev/dri/renderD number. Guards against None entries on restricted ROCm installs.

Environment variables

Read by madengine at runtime

Variable	Module	Purpose
`MAD_ROCM_PATH`	context.py	Override ROCm root on host. Priority 1.
`ROCM_PATH`	core/constants.py	Fallback ROCm root. Priority 3.
`MAD_AUTO_ROCM_PATH`	rocm_path_resolver	Set `0` to disable auto-scan.
`MODEL_DIR`	core/constants.py	Working directory for model scripts. Default: `.`
`MAD_VERBOSE_CONFIG`	core/constants.py	Enable verbose config output.
`MAD_SETUP_MODEL_DIR`	core/constants.py	Trigger model directory setup.
`MAD_SECRETS*`	context.py	Any env var with this prefix is automatically copied to `docker_build_arg` AND `docker_env_vars`.
`MAD_DOCKERHUB_USER`	build_orchestrator	Docker Hub username for registry auth.
`MAD_DOCKERHUB_PASSWORD`	build_orchestrator	Docker Hub password for registry auth.
`SLURM_JOB_ID`	slurm.py	Detect existing SLURM allocation (triggers bash-in-salloc for slurm_multi).
`SLURM_NNODES`, `SLURM_NPROCS`	container_runner	Read in SLURM job to resolve GPU count per node.
`NPROC_PER_NODE`, `GPUS_PER_NODE`	container_runner	Injected by SLURM template; read by ContainerRunner to set up docker run GPU args.
`MONGO_HOST`, `MONGO_PORT`	database/mongodb.py	MongoDB connection.
`MONGO_USER`, `MONGO_PASSWORD`	database/mongodb.py	MongoDB credentials.
`MONGO_AUTH_SOURCE`, `MONGO_TIMEOUT_MS`	database/mongodb.py	MongoDB auth source and timeout.
`NAS_NODES`	core/constants.py	NAS node config (JSON string).
`MAD_AWS_S3`	core/constants.py	AWS S3 credentials (JSON: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, …).
`MAD_MINIO`	core/constants.py	MinIO credentials (JSON: `MINIO_ENDPOINT`, `AWS_ENDPOINT_URL_S3`, …).
`PUBLIC_GITHUB_ROCM_KEY`	core/constants.py	GitHub ROCm key (JSON).
`ROCM_TRACE_LITE_FOLLOW_LATEST`	tools	Set `1` to always pull latest RTL wheel.
`ROCM_TRACE_LITE_WHEEL_URL`	tools	Override RTL wheel URL (air-gapped installs).

Set by madengine in Docker containers

Variable	Set by	Value / source
`MAD_GPU_VENDOR`	context.py	`"AMD"` or `"NVIDIA"`
`MAD_SYSTEM_NGPUS`	context.py	Total GPU count on host
`MAD_SYSTEM_GPU_ARCHITECTURE`	context.py	GPU arch string (e.g. `"gfx90a"`)
`MAD_SYSTEM_HIP_VERSION`	context.py	HIP version string
`MAD_SYSTEM_GPU_PRODUCT_NAME`	context.py	GPU product name
`MAD_GUEST_OS`	container_runner	`"UBUNTU"` or `"CENTOS"`
`MAD_RUNTIME_NGPUS`	container_runner	GPU count allocated for this specific run
`MAD_MULTI_NODE_RUNNER`	container_runner	Distributed launcher command (e.g. `torchrun --standalone --nproc_per_node=8`)
`MAD_MODEL_NAME`	container_runner	Model name from model definition
`MAD_OUTPUT_CSV`	container_runner	Path for `multiple_results` CSV output
`ROCM_PATH`	container_runner	Resolved in-container ROCm root
`JENKINS_BUILD_NUMBER`	container_runner	CI build number (from shell env if set)
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`	container_runner	Force-set to `""` when `HIP_VISIBLE_DEVICES` is active (AMD+Ray fix)

Set by SLURM job script (`job.sh.j2`)

Variable	Value
`MAD_DEPLOYMENT_TYPE`	`"slurm"`
`MAD_SLURM_JOB_ID`	SLURM job ID
`MAD_NODE_RANK`	This node's rank (0-indexed)
`MAD_TOTAL_NODES`	Total node count
`MAD_IN_SLURM_JOB`	`"1"`
`MAD_LAUNCHER_TYPE`	Launcher type string
`MASTER_ADDR`	Head node hostname (via scontrol)
`MASTER_PORT`	Communication port (default 29500)
`WORLD_SIZE`	Total GPU processes (nodes × GPUs/node)
`NNODES`	Node count
`GPUS_PER_NODE`	GPU count per node
`NODE_RANK`	This node's rank
`TORCH_ELASTIC_RDZV_TIMEOUT`	`3600`
`MIOPEN_USER_DB_PATH`	`/tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}`
`HIP_VISIBLE_DEVICES`	GPU indices for this node's processes
`ROCR_VISIBLE_DEVICES`	GPU indices (not set for Ray-based launchers)
`CUDA_VISIBLE_DEVICES`	GPU indices (not set for Ray-based launchers)

Error types

Defined in src/madengine/core/errors.py. All inherit from MADEngineError(Exception) which carries: message, category, context (ErrorContext dataclass), cause, recoverable, suggestions (list). Rich panels are used for display.

Class	Category	When raised
`ValidationError`	VALIDATION	Invalid CLI args, model field values, context key types.
`NetworkError`	CONNECTION	Registry connectivity, pull failures, MongoDB connection.
`AuthenticationError`	AUTHENTICATION	Registry login failure, invalid credentials format.
`ExecutionError`	RUNTIME	Container run failure, script non-zero exit, timeout. (`RuntimeError` is an alias.)
`BuildError`	BUILD	Docker build failure.
`DiscoveryError`	DISCOVERY	models.json parse failure, tag not found, no models matched.
`OrchestrationError`	ORCHESTRATION	Manifest load failure, incompatible build/run state.
`RunnerError`	RUNNER	ContainerRunner internal failure.
`ConfigurationError`	CONFIGURATION	slurm_multi registry gate violation, conflicting flags, missing required config.
`DeploymentTimeoutError`	TIMEOUT	SLURM/K8s job exceeded wall time.

Module reference

Layer	Path	Contents
CLI	cli/app.py	Typer app, `cli_main` entry, `--version`, Rich traceback install.
CLI	cli/commands/build.py	`madengine build`: registry, batch, `--use-image`, `--build-on-compute`, mutex validation.
CLI	cli/commands/run.py	`madengine run`: manifest loading, all run flags, `--force-mirror-local`, `--cleanup-perf`.
CLI	cli/commands/discover.py	Model discovery command, scoped tag parsing.
CLI	cli/commands/report.py	`report to-html` / `to-email` sub-app.
CLI	cli/commands/database.py	MongoDB upload command.
CLI	cli/constants.py	`ExitCode` enum, `DEFAULT_MANIFEST_FILE`, `DEFAULT_PERF_OUTPUT`, `DEFAULT_TIMEOUT=-1`.
CLI	cli/validators.py	Argument validation: `validate_additional_context()`, `create_args_namespace()`.
Orch	orchestration/build_orchestrator.py	`BuildOrchestrator.execute()`: discover → context → build → registry gate → manifest. slurm_multi use-image / build-on-compute paths.
Orch	orchestration/run_orchestrator.py	`RunOrchestrator.execute()`: manifest loading, target inference, script copy/cleanup, local/distributed dispatch.
Orch	orchestration/image_filtering.py	Filters manifest entries by GPU vendor, GPU arch, `skip_gpu_arch` field.
Dep	deployment/factory.py	`DeploymentFactory.create()`. Registers `SlurmDeployment` + `KubernetesDeployment`. `UserWarning` if kubernetes package missing.
Dep	deployment/base.py	`BaseDeployment` (Template Method), `DeploymentConfig`, `DeploymentResult` (incl. `skip_monitoring`), `DeploymentStatus`, `PERFORMANCE_LOG_PATTERN`.
Dep	deployment/kubernetes.py	`KubernetesDeployment`: composes 6 mixins, orchestrates K8s job lifecycle.
Dep	deployment/k8s_pvc.py	PVC creation/deletion, storage-class fallback chain.
Dep	deployment/k8s_results.py	Log/artifact collection, perf aggregation, `collector_pod_name()`.
Dep	deployment/k8s_scripts.py	Script extraction, ConfigMap building (`rocenv_mode`, `guest_os`).
Dep	deployment/k8s_template_context.py	Assembles Jinja2 template context for K8s jobs.
Dep	deployment/k8s_secrets.py	`secrets` dict → K8s Secret objects.
Dep	deployment/k8s_names.py	Name truncation/sanitization helpers for K8s resource names.
Dep	deployment/kubernetes_launcher_mixin.py	Selects Jinja2 template per launcher; sets `MAD_MULTI_NODE_RUNNER` for K8s pods.
Dep	deployment/slurm.py	`SlurmDeployment`: template prep, sbatch submit, bash-in-salloc, slurm_multi dispatch, monitoring, results collection.
Dep	deployment/slurm_node_selector.py	`SlurmNodeSelector`: health/cleanup srun, `reservation` parameter, node preflight.
Dep	deployment/common.py	Shared helpers: `VALID_LAUNCHERS`, slurm_multi wrapper assembly, launcher normalization.
Dep	deployment/config_loader.py	`ConfigLoader`: deep-merge, preset loading, target inference. `env_vars` merged recursively (not replaced).
Dep	deployment/primus_backend.py	Primus YAML / backend selection helper.
Dep	deployment/presets/slurm/defaults.json	SLURM base preset.
Dep	deployment/presets/slurm/profiles/	`single-node.json`, `multi-node.json`.
Dep	deployment/presets/k8s/defaults.json	K8s base preset.
Dep	deployment/presets/k8s/gpu-vendors/	`amd.json`, `nvidia.json`, `amd-multi-gpu.json`.
Dep	deployment/presets/k8s/profiles/	`single-gpu.json`, `multi-gpu.json`, `multi-node.json`.
Dep	deployment/templates/slurm/job.sh.j2	Main sbatch template (~822 lines). Sets all SLURM env vars, runs srun task scripts.
Dep	deployment/templates/kubernetes/	K8s YAML templates: `configmap.yaml.j2`, `job.yaml.j2`, `pvc.yaml.j2`, `pvc-data.yaml.j2`, `service.yaml.j2`.
Exec	execution/container_runner.py	`ContainerRunner`: local docker run, AMD/NVIDIA run options, env injection, tools, perf parsing, `_run_self_managed()`, `_generate_local_launcher_command()`.
Exec	execution/container_runner_helpers.py	Log error pattern scan, `resolve_run_timeout()`, `make_run_log_file_path()`.
Exec	execution/docker_builder.py	`DockerBuilder`: build args, `--build-context tools=` (conditional), registry push, DOCKER_IMAGE_NAME injection into manifest.
Exec	execution/dockerfile_utils.py	Dockerfile parsing: GPU vendor from filename + FROM line.
Core	core/context.py	`Context`: `ast.literal_eval` parse, GPU vendor/arch detection, ROCm path resolution, `MAD_SECRETS*` propagation, renderD mapping.
Core	core/additional_context_defaults.py	Default values merged before user context: `DEFAULT_GPU_VENDOR="AMD"`, `DEFAULT_GUEST_OS="UBUNTU"`.
Core	core/console.py	`Console`: Rich-backed shell executor, live output, timeout, `secret=True` for credential commands.
Core	core/docker.py	`Docker` wrapper: `shlex.quote()` on every interpolation, auto stop/remove on `__del__`.
Core	core/errors.py	10-type error hierarchy, `ErrorCategory`, `ErrorContext`, `ErrorHandler`, Rich panel display.
Core	core/auth.py	`load_credentials()`, `login_to_registry()` using `--password-stdin` + `MAD_REGISTRY_PASSWORD`.
Core	core/timeout.py	`Timeout` context manager; guards `signal.alarm(None)` when seconds is 0/None.
Core	core/dataprovider.py	`Data` abstraction: local / NAS / S3 / MinIO.
Util	utils/discover_models.py	`DiscoverModels`: root, dir, dynamic discovery; scoped vs unscoped tags; `CustomModel` dataclass.
Util	utils/gpu_tool_factory.py	Singleton `get_gpu_tool_manager(vendor, rocm_path)`; auto-detects vendor.
Util	utils/gpu_validator.py	`GPUVendor` enum, `ROCmValidator`, `NVIDIAValidator`, `GPUValidationResult`.
Util	utils/rocm_path_resolver.py	Host + in-container ROCm path resolution chains.
Util	utils/therock_markers.py	Shared TheRock detection markers (rocm-sdk, layout probes).
Util	utils/config_parser.py	`ConfigParser`: 5-level config file resolution, CSV/JSON/YAML loading, multi-row result matching.
Util	utils/session_tracker.py	Session start/marker tracking.
Rep	reporting/update_perf_csv.py	Writes/appends `perf.csv` and `perf_entry.csv`. `PERF_CSV_HEADER` (28 columns).
Rep	reporting/csv_to_html.py	HTML performance report generation.
Rep	reporting/csv_to_email.py	Email-friendly consolidated report.
Rep	reporting/update_perf_super.py	Superset-shaped perf rollups.
DB	database/mongodb.py	`MongoDBConfig.from_env()`, `UploadOptions`, `UploadResult`; upsert + batch upload.
Scripts	scripts/common/pre_scripts/rocEnvTool/	`rocenv_tool.py`, `csv_parser.py`, `console.py` — TheRock-compatible env capture (lite + full modes).
Scripts	scripts/common/tools/	GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers (rocblas, miopen, rccl, tensile).

Test layout

unit/

Fast, isolated, mocked. Key files: test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py, test_deployment.py, test_container_runner.py.

integration/

Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.

e2e/

Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.

Marker	What it selects
`unit`	Fast unit tests with no external deps
`integration`	Tests requiring Docker / real GPU calls
`e2e`	Full end-to-end workflow tests
`slow`	Long-running tests
`gpu`	Requires GPU hardware
`amd` / `nvidia`	Vendor-specific tests
`cpu`	CPU-only tests
`requires_docker`	Tests requiring Docker daemon
`requires_models`	Tests requiring model files to be present

Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0).

Contributing & code style

Style rules

Formatting: Black (line-length 88), targets py3.8–py3.11
Imports: isort with profile="black"; first-party = madengine
Lint: flake8 + mypy (strict equality, warn unused) + bandit (skips B101)
Docstrings: Google style; type hints required for public functions
Conventional commits: feat:, fix:, docs:, test:, refactor:, style:, perf:, chore:

Security rules

Use shlex.quote() on every shell interpolation of user-controlled values (image names, paths, container names, build-args)
Registry passwords via --password-stdin (not command-line args); env var MAD_REGISTRY_PASSWORD
Credential JSON must be a dict object — validated at load time (ConfigurationError on wrong type)
MIOPEN_USER_DB_PATH is filtered from deployment_config to prevent leaking temp paths
Never log secret values — log keys only

Changelog

[2.1.0] — 2026-05-28

Added

slurm_multi self-managed SLURM launcher (PRs #130, #126): alias slurm-multi, parallel docker pull, bash-in-salloc path, _run_self_managed() for local mode
madengine build --use-image [IMAGE|auto] — skip local build
madengine build --build-on-compute — build on compute node + push
slurm_multi registry gate with structured ConfigurationError
DeploymentResult.skip_monitoring for synchronous deploy paths
SlurmNodeSelector.reservation parameter
DockerBuilder: --build-context tools= (conditional on dir existence, PR #131 + #134)
Local MAD_MULTI_NODE_RUNNER via ContainerRunner._generate_local_launcher_command() (PR #126)
Model card distributed/slurm auto-merged into manifest deployment_config
DOCKER_IMAGE_NAME injection into manifest env_vars after successful registry push

Changed

SLURM env-var escaping: double-quote instead of shlex.quote to preserve spaces/paths (PR #134)
Early DiscoverModels result cached and reused for actual build (no duplicate get_models_json.py runs)
E2E test cleanup defaults include build_manifest.json + perf artefacts

[2.0.3] — 2026-05-26

rocEnvTool "full" mode (lshw, dmidecode, dmesg, modinfo)
K8s monolith decomposed into 6 focused mixin modules
Generic storage_class fallback; default preset nfs-banff
rocm_trace_lite_default tool (RTL default mode)
Security: shlex.quote() on every shell interpolation
Collector pod name mismatch fix (shared collector_pod_name() helper)
CANCELLED added to terminal-state set
Local MAD_MULTI_NODE_RUNNER for Docker local (_generate_local_launcher_command())

[2.0.2] / [2.0.1]

Host ROCm auto-detection via priority chain; in-container ROCm resolved independently
TheRock (rocm-sdk) layout support
GPU arch auto-detection injected into Docker build args
Model discovery: scope-based tag selection replaces strict flag
Registry password via --password-stdin + env var
credential.json type validation
Unified PERFORMANCE_LOG_PATTERN across local + deployment paths
Run-phase host/container env table printed at startup

[2.0.0] — 2026-04-09 — Complete rewrite

Unified madengine CLI; legacy mad-* removed
5-layer architecture (CLI / Orchestration / Deployment / Execution / Core)
Factory + Template Method patterns; DeploymentFactory, BaseDeployment, ConfigLoader
Multi-target deployment: presets + Jinja2 templates per launcher
Launcher matrix: torchrun / DeepSpeed / Megatron / TorchTitan / Primus / vLLM / SGLang
Log error pattern scanning; --skip-model-run; batch build manifest
Structured errors (10 types) with Rich panels; fixed exit codes
SLURM nodelist pinning; K8s Secrets management; data provider abstraction

madengine wiki · v2.1.0 (2026-05-28) · branch develop · PRs #126 #130 #131 #133 #134
Structured as a single self-contained HTML file for easy sharing — no server needed, open directly in a browser. Inspired by the Claude Code HTML blog post: richer information density than Markdown, tabbed navigation, live filters, easy to share via one URL. Print works (sidebar hidden via CSS media query).

madengine — Codebase Wiki

Overview

What madengine does

What's new in v2.1.0

Quick start

Install & dev

Setup

Optional extras

Test & quality

5-layer architecture

Architecture diagram

Key data flows

Build flow

Run flow

SLURM job flow (inside sbatch)

CLI — discover

Tag syntax

Discovery sources (checked in order per directory)

CLI — build

--use-image modes

CLI — run

Timeout resolution

CLI — report & database

report

database

Exit codes CI contract

additional_context — configuration spine

SLURM sub-keys (slurm dict)

Kubernetes sub-keys (k8s dict)

Model definition — models.json

Key field notes

Build manifest — build_manifest.json

Deployment target inference

SLURM deployment

Preset merge order

What the SLURM job script does

Node health preflight

Monitoring

Kubernetes deployment

Mixin modules

Preset merge order

Secrets management

slurm_multi launcher merged in v2.1.0

What it is

How to select it

Build modes

Execution paths

Results aggregation

Local self-managed execution

Docker --build-context tools= v2.1.0

What it does

Conditional emission (PR #134)

Launcher matrix

Per-launcher configuration

Config recipes

Local — single GPU, AMD

Local — all 8 GPUs, with Megatron env vars

SLURM — single node torchrun

SLURM — 4-node DeepSpeed with reservation

K8s — single pod, 4 AMD GPUs

K8s — multi-node vLLM with HF secret

SLURM — SGLang Disagg (3 nodes: 1 proxy + 1P + 1D)

Local run with ROCm compute profiling

Profiling & tracing tools

rocm_trace_lite wheel control

rocEnvTool modes

ROCm path resolution

Host path (build & tools)

In-container path (AMD Docker runs)

Environment variables

Read by madengine at runtime

Set by madengine in Docker containers

Set by SLURM job script (job.sh.j2)

Error types

Module reference

Test layout

unit/

integration/

e2e/

Contributing & code style

CLI — `discover`

CLI — `build`

`--use-image` modes

CLI — `run`

CLI — `report` & `database`

`additional_context` — configuration spine

SLURM sub-keys (`slurm` dict)

Kubernetes sub-keys (`k8s` dict)

Model definition — `models.json`

Build manifest — `build_manifest.json`

Docker `--build-context tools=` v2.1.0

Set by SLURM job script (`job.sh.j2`)