madengine — Codebase Wiki

AI/ML model automation & benchmarking platform for local Docker, Kubernetes, and SLURM. A Typer-based CLI that discovers models, builds Docker images, runs them across compute targets, and writes structured performance results.

Entry point: src/madengine/cli/app.py::cli_main → console script madengine registered in pyproject.toml.

v2.1.0 — 2026-05-28 Python ≥ 3.8 5-layer CLI Local · K8s · SLURM · slurm_multi Typer + Rich ROCm & CUDA Jinja2 templates

Overview

What madengine does

  1. Discover — finds model definitions from models.json or dynamic scripts, resolves tags
  2. Build — calls docker build for each model, writes build_manifest.json
  3. Run — reads manifest, infers compute target, dispatches containers, writes perf.csv
  4. Report — converts perf.csv to HTML or email; uploads to MongoDB

All four stages share a single --additional-context configuration spine that controls GPU vendor, deployment type, launcher, profiling tools, and environment variables.

What's new in v2.1.0

  • slurm_multi — self-managed multi-node SLURM launcher for workloads with per-node Docker (e.g. SGLang Disagg)
  • --use-image [auto] / --build-on-compute — new madengine build modes
  • Docker --build-context tools= — shared tool APIs accessible in every Dockerfile
  • Local MAD_MULTI_NODE_RUNNER — Megatron / DeepSpeed / TorchTitan now work on local Docker
  • SLURM env-var escaping — double-quote escaping preserves spaces & paths

Quick start

# 1. Install
pip install -e ".[dev]"

# 2. Discover available models
madengine discover --tags dummy

# 3. Build + run (single command)
madengine run --tags dummy \
  --additional-context '{"gpu_vendor":"AMD","guest_os":"UBUNTU"}'

# 4. Build only, then run from manifest
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
  --additional-context '{"docker_gpus":"0,1,2,3"}'

Local mode: no k8s or slurm key in context → ContainerRunner (local Docker).

# Single-node K8s (minimal — defaults applied from presets/k8s/)
madengine run --tags llama3 \
  --additional-context '{"k8s":{"gpu_count":4}}'

# Multi-node vLLM on K8s
madengine run --tags vllm-serve \
  --additional-context '{
    "k8s": {"namespace":"ml-team","gpu_count":8},
    "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
  }'

# K8s with NFS data PVC and secrets
madengine run --tags model \
  --additional-context '{
    "k8s": {"namespace":"ml","gpu_count":8,"data_storage_class":"nfs-banff"},
    "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy"}
  }'

Presence of "k8s" or "kubernetes" key → KubernetesDeployment. Requires pip install -e ".[all]".

# Single-node SLURM (build on login node, deploy via sbatch)
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":1,"gpus_per_node":8,"time":"12:00:00"}
  }'

# Multi-node torchrun
madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
    "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8}
  }'

# DeepSpeed with reservation
madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,
              "time":"48:00:00","reservation":"ml-training"},
    "distributed": {"launcher":"deepspeed","nnodes":8,"nproc_per_node":8}
  }'

Presence of "slurm" key → SlurmDeployment. Generates sbatch wrapper from Jinja2 template.

# SGLang Disaggregated (3+ nodes: proxy + prefill + decode)
madengine run --tags pyt_sglang_disagg_qwen3-32b \
  --additional-context '{
    "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
    "distributed": {"launcher":"slurm_multi"}
  }'

# Build options for slurm_multi models:
# Option A — use pre-built registry image (skip local build)
madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:latest

# Option B — auto-resolve DOCKER_IMAGE_NAME from model card
madengine build --tags pyt_sglang_disagg --use-image auto

# Option C — build on compute node, push, then run pulls in parallel
madengine build --tags pyt_sglang_disagg \
  --registry registry.io/ml --build-on-compute

slurm_multi bypasses the standard sbatch template: the model's own .slurm script runs directly on the head node so srun/scontrol work inside it.

# Store configuration in a JSON file and reference it
cat > my_run.json <<'EOF'
{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "slurm": {
    "partition": "gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "exclusive": true
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 4,
    "nproc_per_node": 8,
    "backend": "nccl"
  },
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0"
  },
  "tools": [{"name": "rocprofv3_compute"}]
}
EOF

madengine run --tags llama3 --additional-context-file my_run.json

--additional-context-file and --additional-context are mutually exclusive. The file is parsed as JSON (not ast.literal_eval).

Install & dev

Setup

# Base install (includes dev tools)
pip install -e ".[dev]"

# With Kubernetes support
pip install -e ".[all]"

# Enable pre-commit hooks
pre-commit install

Optional extras

ExtraAdds
[dev]pytest, black, flake8, mypy, isort, pre-commit
[kubernetes]kubernetes>=28.0.0, pyyaml
[all]dev + kubernetes

Test & quality

pytest                           # all tests
pytest tests/unit/ -v            # unit only
pytest tests/unit/test_slurm_multi.py -v
pytest --cov=src/madengine --cov-report=html
pytest -m "not slow"             # skip slow tests
pytest -m "unit and amd"         # combined markers

black src/ tests/
isort src/ tests/
flake8 src/ tests/
mypy src/madengine
pre-commit run --all-files

5-layer architecture

Each layer talks only to the layers below it. Layers are color-coded throughout this wiki.

CLI Orchestration Deployment Execution Core Utils Reporting
LayerPathResponsibilitiesKey types
CLI src/madengine/cli/ Typer app, 5 commands, argument validation, Rich output, exit-code mapping. app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode
Orchestration src/madengine/orchestration/ Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment backend. BuildOrchestrator, RunOrchestrator, image_filtering.py
Deployment src/madengine/deployment/ Factory + Template Method pattern. K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring. DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment, ConfigLoader
Execution src/madengine/execution/ Local Docker build/run, log scanning, timeout resolution, perf parsing, self-managed launcher bypass. ContainerRunner, DockerBuilder, container_runner_helpers
Core src/madengine/core/ Cross-cutting primitives: context merging & GPU detection, shell execution, Docker wrapper, error hierarchy, auth, timeout. Context, Console, Docker, MADEngineError, load_credentials
Utils src/madengine/utils/ Model discovery, GPU vendor abstraction, ROCm path resolution, config parsing. DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser
Reporting src/madengine/reporting/ perf.csv writers, HTML/email report generation. Database upload in src/madengine/database/. update_perf_csv, csv_to_html, csv_to_email, mongodb.py

Architecture diagram

CLI · Typer + Rich discover · build · run · report · database ExitCode: SUCCESS=0 · BUILD_FAILURE=2 · RUN_FAILURE=3 · INVALID_ARGS=4 Orchestration BuildOrchestrator DiscoverModels → DockerBuilder → build_manifest.json RunOrchestrator load manifest → merge context → infer target → dispatch image_filtering GPU arch / vendor tag selection Deployment · DeploymentFactory (inferred from context keys) no k8s/slurm → local · "k8s"/"kubernetes" → K8s Jobs · "slurm" → SLURM sbatch · distributed.launcher="slurm_multi" → self-managed Local · ContainerRunner docker run + perf.csv KubernetesDeployment K8s Jobs, PVCs, Secrets SlurmDeployment sbatch · Jinja2 template slurm_multi (2.1.0) head-node script + srun pull Core Context · Console · Docker · MADEngineError · auth · timeout Utils DiscoverModels · gpu_tool_factory · rocm_path_resolver · ConfigParser Reporting perf.csv · perf_entry.csv · csv_to_html · csv_to_email Database MongoDB upload · MongoDBConfig.from_env()

Key data flows

Build flow

  1. madengine buildBuildOrchestrator.execute()
  2. Context(build_only_mode=True) — GPU vendor / arch detection skipped unless detect_local_gpu_arch=True
  3. ConfigLoader.load_config() applies preset defaults (SLURM or K8s) over user config
  4. DiscoverModels resolves --tags from root models.json, scripts/{dir}/models.json, or scripts/{dir}/get_models_json.py
  5. slurm_multi gate: if model uses slurm_multi and no --registry/--use-image given → auto-resolves DOCKER_IMAGE_NAME from model card or raises ConfigurationError
  6. DockerBuilder.build_all_models() — passes --build-context tools=scripts/common/tools if that dir exists
  7. After registry push: sets DOCKER_IMAGE_NAME in manifest env_vars for parallel SLURM pull
  8. Writes build_manifest.json

Run flow

  1. madengine runRunOrchestrator.execute()
  2. If manifest exists: skip build; else trigger _build_phase()
  3. Context(build_only_mode=False) — full GPU detection, ROCm path resolution
  4. _load_and_merge_manifest() — runtime context overrides manifest deployment_config
  5. Target inference: "k8s"/"kubernetes" → K8s · "slurm" → SLURM · neither → local
  6. _copy_scripts() — populates scripts/common/{pre_scripts,post_scripts,tools} from madengine package
  7. Dispatch: ContainerRunner (local) or DeploymentFactory.create() (SLURM/K8s)
  8. Results → perf.csv / perf_entry.csv
  9. _cleanup_model_dir_copies() — removes populated scripts/common/ files

SLURM job flow (inside sbatch)

  1. sbatch script sets MASTER_ADDR (via scontrol), WORLD_SIZE, NNODES, node-local GPU visibility
  2. Multi-node: generates a task script per node; runs via srun bash $TASK_SCRIPT — each node calls madengine run with local manifest
  3. Single-node: creates local manifest with deployment_config.target="docker", calls madengine run
  4. Each node's madengine runContainerRunnerdocker run with SLURM env vars injected
  5. Results collected from per-node perf.csv and aggregated

CLI — discover

Lists and validates model definitions without building or running.

madengine discover [OPTIONS]

  --tags TEXT              Comma-separated tags/names to filter  [required]
  --verbose / --no-verbose Show full model JSON  [default: no-verbose]

Tag syntax

PatternExampleMeaning
Simple tag--tags llama3Any model with tag llama3
Multiple tags--tags llama3,vllmAny model matching any listed tag
All models--tags allEvery discovered model
Scoped (exact dir)--tags MAD/llama3Only from scripts/MAD/ subdirectory
Dynamic + args--tags dummy3:dummy_3:batch=512Dynamic model with arg override

Discovery sources (checked in order per directory)

  1. Root models.json
  2. scripts/{dir}/models.json (static list)
  3. scripts/{dir}/get_models_json.py — dynamic; must export list_models() → List[CustomModel]

CLI — build

Builds Docker images for discovered models and writes build_manifest.json.

madengine build [OPTIONS]

  --tags TEXT                    Tags to select models (mutually exclusive with --batch-manifest)
  --batch-manifest FILE          JSON file of multiple tag groups to build in sequence
  --registry TEXT                Push built images to this registry URL
  --target-archs TEXT            Comma-separated GPU arch list (e.g. "gfx90a,gfx942")
  --use-image [IMAGE|auto]       Skip local build; use named image or auto-resolve from model card
  --build-on-compute             Build on SLURM compute node + push (requires --registry)
  --additional-context TEXT      Python dict / JSON string of context overrides
  --additional-context-file FILE Path to a JSON context file (mutually exclusive with --additional-context)
  --clean-docker-cache           Pass --no-cache to docker build
  --manifest-output FILE         Output path for build_manifest.json  [default: build_manifest.json]
  --summary-output FILE          Output path for build summary JSON
  --live-output / --no-live-output   Stream docker build output line by line  [default: no-live-output]
  --verbose / --no-verbose
Mutual exclusions:
  • --batch-manifest vs --tags
  • --use-image vs --registry
  • --use-image vs --build-on-compute
  • --build-on-compute requires --registry
  • --additional-context-file vs --additional-context

--use-image modes

InvocationBehavior
--use-image autoReads DOCKER_IMAGE_NAME from model card env_vars
--use-image registry.io/img:tagUses the explicit image name; skips all Docker build steps

CLI — run

Runs models from a manifest (build if needed) and writes perf.csv.

madengine run [OPTIONS]

  --tags TEXT                    Select models (triggers build if no manifest)
  --manifest-file FILE           Use existing manifest; skip build  [default: build_manifest.json]
  --registry TEXT                Registry for image pull auth
  --timeout INT                  Seconds per model; -1=7200s default, 0=disabled
  --additional-context TEXT      Python dict or JSON string
  --additional-context-file FILE JSON file (mutually exclusive with --additional-context)
  --keep-alive                   Leave container running after model completes
  --keep-model-dir               Do not clean up model directory copy
  --clean-docker-cache           Remove docker image before pull (SLURM mode)
  --skip-model-run               Build/pull only; skip execution
  --manifest-output FILE
  --summary-output FILE
  --live-output / --no-live-output  Stream container output  [default: no-live-output]
  --output FILE                  Redirect container stdout to file
  --tools-json-file-name FILE    Tools config  [default: ./scripts/common/tools.json]
  --generate-sys-env-details / --no-generate-sys-env-details
  --force-mirror-local           Force ContainerRunner even in SLURM/K8s context
  --disable-skip-gpu-arch        Ignore skip_gpu_arch model field
  --cleanup-perf                 Remove existing perf.csv before run
  --verbose / --no-verbose

Timeout resolution

ValueResolved timeout
-1 (default)7200 s (2 hours)
0Disabled (no timeout)
model card timeout fieldUsed when CLI is default (-1)
Explicit positive intThat many seconds, overrides model card

CLI — report & database

report

# Convert perf.csv to HTML
madengine report to-html --csv-file perf.csv

# Generate consolidated email report
madengine report to-email \
  --directory ./results \
  --output run_results.html

Source: cli/commands/report.pyreporting/csv_to_html.py, reporting/csv_to_email.py

database

madengine database \
  --csv-file perf.csv \
  --database-name benchmarks \
  --collection-name runs

Reads from env: MONGO_HOST, MONGO_PORT, MONGO_USER, MONGO_PASSWORD, MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS.

Source: cli/commands/database.pydatabase/mongodb.py

Exit codes CI contract

Defined in src/madengine/cli/constants.py::ExitCode. Use these in CI pipelines instead of log scraping.

CodeNameMeaning
0SUCCESSAll operations succeeded.
1FAILUREGeneral / unhandled failure (keyboard interrupt, unexpected exception).
2BUILD_FAILUREOne or more Docker image builds failed.
3RUN_FAILUREOne or more model runs failed. Results still written to perf.csv with STATUS=FAILURE.
4INVALID_ARGSArgument validation rejected the invocation.
In Jenkins, use madengine run … 2>&1 | tee madengine.log with bash -o pipefail so tee doesn't swallow the exit code.

additional_context — configuration spine

--additional-context accepts a Python dict string (parsed with ast.literal_eval, not json.loads) or --additional-context-file accepts a JSON file. The dict is deep-merged into Context.ctx alongside system-detected values.

Gotcha — Python dict, not JSON: pass '{"key":"val"}' (valid JSON is also valid Python) or "{'key':'val'}". Do not use True/False as unquoted Python booleans in shell — shell expansion will fail. Use true/false (JSON) or single-quote the whole argument.
KeyTypeSubsystemDescription & example
gpu_vendorstringCoreOverride GPU vendor detection. "AMD" or "NVIDIA". Defaults to "AMD" if not set and auto-detect fails.
guest_osstringCoreContainer OS for package manager selection. "UBUNTU" or "CENTOS". Affects rocEnvTool installer selection.
MAD_ROCM_PATHstringCoreOverride host ROCm root path (e.g. "/opt/rocm-6.2"). Takes priority over auto-detection and ROCM_PATH env.
docker_env_varsdictExecEnv vars injected as --env into docker run. Keys are validated with _ENV_KEY_RE. Special: docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host.
docker_build_argdictExecExtra --build-arg KEY=VAL flags passed to docker build.
docker_gpusstringExecComma-separated GPU indices to expose, or "all". E.g. "0,1,2,3".
docker_cpusstringExecCPU affinity string for --cpuset-cpus. E.g. "0-15".
docker_mountsdictExecExtra volume mounts. E.g. {"host_path":"/data","container_path":"/mnt/data"}.
docker_image / MAD_CONTAINER_IMAGEstringOrchSkip build entirely; use this image for all models. Creates a synthetic manifest.
k8s / kubernetesdictDeploySelects Kubernetes deployment. See K8s config section for sub-keys.
slurmdictDeploySelects SLURM deployment. See SLURM config section for sub-keys.
distributeddictDeployDistributed launcher configuration. launcher, nnodes, nproc_per_node, backend, port. See Per-launcher config.
distributed.launcherstringDeploy"torchrun", "deepspeed", "megatron", "torchtitan", "primus", "vllm", "sglang", "sglang_disagg", "slurm_multi"/"slurm-multi".
distributed.sglang_disaggdictDeployFine-tune prefill/decode node split. {"prefill_nodes":1,"decode_nodes":2}. Default ~40% prefill, rest decode. Min 3 nodes total.
vllmdictDeployvLLM-specific config (tensor/pipeline parallelism, model, etc.).
primusdictDeployPrimus-specific config. config_path, cli_extra, backend.
secretsdictDeployK8s only. Auto-converted to a K8s Secret and mounted as env vars. E.g. {"HF_TOKEN":"hf_xxx"}.
toolslistExecProfiling/tracing tools. Each item: {"name":"rocprofv3_compute"}. Stackable. See Profiling tools.
rocenv_modestringExec"lite" (default) or "full". Full mode runs lshw/dmidecode/dmesg/modinfo, installs missing tools per guest_os.
pre_scriptslistExecScripts to run inside the container before the model script.
post_scriptslistExecScripts to run inside the container after the model script.
encapsulate_scriptstringExecScript prepended to the model run command (wraps the whole execution).
log_error_pattern_scanboolExecSet false to disable post-run log substring error detection. Useful when pytest/JUnit is authoritative.
log_error_patternslistExecReplace the default error patterns list entirely. Each string is matched as substring in log lines.
log_error_benign_patternslistExecLiteral substrings that mark a matching log line as benign (not an error).
env_varsdictDeployTop-level env vars merged into deployment config (SLURM script / K8s job manifest).
gen_sys_env_detailsboolExecEnable/disable rocEnvTool system environment collection. Default: true.
debugboolDeployEnable debug-level logging in deployment templates.

SLURM sub-keys (slurm dict)

KeyDefault (from preset)Description
partition"amd-rccl"SLURM partition name.
nodes1Number of nodes to allocate.
gpus_per_node8GPUs per node.
time"24:00:00"Wall time limit (HH:MM:SS).
exclusivetrueRequest exclusive node access.
nodelistPin to specific nodes. Also skips node health preflight check.
excludeNodes to exclude.
constraintNode feature constraints.
reservationSLURM reservation name. Forwarded to srun health/cleanup commands.
qosQuality of service.
accountSLURM account for billing.
modules[]List of environment modules to load before job.
output_dirCWDDirectory for SLURM log/output files.
network_interfaceNetwork interface for NCCL/RCCL (e.g. "ib0").
shared_workspaceShared filesystem path accessible from all nodes.

Kubernetes sub-keys (k8s dict)

KeyDefaultDescription
namespace"default"Kubernetes namespace.
gpu_countNumber of GPUs per pod.
gpu_resource_name"amd.com/gpu"K8s GPU resource type. Auto-set by GPU-vendor preset.
image_pull_policy"Always"K8s imagePullPolicy.
kubeconfig"~/.kube/config"Path to kubeconfig.
data_storage_class"nfs-banff"Storage class for data PVC. Falls back to nfs_storage_class then storage_class.
storage_class"nfs-banff"Generic storage class fallback.
memory"64Gi"Container memory request.
memory_limit"128Gi"Container memory limit.
cpu"16"CPU request.
cpu_limit"32"CPU limit.
host_ipcfalseEnable hostIPC (needed for multi-node NCCL).
backoff_limit3K8s Job backoffLimit (retries).
ttl_seconds_after_finishednullAuto-delete job after N seconds.
recreate_shared_data_pvcfalseRe-create data PVC even if it already exists.
secrets.strategy"from_local_credentials"How to load K8s image pull secrets.
secrets.image_pull_secret_names[]Existing K8s secret names to use as image pull secrets.

Model definition — models.json

Each model definition lives in a models.json file (or is returned by get_models_json.py::list_models()). Fields map to the CustomModel dataclass in utils/discover_models.py.

{
  "name": "llama3-8b-train",          // Unique model identifier
  "dockerfile": "docker/Dockerfile.ubuntu.amd",
  "dockercontext": ".",               // Build context dir (relative to scripts dir)
  "scripts": "scripts/llama3/train.sh",
  "url": "https://github.com/org/repo",
  "cred": "hf_token",                 // Credential key from credential.json
  "owner": "ml-team",
  "data": "llama3-dataset",           // Data identifier for DataProvider
  "n_gpus": "8",                      // "-1" = all available; "0" = CPU-only
  "timeout": 14400,                   // Seconds; overridden by --timeout CLI flag
  "training_precision": "bf16",
  "tags": ["llama3", "training", "amd"],
  "args": "--batch-size 4 --seq-len 4096",
  "multiple_results": "results.csv",  // CSV file with multiple perf rows
  "skip_gpu_arch": "gfx908,gfx1100", // Comma-list of archs to skip this model on
  "additional_docker_run_options": "--shm-size 64g",
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 8
  },
  "env_vars": {
    "HF_TOKEN": "auto",              // Injected into container env
    "DOCKER_IMAGE_NAME": "reg/img"   // Used by slurm_multi parallel pull
  }
}

Key field notes

FieldNotes
n_gpus"-1" = use all GPUs on the host (MAD_SYSTEM_NGPUS). Positive int = that many GPUs. Used for perf CSV metadata.
timeoutUsed when CLI --timeout=-1 (default). Explicit CLI value always wins.
skip_gpu_archComma-separated GPU arch names (e.g. "gfx908,A100"). Model is skipped if detected arch matches. Disable with --disable-skip-gpu-arch.
multiple_resultsPath to CSV file (relative to model dir) with per-result rows that are appended to perf.csv individually.
DOCKER_IMAGE_NAME in env_varsRequired for slurm_multi: specifies the registry image for parallel srun docker pull on compute nodes. Also set automatically by DockerBuilder after a successful push.

Build manifest — build_manifest.json

Written by madengine build, consumed by madengine run. Pass with --manifest-file.

{
  "built_images": {
    "ci-llama3_Dockerfile.ubuntu.amd": {
      "docker_image": "registry.io/ml/ci-llama3:sha256-abc",
      "docker_sha":   "sha256:abc123",
      "build_duration": 183.4
    }
  },
  "built_models": {
    "ci-llama3_Dockerfile.ubuntu.amd": {
      "name":          "llama3-8b-train",
      "dockerfile":    "docker/Dockerfile.ubuntu.amd",
      "docker_image":  "ci-llama3_Dockerfile.ubuntu.amd",
      "docker_sha":    "sha256:abc123",
      "build_duration": 183.4,
      "scripts":       "scripts/llama3/train.sh",
      "args":          "--batch-size 4",
      "tags":          ["llama3","training"],
      "n_gpus":        "8",
      "timeout":       14400,
      "skip_gpu_arch": "",
      "multiple_results": "",
      "distributed":   {"launcher":"torchrun","nnodes":2,"nproc_per_node":8},
      "env_vars":      {"DOCKER_IMAGE_NAME":"registry.io/ml/ci-llama3:sha256-abc"},
      "built_on_compute": false
    }
  },
  "context": {
    "gpu_vendor": "AMD",
    "guest_os":   "UBUNTU",
    "docker_env_vars": {"MAD_GPU_VENDOR":"AMD","MAD_SYSTEM_NGPUS":"8"},
    "docker_build_arg": {}
  },
  "deployment_config": {
    "target":  "slurm",
    "slurm":   {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
    "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8},
    "env_vars": {"NCCL_DEBUG":"WARN"},
    "debug": false
  },
  "summary": {"total":1,"success":1,"failed":0}
}
Merging at runtime: values in deployment_config are merged into the runtime context at startup. Keys in --additional-context take precedence over deployment_config.

Deployment target inference

No explicit deploy field needed. RunOrchestrator._infer_deployment_target() inspects the merged context:

Context conditionTargetClassPath
"k8s" or "kubernetes" key presentKubernetesKubernetesDeploymentdeployment/kubernetes.py
"slurm" key presentSLURMSlurmDeploymentdeployment/slurm.py
NeitherLocal DockerContainerRunnerexecution/container_runner.py

Within SLURM deployment, if distributed.launcher == "slurm_multi" (or "slurm-multi"), SlurmDeployment.prepare() takes the slurm_multi path instead of generating the standard Jinja2 template.

Force local: use --force-mirror-local on madengine run to always use ContainerRunner even when slurm/k8s keys are in context.

SLURM deployment

Implemented in src/madengine/deployment/slurm.py. Generates an sbatch script from a Jinja2 template at src/madengine/deployment/templates/slurm/job.sh.j2.

Preset merge order

ConfigLoader.load_slurm_config() applies three layers (last wins):

  1. presets/slurm/defaults.json — base defaults for all SLURM runs
  2. presets/slurm/profiles/single-node.json or multi-node.json — profile selected by nodes count
  3. User-supplied slurm / distributed / env_vars keys
presets/slurm/defaults.json — base preset contents
{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "debug": false,
  "slurm": {
    "partition": "amd-rccl",
    "nodes": 1,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "exclusive": true,
    "modules": []
  },
  "distributed": {
    "backend": "nccl",
    "port": 29500
  },
  "env_vars": {
    "OMP_NUM_THREADS": "8",
    "MIOPEN_FIND_MODE": "1",
    "MIOPEN_USER_DB_PATH": "/tmp/.miopen"
  }
}
presets/slurm/profiles/multi-node.json — additional env vars for multi-node
{
  "slurm": {"nodes": 2, "gpus_per_node": 8, "time": "24:00:00"},
  "distributed": {"backend": "nccl", "port": 29500},
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "NCCL_DEBUG_SUBSYS": "INIT",
    "NCCL_IB_DISABLE": "0",
    "NCCL_SOCKET_IFNAME": "ib0",
    "TORCH_NCCL_HIGH_PRIORITY": "1",
    "GPU_MAX_HW_QUEUES": "8",
    "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1",
    "NCCL_TIMEOUT": "1200",
    "HSA_ENABLE_SDMA": "0",
    "HSA_FORCE_FINE_GRAIN_PCIE": "1",
    "RCCL_ENABLE_HIPGRAPH": "0"
  }
}

What the SLURM job script does

  • Sets MASTER_ADDR via scontrol show hostnames, MASTER_PORT, WORLD_SIZE, NNODES
  • Sets per-node HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES (vLLM/SGLang: only HIP_VISIBLE_DEVICES)
  • Sets MIOPEN_USER_DB_PATH per-process: /tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}
  • Sets TORCH_ELASTIC_RDZV_TIMEOUT=3600 for PyTorch elastic
  • Sets MAD_DEPLOYMENT_TYPE=slurm, MAD_SLURM_JOB_ID, MAD_NODE_RANK, MAD_IN_SLURM_JOB=1
  • Multi-node: generates per-node task script; runs via srun bash $TASK_SCRIPT
  • Single-node: creates synthetic manifest with deployment_config.target="docker" and calls madengine run

Node health preflight

SlurmNodeSelector runs a health-check srun before the main job unless slurm.nodelist is set (then skipped). Supports slurm.reservation forwarded to srun commands.

Monitoring

Polls squeue every 30 seconds. Terminal states: COMPLETED, FAILED, CANCELLED — a scancel'd job will not loop forever.

SLURM inside existing allocation (salloc): if SLURM_JOB_ID is set and the launcher is slurm_multi, madengine runs the wrapper script directly with bash instead of nesting a new sbatch. Other launchers still submit via sbatch even inside salloc.

Kubernetes deployment

Implemented in src/madengine/deployment/kubernetes.py and 6 focused mixin modules (refactored in v2.0.3). Requires pip install -e ".[kubernetes]".

Mixin modules

ModuleConcern
k8s_pvc.pyPVC lifecycle. Storage-class fallback: data_storage_classnfs_storage_classstorage_class. Default: "nfs-banff".
k8s_results.pyLog/artifact collection, perf aggregation. Uses shared collector_pod_name() helper — truncated collector-{id[:15]} to stay within K8s name limits.
k8s_scripts.pyScript extraction, ConfigMap building. Carries rocenv_mode and guest_os into the ConfigMap.
k8s_template_context.pyAssembles Jinja2 template context dict passed to job.yaml.j2.
kubernetes_launcher_mixin.pySelects the right Jinja2 template per launcher type.
k8s_secrets.pyConverts additional_context.secrets dict to K8s Secret objects mounted as env vars.

Preset merge order

ConfigLoader.load_k8s_config() applies five layers (last wins):

  1. presets/k8s/defaults.json — base defaults
  2. presets/k8s/gpu-vendors/amd.json or nvidia.json — GPU resource name
  3. presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars (only if AMD + multi-GPU)
  4. presets/k8s/profiles/single-gpu.json, multi-gpu.json, or multi-node.json
  5. User config
presets/k8s/defaults.json — base preset contents
{
  "k8s": {
    "kubeconfig": "~/.kube/config",
    "namespace": "default",
    "image_pull_policy": "Always",
    "backoff_limit": 3,
    "ttl_seconds_after_finished": null,
    "nfs_storage_class": "nfs-banff",
    "storage_class": "nfs-banff",
    "data_storage_class": "nfs-banff",
    "recreate_shared_data_pvc": false,
    "secrets": {
      "strategy": "from_local_credentials",
      "image_pull_secret_names": [],
      "runtime_secret_name": null
    }
  },
  "env_vars": {"OMP_NUM_THREADS": "8"}
}
presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars
{
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "NCCL_IB_DISABLE": "0",
    "NCCL_SOCKET_IFNAME": "ib0",
    "TORCH_NCCL_HIGH_PRIORITY": "1",
    "GPU_MAX_HW_QUEUES": "8",
    "HSA_ENABLE_SDMA": "0",
    "MIOPEN_FIND_MODE": "1",
    "MIOPEN_USER_DB_PATH": "/tmp/.miopen",
    "HSA_FORCE_FINE_GRAIN_PCIE": "1",
    "RCCL_ENABLE_HIPGRAPH": "0"
  }
}
Known issue: in multi-node K8s jobs, a node may show FAILED in the results table even when the pod succeeded — this occurs when the kubelet returns 502 between job completion and log collection. PVC artifacts are still collected. Check kubectl describe pod <pod>.

Secrets management

# Pass secrets via additional_context
madengine run --tags llm-serve \
  --additional-context '{
    "k8s": {"namespace":"ml","gpu_count":8},
    "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy","S3_KEY":"zzz"}
  }'

Secrets in additional_context.secrets are auto-converted to a K8s Secret object and mounted as environment variables in the job pod. They are never written to perf.csv or build logs.

slurm_multi launcher merged in v2.1.0

What it is

An escape-hatch SLURM launcher for workloads that orchestrate their own per-node Docker containers via srun — for example SGLang Disaggregated (proxy + prefill + decode) or any topology that needs to call srun/scontrol from inside the job step.

Generates a wrapper SBATCH that runs the model's own .slurm (or .sh) script directly on the head node on baremetal — no outer container — so the workload can spawn its own per-node containers without nesting.

How to select it

{
  "slurm": {
    "partition": "gpu",
    "nodes": 3,
    "gpus_per_node": 8,
    "time": "02:00:00"
  },
  "distributed": {
    "launcher": "slurm_multi"
  }
}

Alias "slurm-multi" (hyphen) is also accepted and normalized automatically.

Build modes

ModeFlagBehavior
Use prebuilt image--use-image registry.io/img:tagSkip local build. Uses explicit image.
Auto-resolve from model card--use-image autoReads env_vars.DOCKER_IMAGE_NAME from model card.
Build on compute--build-on-compute --registry reg.io/mlBuilds on SLURM compute node, pushes to registry. Manifest sets built_on_compute: true. Run phase pulls in parallel on all nodes.
Implicit fallbackno flagsIf model card has DOCKER_IMAGE_NAME, auto-uses it. Otherwise raises ConfigurationError listing options.

Execution paths

  • sbatch (default): wrapper SBATCH submitted to SLURM. Head node calls srun docker pull on all nodes in parallel, then runs the model's script.
  • bash-in-salloc: if SLURM_JOB_ID env var is set (inside existing salloc), the launcher runs the wrapper synchronously with bash. Sets DeploymentResult.skip_monitoring=True so the monitor poll is skipped.

Results aggregation

_collect_slurm_multi_results() reads per-job CSV from /shared_inference/$USER/$JOBID/perf.csv and writes those rows into cwd/perf.csv (copy if absent, append data rows if present). This ensures display_performance_table and madengine report to-html find results without extra arguments.

Local self-managed execution

When slurm_multi is detected in a non-SLURM context (e.g. local Docker mode), ContainerRunner._run_self_managed() runs the model's script directly on the host. Env vars from model card and additional_context are injected; keys are logged without values to avoid leaking credentials.

Docker --build-context tools= v2.1.0

What it does

Every docker build issued by DockerBuilder now passes --build-context tools=scripts/common/tools when that directory exists. Dockerfiles can pull shared helper scripts from the named context:

# In any model Dockerfile
COPY --from=tools rocm_smi/*.py /opt/mad/tools/rocm_smi/
COPY --from=tools gpu_info/*.py /opt/mad/tools/

Eliminates duplication of shared APIs across model Dockerfiles.

Conditional emission (PR #134)

The flag is only added when scripts/common/tools/ exists at build time. Builds in MAD projects without a tools directory do not receive the flag and will not fail.

Implementation: single guarded block in execution/docker_builder.py.

SLURM fix in same PR: switched from shlex.quote() to double-quote escaping in slurm.py env-var generation so spaces and paths in values survive correctly in the sbatch script.

Launcher matrix

LauncherLocalK8sSLURMTypeNotes
torchrunTrainDDP / FSDP, elastic rendezvous.
megatron / megatron-lmTrainTP + PP parallelism; sets TP/PP/CP size env vars.
torchtitanTrainFSDP2 + TP + PP + CP; Llama 3.1 8B–405B.
deepspeedTrainZeRO, pipeline parallelism; dynamic hostfile from SLURM.
vllmInferPagedAttention; each node self-managing (no torchrun wrapper).
sglangInferRadixAttention, structured gen; each node self-managing.
sglang_disaggInferDisaggregated prefill/decode; min 3 nodes (1 proxy + ≥1P + ≥1D).
primusTrainMegatron / TorchTitan / MaxText via Primus YAML config.
slurm_multi(self-mgd)MetaBypasses template; model's own SLURM script on head node.

Per-launcher configuration

Standard PyTorch distributed launcher. Generates: torchrun --nnodes=N --nproc_per_node=N --node_rank=R --master_addr=ADDR --master_port=PORT

{
  "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 4,
    "nproc_per_node": 8,
    "backend": "nccl",
    "port": 29500
  },
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0",
    "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1"
  }
}

Local: MAD_MULTI_NODE_RUNNER is set to torchrun --standalone --nproc_per_node=N (single-node only).

Uses torchrun under the hood; sets TENSOR_MODEL_PARALLEL_SIZE, PIPELINE_MODEL_PARALLEL_SIZE, CONTEXT_PARALLEL_SIZE env vars for the Megatron script to read.

{
  "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,"time":"48:00:00"},
  "distributed": {
    "launcher": "megatron",
    "nnodes": 8,
    "nproc_per_node": 8
  },
  "env_vars": {
    "TENSOR_MODEL_PARALLEL_SIZE": "4",
    "PIPELINE_MODEL_PARALLEL_SIZE": "2",
    "CONTEXT_PARALLEL_SIZE": "1",
    "NCCL_IB_DISABLE": "0"
  }
}

FSDP2 + TP + PP + CP. Sets TORCHTITAN_TENSOR_PARALLEL_SIZE, TORCHTITAN_PIPELINE_PARALLEL_SIZE, TORCHTITAN_FSDP_ENABLED, TORCHTITAN_CONTEXT_PARALLEL_SIZE.

{
  "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
  "distributed": {
    "launcher": "torchtitan",
    "nnodes": 4,
    "nproc_per_node": 8
  },
  "env_vars": {
    "TORCHTITAN_TENSOR_PARALLEL_SIZE": "2",
    "TORCHTITAN_FSDP_ENABLED": "true"
  }
}

DeepSpeed with dynamic SLURM hostfile generation. Generates: deepspeed --hostfile=/tmp/hostfile …

{
  "slurm": {
    "partition": "gpu",
    "nodes": 8,
    "gpus_per_node": 8,
    "time": "48:00:00",
    "reservation": "ml-priority"
  },
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 8,
    "nproc_per_node": 8,
    "backend": "nccl"
  },
  "env_vars": {
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0"
  }
}

Each node runs independently (no torchrun). Sets VLLM_TENSOR_PARALLEL_SIZE, VLLM_PIPELINE_PARALLEL_SIZE, VLLM_DISTRIBUTED_BACKEND. Only HIP_VISIBLE_DEVICES is set (not ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES) to avoid conflict with Ray.

{
  "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"12:00:00"},
  "distributed": {
    "launcher": "vllm",
    "nnodes": 2,
    "nproc_per_node": 8
  },
  "env_vars": {
    "VLLM_TENSOR_PARALLEL_SIZE": "8",
    "VLLM_PIPELINE_PARALLEL_SIZE": "2"
  }
}
AMD+Ray gotcha: RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES is automatically overridden to "" when HIP_VISIBLE_DEVICES is set, preventing the rocm/vllm image from ignoring GPU visibility.

SGLang standard (RadixAttention, structured gen). Each node self-managing. Sets SGLANG_TENSOR_PARALLEL_SIZE, SGLANG_PIPELINE_PARALLEL_SIZE.

{
  "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"06:00:00"},
  "distributed": {
    "launcher": "sglang",
    "nnodes": 2,
    "nproc_per_node": 8
  },
  "env_vars": {
    "SGLANG_TENSOR_PARALLEL_SIZE": "8"
  }
}

Disaggregated prefill + decode topology. Minimum 3 nodes: 1 proxy + ≥1 prefill + ≥1 decode. Node split: default ~40% prefill, rest decode.

{
  "slurm": {
    "partition": "gpu",
    "nodes": 5,
    "gpus_per_node": 8,
    "time": "04:00:00"
  },
  "distributed": {
    "launcher": "sglang_disagg",
    "nnodes": 5,
    "nproc_per_node": 8,
    "sglang_disagg": {
      "prefill_nodes": 2,
      "decode_nodes": 2
    }
  },
  "env_vars": {
    "SGLANG_TP_SIZE": "8"
  }
}

Sets: SGLANG_DISAGG_MODE, SGLANG_DISAGG_PREFILL_NODES, SGLANG_DISAGG_DECODE_NODES, SGLANG_DISAGG_TOTAL_NODES, SGLANG_NODE_IPS, SGLANG_NODE_RANK.

Config recipes

Complete working configurations for common scenarios.

Local — single GPU, AMD

madengine run --tags llama3 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_gpus": "0"
  }'

Local — all 8 GPUs, with Megatron env vars

madengine run --tags megatron-llama3 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_env_vars": {
      "TENSOR_MODEL_PARALLEL_SIZE": "4",
      "PIPELINE_MODEL_PARALLEL_SIZE": "2"
    }
  }'

SLURM — single node torchrun

cat > slurm-single.json <<'EOF'
{
  "slurm": {
    "partition": "amd-gpu",
    "nodes": 1,
    "gpus_per_node": 8,
    "time": "12:00:00",
    "exclusive": true
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 1,
    "nproc_per_node": 8
  }
}
EOF
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
  --additional-context-file slurm-single.json

SLURM — 4-node DeepSpeed with reservation

cat > slurm-multi.json <<'EOF'
{
  "slurm": {
    "partition": "amd-gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "exclusive": true,
    "reservation": "ml-training-q1",
    "network_interface": "ib0"
  },
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 4,
    "nproc_per_node": 8,
    "backend": "nccl"
  },
  "env_vars": {
    "NCCL_IB_DISABLE": "0",
    "NCCL_SOCKET_IFNAME": "ib0",
    "NCCL_DEBUG": "WARN",
    "HSA_ENABLE_SDMA": "0"
  }
}
EOF
madengine run --manifest-file build_manifest.json \
  --additional-context-file slurm-multi.json

K8s — single pod, 4 AMD GPUs

madengine run --tags llama3-infer \
  --additional-context '{
    "k8s": {
      "namespace": "ml-team",
      "gpu_count": 4
    }
  }'

K8s — multi-node vLLM with HF secret

madengine run --tags vllm-llama3-70b \
  --additional-context '{
    "k8s": {
      "namespace": "ml-team",
      "gpu_count": 8,
      "host_ipc": true,
      "data_storage_class": "nfs-banff"
    },
    "distributed": {
      "launcher": "vllm",
      "nnodes": 2,
      "nproc_per_node": 8
    },
    "secrets": {"HF_TOKEN": "hf_xxxxxxx"},
    "env_vars": {
      "VLLM_TENSOR_PARALLEL_SIZE": "8",
      "VLLM_PIPELINE_PARALLEL_SIZE": "2"
    }
  }'

SLURM — SGLang Disagg (3 nodes: 1 proxy + 1P + 1D)

madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:v0.4

madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm": {
      "partition": "amd-gpu",
      "nodes": 3,
      "gpus_per_node": 8,
      "time": "04:00:00"
    },
    "distributed": {
      "launcher": "slurm_multi"
    }
  }'

Local run with ROCm compute profiling

madengine run --tags llama3 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "tools": [
      {"name": "rocprofv3_compute"}
    ],
    "rocenv_mode": "full"
  }'

Stack multiple profilers:

  "tools": [
    {"name": "rocprofv3_compute"},
    {"name": "rccl_trace"},
    {"name": "gpu_info_power_profiler"}
  ]

Profiling & tracing tools

Enable via --additional-context '{"tools":[{"name":"…"}]}'. Tools are stackable — list multiple objects. Implemented in scripts/common/tools/ and execution/container_runner.py::apply_tools().

Do not combine rocm_trace_lite with rocprof / rocprofv3_* in the same run — they conflict at the kernel-tracing level.
Tool namePurposeOutput locationNotes
rocprofLegacy GPU kernel profilingKernel timings / occupancy CSVsUse rocprofv3_* on ROCm ≥ 7.0
rocprofv3_computeCompute-bound kernelsALU, wave execution metricsROCm ≥ 7.0
rocprofv3_memoryMemory-bound workloadsCache hits, bandwidth
rocprofv3_communicationMulti-GPU communicationRCCL traces
rocprofv3_fullComprehensive (all metrics)All countersHigh overhead — short runs only
rocprofv3_lightweightMinimal overhead tracingHIP API + kernel traces
rocprofv3_perfettoPerfetto UI tracesPerfetto JSON for ui.perfetto.dev
rocprofv3_api_overheadAPI call timingPer-API timing report
rocprofv3_pc_samplingKernel hotspot identificationPC sample histograms
rocm_trace_liteRTL lite dispatch tracerocm_trace_lite_output/trace.dbPinned GitHub release wheel by default
rocm_trace_lite_defaultRTL default modeSame paths, broader coveragev2.0.3+
rocblas_tracerocBLAS call tracingPer-library log
miopen_traceMIOpen call tracingPer-library log
tensile_traceTensile call tracingPer-library log
rccl_traceRCCL communication tracingPer-library log
gpu_info_power_profilerPower consumption over timeCSV time series
gpu_info_vram_profilerVRAM usage over timeCSV time series
therock_checkTheRock ROCm stack validationDetection reportIdentifies apt vs TheRock install

rocm_trace_lite wheel control

Env varEffect
ROCM_TRACE_LITE_FOLLOW_LATEST=1Always pull the latest wheel from GitHub
ROCM_TRACE_LITE_WHEEL_URL=https://…Use a specific wheel URL (air-gapped installs)

rocEnvTool modes

Mode (rocenv_mode)Collects
"lite" (default)Basic ROCm info, GPU topology, driver version
"full"All of lite + lshw, dmidecode, dmesg, modinfo; best-effort installs missing tools per guest_os

ROCm path resolution

Implemented in src/madengine/utils/rocm_path_resolver.py and src/madengine/core/context.py. Two independent resolution chains run in parallel.

Host path (build & tools)

  1. MAD_ROCM_PATH in --additional-context
  2. Auto-detect: /opt/rocm, versioned /opt/rocm-*, TheRock (rocm-sdk + markers)
  3. rocminfo / amd-smi / rocm-smi location on PATH
  4. ROCM_PATH environment variable
  5. /opt/rocm fallback (with warning)

Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only env var / default.

In-container path (AMD Docker runs)

  1. docker_env_vars.MAD_ROCM_PATH in additional_context
  2. ROCM_PATH / ROCM_HOME from image OCI config (docker image inspect)
  3. In-image shell probe (docker run --rm image env)
  4. /opt/rocm fallback with warning

The run-phase env table prints host vs container ROCm root, installation type (apt / therock / unknown), and version side-by-side.

renderD mapping: ROCm < 6.4.1 uses legacy unique_id method; 6.4.1+ uses amd-smi node_id. The gpu_renderDs context key maps GPU index → /dev/dri/renderD number. Guards against None entries on restricted ROCm installs.

Environment variables

Read by madengine at runtime

VariableModulePurpose
MAD_ROCM_PATHcontext.pyOverride ROCm root on host. Priority 1.
ROCM_PATHcore/constants.pyFallback ROCm root. Priority 3.
MAD_AUTO_ROCM_PATHrocm_path_resolverSet 0 to disable auto-scan.
MODEL_DIRcore/constants.pyWorking directory for model scripts. Default: .
MAD_VERBOSE_CONFIGcore/constants.pyEnable verbose config output.
MAD_SETUP_MODEL_DIRcore/constants.pyTrigger model directory setup.
MAD_SECRETS*context.pyAny env var with this prefix is automatically copied to docker_build_arg AND docker_env_vars.
MAD_DOCKERHUB_USERbuild_orchestratorDocker Hub username for registry auth.
MAD_DOCKERHUB_PASSWORDbuild_orchestratorDocker Hub password for registry auth.
SLURM_JOB_IDslurm.pyDetect existing SLURM allocation (triggers bash-in-salloc for slurm_multi).
SLURM_NNODES, SLURM_NPROCScontainer_runnerRead in SLURM job to resolve GPU count per node.
NPROC_PER_NODE, GPUS_PER_NODEcontainer_runnerInjected by SLURM template; read by ContainerRunner to set up docker run GPU args.
MONGO_HOST, MONGO_PORTdatabase/mongodb.pyMongoDB connection.
MONGO_USER, MONGO_PASSWORDdatabase/mongodb.pyMongoDB credentials.
MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MSdatabase/mongodb.pyMongoDB auth source and timeout.
NAS_NODEScore/constants.pyNAS node config (JSON string).
MAD_AWS_S3core/constants.pyAWS S3 credentials (JSON: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, …).
MAD_MINIOcore/constants.pyMinIO credentials (JSON: MINIO_ENDPOINT, AWS_ENDPOINT_URL_S3, …).
PUBLIC_GITHUB_ROCM_KEYcore/constants.pyGitHub ROCm key (JSON).
ROCM_TRACE_LITE_FOLLOW_LATESTtoolsSet 1 to always pull latest RTL wheel.
ROCM_TRACE_LITE_WHEEL_URLtoolsOverride RTL wheel URL (air-gapped installs).

Set by madengine in Docker containers

VariableSet byValue / source
MAD_GPU_VENDORcontext.py"AMD" or "NVIDIA"
MAD_SYSTEM_NGPUScontext.pyTotal GPU count on host
MAD_SYSTEM_GPU_ARCHITECTUREcontext.pyGPU arch string (e.g. "gfx90a")
MAD_SYSTEM_HIP_VERSIONcontext.pyHIP version string
MAD_SYSTEM_GPU_PRODUCT_NAMEcontext.pyGPU product name
MAD_GUEST_OScontainer_runner"UBUNTU" or "CENTOS"
MAD_RUNTIME_NGPUScontainer_runnerGPU count allocated for this specific run
MAD_MULTI_NODE_RUNNERcontainer_runnerDistributed launcher command (e.g. torchrun --standalone --nproc_per_node=8)
MAD_MODEL_NAMEcontainer_runnerModel name from model definition
MAD_OUTPUT_CSVcontainer_runnerPath for multiple_results CSV output
ROCM_PATHcontainer_runnerResolved in-container ROCm root
JENKINS_BUILD_NUMBERcontainer_runnerCI build number (from shell env if set)
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICEScontainer_runnerForce-set to "" when HIP_VISIBLE_DEVICES is active (AMD+Ray fix)

Set by SLURM job script (job.sh.j2)

VariableValue
MAD_DEPLOYMENT_TYPE"slurm"
MAD_SLURM_JOB_IDSLURM job ID
MAD_NODE_RANKThis node's rank (0-indexed)
MAD_TOTAL_NODESTotal node count
MAD_IN_SLURM_JOB"1"
MAD_LAUNCHER_TYPELauncher type string
MASTER_ADDRHead node hostname (via scontrol)
MASTER_PORTCommunication port (default 29500)
WORLD_SIZETotal GPU processes (nodes × GPUs/node)
NNODESNode count
GPUS_PER_NODEGPU count per node
NODE_RANKThis node's rank
TORCH_ELASTIC_RDZV_TIMEOUT3600
MIOPEN_USER_DB_PATH/tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}
HIP_VISIBLE_DEVICESGPU indices for this node's processes
ROCR_VISIBLE_DEVICESGPU indices (not set for Ray-based launchers)
CUDA_VISIBLE_DEVICESGPU indices (not set for Ray-based launchers)

Error types

Defined in src/madengine/core/errors.py. All inherit from MADEngineError(Exception) which carries: message, category, context (ErrorContext dataclass), cause, recoverable, suggestions (list). Rich panels are used for display.

ClassCategoryWhen raised
ValidationErrorVALIDATIONInvalid CLI args, model field values, context key types.
NetworkErrorCONNECTIONRegistry connectivity, pull failures, MongoDB connection.
AuthenticationErrorAUTHENTICATIONRegistry login failure, invalid credentials format.
ExecutionErrorRUNTIMEContainer run failure, script non-zero exit, timeout. (RuntimeError is an alias.)
BuildErrorBUILDDocker build failure.
DiscoveryErrorDISCOVERYmodels.json parse failure, tag not found, no models matched.
OrchestrationErrorORCHESTRATIONManifest load failure, incompatible build/run state.
RunnerErrorRUNNERContainerRunner internal failure.
ConfigurationErrorCONFIGURATIONslurm_multi registry gate violation, conflicting flags, missing required config.
DeploymentTimeoutErrorTIMEOUTSLURM/K8s job exceeded wall time.

Module reference

LayerPathContents
CLIcli/app.pyTyper app, cli_main entry, --version, Rich traceback install.
CLIcli/commands/build.pymadengine build: registry, batch, --use-image, --build-on-compute, mutex validation.
CLIcli/commands/run.pymadengine run: manifest loading, all run flags, --force-mirror-local, --cleanup-perf.
CLIcli/commands/discover.pyModel discovery command, scoped tag parsing.
CLIcli/commands/report.pyreport to-html / to-email sub-app.
CLIcli/commands/database.pyMongoDB upload command.
CLIcli/constants.pyExitCode enum, DEFAULT_MANIFEST_FILE, DEFAULT_PERF_OUTPUT, DEFAULT_TIMEOUT=-1.
CLIcli/validators.pyArgument validation: validate_additional_context(), create_args_namespace().
Orchorchestration/build_orchestrator.pyBuildOrchestrator.execute(): discover → context → build → registry gate → manifest. slurm_multi use-image / build-on-compute paths.
Orchorchestration/run_orchestrator.pyRunOrchestrator.execute(): manifest loading, target inference, script copy/cleanup, local/distributed dispatch.
Orchorchestration/image_filtering.pyFilters manifest entries by GPU vendor, GPU arch, skip_gpu_arch field.
Depdeployment/factory.pyDeploymentFactory.create(). Registers SlurmDeployment + KubernetesDeployment. UserWarning if kubernetes package missing.
Depdeployment/base.pyBaseDeployment (Template Method), DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN.
Depdeployment/kubernetes.pyKubernetesDeployment: composes 6 mixins, orchestrates K8s job lifecycle.
Depdeployment/k8s_pvc.pyPVC creation/deletion, storage-class fallback chain.
Depdeployment/k8s_results.pyLog/artifact collection, perf aggregation, collector_pod_name().
Depdeployment/k8s_scripts.pyScript extraction, ConfigMap building (rocenv_mode, guest_os).
Depdeployment/k8s_template_context.pyAssembles Jinja2 template context for K8s jobs.
Depdeployment/k8s_secrets.pysecrets dict → K8s Secret objects.
Depdeployment/k8s_names.pyName truncation/sanitization helpers for K8s resource names.
Depdeployment/kubernetes_launcher_mixin.pySelects Jinja2 template per launcher; sets MAD_MULTI_NODE_RUNNER for K8s pods.
Depdeployment/slurm.pySlurmDeployment: template prep, sbatch submit, bash-in-salloc, slurm_multi dispatch, monitoring, results collection.
Depdeployment/slurm_node_selector.pySlurmNodeSelector: health/cleanup srun, reservation parameter, node preflight.
Depdeployment/common.pyShared helpers: VALID_LAUNCHERS, slurm_multi wrapper assembly, launcher normalization.
Depdeployment/config_loader.pyConfigLoader: deep-merge, preset loading, target inference. env_vars merged recursively (not replaced).
Depdeployment/primus_backend.pyPrimus YAML / backend selection helper.
Depdeployment/presets/slurm/defaults.jsonSLURM base preset.
Depdeployment/presets/slurm/profiles/single-node.json, multi-node.json.
Depdeployment/presets/k8s/defaults.jsonK8s base preset.
Depdeployment/presets/k8s/gpu-vendors/amd.json, nvidia.json, amd-multi-gpu.json.
Depdeployment/presets/k8s/profiles/single-gpu.json, multi-gpu.json, multi-node.json.
Depdeployment/templates/slurm/job.sh.j2Main sbatch template (~822 lines). Sets all SLURM env vars, runs srun task scripts.
Depdeployment/templates/kubernetes/K8s YAML templates: configmap.yaml.j2, job.yaml.j2, pvc.yaml.j2, pvc-data.yaml.j2, service.yaml.j2.
Execexecution/container_runner.pyContainerRunner: local docker run, AMD/NVIDIA run options, env injection, tools, perf parsing, _run_self_managed(), _generate_local_launcher_command().
Execexecution/container_runner_helpers.pyLog error pattern scan, resolve_run_timeout(), make_run_log_file_path().
Execexecution/docker_builder.pyDockerBuilder: build args, --build-context tools= (conditional), registry push, DOCKER_IMAGE_NAME injection into manifest.
Execexecution/dockerfile_utils.pyDockerfile parsing: GPU vendor from filename + FROM line.
Corecore/context.pyContext: ast.literal_eval parse, GPU vendor/arch detection, ROCm path resolution, MAD_SECRETS* propagation, renderD mapping.
Corecore/additional_context_defaults.pyDefault values merged before user context: DEFAULT_GPU_VENDOR="AMD", DEFAULT_GUEST_OS="UBUNTU".
Corecore/console.pyConsole: Rich-backed shell executor, live output, timeout, secret=True for credential commands.
Corecore/docker.pyDocker wrapper: shlex.quote() on every interpolation, auto stop/remove on __del__.
Corecore/errors.py10-type error hierarchy, ErrorCategory, ErrorContext, ErrorHandler, Rich panel display.
Corecore/auth.pyload_credentials(), login_to_registry() using --password-stdin + MAD_REGISTRY_PASSWORD.
Corecore/timeout.pyTimeout context manager; guards signal.alarm(None) when seconds is 0/None.
Corecore/dataprovider.pyData abstraction: local / NAS / S3 / MinIO.
Utilutils/discover_models.pyDiscoverModels: root, dir, dynamic discovery; scoped vs unscoped tags; CustomModel dataclass.
Utilutils/gpu_tool_factory.pySingleton get_gpu_tool_manager(vendor, rocm_path); auto-detects vendor.
Utilutils/gpu_validator.pyGPUVendor enum, ROCmValidator, NVIDIAValidator, GPUValidationResult.
Utilutils/rocm_path_resolver.pyHost + in-container ROCm path resolution chains.
Utilutils/therock_markers.pyShared TheRock detection markers (rocm-sdk, layout probes).
Utilutils/config_parser.pyConfigParser: 5-level config file resolution, CSV/JSON/YAML loading, multi-row result matching.
Utilutils/session_tracker.pySession start/marker tracking.
Repreporting/update_perf_csv.pyWrites/appends perf.csv and perf_entry.csv. PERF_CSV_HEADER (28 columns).
Repreporting/csv_to_html.pyHTML performance report generation.
Repreporting/csv_to_email.pyEmail-friendly consolidated report.
Repreporting/update_perf_super.pySuperset-shaped perf rollups.
DBdatabase/mongodb.pyMongoDBConfig.from_env(), UploadOptions, UploadResult; upsert + batch upload.
Scriptsscripts/common/pre_scripts/rocEnvTool/rocenv_tool.py, csv_parser.py, console.py — TheRock-compatible env capture (lite + full modes).
Scriptsscripts/common/tools/GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers (rocblas, miopen, rccl, tensile).

Test layout

unit/

Fast, isolated, mocked. Key files: test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py, test_deployment.py, test_container_runner.py.

integration/

Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.

e2e/

Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.

MarkerWhat it selects
unitFast unit tests with no external deps
integrationTests requiring Docker / real GPU calls
e2eFull end-to-end workflow tests
slowLong-running tests
gpuRequires GPU hardware
amd / nvidiaVendor-specific tests
cpuCPU-only tests
requires_dockerTests requiring Docker daemon
requires_modelsTests requiring model files to be present

Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0).

Contributing & code style

Style rules

  • Formatting: Black (line-length 88), targets py3.8–py3.11
  • Imports: isort with profile="black"; first-party = madengine
  • Lint: flake8 + mypy (strict equality, warn unused) + bandit (skips B101)
  • Docstrings: Google style; type hints required for public functions
  • Conventional commits: feat:, fix:, docs:, test:, refactor:, style:, perf:, chore:

Security rules

  • Use shlex.quote() on every shell interpolation of user-controlled values (image names, paths, container names, build-args)
  • Registry passwords via --password-stdin (not command-line args); env var MAD_REGISTRY_PASSWORD
  • Credential JSON must be a dict object — validated at load time (ConfigurationError on wrong type)
  • MIOPEN_USER_DB_PATH is filtered from deployment_config to prevent leaking temp paths
  • Never log secret values — log keys only

Changelog

[2.1.0] — 2026-05-28

Added

  • slurm_multi self-managed SLURM launcher (PRs #130, #126): alias slurm-multi, parallel docker pull, bash-in-salloc path, _run_self_managed() for local mode
  • madengine build --use-image [IMAGE|auto] — skip local build
  • madengine build --build-on-compute — build on compute node + push
  • slurm_multi registry gate with structured ConfigurationError
  • DeploymentResult.skip_monitoring for synchronous deploy paths
  • SlurmNodeSelector.reservation parameter
  • DockerBuilder: --build-context tools= (conditional on dir existence, PR #131 + #134)
  • Local MAD_MULTI_NODE_RUNNER via ContainerRunner._generate_local_launcher_command() (PR #126)
  • Model card distributed/slurm auto-merged into manifest deployment_config
  • DOCKER_IMAGE_NAME injection into manifest env_vars after successful registry push

Changed

  • SLURM env-var escaping: double-quote instead of shlex.quote to preserve spaces/paths (PR #134)
  • Early DiscoverModels result cached and reused for actual build (no duplicate get_models_json.py runs)
  • E2E test cleanup defaults include build_manifest.json + perf artefacts
[2.0.3] — 2026-05-26
  • rocEnvTool "full" mode (lshw, dmidecode, dmesg, modinfo)
  • K8s monolith decomposed into 6 focused mixin modules
  • Generic storage_class fallback; default preset nfs-banff
  • rocm_trace_lite_default tool (RTL default mode)
  • Security: shlex.quote() on every shell interpolation
  • Collector pod name mismatch fix (shared collector_pod_name() helper)
  • CANCELLED added to terminal-state set
  • Local MAD_MULTI_NODE_RUNNER for Docker local (_generate_local_launcher_command())
[2.0.2] / [2.0.1]
  • Host ROCm auto-detection via priority chain; in-container ROCm resolved independently
  • TheRock (rocm-sdk) layout support
  • GPU arch auto-detection injected into Docker build args
  • Model discovery: scope-based tag selection replaces strict flag
  • Registry password via --password-stdin + env var
  • credential.json type validation
  • Unified PERFORMANCE_LOG_PATTERN across local + deployment paths
  • Run-phase host/container env table printed at startup
[2.0.0] — 2026-04-09 — Complete rewrite
  • Unified madengine CLI; legacy mad-* removed
  • 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core)
  • Factory + Template Method patterns; DeploymentFactory, BaseDeployment, ConfigLoader
  • Multi-target deployment: presets + Jinja2 templates per launcher
  • Launcher matrix: torchrun / DeepSpeed / Megatron / TorchTitan / Primus / vLLM / SGLang
  • Log error pattern scanning; --skip-model-run; batch build manifest
  • Structured errors (10 types) with Rich panels; fixed exit codes
  • SLURM nodelist pinning; K8s Secrets management; data provider abstraction