madengine — Codebase Wiki
AI/ML model automation & benchmarking platform for local Docker, Kubernetes, and SLURM. A Typer-based CLI that discovers models, builds Docker images, runs them across compute targets, and writes structured performance results.
Entry point: src/madengine/cli/app.py::cli_main
→ console script madengine registered in pyproject.toml.
Overview
What madengine does
- Discover — finds model definitions from
models.jsonor dynamic scripts, resolves tags - Build — calls
docker buildfor each model, writesbuild_manifest.json - Run — reads manifest, infers compute target, dispatches containers, writes
perf.csv - Report — converts
perf.csvto HTML or email; uploads to MongoDB
All four stages share a single --additional-context configuration spine that controls
GPU vendor, deployment type, launcher, profiling tools, and environment variables.
What's new in v2.1.0
slurm_multi— self-managed multi-node SLURM launcher for workloads with per-node Docker (e.g. SGLang Disagg)--use-image [auto]/--build-on-compute— newmadengine buildmodes- Docker
--build-context tools=— shared tool APIs accessible in every Dockerfile - Local
MAD_MULTI_NODE_RUNNER— Megatron / DeepSpeed / TorchTitan now work on local Docker - SLURM env-var escaping — double-quote escaping preserves spaces & paths
Quick start
# 1. Install
pip install -e ".[dev]"
# 2. Discover available models
madengine discover --tags dummy
# 3. Build + run (single command)
madengine run --tags dummy \
--additional-context '{"gpu_vendor":"AMD","guest_os":"UBUNTU"}'
# 4. Build only, then run from manifest
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
--additional-context '{"docker_gpus":"0,1,2,3"}'
Local mode: no k8s or slurm key in context → ContainerRunner (local Docker).
# Single-node K8s (minimal — defaults applied from presets/k8s/)
madengine run --tags llama3 \
--additional-context '{"k8s":{"gpu_count":4}}'
# Multi-node vLLM on K8s
madengine run --tags vllm-serve \
--additional-context '{
"k8s": {"namespace":"ml-team","gpu_count":8},
"distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
}'
# K8s with NFS data PVC and secrets
madengine run --tags model \
--additional-context '{
"k8s": {"namespace":"ml","gpu_count":8,"data_storage_class":"nfs-banff"},
"secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy"}
}'
Presence of "k8s" or "kubernetes" key → KubernetesDeployment. Requires pip install -e ".[all]".
# Single-node SLURM (build on login node, deploy via sbatch)
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
--additional-context '{
"slurm": {"partition":"gpu","nodes":1,"gpus_per_node":8,"time":"12:00:00"}
}'
# Multi-node torchrun
madengine run --manifest-file build_manifest.json \
--additional-context '{
"slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
"distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8}
}'
# DeepSpeed with reservation
madengine run --manifest-file build_manifest.json \
--additional-context '{
"slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,
"time":"48:00:00","reservation":"ml-training"},
"distributed": {"launcher":"deepspeed","nnodes":8,"nproc_per_node":8}
}'
Presence of "slurm" key → SlurmDeployment. Generates sbatch wrapper from Jinja2 template.
# SGLang Disaggregated (3+ nodes: proxy + prefill + decode)
madengine run --tags pyt_sglang_disagg_qwen3-32b \
--additional-context '{
"slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
"distributed": {"launcher":"slurm_multi"}
}'
# Build options for slurm_multi models:
# Option A — use pre-built registry image (skip local build)
madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:latest
# Option B — auto-resolve DOCKER_IMAGE_NAME from model card
madengine build --tags pyt_sglang_disagg --use-image auto
# Option C — build on compute node, push, then run pulls in parallel
madengine build --tags pyt_sglang_disagg \
--registry registry.io/ml --build-on-compute
slurm_multi bypasses the standard sbatch template: the model's own .slurm script runs directly on the head node so srun/scontrol work inside it.
# Store configuration in a JSON file and reference it
cat > my_run.json <<'EOF'
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"slurm": {
"partition": "gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "24:00:00",
"exclusive": true
},
"distributed": {
"launcher": "torchrun",
"nnodes": 4,
"nproc_per_node": 8,
"backend": "nccl"
},
"env_vars": {
"NCCL_DEBUG": "WARN",
"HSA_ENABLE_SDMA": "0"
},
"tools": [{"name": "rocprofv3_compute"}]
}
EOF
madengine run --tags llama3 --additional-context-file my_run.json
--additional-context-file and --additional-context are mutually exclusive. The file is parsed as JSON (not ast.literal_eval).
Install & dev
Setup
# Base install (includes dev tools)
pip install -e ".[dev]"
# With Kubernetes support
pip install -e ".[all]"
# Enable pre-commit hooks
pre-commit install
Optional extras
| Extra | Adds |
|---|---|
[dev] | pytest, black, flake8, mypy, isort, pre-commit |
[kubernetes] | kubernetes>=28.0.0, pyyaml |
[all] | dev + kubernetes |
Test & quality
pytest # all tests
pytest tests/unit/ -v # unit only
pytest tests/unit/test_slurm_multi.py -v
pytest --cov=src/madengine --cov-report=html
pytest -m "not slow" # skip slow tests
pytest -m "unit and amd" # combined markers
black src/ tests/
isort src/ tests/
flake8 src/ tests/
mypy src/madengine
pre-commit run --all-files
5-layer architecture
Each layer talks only to the layers below it. Layers are color-coded throughout this wiki.
| Layer | Path | Responsibilities | Key types |
|---|---|---|---|
| CLI | src/madengine/cli/ | Typer app, 5 commands, argument validation, Rich output, exit-code mapping. | app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode |
| Orchestration | src/madengine/orchestration/ | Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment backend. | BuildOrchestrator, RunOrchestrator, image_filtering.py |
| Deployment | src/madengine/deployment/ | Factory + Template Method pattern. K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring. | DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment, ConfigLoader |
| Execution | src/madengine/execution/ | Local Docker build/run, log scanning, timeout resolution, perf parsing, self-managed launcher bypass. | ContainerRunner, DockerBuilder, container_runner_helpers |
| Core | src/madengine/core/ | Cross-cutting primitives: context merging & GPU detection, shell execution, Docker wrapper, error hierarchy, auth, timeout. | Context, Console, Docker, MADEngineError, load_credentials |
| Utils | src/madengine/utils/ | Model discovery, GPU vendor abstraction, ROCm path resolution, config parsing. | DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser |
| Reporting | src/madengine/reporting/ | perf.csv writers, HTML/email report generation. Database upload in src/madengine/database/. | update_perf_csv, csv_to_html, csv_to_email, mongodb.py |
Architecture diagram
Key data flows
Build flow
madengine build→BuildOrchestrator.execute()Context(build_only_mode=True)— GPU vendor / arch detection skipped unlessdetect_local_gpu_arch=TrueConfigLoader.load_config()applies preset defaults (SLURM or K8s) over user configDiscoverModelsresolves--tagsfrom rootmodels.json,scripts/{dir}/models.json, orscripts/{dir}/get_models_json.py- slurm_multi gate: if model uses
slurm_multiand no--registry/--use-imagegiven → auto-resolvesDOCKER_IMAGE_NAMEfrom model card or raisesConfigurationError DockerBuilder.build_all_models()— passes--build-context tools=scripts/common/toolsif that dir exists- After registry push: sets
DOCKER_IMAGE_NAMEin manifestenv_varsfor parallel SLURM pull - Writes
build_manifest.json
Run flow
madengine run→RunOrchestrator.execute()- If manifest exists: skip build; else trigger
_build_phase() Context(build_only_mode=False)— full GPU detection, ROCm path resolution_load_and_merge_manifest()— runtime context overrides manifestdeployment_config- Target inference:
"k8s"/"kubernetes"→ K8s ·"slurm"→ SLURM · neither → local _copy_scripts()— populatesscripts/common/{pre_scripts,post_scripts,tools}from madengine package- Dispatch:
ContainerRunner(local) orDeploymentFactory.create()(SLURM/K8s) - Results →
perf.csv/perf_entry.csv _cleanup_model_dir_copies()— removes populatedscripts/common/files
SLURM job flow (inside sbatch)
- sbatch script sets
MASTER_ADDR(via scontrol),WORLD_SIZE,NNODES, node-local GPU visibility - Multi-node: generates a task script per node; runs via
srun bash $TASK_SCRIPT— each node callsmadengine runwith local manifest - Single-node: creates local manifest with
deployment_config.target="docker", callsmadengine run - Each node's
madengine run→ContainerRunner→docker runwith SLURM env vars injected - Results collected from per-node
perf.csvand aggregated
CLI — discover
Lists and validates model definitions without building or running.
madengine discover [OPTIONS]
--tags TEXT Comma-separated tags/names to filter [required]
--verbose / --no-verbose Show full model JSON [default: no-verbose]
Tag syntax
| Pattern | Example | Meaning |
|---|---|---|
| Simple tag | --tags llama3 | Any model with tag llama3 |
| Multiple tags | --tags llama3,vllm | Any model matching any listed tag |
| All models | --tags all | Every discovered model |
| Scoped (exact dir) | --tags MAD/llama3 | Only from scripts/MAD/ subdirectory |
| Dynamic + args | --tags dummy3:dummy_3:batch=512 | Dynamic model with arg override |
Discovery sources (checked in order per directory)
- Root
models.json scripts/{dir}/models.json(static list)scripts/{dir}/get_models_json.py— dynamic; must exportlist_models() → List[CustomModel]
CLI — build
Builds Docker images for discovered models and writes build_manifest.json.
madengine build [OPTIONS]
--tags TEXT Tags to select models (mutually exclusive with --batch-manifest)
--batch-manifest FILE JSON file of multiple tag groups to build in sequence
--registry TEXT Push built images to this registry URL
--target-archs TEXT Comma-separated GPU arch list (e.g. "gfx90a,gfx942")
--use-image [IMAGE|auto] Skip local build; use named image or auto-resolve from model card
--build-on-compute Build on SLURM compute node + push (requires --registry)
--additional-context TEXT Python dict / JSON string of context overrides
--additional-context-file FILE Path to a JSON context file (mutually exclusive with --additional-context)
--clean-docker-cache Pass --no-cache to docker build
--manifest-output FILE Output path for build_manifest.json [default: build_manifest.json]
--summary-output FILE Output path for build summary JSON
--live-output / --no-live-output Stream docker build output line by line [default: no-live-output]
--verbose / --no-verbose
--batch-manifestvs--tags--use-imagevs--registry--use-imagevs--build-on-compute--build-on-computerequires--registry--additional-context-filevs--additional-context
--use-image modes
| Invocation | Behavior |
|---|---|
--use-image auto | Reads DOCKER_IMAGE_NAME from model card env_vars |
--use-image registry.io/img:tag | Uses the explicit image name; skips all Docker build steps |
CLI — run
Runs models from a manifest (build if needed) and writes perf.csv.
madengine run [OPTIONS]
--tags TEXT Select models (triggers build if no manifest)
--manifest-file FILE Use existing manifest; skip build [default: build_manifest.json]
--registry TEXT Registry for image pull auth
--timeout INT Seconds per model; -1=7200s default, 0=disabled
--additional-context TEXT Python dict or JSON string
--additional-context-file FILE JSON file (mutually exclusive with --additional-context)
--keep-alive Leave container running after model completes
--keep-model-dir Do not clean up model directory copy
--clean-docker-cache Remove docker image before pull (SLURM mode)
--skip-model-run Build/pull only; skip execution
--manifest-output FILE
--summary-output FILE
--live-output / --no-live-output Stream container output [default: no-live-output]
--output FILE Redirect container stdout to file
--tools-json-file-name FILE Tools config [default: ./scripts/common/tools.json]
--generate-sys-env-details / --no-generate-sys-env-details
--force-mirror-local Force ContainerRunner even in SLURM/K8s context
--disable-skip-gpu-arch Ignore skip_gpu_arch model field
--cleanup-perf Remove existing perf.csv before run
--verbose / --no-verbose
Timeout resolution
| Value | Resolved timeout |
|---|---|
-1 (default) | 7200 s (2 hours) |
0 | Disabled (no timeout) |
model card timeout field | Used when CLI is default (-1) |
| Explicit positive int | That many seconds, overrides model card |
CLI — report & database
report
# Convert perf.csv to HTML
madengine report to-html --csv-file perf.csv
# Generate consolidated email report
madengine report to-email \
--directory ./results \
--output run_results.html
Source: cli/commands/report.py → reporting/csv_to_html.py, reporting/csv_to_email.py
database
madengine database \
--csv-file perf.csv \
--database-name benchmarks \
--collection-name runs
Reads from env: MONGO_HOST, MONGO_PORT, MONGO_USER, MONGO_PASSWORD, MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS.
Source: cli/commands/database.py → database/mongodb.py
Exit codes CI contract
Defined in src/madengine/cli/constants.py::ExitCode. Use these in CI pipelines instead of log scraping.
| Code | Name | Meaning |
|---|---|---|
0 | SUCCESS | All operations succeeded. |
1 | FAILURE | General / unhandled failure (keyboard interrupt, unexpected exception). |
2 | BUILD_FAILURE | One or more Docker image builds failed. |
3 | RUN_FAILURE | One or more model runs failed. Results still written to perf.csv with STATUS=FAILURE. |
4 | INVALID_ARGS | Argument validation rejected the invocation. |
madengine run … 2>&1 | tee madengine.log with bash -o pipefail so tee doesn't swallow the exit code.
additional_context — configuration spine
--additional-context accepts a Python dict string (parsed with ast.literal_eval, not json.loads) or --additional-context-file accepts a JSON file. The dict is deep-merged into Context.ctx alongside system-detected values.
'{"key":"val"}' (valid JSON is also valid Python) or "{'key':'val'}". Do not use True/False as unquoted Python booleans in shell — shell expansion will fail. Use true/false (JSON) or single-quote the whole argument.
| Key | Type | Subsystem | Description & example |
|---|---|---|---|
gpu_vendor | string | Core | Override GPU vendor detection. "AMD" or "NVIDIA". Defaults to "AMD" if not set and auto-detect fails. |
guest_os | string | Core | Container OS for package manager selection. "UBUNTU" or "CENTOS". Affects rocEnvTool installer selection. |
MAD_ROCM_PATH | string | Core | Override host ROCm root path (e.g. "/opt/rocm-6.2"). Takes priority over auto-detection and ROCM_PATH env. |
docker_env_vars | dict | Exec | Env vars injected as --env into docker run. Keys are validated with _ENV_KEY_RE. Special: docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host. |
docker_build_arg | dict | Exec | Extra --build-arg KEY=VAL flags passed to docker build. |
docker_gpus | string | Exec | Comma-separated GPU indices to expose, or "all". E.g. "0,1,2,3". |
docker_cpus | string | Exec | CPU affinity string for --cpuset-cpus. E.g. "0-15". |
docker_mounts | dict | Exec | Extra volume mounts. E.g. {"host_path":"/data","container_path":"/mnt/data"}. |
docker_image / MAD_CONTAINER_IMAGE | string | Orch | Skip build entirely; use this image for all models. Creates a synthetic manifest. |
k8s / kubernetes | dict | Deploy | Selects Kubernetes deployment. See K8s config section for sub-keys. |
slurm | dict | Deploy | Selects SLURM deployment. See SLURM config section for sub-keys. |
distributed | dict | Deploy | Distributed launcher configuration. launcher, nnodes, nproc_per_node, backend, port. See Per-launcher config. |
distributed.launcher | string | Deploy | "torchrun", "deepspeed", "megatron", "torchtitan", "primus", "vllm", "sglang", "sglang_disagg", "slurm_multi"/"slurm-multi". |
distributed.sglang_disagg | dict | Deploy | Fine-tune prefill/decode node split. {"prefill_nodes":1,"decode_nodes":2}. Default ~40% prefill, rest decode. Min 3 nodes total. |
vllm | dict | Deploy | vLLM-specific config (tensor/pipeline parallelism, model, etc.). |
primus | dict | Deploy | Primus-specific config. config_path, cli_extra, backend. |
secrets | dict | Deploy | K8s only. Auto-converted to a K8s Secret and mounted as env vars. E.g. {"HF_TOKEN":"hf_xxx"}. |
tools | list | Exec | Profiling/tracing tools. Each item: {"name":"rocprofv3_compute"}. Stackable. See Profiling tools. |
rocenv_mode | string | Exec | "lite" (default) or "full". Full mode runs lshw/dmidecode/dmesg/modinfo, installs missing tools per guest_os. |
pre_scripts | list | Exec | Scripts to run inside the container before the model script. |
post_scripts | list | Exec | Scripts to run inside the container after the model script. |
encapsulate_script | string | Exec | Script prepended to the model run command (wraps the whole execution). |
log_error_pattern_scan | bool | Exec | Set false to disable post-run log substring error detection. Useful when pytest/JUnit is authoritative. |
log_error_patterns | list | Exec | Replace the default error patterns list entirely. Each string is matched as substring in log lines. |
log_error_benign_patterns | list | Exec | Literal substrings that mark a matching log line as benign (not an error). |
env_vars | dict | Deploy | Top-level env vars merged into deployment config (SLURM script / K8s job manifest). |
gen_sys_env_details | bool | Exec | Enable/disable rocEnvTool system environment collection. Default: true. |
debug | bool | Deploy | Enable debug-level logging in deployment templates. |
SLURM sub-keys (slurm dict)
| Key | Default (from preset) | Description |
|---|---|---|
partition | "amd-rccl" | SLURM partition name. |
nodes | 1 | Number of nodes to allocate. |
gpus_per_node | 8 | GPUs per node. |
time | "24:00:00" | Wall time limit (HH:MM:SS). |
exclusive | true | Request exclusive node access. |
nodelist | — | Pin to specific nodes. Also skips node health preflight check. |
exclude | — | Nodes to exclude. |
constraint | — | Node feature constraints. |
reservation | — | SLURM reservation name. Forwarded to srun health/cleanup commands. |
qos | — | Quality of service. |
account | — | SLURM account for billing. |
modules | [] | List of environment modules to load before job. |
output_dir | CWD | Directory for SLURM log/output files. |
network_interface | — | Network interface for NCCL/RCCL (e.g. "ib0"). |
shared_workspace | — | Shared filesystem path accessible from all nodes. |
Kubernetes sub-keys (k8s dict)
| Key | Default | Description |
|---|---|---|
namespace | "default" | Kubernetes namespace. |
gpu_count | — | Number of GPUs per pod. |
gpu_resource_name | "amd.com/gpu" | K8s GPU resource type. Auto-set by GPU-vendor preset. |
image_pull_policy | "Always" | K8s imagePullPolicy. |
kubeconfig | "~/.kube/config" | Path to kubeconfig. |
data_storage_class | "nfs-banff" | Storage class for data PVC. Falls back to nfs_storage_class then storage_class. |
storage_class | "nfs-banff" | Generic storage class fallback. |
memory | "64Gi" | Container memory request. |
memory_limit | "128Gi" | Container memory limit. |
cpu | "16" | CPU request. |
cpu_limit | "32" | CPU limit. |
host_ipc | false | Enable hostIPC (needed for multi-node NCCL). |
backoff_limit | 3 | K8s Job backoffLimit (retries). |
ttl_seconds_after_finished | null | Auto-delete job after N seconds. |
recreate_shared_data_pvc | false | Re-create data PVC even if it already exists. |
secrets.strategy | "from_local_credentials" | How to load K8s image pull secrets. |
secrets.image_pull_secret_names | [] | Existing K8s secret names to use as image pull secrets. |
Model definition — models.json
Each model definition lives in a models.json file (or is returned by get_models_json.py::list_models()). Fields map to the CustomModel dataclass in utils/discover_models.py.
{
"name": "llama3-8b-train", // Unique model identifier
"dockerfile": "docker/Dockerfile.ubuntu.amd",
"dockercontext": ".", // Build context dir (relative to scripts dir)
"scripts": "scripts/llama3/train.sh",
"url": "https://github.com/org/repo",
"cred": "hf_token", // Credential key from credential.json
"owner": "ml-team",
"data": "llama3-dataset", // Data identifier for DataProvider
"n_gpus": "8", // "-1" = all available; "0" = CPU-only
"timeout": 14400, // Seconds; overridden by --timeout CLI flag
"training_precision": "bf16",
"tags": ["llama3", "training", "amd"],
"args": "--batch-size 4 --seq-len 4096",
"multiple_results": "results.csv", // CSV file with multiple perf rows
"skip_gpu_arch": "gfx908,gfx1100", // Comma-list of archs to skip this model on
"additional_docker_run_options": "--shm-size 64g",
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 8
},
"env_vars": {
"HF_TOKEN": "auto", // Injected into container env
"DOCKER_IMAGE_NAME": "reg/img" // Used by slurm_multi parallel pull
}
}
Key field notes
| Field | Notes |
|---|---|
n_gpus | "-1" = use all GPUs on the host (MAD_SYSTEM_NGPUS). Positive int = that many GPUs. Used for perf CSV metadata. |
timeout | Used when CLI --timeout=-1 (default). Explicit CLI value always wins. |
skip_gpu_arch | Comma-separated GPU arch names (e.g. "gfx908,A100"). Model is skipped if detected arch matches. Disable with --disable-skip-gpu-arch. |
multiple_results | Path to CSV file (relative to model dir) with per-result rows that are appended to perf.csv individually. |
DOCKER_IMAGE_NAME in env_vars | Required for slurm_multi: specifies the registry image for parallel srun docker pull on compute nodes. Also set automatically by DockerBuilder after a successful push. |
Build manifest — build_manifest.json
Written by madengine build, consumed by madengine run. Pass with --manifest-file.
{
"built_images": {
"ci-llama3_Dockerfile.ubuntu.amd": {
"docker_image": "registry.io/ml/ci-llama3:sha256-abc",
"docker_sha": "sha256:abc123",
"build_duration": 183.4
}
},
"built_models": {
"ci-llama3_Dockerfile.ubuntu.amd": {
"name": "llama3-8b-train",
"dockerfile": "docker/Dockerfile.ubuntu.amd",
"docker_image": "ci-llama3_Dockerfile.ubuntu.amd",
"docker_sha": "sha256:abc123",
"build_duration": 183.4,
"scripts": "scripts/llama3/train.sh",
"args": "--batch-size 4",
"tags": ["llama3","training"],
"n_gpus": "8",
"timeout": 14400,
"skip_gpu_arch": "",
"multiple_results": "",
"distributed": {"launcher":"torchrun","nnodes":2,"nproc_per_node":8},
"env_vars": {"DOCKER_IMAGE_NAME":"registry.io/ml/ci-llama3:sha256-abc"},
"built_on_compute": false
}
},
"context": {
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_env_vars": {"MAD_GPU_VENDOR":"AMD","MAD_SYSTEM_NGPUS":"8"},
"docker_build_arg": {}
},
"deployment_config": {
"target": "slurm",
"slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
"distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8},
"env_vars": {"NCCL_DEBUG":"WARN"},
"debug": false
},
"summary": {"total":1,"success":1,"failed":0}
}
deployment_config are merged into the runtime context at startup. Keys in --additional-context take precedence over deployment_config.
Deployment target inference
No explicit deploy field needed. RunOrchestrator._infer_deployment_target() inspects the merged context:
| Context condition | Target | Class | Path |
|---|---|---|---|
"k8s" or "kubernetes" key present | Kubernetes | KubernetesDeployment | deployment/kubernetes.py |
"slurm" key present | SLURM | SlurmDeployment | deployment/slurm.py |
| Neither | Local Docker | ContainerRunner | execution/container_runner.py |
Within SLURM deployment, if distributed.launcher == "slurm_multi" (or "slurm-multi"), SlurmDeployment.prepare() takes the slurm_multi path instead of generating the standard Jinja2 template.
--force-mirror-local on madengine run to always use ContainerRunner even when slurm/k8s keys are in context.
SLURM deployment
Implemented in src/madengine/deployment/slurm.py. Generates an sbatch script from a Jinja2 template at src/madengine/deployment/templates/slurm/job.sh.j2.
Preset merge order
ConfigLoader.load_slurm_config() applies three layers (last wins):
presets/slurm/defaults.json— base defaults for all SLURM runspresets/slurm/profiles/single-node.jsonormulti-node.json— profile selected bynodescount- User-supplied
slurm/distributed/env_varskeys
presets/slurm/defaults.json — base preset contents
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"debug": false,
"slurm": {
"partition": "amd-rccl",
"nodes": 1,
"gpus_per_node": 8,
"time": "24:00:00",
"exclusive": true,
"modules": []
},
"distributed": {
"backend": "nccl",
"port": 29500
},
"env_vars": {
"OMP_NUM_THREADS": "8",
"MIOPEN_FIND_MODE": "1",
"MIOPEN_USER_DB_PATH": "/tmp/.miopen"
}
}
presets/slurm/profiles/multi-node.json — additional env vars for multi-node
{
"slurm": {"nodes": 2, "gpus_per_node": 8, "time": "24:00:00"},
"distributed": {"backend": "nccl", "port": 29500},
"env_vars": {
"NCCL_DEBUG": "WARN",
"NCCL_DEBUG_SUBSYS": "INIT",
"NCCL_IB_DISABLE": "0",
"NCCL_SOCKET_IFNAME": "ib0",
"TORCH_NCCL_HIGH_PRIORITY": "1",
"GPU_MAX_HW_QUEUES": "8",
"TORCH_NCCL_ASYNC_ERROR_HANDLING": "1",
"NCCL_TIMEOUT": "1200",
"HSA_ENABLE_SDMA": "0",
"HSA_FORCE_FINE_GRAIN_PCIE": "1",
"RCCL_ENABLE_HIPGRAPH": "0"
}
}
What the SLURM job script does
- Sets
MASTER_ADDRviascontrol show hostnames,MASTER_PORT,WORLD_SIZE,NNODES - Sets per-node
HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES(vLLM/SGLang: onlyHIP_VISIBLE_DEVICES) - Sets
MIOPEN_USER_DB_PATHper-process:/tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0} - Sets
TORCH_ELASTIC_RDZV_TIMEOUT=3600for PyTorch elastic - Sets
MAD_DEPLOYMENT_TYPE=slurm,MAD_SLURM_JOB_ID,MAD_NODE_RANK,MAD_IN_SLURM_JOB=1 - Multi-node: generates per-node task script; runs via
srun bash $TASK_SCRIPT - Single-node: creates synthetic manifest with
deployment_config.target="docker"and callsmadengine run
Node health preflight
SlurmNodeSelector runs a health-check srun before the main job unless slurm.nodelist is set (then skipped). Supports slurm.reservation forwarded to srun commands.
Monitoring
Polls squeue every 30 seconds. Terminal states: COMPLETED, FAILED, CANCELLED — a scancel'd job will not loop forever.
salloc): if SLURM_JOB_ID is set and the launcher is slurm_multi, madengine runs the wrapper script directly with bash instead of nesting a new sbatch. Other launchers still submit via sbatch even inside salloc.
Kubernetes deployment
Implemented in src/madengine/deployment/kubernetes.py and 6 focused mixin modules (refactored in v2.0.3). Requires pip install -e ".[kubernetes]".
Mixin modules
| Module | Concern |
|---|---|
| k8s_pvc.py | PVC lifecycle. Storage-class fallback: data_storage_class → nfs_storage_class → storage_class. Default: "nfs-banff". |
| k8s_results.py | Log/artifact collection, perf aggregation. Uses shared collector_pod_name() helper — truncated collector-{id[:15]} to stay within K8s name limits. |
| k8s_scripts.py | Script extraction, ConfigMap building. Carries rocenv_mode and guest_os into the ConfigMap. |
| k8s_template_context.py | Assembles Jinja2 template context dict passed to job.yaml.j2. |
| kubernetes_launcher_mixin.py | Selects the right Jinja2 template per launcher type. |
| k8s_secrets.py | Converts additional_context.secrets dict to K8s Secret objects mounted as env vars. |
Preset merge order
ConfigLoader.load_k8s_config() applies five layers (last wins):
presets/k8s/defaults.json— base defaultspresets/k8s/gpu-vendors/amd.jsonornvidia.json— GPU resource namepresets/k8s/gpu-vendors/amd-multi-gpu.json— AMD multi-GPU NCCL env vars (only if AMD + multi-GPU)presets/k8s/profiles/single-gpu.json,multi-gpu.json, ormulti-node.json- User config
presets/k8s/defaults.json — base preset contents
{
"k8s": {
"kubeconfig": "~/.kube/config",
"namespace": "default",
"image_pull_policy": "Always",
"backoff_limit": 3,
"ttl_seconds_after_finished": null,
"nfs_storage_class": "nfs-banff",
"storage_class": "nfs-banff",
"data_storage_class": "nfs-banff",
"recreate_shared_data_pvc": false,
"secrets": {
"strategy": "from_local_credentials",
"image_pull_secret_names": [],
"runtime_secret_name": null
}
},
"env_vars": {"OMP_NUM_THREADS": "8"}
}
presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars
{
"env_vars": {
"NCCL_DEBUG": "WARN",
"NCCL_IB_DISABLE": "0",
"NCCL_SOCKET_IFNAME": "ib0",
"TORCH_NCCL_HIGH_PRIORITY": "1",
"GPU_MAX_HW_QUEUES": "8",
"HSA_ENABLE_SDMA": "0",
"MIOPEN_FIND_MODE": "1",
"MIOPEN_USER_DB_PATH": "/tmp/.miopen",
"HSA_FORCE_FINE_GRAIN_PCIE": "1",
"RCCL_ENABLE_HIPGRAPH": "0"
}
}
FAILED in the results table even when the pod succeeded — this occurs when the kubelet returns 502 between job completion and log collection. PVC artifacts are still collected. Check kubectl describe pod <pod>.
Secrets management
# Pass secrets via additional_context
madengine run --tags llm-serve \
--additional-context '{
"k8s": {"namespace":"ml","gpu_count":8},
"secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy","S3_KEY":"zzz"}
}'
Secrets in additional_context.secrets are auto-converted to a K8s Secret object and mounted as environment variables in the job pod. They are never written to perf.csv or build logs.
slurm_multi launcher merged in v2.1.0
What it is
An escape-hatch SLURM launcher for workloads that orchestrate their own per-node Docker containers via srun — for example SGLang Disaggregated (proxy + prefill + decode) or any topology that needs to call srun/scontrol from inside the job step.
Generates a wrapper SBATCH that runs the model's own .slurm (or .sh) script directly on the head node on baremetal — no outer container — so the workload can spawn its own per-node containers without nesting.
How to select it
{
"slurm": {
"partition": "gpu",
"nodes": 3,
"gpus_per_node": 8,
"time": "02:00:00"
},
"distributed": {
"launcher": "slurm_multi"
}
}
Alias "slurm-multi" (hyphen) is also accepted and normalized automatically.
Build modes
| Mode | Flag | Behavior |
|---|---|---|
| Use prebuilt image | --use-image registry.io/img:tag | Skip local build. Uses explicit image. |
| Auto-resolve from model card | --use-image auto | Reads env_vars.DOCKER_IMAGE_NAME from model card. |
| Build on compute | --build-on-compute --registry reg.io/ml | Builds on SLURM compute node, pushes to registry. Manifest sets built_on_compute: true. Run phase pulls in parallel on all nodes. |
| Implicit fallback | no flags | If model card has DOCKER_IMAGE_NAME, auto-uses it. Otherwise raises ConfigurationError listing options. |
Execution paths
- sbatch (default): wrapper SBATCH submitted to SLURM. Head node calls
srun docker pullon all nodes in parallel, then runs the model's script. - bash-in-salloc: if
SLURM_JOB_IDenv var is set (inside existingsalloc), the launcher runs the wrapper synchronously withbash. SetsDeploymentResult.skip_monitoring=Trueso the monitor poll is skipped.
Results aggregation
_collect_slurm_multi_results() reads per-job CSV from /shared_inference/$USER/$JOBID/perf.csv and writes those rows into cwd/perf.csv (copy if absent, append data rows if present). This ensures display_performance_table and madengine report to-html find results without extra arguments.
Local self-managed execution
When slurm_multi is detected in a non-SLURM context (e.g. local Docker mode), ContainerRunner._run_self_managed() runs the model's script directly on the host. Env vars from model card and additional_context are injected; keys are logged without values to avoid leaking credentials.
Docker --build-context tools= v2.1.0
What it does
Every docker build issued by DockerBuilder now passes --build-context tools=scripts/common/tools when that directory exists. Dockerfiles can pull shared helper scripts from the named context:
# In any model Dockerfile
COPY --from=tools rocm_smi/*.py /opt/mad/tools/rocm_smi/
COPY --from=tools gpu_info/*.py /opt/mad/tools/
Eliminates duplication of shared APIs across model Dockerfiles.
Conditional emission (PR #134)
The flag is only added when scripts/common/tools/ exists at build time. Builds in MAD projects without a tools directory do not receive the flag and will not fail.
Implementation: single guarded block in execution/docker_builder.py.
SLURM fix in same PR: switched from shlex.quote() to double-quote escaping in slurm.py env-var generation so spaces and paths in values survive correctly in the sbatch script.
Launcher matrix
| Launcher | Local | K8s | SLURM | Type | Notes |
|---|---|---|---|---|---|
torchrun | ✅ | ✅ | ✅ | Train | DDP / FSDP, elastic rendezvous. |
megatron / megatron-lm | ✅ | ✅ | ✅ | Train | TP + PP parallelism; sets TP/PP/CP size env vars. |
torchtitan | ✅ | ✅ | ✅ | Train | FSDP2 + TP + PP + CP; Llama 3.1 8B–405B. |
deepspeed | ✅ | ✅ | ✅ | Train | ZeRO, pipeline parallelism; dynamic hostfile from SLURM. |
vllm | ✅ | ✅ | ✅ | Infer | PagedAttention; each node self-managing (no torchrun wrapper). |
sglang | ✅ | ✅ | ✅ | Infer | RadixAttention, structured gen; each node self-managing. |
sglang_disagg | ❌ | ✅ | ✅ | Infer | Disaggregated prefill/decode; min 3 nodes (1 proxy + ≥1P + ≥1D). |
primus | ✅ | ✅ | ✅ | Train | Megatron / TorchTitan / MaxText via Primus YAML config. |
slurm_multi | ✅ (self-mgd) | ❌ | ✅ | Meta | Bypasses template; model's own SLURM script on head node. |
Per-launcher configuration
Standard PyTorch distributed launcher. Generates: torchrun --nnodes=N --nproc_per_node=N --node_rank=R --master_addr=ADDR --master_port=PORT
{
"slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
"distributed": {
"launcher": "torchrun",
"nnodes": 4,
"nproc_per_node": 8,
"backend": "nccl",
"port": 29500
},
"env_vars": {
"NCCL_DEBUG": "WARN",
"HSA_ENABLE_SDMA": "0",
"TORCH_NCCL_ASYNC_ERROR_HANDLING": "1"
}
}
Local: MAD_MULTI_NODE_RUNNER is set to torchrun --standalone --nproc_per_node=N (single-node only).
Uses torchrun under the hood; sets TENSOR_MODEL_PARALLEL_SIZE, PIPELINE_MODEL_PARALLEL_SIZE, CONTEXT_PARALLEL_SIZE env vars for the Megatron script to read.
{
"slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,"time":"48:00:00"},
"distributed": {
"launcher": "megatron",
"nnodes": 8,
"nproc_per_node": 8
},
"env_vars": {
"TENSOR_MODEL_PARALLEL_SIZE": "4",
"PIPELINE_MODEL_PARALLEL_SIZE": "2",
"CONTEXT_PARALLEL_SIZE": "1",
"NCCL_IB_DISABLE": "0"
}
}
FSDP2 + TP + PP + CP. Sets TORCHTITAN_TENSOR_PARALLEL_SIZE, TORCHTITAN_PIPELINE_PARALLEL_SIZE, TORCHTITAN_FSDP_ENABLED, TORCHTITAN_CONTEXT_PARALLEL_SIZE.
{
"slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
},
"env_vars": {
"TORCHTITAN_TENSOR_PARALLEL_SIZE": "2",
"TORCHTITAN_FSDP_ENABLED": "true"
}
}
DeepSpeed with dynamic SLURM hostfile generation. Generates: deepspeed --hostfile=/tmp/hostfile …
{
"slurm": {
"partition": "gpu",
"nodes": 8,
"gpus_per_node": 8,
"time": "48:00:00",
"reservation": "ml-priority"
},
"distributed": {
"launcher": "deepspeed",
"nnodes": 8,
"nproc_per_node": 8,
"backend": "nccl"
},
"env_vars": {
"NCCL_DEBUG": "WARN",
"HSA_ENABLE_SDMA": "0"
}
}
Each node runs independently (no torchrun). Sets VLLM_TENSOR_PARALLEL_SIZE, VLLM_PIPELINE_PARALLEL_SIZE, VLLM_DISTRIBUTED_BACKEND. Only HIP_VISIBLE_DEVICES is set (not ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES) to avoid conflict with Ray.
{
"slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"12:00:00"},
"distributed": {
"launcher": "vllm",
"nnodes": 2,
"nproc_per_node": 8
},
"env_vars": {
"VLLM_TENSOR_PARALLEL_SIZE": "8",
"VLLM_PIPELINE_PARALLEL_SIZE": "2"
}
}
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES is automatically overridden to "" when HIP_VISIBLE_DEVICES is set, preventing the rocm/vllm image from ignoring GPU visibility.
SGLang standard (RadixAttention, structured gen). Each node self-managing. Sets SGLANG_TENSOR_PARALLEL_SIZE, SGLANG_PIPELINE_PARALLEL_SIZE.
{
"slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"06:00:00"},
"distributed": {
"launcher": "sglang",
"nnodes": 2,
"nproc_per_node": 8
},
"env_vars": {
"SGLANG_TENSOR_PARALLEL_SIZE": "8"
}
}
Disaggregated prefill + decode topology. Minimum 3 nodes: 1 proxy + ≥1 prefill + ≥1 decode. Node split: default ~40% prefill, rest decode.
{
"slurm": {
"partition": "gpu",
"nodes": 5,
"gpus_per_node": 8,
"time": "04:00:00"
},
"distributed": {
"launcher": "sglang_disagg",
"nnodes": 5,
"nproc_per_node": 8,
"sglang_disagg": {
"prefill_nodes": 2,
"decode_nodes": 2
}
},
"env_vars": {
"SGLANG_TP_SIZE": "8"
}
}
Sets: SGLANG_DISAGG_MODE, SGLANG_DISAGG_PREFILL_NODES, SGLANG_DISAGG_DECODE_NODES, SGLANG_DISAGG_TOTAL_NODES, SGLANG_NODE_IPS, SGLANG_NODE_RANK.
Config recipes
Complete working configurations for common scenarios.
Local — single GPU, AMD
madengine run --tags llama3 \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "0"
}'
Local — all 8 GPUs, with Megatron env vars
madengine run --tags megatron-llama3 \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_env_vars": {
"TENSOR_MODEL_PARALLEL_SIZE": "4",
"PIPELINE_MODEL_PARALLEL_SIZE": "2"
}
}'
SLURM — single node torchrun
cat > slurm-single.json <<'EOF'
{
"slurm": {
"partition": "amd-gpu",
"nodes": 1,
"gpus_per_node": 8,
"time": "12:00:00",
"exclusive": true
},
"distributed": {
"launcher": "torchrun",
"nnodes": 1,
"nproc_per_node": 8
}
}
EOF
madengine build --tags llama3 --registry registry.example.com/ml
madengine run --manifest-file build_manifest.json \
--additional-context-file slurm-single.json
SLURM — 4-node DeepSpeed with reservation
cat > slurm-multi.json <<'EOF'
{
"slurm": {
"partition": "amd-gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "24:00:00",
"exclusive": true,
"reservation": "ml-training-q1",
"network_interface": "ib0"
},
"distributed": {
"launcher": "deepspeed",
"nnodes": 4,
"nproc_per_node": 8,
"backend": "nccl"
},
"env_vars": {
"NCCL_IB_DISABLE": "0",
"NCCL_SOCKET_IFNAME": "ib0",
"NCCL_DEBUG": "WARN",
"HSA_ENABLE_SDMA": "0"
}
}
EOF
madengine run --manifest-file build_manifest.json \
--additional-context-file slurm-multi.json
K8s — single pod, 4 AMD GPUs
madengine run --tags llama3-infer \
--additional-context '{
"k8s": {
"namespace": "ml-team",
"gpu_count": 4
}
}'
K8s — multi-node vLLM with HF secret
madengine run --tags vllm-llama3-70b \
--additional-context '{
"k8s": {
"namespace": "ml-team",
"gpu_count": 8,
"host_ipc": true,
"data_storage_class": "nfs-banff"
},
"distributed": {
"launcher": "vllm",
"nnodes": 2,
"nproc_per_node": 8
},
"secrets": {"HF_TOKEN": "hf_xxxxxxx"},
"env_vars": {
"VLLM_TENSOR_PARALLEL_SIZE": "8",
"VLLM_PIPELINE_PARALLEL_SIZE": "2"
}
}'
SLURM — SGLang Disagg (3 nodes: 1 proxy + 1P + 1D)
madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:v0.4
madengine run --manifest-file build_manifest.json \
--additional-context '{
"slurm": {
"partition": "amd-gpu",
"nodes": 3,
"gpus_per_node": 8,
"time": "04:00:00"
},
"distributed": {
"launcher": "slurm_multi"
}
}'
Local run with ROCm compute profiling
madengine run --tags llama3 \
--additional-context '{
"gpu_vendor": "AMD",
"tools": [
{"name": "rocprofv3_compute"}
],
"rocenv_mode": "full"
}'
Stack multiple profilers:
"tools": [
{"name": "rocprofv3_compute"},
{"name": "rccl_trace"},
{"name": "gpu_info_power_profiler"}
]
Profiling & tracing tools
Enable via --additional-context '{"tools":[{"name":"…"}]}'. Tools are stackable — list multiple objects. Implemented in scripts/common/tools/ and execution/container_runner.py::apply_tools().
rocm_trace_lite with rocprof / rocprofv3_* in the same run — they conflict at the kernel-tracing level.
| Tool name | Purpose | Output location | Notes |
|---|---|---|---|
rocprof | Legacy GPU kernel profiling | Kernel timings / occupancy CSVs | Use rocprofv3_* on ROCm ≥ 7.0 |
rocprofv3_compute | Compute-bound kernels | ALU, wave execution metrics | ROCm ≥ 7.0 |
rocprofv3_memory | Memory-bound workloads | Cache hits, bandwidth | |
rocprofv3_communication | Multi-GPU communication | RCCL traces | |
rocprofv3_full | Comprehensive (all metrics) | All counters | High overhead — short runs only |
rocprofv3_lightweight | Minimal overhead tracing | HIP API + kernel traces | |
rocprofv3_perfetto | Perfetto UI traces | Perfetto JSON for ui.perfetto.dev | |
rocprofv3_api_overhead | API call timing | Per-API timing report | |
rocprofv3_pc_sampling | Kernel hotspot identification | PC sample histograms | |
rocm_trace_lite | RTL lite dispatch trace | rocm_trace_lite_output/trace.db | Pinned GitHub release wheel by default |
rocm_trace_lite_default | RTL default mode | Same paths, broader coverage | v2.0.3+ |
rocblas_trace | rocBLAS call tracing | Per-library log | |
miopen_trace | MIOpen call tracing | Per-library log | |
tensile_trace | Tensile call tracing | Per-library log | |
rccl_trace | RCCL communication tracing | Per-library log | |
gpu_info_power_profiler | Power consumption over time | CSV time series | |
gpu_info_vram_profiler | VRAM usage over time | CSV time series | |
therock_check | TheRock ROCm stack validation | Detection report | Identifies apt vs TheRock install |
rocm_trace_lite wheel control
| Env var | Effect |
|---|---|
ROCM_TRACE_LITE_FOLLOW_LATEST=1 | Always pull the latest wheel from GitHub |
ROCM_TRACE_LITE_WHEEL_URL=https://… | Use a specific wheel URL (air-gapped installs) |
rocEnvTool modes
Mode (rocenv_mode) | Collects |
|---|---|
"lite" (default) | Basic ROCm info, GPU topology, driver version |
"full" | All of lite + lshw, dmidecode, dmesg, modinfo; best-effort installs missing tools per guest_os |
ROCm path resolution
Implemented in src/madengine/utils/rocm_path_resolver.py and src/madengine/core/context.py. Two independent resolution chains run in parallel.
Host path (build & tools)
MAD_ROCM_PATHin--additional-context- Auto-detect:
/opt/rocm, versioned/opt/rocm-*, TheRock (rocm-sdk+ markers) rocminfo/amd-smi/rocm-smilocation onPATHROCM_PATHenvironment variable/opt/rocmfallback (with warning)
Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only env var / default.
In-container path (AMD Docker runs)
docker_env_vars.MAD_ROCM_PATHin additional_contextROCM_PATH/ROCM_HOMEfrom image OCI config (docker image inspect)- In-image shell probe (
docker run --rm image env) /opt/rocmfallback with warning
The run-phase env table prints host vs container ROCm root, installation type (apt / therock / unknown), and version side-by-side.
unique_id method; 6.4.1+ uses amd-smi node_id. The gpu_renderDs context key maps GPU index → /dev/dri/renderD number. Guards against None entries on restricted ROCm installs.
Environment variables
Read by madengine at runtime
| Variable | Module | Purpose |
|---|---|---|
MAD_ROCM_PATH | context.py | Override ROCm root on host. Priority 1. |
ROCM_PATH | core/constants.py | Fallback ROCm root. Priority 3. |
MAD_AUTO_ROCM_PATH | rocm_path_resolver | Set 0 to disable auto-scan. |
MODEL_DIR | core/constants.py | Working directory for model scripts. Default: . |
MAD_VERBOSE_CONFIG | core/constants.py | Enable verbose config output. |
MAD_SETUP_MODEL_DIR | core/constants.py | Trigger model directory setup. |
MAD_SECRETS* | context.py | Any env var with this prefix is automatically copied to docker_build_arg AND docker_env_vars. |
MAD_DOCKERHUB_USER | build_orchestrator | Docker Hub username for registry auth. |
MAD_DOCKERHUB_PASSWORD | build_orchestrator | Docker Hub password for registry auth. |
SLURM_JOB_ID | slurm.py | Detect existing SLURM allocation (triggers bash-in-salloc for slurm_multi). |
SLURM_NNODES, SLURM_NPROCS | container_runner | Read in SLURM job to resolve GPU count per node. |
NPROC_PER_NODE, GPUS_PER_NODE | container_runner | Injected by SLURM template; read by ContainerRunner to set up docker run GPU args. |
MONGO_HOST, MONGO_PORT | database/mongodb.py | MongoDB connection. |
MONGO_USER, MONGO_PASSWORD | database/mongodb.py | MongoDB credentials. |
MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS | database/mongodb.py | MongoDB auth source and timeout. |
NAS_NODES | core/constants.py | NAS node config (JSON string). |
MAD_AWS_S3 | core/constants.py | AWS S3 credentials (JSON: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, …). |
MAD_MINIO | core/constants.py | MinIO credentials (JSON: MINIO_ENDPOINT, AWS_ENDPOINT_URL_S3, …). |
PUBLIC_GITHUB_ROCM_KEY | core/constants.py | GitHub ROCm key (JSON). |
ROCM_TRACE_LITE_FOLLOW_LATEST | tools | Set 1 to always pull latest RTL wheel. |
ROCM_TRACE_LITE_WHEEL_URL | tools | Override RTL wheel URL (air-gapped installs). |
Set by madengine in Docker containers
| Variable | Set by | Value / source |
|---|---|---|
MAD_GPU_VENDOR | context.py | "AMD" or "NVIDIA" |
MAD_SYSTEM_NGPUS | context.py | Total GPU count on host |
MAD_SYSTEM_GPU_ARCHITECTURE | context.py | GPU arch string (e.g. "gfx90a") |
MAD_SYSTEM_HIP_VERSION | context.py | HIP version string |
MAD_SYSTEM_GPU_PRODUCT_NAME | context.py | GPU product name |
MAD_GUEST_OS | container_runner | "UBUNTU" or "CENTOS" |
MAD_RUNTIME_NGPUS | container_runner | GPU count allocated for this specific run |
MAD_MULTI_NODE_RUNNER | container_runner | Distributed launcher command (e.g. torchrun --standalone --nproc_per_node=8) |
MAD_MODEL_NAME | container_runner | Model name from model definition |
MAD_OUTPUT_CSV | container_runner | Path for multiple_results CSV output |
ROCM_PATH | container_runner | Resolved in-container ROCm root |
JENKINS_BUILD_NUMBER | container_runner | CI build number (from shell env if set) |
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES | container_runner | Force-set to "" when HIP_VISIBLE_DEVICES is active (AMD+Ray fix) |
Set by SLURM job script (job.sh.j2)
| Variable | Value |
|---|---|
MAD_DEPLOYMENT_TYPE | "slurm" |
MAD_SLURM_JOB_ID | SLURM job ID |
MAD_NODE_RANK | This node's rank (0-indexed) |
MAD_TOTAL_NODES | Total node count |
MAD_IN_SLURM_JOB | "1" |
MAD_LAUNCHER_TYPE | Launcher type string |
MASTER_ADDR | Head node hostname (via scontrol) |
MASTER_PORT | Communication port (default 29500) |
WORLD_SIZE | Total GPU processes (nodes × GPUs/node) |
NNODES | Node count |
GPUS_PER_NODE | GPU count per node |
NODE_RANK | This node's rank |
TORCH_ELASTIC_RDZV_TIMEOUT | 3600 |
MIOPEN_USER_DB_PATH | /tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0} |
HIP_VISIBLE_DEVICES | GPU indices for this node's processes |
ROCR_VISIBLE_DEVICES | GPU indices (not set for Ray-based launchers) |
CUDA_VISIBLE_DEVICES | GPU indices (not set for Ray-based launchers) |
Error types
Defined in src/madengine/core/errors.py. All inherit from MADEngineError(Exception) which carries: message, category, context (ErrorContext dataclass), cause, recoverable, suggestions (list). Rich panels are used for display.
| Class | Category | When raised |
|---|---|---|
ValidationError | VALIDATION | Invalid CLI args, model field values, context key types. |
NetworkError | CONNECTION | Registry connectivity, pull failures, MongoDB connection. |
AuthenticationError | AUTHENTICATION | Registry login failure, invalid credentials format. |
ExecutionError | RUNTIME | Container run failure, script non-zero exit, timeout. (RuntimeError is an alias.) |
BuildError | BUILD | Docker build failure. |
DiscoveryError | DISCOVERY | models.json parse failure, tag not found, no models matched. |
OrchestrationError | ORCHESTRATION | Manifest load failure, incompatible build/run state. |
RunnerError | RUNNER | ContainerRunner internal failure. |
ConfigurationError | CONFIGURATION | slurm_multi registry gate violation, conflicting flags, missing required config. |
DeploymentTimeoutError | TIMEOUT | SLURM/K8s job exceeded wall time. |
Module reference
| Layer | Path | Contents |
|---|---|---|
| CLI | cli/app.py | Typer app, cli_main entry, --version, Rich traceback install. |
| CLI | cli/commands/build.py | madengine build: registry, batch, --use-image, --build-on-compute, mutex validation. |
| CLI | cli/commands/run.py | madengine run: manifest loading, all run flags, --force-mirror-local, --cleanup-perf. |
| CLI | cli/commands/discover.py | Model discovery command, scoped tag parsing. |
| CLI | cli/commands/report.py | report to-html / to-email sub-app. |
| CLI | cli/commands/database.py | MongoDB upload command. |
| CLI | cli/constants.py | ExitCode enum, DEFAULT_MANIFEST_FILE, DEFAULT_PERF_OUTPUT, DEFAULT_TIMEOUT=-1. |
| CLI | cli/validators.py | Argument validation: validate_additional_context(), create_args_namespace(). |
| Orch | orchestration/build_orchestrator.py | BuildOrchestrator.execute(): discover → context → build → registry gate → manifest. slurm_multi use-image / build-on-compute paths. |
| Orch | orchestration/run_orchestrator.py | RunOrchestrator.execute(): manifest loading, target inference, script copy/cleanup, local/distributed dispatch. |
| Orch | orchestration/image_filtering.py | Filters manifest entries by GPU vendor, GPU arch, skip_gpu_arch field. |
| Dep | deployment/factory.py | DeploymentFactory.create(). Registers SlurmDeployment + KubernetesDeployment. UserWarning if kubernetes package missing. |
| Dep | deployment/base.py | BaseDeployment (Template Method), DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN. |
| Dep | deployment/kubernetes.py | KubernetesDeployment: composes 6 mixins, orchestrates K8s job lifecycle. |
| Dep | deployment/k8s_pvc.py | PVC creation/deletion, storage-class fallback chain. |
| Dep | deployment/k8s_results.py | Log/artifact collection, perf aggregation, collector_pod_name(). |
| Dep | deployment/k8s_scripts.py | Script extraction, ConfigMap building (rocenv_mode, guest_os). |
| Dep | deployment/k8s_template_context.py | Assembles Jinja2 template context for K8s jobs. |
| Dep | deployment/k8s_secrets.py | secrets dict → K8s Secret objects. |
| Dep | deployment/k8s_names.py | Name truncation/sanitization helpers for K8s resource names. |
| Dep | deployment/kubernetes_launcher_mixin.py | Selects Jinja2 template per launcher; sets MAD_MULTI_NODE_RUNNER for K8s pods. |
| Dep | deployment/slurm.py | SlurmDeployment: template prep, sbatch submit, bash-in-salloc, slurm_multi dispatch, monitoring, results collection. |
| Dep | deployment/slurm_node_selector.py | SlurmNodeSelector: health/cleanup srun, reservation parameter, node preflight. |
| Dep | deployment/common.py | Shared helpers: VALID_LAUNCHERS, slurm_multi wrapper assembly, launcher normalization. |
| Dep | deployment/config_loader.py | ConfigLoader: deep-merge, preset loading, target inference. env_vars merged recursively (not replaced). |
| Dep | deployment/primus_backend.py | Primus YAML / backend selection helper. |
| Dep | deployment/presets/slurm/defaults.json | SLURM base preset. |
| Dep | deployment/presets/slurm/profiles/ | single-node.json, multi-node.json. |
| Dep | deployment/presets/k8s/defaults.json | K8s base preset. |
| Dep | deployment/presets/k8s/gpu-vendors/ | amd.json, nvidia.json, amd-multi-gpu.json. |
| Dep | deployment/presets/k8s/profiles/ | single-gpu.json, multi-gpu.json, multi-node.json. |
| Dep | deployment/templates/slurm/job.sh.j2 | Main sbatch template (~822 lines). Sets all SLURM env vars, runs srun task scripts. |
| Dep | deployment/templates/kubernetes/ | K8s YAML templates: configmap.yaml.j2, job.yaml.j2, pvc.yaml.j2, pvc-data.yaml.j2, service.yaml.j2. |
| Exec | execution/container_runner.py | ContainerRunner: local docker run, AMD/NVIDIA run options, env injection, tools, perf parsing, _run_self_managed(), _generate_local_launcher_command(). |
| Exec | execution/container_runner_helpers.py | Log error pattern scan, resolve_run_timeout(), make_run_log_file_path(). |
| Exec | execution/docker_builder.py | DockerBuilder: build args, --build-context tools= (conditional), registry push, DOCKER_IMAGE_NAME injection into manifest. |
| Exec | execution/dockerfile_utils.py | Dockerfile parsing: GPU vendor from filename + FROM line. |
| Core | core/context.py | Context: ast.literal_eval parse, GPU vendor/arch detection, ROCm path resolution, MAD_SECRETS* propagation, renderD mapping. |
| Core | core/additional_context_defaults.py | Default values merged before user context: DEFAULT_GPU_VENDOR="AMD", DEFAULT_GUEST_OS="UBUNTU". |
| Core | core/console.py | Console: Rich-backed shell executor, live output, timeout, secret=True for credential commands. |
| Core | core/docker.py | Docker wrapper: shlex.quote() on every interpolation, auto stop/remove on __del__. |
| Core | core/errors.py | 10-type error hierarchy, ErrorCategory, ErrorContext, ErrorHandler, Rich panel display. |
| Core | core/auth.py | load_credentials(), login_to_registry() using --password-stdin + MAD_REGISTRY_PASSWORD. |
| Core | core/timeout.py | Timeout context manager; guards signal.alarm(None) when seconds is 0/None. |
| Core | core/dataprovider.py | Data abstraction: local / NAS / S3 / MinIO. |
| Util | utils/discover_models.py | DiscoverModels: root, dir, dynamic discovery; scoped vs unscoped tags; CustomModel dataclass. |
| Util | utils/gpu_tool_factory.py | Singleton get_gpu_tool_manager(vendor, rocm_path); auto-detects vendor. |
| Util | utils/gpu_validator.py | GPUVendor enum, ROCmValidator, NVIDIAValidator, GPUValidationResult. |
| Util | utils/rocm_path_resolver.py | Host + in-container ROCm path resolution chains. |
| Util | utils/therock_markers.py | Shared TheRock detection markers (rocm-sdk, layout probes). |
| Util | utils/config_parser.py | ConfigParser: 5-level config file resolution, CSV/JSON/YAML loading, multi-row result matching. |
| Util | utils/session_tracker.py | Session start/marker tracking. |
| Rep | reporting/update_perf_csv.py | Writes/appends perf.csv and perf_entry.csv. PERF_CSV_HEADER (28 columns). |
| Rep | reporting/csv_to_html.py | HTML performance report generation. |
| Rep | reporting/csv_to_email.py | Email-friendly consolidated report. |
| Rep | reporting/update_perf_super.py | Superset-shaped perf rollups. |
| DB | database/mongodb.py | MongoDBConfig.from_env(), UploadOptions, UploadResult; upsert + batch upload. |
| Scripts | scripts/common/pre_scripts/rocEnvTool/ | rocenv_tool.py, csv_parser.py, console.py — TheRock-compatible env capture (lite + full modes). |
| Scripts | scripts/common/tools/ | GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers (rocblas, miopen, rccl, tensile). |
Test layout
unit/
Fast, isolated, mocked. Key files: test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py, test_deployment.py, test_container_runner.py.
integration/
Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.
e2e/
Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.
| Marker | What it selects |
|---|---|
unit | Fast unit tests with no external deps |
integration | Tests requiring Docker / real GPU calls |
e2e | Full end-to-end workflow tests |
slow | Long-running tests |
gpu | Requires GPU hardware |
amd / nvidia | Vendor-specific tests |
cpu | CPU-only tests |
requires_docker | Tests requiring Docker daemon |
requires_models | Tests requiring model files to be present |
Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0).
Contributing & code style
Style rules
- Formatting: Black (line-length 88), targets py3.8–py3.11
- Imports: isort with
profile="black"; first-party =madengine - Lint: flake8 + mypy (strict equality, warn unused) + bandit (skips B101)
- Docstrings: Google style; type hints required for public functions
- Conventional commits:
feat:,fix:,docs:,test:,refactor:,style:,perf:,chore:
Security rules
- Use
shlex.quote()on every shell interpolation of user-controlled values (image names, paths, container names, build-args) - Registry passwords via
--password-stdin(not command-line args); env varMAD_REGISTRY_PASSWORD - Credential JSON must be a dict object — validated at load time (
ConfigurationErroron wrong type) MIOPEN_USER_DB_PATHis filtered from deployment_config to prevent leaking temp paths- Never log secret values — log keys only
Changelog
[2.1.0] — 2026-05-28
Added
slurm_multiself-managed SLURM launcher (PRs #130, #126): aliasslurm-multi, parallel docker pull, bash-in-salloc path,_run_self_managed()for local modemadengine build --use-image [IMAGE|auto]— skip local buildmadengine build --build-on-compute— build on compute node + push- slurm_multi registry gate with structured
ConfigurationError DeploymentResult.skip_monitoringfor synchronous deploy pathsSlurmNodeSelector.reservationparameterDockerBuilder:--build-context tools=(conditional on dir existence, PR #131 + #134)- Local
MAD_MULTI_NODE_RUNNERviaContainerRunner._generate_local_launcher_command()(PR #126) - Model card
distributed/slurmauto-merged into manifestdeployment_config DOCKER_IMAGE_NAMEinjection into manifestenv_varsafter successful registry push
Changed
- SLURM env-var escaping: double-quote instead of
shlex.quoteto preserve spaces/paths (PR #134) - Early
DiscoverModelsresult cached and reused for actual build (no duplicateget_models_json.pyruns) - E2E test cleanup defaults include
build_manifest.json+ perf artefacts
[2.0.3] — 2026-05-26
- rocEnvTool
"full"mode (lshw, dmidecode, dmesg, modinfo) - K8s monolith decomposed into 6 focused mixin modules
- Generic
storage_classfallback; default presetnfs-banff rocm_trace_lite_defaulttool (RTL default mode)- Security:
shlex.quote()on every shell interpolation - Collector pod name mismatch fix (shared
collector_pod_name()helper) CANCELLEDadded to terminal-state set- Local
MAD_MULTI_NODE_RUNNERfor Docker local (_generate_local_launcher_command())
[2.0.2] / [2.0.1]
- Host ROCm auto-detection via priority chain; in-container ROCm resolved independently
- TheRock (
rocm-sdk) layout support - GPU arch auto-detection injected into Docker build args
- Model discovery: scope-based tag selection replaces
strictflag - Registry password via
--password-stdin+ env var credential.jsontype validation- Unified
PERFORMANCE_LOG_PATTERNacross local + deployment paths - Run-phase host/container env table printed at startup
[2.0.0] — 2026-04-09 — Complete rewrite
- Unified
madengineCLI; legacymad-*removed - 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core)
- Factory + Template Method patterns;
DeploymentFactory,BaseDeployment,ConfigLoader - Multi-target deployment: presets + Jinja2 templates per launcher
- Launcher matrix: torchrun / DeepSpeed / Megatron / TorchTitan / Primus / vLLM / SGLang
- Log error pattern scanning;
--skip-model-run; batch build manifest - Structured errors (10 types) with Rich panels; fixed exit codes
- SLURM nodelist pinning; K8s Secrets management; data provider abstraction