madengine — Codebase Wiki
AI/ML model automation and benchmarking platform for local Docker, Kubernetes and SLURM. This wiki reflects branch
develop. madengine is a streamlined CLI tool for running and benchmarking AI models on ROCm GPUs, offering a production‑ready workflow for local single node or remote multi node execution with integrated performance monitoring.
Overview
What it does
madengine is a Typer-based CLI (madengine) that discovers models from a
MAD package, builds Docker images, and runs them either locally or on distributed
backends (Kubernetes, SLURM). It writes performance results to perf.csv
and can generate HTML reports or upload to MongoDB.
Entry point: src/madengine/cli/app.py::cli_main
(registered as the madengine console script in pyproject.toml).
Why this branch matters
The add_slurm_multi_launcher branch adds a self-managed multi-node SLURM launcher
so that workloads with their own per-node Docker orchestration (e.g. SGLang Disaggregated
prefill + decode + proxy) can run via a thin wrapper SBATCH that does not nest Docker
inside the job step. It adds --use-image / --build-on-compute build modes,
a registry gate, parallel image pull, and a bash-in-salloc execution path.
Quick start
# Install
pip install -e ".[dev]"
# Discover models
madengine discover --tags dummy
# Run locally (build + run)
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Minimal K8s config — defaults applied automatically
madengine run --tags model \
--additional-context '{"k8s": {"gpu_count": 2}}'
# Multi-node vLLM
madengine run --tags model --additional-context '{
"k8s": {"namespace": "ml-team", "gpu_count": 8},
"distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
}'
# Build phase (login node or CI) then deploy
madengine build --tags model --registry gcr.io/myproject
madengine run --manifest-file build_manifest.json \
--additional-context '{
"slurm":{"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
"distributed":{"launcher":"torchtitan","nnodes":4,"nproc_per_node":8}
}'
# slurm_multi — for workloads that run their own docker via srun
madengine run --tags pyt_sglang_disagg_qwen3-32b_short \
--additional-context '{
"slurm":{"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
"distributed":{"launcher":"slurm_multi"}
}'
# Build on a compute node, push, then have run pull in parallel
madengine build --tags model --build-on-compute --registry myreg.io/team
# or skip build entirely and use a pre-baked image
madengine build --tags model --use-image auto
Install & dev
Setup
pip install -e ".[dev]" # base + dev
pip install -e ".[all]" # + kubernetes
pre-commit install
Test & quality
pytest # all tests
pytest tests/unit/test_slurm_multi.py -v
pytest --cov=src/madengine --cov-report=html
pytest -m "not slow"
black src/ tests/ && isort src/ tests/
flake8 src/ tests/
mypy src/madengine
pre-commit run --all-files
5-layer architecture
Each layer talks only to the one below it. Layers are color-coded throughout this wiki.
| Layer | Path | Responsibilities | Key types |
|---|---|---|---|
| CLI | src/madengine/cli/ | Typer app, command parsing, Rich output, exit-code mapping. | app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode |
| Orchestration | src/madengine/orchestration/ | Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment. | BuildOrchestrator, RunOrchestrator, image_filtering.py |
| Deployment | src/madengine/deployment/ | Factory + K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring. | DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment |
| Execution | src/madengine/execution/ | Local Docker build/run, log scanning, timeout resolution, perf parsing. | ContainerRunner, DockerBuilder, container_runner_helpers.py |
| Core | src/madengine/core/ | Cross-cutting primitives: context merging, console, docker wrapper, errors, auth, timeout. | Context, Console, Docker, MADEngineError, load_credentials |
| Utils | src/madengine/utils/ | Discovery, GPU vendor abstraction, ROCm path resolution, config parsing. | DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser |
| Reporting | src/madengine/reporting/ | perf.csv writers, HTML/email report generation. | update_perf_csv, csv_to_html, csv_to_email |
Architecture diagram
Key data flows
Build flow
madengine build→BuildOrchestrator.execute()DiscoverModelsresolves--tagsagainst the MAD package (rootmodels.json,scripts/{dir}/models.json, orscripts/{dir}/get_models_json.py).- Each model is materialised through
Context(system + useradditional_context) and passed toDockerBuilder. - Optionally tags & pushes to
--registry. - Writes
build_manifest.jsonconsumed byrun.
Special build modes on this branch:
--use-image [IMAGE|auto]— skip local build, use a prebuilt image (auto resolvesenv_vars.DOCKER_IMAGE_NAMEfrom the model card). Mutually exclusive with--registryand--build-on-compute.--build-on-compute— build on a SLURM compute node and push to--registry; manifest carriesbuilt_on_compute: true.
Run flow
madengine run→RunOrchestratorloads existing manifest or triggers a build.- Target inference (Convention over Configuration):
"k8s"/"kubernetes"in context → KubernetesDeployment"slurm"in context → SlurmDeploymentdistributed.launcher == "slurm_multi"→ slurm_multi path- neither → ContainerRunner (local Docker)
scripts/common/is populated from the package (pre_scripts, post_scripts, tools) and cleaned up afterwards.- Per-model results parsed via
PERFORMANCE_LOG_PATTERNand appended toperf.csv/perf_entry.csv. Failed runs are still recorded withSTATUS=FAILURE.
additional_context — the configuration spine
--additional-context accepts a JSON or Python-dict string (parsed with
ast.literal_eval(), not json.loads) or a path to a JSON file.
It is merged into Context.ctx alongside system-detected values
(GPU vendor, architecture, OS, ROCm path). Specific keys drive different subsystems.
| Key | Where it goes | What it does |
|---|---|---|
gpu_vendor | Core | AMD or NVIDIA. Defaults to AMD if missing. |
guest_os | Core | UBUNTU or CENTOS; selects package manager for in-container installs. |
MAD_ROCM_PATH | Core | Override host ROCm root (top-level only). |
docker_env_vars | Execution | Env vars injected into the container. docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host. |
docker_gpus | Execution | Comma list of GPU indices or all. |
k8s / kubernetes | Deployment | Selects K8s. Merged with preset defaults; supports namespace, gpu_count, storage class fallback chain (data_storage_class → nfs_storage_class → storage_class). |
slurm | Deployment | Selects SLURM. partition, nodes, gpus_per_node, time, exclusive, reservation, nodelist. Setting nodelist also skips automatic node health preflight. |
distributed.launcher | Deployment | torchrun, deepspeed, megatron, torchtitan, primus, vllm, sglang, sglang_disagg, slurm_multi / slurm-multi. |
distributed.nnodes / nproc_per_node | Deployment | Topology hints for launcher templates. |
tools | Execution | List of profilers/tracers to enable, e.g. [{"name":"rocprofv3_compute"}]. |
rocenv_mode | Execution | "lite" (default) or "full" — full collects lshw / dmidecode / dmesg / modinfo, best-effort installs missing tools per guest_os. |
log_error_pattern_scan | Execution | false disables post-run log substring scan (use when pytest/JUnit is authoritative). |
log_error_patterns / log_error_benign_patterns | Execution | Override or extend the failure-substring lists. |
pre_scripts / post_scripts | Execution | Custom scripts to run before/after the model. |
secrets | Deployment (K8s) | Auto-converted to a K8s Secret and mounted as env vars. |
Context parses with ast.literal_eval(). Pass a Python dict
repr (single quotes are fine in shells if you wrap the whole argument in single quotes and use
double quotes inside) — strictly JSON also works since JSON ⊂ Python literals.
CLI commands
| Command | Source | Purpose | Notable flags |
|---|---|---|---|
discover |
cli/commands/discover.py | List/validate models matching tags. | --tags (scoped: MAD/foo, dynamic: dummy3:dummy_3:batch=512) |
build |
cli/commands/build.py | Build Docker images; write build_manifest.json. |
--registry, --target-archs, --batch-manifest, --clean-docker-cache, --use-image new, --build-on-compute new |
run |
cli/commands/run.py | Run models from manifest or trigger a build first. | --manifest-file, --additional-context[-file], --skip-model-run, --live-output, --keep-alive, --verbose, --timeout |
report |
cli/commands/report.py | Convert perf CSVs to HTML/email. | Sub-apps: to-html --csv-file …, to-email --directory … |
database |
cli/commands/database.py | Upload perf CSV to MongoDB. | --csv-file, --database-name, --collection-name (uses MONGO_HOST/USER/PASSWORD env) |
Exit codes (CI contract)
From src/madengine/cli/constants.py::ExitCode. Use these in pipelines instead of log scraping.
| Code | Name | Meaning |
|---|---|---|
0 | SUCCESS | All operations succeeded. |
1 | FAILURE | General/unhandled failure. |
2 | BUILD_FAILURE | One or more image builds failed. |
3 | RUN_FAILURE | One or more model runs failed (still written to perf.csv with status FAILURE). |
4 | INVALID_ARGS | Argument validation rejected the invocation. |
... 2>&1 | tee madengine.run.log with bash -o pipefail
so the step's exit code is still madengine's, not tee's.
Deployment target inference
No explicit deploy field exists. The factory inspects additional_context:
| Trigger | Class | Source |
|---|---|---|
no k8s/slurm key | Local ContainerRunner | execution/container_runner.py |
"k8s" or "kubernetes" key | KubernetesDeployment | deployment/kubernetes.py |
"slurm" key | SlurmDeployment | deployment/slurm.py |
distributed.launcher == "slurm_multi" | slurm_multi path (within Slurm) | deployment/slurm.py + common.py |
The mixin deployment/kubernetes_launcher_mixin.py selects the correct Jinja2 template under src/madengine/deployment/templates/{kubernetes,slurm}/ per launcher.
slurm_multi launcher branch focus
What it is
A minimal-but-additive SLURM launcher for workloads that orchestrate their own per-node
Docker containers via srun — for example SGLang Disaggregated (proxy +
prefill + decode topologies) or anything that needs to call srun / scontrol from
inside the job script.
Generates a wrapper SBATCH that runs the model's .slurm script
directly on baremetal (not inside a container), so the workload can spawn its own
per-node containers without the outer job step holding a container open.
How to pick it
{
"slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
"distributed": {"launcher": "slurm_multi"}
// aliases: "slurm-multi"
}
Honors model-card + context slurm fields:
partition, nodes, gpus_per_node, time,
exclusive, reservation, nodelist.
Build modes added with this launcher
| Mode | Flag | Behaviour |
|---|---|---|
| Local build (default) | — | Normal madengine build. |
| Use prebuilt image | --use-image [IMAGE | auto] | Skip local build. auto resolves to the model card's env_vars.DOCKER_IMAGE_NAME. Mutually exclusive with the two below. |
| Build on compute | --build-on-compute (requires --registry) | Build on a SLURM compute node, push to registry; manifest sets built_on_compute: true. run then does parallel srun docker pull on all allocated nodes. |
| Implicit auto-use-image | none | If build finds a slurm_multi model and none of --registry / --use-image / --build-on-compute is set, it either auto-resolves the model card's DOCKER_IMAGE_NAME or raises a structured ConfigurationError listing the four supported options. |
Execution paths
- sbatch (default): wrapper SBATCH submitted to SLURM.
- bash-in-salloc: when
SLURM_JOB_IDis already set (inside an existingsalloc), the slurm_multi launcher runs the wrapper synchronously withbashinstead of nestingsbatch. Other launchers keep usingsbatcheven insidesalloc. UsesDeploymentResult.skip_monitoring=Trueto skip the monitor poll.
Results aggregation
_collect_slurm_multi_results reads the per-job CSV at
/shared_inference/$USER/$JOBID/perf.csv and now also writes those rows
into cwd/perf.csv (copy if absent, append data rows if present), so the default
reporter (display_performance_table) finds them without extra args. Local + classic-SLURM
flows are unchanged.
Tests & examples
- tests/unit/test_slurm_multi.py — registry membership, hyphen alias
normalization, env_vars-export contract against MAD-private PR #186's
pyt_sglang_disagg_qwen3-32b_shortmodel card. - examples/slurm-configs/minimal/slurm-multi-minimal.json — reference config.
Recent commits on this branch (most recent first)
2e8f1a4 Merge remote-tracking branch 'upstream/develop' into add_slurm_multi_launcher
68d0bf3 fix(slurm_multi): address Copilot review on PR #124
dc3bc48 docs(slurm_multi): CHANGELOG entry + forward-compat TODO on --use-image
e84506a fix(slurm_multi): aggregate per-job perf.csv into cwd for dashboard reporter
e281e7e fix(deployment): add skip_monitoring to DeploymentResult for slurm_multi bash branch
f7af062 test(slurm_multi): contract tests + minimal example config
8a5e174 feat(cli): expose --use-image and --build-on-compute on madengine build
bd371fe feat(orchestration): build_on_compute, registry gate, parallel pull for slurm_multi
941d56d feat(deployment): add slurm_multi launcher (minimal additive)
Kubernetes deployment
Decomposed (v2.0.3) into focused mixins composed by KubernetesDeployment:
| Module | Concern |
|---|---|
| k8s_pvc.py | PVC lifecycle (data PVC, single-node results PVC). |
| k8s_results.py | Log/artifact collection, performance aggregation. Uses the shared collector_pod_name() helper so cleanup matches the truncated collector-{deployment_id[:15]} name. |
| k8s_scripts.py | Script extraction, ConfigMap building. |
| k8s_template_context.py | Jinja2 template context assembly. |
| kubernetes_launcher_mixin.py | Per-launcher template selection. |
| k8s_secrets.py | secrets dict → K8s Secret objects → env vars. |
| k8s_pvc.py | Storage-class fallback: data_storage_class → nfs_storage_class → storage_class; single_node_results_storage_class → local_path_storage_class → storage_class. Default bundled preset: storage_class: "nfs-banff". |
FAILED in the results table
even though the pod actually succeeded — this happens when the kubelet returns 502 between
job completion and log collection, so madengine cannot parse perf metrics. PVC artifacts are still collected.
Check kubectl describe pod <pod>.
Launcher matrix
| Launcher | Local | K8s | SLURM | Type | Notes |
|---|---|---|---|---|---|
| torchrun | ✅ | ✅ | ✅ | Train | DDP / FSDP, elastic. |
| DeepSpeed | ✅ | ✅ | ✅ | Train | ZeRO, pipeline parallelism. |
| Megatron-LM | ✅ | ✅ | ✅ | Train | TP + PP, large transformers. |
| TorchTitan | ✅ | ✅ | ✅ | Train | FSDP2 + TP + PP + CP, Llama 3.1 8B–405B. |
| Primus | ✅ | ✅ | ✅ | Train | Megatron / TorchTitan / MaxText via Primus YAML. |
| vLLM | ✅ | ✅ | ✅ | Infer | v1 engine, PagedAttention. |
| SGLang | ✅ | ✅ | ✅ | Infer | RadixAttention, structured gen. |
| SGLang Disagg | ❌ | ✅ | ✅ | Infer | Disagg prefill/decode, Mooncake, 3+ nodes. |
slurm_multi branch | ❌ | ❌ | ✅ | Meta | Self-managed multi-node SLURM wrapper for workloads with their own per-node container orchestration. |
Profiling & tracing
Enable via --additional-context '{"tools":[{"name":"…"}]}'. Stackable.
| Tool | Purpose | Output |
|---|---|---|
rocprof | Legacy GPU kernel profiling | Kernel timings/occupancy |
rocprofv3_compute | Compute-bound (ROCm ≥ 7.0) | ALU, wave execution |
rocprofv3_memory | Memory-bound | Cache hits, bandwidth |
rocprofv3_communication | Multi-GPU | RCCL traces |
rocprofv3_full | Comprehensive | All metrics, high overhead |
rocprofv3_lightweight | Minimal overhead | HIP + kernel traces |
rocprofv3_perfetto | Perfetto UI traces | Perfetto JSON |
rocprofv3_api_overhead | API call timing | API timings |
rocprofv3_pc_sampling | Kernel hotspots | PC sample histograms |
rocm_trace_lite | RTL lite dispatch trace | rocm_trace_lite_output/trace.db |
rocm_trace_lite_default | RTL default mode | Same paths, broader coverage |
rocblas_trace / miopen_trace / tensile_trace / rccl_trace |
Library call tracing | Per-library log |
gpu_info_power_profiler / gpu_info_vram_profiler | Power / VRAM over time | CSV time series |
therock_check | TheRock ROCm validation | Detection report |
rocm_trace_lite with rocprof /
rocprofv3_* in the same run. RTL installs from a pinned GitHub release wheel by
default — set ROCM_TRACE_LITE_FOLLOW_LATEST=1 or
ROCM_TRACE_LITE_WHEEL_URL=… for latest / air-gapped installs.
ROCm path resolution
Implemented in src/madengine/utils/rocm_path_resolver.py.
Host (build & tools)
- Top-level
MAD_ROCM_PATHin--additional-context - Auto-detect:
/opt/rocm,/opt/rocm-*, TheRockrocm-sdk+ markers, thenrocminfo/amd-smi/rocm-smionPATH ROCM_PATHenv var/opt/rocmfallback
Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only the env var/default.
In-container (AMD Docker runs)
docker_env_vars.MAD_ROCM_PATH(consumed; not forwarded as-is)ROCM_PATH/ROCM_HOMEfrom image OCI config (docker image inspect)- In-image shell probe (
docker run --rm) /opt/rocmwith a warning
The run-phase environment table prints host vs container installation type
(apt / therock / unknown), ROCm/CUDA root, and version side-by-side.
Module reference
| Layer | Path | What it contains |
|---|---|---|
| CLI | cli/app.py | Typer app, cli_main entry, --version handling, rich traceback install. |
| CLI | cli/commands/build.py | madengine build command, registry options, batch builds, --use-image/--build-on-compute. |
| CLI | cli/commands/run.py | madengine run command, manifest loading, --skip-model-run. |
| CLI | cli/commands/discover.py | Model discovery command. |
| CLI | cli/commands/report.py | report to-html / to-email sub-app. |
| CLI | cli/commands/database.py | MongoDB upload command. |
| CLI | cli/constants.py | ExitCode enum. |
| CLI | cli/validators.py | Argument validation. |
| Orch | orchestration/build_orchestrator.py | BuildOrchestrator.execute(), discover → build, registry login, batch manifest, slurm_multi registry gate. |
| Orch | orchestration/run_orchestrator.py | RunOrchestrator, build phase, target inference, local Docker dispatch, slurm_multi result aggregation. |
| Orch | orchestration/image_filtering.py | Target-arch / tag filtering of manifest entries. |
| Dep | deployment/factory.py | DeploymentFactory.create(), registers SlurmDeployment + KubernetesDeployment; UserWarning if kubernetes pkg missing. |
| Dep | deployment/base.py | BaseDeployment, DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN, terminal states (COMPLETED/FAILED/CANCELLED). |
| Dep | deployment/kubernetes.py | Composes K8s mixins; orchestrates job lifecycle. |
| Dep | deployment/k8s_pvc.py | PVC creation/deletion + storage-class resolution. |
| Dep | deployment/k8s_results.py | Log/artifact collection, perf aggregation; collector_pod_name(). |
| Dep | deployment/k8s_scripts.py | Script extraction, ConfigMap building (carries rocenv_mode, guest_os). |
| Dep | deployment/k8s_template_context.py | Assembles Jinja2 template context. |
| Dep | deployment/k8s_secrets.py | secrets → K8s Secret objects. |
| Dep | deployment/k8s_names.py | Name truncation/sanitization helpers. |
| Dep | deployment/kubernetes_launcher_mixin.py | Selects K8s template per launcher. |
| Dep | deployment/slurm.py | SlurmDeployment; classic SLURM path; routes to slurm_multi when launcher matches. |
| Dep | deployment/slurm_node_selector.py | SlurmNodeSelector health/cleanup srun, supports reservation. |
| Dep | deployment/primus_backend.py | Primus YAML / backend selection. |
| Dep | deployment/common.py | Shared deployment helpers, slurm_multi wrapper assembly. |
| Dep | deployment/config_loader.py | Loads and deep-merges preset JSON with user config. |
| Dep | deployment/presets/{k8s,slurm}/defaults.json | Default values auto-merged with minimal user configs. |
| Dep | deployment/templates/{kubernetes,slurm}/ | Jinja2 templates per launcher. |
| Exec | execution/container_runner.py | ContainerRunner: local docker run, env injection (MAD_GUEST_OS, MAD_OUTPUT_CSV), tools wiring, perf parsing. |
| Exec | execution/container_runner_helpers.py | Log error pattern scan, timeout resolution. |
| Exec | execution/docker_builder.py | DockerBuilder: build args (incl. MAD_SYSTEM_GPU_ARCHITECTURE), push/tag, shell-quoted everywhere. |
| Exec | execution/dockerfile_utils.py | Dockerfile parsing helpers. |
| Core | core/context.py | Context: ast.literal_eval parse, system detect, GPU vendor/arch, ROCm path; guards against None kfd_renderDs entries on restricted ROCm. |
| Core | core/additional_context_defaults.py | Default values merged into context. |
| Core | core/console.py | Console: Rich-backed shell wrapper, live output mode. |
| Core | core/docker.py | Docker wrapper; shlex.quote() on every interpolation. |
| Core | core/errors.py | MADEngineError + 10 typed errors; create_error_context; Rich panels. |
| Core | core/auth.py | load_credentials(), login_to_registry() (uses --password-stdin + MAD_REGISTRY_PASSWORD env). |
| Core | core/timeout.py | Timeout context manager; guards signal.alarm(None) when seconds is 0/None. |
| Core | core/constants.py | Misc core constants. |
| Core | core/dataprovider.py | Data: local / NAS / S3 / MinIO abstraction. |
| Util | utils/discover_models.py | DiscoverModels: root, dir, or dynamic discovery; scoped vs unscoped tags. |
| Util | utils/gpu_tool_factory.py | Returns AMD or NVIDIA tool manager based on vendor. |
| Util | utils/gpu_tool_manager.py | Abstract GPU tool manager interface. |
| Util | utils/rocm_tool_manager.py | AMD/ROCm implementation. |
| Util | utils/nvidia_tool_manager.py | NVIDIA implementation. |
| Util | utils/gpu_validator.py | ROCm install detection, GPU vendor detection. |
| Util | utils/gpu_config.py | GPU configuration helpers. |
| Util | utils/rocm_path_resolver.py | Host/in-container ROCm root resolver. |
| Util | utils/therock_markers.py | Shared TheRock detection markers. |
| Util | utils/config_parser.py | ConfigParser: parses additional context + tools config. |
| Util | utils/path_utils.py | Path helpers. |
| Util | utils/session_tracker.py | Session start/marker tracking. |
| Util | utils/ops.py | Misc operations. |
| Util | utils/log_formatting.py | Log formatting helpers. |
| Util | utils/run_details.py | Run metadata helpers. |
| Rep | reporting/update_perf_csv.py | Writes/appends to perf.csv and perf_entry.csv. |
| Rep | reporting/csv_to_html.py | HTML report generation. |
| Rep | reporting/csv_to_email.py | Email-friendly consolidated report. |
| Rep | reporting/update_perf_super.py | Superset-shaped perf rollups. |
| DB | database/mongodb.py | MongoDB connection + insert; uses datetime.now(timezone.utc). |
| Scripts | scripts/common/pre_scripts/rocEnvTool/ | rocenv_tool.py, csv_parser.py, console.py — TheRock-compatible env capture (lite + full modes). |
| Scripts | scripts/common/tools/ | GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers. |
Test layout
unit/
Fast, isolated, mocked. ~28 modules including test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py.
integration/
Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.
e2e/
Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.
Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0). Markers: unit, integration, e2e, slow, gpu, amd, nvidia, cpu, requires_docker, requires_models.
Contributing & code style
- Formatting: Black (line-length 88), targets py38–py311.
- Imports: isort with
profile = "black"; first-party =madengine. - Lint: flake8 + mypy (strict equality, warn unused, etc.) + bandit (skips B101).
- Docstrings: Google style; type hints for public functions.
- Conventional commits:
feat:,fix:,docs:,test:,refactor:,style:,perf:,chore:. - Pre-commit:
pip install pre-commit && pre-commit install.
Recent notable changes
[Unreleased] — slurm_multi launcher
- New
slurm_multiSLURM launcher;slurm-multialias accepted. madengine build --use-image [IMAGE|auto]and--build-on-compute.- Build registry gate with structured
ConfigurationError. - bash-in-salloc execution path when
SLURM_JOB_IDis already set. DeploymentResult.skip_monitoringfor synchronous deploys.SlurmNodeSelectoraccepts areservationparameter.- perf.csv aggregation into cwd so the default reporter sees per-job rows.
- Contract tests + minimal example config.
[2.0.3] — rocEnvTool full mode, K8s refactor, security
- K8s monolith decomposed into
k8s_pvc/k8s_results/k8s_scripts/k8s_template_contextmixins. - rocEnvTool
"full"mode (lshw, dmidecode, dmesg, modinfo) with guest_os-native installers. - Generic
storage_classfallback added; default preset nownfs-banff. rocm_trace_lite_defaulttool (RTLdefaultmode).- Security:
shlex.quote()on every shell interpolation incore/docker.py,container_runner.py,docker_builder.py,run_orchestrator.py. - Collector pod name mismatch fix (truncated
collector-{id[:15]}shared helper). - RPD pre-script:
xxdinstall + sudo/root branch fixes. CANCELLEDadded to terminal-state set soscancel'd jobs don't loop forever.Contextguards againstNonekfd_renderDson restricted ROCm.
[2.0.2] / [2.0.1] — credential validation, ROCm auto-detect, GPU arch
load_credentials()validates JSON object type, raisesConfigurationError.- Host ROCm auto-detection via priority chain; in-container ROCm resolved independently.
- TheRock layout support (
rocm-sdk+ markers). - GPU arch auto-detection injected into Docker build args for full-run mode.
- Model discovery: scope-based tag selection replaces
strictflag. - Shared
login_to_registry, centralised credential loading. - Registry password via env +
--password-stdin(no more/procexposure). - Unified
PERFORMANCE_LOG_PATTERNacross local + deployment paths.
[2.0.0] — Complete rewrite
- Unified
madengineCLI; legacymad-*removed. - 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core).
- Multi-target deployment via factory + presets + Jinja2 templates.
- Launcher mixin with torchrun / DeepSpeed / Megatron-LM / TorchTitan / Primus / vLLM / SGLang.
- Log error pattern scanning;
--skip-model-run; batch build manifest. - SLURM nodelist pinning; K8s Secrets management.
- Structured errors (10 types) with Rich panels; fixed exit codes.
RuntimeErrorrenamed toExecutionError(alias preserved).