madengine — Codebase Wiki

AI/ML model automation and benchmarking platform for local Docker, Kubernetes and SLURM. This wiki reflects branch develop. madengine is a streamlined CLI tool for running and benchmarking AI models on ROCm GPUs, offering a production‑ready workflow for local single node or remote multi node execution with integrated performance monitoring.

branch: develop Python ≥ 3.8 5-layer CLI Local / K8s / SLURM / slurm_multi Typer + Rich ROCm & CUDA

Overview

What it does

madengine is a Typer-based CLI (madengine) that discovers models from a MAD package, builds Docker images, and runs them either locally or on distributed backends (Kubernetes, SLURM). It writes performance results to perf.csv and can generate HTML reports or upload to MongoDB.

Entry point: src/madengine/cli/app.py::cli_main (registered as the madengine console script in pyproject.toml).

Why this branch matters

The add_slurm_multi_launcher branch adds a self-managed multi-node SLURM launcher so that workloads with their own per-node Docker orchestration (e.g. SGLang Disaggregated prefill + decode + proxy) can run via a thin wrapper SBATCH that does not nest Docker inside the job step. It adds --use-image / --build-on-compute build modes, a registry gate, parallel image pull, and a bash-in-salloc execution path.

Quick start

# Install
pip install -e ".[dev]"

# Discover models
madengine discover --tags dummy

# Run locally (build + run)
madengine run --tags dummy \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Minimal K8s config — defaults applied automatically
madengine run --tags model \
  --additional-context '{"k8s": {"gpu_count": 2}}'

# Multi-node vLLM
madengine run --tags model --additional-context '{
  "k8s": {"namespace": "ml-team", "gpu_count": 8},
  "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
}'
# Build phase (login node or CI) then deploy
madengine build --tags model --registry gcr.io/myproject

madengine run --manifest-file build_manifest.json \
  --additional-context '{
    "slurm":{"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
    "distributed":{"launcher":"torchtitan","nnodes":4,"nproc_per_node":8}
  }'
# slurm_multi — for workloads that run their own docker via srun
madengine run --tags pyt_sglang_disagg_qwen3-32b_short \
  --additional-context '{
    "slurm":{"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
    "distributed":{"launcher":"slurm_multi"}
  }'

# Build on a compute node, push, then have run pull in parallel
madengine build --tags model --build-on-compute --registry myreg.io/team
# or skip build entirely and use a pre-baked image
madengine build --tags model --use-image auto

Install & dev

Setup

pip install -e ".[dev]"      # base + dev
pip install -e ".[all]"      # + kubernetes
pre-commit install

Test & quality

pytest                            # all tests
pytest tests/unit/test_slurm_multi.py -v
pytest --cov=src/madengine --cov-report=html
pytest -m "not slow"
black src/ tests/ && isort src/ tests/
flake8 src/ tests/
mypy src/madengine
pre-commit run --all-files

5-layer architecture

Each layer talks only to the one below it. Layers are color-coded throughout this wiki.

CLI Orchestration Deployment Execution Core Utils Reporting
LayerPathResponsibilitiesKey types
CLIsrc/madengine/cli/ Typer app, command parsing, Rich output, exit-code mapping. app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode
Orchestrationsrc/madengine/orchestration/ Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment. BuildOrchestrator, RunOrchestrator, image_filtering.py
Deploymentsrc/madengine/deployment/ Factory + K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring. DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment
Executionsrc/madengine/execution/ Local Docker build/run, log scanning, timeout resolution, perf parsing. ContainerRunner, DockerBuilder, container_runner_helpers.py
Coresrc/madengine/core/ Cross-cutting primitives: context merging, console, docker wrapper, errors, auth, timeout. Context, Console, Docker, MADEngineError, load_credentials
Utilssrc/madengine/utils/ Discovery, GPU vendor abstraction, ROCm path resolution, config parsing. DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser
Reportingsrc/madengine/reporting/ perf.csv writers, HTML/email report generation. update_perf_csv, csv_to_html, csv_to_email

Architecture diagram

CLI · Typer + Rich discover · build · run · report · database → ExitCode { SUCCESS=0, BUILD_FAILURE=2, RUN_FAILURE=3, INVALID_ARGS=4 } Orchestration BuildOrchestrator DiscoverModels → DockerBuilder → manifest RunOrchestrator load manifest → infer target → dispatch image_filtering arch/tag selection Deployment · DeploymentFactory (inferred target) no key → local Docker · "k8s"/"kubernetes" → K8s Jobs · "slurm" → SLURM · distributed.launcher = "slurm_multi" → self-managed Local · ContainerRunner KubernetesDeployment SlurmDeployment slurm_multi (this branch) Launchers (training + inference) torchrun · DeepSpeed · Megatron-LM · TorchTitan · Primus · vLLM · SGLang · SGLang Disagg Reporting perf.csv · perf_entry.csv · csv_to_html · csv_to_email report to-html · report to-email Database MongoDB upload (madengine database …)

Key data flows

Build flow

  1. madengine buildBuildOrchestrator.execute()
  2. DiscoverModels resolves --tags against the MAD package (root models.json, scripts/{dir}/models.json, or scripts/{dir}/get_models_json.py).
  3. Each model is materialised through Context (system + user additional_context) and passed to DockerBuilder.
  4. Optionally tags & pushes to --registry.
  5. Writes build_manifest.json consumed by run.

Special build modes on this branch:

  • --use-image [IMAGE|auto] — skip local build, use a prebuilt image (auto resolves env_vars.DOCKER_IMAGE_NAME from the model card). Mutually exclusive with --registry and --build-on-compute.
  • --build-on-compute — build on a SLURM compute node and push to --registry; manifest carries built_on_compute: true.

Run flow

  1. madengine runRunOrchestrator loads existing manifest or triggers a build.
  2. Target inference (Convention over Configuration):
    • "k8s"/"kubernetes" in context → KubernetesDeployment
    • "slurm" in context → SlurmDeployment
    • distributed.launcher == "slurm_multi"slurm_multi path
    • neither → ContainerRunner (local Docker)
  3. scripts/common/ is populated from the package (pre_scripts, post_scripts, tools) and cleaned up afterwards.
  4. Per-model results parsed via PERFORMANCE_LOG_PATTERN and appended to perf.csv/perf_entry.csv. Failed runs are still recorded with STATUS=FAILURE.

additional_context — the configuration spine

--additional-context accepts a JSON or Python-dict string (parsed with ast.literal_eval(), not json.loads) or a path to a JSON file. It is merged into Context.ctx alongside system-detected values (GPU vendor, architecture, OS, ROCm path). Specific keys drive different subsystems.

KeyWhere it goesWhat it does
gpu_vendorCoreAMD or NVIDIA. Defaults to AMD if missing.
guest_osCoreUBUNTU or CENTOS; selects package manager for in-container installs.
MAD_ROCM_PATHCoreOverride host ROCm root (top-level only).
docker_env_varsExecutionEnv vars injected into the container. docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host.
docker_gpusExecutionComma list of GPU indices or all.
k8s / kubernetesDeploymentSelects K8s. Merged with preset defaults; supports namespace, gpu_count, storage class fallback chain (data_storage_classnfs_storage_classstorage_class).
slurmDeploymentSelects SLURM. partition, nodes, gpus_per_node, time, exclusive, reservation, nodelist. Setting nodelist also skips automatic node health preflight.
distributed.launcherDeploymenttorchrun, deepspeed, megatron, torchtitan, primus, vllm, sglang, sglang_disagg, slurm_multi / slurm-multi.
distributed.nnodes / nproc_per_nodeDeploymentTopology hints for launcher templates.
toolsExecutionList of profilers/tracers to enable, e.g. [{"name":"rocprofv3_compute"}].
rocenv_modeExecution"lite" (default) or "full" — full collects lshw / dmidecode / dmesg / modinfo, best-effort installs missing tools per guest_os.
log_error_pattern_scanExecutionfalse disables post-run log substring scan (use when pytest/JUnit is authoritative).
log_error_patterns / log_error_benign_patternsExecutionOverride or extend the failure-substring lists.
pre_scripts / post_scriptsExecutionCustom scripts to run before/after the model.
secretsDeployment (K8s)Auto-converted to a K8s Secret and mounted as env vars.
Gotcha: Context parses with ast.literal_eval(). Pass a Python dict repr (single quotes are fine in shells if you wrap the whole argument in single quotes and use double quotes inside) — strictly JSON also works since JSON ⊂ Python literals.

CLI commands

CommandSourcePurposeNotable flags
discover cli/commands/discover.py List/validate models matching tags. --tags (scoped: MAD/foo, dynamic: dummy3:dummy_3:batch=512)
build cli/commands/build.py Build Docker images; write build_manifest.json. --registry, --target-archs, --batch-manifest, --clean-docker-cache, --use-image new, --build-on-compute new
run cli/commands/run.py Run models from manifest or trigger a build first. --manifest-file, --additional-context[-file], --skip-model-run, --live-output, --keep-alive, --verbose, --timeout
report cli/commands/report.py Convert perf CSVs to HTML/email. Sub-apps: to-html --csv-file …, to-email --directory …
database cli/commands/database.py Upload perf CSV to MongoDB. --csv-file, --database-name, --collection-name (uses MONGO_HOST/USER/PASSWORD env)

Exit codes (CI contract)

From src/madengine/cli/constants.py::ExitCode. Use these in pipelines instead of log scraping.

CodeNameMeaning
0SUCCESSAll operations succeeded.
1FAILUREGeneral/unhandled failure.
2BUILD_FAILUREOne or more image builds failed.
3RUN_FAILUREOne or more model runs failed (still written to perf.csv with status FAILURE).
4INVALID_ARGSArgument validation rejected the invocation.
In Jenkins use ... 2>&1 | tee madengine.run.log with bash -o pipefail so the step's exit code is still madengine's, not tee's.

Deployment target inference

No explicit deploy field exists. The factory inspects additional_context:

TriggerClassSource
no k8s/slurm keyLocal ContainerRunnerexecution/container_runner.py
"k8s" or "kubernetes" keyKubernetesDeploymentdeployment/kubernetes.py
"slurm" keySlurmDeploymentdeployment/slurm.py
distributed.launcher == "slurm_multi"slurm_multi path (within Slurm)deployment/slurm.py + common.py

The mixin deployment/kubernetes_launcher_mixin.py selects the correct Jinja2 template under src/madengine/deployment/templates/{kubernetes,slurm}/ per launcher.

slurm_multi launcher branch focus

What it is

A minimal-but-additive SLURM launcher for workloads that orchestrate their own per-node Docker containers via srun — for example SGLang Disaggregated (proxy + prefill + decode topologies) or anything that needs to call srun / scontrol from inside the job script.

Generates a wrapper SBATCH that runs the model's .slurm script directly on baremetal (not inside a container), so the workload can spawn its own per-node containers without the outer job step holding a container open.

How to pick it

{
  "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
  "distributed": {"launcher": "slurm_multi"}
  // aliases: "slurm-multi"
}

Honors model-card + context slurm fields: partition, nodes, gpus_per_node, time, exclusive, reservation, nodelist.

Build modes added with this launcher

ModeFlagBehaviour
Local build (default)Normal madengine build.
Use prebuilt image--use-image [IMAGE | auto]Skip local build. auto resolves to the model card's env_vars.DOCKER_IMAGE_NAME. Mutually exclusive with the two below.
Build on compute--build-on-compute (requires --registry)Build on a SLURM compute node, push to registry; manifest sets built_on_compute: true. run then does parallel srun docker pull on all allocated nodes.
Implicit auto-use-imagenoneIf build finds a slurm_multi model and none of --registry / --use-image / --build-on-compute is set, it either auto-resolves the model card's DOCKER_IMAGE_NAME or raises a structured ConfigurationError listing the four supported options.

Execution paths

  • sbatch (default): wrapper SBATCH submitted to SLURM.
  • bash-in-salloc: when SLURM_JOB_ID is already set (inside an existing salloc), the slurm_multi launcher runs the wrapper synchronously with bash instead of nesting sbatch. Other launchers keep using sbatch even inside salloc. Uses DeploymentResult.skip_monitoring=True to skip the monitor poll.

Results aggregation

_collect_slurm_multi_results reads the per-job CSV at /shared_inference/$USER/$JOBID/perf.csv and now also writes those rows into cwd/perf.csv (copy if absent, append data rows if present), so the default reporter (display_performance_table) finds them without extra args. Local + classic-SLURM flows are unchanged.

Tests & examples

  • tests/unit/test_slurm_multi.py — registry membership, hyphen alias normalization, env_vars-export contract against MAD-private PR #186's pyt_sglang_disagg_qwen3-32b_short model card.
  • examples/slurm-configs/minimal/slurm-multi-minimal.json — reference config.
Recent commits on this branch (most recent first)
2e8f1a4 Merge remote-tracking branch 'upstream/develop' into add_slurm_multi_launcher
68d0bf3 fix(slurm_multi): address Copilot review on PR #124
dc3bc48 docs(slurm_multi): CHANGELOG entry + forward-compat TODO on --use-image
e84506a fix(slurm_multi): aggregate per-job perf.csv into cwd for dashboard reporter
e281e7e fix(deployment): add skip_monitoring to DeploymentResult for slurm_multi bash branch
f7af062 test(slurm_multi): contract tests + minimal example config
8a5e174 feat(cli): expose --use-image and --build-on-compute on madengine build
bd371fe feat(orchestration): build_on_compute, registry gate, parallel pull for slurm_multi
941d56d feat(deployment): add slurm_multi launcher (minimal additive)

Kubernetes deployment

Decomposed (v2.0.3) into focused mixins composed by KubernetesDeployment:

ModuleConcern
k8s_pvc.pyPVC lifecycle (data PVC, single-node results PVC).
k8s_results.pyLog/artifact collection, performance aggregation. Uses the shared collector_pod_name() helper so cleanup matches the truncated collector-{deployment_id[:15]} name.
k8s_scripts.pyScript extraction, ConfigMap building.
k8s_template_context.pyJinja2 template context assembly.
kubernetes_launcher_mixin.pyPer-launcher template selection.
k8s_secrets.pysecrets dict → K8s Secret objects → env vars.
k8s_pvc.pyStorage-class fallback: data_storage_classnfs_storage_classstorage_class; single_node_results_storage_classlocal_path_storage_classstorage_class. Default bundled preset: storage_class: "nfs-banff".
Known issue: in multi-node K8s jobs a node may report FAILED in the results table even though the pod actually succeeded — this happens when the kubelet returns 502 between job completion and log collection, so madengine cannot parse perf metrics. PVC artifacts are still collected. Check kubectl describe pod <pod>.

Launcher matrix

LauncherLocalK8sSLURMTypeNotes
torchrunTrainDDP / FSDP, elastic.
DeepSpeedTrainZeRO, pipeline parallelism.
Megatron-LMTrainTP + PP, large transformers.
TorchTitanTrainFSDP2 + TP + PP + CP, Llama 3.1 8B–405B.
PrimusTrainMegatron / TorchTitan / MaxText via Primus YAML.
vLLMInferv1 engine, PagedAttention.
SGLangInferRadixAttention, structured gen.
SGLang DisaggInferDisagg prefill/decode, Mooncake, 3+ nodes.
slurm_multi branchMetaSelf-managed multi-node SLURM wrapper for workloads with their own per-node container orchestration.

Profiling & tracing

Enable via --additional-context '{"tools":[{"name":"…"}]}'. Stackable.

ToolPurposeOutput
rocprofLegacy GPU kernel profilingKernel timings/occupancy
rocprofv3_computeCompute-bound (ROCm ≥ 7.0)ALU, wave execution
rocprofv3_memoryMemory-boundCache hits, bandwidth
rocprofv3_communicationMulti-GPURCCL traces
rocprofv3_fullComprehensiveAll metrics, high overhead
rocprofv3_lightweightMinimal overheadHIP + kernel traces
rocprofv3_perfettoPerfetto UI tracesPerfetto JSON
rocprofv3_api_overheadAPI call timingAPI timings
rocprofv3_pc_samplingKernel hotspotsPC sample histograms
rocm_trace_liteRTL lite dispatch tracerocm_trace_lite_output/trace.db
rocm_trace_lite_defaultRTL default modeSame paths, broader coverage
rocblas_trace / miopen_trace / tensile_trace / rccl_trace Library call tracingPer-library log
gpu_info_power_profiler / gpu_info_vram_profilerPower / VRAM over timeCSV time series
therock_checkTheRock ROCm validationDetection report
Do not combine rocm_trace_lite with rocprof / rocprofv3_* in the same run. RTL installs from a pinned GitHub release wheel by default — set ROCM_TRACE_LITE_FOLLOW_LATEST=1 or ROCM_TRACE_LITE_WHEEL_URL=… for latest / air-gapped installs.

ROCm path resolution

Implemented in src/madengine/utils/rocm_path_resolver.py.

Host (build & tools)

  1. Top-level MAD_ROCM_PATH in --additional-context
  2. Auto-detect: /opt/rocm, /opt/rocm-*, TheRock rocm-sdk + markers, then rocminfo / amd-smi / rocm-smi on PATH
  3. ROCM_PATH env var
  4. /opt/rocm fallback

Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only the env var/default.

In-container (AMD Docker runs)

  1. docker_env_vars.MAD_ROCM_PATH (consumed; not forwarded as-is)
  2. ROCM_PATH/ROCM_HOME from image OCI config (docker image inspect)
  3. In-image shell probe (docker run --rm)
  4. /opt/rocm with a warning

The run-phase environment table prints host vs container installation type (apt / therock / unknown), ROCm/CUDA root, and version side-by-side.

Module reference

LayerPathWhat it contains
CLIcli/app.pyTyper app, cli_main entry, --version handling, rich traceback install.
CLIcli/commands/build.pymadengine build command, registry options, batch builds, --use-image/--build-on-compute.
CLIcli/commands/run.pymadengine run command, manifest loading, --skip-model-run.
CLIcli/commands/discover.pyModel discovery command.
CLIcli/commands/report.pyreport to-html / to-email sub-app.
CLIcli/commands/database.pyMongoDB upload command.
CLIcli/constants.pyExitCode enum.
CLIcli/validators.pyArgument validation.
Orchorchestration/build_orchestrator.pyBuildOrchestrator.execute(), discover → build, registry login, batch manifest, slurm_multi registry gate.
Orchorchestration/run_orchestrator.pyRunOrchestrator, build phase, target inference, local Docker dispatch, slurm_multi result aggregation.
Orchorchestration/image_filtering.pyTarget-arch / tag filtering of manifest entries.
Depdeployment/factory.pyDeploymentFactory.create(), registers SlurmDeployment + KubernetesDeployment; UserWarning if kubernetes pkg missing.
Depdeployment/base.pyBaseDeployment, DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN, terminal states (COMPLETED/FAILED/CANCELLED).
Depdeployment/kubernetes.pyComposes K8s mixins; orchestrates job lifecycle.
Depdeployment/k8s_pvc.pyPVC creation/deletion + storage-class resolution.
Depdeployment/k8s_results.pyLog/artifact collection, perf aggregation; collector_pod_name().
Depdeployment/k8s_scripts.pyScript extraction, ConfigMap building (carries rocenv_mode, guest_os).
Depdeployment/k8s_template_context.pyAssembles Jinja2 template context.
Depdeployment/k8s_secrets.pysecrets → K8s Secret objects.
Depdeployment/k8s_names.pyName truncation/sanitization helpers.
Depdeployment/kubernetes_launcher_mixin.pySelects K8s template per launcher.
Depdeployment/slurm.pySlurmDeployment; classic SLURM path; routes to slurm_multi when launcher matches.
Depdeployment/slurm_node_selector.pySlurmNodeSelector health/cleanup srun, supports reservation.
Depdeployment/primus_backend.pyPrimus YAML / backend selection.
Depdeployment/common.pyShared deployment helpers, slurm_multi wrapper assembly.
Depdeployment/config_loader.pyLoads and deep-merges preset JSON with user config.
Depdeployment/presets/{k8s,slurm}/defaults.jsonDefault values auto-merged with minimal user configs.
Depdeployment/templates/{kubernetes,slurm}/Jinja2 templates per launcher.
Execexecution/container_runner.pyContainerRunner: local docker run, env injection (MAD_GUEST_OS, MAD_OUTPUT_CSV), tools wiring, perf parsing.
Execexecution/container_runner_helpers.pyLog error pattern scan, timeout resolution.
Execexecution/docker_builder.pyDockerBuilder: build args (incl. MAD_SYSTEM_GPU_ARCHITECTURE), push/tag, shell-quoted everywhere.
Execexecution/dockerfile_utils.pyDockerfile parsing helpers.
Corecore/context.pyContext: ast.literal_eval parse, system detect, GPU vendor/arch, ROCm path; guards against None kfd_renderDs entries on restricted ROCm.
Corecore/additional_context_defaults.pyDefault values merged into context.
Corecore/console.pyConsole: Rich-backed shell wrapper, live output mode.
Corecore/docker.pyDocker wrapper; shlex.quote() on every interpolation.
Corecore/errors.pyMADEngineError + 10 typed errors; create_error_context; Rich panels.
Corecore/auth.pyload_credentials(), login_to_registry() (uses --password-stdin + MAD_REGISTRY_PASSWORD env).
Corecore/timeout.pyTimeout context manager; guards signal.alarm(None) when seconds is 0/None.
Corecore/constants.pyMisc core constants.
Corecore/dataprovider.pyData: local / NAS / S3 / MinIO abstraction.
Utilutils/discover_models.pyDiscoverModels: root, dir, or dynamic discovery; scoped vs unscoped tags.
Utilutils/gpu_tool_factory.pyReturns AMD or NVIDIA tool manager based on vendor.
Utilutils/gpu_tool_manager.pyAbstract GPU tool manager interface.
Utilutils/rocm_tool_manager.pyAMD/ROCm implementation.
Utilutils/nvidia_tool_manager.pyNVIDIA implementation.
Utilutils/gpu_validator.pyROCm install detection, GPU vendor detection.
Utilutils/gpu_config.pyGPU configuration helpers.
Utilutils/rocm_path_resolver.pyHost/in-container ROCm root resolver.
Utilutils/therock_markers.pyShared TheRock detection markers.
Utilutils/config_parser.pyConfigParser: parses additional context + tools config.
Utilutils/path_utils.pyPath helpers.
Utilutils/session_tracker.pySession start/marker tracking.
Utilutils/ops.pyMisc operations.
Utilutils/log_formatting.pyLog formatting helpers.
Utilutils/run_details.pyRun metadata helpers.
Repreporting/update_perf_csv.pyWrites/appends to perf.csv and perf_entry.csv.
Repreporting/csv_to_html.pyHTML report generation.
Repreporting/csv_to_email.pyEmail-friendly consolidated report.
Repreporting/update_perf_super.pySuperset-shaped perf rollups.
DBdatabase/mongodb.pyMongoDB connection + insert; uses datetime.now(timezone.utc).
Scriptsscripts/common/pre_scripts/rocEnvTool/rocenv_tool.py, csv_parser.py, console.py — TheRock-compatible env capture (lite + full modes).
Scriptsscripts/common/tools/GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers.

Test layout

unit/

Fast, isolated, mocked. ~28 modules including test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py.

integration/

Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.

e2e/

Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.

Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0). Markers: unit, integration, e2e, slow, gpu, amd, nvidia, cpu, requires_docker, requires_models.

Contributing & code style

  • Formatting: Black (line-length 88), targets py38–py311.
  • Imports: isort with profile = "black"; first-party = madengine.
  • Lint: flake8 + mypy (strict equality, warn unused, etc.) + bandit (skips B101).
  • Docstrings: Google style; type hints for public functions.
  • Conventional commits: feat:, fix:, docs:, test:, refactor:, style:, perf:, chore:.
  • Pre-commit: pip install pre-commit && pre-commit install.

Recent notable changes

[Unreleased] — slurm_multi launcher
  • New slurm_multi SLURM launcher; slurm-multi alias accepted.
  • madengine build --use-image [IMAGE|auto] and --build-on-compute.
  • Build registry gate with structured ConfigurationError.
  • bash-in-salloc execution path when SLURM_JOB_ID is already set.
  • DeploymentResult.skip_monitoring for synchronous deploys.
  • SlurmNodeSelector accepts a reservation parameter.
  • perf.csv aggregation into cwd so the default reporter sees per-job rows.
  • Contract tests + minimal example config.
[2.0.3] — rocEnvTool full mode, K8s refactor, security
  • K8s monolith decomposed into k8s_pvc/k8s_results/k8s_scripts/k8s_template_context mixins.
  • rocEnvTool "full" mode (lshw, dmidecode, dmesg, modinfo) with guest_os-native installers.
  • Generic storage_class fallback added; default preset now nfs-banff.
  • rocm_trace_lite_default tool (RTL default mode).
  • Security: shlex.quote() on every shell interpolation in core/docker.py, container_runner.py, docker_builder.py, run_orchestrator.py.
  • Collector pod name mismatch fix (truncated collector-{id[:15]} shared helper).
  • RPD pre-script: xxd install + sudo/root branch fixes.
  • CANCELLED added to terminal-state set so scancel'd jobs don't loop forever.
  • Context guards against None kfd_renderDs on restricted ROCm.
[2.0.2] / [2.0.1] — credential validation, ROCm auto-detect, GPU arch
  • load_credentials() validates JSON object type, raises ConfigurationError.
  • Host ROCm auto-detection via priority chain; in-container ROCm resolved independently.
  • TheRock layout support (rocm-sdk + markers).
  • GPU arch auto-detection injected into Docker build args for full-run mode.
  • Model discovery: scope-based tag selection replaces strict flag.
  • Shared login_to_registry, centralised credential loading.
  • Registry password via env + --password-stdin (no more /proc exposure).
  • Unified PERFORMANCE_LOG_PATTERN across local + deployment paths.
[2.0.0] — Complete rewrite
  • Unified madengine CLI; legacy mad-* removed.
  • 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core).
  • Multi-target deployment via factory + presets + Jinja2 templates.
  • Launcher mixin with torchrun / DeepSpeed / Megatron-LM / TorchTitan / Primus / vLLM / SGLang.
  • Log error pattern scanning; --skip-model-run; batch build manifest.
  • SLURM nodelist pinning; K8s Secrets management.
  • Structured errors (10 types) with Rich panels; fixed exit codes.
  • RuntimeError renamed to ExecutionError (alias preserved).