Site-Specific Instructions
ORNL
This section provides instructions for running user-mode Omnistat on ORNL’s Frontier supercomputer with pre-installed versions from AMD Research.
Prerequisites:
User account on Frontier
Familiarity with SLURM job submission
Running jobs on Frontier
Omnistat is preinstalled on Frontier so users only need to setup their module environment appropriately and add several commands to their SLURM job scripts.
The following video demonstrates a complete Omnistat workflow on Frontier using an interactive 2-node job. You’ll see how to: load the Omnistat module and initialize data collection, monitor system metrics during application execution, stop data collection, and generate a performance report.
The version of Omnistat installed on Frontier also includes a convenience wrapper for teams running their own Python stack who may wish to avoid any pollution with the application’s Python environment. A SLURM job script example highlighting use of the wrapper utility before and after executing a GPU application is highlighted below:
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0
# Setup and launch Omnistat (wrapper version)
ml use /autofs/nccs-svm1_sw/crusher/amdsw/modules
ml omnistat-wrapper
${OMNISTAT_WRAPPER} usermode --start --interval 1.0
# Your GPU application here
srun ./your_application
# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --interval 1.0 --job ${SLURM_JOB_ID} --pdf omnistat.${SLURM_JOB_ID}.pdf
${OMNISTAT_WRAPPER} usermode --stopserver
Storage
By default, Omnistat databases are stored in Lustre, under the
/lustre/orion/$(SLURM_JOB_ACCOUNT)/scratch/$(USER)/omnistat/$(SLURM_JOB_ID)
directory. It’s possible to override the default path using the
OMNISTAT_VICTORIA_DATADIR
environment variable as highlighted in the following example.
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0
# Setup and launch Omnistat (wrapper version)
ml use /autofs/nccs-svm1_sw/crusher/amdsw/modules
ml omnistat-wrapper
export OMNISTAT_VICTORIA_DATADIR=/tmp/omnistat/${SLURM_JOB_ID}
${OMNISTAT_WRAPPER} usermode --start --interval 1.0
# Your GPU application here
srun ./your_application
# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --interval 1.0 --job ${SLURM_JOB_ID} --pdf omnistat.${SLURM_JOB_ID}.pdf
${OMNISTAT_WRAPPER} usermode --stopserver
mv /tmp/omnistat/${SLURM_JOB_ID} data_omnistat.${SLURM_JOB_ID}
Note
Omnistat databases require flock
support, which is available in Lustre and
local filesystems like /tmp
. Data can’t be stored directly under Frontier’s
$HOME
, but it can be moved there after running.
Data Analysis
After job completion, transfer the archived Omnistat data to your local machine for analysis using the Docker environment described in the user-mode guide.
Additional Metrics
Frontier exposes an additional site‑specific collector beyond the standard set documented in the main metrics overview.
Vendor Counters
The vendor collector ingests additional telemetry made
possible by site‑specific platform integrations. On Frontier, this collector leverages the HPE
Cray pm_counters
interface and translates raw counter files into metrics
that distinguish cumulative energy and instantaneous power samples for
different node-level components. Note that GPU metrics are indexed by an accel
label and this index
may differ from the ordering used by ROCm.
Collector: enable_vendor_counters
Node Metric |
Description |
---|---|
|
Total node energy consumption (J). Labels: |
|
Instantaneous total node power draw (W). Labels: |
|
Cumulative CPU energy (J). Labels: |
|
Instantaneous CPU power (W). Labels: |
|
Cumulative system memory energy (J). Labels: |
|
Instantaneous system memory power (W). Labels: |
GPU Metric |
Description |
---|---|
|
Cumulative accelerator (GPU) energy (J) for each device. Labels: |
|
Instantaneous accelerator (GPU) power (W) for each device. Labels: |