ORNL
This section provides instructions for running user-mode Omnistat on ORNL’s Frontier supercomputer with pre-installed versions from AMD Research.
Prerequisites:
User account on Frontier
Familiarity with SLURM job submission
Running jobs on Frontier
Omnistat is preinstalled on Frontier so users only need to setup their module environment appropriately and add several commands to their SLURM job scripts.
The following video demonstrates a complete Omnistat workflow on Frontier using an interactive 2-node job. You’ll see how to: load the Omnistat module and initialize data collection, monitor system metrics during application execution, stop data collection, and generate a performance report.
The version of Omnistat installed on Frontier also includes a convenience wrapper for teams running their own Python stack who may wish to avoid any pollution with the application’s Python environment. A SLURM job script example highlighting use of the wrapper utility before and after executing a GPU application is highlighted below:
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0
# Setup and launch Omnistat (wrapper version)
ml use /sw/frontier/amdsw/modulefiles
ml omnistat-wrapper
${OMNISTAT_WRAPPER} usermode --start --interval 1.0
# Your GPU application here
srun ./your_application
# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --interval 1.0 --job ${SLURM_JOB_ID} --pdf omnistat.${SLURM_JOB_ID}.pdf
${OMNISTAT_WRAPPER} usermode --stopserver
Storage
By default, Omnistat databases are stored in Lustre, under the
/lustre/orion/$(SLURM_JOB_ACCOUNT)/scratch/$(USER)/omnistat/$(SLURM_JOB_ID)
directory. It’s possible to override the default path using the
OMNISTAT_VICTORIA_DATADIR environment variable as highlighted in the following example.
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0
# Setup and launch Omnistat (wrapper version)
ml use /sw/frontier/amdsw/modulefiles
ml omnistat-wrapper
export OMNISTAT_VICTORIA_DATADIR=/tmp/omnistat/${SLURM_JOB_ID}
${OMNISTAT_WRAPPER} usermode --start --interval 1.0
# Your GPU application here
srun ./your_application
# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --interval 1.0 --job ${SLURM_JOB_ID} --pdf omnistat.${SLURM_JOB_ID}.pdf
${OMNISTAT_WRAPPER} usermode --stopserver
mv /tmp/omnistat/${SLURM_JOB_ID} data_omnistat.${SLURM_JOB_ID}
Note
Omnistat databases require flock support, which is available in Lustre and
local filesystems like /tmp. Data can’t be stored directly under Frontier’s
$HOME, but it can be moved there after running.
Data Analysis
After job completion, transfer the archived Omnistat data to your local machine for analysis using the Docker environment described in the user-mode guide.
Additional Features
Vendor Counters
Frontier exposes an additional site‑specific collector beyond the standard set
documented in the main metrics overview. The vendor collector ingests additional telemetry made
possible by site‑specific platform integrations. On Frontier, this collector leverages the HPE
Cray pm_counters interface and translates raw counter files into metrics
that distinguish cumulative energy and instantaneous power samples for
different node-level components. Note that GPU metrics are indexed by an accel label and this index
may differ from the ordering used by ROCm.
Collector: enable_vendor_counters
Node Metric |
Description |
|---|---|
|
Total node energy consumption (J). Labels: |
|
Instantaneous total node power draw (W). Labels: |
|
Cumulative CPU energy (J). Labels: |
|
Instantaneous CPU power (W). Labels: |
|
Cumulative system memory energy (J). Labels: |
|
Instantaneous system memory power (W). Labels: |
GPU Metric |
Description |
|---|---|
|
Cumulative accelerator (GPU) energy (J) for each device. Labels: |
|
Instantaneous accelerator (GPU) power (W) for each device. Labels: |
Kernel Tracing
Kernel tracing is available on Frontier starting
with Omnistat 1.12.0. It works by loading a tracing library
(libomnistat_trace.so) into the application’s runtime environment to
intercept GPU kernel dispatches. Pre-built tracing libraries are currently
provided for ROCm 6.4.0 through 7.2.0.
To enable kernel tracing:
Create a custom configuration file with
enable_kernel_trace = True, using the default Frontier configuration file as a starting point.Set
OMNISTAT_CONFIGto point to your custom file in the job script.Set
ROCP_TOOL_LIBRARIESto the pre-built tracing library matching the ROCm version used to build your application. Pre-built libraries are available under${OMNISTAT_DIR}/build-trace-rocm-<version>/. This variable must be set before launching the application.
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0
# Load ROCm (must match the version used to build your application)
ml rocm/7.2.0
# Setup and launch Omnistat (wrapper version)
ml use /sw/frontier/amdsw/modulefiles
ml omnistat-wrapper
export OMNISTAT_CONFIG=/path/to/omnistat.config
${OMNISTAT_WRAPPER} usermode --start --interval 1.0
# Enable Omnistat kernel tracing
export ROCP_TOOL_LIBRARIES=${OMNISTAT_DIR}/build-trace-rocm-${CRAY_ROCM_VERSION}/libomnistat_trace.so
# Your GPU application here
srun ./your_application
# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --job ${SLURM_JOB_ID} --interval 1.0
${OMNISTAT_WRAPPER} usermode --stopserver