ORNL

This section provides instructions for running user-mode Omnistat on ORNL’s Frontier supercomputer with pre-installed versions from AMD Research.

Prerequisites:

  • User account on Frontier

  • Familiarity with SLURM job submission

Running jobs on Frontier

Omnistat is preinstalled on Frontier so users only need to setup their module environment appropriately and add several commands to their SLURM job scripts.

The following video demonstrates a complete Omnistat workflow on Frontier using an interactive 2-node job. You’ll see how to: load the Omnistat module and initialize data collection, monitor system metrics during application execution, stop data collection, and generate a performance report.

The version of Omnistat installed on Frontier also includes a convenience wrapper for teams running their own Python stack who may wish to avoid any pollution with the application’s Python environment. A SLURM job script example highlighting use of the wrapper utility before and after executing a GPU application is highlighted below:

Listing 16 Example Frontier SLURM job using wrapper, highlighting changes needed to run Omnistat
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0

# Setup and launch Omnistat (wrapper version)
ml use /sw/frontier/amdsw/modulefiles
ml omnistat-wrapper
${OMNISTAT_WRAPPER} usermode --start --interval 1.0

# Your GPU application here
srun ./your_application

# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --interval 1.0 --job ${SLURM_JOB_ID} --pdf omnistat.${SLURM_JOB_ID}.pdf
${OMNISTAT_WRAPPER} usermode --stopserver

Storage

By default, Omnistat databases are stored in Lustre, under the /lustre/orion/$(SLURM_JOB_ACCOUNT)/scratch/$(USER)/omnistat/$(SLURM_JOB_ID) directory. It’s possible to override the default path using the OMNISTAT_VICTORIA_DATADIR environment variable as highlighted in the following example.

Listing 17 Storing data under /tmp and copying it after running
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0

# Setup and launch Omnistat (wrapper version)
ml use /sw/frontier/amdsw/modulefiles
ml omnistat-wrapper
export OMNISTAT_VICTORIA_DATADIR=/tmp/omnistat/${SLURM_JOB_ID}
${OMNISTAT_WRAPPER} usermode --start --interval 1.0

# Your GPU application here
srun ./your_application

# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --interval 1.0 --job ${SLURM_JOB_ID} --pdf omnistat.${SLURM_JOB_ID}.pdf
${OMNISTAT_WRAPPER} usermode --stopserver
mv /tmp/omnistat/${SLURM_JOB_ID} data_omnistat.${SLURM_JOB_ID}

Note

Omnistat databases require flock support, which is available in Lustre and local filesystems like /tmp. Data can’t be stored directly under Frontier’s $HOME, but it can be moved there after running.

Data Analysis

After job completion, transfer the archived Omnistat data to your local machine for analysis using the Docker environment described in the user-mode guide.

Additional Features

Vendor Counters

Frontier exposes an additional site‑specific collector beyond the standard set documented in the main metrics overview. The vendor collector ingests additional telemetry made possible by site‑specific platform integrations. On Frontier, this collector leverages the HPE Cray pm_counters interface and translates raw counter files into metrics that distinguish cumulative energy and instantaneous power samples for different node-level components. Note that GPU metrics are indexed by an accel label and this index may differ from the ordering used by ROCm.

Collector: enable_vendor_counters

Node Metric

Description

omnistat_vendor_energy_joules

Total node energy consumption (J). Labels: vendor.

omnistat_vendor_power_watts

Instantaneous total node power draw (W). Labels: vendor.

omnistat_vendor_cpu_energy_joules

Cumulative CPU energy (J). Labels: vendor.

omnistat_vendor_cpu_power_watts

Instantaneous CPU power (W). Labels: vendor.

omnistat_vendor_memory_energy_joules

Cumulative system memory energy (J). Labels: vendor.

omnistat_vendor_memory_power_watts

Instantaneous system memory power (W). Labels: vendor.

GPU Metric

Description

omnistat_vendor_accel_energy_joules

Cumulative accelerator (GPU) energy (J) for each device. Labels: vendor, accel.

omnistat_vendor_accel_power_watts

Instantaneous accelerator (GPU) power (W) for each device. Labels: vendor, accel.

Kernel Tracing

Kernel tracing is available on Frontier starting with Omnistat 1.12.0. It works by loading a tracing library (libomnistat_trace.so) into the application’s runtime environment to intercept GPU kernel dispatches. Pre-built tracing libraries are currently provided for ROCm 6.4.0 through 7.2.0.

To enable kernel tracing:

  1. Create a custom configuration file with enable_kernel_trace = True, using the default Frontier configuration file as a starting point.

  2. Set OMNISTAT_CONFIG to point to your custom file in the job script.

  3. Set ROCP_TOOL_LIBRARIES to the pre-built tracing library matching the ROCm version used to build your application. Pre-built libraries are available under ${OMNISTAT_DIR}/build-trace-rocm-<version>/. This variable must be set before launching the application.

Listing 18 Example Frontier SLURM job with kernel tracing enabled
#!/bin/bash
#SBATCH -A <project_id>
#SBATCH -J omnistat
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH -S 0

# Load ROCm (must match the version used to build your application)
ml rocm/7.2.0

# Setup and launch Omnistat (wrapper version)
ml use /sw/frontier/amdsw/modulefiles
ml omnistat-wrapper
export OMNISTAT_CONFIG=/path/to/omnistat.config
${OMNISTAT_WRAPPER} usermode --start --interval 1.0

# Enable Omnistat kernel tracing
export ROCP_TOOL_LIBRARIES=${OMNISTAT_DIR}/build-trace-rocm-${CRAY_ROCM_VERSION}/libomnistat_trace.so

# Your GPU application here
srun ./your_application

# Tear down Omnistat
${OMNISTAT_WRAPPER} usermode --stopexporters
${OMNISTAT_WRAPPER} query --job ${SLURM_JOB_ID} --interval 1.0
${OMNISTAT_WRAPPER} usermode --stopserver