Metrics Available

Omnistat supports multiple embedded data collectors to aggregate a large collection of metrics from a variety of system sources. Many of the available data collectors are optional and can be enabled via runtime configuration settings (e.g. via omnistat.default). The sections and tables that follow serve to outline major data collector variants, their associated runtime configuration control options, and a comprehensive list of specific metric names defined for each collector.

Note that Omnistat metrics generally fall into one of the two following types:

  • Node-level metrics: These are reported once per node and are designated with a Node Metric heading.

  • GPU-level metrics: These are reported for each individual GPU on a node and include a card label to distinguish between them. These metric types are denoted with a GPU Metric heading.

ROCm

This core data collector provides essential metrics for monitoring AMD Instinct™ GPUs covering utilization, memory usage, power consumption, frequencies, and temperature. These metrics can be collected using the ROCm System Management Interface (ROCm SMI) or the AMD System Management Interface (AMD SMI) and are fundamental for assessing GPU health and performance.

Collector: enable_rocm_smi or enable_amd_smi

Node Metric

Description

rocm_num_gpus

Number of GPUs in the node.

GPU Metric

Description

rocm_version_info

GPU model and versioning information for GPU driver and VBIOS. Labels: driver_ver, vbios, type, schema.

rocm_utilization_percentage

GPU utilization (%).

rocm_vram_used_percentage

Memory utilization (%).

rocm_average_socket_power_watts

Average socket power (W).

rocm_sclk_clock_mhz

GPU clock speed (MHz).

rocm_mclk_clock_mhz

Memory clock speed (MHz).

rocm_temperature_celsius

GPU temperature (°C). Labels: location.

rocm_temperature_memory_celsius

Memory temperature (°C). Labels: location.

Resource Manager

The resource manager data collector links system-level monitoring data with specific jobs running on the system. This is essential for attributing resource usage to individual users or applications.

Collector: enable_rms

Node Metric

Description

rmsjob_info

Resource manager info metric tracking running jobs. When a job is running, the jobid label is different than the empty string. Labels: jobid, user, partition, nodes, batchflag, jobstep, type.

Annotations

The resource manager collector optionally allows users to add application-level context to Omnistat metrics using the omnistat-annotate tool. This is useful for marking specific events or phases within an application, such as the start and end of a computation, making it easier to correlate performance data with application behavior.

Collector: enable_rms
Collector options: enable_annotations

Node Metric

Description

rmsjob_annotations

User-provided annotations. Labels: jobid, marker.

RAS

The RAS (Reliability, Availability, Serviceability) collection mechanism is an optional capability of the ROCm data collectors and provides information about ECC errors in different GPU blocks. There are three types of ECC errors available for tracking:

  • Correctable: Single-bit errors that are automatically corrected by the hardware. These do not cause data corruption or affect functionality.

  • Uncorrectable: Multi-bit errors that cannot be corrected by the hardware. These can lead to data corruption and system instability.

  • Deferred: Multi-bit errors that cannot be corrected by the hardware but can be flagged or isolated. These need to be handled to ensure data integrity and system stability.

Collectors: enable_rocm_smi or enable_amd_smi, enable_ras_ecc

GPU Metric

Description

rocm_ras_umc_correctable_count

Correctable errors in the Unified Memory Controller block.

rocm_ras_sdma_correctable_count

Correctable errors in the System Direct Memory Access block.

rocm_ras_gfx_correctable_count

Correctable errors in the Graphics Processing Unit block.

rocm_ras_mmhub_correctable_count

Correctable errors in the Multi Media Hub block.

rocm_ras_pcie_bif_correctable_count

Correctable errors in the PCIe Bifurcation block.

rocm_ras_hdp_correctable_count

Correctable errors in the Host Data Path block.

rocm_ras_xgmi_wafl_correctable_count

Correctable errors in the External Global Memory Interconnect block.

rocm_ras_umc_uncorrectable_count

Uncorrectable errors in the Unified Memory Controller block.

rocm_ras_sdma_uncorrectable_count

Uncorrectable errors in the System Direct Memory Access block.

rocm_ras_gfx_uncorrectable_count

Uncorrectable errors in the Graphics Processing Unit block.

rocm_ras_mmhub_uncorrectable_count

Uncorrectable errors in the Multi Media Hub block.

rocm_ras_pcie_bif_uncorrectable_count

Uncorrectable errors in the PCIe Bifurcation block.

rocm_ras_hdp_uncorrectable_count

Uncorrectable errors in the Host Data Path block.

rocm_ras_xgmi_wafl_uncorrectable_count

Uncorrectable errors in the External Global Memory Interconnect block.

rocm_ras_umc_deferred_count

Deferred[1] errors in the Unified Memory Controller block.

rocm_ras_sdma_deferred_count

Deferred[1] errors in the System Direct Memory Access block.

rocm_ras_gfx_deferred_count

Deferred[1] errors in the Graphics Processing Unit block.

rocm_ras_mmhub_deferred_count

Deferred[1] errors in the Multi Media Hub block.

rocm_ras_pcie_bif_deferred_count

Deferred[1] errors in the PCIe Bifurcation block.

rocm_ras_hdp_deferred_count

Deferred[1] errors in the Host Data Path block.

rocm_ras_xgmi_wafl_deferred_count

Deferred[1] errors in the External Global Memory Interconnect block.

Occupancy

The occupancy collection mechanism is another optional capability of the ROCm data collectors that provides insight to help understand how the GPU’s compute units (CUs) are being utilized. It represents the ratio of active wavefronts to the maximum number of wavefronts that a CU can handle simultaneously.

Collectors: enable_rocm_smi or enable_amd_smi, enable_cu_occupancy

GPU Metric

Description

rocm_num_compute_units

Number of compute units.

rocm_compute_unit_occupancy

Number of used compute units.

xGMI

The xGMI (External Global Memory Interconnect) data collector provides metrics for monitoring the total data transferred over the GPU-to-GPU high-speed interconnect. These metrics accumulate over time and are reset upon driver load.

Collectors: enable_rocm_smi or enable_amd_smi, enable_xgmi

GPU Metric

Description

rocm_xgmi_total_read_kilobytes

Total data read from all xGMI links (KB).

rocm_xgmi_total_write_kilobytes

Total data written to all xGMI links (KB).

VCN

The VCN (Video Core Next) collection mechanism is an optional capability of the AMD SMI data collector that provides metrics for monitoring video decoding operations on AMD GPUs. GPUs may contain multiple VCN engines to handle parallel video decoding workloads.

Note

The VCN collector requires enabling the AMD SMI collector (enable_amd_smi). It is not supported by the ROCm SMI collector (enable_rocm_smi).

Collectors: enable_amd_smi, enable_vcn

GPU Metric

Description

rocm_average_decoder_utilization_percentage

Decoder utilization averaged across all engines in the GPU (%).

ROCprofiler

The ROCprofiler data collector provides access to low-level GPU hardware counters for in-depth performance analysis. Counters are collected by sampling the GPUs at the device level with minimal impact on application performance. The collection is configured through the profile option in the configuration file.

Each profile defines a sampling mode and a set of counters to be collected:

  • sampling_mode: This option controls how counter sets are distributed across the available GPUs:

    • constant: Assigns one set of counters to all GPUs.

    • gpu-id: Cyclically assigns sets of counters to GPU IDs in all nodes. The number of sets of counters must not exceed the number of GPUs per node.

    • periodic: Rotates all GPUs through multiple counter sets, changing the active counter set after every sample. When this mode is enabled, counter values are reset at each sampling interval and not accumulated.

  • counters: This option accepts one or more sets of counters formatted as a flat or nested JSON list. For a complete list of supported counters, see the ROCm documentation.

Listing 2 Example profile to collect free-running and active cycles on all GPUs
 [omnistat.collectors.rocprofiler.cycles]
 sampling_mode = constant
 counters = ["GRBM_COUNT", "GRBM_GUI_ACTIVE"]
Listing 3 Example profile to collect HBM reads and writes from different GPU IDs
 [omnistat.collectors.rocprofiler.hbm]
 sampling_mode = gpu-id
 counters = [["FETCH_SIZE"], ["WRITE_SIZE"]]

The ROCprofiler data collector requires building the ROCprofiler extension.

To ensure all performance counters are collected correctly, the collector has the following requirements depending on how Omnistat is executed:

  • System mode: Run Omnistat with the CAP_PERFMON capability enabled.

  • User mode: Set the HSA_TOOLS_LIB environment variable in the application’s runtime environment.

    export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so
    

Collector: enable_rocprofiler
Collector options: profile

GPU Metric

Description

omnistat_hardware_counter

GPU hardware counter value from ROCprofiler. Labels: source, name.

Network

The network data collector enables metrics providing information about data transfers for each network interface detected in the host platform. Currently supported network types include Ethernet, Infiniband, and Slingshot.

Collector: enable_network

Node Metric

Description

omnistat_network_tx_bytes

Total bytes transmitted by network interface. Labels: device_class, interface.

omnistat_network_rx_bytes

Total bytes received by network interface. Labels: device_class, interface.