Metrics Available

Omnistat supports multiple embedded data collectors to aggregate a large collection of metrics from a variety of system sources. Many of the available data collectors are optional and can be enabled via runtime configuration settings (e.g. via omnistat.default). The sections and tables that follow serve to outline major data collector variants, their associated runtime configuration control options, and a comprehensive list of specific metric names defined for each collector.

Note that Omnistat metrics generally fall into one of the two following types:

  • Node-level metrics: These are reported once per node and are designated with a Node Metric heading.

  • GPU-level metrics: These are reported for each individual GPU on a node and include a card label to distinguish between them. These metric types are denoted with a GPU Metric heading.

ROCm

This core data collector provides essential metrics for monitoring AMD Instinct™ GPUs covering utilization, memory usage, power consumption, frequencies, and temperature. These metrics can be collected using the ROCm System Management Interface (ROCm SMI) or the AMD System Management Interface (AMD SMI) and are fundamental for assessing GPU health and performance.

Collector: enable_rocm_smi or enable_amd_smi

Node Metric

Description

rocm_num_gpus

Number of GPUs in the node.

GPU Metric

Description

rocm_version_info

GPU model and versioning information for GPU driver and VBIOS. Labels: driver_ver, vbios, type, schema.

rocm_utilization_percentage

GPU utilization (%).

rocm_vram_used_percentage

Memory utilization (%).

rocm_average_socket_power_watts

Average socket power (W).

rocm_sclk_clock_mhz

GPU clock speed (MHz).

rocm_mclk_clock_mhz

Memory clock speed (MHz).

rocm_temperature_celsius

GPU temperature (°C). Labels: location.

rocm_temperature_memory_celsius

Memory temperature (°C). Labels: location.

Resource Manager

The resource manager data collector links system-level monitoring data with specific jobs running on the system. This is essential for attributing resource usage to individual users or applications.

Collector: enable_rms

Node Metric

Description

rmsjob_info

Resource manager info metric tracking running jobs. When a job is running, the jobid label is different than the empty string. Labels: jobid, user, partition, nodes, batchflag, jobstep, type.

Annotations

The resource manager collector optionally allows users to add application-level context to Omnistat metrics using the omnistat-annotate tool. This is useful for marking specific events or phases within an application, such as the start and end of a computation, making it easier to correlate performance data with application behavior.

Collector: enable_rms
Collector options: enable_annotations

Node Metric

Description

rmsjob_annotations

User-provided annotations. Labels: jobid, marker.

RAS

The RAS (Reliability, Availability, Serviceability) collection mechanism is an optional capability of the ROCm data collectors and provides information about ECC errors in different GPU blocks. There are three types of ECC errors available for tracking:

  • Correctable: Single-bit errors that are automatically corrected by the hardware. These do not cause data corruption or affect functionality.

  • Uncorrectable: Multi-bit errors that cannot be corrected by the hardware. These can lead to data corruption and system instability.

  • Deferred: Multi-bit errors that cannot be corrected by the hardware but can be flagged or isolated. These need to be handled to ensure data integrity and system stability.

Collectors: enable_rocm_smi or enable_amd_smi, enable_ras_ecc

GPU Metric

Description

rocm_ras_umc_correctable_count

Correctable errors in the Unified Memory Controller block.

rocm_ras_sdma_correctable_count

Correctable errors in the System Direct Memory Access block.

rocm_ras_gfx_correctable_count

Correctable errors in the Graphics Processing Unit block.

rocm_ras_mmhub_correctable_count

Correctable errors in the Multi Media Hub block.

rocm_ras_pcie_bif_correctable_count

Correctable errors in the PCIe Bifurcation block.

rocm_ras_hdp_correctable_count

Correctable errors in the Host Data Path block.

rocm_ras_xgmi_wafl_correctable_count

Correctable errors in the External Global Memory Interconnect block.

rocm_ras_umc_uncorrectable_count

Uncorrectable errors in the Unified Memory Controller block.

rocm_ras_sdma_uncorrectable_count

Uncorrectable errors in the System Direct Memory Access block.

rocm_ras_gfx_uncorrectable_count

Uncorrectable errors in the Graphics Processing Unit block.

rocm_ras_mmhub_uncorrectable_count

Uncorrectable errors in the Multi Media Hub block.

rocm_ras_pcie_bif_uncorrectable_count

Uncorrectable errors in the PCIe Bifurcation block.

rocm_ras_hdp_uncorrectable_count

Uncorrectable errors in the Host Data Path block.

rocm_ras_xgmi_wafl_uncorrectable_count

Uncorrectable errors in the External Global Memory Interconnect block.

rocm_ras_umc_deferred_count

Deferred[1] errors in the Unified Memory Controller block.

rocm_ras_sdma_deferred_count

Deferred[1] errors in the System Direct Memory Access block.

rocm_ras_gfx_deferred_count

Deferred[1] errors in the Graphics Processing Unit block.

rocm_ras_mmhub_deferred_count

Deferred[1] errors in the Multi Media Hub block.

rocm_ras_pcie_bif_deferred_count

Deferred[1] errors in the PCIe Bifurcation block.

rocm_ras_hdp_deferred_count

Deferred[1] errors in the Host Data Path block.

rocm_ras_xgmi_wafl_deferred_count

Deferred[1] errors in the External Global Memory Interconnect block.

Occupancy

The occupancy collection mechanism is another optional capability of the ROCm data collectors that provides insight to help understand how the GPU’s compute units (CUs) are being utilized. It represents the ratio of active wavefronts to the maximum number of wavefronts that a CU can handle simultaneously.

Collectors: enable_rocm_smi or enable_amd_smi, enable_cu_occupancy

GPU Metric

Description

rocm_num_compute_units

Number of compute units.

rocm_compute_unit_occupancy

Number of used compute units.

Network

The network data collector enables metrics providing information about data transfers for each network interface detected in the host platform. Currently supported network types include Ethernet, Infiniband, and Slingshot.

Collector: enable_network

Node Metric

Description

omnistat_network_tx_bytes

Total bytes transmitted by network interface. Labels: device_class, interface.

omnistat_network_rx_bytes

Total bytes received by network interface. Labels: device_class, interface.