Metrics Available
Omnistat supports multiple embedded data collectors to aggregate a large collection of metrics from a variety of system sources. Many of the available data collectors are optional and can be enabled via runtime configuration settings (e.g. via omnistat.default). The sections and tables that follow serve to outline major data collector variants, their associated runtime configuration control options, and a comprehensive list of specific metric names defined for each collector.
Note that Omnistat metrics generally fall into one of the two following types:
Node-level metrics: These are reported once per node and are designated with a Node Metric heading.
GPU-level metrics: These are reported for each individual GPU on a node and include a
card
label to distinguish between them. These metric types are denoted with a GPU Metric heading.
ROCm
This core data collector provides essential metrics for monitoring AMD Instinct™ GPUs covering utilization, memory usage, power consumption, frequencies, and temperature. These metrics can be collected using the ROCm System Management Interface (ROCm SMI) or the AMD System Management Interface (AMD SMI) and are fundamental for assessing GPU health and performance.
Collector: enable_rocm_smi
or enable_amd_smi
Node Metric |
Description |
---|---|
|
Number of GPUs in the node. |
GPU Metric |
Description |
---|---|
|
GPU model and versioning information for GPU driver and VBIOS. Labels: |
|
GPU utilization (%). |
|
Memory utilization (%). |
|
Average socket power (W). |
|
GPU clock speed (MHz). |
|
Memory clock speed (MHz). |
|
GPU temperature (°C). Labels: |
|
Memory temperature (°C). Labels: |
Resource Manager
The resource manager data collector links system-level monitoring data with specific jobs running on the system. This is essential for attributing resource usage to individual users or applications.
Collector: enable_rms
Node Metric |
Description |
---|---|
|
Resource manager info metric tracking running jobs. When a job is running, the |
Annotations
The resource manager collector optionally allows users to add application-level context to Omnistat metrics using the
omnistat-annotate
tool. This is useful for marking specific events or phases
within an application, such as the start and end of a computation, making it
easier to correlate performance data with application behavior.
Collector: enable_rms
Collector options: enable_annotations
Node Metric |
Description |
---|---|
|
User-provided annotations. Labels: |
RAS
The RAS (Reliability, Availability, Serviceability) collection mechanism is an optional capability of the ROCm data collectors and provides information about ECC errors in different GPU blocks. There are three types of ECC errors available for tracking:
Correctable: Single-bit errors that are automatically corrected by the hardware. These do not cause data corruption or affect functionality.
Uncorrectable: Multi-bit errors that cannot be corrected by the hardware. These can lead to data corruption and system instability.
Deferred: Multi-bit errors that cannot be corrected by the hardware but can be flagged or isolated. These need to be handled to ensure data integrity and system stability.
Collectors: enable_rocm_smi
or enable_amd_smi
, enable_ras_ecc
GPU Metric |
Description |
---|---|
|
Correctable errors in the Unified Memory Controller block. |
|
Correctable errors in the System Direct Memory Access block. |
|
Correctable errors in the Graphics Processing Unit block. |
|
Correctable errors in the Multi Media Hub block. |
|
Correctable errors in the PCIe Bifurcation block. |
|
Correctable errors in the Host Data Path block. |
|
Correctable errors in the External Global Memory Interconnect block. |
|
Uncorrectable errors in the Unified Memory Controller block. |
|
Uncorrectable errors in the System Direct Memory Access block. |
|
Uncorrectable errors in the Graphics Processing Unit block. |
|
Uncorrectable errors in the Multi Media Hub block. |
|
Uncorrectable errors in the PCIe Bifurcation block. |
|
Uncorrectable errors in the Host Data Path block. |
|
Uncorrectable errors in the External Global Memory Interconnect block. |
|
Deferred[1] errors in the Unified Memory Controller block. |
|
Deferred[1] errors in the System Direct Memory Access block. |
|
Deferred[1] errors in the Graphics Processing Unit block. |
|
Deferred[1] errors in the Multi Media Hub block. |
|
Deferred[1] errors in the PCIe Bifurcation block. |
|
Deferred[1] errors in the Host Data Path block. |
|
Deferred[1] errors in the External Global Memory Interconnect block. |
Occupancy
The occupancy collection mechanism is another optional capability of the ROCm data collectors that provides insight to help understand how the GPU’s compute units (CUs) are being utilized. It represents the ratio of active wavefronts to the maximum number of wavefronts that a CU can handle simultaneously.
Collectors: enable_rocm_smi
or enable_amd_smi
, enable_cu_occupancy
GPU Metric |
Description |
---|---|
|
Number of compute units. |
|
Number of used compute units. |
Network
The network data collector enables metrics providing information about data transfers for each network interface detected in the host platform. Currently supported network types include Ethernet, Infiniband, and Slingshot.
Collector: enable_network
Node Metric |
Description |
---|---|
|
Total bytes transmitted by network interface. Labels: |
|
Total bytes received by network interface. Labels: |