Common Metrics Recommended for Grafana

The following common metrics are recommended for MATRIXX Engine Grafana dashboards.

Node/Pod CPU and Memory

Node/Pod CPU/Memory usage should be less than a threshold based on the bill of materials. To determine if your system meets this criteria, MATRIXX Support recommends using the metrics described in Node/Pod CPU and Memory Metrics Recommended for Grafana.

Table 1. Node/Pod CPU and Memory Metrics Recommended for Grafana
Metric Type Labels Description
node_cpu_usage_seconds_total Custom The cumulative CPU time consumed by the node in core-seconds.
node_memory_working_set_bytes Custom The current working set of node in bytes.
pod_cpu_usage_seconds_total Custom pod

namespace

The cumulative CPU time consumed by the pod in core-seconds.
pod_memory_working_set_bytes Custom pod

namespace

The service process total CPU time in OS jiffies.

For Kubernetes metrics reference, see the Kubernetes documentation.

Per Process CPU and Memory Usage

Per Process CPU and Memory Usage Metrics Recommended for Grafana describes metrics to help troubleshoot busy processes or memory leaks.

Table 2. Per Process CPU and Memory Usage Metrics Recommended for Grafana
Metric Type Labels Description
sysServiceSystemCpuTime Gauge sysServiceStatsServiceId: service ID The service process system CPU time in OS jiffies.
sysServiceTotalCpuTime Gauge sysServiceStatsServiceId: service ID The service process total CPU time in OS jiffies.
sysServiceUserCpuTime Gauge sysServiceStatsServiceId: service ID The service process user CPU time in OS jiffies.
sysServiceResidentSetSizeKb Gauge sysServiceStatsServiceId: service ID The service process resident set size in kilobytes.
sysServiceVirtualMemorySizeKb Gauge sysServiceStatsServiceId: service ID The service process virtual memory size in kilobytes.

Shared Memory Stats

Shared Memory Stats Recommended for Grafana describes metrics to show that you have enough free shared memory (sysTotalMemoryPoolSizeMb - sysTotalMemoryPoolInUseMb) for projected database growth or any run-time allocation due to load spike.

Table 3. Shared Memory Stats Recommended for Grafana
Metric Type Labels Description
sysTotalMemoryPoolInUseMb Gauge The total size (in megabytes) of the system memory dedicated to databases, buffer pools (mtxbufs), and shared memory multi-queues that is in use.
sysTotalMemoryPoolSizeMb Gauge The total size (in megabytes) of the system memory dedicated to databases, buffer pools (mtxbufs), and shared memory multi-queues.

Disk Stats

Disk Stats Recommended for Grafana describes metrics to help verify that local (fast-shared) and shared storage has enough free space.

Table 4. Disk Stats Recommended for Grafana
Metric Type Labels Description
statSysInfoDiskAvailableMb Gauge statSysInfoDiskIdStr: either local (fast-shared) or shared The disk available space in MB.
statSysInfoDiskAvailablePct Gauge statSysInfoDiskIdStr: either local (fast-shared) or shared The disk available space as a percentage.

Queue Stats

The queue full count should be zero. It might be nonzero if there is a sudden load spike but should not be constantly increasing. The queue current count should be zero most of the time. If current count is high, it means the task could not process messages fast enough. If it is in the request processing path, it might cause higher message latency. Queue Stats Recommended for Grafana describes metrics to identify queue issues.

Table 5. Queue Stats Recommended for Grafana
Metric Type Labels Description
sysQueueStatsFullCount Gauge sysQueueStatsServiceId: service ID

queueName: the name of the queue

The number of times the queue was full.
sysQueueStatsCurrentCount Gauge sysQueueStatsServiceId: service ID

queueName: the name of the queue

The current number of queued messages waiting to be processed.