Common Metrics Recommended for Grafana

Node/Pod CPU and Memory

Node/Pod CPU/Memory usage should be less than a threshold based on the bill of materials. To determine if your system meets this criteria, MATRIXX Support recommends using the metrics described in Node/Pod CPU and Memory Metrics Recommended for Grafana.

Table 1. Node/Pod CPU and Memory Metrics Recommended for Grafana
Metric	Type	Labels	Description
node_cpu_usage_seconds_total	Custom		The cumulative CPU time consumed by the node in core-seconds.
node_memory_working_set_bytes	Custom		The current working set of node in bytes.
pod_cpu_usage_seconds_total	Custom	pod namespace	The cumulative CPU time consumed by the pod in core-seconds.
pod_memory_working_set_bytes	Custom	pod namespace	The service process total CPU time in OS jiffies.

For Kubernetes metrics reference, see the Kubernetes documentation.

Per Process CPU and Memory Usage

Per Process CPU and Memory Usage Metrics Recommended for Grafana describes metrics to help troubleshoot busy processes or memory leaks.

Table 2. Per Process CPU and Memory Usage Metrics Recommended for Grafana
Metric	Type	Labels	Description
sysServiceSystemCpuTime	Gauge	sysServiceStatsServiceId: service ID	The service process system CPU time in OS jiffies.
sysServiceTotalCpuTime	Gauge	sysServiceStatsServiceId: service ID	The service process total CPU time in OS jiffies.
sysServiceUserCpuTime	Gauge	sysServiceStatsServiceId: service ID	The service process user CPU time in OS jiffies.
sysServiceResidentSetSizeKb	Gauge	sysServiceStatsServiceId: service ID	The service process resident set size in kilobytes.
sysServiceVirtualMemorySizeKb	Gauge	sysServiceStatsServiceId: service ID	The service process virtual memory size in kilobytes.

Shared Memory Stats

Shared Memory Stats Recommended for Grafana describes metrics to show that you have enough free shared memory (sysTotalMemoryPoolSizeMb - sysTotalMemoryPoolInUseMb) for projected database growth or any run-time allocation due to load spike.

Table 3. Shared Memory Stats Recommended for Grafana
Metric	Type	Labels	Description
sysTotalMemoryPoolInUseMb	Gauge		The total size (in megabytes) of the system memory dedicated to databases, buffer pools (mtxbufs), and shared memory multi-queues that is in use.
sysTotalMemoryPoolSizeMb	Gauge		The total size (in megabytes) of the system memory dedicated to databases, buffer pools (mtxbufs), and shared memory multi-queues.

Disk Stats

Disk Stats Recommended for Grafana describes metrics to help verify that local (fast-shared) and shared storage has enough free space.

Table 4. Disk Stats Recommended for Grafana
Metric	Type	Labels	Description
statSysInfoDiskAvailableMb	Gauge	statSysInfoDiskIdStr: either local (fast-shared) or shared	The disk available space in MB.
statSysInfoDiskAvailablePct	Gauge	statSysInfoDiskIdStr: either local (fast-shared) or shared	The disk available space as a percentage.

Queue Stats

The queue full count should be zero. It might be nonzero if there is a sudden load spike but should not be constantly increasing. The queue current count should be zero most of the time. If current count is high, it means the task could not process messages fast enough. If it is in the request processing path, it might cause higher message latency. Queue Stats Recommended for Grafana describes metrics to identify queue issues.

Table 5. Queue Stats Recommended for Grafana
Metric	Type	Labels	Description
sysQueueStatsFullCount	Gauge	sysQueueStatsServiceId: service ID queueName: the name of the queue	The number of times the queue was full.
sysQueueStatsCurrentCount	Gauge	sysQueueStatsServiceId: service ID queueName: the name of the queue	The current number of queued messages waiting to be processed.