Common Metrics Recommended for Grafana
The following common metrics are recommended for MATRIXX Engine Grafana dashboards.
Node/Pod CPU and Memory
Node/Pod CPU/Memory usage should be less than a threshold based on the bill of materials. To determine if your system meets this criteria, MATRIXX Support recommends using the metrics described in Node/Pod CPU and Memory Metrics Recommended for Grafana.
Metric | Type | Labels | Description |
---|---|---|---|
node_cpu_usage_seconds_total | Custom | The cumulative CPU time consumed by the node in core-seconds. | |
node_memory_working_set_bytes | Custom | The current working set of node in bytes. | |
pod_cpu_usage_seconds_total | Custom | pod namespace |
The cumulative CPU time consumed by the pod in core-seconds. |
pod_memory_working_set_bytes | Custom | pod namespace |
The service process total CPU time in OS jiffies. |
For Kubernetes metrics reference, see the Kubernetes documentation.
Per Process CPU and Memory Usage
Per Process CPU and Memory Usage Metrics Recommended for Grafana describes metrics to help troubleshoot busy processes or memory leaks.
Metric | Type | Labels | Description |
---|---|---|---|
sysServiceSystemCpuTime | Gauge | sysServiceStatsServiceId: service ID | The service process system CPU time in OS jiffies. |
sysServiceTotalCpuTime | Gauge | sysServiceStatsServiceId: service ID | The service process total CPU time in OS jiffies. |
sysServiceUserCpuTime | Gauge | sysServiceStatsServiceId: service ID | The service process user CPU time in OS jiffies. |
sysServiceResidentSetSizeKb | Gauge | sysServiceStatsServiceId: service ID | The service process resident set size in kilobytes. |
sysServiceVirtualMemorySizeKb | Gauge | sysServiceStatsServiceId: service ID | The service process virtual memory size in kilobytes. |
Shared Memory Stats
Shared Memory Stats Recommended for Grafana describes metrics to show that you have enough free shared memory (sysTotalMemoryPoolSizeMb - sysTotalMemoryPoolInUseMb) for projected database growth or any run-time allocation due to load spike.
Metric | Type | Labels | Description |
---|---|---|---|
sysTotalMemoryPoolInUseMb | Gauge | The total size (in megabytes) of the system memory dedicated to databases, buffer pools (mtxbufs), and shared memory multi-queues that is in use. | |
sysTotalMemoryPoolSizeMb | Gauge | The total size (in megabytes) of the system memory dedicated to databases, buffer pools (mtxbufs), and shared memory multi-queues. |
Disk Stats
Disk Stats Recommended for Grafana describes metrics to help verify that local (fast-shared) and shared storage has enough free space.
Metric | Type | Labels | Description |
---|---|---|---|
statSysInfoDiskAvailableMb | Gauge | statSysInfoDiskIdStr: either local (fast-shared) or shared | The disk available space in MB. |
statSysInfoDiskAvailablePct | Gauge | statSysInfoDiskIdStr: either local (fast-shared) or shared | The disk available space as a percentage. |
Queue Stats
The queue full count should be zero. It might be nonzero if there is a sudden load spike but should not be constantly increasing. The queue current count should be zero most of the time. If current count is high, it means the task could not process messages fast enough. If it is in the request processing path, it might cause higher message latency. Queue Stats Recommended for Grafana describes metrics to identify queue issues.
Metric | Type | Labels | Description |
---|---|---|---|
sysQueueStatsFullCount | Gauge | sysQueueStatsServiceId: service ID queueName: the name of the queue |
The number of times the queue was full. |
sysQueueStatsCurrentCount | Gauge | sysQueueStatsServiceId: service ID queueName: the name of the queue |
The current number of queued messages waiting to be processed. |