Out-of-Bounds Engine System Resource Metrics

Use the mtx-engine-system-health-check container in the engine-health-checker pod to monitor the engine pod system resources that are being used, including CPU and disk space, and alert users when the metrics they are monitoring are outside predefined bounds.

Note: The mtx-engine-system-health-check container requires that Kubernetes Metrics Server is installed in your cluster. For installation instructions, see the Kubernetes Metrics Server documentation at the Kubernetes Metrics Server GitHub web page.

The mtx-engine-system-health-check container periodically scrapes metrics that were predefined by the user and compares those metrics with the bounds that the user also predefined in the boundaries file. If the metric value is outside the specified boundary, an error message is logged. Users can also retrieve these out of bounds metrics using the Prometheus server running inside the mtx-engine-system-health-check container.

Users can compare values from the engine pod's Prometheus Agent, and from the Kubernetes API as described in Metrics Comparable for Out of Bounds:
Table 1. Metrics Comparable for Out of Bounds
Location Metric
CPU (Node level) statSysInfoCpuUsgIdlePrc

statSysInfoCpuUsgUserPrc

statSysInfoCpuUsgSysPrc

Memory (Node level) statSysInfoPhysicalMemoryTotalMb

statSysInfoPhysicalMemoryFreeMb

statSysInfoPhysicalMemoryAvailableMb

Disk (local and shared) – local is fast-shared storage and shared is shared storage statSysInfoDiskTotalSizeMb

statSysInfoDiskAvailableMb

statSysInfoDiskUsedMb

statSysInfoDiskAvailablePct

statSysInfoDiskUsedPct

Common sysTotalMemoryPoolSizeMb (pod level shared memory allocated)

sysTotalMemoryPoolInUseMb (pod level shared memory used)

Kubernetes API podCpuUsage

podMemoryUsage

The following example includes all the supported metrics that a user can have in their boundary file. This example file lists both the metric they want to monitor and its boundary. The following is an example boundary file for the engine pod's Prometheus Agent metrics; you can define lower boundaries and upper boundaries:
metric-boundary-prometheus-agent.yaml:
- MetricName: statSysInfoCpuUsgIdlePrc
  MetricLowerBoundary: 20
- MetricName: statSysInfoCpuUsgUserPrc
  MetricUpperBoundary: 60
- MetricName: statSysInfoCpuUsgSysPrc
  MetricUpperBoundary: 20
- MetricName: statSysInfoPhysicalMemoryFreeMb
  MetricLowerBoundary: 64
- MetricName: statSysInfoDiskAvailablePct{statSysInfoDiskIdStr="shared"}
  MetricLowerBoundary: 10
- MetricName: statSysInfoDiskAvailablePct{statSysInfoDiskIdStr="local"}
  MetricLowerBoundary: 10
- MetricName: statSysInfoDiskUsedPct{statSysInfoDiskIdStr="shared"}
  MetricUpperBoundary: 90
- MetricName: statSysInfoDiskUsedPct{statSysInfoDiskIdStr="local"}
  MetricUpperBoundary: 90

#- MetricName: statSysInfoPhysicalMemoryTotalMb
#- MetricName: statSysInfoPhysicalMemoryAvailableMb
#- MetricName: statSysInfoDiskTotalSizeMb{statSysInfoDiskIdStr="shared"}
#- MetricName: statSysInfoDiskTotalSizeMb{statSysInfoDiskIdStr="local"}
#- MetricName: statSysInfoDiskAvailableMb{statSysInfoDiskIdStr="shared"}
#- MetricName: statSysInfoDiskAvailableMb{statSysInfoDiskIdStr="local"}
#- MetricName: statSysInfoDiskUsedMb{statSysInfoDiskIdStr="shared"}
#- MetricName: statSysInfoDiskUsedMb{statSysInfoDiskIdStr="local"}
#- MetricName: sysTotalMemoryPoolSizeMb
#- MetricName: sysTotalMemoryPoolInUseMb
Note: Comment out the metrics you do not want to compare to boundaries.

The following example includes the supported metrics for a Kubernetes API boundary file:

metric-boundary-k8s.yaml:
#- MetricName: podCpuUsage
#- MetricName: podMemoryUsage

Output Information

mtx-engine-system-health-check's Prometheus server is listening on port 18087. To scrape the out of bounds metrics and their values, run the following command:
kubectl port-forward pod/engine-health-checker-xxxxxxx -n matrixx 18087:18087

Where xxxxxxx is the pod's complete name. This values is generated by Kubernetes to make the pod name globally unique.

The following is an output example:
custom_metric_out_of_boundary{containerName="ctr-1",podName="ckpt-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 2608
custom_metric_out_of_boundary{containerName="ctr-1",podName="proc-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 102596
custom_metric_out_of_boundary{containerName="ctr-1",podName="proc-s1e1-1",source="K8SClientApi",theMetricName="podMemoryUsage"} 2596
custom_metric_out_of_boundary{containerName="ctr-1",podName="publ-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 2612
custom_metric_out_of_boundary{containerName="tralb-1",podName="tralb-proc-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 333240