Out-of-Bounds Engine System Resource Metrics
Use the mtx-engine-system-health-check container in the engine-health-checker pod to monitor the engine pod system resources that are being used, including CPU and disk space, and alert users when the metrics they are monitoring are outside predefined bounds.
The mtx-engine-system-health-check container periodically scrapes metrics that were predefined by the user and compares those metrics with the bounds that the user also predefined in the boundaries file. If the metric value is outside the specified boundary, an error message is logged. Users can also retrieve these out of bounds metrics using the Prometheus server running inside the mtx-engine-system-health-check container.
Location | Metric |
---|---|
CPU (Node level) | statSysInfoCpuUsgIdlePrc statSysInfoCpuUsgUserPrc statSysInfoCpuUsgSysPrc |
Memory (Node level) | statSysInfoPhysicalMemoryTotalMb statSysInfoPhysicalMemoryFreeMb statSysInfoPhysicalMemoryAvailableMb |
Disk (local and shared) – local is fast-shared storage and shared is shared storage | statSysInfoDiskTotalSizeMb statSysInfoDiskAvailableMb statSysInfoDiskUsedMb statSysInfoDiskAvailablePct statSysInfoDiskUsedPct |
Common | sysTotalMemoryPoolSizeMb (pod level shared memory allocated) sysTotalMemoryPoolInUseMb (pod level shared memory used) |
Kubernetes API | podCpuUsage podMemoryUsage |
metric-boundary-prometheus-agent.yaml:
- MetricName: statSysInfoCpuUsgIdlePrc
MetricLowerBoundary: 20
- MetricName: statSysInfoCpuUsgUserPrc
MetricUpperBoundary: 60
- MetricName: statSysInfoCpuUsgSysPrc
MetricUpperBoundary: 20
- MetricName: statSysInfoPhysicalMemoryFreeMb
MetricLowerBoundary: 64
- MetricName: statSysInfoDiskAvailablePct{statSysInfoDiskIdStr="shared"}
MetricLowerBoundary: 10
- MetricName: statSysInfoDiskAvailablePct{statSysInfoDiskIdStr="local"}
MetricLowerBoundary: 10
- MetricName: statSysInfoDiskUsedPct{statSysInfoDiskIdStr="shared"}
MetricUpperBoundary: 90
- MetricName: statSysInfoDiskUsedPct{statSysInfoDiskIdStr="local"}
MetricUpperBoundary: 90
#- MetricName: statSysInfoPhysicalMemoryTotalMb
#- MetricName: statSysInfoPhysicalMemoryAvailableMb
#- MetricName: statSysInfoDiskTotalSizeMb{statSysInfoDiskIdStr="shared"}
#- MetricName: statSysInfoDiskTotalSizeMb{statSysInfoDiskIdStr="local"}
#- MetricName: statSysInfoDiskAvailableMb{statSysInfoDiskIdStr="shared"}
#- MetricName: statSysInfoDiskAvailableMb{statSysInfoDiskIdStr="local"}
#- MetricName: statSysInfoDiskUsedMb{statSysInfoDiskIdStr="shared"}
#- MetricName: statSysInfoDiskUsedMb{statSysInfoDiskIdStr="local"}
#- MetricName: sysTotalMemoryPoolSizeMb
#- MetricName: sysTotalMemoryPoolInUseMb
The following example includes the supported metrics for a Kubernetes API boundary file:
metric-boundary-k8s.yaml:
#- MetricName: podCpuUsage
#- MetricName: podMemoryUsage
Output Information
kubectl port-forward pod/engine-health-checker-xxxxxxx -n matrixx 18087:18087
Where xxxxxxx is the pod's complete name. This values is generated by Kubernetes to make the pod name globally unique.
custom_metric_out_of_boundary{containerName="ctr-1",podName="ckpt-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 2608
custom_metric_out_of_boundary{containerName="ctr-1",podName="proc-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 102596
custom_metric_out_of_boundary{containerName="ctr-1",podName="proc-s1e1-1",source="K8SClientApi",theMetricName="podMemoryUsage"} 2596
custom_metric_out_of_boundary{containerName="ctr-1",podName="publ-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 2612
custom_metric_out_of_boundary{containerName="tralb-1",podName="tralb-proc-s1e1-0",source="K8SClientApi",theMetricName="podMemoryUsage"} 333240