Alert Management Setup
Perform the following tasks to set up alert management using Prometheus and Alert Manager for SNMP traps:
Note: Prometheus must be deployed already in the monitoring namespace.
For more information, see the discussion about monitoring with Prometheus and Grafana.
Enable Monitoring and SNMP Exporter
Prometheus monitoring and the SNMP exporter process must be enabled in the Helm chart
as shown in the following example
configuration.
global:
# monitoring - enable/disable prometheus monitoring and configure ServiceMonitor labels
# note: engine/tra/network-enabler pods also require 'snmp-exporter' to be enabled
# note: monitoring requires Prometheus-Operator to be installed separately
monitoring:
# enable prometheus service monitoring
enabled: true
# SNMP Exporter
snmp-exporter:
enabled: true
# Default to 2 instances to support High Availability when performing a rolling update
replicaCount: 2
Deploy matrixx-snmpnotifier
Before deploying snmp-notifier ensure that:
- The matrixx-snmpnotifier image is available.
- The monitoring namespace is configured.
- You know the destination SNMP server host address as set by the SNMP_DESTINATION environment variable. This is set to 127.0.0.1.1162 by default.
- The snmpnotifier-values.yaml file contains the required configuration. The SNMP version is set to V2c by default. For more information about environment variables you can set, see the discussion about SNMP notifier environment variables.
Deploy snmp-notifier in the monitoring namespace using the following
command:
> helm install snmpnotifier snmp-notifier/ --namespace monitoring -f snmpnotifier-values.yaml
Configure Prometheus Alert Manager
Configure the alert rules and OID mapping to use with Prometheus alert manager in the
prom-stack-values.yaml configuration file.
Note: The snmp webhook URL defined in the
alertmanager.receivers
section must be set to the URL of
matrixx-snmpnotifier deployed in the MATRIXX cluster through the matrixx-snmpnotifier Helm
chart. Use the following format:
servicename.namespace.svc.cluster.local:9464/alerts
.
For
example:alertmanager:
receivers:
- name: 'null'
- name: 'snmp_notifier'
webhook_configs:
- send_resolved: true
url: http://snmp-notifier-snmp-notifier.monitoring.svc.cluster.local:9464/alerts
The following example shows alert rules and OID mapping configuration in the prom-stack-values.yaml
file.
grafana:
adminPassword: admin
service:
port: 3000
kubelet:
serviceMonitor:
https: false
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager: # Optional setting
persistentVolume:
storageClass: gp2
config:
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'snmp_notifier'
routes:
- match:
alertname: Watchdog
receiver: 'null'
receivers:
- name: 'null'
- name: 'snmp_notifier'
webhook_configs:
- send_resolved: true
url: http://snmp-notifier-snmp-notifier.monitoring.svc.cluster.local:9464/alerts
templates:
- '/etc/alertmanager/config/*.tmpl'
server: # Optional settings
persistentVolume:
storageClass: gp2
## Provide custom recording or alerting rules to be deployed into the cluster.
##
additionalPrometheusRulesMap:
matrixx-rules:
groups:
- name: matrixx-services
rules:
- alert: sysClusterNodeJoinedCkpt
expr: kube_pod_container_status_running{namespace="matrixx", pod=~"ckpt.+."} == 1
for: 5m
labels:
severity: "info"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.8"
annotations:
message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
- alert: sysClusterNodeJoinedPubl
expr: kube_pod_container_status_running{namespace="matrixx", pod=~"publ.+."} == 1
for: 5m
labels:
severity: "info"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.8"
annotations:
message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
- alert: sysClusterNodeJoinedProc
expr: kube_pod_container_status_running{namespace="matrixx", pod=~"proc.+."} == 1
for: 5m
labels:
severity: "info"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.8"
annotations:
message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
- alert: sysClusterNodeExitedCkpt
expr: kube_pod_container_status_terminated{namespace="matrixx", pod=~"ckpt.+."} == 1
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.9"
annotations:
message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
- alert: sysClusterNodeExitedPubl
expr: kube_pod_container_status_terminated{namespace="matrixx", pod=~"publ.+."} == 1
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.9"
annotations:
message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
- alert: sysClusterNodeExitedProc
expr: kube_pod_container_status_terminated{namespace="matrixx", pod=~"proc.+."} == 1
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.9"
annotations:
message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
- alert: sysClusterNodeServiceUp
expr: count (up{namespace="matrixx", container="ctr-1"}) BY (pod,service, namespace, instance) == 1
for: 5m
labels:
severity: "info"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.10"
annotations:
message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is up.'
- alert: sysClusterNodeServiceDown
expr: count (up{namespace="matrixx", container="ctr-1"}) BY (pod,service, namespace, instance) < 1
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.11"
annotations:
message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is Down.'
- alert: sysTraNodeServiceUpAlert
expr: count (up{namespace="matrixx", instance=~"tra.+-.+."}) BY (pod,service, namespace, instance) == 1
for: 5m
labels:
severity: "info"
type: service
oid: "1.3.6.1.4.1.35838.1.2.1.1.4.1"
annotations:
message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is up.'
- alert: sysTraNodeServiceDownAlert
expr: count (up{namespace="matrixx", instance=~"tra.+-.+."}) BY (pod,service, namespace, instance) < 1
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.2.1.1.4.1"
annotations:
message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is Down.'
- alert: sysClusterPeerActiveError
expr: sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="1"} == 8 AND sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"} == 8
for: 5m
labels:
severity: "error"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.12"
annotations:
message: 'Both Engine are in Active state.'
- alert: sysClusterPeerConnected
expr: sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="1"} == 8 AND sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"} == 6
for: 5m
labels:
severity: "info"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.13"
annotations:
message: 'Second Engine is in Standby state.'
- alert: sysClusterPeerDisconnected
expr: count(absent(sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"}) OR absent(sysPeerClusterClusterState{instance=~"publ-.+.", namespace="matrixx", sysPeerClusterEngineId="2"}) OR absent(sysPeerClusterClusterState{instance=~"ckpt-.+.", namespace="matrixx", sysPeerClusterEngineId="2"})) == 1
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.1.14"
annotations:
message: 'Peer Cluster is not Connected.'
- alert: sysProcessingErrorAlert
expr: increase(sysProcessingErrors{instance=~"proc-.*"}[5m]) > 50
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.4.2.1.7"
annotations:
message: 'Processing Error threshold breached. Current count is {{ printf "%.4g" $value }} for "{{ $labels.namespace }}"\"{{ $labels.pod }}"'
- alert: sysMemoryAvailableThresholdCrossingAlert
expr: sysMemoryAvailableThresholdMb < 30
for: 5m
labels:
severity: "warning"
type: service
oid: "1.3.6.1.4.1.35838.1.4.2.1.8"
annotations:
message: 'Available memory Threshold value breached. Current level is {{ printf "%.4g" $value }} mb for "{{ $labels.namespace }}"\"{{ $labels.pod }}"'
- alert: txnDatabaseMemoryUsedThresholdCrossingAlert
expr: 100 * ( txnDatabaseMemoryFreeKb + txnDatabaseMemoryUsedKb + txnDatabaseMemoryReclaimableKb ) / ( txnDatabaseMemoryFreeKb + txnDatabaseMemoryUsedKb + txnDatabaseMemoryReclaimableKb + (totalMemoryPoolSizeMb - totalMemoryPoolInUseMb) * 1024 ) > 75
for: 5m
labels:
severity: "warning"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.5.1"
annotations:
message: 'Available memory Threshold value breached. Current level is {{ printf "%.4g" $value }} mb for "{{ $labels.namespace }}"\"{{ $labels.pod }}"'
- alert: txnGtcOutOfSyncAlert
expr: (sum by (pod, namespace, txnReplayEngineId, txnReplayClusterId) (txnReplayCurrentGlobalTxnCounter)) - (sum by (pod, namespace, txnReplayEngineId, txnReplayClusterId) (txnReplayLastReplayGlobalTxnCounter)) > 100000
for: 5m
labels:
severity: "critical"
type: service
oid: "1.3.6.1.4.1.35838.1.1.2.5.3"
annotations:
message: 'GTC value gap is above threshold breached. Current count is {{ printf "%.4g" $value }} for "{{ $labels.namespace }}"\"{{ $labels.pod }}" txnReplayEngineId::txnReplayClusterId: "{{ $labels.txnReplayEngineId }}"::"{{ $labels.txnReplayClusterId }}"'
- alert: MemoryUsageAlert
expr: sysTotalMemoryPoolInUseMb/sysTotalMemoryPoolSizeMb * 100 > 60
for: 10m
labels:
severity: "warning"
type: service
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Usage for System memory dedicated to databases, buffer pools (mtx bufs) on {{ $labels.instance }} is higher than 60%. Current usage is at {{ printf "%.4g" $value }}%'
- alert: EngineMemoryUsageAlert
expr: 100 *(1 - ((avg_over_time(statSysInfoPhysicalMemoryFreeMb[24h]) + avg_over_time(statSysInfoPhysicalMemoryCachedMb[24h]) + avg_over_time(statSysInfoPhysicalMemoryBuffersMb[24h])) / avg_over_time(statSysInfoPhysicalMemoryTotalMb[24h]))) > 60
for: 10m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Engine memory usage on instance {{ $labels.instance }} is high. Current usage is at {{ printf "%.4g" $value }}%'
- alert: EngineDiskUsageHighAlert
expr: statSysInfoDiskAvailablePct < 60
for: 5m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Disk usage is high on {{ $labels.pod }}. Currently {{ printf "%.4g" $value }}% is available.'
- alert: NodeHeartbeatMsgLostAlert
expr: 100 *(sysClusterNodeHeartbeatMsgReceivedCount / sysClusterNodeHeartbeatMsgSentCount) < 99
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'System is loosing more Node Heart beat messages. Current percentage loss is: {{ printf "%.4g" $value }}% for service {{ $labels.service }}.'
summary: Node Heart beat MsgLost Alert.
- alert: EngineOneStateAlert
expr: avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="1"}) != 8 and avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="1"}) != 6
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Engine 1 is not in Active or Standby state.'
- alert: EngineTwoStateAlert
expr: avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="2"}) != 8 and avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="2"}) != 6
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Engine 2 is not in Active or Standby state.'
- alert: SiteStatusAlert
#expr: count(absent(sysClusterEngineActiveDateTime{subdomain=”subdomain-1S”}) OR absent(sysClusterEngineActiveDateTime{subdomain="subdomain-2S"}) OR absent(sysClusterEngineActiveDateTime{subdomain="subdomain-3S"})) == 3
expr: count(absent(sysClusterEngineActiveDateTime{pod=~"ckpt-.+."}) OR absent(sysClusterEngineActiveDateTime{pod=~"publ-.+."}) OR absent(sysClusterEngineActiveDateTime{pod=~"proc-.+."})) == 3
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Site is Down.'
- alert: SecondaryEngineNotInStandbyAlert
expr: sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="1"} == 8 AND sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"} != 6
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Second Engine is not in Standby state.'
- alert: SystemCpuUsageAlert
expr: rate(system_cpu_usage[5m]) * 100 > 80
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'CPU usage for {{ $labels.application }} is Critical. Current percentage is: {{ printf "%.4g" $value }}%.'
summary: CPU Usage Critical Alert.
- alert: TransactionThresholdAlert
expr: increase(txnMsgCount[5m]) > 500
for: 5m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'More than 500 transactions occurred within 5 minutes duration.'
- alert: ActiveMQStatusAlert
expr: org_apache_activemq_Broker_Active != 1
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'ActiveMQ is down.'
- alert: diamConnectionStatsReceivedErrors
expr: increase(diamConnectionStatsReceivedErrorCount[5m]) > 10
for: 5m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'More than 10 errors encountered within 5m duration while reading diameter data'
- alert: diamConnectionStatsSentError
expr: increase(diamConnectionStatsSentErrorCount[5m]) > 10
for: 5m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'More than 10 errors encountered within 5m duration while sending diameter data'
- alert: diamReceivedErrorLimit
expr: 100 * ( diamConnectionStatsReceivedErrorCount / diamConnectionStatsReceivedMsgCount ) > 50
for: 5m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Received error threshold reached.'
- alert: diamSentErrorLimit
expr: 100 * ( diamConnectionStatsSentErrorCount / diamConnectionStatsSentMsgCount) > 50
for: 5m
labels:
severity: "warning"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Sent error threshold reached.'
- alert: GatewayProxyFailureAlert
expr: 100 * (mtx_proxy_error_count_total / mtx_proxy_request_count_total) > 50
for: 5m
labels:
severity: "critical"
oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
annotations:
message: 'Gateway Proxy Error Threshold reached. Error rate is more than 50%'