Alert Management Setup

Perform the following tasks to set up alert management using Prometheus and Alert Manager for SNMP traps:

Enable Monitoring and SNMP Exporter
Deploy matrixx-snmpnotifier
Configure Prometheus Alert Manager

Note: Prometheus must be deployed already in the monitoring namespace. For more information, see the discussion about monitoring with Prometheus and Grafana.

Enable Monitoring and SNMP Exporter

Prometheus monitoring and the SNMP exporter process must be enabled in the Helm chart as shown in the following example configuration.

global:
  # monitoring - enable/disable prometheus monitoring and configure ServiceMonitor labels
  # note: engine/tra/network-enabler pods also require 'snmp-exporter' to be enabled
  # note: monitoring requires Prometheus-Operator to be installed separately
  monitoring:
    # enable prometheus service monitoring
    enabled: true
 
# SNMP Exporter
snmp-exporter:
  enabled: true
  # Default to 2 instances to support High Availability when performing a rolling update
  replicaCount: 2

Deploy matrixx-snmpnotifier

Before deploying snmp-notifier ensure that:

The matrixx-snmpnotifier image is available.
The monitoring namespace is configured.
You know the destination SNMP server host address as set by the SNMP_DESTINATION environment variable. This is set to 127.0.0.1.1162 by default.
The snmpnotifier-values.yaml file contains the required configuration. The SNMP version is set to V2c by default. For more information about environment variables you can set, see the discussion about SNMP notifier environment variables.

Deploy snmp-notifier in the monitoring namespace using the following command:

> helm install snmpnotifier snmp-notifier/ --namespace monitoring -f snmpnotifier-values.yaml

Configure Prometheus Alert Manager

Configure the alert rules and OID mapping to use with Prometheus alert manager in the prom-stack-values.yaml configuration file.

Note: The snmp webhook URL defined in the alertmanager.receivers section must be set to the URL of matrixx-snmpnotifier deployed in the MATRIXX cluster through the matrixx-snmpnotifier Helm chart. Use the following format: servicename.namespace.svc.cluster.local:9464/alerts. For example:

alertmanager:
   receivers:
      - name: 'null'
      - name: 'snmp_notifier'
        webhook_configs:
          - send_resolved: true
            url: http://snmp-notifier-snmp-notifier.monitoring.svc.cluster.local:9464/alerts

The following example shows alert rules and OID mapping configuration in the prom-stack-values.yaml file.

grafana:
  adminPassword: admin
  service:
    port: 3000
kubelet:
  serviceMonitor:
    https: false
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
 
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:       # Optional setting
  persistentVolume:
    storageClass: gp2
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'snmp_notifier'
      routes:
        - match:
            alertname: Watchdog
          receiver: 'null'
    receivers:
      - name: 'null'
      - name: 'snmp_notifier'
        webhook_configs:
          - send_resolved: true
            url: http://snmp-notifier-snmp-notifier.monitoring.svc.cluster.local:9464/alerts
    templates:
      - '/etc/alertmanager/config/*.tmpl'
 
server:            # Optional settings 
  persistentVolume:
    storageClass: gp2
 
## Provide custom recording or alerting rules to be deployed into the cluster.
##
additionalPrometheusRulesMap:
  matrixx-rules:
    groups:
    - name: matrixx-services
      rules:
        - alert: sysClusterNodeJoinedCkpt
          expr: kube_pod_container_status_running{namespace="matrixx", pod=~"ckpt.+."} == 1
          for: 5m
          labels:
            severity: "info"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.8"
          annotations:
            message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
 
        - alert: sysClusterNodeJoinedPubl
          expr: kube_pod_container_status_running{namespace="matrixx", pod=~"publ.+."} == 1
          for: 5m
          labels:
            severity: "info"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.8"
          annotations:
            message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
 
        - alert: sysClusterNodeJoinedProc
          expr: kube_pod_container_status_running{namespace="matrixx", pod=~"proc.+."} == 1
          for: 5m
          labels:
            severity: "info"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.8"
          annotations:
            message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
 
        - alert: sysClusterNodeExitedCkpt
          expr: kube_pod_container_status_terminated{namespace="matrixx", pod=~"ckpt.+."} == 1
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.9"
          annotations:
            message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
 
        - alert: sysClusterNodeExitedPubl
          expr: kube_pod_container_status_terminated{namespace="matrixx", pod=~"publ.+."} == 1
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.9"
          annotations:
            message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
 
        - alert: sysClusterNodeExitedProc
          expr: kube_pod_container_status_terminated{namespace="matrixx", pod=~"proc.+."} == 1
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.9"
          annotations:
            message: '{{ $labels.namespace }}\{{ $labels.pod }} has joined the cluster.'
 
        - alert: sysClusterNodeServiceUp
          expr: count (up{namespace="matrixx", container="ctr-1"}) BY (pod,service, namespace, instance) == 1
          for: 5m
          labels:
            severity: "info"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.10"
          annotations:
            message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is up.'
 
        - alert: sysClusterNodeServiceDown
          expr: count (up{namespace="matrixx", container="ctr-1"}) BY (pod,service, namespace, instance) < 1
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.11"
          annotations:
            message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is Down.'
 
        - alert: sysTraNodeServiceUpAlert
          expr: count (up{namespace="matrixx", instance=~"tra.+-.+."}) BY (pod,service, namespace, instance) == 1
          for: 5m
          labels:
            severity: "info"
            type: service
            oid: "1.3.6.1.4.1.35838.1.2.1.1.4.1"
          annotations:
            message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is up.'
 
        - alert: sysTraNodeServiceDownAlert
          expr: count (up{namespace="matrixx", instance=~"tra.+-.+."}) BY (pod,service, namespace, instance) < 1
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.2.1.1.4.1"
          annotations:
            message: '{{ $labels.namespace }}\{{$labels.service}} on cluster is Down.'
 
        - alert: sysClusterPeerActiveError
          expr: sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="1"} == 8 AND sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"} == 8
          for: 5m
          labels:
            severity: "error"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.12"
          annotations:
            message: 'Both Engine are in Active state.'
 
        - alert: sysClusterPeerConnected
          expr: sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="1"} == 8 AND sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"} == 6
          for: 5m
          labels:
            severity: "info"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.13"
          annotations:
            message: 'Second Engine is in Standby state.'
 
        - alert: sysClusterPeerDisconnected
          expr: count(absent(sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"}) OR absent(sysPeerClusterClusterState{instance=~"publ-.+.", namespace="matrixx", sysPeerClusterEngineId="2"}) OR absent(sysPeerClusterClusterState{instance=~"ckpt-.+.", namespace="matrixx", sysPeerClusterEngineId="2"})) == 1
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.1.14"
          annotations:
            message: 'Peer Cluster is not Connected.'
 
        - alert: sysProcessingErrorAlert
          expr: increase(sysProcessingErrors{instance=~"proc-.*"}[5m]) > 50
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.4.2.1.7"
          annotations:
            message: 'Processing Error threshold breached. Current count  is {{ printf "%.4g" $value }}  for  "{{ $labels.namespace }}"\"{{ $labels.pod }}"'
 
        - alert: sysMemoryAvailableThresholdCrossingAlert
          expr: sysMemoryAvailableThresholdMb  <  30
          for: 5m
          labels:
            severity: "warning"
            type: service
            oid: "1.3.6.1.4.1.35838.1.4.2.1.8"
          annotations:
            message: 'Available memory Threshold value breached. Current level is {{ printf "%.4g" $value }} mb for  "{{ $labels.namespace }}"\"{{ $labels.pod }}"'
 
        - alert: txnDatabaseMemoryUsedThresholdCrossingAlert
          expr: 100 * ( txnDatabaseMemoryFreeKb + txnDatabaseMemoryUsedKb + txnDatabaseMemoryReclaimableKb ) / ( txnDatabaseMemoryFreeKb + txnDatabaseMemoryUsedKb + txnDatabaseMemoryReclaimableKb + (totalMemoryPoolSizeMb - totalMemoryPoolInUseMb) * 1024 ) > 75
          for: 5m
          labels:
            severity: "warning"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.5.1"
          annotations:
            message: 'Available memory Threshold value breached. Current level is {{ printf "%.4g" $value }} mb for  "{{ $labels.namespace }}"\"{{ $labels.pod }}"'
 
        - alert: txnGtcOutOfSyncAlert
          expr: (sum by (pod, namespace, txnReplayEngineId, txnReplayClusterId) (txnReplayCurrentGlobalTxnCounter)) - (sum by (pod, namespace, txnReplayEngineId, txnReplayClusterId) (txnReplayLastReplayGlobalTxnCounter)) > 100000
          for: 5m
          labels:
            severity: "critical"
            type: service
            oid: "1.3.6.1.4.1.35838.1.1.2.5.3"
          annotations:
            message: 'GTC value gap is above threshold breached. Current count is {{ printf "%.4g" $value }} for  "{{ $labels.namespace }}"\"{{ $labels.pod }}" txnReplayEngineId::txnReplayClusterId: "{{ $labels.txnReplayEngineId }}"::"{{ $labels.txnReplayClusterId }}"'
 
        - alert: MemoryUsageAlert
          expr: sysTotalMemoryPoolInUseMb/sysTotalMemoryPoolSizeMb * 100 > 60
          for: 10m
          labels:
            severity: "warning"
            type: service
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Usage for System memory dedicated to databases, buffer pools (mtx bufs) on {{ $labels.instance }} is higher than 60%. Current usage is at {{ printf "%.4g" $value }}%'
 
        - alert: EngineMemoryUsageAlert
          expr: 100 *(1 - ((avg_over_time(statSysInfoPhysicalMemoryFreeMb[24h]) + avg_over_time(statSysInfoPhysicalMemoryCachedMb[24h]) + avg_over_time(statSysInfoPhysicalMemoryBuffersMb[24h])) / avg_over_time(statSysInfoPhysicalMemoryTotalMb[24h]))) > 60
          for: 10m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Engine  memory usage on instance {{ $labels.instance }} is high. Current usage is at {{ printf "%.4g" $value }}%'
 
        - alert: EngineDiskUsageHighAlert
          expr: statSysInfoDiskAvailablePct < 60
          for: 5m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Disk usage is high on {{ $labels.pod }}. Currently {{ printf "%.4g" $value }}% is available.'
 
        - alert: NodeHeartbeatMsgLostAlert
          expr: 100 *(sysClusterNodeHeartbeatMsgReceivedCount / sysClusterNodeHeartbeatMsgSentCount) < 99
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'System is loosing more Node Heart beat messages. Current percentage loss is: {{ printf "%.4g" $value }}% for service {{ $labels.service }}.'
            summary: Node Heart beat MsgLost Alert.
 
        - alert: EngineOneStateAlert
          expr: avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="1"}) != 8 and avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="1"}) != 6
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Engine 1 is not in Active or Standby state.'
 
        - alert: EngineTwoStateAlert
          expr: avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="2"}) != 8 and avg(sysPeerClusterClusterState{instance=~"proc-s1.*", sysPeerClusterClusterId="1",sysPeerClusterEngineId="2"}) != 6
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Engine 2 is not in Active or Standby state.'
 
        - alert: SiteStatusAlert
          #expr: count(absent(sysClusterEngineActiveDateTime{subdomain=”subdomain-1S”}) OR absent(sysClusterEngineActiveDateTime{subdomain="subdomain-2S"}) OR absent(sysClusterEngineActiveDateTime{subdomain="subdomain-3S"})) == 3
          expr: count(absent(sysClusterEngineActiveDateTime{pod=~"ckpt-.+."}) OR absent(sysClusterEngineActiveDateTime{pod=~"publ-.+."}) OR absent(sysClusterEngineActiveDateTime{pod=~"proc-.+."})) == 3
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Site is Down.'
 
        - alert: SecondaryEngineNotInStandbyAlert
          expr: sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="1"} == 8 AND sysPeerClusterClusterState{instance=~"proc-.+.", namespace="matrixx", sysPeerClusterEngineId="2"} != 6
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Second Engine is not in Standby state.'
 
        - alert: SystemCpuUsageAlert
          expr: rate(system_cpu_usage[5m]) * 100 > 80
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'CPU usage for {{ $labels.application }} is Critical. Current percentage is: {{ printf "%.4g" $value }}%.'
            summary: CPU Usage Critical Alert.
 
        - alert: TransactionThresholdAlert
          expr: increase(txnMsgCount[5m]) > 500
          for: 5m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'More than 500 transactions occurred within 5 minutes duration.'
 
        - alert: ActiveMQStatusAlert
          expr: org_apache_activemq_Broker_Active != 1
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'ActiveMQ is down.'
 
        - alert: diamConnectionStatsReceivedErrors
          expr: increase(diamConnectionStatsReceivedErrorCount[5m]) > 10
          for: 5m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'More than 10 errors encountered within 5m duration while reading diameter data'
 
        - alert: diamConnectionStatsSentError
          expr: increase(diamConnectionStatsSentErrorCount[5m]) > 10
          for: 5m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'More than 10 errors encountered within 5m duration while sending diameter data'
 
        - alert: diamReceivedErrorLimit
          expr: 100 * ( diamConnectionStatsReceivedErrorCount / diamConnectionStatsReceivedMsgCount ) > 50
          for: 5m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Received error threshold reached.'
 
        - alert: diamSentErrorLimit
          expr: 100 * ( diamConnectionStatsSentErrorCount /  diamConnectionStatsSentMsgCount) > 50
          for: 5m
          labels:
            severity: "warning"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Sent error threshold reached.'
 
        - alert: GatewayProxyFailureAlert
          expr: 100 * (mtx_proxy_error_count_total / mtx_proxy_request_count_total) > 50
          for: 5m
          labels:
            severity: "critical"
            oid: "1.3.6.1.4.1.35838.1.4.2.1.9"
          annotations:
            message: 'Gateway Proxy Error Threshold reached. Error rate is more than 50%'