System Notifications

System notifications are sent when the value of a variable associated with a trap moves above a defined threshold.

For traps based on counters, the SNMP agent periodically polls each specified variable. It takes the difference between the previous and current values of the variable and compares that difference with the threshold. If the value has moved above the threshold, the SNMP agent sends a trap raising an alarm. If the value has moved below the threshold, the agent sends a trap clearing the alarm. For traps based on gauges, the previous value is not needed. The current value is compared with the threshold.

Note: When objects are quarantined, messages are written to the mtx_debug.log file on the failed pod, to the /var/log/messages file on the failed pod, and to the mtx_debug.log file on the highest-numbered processing pod.

MIB System Notifications describes the traps, severity/results, and actions.

Table 1. MIB System Notifications
Trap Description Severity / Result Action

SNMP Agent

sysAgentStart

sysAgentShutdown

Indicates the SNMP agent has started running or is shutting down.

Messages are:
  • The agent has started running.
  • The agent is shutting down.

Major:

Notifications and alarms are not sent when the SNMP agent is not running.

Restart the MATRIXX server running System Monitor.

Service Restart

sysServiceRestart

Indicates that MTX services have been automatically restarted due to a detected software failure.

Messages are:
  • MATRIXX services have been restarted.
  • Cleared by Operator Initiated action.

Major:

Possible service outage and degradation while the service restarts and is verified to be working correctly.

Review /var/log/mtx/mtx_debug.log to determine the cause of the failure.

Server Reboot

sysServerReboot

Indicates that an automatic reboot of the server machine has been initiated due to a detected software failure.

Messages are:
  • The server machine is rebooting.
  • Cleared by Operator Initiated action.

Major:

Possible service outage and degradation while the server restarts.

Investigate why the pod restarted:
  1. Verify the system rebooted and the uptime:
    • # last reboot
    • # uptime
  2. Review the following logs to determine what occurred at the time, or just before, the system reboot:
    • /var/log/mtx/mtx_debug.log
    • /var/log/messages
    • /var/log/boot logs
  3. Escalate to <operation responsible> if required.

Service Down

sysServiceDown

A critical system service shut down. The system is not operational. The notification contains the failed service name.

Messages are:
  • The system service <service> has shut down. The system is not operational.
  • Cleared by Operator Initiated action.

Critical:

Service disruption.

Requires Immediate attention.

You should:
  • Escalate to Level 3.
  • Investigate why the pod shut down by logging onto the host and looking at the following logs to determine what occurred at the time, or just before, the system shut down:
    • /var/log/mtx/ mtx_debug.log
    • /var/log/messages
    • /var/log/boot logs

Processing Error

sysProcessingErrorAlert

A sysProcessingErrorAlert notification is sent when the value of sysProcessingErrors changes. The agent must not generate more than one sysProcessingErrors notification event in a five second period. If additional processing errors occur within the five second "throttling" period, then these notification events are suppressed by the agent.

An NMS should periodically check the value of sysProcessingErrors to detect any missed sysProcessingErrors notification events.

Messages are:
  • The processing error count has changed to <processing count since last restart>.
  • Cleared by Operator Initiated action.

Minor:

Possible service degradation.

Investigate why the pod shut down:
  1. Review the /var/log/mtx/ mtx_debug.log log to view the processing error message(s) and resolve the issue.
  2. Escalate to <operation responsible> if required.

Diameter Performance Threshold

sysThresholdCrossingAlert

A sysThresholdCrossingAlert notification is sent when a monitored Diameter performance gauge reaches or crosses a threshold value. For example, average and maximum system response times. At most one threshold crossing notification event can be generated in any monitoring interval.

The initial default average response time is 100 milliseconds and the maximum response time is 500 milliseconds.

Messages are:
  • Performance gauge <Threshold ID> value <value> has crossed threshold value of <threshold value>.
  • Cleared by Operator Initiated action.

Major:

Possible service degradation.

Review /var/log/mtx/mtx_debug.log to determine if there are any processing errors causing the threshold crossing

Look at the hardware monitoring stats using the Reporting Splunk and directly on the pods.

Cluster Node Status

sysClusterNodeJoined

sysClusterNodeExited

A sysClusterNodeJoined notification is sent by any member node of a cluster when it detects that a new node has joined the cluster.

A sysClusterNodeExited notification is sent by any member node of a cluster when it detects that a node has exited the cluster.

Messages are:
  • Node <node IP address> has left the cluster.
  • Node <node IP address > has joined the cluster.

Critical:

Possible service degradation and disruption.

A node leaving a cluster could be the result of a node restarting or maintenance being performed. You should investigate the cause of the change in cluster node status by logging onto the host and analyzing messages in the following logs:
  • /var/log/mtx/mtx_debug.log
  • /var/log/messages

Cluster Node Service State

sysClusterNodeServiceUp

sysClusterNodeServiceDown

A sysClusterNodeServiceUp notification is sent by a cluster member (pod) when it is ready to process service requests (such as Diameter requests).

A sysClusterNodeServiceDown notification is sent by a pod when it stops processing service requests (such as Diameter requests).

Messages are:
  • A cluster node has stopped processing requests.
  • A cluster node has started processing requests.

Major:

Possible service degradation and disruption.

This error can occur for multiple reasons. Investigate the cause of the change in cluster node status by logging onto the host and analyzing messages in the following logs:
  • /var/log/mtx/ mtx_debug.log
  • /var/log/messages

Cluster Peer Active Error

sysClusterPeerActiveError

A sysClusterPeerActiveError notification is sent when both the primary and secondary clusters are in the HA ACTIVE state at the same time. The address object contains a Virtual IPv4 (VIP) address of the peer cluster.

Messages are:
  • Primary and secondary clusters are both ACTIVE. Peer management node is <Management Node IP address>.
  • Cleared by Operator Initiated action.

Critical:

Service disruption.

Requires Immediate attention.

You should:
  • Escalate to Level 3.
  • Investigate the cause of the change in cluster peer status by logging onto the host and analyzing messages in the following logs:
    • /var/log/mtx/ mtx_debug.log
    • /var/log/messages

Cluster Peer Connection State

sysClusterPeerConnected

sysClusterPeerDisconnected

A sysClusterPeerConnected notification is sent when a remote HA peer cluster becomes connected/reachable to/from this cluster. The address object contains a Virtual IP (VIP) address of the remote peer cluster.

A sysClusterPeerDisconnected notification is sent when a remote HA peer cluster is disconnected/unreachable from this cluster. The address object contains a Virtual IP (VIP) address of the remote peer cluster.

Messages are:
  • Peer cluster <IP address> has become disconnected.
  • Peer cluster <IP address> has connected.

Critical:

Service degradation and disruption.

Communication between the two engines was lost. This could be the result of the engine restarting or maintenance being performed.

Investigate the cause of the change in cluster peer status by logging onto the host and analyzing messages in the following logs:
  • /var/log/mtx/mtx_debug.log
  • /var/log/messages

Cluster HA Status

sysClusterClusterStateChange

A sysClusterClusterStateChange is sent when the HA state of a cluster has changed. The notification contains the ID of the cluster and the current value of the HA state of the cluster.

Message is:
  • HA state of cluster <Id> has changed to <status>.

Critical:

Possible service degradation and disruption.

This could be the result of the engine restarting or maintenance being performed.
Attention: If the reason is unknown, investigate immediately.
Investigate the cause of the change in cluster peer status by logging onto the host and analyzing messages in the following logs:
  • /var/log/mtx/mtx_debug.log
  • /var/log/messages
sysClusterHaPeerActiveStateConflict A sysClusterHaPeerActiveStateConflict is sent when both peer clusters are in the HA ACTIVE state, and a split-brain condition has occurred. This could have been the result of communication problems between the Traffic Routing Agent and the network Critical:

Possible data corruption.

This trap indicates that a split-brain condition has occurred. If this happens, the TRA DR is designed to automatically correct the problem, and one of the TRA DR redundant pair will solve the problem.
If no TRA DR is present, or for some reason it cannot correct the problem, you will see an error message like this in mtx_debut.log:
LM_CRITI 33260|33289
2015-04-18 19:20:29.143410
[cluster_manager_2:1:1:1
(4526.28250)] | 
FsmClusterHaStateActive::
handleClockTick: peer cluster
ACTIVE state conflict: peer
cluster HA state=ACTIVE
@cluster=1:1

In this case, MATRIXX suggests that you stop both standby engines (E2 and E3), restart engine E2 first, and then E3.

sysClusterHaPeerClusterIdReplayConflict Sent when two HA STANDBY clusters have the same transaction replay source (the same HA peer cluster Id).

This can happen in a three-engine environment when the network between the two standby clusters (E2 and E3) fails. In such cases, E3 changes its transaction replay source to E1, which causes the active engine (E1) to send its transactions to that cluster (E3) to replay, instead of continuing to send them to E2. This can cause data corruption.

Critical:

Possible data corruption.

If you get this notification or see the following critical message in the mtx_debug.log file, MATRIXX suggests stopping both standby engines (E2 and E3), restarting E2, then restarting E3.
LM_CRITI 1758|1769 2016-09-15 16:19:40.092068 [snmp_agent_1:1:1:1(4751.39735)] | 
SysPeerClusterTable::update: Standby clusters 3:1 and 2:1 have the same replay 
source id 1:1 . Note: this message will not be repeated for 5 seconds 

Database Memory Usage

txnDatabaseMemoryUsedThresholdCrossingAlert

This notification is sent when the percentage of available memory for a database crosses a configured threshold value. The value is based on the total memory allocated vs total memory available for further allocation for that database.

Messages are:
  • Database <database name> (pool <database pool id>) is using <percentage used>% of memory.
  • Cleared by Operator Initiated action.

Major:

Possible service degradation and disruption.

Investigate the cause of the change:
  • Look at the hardware monitoring statistics using the Reporting Splunk and directly on each pod.
  • Log onto the host and analyze messages in the following logs:
    • /var/log/mtx/mtx_debug.log
    • /var/log/messages

System Memory Usage

sysMemoryAvailableThresholdMb

This notification is sent when the amount of total system memory that is available for allocation reaches a configured value. The default value is 50 MB.

Major:

Possible service disruption.

Recycle servers to reclaim memory and possibly add more memory to the server.

Route Cache Usage

rcUsageThresholdCross

Indicates that the route-cache table usage has crossed the specified percentage threshold for capacity for the maximum number of records allowed.

Messages are:
  • rcUsageThresholdTableName= <route-cache table>
  • rcUsageThresholdPercentage = <percentage crossed>
  • Cleared by Operator Initiated action.
Note: The route-cache table is the applicable route-cache table, such as subscriber_id.db or session_id.db.

Major:

Possible service disruption.

Check the severity of the usage threshold crossed, and check the Route Cache watermark log messages in /var/log/mtx/mtx_debug.log.

You may need to adjust the size of the Route Cache.

rcUsageThresholdClear

Indicates that the number of records in the route-cache table has dropped below the specified usage threshold percentage.

Messages are:
  • rcUsageThresholdTableName= <route-cache table>
  • rcUsageThresholdPercentage = <percentage cleared>
  • Cleared by Operator Initiated action.
Note: The route-cache table is the applicable route-cache table, such as subscriber_id.db or session_id.db.