System Notifications
System notifications are sent when the value of a variable associated with a trap moves above a defined threshold.
For traps based on counters, the SNMP agent periodically polls each specified variable. It takes the difference between the previous and current values of the variable and compares that difference with the threshold. If the value has moved above the threshold, the SNMP agent sends a trap raising an alarm. If the value has moved below the threshold, the agent sends a trap clearing the alarm. For traps based on gauges, the previous value is not needed. The current value is compared with the threshold.
MIB System Notifications describes the traps, severity/results, and actions.
Trap | Description | Severity / Result | Action |
---|---|---|---|
SNMP Agent |
|||
sysAgentStart sysAgentShutdown |
Indicates the SNMP agent has started running or is shutting down. Messages are:
|
Major: Notifications and alarms are not sent when the SNMP agent is not running. |
Restart the MATRIXX server running System Monitor. |
Service Restart |
|||
sysServiceRestart |
Indicates that MTX services have been automatically restarted due to a detected software failure. Messages are:
|
Major: Possible service outage and degradation while the service restarts and is verified to be working correctly. |
Review /var/log/mtx/mtx_debug.log to determine the cause of the failure. |
Server Reboot |
|||
sysServerReboot |
Indicates that an automatic reboot of the server machine has been initiated due to a detected software failure. Messages are:
|
Major: Possible service outage and degradation while the server restarts. |
Investigate why the pod restarted:
|
Service Down |
|||
sysServiceDown |
A critical system service shut down. The system is not operational. The notification contains the failed service name. Messages are:
|
Critical: Service disruption. |
Requires Immediate attention. You should:
|
Processing Error |
|||
sysProcessingErrorAlert |
A sysProcessingErrorAlert notification is sent when the value of sysProcessingErrors changes. The agent must not generate more than one sysProcessingErrors notification event in a five second period. If additional processing errors occur within the five second "throttling" period, then these notification events are suppressed by the agent. An NMS should periodically check the value of sysProcessingErrors to detect any missed sysProcessingErrors notification events. Messages are:
|
Minor: Possible service degradation. |
Investigate why the pod shut down:
|
Diameter Performance Threshold |
|||
sysThresholdCrossingAlert |
A sysThresholdCrossingAlert notification is sent when a monitored Diameter performance gauge reaches or crosses a threshold value. For example, average and maximum system response times. At most one threshold crossing notification event can be generated in any monitoring interval. The initial default average response time is 100 milliseconds and the maximum response time is 500 milliseconds. Messages are:
|
Major: Possible service degradation. |
Review /var/log/mtx/mtx_debug.log to determine if there are any processing errors causing the threshold crossing Look at the hardware monitoring stats using the Reporting Splunk and directly on the pods. |
Cluster Node Status |
|||
sysClusterNodeJoined sysClusterNodeExited |
A sysClusterNodeJoined notification is sent by any member node of a cluster when it detects that a new node has joined the cluster. A sysClusterNodeExited notification is sent by any member node of a cluster when it detects that a node has exited the cluster. Messages are:
|
Critical: Possible service degradation and disruption. |
A node leaving a cluster could be the result of a node restarting or maintenance being
performed. You should investigate the cause of the change in
cluster node status by logging onto the host and analyzing
messages in the following logs:
|
Cluster Node Service State |
|||
sysClusterNodeServiceUp sysClusterNodeServiceDown |
A sysClusterNodeServiceUp notification is sent by a cluster member (pod) when it is ready to process service requests (such as Diameter requests). A sysClusterNodeServiceDown notification is sent by a pod when it stops processing service requests (such as Diameter requests). Messages are:
|
Major: Possible service degradation and disruption. |
This error can occur for multiple reasons. Investigate the cause of the change in cluster
node status by logging onto the host and analyzing messages in
the following logs:
|
Cluster Peer Active Error |
|||
sysClusterPeerActiveError |
A sysClusterPeerActiveError notification is sent when both the primary and secondary clusters are in the HA ACTIVE state at the same time. The address object contains a Virtual IPv4 (VIP) address of the peer cluster. Messages are:
|
Critical: Service disruption. |
Requires Immediate attention. You should:
|
Cluster Peer Connection State |
|||
sysClusterPeerConnected sysClusterPeerDisconnected |
A sysClusterPeerConnected notification is sent when a remote HA peer cluster becomes connected/reachable to/from this cluster. The address object contains a Virtual IP (VIP) address of the remote peer cluster. A sysClusterPeerDisconnected notification is sent when a remote HA peer cluster is disconnected/unreachable from this cluster. The address object contains a Virtual IP (VIP) address of the remote peer cluster. Messages are:
|
Critical: Service degradation and disruption. |
Communication between the two engines was lost. This could be the result of the engine restarting or maintenance being performed. Investigate the cause of the change in cluster peer status by logging onto the host and
analyzing messages in the following logs:
|
Cluster HA Status |
|||
sysClusterClusterStateChange |
A sysClusterClusterStateChange is sent when the HA state of a cluster has changed. The notification contains the ID of the cluster and the current value of the HA state of the cluster. Message is:
|
Critical: Possible service degradation and disruption. |
This could be the result of the engine restarting or
maintenance being performed.
Attention: If the reason is unknown, investigate
immediately.
Investigate the cause of the change in cluster peer status by logging onto the host and
analyzing messages in the following logs:
|
sysClusterHaPeerActiveStateConflict | A sysClusterHaPeerActiveStateConflict is sent when both peer clusters are in the HA ACTIVE state, and a split-brain condition has occurred. This could have been the result of communication problems between the Traffic Routing Agent and the network | Critical:
Possible data corruption. |
This trap indicates that a split-brain condition has occurred. If this happens, the TRA
DR is designed to automatically correct the problem, and one of the
TRA DR redundant pair will solve the problem. If no TRA DR is
present, or for some reason it cannot correct the problem, you
will see an error message like this in
mtx_debut.log:
In this case, MATRIXX suggests that you stop both standby engines (E2 and E3), restart engine E2 first, and then E3. |
sysClusterHaPeerClusterIdReplayConflict | Sent when two HA STANDBY clusters have the same transaction replay source (the same HA
peer cluster Id). This can happen in a three-engine environment when the network between the two standby clusters (E2 and E3) fails. In such cases, E3 changes its transaction replay source to E1, which causes the active engine (E1) to send its transactions to that cluster (E3) to replay, instead of continuing to send them to E2. This can cause data corruption. |
Critical: Possible data corruption. |
If you get this notification or see the following critical message in the
mtx_debug.log file, MATRIXX suggests
stopping both standby engines (E2 and E3), restarting E2, then
restarting E3.
|
Database Memory Usage |
|||
txnDatabaseMemoryUsedThresholdCrossingAlert |
This notification is sent when the percentage of available memory for a database crosses a configured threshold value. The value is based on the total memory allocated vs total memory available for further allocation for that database. Messages are:
|
Major: Possible service degradation and disruption. |
Investigate the cause of the change:
|
System Memory Usage |
|||
sysMemoryAvailableThresholdMb |
This notification is sent when the amount of total system memory that is available for allocation reaches a configured value. The default value is 50 MB. |
Major: Possible service disruption. |
Recycle servers to reclaim memory and possibly add more memory to the server. |
Route Cache Usage |
|||
rcUsageThresholdCross |
Indicates that the route-cache table usage has crossed the specified percentage threshold for capacity for the maximum number of records allowed. Messages are:
Note: The route-cache table is the
applicable route-cache table, such as
subscriber_id.db or
session_id.db . |
Major: Possible service disruption. |
Check the severity of the usage threshold crossed, and check the Route Cache watermark log messages in /var/log/mtx/mtx_debug.log. You may need to adjust the size of the Route Cache. |
rcUsageThresholdClear |
Indicates that the number of records in the route-cache table has dropped below the specified usage threshold percentage. Messages are:
Note: The route-cache table is the
applicable route-cache table, such as
subscriber_id.db or
session_id.db . |