Using a Generic SNMP Trap

To send a generic SNMP trap, you can use any process, rather than the standard Process Controller, Cluster Manager, SNMP agent, or TRA. The trap has error information, with specific alerts that relate back to specific error numbers. You can use the sysGenericErrorMessage SNMP trap to send out a system level alert with message text in the payload.

Generic Trap Locations

Configure the trap generation period in mtx_config.xml:

<snmp_agent> <trap_generate_period_msec>10000</trap_generate_period_msec>

You can change the configuration to control how soon the trap generates; the default is 10 seconds. The same type of trap only generates once in this period.

For more information about mtx_config.xml, see the discussion about MATRIXX configuration specification (mtx_config.xml) in MATRIXX Installation and Upgrade.

SNMP uses the generic trap locations specified in Generic Trap Locations.

Table 1. Generic Trap Locations
Component/Module Task::Function Message Action
MtxChrg AbortThreadData.onThreadTimeout Thread linuxThreadId_ has exceeded threadQuarantineTimeoutInMillis_

ms timeout while processing message. Placing server into quarantine.

Example:
2022-06-10 23:46:51.213401 FROM localhost:
------------------------------------------
DISMAN-EVENT-MIB::sysUpTimeInstance = 0:0:01:28.69SNMPv2-MIB::snmpTrapOID                                              
= MATRIXX-COMMON-MIB::sysGenericErrorMessageMATRIXX-COMMON-MIB::sysGenericErrorText                              
= b'Thread 23953 has exceeded 2000ms timeout while processing message. 
Placing blade into quarantine.'
Check messages/OIDs in the log on the pod for latency issues.
Quarantining thread linuxThreadId_ would exceed server limit of threadQuarantineLimit_ quarantined threads. Terminating server. Check messages/OIDs in the log on the pod for latency issues and check system health.
MtxEventLoader EventLoaderDispatcherTask::checkIdleGtcTimeouts Have not received a GTC in the last idleGtcErrorTimeout_.count() minutes. Example:
2022-04-21 13:00:09.655332
FROM localhost:
------------------------------------------
DISMAN-EVENT-MIB::sysUpTimeInstance       = 0:23:36:24.02
SNMPv2-MIB::snmpTrapOID                   = MATRIXX-COMMON-MIB::sysGenericErrorMessage
MATRIXX-COMMON-MIB::sysGenericErrorText   = b'Have not received a GTC in the last 5 minutes.'

2022-04-21 13:00:10.808405 FROM localhost:
------------------------------------------
DISMAN-EVENT-MIB::sysUpTimeInstance       = 0:23:36:25.18
SNMPv2-MIB::snmpTrapOID                   = MATRIXX-COMMON-MIB::sysProcessingErrorAlert
MATRIXX-COMMON-MIB::sysProcessingErrors   = 283
Check system health.
EventLoaderDispatcherTask::dispatcherLoop Failed to read Event Repository for missing GTC ranges. This can happen when a publishing pod becomes active. The Dispatcher reads the LoaderTraceCollection for any gaps to fill. Check MongoDB.
MtxStream MefV2GeneratorTask::publishMefv2FilesToTarget Could not publish event files: ::strerror(savedErrno) savedErrno

and Could not publish event files. Exit status= publishCommand.getExitStatus(). Example:

DISMAN-EVENT-MIB::sysUpTimeInstance = 0:0:02:22.45
SNMPv2-MIB::snmpTrapOID                      = MATRIXX-COMMON-MIB::sysGenericErrorMessage
MATRIXX-COMMON-MIB::sysGenericErrorText      = b'Could not publish event files. Exit status=255'
Check the publishing target.
MefV2GeneratorTask::createPublishedMefList MEFv2 event recovery. Could not execute create_published_mef_list.py on publish target: ::strerror(savedErrno) savedErrno and MEFv2 event recovery. Could not execute create_published_mef_list.py on publish target publishTargetHostName_. Error due to errString. Requires manual MEFv2 recovery.
MefV2GeneratorTask::pubTriggerCallbackHandler Mef V2 Publisher did not make any progress for kPubMonitorTimeoutMillis milliseconds. Check system health.
MtxTrafficMgr CmpLeaderNodePool::getNextSvcStateOnNodeUp "duplicate CMP " << str << " nodes, count=" << count << FQN
Example:
DISMAN-EVENT-MIB::sysUpTimeInstance = 0:0:04:01.45

SNMPv2-MIB::snmpTrapOID = MATRIXX-COMMON-MIB::sysGenericErrorMessage

MATRIXX-COMMON-MIB::sysGenericErrorText = b'duplicate CMP leader nodes, count=2; fqn=poolLeader'
Restart the previous active publishing pod.
MtxTxn CheckpointWriterTask::writeCheckpoint The checkpointing server is out-of-sync with the last system snapshot. Please check for other errors to determine why.

A duplicate Checkpoint was created for GTC= prevCkptGtc_. Example:

2022-06-03 19:46:49.667491 FROM localhost:
------------------------------------------
DISMAN-EVENT-MIB::sysUpTimeInstance  = 0:4:30:19.70
SNMPv2-MIB::snmpTrapOID              = MATRIXX-COMMON-MIB::sysGenericErrorMessage
MATRIXX-COMMON-MIB::sysGenericErrorText  = b'The checkpointing server is out-of-sync with the 
last system snapshot. Please check for other errors (in this log?) to determine why. 
A duplicate Checkpoint was created for GTC=7059770'
Check system health.
TransactionManagerTask::resolvePendingTransactionIfAny Number of retries to resolve transaction ID txnID, GTC=txnCtxP- getGlobalTxnCounter() reaches maximum value resolveTxnMaxRetries_. Restart the pod.
TransactionManagerTask::handleSharedStorageEvent Failed to execute nfs unmount from Standby server= myBladeId.
Note: Please unmount nfs and mount shared storage manually.
Unmount NFS and mount shared storage.
Failed to mount the shared storage even after fsck on Active publishing server= myBladeId.
Note: Please manually mount the shared storage.
Mount shared storage.
TransactionSortedLoggingTask::logWriteBufferAbrtCbHandler TransactionSortedLoggingTask::logWriteBufferAbrtCbHandler:atPtr->getStepString(),

Step: atPtr->getStepString(). Timeout: timeoutMs msec.

Restart the publishing cluster.
TransactionSortedLoggingTask::diskWriteAbrtCbHandler TransactionSortedLoggingTask::diskWriteAbrtCbHandler: atPtr->getStepString(), Step: atPtr->getStepString(). Timeout: timeoutMs msec. Restart the engine.
TransactionStreamTask::peerClusterHaStateUpdated Got Publishing cluster cl::name(toClusterHaState) state, aborting transaction stream.

To start transaction stream need to restart the publishing cluster.

Note: This can happen during high load when LogWriteBuffer is not available.
Example:
2022-06-03 13:36:20.172439 FROM localhost:
------------------------------------------
DISMAN-EVENT-MIB::sysUpTimeInstance      = 0:20:57:18.36
SNMPv2-MIB::snmpTrapOID                  = MATRIXX-COMMON-MIB::sysGenericErrorMessage
MATRIXX-COMMON-MIB::sysGenericErrorText  = b'Got Publishing cluster FAILED state, 
aborting transaction stream. Note: To start transaction stream need to restart the publishing cluster.'
TransactionStreamTask::handleTxnStreamClusterStateMsg Got HA peer engine= haPeerEcbId cl::name(clusterState) state, aborting transaction stream.

To start transaction stream need to restart the engine.

Note: This means the sorted transaction log writing to the local disk is slow. Verify if any non-MATRIXX processes are writing to disk.
TransactionManagerTask::coordinatorCommit Fatal error in committing transaction, NACK this transaction.\n
Note: Only when the server is not shut down.
Start the other server, engine, or cluster before restarting this server.