MATRIXX Engine SNMP Statistic Monitoring Recommendations
This topic lists the MATRIXX SNMP statistics you should actively monitor. These statistics are important indicators of the efficiency of your MATRIXX implementation.
Important MATRIXX Engine SNMP Statistics to Monitor describes the important statistics to monitor.
SNMP Statistic | Description |
---|---|
chrgServerGroup.chrgServerRetryCount | The number of times the Charging Server retries transactions in the Transaction Server. This statistic
is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible. Performance degradation and possible transaction failure can occur if there are too many Charging Server retries due to contention. |
chrgServerGroup.chrgServerRejectCount | The number of messages that were rejected by the Charging Server because the configured maximum transaction retry limit was
reached. The default is 10. A high count can indicate an overloaded
server or product/pricing issues that result in contradictory
transactions. The response message to the network contains the error code associated with the rejection. |
sysClusterMemberTable.sysClusterMemberMgmtIpAddress | An Internet Protocol address of a cluster member node at the management interface, in a printable format. Can be IPv4, IPv6, or DNS address type. |
sysClusterMemberTable.sysClusterMemberNodeState | The service state ID of a member node. This value should always be ACTIVE. |
sysPeerClusterTable.sysPeerClusterClusterState | The HA state of a peer cluster. One of:
The HA state NONE is used by the Traffic Routing Agent to identify when only one cluster (engine) is configured for a MATRIXX Engine environment. It does not influence engine operations. |
|
The size of the database growth, in kilobytes, in the last
minute, hour, and day. If the growth is small, it indicates that the current sizes of the databases are sufficient to handle the system activity. If the growth is large, it indicates that the associated database is growing with a lot of new objects being added, which is forcing the addition of new segments. In such cases, more memory might need to be added. If the growth is extremely large, it might indicate a problem, such as too much load being introduced for the system configuration. If this occurs often, throttling the incoming event rate might be a solution. These statistics can also serve as an informative data point. For instance, if a new service is rolled out and attracts a lot of new subscribers, the subscriber database will likely add new segments. This is normal and might be a method of determining service attraction. |
txnDatabaseSegmentStatsTable.objectMemorySurplusKb | The amount of memory over-allocated in the system due to objects reducing in size. You
can clean up surplus memory by setting the advanced database
configuration parameters for object relocation thresholds. These
parameters set thresholds at which an object is relocated to a new
space and the previously allocated memory is freed up. For more information, see the discussion about advanced database configuration in MATRIXX Configuration. |
txnDatabaseIndexStatsTable.txnDatabaseIndexPoolId | A database pool ID. Values are:
|
txnDatabaseIndexStatsTable.txnDatabaseIndexTypeId | The unique ID of a database object type. |
txnDatabaseIndexStatsTable.txnDatabaseIndexCount | The number of database objects of a given type. For the subscriber, activity, balance set, and event databases (5, 19, 20, and 52), compare the values with the STANDBY site. The difference of the values should be < X, where X is subject to implementation. For example, difference < 10. For the schedule database (31), the actual value is < X, where X is subject to implementation. An increasing value indicates that notifications are not being sent. |
txnDatabaseObjectStatsTable.txnDatabaseObjectMaximumCount | The maximum number of database objects for a given type. |
diamGroup.diamTotalStatsMalformedRequests | The total number of malfomed Diameter packets received. |
diamGroup.diamTotalStatsPermanentFailures | Number of permanent failures returned. |
diamGroup.diamTotalStatsProtocolErrors | Total number of protocol errors returned to peers, but not including redirects. |
diamGroup.diamTotalStatsTransientFailures | Number of transient failures returned. |
diamGroup.diamTotalStatsTransportDown | Number of unexpected transport failures. |
diamGroup.diamTotalStatsUnknownTypes | The number of Diameter packets of unknown type which were received. |
diamResultCodeStatsTable.diamResultCodeStatsCmdCode | A Diameter Command-Code identifying requests and response types. |
diamResultCodeStatsTable.diamResultCodeStatsResultCode | A Diameter Result-Code identifiers. |
diamResultCodeStatsTable.diamResultCodeStatsResultCodeOut | Total number of responses of a given application id/command code/result-code sent since the start of the diameter server. |
chrgGroup.chrgNotificationOut | The count of unique notifications that have been sent from the
engine. Notifications that have been sent multiple times (for
retries) are counted only once. Should always increment within time window X (where X is subject to the implementation). |
chrgGroup.chrgNotificationMaxRetryFailures | The count of notifications that have been discarded because they have been sent multiple times, but the remote messaging system has not acknowledged reception of the notification. |
chrgGroup.chrgNotificationNotificationServerFailed | The count of notifications that have been failed to delivered by the remote messaging system. |
chrgGroup.chrgNotificationNotificationServerFiltered | The count of notifications that have been filtered by the remote messaging system. |
sysQueueStatsTable.sysQueueStatsFullCount | The maximum number of elements ever present in the queue since
system start time. Note: Exclude the checkpoint line item. |
sysQueueStatsTable.sysQueueStatsMaxCount | The number of times a message queue was filled to capacity during the collection
interval. If the QueueName is not a message
queue, this value is zero (0). Note: Exclude the
checkpoint line item. |
txnReplayStatsTable.txnReplayPendingFileCount | The current number of outstanding checkpoint files to replay, including those currently being replayed and those queued for replay on the publishing pod. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap. |
txnReplayStatsTable.txnReplayCurrentTransactionCount | The current number of outstanding transactions to replay. This value includes the
number of replay messages queued for replay and those queued for
sorting, before being replayed. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap. |
txnReplayStatsTable.txnReplayCurrentTransactionBatchCount | The current number of outstanding transaction batches to replay. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the monitoring the trap. |
txnReplayStatsTable.txnReplayObjectCount | The current number of outstanding database objects to replay. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap. |