MATRIXX Engine SNMP Statistic Monitoring Recommendations

This topic lists the MATRIXX SNMP statistics you should actively monitor. These statistics are important indicators of the efficiency of your MATRIXX implementation.

Important MATRIXX Engine SNMP Statistics to Monitor describes the important statistics to monitor.

Table 1. Important MATRIXX Engine SNMP Statistics to Monitor
SNMP Statistic Description
chrgServerGroup.chrgServerRetryCount The number of times the Charging Server retries transactions in the Transaction Server. This statistic is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible.

Performance degradation and possible transaction failure can occur if there are too many Charging Server retries due to contention.

chrgServerGroup.chrgServerRejectCount The number of messages that were rejected by the Charging Server because the configured maximum transaction retry limit was reached. The default is 10. A high count can indicate an overloaded server or product/pricing issues that result in contradictory transactions.

The response message to the network contains the error code associated with the rejection.

sysClusterMemberTable.sysClusterMemberMgmtIpAddress An Internet Protocol address of a cluster member node at the management interface, in a printable format. Can be IPv4, IPv6, or DNS address type.
sysClusterMemberTable.sysClusterMemberNodeState The service state ID of a member node. This value should always be ACTIVE.
sysPeerClusterTable.sysPeerClusterClusterState The HA state of a peer cluster. One of:
unknown(0)
start(1)
pre-init(2)
init(3)
post-init(4)
standby-sync(5)
standby(6)
active-sync(7)
active(8)
exit(10)
stop(11)
final(12)
failed(13)
none(14)
  • On the active cluster, the peer cluster state should be STANDBY.
  • On a standby cluster, in a two-engine installation, the peer cluster state should be ACTIVE. In a three-engine installation, the peer cluster state should be ACTIVE or STANDBY, depending on HA state of the cluster it is supporting.

The HA state NONE is used by the Traffic Routing Agent to identify when only one cluster (engine) is configured for a MATRIXX Engine environment. It does not influence engine operations.

  • txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastMinuteInKb
  • txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastHourInKb
  • txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastDayInKb
The size of the database growth, in kilobytes, in the last minute, hour, and day.

If the growth is small, it indicates that the current sizes of the databases are sufficient to handle the system activity.

If the growth is large, it indicates that the associated database is growing with a lot of new objects being added, which is forcing the addition of new segments. In such cases, more memory might need to be added.

If the growth is extremely large, it might indicate a problem, such as too much load being introduced for the system configuration. If this occurs often, throttling the incoming event rate might be a solution.

These statistics can also serve as an informative data point. For instance, if a new service is rolled out and attracts a lot of new subscribers, the subscriber database will likely add new segments. This is normal and might be a method of determining service attraction.
txnDatabaseSegmentStatsTable.objectMemorySurplusKb The amount of memory over-allocated in the system due to objects reducing in size. You can clean up surplus memory by setting the advanced database configuration parameters for object relocation thresholds. These parameters set thresholds at which an object is relocated to a new space and the previously allocated memory is freed up.

For more information, see the discussion about advanced database configuration in MATRIXX Configuration.

txnDatabaseIndexStatsTable.txnDatabaseIndexPoolId A database pool ID.
Values are:
  • 5 — Subscriber Database: all
  • 19 — Activity Database: all
  • 20 — Balance Set Database: all
  • 31 — Schedule Database: all
  • 52 — Event Database
txnDatabaseIndexStatsTable.txnDatabaseIndexTypeId The unique ID of a database object type.
txnDatabaseIndexStatsTable.txnDatabaseIndexCount The number of database objects of a given type.

For the subscriber, activity, balance set, and event databases (5, 19, 20, and 52), compare the values with the STANDBY site. The difference of the values should be < X, where X is subject to implementation. For example, difference < 10.

For the schedule database (31), the actual value is < X, where X is subject to implementation. An increasing value indicates that notifications are not being sent.

txnDatabaseObjectStatsTable.txnDatabaseObjectMaximumCount The maximum number of database objects for a given type.
diamGroup.diamTotalStatsMalformedRequests The total number of malfomed Diameter packets received.
diamGroup.diamTotalStatsPermanentFailures Number of permanent failures returned.
diamGroup.diamTotalStatsProtocolErrors Total number of protocol errors returned to peers, but not including redirects.
diamGroup.diamTotalStatsTransientFailures Number of transient failures returned.
diamGroup.diamTotalStatsTransportDown Number of unexpected transport failures.
diamGroup.diamTotalStatsUnknownTypes The number of Diameter packets of unknown type which were received.
diamResultCodeStatsTable.diamResultCodeStatsCmdCode A Diameter Command-Code identifying requests and response types.
diamResultCodeStatsTable.diamResultCodeStatsResultCode A Diameter Result-Code identifiers.
diamResultCodeStatsTable.diamResultCodeStatsResultCodeOut Total number of responses of a given application id/command code/result-code sent since the start of the diameter server.
chrgGroup.chrgNotificationOut The count of unique notifications that have been sent from the engine. Notifications that have been sent multiple times (for retries) are counted only once.

Should always increment within time window X (where X is subject to the implementation).

chrgGroup.chrgNotificationMaxRetryFailures The count of notifications that have been discarded because they have been sent multiple times, but the remote messaging system has not acknowledged reception of the notification.
chrgGroup.chrgNotificationNotificationServerFailed The count of notifications that have been failed to delivered by the remote messaging system.
chrgGroup.chrgNotificationNotificationServerFiltered The count of notifications that have been filtered by the remote messaging system.
sysQueueStatsTable.sysQueueStatsFullCount The maximum number of elements ever present in the queue since system start time.
Note: Exclude the checkpoint line item.
sysQueueStatsTable.sysQueueStatsMaxCount The number of times a message queue was filled to capacity during the collection interval. If the QueueName is not a message queue, this value is zero (0).
Note: Exclude the checkpoint line item.
txnReplayStatsTable.txnReplayPendingFileCount The current number of outstanding checkpoint files to replay, including those currently being replayed and those queued for replay on the publishing pod.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.

txnReplayStatsTable.txnReplayCurrentTransactionCount The current number of outstanding transactions to replay. This value includes the number of replay messages queued for replay and those queued for sorting, before being replayed.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.

txnReplayStatsTable.txnReplayCurrentTransactionBatchCount The current number of outstanding transaction batches to replay.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the monitoring the trap.

txnReplayStatsTable.txnReplayObjectCount The current number of outstanding database objects to replay.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.