MATRIXX Engine SNMP Statistic Monitoring Recommendations

Table 1. Important MATRIXX Engine SNMP Statistics to Monitor
SNMP Statistic	Description
chrgServerGroup.chrgServerRetryCount	The number of times the Charging Server retries transactions in the Transaction Server. This statistic is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible. Performance degradation and possible transaction failure can occur if there are too many Charging Server retries due to contention.
chrgServerGroup.chrgServerRejectCount	The number of messages that were rejected by the Charging Server because the configured maximum transaction retry limit was reached. The default is 10. A high count can indicate an overloaded server or product/pricing issues that result in contradictory transactions. The response message to the network contains the error code associated with the rejection.
sysClusterMemberTable.sysClusterMemberMgmtIpAddress	An Internet Protocol address of a cluster member node at the management interface, in a printable format. Can be IPv4, IPv6, or DNS address type.
sysClusterMemberTable.sysClusterMemberNodeState	The service state ID of a member node. This value should always be ACTIVE.
sysPeerClusterTable.sysPeerClusterClusterState	The HA state of a peer cluster. One of: `unknown(0) start(1) pre-init(2) init(3) post-init(4) standby-sync(5) standby(6) active-sync(7) active(8) exit(10) stop(11) final(12) failed(13) none(14)` On the active cluster, the peer cluster state should be STANDBY. On a standby cluster, in a two-engine installation, the peer cluster state should be ACTIVE. In a three-engine installation, the peer cluster state should be ACTIVE or STANDBY, depending on HA state of the cluster it is supporting. The HA state NONE is used by the Traffic Routing Agent to identify when only one cluster (engine) is configured for a MATRIXX Engine environment. It does not influence engine operations.
txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastMinuteInKb txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastHourInKb txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastDayInKb	The size of the database growth, in kilobytes, in the last minute, hour, and day. If the growth is small, it indicates that the current sizes of the databases are sufficient to handle the system activity. If the growth is large, it indicates that the associated database is growing with a lot of new objects being added, which is forcing the addition of new segments. In such cases, more memory might need to be added. If the growth is extremely large, it might indicate a problem, such as too much load being introduced for the system configuration. If this occurs often, throttling the incoming event rate might be a solution. These statistics can also serve as an informative data point. For instance, if a new service is rolled out and attracts a lot of new subscribers, the subscriber database will likely add new segments. This is normal and might be a method of determining service attraction.
txnDatabaseSegmentStatsTable.objectMemorySurplusKb	The amount of memory over-allocated in the system due to objects reducing in size. You can clean up surplus memory by setting the advanced database configuration parameters for object relocation thresholds. These parameters set thresholds at which an object is relocated to a new space and the previously allocated memory is freed up. For more information, see the discussion about advanced database configuration in MATRIXX Configuration.
txnDatabaseIndexStatsTable.txnDatabaseIndexPoolId	A database pool ID. Values are: 5 — Subscriber Database: all 19 — Activity Database: all 20 — Balance Set Database: all 31 — Schedule Database: all 52 — Event Database
txnDatabaseIndexStatsTable.txnDatabaseIndexTypeId	The unique ID of a database object type.
txnDatabaseIndexStatsTable.txnDatabaseIndexCount	The number of database objects of a given type. For the subscriber, activity, balance set, and event databases (5, 19, 20, and 52), compare the values with the STANDBY site. The difference of the values should be < X, where X is subject to implementation. For example, difference < 10. For the schedule database (31), the actual value is < X, where X is subject to implementation. An increasing value indicates that notifications are not being sent.
txnDatabaseObjectStatsTable.txnDatabaseObjectMaximumCount	The maximum number of database objects for a given type.
diamGroup.diamTotalStatsMalformedRequests	The total number of malfomed Diameter packets received.
diamGroup.diamTotalStatsPermanentFailures	Number of permanent failures returned.
diamGroup.diamTotalStatsProtocolErrors	Total number of protocol errors returned to peers, but not including redirects.
diamGroup.diamTotalStatsTransientFailures	Number of transient failures returned.
diamGroup.diamTotalStatsTransportDown	Number of unexpected transport failures.
diamGroup.diamTotalStatsUnknownTypes	The number of Diameter packets of unknown type which were received.
diamResultCodeStatsTable.diamResultCodeStatsCmdCode	A Diameter Command-Code identifying requests and response types.
diamResultCodeStatsTable.diamResultCodeStatsResultCode	A Diameter Result-Code identifiers.
diamResultCodeStatsTable.diamResultCodeStatsResultCodeOut	Total number of responses of a given application id/command code/result-code sent since the start of the diameter server.
chrgGroup.chrgNotificationOut	The count of unique notifications that have been sent from the engine. Notifications that have been sent multiple times (for retries) are counted only once. Should always increment within time window X (where X is subject to the implementation).
chrgGroup.chrgNotificationMaxRetryFailures	The count of notifications that have been discarded because they have been sent multiple times, but the remote messaging system has not acknowledged reception of the notification.
chrgGroup.chrgNotificationNotificationServerFailed	The count of notifications that have been failed to delivered by the remote messaging system.
chrgGroup.chrgNotificationNotificationServerFiltered	The count of notifications that have been filtered by the remote messaging system.
sysQueueStatsTable.sysQueueStatsFullCount	The maximum number of elements ever present in the queue since system start time. Note: Exclude the checkpoint line item.
sysQueueStatsTable.sysQueueStatsMaxCount	The number of times a message queue was filled to capacity during the collection interval. If the `QueueName` is not a message queue, this value is zero (0). Note: Exclude the checkpoint line item.
txnReplayStatsTable.txnReplayPendingFileCount	The current number of outstanding checkpoint files to replay, including those currently being replayed and those queued for replay on the publishing pod. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.
txnReplayStatsTable.txnReplayCurrentTransactionCount	The current number of outstanding transactions to replay. This value includes the number of replay messages queued for replay and those queued for sorting, before being replayed. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.
txnReplayStatsTable.txnReplayCurrentTransactionBatchCount	The current number of outstanding transaction batches to replay. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the monitoring the trap.
txnReplayStatsTable.txnReplayObjectCount	The current number of outstanding database objects to replay. An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.

chrgServerGroup.chrgServerRetryCount

The number of times the Charging Server retries transactions in the Transaction Server. This statistic is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible.

Performance degradation and possible transaction failure can occur if there are too many Charging Server retries due to contention.

chrgServerGroup.chrgServerRejectCount

The number of messages that were rejected by the Charging Server because the configured maximum transaction retry limit was reached. The default is 10. A high count can indicate an overloaded server or product/pricing issues that result in contradictory transactions.

The response message to the network contains the error code associated with the rejection.

sysClusterMemberTable.sysClusterMemberMgmtIpAddress

An Internet Protocol address of a cluster member node at the management interface, in a printable format. Can be IPv4, IPv6, or DNS address type.

sysClusterMemberTable.sysClusterMemberNodeState

The service state ID of a member node. This value should always be ACTIVE.

sysPeerClusterTable.sysPeerClusterClusterState

The HA state of a peer cluster. One of:

unknown(0)
start(1)
pre-init(2)
init(3)
post-init(4)
standby-sync(5)
standby(6)
active-sync(7)
active(8)
exit(10)
stop(11)
final(12)
failed(13)
none(14)

On the active cluster, the peer cluster state should be STANDBY.
On a standby cluster, in a two-engine installation, the peer cluster state should be ACTIVE. In a three-engine installation, the peer cluster state should be ACTIVE or STANDBY, depending on HA state of the cluster it is supporting.

The HA state NONE is used by the Traffic Routing Agent to identify when only one cluster (engine) is configured for a MATRIXX Engine environment. It does not influence engine operations.

txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastMinuteInKb
txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastHourInKb
txnDatabaseStatsTable.txnDatabaseMemoryGrowthLastDayInKb

The size of the database growth, in kilobytes, in the last minute, hour, and day.

If the growth is small, it indicates that the current sizes of the databases are sufficient to handle the system activity.

If the growth is large, it indicates that the associated database is growing with a lot of new objects being added, which is forcing the addition of new segments. In such cases, more memory might need to be added.

If the growth is extremely large, it might indicate a problem, such as too much load being introduced for the system configuration. If this occurs often, throttling the incoming event rate might be a solution.

These statistics can also serve as an informative data point. For instance, if a new service is rolled out and attracts a lot of new subscribers, the subscriber database will likely add new segments. This is normal and might be a method of determining service attraction.

txnDatabaseSegmentStatsTable.objectMemorySurplusKb

The amount of memory over-allocated in the system due to objects reducing in size. You can clean up surplus memory by setting the advanced database configuration parameters for object relocation thresholds. These parameters set thresholds at which an object is relocated to a new space and the previously allocated memory is freed up.

For more information, see the discussion about advanced database configuration in MATRIXX Configuration.

txnDatabaseIndexStatsTable.txnDatabaseIndexPoolId

A database pool ID.

Values are:

5 — Subscriber Database: all
19 — Activity Database: all
20 — Balance Set Database: all
31 — Schedule Database: all
52 — Event Database

txnDatabaseIndexStatsTable.txnDatabaseIndexTypeId

The unique ID of a database object type.

txnDatabaseIndexStatsTable.txnDatabaseIndexCount

The number of database objects of a given type.

For the subscriber, activity, balance set, and event databases (5, 19, 20, and 52), compare the values with the STANDBY site. The difference of the values should be < X, where X is subject to implementation. For example, difference < 10.

For the schedule database (31), the actual value is < X, where X is subject to implementation. An increasing value indicates that notifications are not being sent.

txnDatabaseObjectStatsTable.txnDatabaseObjectMaximumCount

The maximum number of database objects for a given type.

diamGroup.diamTotalStatsMalformedRequests

The total number of malfomed Diameter packets received.

diamGroup.diamTotalStatsPermanentFailures

Number of permanent failures returned.

diamGroup.diamTotalStatsProtocolErrors

Total number of protocol errors returned to peers, but not including redirects.

diamGroup.diamTotalStatsTransientFailures

Number of transient failures returned.

diamGroup.diamTotalStatsTransportDown

Number of unexpected transport failures.

diamGroup.diamTotalStatsUnknownTypes

The number of Diameter packets of unknown type which were received.

diamResultCodeStatsTable.diamResultCodeStatsCmdCode

A Diameter Command-Code identifying requests and response types.

diamResultCodeStatsTable.diamResultCodeStatsResultCode

A Diameter Result-Code identifiers.

diamResultCodeStatsTable.diamResultCodeStatsResultCodeOut

Total number of responses of a given application id/command code/result-code sent since the start of the diameter server.

chrgGroup.chrgNotificationOut

The count of unique notifications that have been sent from the engine. Notifications that have been sent multiple times (for retries) are counted only once.

Should always increment within time window X (where X is subject to the implementation).

chrgGroup.chrgNotificationMaxRetryFailures

The count of notifications that have been discarded because they have been sent multiple times, but the remote messaging system has not acknowledged reception of the notification.

chrgGroup.chrgNotificationNotificationServerFailed

The count of notifications that have been failed to delivered by the remote messaging system.

chrgGroup.chrgNotificationNotificationServerFiltered

The count of notifications that have been filtered by the remote messaging system.

sysQueueStatsTable.sysQueueStatsFullCount

The maximum number of elements ever present in the queue since system start time.

Note: Exclude the checkpoint line item.

sysQueueStatsTable.sysQueueStatsMaxCount

The number of times a message queue was filled to capacity during the collection interval. If the QueueName is not a message queue, this value is zero (0).

Note: Exclude the checkpoint line item.

txnReplayStatsTable.txnReplayPendingFileCount

The current number of outstanding checkpoint files to replay, including those currently being replayed and those queued for replay on the publishing pod.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.

txnReplayStatsTable.txnReplayCurrentTransactionCount

The current number of outstanding transactions to replay. This value includes the number of replay messages queued for replay and those queued for sorting, before being replayed.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.

txnReplayStatsTable.txnReplayCurrentTransactionBatchCount

The current number of outstanding transaction batches to replay.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the monitoring the trap.

txnReplayStatsTable.txnReplayObjectCount

The current number of outstanding database objects to replay.

An alarm triggers if the number does not = 0 within a time window of X, where X is subject to the frequency of the of monitoring the trap.