Engine Metrics Recommended for Grafana

The following MATRIXX Engine metrics are recommended for MATRIXX Engine Grafana dashboards.

Charging Server

The Charging Server retry count should be low compared to the total messages processed. If the retry count is high, it might impact processing performance. Charging Server Metrics Recommended for Grafana describes recommended metrics.

Table 1. Charging Server Metrics Recommended for Grafana
Metric	Type	Labels	Description
chrgServerRetryCount	Gauge		The number of times the Charging Server retries transactions in the Transaction Server. This is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible.

Cluster Manager

The Grafana dashboard should indicate that the Node and Cluster are not down. Cluster Manager Metrics Recommended for Grafana describes recommended metrics.

Table 2. Cluster Manager Metrics Recommended for Grafana
Metric	Type	Labels	Description
sysClusterMemberNodeState	Gauge	sysClusterMemberNodeId: Node ID	A service state ID of a member node.
sysPeerClusterClusterState	Gauge	sysPeerClusterEngineId: Engine ID sysPeerClusterClusterId: Cluster ID	The cluster state of the peer cluster.

Diameter Gateway

Use your Grafana dashboard to verify that Diameter Gateway average and maximum current response times are less than an expected value. The failed result code count should not be high. Diameter Gateway Metrics Recommended for Grafana describes recommended metrics.

Table 3. Diameter Gateway Metrics Recommended for Grafana
Metric	Type	Labels	Description
diamPduStatsCurrentRespTimeAvg	Gauge	diamPduStatsApplicationId: application ID diamPduStatsCmdCode: command code	The average response time, in microseconds, measured during the last completed monitoring interval for a given application ID/command code. The response time measured is the internal system response time; it does not include network latencies.
diamPduStatsCurrentRespTimeMax	Gauge	diamPduStatsApplicationId: application ID diamPduStatsCmdCode: command code	The maximum response time, in microseconds, measured during the last completed monitoring interval for a given application ID/command code. The response time measured is the internal system response time; it does not include network latencies.
diamResultCodeStatsResultCodeOut	Gauge	diamResultCodeStatsApplicationId: application ID diamResultCodeStatsCmdCode: command code diamResultCodeStatsResultCode: result code	Total number of responses of a given application ID/command code/result-code sent since the start of the Diameter server.

Event Loader

Event Loader Metrics Recommended for Grafana describes the Event Loader metrics recommended for Grafana.

Table 4. Event Loader Metrics Recommended for Grafana
Metric	Type	Description
eventLoaderMaxReplayGlobalTxnCounter	Gauge	The highest global transaction counter replayed on the publishing server. The maximum replayed global transaction counter (GTC) on the publishing cluster in the last engine should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount.
eventLoaderLastLoadedGlobalTxnCounter	Gauge	The last global transaction counter the Event Loader successfully loaded to the Event Repository. The last loaded GTC should be moving and closely in sync with eventLoaderMaxReplayGlobalTxnCounter.
eventLoaderMefBacklogCount	Gauge	The MEF backlog count for Event Loader. The backlog count should be zero during normal processing.
eventLoaderMefRejectedCount	Gauge	The number of MEFs rejected. The rejected count should be zero during normal processing.

The MongoDB Cluster should be monitored. If any node is down, the CleanUp task does not run, and MEFs accumulate in the processed directory.

The LoaderStatsCollection in the Event Repository retains the number of records processed and loaded (and total bytes), accumulated processing times for different steps, and the number of MEFs loaded.

Event Stream Server

Event Streamer Server metrics recommended for Grafana are described in Event Stream Server Metrics Recommended for Grafana.

Table 5. Event Stream Server Metrics Recommended for Grafana
Metric	Type	Labels	Description
streamMaxReplayGlobalTxnCounter	Gauge		The highest global transaction counter replayed on the publishing server. The maximum replayed GTC on the publishing cluster in the last engine should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount.
streamSefLastWrittenGlobalTxnCounter	Gauge		The last global transaction counter the event writer wrote to SEF. The last written GTC should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount.
streamMefLastPublishedGlobalTxnCounter	Gauge		The last global transaction counter the MEFv2 generator published to target. The last published GTC should be moving and closely in sync with streamSefLastWrittenGlobalTxnCounter.
streamConnectionStatsCursor	Gauge	streamConnectionStatsIndex: stream session ID filter: The filter string for this stream	The stream cursor. The stream cursor should be moving and closely in sync with the streamMaxReplayGlobalTxnCounter.
streamConnectionStatsEventCount	Gauge	streamConnectionStatsIndex: stream session ID filter: The filter string for this stream	The number of events sent. Event count information might help with troubleshooting performance issues.

Transaction Server

Transaction Server metrics recommended for Grafana are described in Transaction Server Metrics Recommended for Grafana.

Table 6. Transaction Server Metrics Recommended for Grafana
Metric	Type	Labels	Description
txnDatabaseObjectCount	Gauge	txnDatabaseObjectPoolId: pool ID txnDatabaseObjectName: object name	The number of database objects of a given type. The object count should be closely synchronized between all engine pods. The checkpointing pod might be slightly behind because the replay is file-based, not real-time. It might also be behind when the checkpoint creation is running. The object count should be stable or show reasonable growth. Any sudden object count increase might indicate an issue in the system, for example in a schedule database, alert database, activity database, or event database.
txnDatabaseSegmentObjectMemorySurplusKb	Gauge	txnDatabaseObjectPoolId: pool ID txnDatabaseSegmentPoolId: segment ID	The amount of memory pre-allocated for existing objects. High object memory surplus means wasting database memory. Average/initial object size can be adjusted in the configuration.
txnDatabaseTimerIndexNumOfFarInserts	Gauge	txnDatabaseTimerIndexPoolId: pool ID	The number of inserts into the far-term area of the Timer index. Any operation in timer index far storage might indicate a performance issue.
txnDatabaseTimerIndexNumOfFarMoves	Gauge	txnDatabaseTimerIndexPoolId: pool ID	The number of entries moved from the far-term to near-term area of the Timer index. Any operation in timer index far storage might indicate a performance issue.
txnDatabaseTimerIndexNumOfFarRemoves	Gauge	txnDatabaseTimerIndexPoolId: pool ID	The number of removes from the far-term area of the Timer index. Any operation in timer index far storage might indicate a performance issue.
txnReplayCurrentGlobalTxnCounter	Gauge	txnReplayEngineId: engine ID txnReplayClusterId: cluster ID	The current global transaction counter. On the active engine processing cluster, verify that the publishing cluster and the standby engine are in sync.
txnReplayLastReplayGlobalTxnCounter	Gauge	txnReplayEngineId: engine ID txnReplayClusterId: cluster ID	The current last replayed global transaction counter from a transaction stream replay destination. On the standby engine processing cluster, verify that the publishing cluster and next standby engine (if applicable) are in sync.
txnCheckpointState	Gauge	txnCheckpointBladeId: blade ID	The checkpoint state. Use this metric to verify that the txnReplayLastReplayGlobalTxnCounter is moving on the checkpoint pod except during checkpoint creation.
txnBusinessCollisionCount	Gauge		The total number of business collisions for a transaction. A high business collision count might indicate a performance issue.
txnCurrentTxnCount	Gauge		The current in-progress/pending transaction count. A high current transaction count might indicate a performance issue.
txnEffectiveTxnCountPerSecond	Gauge		The effective transaction per second rate logged. A high effective transaction count per second might indicate a load spike.
txnStaleMessageRejectedCount	Gauge		The count of messages not processed by the Transaction Server because they were already too old when they were received. A high rejected stale message count might indicate a performance issue.
txnGlobalTxnCounterStatsCurrentCount	Gauge	txnGlobalTxnCounterStatsTaskName: task name	The current occupied entries between the low and high GTC range in this GTC sorter. The GTC sorter current count should be low during normal processing.
txnGlobalTxnCounterStatsMaxInUseSize	Gauge	txnGlobalTxnCounterStatsTaskName: task name	The maximum low and high GTC range used in this GTC sorter since task starts. The GTC max in-use size should be less than max size.
txnGlobalTxnCounterStatsMaxSize	Gauge	txnGlobalTxnCounterStatsTaskName: task name	The configured maximum GTC sorter size.
txnLoggingInUseWriteBufferCount	Gauge	txnLoggingBladeId: blade ID	The current in-use write buffers count. This count should be low.
txnLoggingMaxInUseWriteBufferCount	Gauge	txnLoggingBladeId: blade ID	The maximum in-use write buffers count. This should not reach the maximum number of write buffers.
txnLoggingLastWrittenGtc	Gauge	txnLoggingBladeId: blade ID	The last global transaction counter written to the transaction log file. This value should be moving and closely in sync with txnReplayCurrentGlobalTxnCounter