Engine Metrics Recommended for Grafana

The following MATRIXX Engine metrics are recommended for MATRIXX Engine Grafana dashboards.

Charging Server

The Charging Server retry count should be low compared to the total messages processed. If the retry count is high, it might impact processing performance. Charging Server Metrics Recommended for Grafana describes recommended metrics.

Table 1. Charging Server Metrics Recommended for Grafana
Metric Type Labels Description
chrgServerRetryCount Gauge The number of times the Charging Server retries transactions in the Transaction Server. This is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible.

Cluster Manager

The Grafana dashboard should indicate that the Node and Cluster are not down. Cluster Manager Metrics Recommended for Grafana describes recommended metrics.

Table 2. Cluster Manager Metrics Recommended for Grafana
Metric Type Labels Description
sysClusterMemberNodeState Gauge sysClusterMemberNodeId: Node ID A service state ID of a member node.
sysPeerClusterClusterState Gauge sysPeerClusterEngineId: Engine ID

sysPeerClusterClusterId: Cluster ID

The cluster state of the peer cluster.

Diameter Gateway

Use your Grafana dashboard to verify that Diameter Gateway average and maximum current response times are less than an expected value. The failed result code count should not be high. Diameter Gateway Metrics Recommended for Grafana describes recommended metrics.

Table 3. Diameter Gateway Metrics Recommended for Grafana
Metric Type Labels Description
diamPduStatsCurrentRespTimeAvg Gauge diamPduStatsApplicationId: application ID

diamPduStatsCmdCode: command code

The average response time, in microseconds, measured during the last completed monitoring interval for a given application ID/command code. The response time measured is the internal system response time; it does not include network latencies.
diamPduStatsCurrentRespTimeMax Gauge diamPduStatsApplicationId: application ID

diamPduStatsCmdCode: command code

The maximum response time, in microseconds, measured during the last completed monitoring interval for a given application ID/command code. The response time measured is the internal system response time; it does not include network latencies.
diamResultCodeStatsResultCodeOut Gauge diamResultCodeStatsApplicationId: application ID

diamResultCodeStatsCmdCode: command code

diamResultCodeStatsResultCode: result code

Total number of responses of a given application ID/command code/result-code sent since the start of the Diameter server.

Event Loader

Event Loader Metrics Recommended for Grafana describes the Event Loader metrics recommended for Grafana.

Table 4. Event Loader Metrics Recommended for Grafana
Metric Type Labels Description
eventLoaderMaxReplayGlobalTxnCounter Gauge The highest global transaction counter replayed on the publishing server. The maximum replayed global transaction counter (GTC) on the publishing cluster in the last engine should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount.
eventLoaderLastLoadedGlobalTxnCounter Gauge The last global transaction counter the Event Loader successfully loaded to the Event Repository. The last loaded GTC should be moving and closely in sync with eventLoaderMaxReplayGlobalTxnCounter.
eventLoaderMefBacklogCount Gauge The MEF backlog count for Event Loader. The backlog count should be zero during normal processing.
eventLoaderMefRejectedCount Gauge The number of MEFs rejected. The rejected count should be zero during normal processing.

The MongoDB Cluster should be monitored. If any node is down, the CleanUp task does not run, and MEFs accumulate in the processed directory.

The LoaderStatsCollection in the Event Repository retains the number of records processed and loaded (and total bytes), accumulated processing times for different steps, and the number of MEFs loaded.

Event Stream Server

Event Streamer Server metrics recommended for Grafana are described in Event Stream Server Metrics Recommended for Grafana.

Table 5. Event Stream Server Metrics Recommended for Grafana
Metric Type Labels Description
streamMaxReplayGlobalTxnCounter Gauge The highest global transaction counter replayed on the publishing server. The maximum replayed GTC on the publishing cluster in the last engine should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount.
streamSefLastWrittenGlobalTxnCounter Gauge The last global transaction counter the event writer wrote to SEF. The last written GTC should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount.
streamMefLastPublishedGlobalTxnCounter Gauge The last global transaction counter the MEFv2 generator published to target. The last published GTC should be moving and closely in sync with streamSefLastWrittenGlobalTxnCounter.
streamConnectionStatsCursor Gauge streamConnectionStatsIndex: stream session ID

filter: The filter string for this stream

The stream cursor.

The stream cursor should be moving and closely in sync with the streamMaxReplayGlobalTxnCounter.

streamConnectionStatsEventCount Gauge streamConnectionStatsIndex: stream session ID

filter: The filter string for this stream

The number of events sent. Event count information might help with troubleshooting performance issues.

Transaction Server

Transaction Server metrics recommended for Grafana are described in Transaction Server Metrics Recommended for Grafana.

Table 6. Transaction Server Metrics Recommended for Grafana
Metric Type Labels Description
txnDatabaseObjectCount Gauge txnDatabaseObjectPoolId: pool ID

txnDatabaseObjectName: object name

The number of database objects of a given type.

The object count should be closely synchronized between all engine pods.

The checkpointing pod might be slightly behind because the replay is file-based, not real-time. It might also be behind when the checkpoint creation is running.

The object count should be stable or show reasonable growth. Any sudden object count increase might indicate an issue in the system, for example in a schedule database, alert database, activity database, or event database.

txnDatabaseSegmentObjectMemorySurplusKb Gauge txnDatabaseObjectPoolId: pool ID

txnDatabaseSegmentPoolId: segment ID

The amount of memory pre-allocated for existing objects.

High object memory surplus means wasting database memory. Average/initial object size can be adjusted in the configuration.

txnDatabaseTimerIndexNumOfFarInserts Gauge txnDatabaseTimerIndexPoolId: pool ID The number of inserts into the far-term area of the Timer index.

Any operation in timer index far storage might indicate a performance issue.

txnDatabaseTimerIndexNumOfFarMoves Gauge txnDatabaseTimerIndexPoolId: pool ID The number of entries moved from the far-term to near-term area of the Timer index.

Any operation in timer index far storage might indicate a performance issue.

txnDatabaseTimerIndexNumOfFarRemoves Gauge txnDatabaseTimerIndexPoolId: pool ID The number of removes from the far-term area of the Timer index.

Any operation in timer index far storage might indicate a performance issue.

txnReplayCurrentGlobalTxnCounter Gauge txnReplayEngineId: engine ID

txnReplayClusterId: cluster ID

The current global transaction counter.

On the active engine processing cluster, verify that the publishing cluster and the standby engine are in sync.

txnReplayLastReplayGlobalTxnCounter Gauge txnReplayEngineId: engine ID

txnReplayClusterId: cluster ID

The current last replayed global transaction counter from a transaction stream replay destination.

On the standby engine processing cluster, verify that the publishing cluster and next standby engine (if applicable) are in sync.

txnCheckpointState Gauge txnCheckpointBladeId: blade ID The checkpoint state.

Use this metric to verify that the txnReplayLastReplayGlobalTxnCounter is moving on the checkpoint pod except during checkpoint creation.

txnBusinessCollisionCount Gauge The total number of business collisions for a transaction.

A high business collision count might indicate a performance issue.

txnCurrentTxnCount Gauge The current in-progress/pending transaction count.

A high current transaction count might indicate a performance issue.

txnEffectiveTxnCountPerSecond Gauge The effective transaction per second rate logged.

A high effective transaction count per second might indicate a load spike.

txnStaleMessageRejectedCount Gauge The count of messages not processed by the Transaction Server because they were already too old when they were received.

A high rejected stale message count might indicate a performance issue.

txnGlobalTxnCounterStatsCurrentCount Gauge txnGlobalTxnCounterStatsTaskName: task name The current occupied entries between the low and high GTC range in this GTC sorter.

The GTC sorter current count should be low during normal processing.

txnGlobalTxnCounterStatsMaxInUseSize Gauge txnGlobalTxnCounterStatsTaskName: task name The maximum low and high GTC range used in this GTC sorter since task starts.

The GTC max in-use size should be less than max size.

txnGlobalTxnCounterStatsMaxSize Gauge txnGlobalTxnCounterStatsTaskName: task name The configured maximum GTC sorter size.
txnLoggingInUseWriteBufferCount Gauge txnLoggingBladeId: blade ID The current in-use write buffers count.

This count should be low.

txnLoggingMaxInUseWriteBufferCount Gauge txnLoggingBladeId: blade ID The maximum in-use write buffers count.

This should not reach the maximum number of write buffers.

txnLoggingLastWrittenGtc Gauge txnLoggingBladeId: blade ID The last global transaction counter written to the transaction log file.

This value should be moving and closely in sync with txnReplayCurrentGlobalTxnCounter