Engine Metrics Recommended for Grafana
The following MATRIXX Engine metrics are recommended for MATRIXX Engine Grafana dashboards.
Charging Server
The Charging Server retry count should be low compared to the total messages processed. If the retry count is high, it might impact processing performance. Charging Server Metrics Recommended for Grafana describes recommended metrics.
Metric | Type | Labels | Description |
---|---|---|---|
chrgServerRetryCount | Gauge | The number of times the Charging Server retries transactions in the Transaction Server. This is an indication of object-level contention in parallel transactions and can help isolate performance issues that are not otherwise visible. |
Cluster Manager
The Grafana dashboard should indicate that the Node and Cluster are not down. Cluster Manager Metrics Recommended for Grafana describes recommended metrics.
Metric | Type | Labels | Description |
---|---|---|---|
sysClusterMemberNodeState | Gauge | sysClusterMemberNodeId: Node ID | A service state ID of a member node. |
sysPeerClusterClusterState | Gauge | sysPeerClusterEngineId: Engine ID sysPeerClusterClusterId: Cluster ID |
The cluster state of the peer cluster. |
Diameter Gateway
Use your Grafana dashboard to verify that Diameter Gateway average and maximum current response times are less than an expected value. The failed result code count should not be high. Diameter Gateway Metrics Recommended for Grafana describes recommended metrics.
Metric | Type | Labels | Description |
---|---|---|---|
diamPduStatsCurrentRespTimeAvg | Gauge | diamPduStatsApplicationId: application ID diamPduStatsCmdCode: command code |
The average response time, in microseconds, measured during the last completed monitoring interval for a given application ID/command code. The response time measured is the internal system response time; it does not include network latencies. |
diamPduStatsCurrentRespTimeMax | Gauge | diamPduStatsApplicationId: application ID diamPduStatsCmdCode: command code |
The maximum response time, in microseconds, measured during the last completed monitoring interval for a given application ID/command code. The response time measured is the internal system response time; it does not include network latencies. |
diamResultCodeStatsResultCodeOut | Gauge | diamResultCodeStatsApplicationId: application ID diamResultCodeStatsCmdCode: command code diamResultCodeStatsResultCode: result code |
Total number of responses of a given application ID/command code/result-code sent since the start of the Diameter server. |
Event Loader
Event Loader Metrics Recommended for Grafana describes the Event Loader metrics recommended for Grafana.
Metric | Type | Labels | Description |
---|---|---|---|
eventLoaderMaxReplayGlobalTxnCounter | Gauge | The highest global transaction counter replayed on the publishing server. The maximum replayed global transaction counter (GTC) on the publishing cluster in the last engine should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount. | |
eventLoaderLastLoadedGlobalTxnCounter | Gauge | The last global transaction counter the Event Loader successfully loaded to the Event Repository. The last loaded GTC should be moving and closely in sync with eventLoaderMaxReplayGlobalTxnCounter. | |
eventLoaderMefBacklogCount | Gauge | The MEF backlog count for Event Loader. The backlog count should be zero during normal processing. | |
eventLoaderMefRejectedCount | Gauge | The number of MEFs rejected. The rejected count should be zero during normal processing. |
The MongoDB Cluster should be monitored. If any node is down, the CleanUp task does not run, and MEFs accumulate in the processed directory.
The LoaderStatsCollection in the Event Repository retains the number of records processed and loaded (and total bytes), accumulated processing times for different steps, and the number of MEFs loaded.
Event Stream Server
Event Streamer Server metrics recommended for Grafana are described in Event Stream Server Metrics Recommended for Grafana.
Metric | Type | Labels | Description |
---|---|---|---|
streamMaxReplayGlobalTxnCounter | Gauge | The highest global transaction counter replayed on the publishing server. The maximum replayed GTC on the publishing cluster in the last engine should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount. | |
streamSefLastWrittenGlobalTxnCounter | Gauge | The last global transaction counter the event writer wrote to SEF. The last written GTC should be moving and closely in sync with txnGlobalTxnCounterStatsCurrentCount. | |
streamMefLastPublishedGlobalTxnCounter | Gauge | The last global transaction counter the MEFv2 generator published to target. The last published GTC should be moving and closely in sync with streamSefLastWrittenGlobalTxnCounter. | |
streamConnectionStatsCursor | Gauge | streamConnectionStatsIndex: stream session ID filter: The filter string for this stream |
The stream cursor. The stream cursor should be moving and closely in sync with the streamMaxReplayGlobalTxnCounter. |
streamConnectionStatsEventCount | Gauge | streamConnectionStatsIndex: stream session ID filter: The filter string for this stream |
The number of events sent. Event count information might help with troubleshooting performance issues. |
Transaction Server
Transaction Server metrics recommended for Grafana are described in Transaction Server Metrics Recommended for Grafana.
Metric | Type | Labels | Description |
---|---|---|---|
txnDatabaseObjectCount | Gauge | txnDatabaseObjectPoolId: pool ID txnDatabaseObjectName: object name |
The number of database objects of a given type. The object count should be closely synchronized between all engine pods. The checkpointing pod might be slightly behind because the replay is file-based, not real-time. It might also be behind when the checkpoint creation is running. The object count should be stable or show reasonable growth. Any sudden object count increase might indicate an issue in the system, for example in a schedule database, alert database, activity database, or event database. |
txnDatabaseSegmentObjectMemorySurplusKb | Gauge | txnDatabaseObjectPoolId: pool ID txnDatabaseSegmentPoolId: segment ID |
The amount of memory pre-allocated for existing objects. High object memory surplus means wasting database memory. Average/initial object size can be adjusted in the configuration. |
txnDatabaseTimerIndexNumOfFarInserts | Gauge | txnDatabaseTimerIndexPoolId: pool ID | The number of inserts into the far-term area of the Timer index. Any operation in timer index far storage might indicate a performance issue. |
txnDatabaseTimerIndexNumOfFarMoves | Gauge | txnDatabaseTimerIndexPoolId: pool ID | The number of entries moved from the far-term to near-term area of the Timer index. Any operation in timer index far storage might indicate a performance issue. |
txnDatabaseTimerIndexNumOfFarRemoves | Gauge | txnDatabaseTimerIndexPoolId: pool ID | The number of removes from the far-term area of the Timer index. Any operation in timer index far storage might indicate a performance issue. |
txnReplayCurrentGlobalTxnCounter | Gauge | txnReplayEngineId: engine ID txnReplayClusterId: cluster ID |
The current global transaction counter. On the active engine processing cluster, verify that the publishing cluster and the standby engine are in sync. |
txnReplayLastReplayGlobalTxnCounter | Gauge | txnReplayEngineId: engine ID txnReplayClusterId: cluster ID |
The current last replayed global transaction counter from a transaction stream replay destination. On the standby engine processing cluster, verify that the publishing cluster and next standby engine (if applicable) are in sync. |
txnCheckpointState | Gauge | txnCheckpointBladeId: blade ID | The checkpoint state. Use this metric to verify that the txnReplayLastReplayGlobalTxnCounter is moving on the checkpoint pod except during checkpoint creation. |
txnBusinessCollisionCount | Gauge | The total number of business collisions for a transaction. A high business collision count might indicate a performance issue. |
|
txnCurrentTxnCount | Gauge | The current in-progress/pending transaction count. A high current transaction count might indicate a performance issue. |
|
txnEffectiveTxnCountPerSecond | Gauge | The effective transaction per second rate logged. A high effective transaction count per second might indicate a load spike. |
|
txnStaleMessageRejectedCount | Gauge | The count of messages not processed by the Transaction Server because they were already too old when they were received. A high rejected stale message count might indicate a performance issue. |
|
txnGlobalTxnCounterStatsCurrentCount | Gauge | txnGlobalTxnCounterStatsTaskName: task name | The current occupied entries between the low and high GTC range in this GTC sorter. The GTC sorter current count should be low during normal processing. |
txnGlobalTxnCounterStatsMaxInUseSize | Gauge | txnGlobalTxnCounterStatsTaskName: task name | The maximum low and high GTC range used in this GTC sorter since task starts. The GTC max in-use size should be less than max size. |
txnGlobalTxnCounterStatsMaxSize | Gauge | txnGlobalTxnCounterStatsTaskName: task name | The configured maximum GTC sorter size. |
txnLoggingInUseWriteBufferCount | Gauge | txnLoggingBladeId: blade ID | The current in-use write buffers count. This count should be low. |
txnLoggingMaxInUseWriteBufferCount | Gauge | txnLoggingBladeId: blade ID | The maximum in-use write buffers count. This should not reach the maximum number of write buffers. |
txnLoggingLastWrittenGtc | Gauge | txnLoggingBladeId: blade ID | The last global transaction counter written to the transaction log file. This value should be moving and closely in sync with txnReplayCurrentGlobalTxnCounter |