Home
MATRIXX Monitoring and Logging
Metrics and Alerts
The following information describes monitoring metrics and alerts information for MATRIXX. MATRIXX Support recommends using the Prometheus and Grafana third-party products to monitor a MATRIXX environment, along with aggregated logging.
Monitoring the MATRIXX Engine Environment Locally
The following topics explain how to monitor the MATRIXX Engine locally using the internal MATRIXX Engine tools.
Monitor Transaction Replay Progress on the Standby Engine
Run the print_blade_stats.py script with the -R option to display the number of checkpoint log files during local transaction replay and the number of outstanding replay batches during remote replay that are outstanding when completing an InitDatabase request.

Welcome
MATRIXX Release Notes
MATRIXX Architecture
MATRIXX Installation and Upgrade
MATRIXX Configuration
MATRIXX Security
MATRIXX Integration
MATRIXX Diameter Integration
MATRIXX Call Control Framework Integration
MATRIXX TM Forum Integration
MATRIXX 5G Integration
MATRIXX 5G Event Streaming
MATRIXX Event Streaming
MATRIXX Administration
MATRIXX Web App Administration
MATRIXX Monitoring and Logging
- About MATRIXX Monitoring and Logging
  MATRIXX Monitoring and Logging describes monitoring and logging for MATRIXX components in a MATRIXX environment.
- Metrics and Alerts
  The following information describes monitoring metrics and alerts information for MATRIXX. MATRIXX Support recommends using the Prometheus and Grafana third-party products to monitor a MATRIXX environment, along with aggregated logging.
  - Monitoring with Prometheus and Grafana
    The following information describes how to implement monitoring through Prometheus in a containerized environment for MATRIXX. Information collected by Prometheus can be displayed with tools such as Grafana.
  - Cluster HA States
    A cluster can change high availability (HA) states during engine start up, manual switchover, failover, and when the connection to a remote peer cluster is lost.
  - MATRIXX Engine Start-Up Sequence
    A Topology Operator engine-starter instance starts MATRIXX Engine and engine-level Traffic Routing Agent (TRA-PROC and TRA-PUB) pods.
  - Monitoring the MATRIXX Engine Environment Locally
    The following topics explain how to monitor the MATRIXX Engine locally using the internal MATRIXX Engine tools.
    - Server States
      Servers can have several states that indicate the stage they are in during processing. They can be waiting to synchronize databases, synchronizing databases, actively processing transactions, preparing to stop, or stopping.
    - Cluster HA States
      A cluster can change high availability (HA) states during engine start up, manual switchover, failover, and when the connection to a remote peer cluster is lost.
    - MATRIXX SNMP Notifications
      MATRIXX uses trap-directed notifications to alert your SNMPv2 or SNMPv3 Network Operations Center (NOC) when a change occurs in a service status, an error occurs, a threshold has been crossed, or the system topology changes. It does this by sending a trap of the event.
    - Configure and Enable System Monitor
      You configure the optional MATRIXX System Monitor by editing the sysmon_config.xml file. After the file is configured, System Monitor continuously monitors the current MATRIXX Engine processing server.
    - Monitoring Recommendations
      The following information includes recommended actions for monitoring a real-time MATRIXX production system. Operating system and hardware monitoring should also be implemented but is not described in this document.
    - Monitor Database Sync Progress
      You can monitor the database sync operation of a cluster during engine start up, engine failover, and server start up.
    - Monitor Transaction Replay Progress on the Standby Engine
      Run the print_blade_stats.py script with the -R option to display the number of checkpoint log files during local transaction replay and the number of outstanding replay batches during remote replay that are outstanding when completing an InitDatabase request.
    - Monitor the Availability of the STANDBY Cluster
      The STANDBY cluster responds to Diameter Device-Watchdog-Request (DWR) messages to let the network application know it is running so it can be ready to take over processing if a failover occurs. You can monitor the sending and receiving of DW messages by monitoring SNMP notifications and by running the print_blade_stats.py script.
    - Display Memory Statistics
      There are two types of memory statistics tracked in MATRIXX Engine: those related to a specific database and the total system memory allocated and in-use.
  - Custom System Monitor Configuration
    You configure the optional MATRIXX System Monitor by editing the sysmon_config.xml file. After the file is configured, System Monitor continuously monitors the MATRIXX Engine processing pod.
  - Alert Management Integration with SNMP Traps
    In a Cloud Native environment you can integrate the statistics and metrics data collected by Prometheus with other systems, such as Alert Manager, and you can configure snmp-notifier to convert the generated alerts to SNMP traps for reporting in external monitoring systems.
  - Retrieving Core Dump Information
    Core dumps are files that include the complete contents of the memory of a process at a point in time. Core dump handling in MATRIXX must be enabled per namespace. By default this feature is disabled and the commonly mounted /coredumps directory is empty until this feature is enabled.
- Logging
  The following describes logging information for MATRIXX.
- Tracing
  The following information describes tracing information for MATRIXX.
- Monitoring Components
  The following information describes monitoring different components for MATRIXX.
- Appendixes
MATRIXX Policy
MATRIXX Kafka CDR Consumer
MATRIXX Pricing and Rating
MATRIXX Pay Now
MATRIXX Subscriber Management
MATRIXX Subscriber Management API
MATRIXX Business API SDK
My MATRIXX Help
MATRIXX Backoffice Customer Tool Help
MATRIXX Third-Party Licenses
Glossary

Monitor Transaction Replay Progress on the Standby Engine

Run the print_blade_stats.py script with the -R option to display the number of checkpoint log files during local transaction replay and the number of outstanding replay batches during remote replay that are outstanding when completing an InitDatabase request.

About this task

When the standby cluster is running, the number of outstanding batches to replay should be less than or equal to number of processing servers, and the number of outstanding checkpoint files to process should be zero. After a failover operation or engine start up, there can be one or more outstanding checkpoints to process. The checkpoint value returns to zero after the cluster restores the databases from a checkpoint. Note that when a standby cluster is configured but not running, the outstanding batch value is zero.

If you are monitoring replay statistics during runtime operations, perform this task on either server in the active processing cluster. If you are monitoring replay statistics when you are first starting the standby engine, perform this task on the server in the active processing cluster with the lowest server ID. This is the server that receives the InitDatabase request from the standby cluster.

Procedure

In a terminal, enter the following command to view the transaction replay statistics, where bladeId is the ID of the processing server in the active cluster.

print_blade_stats.py -b bladeId -R

Results

For an example of the output, see the discussion about print_blade_stats.py.

For an engine to change from being STANDBY to ACTIVE, the checkpoint replay file count must be 0. If you try to activate the engine beforehand, the operation is rejected.

Important: If transaction replay has failed on a STANDBY cluster, the failed transactions are logged to the ${MTX_SHARED_DIR}/bad directory and an error containing the string "failed to replay transaction" is written to the mtx_debug.log. It is important to monitor this directory because these transactions must be reprocessed. To reprocess failed transactions, restart the STANDBY cluster to re-sync its data with the ACTIVE cluster.

For more information about the MATRIXX environment variables, see the discussion about container directories and environment variables in MATRIXX Installation and Upgrade.