Monitoring MATRIXX Engines

Using the MATRIXX Automated Statistics

This topic describes how and where to gather statistics for specific MATRIXX Digital Commerce components. These are in addition to the automatic system processing statistics gathered using the capture_diagstats_tofile.py script. For details, see the discussion about collecting MATRIXX diagnostic statistics in Architecture Overview.

Using SNMP

The statistics collected for cluster-level HA counters are defined in the MATRIXX MIB files and are saved to the Statistics database. If the SNMP agent finds abnormal values during runtime, it sets a trap that notifies your third-party Network Operation Center (NOC) and also writes the information to the MATRIXX system log.

MATRIXX provides a Cluster Manager command-line client, cluster_mgr_cli.py, that operators can run remotely to check the HA state of a cluster in a geo-redundant environment. The cluster_mgr_cli.py client is located in the ${MTX_BIN_DIR} directory and requires access to MATRIXX Engine to run.

For details about the MATRIXX Digital Commerce environment variables, see the discussion about installation directories, navigation shortcuts, and environment variables in Installation and Configuration.

Note: The SNMP Agent and MATRIXX statistics scripts require the net-snmp-utils package to collect statistics. It is part of the standard RHEL installation and a MATRIXX third-party requirement for blade servers.

You use the capture_statistics_tofile.py, print_blade_stats.py, print_snmp_stats.py scripts, and the net-snmp-utils snmpwalk utility to gather statistics from the SNMP agent.

Collecting Statistics from MATRIXX Engine

MATRIXX provides scripts for monitoring the MATRIXX Engine. All scripts must be run as user mtx and can run on a single blade or across all blades in a cluster. Table 1 shows the statistics you can collect for a given component, and the script or command to use. Sometimes, you can use more than one script or command.

For more information about these scripts, see the discussion about Analytical and Informational Scripts.

Table 1. MATRIXX Engine Statistics Collection Scripts
Engine Statistic Category	Example Statistics	Script to Use
Engine blade servers	Process statistics Blade Cluster statistics for local clusters (node ID, cluster leader id, service role, node state, and mgmt address)	analyze_process.py
Engine blade servers	A server (node) usage level expressed as a whole percentage. For details see the discussion about System Monitor in Architecture Overview.	Use print_blade_stats.py --system_monitor to get a snapshot of processing server usage.
Engine blade status	Blade runtime state: started or stopped.	check_blade.py
Engine	Engine statistics (in the order displayed): System Statistics Service Statistics Charging Server Statistics Notification Statistics Buffer Pool Statistics (large and huge buffers) Database (Memory, Segment, Object, Index, Timer, and OID) Statistics Diameter Gateway (Error Result, Connection, Latency Buckert, Max Latency, Message Latency, Messages Throttled) Statistics Service Queue Statistics Txn Replay Statistics Txn Statistics Cluster Statistics for peer clusters (engine ID, cluster ID, cluster state, schema version, fully-qualified cluster ID) Peer Cluster Statistics Schedule Database Statistics Event Loader Statistics Signaling Network Statistics TCAP Statistics CAMEL Gateway Statistics Voice Charging Statistics Voice Charging Call Outcome Statistics Voice Charging Announcement and VXML Script Statistics TSAN Statistics CAP3 SMS Charging Statistics MAP Call Out AnyTimeInterrogation Statistics MAP Call Out SendRoutingInfo Statistics USSD Notifications USSD Incoming Service Statistics USSD Call Back Statistics and Call Back Call Start Statistics MDC (Gateway Connection, Latency Bucket, Max Latency, and Message Latency, and Messages Throttled) Statistics	print_blade_stats.py
HA Clusters	Peer states, on-going cluster HA state changes, from the time the script first runs to when it is stopped. Information includes the time at which each HA state change occurred.	monitor_stats.py
Cluster runtime status	Started or stopped.	check_cluster.py
Engine runtime status	Started or stopped.	check_engine.py
Engine topology	Software topology information, such as sites, domains, load balancers, shared storage, MATRIXX Engines, clusters, and blades. This information is retrieved from the mtx_config.xml file; not from the current, running configuration.	print_topology.py
Engine performance	Performance statistics, such as CPU, network, processor, and kernel statistics.	start_collecting_stats.py stop_collecting_stats.py

Note: If the SNMP agent cannot read the statistics because the system is busy, the following error is written to the log:

LM_WARN 16830|16840 2014-02-14 19:06:06.254953 [snmp_agent_1:1:1:1(4510.21715)] StatsManager::read: entry at storage ID 13:1:1:5 is in unexpected state. initial start revision: 1 
LM_WARN 16830|16840 2014-02-14 19:06:06.254987 [snmp_agent_1:1:1:1(4510.21715)] StatsMonitorTask::svc: failed to read stats container from database

This error can be ignored. The SNMP agent will continue attempting to gather the statistics.

Collecting Statistics from Other Components

MATRIXX also provides tools to collect SNMP statistics from the non-MATRIXX Engine components. Table 2 shows the statistics you can collect and the tool to use. For more information about each script including their syntax and which user needs to run them, see the discussion about Analytical and Informational Scripts. See your net-snmp-utils documentation for information about the snmpwalk command.

Table 2. Non-MATRIXX Engine Statistics Collection Tools
Component	Example Statistics	Tool to Use
Network Enabler	Statistics for payload, UDT, XUDT, and XUDTS processing, and SCCP statistics for available point codes (PCs) and SSNs. Run this script on the Network Enabler, not the MATRIXX Engine.	print_snmp_stats.py
Notification Server	Statistics for the number of notifications, the number of notification in a specific time period, notifications scheduled but not delivered, the number of alarms, and latency problems.	snmpwalk command
Payment Gateway	Statistics on the number of successful and unsuccessful payment registrations, payment requests, rejected requests, and latency problems.	snmpwalk command
Gateway Proxy	Statistics on the number of incoming requests, failed requests, successfully processed requests, rejected requests, and latency problems.	print_blade_stats.py, and the snmpwalk command
Route Cache Controller	These Statistics for both the subscriber MDB and Session MDB: Insert Fetch Erase Interators Garbage Collection Main Info Object Run this script on the Route Cache server, not the MATRIXX Engine.	print_snmp_stats.py
MATRIXX Peer Manager	Debug name, server address, and local peer ID. For each peer: debug name, peer ID, address, state, time when connected, and time when disconnected (if available.)	print_snmp_stats.py print_blade_stats.py
Route Cache Agent	Debug name, server address, version, MPM name, and diagnostic counters. Run this script on the Route Cache server, not the MATRIXX Engine.	print_snmp_stats.py print_blade_stats.py
RS Gateway	Statistics on the number of total, incoming, rejected, valid, and invalid, and failed requests, latency, and response time information.	snmpwalk command
Service Statistics	Processes, errors, memory and CPU usage.	print_snmp_stats.py
System Statistics	Monitoring intervals, processing errors, and memory pool information.	print_snmp_stats.py
Traffic Routing Agent	Statistics for communication: Diameter and MDC: bytes sent, failed, received, dropped, cached, and throttled. TCP and UDP: active connection bridge counts, upstream/downstream bytes sent/dropped/received. I/O: read/write upstream/downstream collisions, executed, postponed, written, and read communication. Run this script on the TRA server, not the MATRIXX Engine. Note: PDU counters measuring TRA statistics are updated for application-level virtual server instances only, such as Diameter and MDC. When TRA operates in a mode that does not route that kind of traffic, statistics for those virtual servers are reported as zeros.	print_snmp_stats.py
Diameter	Downstream and upstream PDU, and downstream route lookups.	print_snmp_stats.py
Domain Pools	Pool server names, monitors and monitor ports, and balance methods.	print_snmp_stats.py
Domain Pool Nodes	Pool names, IDs, and addresses.	print_snmp_stats.py
Virtual IPs	Configuration information.	print_snmp_stats.py
Virtual Servers	Configuration information.	print_snmp_stats.py

Viewing MATRIXX Logs

The Cluster Manager on each blade logs messages related to peer cluster states and the Transaction Server on each blade logs messages related to processing readiness. The log level for these messages is INFO.


Information to Retrieve	Script to Use
Critical errors written to core files.	analyze_core_files.py
Engine, cluster, and blade processing information and error messages.	check_error_logs_on_blade.py
Error messages for a specified MATRIXX process, for example the Charging Server.	split_mtx_debug_log.py

If the ACTIVE cluster becomes unavailable or shuts down, messages similar to the following will exist in the log.

LM_ERROR 17479|17483 2013-05-15 19:50:08.943950 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateStandbyActive::handleClockTick: peer cluster HA state FAILED @10.10.15.15:4800
LM_INFO  17479|17483 2013-05-15 19:50:08.944013 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::onExit: exited state STANDBY_ACTIVE
LM_INFO  17479|17483 2013-05-15 19:50:08.944075 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::onEntry: entered state ACTIVE_UNKNOWN
LM_INFO  17479|17483 2013-05-15 19:50:08.944096 [cluster_manager_2:1:1:1(4510.21715)] LocalClusterStatus::setPeerClusterHaStates: new cluster HA state=ACTIVE, previous=STANDBY
LM_INFO  17479|17483 2013-05-15 19:50:08.944126 [cluster_manager_2:1:1:1(4510.21715)] LocalClusterStatus::setPeerClusterHaStates: new peer cluster HA state=UNKNOWN, previous=ACTIVE
LM_INFO  17479|17483 2013-05-15 19:50:08.944181 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::setPeerClusterHaStates: set peer cluster HA states: this=ACTIVE, peer=UNKNOWN @ ACTIVE_UNKNOWN
LM_ERROR 17479|17483 2013-05-15 19:50:08.944282 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::generatePeerDisconnectedSnmpTrap: disconnected from cluster @10.10.15.15:4800 @ ACTIVE_UNKNOWN
LM_INFO  17338|17387 2013-05-15 19:50:08.945376 [transaction_server_2:1:1:1(4510.21715)] TopologyManager::setPeerClusterHaState: HA state of peer cluster: UNKNOWN
LM_INFO  17338|17387 2013-05-15 19:50:08.945481 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::handleClusterHaStateUpdate: received a cluster HA state update event in state 'ready'
LM_INFO  17338|17387 2013-05-15 19:50:08.945524 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::onExit: exited state 'ready'
LM_INFO  17338|17387 2013-05-15 19:50:08.945543 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::onEntry: entered state 'replay syncing'
LM_INFO  17338|17379 2013-05-15 19:50:09.961740 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::onEntry: entered state 'ready'

Note: If the system logs contain errors that need further analysis by MATRIXX support, run the gather_logs.py script to create a ZIP file of the system logs in the ${MTX_LOG_DIR} directory, any core files, and custom configuration files that are used at runtime. The gather_logs.py script also analyzes the data in the core file and writes the results to a file with the same name, but with a .log extension.

In the mtx_debug.log example above, 4510 indicates the release version and 21715 indicates the Subversion revision.

Viewing Transaction Replay Statistics

The print_blade_stats.py enables service providers to view transaction replay progress on a STANDBY cluster. If the replay operation fails, the failed transactions are logged to the ${MTX_SHARED_DIR}/bad directory and a message similar to the following is written to the mtx_debug.log file:

LM_ERROR 58149|58434 2014-01-27 15:55:03.893604 [transaction_server_1:1:1:1(4510.21715)] TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::handleReplayResponse: failed to replay transaction with transaction ID [6:-:1:359309]|[0:0:0:0:0:0]|1306775747|0
LM_INFO  58149|58434 2014-01-27 15:55:03.940780 [transaction_server_1:1:1:1(4510.21715)] TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::handleReplayResponse: successfully wrote this transaction to /mnt/mtx/shared/txnlogs/bad/sched_db_439.log.bad

It is important to monitor the ${MTX_SHARED_DIR}/bad directory because these transactions must be reprocessed to re-synchronize the STANDBY cluster's data with that of the ACTIVE cluster. To reprocess failed transactions, the STANDBY cluster must be restarted. For more information, see the discussion about restarting a cluster. For information about viewing replay statistics, see the discussion about monitoring transaction replay progress.

Monitoring the MATRIXX Engines Remotely

MATRIXX Digital Commerce supports the third-party open source Prometheus monitoring software to monitor MATRIXX Engines. Prometheus works with the Grafana third-party graphing software to present a user-friendly view of the monitored data. MATRIXX Digital Commerce supplies example configuration files that work with Prometheus/Grafana out of the box. To use them, specify your engine locations in the configuration files. This release includes these default MATRIXX/Grafana dashboards:

Alerts and Physical Memory.
Line graphs for Diameter PDU Statistics.
Tables for cluster state, peer cluster state, buffer pool, and shared buffer pool statistics.
Line graphs for cluster state, peer cluster state, buffer pool, and shared buffer pool statistics.