Monitoring MATRIXX Engines
Use SNMP to monitor a MATRIXX Digital Commerce environment to confirm that the MATRIXX Engines are functioning properly in an active-standby HA configuration and that during an engine failover, the switch-over operation is working as designed. MATRIXX Digital Commerce also offers statistics for other components that indicate those components are running correctly, and to help you diagnose problems.
Using the MATRIXX Automated Statistics
This topic describes how and where to gather statistics for specific MATRIXX Digital Commerce components. These are in addition to the automatic system processing statistics gathered using the capture_diagstats_tofile.py script. For details, see the discussion about collecting MATRIXX diagnostic statistics in Architecture Overview.
Using SNMP
The statistics collected for cluster-level HA counters are defined in the MATRIXX MIB files and are saved to the Statistics database. If the SNMP agent finds abnormal values during runtime, it sets a trap that notifies your third-party Network Operation Center (NOC) and also writes the information to the MATRIXX system log.
MATRIXX provides a Cluster Manager command-line client, cluster_mgr_cli.py, that operators can run remotely to check the HA state of a cluster in a geo-redundant environment. The cluster_mgr_cli.py client is located in the ${MTX_BIN_DIR} directory and requires access to MATRIXX Engine to run.
For details about the MATRIXX Digital Commerce environment variables, see the discussion about installation directories, navigation shortcuts, and environment variables in Installation and Configuration.
You use the capture_statistics_tofile.py, print_blade_stats.py, print_snmp_stats.py scripts, and the net-snmp-utils snmpwalk utility to gather statistics from the SNMP agent.
Collecting Statistics from MATRIXX Engine
MATRIXX provides scripts for monitoring the MATRIXX Engine. All scripts must be run as user mtx
and can run on a single
blade or across all blades in a cluster. Table 1 shows the statistics you can collect for a given
component, and the script or command to use. Sometimes, you can use more than one
script or command.
For more information about these scripts, see the discussion about Analytical and Informational Scripts.
Engine Statistic Category | Example Statistics | Script to Use |
---|---|---|
Engine blade servers |
|
analyze_process.py |
A server (node) usage level expressed as a whole percentage. For details see the discussion about System Monitor in Architecture Overview. | Use print_blade_stats.py --system_monitor to get a snapshot of processing server usage. | |
Engine blade status | Blade runtime state: started or stopped. | check_blade.py |
Engine | Engine statistics (in the order displayed):
|
print_blade_stats.py |
HA Clusters | Peer states, on-going cluster HA state changes, from the time the script first runs to when it is stopped. Information includes the time at which each HA state change occurred. | monitor_stats.py |
Cluster runtime status | Started or stopped. | check_cluster.py |
Engine runtime status | Started or stopped. | check_engine.py |
Engine topology | Software topology information, such as sites, domains, load balancers, shared storage,
MATRIXX Engines, clusters, and blades. This information is retrieved from the mtx_config.xml file; not from the current, running configuration. |
print_topology.py |
Engine performance | Performance statistics, such as CPU, network, processor, and kernel statistics. | start_collecting_stats.py stop_collecting_stats.py |
LM_WARN 16830|16840 2014-02-14 19:06:06.254953 [snmp_agent_1:1:1:1(4510.21715)] StatsManager::read: entry at storage ID 13:1:1:5 is in unexpected state. initial start revision: 1
LM_WARN 16830|16840 2014-02-14 19:06:06.254987 [snmp_agent_1:1:1:1(4510.21715)] StatsMonitorTask::svc: failed to read stats container from database
This
error can be ignored. The SNMP agent will continue attempting to gather the
statistics.Collecting Statistics from Other Components
MATRIXX also provides tools to collect SNMP statistics from the non-MATRIXX Engine components. Table 2 shows the statistics you can collect and the tool to use. For more information about each script including their syntax and which user needs to run them, see the discussion about Analytical and Informational Scripts. See your net-snmp-utils documentation for information about the snmpwalk command.
Component | Example Statistics | Tool to Use |
---|---|---|
Network Enabler | Statistics for payload, UDT, XUDT, and XUDTS processing, and SCCP statistics for available point codes (PCs) and SSNs. Run this script on the Network Enabler, not the MATRIXX Engine. | print_snmp_stats.py |
Notification Server | Statistics for the number of notifications, the number of notification in a specific time period, notifications scheduled but not delivered, the number of alarms, and latency problems. | snmpwalk command |
Payment Gateway | Statistics on the number of successful and unsuccessful payment registrations, payment requests, rejected requests, and latency problems. | snmpwalk command |
Gateway Proxy | Statistics on the number of incoming requests, failed requests, successfully processed requests, rejected requests, and latency problems. | print_blade_stats.py, and the snmpwalk command |
Route Cache Controller | These Statistics for both the subscriber MDB and Session MDB:
Run this script on the Route Cache server, not the MATRIXX Engine. |
print_snmp_stats.py |
MATRIXX Peer Manager | Debug name, server address, and local peer ID. For each peer: debug name, peer ID, address, state, time when connected, and time when disconnected (if available.) | print_snmp_stats.py print_blade_stats.py |
Route Cache Agent | Debug name, server address, version, MPM name, and diagnostic
counters. Run this script on the Route Cache server, not the MATRIXX Engine. |
print_snmp_stats.py print_blade_stats.py |
RS Gateway | Statistics on the number of total, incoming, rejected, valid, and invalid, and failed requests, latency, and response time information. | snmpwalk command |
Service Statistics | Processes, errors, memory and CPU usage. | print_snmp_stats.py |
System Statistics | Monitoring intervals, processing errors, and memory pool information. | print_snmp_stats.py |
Traffic Routing Agent | Statistics for communication:
Run this script on the TRA server, not the MATRIXX Engine. Note: PDU counters measuring TRA statistics are updated for application-level
virtual server instances only, such as Diameter and MDC. When TRA operates in a mode
that does not route that kind of traffic, statistics for those virtual servers are
reported as zeros. |
print_snmp_stats.py |
Diameter | Downstream and upstream PDU, and downstream route lookups. | print_snmp_stats.py |
Domain Pools | Pool server names, monitors and monitor ports, and balance methods. | print_snmp_stats.py |
Domain Pool Nodes | Pool names, IDs, and addresses. | print_snmp_stats.py |
Virtual IPs | Configuration information. | print_snmp_stats.py |
Virtual Servers | Configuration information. | print_snmp_stats.py |
Viewing MATRIXX Logs
The Cluster Manager on each blade logs messages related to peer cluster states and the Transaction Server on each blade logs messages related to processing readiness. The log level for these messages is INFO.
Information to Retrieve | Script to Use |
---|---|
Critical errors written to core files. | analyze_core_files.py |
Engine, cluster, and blade processing information and error messages. | check_error_logs_on_blade.py |
Error messages for a specified MATRIXX process, for example the Charging Server. | split_mtx_debug_log.py |
If the ACTIVE cluster becomes unavailable or shuts down, messages similar to the following will exist in the log.
LM_ERROR 17479|17483 2013-05-15 19:50:08.943950 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateStandbyActive::handleClockTick: peer cluster HA state FAILED @10.10.15.15:4800
LM_INFO 17479|17483 2013-05-15 19:50:08.944013 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::onExit: exited state STANDBY_ACTIVE
LM_INFO 17479|17483 2013-05-15 19:50:08.944075 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::onEntry: entered state ACTIVE_UNKNOWN
LM_INFO 17479|17483 2013-05-15 19:50:08.944096 [cluster_manager_2:1:1:1(4510.21715)] LocalClusterStatus::setPeerClusterHaStates: new cluster HA state=ACTIVE, previous=STANDBY
LM_INFO 17479|17483 2013-05-15 19:50:08.944126 [cluster_manager_2:1:1:1(4510.21715)] LocalClusterStatus::setPeerClusterHaStates: new peer cluster HA state=UNKNOWN, previous=ACTIVE
LM_INFO 17479|17483 2013-05-15 19:50:08.944181 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::setPeerClusterHaStates: set peer cluster HA states: this=ACTIVE, peer=UNKNOWN @ ACTIVE_UNKNOWN
LM_ERROR 17479|17483 2013-05-15 19:50:08.944282 [cluster_manager_2:1:1:1(4510.21715)] FsmHaPeerClusterStateBase::generatePeerDisconnectedSnmpTrap: disconnected from cluster @10.10.15.15:4800 @ ACTIVE_UNKNOWN
LM_INFO 17338|17387 2013-05-15 19:50:08.945376 [transaction_server_2:1:1:1(4510.21715)] TopologyManager::setPeerClusterHaState: HA state of peer cluster: UNKNOWN
LM_INFO 17338|17387 2013-05-15 19:50:08.945481 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::handleClusterHaStateUpdate: received a cluster HA state update event in state 'ready'
LM_INFO 17338|17387 2013-05-15 19:50:08.945524 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::onExit: exited state 'ready'
LM_INFO 17338|17387 2013-05-15 19:50:08.945543 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::onEntry: entered state 'replay syncing'
LM_INFO 17338|17379 2013-05-15 19:50:09.961740 [transaction_server_2:1:1:1(4510.21715)] FsmTxnSvrStateBase::onEntry: entered state 'ready'
In the mtx_debug.log example above,
4510
indicates the release version and 21715
indicates the Subversion revision.
Viewing Transaction Replay Statistics
LM_ERROR 58149|58434 2014-01-27 15:55:03.893604 [transaction_server_1:1:1:1(4510.21715)] TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::handleReplayResponse: failed to replay transaction with transaction ID [6:-:1:359309]|[0:0:0:0:0:0]|1306775747|0
LM_INFO 58149|58434 2014-01-27 15:55:03.940780 [transaction_server_1:1:1:1(4510.21715)] TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::handleReplayResponse: successfully wrote this transaction to /mnt/mtx/shared/txnlogs/bad/sched_db_439.log.bad
It
is important to monitor the ${MTX_SHARED_DIR}/bad directory
because these transactions must be reprocessed to re-synchronize the STANDBY
cluster's data with that of the ACTIVE cluster. To reprocess failed transactions,
the STANDBY cluster must be restarted. For more information, see the discussion
about restarting a cluster. For information about viewing replay statistics, see the
discussion about monitoring transaction replay progress.Monitoring the MATRIXX Engines Remotely
MATRIXX Digital Commerce supports the third-party open source Prometheus monitoring software to monitor MATRIXX Engines. Prometheus works with the Grafana third-party graphing software to present a user-friendly view of the monitored data. MATRIXX Digital Commerce supplies example configuration files that work with Prometheus/Grafana out of the box. To use them, specify your engine locations in the configuration files. This release includes these default MATRIXX/Grafana dashboards:
- Alerts and Physical Memory.
- Line graphs for Diameter PDU Statistics.
- Tables for cluster state, peer cluster state, buffer pool, and shared buffer pool statistics.
- Line graphs for cluster state, peer cluster state, buffer pool, and shared buffer pool statistics.