print_blade_stats.py

The print_blade_stats.py script displays statistics for MATRIXX Engine, in-memory databases, cluster and server states, system memory, notifications, Task Manager, Diameter, MDC Gateway, CCF statistics, checkpoint statistics (such as the server ID, last or in-progress checkpoint state, checkpoint start and end times, type of checkpoint (ad hoc, fast restart), related Global Transaction Counter (GTC), and checkpoint name and path), system monitor, and others. For a description of all SNMP statistics, see the discussion about all MATRIXX SNMP statistics in MATRIXX Monitoring and Logging.

How to Use this Script

Use this script to collect debugging information for specific MATRIXX components that you think might have an issue. MATRIXX provides an automated mechanism that runs the capture_diagstats_tofile.py script on a configurable schedule to capture the same statistics that print_blade_stats.py does. This automated mechanism uses fewer system resources and is configured to run more efficiently, and the output is better suited to coordinating with other diagnostic tools.

Note: Running print_blade_stats.py using a cron job is inefficient. MATRIXX Support recommends that you use the output from the capture_diagstats_tofile.py script instead.
Note: In bare metal environments, MATRIXX components and scripts, such as the processing server or print_blade_stats.py, run on a server, sometimes called a blade. To avoid confusion, some scripts and code samples that refer to blades have not changed in cloud native environments. When possible, MATRIXX documentation uses cloud native terminology, for example, when referring to processing pods.

Syntax

print_blade_stats.py [ -h | -a | -A | -B | -C | -E | -d | -D | -F| -G | -H | -I | -J | -K | -l | -L | -M | -N | -O | -P | --peer_manager | -Q | -r seconds | -R | --rca | --route_cache_agent | -S | --stream | -T | -U | -v | -V | -W | -X | -Y | -Z ]

An asterisk (*) character next to a pricing database indicates that the database is active.

Options

The print_blade_stats.py script has the following options in addition to the general server command line options.

When run without any of the following options, the print_blade_stats.py script prints all statistics for the server. It can be run with a subset of the following options to print specific statistics.

For descriptions of the server command line options, see the discussion about command line options for server scripts in MATRIXX Administration.

-h, --help
Prints help information for this script.
-a
Print AbsoluteTimer service statistics for various system services and tasks. Statistics for each system task include the number of times AbsoluteTimer passed, the last delay time in microseconds and the related timestamp, and the maximum delay time in microseconds and the related timestamp. These statistics can help you monitor processing threads and whether system tasks run on time or if there might be a pattern of system latency.
-A, --map_call_out
Print Mobile Application Part (MAP) statistics for MAP-ATI and MAP-SRI call-out requests. The statistics include the number of requests made and successful responses returned, the number of timeouts, and the number of notifications for failed messages.
-B, --database
Print database segment, memory, object, index, timer index, and OID index statistics. If a database has compression enabled, this option also prints compression statistics for the database. The compression ratio is equal to the uncompressed size of an object divided by the compressed size. For example, if the wallet object compression ratio is 1.5, it means if the database was not compressed, the wallet would be 1.5 times larger than the compressed size.
The Expired Inserts statistic indicates the number of object IDs that were triggered but have not been removed after five minutes. Zero Time Inserts lists the objects with a trigger time of 0. Average Far Time, in microseconds, is the average trigger time for insertions that had a trigger time of longer than one hour. One hour is the main duration window of the timer index.
-C, --cluster_stats
Print cluster-level information for the local cluster and peer clusters. Local cluster information includes the node ID, service role (processing, publishing, or checkpointing), node state, and IP address. Peer and processing cluster information includes the engine ID, cluster ID, cluster state, system schema version, and fully qualified cluster ID (engine:cluster). If this script is run from the processing cluster leader, it also identifies the cluster leaderID. If this script is run from the non-leader processing cluster leader, the statistics header is returned with a note to indicate that the statistics are only available from the processing server that is the cluster leader. The cluster peer is the cluster that the local cluster is receiving transactions from and replaying. If the cluster does not have an HA peer or is not supporting another cluster (it is the active cluster), the fully qualified ID is 0:0.
The Engine Up Time and Engine Active Times are listed in dateTtime format. The engine active time refers to the lead processing server of the active MATRIXX Engine.
-d, --debug

Debug flag. If this option is specified, extra messages are printed to help with debugging this script. By default, the script does not run in debug mode.

-D, --diameter
Print diameter SNMP PDU table statistics, Diameter Gateway error result statistics, and latency and connection statistics. Diameter Gateway error result statistics include the total requests received, total responses sent, average response time, and maximum response time for each Diameter application and command-code combination.

Other information provided includes:

  • Malformed Requests — The total number of malformed requests, for example, receiving a non-Diameter packet.
  • Permanent Failures — The total number of permanent failures, for example, any 5xxx Result-Code (per RFC-6733). This total is not incremented.
  • Protocol Errors — The total number of protocol errors, for example, any 3xxx Result-Code (per RFC-6733). This total is not incremented.
  • Transient Failures — The total number of transient failures, for example, any 4xxx Result-Code (per RFC-6733).
  • Transport Down — The total number of transport down errors. This total is not incremented.
  • Unknown Types — The total number of unknown types errors. This total is incremented when a packet is received that is not mapped to a MATRIXX Data Container (MDC) in the diameter_dictionary.xml file.

Diameter Gateway latency statistics are recorded for latency buckets (which are time segments), maximum message latency per connection, and Diameter Gateway-related tasks. For each task, the statistics include the total and average latencies. Connection statistics include the number of bytes sent and received, number of messages sent and received, and number of errors that occurred.

When printing the Diameter statistics, print_blade_stats.py uses a hard-coded dictionary to get Diameter PDU statistics. The description refers to the IANA Diameter assignments for the Application ID and Command Codes. If a match is found, it is used. If a match is not found, the script looks for a match in the diameter_dictionary.xml file. You can add Application IDs or commands in the diameter_dictionary.xml file. print_blade_stats.py uses the Application ID and the command value as a key into the dictionary. The hard-coded dictionary has the following values:

'0:257': 'common:CE',
'0:258': 'common:RA',
'0:274': 'common:AS',
'0:275': 'common:ST',
'0:280': 'common:DW',
'0:282': 'common:DP',
'1:265': 'nasreq:AA',
'3:271': 'accounting:AC',
'4:258': 'credit-control:RA',
'4:272': 'credit-control:CC',
'16777217:306': 'Sh:UD',
'16777217:307': 'Sh:PU',
'16777217:308': 'Sh:SN',
'16777217:309': 'Sh:PN',
'16777236:258': 'Rx:RA',
'16777236:265': 'Rx:AA',
'16777236:274': 'Rx:AS',
'16777236:275': 'Rx:ST',
'16777238:258': 'Gx:RA',
'16777238:272': 'Gx:CC',
'16777302:8388635': 'Sy:SL',
'16777302:8388636': 'Sy:SN',
'16777302:275': 'Sy:ST',
'33686018:430': 'private:mdc', 
-E, --event_loader
Print Event Loader statistics. The statistics include the number of database errors that were logged after the Event Loader started, the number of MATRIXX Event Files (MEFs) in the backlog that are ready to be loaded (but are not yet loaded) into the Event Repository, the number of MEFs loaded (Mef Loaded), number of MEFs rejected (Mef Rejected), number of events loaded (Events Loaded), the latest event time from the last loaded MEF (Last Event Time), and GTC statistics, including:
  • Max Available GTC — The highest GTC available for reading and loading.
  • Last Processed GTC — The GTC of the work order that was most recently processed.
  • Last Loaded GTC — The GTC of the work order that was most recently loaded to the Event Repository.
Note: This option does not run on an active processing server.
-F, --signalling
Print Signaling Network statistics. The statistics include the signaling link name, state of the link, received rate limit and number of delivery errors that were logged, and number of messages sent and received.
-G --charging
Print the Charging Server statistics, including average, minimum, and maximum latencies when processing messages, average number of transactions processed per second, number of duplicate messages encountered, number of transactions rejected due to collisions, and number of transactions retried. This information also includes message retry information, such as minimum, maximum, and average wait times and the message count for a given retry count.
-H, --call_start
Print callback call start statistics for the number of successful and failed callback call start attempts.
-I, --ussd_call_back
Print USSD callback statistics for the number of successful and failed callback requests.
-J, --tcap
Print TCAP (Transaction Capabilities Application Part) statistics, including the number of TCAP protocol messages sent and received, number of messages not sent due to an error, and number of messages rejected.
-K, --task_manager
Print Task Manager statistics for managing the schedule database, including notifications, recurring processing, event cleanup, and session cleanup.
Notification statistics include the number of scans since the engine started, whether the current server is the Notification Server, whether there is in active scan in progress, the number of notifications sent since the engine started, and the number of full scans since the engine started.
The other statistics include the number of cleanup scans completed for event objects, recurring processing objects, and session objects and the number of objects processed in the current scan for each of these operations. If a scan is not in progress, the statistic display an N. Also, the statistics show whether the current server is the server performing the cleanup scan.
The Scan Enabled, Scan In Process, Total Scan Objects, Total Scan Time, Average Latency in micros, Max Latency in micros, and Max Max Latency in micros are updated during the scan. The other statistics are only updated at the end of the scan.

The latency for an object is defined as time the object is processed minus the trigger time or minus the modified time, if the trigger time is not available.

Max Latency in micros is the maximum latency for the current most recent scan. Max Max Latency in micros is the maximum latency in any scan.

-l, --ldap
Print LDAP (Lightweight Directory Access Protocol) Gateway statistics for the number of successful LDAP requests and responses. Statistics include the following:
  • Sent – The request sent from MATRIXX Engine to the LDAP Gateway.
  • Received – The response received by MATRIXX Engine from the LDAP Gateway.
  • Error – An error when trying to send to the LDAP Gateway.
  • Timeout – The LDAP Gateway has not responded in time.
  • Hit – The LDAP server located the appropriate record in its database.
  • Miss – The LDAP server failed to locate the appropriate record in its database.
  • Serv-Err – The LDAP Gateway received an error from the LDAP server.
  • Serv-Timeout – The LDAP Gateway timed out the LDAP server, as there was no response.
-L --camel_gateway
Print CAMEL (Customized Applications for Mobile network Enhanced Logic) Gateway statistics for the number of charging sessions started and ended.
-M --sms_charging
Print CAP3 (CAMEL Application Part 3) SMS statistics, such as the number of valid and invalid SMS operations, SMS messages for which charging was applied immediately, SMS messages for which a reservation was made, and rejected and failed SMS operations.
-N --notifications
Print notification processing statistics, such as the number of unique notification messages sent, acknowledgments received, and failures due to maximum retry timeouts, address failures, and socket failures.
-O, --tsan
Print TSAN (Temporary Subscriber Access Number) statistics for the CAP1 re-origination service, such as the number of TSAN requests, timeouts, successful releases, and bad messages.
-P, --pools
Print memory pool and shared buffer pool (large buffer and huge buffer) statistics for each database.
--peer_manager
The MATRIXX Peer Manager is a networking infrastructure function for managing TCP connectivity between MATRIXX services. Each running instance reports its current peers' connectivity state along with server side information. This option prints MATRIXX Peer Manager statistics, including the debug name, server address, and local peer ID. For each peer, it prints the debug name, peer ID, address, state, time when connected, and time when disconnected.
-Q, --queues
Print queue statistics for Charging Server, Transaction Server, Diameter Gateway, and MDC Gateway. The statistics include the queue sizes, maximum reached size, number of times the queues were full or empty, and information about the number of messages read in each queue.
-r seconds, --repeat_seconds seconds

Reprint the statistics every specified number of seconds.

-R, --replay
Print the following transaction replay statistics. These statistics are only meaningful when run on the active cluster. After the active cluster is started, the Destination Cluster ID column in the output shows its own processing cluster ID and has a nonzero Checkpoint Replay File Count. After the database restore completes, this column is not displayed.
Note: The Current Replay Batch Count and Current Replay Txn Count statistics are only useful during MATRIXX Engine start-up. These statistics indicate how many files or objects must still be replayed to start the engine. When you are performing a cold restart on an active engine, these statistics list the number files that must be replayed before the engine is available to process transactions. When you are starting a standby engine, these statistics list the number of objects that must be replayed before the engine is available to process transactions.
  • For real-time replay on a standby cluster:
    • The GTC that the server is processing.
    • The GTC that is being replayed.
  • For synchronization of a standby cluster when it starts:
    • The number of outstanding transaction batches to replay. This value must always be equal to or less than the number of processing servers if the engines are in perfect synchronization.
    • The number of outstanding transactions to replay.
    • The number of outstanding database objects that must be replayed on a standby cluster to get its databases up-to-date. Every database has a number of objects to replay. After all objects in one database are sent, a new count of another database's objects starts until all databases have completed synchronization. The new count adds to the exiting object count that has not finished replay (the object count does not go to zero before replaying objects from next database). The Database Replay Object Count is be a nonzero value until the standby cluster finishes the database initialization process. The Checkpoint Replay File Count is not used when a standby cluster is starting.
    • The GTC that the server is processing.
    • The GTC that is being replayed.
  • For a cold start-up of an engine (either after a complete system failure or start-up of a standalone system), prints the number of outstanding checkpoint files and transaction log files to replay when the engine restores its databases from a checkpoint. The Checkpoint Replay File Count is only be nonzero when the cluster starts and restores from a checkpoint. After the database is restored, the value is always 0. This file count depends on the number of real-time replay batches that are outstanding at that moment, including those being replayed and those queued for replay on the publishing server. The Checkpoint Replay File Count value is always be zero when the second engine is starting.

To support two standby clusters, the SNMP Object ID (OID) for monitoring real-time replay stats to a standby cluster is "txnReplayCurrentTransactionBatchCount.engineId.clusterId," where engineId and clusterId are the engine ID and cluster ID of the standby cluster to watch.

--rca, --route_cache_agent
Print Route Cache Agent statistics, including the debug name, server address version, MPM name, and diagnostic counters.
-S, --services
Print MATRIXX service statistics, such as the service process ID, number of errors, memory usage, and CPU usage.
--stream
When event streaming is enabled on the engine, use this option to print the internal stream statistics of the Event Stream Server, including GTC Sorter, SEF Writer, Stream Publisher, and MEF Publisher.
  • GTC Sorter Statistics:
    • Low GTC — The transaction with the lowest GTC that the sorter is waiting to receive to process.
    • High GTC — The transaction with the highest GTC the sorter has received and processed.
    • Current Count — The number of transactions in the sorter. When the current count is zero (0), the Low GTC column has no meaning.
    • Max Count — The highest number of transactions that has ever been in the sorter at one time.
    • Max Size — The maximum number of transactions that the sorter can hold. If this number is exceeded, the sorter does not process.
  • SEF Writer Statistics:
    • Last Processed GTC — The GTC of the work order most recently processed.
    • Last Written GTC — The GTC of the work order most recently written to a Streamed Event File (SEF) or MEF.
  • Stream Publisher Statistics:
    • Max Available GTC — The highest GTC available for reading, writing, or publishing.
    • Client Connections — The total number of current client connections.
    • Buffer Count — The number of configured memory buffers. Buffers send and receive requests and responses, and they are shared across all connections.
    • Free Buffer Count — The number of remaining buffers that are still available.
  • Connection Statistics:
    • Session ID — The session ID being read when this utility was run.
    • Role — The Event Streaming Framework HA role (Leader/Non-leader).
    • Cursor — The cursor being processed.
    • Last time — The time of last event transmission by Event Stream Server to Event Streaming Framework.
    • Last Count — The number of events included in the last transmission to Event Streaming Framework.
    • Total Count — The total number of events sent by a specific stream.
    • ReqEvents — The maximum number of events requested by Event Streaming Framework.
    • ReqBytes — The maximum number of bytes requested by Event Streaming Framework.
    • Filter — A string representing the stream event filter, for example, CancelEvent or ChargeEvent.
  • MEF Publisher Statistics:
    • Max Available GTC — The highest GTC available for reading, writing, or publishing.
    • Last Processed GTC — The GTC of the work order most recently processed.
    • Last Written GTC — The GTC of the work order most recently written to a SEF or MEF.
    • Last To Be Published GTC — The GTC most recently transferred to a local directory from where the records can be published to a remote target.
    • Last Published GTC — The GTC of the work order most recently published to the destination.
--system_monitor
Displays the System Monitor current node usage level, node service state (ok or oos (out of service)), monitored objects, and the last transition time. For details about System Monitor, see the discussion about MATRIXX System Monitor in MATRIXX Architecture.
-T, --txn
Print transaction number and GTC statistics in different tables. These statistics must be monitored.
The general transaction statistics table includes:
  • The unique sequence IDs for business and non-business events.
    Note: The format of the unique sequence ID displayed by print_blade_stats.py is different from the format in the transaction log file. In the transaction log file, it is displayed as the unique sequence ID plus a higher event type bit set.
  • ID of the transaction protocol leader within a cluster and related statistics.
  • The number of transactions that must be retried due to parallel commit collisions.
  • The number of transactions that must be retried due to business-level collisions.
  • The number of in-progress or pending transactions.
  • The maximum number of in-progress or pending transactions.
  • The total number of transaction messages logged since the engine started.
  • The average transaction size.
  • The maximum transaction size.
  • The effective transaction count per second.
    Note: If you are using Transaction Protocol, this value is the total transactions per second processed by the Transaction Server (not the transactions per second per server).
The Current Transaction Count value is stabilized based on the transaction workload. If this count continuously increases, it indicates that there are an increasing number of outstanding and pending transactions in the system.
The GTC Sorter Stats and Stream Stats track these statistics:
  • Low GTC — Both sorters show this as the next GTC number that it expects to receive and is waiting for.
  • High GTC — Transaction Counter Stats shows this as the next GTC that includes in its sorter window. Stream Stats shows this as the highest GTC already in its sorter.
  • Current Count — The number of GTCs in the sorter.
  • Max Count — The largest number of GTCs that the sorter has had at any time.
  • Max Size — Transaction Counter Stats shows this as the largest part of the sorter that it has ever used to hold all the GTCs (including place holders) at any time. The Stream Stats does not have this statistic.
  • Max Size — The maximum allocated sorter size.
If the number of pending transactions is too large, the server might shut down if it runs out of memory.
Note: When a pending transaction is resolved after a retry, a message like this one is written to the mtx_debug.log file:
LM_INFO 19090|19138 2015-06-30 14:39:08.008947 [transaction_server_1:1:1:1(4700.33153)] | 
TransactionCtxFactory::Release: pending transaction with transaction ID: [6:-:1:14901]|[1:1:2:1:0:1]|575|0 is resolved after 2 retries
The GTC statistics table includes general transaction manger tasks and transaction log file statistics.
-U, --ussd_call_out
Print Unstructured Supplementary Service Data (USSD) statistics for MAP-USSD Notify call-out requests to send USSD notifications. The statistics include the number of requests made and successful responses returned, number of timeouts, and number of notifications for failed messages.
-v, --print_version
Print the RPM version number when printing other statistics. The default value is True. To omit the version number from the output, run the script with any options and specify -v 0.
-V, --voice_charging
Print Voice Charging statistics, including the number of valid and invalid IDPs received, number of ApplyCharging messages sent and ApplyChargingReport messages received, number of calls for which quota was granted, number of free calls, number of rejected calls, and the number of disconnected, busy, or abandoned calls. Voice Charging statistics also include announcement and VXML script statistics such as the number of announcements or scripts attempted, number of announcements or scripts completed, and number of failed announcements.
-W, --mdc_gateway
Print MDC Gateway statistics, including latency statistics and connection statistics. Latency statistics are recorded for latency buckets (which are time segments), maximum message latency per connection, and MDC Gateway-related tasks, such as Transaction Manager prepare and commit tasks. For each task, the statistics include the total and average latencies. Connection statistics include the number of bytes sent and received, number of messages sent and received, and number of errors.
-X, --ussd_in
Print USSD incoming service statistics for MAP Process-UnstructuredSS-Request messages. The statistics include the number of requests made and notifications sent.
-Y, --system
Print system-level information, including the logical server ID, monitoring interval, number of processing errors, average response time to the network, total amount of system memory allocated for MATRIXX databases and work buffers, amount of memory in use, and cluster heartbeat information. Heartbeat information includes the number of heartbeats sent, received, and missed by the server.
-Z, --disk_usage
Print disk usage statistics. This option directs print_blade_stats.py to display disk usage statistics for the SSD and SAN storage that it gets from the global.storage_layout.local_directory and global.storage_layout.shared_directory entries, respectively, defined in the mtx_config.xml file. If the output result is an empty table, those entries are missing or configured incorrectly.
Note: Output from the print_blade_stats.py script always also includes the MATRIXX schema version and the date and time the script was run. Where applicable, the server being reported on is also indicated in engine-cluster-server notation. That initial output is omitted from some of these examples.