Home
MATRIXX Administration
Troubleshooting Issues
This section has information to help identify and resolve system issues.
Cold Restart Procedures
A cold restart is when all MATRIXX Engines are down in one or more sub-domains. Recovering from this condition involves traffic bypass of the affected sub- domain(s) followed by starting first the primary (active) engine, and then starting the standby engine.
Bypass Affected Engines and Collect Information
The first steps when restarting all MATRIXX Engines in a sub-domain are to bypass traffic for the affected engines and assess the impact of the outage.

Welcome
MATRIXX Release Notes
MATRIXX Architecture
MATRIXX Installation and Upgrade
MATRIXX Configuration
MATRIXX Security
MATRIXX Integration
MATRIXX Diameter Integration
MATRIXX Call Control Framework Integration
MATRIXX TM Forum Integration
MATRIXX 5G Integration
MATRIXX 5G Event Streaming
MATRIXX Event Streaming
MATRIXX Administration
- About MATRIXX Administration
  MATRIXX Administration describes how to monitor MATRIXX Engine components and to perform daily operations for maintaining a geographically redundant system. This includes administration for the engine, Traffic Routing Agent, MATRIXX gateways and web apps, and Notification Framework. It also provides information about recovering from a total system failure.
- Administering MATRIXX Engine
  The topics in this section have information about maintaining, managing, and resolving issues with a MATRIXX Engine installation.
- Administering MATRIXX Traffic Routing Agent
  The topics in this section describe administering Traffic Routing Agent (TRA) servers, including starting and stopping Traffic Routing Agent nodes and clusters. The information in this section applies to all Traffic Routing Agent functions unless otherwise noted.
- Administering the Route Cache
  The topics in this section describe administering the Route Cache.
- Administering US Tax Data
  US tax data is imported from CCH tax data, compiled in My MATRIXX, and output as compact MATRIXX Data Container (MDC) files, including an MDC file containing tax database metadata and status information. MDC files are packaged into a .zip file and loaded into the MATRIXX tax database.
- Troubleshooting Issues
  This section has information to help identify and resolve system issues.
  - Cold Restart Procedures
    A cold restart is when all MATRIXX Engines are down in one or more sub-domains. Recovering from this condition involves traffic bypass of the affected sub- domain(s) followed by starting first the primary (active) engine, and then starting the standby engine.
    - Bypass Affected Engines and Collect Information
      The first steps when restarting all MATRIXX Engines in a sub-domain are to bypass traffic for the affected engines and assess the impact of the outage.
    - Start the Primary Engine
      After assessing the situation and collecting the most current checkpoint and transaction logs available, start the primary MATRIXX Engine.
    - Address High-Priority Errors
      Address any errors encountered during primary engine start-up.
    - Address Data Consistency
      Determine whether checkpoint validation must be completed before lifting the traffic bypass. If the steps in this procedure do not have to be completed before bypass lift, these tasks must be completed when time or resource availability allow, such that they do not interfere with service restoration.
    - Start the Standby Engine
      Determine whether the standby engine must be started before lifting the traffic bypass. Standby engine start-up can take an extended amount of time for large installations. If the steps in this procedure do not have to be completed before bypass lift, these tasks must be returned to when time or resource availability allows, such that they do not interfere with service restoration.
    - Lift the Traffic Bypass, Test, and Monitor
      When both MATRIXX Engines have been started and verified, the traffic bypass can be lifted.
  - Identifying Issues with System MDCs
    In MATRIXX version 511x and earlier releases, system MATRIXX Data Containers (MDCs) were included in the ${MTX_DATA_DIR}/mtx_config_base.xml file. System MDC definitions are now in the ${MTX_DATA_DIR}/mdc_config_system.xml. This file is not used by MATRIXX. The file is only provided to help in using these MDCs.
  - Avoiding Performance Issues Caused by Slow I/O
    In a production environment, avoid running I/O-intensive operations, such as using gunzip to unzip large ZIP files, for example, MATRIXX log files. Running I/O-intensive commands can cause I/O spikes and degrade Transaction Server performance by causing transaction write failures.
  - Analyzing Diameter Gateway and MDC Gateway Connections
    Use the print_gw_connections.py script to understand the condition of current client and server connections maintained by Diameter Gateway or MDC Gateway.
  - Analyzing the Standby Cluster Initialization Phase
    When in an HA state of INIT, a cluster is in the process of initializing its databases from a checkpoint or transaction log file (when the cluster becomes the standby cluster), or from replaying the in-memory databases (when the cluster becomes the active cluster). During this time, messages might be written to the mtx_debug.log file that indicate errors, and others that do not. These messages are listed in the following subsection.
  - Analyze Queue Latencies
    Use the analyze_mtx_debug_log.py script to analyze the queue statistics in the mtx_debug.log file and aid in troubleshooting issues with transaction performance. Queue statistics are collected for Diameter Gateway, MDC Gateway, Charging Server, and Transaction Server.
  - Analyze a Segmentation Fault
    Run the analyze_seg_fault.py script to gather information about a specified segfault line in a stack trace. The script prints the function name, source filename, and line number where the error occurred. This information can be sent to a MATRIXX representative for troubleshooting errors in the MATRIXX Engine code.
  - Analyze a Stack Frame
    Use the analyze_stack_frame.py script to analyze a stack frame from an error in the mtx_debug.log file or a specified text file and retrieve the filename and line number in which the error occurred. This information can be sent to a MATRIXX representative for troubleshooting errors in the MATRIXX Engine code.
  - Analyze Transaction Replay Performance
    Use the analyze_replay_performance.py script after a database initialization completes to analyze transaction replay throughput statistics, including the replay duration, any throttling, the number of transactions replayed, and the number of transactions replayed per second.
  - Fix Java-Related Out-of-Memory Issues
    The default Java heap memory settings are appropriate for MATRIXX Engine Java tools in a typical MATRIXX implementation. However, if you see Java-related out-of-memory errors, you can follow the instructions in these topics to increase the Java heap size.
  - Merge Messages from Multiple Debug Logs
    You can merge the messages from multiple system logs into one log file. The result is the combined log messages ordered according to the time stamp on the LM_ line.
  - Correcting Too-Large Message Errors
    MATRIXX enforces a message size limit of 2097152 bytes to keep transactions from becoming too large to process effectively.
  - Correcting Duplicate Checkpoint Errors
    If MATRIXX determines that MATRIXX Engine is trying a cold restart using a stale (inconsistent) checkpoint, the system stops the process and alerts you with a duplicate checkpoint error message. In this case, the checkpointing server is out of out-of-sync with the latest snapshot, and you must investigate the issue and restart the checkpointing server before trying to restart the engine.
  - Debug Kit
    The mtx-debug-kit image includes the following tools for testing, performance, and analysis:
- Appendixes
MATRIXX Web App Administration
MATRIXX Monitoring and Logging
MATRIXX Policy
MATRIXX Kafka CDR Consumer
MATRIXX Pricing and Rating
MATRIXX Pay Now
MATRIXX Subscriber Management
MATRIXX Subscriber Management API
MATRIXX Business API SDK
My MATRIXX Help
MATRIXX Backoffice Customer Tool Help
MATRIXX Third-Party Licenses
Glossary

Bypass Affected Engines and Collect Information

The first steps when restarting all MATRIXX Engines in a sub-domain are to bypass traffic for the affected engines and assess the impact of the outage.

About this task

Most of these tasks can be performed in parallel to optimize recovery time.

Procedure

Initiate a traffic bypass of the affected sub-domain and confirm it is working.
Specify any traffic not bypassed, if applicable, by node or service type (data, voice, SMS, or others).
Establish the most complete dataset (checkpoint and transaction logs), usually from the most recently active engine. Use the get_latest_checkpoint.py command to determine the latest available checkpoint.
Provide the dataset to the primary engine so that they it is used on restart.
Verify engine-level Traffic Routing Agents (TRA-PROCs and TRA-PUBs) are running (using print_tra_cluster_status.py).
Start any TRA-PROC or TRA-PUB instances not already running.
Assess and share with MATRIXX Support the impact on service and the subscriber base by type and scale, and share any updates when new information comes available.
This step does not need to be completed before restarting the engines, but it should be started.
Investigate any symptoms present in the system when the engines went down. For example, search debug logs with a command similar to the following:
```
grep LM_CRITI mtx_debug.log
```
Also check /var/log/messages for OS-related issues.
Collect and share key logs and other data (for example mtx_debug.log, messages files, print_blade_stats.py outputs, atop, transaction logs, and tcpdumps) with MATRIXX Support to support prompt root cause analysis.