Bypass Affected Engines and Collect Information

The first steps when restarting all MATRIXX Engines in a sub-domain are to bypass traffic for the affected engines and assess the impact of the outage.

About this task

Most of these tasks can be performed in parallel to optimize recovery time.

Procedure

  1. Initiate a traffic bypass of the affected sub-domain and confirm it is working.
  2. Specify any traffic not bypassed, if applicable, by node or service type (data, voice, SMS, or others).
  3. Establish the most complete dataset (checkpoint and transaction logs), usually from the most recently active engine. Use the get_latest_checkpoint.py command to determine the latest available checkpoint.
  4. Provide the dataset to the primary engine so that they it is used on restart.
  5. Verify engine-level Traffic Routing Agents (TRA-PROCs and TRA-PUBs) are running (using print_tra_cluster_status.py).
  6. Start any TRA-PROC or TRA-PUB instances not already running.
  7. Assess and share with MATRIXX Support the impact on service and the subscriber base by type and scale, and share any updates when new information comes available.
    This step does not need to be completed before restarting the engines, but it should be started.
  8. Investigate any symptoms present in the system when the engines went down. For example, search debug logs with a command similar to the following:
    grep LM_CRITI mtx_debug.log

    Also check /var/log/messages for OS-related issues.

  9. Collect and share key logs and other data (for example mtx_debug.log, messages files, print_blade_stats.py outputs, atop, transaction logs, and tcpdumps) with MATRIXX Support to support prompt root cause analysis.