Cluster HA States

A cluster can change high availability (HA) states during engine start up, manual switchover, failover, and when the connection to a remote peer cluster is lost.

Clusters can have the HA states listed in Cluster HA States during runtime and failover operations. The SNMP agent monitors the engine cluster and sends a notification when its state changes.
Table 1. Cluster HA States
State Description SNMP ID
UNKNOWN The state of the remote cluster is unknown. Either a connection to the remote peer was never established, or a connection was established and then lost within the failure detection timeout period (it is not yet considered failed). 0
START A cluster is in the process of starting, which involves waiting for a cluster quorum and other conditions to be met. 1
PRE_INIT A cluster is waiting for the latest pricing version to be loaded into the newly upgraded engine. This HA state is only transitioned to during a MATRIXX Engine software upgrade. 2
INIT A cluster is in the process of initializing its databases from a checkpoint file or a running cluster. The cluster transitions into an active or standby cluster.
Note: When a standby cluster is in the INIT state, a pending transactions message similar to the following might be written to the mtx_debug.log file:
LM_INFO 63530|16816 2015-06-01 19:17:28.793198  [transaction_server_2:1:1:1(4603.32039)] | 
TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::printPendingTxnSummary:
number of pending transactions per blade={blade 5=2}
This message is not indicative of issues. It is informational and can occur when a server in the standby cluster receives a parallel balance transaction to replay but has not yet received the checkpoint transaction for the same balance set object. Because the parallel balance transaction does not have an absolute balance to apply the difference, the transaction is saved as pending for a short period of time until it receives the checkpoint transaction for this balance set object. During this period, the message is recorded and indicates that the standby cluster is synchronizing with the active cluster. The number of pending transactions resolves as the synchronization completes.
3
POST_INIT A cluster was upgraded to a new software version, and it is undergoing the schema upgrade and conversion transformations to handle any caveats before entering a standby state. This HA state is only transitioned to during a MATRIXX Engine software upgrade. 4
STANDBY_SYNC For a standby cluster, the servers are synchronizing their databases by replaying transactions. This state indicates the state transition during an engine start-up, switchover, or fail-over.

For an active cluster, the servers are replaying transactions after an engine switchover to sync its databases.

The STANDBY_SYNC state precedes the STANDBY state.
5
STANDBY A cluster is ready to replay transaction logs.

During typical runtime operations, the cluster in the secondary engine is in a STANDBY HA state.

6
ACTIVE_SYNC The cluster was selected as the active cluster and is in the process of synchronizing its databases in real time from its queued replay transactions. This state is transitional from STANDBY to ACTIVE.

If an engine in a FAILED state is detected, a STANDBY engine transitions to ACTIVE_SYNC. If the FAILED engine has a processing pod that is still able to process requests, the ACTIVE_SYNC engine never detects that all transactions have completed. In that case, the ACTIVE_SYNC engine shuts down after a configurable timeout period. The duration of the timeout period is the product of the timer_tick_msec and cluster_active_conflict_timeout_count mtx_config.xml properties. For example, if the timer_tick_msec property is set to 50 (milliseconds), and the cluster_active_conflict_timeout_count property is set to 100, the ACTIVE_SYNC engine shuts down after 5 seconds.

7
ACTIVE A cluster is actively processing incoming network traffic.

During typical runtime operations, the cluster in the primary engine is in an ACTIVE HA state.

8
EXIT The servers in a cluster are exiting so the cluster can be stopped without causing quorum issues. 10
STOP A cluster is stopping. 11
FINAL A cluster is stopped. 12
FAILED A cluster had a connection to a remote peer cluster and lost the connection permanently. This condition occurs when a connection cannot be restored within the failure-detection timeout period. The peer cluster is viewed by the cluster as failed. 13
NONE A pseudo state added for the Traffic Routing Agent to identify an engine cluster. This value is used when no peer cluster is configured for a MATRIXX Engine environment. 14
OFFLINE A cluster is not stopped, but after the process of replaying transactions is completed, ports on Traffic Routing Agent load-balancing instances (TRA-PROCs) are blocked, so that the cluster is isolated from the rest of the topology. A peer cluster in a STANDBY HA state, if present, transitions to an ACTIVE HA state, as if the cluster in the OFFLINE state has been stopped. 15

See the discussion about all MATRIXX SNMP statistics in MATRIXX Monitoring and Logging for a complete list of these statistics.

You can use the print_blade_stats.py script, located in MATRIXX Engine in the ${MTX_BIN_DIR} directory, to monitor cluster HA states. The information includes the HA state of both clusters and information about the pods in the local cluster, including server ID, role, service state, and IP address. For example:

print_blade_stats.py -C -e 1 -c 2 -b 1
----------------------------------------------------------------
blade - 1:3:1 , version - 5081

time - Mon 2018-10-08T09:03:57
----------------------------------------------------------------


Cluster Stats
-------------
Node   Cluster        Service     Node  Mgmt      
  Id  LeaderId           Role    State  IP Address
========================================================
   1         1  checkpointing   active  127.0.0.1      

(On a standby processing blade the first time it is active)
Peer Cluster Stats
------------------
System Peer
Engine Cluster Cluster Schema    Cluster
    Id      Id   State Version     FQ Id   Cluster Up Time   Cluster Active Time
================================================================================================== 
     1       1 active    5110        0:0        2019-04-29           T23:12:57 0

(On standby processing blade which has been active)
Peer Cluster Stats
------------------
System Peer
Engine Cluster Cluster  Schema    Cluster
    Id      Id   State Version      FQ Id   Cluster Up Time     Cluster Active Time
==================================================================================================
     1       1  active    5110        0:0  2019-04-29T23:12:57    2019-04-27T07:01:12

Peer Cluster Stats
------------------
                                System     Peer
Engine  Cluster       Cluster   Schema  Cluster
    Id       Id         State  Version    FQ Id
===============================================
     1        1        active     5081      0:0


Processing Cluster Stats
------------------------
                                System     Peer
Engine  Cluster       Cluster   Schema  Cluster
    Id       Id         State  Version    FQ Id
===============================================
     1        3        active     5081      0:0