Cluster HA States
A cluster can change high availability (HA) states during engine start up, manual switchover, failover, and when the connection to a remote peer cluster is lost.
State | Description | SNMP ID |
---|---|---|
UNKNOWN | The state of the remote cluster is unknown. Either a connection to the remote peer was never established, or a connection was established and then lost within the failure detection timeout period (it is not yet considered failed). | 0 |
START | A cluster is in the process of starting, which involves waiting for a cluster quorum and other conditions to be met. | 1 |
PRE_INIT | A cluster is waiting for the latest pricing version to be loaded into the newly upgraded engine. This HA state is only transitioned to during a MATRIXX Engine software upgrade. | 2 |
INIT | A cluster is in the process of initializing its databases from a checkpoint file or a running cluster. The cluster transitions into an active or standby cluster.
Note: When a standby cluster is in the INIT state, a pending transactions message similar to the following might be
written to the mtx_debug.log
file:
This message is not indicative of issues. It is informational and can occur when a server in the
standby cluster receives a parallel balance transaction to replay but has not yet received the checkpoint transaction for the same balance set object. Because the
parallel balance transaction does not have an absolute balance to apply the difference, the transaction is saved as pending for a short period of time until it
receives the checkpoint transaction for this balance set object. During this period, the message is recorded and indicates that the standby cluster is synchronizing
with the active cluster. The number of pending transactions resolves as the synchronization completes. |
3 |
POST_INIT | A cluster was upgraded to a new software version, and it is undergoing the schema upgrade and conversion transformations to handle any caveats before entering a standby state. This HA state is only transitioned to during a MATRIXX Engine software upgrade. | 4 |
STANDBY_SYNC | For a standby cluster, the servers are synchronizing their databases by replaying transactions. This state indicates the state transition during an engine start-up,
switchover, or fail-over. For an active cluster, the servers are replaying transactions after an engine switchover to sync its databases. The STANDBY_SYNC state precedes the STANDBY state. |
5 |
STANDBY | A cluster is ready to replay transaction logs. During typical runtime operations, the cluster in the secondary engine is in a STANDBY HA state. |
6 |
ACTIVE_SYNC | The cluster was selected as the active cluster and is in the process of synchronizing its databases in real time from its queued replay transactions. This state is transitional from
STANDBY to ACTIVE. If an engine in a FAILED state is detected, a STANDBY engine transitions to ACTIVE_SYNC. If the FAILED engine has a processing pod that is still
able to process requests, the ACTIVE_SYNC engine never detects that all transactions have completed. In that case, the ACTIVE_SYNC engine shuts down after a
configurable timeout period. The duration of the timeout period is the product of the |
7 |
ACTIVE | A cluster is actively processing incoming network traffic. During typical runtime operations, the cluster in the primary engine is in an ACTIVE HA state. |
8 |
EXIT | The servers in a cluster are exiting so the cluster can be stopped without causing quorum issues. | 10 |
STOP | A cluster is stopping. | 11 |
FINAL | A cluster is stopped. | 12 |
FAILED | A cluster had a connection to a remote peer cluster and lost the connection permanently. This condition occurs when a connection cannot be restored within the failure-detection timeout period. The peer cluster is viewed by the cluster as failed. | 13 |
NONE | A pseudo state added for the Traffic Routing Agent to identify an engine cluster. This value is used when no peer cluster is configured for a MATRIXX Engine environment. | 14 |
OFFLINE | A cluster is not stopped, but after the process of replaying transactions is completed, ports on Traffic Routing Agent load-balancing instances (TRA-PROCs) are blocked, so that the cluster is isolated from the rest of the topology. A peer cluster in a STANDBY HA state, if present, transitions to an ACTIVE HA state, as if the cluster in the OFFLINE state has been stopped. | 15 |
See the discussion about all MATRIXX SNMP statistics in MATRIXX Monitoring and Logging for a complete list of these statistics.
You can use the print_blade_stats.py
script, located in MATRIXX Engine in the ${MTX_BIN_DIR} directory, to monitor cluster HA states. The
information includes the HA state of both clusters and information about the pods in the local cluster, including
server ID, role, service state, and IP address. For example:
print_blade_stats.py -C -e 1 -c 2 -b 1
----------------------------------------------------------------
blade - 1:3:1 , version - 5081
time - Mon 2018-10-08T09:03:57
----------------------------------------------------------------
Cluster Stats
-------------
Node Cluster Service Node Mgmt
Id LeaderId Role State IP Address
========================================================
1 1 checkpointing active 127.0.0.1
(On a standby processing blade the first time it is active)
Peer Cluster Stats
------------------
System Peer
Engine Cluster Cluster Schema Cluster
Id Id State Version FQ Id Cluster Up Time Cluster Active Time
==================================================================================================
1 1 active 5110 0:0 2019-04-29 T23:12:57 0
(On standby processing blade which has been active)
Peer Cluster Stats
------------------
System Peer
Engine Cluster Cluster Schema Cluster
Id Id State Version FQ Id Cluster Up Time Cluster Active Time
==================================================================================================
1 1 active 5110 0:0 2019-04-29T23:12:57 2019-04-27T07:01:12
Peer Cluster Stats
------------------
System Peer
Engine Cluster Cluster Schema Cluster
Id Id State Version FQ Id
===============================================
1 1 active 5081 0:0
Processing Cluster Stats
------------------------
System Peer
Engine Cluster Cluster Schema Cluster
Id Id State Version FQ Id
===============================================
1 3 active 5081 0:0