Cluster State Transitions
All cluster state changes during runtime are trapped by the SNMP Agent, written to the statistics database, and sent as a notification to the Network Operation Center (NOC). The states enable operators to determine where in the failover process each cluster is.
State Transitions During Initial System Start Up
During initial start up operations, the cluster initializing as the ACTIVE cluster transitions through the following states before it is ready to accept incoming network traffic, including Diameter messages and API requests.
- UNKNOWN — The state of the remote peer cluster is unknown.
- START — The cluster is in the processing of starting. During this state, it is waiting for a cluster quorum and other conditions to be met.
- PRE_INIT — The cluster enters the first phase of cluster database initialization/synchronization, either from a checkpoint file or from a running peer cluster.
- INIT — The cluster enters the second phase of cluster database initialization/synchronization.
- POST_INIT — The cluster enters the final phase of cluster database initialization/synchronization.
- ACTIVE — The cluster is ready for real-time processing.
- START — The cluster is in the processing of starting.
- PRE_INIT — The cluster enters the first phase of cluster database initialization/synchronization from the peer for which it replays transactions.
- INIT — The cluster enters the second phase of cluster database initialization/synchronization.
- POST_INIT — The cluster enters the final phase of cluster database initialization/synchronization.
- STANDBY — The cluster is not actively processing network payload but is synchronizing its database in real-time from the active cluster. It is ready to become the ACTIVE cluster in case of a cluster failover or disaster recovery event.
State Transitions During Engine Failover
-
INIT, ACTIVE, or ACTIVE_SYNC — The standby cluster can communicate with the active cluster.
- UNKNOWN — Connection to the active cluster has been lost and its state is unknown.
- FAILED — Connection to the peer (can be STANDBY to ACTIVE) cluster has been lost for a time is longer than a configurable cluster failure detection timeout, and the peer cluster has been designated as failed.
- STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from the active cluster and is ready to become ACTIVE in the case of a failover.
- ACTIVE_SYNC — The cluster has been selected as ACTIVE and is completing its real-time database synchronization from its queued replay transactions. This is a transitional state from STANDBY to ACTIVE and has a very short duration.
- ACTIVE — The databases are up-to-date and the cluster is ready to process incoming network traffic.
- START — The cluster is in the processing of starting.
- PRE_INIT, INIT, POST_INIT — The cluster is started and initializes its databases from the peer cluster for which it replays transactions.
- STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from the active cluster. It is ready to become the ACTIVE cluster in the case of a failover.
State Transitions During an Online Upgrade
- START — The cluster is in the processing of starting. During this state, it is waiting for a cluster quorum and other conditions to be met.
- PRE_INIT — A cluster is waiting for the latest pricing version to be loaded into the newly-upgraded engine.
- INIT — The cluster has been started and is initializing its databases from a running peer cluster.
- POST_INIT — The cluster has been upgraded to a new software version and it is in the process of undergoing the schema upgrade and conversion transformations to handle any caveats before entering a STANDBY state.
- STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from its transaction peer cluster. It is ready to become the ACTIVE cluster in the case of a failover or switchover operation.
STANDBY Cluster State Transitions During an Online Upgrade
During initial start up operations after a STANDBY cluster has been upgraded, the cluster transitions through the following states to enter a STANDBY state, indicating it is ready to accept and replay transactions from a peer cluster.
- START — The cluster is in the processing of starting. During this state, it is waiting for a cluster quorum and other conditions to be met.
- PRE_INIT — A cluster is waiting for the latest pricing version to be loaded into the newly-upgraded engine.
- INIT — The cluster has been started and is initializing its databases from a running peer cluster.
- POST_INIT — The cluster has been upgraded to a new software version and it is in the process of undergoing the schema upgrade and conversion transformations to handle any caveats before entering a STANDBY state.
- STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from its transaction peer cluster. It is ready to become the ACTIVE cluster in the case of a failover or switchover operation.
ACTIVE Cluster State Transitions During an Online Upgrade
After upgrading a STANDBY cluster and setting the ACTIVE cluster to OFFLINE with cluster_mgr_cli.py -t target setto offline_cluster
, the ACTIVE cluster transitions
throught the following states.
- OFFLINE — The cluster is not stopped, but after the process of replaying transactions is completed, ports on the Traffic Routing Agent load-balancing instances (TRA-PROCs) are blocked, so that the cluster is isolated
from the rest of the topology.
If you then run
stop_cluster.py
on the offiline cluster, the following transitions occur: - EXIT — Pods in the cluster are exiting so the cluster can be stopped without causing quorum issues.
- STOP — The cluster is stopping.
- FINAL — The cluster is stopped.
- UNKNOWN — The state of the cluster is unknown.
State Transition Diagram
A cluster can exit a state due to a fatal cluster failure or an administrative shut down operation. When this occurs, the state transitions from its current state, for example, ACTIVE to an EXIT state to a FINAL state. The final state occurs when the FSM terminates.
Peer Cluster State Transition Diagram
If a cluster is stopped in an orderly method, either manually or by the Process Controller, the FAILED state is not present.
Transition Triggers
Transitions in the diagrams are labeled with symbolic names of the events that trigger them (and sometimes also with a condition that must be satisfied in order for the event to trigger a transition). Cluster State Transition Triggers contains more information about the events that trigger the state changes.
Event Symbolic Name | Description |
---|---|
control_node_ready | A cluster control has been selected and formation of the cluster begins. This event is generated internally by the Cluster Manager. |
cluster_quorum_ready | The cluster contains enough active nodes in order to have a quorum, which is the minimum number of nodes that can form a fully functional cluster. This event is generated internally by the Cluster Manager. |
txn_sync_in_progress | Transaction database synchronization is in progress. This applies only to states which represent cluster database initialization or synchronization activity (PRE_INIT, INIT, POST_INIT, STANDBY_SYNC, ACTIVE_SYNC). This event is sent as a message by the Transaction Server to the Cluster Manager. |
txn_sync_completed | Transaction database synchronization has been completed. This applies only to states which represent cluster database initialization or synchronization activity (PRE_INIT, INIT, POST_INIT, STANDBY_SYNC, ACTIVE_SYNC). This event is sent as a message by the Transaction Server to the Cluster Manager. |
failover | The remote HA peer that was active has been declared as FAILED. This standby cluster transitions to the ACTIVE state. This event is generated internally by Cluster Manager based on peer cluster condition/state. |
switchover | An administrator issued an ACTIVE cluster switchover command. |
admin_shutdown | An administrator issued a cluster shutdown command. |
fatal_cluster_failure | A fatal local cluster failure has been detected by Cluster Manager. This event is generated internally by Cluster Manager based on its monitoring of local cluster failures. |