Cluster State Transitions

All cluster state changes during runtime are trapped by the SNMP Agent, written to the statistics database, and sent as a notification to the Network Operation Center (NOC). The states enable operators to determine where in the failover process each cluster is.

State Transitions During Initial System Start Up

During initial start up operations, the cluster initializing as the ACTIVE cluster transitions through the following states before it is ready to accept incoming network traffic, including Diameter messages and API requests.

  1. UNKNOWN — The state of the remote peer cluster is unknown.
  2. START — The cluster is in the processing of starting. During this state, it is waiting for a cluster quorum and other conditions to be met.
  3. PRE_INIT — The cluster enters the first phase of cluster database initialization/synchronization, either from a checkpoint file or from a running peer cluster.
  4. INIT — The cluster enters the second phase of cluster database initialization/synchronization.
  5. POST_INIT — The cluster enters the final phase of cluster database initialization/synchronization.
  6. ACTIVE — The cluster is ready for real-time processing.
The cluster initializing as a STANDBY cluster transitions through the following states before it is ready to replay transactions.
  1. START — The cluster is in the processing of starting.
  2. PRE_INIT — The cluster enters the first phase of cluster database initialization/synchronization from the peer for which it replays transactions.
  3. INIT — The cluster enters the second phase of cluster database initialization/synchronization.
  4. POST_INIT — The cluster enters the final phase of cluster database initialization/synchronization.
  5. STANDBY — The cluster is not actively processing network payload but is synchronizing its database in real-time from the active cluster. It is ready to become the ACTIVE cluster in case of a cluster failover or disaster recovery event.

State Transitions During Engine Failover

When the standby cluster cannot communicate with the active cluster, a failover operation is automatically initiated. In such cases, the standby cluster sees the following state transitions for the active cluster:
  1. INIT, ACTIVE, or ACTIVE_SYNC — The standby cluster can communicate with the active cluster.

  2. UNKNOWN — Connection to the active cluster has been lost and its state is unknown.
  3. FAILED — Connection to the peer (can be STANDBY to ACTIVE) cluster has been lost for a time is longer than a configurable cluster failure detection timeout, and the peer cluster has been designated as failed.
At this point, a standby cluster transitions through the following states to become the active processing cluster:
  1. STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from the active cluster and is ready to become ACTIVE in the case of a failover.
  2. ACTIVE_SYNC — The cluster has been selected as ACTIVE and is completing its real-time database synchronization from its queued replay transactions. This is a transitional state from STANDBY to ACTIVE and has a very short duration.
  3. ACTIVE — The databases are up-to-date and the cluster is ready to process incoming network traffic.
At this point, the failed cluster is in an UNKNOWN state because the connection could not be restored. When the engine is started again, its cluster transitions through the following states to become a standby cluster:
  1. START — The cluster is in the processing of starting.
  2. PRE_INIT, INIT, POST_INIT — The cluster is started and initializes its databases from the peer cluster for which it replays transactions.
  3. STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from the active cluster. It is ready to become the ACTIVE cluster in the case of a failover.

State Transitions During an Online Upgrade

During initial start up operations after a STANDBY cluster has been upgraded, the cluster transitions through the following states to enter a STANDBY state, indicating it is ready to accept and replay transactions from a peer cluster.
  1. START — The cluster is in the processing of starting. During this state, it is waiting for a cluster quorum and other conditions to be met.
  2. PRE_INIT — A cluster is waiting for the latest pricing version to be loaded into the newly-upgraded engine.
  3. INIT — The cluster has been started and is initializing its databases from a running peer cluster.
  4. POST_INIT — The cluster has been upgraded to a new software version and it is in the process of undergoing the schema upgrade and conversion transformations to handle any caveats before entering a STANDBY state.
  5. STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from its transaction peer cluster. It is ready to become the ACTIVE cluster in the case of a failover or switchover operation.

STANDBY Cluster State Transitions During an Online Upgrade

During initial start up operations after a STANDBY cluster has been upgraded, the cluster transitions through the following states to enter a STANDBY state, indicating it is ready to accept and replay transactions from a peer cluster.

  1. START — The cluster is in the processing of starting. During this state, it is waiting for a cluster quorum and other conditions to be met.
  2. PRE_INIT — A cluster is waiting for the latest pricing version to be loaded into the newly-upgraded engine.
  3. INIT — The cluster has been started and is initializing its databases from a running peer cluster.
  4. POST_INIT — The cluster has been upgraded to a new software version and it is in the process of undergoing the schema upgrade and conversion transformations to handle any caveats before entering a STANDBY state.
  5. STANDBY — The cluster is actively synchronizing its databases in real-time by replaying transactions received from its transaction peer cluster. It is ready to become the ACTIVE cluster in the case of a failover or switchover operation.

ACTIVE Cluster State Transitions During an Online Upgrade

After upgrading a STANDBY cluster and setting the ACTIVE cluster to OFFLINE with cluster_mgr_cli.py -t target setto offline_cluster, the ACTIVE cluster transitions throught the following states.

  1. OFFLINE — The cluster is not stopped, but after the process of replaying transactions is completed, ports on the Traffic Routing Agent load-balancing instances (TRA-PROCs) are blocked, so that the cluster is isolated from the rest of the topology.

    If you then run stop_cluster.py on the offiline cluster, the following transitions occur:

  2. EXIT — Pods in the cluster are exiting so the cluster can be stopped without causing quorum issues.
  3. STOP — The cluster is stopping.
  4. FINAL — The cluster is stopped.
  5. UNKNOWN — The state of the cluster is unknown.

State Transition Diagram

A cluster can exit a state due to a fatal cluster failure or an administrative shut down operation. When this occurs, the state transitions from its current state, for example, ACTIVE to an EXIT state to a FINAL state. The final state occurs when the FSM terminates.

Peer Cluster State Transition Diagram

The clusters in an HA peer pair monitors the HA state of its remote peer in real-time. Figure 1 contains the peer states that can be observed by a monitoring cluster when its peer shuts down in an un-orderly method (fails).
Figure 1. Peer Cluster State Transactions

If a cluster is stopped in an orderly method, either manually or by the Process Controller, the FAILED state is not present.

Transition Triggers

Transitions in the diagrams are labeled with symbolic names of the events that trigger them (and sometimes also with a condition that must be satisfied in order for the event to trigger a transition). Cluster State Transition Triggers contains more information about the events that trigger the state changes.

Table 1. Cluster State Transition Triggers
Event Symbolic Name Description
control_node_ready A cluster control has been selected and formation of the cluster begins. This event is generated internally by the Cluster Manager.
cluster_quorum_ready The cluster contains enough active nodes in order to have a quorum, which is the minimum number of nodes that can form a fully functional cluster. This event is generated internally by the Cluster Manager.
txn_sync_in_progress Transaction database synchronization is in progress. This applies only to states which represent cluster database initialization or synchronization activity (PRE_INIT, INIT, POST_INIT, STANDBY_SYNC, ACTIVE_SYNC). This event is sent as a message by the Transaction Server to the Cluster Manager.
txn_sync_completed Transaction database synchronization has been completed. This applies only to states which represent cluster database initialization or synchronization activity (PRE_INIT, INIT, POST_INIT, STANDBY_SYNC, ACTIVE_SYNC). This event is sent as a message by the Transaction Server to the Cluster Manager.
failover The remote HA peer that was active has been declared as FAILED. This standby cluster transitions to the ACTIVE state. This event is generated internally by Cluster Manager based on peer cluster condition/state.
switchover An administrator issued an ACTIVE cluster switchover command.
admin_shutdown An administrator issued a cluster shutdown command.
fatal_cluster_failure A fatal local cluster failure has been detected by Cluster Manager. This event is generated internally by Cluster Manager based on its monitoring of local cluster failures.