Cluster Management

The MATRIXX high availability (HA) functionality provides for continuous operation and integrity of a cluster as a whole in case one or more MATRIXX Engine servers in a cluster fails. It also provides HA for active and standby clusters running in a geographically redundant or distributed environment.

The Cluster Manager HA functionality includes complex failure scenarios, including network, hardware, and software failures that lead to partitioning of a whole cluster into separate sub-clusters, and the switch-over from a failed ACTIVE processing cluster to the STANDBY cluster.

Intra-Cluster HA

The Cluster Manager, using the Cluster Manager Protocol (CMP), monitors the processing and publishing servers on the Parallel-MATRIXX™ protocol, so it knows when a server is added or removed. If this happens, the Cluster Manager initiates a message that is sent across the CMP. The message requests that all MATRIXX Engine servers running on the protocol update their topology map so transaction processing is not interrupted and data integrity is not compromised. If a server is added, the running Transaction Servers add the new server to their topology map and include the server in parallel-transaction processing.

MATRIXX Engine processing server clusters are deployed in ACTIVE-ACTIVE pairs and the publishing server clusters in ACTIVE-STANDBY pairs. In both cases, if a problematic server becomes unavailable, the Cluster Manager directs the other server to take over without human intervention. Any pause in processing or publishing is momentary.

MATRIXX Engine clusters are deployed ACTIVE-STANDBY or ACTIVE-STANDBY-STANDBY. If the TRA-SI/DR detects an active-active conflict between engines, it ensures that the appropriate engine remains active and shuts down the other. This ensures that only one engine is active at a time.

Inter-Cluster DR

MATRIXX Engine processing and publishing clusters are typically deployed in ACTIVE-STANDBY HA pairs to offer inter-cluster HA and support disaster recovery (DR). Processing clusters can also be deployed in a three-cluster HA model, where one cluster is in an ACTIVE state and two clusters are in a STANDBY state. Regardless of the HA model, only the active cluster can process network payload traffic.

The ACTIVE and STANDBY HA states of the clusters are dynamic in that they can change due to an engine failover, an administrative shutdown of the active engine, or a forceful switch-over of these states, as when the clusters are commanded to mutually exchange the ACTIVE/STANDBY states using a CLI command. To ensure failover can occur during any of these situations, the Cluster Manager in each processing cluster monitors the availability of the Cluster Manager in the other processing and publishing cluster(s). To offer an accurate view of any cluster, during engine start up, one Cluster Manager in each cluster is designated as the lead and is responsible for communicating status to the Cluster Manager in the peer clusters. If the MATRIXX Engine server running the lead Cluster Manager fails, the lead role switches to another healthy Cluster Manager in the cluster, so that inter-cluster HA monitoring is not interrupted.

If the lead Cluster Manager in the active cluster identifies an issue with a standby cluster, it can stop the cluster so processing errors do not occur. If the lead Cluster Manager on a standby cluster identifies an issue with the active cluster, it can stop the cluster, which changes the HA state of a standby cluster to ACTIVE so processing can continue.

For information about configuring Cluster Manager quorum and fencing behavior, see the discussion about Cluster Manager configuration in MATRIXX Configuration.