cluster_mgr_cli.py

The cluster_mgr_cli.py script provides a simple command line client that can be run on an off-engine server such as a Network Operation Center (NOC) to manage certain cluster operations. This script can retrieve cluster and peer cluster HA states, low-level information about cluster status, the Cluster Manager leader, and the cluster schema version. This script can also shut down a target cluster.

The cluster_mgr_cli.py script uses the Traffic Routing Agent's virtual IP address (VIP) and port of the cluster_control virtual server to identify a cluster. It can run on an off-engine server, such as a Network Operation Center (NOC) server, or directly on a server in MATRIXX Engine. In addition, if the script is run on an off-engine server, the server must have Python installed and the cluster_mgr_cli.py file located in the same off-engine directory. The schema version of the script and the Python file must match the schema version of the target cluster. Cluster HA States lists the available cluster HA states.

Table 1. Cluster HA States
State	Description	SNMP ID
UNKNOWN	The state of the remote cluster is unknown. Either a connection to the remote peer was never established, or a connection was established and then lost within the failure detection timeout period (it is not yet considered failed).	0
START	A cluster is in the process of starting, which involves waiting for a cluster quorum and other conditions to be met.	1
PRE_INIT	A cluster is waiting for the latest pricing version to be loaded into the newly upgraded engine. This HA state is only transitioned to during a MATRIXX Engine software upgrade.	2
INIT	A cluster is in the process of initializing its databases from a checkpoint file or a running cluster. The cluster transitions into an active or standby cluster. Note: When a standby cluster is in the INIT state, a pending transactions message similar to the following might be written to the mtx_debug.log file: `LM_INFO 63530\|16816 2015-06-01 19:17:28.793198 [transaction_server_2:1:1:1(4603.32039)] \| TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::printPendingTxnSummary: number of pending transactions per blade={blade 5=2}` This message is not indicative of issues. It is informational and can occur when a server in the standby cluster receives a parallel balance transaction to replay but has not yet received the checkpoint transaction for the same balance set object. Because the parallel balance transaction does not have an absolute balance to apply the difference, the transaction is saved as pending for a short period of time until it receives the checkpoint transaction for this balance set object. During this period, the message is recorded and indicates that the standby cluster is synchronizing with the active cluster. The number of pending transactions resolves as the synchronization completes.	3
POST_INIT	A cluster was upgraded to a new software version, and it is undergoing the schema upgrade and conversion transformations to handle any caveats before entering a standby state. This HA state is only transitioned to during a MATRIXX Engine software upgrade.	4
STANDBY_SYNC	For a standby cluster, the servers are synchronizing their databases by replaying transactions. This state indicates the state transition during an engine start-up, switchover, or fail-over. For an active cluster, the servers are replaying transactions after an engine switchover to sync its databases. The STANDBY_SYNC state precedes the STANDBY state.	5
STANDBY	A cluster is ready to replay transaction logs. During typical runtime operations, the cluster in the secondary engine is in a STANDBY HA state.	6
ACTIVE_SYNC	The cluster was selected as the active cluster and is in the process of synchronizing its databases in real time from its queued replay transactions. This state is transitional from STANDBY to ACTIVE. If an engine in a FAILED state is detected, a STANDBY engine transitions to ACTIVE_SYNC. If the FAILED engine has a processing pod that is still able to process requests, the ACTIVE_SYNC engine never detects that all transactions have completed. In that case, the ACTIVE_SYNC engine shuts down after a configurable timeout period. The duration of the timeout period is the product of the `timer_tick_msec` and `cluster_active_conflict_timeout_count` mtx_config.xml properties. For example, if the `timer_tick_msec` property is set to 50 (milliseconds), and the `cluster_active_conflict_timeout_count` property is set to 100, the ACTIVE_SYNC engine shuts down after 5 seconds.	7
ACTIVE	A cluster is actively processing incoming network traffic. During typical runtime operations, the cluster in the primary engine is in an ACTIVE HA state.	8
EXIT	The servers in a cluster are exiting so the cluster can be stopped without causing quorum issues.	10
STOP	A cluster is stopping.	11
FINAL	A cluster is stopped.	12
FAILED	A cluster had a connection to a remote peer cluster and lost the connection permanently. This condition occurs when a connection cannot be restored within the failure-detection timeout period. The peer cluster is viewed by the cluster as failed.	13
NONE	A pseudo state added for the Traffic Routing Agent to identify an engine cluster. This value is used when no peer cluster is configured for a MATRIXX Engine environment.	14
OFFLINE	A cluster is not stopped, but after the process of replaying transactions is completed, ports on Traffic Routing Agent load-balancing instances (TRA-PROCs) are blocked, so that the cluster is isolated from the rest of the topology. A peer cluster in a STANDBY HA state, if present, transitions to an ACTIVE HA state, as if the cluster in the OFFLINE state has been stopped.	15

Restriction: Several cluster_mgr_cli.py script command options are for internal use only and are subject to change without notice. See the following section for the list of supported options.

Syntax

/opt/mtx/bin/cluster_mgr_cli.py [-h] -t target [ get cluster_state| | get cluster_ha_state |get excluded_nodes | clear excluded_nodes | get schema_version | get peer_clusters | shutdown cluster | switchover active_cluster ]

Supported Options

-h, --help: Prints help information about this script.
-t, --target ipaddress:cli_port: The Traffic Routing Agent virtual IP address (VIP) and port of the cluster_control virtual server for an engine cluster.
get cluster_state: Prints the HA state (SNMP ID) of the target cluster as the integer ID defined in the SNMP MIB file.
get cluster_ha_state: An alias for the get cluster_state command for backward compatibility.
get excluded_nodes: Prints the node ID of the servers that failed to be fenced off from the cluster and, therefore, added to the Cluster Management Protocol (CMP) block list.
clear excluded_nodes: Removes the list of nodes from the CMP block list. This option enables them to rejoin the cluster and CMP. This option returns 0 upon success.
get schema_version: Prints the schema version of the target cluster.
setto offline_cluster: Puts the cluster in an offline, isolated state without fully stopping the cluster. A standby cluster, if present, becomes active. The offline state allows restoration of the cluster faster than restarting, during multi-cluster upgrade.
clear offline_cluster: Restores the cluster to an online, non-isolated state.

Unsupported Options

get peer_clusters: Prints information about each peer cluster in an active-standby HA configuration, including the cluster ID, cluster state, cluster substate, schema version, and peer cluster ID.; Use the get cluster_state command instead of this command.
shutdown cluster: Requests an orderly shutdown of the target cluster. Returns 0 upon success.
Warning: Stopping a cluster results in an engine failover.
switchover active_cluster: Requests a switchover of the active and standby peer clusters. Returns 0 upon success.
Use the activate_engine.py or activate_cluster.py script instead of this command. These scripts perform more checks before initiating a switchover operation.

Important: If your production environment has three running engines, you cannot switch the active and standby states of two clusters. You must first stop the engine that is not part of the switchover operation.

Display the HA State of a Cluster

Display the HA state of the cluster using VIP 10.10.1.1:

cluster_mgr_cli.py -t 10.10.1.1:4800 get cluster_state

This output shows that the cluster is in the ACTIVE HA state.