Sub-Domain Health Check Brain Recovery Operations

The subdomain-health-check-brain attempts recovery from four conditions where a failed condition prevents considering the next decision.

The subdomain-health-check-brain keeps track of restarts in a brain-engine-state file written to the subdomain-health-checker directory in shared logging storage. If there is no persistent storage enabled then the file is saved in the subdomain-health-check-brain pod.

The subdomain-health-check-brain uses this file to limit how many times restarts are performed. Without tracking restarts, a MATRIXX Engine might be continuously restarted for a recurring condition. The file is deleted when the subdomain-health-check-brain is started and is updated whenever an engine is restarted during recovery. The saved state is controlled by the engineStateExpiry Helm configuration property. When set to a nonzero value, the saved engine state is expired after the specified number of minutes. The saved engine state is updated whenever the subdomain-health-check-brain restarts an engine or the publishing cluster of an engine. If the state has not expired then an engine can only be restarted if the saved restart count is less than the maximum allowed, which by default is one restart for an engine and two restarts for the publishing cluster.

Recovery from Engine Failure

The subdomain-health-check-brain does nothing until the engines are either started or have failed to start after the maximum number of attempts. Engine states are mapped to an engine failure decision table for a course of recovery actions. When an engine is down, it is restarted if it has auto-healing disabled and if it is not the active engine, except in the case of a full outage where auto-healing is stopped. In that case, an automated process restarts the engine with the highest GTC as active, and the remaining engines in the standby state.

If any engine has been found to have failed and does not meet the condition to be restarted then no further steps are taken by the brain. This is because the gtc-sync-health-check requires all engines to be started before it detects GTC gaps.

Recovery from Engine Standby GTC Out of Sync

When gtc-sync-health-check notifies subdomain-health-check-brain with the gtc-sync-state file that a standby engine is out of sync, subdomain-health-check-brain uses a decision table to determine what action to take. The action depends on which standby is out of sync, but always involves restarting the standby engine.

Note: For three-engine deployments, both standby engines must be stopped if the active and first standby engines are out of sync.

Recovery from Processing and Publishing GTC Out of Sync

When gtc-sync-health-check notifies subdomain-health-check-brain that a publishing cluster is out of sync with the gtc-sync-state file, subdomain-health-check-brain stops and starts the publishing cluster. The cluster is started a maximum of two times, after which the engine of the failed cluster is restarted once.

Recovery from Publishing Cluster Failure

If there is more than one engine deployed, and a publishing cluster failure has been detected with auto-healing disabled, then the engine with the failed publishing cluster will be restarted