Sub-Domain Health Checks
The sub-domain health checkers are deployed per sub-domain and are responsible for checking MATRIXX Engine health within the sub-domain. Two of the sub-domain health checker components are the gtc-sync-health-check and subdomain-health-check-brain containers.
GTC Sync Health Check
- All engines have been started.
- An active engine is running with any other engines disabled after the maximum configured number of attempts to auto-heal.
The GTC sync health check reports an error if either of the following conditions arise:
- A constant GTC value for the configured period of time, in which case all values detected during the period must be identical, nonzero, and at least one of the current GTC and last replayed GTC must be static.
- A GTC value exceeding the configured maximum value which does not decrease.
Multiple errors can be detected during the same period of time, for example in different engines. The errors are reported to the sub-domain health check brain container for it to make a decision on what action to take, if any.
Sub-Domain Health Check Brain
The sub-domain-health-check-brain does nothing unless one of the following conditions are met:
- All engines have been started or disabled after the maximum configured number of attempts to auto-heal.
- An active engine is running with any other engines disabled after the maximum configured number of attempts to auto-heal.
The subdomain-health-check-brain has two modes of operation:
- (Default) Recovery from engine and GTC out-of-sync errors.
- A dry-run mode where no action is taken in response to engine and GTC out-of-sync errors except for logging.
The brain checks for the following conditions in order:
- Engine failure (where there is more than one deployed engine) when the auto-healing retry limit has been exceeded.
- Engine standby GTC out-of-sync.
- Processing to publishing cluster GTC out-of-sync.
- Publishing cluster failure.
Only one condition is recovered at a time. For example, if the standby engine is out of sync, only that is recovered from even if there is also a publishing cluster failure.