Master Cluster Failure
When a cluster that is hosting Topology Operator masters (topology-operator, subdomain-operator, and pricing-operator pods) goes down, there are no masters present to communicate with the topology-agent instances in surviving clusters.
In a three-cluster deployment with the following distribution across clusters and namespaces:
- The masters and agents in cluster 3 are in namespace
matrixx-operators
. - The agents in cluster 2 are in namespace
matrixx-operators
. - Engine s1e1 in cluster 3 is in namespace
matrixx-engine-s1
. - Engine s1e2 in cluster 2 is in namespace
matrixx-engine-s1
.
When cluster 1 goes down, the masters are no longer present to communicate with cluster 2. Similarly, engine s1e2 can not communicate with engine s1e1.
This situation is not as serious as you might expect. When a helm
install
or helm upgrade
is performed, the topology-operator pod is
responsible for creating the other operators and the resources they require. After that, the
topology-operator is idle, waiting for the next Helm upgrade to be performed. This means that,
once the engine(s) have been created, the day-to-day managing of an engine does not require
the topology-operator pod to even be present. Auto-healing is managed by the engine-operator
and pod-monitor pods, which are local to the engine they're managing.
After recreating the masters and engine s1e1 in cluster 3, the new deployment will be:
- The masters and agents in cluster 3 are in namespace
matrixx-operators
. - The agents in cluster 2 are in namespace
matrixx-operators
. - Engine s1e1 in cluster 3 are in namespace
matrixx-engine-s1
. - Engine s1e2 in cluster 2 are in namespace
matrixx-engine-s1
.
As well as performing new Helm installs in cluster 3, you must perform Helm upgrades in cluster 2 to update their configuration.
Create the needed cluster 3 namespaces and install the engines, agents, and masters for the new deployment the following order:
kubectl --context context3 create ns matrixx-operators
kubectl --context context3 create ns matrixx-engine-s1
helm --kube-context context2 upgrade mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster2.yaml --version matrixx_version
helm --kube-context context3 install mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
helm --kube-context context2 upgrade mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster2.yaml --version matrixx_version
helm --kube-context context3 install mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
Where topology-recover.yaml has the following contents:
engine:
enabled: true
global:
topology:
operators:
master:
context: context3
namespace: matrixx-operators
agents:
- context: context3
namespace: matrixx-operators
externalAddress: 10.10.10.300
auth:
basic:
username: username3
password: password3
- context: context2
namespace: matrixx-operators
externalAddress: 10.10.10.200
auth:
basic:
username: username2
password: password2
domains:
- subdomains:
- pricing:
fileName: mtx_pricing_matrixxOne.xml
image:
name: example-pricing-sideloader
version: "version"
engines:
- context: context3
namespace: matrixx-engine-s1
checkpointing:
replicaCount: 1
processing:
externalAddress: 10.10.10.301
replicaCount: 2
tralb:
replicaCount: 2
publishing:
externalAddress: 10.10.10.302
replicaCount: 2
tralb:
replicaCount: 2
- context: context2
namespace: matrixx-engine-s1
checkpointing:
replicaCount: 1
processing:
externalAddress: 10.10.10.201
replicaCount: 2
tralb:
replicaCount: 2
publishing:
externalAddress: 10.10.10.202
replicaCount: 2
tralb:
replicaCount: 2
pricing-controller:
enabled: true