Non-Master Cluster Failure
When a cluster that is not hosting Topology Operator masters (topology-operator, subdomain-operator, and pricing-operator pods) fails, the masters report errors when they fail to communicate with the topology-agents in the failed cluster.
In a three-cluster deployment with the following distribution across clusters and namespaces:
- The masters and agents in cluster 1 are in namespace
matrixx-operators
. - The agents in cluster 2 are in namespace
matrixx-operators
. - Engine s1e1 in cluster 1 is in namespace
matrixx-engine-s1
. - Engine s1e2 in cluster 2 is in namespace
matrixx-engine-s1
.
After recovering from cluster 2 failure, you recreate engine s1e2 on cluster 3. The new deployment distribution will be:
- The masters and agents in cluster 1 are in namespace
matrixx-operators
. - The agents in cluster 3 are in namespace
matrixx-operators
. - Engine s1e1 in cluster 1 is in namespace
matrixx-engine-s1
. - Engine s1e2 in cluster 3 is in namespace
matrixx-engine-s1
.
As well as performing new Helm installs in cluster 3, you must perform Helm upgrades in cluster 1 to update their configuration.
Create the needed cluster 3 namespaces and install the engines, agents, and masters for the new deployment the following order:
kubectl --context context3 create ns matrixx-operators
kubectl --context context3 create ns matrixx-engine-s1
helm --kube-context context1 upgrade mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster1.yaml --version matrixx_version
helm --kube-context context3 install mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
helm --kube-context context3 install mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
helm --kube-context context1 upgrade mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster1.yaml --version matrixx_version
Where topology-recover.yaml has the following contents:
engine:
enabled: true
global:
topology:
operators:
master:
context: context1
namespace: matrixx-operators
agents:
- context: context1
namespace: matrixx-operators
externalAddress: 10.10.10.100
auth:
basic:
username: username1
password: password1
- context: context3
namespace: matrixx-operators
externalAddress: 10.10.10.300
auth:
basic:
username: username3
password: password3
domains:
- subdomains:
- pricing:
fileName: mtx_pricing_matrixxOne.xml
image:
name: example-pricing-sideloader
version: "version"
engines:
- context: context1
namespace: matrixx-engine-s1
checkpointing:
replicaCount: 1
processing:
externalAddress: 10.10.10.101
replicaCount: 2
tralb:
replicaCount: 2
publishing:
externalAddress: 10.10.10.102
replicaCount: 2
tralb:
replicaCount: 2
- context: context3
namespace: matrixx-engine-s1
checkpointing:
replicaCount: 1
processing:
externalAddress: 10.10.10.301
replicaCount: 2
tralb:
replicaCount: 2
publishing:
externalAddress: 10.10.10.302
replicaCount: 2
tralb:
replicaCount: 2
pricing-controller:
enabled: true
Where cluster3.yaml has the following contents:
global:
topology:
operators:
currentContext: context3