Master Cluster Failure

When a cluster that is hosting Topology Operator masters (topology-operator, subdomain-operator, and pricing-operator pods) goes down, there are no masters present to communicate with the topology-agent instances in surviving clusters.

In a three-cluster deployment with the following distribution across clusters and namespaces:

The masters and agents in cluster 3 are in namespace matrixx-operators.
The agents in cluster 2 are in namespace matrixx-operators.
Engine s1e1 in cluster 3 is in namespace matrixx-engine-s1.
Engine s1e2 in cluster 2 is in namespace matrixx-engine-s1.

When cluster 1 goes down, the masters are no longer present to communicate with cluster 2. Similarly, engine s1e2 can not communicate with engine s1e1.

This situation is not as serious as you might expect. When a helm install or helm upgrade is performed, the topology-operator pod is responsible for creating the other operators and the resources they require. After that, the topology-operator is idle, waiting for the next Helm upgrade to be performed. This means that, once the engine(s) have been created, the day-to-day managing of an engine does not require the topology-operator pod to even be present. Auto-healing is managed by the engine-operator and pod-monitor pods, which are local to the engine they're managing.

After recreating the masters and engine s1e1 in cluster 3, the new deployment will be:

The masters and agents in cluster 3 are in namespace matrixx-operators.
The agents in cluster 2 are in namespace matrixx-operators.
Engine s1e1 in cluster 3 are in namespace matrixx-engine-s1.
Engine s1e2 in cluster 2 are in namespace matrixx-engine-s1.

As well as performing new Helm installs in cluster 3, you must perform Helm upgrades in cluster 2 to update their configuration.

Create the needed cluster 3 namespaces and install the engines, agents, and masters for the new deployment the following order:

kubectl --context context3 create ns matrixx-operators
kubectl --context context3 create ns matrixx-engine-s1
helm --kube-context context2 upgrade mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster2.yaml --version matrixx_version
helm --kube-context context3 install mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
helm --kube-context context2 upgrade mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster2.yaml --version matrixx_version
helm --kube-context context3 install mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version

Where topology-recover.yaml has the following contents:

engine:
  enabled: true
  
global:
  topology:
    operators:
      master:
        context: context3
        namespace: matrixx-operators
      agents:
      - context: context3
        namespace: matrixx-operators
        externalAddress: 10.10.10.300
        auth:
          basic:
            username: username3
            password: password3
      - context: context2
        namespace: matrixx-operators
        externalAddress: 10.10.10.200
        auth:
          basic:
            username: username2
            password: password2
    domains:
    - subdomains:
      - pricing:
          fileName: mtx_pricing_matrixxOne.xml
          image:
            name: example-pricing-sideloader
            version: "version"
        engines:
        - context: context3
          namespace: matrixx-engine-s1
          checkpointing:
            replicaCount: 1
          processing:
            externalAddress: 10.10.10.301
            replicaCount: 2
            tralb:
              replicaCount: 2
          publishing:
            externalAddress: 10.10.10.302
            replicaCount: 2
            tralb:
              replicaCount: 2
        - context: context2
          namespace: matrixx-engine-s1
          checkpointing:
            replicaCount: 1
          processing:
            externalAddress: 10.10.10.201
            replicaCount: 2
            tralb:
              replicaCount: 2
          publishing:
            externalAddress: 10.10.10.202
            replicaCount: 2
            tralb:
              replicaCount: 2
  
pricing-controller:
  enabled: true