Non-Master Cluster Failure

When a cluster that is not hosting Topology Operator masters (topology-operator, subdomain-operator, and pricing-operator pods) fails, the masters report errors when they fail to communicate with the topology-agents in the failed cluster.

In a three-cluster deployment with the following distribution across clusters and namespaces:

  • The masters and agents in cluster 1 are in namespace matrixx-operators.
  • The agents in cluster 2 are in namespace matrixx-operators.
  • Engine s1e1 in cluster 1 is in namespace matrixx-engine-s1.
  • Engine s1e2 in cluster 2 is in namespace matrixx-engine-s1.

After recovering from cluster 2 failure, you recreate engine s1e2 on cluster 3. The new deployment distribution will be:

  • The masters and agents in cluster 1 are in namespace matrixx-operators.
  • The agents in cluster 3 are in namespace matrixx-operators.
  • Engine s1e1 in cluster 1 is in namespace matrixx-engine-s1.
  • Engine s1e2 in cluster 3 is in namespace matrixx-engine-s1.

As well as performing new Helm installs in cluster 3, you must perform Helm upgrades in cluster 1 to update their configuration.

Create the needed cluster 3 namespaces and install the engines, agents, and masters for the new deployment the following order:

kubectl --context context3 create ns matrixx-operators
kubectl --context context3 create ns matrixx-engine-s1
helm --kube-context context1 upgrade mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster1.yaml --version matrixx_version
helm --kube-context context3 install mtx-engine-s1 matrixx/matrixx -n matrixx-engine-s1 -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
helm --kube-context context3 install mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster3.yaml --version matrixx_version
helm --kube-context context1 upgrade mtx-operators matrixx/matrixx -n matrixx-operators -f base.yaml -f topology-recover.yaml -f cluster1.yaml --version matrixx_version

Where topology-recover.yaml has the following contents:

engine:
  enabled: true
  
global:
  topology:
    operators:
      master:
        context: context1
        namespace: matrixx-operators
      agents:
      - context: context1
        namespace: matrixx-operators
        externalAddress: 10.10.10.100
        auth:
          basic:
            username: username1
            password: password1
      - context: context3
        namespace: matrixx-operators
        externalAddress: 10.10.10.300
        auth:
          basic:
            username: username3
            password: password3
    domains:
    - subdomains:
      - pricing:
          fileName: mtx_pricing_matrixxOne.xml
          image:
            name: example-pricing-sideloader
            version: "version"
        engines:
        - context: context1
          namespace: matrixx-engine-s1
          checkpointing:
            replicaCount: 1
          processing:
            externalAddress: 10.10.10.101
            replicaCount: 2
            tralb:
              replicaCount: 2
          publishing:
            externalAddress: 10.10.10.102
            replicaCount: 2
            tralb:
              replicaCount: 2
        - context: context3
          namespace: matrixx-engine-s1
          checkpointing:
            replicaCount: 1
          processing:
            externalAddress: 10.10.10.301
            replicaCount: 2
            tralb:
              replicaCount: 2
          publishing:
            externalAddress: 10.10.10.302
            replicaCount: 2
            tralb:
              replicaCount: 2
  
pricing-controller:
  enabled: true

Where cluster3.yaml has the following contents:

global:
  topology:
    operators:
      currentContext: context3