Debugging and Failure Recovery

The MATRIXX Engine controller and MATRIXX Engine manager pods create, configure, and start the MATRIXX Engine(s) in a deployment.

MATRIXX Engine Controller

The MATRIXX Engine controller (engine-controller, in the matrixx namespace) is effectively a Kubernetes controller, in that it processes events and reacts to them as needed. When deploying, the controller creates all of the needed services, ConfigMaps, and pods.

If an expected object is not created, check the engine-controller logs for diagnositc information with a command similar to the following:

kubectl logs engine-controller-7667c465c-srs4n --namespace matrixx

The exact name of the engine controller pod in your installation will be different from the name in the example.

When a pod in the engine fails and is restarted, engine-controller revives the engine manager to re-run the engine start-up sequence.

MATRIXX Engine Manager

The MATRIXX Engine manager (mgr-engine-s1e1 in a single-engine deployment, mgr-engine-s1e1 and mgr-engine-s1e2 in a dual-engine deployment) runs the engine start-up sequence, running the create_config.py and start_engine.py scripts. If the engine does not start, check the logs from mgr-engine-sXeY (where X is the subdomain ID and Y is the engine ID) for diagnostic information with the following command:

kubectl logs mgr-engine-sXeY --namespace matrixx

When the start-up sequence is complete, the MATRIXX Engine manager pod is in a completed state. If an engine pod becomes unresponsive, the MATRIXX Engine manager begins the start-up sequence again to restart the pod.

For each pod in the engine, including TRA-PROC and TRA-PUB pods, a Kubernetes liveness probe confirms pod activity. If this probe fails, Kubernetes restarts the containers in the unresponsive pod. Upon receipt of the failure event, the MATRIXX Engine controller revives the MATRIXX Engine manager to begin the engine start-up sequence.

Traffic Routing Agent Manager

The Traffic Routing Agent (TRA) Manager (mgr-tra-access_group, where access_group is the name of the access group, in the matrixx namespace) runs the TRA startup sequence, running the create_config.py script to generate the tra_config.xml and tra_config_network_topology.xml files, and running the start_tra_node.py script.

If the TRA does not start, check the logs from mgr-tra-access_group for diagnostic information with the following command:

kubectl logs mgr-tra-ag1 --namespace matrixx

When the start-up sequence is complete, TRA manager is in a completed state.

Note: The access group name is specified with the global.accessGroup.id property in your values.yaml file. The default value is ag1.

If a TRA pod becomes unresponsive, TRA manager begins the start-up sequence again to restart the pod. For each TRA pod there is a probe that confirms pod activity. If the probe fails, Kubernetes restarts the container in the unresponsive pod. Upon receipt of the failure event, the MATRIXX Engine Controller revives TRA Manager to begin the TRA startup sequence.