Debugging and Failure Recovery

MATRIXX components store log files in persistent storage on the node by default, in a persistent volume (PV) at the location specified in global.storage.localStorageDir property (/home/data by default).

Topology Operator

In Topology Operator-based deployments, Topology Operator manages MATRIXX Engine pods, performing all necessary create, update, or delete actions on MtxSubdomain CRs. It creates, updates, or deletes sub-domain-level resources for each MtxSubdomain CR such as subdomain-operator and pricing-operator pods. It also creates, updates, or deletes engine-level resources for each MtxEngine CR such as engine-operator pods.

Topology Operator Log Output Directories describes the locations in shared logging storage where each component stores log files. Each directory is a subdirectory of the directory specified with the global.storage.localStorageDir property.

Table 1. Topology Operator Log Output Directories
Component Directory
topology-operator topology-operator
subdomain-operator subdomain-operator-s<subdomainId>
pricing-operator pricing-operator-s<subdomainId>
engine-operator engine-operator-s<subdomainId>e<engineId>
pod-monitor pod-monitor-s<subdomainId>e<engineId>
cluster-monitor cluster-monitor-s<subdomainId>e<engineId>
topology-agent Topology agent logs are saved in a directory with the same name as the pod, for example: topology-agent-8fcc57755-2nn64.
engine-starter engine-starter-s<subdomainId>e<engineId>
engine-stopper engine-stopper-s<subdomainId>e<engineId>
pre-update topology-operator-pre-update
pre-delete topology-operator-pre-delete

The topology-operator, subdomain-operator, pricing-operator, engine-operator, pod-monitor, cluster-monitor, and topology-agent logs rotate every day at a maximum size of 100 MB. The engine-starter, engine-stopper, pre-update, and pre-delete components log to a new file each time they are used.

Engine Operator

In Engine Operator-based deployments, Engine Operator is effectively a Kubernetes controller, in that it processes events and reacts to them as needed. When deploying, the operator creates all the needed services, ConfigMaps, and pods. By default, Engine Operator logs are stored at /home/data/engine-operator-s<subdomainId>e<engineId>/mtx-engine-opr<index>.log.

When a pod in the engine fails and is restarted, Engine Operator revives the engine manager to re-run the engine start-up sequence.

Important: Engine Operator is deprecated in this release and will be removed in a future release of MATRIXX.

Engine Manager

Engine Manager (mgr-engine-sXeY) runs the engine start-up sequence, running the create_config.py and start_engine.py scripts. If the engine does not start, check log files at /home/data/mgr_engine-s<subdomainId>e<engineId>/mtx-engine-mgr<index>.log for diagnostic information.

When the start-up sequence is complete, the Engine Manager pod is in a completed state. If an engine pod becomes unresponsive, the Engine Manager begins the start-up sequence again to restart the pod.

For each pod in the engine, including TRA-PROC and TRA-PUB pods, a Kubernetes liveness probe confirms pod activity. If this probe fails, Kubernetes restarts the containers in the unresponsive pod. Upon receipt of the failure event, the Engine Operator revives Engine Manager to begin the engine start-up sequence.

TRA-DR/TRA-RT Pods

The TRA-DR and TRA-RT pods run the TRA start-up sequence on themselves, running the create_config.py script to generate the tra_config.xml and tra_config_network_topology.xml files and running the start_tra_node.py script.

If a TRA pod does not start, check the logs of the TRA-DR or TRA-RT pods for diagnostic information with the following commands:

kubectl logs tra-dr-ag1-0 --namespace matrixx
kubectl logs tra-rt-ag1-0 --namespace matrixx

When the start-up sequence is complete, the TRA pod is in a completed state.

Note: The access group name is specified with the global.accessGroup.id property in your Helm values file. The default value is ag1.

For each TRA pod, there is a probe that confirms pod activity. If the probe fails, Kubernetes restarts the container in the unresponsive pod, and the new pod runs the TRA start-up sequence.