Debugging and Failure Recovery
MATRIXX components store log files in persistent storage on the node by default, in a persistent
volume (PV) at the location specified in global.storage.localStorageDir
property (/home/data by default).
Topology Operator
In Topology Operator-based deployments, Topology Operator manages MATRIXX Engine pods, performing all necessary create, update, or delete actions on MtxSubdomain CRs. It creates, updates, or deletes sub-domain-level resources for each MtxSubdomain CR such as subdomain-operator and pricing-operator pods. It also creates, updates, or deletes engine-level resources for each MtxEngine CR such as engine-operator pods.
Topology Operator Log Output Directories describes the
locations in shared logging storage where each component stores log files. Each
directory is a subdirectory of the directory specified with the
global.storage.localStorageDir
property.
Component | Directory |
---|---|
topology-operator | topology-operator |
subdomain-operator | subdomain-operator-s<subdomainId> |
pricing-operator | pricing-operator-s<subdomainId> |
engine-operator | engine-operator-s<subdomainId>e<engineId> |
pod-monitor | pod-monitor-s<subdomainId>e<engineId> |
cluster-monitor | cluster-monitor-s<subdomainId>e<engineId> |
topology-agent | Topology agent logs are saved in a directory with the same name
as the pod, for example:
topology-agent-8fcc57755-2nn64 . |
engine-starter | engine-starter-s<subdomainId>e<engineId> |
engine-stopper | engine-stopper-s<subdomainId>e<engineId> |
pre-update | topology-operator-pre-update |
pre-delete | topology-operator-pre-delete |
The topology-operator, subdomain-operator, pricing-operator, engine-operator, pod-monitor, cluster-monitor, and topology-agent logs rotate every day at a maximum size of 100 MB. The engine-starter, engine-stopper, pre-update, and pre-delete components log to a new file each time they are used.
Engine Operator
In Engine Operator-based deployments, Engine Operator is effectively a Kubernetes controller, in that it processes events and reacts to them as needed. When deploying, the operator creates all the needed services, ConfigMaps, and pods. By default, Engine Operator logs are stored at /home/data/engine-operator-s<subdomainId>e<engineId>/mtx-engine-opr<index>.log.
When a pod in the engine fails and is restarted, Engine Operator revives the engine manager to re-run the engine start-up sequence.
Engine Manager
Engine Manager (mgr-engine-sXeY) runs the engine start-up sequence, running the create_config.py
and
start_engine.py
scripts. If the engine does not start, check log files at
/home/data/mgr_engine-s<subdomainId>e<engineId>/mtx-engine-mgr<index>.log for diagnostic
information.
When the start-up sequence is complete, the Engine Manager pod is in a completed state. If an engine pod becomes unresponsive, the Engine Manager begins the start-up sequence again to restart the pod.
For each pod in the engine, including TRA-PROC and TRA-PUB pods, a Kubernetes liveness probe confirms pod activity. If this probe fails, Kubernetes restarts the containers in the unresponsive pod. Upon receipt of the failure event, the Engine Operator revives Engine Manager to begin the engine start-up sequence.
TRA-DR/TRA-RT Pods
The TRA-DR and TRA-RT pods run the TRA start-up sequence on themselves, running the create_config.py
script to generate the tra_config.xml and
tra_config_network_topology.xml files and running the start_tra_node.py
script.
If a TRA pod does not start, check the logs of the TRA-DR or TRA-RT pods for diagnostic information with the following commands:
kubectl logs tra-dr-ag1-0 --namespace matrixx
kubectl logs tra-rt-ag1-0 --namespace matrixx
When the start-up sequence is complete, the TRA pod is in a completed state.
global.accessGroup.id
property in your Helm values file. The default value is ag1
. For each TRA pod, there is a probe that confirms pod activity. If the probe fails, Kubernetes restarts the container in the unresponsive pod, and the new pod runs the TRA start-up sequence.