Reconciliation Errors

Sometimes errors may be seen in the logs of the various different operators as they try to reconcile the current state and target state. In certain cases these errors can be ignored.

Consider the following error from the engine-operator logs:

YYYY-MM-DD HH:MM:SS.SSS | ERROR | engine-operator | Failed to update MtxEngine Status | {"MtxEngine": "matrixx/engine-s1e1", "error": "Operation cannot be fulfilled on mtxengines.matrixx.matrixx.com \"engine-s1e1\": the object has been modified; please apply your changes to the latest version and try again"}
YYYY-MM-DD HH:MM:SS.SSS | ERROR | engine-operator | Reconciler error | {"error": "Operation cannot be fulfilled on mtxengines.matrixx.matrixx.com \"engine-s1e1\": the object has been modified; please apply your changes to the latest version and try again"}

This is an example of an error that can be ignored. The reason for this can be understood by looking at this error in the context of the surrounding logs:

YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Entering reconcile loop | {"MtxEngine": "matrixx/engine-s1e1"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Setting MtxEngine Status.State and Status.StateTimestamp | {"MtxEngine": "matrixx/engine-s1e1", "State": "waiting-start", "StateTimestamp": "YYYY-MM-DD HH:MM:SS +0000 UTC"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Updating MtxEngine Status | {"MtxEngine": "matrixx/engine-s1e1"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Updated MtxEngine Status | {"MtxEngine": "matrixx/engine-s1e1"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Entering reconcile loop | {"MtxEngine": "matrixx/engine-s1e1"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Setting MtxEngine Status.State and Status.StateTimestamp | {"MtxEngine": "matrixx/engine-s1e1", "State": "waiting-start", "StateTimestamp": "YYYY-MM-DD HH:MM:SS +0000 UTC"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Updating MtxEngine Status | {"MtxEngine": "matrixx/engine-s1e1"}
YYYY-MM-DD HH:MM:SS.SSS | ERROR | engine-operator | Failed to update MtxEngine Status | {"MtxEngine": "matrixx/engine-s1e1", "error": "Operation cannot be fulfilled on mtxengines.matrixx.matrixx.com \"engine-s1e1\": the object has been modified; please apply your changes to the latest version and try again"}
YYYY-MM-DD HH:MM:SS.SSS | ERROR | engine-operator | Reconciler error | {"error": "Operation cannot be fulfilled on mtxengines.matrixx.matrixx.com \"engine-s1e1\": the object has been modified; please apply your changes to the latest version and try again"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Entering reconcile loop | {"MtxEngine": "matrixx/engine-s1e1"}
YYYY-MM-DD HH:MM:SS.SSS | INFO | engine-operator | Nothing to do | {"MtxEngine": "matrixx/engine-s1e1", "State": "waiting-start"}

You can see that:

The engine-operator instance entered the reconcile loop, attempted to update the status.state of the MtxEngine custom resource (CR) to waiting-start, and succeeded.
The engine-operator entered the reconcile loop, attempted to update the status.state of the MtxEngine CR to waiting-start and failed.
The engine-operator entered the reconcile loop, saw that the status.state of the MtxEngine CR was waiting-start and that there was nothing to do, because the engine-operator was waiting for the pod-monitor.

This appears to be a race condition between the engine-operator and Kubernetes API:

The engine-operator reads the current state and handles by updating the MtxEngine CR as expected.
The engine-operator reads the current state but gets the old (pre-update) MtxEngine CR. It attempts to handle by updating the MtxEngine CR in the same way. This fails because the engine-operator is trying to act on out-of-date information.
The engine-operator reads the current state and gets the new (post-update) MtxEngine CR and handles it as expected.

Despite the error, the operator recovers and continues as expected. Reconciliation errors are only an issue if an operator runs into the same error repeatedly. After many successive failures trying to perform the same action, the operator eventually stops trying. In this case, manual intervention is required to understand the issue and fix it. After that it may be required to delete the operator pod to trigger a new reconciliation if the change is not picked up automatically.