Address High-Priority Errors

Address any errors encountered during primary engine start-up.

About this task

Most of these tasks can be performed in parallel to optimize recovery time.

Procedure

  1. Triage and address debug log errors.
    Focus on high priority anomalies that are likely to be issues when traffic bypass is lifted. Defer lower priority errors until a later time. Watch for errors (by volume and type) that are inconsistent with accepted/known BAU errors and volumes. Compare those to error levels characteristic to your installation to identify anomalies, accounting for time of day and other circumstances.
    Note: Characterization of accepted error levels for your installation should already be documented as part of recommended operational practices.
  2. If it has been started, stop the secondary (standby) engine if an engine restart is required due to critical errors to avoid engine fail-over to an out-of-sync engine. Addressing high-priority errors in this scenario may include verifying that checkpoint and transaction logs remain correctly in place.
  3. Monitor resource usage and take action as required where monitoring indicates potential problems for post-bypass stability.
  4. Identify issues from system statistics using print_blade_stats.py to identify problematic anomalies.

What to do next

Repeat these steps until all issues are resolved and resolution is confirmed with sanity testing.