Analyzing the Standby Cluster Initialization Phase

When in an HA state of INIT, a cluster is in the process of initializing its databases from a checkpoint or transaction log file (when the cluster becomes the standby cluster), or from replaying the in-memory databases (when the cluster becomes the active cluster). During this time, messages might be written to the mtx_debug.log file that indicate errors, and others that do not. These messages are listed in the following subsection.

Cannot Initialize Databases

If the databases cannot be initialized, the replay operation fails. When this occurs, LM_CRITI messages similar to the following are written to the log:
LM_CRITI 29584|29769 2016-01-14 15:47:57.236605 [transaction_server_2:1:1:1(4700.36371)] | TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::handleInitDatabaseMsg: Could not initialize databases: result=14, message =DataContainer: size=1131, bufSize=1430, bufMaxSize=32768, this=0x7f78dc4c2a48
  descName=MtxInitDatabaseMsg(363,4700,3), descriptorPtr=0x1921ba0, version=1, flags=0, fields=8
  RDM id=6:-:2:4928, mtxBufPtr=0x7f78dc4c2a00, dataPtr=0x7f78dc4c2a8a, baseContainerPtr=0x7f78dc4c2ac1
  idx name                           type      L A M P offset  maxSz value
    0 FromEcbsmiId                   UINT64    0 0 0 1      0      8 36046457925533697 ([2:1:1:1:0:1])
    1 InitDatabaseOp                 UINT32    0 0 0 1      8      4 2
    2 InitDatabaseResult             UINT32    0 0 0 1     12      4 14
    3 InitDatabaseResultDetail       UINT32    0 0 0 1     16      4 0
    4 InitDatabaseResultText         STRING    0 0 0 1    530      0 (size=180, data=TXN1-CheckpointManager:checkpoint_manager_task::CheckpointManagerTask::waitForReplayToComplete: Replay to [2:1:1:1:0:1] was aborted with 5281658 objects remaining to be completed.)
    5 InitDatabaseResultFieldKey     FIELD_KEY 0 0 0 1     20     35 0.0:0.0(Unknown)
    6 InitDatabaseResultData         BLOB      0 0 0 0      0      0
    7 MaxObjectIdList                OBJECT_ID 1 0 0 0      0      8
---BaseDataContainer---
DataContainer: size=828, bufSize=1430, bufMaxSize=32768, this=0x7f78dc4c2ac1
  descName=MtxWorkOrderMsg(140,4700,3), descriptorPtr=0x1ae2fb0, version=1, flags=0, fields=25
  RDM id=6:-:2:4928, mtxBufPtr=0x7f78dc4c2a00, dataPtr=0x7f78dc4c2b58, baseContainerPtr=0x7f78dc4c2bc0
  idx name                           type      L A M P offset  maxSz value
    0 ReplayTimeArray                UINT64    0 1 0 0      0      8
    1 TxnId                          STRUCT    0 0 0 0      0      0
    2 TxnResult                      UINT32    0 0 0 0      0      4
    3 TxnParticipantSet              UINT64    0 0 0 0      4      8
    4 TxnInQueueId                   UINT32    0 0 0 0     12      4
    5 TxnToEcbsmiId                  UINT64    0 0 0 0     16      8
    6 ApplicationMsgRdmId            UINT64    0 0 0 0     24      8
    7 TxnConditionList               STRUCT    1 0 0 0      0      0
    8 TxnActionList                  STRUCT    1 0 0 0      0      0
    9 ResendCount                    UINT32    0 0 0 0     32      4
   10 LogFileTime                    UINT32    0 0 0 0     36      4
   11 LogFileSequenceId              UINT32    0 0 0 0     40      4
   12 LogBufferId                    UINT32    0 0 0 0     44      4
   13 TotalTxnCountInLogBuffer       UINT32    0 0 0 0     48      4
   14 TotalTxnCountInLogFile         UINT32    0 0 0 0     52      4
   15 ReplayEcbsmiId                 UINT64    0 0 0 0     56      8
   16 ReplayContextId                UINT32    0 0 0 0     64      4
   17 SourceLogEcbsmiId              UINT64    0 0 0 0     68      8
   18 SourceLogFileTime              UINT32    0 0 0 0     76      4
   19 SourceLogFileSequenceId        UINT32    0 0 0 0     80      4
   20 ResendTargetParticipantSet     UINT64    0 0 0 0     84      8
   21 GlobalTxnCounter               UINT64    0 0 0 1     92      8 193604888
   22 TxnAuditList                   STRUCT    1 0 0 0      0      0
   23 Flags                          UINT32    0 0 0 0    100      4
 ---BaseDataContainer---
DataContainer: size=573, bufSize=1430, bufMaxSize=32768, this=0x7f78dc4c2bc0
  descName=MtxMsg(93,4700,3), descriptorPtr=0x1a493e0, version=1, flags=0, fields=19
  RDM id=6:-:2:4928, mtxBufPtr=0x7f78dc4c2a00, dataPtr=0x7f78dc4c2c39, baseContainerPtr=0
  idx name                           type      L A M P offset  maxSz value
    0 ReceiveTime                    DATETIME  0 0 0 1      0     12 2016-01-14T23:47:57.236015Z
    1 GatewaySocketId                INT32     0 0 0 1     12      4 268697600
    2 Op                             UINT32    0 0 0 0     16      4
    3 TimeArray                      UINT64    0 1 0 1    580      8 {maxElements=39:1452815277236015, 1452815277236015, 1452815277236366, , , , , , , 1452815277236368, 1452815277236444, , , , , , , , , , , , , , , , , , , , , , , , , , , , }
    4 ChrgInQueueId                  UINT32    0 0 0 1     20      4 1
    5 MtxParticipantTimeInfoList     STRUCT    1 0 0 0      0      0
    6 TxnMsgRdmId                    UINT64    0 0 0 0     24      8
    7 Result                         UINT32    0 0 0 0     32      4
    8 ResultDetail                   UINT32    0 0 0 0     36      4
    9 ResultText                     STRING    0 0 0 0      0      0
   10 ResultFieldKey                 FIELD_KEY 0 0 0 0     40     35
   11 ResultData                     BLOB      0 0 0 0      0      0
   12 ProxySocketId                  INT32     0 0 0 0     75      4
   13 HopByHopId                     UINT32    0 0 0 1     79      4 308
   14 EndToEndId                     UINT32    0 0 0 1     83      4 308
   15 AfterTxnMsgRdmIdList           UINT64    1 0 0 0      0      8
   16 OriginalResult                 UINT32    0 0 0 0     87      4
   17 DiamResult                     UINT32    0 0 0 0     91      4
   18 TraceFlags                     UINT32    0 0 0 0     95      4

Failure to Join the Topology

If there are pending transactions during the INIT state and any server tries to join the topology, the join operation fails. In such cases, errors similar to the following are logged on an existing server:
LM_ERROR 23803|43133 2015-06-01 19:30:24.735193 [transaction_server_2:1:2:1(4603.32039)] | 
TXN2-TransactionManager:transaction_manager_task::TransactionManagerTask::waitUntilAllTransactionsCompleted:
failed to wait for all transactions that have old topology without [id: 8, ECBSMI: [2:1:8:1:0:1]] to complete
In addition, errors similar to the following are logged on the server trying to join the cluster:
LM_ERROR 8224|27799 2015-06-01 19:45:22.006419[transaction_server_2:1:8:1(4603.32039)] | FsmTxnSvrStateBase::handleKeepAliveEvent: timed out in database synchronization
LM_ERROR 30699|30707 2015-06-01 19:45:22.006877 [cluster_manager_2:1:8:1(4603.32039)] | FsmNodeStateBase::transitionOnFatalError: invalid or fatal event 'TransactionServiceFailure' in state SYNC @state=SYNC
In such cases, wait until the cluster is synchronized and then try to add the server again.

Pending Transactions

If a processing server on the cluster transitioning to the standby cluster receives a parallel balance transaction to replay, but it has not yet received the checkpoint transaction for the same balance set object, the parallel balance transaction is saved in a pending state for a short period of time. When the processing server receives the checkpoint transaction for this balance set object, that checkpoint transaction provides an absolute balance to apply the difference. When this occurs, LM_INFO messages similar to the following are written to the log. These messages do not indicate runtime issues. Instead, they indicate that the standby cluster is catching up to the active cluster:

LM_INFO 63530|16816 2015-06-01 19:17:28.793198 [transaction_server_2:1:1:1(4603.32039)] | 
TXN1-TransactionManager:transaction_manager_task::TransactionManagerTask::printPendingTxnSummary:
number of pending transactions per blade={blade 5=2}