Skip to content

Conversation

kaikulimu
Copy link
Collaborator

@kaikulimu kaikulimu commented Sep 30, 2025

Today in the Cluster FSM, leader becomes healed upon the CSL commit success of the first leader advisory, where it will initialize the queue key info map on the cluster thread. Near identical logic exists at the follower node.

At the same time, CSL commit callback fires and triggers the onPartitionPrimaryAssignment observer, which jumpstarts the Partition FSM. The Partition FSM will eventually attempt to open FileStore in the partition thread, which will access the queue key info map. There is a slight chance of race condition, so let's fix it.

The fix consists of two parts:

  1. Let Cluster FSM be the one to jumpstart the Partition FSMs. Hence, the action do_stopWatchDog_initializeQueueKeyInfoMap becomes do_stopWatchDog_initializeQueueKeyInfoMap_jumpstartPartitionFSMs
  2. The actions do_storePartitionInfo and do_clearPartitionInfo will be removed from the Partition FSMs. The logic is cleaner if we update the partition information before triggering the corresponding transitions in the Partition FSMs.

After the fix, we can observe that the events happen in the correct order -- Cluster FSM becomes healed before any Partition FSM can start:

(TODO Update below logs with the logs from the latest fix)

30SEP2025_19:19:03.026 (6167932928) INFO mqbc_clusterstatemanager.cpp:1051 Cluster (c2x2): Committed advisory: [ rId = NULL choice = [ clusterMessage = [ choice = [ leaderAdvisory = [ sequenceNumber = [ electorTerm = 2 sequenceNumber = 1 ] partitions = [ [ partitionId = 0 primaryNodeId = 2 primaryLeaseId = 2 ] [ partitionId = 1 primaryNodeId = 2 primaryLeaseId = 2 ] [ partitionId = 2 primaryNodeId = 2 primaryLeaseId = 2 ] [ partitionId = 3 primaryNodeId = 2 primaryLeaseId = 2 ] ] queues = [ ] ] ] ] ] ], with status 'SUCCESS'
30SEP2025_19:19:03.026 (6167932928) INFO mqbc_clusterfsm.cpp:98 Cluster FSM on Event 'CSL_CMT_SUCCESS', transition: State 'LDR_HEALING_STG2' =>  State 'LDR_HEALED'
30SEP2025_19:19:03.027 (6167932928) INFO mqbc_storagemanager.cpp:392 Cluster (c2x2) Partition [0]: Self Transition to Primary in the Partition FSM.
30SEP2025_19:19:03.027 (6166212608) INFO mqbc_partitionfsm.cpp:76 Partition FSM for Partition [0] on Event 'DETECT_SELF_PRIMARY', transition: State 'UNKNOWN' =>  State 'PRIMARY_HEALING_STG1'

@kaikulimu kaikulimu requested a review from a team as a code owner September 30, 2025 19:26
@kaikulimu kaikulimu changed the title Fix mqbc: Cluster FSM must heal before starting Partition FSMs WIP Fix mqbc: Cluster FSM must heal before starting Partition FSMs Sep 30, 2025
@kaikulimu kaikulimu self-assigned this Sep 30, 2025
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant