WIP Fix mqbc: Cluster FSM must heal before starting Partition FSMs #951
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Today in the Cluster FSM, leader becomes healed upon the CSL commit success of the first leader advisory, where it will initialize the queue key info map on the cluster thread. Near identical logic exists at the follower node.
At the same time, CSL commit callback fires and triggers the
onPartitionPrimaryAssignment
observer, which jumpstarts the Partition FSM. The Partition FSM will eventually attempt to open FileStore in the partition thread, which will access the queue key info map. There is a slight chance of race condition, so let's fix it.The fix consists of two parts:
do_stopWatchDog_initializeQueueKeyInfoMap
becomesdo_stopWatchDog_initializeQueueKeyInfoMap_jumpstartPartitionFSMs
do_storePartitionInfo
anddo_clearPartitionInfo
will be removed from the Partition FSMs. The logic is cleaner if we update the partition information before triggering the corresponding transitions in the Partition FSMs.After the fix, we can observe that the events happen in the correct order -- Cluster FSM becomes healed before any Partition FSM can start:
(TODO Update below logs with the logs from the latest fix)