Skip to content

Replace member pre-check does not support non-3-replica env #738

@Besroy

Description

@Besroy

In a 5-replica replace member process, starting the replace member is allowed even when there are only 2 active peers. However, once an active member out and flip to a learner, it does not participate in elections, making it impossible to meet the commit quorum requirement. Consequently, subsequent logs (e.g., set_priority) cannot be committed, causing the replace member operation to be canceled due to timeout, returning an INVALID_ARG error.
Consider the following example:

  1. Five replicas, M1 through M5, exist. Replicas M1, M2, and M3 have a committed LSN of 1000. Replicas M4 and M5 have a committed LSN of 100. M1 is the leader.
  2. M2 is selected for removal and set as a learner with LSN 1001. After the flip is complete, M1, M2, and M3 have LSN 1001, while M4 and M5 have LSN 100.
  3. M2's priority is set to 0 with LSN 1002, but this change cannot be committed due to an insufficient quorum of only two active peers. (Only M1 and M3 are active. M4 and M5 are behind, and M3 is a learner.)
  4. Learner status checks fail as the learner's priority remains non-zero during the timeout, causing the member replacement operation to be cancelled.
    More details in issue51

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions