Description
Overview of the Issue
We've run into a case where a PlannedReparentShard
operation failed due to a timeout while trying to close the TableGC
pool during the DemotePrimary
step of PlannedReparentShard
, and now the tablet is in an undefined state and can't be demoted, as it always fails with the timeout error during DemotePrimary
.
Debugging this, we believe this is because TableGC
re-uses the same Pool
instance. This is problematic, because if Pool.Close
fails once, the pool will be in an undefined state, and re-opening and closing the pool will fail continuously because the active
connection count on the pool instance will be out-of sync with the "actual" pool state.
I feel like re-using a connection pool instance after calling .Close
, especially if that .Close
call ran into a timeout, is not safe.
I believe after calling .Close
, we should remove any references to the pool, and create a new pool instance when the pool needs to be reopened so that we start with a fresh connection pool instance.
Reproduction Steps
N/A
Binary Version
v19+
Operating System and Environment details
N/A
Log Fragments
N/A