Skip to content

Bug Report: Connection Pool Reuse leads to consistent PRS failure #18202

@arthurschreiber

Description

@arthurschreiber

Overview of the Issue

We've run into a case where a PlannedReparentShard operation failed due to a timeout while trying to close the TableGC pool during the DemotePrimary step of PlannedReparentShard, and now the tablet is in an undefined state and can't be demoted, as it always fails with the timeout error during DemotePrimary.

Debugging this, we believe this is because TableGC re-uses the same Pool instance. This is problematic, because if Pool.Close fails once, the pool will be in an undefined state, and re-opening and closing the pool will fail continuously because the active connection count on the pool instance will be out-of sync with the "actual" pool state.

I feel like re-using a connection pool instance after calling .Close, especially if that .Close call ran into a timeout, is not safe.

I believe after calling .Close, we should remove any references to the pool, and create a new pool instance when the pool needs to be reopened so that we start with a fresh connection pool instance.

Reproduction Steps

N/A

Binary Version

v19+

Operating System and Environment details

N/A

Log Fragments

N/A

Metadata

Metadata

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions