After Restarting keeper pods, all 3 pods go into fail state

I am running a keeper 3-node setup on my  k8s cluster, due to any reason if the keeper pods gets restarted, they loose quorum and any node fails to become leader, stopping other nodes to get created, 

NAME                  READY   STATUS    RESTARTS     AGE
clickhouse-keeper-0   1/1     Running   0            48s
clickhouse-keeper-1   0/1     Error     1 (3s ago)   27s

logs of clickhouse-keeper-0:

```
cipant;1'\''' -h clickhouse-keeper -p 2181
Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: 34.118.236.244:2181
Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 26.2.1.1139 (official build)), 34.118.236.244:2181
Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 26.2.1.1139 (official build)), 34.118.236.244:2181
Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 26.2.1.1139 (official build)), 34.118.236.244:2181

2026.03.03 06:32:48.072377 [ 45 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2026.03.03 06:32:48.072393 [ 45 ] {} <Information> RaftInstance: [PRIORITY] decay, target 1 -> 1, mine 1
2026.03.03 06:32:48.072405 [ 45 ] {} <Information> RaftInstance: [ELECTION TIMEOUT] current role: candidate, log last term 53, state term 2, target p 1, my p 1, hb dead, pre-vote NOT done
2026.03.03 06:32:48.072428 [ 45 ] {} <Information> RaftInstance: reset RPC client for peer 3
2026.03.03 06:32:48.072500 [ 45 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 71
2026.03.03 06:32:48.072504 [ 45 ] {} <Information> RaftInstance: [PRE-VOTE INIT] my id 1, my role candidate, term 2, log idx 905449542, log term 53, priority (target 1 / mine 1)
2026.03.03 06:32:49.639658 [ 36 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2026.03.03 06:32:49.639679 [ 36 ] {} <Information> RaftInstance: [PRIORITY] decay, target 1 -> 1, mine 1
2026.03.03 06:32:49.639695 [ 36 ] {} <Information> RaftInstance: [ELECTION TIMEOUT] current role: candidate, log last term 53, state term 2, target p 1, my p 1, hb dead, pre-vote NOT done
2026.03.03 06:32:49.639703 [ 36 ] {} <Information> RaftInstance: reset RPC client for peer 3
2026.03.03 06:32:49.639762 [ 36 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 72
2026.03.03 06:32:49.639769 [ 36 ] {} <Information> RaftInstance: [PRE-VOTE INIT] my id 1, my role candidate, term 2, log idx 905449542, log term 53, priority (target 1 / mine 1)
2026.03.03 06:32:50.843923 [ 42 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2026.03.03 06:32:50.843944 [ 42 ] {} <Information> RaftInstance: [PRIORITY] decay, target 1 -> 1, mine 1
2026.03.03 06:32:50.843959 [ 42 ] {} <Information> RaftInstance: [ELECTION TIMEOUT] current role: candidate, log last term 53, state term 2, target p 1, my p 1, hb dead, pre-vote NOT done
2026.03.03 06:32:50.843967 [ 42 ] {} <Information> RaftInstance: reset RPC client for peer 3
2026.03.03 06:32:50.844017 [ 42 ] {} <Warning> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 73
2026.03.03 06:32:50.844023 [ 42 ] {} <Information> RaftInstance: [PRE-VOTE INIT] my id 1, my role candidate, term 2, log idx 905449542, log term 53, priority (target 1 / mine 1)
```



logs of clickhouse-keeper-1:

```
2026.03.03 06:33:02.408916 [ 23 ] {} <Information> KeeperContext: Keeper feature flag TRY_REMOVE: enabled
2026.03.03 06:33:02.408918 [ 23 ] {} <Information> KeeperContext: Keeper feature flag LIST_WITH_STAT_AND_DATA: enabled
2026.03.03 06:33:02.410730 [ 23 ] {} <Trace> KeeperSnapshotManager: Reading from disk LocalSnapshotDisk
2026.03.03 06:33:02.410843 [ 23 ] {} <Trace> KeeperSnapshotManager: No snapshots were found on LocalSnapshotDisk
2026.03.03 06:33:02.411684 [ 23 ] {} <Debug> KeeperDispatcher: Shutting down storage dispatcher
2026.03.03 06:33:02.412134 [ 23 ] {} <Debug> KeeperSnapshotManagerS3: Shutting down KeeperSnapshotManagerS3
2026.03.03 06:33:02.412330 [ 23 ] {} <Information> KeeperSnapshotManagerS3: KeeperSnapshotManagerS3 shut down
2026.03.03 06:33:02.412351 [ 23 ] {} <Debug> KeeperDispatcher: Dispatcher shut down
2026.03.03 06:33:02.412558 [ 23 ] {} <Trace> Context: Shutting down named sessions
2026.03.03 06:33:02.412579 [ 23 ] {} <Trace> Context: Shutting down object storage queue streaming
2026.03.03 06:33:02.412585 [ 23 ] {} <Debug> ObjectStorageQueueFactory: There are no queue storages to shutdown
2026.03.03 06:33:02.412594 [ 23 ] {} <Trace> Context: Shutting down database catalog
2026.03.03 06:33:02.412599 [ 23 ] {} <Trace> DatabaseCatalog: Shutting down system logs
2026.03.03 06:33:02.412602 [ 23 ] {} <Trace> DatabaseCatalog: Shutting down system databases
2026.03.03 06:33:02.412624 [ 23 ] {} <Trace> Context: Shutting down caches
2026.03.03 06:33:02.412632 [ 23 ] {} <Trace> Context: Shutting down AccessControl
2026.03.03 06:33:02.412685 [ 23 ] {} <Debug> Context: Destructing remote fs threadpool reader
2026.03.03 06:33:02.412690 [ 23 ] {} <Debug> Context: Destructing local fs threadpool reader
2026.03.03 06:33:02.412693 [ 23 ] {} <Debug> Context: Destructing local fs threadpool reader
2026.03.03 06:33:02.413978 [ 23 ] {} <Information> Application: Waiting for background threads
2026.03.03 06:33:02.416623 [ 23 ] {} <Information> Application: Background threads finished in 2 ms
2026.03.03 06:33:02.417336 [ 23 ] {} <Error> Application: Code: 568. DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>). (RAFT_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000115c14ea
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x0000000009324ece
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x0000000009324909
3. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x00000000115c0df6
4. DB::KeeperStateManager::parseServersConfiguration(Poco::Util::AbstractConfiguration const&, bool, bool) const @ 0x0000000014b52bdf
5. DB::KeeperStateManager::KeeperStateManager(int, String const&, String const&, Poco::Util::AbstractConfiguration const&, std::shared_ptr<DB::KeeperContext>) @ 0x0000000014b55093
6. DB::KeeperServer::KeeperServer(std::shared_ptr<DB::KeeperConfigurationAndSettings> const&, Poco::Util::AbstractConfiguration const&, ConcurrentBoundedQueue<DB::KeeperResponseForSession>&, ConcurrentBoundedQueue<DB::CreateSnapshotTask>&, std::shared_ptr<DB::KeeperContext>, DB::KeeperSnapshotManagerS3&, std::function<void (unsigned long, DB::KeeperRequestForSession const&)>) @ 0x0000000014ab2dfd
7. DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool, bool, std::shared_ptr<DB::Macros const> const&) @ 0x0000000014a81542
8. DB::Context::initializeKeeperDispatcher(bool) const @ 0x0000000012f6f94f
9. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x000000000931ba79
10. Poco::Util::Application::run() @ 0x00000000177ebab1
11. DB::Keeper::run() @ 0x000000000931893a
12. mainEntryClickHouseKeeper(int, char**) @ 0x000000000931742d
13. main @ 0x0000000009315e5d
14. __pow_finite @ 0x0000000000029d90
15. __libc_start_main @ 0x0000000000029e40
16. _start @ 0x00000000068fdc6e
 (version 26.2.1.1139 (official build))
2026.03.03 06:33:02.417387 [ 23 ] {} <Information> Application: shutting down
2026.03.03 06:33:02.417395 [ 23 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem
2026.03.03 06:33:02.417772 [ 27 ] {} <Trace> BaseDaemon: Received signal -2
2026.03.03 06:33:02.417809 [ 27 ] {} <Information> BaseDaemon: Stop SignalListener thread
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After Restarting keeper pods, all 3 pods go into fail state #1935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

After Restarting keeper pods, all 3 pods go into fail state #1935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions