-
Couldn't load subscription status.
- Fork 928
Description
Describe the bug
We have a test case in which 2 out 3 primaries are deleted in the same time - short gap between deletion in reality. In majority of cases this gap is enough for cluster to recover (time for failovers). Anyway in failure runs (1/50) I am observing a strange behavior in which multiple nodes have same IP and PORT: 127.0.0.1:6380. It doesn't seam right. Have a look at the logs below which I have.
I1010 23:52:09.100471 18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:09.100939 d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:09.100947 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140328554 7 connected 5462-10922
I1010 23:52:09.100950 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140328554 7 connected
I1010 23:52:09.100926 dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:09.100943 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140328857 13 connected 0-5461
To reproduce
N/A. I will try to reproduce the issue with lower logging level so the GOSSIPs will be visible, but maybe what I already shared gives any clue if this can be improved.
Expected behavior
I think there should be no node in the cluster with same IP and PORT.
Additional information
Full log from affected pod:
I1010 23:40:39.716347 18:S Start of election delayed for 0 milliseconds (rank #0, offset 650332392).
I1010 23:40:39.716473 18:S Starting a failover election for epoch 9, node config epoch is 8
I1010 23:40:39.716279 18:S Forced failover user request accepted (user request from 'id=141 addr=192.168.5.54:37738 laddr=192.168.5.52:6380 fd=44 name=*redacted* age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=20 multi-mem=0 rbs=16384 rbp=16384 obl=0
oll=0 omem=0 tot-mem=17324 events=r cmd=cluster|failover user=*redacted* redir=-1 resp=2 lib-name= lib-ver= tot-net-in=232 tot-net-out=2971 tot-cmds=5').
I1010 23:40:39.720222 18:M Discarding previously cached primary state.
I1010 23:40:39.720208 18:S configEpoch set to 9 after successful failover
I1010 23:40:39.720185 18:S Failover election won: I'm the new primary.
I1010 23:40:39.720214 18:M Connection with primary lost.
I1010 23:40:39.720218 18:M Caching the disconnected primary state.
I1010 23:40:39.720228 18:M Setting secondary replication ID to fc2c29faebd3533e0f3b15052d651cb656853440, valid up to offset: 650332393. New replication ID is aa2dd4f2e298e5a3122a1f1646b7de61218adb0b
I1010 23:40:39.723631
I1010 23:40:39.723589 dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139639092 1 connected
I1010 23:40:39.723618 d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:39.723625 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139639596 7 connected 5462-10922
I1010 23:40:39.723628 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139639193 1 connected 0-5461
I1010 23:40:39.723613 256f23c60e389a6edfd5a657ca2c143e50c42830 192.168.51.38:6380@16380 master - 0 1760139639596 8 connected
I1010 23:40:39.723621 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139639193 7 connected
I1010 23:40:39.732682 18:M Partial resynchronization not accepted: Requested offset for second ID was 650332995, but I can reply up to 650332393
I1010 23:40:39.732651 18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.732691 18:M Replica 192.168.51.38:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
I1010 23:40:39.740857 18:M Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
I1010 23:40:39.740838 18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.740863 18:M Sending to replica 192.168.51.38:6380 RDB end offset 650332392 and client-id 147
I1010 23:40:39.745611 18:M Background RDB transfer started by pid 32 to direct socket to replica
I1010 23:40:39.748410 18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.748450 18:M Partial resynchronization request from 192.168.51.38:6380 accepted. Sending 0 bytes of backlog starting from offset 650332393.
I1010 23:40:39.764787 18:M Connection with replica client id #143 lost.
I1010 23:40:39.765286 18:M Replica 192.168.51.38:6380 rdb channel disconnected.
W1010 23:40:39.797651 18:M Background RDB transfer error
I1010 23:40:42.176969 18:M Successfully completed handshake with cba6f96b9c5d2cf97ff2112a77967a0c982873be ()
W1010 23:40:42.176934 18:M Failover election in progress for epoch 9, but received a claim from node cba6f96b9c5d2cf97ff2112a77967a0c982873be () with an equal or higher epoch 10. Resetting the election since we cannot win an election in the past.
I1010 23:40:42.177122 18:M Address updated for node cba6f96b9c5d2cf97ff2112a77967a0c982873be (), now 192.168.37.138:6380
I1010 23:40:42.180548 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139641648 7 connected 5462-10922
I1010 23:40:42.180503 cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 master - 0 1760139642157 10 disconnected
I1010 23:40:42.180530 dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139641648 1 connected
I1010 23:40:42.180544 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139641211 7 connected
I1010 23:40:42.180539 d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:42.180551 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139642000 1 connected 0-5461
I1010 23:40:42.180554
I1010 23:40:42.255927 18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is no longer primary of shard b97108364721ecad6dfcc1932290c9ba2e321bca; removed all 0 slot(s) it used to own
I1010 23:40:42.255958 18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is now part of shard b24ffc0709aee18910992a47668946085bc63936
I1010 23:40:42.255964 18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is now a replica of node d933a2d1b785fa934f1a5efae9243a937b522ea1 () in shard b24ffc0709aee18910992a47668946085bc63936
I1010 23:40:42.259767 18:M Replica 192.168.37.138:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
I1010 23:40:42.259761 18:M Full resync requested by replica 192.168.37.138:6380
I1010 23:40:42.259749 18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.263598 18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.263622 18:M Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
I1010 23:40:42.263627 18:M Sending to replica 192.168.37.138:6380 RDB end offset 650348316 and client-id 186
I1010 23:40:42.267637 18:M Background RDB transfer started by pid 33 to direct socket to replica
I1010 23:40:42.269417 18:M Partial resynchronization request from 192.168.37.138:6380 accepted. Sending 0 bytes of backlog starting from offset 650348317.
I1010 23:40:42.269370 18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.344211 log catcher is delaying logging cluster state
I1010 23:40:42.343905 18:M Replica 192.168.37.138:6380 rdb channel disconnected.
I1010 23:40:42.344175 18:M Synchronization with replica 192.168.37.138:6380 succeeded
I1010 23:40:42.345081 33:C Fork CoW for RDB: current 3 MB, peak 3 MB, average 2 MB
I1010 23:40:42.352481 18:M Background RDB transfer terminated with success
I1010 23:40:43.345034 dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139643058 1 connected
I1010 23:40:43.345039 d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:43.345044 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139643058 7 connected
I1010 23:40:43.345051 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139643000 1 connected 0-5461
I1010 23:40:43.345054
I1010 23:40:43.345001 cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760139643260 9 connected
I1010 23:40:43.345048 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139643058 7 connected 5462-10922
I1010 23:42:27.966860 18:M Failover auth granted to dad564b61f8758c0d29328b5803edcdaa0811c6f () for epoch 11
I1010 23:42:32.145391 18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now a replica of node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:42:32.145377 18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is no longer primary of shard 14d7218516be3bff8259ee30bc3b6b57fa835673; removed all 0 slot(s) it used to own
I1010 23:42:32.145385 18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now part of shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:42:32.145339 18:M Successfully completed handshake with 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe ()
I1010 23:42:32.438524 d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:42:32.438530 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139752129 7 connected
I1010 23:42:32.438461 cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760139752000 9 connected
I1010 23:42:32.438535 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139752028 7 connected 5462-10922
I1010 23:42:32.438540 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave dad564b61f8758c0d29328b5803edcdaa0811c6f 0 1760139752432 11 disconnected
I1010 23:42:32.437782 18:M Address updated for node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe (), now 192.168.51.41:6380
I1010 23:42:32.438544
I1010 23:42:32.438490 dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 master - 0 1760139752331 11 connected 0-5461
I1010 23:42:34.403031 18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now a replica of node 756d2c3a79c1dfba5df86ccff020e0a72475a47d () in shard 2f93c933f455fdff696bd292b7d9b7330ee83314
I1010 23:42:34.404257 18:M Node a6126962691ab37568773919497b6498254f26fd () is now a replica of node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:52:07.940500 1 signal.go:30] Signal received: terminated
I1010 23:52:07.940665 1 main.go:38] Shutdown handler called
E1010 23:52:07.940792 1 redisnode.go:325] http server stop error: http: Server closed
I1010 23:52:07.975305 1 admin.go:321] Failover by address: 192.168.5.52:6380
I1010 23:52:07.984725 1 admin.go:345] Starting failover for node d933a2d1b785fa934f1a5efae9243a937b522ea1
W1010 23:52:08.019923 1 clusterstatus.go:223] inconsistent cluster view
I1010 23:52:08.019964 1 clusterstatus.go:224] consolidated signature:
I1010 23:52:08.021038 1 clusterstatus.go:231] node 756d signature:
I1010 23:52:08.021026 1 clusterstatus.go:227]
I1010 23:52:08.020986 1 clusterstatus.go:227] 756d [5462-10922]
I1010 23:52:08.023412 1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.023416 1 clusterstatus.go:234]
I1010 23:52:08.023426 1 clusterstatus.go:231] node 7fcf signature:
I1010 23:52:08.023407 1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.023376 1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.029074 1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.029084 1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.029031 1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.029091 1 clusterstatus.go:234]
I1010 23:52:08.029103 1 clusterstatus.go:231] node a612 signature:
I1010 23:52:08.031447 1 clusterstatus.go:234]
I1010 23:52:08.031411 1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.031439 1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.031444 1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.031454 1 clusterstatus.go:231] node cba6 signature:
I1010 23:52:08.034195 1 admin.go:351] Found 1 slave nodes
I1010 23:52:08.034182 1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.034186 1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.034202 1 admin.go:367] Attempting failover for slave node cba6f96b9c5d2cf97ff2112a77967a0c982873be
I1010 23:52:08.034157 1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.034189 1 clusterstatus.go:234]
I1010 23:52:08.039138 18:M Failover auth granted to a6126962691ab37568773919497b6498254f26fd () for epoch 13
I1010 23:52:08.051635 1 admin.go:389] Failover command successfully sent to cba6f96b9c5d2cf97ff2112a77967a0c982873be
I1010 23:52:08.051768 18:M Manual failover requested by replica cba6f96b9c5d2cf97ff2112a77967a0c982873be ().
I1010 23:52:08.051678 1 admin.go:397] Starting to wait for slot transfer from master=(d933a2d1b785fa934f1a5efae9243a937b522ea1, 192.168.5.52:6380) to replica=(cba6f96b9c5d2cf97ff2112a77967a0c982873be, 192.168.37.138:6380) to complete, total slots to transfer: 5461
W1010 23:52:08.059995 18:M Failover auth denied to cba6f96b9c5d2cf97ff2112a77967a0c982873be : can't vote about this primary before 3997 milliseconds
I1010 23:52:08.059971 18:M Failover auth granted to cba6f96b9c5d2cf97ff2112a77967a0c982873be () for epoch 14
I1010 23:52:08.060182 1 admin.go:417] Waiting for failover to complete slot transfer...
I1010 23:52:08.091436 18:M Node dad564b61f8758c0d29328b5803edcdaa0811c6f () is now a replica of node a6126962691ab37568773919497b6498254f26fd () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:52:08.091404 18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:08.096713 d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:08.096732
I1010 23:52:08.096708 dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 0 1760140327847 13 disconnected
I1010 23:52:08.096717 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140327140 13 connected 0-5461
I1010 23:52:08.096723 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140327544 7 connected
I1010 23:52:08.096720 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140327544 7 connected 5462-10922
I1010 23:52:08.096687 cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760140328049 9 connected
I1010 23:52:08.151389 18:M Connection with replica 192.168.37.138:6380 lost.
I1010 23:52:08.154041 18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1532191 ms ago, having flags 3074
I1010 23:52:09.100904 cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 1760140329058 1760140328152 9 connected
I1010 23:52:09.100471 18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:09.100939 d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:09.100947 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140328554 7 connected 5462-10922
I1010 23:52:09.100950 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140328554 7 connected
I1010 23:52:09.100926 dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:09.100943 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140328857 13 connected 0-5461
I1010 23:52:09.100954
I1010 23:52:09.165598 18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1533203 ms ago, having flags 3074
I1010 23:52:10.109863 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140329868 7 connected
I1010 23:52:10.109853 d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:10.109374 18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:10.109856 a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140330069 13 connected 0-5461
I1010 23:52:10.109860 756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140329565 7 connected 5462-10922
I1010 23:52:10.109867
I1010 23:52:10.109824 cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 1760140330068 1760140328152 9 connected
I1010 23:52:10.109847 dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:10.176490 18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1534214 ms ago, having flags 3074
E1010 23:52:10.261057 1 redisnode.go:448] Failover error: FailoverError: failed to contact replica node when checking progress
E1010 23:52:10.261026 1 admin.go:1022] Failed to get node at 192.168.37.138:6380: unable to retrieve Node Info for the node with address 192.168.37.138:6380 due to error failed to perform action ["CLUSTER" "NODES"], err: read tcp 192.168.5.52:43784->192.168.37.138:6380: i/o timeout
E1010 23:52:10.261245 1 redisnode.go:170] error happened during fail over and forget: cannot failover by address 192.168.5.52:6380: FailoverError: failed to contact replica node when checking progress
I1010 23:52:10.261258 1 main.go:42] fail over done, shut down
I1010 23:52:10.261291 1 main.go:66] redisnode.Run() stopped gracefully
I1010 23:52:10.261280 1 redisnode.go:157] received Stop Signal...
BR,
Seb.