Skip to content

[BUG] Multiple nodes with same IP and PORT #2768

@esebole

Description

@esebole

Describe the bug

We have a test case in which 2 out 3 primaries are deleted in the same time - short gap between deletion in reality. In majority of cases this gap is enough for cluster to recover (time for failovers). Anyway in failure runs (1/50) I am observing a strange behavior in which multiple nodes have same IP and PORT: 127.0.0.1:6380. It doesn't seam right. Have a look at the logs below which I have.

I1010 23:52:09.100471  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:09.100939  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:09.100947  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140328554 7 connected 5462-10922
I1010 23:52:09.100950  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140328554 7 connected
I1010 23:52:09.100926  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:09.100943  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140328857 13 connected 0-5461

To reproduce

N/A. I will try to reproduce the issue with lower logging level so the GOSSIPs will be visible, but maybe what I already shared gives any clue if this can be improved.

Expected behavior

I think there should be no node in the cluster with same IP and PORT.

Additional information

Full log from affected pod:

I1010 23:40:39.716347  18:S Start of election delayed for 0 milliseconds (rank #0, offset 650332392).
I1010 23:40:39.716473  18:S Starting a failover election for epoch 9, node config epoch is 8
I1010 23:40:39.716279  18:S Forced failover user request accepted (user request from 'id=141 addr=192.168.5.54:37738 laddr=192.168.5.52:6380 fd=44 name=*redacted* age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=20 multi-mem=0 rbs=16384 rbp=16384 obl=0
 oll=0 omem=0 tot-mem=17324 events=r cmd=cluster|failover user=*redacted* redir=-1 resp=2 lib-name= lib-ver= tot-net-in=232 tot-net-out=2971 tot-cmds=5').
I1010 23:40:39.720222  18:M Discarding previously cached primary state.
I1010 23:40:39.720208  18:S configEpoch set to 9 after successful failover
I1010 23:40:39.720185  18:S Failover election won: I'm the new primary.
I1010 23:40:39.720214  18:M Connection with primary lost.
I1010 23:40:39.720218  18:M Caching the disconnected primary state.
I1010 23:40:39.720228  18:M Setting secondary replication ID to fc2c29faebd3533e0f3b15052d651cb656853440, valid up to offset: 650332393. New replication ID is aa2dd4f2e298e5a3122a1f1646b7de61218adb0b
I1010 23:40:39.723631 
I1010 23:40:39.723589  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139639092 1 connected
I1010 23:40:39.723618  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:39.723625  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139639596 7 connected 5462-10922
I1010 23:40:39.723628  6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139639193 1 connected 0-5461
I1010 23:40:39.723613  256f23c60e389a6edfd5a657ca2c143e50c42830 192.168.51.38:6380@16380 master - 0 1760139639596 8 connected
I1010 23:40:39.723621  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139639193 7 connected
I1010 23:40:39.732682  18:M Partial resynchronization not accepted: Requested offset for second ID was 650332995, but I can reply up to 650332393
I1010 23:40:39.732651  18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.732691  18:M Replica 192.168.51.38:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
I1010 23:40:39.740857  18:M Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
I1010 23:40:39.740838  18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.740863  18:M Sending to replica 192.168.51.38:6380 RDB end offset 650332392 and client-id 147
I1010 23:40:39.745611  18:M Background RDB transfer started by pid 32 to direct socket to replica
I1010 23:40:39.748410  18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.748450  18:M Partial resynchronization request from 192.168.51.38:6380 accepted. Sending 0 bytes of backlog starting from offset 650332393.
I1010 23:40:39.764787  18:M Connection with replica client id #143 lost.
I1010 23:40:39.765286  18:M Replica 192.168.51.38:6380 rdb channel disconnected.
W1010 23:40:39.797651  18:M Background RDB transfer error
I1010 23:40:42.176969  18:M Successfully completed handshake with cba6f96b9c5d2cf97ff2112a77967a0c982873be ()
W1010 23:40:42.176934  18:M Failover election in progress for epoch 9, but received a claim from node cba6f96b9c5d2cf97ff2112a77967a0c982873be () with an equal or higher epoch 10. Resetting the election since we cannot win an election in the past.
I1010 23:40:42.177122  18:M Address updated for node cba6f96b9c5d2cf97ff2112a77967a0c982873be (), now 192.168.37.138:6380
I1010 23:40:42.180548  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139641648 7 connected 5462-10922
I1010 23:40:42.180503  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 master - 0 1760139642157 10 disconnected
I1010 23:40:42.180530  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139641648 1 connected
I1010 23:40:42.180544  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139641211 7 connected
I1010 23:40:42.180539  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:42.180551  6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139642000 1 connected 0-5461
I1010 23:40:42.180554 
I1010 23:40:42.255927  18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is no longer primary of shard b97108364721ecad6dfcc1932290c9ba2e321bca; removed all 0 slot(s) it used to own
I1010 23:40:42.255958  18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is now part of shard b24ffc0709aee18910992a47668946085bc63936
I1010 23:40:42.255964  18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is now a replica of node d933a2d1b785fa934f1a5efae9243a937b522ea1 () in shard b24ffc0709aee18910992a47668946085bc63936
I1010 23:40:42.259767  18:M Replica 192.168.37.138:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
I1010 23:40:42.259761  18:M Full resync requested by replica 192.168.37.138:6380
I1010 23:40:42.259749  18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.263598  18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.263622  18:M Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
I1010 23:40:42.263627  18:M Sending to replica 192.168.37.138:6380 RDB end offset 650348316 and client-id 186
I1010 23:40:42.267637  18:M Background RDB transfer started by pid 33 to direct socket to replica
I1010 23:40:42.269417  18:M Partial resynchronization request from 192.168.37.138:6380 accepted. Sending 0 bytes of backlog starting from offset 650348317.
I1010 23:40:42.269370  18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.344211  log catcher is delaying logging cluster state
I1010 23:40:42.343905  18:M Replica 192.168.37.138:6380 rdb channel disconnected.
I1010 23:40:42.344175  18:M Synchronization with replica 192.168.37.138:6380 succeeded
I1010 23:40:42.345081  33:C Fork CoW for RDB: current 3 MB, peak 3 MB, average 2 MB
I1010 23:40:42.352481  18:M Background RDB transfer terminated with success
I1010 23:40:43.345034  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139643058 1 connected
I1010 23:40:43.345039  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:43.345044  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139643058 7 connected
I1010 23:40:43.345051  6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139643000 1 connected 0-5461
I1010 23:40:43.345054 
I1010 23:40:43.345001  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760139643260 9 connected
I1010 23:40:43.345048  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139643058 7 connected 5462-10922
I1010 23:42:27.966860  18:M Failover auth granted to dad564b61f8758c0d29328b5803edcdaa0811c6f () for epoch 11
I1010 23:42:32.145391  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now a replica of node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:42:32.145377  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is no longer primary of shard 14d7218516be3bff8259ee30bc3b6b57fa835673; removed all 0 slot(s) it used to own
I1010 23:42:32.145385  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now part of shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:42:32.145339  18:M Successfully completed handshake with 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe ()
I1010 23:42:32.438524  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:42:32.438530  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139752129 7 connected
I1010 23:42:32.438461  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760139752000 9 connected
I1010 23:42:32.438535  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139752028 7 connected 5462-10922
I1010 23:42:32.438540  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave dad564b61f8758c0d29328b5803edcdaa0811c6f 0 1760139752432 11 disconnected
I1010 23:42:32.437782  18:M Address updated for node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe (), now 192.168.51.41:6380
I1010 23:42:32.438544 
I1010 23:42:32.438490  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 master - 0 1760139752331 11 connected 0-5461
I1010 23:42:34.403031  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now a replica of node 756d2c3a79c1dfba5df86ccff020e0a72475a47d () in shard 2f93c933f455fdff696bd292b7d9b7330ee83314
I1010 23:42:34.404257  18:M Node a6126962691ab37568773919497b6498254f26fd () is now a replica of node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:52:07.940500       1 signal.go:30] Signal received: terminated
I1010 23:52:07.940665       1 main.go:38] Shutdown handler called
E1010 23:52:07.940792       1 redisnode.go:325] http server stop error: http: Server closed
I1010 23:52:07.975305       1 admin.go:321] Failover by address: 192.168.5.52:6380
I1010 23:52:07.984725       1 admin.go:345] Starting failover for node d933a2d1b785fa934f1a5efae9243a937b522ea1
W1010 23:52:08.019923       1 clusterstatus.go:223] inconsistent cluster view
I1010 23:52:08.019964       1 clusterstatus.go:224] consolidated signature:
I1010 23:52:08.021038       1 clusterstatus.go:231] node 756d signature:
I1010 23:52:08.021026       1 clusterstatus.go:227]
I1010 23:52:08.020986       1 clusterstatus.go:227] 756d [5462-10922]
I1010 23:52:08.023412       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.023416       1 clusterstatus.go:234]
I1010 23:52:08.023426       1 clusterstatus.go:231] node 7fcf signature:
I1010 23:52:08.023407       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.023376       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.029074       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.029084       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.029031       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.029091       1 clusterstatus.go:234]
I1010 23:52:08.029103       1 clusterstatus.go:231] node a612 signature:
I1010 23:52:08.031447       1 clusterstatus.go:234]
I1010 23:52:08.031411       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.031439       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.031444       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.031454       1 clusterstatus.go:231] node cba6 signature:
I1010 23:52:08.034195       1 admin.go:351] Found 1 slave nodes
I1010 23:52:08.034182       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.034186       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.034202       1 admin.go:367] Attempting failover for slave node cba6f96b9c5d2cf97ff2112a77967a0c982873be
I1010 23:52:08.034157       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.034189       1 clusterstatus.go:234]
I1010 23:52:08.039138  18:M Failover auth granted to a6126962691ab37568773919497b6498254f26fd () for epoch 13
I1010 23:52:08.051635       1 admin.go:389] Failover command successfully sent to cba6f96b9c5d2cf97ff2112a77967a0c982873be
I1010 23:52:08.051768  18:M Manual failover requested by replica cba6f96b9c5d2cf97ff2112a77967a0c982873be ().
I1010 23:52:08.051678       1 admin.go:397] Starting to wait for slot transfer from master=(d933a2d1b785fa934f1a5efae9243a937b522ea1, 192.168.5.52:6380) to replica=(cba6f96b9c5d2cf97ff2112a77967a0c982873be, 192.168.37.138:6380) to complete, total slots to transfer: 5461
W1010 23:52:08.059995  18:M Failover auth denied to cba6f96b9c5d2cf97ff2112a77967a0c982873be : can't vote about this primary before 3997 milliseconds
I1010 23:52:08.059971  18:M Failover auth granted to cba6f96b9c5d2cf97ff2112a77967a0c982873be () for epoch 14
I1010 23:52:08.060182       1 admin.go:417] Waiting for failover to complete slot transfer...
I1010 23:52:08.091436  18:M Node dad564b61f8758c0d29328b5803edcdaa0811c6f () is now a replica of node a6126962691ab37568773919497b6498254f26fd () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:52:08.091404  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:08.096713  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:08.096732 
I1010 23:52:08.096708  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 0 1760140327847 13 disconnected
I1010 23:52:08.096717  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140327140 13 connected 0-5461
I1010 23:52:08.096723  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140327544 7 connected
I1010 23:52:08.096720  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140327544 7 connected 5462-10922
I1010 23:52:08.096687  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760140328049 9 connected
I1010 23:52:08.151389  18:M Connection with replica 192.168.37.138:6380 lost.
I1010 23:52:08.154041  18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1532191 ms ago, having flags 3074
I1010 23:52:09.100904  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 1760140329058 1760140328152 9 connected
I1010 23:52:09.100471  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:09.100939  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:09.100947  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140328554 7 connected 5462-10922
I1010 23:52:09.100950  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140328554 7 connected
I1010 23:52:09.100926  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:09.100943  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140328857 13 connected 0-5461
I1010 23:52:09.100954 
I1010 23:52:09.165598  18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1533203 ms ago, having flags 3074
I1010 23:52:10.109863  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140329868 7 connected
I1010 23:52:10.109853  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:10.109374  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:10.109856  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140330069 13 connected 0-5461
I1010 23:52:10.109860  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140329565 7 connected 5462-10922
I1010 23:52:10.109867 
I1010 23:52:10.109824  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 1760140330068 1760140328152 9 connected
I1010 23:52:10.109847  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:10.176490  18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1534214 ms ago, having flags 3074
E1010 23:52:10.261057       1 redisnode.go:448] Failover error: FailoverError: failed to contact replica node when checking progress
E1010 23:52:10.261026       1 admin.go:1022] Failed to get node at 192.168.37.138:6380: unable to retrieve Node Info for the node with address 192.168.37.138:6380 due to error failed to perform action ["CLUSTER" "NODES"], err: read tcp 192.168.5.52:43784->192.168.37.138:6380: i/o timeout
E1010 23:52:10.261245       1 redisnode.go:170] error happened during fail over and forget: cannot failover by address 192.168.5.52:6380: FailoverError: failed to contact replica node when checking progress
I1010 23:52:10.261258       1 main.go:42] fail over done, shut down
I1010 23:52:10.261291       1 main.go:66] redisnode.Run() stopped gracefully
I1010 23:52:10.261280       1 redisnode.go:157] received Stop Signal...

BR,
Seb.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions