[BUG] Multiple nodes with same IP and PORT

**Describe the bug**

We have a test case in which 2 out 3 primaries are deleted in the same time - short gap between deletion in reality. In majority of cases this gap is enough for cluster to recover (time for failovers). Anyway in failure runs (1/50) I am observing a strange behavior in which multiple nodes have same IP and PORT: 127.0.0.1:6380. It doesn't seam right. Have a look at the logs below which I have.

```
I1010 23:52:09.100471  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:09.100939  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:09.100947  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140328554 7 connected 5462-10922
I1010 23:52:09.100950  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140328554 7 connected
I1010 23:52:09.100926  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:09.100943  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140328857 13 connected 0-5461
```

**To reproduce**

N/A.  I will try to reproduce the issue with lower logging level so the GOSSIPs will be visible, but maybe what I already shared gives any clue if this can be improved.


**Expected behavior**

I think there should be no node in the cluster with same IP and PORT.

**Additional information**

Full log from affected pod:

```
I1010 23:40:39.716347  18:S Start of election delayed for 0 milliseconds (rank #0, offset 650332392).
I1010 23:40:39.716473  18:S Starting a failover election for epoch 9, node config epoch is 8
I1010 23:40:39.716279  18:S Forced failover user request accepted (user request from 'id=141 addr=192.168.5.54:37738 laddr=192.168.5.52:6380 fd=44 name=*redacted* age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=20 multi-mem=0 rbs=16384 rbp=16384 obl=0
 oll=0 omem=0 tot-mem=17324 events=r cmd=cluster|failover user=*redacted* redir=-1 resp=2 lib-name= lib-ver= tot-net-in=232 tot-net-out=2971 tot-cmds=5').
I1010 23:40:39.720222  18:M Discarding previously cached primary state.
I1010 23:40:39.720208  18:S configEpoch set to 9 after successful failover
I1010 23:40:39.720185  18:S Failover election won: I'm the new primary.
I1010 23:40:39.720214  18:M Connection with primary lost.
I1010 23:40:39.720218  18:M Caching the disconnected primary state.
I1010 23:40:39.720228  18:M Setting secondary replication ID to fc2c29faebd3533e0f3b15052d651cb656853440, valid up to offset: 650332393. New replication ID is aa2dd4f2e298e5a3122a1f1646b7de61218adb0b
I1010 23:40:39.723631 
I1010 23:40:39.723589  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139639092 1 connected
I1010 23:40:39.723618  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:39.723625  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139639596 7 connected 5462-10922
I1010 23:40:39.723628  6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139639193 1 connected 0-5461
I1010 23:40:39.723613  256f23c60e389a6edfd5a657ca2c143e50c42830 192.168.51.38:6380@16380 master - 0 1760139639596 8 connected
I1010 23:40:39.723621  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139639193 7 connected
I1010 23:40:39.732682  18:M Partial resynchronization not accepted: Requested offset for second ID was 650332995, but I can reply up to 650332393
I1010 23:40:39.732651  18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.732691  18:M Replica 192.168.51.38:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
I1010 23:40:39.740857  18:M Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
I1010 23:40:39.740838  18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.740863  18:M Sending to replica 192.168.51.38:6380 RDB end offset 650332392 and client-id 147
I1010 23:40:39.745611  18:M Background RDB transfer started by pid 32 to direct socket to replica
I1010 23:40:39.748410  18:M Replica 192.168.51.38:6380 asks for synchronization
I1010 23:40:39.748450  18:M Partial resynchronization request from 192.168.51.38:6380 accepted. Sending 0 bytes of backlog starting from offset 650332393.
I1010 23:40:39.764787  18:M Connection with replica client id #143 lost.
I1010 23:40:39.765286  18:M Replica 192.168.51.38:6380 rdb channel disconnected.
W1010 23:40:39.797651  18:M Background RDB transfer error
I1010 23:40:42.176969  18:M Successfully completed handshake with cba6f96b9c5d2cf97ff2112a77967a0c982873be ()
W1010 23:40:42.176934  18:M Failover election in progress for epoch 9, but received a claim from node cba6f96b9c5d2cf97ff2112a77967a0c982873be () with an equal or higher epoch 10. Resetting the election since we cannot win an election in the past.
I1010 23:40:42.177122  18:M Address updated for node cba6f96b9c5d2cf97ff2112a77967a0c982873be (), now 192.168.37.138:6380
I1010 23:40:42.180548  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139641648 7 connected 5462-10922
I1010 23:40:42.180503  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 master - 0 1760139642157 10 disconnected
I1010 23:40:42.180530  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139641648 1 connected
I1010 23:40:42.180544  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139641211 7 connected
I1010 23:40:42.180539  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:42.180551  6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139642000 1 connected 0-5461
I1010 23:40:42.180554 
I1010 23:40:42.255927  18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is no longer primary of shard b97108364721ecad6dfcc1932290c9ba2e321bca; removed all 0 slot(s) it used to own
I1010 23:40:42.255958  18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is now part of shard b24ffc0709aee18910992a47668946085bc63936
I1010 23:40:42.255964  18:M Node cba6f96b9c5d2cf97ff2112a77967a0c982873be () is now a replica of node d933a2d1b785fa934f1a5efae9243a937b522ea1 () in shard b24ffc0709aee18910992a47668946085bc63936
I1010 23:40:42.259767  18:M Replica 192.168.37.138:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
I1010 23:40:42.259761  18:M Full resync requested by replica 192.168.37.138:6380
I1010 23:40:42.259749  18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.263598  18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.263622  18:M Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
I1010 23:40:42.263627  18:M Sending to replica 192.168.37.138:6380 RDB end offset 650348316 and client-id 186
I1010 23:40:42.267637  18:M Background RDB transfer started by pid 33 to direct socket to replica
I1010 23:40:42.269417  18:M Partial resynchronization request from 192.168.37.138:6380 accepted. Sending 0 bytes of backlog starting from offset 650348317.
I1010 23:40:42.269370  18:M Replica 192.168.37.138:6380 asks for synchronization
I1010 23:40:42.344211  log catcher is delaying logging cluster state
I1010 23:40:42.343905  18:M Replica 192.168.37.138:6380 rdb channel disconnected.
I1010 23:40:42.344175  18:M Synchronization with replica 192.168.37.138:6380 succeeded
I1010 23:40:42.345081  33:C Fork CoW for RDB: current 3 MB, peak 3 MB, average 2 MB
I1010 23:40:42.352481  18:M Background RDB transfer terminated with success
I1010 23:40:43.345034  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 slave 6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 0 1760139643058 1 connected
I1010 23:40:43.345039  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:40:43.345044  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139643058 7 connected
I1010 23:40:43.345051  6f038b2fa88fe9c6bbcd9981db7c7ea458b42d44 192.168.132.116:6380@16380 master - 0 1760139643000 1 connected 0-5461
I1010 23:40:43.345054 
I1010 23:40:43.345001  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760139643260 9 connected
I1010 23:40:43.345048  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139643058 7 connected 5462-10922
I1010 23:42:27.966860  18:M Failover auth granted to dad564b61f8758c0d29328b5803edcdaa0811c6f () for epoch 11
I1010 23:42:32.145391  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now a replica of node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:42:32.145377  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is no longer primary of shard 14d7218516be3bff8259ee30bc3b6b57fa835673; removed all 0 slot(s) it used to own
I1010 23:42:32.145385  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now part of shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:42:32.145339  18:M Successfully completed handshake with 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe ()
I1010 23:42:32.438524  d933a2d1b785fa934f1a5efae9243a937b522ea1 192.168.5.52:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:42:32.438530  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760139752129 7 connected
I1010 23:42:32.438461  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760139752000 9 connected
I1010 23:42:32.438535  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760139752028 7 connected 5462-10922
I1010 23:42:32.438540  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave dad564b61f8758c0d29328b5803edcdaa0811c6f 0 1760139752432 11 disconnected
I1010 23:42:32.437782  18:M Address updated for node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe (), now 192.168.51.41:6380
I1010 23:42:32.438544 
I1010 23:42:32.438490  dad564b61f8758c0d29328b5803edcdaa0811c6f 192.168.51.35:6380@16380 master - 0 1760139752331 11 connected 0-5461
I1010 23:42:34.403031  18:M Node 7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe () is now a replica of node 756d2c3a79c1dfba5df86ccff020e0a72475a47d () in shard 2f93c933f455fdff696bd292b7d9b7330ee83314
I1010 23:42:34.404257  18:M Node a6126962691ab37568773919497b6498254f26fd () is now a replica of node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:52:07.940500       1 signal.go:30] Signal received: terminated
I1010 23:52:07.940665       1 main.go:38] Shutdown handler called
E1010 23:52:07.940792       1 redisnode.go:325] http server stop error: http: Server closed
I1010 23:52:07.975305       1 admin.go:321] Failover by address: 192.168.5.52:6380
I1010 23:52:07.984725       1 admin.go:345] Starting failover for node d933a2d1b785fa934f1a5efae9243a937b522ea1
W1010 23:52:08.019923       1 clusterstatus.go:223] inconsistent cluster view
I1010 23:52:08.019964       1 clusterstatus.go:224] consolidated signature:
I1010 23:52:08.021038       1 clusterstatus.go:231] node 756d signature:
I1010 23:52:08.021026       1 clusterstatus.go:227]
I1010 23:52:08.020986       1 clusterstatus.go:227] 756d [5462-10922]
I1010 23:52:08.023412       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.023416       1 clusterstatus.go:234]
I1010 23:52:08.023426       1 clusterstatus.go:231] node 7fcf signature:
I1010 23:52:08.023407       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.023376       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.029074       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.029084       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.029031       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.029091       1 clusterstatus.go:234]
I1010 23:52:08.029103       1 clusterstatus.go:231] node a612 signature:
I1010 23:52:08.031447       1 clusterstatus.go:234]
I1010 23:52:08.031411       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.031439       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.031444       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.031454       1 clusterstatus.go:231] node cba6 signature:
I1010 23:52:08.034195       1 admin.go:351] Found 1 slave nodes
I1010 23:52:08.034182       1 clusterstatus.go:234] d933 [10923-16383]
I1010 23:52:08.034186       1 clusterstatus.go:234] 756d [5462-10922]
I1010 23:52:08.034202       1 admin.go:367] Attempting failover for slave node cba6f96b9c5d2cf97ff2112a77967a0c982873be
I1010 23:52:08.034157       1 clusterstatus.go:234] dad5 [0-5461]
I1010 23:52:08.034189       1 clusterstatus.go:234]
I1010 23:52:08.039138  18:M Failover auth granted to a6126962691ab37568773919497b6498254f26fd () for epoch 13
I1010 23:52:08.051635       1 admin.go:389] Failover command successfully sent to cba6f96b9c5d2cf97ff2112a77967a0c982873be
I1010 23:52:08.051768  18:M Manual failover requested by replica cba6f96b9c5d2cf97ff2112a77967a0c982873be ().
I1010 23:52:08.051678       1 admin.go:397] Starting to wait for slot transfer from master=(d933a2d1b785fa934f1a5efae9243a937b522ea1, 192.168.5.52:6380) to replica=(cba6f96b9c5d2cf97ff2112a77967a0c982873be, 192.168.37.138:6380) to complete, total slots to transfer: 5461
W1010 23:52:08.059995  18:M Failover auth denied to cba6f96b9c5d2cf97ff2112a77967a0c982873be : can't vote about this primary before 3997 milliseconds
I1010 23:52:08.059971  18:M Failover auth granted to cba6f96b9c5d2cf97ff2112a77967a0c982873be () for epoch 14
I1010 23:52:08.060182       1 admin.go:417] Waiting for failover to complete slot transfer...
I1010 23:52:08.091436  18:M Node dad564b61f8758c0d29328b5803edcdaa0811c6f () is now a replica of node a6126962691ab37568773919497b6498254f26fd () in shard bec51287c31b192c3a4a1936493523ca0882f87d
I1010 23:52:08.091404  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:08.096713  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:08.096732 
I1010 23:52:08.096708  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 0 1760140327847 13 disconnected
I1010 23:52:08.096717  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140327140 13 connected 0-5461
I1010 23:52:08.096723  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140327544 7 connected
I1010 23:52:08.096720  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140327544 7 connected 5462-10922
I1010 23:52:08.096687  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 0 1760140328049 9 connected
I1010 23:52:08.151389  18:M Connection with replica 192.168.37.138:6380 lost.
I1010 23:52:08.154041  18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1532191 ms ago, having flags 3074
I1010 23:52:09.100904  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 1760140329058 1760140328152 9 connected
I1010 23:52:09.100471  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:09.100939  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:09.100947  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140328554 7 connected 5462-10922
I1010 23:52:09.100950  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140328554 7 connected
I1010 23:52:09.100926  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:09.100943  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140328857 13 connected 0-5461
I1010 23:52:09.100954 
I1010 23:52:09.165598  18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1533203 ms ago, having flags 3074
I1010 23:52:10.109863  7fcfe27413cbc06f71897ccd5f9ff4b1083cfbbe 192.168.51.41:6380@16380 slave 756d2c3a79c1dfba5df86ccff020e0a72475a47d 0 1760140329868 7 connected
I1010 23:52:10.109853  d933a2d1b785fa934f1a5efae9243a937b522ea1 127.0.0.1:6380@16380 myself,master - 0 0 9 connected 10923-16383
I1010 23:52:10.109374  18:M Address updated for node dad564b61f8758c0d29328b5803edcdaa0811c6f (), now 127.0.0.1:6380
I1010 23:52:10.109856  a6126962691ab37568773919497b6498254f26fd 192.168.132.112:6380@16380 master - 0 1760140330069 13 connected 0-5461
I1010 23:52:10.109860  756d2c3a79c1dfba5df86ccff020e0a72475a47d 192.168.37.165:6380@16380 master - 0 1760140329565 7 connected 5462-10922
I1010 23:52:10.109867 
I1010 23:52:10.109824  cba6f96b9c5d2cf97ff2112a77967a0c982873be 192.168.37.138:6380@16380 slave d933a2d1b785fa934f1a5efae9243a937b522ea1 1760140330068 1760140328152 9 connected
I1010 23:52:10.109847  dad564b61f8758c0d29328b5803edcdaa0811c6f 127.0.0.1:6380@16380 slave a6126962691ab37568773919497b6498254f26fd 1760140328152 1760140328000 13 disconnected
I1010 23:52:10.176490  18:M PONG contains mismatching sender ID. About node dad564b61f8758c0d29328b5803edcdaa0811c6f () in shard bec51287c31b192c3a4a1936493523ca0882f87d added 1534214 ms ago, having flags 3074
E1010 23:52:10.261057       1 redisnode.go:448] Failover error: FailoverError: failed to contact replica node when checking progress
E1010 23:52:10.261026       1 admin.go:1022] Failed to get node at 192.168.37.138:6380: unable to retrieve Node Info for the node with address 192.168.37.138:6380 due to error failed to perform action ["CLUSTER" "NODES"], err: read tcp 192.168.5.52:43784->192.168.37.138:6380: i/o timeout
E1010 23:52:10.261245       1 redisnode.go:170] error happened during fail over and forget: cannot failover by address 192.168.5.52:6380: FailoverError: failed to contact replica node when checking progress
I1010 23:52:10.261258       1 main.go:42] fail over done, shut down
I1010 23:52:10.261291       1 main.go:66] redisnode.Run() stopped gracefully
I1010 23:52:10.261280       1 redisnode.go:157] received Stop Signal...

```

BR,
Seb.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Multiple nodes with same IP and PORT #2768

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Multiple nodes with same IP and PORT #2768

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions