Description
Describe the bug
Description
When disconnecting and reconnecting routers, if one router is busy and the timing right, it can fail the reconnection and display
ERROR rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0
Expected
The reconnection goes well and messages are routed between the two hosts.
My setup
- Two computers, A and B, connected through Wi-Fi.
- Each has a router configured to connect to each others in TCP.
- A has two z_pub (the modified examples/z_pub), publishing blocking messages at 50Hz.
- A and B each have one z_sub (I believe the A z_sub is not needed).
- No shm.
Sometimes, when disconnecting and reconnecting B from the Wi-Fi, the following error messages appears in A's router logs, and no message is routed between the two machines:
2025-04-03T16:11:51.643967Z ERROR rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0
2025-04-03T16:11:51.643984Z ERROR rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0
2025-04-03T16:11:51.643997Z ERROR rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0
Here are TCPDumps from A and B, as well as the router's logs.
logs. I had to deconnect and reconnect B a bunch of times for the bug to trigger. The bug happens after the reconnection at 16:11:48.888Z
router-a01.txt.gz
router-b01.txt.gz
capture-a.pcap.gz
capture-b.pcap.gz
Analysis and hypothesis
Here is an extract of what I think are the relevant events, as seen from the router A:
16:11:34Z B disconnected from Wi-Fi.
16:11:48.149233Z RX task failed: tcp/10.42.0.1:60920 => tcp/10.42.0.120:7447: expired after 10000 milliseconds.
16:11:48.149269Z Closing transport with peer: b01
16:11:48.888Z B connected to Wi-Fi.
16:11:51.544858Z [Routers network] Received from b01 raw: [LinkState { psid: 1, sn: 9, zid: Some(a01), whatami: Some(Router), locators: Some([tcp/10.42.0.1:7447, tcp/192.168.1.44:7447, tcp/10.45.0.1:7447]), li\
nks: [0, 0] }]
16:11:51.643951Z Face{9, b01} Declare subscriber 0 (demo/example/**)
16:11:51.643967Z ERROR: Received router declaration with unknown routing context id 0.
16:11:52.684869Z A:60920 sends a TCP retransmission to B:7447.
16:11:52.689312Z B:7447 sends RST to A:60920.
16:11:52.689584Z Write error on TCP link 10.42.0.1:60920 => 10.42.0.120:7447: Connection reset by peer (os error 104).
16:11:52.689734Z A closes Face{8, b01} as a result of the TCP RST.
I see two weird thing:
- The LinkState message sent by b01 shows two links, [0, 0]. But that might not be relevant.
- Face{8, b01} is closed after opening Face{9, b01}.
In fact, if I run grep 'Face{., b01' router-a01.txt
, I see this:
16:11:35.545777Z DEBUG net-0 ThreadId(03) zenoh::net::routing::router: New Face{8, b01}
16:11:35.647248Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{8, b01} Declare subscriber 0 (demo/example/**)
16:11:35.647276Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{8, b01} Declare subscriber 0 (@/b01/router/config/**)
16:11:35.647291Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::queries: Face{8, b01} Declare queryable 0 (@/b01/router/**)
16:11:51.541409Z DEBUG net-0 ThreadId(03) zenoh::net::routing::router: New Face{9, b01}
16:11:51.643951Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{9, b01} Declare subscriber 0 (demo/example/**)
16:11:51.643978Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{9, b01} Declare subscriber 0 (@/b01/router/config/**)
16:11:51.643991Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::queries: Face{9, b01} Declare queryable 0 (@/b01/router/**)
16:11:52.689734Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::face: Face{8, b01} Close
16:12:08.667186Z DEBUG rx-0 ThreadId(06) zenoh::net::routing::dispatcher::face: Face{9, b01} Close
For some reason, perhaps because the RX threads were busy, Face{8. b01}
was not properly closed when b01 reconnected, leading to the simultaneous existence of both Face{8, b01}
and Face{9, b01}
.
This might be related, or have the same fix as #1876.
To reproduce
- Reproduce a network similar to the one given in example.
- Run a zenohd router on each computer with the given configuration.
- Run at least two z_pub on one side, modified to publish at 50Hz. Use the given configuration file with
--mode peer
. - Run a z_sub on the other computer.
- Optional: Run a z_sub on the first computer, but I'm not sure if it is necessary.
- Disconnect B from the network.
- As soon as A's routers shows
RX task failed: [...] expired after 10000 milliseconds
, reconnect B to the network. - Repeat 6 and 8 until you see the error
Received router declaration with unknown routing context id
on A's router. On my production setup, this happens almost every time, but with this simplified setup it happens around 1 out of 4 times.
The configuration file: conf.json
I also created a bash file to ease the reproduction if you want.
To use it:
- On both computers:
- Set the
interface
env variable to the network interface used to transmit to the other computer. - Have
zenohd
,z_pub
andz_sub
in your PATH.
- Set the
- On computer A, run
./unknown-session.sh a
. - On computer B, run
./unknown-session.sh b
. - After a few seconds, disconnect B from the network.
- As soon as A's routers shows
RX task failed: [...] expired after 10000 milliseconds
, reconnect B to the network. - Repeat 3 and 4 until you see the error
Received router declaration with unknown routing context id
on A's router.
System info
- Platform: Ubuntu 20.04.
- CPU: Intel i7-9750H.
- Zenoh: 1.3.1, with default features, in release.