Skip to content

Reconnection can fail with Received router declaration with unknown routing context id #1886

Open
@Hugal31

Description

@Hugal31

Describe the bug

Description

When disconnecting and reconnecting routers, if one router is busy and the timing right, it can fail the reconnection and display

ERROR  rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0

Expected

The reconnection goes well and messages are routed between the two hosts.

My setup

Setup with A and B computer, A has one router, two z_pub and one z_sub, B has one router and one z_sub

  • Two computers, A and B, connected through Wi-Fi.
  • Each has a router configured to connect to each others in TCP.
  • A has two z_pub (the modified examples/z_pub), publishing blocking messages at 50Hz.
  • A and B each have one z_sub (I believe the A z_sub is not needed).
  • No shm.

Sometimes, when disconnecting and reconnecting B from the Wi-Fi, the following error messages appears in A's router logs, and no message is routed between the two machines:

2025-04-03T16:11:51.643967Z ERROR  rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0
2025-04-03T16:11:51.643984Z ERROR  rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0
2025-04-03T16:11:51.643997Z ERROR  rx-0 ThreadId(06) zenoh::net::routing::hat::router: Received router declaration with unknown routing context id 0

Here are TCPDumps from A and B, as well as the router's logs.
logs. I had to deconnect and reconnect B a bunch of times for the bug to trigger. The bug happens after the reconnection at 16:11:48.888Z

router-a01.txt.gz
router-b01.txt.gz
capture-a.pcap.gz
capture-b.pcap.gz

Analysis and hypothesis

Here is an extract of what I think are the relevant events, as seen from the router A:

16:11:34Z           B disconnected from Wi-Fi.
16:11:48.149233Z    RX task failed: tcp/10.42.0.1:60920 => tcp/10.42.0.120:7447: expired after 10000 milliseconds.
16:11:48.149269Z    Closing transport with peer: b01
16:11:48.888Z       B connected to Wi-Fi.
16:11:51.544858Z    [Routers network] Received from b01 raw: [LinkState { psid: 1, sn: 9, zid: Some(a01), whatami: Some(Router), locators: Some([tcp/10.42.0.1:7447, tcp/192.168.1.44:7447, tcp/10.45.0.1:7447]), li\
nks: [0, 0] }]
16:11:51.643951Z    Face{9, b01} Declare subscriber 0 (demo/example/**)
16:11:51.643967Z    ERROR: Received router declaration with unknown routing context id 0.
16:11:52.684869Z    A:60920 sends a TCP retransmission to B:7447.
16:11:52.689312Z    B:7447 sends RST to A:60920.
16:11:52.689584Z    Write error on TCP link 10.42.0.1:60920 => 10.42.0.120:7447: Connection reset by peer (os error 104).
16:11:52.689734Z    A closes Face{8, b01} as a result of the TCP RST.

I see two weird thing:

  1. The LinkState message sent by b01 shows two links, [0, 0]. But that might not be relevant.
  2. Face{8, b01} is closed after opening Face{9, b01}.

In fact, if I run grep 'Face{., b01' router-a01.txt, I see this:

16:11:35.545777Z DEBUG net-0 ThreadId(03) zenoh::net::routing::router: New Face{8, b01}
16:11:35.647248Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{8, b01} Declare subscriber 0 (demo/example/**)
16:11:35.647276Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{8, b01} Declare subscriber 0 (@/b01/router/config/**)
16:11:35.647291Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::queries: Face{8, b01} Declare queryable 0 (@/b01/router/**)
16:11:51.541409Z DEBUG net-0 ThreadId(03) zenoh::net::routing::router: New Face{9, b01}
16:11:51.643951Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{9, b01} Declare subscriber 0 (demo/example/**)
16:11:51.643978Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::pubsub: Face{9, b01} Declare subscriber 0 (@/b01/router/config/**)
16:11:51.643991Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::queries: Face{9, b01} Declare queryable 0 (@/b01/router/**)
16:11:52.689734Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::face: Face{8, b01} Close
16:12:08.667186Z DEBUG  rx-0 ThreadId(06) zenoh::net::routing::dispatcher::face: Face{9, b01} Close

For some reason, perhaps because the RX threads were busy, Face{8. b01} was not properly closed when b01 reconnected, leading to the simultaneous existence of both Face{8, b01} and Face{9, b01}.

This might be related, or have the same fix as #1876.

To reproduce

  1. Reproduce a network similar to the one given in example.
  2. Run a zenohd router on each computer with the given configuration.
  3. Run at least two z_pub on one side, modified to publish at 50Hz. Use the given configuration file with --mode peer.
  4. Run a z_sub on the other computer.
  5. Optional: Run a z_sub on the first computer, but I'm not sure if it is necessary.
  6. Disconnect B from the network.
  7. As soon as A's routers shows RX task failed: [...] expired after 10000 milliseconds, reconnect B to the network.
  8. Repeat 6 and 8 until you see the error Received router declaration with unknown routing context id on A's router. On my production setup, this happens almost every time, but with this simplified setup it happens around 1 out of 4 times.

The configuration file: conf.json

I also created a bash file to ease the reproduction if you want.

unknown-session.sh.gz

To use it:

  1. On both computers:
    • Set the interface env variable to the network interface used to transmit to the other computer.
    • Have zenohd, z_pub and z_sub in your PATH.
  2. On computer A, run ./unknown-session.sh a.
  3. On computer B, run ./unknown-session.sh b.
  4. After a few seconds, disconnect B from the network.
  5. As soon as A's routers shows RX task failed: [...] expired after 10000 milliseconds, reconnect B to the network.
  6. Repeat 3 and 4 until you see the error Received router declaration with unknown routing context id on A's router.

System info

  • Platform: Ubuntu 20.04.
  • CPU: Intel i7-9750H.
  • Zenoh: 1.3.1, with default features, in release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions