Skip to content

Conversation

@duartenfonseca
Copy link
Collaborator

No description provided.

ricardolleite and others added 11 commits June 16, 2025 18:08
Summary
Remove use of linger-with-timeout in tse, which fixes blocking socket
operations
Details
The 5s linger time introduced in ade0d62 makes sense to avoid
TIME_WAIT, but causes shutdown/close operations to block
In an ECU, the following happens:

Some ECU 1 has some TCP connections with another ECU 2
ECU 1 has some pending send data
ECU 2 stops responding entirely to that TCP connection, for whatever
reason
wait_until_sent times out after 10s, because of no sent data ack
TCP connection is closed due to timeout, but that blocks for 5s

That blockage is at a critical spot in the (see tse:stop), which happens
to hold 3 mutexes, one of which (connections_mutex_?) quite important
That eventually causes vsomeipd to block completely
Solve this by setting a 0 linger time, which again makes it
non-blocking, and forces TCP RST for every connection close
Summary
reset multicast_id_ when leaving multicast group
Details
The issue was caused by a mismatch on the multicast_id_ from messages received
and the one expected from upd_server_endpoint. This then caused the log
"Didn't receive a multicast SD message for 1100 ms" to show, triggering the
leaving and following joining of multicast group. As the multicast_id was never reset,
it would stay in this loop forever.
Summary
is_sending_ set to false when socket is not open
Details
After the STR, it is sometimes observed that some applications are unable to register again.
When send_queued is called but the socket is not open,
the log 'ltcei::send_queued: try to send while socket was not open'
appears and send_cbk is not called, leaving is_sending_ set to TRUE.
After this point, send_cbk is no longer called, no further sends occur on the endpoint, and registration
is no longer possible.
To solve this problem, is_sending_ is reset in case the socket is closed, allowing sending to be possible again.
Changes:

local_tcp_client_endpoint_impl::send_queued -> set is_sending_ to false if the socket is closed
local_uds_client_endpoint_impl::send_queued -> set is_sending_ to false if the socket is closed
tcp_client_endpoint_impl::send_queued -> set is_sending_ to false if the socket is closed
Summary
fix big_payload_test_local_tcp_queue_limited
Details
As the consumer app was acting as routingmanager, it would handle all vsomeip
commands to the service. This caused some messages to be present on queue when
adding the big payload message, which, with the previously mentioned messages
would overreach the max queue size and not be sent. This then caused the provider
to not get all expected messages and be stuck on loop.
Also added a change to add_known_client so that the order of calling config_command
and routing_info_command would not matter. This change only has impact when the
clients adds itself as a known_client and was made to safeguard its hostname.
Summary
Add SO_REUSEADDR to the non-local routing host check
Details
If there is no routing host name given, the first application becomes
the routing host. With non-local routing (e.g., IPC via TCP), there is a
race:

earlier vsomeip application was killed/stopped abrutly, leaving a
TIME_WAIT socket
newer vsomeip application starts, but cannot become routing host
because of said socket

Fix this by adding SO_REUSEADDR to the non-local routing host check
This issue explains the subscribe_notify_test_* failures seen in 1
Summary
is_sending_ set to false on restart
Details
After the STR, it is sometimes observed that some applications are unable to register again.
When an application has messages in the queue and is in the process of sending them, the is_sending_ flag
is set to true. However, if a suspend is received at that moment, the following error occurs:
'rmc::handle_client_error: Client 0x6381 handles a client error(0000) with host, will reconnect'.
The reconnect() -> restart() process is triggered, but the client is unable to register again because is_sending_
remains set to true, preventing any communication from being sent after the Resume.
To resolve this issue, is_sending_ is reset in restart(), thus allowing messages to be sent again.

this expands on #689
Details
In the context of intra-ECU communication, if a client loses the
connection to a service provider, then it should re-request service
availability triggers from the routing host for the provided services.
This would allow the client to handle the error and re-establish the
connection to the service provider, however, it does not appear that
this mechanism is working correctly.
These changes attempt to fix the issue by searching through
local_services_ for the services offered by the provider, instead of
searching through the services to which the client is subscribed.
Summary
Synchronization issue found during the stop process when one side of the connection closes
earlier than the other, allowing the port to be reused while there is still an active connection.
Details
This issue occurs in a very specific scenario:

The application starts the stop process, closing connections with routing manager, and routing manager
also starts closing connections with the application.
routing manager still has messages in the queue on the routing manager client -> application server connection,
waiting in 10ms cycles for the queue to become empty.
The application closes the application server -> routing manager client connection and sends an EOF to routing
manager, making port XXXXX available for new connections.
A new application connects to port XXXXX because it is free, but fails to connect because routing manager
still has an active connection to port XXXXX.
After the error, it enters a loop and cannot recover.

Issue 1: A port is assigned to a new application but there is still a connection to that port for
the previous application
A condition variable was implemented to notify the stop process to exit the 10ms wait if an EOF is received
in receive_cbk, since the queue is cleared in this case.

local_tcp_client_endpoint_impl.hpp: Added bool is_stopping_ and condition variable queue_cv_.
local_tcp_client_endpoint_impl.cpp: Updated receive_cbk to notify stop() if the stop process is in progress
and an EOF is received.
Updated stop() to wait for either 10ms or until the queue is empty, instead of just waiting 10ms.

It was possible to reproduce this issue and with this change, it was possible to immediately close
the socket to prevent the issue from occurring.

Issue 2: The application does not recover the connection after the register timeout
client_endpoint_impl::cancel_and_connect_cbk was updated to only report the connection as successfully
terminated when cancel_and_connect_cbk is called without error. Previously, it was reporting the connection
as successfully terminated even when there was a connection error, preventing the endpoint from being
restarted.
It was possible to reproduce this issue and with this change, the issue was resolved.
After the register timeout, the client can now recover the connection.
Summary
Ensure "shutdown_and_close" is called on "stop"
Details
A call to "stop" sets the "is_stopped_" flag and cancels the running operations on the
sockets of the endpoint. In case the operations are scheduled but not yet running,
their callbacks are called with "operation_aborted" and call "shutdown_and_close",
which handles the sockets and, in case of a restart, reinitializes and starts the endpoint.
In case the operation is already running (executes its callback) at the point in time
"cancel" is called, we currently do detect the endpoint is stopped by checking the flag,
but do not call "shutdown_and_close". As "shutdown_and_close" is not triggered, the
endpoint will stay stopped and can also not be restarted by another call to "restart",
because no operation was scheduled that could be cancelled. To avoid this, this change
also calls "shutdown_and_close" on the regular path of the callbacks.
Summary
This reverts commit 52e1bef.
Details
Issue description
In short, it is related to the missing initial notification we usually sent on internal communication
(routing_manager_base::notify_one_current_value) after subscription has been accepted.
At some point, coming from Suspend and after entering Resume mode (in middle of LC),
the subscription seems still to exist. (I believe the routing_manager_base::insert_subscription
was returning false. We suspect, the insertion fails and returns false (see event::add_subscriber).
As a result, notification won't be sent.
Expected sequence of events

Provider receives subscribe request from local client routing_manager_client::on_message

Provider accepts the subscription
https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/routing_manager_client.cpp#L1573

Provider attempts to insert the client_ into eventgroups_
https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/routing_manager_base.cpp#L794
https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/event.cpp#L619

Provider does not send any notification

So it seems like for some reason during the ::remove_local,  the ::unsubscribe was not executed.
Therefore, the reason by changing the context of unsubscribe into io context instead of dispatcher.
Assumption is that for some reason, dispatcher is at some point slower preventing the ::unsubscribe
to be executed.
https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/routing_manager_base.cpp#L1209
Changes
This PR is to revert 52e1bef, which was changing
the operations to run on Dispatcher's context instead of IO.
Summary
Ensuring the lifetime of message_buffer_ptr
Details
The backtrace shown nullptr on shared_ptr on the arguments passed
_M_bound_ags_ into callable object (send_cbk).
std::_Head_base<3, std::shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> > >, false>> = {
 _M_head_impl = {
   std::__shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> >, ...> = {
     std::__shared_ptr_access<std::vector<unsigned char, std::allocator<unsigned char> >, ...> = { },
     _M_ptr = 0x0,
     _M_refcount = {
       _M_pi = 0x0
     }
   }
 }
}

// key part
_M_ptr = 0x0
_M_refcount = {_M_pi = 0x0}
This PR attempts to capture shared_ptr in std::bind to ensure that
object stays alive during the async operation.
By explicitly capture the shared_ptr, it guarantees that its reference
count is incremented (keeping the underlying object alive). Therefore,
it should prevent the shared_ptr from going out of scope and getting
destroyed while async operation is still pending.
Even if original shared_ptr is destroyed, the callable object should be
able to access the message_buffer_t.
The bind usage during async_write call in local_uds_client_endpoint_impl::send_queued
replaced with a lambda.
@duartenfonseca duartenfonseca merged commit 384f94b into master Jun 17, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants