3.5.6 new commits #914

duartenfonseca · 2025-06-17T10:30:44Z

No description provided.

Summary Remove use of linger-with-timeout in tse, which fixes blocking socket operations Details The 5s linger time introduced in ade0d62 makes sense to avoid TIME_WAIT, but causes shutdown/close operations to block In an ECU, the following happens: Some ECU 1 has some TCP connections with another ECU 2 ECU 1 has some pending send data ECU 2 stops responding entirely to that TCP connection, for whatever reason wait_until_sent times out after 10s, because of no sent data ack TCP connection is closed due to timeout, but that blocks for 5s That blockage is at a critical spot in the (see tse:stop), which happens to hold 3 mutexes, one of which (connections_mutex_?) quite important That eventually causes vsomeipd to block completely Solve this by setting a 0 linger time, which again makes it non-blocking, and forces TCP RST for every connection close

Summary reset multicast_id_ when leaving multicast group Details The issue was caused by a mismatch on the multicast_id_ from messages received and the one expected from upd_server_endpoint. This then caused the log "Didn't receive a multicast SD message for 1100 ms" to show, triggering the leaving and following joining of multicast group. As the multicast_id was never reset, it would stay in this loop forever.

Summary is_sending_ set to false when socket is not open Details After the STR, it is sometimes observed that some applications are unable to register again. When send_queued is called but the socket is not open, the log 'ltcei::send_queued: try to send while socket was not open' appears and send_cbk is not called, leaving is_sending_ set to TRUE. After this point, send_cbk is no longer called, no further sends occur on the endpoint, and registration is no longer possible. To solve this problem, is_sending_ is reset in case the socket is closed, allowing sending to be possible again. Changes: local_tcp_client_endpoint_impl::send_queued -> set is_sending_ to false if the socket is closed local_uds_client_endpoint_impl::send_queued -> set is_sending_ to false if the socket is closed tcp_client_endpoint_impl::send_queued -> set is_sending_ to false if the socket is closed

Summary fix big_payload_test_local_tcp_queue_limited Details As the consumer app was acting as routingmanager, it would handle all vsomeip commands to the service. This caused some messages to be present on queue when adding the big payload message, which, with the previously mentioned messages would overreach the max queue size and not be sent. This then caused the provider to not get all expected messages and be stuck on loop. Also added a change to add_known_client so that the order of calling config_command and routing_info_command would not matter. This change only has impact when the clients adds itself as a known_client and was made to safeguard its hostname.

Summary Add SO_REUSEADDR to the non-local routing host check Details If there is no routing host name given, the first application becomes the routing host. With non-local routing (e.g., IPC via TCP), there is a race: earlier vsomeip application was killed/stopped abrutly, leaving a TIME_WAIT socket newer vsomeip application starts, but cannot become routing host because of said socket Fix this by adding SO_REUSEADDR to the non-local routing host check This issue explains the subscribe_notify_test_* failures seen in 1

Summary is_sending_ set to false on restart Details After the STR, it is sometimes observed that some applications are unable to register again. When an application has messages in the queue and is in the process of sending them, the is_sending_ flag is set to true. However, if a suspend is received at that moment, the following error occurs: 'rmc::handle_client_error: Client 0x6381 handles a client error(0000) with host, will reconnect'. The reconnect() -> restart() process is triggered, but the client is unable to register again because is_sending_ remains set to true, preventing any communication from being sent after the Resume. To resolve this issue, is_sending_ is reset in restart(), thus allowing messages to be sent again. this expands on #689

Details In the context of intra-ECU communication, if a client loses the connection to a service provider, then it should re-request service availability triggers from the routing host for the provided services. This would allow the client to handle the error and re-establish the connection to the service provider, however, it does not appear that this mechanism is working correctly. These changes attempt to fix the issue by searching through local_services_ for the services offered by the provider, instead of searching through the services to which the client is subscribed.

Summary Synchronization issue found during the stop process when one side of the connection closes earlier than the other, allowing the port to be reused while there is still an active connection. Details This issue occurs in a very specific scenario: The application starts the stop process, closing connections with routing manager, and routing manager also starts closing connections with the application. routing manager still has messages in the queue on the routing manager client -> application server connection, waiting in 10ms cycles for the queue to become empty. The application closes the application server -> routing manager client connection and sends an EOF to routing manager, making port XXXXX available for new connections. A new application connects to port XXXXX because it is free, but fails to connect because routing manager still has an active connection to port XXXXX. After the error, it enters a loop and cannot recover. Issue 1: A port is assigned to a new application but there is still a connection to that port for the previous application A condition variable was implemented to notify the stop process to exit the 10ms wait if an EOF is received in receive_cbk, since the queue is cleared in this case. local_tcp_client_endpoint_impl.hpp: Added bool is_stopping_ and condition variable queue_cv_. local_tcp_client_endpoint_impl.cpp: Updated receive_cbk to notify stop() if the stop process is in progress and an EOF is received. Updated stop() to wait for either 10ms or until the queue is empty, instead of just waiting 10ms. It was possible to reproduce this issue and with this change, it was possible to immediately close the socket to prevent the issue from occurring. Issue 2: The application does not recover the connection after the register timeout client_endpoint_impl::cancel_and_connect_cbk was updated to only report the connection as successfully terminated when cancel_and_connect_cbk is called without error. Previously, it was reporting the connection as successfully terminated even when there was a connection error, preventing the endpoint from being restarted. It was possible to reproduce this issue and with this change, the issue was resolved. After the register timeout, the client can now recover the connection.

Summary Ensure "shutdown_and_close" is called on "stop" Details A call to "stop" sets the "is_stopped_" flag and cancels the running operations on the sockets of the endpoint. In case the operations are scheduled but not yet running, their callbacks are called with "operation_aborted" and call "shutdown_and_close", which handles the sockets and, in case of a restart, reinitializes and starts the endpoint. In case the operation is already running (executes its callback) at the point in time "cancel" is called, we currently do detect the endpoint is stopped by checking the flag, but do not call "shutdown_and_close". As "shutdown_and_close" is not triggered, the endpoint will stay stopped and can also not be restarted by another call to "restart", because no operation was scheduled that could be cancelled. To avoid this, this change also calls "shutdown_and_close" on the regular path of the callbacks.

Summary This reverts commit 52e1bef. Details Issue description In short, it is related to the missing initial notification we usually sent on internal communication (routing_manager_base::notify_one_current_value) after subscription has been accepted. At some point, coming from Suspend and after entering Resume mode (in middle of LC), the subscription seems still to exist. (I believe the routing_manager_base::insert_subscription was returning false. We suspect, the insertion fails and returns false (see event::add_subscriber). As a result, notification won't be sent. Expected sequence of events Provider receives subscribe request from local client routing_manager_client::on_message Provider accepts the subscription https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/routing_manager_client.cpp#L1573 Provider attempts to insert the client_ into eventgroups_ https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/routing_manager_base.cpp#L794 https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/event.cpp#L619 Provider does not send any notification So it seems like for some reason during the ::remove_local, the ::unsubscribe was not executed. Therefore, the reason by changing the context of unsubscribe into io context instead of dispatcher. Assumption is that for some reason, dispatcher is at some point slower preventing the ::unsubscribe to be executed. https://github.com/COVESA/vsomeip/blob/932a88a841771cdcab85b15a21fcaefdf0403d4c/implementation/routing/src/routing_manager_base.cpp#L1209 Changes This PR is to revert 52e1bef, which was changing the operations to run on Dispatcher's context instead of IO.

Summary Ensuring the lifetime of message_buffer_ptr Details The backtrace shown nullptr on shared_ptr on the arguments passed _M_bound_ags_ into callable object (send_cbk). std::_Head_base<3, std::shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> > >, false>> = { _M_head_impl = { std::__shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> >, ...> = { std::__shared_ptr_access<std::vector<unsigned char, std::allocator<unsigned char> >, ...> = { }, _M_ptr = 0x0, _M_refcount = { _M_pi = 0x0 } } } } // key part _M_ptr = 0x0 _M_refcount = {_M_pi = 0x0} This PR attempts to capture shared_ptr in std::bind to ensure that object stays alive during the async operation. By explicitly capture the shared_ptr, it guarantees that its reference count is incremented (keeping the underlying object alive). Therefore, it should prevent the shared_ptr from going out of scope and getting destroyed while async operation is still pending. Even if original shared_ptr is destroyed, the callable object should be able to access the message_buffer_t. The bind usage during async_write call in local_uds_client_endpoint_impl::send_queued replaced with a lambda.

ricardolleite and others added 11 commits June 16, 2025 18:08

duartenfonseca force-pushed the 3.5.6_new_commits_2 branch from 9f51d18 to 30eaec1 Compare June 17, 2025 11:07

duartenfonseca merged commit 384f94b into master Jun 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.5.6 new commits #914

3.5.6 new commits #914

Uh oh!

duartenfonseca commented Jun 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3.5.6 new commits #914

3.5.6 new commits #914

Uh oh!

Conversation

duartenfonseca commented Jun 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants