3.5.6 new commits #903

duartenfonseca · 2025-05-22T14:49:28Z

No description provided.

Summary Do not call shutdown_and_close_socket if socket state is connecting Details If socket is reconnecting, a failed attempt to send could cause two reconnect attempts at the same time.

Summary Convert pending_sd_offers_ from vector to set Details Issue: While external interface is off or multicast is not set, if the internal interface is toggles and services are re offered, the pending_sd_offers continues to grow with repeated service entries.

Summary Lock requests_to_debounce_mutex_ when setting request_debounce_timer_running_ to false Details Fix issue introduced on 761c052 where request_debounce_timer_running_ was being reset without being protected by requests_to_debounce_mutex_. This could lead to a situation where between releasing the mutexes and resetting request_debounce_timer_running_ some new requests could be added to requests_to_debounce_. These request would then be left unprocessed until a new entry would be added to requests_to_debounce_

Summary Fix REMOTE_ERROR on response Details Increase mutex_ scope to include was_not_connected flag on connect Reset is_sending flag when failing to trigger send_queued Reset was_not_connected_ flag when send_queue checks that socket was not open

Summary lock registration_state_mutex_ before sending subscribes and unsubscribes Details Without locking registration_state_mutex_, there could be a situation where a subscription would not be sent if the client was still on the registration process and state_ had value ST_DEREGISTERED. Also locked the same mutex on unsubscribe function

Summary Remove the get_mutex_ from runtime_impl. Details The issue is on exit a global static mutex can be destroyed before the static class runtime_impl, causing tombstone/crashes, when an io thread calls the termination of an application. This mutex is not needed, as the static initialization of the shared_ptr is atomic. Valgrind-helgrind reports as a data race because it does not consider the atomic operations that are not POSIX. From: https://valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use "Do not roll your own threading primitives (mutexes, etc) from combinations of the Linux futex syscall, atomic counters, etc. " Issue

Summary Make data lookups more efficient on the common hot paths. Details Replace many instances of nested std::map<std::map< used for representing a service_t / instance_t pair with a single std::unordered_map with packed arguments. This reduces lookup complexity from O(2*log(n)) to O(1). This functionality is wrapped in a new service_instance_map type.

Summary Add config for wait_route_netlink_notification Details Issue is still very unclear. The only thing that is know is that, for a certain ECUs, when the external interface is toggled mid STR, vsomeip-lib would get stuck in DELAYED_RESUME state. This state is only exited to RESUME when vsomeip-lib netlink socket received the LINK_UP and NEW ROUTE notifications. This PR adds a configuration for the netlink connector to not reset the sd_route_set_ flag, once the interface goes down. This will bypass the wait for the NEW ROUTE wait.

Summary Implemented config_command in handle_requests Details Implemented the config_command in handle_requests to fix NSM logs with an empty "Container ID" string. The container ID was empty because the config_command was not received by the application (service), only between the clients and the daemon, which did not provide the container IDs to the application. To resolve the issue and avoid the problems caused by dd25ea1, the config_command was implemented to be sent from the daemon to the application (service), ensuring that the config_command message is only sent after the routing_info is sent, thereby guaranteeing that the routing_info is processed first.

Summary Stop find/offer debounce timers when the service discovery is stopped Details In suspend, the service discovery module is stopped. In this state, no SOMEIP/SD messages should be sent to the network. But it can be seen that FindService and StopOffer messages are sent although the service discovery is suspended. The reason for this are the debounce timers which are not stopped when the service discovery is stopped and therefore can trigger the sending of messages also when the service discovery is stopped. This PR stops the debounce timers together with all other timers when the service discovery is stopped.

Summary Enable global configuration of request debounce time

Summary Subscription handlers shall run within dispatcher contexts, not within io contexts Details Subscription handlers are executed within io contexts. This must be changed to ensure applications cannot block the io threads. Therefore, the handler is encapsulated in a sync_handler to enable it being dispatched by the dipatcher threads.

Summary Support multiple IPsec activation files per connection Details As we neither want additional dependencies nor different versions, we need to add the ability to define and check multiple activation files per connection.

Summary Several issues regarding UDP server endpoints are addressed by this PR, mainly causing BAD FILE DESCRIPTORS and issues on joining multicast. Details ISSUE 1: Initial issue was detected with dlt logs with bad file descriptor on send_cbk of the uds_server_endpoint, for the multicast addresses. Cause: Issue was introduced by vsomeip-lib version 3.5.5 by cf2d68d The main change was the re introduction of the server_endpoint restart, within the endpoint_manager_impl::resume plus the introduction of the endpoint stop() within the restart function The changes were necessary to clear up the endpoint buffers, so that the new lifecycle would not carry messages, such as subscribes, from previous lifecycles. In past versions this step was done in the suspend function, instead of resume. However, certain environments would run into a state of sockets being stuck in accept4 (High CPU load), as the suspend would happen in a non deterministic time, leaving sockets in an unknown state. Problem: The main problem introduced was that the stopping of the UPD server endpoint. Unlike the TCP_server::stop (shutdown + close), the UDP stop is an async stop. It will only call the cancel of the unicast and multicast sockets. The stop itself only happens later, when the on__receive()* callbacks receive the operation_aborted error, with the flag is_stopped enabled. This triggered a data race between the stopping called withing the receive_cbk operation and the start() that was being called in the restart function. Meaning we are left with closed sockets after start. In the end, any send triggered to those sockets results in a bad_file_descriptor error. Solution: A state machine was created to cascade the stop -> init + start The goal was to ensure that the udp_server endpoint waits on both unicast and multicast stop, and after they are both stopped, and the flag restarting is set, the init and start can be called. ISSUE 2: The previous solution made evident another issue: the join/leave of multicast group was now being done by 2 threads. The resume steps of the routing_manager_impl have the 2 main steps: ep_mgr_->resume() // resume all endpoints client endpoint->restart() // UDP + TCP server endpoint->restart() // UDP + TCP For UDP: stop() = socket->cancel() on_unicast_receive() operation_abort on_multicast_receive() operation_abort init() add multicast option // this step is async and will start the multicast_receives start() // this step is sync and will start the unicast_receives discovery_->start() sd_endpoint->join() // just like the add multicast option, this step is async Problem: Now both discovery::start and ep_mgr::resume race for multicast join. Solution: This had to be synchronized ISSUE 3: if the join multicast failed, nothing is done. Problem: the join of multicast has many steps, which can fail. The implementation, since ever, was to ignore and return. We believe that this only started to be a problem now because of the introduction of the stop() within the udp_server::restart. (Still needs more investigation) Solution: If join fails, now the endpoint_manager will queue the leave + join again, and keep on retrying. (this can still lead to an infinite loop but is not clear what should be done otherwise)

Summary Allow daemon to send ADD_CLIENT to already connected clients. Details During STR, connected clients might handle partner client errors due to lost of connectivity (tcp keep-alive), when handling error, the client removes the partner from its list of known clients. On resume, these clients will receive the service availability and subscriptions and etc but will not respond as they are unable to create tcp sockets to communicate with other client as it is not in the known client list. To be in the known client list, an application needs to receive an ADD_CLIENT routing info command from the daemon when an application either requests or registers events, the daemon did not sent the message as both clients were already already connected according to the host connection matrix map. To fix this issue, this PR proposes the removal of the is_connected check when handling of requests or offers, without it, application will always be notified independent of the connection matrix state.

Rui Graça and others added 15 commits May 22, 2025 15:40

Dont call shutdown_and_close if socket is connecting

71bf94f

Summary Do not call shutdown_and_close_socket if socket state is connecting Details If socket is reconnecting, a failed attempt to send could cause two reconnect attempts at the same time.

Fix REMOTE_ERROR on response

17f8858

Summary Fix REMOTE_ERROR on response Details Increase mutex_ scope to include was_not_connected flag on connect Reset is_sending flag when failing to trigger send_queued Reset was_not_connected_ flag when send_queue checks that socket was not open

Global request debounce time

ae1dfe4

Summary Enable global configuration of request debounce time

Support multiple IPsec activation files per connection

213d608

Summary Support multiple IPsec activation files per connection Details As we neither want additional dependencies nor different versions, we need to add the ability to define and check multiple activation files per connection.

duartenfonseca force-pushed the 3.5.6_new_commits branch from 1db88ad to 9074391 Compare May 22, 2025 15:11

duartenfonseca merged commit 932a88a into master May 22, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.5.6 new commits #903

3.5.6 new commits #903

Uh oh!

duartenfonseca commented May 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

3.5.6 new commits #903

3.5.6 new commits #903

Uh oh!

Conversation

duartenfonseca commented May 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants