Skip to content

Conversation

@duartenfonseca
Copy link
Collaborator

No description provided.

Rui Graça and others added 15 commits May 22, 2025 15:40
Summary
Do not call shutdown_and_close_socket if socket state is connecting
Details
If socket is reconnecting, a failed attempt to send could cause two reconnect
attempts at the same time.
Summary
Convert pending_sd_offers_ from vector to set
Details
Issue:
While external interface is off or multicast is not set, if the internal interface is toggles and
services are re offered, the pending_sd_offers continues to grow with repeated service entries.
Summary
Lock requests_to_debounce_mutex_ when setting request_debounce_timer_running_ to false
Details
Fix issue introduced on 761c052
where request_debounce_timer_running_ was being reset
without being protected by requests_to_debounce_mutex_.
This could lead to a situation where between releasing the mutexes
and resetting request_debounce_timer_running_ some new requests
could be added to requests_to_debounce_.
These request would then be left unprocessed until a new entry would be
added to requests_to_debounce_
Summary
Fix REMOTE_ERROR on response
Details
Increase mutex_ scope to include was_not_connected flag on connect
Reset is_sending flag when failing to trigger send_queued
Reset was_not_connected_ flag when send_queue checks that socket was not open
Summary
lock registration_state_mutex_ before sending subscribes and
unsubscribes
Details
Without locking registration_state_mutex_, there could be a situation
where a subscription would not be sent if the client was still on the
registration process and state_ had value  ST_DEREGISTERED.
Also locked the same mutex on unsubscribe function
Summary
Remove the get_mutex_ from runtime_impl.
Details
The issue is on exit a global static mutex can be destroyed before the static class runtime_impl,
causing tombstone/crashes, when an io thread calls the termination of an application.
This mutex is not needed, as the static initialization of the shared_ptr is atomic.
Valgrind-helgrind reports  as a data race because it does not consider the atomic operations
that are not POSIX.
From: https://valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use
"Do not roll your own threading primitives (mutexes, etc) from combinations of the Linux futex syscall,
atomic counters, etc. "
Issue
Summary
Make data lookups more efficient on the common hot paths.
Details
Replace many instances of nested std::map<std::map< used for representing
a service_t / instance_t pair with a single std::unordered_map with packed
arguments. This reduces lookup complexity from O(2*log(n)) to O(1).
This functionality is wrapped in a new service_instance_map type.
Summary
Add config for wait_route_netlink_notification
Details
Issue is still very unclear.
The only thing that is know is that, for a certain ECUs, when the external interface is
toggled mid STR, vsomeip-lib would get stuck in DELAYED_RESUME state.
This state is only exited to RESUME when vsomeip-lib netlink socket received the LINK_UP
and NEW ROUTE notifications.
This PR adds a configuration for the netlink connector to not reset the sd_route_set_ flag,
once the interface goes down. This will bypass the wait for the NEW ROUTE wait.
Summary
Implemented config_command in handle_requests
Details
Implemented the config_command in handle_requests
to fix NSM logs with an empty "Container ID" string.
The container ID was empty because
the config_command was not received by
the application (service), only between the
clients and the daemon, which did not provide
the container IDs to the application.
To resolve the issue and avoid the problems
caused by dd25ea1,
the config_command was implemented to be
sent from the daemon to the application (service),
ensuring that the config_command message is only sent
after the routing_info is sent,
thereby guaranteeing that the routing_info is processed first.
Summary
Stop find/offer debounce timers when the service discovery is stopped
Details
In suspend, the service discovery module is stopped. In this state, no
SOMEIP/SD messages should be sent to the network. But it can be seen
that FindService and StopOffer messages are sent although the service
discovery is suspended. The reason for this
are the debounce timers which are not stopped when the service
discovery is stopped and therefore can trigger the sending of messages
also when the service discovery is stopped. This PR stops the debounce
timers together with all other timers when the service discovery is
stopped.
Summary
Enable global configuration of request debounce time
Summary
Subscription handlers shall run within dispatcher contexts, not within io contexts
Details
Subscription handlers are executed within io contexts. This must be changed to ensure
applications cannot block the io threads. Therefore, the handler is encapsulated in a
sync_handler to enable it being dispatched by the dipatcher threads.
Summary
Support multiple IPsec activation files per connection
Details
As we neither want additional dependencies nor different versions, we need to add
the ability to define and check multiple activation files per connection.
Summary
Several issues regarding UDP server endpoints are addressed by this PR,
mainly causing BAD FILE DESCRIPTORS and issues on joining multicast.
Details
ISSUE 1: Initial issue was detected with dlt logs with bad file descriptor on send_cbk of the uds_server_endpoint,
for the multicast addresses.
Cause: Issue was introduced by vsomeip-lib version 3.5.5 by cf2d68d
The main change was the re introduction of the server_endpoint restart,
within the endpoint_manager_impl::resume plus the introduction of the endpoint stop() within the restart function
The changes were necessary to clear up the endpoint buffers,
so that the new lifecycle would not carry messages, such as subscribes, from previous lifecycles.
In past versions this step was done in the suspend function, instead of resume.
However, certain environments would run into a state
of sockets being stuck in accept4 (High CPU load),
as the suspend would happen in a non deterministic time, leaving sockets in an unknown state.
Problem: The main problem introduced was that the stopping of the UPD server endpoint.
Unlike the TCP_server::stop (shutdown + close), the UDP stop is an async stop.
It will only call the cancel of the unicast and multicast sockets.
The stop itself only happens later, when the on__receive()* callbacks receive the operation_aborted error,
with the flag is_stopped enabled.
This triggered a data race between the stopping called withing the receive_cbk operation and the start()
that was being called in the restart function.
Meaning we are left with closed sockets after start.
In the end, any send triggered to those sockets results in a bad_file_descriptor error.
Solution: A state machine was created to cascade the stop -> init + start
The goal was to ensure that the udp_server endpoint waits on both unicast and multicast stop,
and after they are both stopped, and the flag restarting is set, the init and start can be called.
ISSUE 2: The previous solution made evident another issue: the join/leave of multicast group was now being
done by 2 threads.
The resume steps of the routing_manager_impl have the 2 main steps:
ep_mgr_->resume() // resume all endpoints

client endpoint->restart()  // UDP + TCP
server endpoint->restart()  // UDP + TCP

For UDP:
stop() = socket->cancel()
on_unicast_receive() operation_abort
on_multicast_receive() operation_abort
init()
add multicast option  // this step is async and will start the multicast_receives
start() // this step is sync and will start the unicast_receives

discovery_->start()

sd_endpoint->join()  // just like the add multicast option, this step is async

Problem: Now both discovery::start and ep_mgr::resume race for multicast join.
Solution: This had to be synchronized
ISSUE 3: if the join multicast failed, nothing is done.
Problem: the join of multicast has many steps, which can fail.
The implementation, since ever, was to ignore and return.
We believe that this only started to be a problem now because of the introduction of the stop() within the
udp_server::restart. (Still needs more investigation)
Solution: If join fails, now the endpoint_manager will queue the leave + join again,
and keep on retrying. (this can still lead to an infinite loop  but is not clear what should be done otherwise)
Summary
Allow daemon to send ADD_CLIENT to already connected clients.
Details
During STR, connected clients might handle partner client errors due to lost of connectivity (tcp keep-alive),
when handling error, the client removes the partner from its list of known clients. On resume, these clients
will receive the service availability and subscriptions and etc but will not respond as they are unable to
create tcp sockets to communicate with other client as it is not in the known client list.
To be in the known client list, an application needs to receive an ADD_CLIENT routing info command from
the daemon when an application either requests or registers events, the daemon did not
sent the message as both clients were already already connected according to the
host connection matrix map.
To fix this issue, this PR proposes the removal of the is_connected check when handling of requests or
offers, without it, application will always be notified independent of the connection matrix state.
@duartenfonseca duartenfonseca merged commit 932a88a into master May 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants