Skip to content

Conversation

@vvcarvalho-csw
Copy link

No description provided.

donatellob and others added 28 commits October 24, 2025 15:08
Implement a clean shutdown of offer_test_service.
The application was interrupted using SIGTERM, and Valgrind
sometimes reported memory leaks. By performing a clean
shutdown, this error no longer occurs.
Fix race condition in test suspend_resume_test_initial when a notification is received after the unavailability.
the test failed because a notification was received when the service was unavailable.
2025-08-11 14:44:47.829423 suspend_resume_test_service [info] rmi::set_routing_state
      Set routing to suspend mode, diagnosis mode is inactive.
2025-08-11 14:44:47.829768 suspend_resume_test_client [debug] on_message: Received event.
2025-08-11 14:44:47.829864 suspend_resume_test_client [debug] on_availability: Test service is NOT available.
2025-08-11 14:44:47.829890 suspend_resume_test_client [debug] on_message: Received event.
2025-08-11 14:44:47.829894 suspend_resume_test_client [debug] [TEST-cli] HasReceived Changed, triggering cv

This last trace means that the variable has_received_ has been set to true.
After resume, the test expected to receive a notification, which it did. But this was ignored because the
variable has_received_ was not set to false.
2025-08-11 14:44:52.829535 suspend_resume_test_service [info] rmi::set_routing_state
      Set routing to resume mode, diagnosis mode was inactive.
2025-08-11 14:44:52.840545 suspend_resume_test_client [debug] [TEST-cli] On availability will trigger cv
2025-08-11 14:44:52.840596 suspend_resume_test_client [debug] [TEST-cli] Service Available after susp/resume: r=0
2025-08-11 14:44:54.836073 suspend_resume_test_client [debug] on_message: Received event.

There is a defect in the test that requested two notifications by generating unsubscribe/subscribe commands
without waiting for confirmation of these operations. The result of these two operations interfered with the
expected sequence.
add check to tests return code and documentation
for debounce_filter_test
add check to tests return code and documentation
for cyclic_event_tests
add check to tests return code and documentation
for debounce_callback_test
add check to tests return code and documentation
for debounce_frequency_test
Document offer_test_local test.
Add VSOMEIP_APPLICATION_NAME to the test executables
Reduces the amount of cycles from 100 to 5.
Increases test timeout to 180 seconds.
This test fails in valgrind tests, which run in a slower environment.
In a worst case scenario this test would take 6*100 seconds.
get_local_port would unnecessarily use getsockname syscall, and do
it especially often when ACL is used
While at it, remove set_local_port (shift logic to on_bind_error
instead), and cleanup dead code
Unit tests of usei to reproduce time-related issues.
I tried to adapt a POC usei network test, but the network tests are blackbox tests,
and we cannot create routing_manager_stub or
endpoint_manager_impl. Changing the interface was considered,
but it would not pass review. Another solution was to use debug
symbols to find the private methods, but that would be too
much. For these reasons, unit tests are developed.
All Android builds also define linux, plenty of portability guides
state the same (which makes sense, Android uses a Linux kernel..). This
eliminates most uses of ifdef ANDROID
The test failed due to a delay of 6 ms, whereas the maximum expected delay was 5 ms.
debug log have been activated and the high_resolution_clock has been replaced by the
steady_clock. An issue has been found in the code generating the messages that has
a time drift.
Came across missing details
The fake_socket tests already exceed the 1MB limit, which is anyhow
hilariously small. Increase it by a few orders
4a0295b introduces a data race caught by the CI, where:

lusei restarts (which is stop+init+start)
init writes local_, which is the address of the acceptor socket (in
the case of lusei, something like /var/run/someip/vsomeip-1234)
however, even though stop closes the acceptor socket, an accept_cbk
can execute, which due to 4a0295b, also reads local_ for logging

The accept_cbk in the background is concerning and needs to be solved,
this commit just mitigates the race. And of course, both lusei and ltsei
have the same race, there are just more tests using lusei
add fair-sched to valgrind
Set valgrind with --fair-sched=yes. This allows created threads to run with
same priority as main application thread. Without this, valgrind stalled
application thread execution and the app that started first would get
almost no cpu time.
See:
https://valgrind.org/docs/manual/manual-core.html
https://valgrind.org/docs/manual/manual-core.html#manual-core.pthreads_perf_sched
Enable compilation of usei_tests/ut_basic_tests with boost 1.87+
boost 1.87 retires asio::io_context::work. It was replaced by executor_work_guard.
Force remote subscriptions to be removed on host error.
In some ECU modes, an emergency shutdown is executed instead of a graceful shutdown,
this leads to vsomeipd not suspending, consequently, the vsomeip suspend
command is not propagated to clients.
For some clients, as the suspend command is not received and the unit
doesn't reboot, it will not clean up the remote subscriptions map that it manages,
the container is used to detect if a client is a new subscriber or not to an event (initial
events are only sent if it's a new subscriber).
This PR forces the clean up for scenarios where a graceful shutdown is not performed by
cleaning up the container when the client detects and handles a connection lost towards host.
As it was, the result from ctest when running offer_stop_offer_test_client
was not being evaluated. To fix this, keep track of the return codes of all
programs started.
Removed unnecessary code
Due to e8b6519 (netlink removal), libvsomeip receives a EADDRNOTAVAIL
on bind, in ltcei::connect, before the network interface is ready
Which by itself is expected, and not a problem - but due to broken error
handling, libvsomeip continues with connect logic, does ASSIGN_CLIENT
The sending of ASSIGN_CLIENT of course fails (connection is not open..),
and causes a cascade of errors, which unfortunately only resolve on the
ASSIGN_CLIENT_ACK timeout, which takes 3s and heals the situation.
There are failures exactly because of these extra 3s.
While at it, promote ASSIGN_CLIENT/REGISTER_APPLICATION timeouts to
errors, as these should never happen
Abort when incorrect state is detected for critical syscall/libc calls such as
recvfrom/sendto/epoll_wait. Under these situations, the process using libvsomeip
is incorrectly handling file descriptors (e.g. double close with invalid fd
value) leading to a libvsomeip state that is incorrect and not recoverable. The
abort will cause a core dump and platforms can decide on recovery strategies.
strerror is not thread safe in older glibc, and anyhow not guaranteed
to be thread safe in general
It is not important, as any developer should recognize the term "errno",
therefore remove all of the calls
If an application closes the epollfd that libvsomeip is using
internally, libvsomeip IO threads will loop forever on a EBADF.
Therefore, react on epoll_wait.
Test is fragile as there can be more client -> routing reconnects than
expected, adjust check, add comment
This test was created to ensure that all clients subscribed
to a remote service receives all notifications correctly
when another client on the same device subscribed to the
same service unsubscribes it. This happened since before
it was fixed the routing manager would leave the multicast
group where the service sent its notifications while there
were clients still subscribed to it, leading to clients missing
notifications being sent by the service
Victor Carvalho and others added 19 commits October 24, 2025 15:08
Fix remaining race condition in udp_server_endpoint_impl during restart.
During a restart, a problem remained concerning buffer sharing between stopped
streaming and started streaming. The solution is to allocate buffers dynamically.
The unit test provided reproduces the problem and also detects an issue with
a shared_pointer that was not protected.
Removed Services/multicast, as the multicast address and port must be
assigned to the eventgroup as different eventgroups of the same service can
use different multicast addresses/ports.
Removed Eventgroups/is_multicast, as eventgroup uses the multicast
address/port defined in the service node. This is superseded by the
eventgroup specific multicast address/port definition
Removed servicegroups, leftover from old versions of vsomeip.
Removed routing-client-ports, leftover from old versions of vsomeip.
Added Environment Variables info
Adapted configuration test to the new changes, as the deprecated json now
makes no sense to use.
Simplify the logger implementation and make is (mostly) lock-free.
Refactor the logger implementation so it is lock-free for logging to both DLT,
and console. This should completely remove contention and significantly
improve performance in a multithreaded environment. However, one atomic
load per message logged remains, due to out-of-band (and repeated)
initialization of the logger. Fixing it would
enable additional simplifications and optimizations.
Only file logging still requires locking - there is no way around this,
without relying on POSIX specifics. However, this PR also should improve
performance here, as the log file no longer is opened, flushed, and closed
for every single message.
Also incorporates the more general optimization changes
that were supposed to be merged, but somehow got lost in time.
Change tcp_tw_reuse to 2 to have the default linux values used on most ECUs
tcp_tw_reuse is a Linux kernel parameter that enables the reuse of TCP sockets
in the TIME_WAIT state for new outgoing connections.
It has the following values:

0: Disable TIME_WAIT socket reuse
1: Enable TIME_WAIT socket reuse
2: Enable TIME_WAIT socket reuse for loopback only (Linux Default)
Fix invalid use of std::move
The mutex and the condition variables were moved and then used
again. It could explain the BLOCKING CALL that lead to a failure.
Use only console output, do not use dlt nor dlt-daemon, as there is anyhow a DISABLE_DLT option
While at it, enable trace level for all tests
These logs should enable better tracing of connection lifecycles.
On the client side the log is expanded to mention the remote address,
on the server side a new log is added to mention the remote connector.
Avoid a busy loop if a failure occurs during a multicast leave operation.
The leave failed because MACsec reconfigured the network interface
during the normal execution. The endpoint_manager_impl doesn't
repeat the leave but it repeats the join in a busy loop.
By performing the leave operation even if it was already done,
it will avoid the error: address already in use, when the join is
done again.
The error: address already in use, cannot be ignored because
it happens during the restart case and it must be managed in
this case.
It isn't enough to fully fix the busy loop, because the join
will return an error until the network interface is up again. So
a delay has been added,
It remains one issue: the leave isn't repeated by
endpoint_manager_impl. Only the join is repeated.
The correct handling of this problem would require to record that
the join state as unknown. Then we would have to manage this
new state and repeat of the leave operation. It is too risky for
something that shall not happen.
Turns out that boost::system::errc::broken_pipe !=
boost::asio::error::broken_pipe
This was very, very painful to find out. It is also generally true - no
system::errc will ever be equal to any asio::error, they are inherently
different error categories
Remove dead code, fix tests. Remove all uses of system::errc because I
never, ever want to deal with this again
It makes no sense to do so, because that causes libvsomeip to connect to
vsomeipd on application::init, before there is ever an io thread
executing events
If close is not called, the socket options can be overwritten by boost
which can lead to an increased TIME_WAIT value
With this change, the fake_socket required some changes in the receive_
handler clean up, to avoid lock order inversion
Remove duplicate shutdown closure of sockets.
Both wait_until_sent and restart functions already call the
shutdown_and_close_socket.
There is no need to call it also before calling either of the other 2 functions.
This PR is addressing the following scenario:
cei::shutdown_and_close_socket_unlocked: socket shutdown error (107): Transport endpoint is not connected
endpoint > 0xffff980b8a80 socket state > 3
cei::shutdown_and_close_socket_unlocked: not recreating socket endpoint > 0xffff980b8a80 socket state > 0
cei::shutdown_and_close_socket_unlocked: socket was not open endpoint > 0xffff980b8a80 socket state > 1
cei::shutdown_and_close_socket_unlocked: socket has been reset endpoint > 0xffff980b8a80 socket state > 0
cei::shutdown_and_close_socket_unlocked: socket was not open endpoint > 0xffff980b8a80 socket state > 1
cei::shutdown_and_close_socket_unlocked: not recreating socket endpoint > 0xffff980b8a80 socket state > 0

Where 2 extra shutdowns are called.
Update local_endpoint ClientID on add_guest
This was the case:
There was an endpoint established between two applications A and B.
An STR occurred.
The endpoint still exists, but is disconnected. It tries to reconnect, and succeeds.
During this reconnection, application B re-registers to routing host and changes ClientID.
Afterwards app A receives ON_AVAILABLE from app B, with a new ClientID.
App A tries to connect to this new ClientID, which has the same address, and it fails with
Cannot assign requested address.
This fix proposes to update the local_endpoint map, to synchronize the ClientID with
what we are updating in the guests map (when 'add_guest' is called).
Work around compiler bug in GCC < 10 after recent logger optimizations.
GCC < 10 does not accept a struct with member initializers as a template argument for std::atomic.
This was fixed in later versions; however, as the affected struct is never used without explicit initialization,
just remove the initializers.
Silly one, likely there for years, but only happens if libdlt is used
For versions of libdlt that support privacy-aware logging,
ensure that all normal log messages are marked as public.
It was determined that vsomeip never logs sensitive data
through the normal logger.
Trace data containing message payloads (and thus potentially privacy-
relevant information) go through a different code path, and remains
marked as private for now.
SO_REUSEPORT option to mitigate the issue with Address already used.
After resume, the TCP server that was closed during suspend received
an unexpected error: Address already in use. The reason is not clear,
and the DLT log shows that the stop was called. The workaround is to
use the option SO_REUSEPORT.
The same option is added to the UDP unicast server. The UDP multicast
server already has error handling and does not require it.
This PR also aims to improve logging, particularly by tracking init,
start and stop operations, and by monitoring active connections.
Update vsomeip-lib to v3.5.9
Update local_endpoint ClientID on add_guest
Make compile with GCC < 10 again
tce: Remove duplicate shutdown closure of sockets
Ensure socket closure on dtor
rmc: remove sender start on rmc::init
misc: fix use of non-asio error codes
multicast, add repeat delay
Expand logs to trace connections easier
Minor change, typo when declaring endianness structure
zuul: remove dlt
flaky, subscribe_notify_test_one_event_two_eventgroups_tcp
Change tcp_tw_reuse to 2
Avoid locking in and improve performance of the logger
Remove leftover configurations
fix race condition with on_message_received_unlocked
Create multicast group test
tests: fix flaky test allow_reconnects
NTF integration pipeline
misc: react on EBADF for epoll_wait
misc: remove use of strerror
Add support for VSOMEIP_ABORT_ON_CRIT_SYSCALL_ERROR
ltcei: fix connect logic
fix and document event_test
fix offer_stop_offer_test scripts
Force remote subscriptions to be removed on host error
Enable usei_tests/ut_basic_tests with boost 1.87+
Post results to artifactory for batch pipelines
add fair-sched to valgrind
Adds remote information to client endpoint warning logs
Adds vsomeip version to the cmake log
lsei: fix race
zuul: increase test output size
lsei: minor log improvement
flaky test, debounce_filter_test
build: cleanup ifdef ANDROID
usei, unit tests
endpoint: fix get_local_port, remove set_local_port
reduce time spent in test_restart_client_in_loop
zuul: suppress sonarqube on catch-all exceptions
document offer_test_local memcheck test
doc/fix debounce_frequency_test
fix/document debounce_callback_test scripts
fix/document cyclic_event_tests scripts
fix/document debounce_filter_test scripts
fix test suspend_resume_test_initial
fix shutdown of offer_test_service
Adds ets and verification job to check (non-voting)
add SO_REUSEPORT option + logging
Mark all normal log messages as public
misc: fix client-side-logging crash
@fcmonteiro fcmonteiro merged commit bb35c97 into COVESA:master Oct 28, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants