Skip to content

Conversation

@vvcarvalho-csw
Copy link

@vvcarvalho-csw vvcarvalho-csw commented Sep 16, 2025

andrefesilva and others added 18 commits September 30, 2025 09:52
Fixes issues with suspend resume tests by introducing sleeps that helps
when processing data in systems with less resources.

Before these changes, there were multiple issues with timing when
threads were being created and SOME/IP packets were being received.
The first sleep is done in client side in the start.
This is done since in systems with less resources,
the runner_ thread was created after the condition variable (cv) in
start function was notified. What this lead to was that the cv in run
would never be notified since the wait would start after it was notified
in start.

The other change is done in service when it receives and unsubscribe.
Now the service sleeps for longer, since there was an issue where it
would no process the subscribes/unsubscribes from client in a correct
order, leading to the test failing. Now with the 3 seconds sleeps the
correct order is ensured and the test has not failed with this change.
std::filesystem is supported since C++17, no need for boost.

On my PC (WSL+Debian), boost::filesystem has a linking problem.
However, the main motivation is to modernise vsomeip.
Add configuration for TCP keepalive parameters, also for along
TCP_USER_TIMEOUT.
Covers the ON_UNAVAILABLE that applications see when underlying
connections break
Fix availability handler client deadlock.

Sporadically, the availability tests failed by timeout since the client
application didn't terminate properly and continued on loop trying to
connect to host. The client shutdown thread was blocked waiting for the
dispatch thread which was trying to acquire the client inter-process
mutex in the availability callback, which was held by the client in the
stop method.

Why sporadic? It depends on (non-deterministic) locking order of client
mutex between the dispatch thread and the stop method.
Adds a network test that checks whether the epsilon change function
is being correctly used to debounce outgoing events.
With TCP communication, netlink is used by a host vsomeip application to
determine when the "local" interface is ready.

Instead of using netlink (which is Linux-specific), use IP_FREEBIND
(also Linux-specific) in order to bind the local TCP ports regardless of
network state. It is much simpler, and bypasses the problem of waiting
for the local interface.

Same is done in rmc, see e8b6519
If an application (or libvsomeip itself) accidentally closes a
descriptor, following syscalls receive EBADF.

There are plenty of libraries (e.g., libsystemd, libudev, ...) that will
immediately SIGABRT on such a case, to signal application misbehavior
It is somewhat difficult to do the same in libvsomeip because of the use
of boost asio, but follow the same principle, and add close(2) and other
wrappers that react to EBADF.
Ensures that all accesses to targets_ are synchronised by mutex_.

There is a segfault when usei::send_queued_unlocked performs the following
operation:
const auto its_entry = _it->second.queue_.front();

The back trace is
\#1  __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x402e2cbc6794f03)
\#2  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x402e2cbc6794efb)
\#3  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (__r=..., this=<synthetic pointer>)
\#4  std::__shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<synthetic pointer>)
\#5  std::shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> > >::shared_ptr (this=<synthetic pointer>)
\#6  std::pair<std::shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> > >, unsigned int>::pair (this=<synthetic pointer>)
\#7  vsomeip_v3::udp_server_endpoint_impl::send_queued_unlocked (this=this@entry=0x7f84002180, _it=...)
\#8  vsomeip_v3::udp_server_endpoint_impl::send_queued (this=0x7f84002180, _it=...)
\#9  vsomeip_v3::server_endpoint_impl<boost::asio::ip::udp>::send_cbk (this=0x7f84002180,

This means that it, which is of type target_data_iterator_type, pointed to
corrupted data. It had been retrieved from server_endpoint_impl::send_cbk,
which obtained it as follows
    std::lock_guard<std::mutex> its_lock(mutex_);
    auto it = targets_.find(_key);
    if (it == targets_.end())
        return;

If all accesses to targets_ are synchronised using mutex_, it should not be
possible to corrupt the it pointer. This is what is done in this PR.
Remove leftovers from application signal handling logic, see
2709230.
While at it, improve a couple of logs.
Check if a service is available before sending a request to it.

This would prevent us from trying to process a request which might not
reach its destination, and allows us to immediately report an error to
the application.

Process offer service before on_availability callback being invoked,
with previous order, clients could send a request while the service is
still not added to the control map.

Npdu test is updated to enforce the clients to request services before
sending any request type message, without the request, the host will not
send the ADD_SERVICE routing info command and consequently on the client
side the service will constantly be assessed as unavailable.
Created a new CMake flag, ANDROID_CI_BUILD, for building with NDK.
This changes are only to be used on the CI, and with this, a check job
will verify if every PR builds for ANDROID.

This will not influence the regular ANDROID build.
Check that connection wasn't stopped before to repeat the receive.

After refactoring udp_server_endpoint_impl, there is a lifecycle_index
which is used to check whether the result of an asynchronous operation
must be ignored. The access to this variable cannot be synchronized
because  the call to on_message_received_unlocked must be lock-free.

But after calling on_message_received_unlocked, we must check again
the value of lifecycle_index before to repeat the asynchronous
operation. And this second check was missing.

Without this check, the asynchronous operation can be repeated, even if
a stop was requested in between.
Fixed a minor threading issue in the routing handler implementation.

Problem reported while executing the test offer_test_local in CI.
The function that automatically restored joins at startup was defective.

The function was faulty because it was performing invalid checks. The
need behind this function is unknown. It could be a performance
optimization. There were two options: delete it because it seems
redundant, or fix it. The second option has been chosen.

This function checked whether the group was joined successfully before.
But after a stop, this information is no longer valid. This check has
been removed.

It also checked whether the socket has been created previously. However,
the socket may have been created by set_multicast_option, in which case
we still need to join the groups. This check has been removed as well.
We can remove both checks because redundant join operations are
ignored by set_multicast_option.
Problem scenario:

Client successfully connects for Routing HOST;
Before the client starts the Registration process it receives EOF
(unknown reasons);
Client retries to reconnect with HOST, however host denies because its
connection with this client already exists;
Client gets stuck in this loop.

The main issue is that, because the TCP keep-alive is disabled, for
clients in the same VM/SOC, the HOST never discovers that it lost the
connection.

With keep-alive enabled, this process is recoverable, HOWEVER,
it takes more time, equal to what are the tcp_keekaplive settings.

Solution:

The proposed is to simply replace the previous connection, as it is a
faster recovery. Also adds test to replicate the issue. Adds to the
socket_manager the ability to allow the disconnection of  only one side
of the endpoint.
Summary
Infer routing information (client-id, guest address) for a local server
connection (e.g., ltsei or lusei) from the first received message
Details
It is necessary to fundamentally solve the following race condition:

client -> server connection breaks; implementation also breaks
server -> client as a reaction
client does REQUEST_SERVICE_ID
routing sends ROUTING_INFO+CONFIG_ID to both client+server
assume server receives ROUTING_INFO first
assume server -> client ECONNRESET is received now (and not earlier)
server erases ROUTING_INFO due to 5.
assume client receives ROUTING_INFO second, sends SUBSCRIBE to server
server does not respond due to lack of routing info

Fundamentally, this race (and similar) happens because:
a. routing information comes through routing -> server connection
b. routing information is cleaned based on client <-> server connection
The race can only be solved with strict order of events (difficult for
events such as connection breaks), or by sending the information in the
same channel. This commit does the later, using as a starting point work
already done in fe4e94c.
While at it:

add test coverage + "command-peeking" capability (based on Kai's work)
add logging of own client-id in emb
use a single registration thread, fixes a re-registration race
remove pending sub logic, no longer necessary with routing inference
enable connection break tests, fixed with this commit
format internal.hpp.in, was forgotten in
7e60ee2.
Victor Carvalho and others added 3 commits October 9, 2025 16:08
Tombstone fix:
In android boost::property_tree::ptree_error does not inherit from
std::exception.
Change the catch to "..." as we are also ignoring it.

Log fix:
The start function is called on every receive_cbk, so the log can not be there.

This PR moves log from the connection to the server_endpoint accept_cbk.
New log example:

ltsei::accept_cbk: server(169.254.87.2:31000) accepted connection with
client(169.254.87.2:31495), endpoint > 0x55555563de60

cmake fix:
update cmake_minimum_required to be the same every where.
Summary
Remove superfluous semicolon that causes compilation failure with -pedantic and certain toolchains.
Changes:
- Fix tombstone on android application load and spamming log
- routing: infer routing info
- Mitigate to add already existing connection
- Set test results artifact expiration
- udp-server: fix automatic rejoin operations at startup
- routing manager implementation, tiny threading issue
- fix regression introduced by usei refactoring
- zuul: make clang-format a voting check
- Android CI Build
- Check availability on send
- application_impl: minor cleanup
- control access to targets_
- misc: react to EBADF
- rms: drop use of netlink
- Revert "NO-TICKET: remove dependency to boost filesystem"
- Debounce Epsilon Tests
- Solve availability test deadlock
- rms: add missing ON_UNAVAILABLE log
- override host kernel TCP keepalive params
- remove dependency to boost filesystem
- suspend resume tests fix
- Fix E2E protection checker counter increment (#889)
- Fix queue size recheck condition
- app: fix thread names on ASSIGN_CLIENT_ACK_ID
- flaky test someip_application_init_test.multithread_init
- Document offer_stop_offer_test
- Remove superfluous semicolon
- fix github actions (#940)
@fcmonteiro fcmonteiro merged commit e89240c into COVESA:master Oct 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants