Skip to content

Conversation

@Jakio815
Copy link
Collaborator

@Jakio815 Jakio815 commented Jan 15, 2025

lf-lang/lingua-franca#2455

Summary

This PR adds a network interface layer to support interoperability in terms of network protocols and security using the net_abstraction.

The original code is very tied up with sockets, making it hard to add different network features.
However, I clarify that this PR does not remove all socket related functions out of the main flow from the RTI and federate, for two reasons.

  1. The RTI and federate support user-specified port numbers and also IP addresses for the federate.
  2. The MSG_TYPE_ADDRESS_ADVERTISEMENT and MSG_TYPE_ADDRESS_QUERY uses the port and IP address in the protocol itself.

We can completely remove socket-related code using #ifdef COMM_TYPE_TCP; however, I highlight that this PR is more concentrated on supporting end-to-end pluggable and interchangeable network security based on TCP. Thus, in this PR, I have not added the #ifdef guards, and have not changed the protocol for MSG_TYPE_ADDRESS_ADVERTISEMENT and MSG_TYPE_ADDRESS_QUERY.

Features

Plugin API for network

This PR creates a separately-compiled library for the network instead of compiling it as a part of the core runtime.
I followed the prior work on low_level_platform.h and platform.h.
All source files related to networking are moved to network/impl/src, and all headers are under network/api. There are separate CMakelists.txt for each.

Add COMM_TYPE target property.

The comm-type keyword is available in the C target as follows.

target C {
  comm-type: TCP
}

Currently only supports TCP, and plan to support SST for security.

Refactoring on clock-synchronization.

Cloxk synchronization still uses UDP. So, I left all clock-sync functions to directly use UDP sockets, and refactored these functions.

rti_remote.c

  • send_physical_clock() and handle_physical_clock_sync_message() : Uses boolean flag use_UDP when UDP socket used.

clock-sync.c

  • handle_T4_clock_sync_message() : Uses boolean flag use_UDP when UDP socket used.
  • handle_T1_clock_sync_message(): Use void* socket_or_net_abstraction as parameter to support both socket and network abstraction.

handle_address_ad() and handle_address_query()

There are no changes in the protocol.

As explained in the summary, MSG_TYPE_ADDRESS_ADVERTISEMENT and MSG_TYPE_ADDRESS_QUERY use port numbers and IP addresses in the protocol itself. To explain further, for physical connections or decentralized coordination, the federateA has to know the peer federateB's port number and IP address to directly connect to it. This is done by federateA sending a MSG_TYPE_ADDRESS_ADVERTISEMENT to the RTI it's port, and the RTI sends a MSG_TYPE_ADDRESS_QUERY_REPLY message to the peer federateB, including the port number and IP address. Thus, the port number and IP address itself cannot be encapsulated under the network abstraction layer, as it is included in the protocol.

Therefore, there are some get() and set() calls to the network interface.

  1. FedA - > RTI: MSG_TYPE_ADDRESS_ADVERTISEMENT: FedA calls get_my_port() to send its own port to the RTI.
  2. FedB -> RTI: MSG_TYPE_ADDRESS_QUERY : No changes.
  3. RTI -> FedB: MSG_TYPE_ADDRESS_QUERY_REPLY : RTI calls get_server_port() and get_ip_addr() to encode it to the message to send to FedB.
  4. FedB: Sets the received port and IP address by set_server_port() and set_server_host_name() to directly connect to FedA.

Minor logic change: Move getpeername() logic to accept_net_abstraction()

The RTI should know the connected federate(FedA)'s IP address, to pass the IP address to the other federate(FedB), as explained above.

Before, this was done in rti_remote.c's receive_and_check_fed_id_message(). I moved this to lf_socket_support.c's accept_net_abstraction(), because it looks more appropriate to set the connected peer's information inside this function.

One inefficiency that happens is that in decentralized coordination, the server federate does not need to save the connected peer federate's IP address. However, I think this will barely affect the performance.

Add default UDP port number as 15061.

The UDP port was usually set to the RTI's port + 1, in rti_remote.c's create_server() function call. However, I did not want to expose the port number in the create_server() interface, so I took the parameter port out from create_server().

Minor changes

  • The start_rti_server() function does not get the port as a parameter. The port will be saved in rti_remote_t;s user_specified_port, with a default value when not set up.

@Jakio815 Jakio815 changed the base branch from main to shutdown January 15, 2025 17:32
Jakio815 and others added 24 commits January 15, 2025 11:58
…create_clock_server, because there are no plans using other network stacks rather than UDP.
…e. && Change all read() write(), and shutdown() to use netdrv
…work driver is initialized, and send default values when not initialized.
@Jakio815 Jakio815 requested review from edwardalee and hokeun November 7, 2025 21:15
Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made it all the way though, but it's looking pretty good! One issue is that while I think "network abstraction" is a good term to use for the general architecture, I find the resulting names unnecessarily long. I suggest dropping the "abstraction" in most cases. E.g.:

fed->fed_net_abstraction -> fed->net
write_to_net_abstraction -> write_to_net
write_to_net_abstraction_fail_on_error -> write_to_net_fail_on_error
read_from_net_abstraction_fail_on_error -> read_from_net_fail_on_error

@Jakio815
Copy link
Collaborator Author

Jakio815 commented Nov 14, 2025

I haven't made it all the way though, but it's looking pretty good! One issue is that while I think "network abstraction" is a good term to use for the general architecture, I find the resulting names unnecessarily long. I suggest dropping the "abstraction" in most cases. E.g.:

fed->fed_net_abstraction -> fed->net write_to_net_abstraction -> write_to_net write_to_net_abstraction_fail_on_error -> write_to_net_fail_on_error read_from_net_abstraction_fail_on_error -> read_from_net_fail_on_errorHello

Hello Prof. Lee, thank you for your comment. I shortened most of the names.
The one thing I haven't changed was the name of the type, net_abstraction_t.
Besides this, most of them are shortened.

net_abstractions_for_inbound_p2p_connections -> net_for_inbound_p2p_connections
net_abstractions_for_outbound_p2p_connections -> net_for_outbound_p2p_connections
lf_outbound_net_abstraction_mutex -> lf_outbound_net_mutex
fed->fed_net_abstraction --> fed->net
net_abstraction_to_RTI -> net_to_RTI
RTI_net_abstraction_listener -> RTI_net_listener


#include "util.h"
#include "socket_common.h"
#include "util.h" // LF_MUTEX_UNLOCK(), logging.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "util.h" // LF_MUTEX_UNLOCK(), logging.h
#include <logging.h>
#include "util.h" // LF_MUTEX_UNLOCK()

}

void send_physical_clock(unsigned char message_type, federate_info_t* fed, socket_type_t socket_type) {
void send_physical_clock(unsigned char message_type, federate_info_t* fed, bool use_UDP) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think socket_type == UDP is more readable than true. I recommend changing this back to using the socket_type_t enum unless there is a pressing reason to change this to bool.

}

int handle_T1_clock_sync_message(unsigned char* buffer, int socket, instant_t t2) {
int handle_T1_clock_sync_message(unsigned char* buffer, void* socket_or_net, instant_t t2, bool use_udp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument name here is inconsistent with the one above: use_UDP vs. use_udp.

@hokeun hokeun changed the title Add Network Abstraction layer. Addition of network abstraction layer to separate socket implementation code from network communication logic Nov 27, 2025
@Jakio815
Copy link
Collaborator Author

Jakio815 commented Dec 22, 2025

Non-deterministically failing tests:

  • src/federated/DecentralizedMicrosteps.lf
  • src/docker/federated/DistributedStopDecentralizedContainerized.lf
  • src/federated/DecentralizedP2PUnbalancedTimeout.lf

All fixed.

@Jakio815 Jakio815 requested a review from edwardalee December 24, 2025 15:37
Copy link
Member

@hokeun hokeun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Just left one minor comment.

@Jakio815
Copy link
Collaborator Author

I haven't made it all the way though, but it's looking pretty good! One issue is that while I think "network abstraction" is a good term to use for the general architecture, I find the resulting names unnecessarily long. I suggest dropping the "abstraction" in most cases. E.g.:
fed->fed_net_abstraction -> fed->net write_to_net_abstraction -> write_to_net write_to_net_abstraction_fail_on_error -> write_to_net_fail_on_error read_from_net_abstraction_fail_on_error -> read_from_net_fail_on_errorHello

Hello Prof. Lee, thank you for your comment. I shortened most of the names. The one thing I haven't changed was the name of the type, net_abstraction_t. Besides this, most of them are shortened.

net_abstractions_for_inbound_p2p_connections -> net_for_inbound_p2p_connections net_abstractions_for_outbound_p2p_connections -> net_for_outbound_p2p_connections lf_outbound_net_abstraction_mutex -> lf_outbound_net_mutex fed->fed_net_abstraction --> fed->net net_abstraction_to_RTI -> net_to_RTI RTI_net_abstraction_listener -> RTI_net_listener

@edwardalee Hello, Professor. Do you mind taking a look once again? I finally fixed all CI tests failing problems.

@edwardalee
Copy link
Contributor

Will do later this week

Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay.

This is really a significant piece of work. This time, I focused on net_abstraction.h and I have a number of questions. Some of those I've entered in the code. Overall, this looks like a good first step, but the abstraction still looks pretty "leaky" to me. There are a lot of specifics that leak into the abstraction, like arguments that are there only because the RTI and the federates handle error conditions differently. It seems like this abstraction should not be making any distinction between RTI, federate, or any other user of the abstraction. Let the caller handle errors however they like.

Most importantly, net_abstraction.h needs some documentation on how to use it.
Writing such documentation might reveal some weaknesses in the abstraction design.

What you currently do is this, for example:

  // Create a network abstraction.
  net_abstraction_t net = initialize_net();
  // Set the received host name and port to the network abstraction.
  set_server_port(net, uport);
  set_server_hostname(net, hostname);
  // Create the client network abstraction.
  create_client(net);
  if (connect_to_net(net) < 0) {
    lf_print_error_and_exit("Failed to connect to federate.");
  }

(there is nearly identical code in connect_to_RTI)

What this is really doing is this:

  // Allocate a network abstraction.
  net_abstraction_t net = initialize_net();
  // Set the received host name and port to the network abstraction.
  ((socket_priv_t*)net)->server_port = uport;
  ((socket_priv_t*)net)->server_hostname = hostname;
  // Create the client network abstraction.
  create_client(net);
  if (connect_to_net(net) < 0) {
    lf_print_error_and_exit("Failed to connect to federate.");
  }

It is confusing to me to initialize, then initialize some more, then create, then connect.
Also, create_client is always followed by connect_to_net, so why are these two different functions? Maybe it be better to do something like this:

  socket_connection_parameters_t params;
  params.type = "TCP";
  params.port = uport;
  params.server_hostname = hostname;
  net_abstraction_t net = connect_to_net((net_params_t*)&params);
  if (net == NULL) {
    lf_print_error_and_exit("Failed to connect to federate.");
  }

The base type net_params_t would have whatever fields are needed by all uses, such as the type.

Better yet, this could be abstracted for TCP usage, so the user just does this:

  net_abstraction_t net = connect_via_TCP(hostname, uport);
  if (net == NULL) {
    lf_print_error_and_exit("Failed to connect to federate.");
  }

This function would then do what the previous code does.
The function connect_via_TCP would not be part of your API, hence not declared in net_abstraction.h.

More questions:

Is get_socket_priv_t really necessary? All it does is a cast, and it seems to obscure the code. It doesn't even avoid the segfault if the argument is NULL.

It's not clear what check_net_closed will do if called on a network interface that has not been opened. Maybe a better function would be is_net_open or is_connection_open?

Regarding MSG_TYPE_ADDRESS_ADVERTISEMENT and MSG_TYPE_ADDRESS_QUERY, perhaps in a future PR these could also be abstracted. The protocol could be augmented (or just changed) to support an advertisement that could be a simple string message, to be interpreted appropriately by the client. This could in fact allow for adaptable connections. E.g., the advertisement could be a string like: "TCP:my_ip_address:my_port"

For now, I don't think get_my_port or get_server_port or get_ip_addr or get_server_hostname should be part of this abstraction. The users of these functions already know what struct to cast to. Same for all the corresponding set functions.
If you want functions, make them static functions in rti_remote.c.

Comment on lines +319 to +326
// If use UDP, apply a jitter attenuator to the estimated clock error to prevent
// large jumps in the underlying clock.
// Note that estimated_clock_error is calculated using lf_time_physical() which includes
// the clock sync adjustment.
// Use of TCP socket means we are in the startup phase, so
// rather than adjust the clock offset, we simply set it to the
// estimated error.
adjustment = (socket_type == UDP) ? estimated_clock_error / _LF_CLOCK_SYNC_ATTENUATION : estimated_clock_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not change the original logic in any way. Maybe revert to the original?

@@ -1 +1 @@
master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to point back to master when that merge happens.

Comment on lines +35 to +36
* Create a network abstraction server. This is such as a server socket which accepts connections.
* However this is only the creation of the server network abstraction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Create a network abstraction server. This is such as a server socket which accepts connections.
* However this is only the creation of the server network abstraction.
* Create a network abstraction server. For example, this might be a server socket that accepts connections.


#include "socket_common.h"

typedef void* net_abstraction_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear whether an instance of net_abstraction_t is a single connection or a collection of connections of a particular type. Perhaps clarify this with a comment, like this:

/**
 * Pointer to whatever data structure is used to maintain the state of a network connection or service.
 * For example, this could point to an integer socket identifier if TCP is being used for network connections.
 */

* However this is only the creation of the server network abstraction.
*
* @param net_abs Server's network abstraction.
* @param increment_port_on_retry Whether to increment the port on retry if binding fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument suggests a "leaky abstraction". The argument only applies to socket connections and is only used by the RTI on its rather funky mechanism where it increments the port if the DEFAULT_PORT is not available. This functionality could be moved to rti_remote.c, which has this:

if (create_server(rti_remote->rti_net, true)) {
    lf_print_error_system_failure("RTI failed to create TCP server: %s.", strerror(errno));
    return -1;
  };

If the retry functionality were put here and removed from deep inside socket_common.c, then this API could be cleaned up by removing this argument.

Comment on lines +49 to +53
* The implementation should include three steps.
* 1. Initialize the network abstraction of the connected federate.
* 2. Wait for the incoming connection request. This should block until the connection is successfully accepted.
* 3. Save the information in the connected network abstraction, such as the address of the connected peer, for future
* querying address.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These steps are unclear. The first step implies there is already a connected federate.
Also, why is this even referencing a federate or RTI? Shouldn't this abstraction be just about establishing a connection between two entities, without regard to what they are?

Also, this comment should state that create_server must be called first.

Also, I think the rti_chan argument should be removed and its functionality put where it is actually needed, which is in federate.c where accept_net is called. Shouldn't this handle checking the RTI and retrying? This may require another function like is_connected(net_abstraction_t) in your net_abstraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants