Skip to content

Engine Threads >1 "Failed to connect to remote host" #49

@GeorgKreuzmayr

Description

@GeorgKreuzmayr

Hello everyone,

when configuring the engine_threads to any value > 1 I get an error when connecting from a client to a server.

For the configuration with engine_threads=2, I get the error occasionally e.g. not for every port combination. The configuration engine_threads=16 fails more often e.g. on every port combination I tried.

The error message on the client is

ubuntu@ip-172-31-32-21:~$ ${MSG_GEN} --local_ip 172.31.32.121 --remote_ip 172.31.32.120 --msg_size 64 --msg_window 32
I20241103 18:50:38.682282     1 main.cc:332] Starting in client mode, request size 64
Checking for file descriptor...
Got a file descriptor!
ERROR: Failed to dequeue response from control queue.
F20241103 18:50:49.975369     1 main.cc:346] Check failed: ret == 0 Failed to connect to remote host. machnet_connect() error: Unknown error -1
*** Check failure stack trace: ***
    @     0x7fa3d8ce3f03  google::LogMessage::Fail()
    @     0x7fa3d8ce793c  google::LogMessage::SendToLog()
    @     0x7fa3d8ce39e7  google::LogMessage::Flush()
    @     0x7fa3d8ce509f  google::LogMessageFatal::~LogMessageFatal()
    @     0x562d0c932a28  main
    @     0x7fa3d8866d90  (unknown)

I have a server running on another EC2 instance with this command

ubuntu@ip-172-31-32-20:~$ ${MSG_GEN} --local_ip 172.31.32.120 --msg_size 64 

On the other hand, if I use engine_threads=1, the execution succeeds

ubuntu@ip-172-31-32-21:~$ ${MSG_GEN} --local_ip 172.31.32.121 --remote_ip 172.31.32.120 --msg_size 64 --msg_window 32
I20241103 18:06:00.837787     1 main.cc:332] Starting in client mode, request size 64
Checking for file descriptor...
Got a file descriptor!
I20241103 18:06:03.949545     1 main.cc:350] [CONNECTED] [172.31.32.121:1024 <-> 172.31.32.120:888]
I20241103 18:06:03.972815     7 main.cc:294] Client Loop: Starting.
TX/RX (msg/sec, Gbps): (0.0K/0.0K, 0.000/0.000). RTT (p50/99/99.9 us): 144/144/144
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/195
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/194
TX/RX (msg/sec, Gbps): (217.4K/217.4K, 0.111/0.111). RTT (p50/99/99.9 us): 143/179/543
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/193
TX/RX (msg/sec, Gbps): (220.2K/220.2K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/190
TX/RX (msg/sec, Gbps): (220.1K/220.1K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/191
TX/RX (msg/sec, Gbps): (220.1K/220.1K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/189

MSG_GEN="docker run -v /var/run/machnet:/var/run/machnet ghcr.io/microsoft/machnet/machnet:latest release_build/src/apps/msg_gen/msg_gen"

Setup: Two EC2 instances of type c5n.18xlarge running Kernel 6.5.0-1014-aws on Ubuntu 23.10.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions