-
Notifications
You must be signed in to change notification settings - Fork 234
Open
Description
Describe the bug
Post restart of aggregator, collaborators are not able to connect with aggregator.
To Reproduce
Steps to reproduce the behavior:
- Start the experiment using workspace "keras/tensorflow/mnist" for 30 rounds
- Once few rounds completed kill aggregator process.
- Restart aggregator process after few sec
- Collaborator is not able to connect with aggregator.
Failure in pipeline -
https://github.com/securefederatedai/openfl/actions/runs/15161448192/attempts/1?pr=1644
Logs.
collaborator1.log
collaborator2.log
aggregator.log
Error
[05:10:07] [connectionpool.py:868] WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b10af0>: Failed to establish a new connection: [Errno 111] Connection refused')': /experimental/v1/tasks/results
[05:10:07] [aggregator_client.py:415] ERROR Connection error: HTTPSConnectionPool(host='localhost', port=49955): Max retries exceeded with url: /experimental/v1/tasks/results (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b12920>: Failed to establish a new connection: [Errno 111] Connection refused'))
[05:10:07] [aggregator_client.py:417] ERROR Connection error details: HTTPSConnectionPool(host='localhost', port=49955): Max retries exceeded with url: /experimental/v1/tasks/results (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b12920>: Failed to establish a new connection: [Errno 111] Connection refused'))
[05:10:07] [aggregator_client.py:552] ERROR Failed to send task results for round 8: HTTPSConnectionPool(host='localhost', port=49955): Max retries exceeded with url: /experimental/v1/tasks/results (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b12920>: Failed to establish a new connection: [Errno 111] Connection refused'))
[05:10:07] [aggregator_client.py:553] ERROR Error type: ConnectionError
[05:10:07] [aggregator_client.py:554] ERROR Request headers were: {'Receiver': 'aggregator_plan.yaml_3a116919', 'Federation-UUID': 'plan.yaml_3a116919', 'Single-Col-Cert-CN': '', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Sender': 'collaborator1', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Content-Type': 'application/x-protobuf-stream', 'Content-Length': '1723239'}
Expected behavior
Collaborator must able to connect with aggregator and process must run smootly.
Additional Information
Local test can run as below -
- git clone openfl
- install (pip install .)
- install all test-requirements.txt
python -m pytest -s tests/end_to_end/test_suites/tr_resiliency_tests.py -m task_runner_basic --model_name torch/mnist --tr_rest_api --num_rounds 30
If testing post merge - #1644
Use --tr_rest_protocol instead of --test_rest_api
python -m pytest -s tests/end_to_end/test_suites/tr_resiliency_tests.py -m task_runner_basic --model_name torch/mnist --tr_rest_protocol --num_rounds 30
Metadata
Metadata
Assignees
Labels
No labels