Skip to content

Resiliency issue if "rest" transport protocol is used #1646

@payalcha

Description

@payalcha

Describe the bug
Post restart of aggregator, collaborators are not able to connect with aggregator.

To Reproduce
Steps to reproduce the behavior:

  1. Start the experiment using workspace "keras/tensorflow/mnist" for 30 rounds
  2. Once few rounds completed kill aggregator process.
  3. Restart aggregator process after few sec
  4. Collaborator is not able to connect with aggregator.
    Failure in pipeline -
    https://github.com/securefederatedai/openfl/actions/runs/15161448192/attempts/1?pr=1644
    Logs.
    collaborator1.log
    collaborator2.log
    aggregator.log

Error

[05:10:07] [connectionpool.py:868] WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b10af0>: Failed to establish a new connection: [Errno 111] Connection refused')': /experimental/v1/tasks/results
[05:10:07] [aggregator_client.py:415] ERROR Connection error: HTTPSConnectionPool(host='localhost', port=49955): Max retries exceeded with url: /experimental/v1/tasks/results (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b12920>: Failed to establish a new connection: [Errno 111] Connection refused'))
[05:10:07] [aggregator_client.py:417] ERROR Connection error details: HTTPSConnectionPool(host='localhost', port=49955): Max retries exceeded with url: /experimental/v1/tasks/results (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b12920>: Failed to establish a new connection: [Errno 111] Connection refused'))
[05:10:07] [aggregator_client.py:552] ERROR Failed to send task results for round 8: HTTPSConnectionPool(host='localhost', port=49955): Max retries exceeded with url: /experimental/v1/tasks/results (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc74b12920>: Failed to establish a new connection: [Errno 111] Connection refused'))
[05:10:07] [aggregator_client.py:553] ERROR Error type: ConnectionError
[05:10:07] [aggregator_client.py:554] ERROR Request headers were: {'Receiver': 'aggregator_plan.yaml_3a116919', 'Federation-UUID': 'plan.yaml_3a116919', 'Single-Col-Cert-CN': '', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Sender': 'collaborator1', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Content-Type': 'application/x-protobuf-stream', 'Content-Length': '1723239'}

Expected behavior
Collaborator must able to connect with aggregator and process must run smootly.

Additional Information
Local test can run as below -

  1. git clone openfl
  2. install (pip install .)
  3. install all test-requirements.txt
    python -m pytest -s tests/end_to_end/test_suites/tr_resiliency_tests.py -m task_runner_basic --model_name torch/mnist --tr_rest_api --num_rounds 30

If testing post merge - #1644
Use --tr_rest_protocol instead of --test_rest_api
python -m pytest -s tests/end_to_end/test_suites/tr_resiliency_tests.py -m task_runner_basic --model_name torch/mnist --tr_rest_protocol --num_rounds 30

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions