Skip to content

fasterrcnn_resnet50_fpn Windows GPU tests failing on CUDA 11.6 #6589

@datumbox

Description

@datumbox

🐛 Describe the bug

After the removal of CUDA 11.3 and the setting of 11.6 as default the tests at fasterrcnn_resnet50_fpn started failing across all python versions:

Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_models.py", line 775, in check_out
    _assert_expected(output, model_name, prec=prec)
  File "C:\Users\circleci\project\test\test_models.py", line 117, in _assert_expected
    torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1342, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 61 / 80 (76.2%)
Greatest absolute difference: 182.7935028076172 at index (9, 1) (up to 0.01 allowed)
Greatest relative difference: inf at index (1, 0) (up to 0.01 allowed)

The failure occurred for item [0]['boxes']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_models.py", line 803, in test_detection_model
    full_validation &= check_out(out)
  File "C:\Users\circleci\project\test\test_models.py", line 783, in check_out
    torch.testing.assert_close(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1342, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 20 (5.0%)
Greatest absolute difference: 0.019545435905456543 at index (19,) (up to 0.01 allowed)
Greatest relative difference: 0.026511636142229424 at index (19,) (up to 0.01 allowed)

The failure occurs only on Windows. The Linux tests pass fine.

Versions

TorchVision latest main branch

cc @atalman @malfet @ptrblck

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions