Skip to content

rclone: fix discard thread issues#92

Merged
ThomasWaldmann merged 5 commits intoborgbackup:masterfrom
alexandru-bagu:fix-rclone-hang
Feb 4, 2025
Merged

rclone: fix discard thread issues#92
ThomasWaldmann merged 5 commits intoborgbackup:masterfrom
alexandru-bagu:fix-rclone-hang

Conversation

@alexandru-bagu
Copy link
Contributor

This PR fixes two issues:

  1. because the while in the discard thread was not checking the value of self.process it would cause an exception to be raised in a thread where no exception handling is present
  2. on Cygwin the daemon thread would sometimes not start when thread.start() is called and borg would hang. unsure if it happens on linux though I doubt it.
  • keep the discard thread alive only while we have a valid self.process
  • mark thread as non daemon - this would cause borg to hang if the rclone process is not indeed closed which should not happen anyway

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 10, 2025

@alexandru-bagu so you practically tested this on cygwin? Thanks for the PR!

@alexandru-bagu
Copy link
Contributor Author

Yes, I did more than ~30 tests on Cygwin.
Without this PR borg would randomly hang at thread.start. Once I removed daemon=True, borg would not hang and would work correctly all the time.

@alexandru-bagu
Copy link
Contributor Author

I actually did a few more tests and it looks like borg is still hanging on thread.start() even with daemon=False. What is odd is that when the debug argument is provided it does not hang...

@alexandru-bagu
Copy link
Contributor Author

For whatever random reason, when starting a second thread using threading.thread.start() the 2nd would hang. Oddly enough, when --debug is used it does not hang which means it's got to be something with a process pipe in cygwin.
To make it work I decided to bypass piping the rclone's stderr output but that meant I needed to know the port beforehand and this introduces a race condition.
To make sure that rclone ends up running (if at all) with the correct port, I am selecting an open port first, providing that to rclone and then doing a noop request to make sure rclone is up and that it responds accordingly.
Additionally I added an environment variable to override the rclone binary name as in Cygwin I had issues where I would need to override the PATH environment variable to help it find the binary.

I will test this on Linux to make sure no bugs are introduced. My only concern (when comparing Windows with Linux) is that in Linux I know one can get an error with a port being in used because Linux keeps ports for a while even when the socket is closed. I attempted to fix this by using s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) but I have yet to test it.

@alexandru-bagu alexandru-bagu force-pushed the fix-rclone-hang branch 2 times, most recently from 9b2d4b8 to df7a25f Compare January 14, 2025 18:10
@alexandru-bagu
Copy link
Contributor Author

I tested on a Oracle Linux 8.10 the changes and had to make one adjustment. On Windows urllib3 has a bigger timeout for starting a http request whilst on Linux it would instantly fail because rclone was not yet listening to the given port. As such I am waiting for the port to become available before trying the noop request. This version seems to work fine for both OS.
Additionally updated the README.rst to include information regarding the RCLONE_BINARY variable.

… on relying on rclone to pick one; add a way to specify the path to the rclone binary for custom installations
@alexandru-bagu
Copy link
Contributor Author

alexandru-bagu commented Jan 15, 2025

I opted not to catch an error now for self.noop("noop") as that now suggests there's an issue with rclone which could not be handled by _rpc's retry mechanism.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpicks.

@ThomasWaldmann
Copy link
Member

@ncw Can you have a look, please?

@alexandru-bagu
Copy link
Contributor Author

alexandru-bagu commented Jan 19, 2025

I am actually wondering if it wouldn't be a good idea just to add an environment variable with a port to be used for rclone. If for some reason the port is already in use, then the user can decide whether to use another one or not. That way we can also get rid of the while loop.
Additionally, maybe being able to provide parameters to rclone wouldn't be a bad idea, something like a RCLONE_RC_EXTRA_ARGS. In my tests I found that some arguments might be necessary for optimizations.
@ncw Can you please chime in?

Copy link
Contributor

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks reasonable to me.

If duplicating the find an open port code in Python lets us lose a thread then that is probably a good tradeoff as Python and threads aren't my favourite sandwich filling.

Though I'll note that the detection of an open port is racy - there is opportunity for another process to take it in between python closing it and rclone opening it. I think it is unlikely though.

I probably wouldn't bother with a variable to set a port number - that will break as soon as the user runs two copies of rclone.

@alexandru-bagu
Copy link
Contributor Author

Though I'll note that the detection of an open port is racy - there is opportunity for another process to take it in between python closing it and rclone opening it. I think it is unlikely though.

I considered this and I did a noop check. If that fails another port is considered. This will ensure that rclone or a software that implements rclone rc protocol would respond with a valid response.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tell me when ready, so I can merge it.

@alexandru-bagu
Copy link
Contributor Author

alexandru-bagu commented Jan 26, 2025

I did tests with a thread-pool with long-lived threads that would be kept until the program exits and there is room for concurrency improvements but I would leave this in another PR.
I think the parallelism should be handled either on a higher level or specific requirements/contracts should be set before implementing it.
I don't plan on adding anything else to this PR.

@alexandru-bagu
Copy link
Contributor Author

While developing an s3 backend for borgstore, I found that the exception thrown in get_rclone_backend would prevent my backend from being considered. Updated the code to log those errors instead and return None.

@alexandru-bagu
Copy link
Contributor Author

I moved the regex check before raising the exceptions. I believe we should still raise exceptions if rclone is required but is not available or is too old.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ThomasWaldmann ThomasWaldmann merged commit 16392a8 into borgbackup:master Feb 4, 2025
7 checks passed
@ThomasWaldmann
Copy link
Member

@alexandru-bagu Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants