automatically adjust stall detection period

Improve annex.stalldetection to handle remotes that update progress less
frequently than the configured time period.

In particular, this makes remotes that don't report progress but are
chunked work when transferring a single chunk takes longer than the
specified time period.

Any remotes that just have very low update granulatity would also be
handled by this.

The change to Remote.Helper.Chunked avoids an extra progress update when
resuming an interrupted upload. In that case, the code saw first Nothing
and then Just the already transferred number of bytes, which defeated this
new heuristic. This change will mean that, when resuming an interrupted
upload to a chunked remote that does not do its own progress reporting, the
progress display does not start out displaying the amount sent so far,
until after the first chunk is sent. This behavior change does not seem
like a major problem.

About the scalefudgefactor, it seems reasonable to expect subsequent chunks
to take no more than 1.5 times as long as the first chunk to transfer.
Could set it to 1, but then any chunk taking a little longer would be
treated as a stall. 2 also seems a likely value. Even 10 might be fine?

Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
Joey Hess 2024-01-18 17:11:56 -04:00
parent 8f655f7953
commit c2634e7df2
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
7 changed files with 153 additions and 19 deletions

View file

@ -0,0 +1,40 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2024-01-18T17:18:07Z"
content="""
It looks like you must have annex.stalldetection (or the per-remote config)
set. git-annex does not behave this way without that configuration.
What is it set to?
You are probably right in that it involves rclone special remote not
reporting transfer progress back to git-annex.
Normally, when a special remote does not do progress reporting,
git-annex does not do any stall detection, because there must have been
at least some previous progress update in order for it to detect a stall.
But when chunking is enabled (as it was in your case with 1 gb chunks),
git-annex itself updates the progress after each chunk. When the special
remote does not do any progress reporting, and chunk size is large, that
means that the progress will be updated very infrequently.
So for example, if it takes 2 minutes to upload a chunk, and you had
annex.stalldetection set to eg "10mb/1m", then in a chunk after the 1st one,
git-annex would wake up after 1 minute, see that no data seemed to have
been sent, and conclude there was a stall. You would need to change the
time period in the config to something less granular eg "100mb/10m"
to avoid that.
This might be a documentation problem, it may not be clear to the user
that "10mb/1m" is any different than "100mb/10m". And finding a value that
works depends on knowing details of how often the progress gets updated
for a given remote.
But, your transcript show that the stall was detected on chunk 296.
(git-annex's chunk, rclone is doing its own chunking to dropbox)
So the annex.stalldetection configuration must have been one that
worked most of the time, for it to transfer the previous 295 chunks
without a stall having been detected. Unless this was a resume after
previous command(s) had uploaded those other chunks.
"""]]

View file

@ -0,0 +1,22 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2024-01-18T17:50:21Z"
content="""
I think that what git-annex could do is detect when progress updates are
happening with too low a granularity for the annex.stalldetection
configuration.
When waiting for the first progress update, it can keep track of how much time
has elapsed. If annex.stalldetection is "10mb/2m" and it took 20 minutes to
get the first progress update, the granularity is clearly too low.
And then it could disregard the configuration, or suggest a better
configuration value, or adjust what it's expecting to match the
observed granularity.
(The stall detection auto-prober uses a similar heuristic to that already.
It observes the granularity and only if it's sufficiently low (an update
every 30 seconds or less) does it assume that 60 seconds without an update
may be a stall.)
"""]]

View file

@ -0,0 +1,20 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2024-01-18T19:04:26Z"
content="""
To reproduce this (more or less), I modified the example.sh external
special remote to sleep for 10 seconds before each key store.
Set up a remote with chunk=1mb, and annex.stalldetection = "0.001mb/1s".
Uploading a 100 mb file, a stall is detected after the first chunk is
uploaded. As expected, since 1 second passed with no update.
When I resume the upload, the second chunk is uploaded and then a stall is
detected on the third. And so on.
I've implemented dynamic granularity scaling now, and in this test case, it notices
it took 11 seconds for the first chunk, and behaves as if it were
configured with annex.stalldetection of "0.022mb/22s". Which keeps it from
detecting a stall.
"""]]