automatically adjust stall detection period

Improve annex.stalldetection to handle remotes that update progress less frequently than the configured time period. In particular, this makes remotes that don't report progress but are chunked work when transferring a single chunk takes longer than the specified time period. Any remotes that just have very low update granulatity would also be handled by this. The change to Remote.Helper.Chunked avoids an extra progress update when resuming an interrupted upload. In that case, the code saw first Nothing and then Just the already transferred number of bytes, which defeated this new heuristic. This change will mean that, when resuming an interrupted upload to a chunked remote that does not do its own progress reporting, the progress display does not start out displaying the amount sent so far, until after the first chunk is sent. This behavior change does not seem like a major problem. About the scalefudgefactor, it seems reasonable to expect subsequent chunks to take no more than 1.5 times as long as the first chunk to transfer. Could set it to 1, but then any chunk taking a little longer would be treated as a stall. 2 also seems a likely value. Even 10 might be fine? Sponsored-by: Dartmouth College's DANDI project
2024-01-18 17:11:56 -04:00 · 2024-01-18 17:11:56 -04:00 · c2634e7df2
commit c2634e7df2
parent 8f655f7953
7 changed files with 153 additions and 19 deletions
--- a/doc/bugs/too_aggressive_in_claiming_34Transfer_stalled3463/comment_1_db83f6a38cae36da89f0ab4ef83021d8._comment
+++ b/doc/bugs/too_aggressive_in_claiming_34Transfer_stalled3463/comment_1_db83f6a38cae36da89f0ab4ef83021d8._comment
@ -0,0 +1,40 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2024-01-18T17:18:07Z"
+ content="""
+It looks like you must have annex.stalldetection (or the per-remote config)
+set. git-annex does not behave this way without that configuration.
+What is it set to?
+
+You are probably right in that it involves rclone special remote not
+reporting transfer progress back to git-annex.
+
+Normally, when a special remote does not do progress reporting,
+git-annex does not do any stall detection, because there must have been
+at least some previous progress update in order for it to detect a stall.
+
+But when chunking is enabled (as it was in your case with 1 gb chunks),
+git-annex itself updates the progress after each chunk. When the special
+remote does not do any progress reporting, and chunk size is large, that
+means that the progress will be updated very infrequently. 
+
+So for example, if it takes 2 minutes to upload a chunk, and you had
+annex.stalldetection set to eg "10mb/1m", then in a chunk after the 1st one,
+git-annex would wake up after 1 minute, see that no data seemed to have
+been sent, and conclude there was a stall. You would need to change the
+time period in the config to something less granular eg "100mb/10m"
+to avoid that.
+
+This might be a documentation problem, it may not be clear to the user
+that "10mb/1m" is any different than "100mb/10m". And finding a value that
+works depends on knowing details of how often the progress gets updated
+for a given remote.
+
+But, your transcript show that the stall was detected on chunk 296.
+(git-annex's chunk, rclone is doing its own chunking to dropbox)
+So the annex.stalldetection configuration must have been one that
+worked most of the time, for it to transfer the previous 295 chunks
+without a stall having been detected. Unless this was a resume after
+previous command(s) had uploaded those other chunks.
+"""]]
--- a/doc/bugs/too_aggressive_in_claiming_34Transfer_stalled3463/comment_2_2aeb065a257729e852055533aff04650._comment
+++ b/doc/bugs/too_aggressive_in_claiming_34Transfer_stalled3463/comment_2_2aeb065a257729e852055533aff04650._comment
@ -0,0 +1,22 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2024-01-18T17:50:21Z"
+ content="""
+I think that what git-annex could do is detect when progress updates are
+happening with too low a granularity for the annex.stalldetection
+configuration.
+
+When waiting for the first progress update, it can keep track of how much time
+has elapsed. If annex.stalldetection is "10mb/2m" and it took 20 minutes to
+get the first progress update, the granularity is clearly too low.
+
+And then it could disregard the configuration, or suggest a better
+configuration value, or adjust what it's expecting to match the
+observed granularity.
+
+(The stall detection auto-prober uses a similar heuristic to that already.
+It observes the granularity and only if it's sufficiently low (an update
+every 30 seconds or less) does it assume that 60 seconds without an update
+may be a stall.)
+"""]]
--- a/doc/bugs/too_aggressive_in_claiming_34Transfer_stalled3463/comment_3_f8bd6a233d835bdc413bbf0127608431._comment
+++ b/doc/bugs/too_aggressive_in_claiming_34Transfer_stalled3463/comment_3_f8bd6a233d835bdc413bbf0127608431._comment
@ -0,0 +1,20 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2024-01-18T19:04:26Z"
+ content="""
+To reproduce this (more or less), I modified the example.sh external
+special remote to sleep for 10 seconds before each key store.
+Set up a remote with chunk=1mb, and annex.stalldetection = "0.001mb/1s".
+
+Uploading a 100 mb file, a stall is detected after the first chunk is
+uploaded. As expected, since 1 second passed with no update.
+
+When I resume the upload, the second chunk is uploaded and then a stall is
+detected on the third. And so on.
+
+I've implemented dynamic granularity scaling now, and in this test case, it notices
+it took 11 seconds for the first chunk, and behaves as if it were
+configured with annex.stalldetection of "0.022mb/22s". Which keeps it from
+detecting a stall.
+"""]]