tweak stall detection scaling

Refactored to allow offline experimentation, and ended up changing the
allowedvariation (aka fudge factor) to 3. 10 seems too high, and 1.5 too low.

Scale earlier, so even if the first chunk takes less than the configured
time period, allowance is made that later chunks might transfer slower.
Decided to use the same allowedvariation to decide when to start
scaling.

Smoothed the scaling out.

Some examples:

ghci> upscale (BwRate 10 (Duration 60)) 25
BwRate 13 (Duration {durationSeconds = 75})
-- A small scaling upwards after 1/3rd the time. Not noticable.
ghci> upscale (BwRate 10 (Duration 60)) 60
BwRate 30 (Duration {durationSeconds = 180})
-- At the configured time, 3x scaling.
ghci> upscale (BwRate 10 (Duration 60)) 120
BwRate 60 (Duration {durationSeconds = 360})
-- A typical upscaling, here a 1 minute duration became 6 minutes
-- due to the first chunk taking 2 minutes to transfer.
ghci> upscale (BwRate 10 (Duration 60)) 600
BwRate 300 (Duration {durationSeconds = 1800})
-- Here the first chunk took 10 minutes to transfer, so it will
-- take 30 minutes to detect a stall.

Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
Joey Hess 2024-01-19 12:46:36 -04:00
parent e61af28acf
commit df35f70801
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 34 additions and 16 deletions

View file

@ -23,7 +23,7 @@ import Data.Time.Clock
detectStalls :: (Monad m, MonadIO m) => Maybe StallDetection -> TVar (Maybe BytesProcessed) -> m () -> m ()
detectStalls Nothing _ _ = noop
detectStalls (Just StallDetectionDisabled) _ _ = noop
detectStalls (Just (StallDetection (BwRate minsz duration))) metervar onstall = do
detectStalls (Just (StallDetection bwrate@(BwRate _minsz duration))) metervar onstall = do
-- If the progress is being updated, but less frequently than
-- the specified duration, a stall would be incorrectly detected.
--
@ -32,25 +32,15 @@ detectStalls (Just (StallDetection (BwRate minsz duration))) metervar onstall =
-- size. In that case, progress is only updated after each chunk.
--
-- So, wait for the first update, and see how long it takes.
-- It's longer than the duration, upscale the duration and minsz
-- accordingly.
-- When it's longer than the duration (or close to it),
-- upscale the duration and minsz accordingly.
starttime <- liftIO getCurrentTime
v <- waitforfirstupdate =<< readMeterVar metervar
endtime <- liftIO getCurrentTime
let timepassed = floor (endtime `diffUTCTime` starttime)
let (scaledminsz, scaledduration) = upscale timepassed
let BwRate scaledminsz scaledduration = upscale bwrate timepassed
detectStalls' scaledminsz scaledduration metervar onstall v
where
upscale timepassed
| timepassed > dsecs =
let scale = scaleamount timepassed
in (minsz * scale, Duration (dsecs * scale))
| otherwise = (minsz, duration)
scaleamount timepassed = max 1 $ floor $
(fromIntegral timepassed / fromIntegral (max dsecs 1))
* scalefudgefactor
scalefudgefactor = 1.5 :: Double
dsecs = durationSeconds duration
minwaitsecs = Seconds $
min 60 (fromIntegral (durationSeconds duration))
waitforfirstupdate startval = do
@ -59,7 +49,6 @@ detectStalls (Just (StallDetection (BwRate minsz duration))) metervar onstall =
if v > startval
then return v
else waitforfirstupdate startval
detectStalls (Just ProbeStallDetection) metervar onstall = do
-- Only do stall detection once the progress is confirmed to be
-- consistently updating. After the first update, it needs to
@ -120,3 +109,32 @@ readMeterVar
-> m (Maybe ByteSize)
readMeterVar metervar = liftIO $ atomically $
fmap fromBytesProcessed <$> readTVar metervar
-- Scale up the minsz and duration to match the observed time that passed
-- between progress updates. This allows for some variation in the transfer
-- rate causing later progress updates to happen less frequently.
upscale :: BwRate -> Integer -> BwRate
upscale input@(BwRate minsz duration) timepassedsecs
| timepassedsecs > dsecs `div` allowedvariation = BwRate
(ceiling (fromIntegral minsz * scale))
(Duration (ceiling (fromIntegral dsecs * scale)))
| otherwise = input
where
scale = max 1 $
(fromIntegral timepassedsecs / fromIntegral (max dsecs 1))
* fromIntegral allowedvariation
dsecs = durationSeconds duration
-- Setting this too low will make normal bandwidth variations be
-- considered to be stalls, while setting it too high will make
-- stalls not be detected for much longer than the expected
-- duration.
--
-- For example, a BwRate of 20MB/1m, when the first progress
-- update takes 10m to arrive, is scaled to 600MB/30m. That 30m
-- is a reasonable since only 3 chunks get sent in that amount of
-- time at that rate. If allowedvariation = 10, that would
-- be 2000MB/100m, which seems much too long to wait to detect a
-- stall.
allowedvariation = 3

View file

@ -1559,7 +1559,7 @@ Remotes are configured using these settings in `.git/config`.
period, the stall detection will be automically adjusted to use a longer
time period. For example, if the first progress update comes after 10
minutes, but annex.stalldetection is "1MB/1m", it will be treated as eg
"15MB/15m".
"30MB/30m".
Configuring stall detection can make git-annex use more resources. To be
able to cancel stalls, git-annex has to run transfers in separate