Increased the default annex.bloomaccuracy from 1000 to 10000000

This makes git annex unused use around 48 mb more memory than it did before,
but the massive increase in accuracy makes this worthwhile for all but the
smallest systems.

Also, I want to use the bloom filter for sync --all --content, to avoid
dropping files that the preferred content doesn't want, and 1/1000
false positives would be far too many in that use case, even if it were
acceptable for unused.

Actual memory use numbers:

1000: 21.06user 3.42system 0:26.40elapsed 92%CPU (0avgtext+0avgdata 501552maxresident)k
1000000: 21.41user 3.55system 0:26.84elapsed 93%CPU (0avgtext+0avgdata 549496maxresident)k
10000000: 21.84user 3.52system 0:27.89elapsed 90%CPU (0avgtext+0avgdata 549920maxresident)k

Based on these numbers, 10 million seemed a better pick than 1 million.
This commit is contained in:
Joey Hess 2015-06-16 17:58:15 -04:00
parent f7350b7c33
commit 8b74aec3ea
6 changed files with 76 additions and 56 deletions

View file

@ -16,7 +16,6 @@ import Data.Tuple
import Data.Ord
import Common.Annex
import qualified Command.Unused
import qualified Git
import qualified Annex
import qualified Remote
@ -39,6 +38,8 @@ import Types.TrustLevel
import Types.FileMatcher
import qualified Limit
import Messages.JSON (DualDisp(..))
import Annex.BloomFilter
import qualified Command.Unused
-- a named computation that produces a statistic
type Stat = StatState (Maybe (String, StatState String))
@ -330,17 +331,17 @@ key_name k = simpleStat "key" $ pure $ key2file k
bloom_info :: Stat
bloom_info = simpleStat "bloom filter size" $ do
localkeys <- countKeys <$> cachedPresentData
capacity <- fromIntegral <$> lift Command.Unused.bloomCapacity
capacity <- fromIntegral <$> lift bloomCapacity
let note = aside $
if localkeys >= capacity
then "appears too small for this repository; adjust annex.bloomcapacity"
else showPercentage 1 (percentage capacity localkeys) ++ " full"
-- Two bloom filters are used at the same time, so double the size
-- of one.
-- Two bloom filters are used at the same time when running
-- git-annex unused, so double the size of one.
sizer <- lift mkSizer
size <- sizer memoryUnits False . (* 2) . fromIntegral . fst <$>
lift Command.Unused.bloomBitsHashes
lift bloomBitsHashes
return $ size ++ note