comments
This commit is contained in:
parent
5891e10409
commit
8f508d4406
2 changed files with 56 additions and 0 deletions
|
@ -0,0 +1,24 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2020-06-30T16:27:26Z"
|
||||
content="""
|
||||
I am surprised by 30m, unless you have a very large number of files
|
||||
or keys. Benchmarking in a repo with 100k files and keys, git-annex
|
||||
sync --content took 7m, and with --all 14m. (both cold cache)
|
||||
|
||||
How many files and keys does git-annex info say are in your repository?
|
||||
|
||||
--all is slower because it does two passes, first over all files in the
|
||||
current branch, and a second pass over all keys. (Necessary to handle
|
||||
preferred content filename matching correctly.)
|
||||
|
||||
git cat-file --batch-all-objects --unordered would not work, because
|
||||
there would be no way to know if a particular blob was a current version of
|
||||
the location log for a key, or an old version.
|
||||
(For that matter, it doesn't even say what file the blob belongs to, so
|
||||
it would have no idea what it's a location log for.)
|
||||
|
||||
[[design/caching_database]] talks about speedups by sqlite caching. But
|
||||
that's a hard road.
|
||||
"""]]
|
|
@ -0,0 +1,32 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2020-06-30T18:53:52Z"
|
||||
content="""
|
||||
I wonder if the second pass could avoid looking at the location log at all
|
||||
for keys handled in the first pass. Currently it only checks the bloom
|
||||
filter to skip dropping those keys, but not to skip transferring the keys. If it
|
||||
also skipped transferring, it would not need to read the location log a second
|
||||
time for the same key.
|
||||
|
||||
That would speed it up by around 2x, in fairly typical cases where there are a
|
||||
lot of files, but not a lot of old versions of each file.
|
||||
|
||||
Problem with that is, it's a bloom filter, there can be false positives.
|
||||
Currently a false positive means it does not drop a key that it should want
|
||||
to, annoying maybe but unlikely to happen and not a big problem. But
|
||||
consulting the bloom filter for transfers would risk it not making as many
|
||||
copies of a key as it is configured to, which risks data loss, or at least not
|
||||
having all desired data available after sync.
|
||||
|
||||
But, if it could use something other than a bloom filter to keep track
|
||||
of the keys processed in the first pass, that would be a good optimisation.
|
||||
Sqlite database maybe, have to consider the overhead of querying it. Just
|
||||
keeping the keys in memory w/o a bloom filter maybe, and only use the bloom
|
||||
filter if there are too many keys.
|
||||
|
||||
The bloom filter used currently uses around 16 mb of ram. A typical key is
|
||||
80 bytes or so. So, up to around 200,000 keys in a set is probably the same
|
||||
ballpark amount of ram. (32 mb would be needed to construct the
|
||||
bloom filter from the set, probably.)
|
||||
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue