comments
This commit is contained in:
parent
5891e10409
commit
8f508d4406
2 changed files with 56 additions and 0 deletions
|
@ -0,0 +1,24 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 1"""
|
||||||
|
date="2020-06-30T16:27:26Z"
|
||||||
|
content="""
|
||||||
|
I am surprised by 30m, unless you have a very large number of files
|
||||||
|
or keys. Benchmarking in a repo with 100k files and keys, git-annex
|
||||||
|
sync --content took 7m, and with --all 14m. (both cold cache)
|
||||||
|
|
||||||
|
How many files and keys does git-annex info say are in your repository?
|
||||||
|
|
||||||
|
--all is slower because it does two passes, first over all files in the
|
||||||
|
current branch, and a second pass over all keys. (Necessary to handle
|
||||||
|
preferred content filename matching correctly.)
|
||||||
|
|
||||||
|
git cat-file --batch-all-objects --unordered would not work, because
|
||||||
|
there would be no way to know if a particular blob was a current version of
|
||||||
|
the location log for a key, or an old version.
|
||||||
|
(For that matter, it doesn't even say what file the blob belongs to, so
|
||||||
|
it would have no idea what it's a location log for.)
|
||||||
|
|
||||||
|
[[design/caching_database]] talks about speedups by sqlite caching. But
|
||||||
|
that's a hard road.
|
||||||
|
"""]]
|
|
@ -0,0 +1,32 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 2"""
|
||||||
|
date="2020-06-30T18:53:52Z"
|
||||||
|
content="""
|
||||||
|
I wonder if the second pass could avoid looking at the location log at all
|
||||||
|
for keys handled in the first pass. Currently it only checks the bloom
|
||||||
|
filter to skip dropping those keys, but not to skip transferring the keys. If it
|
||||||
|
also skipped transferring, it would not need to read the location log a second
|
||||||
|
time for the same key.
|
||||||
|
|
||||||
|
That would speed it up by around 2x, in fairly typical cases where there are a
|
||||||
|
lot of files, but not a lot of old versions of each file.
|
||||||
|
|
||||||
|
Problem with that is, it's a bloom filter, there can be false positives.
|
||||||
|
Currently a false positive means it does not drop a key that it should want
|
||||||
|
to, annoying maybe but unlikely to happen and not a big problem. But
|
||||||
|
consulting the bloom filter for transfers would risk it not making as many
|
||||||
|
copies of a key as it is configured to, which risks data loss, or at least not
|
||||||
|
having all desired data available after sync.
|
||||||
|
|
||||||
|
But, if it could use something other than a bloom filter to keep track
|
||||||
|
of the keys processed in the first pass, that would be a good optimisation.
|
||||||
|
Sqlite database maybe, have to consider the overhead of querying it. Just
|
||||||
|
keeping the keys in memory w/o a bloom filter maybe, and only use the bloom
|
||||||
|
filter if there are too many keys.
|
||||||
|
|
||||||
|
The bloom filter used currently uses around 16 mb of ram. A typical key is
|
||||||
|
80 bytes or so. So, up to around 200,000 keys in a set is probably the same
|
||||||
|
ballpark amount of ram. (32 mb would be needed to construct the
|
||||||
|
bloom filter from the set, probably.)
|
||||||
|
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue