comment and a neat idea

This commit is contained in:
Joey Hess 2023-05-30 15:42:34 -04:00
parent f9baf11e11
commit aaeae746f0
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 75 additions and 0 deletions

View file

@ -0,0 +1,38 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2023-05-30T18:49:34Z"
content="""
I think I see why a second sync re-writes the cidsdb with information from
the git-annex branch.
The first sync does write to the cidsdb at the same time it writes to the
git-annex branch. So, it seems that it could at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second sync
doing extra work.
But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the sync then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.
I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end, and has to do extra
work in updateFromLog at the beginning.
What it could do is, only record a cid to the git-annex branch, not to the
cidsdb. Then the updateFromLog would not be doing extra work, but necessary
work. But, it also needs to read cids from the db during import, and if it
doesn't record a cid there, it won't know when a later file has the same
cid. So it will re-import it. Which for other special remotes than
directory, means an expensive second download of the content.
Anyway, the extra work of re-writing the cidsdb is only done on the sync
immediately after the one that did import some new files. And it only
re-writes cids for the new files, not for unchanged files.
I'm not sure that this extra work is what the bug reporter was complaining about though.
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2023-05-30T19:25:11Z"
content="""
See [[todo/speed_up_import_tree]] for a few ideas of things that would
speed it up.
"""]]

View file

@ -0,0 +1,29 @@
Users sometimes expect `git-annex import --from remote` to be faster than
it is, when importing hundreds of thousands of files, particularly
from a directory special remote.
I think generally, they're expecting something that is not achievable.
It is always going to be slower than using git in a repository with that
many files, because git operates at a lower level of abstraction (the
filesystem), so has more optimisations available to it. (Also git has its
own scalability limits with many files.)
Still, it would be good to find some ways to speed it up.
Hmm... What if it generated a git tree, where each file in the tree is
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
On the next sync, diff from the old to the new tree. It only needs to
import the changed files!
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
If there was a collision it would fail to import the new file. But it seems
reasonable, because git loses data on sha1 collisions anyway, and ContentIdentifiers
are no more likely to collide than the content of files, and probably less
likely overall..)
Another idea would to be use something faster than sqlite to record the cid
to key mappings. Looking up those mappings is the main thing that makes
import slow when only a few files have changed and a large number have not.
--[[Joey]]