comment and a neat idea
This commit is contained in:
parent
f9baf11e11
commit
aaeae746f0
3 changed files with 75 additions and 0 deletions
|
@ -0,0 +1,38 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2023-05-30T18:49:34Z"
|
||||
content="""
|
||||
I think I see why a second sync re-writes the cidsdb with information from
|
||||
the git-annex branch.
|
||||
|
||||
The first sync does write to the cidsdb at the same time it writes to the
|
||||
git-annex branch. So, it seems that it could at the end call
|
||||
recordAnnexBranchTree to indicate that the information in the git-annex
|
||||
branch has all been written to the cidsdb. That would avoid the second sync
|
||||
doing extra work.
|
||||
|
||||
But, there could be other processes running at the same time, and one of
|
||||
them may update the git-annex branch, eg merging a remote git-annex branch
|
||||
into it. Any cids logs on that merged git-annex branch would not be
|
||||
reflected in the cidsdb yet. If the sync then called
|
||||
recordAnnexBranchTree, the cidsdb would never get updated with that merged
|
||||
information.
|
||||
|
||||
I don't think there's a good way to prevent, or to detect that situation.
|
||||
So, it can't call recordAnnexBranchTree at the end, and has to do extra
|
||||
work in updateFromLog at the beginning.
|
||||
|
||||
What it could do is, only record a cid to the git-annex branch, not to the
|
||||
cidsdb. Then the updateFromLog would not be doing extra work, but necessary
|
||||
work. But, it also needs to read cids from the db during import, and if it
|
||||
doesn't record a cid there, it won't know when a later file has the same
|
||||
cid. So it will re-import it. Which for other special remotes than
|
||||
directory, means an expensive second download of the content.
|
||||
|
||||
Anyway, the extra work of re-writing the cidsdb is only done on the sync
|
||||
immediately after the one that did import some new files. And it only
|
||||
re-writes cids for the new files, not for unchanged files.
|
||||
|
||||
I'm not sure that this extra work is what the bug reporter was complaining about though.
|
||||
"""]]
|
|
@ -0,0 +1,8 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 3"""
|
||||
date="2023-05-30T19:25:11Z"
|
||||
content="""
|
||||
See [[todo/speed_up_import_tree]] for a few ideas of things that would
|
||||
speed it up.
|
||||
"""]]
|
29
doc/todo/speed_up_import_tree.mdwn
Normal file
29
doc/todo/speed_up_import_tree.mdwn
Normal file
|
@ -0,0 +1,29 @@
|
|||
Users sometimes expect `git-annex import --from remote` to be faster than
|
||||
it is, when importing hundreds of thousands of files, particularly
|
||||
from a directory special remote.
|
||||
|
||||
I think generally, they're expecting something that is not achievable.
|
||||
It is always going to be slower than using git in a repository with that
|
||||
many files, because git operates at a lower level of abstraction (the
|
||||
filesystem), so has more optimisations available to it. (Also git has its
|
||||
own scalability limits with many files.)
|
||||
|
||||
Still, it would be good to find some ways to speed it up.
|
||||
|
||||
Hmm... What if it generated a git tree, where each file in the tree is
|
||||
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
|
||||
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
|
||||
On the next sync, diff from the old to the new tree. It only needs to
|
||||
import the changed files!
|
||||
|
||||
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
|
||||
If there was a collision it would fail to import the new file. But it seems
|
||||
reasonable, because git loses data on sha1 collisions anyway, and ContentIdentifiers
|
||||
are no more likely to collide than the content of files, and probably less
|
||||
likely overall..)
|
||||
|
||||
Another idea would to be use something faster than sqlite to record the cid
|
||||
to key mappings. Looking up those mappings is the main thing that makes
|
||||
import slow when only a few files have changed and a large number have not.
|
||||
|
||||
--[[Joey]]
|
Loading…
Add table
Reference in a new issue