avoid import writing to cidsdb initially

Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database.

Before, import would write to both the git-annex branch and also to the
sqlite database. But then the next time it was run, needsUpdateFromLog
would see the branch had changed, so run updateFromLog, which would make
the same writes to the sqlite database a second time.

Now import writes only to the git-annex branch. The next time it's run,
needsUpdateFromLog sees that the branch has changed and so calls
updateFromLog, which updates the sqlite database.

Why defer the write to the sqlite database like this? It seems that it
could write to the database as it goes, and at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second
import doing extra work.

But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the import then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.

I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end. So it might as well
wait until the next run and do updateFromLog then. It could instead do
updateFromLog at the end, but it's going to check needsUpdateFromLog
at the beginning anyway.

Note that the database writes were queued, so there is already a cidmap
that is used to remember changes that the current process has made.
So, omitting database writes can't change the behavior of the current
process.

Also note that thirdpartypopulatedimport uses recordcidkeyindb, which
reflects what it already did. That code path does not use the cidmap,
but does not need to query it either. It might be possible to make that
code path also only update the git-annex branch and not the db, but I
haven't checked.

Sponsored-by: Noam Kremen on Patreon
This commit is contained in:
Joey Hess 2023-05-30 17:05:28 -04:00
parent c1e415887a
commit f6aa097a39
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 33 additions and 47 deletions

View file

@ -3,36 +3,14 @@
subject="""comment 2"""
date="2023-05-30T18:49:34Z"
content="""
I think I see why a second sync re-writes the cidsdb with information from
the git-annex branch.
> The other hit is recordContentIdentifier, which happens for
> each recorded cid, due to updateFromLog. That seems unnecessary, because
> the previous sync already recorded all the cids.
The first sync does write to the cidsdb at the same time it writes to the
git-annex branch. So, it seems that it could at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second sync
doing extra work.
I was able to eliminate that extra work. Now the first sync does not write
to the cidsdb but only to the git-annex branch, and the second sync does
the necessary work of updating the cidsdb from the git-annex branch.
But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the sync then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.
I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end, and has to do extra
work in updateFromLog at the beginning.
What it could do is, only record a cid to the git-annex branch, not to the
cidsdb. Then the updateFromLog would not be doing extra work, but necessary
work. But, it also needs to read cids from the db during import, and if it
doesn't record a cid there, it won't know when a later file has the same
cid. So it will re-import it. Which for other special remotes than
directory, means an expensive second download of the content.
Anyway, the extra work of re-writing the cidsdb is only done on the sync
immediately after the one that did import some new files. And it only
re-writes cids for the new files, not for unchanged files.
I'm not sure that this extra work is what the bug reporter was complaining about though.
I'm not sure that extra work is what the bug reporter was complaining
about though.
"""]]