avoid import writing to cidsdb initially

Speed up importing trees from special remotes somewhat by avoiding redundant writes to sqlite database. Before, import would write to both the git-annex branch and also to the sqlite database. But then the next time it was run, needsUpdateFromLog would see the branch had changed, so run updateFromLog, which would make the same writes to the sqlite database a second time. Now import writes only to the git-annex branch. The next time it's run, needsUpdateFromLog sees that the branch has changed and so calls updateFromLog, which updates the sqlite database. Why defer the write to the sqlite database like this? It seems that it could write to the database as it goes, and at the end call recordAnnexBranchTree to indicate that the information in the git-annex branch has all been written to the cidsdb. That would avoid the second import doing extra work. But, there could be other processes running at the same time, and one of them may update the git-annex branch, eg merging a remote git-annex branch into it. Any cids logs on that merged git-annex branch would not be reflected in the cidsdb yet. If the import then called recordAnnexBranchTree, the cidsdb would never get updated with that merged information. I don't think there's a good way to prevent, or to detect that situation. So, it can't call recordAnnexBranchTree at the end. So it might as well wait until the next run and do updateFromLog then. It could instead do updateFromLog at the end, but it's going to check needsUpdateFromLog at the beginning anyway. Note that the database writes were queued, so there is already a cidmap that is used to remember changes that the current process has made. So, omitting database writes can't change the behavior of the current process. Also note that thirdpartypopulatedimport uses recordcidkeyindb, which reflects what it already did. That code path does not use the cidmap, but does not need to query it either. It might be possible to make that code path also only update the git-annex branch and not the db, but I haven't checked. Sponsored-by: Noam Kremen on Patreon
2023-05-30 17:05:28 -04:00 · 2023-05-30 17:05:28 -04:00 · f6aa097a39
commit f6aa097a39
parent c1e415887a
3 changed files with 33 additions and 47 deletions
--- a/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_2_8f609e231fdf315d2b658d30d708d471._comment
+++ b/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_2_8f609e231fdf315d2b658d30d708d471._comment
@ -3,36 +3,14 @@
 subject="""comment 2"""
 date="2023-05-30T18:49:34Z"
 content="""
-I think I see why a second sync re-writes the cidsdb with information from
-the git-annex branch. 
+> The other hit is recordContentIdentifier, which happens for
+> each recorded cid, due to updateFromLog. That seems unnecessary, because
+> the previous sync already recorded all the cids.

-The first sync does write to the cidsdb at the same time it writes to the
-git-annex branch. So, it seems that it could at the end call 
-recordAnnexBranchTree to indicate that the information in the git-annex
-branch has all been written to the cidsdb. That would avoid the second sync
-doing extra work.
+I was able to eliminate that extra work. Now the first sync does not write
+to the cidsdb but only to the git-annex branch, and the second sync does
+the necessary work of updating the cidsdb from the git-annex branch.

-But, there could be other processes running at the same time, and one of
-them may update the git-annex branch, eg merging a remote git-annex branch
-into it. Any cids logs on that merged git-annex branch would not be
-reflected in the cidsdb yet. If the sync then called
-recordAnnexBranchTree, the cidsdb would never get updated with that merged
-information.
-
-I don't think there's a good way to prevent, or to detect that situation.
-So, it can't call recordAnnexBranchTree at the end, and has to do extra
-work in updateFromLog at the beginning.
-
-What it could do is, only record a cid to the git-annex branch, not to the
-cidsdb. Then the updateFromLog would not be doing extra work, but necessary
-work. But, it also needs to read cids from the db during import, and if it
-doesn't record a cid there, it won't know when a later file has the same
-cid. So it will re-import it. Which for other special remotes than
-directory, means an expensive second download of the content.
-
-Anyway, the extra work of re-writing the cidsdb is only done on the sync
-immediately after the one that did import some new files. And it only
-re-writes cids for the new files, not for unchanged files.
-
-I'm not sure that this extra work is what the bug reporter was complaining about though.
+I'm not sure that extra work is what the bug reporter was complaining
+about though.
 """]]