update RepoSize database from git-annex branch incrementally

The use of catObjectStream is optimally fast. Although it might be
possible to combine this with git-annex branch merge to avoid some
redundant work.

Benchmarking, a git-annex branch that had 100000 files changed
took less than 1.88 seconds to run through this.
This commit is contained in:
Joey Hess 2024-08-17 13:30:24 -04:00
parent 8239824d92
commit d09a005f2b
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
9 changed files with 115 additions and 33 deletions

View file

@ -30,19 +30,11 @@ Planned schedule of work:
## work notes
* Implement [[track_free_space_in_repos_via_git-annex_branch]]:
* updateRepoSizes incrementally when the git-annex branch sha in the
database is older than the current git-annex branch. Diff from old to
new branch to efficiently update.
Note ideas in above todo about doing this at git-annex branch merge
time to reuse the git diff done there.
* Concurrency issues with RepoSizes calculation and balanced content:
* What if 2 concurrent threads are considering sending two different
keys to a repo at the same time. It can hold either but not both.
It should avoid sending both in this situation. (Also discussed in
above todo)
It should avoid sending both in this situation.
* There can also be a race with 2 concurrent threads where one just
finished sending to a repo, but has not yet updated the location log.
@ -101,6 +93,7 @@ Planned schedule of work:
* Balanced preferred content basic implementation, including --rebalance
option.
* Implemented [[track_free_space_in_repos_via_git-annex_branch]]
## completed items for August's work on git-annex proxy support for exporttre

View file

@ -0,0 +1,15 @@
When git-annex merges a remote into the git-annex branch, it uses
a CatFileHandle, making a query get the contents of each file in the
diff. It would be faster for it to use catObjectStream.
[[!commit d010ab04be5a8d74fe85a2fa27a853784d1f9009]] saw a 2x-16x
improvement to a similar process.
Also, Database.ContentIdentifier.updateFromLog,
Database.ImportFeed.updateFromLog, and Annex.RepoSize.diffBranchRepoSizes
each do a similar diff and cat-file to update information cached from the
git-annex branch into a database. (diffBranchRepoSizes does use
catObjectStream, the others don't.)
It seems like it might be possible to
make merging the git-annex branch do these updates in passing, and reduce
the overhead of diff and cat-file 4x. --[[Joey]]

View file

@ -92,6 +92,9 @@ merge time. Those are less expensive than diffing the location logs only
because the logs they diff are less often used, and the work is only
done when relevant commands are run.
(Opened [[todo/optimise_git-annex_branch_merge_and_database_updates]]
about that possibility.)
## concurrency
Suppose a repository is almost full. Two concurrent threads or processes
@ -106,3 +109,6 @@ sizeOfDownloadsInProgress. It would be possible to make a
`sizeOfUploadsInProgressToRemote r` similarly.
[[!tag projects/openneuro]]
> Current status: This is implemented, but concurrency issues remain.
> --[[Joey]]