update RepoSize database from git-annex branch incrementally
The use of catObjectStream is optimally fast. Although it might be possible to combine this with git-annex branch merge to avoid some redundant work. Benchmarking, a git-annex branch that had 100000 files changed took less than 1.88 seconds to run through this.
This commit is contained in:
parent
8239824d92
commit
d09a005f2b
9 changed files with 115 additions and 33 deletions
|
@ -30,19 +30,11 @@ Planned schedule of work:
|
|||
|
||||
## work notes
|
||||
|
||||
* Implement [[track_free_space_in_repos_via_git-annex_branch]]:
|
||||
|
||||
* updateRepoSizes incrementally when the git-annex branch sha in the
|
||||
database is older than the current git-annex branch. Diff from old to
|
||||
new branch to efficiently update.
|
||||
|
||||
Note ideas in above todo about doing this at git-annex branch merge
|
||||
time to reuse the git diff done there.
|
||||
* Concurrency issues with RepoSizes calculation and balanced content:
|
||||
|
||||
* What if 2 concurrent threads are considering sending two different
|
||||
keys to a repo at the same time. It can hold either but not both.
|
||||
It should avoid sending both in this situation. (Also discussed in
|
||||
above todo)
|
||||
It should avoid sending both in this situation.
|
||||
|
||||
* There can also be a race with 2 concurrent threads where one just
|
||||
finished sending to a repo, but has not yet updated the location log.
|
||||
|
@ -101,6 +93,7 @@ Planned schedule of work:
|
|||
|
||||
* Balanced preferred content basic implementation, including --rebalance
|
||||
option.
|
||||
* Implemented [[track_free_space_in_repos_via_git-annex_branch]]
|
||||
|
||||
## completed items for August's work on git-annex proxy support for exporttre
|
||||
|
||||
|
|
|
@ -0,0 +1,15 @@
|
|||
When git-annex merges a remote into the git-annex branch, it uses
|
||||
a CatFileHandle, making a query get the contents of each file in the
|
||||
diff. It would be faster for it to use catObjectStream.
|
||||
[[!commit d010ab04be5a8d74fe85a2fa27a853784d1f9009]] saw a 2x-16x
|
||||
improvement to a similar process.
|
||||
|
||||
Also, Database.ContentIdentifier.updateFromLog,
|
||||
Database.ImportFeed.updateFromLog, and Annex.RepoSize.diffBranchRepoSizes
|
||||
each do a similar diff and cat-file to update information cached from the
|
||||
git-annex branch into a database. (diffBranchRepoSizes does use
|
||||
catObjectStream, the others don't.)
|
||||
|
||||
It seems like it might be possible to
|
||||
make merging the git-annex branch do these updates in passing, and reduce
|
||||
the overhead of diff and cat-file 4x. --[[Joey]]
|
|
@ -92,6 +92,9 @@ merge time. Those are less expensive than diffing the location logs only
|
|||
because the logs they diff are less often used, and the work is only
|
||||
done when relevant commands are run.
|
||||
|
||||
(Opened [[todo/optimise_git-annex_branch_merge_and_database_updates]]
|
||||
about that possibility.)
|
||||
|
||||
## concurrency
|
||||
|
||||
Suppose a repository is almost full. Two concurrent threads or processes
|
||||
|
@ -106,3 +109,6 @@ sizeOfDownloadsInProgress. It would be possible to make a
|
|||
`sizeOfUploadsInProgressToRemote r` similarly.
|
||||
|
||||
[[!tag projects/openneuro]]
|
||||
|
||||
> Current status: This is implemented, but concurrency issues remain.
|
||||
> --[[Joey]]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue