This commit is contained in:
Joey Hess 2024-03-08 13:43:31 -04:00
parent 1115fb1f9b
commit ad966e5e7b
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 41 additions and 12 deletions

View file

@ -7,7 +7,7 @@ that entirely:
other repo is not know to contain, but then repos will race and both get
the same file, or similarly if they are not communicating frequently.
So, let's add a new expression: `balanced_amoung(group)`
So, let's add a new expression: `balanced(group)`
This would work by taking the list of uuids of all repositories in the
group, and sorting them, which yields a list from 0..M-1 repositories.
@ -28,7 +28,7 @@ scheme stands, it's equally likely that adding repo3 will make repo1 and
repo2 want to swap files between them. So, we'll want to add some
precautions to avoid a lot of data moving around in this case:
((balanced_amoung(backup) and not (copies=backup:1)) or present
((balanced(backup) and not (copies=backup:1)) or present
So once file lands on a backup drive, it stays there, even if more backup
drives change the balancing.
@ -74,7 +74,7 @@ a manual/scripted process.
What if we have 5 backup repos and want each file to land in 3 of them?
There's a simple change that can support that:
`balanced_amoung(group:3)`
`balanced(group:3)`
This works the same as before, but rather than just `N mod M`, take
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a

View file

@ -21,11 +21,7 @@ repos that had a maxsize recorded, essentially for free.
But 8 seconds is rather a long time to block a `git-annex push`
type command. Which would be needed if any remote's preferred content
expression used `balanced_amoung`.
It would help some to cache the calculated sizes in eq a sqlite db, update
the cache after sending or dropping content, and invalidate the cache when
git-annex branch update merges in a git-annex branch from elsewhere.
expression used the free space information.
Would it be possible to update incrementally from the previous git-annex
branch to the current one? That's essentially what `git-annex log
@ -39,13 +35,46 @@ particular git-annex branch commit. We don't care about sizes at
intermediate points in time, which that command does calculate.
See [[todo/info_--size-history]] for the subtleties that had to be handled.
In particular, diffing from the previous git-annex branch commit to current may
In particular, compating the previous git-annex branch commit to current may
yield lines that seem to indicate content was added to a repo, but in fact
that repo already had that content at the previous git-annex branch commit.
So it seems it would have to look up the location log's value at the
previous commit, either querying the git-annex branch or cached state.
that repo already had that content at the previous git-annex branch commit
and another log line was recorded elsewhere redundantly.
So it needs to look at the location log's value at the
previous commit in order to determine if a change to a log should be
counted.
Worst case, that's queries of the location log file for every single key.
If queried from git, that would be slow -- slower than `git-annex info`'s
streaming approach. If they were all cached in a sqlite database, it might
manage to be faster?
## incremental update via git diff
Could `git diff -U1000000` be used and the patch parsed to get the complete
old and new location log? (Assuming no log file ever reaches a million
lines.) I tried this in my big repo, and even diffing from the first
git-annex branch commit to the last took 7.54 seconds.
Compare that with the method used by `git-annex info`'s size gathering, of
dumping out the content of all files on the branch with `git ls-tree -r
git-annex |awk '{print $3}'|git cat-file --batch --buffer`, which only
takes 3 seconds. So, this is not ideal when diffing to too old a point.
Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.
... from 3 months ago takes 2 seconds.
... from 1 week ago takes 1 second.
## incremental update when merging git-annex branch
When merging git-annex branch changes into .git/annex/index,
it already diffs between the branch and the index and uses `git cat-file`
to get both versions of the file in order to union merge them.
That's essentially the same information needed to do the incremental update
of the repo sizes. So could update sizes at the same time as merging the
git-annex branch. That would be essentially free!
Note that the use of `git cat-file` in union merge is not --buffer
streaming, so is slower than the patch parsing method that was discussed in
the previous section. So it might be possible to speed up git-annex branch
merging using patch parsing.