update
This commit is contained in:
parent
1115fb1f9b
commit
ad966e5e7b
2 changed files with 41 additions and 12 deletions
|
@ -7,7 +7,7 @@ that entirely:
|
|||
other repo is not know to contain, but then repos will race and both get
|
||||
the same file, or similarly if they are not communicating frequently.
|
||||
|
||||
So, let's add a new expression: `balanced_amoung(group)`
|
||||
So, let's add a new expression: `balanced(group)`
|
||||
|
||||
This would work by taking the list of uuids of all repositories in the
|
||||
group, and sorting them, which yields a list from 0..M-1 repositories.
|
||||
|
@ -28,7 +28,7 @@ scheme stands, it's equally likely that adding repo3 will make repo1 and
|
|||
repo2 want to swap files between them. So, we'll want to add some
|
||||
precautions to avoid a lot of data moving around in this case:
|
||||
|
||||
((balanced_amoung(backup) and not (copies=backup:1)) or present
|
||||
((balanced(backup) and not (copies=backup:1)) or present
|
||||
|
||||
So once file lands on a backup drive, it stays there, even if more backup
|
||||
drives change the balancing.
|
||||
|
@ -74,7 +74,7 @@ a manual/scripted process.
|
|||
|
||||
What if we have 5 backup repos and want each file to land in 3 of them?
|
||||
There's a simple change that can support that:
|
||||
`balanced_amoung(group:3)`
|
||||
`balanced(group:3)`
|
||||
|
||||
This works the same as before, but rather than just `N mod M`, take
|
||||
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
|
||||
|
|
|
@ -21,11 +21,7 @@ repos that had a maxsize recorded, essentially for free.
|
|||
|
||||
But 8 seconds is rather a long time to block a `git-annex push`
|
||||
type command. Which would be needed if any remote's preferred content
|
||||
expression used `balanced_amoung`.
|
||||
|
||||
It would help some to cache the calculated sizes in eq a sqlite db, update
|
||||
the cache after sending or dropping content, and invalidate the cache when
|
||||
git-annex branch update merges in a git-annex branch from elsewhere.
|
||||
expression used the free space information.
|
||||
|
||||
Would it be possible to update incrementally from the previous git-annex
|
||||
branch to the current one? That's essentially what `git-annex log
|
||||
|
@ -39,13 +35,46 @@ particular git-annex branch commit. We don't care about sizes at
|
|||
intermediate points in time, which that command does calculate.
|
||||
|
||||
See [[todo/info_--size-history]] for the subtleties that had to be handled.
|
||||
In particular, diffing from the previous git-annex branch commit to current may
|
||||
In particular, compating the previous git-annex branch commit to current may
|
||||
yield lines that seem to indicate content was added to a repo, but in fact
|
||||
that repo already had that content at the previous git-annex branch commit.
|
||||
So it seems it would have to look up the location log's value at the
|
||||
previous commit, either querying the git-annex branch or cached state.
|
||||
that repo already had that content at the previous git-annex branch commit
|
||||
and another log line was recorded elsewhere redundantly.
|
||||
So it needs to look at the location log's value at the
|
||||
previous commit in order to determine if a change to a log should be
|
||||
counted.
|
||||
|
||||
Worst case, that's queries of the location log file for every single key.
|
||||
If queried from git, that would be slow -- slower than `git-annex info`'s
|
||||
streaming approach. If they were all cached in a sqlite database, it might
|
||||
manage to be faster?
|
||||
|
||||
## incremental update via git diff
|
||||
|
||||
Could `git diff -U1000000` be used and the patch parsed to get the complete
|
||||
old and new location log? (Assuming no log file ever reaches a million
|
||||
lines.) I tried this in my big repo, and even diffing from the first
|
||||
git-annex branch commit to the last took 7.54 seconds.
|
||||
|
||||
Compare that with the method used by `git-annex info`'s size gathering, of
|
||||
dumping out the content of all files on the branch with `git ls-tree -r
|
||||
git-annex |awk '{print $3}'|git cat-file --batch --buffer`, which only
|
||||
takes 3 seconds. So, this is not ideal when diffing to too old a point.
|
||||
|
||||
Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.
|
||||
... from 3 months ago takes 2 seconds.
|
||||
... from 1 week ago takes 1 second.
|
||||
|
||||
## incremental update when merging git-annex branch
|
||||
|
||||
When merging git-annex branch changes into .git/annex/index,
|
||||
it already diffs between the branch and the index and uses `git cat-file`
|
||||
to get both versions of the file in order to union merge them.
|
||||
|
||||
That's essentially the same information needed to do the incremental update
|
||||
of the repo sizes. So could update sizes at the same time as merging the
|
||||
git-annex branch. That would be essentially free!
|
||||
|
||||
Note that the use of `git cat-file` in union merge is not --buffer
|
||||
streaming, so is slower than the patch parsing method that was discussed in
|
||||
the previous section. So it might be possible to speed up git-annex branch
|
||||
merging using patch parsing.
|
||||
|
|
Loading…
Add table
Reference in a new issue