update
This commit is contained in:
parent
1115fb1f9b
commit
ad966e5e7b
2 changed files with 41 additions and 12 deletions
|
@ -7,7 +7,7 @@ that entirely:
|
||||||
other repo is not know to contain, but then repos will race and both get
|
other repo is not know to contain, but then repos will race and both get
|
||||||
the same file, or similarly if they are not communicating frequently.
|
the same file, or similarly if they are not communicating frequently.
|
||||||
|
|
||||||
So, let's add a new expression: `balanced_amoung(group)`
|
So, let's add a new expression: `balanced(group)`
|
||||||
|
|
||||||
This would work by taking the list of uuids of all repositories in the
|
This would work by taking the list of uuids of all repositories in the
|
||||||
group, and sorting them, which yields a list from 0..M-1 repositories.
|
group, and sorting them, which yields a list from 0..M-1 repositories.
|
||||||
|
@ -28,7 +28,7 @@ scheme stands, it's equally likely that adding repo3 will make repo1 and
|
||||||
repo2 want to swap files between them. So, we'll want to add some
|
repo2 want to swap files between them. So, we'll want to add some
|
||||||
precautions to avoid a lot of data moving around in this case:
|
precautions to avoid a lot of data moving around in this case:
|
||||||
|
|
||||||
((balanced_amoung(backup) and not (copies=backup:1)) or present
|
((balanced(backup) and not (copies=backup:1)) or present
|
||||||
|
|
||||||
So once file lands on a backup drive, it stays there, even if more backup
|
So once file lands on a backup drive, it stays there, even if more backup
|
||||||
drives change the balancing.
|
drives change the balancing.
|
||||||
|
@ -74,7 +74,7 @@ a manual/scripted process.
|
||||||
|
|
||||||
What if we have 5 backup repos and want each file to land in 3 of them?
|
What if we have 5 backup repos and want each file to land in 3 of them?
|
||||||
There's a simple change that can support that:
|
There's a simple change that can support that:
|
||||||
`balanced_amoung(group:3)`
|
`balanced(group:3)`
|
||||||
|
|
||||||
This works the same as before, but rather than just `N mod M`, take
|
This works the same as before, but rather than just `N mod M`, take
|
||||||
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
|
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
|
||||||
|
|
|
@ -21,11 +21,7 @@ repos that had a maxsize recorded, essentially for free.
|
||||||
|
|
||||||
But 8 seconds is rather a long time to block a `git-annex push`
|
But 8 seconds is rather a long time to block a `git-annex push`
|
||||||
type command. Which would be needed if any remote's preferred content
|
type command. Which would be needed if any remote's preferred content
|
||||||
expression used `balanced_amoung`.
|
expression used the free space information.
|
||||||
|
|
||||||
It would help some to cache the calculated sizes in eq a sqlite db, update
|
|
||||||
the cache after sending or dropping content, and invalidate the cache when
|
|
||||||
git-annex branch update merges in a git-annex branch from elsewhere.
|
|
||||||
|
|
||||||
Would it be possible to update incrementally from the previous git-annex
|
Would it be possible to update incrementally from the previous git-annex
|
||||||
branch to the current one? That's essentially what `git-annex log
|
branch to the current one? That's essentially what `git-annex log
|
||||||
|
@ -39,13 +35,46 @@ particular git-annex branch commit. We don't care about sizes at
|
||||||
intermediate points in time, which that command does calculate.
|
intermediate points in time, which that command does calculate.
|
||||||
|
|
||||||
See [[todo/info_--size-history]] for the subtleties that had to be handled.
|
See [[todo/info_--size-history]] for the subtleties that had to be handled.
|
||||||
In particular, diffing from the previous git-annex branch commit to current may
|
In particular, compating the previous git-annex branch commit to current may
|
||||||
yield lines that seem to indicate content was added to a repo, but in fact
|
yield lines that seem to indicate content was added to a repo, but in fact
|
||||||
that repo already had that content at the previous git-annex branch commit.
|
that repo already had that content at the previous git-annex branch commit
|
||||||
So it seems it would have to look up the location log's value at the
|
and another log line was recorded elsewhere redundantly.
|
||||||
previous commit, either querying the git-annex branch or cached state.
|
So it needs to look at the location log's value at the
|
||||||
|
previous commit in order to determine if a change to a log should be
|
||||||
|
counted.
|
||||||
|
|
||||||
Worst case, that's queries of the location log file for every single key.
|
Worst case, that's queries of the location log file for every single key.
|
||||||
If queried from git, that would be slow -- slower than `git-annex info`'s
|
If queried from git, that would be slow -- slower than `git-annex info`'s
|
||||||
streaming approach. If they were all cached in a sqlite database, it might
|
streaming approach. If they were all cached in a sqlite database, it might
|
||||||
manage to be faster?
|
manage to be faster?
|
||||||
|
|
||||||
|
## incremental update via git diff
|
||||||
|
|
||||||
|
Could `git diff -U1000000` be used and the patch parsed to get the complete
|
||||||
|
old and new location log? (Assuming no log file ever reaches a million
|
||||||
|
lines.) I tried this in my big repo, and even diffing from the first
|
||||||
|
git-annex branch commit to the last took 7.54 seconds.
|
||||||
|
|
||||||
|
Compare that with the method used by `git-annex info`'s size gathering, of
|
||||||
|
dumping out the content of all files on the branch with `git ls-tree -r
|
||||||
|
git-annex |awk '{print $3}'|git cat-file --batch --buffer`, which only
|
||||||
|
takes 3 seconds. So, this is not ideal when diffing to too old a point.
|
||||||
|
|
||||||
|
Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.
|
||||||
|
... from 3 months ago takes 2 seconds.
|
||||||
|
... from 1 week ago takes 1 second.
|
||||||
|
|
||||||
|
## incremental update when merging git-annex branch
|
||||||
|
|
||||||
|
When merging git-annex branch changes into .git/annex/index,
|
||||||
|
it already diffs between the branch and the index and uses `git cat-file`
|
||||||
|
to get both versions of the file in order to union merge them.
|
||||||
|
|
||||||
|
That's essentially the same information needed to do the incremental update
|
||||||
|
of the repo sizes. So could update sizes at the same time as merging the
|
||||||
|
git-annex branch. That would be essentially free!
|
||||||
|
|
||||||
|
Note that the use of `git cat-file` in union merge is not --buffer
|
||||||
|
streaming, so is slower than the patch parsing method that was discussed in
|
||||||
|
the previous section. So it might be possible to speed up git-annex branch
|
||||||
|
merging using patch parsing.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue