update

2024-03-08 13:43:31 -04:00 · 2024-03-08 13:43:31 -04:00 · ad966e5e7b
commit ad966e5e7b
parent 1115fb1f9b
2 changed files with 41 additions and 12 deletions
--- a/doc/design/balanced_preferred_content.mdwn
+++ b/doc/design/balanced_preferred_content.mdwn
@ -7,7 +7,7 @@ that entirely:
  other repo is not know to contain, but then repos will race and both get
  the same file, or similarly if they are not communicating frequently.
-So, let's add a new expression: `balanced_amoung(group)`
+So, let's add a new expression: `balanced(group)`
 This would work by taking the list of uuids of all repositories in the
 group, and sorting them, which yields a list from 0..M-1 repositories.
@ -28,7 +28,7 @@ scheme stands, it's equally likely that adding repo3 will make repo1 and
 repo2 want to swap files between them. So, we'll want to add some
 precautions to avoid a lot of data moving around in this case:
-	((balanced_amoung(backup) and not (copies=backup:1)) or present
+	((balanced(backup) and not (copies=backup:1)) or present
 So once file lands on a backup drive, it stays there, even if more backup
 drives change the balancing.
@ -74,7 +74,7 @@ a manual/scripted process.
 What if we have 5 backup repos and want each file to land in 3 of them?
 There's a simple change that can support that:
-`balanced_amoung(group:3)`
+`balanced(group:3)`
 This works the same as before, but rather than just `N mod M`, take
 `N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
--- a/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn
+++ b/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn
@ -21,11 +21,7 @@ repos that had a maxsize recorded, essentially for free.
 But 8 seconds is rather a long time to block a `git-annex push`
 type command. Which would be needed if any remote's preferred content
-expression used `balanced_amoung`.
+expression used the free space information.
 It would help some to cache the calculated sizes in eq a sqlite db, update
 the cache after sending or dropping content, and invalidate the cache when
 git-annex branch update merges in a git-annex branch from elsewhere.
 Would it be possible to update incrementally from the previous git-annex
 branch to the current one? That's essentially what `git-annex log
@ -39,13 +35,46 @@ particular git-annex branch commit. We don't care about sizes at
 intermediate points in time, which that command does calculate.
 See [[todo/info_--size-history]] for the subtleties that had to be handled.
-In particular, diffing from the previous git-annex branch commit to current may
+In particular, compating the previous git-annex branch commit to current may
 yield lines that seem to indicate content was added to a repo, but in fact
-that repo already had that content at the previous git-annex branch commit.
+that repo already had that content at the previous git-annex branch commit
-So it seems it would have to look up the location log's value at the 
+and another log line was recorded elsewhere redundantly.
-previous commit, either querying the git-annex branch or cached state.
+So it needs to look at the location log's value at the 
 previous commit in order to determine if a change to a log should be
 counted.
 Worst case, that's queries of the location log file for every single key.
 If queried from git, that would be slow -- slower than `git-annex info`'s
 streaming approach. If they were all cached in a sqlite database, it might
 manage to be faster?
 ## incremental update via git diff
 Could `git diff -U1000000` be used and the patch parsed to get the complete
 old and new location log? (Assuming no log file ever reaches a million
 lines.) I tried this in my big repo, and even diffing from the first
 git-annex branch commit to the last took 7.54 seconds. 
 Compare that with the method used by `git-annex info`'s size gathering, of
 dumping out the content of all files on the branch with `git ls-tree -r
 git-annex |awk '{print $3}'|git cat-file --batch --buffer`, which only
 takes 3 seconds. So, this is not ideal when diffing to too old a point.
 Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.  
 ... from 3 months ago takes 2 seconds.  
 ... from 1 week ago takes 1 second.  
 ## incremental update when merging git-annex branch
 When merging git-annex branch changes into .git/annex/index, 
 it already diffs between the branch and the index and uses `git cat-file`
 to get both versions of the file in order to union merge them.
 That's essentially the same information needed to do the incremental update
 of the repo sizes. So could update sizes at the same time as merging the
 git-annex branch. That would be essentially free!
 Note that the use of `git cat-file` in union merge is not --buffer
 streaming, so is slower than the patch parsing method that was discussed in
 the previous section. So it might be possible to speed up git-annex branch
 merging using patch parsing.