update

2024-03-08 13:43:31 -04:00 · 2024-03-08 13:43:31 -04:00 · ad966e5e7b
commit ad966e5e7b
parent 1115fb1f9b
2 changed files with 41 additions and 12 deletions
--- a/doc/design/balanced_preferred_content.mdwn
+++ b/doc/design/balanced_preferred_content.mdwn
@ -7,7 +7,7 @@ that entirely:
  other repo is not know to contain, but then repos will race and both get
  the same file, or similarly if they are not communicating frequently.

-So, let's add a new expression: `balanced_amoung(group)`
+So, let's add a new expression: `balanced(group)`

 This would work by taking the list of uuids of all repositories in the
 group, and sorting them, which yields a list from 0..M-1 repositories.
@ -28,7 +28,7 @@ scheme stands, it's equally likely that adding repo3 will make repo1 and
 repo2 want to swap files between them. So, we'll want to add some
 precautions to avoid a lot of data moving around in this case:

-	((balanced_amoung(backup) and not (copies=backup:1)) or present
+	((balanced(backup) and not (copies=backup:1)) or present

 So once file lands on a backup drive, it stays there, even if more backup
 drives change the balancing.
@ -74,7 +74,7 @@ a manual/scripted process.

 What if we have 5 backup repos and want each file to land in 3 of them?
 There's a simple change that can support that:
-`balanced_amoung(group:3)`
+`balanced(group:3)`

 This works the same as before, but rather than just `N mod M`, take
 `N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
--- a/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn
+++ b/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn
@ -21,11 +21,7 @@ repos that had a maxsize recorded, essentially for free.

 But 8 seconds is rather a long time to block a `git-annex push`
 type command. Which would be needed if any remote's preferred content
-expression used `balanced_amoung`.
-
-It would help some to cache the calculated sizes in eq a sqlite db, update
-the cache after sending or dropping content, and invalidate the cache when
-git-annex branch update merges in a git-annex branch from elsewhere.
+expression used the free space information.

 Would it be possible to update incrementally from the previous git-annex
 branch to the current one? That's essentially what `git-annex log
@ -39,13 +35,46 @@ particular git-annex branch commit. We don't care about sizes at
 intermediate points in time, which that command does calculate.

 See [[todo/info_--size-history]] for the subtleties that had to be handled.
-In particular, diffing from the previous git-annex branch commit to current may
+In particular, compating the previous git-annex branch commit to current may
 yield lines that seem to indicate content was added to a repo, but in fact
-that repo already had that content at the previous git-annex branch commit.
-So it seems it would have to look up the location log's value at the 
-previous commit, either querying the git-annex branch or cached state.
+that repo already had that content at the previous git-annex branch commit
+and another log line was recorded elsewhere redundantly.
+So it needs to look at the location log's value at the 
+previous commit in order to determine if a change to a log should be
+counted.

 Worst case, that's queries of the location log file for every single key.
 If queried from git, that would be slow -- slower than `git-annex info`'s
 streaming approach. If they were all cached in a sqlite database, it might
 manage to be faster?
+
+## incremental update via git diff
+
+Could `git diff -U1000000` be used and the patch parsed to get the complete
+old and new location log? (Assuming no log file ever reaches a million
+lines.) I tried this in my big repo, and even diffing from the first
+git-annex branch commit to the last took 7.54 seconds. 
+
+Compare that with the method used by `git-annex info`'s size gathering, of
+dumping out the content of all files on the branch with `git ls-tree -r
+git-annex |awk '{print $3}'|git cat-file --batch --buffer`, which only
+takes 3 seconds. So, this is not ideal when diffing to too old a point.
+
+Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.  
+... from 3 months ago takes 2 seconds.  
+... from 1 week ago takes 1 second.  
+
+## incremental update when merging git-annex branch
+
+When merging git-annex branch changes into .git/annex/index, 
+it already diffs between the branch and the index and uses `git cat-file`
+to get both versions of the file in order to union merge them.
+
+That's essentially the same information needed to do the incremental update
+of the repo sizes. So could update sizes at the same time as merging the
+git-annex branch. That would be essentially free!
+
+Note that the use of `git cat-file` in union merge is not --buffer
+streaming, so is slower than the patch parsing method that was discussed in
+the previous section. So it might be possible to speed up git-annex branch
+merging using patch parsing.