update

2024-03-08 13:43:31 -04:00 · 2024-03-08 13:43:31 -04:00 · ad966e5e7b
commit ad966e5e7b
parent 1115fb1f9b
2 changed files with 41 additions and 12 deletions
--- a/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn
+++ b/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn
@ -21,11 +21,7 @@ repos that had a maxsize recorded, essentially for free.

 But 8 seconds is rather a long time to block a `git-annex push`
 type command. Which would be needed if any remote's preferred content
-expression used `balanced_amoung`.
-
-It would help some to cache the calculated sizes in eq a sqlite db, update
-the cache after sending or dropping content, and invalidate the cache when
-git-annex branch update merges in a git-annex branch from elsewhere.
+expression used the free space information.

 Would it be possible to update incrementally from the previous git-annex
 branch to the current one? That's essentially what `git-annex log
@ -39,13 +35,46 @@ particular git-annex branch commit. We don't care about sizes at
 intermediate points in time, which that command does calculate.

 See [[todo/info_--size-history]] for the subtleties that had to be handled.
-In particular, diffing from the previous git-annex branch commit to current may
+In particular, compating the previous git-annex branch commit to current may
 yield lines that seem to indicate content was added to a repo, but in fact
-that repo already had that content at the previous git-annex branch commit.
-So it seems it would have to look up the location log's value at the 
-previous commit, either querying the git-annex branch or cached state.
+that repo already had that content at the previous git-annex branch commit
+and another log line was recorded elsewhere redundantly.
+So it needs to look at the location log's value at the 
+previous commit in order to determine if a change to a log should be
+counted.

 Worst case, that's queries of the location log file for every single key.
 If queried from git, that would be slow -- slower than `git-annex info`'s
 streaming approach. If they were all cached in a sqlite database, it might
 manage to be faster?
+
+## incremental update via git diff
+
+Could `git diff -U1000000` be used and the patch parsed to get the complete
+old and new location log? (Assuming no log file ever reaches a million
+lines.) I tried this in my big repo, and even diffing from the first
+git-annex branch commit to the last took 7.54 seconds. 
+
+Compare that with the method used by `git-annex info`'s size gathering, of
+dumping out the content of all files on the branch with `git ls-tree -r
+git-annex |awk '{print $3}'|git cat-file --batch --buffer`, which only
+takes 3 seconds. So, this is not ideal when diffing to too old a point.
+
+Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.  
+... from 3 months ago takes 2 seconds.  
+... from 1 week ago takes 1 second.  
+
+## incremental update when merging git-annex branch
+
+When merging git-annex branch changes into .git/annex/index, 
+it already diffs between the branch and the index and uses `git cat-file`
+to get both versions of the file in order to union merge them.
+
+That's essentially the same information needed to do the incremental update
+of the repo sizes. So could update sizes at the same time as merging the
+git-annex branch. That would be essentially free!
+
+Note that the use of `git cat-file` in union merge is not --buffer
+streaming, so is slower than the patch parsing method that was discussed in
+the previous section. So it might be possible to speed up git-annex branch
+merging using patch parsing.