git-annex/doc/todo/speed_up_annex.adjustedbranchrefresh.mdwn
2020-11-23 13:00:20 -04:00

103 lines
4.2 KiB
Markdown

The --hide-missing and --unlock-present adjusted branches
depend on whether the content of a file is present. So after a get or drop,
the branch needs to be updated to reflect the change.
To avoid the user needing to do it manually, the
annex.adjustedbranchrefresh config automates it now. That is not yet
enabled by default, because it can be slow.
Can it be done efficiently enough to become default?
And how frequently can it update without being too slow?
After every file or after some number of files, or after processing
all files in a (sub-)tree, or only once at the end of a command?
The config was designed as a numeric scale with the hope that it might
later become a simple boolean if it got fast enough. Since 0 is false
and > 0 is true, in git config, the old values will still work.
Investigation of the obvious things that make it slow follows:
## efficient branch adjustment
updateAdjustedBranch re-adjusts the whole branch. That is inneficient for a
branch with a large number of files in it. We need a way to incrementally
adjust part of the branch.
(Or maybe not *need* because maybe it's acceptable for it to be slow; slow
is still better than manual, probably.)
Git.Tree.adjustTree recursively lists the tree and applies
the adjustment to each item in it. What's needed is a way to
adjust only the subtree containing a file, and then use
Git.Tree.graftTree to graft that into the existing tree. Seems quite
doable!
graftTree does use getTree, so buffers the whole tree object in memory.
adjustTree avoids doing that. I think it's probably ok though; git probably
also buffers tree objects in memory. Only the tree objects down to the
graft point need to be buffered.
Oh, also, it can cache a few tree objects to speed
this up more. Eg after dropping foo/bar/1 buffer foo and foo/bar's
objects, and use that when dropping foo/bar/2.
## efficient checkout
updateAdjustedBranch checks out the new version of the branch.
git checkout needs to diff the old and new tree, and while git is fast,
it's not magic. Consider if, after every file dropped, we checked out a new
branch that did not contain that file. Dropping can be quite fast,
thousands of files per second.. How much would those git checkouts
slow it down?
I benchmarked to find out. (On an SSD)
for x in $(seq 1 10000); do echo $x > $x; git add $x; git commit -m add; done
time while git checkout --quiet HEAD^;do : ; done
real 5m59.489s
user 4m34.775s
sys 3m39.665s
Seems like git-annex could do the equivilant of checkout more quickly,
since it knows the part of the branch that changed. Eg, git rm the file,
and update the sha1 that HEAD points to.
## dangling git objects
Incremental adjusting would generate new trees each time, and a new commit.
If it's done a lot, those will pile up. They get deleted by gc, but
normally would only be deleted after some time. Perhaps git-annex could
delete them itself. (At least if they remain loose git objects; deleting
them once they reach a pack file seems hard.)
(The tree object buffer mentioned above suggests the approach of,
when an object in the buffer gets replaced with another object for the same
tree position, delete the old object from .git/objects.)
A git repo containing 10000 commits, starting with 10000 files and ending
with 0, deleting 1 file per commit, weighs in at 99mb. That's with
short filenames, more typical filenames would add to it. (It packs down to
9 mb, but that's just extra work for objects that will get gc'ed away
eventually.)
## slowness in writing git objects
git-annex uses git commit-tree to create new commits on the adjusted
branch. That is not batched, so running once per file may get slow.
And to write trees, it uses git mktree --batch. But, a new process is
started each time by Git.Tree.adjustTree (and other things).
Making that a long-running process would speed it up, probably.
## queue flush slowness
It uses Annex.Queue.flush to avoid a problem with dropping files,
which depopuates the pointer file, which git sees as modified until
restaged, which then prevents checking out the version of the branch where
the pointer file is a symlink. There must be a less expensive way to handle
that.
(All the extra "(recording state in git...)" when dropping
are due to it doing that too.)