git-annex/doc/todo/speed_up_annex.adjustedbranchrefresh.mdwn

The --hide-missing and --unlock-present adjusted branches
depend on whether the content of a file is present. So after a get or drop,
the branch needs to be updated to reflect the change. Currently this has to
be done manually, except git-annex sync does do it at the end. Can the
update be automated?

Of course, it could be updated on shutdown whenever content was transferred
or dropped. The question really is, can it be done efficiently enough that
it makes sense to do that? And for that matter, can it be done efficiently
enough to do it more frequently? After every file or after some number of
files, or after processing all files in a (sub-)tree?

> Since the answer to that will change over time, let's make a config for
> it. annex.adjustedbranchrefresh
> 
> The config can start out as a range from 0 up that indicates how
> infrequently to update the branch for this. With 0 being never refresh,
> 1 being refresh the minimum (once at shutdown), 2 being refresh after 
> 1 file, 100 after every 99 files, etc.
> If refreshing gets twice as fast, divide the numbers by two, etc.
> If it becomes sufficiently fast that the overhead doesn't matter,
> it can change to a simple boolean. Since 0 is false and > 0 is true,
> in git config, the old values will still work. (done)

Investigation of the obvious things that make it slow follows:

## efficient branch adjustment

updateAdjustedBranch re-adjusts the whole branch. That is inneficient for a
branch with a large number of files in it. We need a way to incrementally
adjust part of the branch.

(Or maybe not *need* because maybe it's acceptable for it to be slow; slow
is still better than manual, probably.)

Git.Tree.adjustTree recursively lists the tree and applies
the adjustment to each item in it. What's needed is a way to
adjust only the subtree containing a file, and then use 
Git.Tree.graftTree to graft that into the existing tree. Seems quite
doable!

graftTree does use getTree, so buffers the whole tree object in memory.
adjustTree avoids doing that. I think it's probably ok though; git probably
also buffers tree objects in memory. Only the tree objects down to the
graft point need to be buffered.

Oh, also, it can cache a few tree objects to speed
this up more. Eg after dropping foo/bar/1 buffer foo and foo/bar's
objects, and use that when dropping foo/bar/2.

## efficient checkout

updateAdjustedBranch checks out the new version of the branch.

git checkout needs to diff the old and new tree, and while git is fast,
it's not magic. Consider if, after every file dropped, we checked out a new
branch that did not contain that file. Dropping can be quite fast,
thousands of files per second.. How much would those git checkouts
slow it down?

I benchmarked to find out. (On an SSD)

	for x in $(seq 1 10000); do echo $x > $x; git add $x; git commit -m add; done

	time while git checkout --quiet HEAD^;do : ; done
	real	5m59.489s
	user	4m34.775s
	sys	3m39.665s

Seems like git-annex could do the equivilant of checkout more quickly,
since it knows the part of the branch that changed. Eg, git rm the file,
and update the sha1 that HEAD points to.

## dangling git objects

Incremental adjusting would generate new trees each time, and a new commit.
If it's done a lot, those will pile up. They get deleted by gc, but
normally would only be deleted after some time. Perhaps git-annex could
delete them itself. (At least if they remain loose git objects; deleting
them once they reach a pack file seems hard.)

(The tree object buffer mentioned above suggests the approach of,
when an object in the buffer gets replaced with another object for the same
tree position, delete the old object from .git/objects.)

A git repo containing 10000 commits, starting with 10000 files and ending
with 0, deleting 1 file per commit, weighs in at 99mb. That's with
short filenames, more typical filenames would add to it. (It packs down to
9 mb, but that's just extra work for objects that will get gc'ed away
eventually.)

## slowness in writing git objects

git-annex uses git commit-tree to create new commits on the adjusted
branch. That is not batched, so running once per file may get slow.

And to write trees, it uses git mktree --batch. But, a new process is
started each time by Git.Tree.adjustTree (and other things). 
Making that a long-running process would speed it up, probably.

## queue flush slowness

It uses Annex.Queue.flush to avoid a problem with dropping files,
which depopuates the pointer file, which git sees as modified until
restaged, which then prevents checking out the version of the branch where
the pointer file is a symlink. There must be a less expensive way to handle
that.

(All the extra "(recording state in git...)" when dropping
are due to it doing that too.)
add todo 2020-11-13 19:50:35 +00:00			`The --hide-missing and --unlock-present adjusted branches`
			`depend on whether the content of a file is present. So after a get or drop,`
			`the branch needs to be updated to reflect the change. Currently this has to`
			`be done manually, except git-annex sync does do it at the end. Can the`
			`update be automated?`

			`Of course, it could be updated on shutdown whenever content was transferred`
			`or dropped. The question really is, can it be done efficiently enough that`
			`it makes sense to do that? And for that matter, can it be done efficiently`
			`enough to do it more frequently? After every file or after some number of`
			`files, or after processing all files in a (sub-)tree?`

annex.adjustedbranchrefresh Added annex.adjustedbranchrefresh git config to update adjusted branches set up by git-annex adjust --unlock-present/--hide-missing. Note, in a few cases, I was not able to make the adjusted branch be updated in calls to moveAnnex, because information about what file corresponds to a key is not available. They are: * If two files point to one file, then eg, `git annex get foo` will update the branch to unlock foo, but will not unlock bar, because it does not know about it. Might be fixable by making `git annex get bar` do something besides skipping bar? * git-annex-shell recvkey likewise (so sends over ssh from old versions of git-annex) * git-annex setkey * git-annex transferkey if the user does not use --file * git-annex multicast sends keys with no associated file info Doing a single full refresh at the end, after any incremental refresh, will deal with those edge cases. 2020-11-16 18:09:55 +00:00			`> Since the answer to that will change over time, let's make a config for`
			`> it. annex.adjustedbranchrefresh`
			`>`
			`> The config can start out as a range from 0 up that indicates how`
			`> infrequently to update the branch for this. With 0 being never refresh,`
			`> 1 being refresh the minimum (once at shutdown), 2 being refresh after`
			`> 1 file, 100 after every 99 files, etc.`
			`> If refreshing gets twice as fast, divide the numbers by two, etc.`
			`> If it becomes sufficiently fast that the overhead doesn't matter,`
			`> it can change to a simple boolean. Since 0 is false and > 0 is true,`
			`> in git config, the old values will still work. (done)`

add todo 2020-11-13 19:50:35 +00:00			`Investigation of the obvious things that make it slow follows:`

			`## efficient branch adjustment`

			`updateAdjustedBranch re-adjusts the whole branch. That is inneficient for a`
			`branch with a large number of files in it. We need a way to incrementally`
			`adjust part of the branch.`

			`(Or maybe not need because maybe it's acceptable for it to be slow; slow`
			`is still better than manual, probably.)`

			`Git.Tree.adjustTree recursively lists the tree and applies`
			`the adjustment to each item in it. What's needed is a way to`
			`adjust only the subtree containing a file, and then use`
			`Git.Tree.graftTree to graft that into the existing tree. Seems quite`
			`doable!`

			`graftTree does use getTree, so buffers the whole tree object in memory.`
			`adjustTree avoids doing that. I think it's probably ok though; git probably`
			`also buffers tree objects in memory. Only the tree objects down to the`
			`graft point need to be buffered.`

			`Oh, also, it can cache a few tree objects to speed`
			`this up more. Eg after dropping foo/bar/1 buffer foo and foo/bar's`
			`objects, and use that when dropping foo/bar/2.`

			`## efficient checkout`

			`updateAdjustedBranch checks out the new version of the branch.`

			`git checkout needs to diff the old and new tree, and while git is fast,`
			`it's not magic. Consider if, after every file dropped, we checked out a new`
			`branch that did not contain that file. Dropping can be quite fast,`
			`thousands of files per second.. How much would those git checkouts`
			`slow it down?`

			`I benchmarked to find out. (On an SSD)`

			`for x in $(seq 1 10000); do echo $x > $x; git add $x; git commit -m add; done`

			`time while git checkout --quiet HEAD^;do : ; done`
			`real 5m59.489s`
			`user 4m34.775s`
			`sys 3m39.665s`

			`Seems like git-annex could do the equivilant of checkout more quickly,`
			`since it knows the part of the branch that changed. Eg, git rm the file,`
			`and update the sha1 that HEAD points to.`

			`## dangling git objects`

			`Incremental adjusting would generate new trees each time, and a new commit.`
			`If it's done a lot, those will pile up. They get deleted by gc, but`
			`normally would only be deleted after some time. Perhaps git-annex could`
			`delete them itself. (At least if they remain loose git objects; deleting`
			`them once they reach a pack file seems hard.)`

			`(The tree object buffer mentioned above suggests the approach of,`
			`when an object in the buffer gets replaced with another object for the same`
			`tree position, delete the old object from .git/objects.)`

measurement 2020-11-13 19:57:35 +00:00			`A git repo containing 10000 commits, starting with 10000 files and ending`
			`with 0, deleting 1 file per commit, weighs in at 99mb. That's with`
			`short filenames, more typical filenames would add to it. (It packs down to`
			`9 mb, but that's just extra work for objects that will get gc'ed away`
			`eventually.)`

add todo 2020-11-13 19:50:35 +00:00			`## slowness in writing git objects`

			`git-annex uses git commit-tree to create new commits on the adjusted`
			`branch. That is not batched, so running once per file may get slow.`

			`And to write trees, it uses git mktree --batch. But, a new process is`
			`started each time by Git.Tree.adjustTree (and other things).`
			`Making that a long-running process would speed it up, probably.`
bug fix really innefficient but it does solve dropping 2020-11-16 18:57:51 +00:00
			`## queue flush slowness`

			`It uses Annex.Queue.flush to avoid a problem with dropping files,`
			`which depopuates the pointer file, which git sees as modified until`
			`restaged, which then prevents checking out the version of the branch where`
			`the pointer file is a symlink. There must be a less expensive way to handle`
			`that.`

			`(All the extra "(recording state in git...)" when dropping`
			`are due to it doing that too.)`