git-annex/doc/todo/use_git-mktree_rather_than_index_file.mdwn

When git-annex is updating the git-annex branch, it currently
uses a separate index file. This adds overhead and complexity to the code.
Especially when there are many files, the index file gets large and writing
it gets slow.

It might be an improvement to use `git mktree --batch` to inject a
tree object into git, without using the index file. `git hash-object`
is already used to add the files to git. All that would be needed is to
generate an updated tree containing the new file(s), and then update each
parent tree up to the root tree. This new tree can then be committed using
`git commit-tree`

The only thing I can see that might make this slow at all is reading the old
tree contents, in order to update it. This would need a `git ls-tree` for
each tree; it does not have a batch mode, so 4 processes would need to be
spawned when generating a tree that changes 1 file. For any repo that's not
very small, that's probably still faster than rewriting the index file.

Notes:

* The union merge code currently uses the index. No particular reason
  it needs to; that's just how the code is written, and it might be a large
  rewrite to change it.
* A new git-annex branch can be pushed into the repository at any time.
  The current code uses the index to detect when this happens, and
  union merges the new branch head into the index. Would need something
  like a `GIT_ANNEX_HEAD` ref to do the same if the index is removed.

Thanks to sm for indirectly suggesting this. --[[Joey]]
possible optimisation idea 2015-09-03 20:52:43 +00:00			`When git-annex is updating the git-annex branch, it currently`
			`uses a separate index file. This adds overhead and complexity to the code.`
			`Especially when there are many files, the index file gets large and writing`
			`it gets slow.`

			It might be an improvement to use `git mktree --batch` to inject a
			tree object into git, without using the index file. `git hash-object`
			`is already used to add the files to git. All that would be needed is to`
			`generate an updated tree containing the new file(s), and then update each`
			`parent tree up to the root tree. This new tree can then be committed using`
			`git commit-tree`

			`The only thing I can see that might make this slow at all is reading the old`
			tree contents, in order to update it. This would need a `git ls-tree` for
			`each tree; it does not have a batch mode, so 4 processes would need to be`
			`spawned when generating a tree that changes 1 file. For any repo that's not`
			`very small, that's probably still faster than rewriting the index file.`

			`Notes:`

			`* The union merge code currently uses the index. No particular reason`
			`it needs to; that's just how the code is written, and it might be a large`
			`rewrite to change it.`
			`* A new git-annex branch can be pushed into the repository at any time.`
			`The current code uses the index to detect when this happens, and`
			`union merges the new branch head into the index. Would need something`
			like a `GIT_ANNEX_HEAD` ref to do the same if the index is removed.

			`Thanks to sm for indirectly suggesting this. --[[Joey]]`