git-annex/doc/bugs/add_overwrite_race.mdwn

I was running `git-annex add` on 5 gb of files, and accidentially overwrote
some of the first ones, which it had already processed, while it was
running. This caused binary files to get staged in git, rather than the
annex pointers. 

Test case: 

	echo hi > 1
	dd if=/dev/urandom of=2 bs=1M count=1000
	(sleep 2s; rm 1; echo bye > 1) &
	git-annex add
	git diff --cached 1
	diff --git a/1 b/1
	new file mode 100644
	index 0000000..b023018
	--- /dev/null
	+++ b/1
	@@ -0,0 +1 @@
	+bye

This happens due to ingestAdd using addLink on the symlink, 
which just queues a "git add" of the file for later. In the
meantime, the symlink is replaced with something else, so git
adds that.

It seems that the solution will be to use update-index rather than git add.
Note that addLink has a comment about why it uses git add, which seems to mostly
be that it's faster. It also sometimes relies on git add to check gitignore,
although sometimes redundandly, some of the callers of it may rely on that
and have to be changed to check it first themselves.

Since adding a file to the annex also involves locking it down and
detecting modifications made while generating the key, update-index is
sufficient.

> Update: This is done for `git-annex add`, using addSymlink. But addLink
> is still in use elsewhere, and those other users might also be subject to
> similar races.

When it's adding a file unlocked, it already stages the pointer file using
update-index instead so there is no overwrite problem there.

But, there's a similar problem when it decides not to annex a file
and adds it to git. If the file content is overwritten then, it will
git add the new content. Which may be large enough that it should have been
added to the annex after all. Test case for this:

	git config annex.largefiles largerthan=3b
	echo hi > 1
	dd if=/dev/urandom of=2 bs=1M count=1000
	(sleep 2s; rm 1; echo bye > 1) &
	git-annex add
	git diff --cached 1

Unsure how to fix this case yet? Maybe it needs to cache the inode,
hash the file content, then verifiy the inode did not change during
hashing, and then also use update-index.


--[[Joey]]
bug report Sponsored-by: Luke Shumaker on Patreon 2022-06-14 16:51:04 +00:00			I was running `git-annex add` on 5 gb of files, and accidentially overwrote
			`some of the first ones, which it had already processed, while it was`
			`running. This caused binary files to get staged in git, rather than the`
			`annex pointers.`

			`Test case:`

			`echo hi > 1`
			`dd if=/dev/urandom of=2 bs=1M count=1000`
			`(sleep 2s; rm 1; echo bye > 1) &`
			`git-annex add`
			`git diff --cached 1`
			`diff --git a/1 b/1`
			`new file mode 100644`
			`index 0000000..b023018`
			`--- /dev/null`
			`+++ b/1`
			`@@ -0,0 +1 @@`
			`+bye`

			`This happens due to ingestAdd using addLink on the symlink,`
			`which just queues a "git add" of the file for later. In the`
			`meantime, the symlink is replaced with something else, so git`
			`adds that.`

			`It seems that the solution will be to use update-index rather than git add.`
			`Note that addLink has a comment about why it uses git add, which seems to mostly`
			`be that it's faster. It also sometimes relies on git add to check gitignore,`
			`although sometimes redundandly, some of the callers of it may rely on that`
			`and have to be changed to check it first themselves.`

fix add overwrite race with git-annex add to annex This is not a complete fix for all such races, only the one where a large file gets changed while adding and gets added to git rather than to the annex. addLink needs to go away, any caller of it is probably subject to the same kind of race. (Also, addLink itself fails to check gitignore when symlinks are not supported.) ingestAdd no longer checks gitignore. (It didn't check it consistently before either, since there were cases where it did not run git add!) When git-annex import calls it, it's already checked gitignore itself earlier. When git-annex add calls it, it's usually on files found by withFilesNotInGit, which handles checking ignores. There was one other case, when git-annex add --batch calls it. In that case, old git-annex behaved rather badly, it would seem to add the file, but git add would later fail, leaving the file as an unstaged annex symlink. That behavior has also been fixed. Sponsored-by: Brett Eisenberg on Patreon 2022-06-14 17:20:42 +00:00			`Since adding a file to the annex also involves locking it down and`
			`detecting modifications made while generating the key, update-index is`
			`sufficient.`

			> Update: This is done for `git-annex add`, using addSymlink. But addLink
			`> is still in use elsewhere, and those other users might also be subject to`
			`> similar races.`

bug report Sponsored-by: Luke Shumaker on Patreon 2022-06-14 16:51:04 +00:00			`When it's adding a file unlocked, it already stages the pointer file using`
			`update-index instead so there is no overwrite problem there.`

			`But, there's a similar problem when it decides not to annex a file`
			`and adds it to git. If the file content is overwritten then, it will`
			`git add the new content. Which may be large enough that it should have been`
			`added to the annex after all. Test case for this:`

			`git config annex.largefiles largerthan=3b`
			`echo hi > 1`
			`dd if=/dev/urandom of=2 bs=1M count=1000`
			`(sleep 2s; rm 1; echo bye > 1) &`
			`git-annex add`
			`git diff --cached 1`

fix add overwrite race with git-annex add to annex This is not a complete fix for all such races, only the one where a large file gets changed while adding and gets added to git rather than to the annex. addLink needs to go away, any caller of it is probably subject to the same kind of race. (Also, addLink itself fails to check gitignore when symlinks are not supported.) ingestAdd no longer checks gitignore. (It didn't check it consistently before either, since there were cases where it did not run git add!) When git-annex import calls it, it's already checked gitignore itself earlier. When git-annex add calls it, it's usually on files found by withFilesNotInGit, which handles checking ignores. There was one other case, when git-annex add --batch calls it. In that case, old git-annex behaved rather badly, it would seem to add the file, but git add would later fail, leaving the file as an unstaged annex symlink. That behavior has also been fixed. Sponsored-by: Brett Eisenberg on Patreon 2022-06-14 17:20:42 +00:00			`Unsure how to fix this case yet? Maybe it needs to cache the inode,`
			`hash the file content, then verifiy the inode did not change during`
			`hashing, and then also use update-index.`


bug report Sponsored-by: Luke Shumaker on Patreon 2022-06-14 16:51:04 +00:00			`--[[Joey]]`