git-annex/doc/todo/incremental_hashing_for_add.mdwn

When `git-annex filter-process` is enabled (v9 and above), `git add` pipes
the content of files into it, but that's thrown away, and the file is read
again by git-annex to generate a hash. It would improve performance to hash
the content provided via the pipe.

When filter-process is not enabled, `git-annex smudge --clean` reads
the file to hash it, then reads it a second time to copy it into
.git/annex/objects. When annex.addunlocked is enabled, `git annex add`
does the same. It would improve performance to read once and copy and
hash at the same time.

The `incrementalhash` branch has a start at implementing this.
I lost steam on this branch when I realized that it would need to
re-implement Annex.Ingest.ingest in order to populate
.git/annex/objects/. And it's not as simple as writing an object file
and moving it into place there, because annex.thin means a hard link should
be made, and if the filesystem supports CoW, that should be used rather
than writing the file again.

A benchmark on Linux showed that `git add` of a 1 gb file 
is about 5% slower with filter-process enabled than it is 
with filter-process disabled. That's due to the piping overhead to 
filter-process ([[todo/git_smudge_clean_interface_suboptiomal]]).
`git-annex add` with `annex.addunlocked` has similar performance
as `git add` with filter-process disabled.

`git-annex add` without `annex.addunlocked` is about 25% faster than those,
and only reads the file once, but it also does not copy the file, so of
course it's faster, and always will be. 

Probably disk cache helps them a fair amount, unless it's too small.
So it's not clear how much implementing this would really speed them up.

This does not really affect default configurations.
Performance is only impacted when annex.addunlocked or 
annex.largefiles is configured, and in a few cases 
where an already annexed file is added by `git add` or `git commit -a`.

So is the complication of implementing this worth it? Users who
need maximum speed can use `git-annex add`.
enable filter.annex.process in v9 This has tradeoffs, but is generally a win, and users who it causes git add to slow down unacceptably for can just disable it again. It needed to happen in an upgrade, since there are git-annex versions that do not support it, and using such an old version with a v8 repository with filter.annex.process set will cause bad behavior. By enabling it in v9, it's guaranteed that any git-annex version that can use the repository does support it. Although, this is not a perfect protection against problems, since an old git-annex version, if it's used with a v9 repository, will cause git add to try to run git-annex filter-process, which will fail. But at least, the user is unlikely to have an old git-annex in path if they are using a v9 repository, since it won't work in that repository. Sponsored-by: Dartmouth College's Datalad project 2022-01-21 17:11:18 +00:00			When `git-annex filter-process` is enabled (v9 and above), `git add` pipes
			`the content of files into it, but that's thrown away, and the file is read`
			`again by git-annex to generate a hash. It would improve performance to hash`
			`the content provided via the pipe.`
new todo 2021-11-09 17:30:52 +00:00
			When filter-process is not enabled, `git-annex smudge --clean` reads
			`the file to hash it, then reads it a second time to copy it into`
			.git/annex/objects. When annex.addunlocked is enabled, `git annex add`
			`does the same. It would improve performance to read once and copy and`
			`hash at the same time.`

			The `incrementalhash` branch has a start at implementing this.
			`I lost steam on this branch when I realized that it would need to`
			`re-implement Annex.Ingest.ingest in order to populate`
			`.git/annex/objects/. And it's not as simple as writing an object file`
			`and moving it into place there, because annex.thin means a hard link should`
			`be made, and if the filesystem supports CoW, that should be used rather`
			`than writing the file again.`

clarify 2021-11-29 18:00:32 +00:00			A benchmark on Linux showed that `git add` of a 1 gb file
new todo 2021-11-09 17:30:52 +00:00			`is about 5% slower with filter-process enabled than it is`
			`with filter-process disabled. That's due to the piping overhead to`
			`filter-process ([[todo/git_smudge_clean_interface_suboptiomal]]).`
			`git-annex add` with `annex.addunlocked` has similar performance
			as `git add` with filter-process disabled.

			`git-annex add` without `annex.addunlocked` is about 25% faster than those,
			`and only reads the file once, but it also does not copy the file, so of`
			`course it's faster, and always will be.`

			`Probably disk cache helps them a fair amount, unless it's too small.`
			`So it's not clear how much implementing this would really speed them up.`

			`This does not really affect default configurations.`
			`Performance is only impacted when annex.addunlocked or`
			`annex.largefiles is configured, and in a few cases`
			where an already annexed file is added by `git add` or `git commit -a`.

			`So is the complication of implementing this worth it? Users who`
			need maximum speed can use `git-annex add`.