enable filter.annex.process in v9
This has tradeoffs, but is generally a win, and users who it causes git add to
slow down unacceptably for can just disable it again.
It needed to happen in an upgrade, since there are git-annex versions
that do not support it, and using such an old version with a v8
repository with filter.annex.process set will cause bad behavior.
By enabling it in v9, it's guaranteed that any git-annex version that
can use the repository does support it. Although, this is not a perfect
protection against problems, since an old git-annex version, if it's
used with a v9 repository, will cause git add to try to run
git-annex filter-process, which will fail. But at least, the user is
unlikely to have an old git-annex in path if they are using a v9
repository, since it won't work in that repository.
Sponsored-by: Dartmouth College's Datalad project
2022-01-21 17:11:18 +00:00
|
|
|
When `git-annex filter-process` is enabled (v9 and above), `git add` pipes
|
|
|
|
the content of files into it, but that's thrown away, and the file is read
|
|
|
|
again by git-annex to generate a hash. It would improve performance to hash
|
|
|
|
the content provided via the pipe.
|
2021-11-09 17:30:52 +00:00
|
|
|
|
|
|
|
When filter-process is not enabled, `git-annex smudge --clean` reads
|
|
|
|
the file to hash it, then reads it a second time to copy it into
|
|
|
|
.git/annex/objects. When annex.addunlocked is enabled, `git annex add`
|
|
|
|
does the same. It would improve performance to read once and copy and
|
|
|
|
hash at the same time.
|
|
|
|
|
|
|
|
The `incrementalhash` branch has a start at implementing this.
|
|
|
|
I lost steam on this branch when I realized that it would need to
|
|
|
|
re-implement Annex.Ingest.ingest in order to populate
|
|
|
|
.git/annex/objects/. And it's not as simple as writing an object file
|
|
|
|
and moving it into place there, because annex.thin means a hard link should
|
|
|
|
be made, and if the filesystem supports CoW, that should be used rather
|
|
|
|
than writing the file again.
|
|
|
|
|
2021-11-29 18:00:32 +00:00
|
|
|
A benchmark on Linux showed that `git add` of a 1 gb file
|
2021-11-09 17:30:52 +00:00
|
|
|
is about 5% slower with filter-process enabled than it is
|
|
|
|
with filter-process disabled. That's due to the piping overhead to
|
|
|
|
filter-process ([[todo/git_smudge_clean_interface_suboptiomal]]).
|
|
|
|
`git-annex add` with `annex.addunlocked` has similar performance
|
|
|
|
as `git add` with filter-process disabled.
|
|
|
|
|
|
|
|
`git-annex add` without `annex.addunlocked` is about 25% faster than those,
|
|
|
|
and only reads the file once, but it also does not copy the file, so of
|
|
|
|
course it's faster, and always will be.
|
|
|
|
|
|
|
|
Probably disk cache helps them a fair amount, unless it's too small.
|
|
|
|
So it's not clear how much implementing this would really speed them up.
|
|
|
|
|
|
|
|
This does not really affect default configurations.
|
|
|
|
Performance is only impacted when annex.addunlocked or
|
|
|
|
annex.largefiles is configured, and in a few cases
|
|
|
|
where an already annexed file is added by `git add` or `git commit -a`.
|
|
|
|
|
|
|
|
So is the complication of implementing this worth it? Users who
|
|
|
|
need maximum speed can use `git-annex add`.
|