update for git-annex filter-process
This commit is contained in:
parent
8dd91be867
commit
b25a138e22
1 changed files with 19 additions and 53 deletions
|
@ -16,7 +16,8 @@ to git. git-lfs uses it that way.
|
||||||
|
|
||||||
The first problem with the interface was that it ran a command once per
|
The first problem with the interface was that it ran a command once per
|
||||||
file. This was later fixed by extending it to support long-running filter
|
file. This was later fixed by extending it to support long-running filter
|
||||||
processes, which git-lfs uses.
|
processes, which git-lfs uses. git-annex can also use that interface,
|
||||||
|
when `git-annex filter-process` is enabled, but it does not by default.
|
||||||
|
|
||||||
A second problem with the interface, which affects git-lfs AFAIK, is that
|
A second problem with the interface, which affects git-lfs AFAIK, is that
|
||||||
git buffers the output of the smudge filter in memory before updating the
|
git buffers the output of the smudge filter in memory before updating the
|
||||||
|
@ -29,14 +30,13 @@ filter in the usual way, as described below.
|
||||||
A third problem with the interface is that piping large file contents
|
A third problem with the interface is that piping large file contents
|
||||||
between git and filters is innefficient. Seems this must affect git-lfs
|
between git and filters is innefficient. Seems this must affect git-lfs
|
||||||
too, but perhaps it's used on less enourmous data sets than git-annex.
|
too, but perhaps it's used on less enourmous data sets than git-annex.
|
||||||
To avoid the problem, git-annex relies on a not very well documented trick: The
|
|
||||||
clean filter is fed a possibly large file on stdin, but when it closes the FD
|
To avoid the problem, `git-annex smudge --clean` relies on a not very well
|
||||||
without reading. git gets a SIGPIPE and stops reading and sending the
|
documented trick: It is fed a possibly large file on stdin,
|
||||||
file. Instead of reading from stdin, git-annex abuses the fact that git
|
but when it closes the FD without reading. git gets a SIGPIPE and stops
|
||||||
provides the clean filter with the work tree filename, and reads and cleans
|
reading and sending the file. Instead of reading from stdin, git-annex
|
||||||
the file itself, more efficiently. But, this trick only works when the
|
abuses the fact that git provides the clean filter with the work tree
|
||||||
long-running filter process interface is not used, so git-annex can't use
|
filename, and reads and cleans the file itself, more efficiently.
|
||||||
it.
|
|
||||||
|
|
||||||
git-lfs differs from git-annex in that all the large files in the
|
git-lfs differs from git-annex in that all the large files in the
|
||||||
repository are usually present in the working tree; it doesn't have a way
|
repository are usually present in the working tree; it doesn't have a way
|
||||||
|
@ -78,13 +78,19 @@ fit.
|
||||||
|
|
||||||
And here's the consequences of git-annex's workarounds:
|
And here's the consequences of git-annex's workarounds:
|
||||||
|
|
||||||
* It doesn't use the long-running filter process interface, so `git add`
|
* It doesn't use the long-running filter process interface by default,
|
||||||
of a lot of files runs the clean filter once per file, which is slower than it
|
so `git add` of a lot of files runs `git-annex smudge --clean` once per file,
|
||||||
could be. Using `git-annex add` avoids this problem.
|
which is slower than it could be. Using `git-annex add` avoids this problem.
|
||||||
|
So does enabling `git-annex filter-process`.
|
||||||
|
|
||||||
* After a git-annex get/drop or a git checkout or pull that affects a lot
|
* After a git-annex get/drop or a git checkout or pull that affects a lot
|
||||||
of files, the clean filter gets run once per file, which is again, slower
|
of files, the clean filter gets run once per file, which is again, slower
|
||||||
than ideal.
|
than ideal. Enabling `git-annex filter-process` can speed up git checkout
|
||||||
|
or pull, but not git-annex get/drop.
|
||||||
|
|
||||||
|
* When `git-annex filter-process` is enabled, it cannot use the trick
|
||||||
|
described above that `git-annex smudge --clean` uses to avoid git
|
||||||
|
piping the whole content of large files throuh it.
|
||||||
|
|
||||||
* In a rare situation, git-annex would like to get git to run the clean
|
* In a rare situation, git-annex would like to get git to run the clean
|
||||||
filter, but it cannot because git has the index locked. So, git-annex has
|
filter, but it cannot because git has the index locked. So, git-annex has
|
||||||
|
@ -109,43 +115,3 @@ The best fix would be to improve git's smudge/clean interface:
|
||||||
|
|
||||||
* Allow clean filter to read work tree files itself, to avoid overhead of
|
* Allow clean filter to read work tree files itself, to avoid overhead of
|
||||||
sending huge files through a pipe.
|
sending huge files through a pipe.
|
||||||
|
|
||||||
[[!tag confirmed]]
|
|
||||||
|
|
||||||
> Could it use the long-running filter process interface for
|
|
||||||
> smudge, but not clean? It could behave like the smudge filter does now,
|
|
||||||
> outputting the same pointer git feeds it, and deferring populating the
|
|
||||||
> file until later. So large file content would not be sent through the
|
|
||||||
> pipe in this case. It seems that the interface's handshake allows
|
|
||||||
> it to claim only support for smudge, and not clean, but the docs don't
|
|
||||||
> say if git falls back to running the clean filter in that case.
|
|
||||||
>
|
|
||||||
> If this was possible, it should speed up git checkout etc which use the
|
|
||||||
> smudge filter. --[[Joey]]
|
|
||||||
|
|
||||||
> > Tried that, it looks like when the long-running filter process
|
|
||||||
> > sends only capability=smudge, and not clean, git add does not
|
|
||||||
> > fall back to running filter.annex.clean. So unfortunately,
|
|
||||||
> > it's not that easy.
|
|
||||||
> >
|
|
||||||
> > To proceed down this path, git-annex would need to use
|
|
||||||
> > the long-running filter process for clean too. So `git add`
|
|
||||||
> > would end up piping the whole content of large files over to git-annex,
|
|
||||||
> > which would have to read and discard that data. This would
|
|
||||||
> > probably make git-annex twice as slow for large files, although
|
|
||||||
> > it would speed up git add of many small files. git-annex add
|
|
||||||
> > could be used to work around any speed impact.
|
|
||||||
> > (The long-running-smudge branch has some preliminary work to doing
|
|
||||||
> > this.)
|
|
||||||
> >
|
|
||||||
> > Or git could be extended
|
|
||||||
> > with a capability in the protocol that lets the clean filter read the
|
|
||||||
> > file content from disk, rather than having it piped into it. The
|
|
||||||
> > long-running filter process protocol does have a design that would let
|
|
||||||
> > it be extended that way.
|
|
||||||
> >
|
|
||||||
> > Or, I suppose git could be changed to run the clean filter when
|
|
||||||
> > the long-running process does not support capability=clean. Maybe
|
|
||||||
> > that would be more appealing to the git devs. Although since
|
|
||||||
> > it currently does not run it, git-annex would need to somehow
|
|
||||||
> > detect the old version of git and still work. --[[Joey]]
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue