update for git-annex filter-process

2021-11-04 15:11:30 -04:00 · 2021-11-04 15:11:30 -04:00 · b25a138e22
commit b25a138e22
parent 8dd91be867
1 changed files with 19 additions and 53 deletions
--- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
+++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
@ -16,7 +16,8 @@ to git. git-lfs uses it that way.
 The first problem with the interface was that it ran a command once per
 file. This was later fixed by extending it to support long-running filter
-processes, which git-lfs uses.
+processes, which git-lfs uses. git-annex can also use that interface,
 when `git-annex filter-process` is enabled, but it does not by default.
 A second problem with the interface, which affects git-lfs AFAIK, is that
 git buffers the output of the smudge filter in memory before updating the
@ -29,14 +30,13 @@ filter in the usual way, as described below.
 A third problem with the interface is that piping large file contents
 between git and filters is innefficient. Seems this must affect git-lfs
 too, but perhaps it's used on less enourmous data sets than git-annex.
-To avoid the problem, git-annex relies on a not very well documented trick: The
+
-clean filter is fed a possibly large file on stdin, but when it closes the FD
+To avoid the problem, `git-annex smudge --clean` relies on a not very well
-without reading. git gets a SIGPIPE and stops reading and sending the
+documented trick: It is fed a possibly large file on stdin,
-file. Instead of reading from stdin, git-annex abuses the fact that git
+but when it closes the FD without reading. git gets a SIGPIPE and stops
-provides the clean filter with the work tree filename, and reads and cleans
+reading and sending the file. Instead of reading from stdin, git-annex
-the file itself, more efficiently. But, this trick only works when the
+abuses the fact that git provides the clean filter with the work tree
-long-running filter process interface is not used, so git-annex can't use
+filename, and reads and cleans the file itself, more efficiently.
 it.
 git-lfs differs from git-annex in that all the large files in the
 repository are usually present in the working tree; it doesn't have a way
@ -78,13 +78,19 @@ fit.
 And here's the consequences of git-annex's workarounds:
-* It doesn't use the long-running filter process interface, so `git add`
+* It doesn't use the long-running filter process interface by default, 
-  of a lot of files runs the clean filter once per file, which is slower than it
+  so `git add` of a lot of files runs `git-annex smudge --clean` once per file,
-  could be. Using `git-annex add` avoids this problem.
+  which is slower than it could be. Using `git-annex add` avoids this problem.
  So does enabling `git-annex filter-process`.
 * After a git-annex get/drop or a git checkout or pull that affects a lot
  of files, the clean filter gets run once per file, which is again, slower
-  than ideal.
+  than ideal. Enabling `git-annex filter-process` can speed up git checkout
  or pull, but not git-annex get/drop.
 * When `git-annex filter-process` is enabled, it cannot use the trick
  described above that `git-annex smudge --clean` uses to avoid git
  piping the whole content of large files throuh it.
 * In a rare situation, git-annex would like to get git to run the clean
  filter, but it cannot because git has the index locked. So, git-annex has
@ -109,43 +115,3 @@ The best fix would be to improve git's smudge/clean interface:
 * Allow clean filter to read work tree files itself, to avoid overhead of
  sending huge files through a pipe.
 [[!tag confirmed]]
 > Could it use the long-running filter process interface for
 > smudge, but not clean? It could behave like the smudge filter does now,
 > outputting the same pointer git feeds it, and deferring populating the
 > file until later. So large file content would not be sent through the
 > pipe in this case. It seems that the interface's handshake allows
 > it to claim only support for smudge, and not clean, but the docs don't
 > say if git falls back to running the clean filter in that case.
 > 
 > If this was possible, it should speed up git checkout etc which use the
 > smudge filter. --[[Joey]]
 > > Tried that, it looks like when the long-running filter process
 > > sends only capability=smudge, and not clean, git add does not
 > > fall back to running filter.annex.clean. So unfortunately,
 > > it's not that easy.
 > >
 > > To proceed down this path, git-annex would need to use
 > > the long-running filter process for clean too. So `git add`
 > > would end up piping the whole content of large files over to git-annex,
 > > which would have to read and discard that data. This would
 > > probably make git-annex twice as slow for large files, although
 > > it would speed up git add of many small files. git-annex add
 > > could be used to work around any speed impact.
 > > (The long-running-smudge branch has some preliminary work to doing
 > > this.)
 > > 
 > > Or git could be extended
 > > with a capability in the protocol that lets the clean filter read the
 > > file content from disk, rather than having it piped into it. The
 > > long-running filter process protocol does have a design that would let
 > > it be extended that way.
 > >
 > > Or, I suppose git could be changed to run the clean filter when
 > > the long-running process does not support capability=clean. Maybe
 > > that would be more appealing to the git devs. Although since
 > > it currently does not run it, git-annex would need to somehow
 > > detect the old version of git and still work. --[[Joey]]