add for later

2018-10-22 13:00:42 -04:00 · 2018-10-22 13:00:42 -04:00 · 4ba9c6608e
commit 4ba9c6608e
parent c24e255de1
1 changed files with 115 additions and 0 deletions
--- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
+++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
@ -0,0 +1,115 @@
+(This is a draft todo item, for when [[todo/smudge]] gets closed. Some of
+what's described below is not implemented yet, so it's actually worse than
+this describes. --[[Joey]])
+
+git-annex uses git's smudge/clean interface to implement v6 unlocked
+files. However, the interface is suboptimal for git-annex's needs. While
+git-annex works around most of the problems with the interface, it can't
+avoid some consequences of this poor fit, and it has to do some surprising
+things to make it work as well as it does.
+
+First, how git's smudge/clean interface is meant to work: The smudge filter
+is run on the content of files as stored in a repo before they are written to
+the work tree, and can alter the content in arbitrary ways. The clean filter
+reverses the smudge filter, so git can use it to get the content to store
+in the repo. See gitattributes(5) for details.
+
+It was originally used for minor textual changes (eg line ending
+conversion), but it's general enough to be used to add large file support
+to git. git-lfs uses it that way.
+
+The first problem with the interface was that it ran a command once per
+file. This was later fixed by extending it to support long-running filter
+processes, which git-lfs uses.
+
+A second problem with the interface, which affects git-lfs AFAIK, is that
+git buffers the output of the smudge filter in memory before updating the
+working tree. If the smudge filter emits a large file, git can use a lot of
+memory. Of course, on modern computers this needs to be hundreds of
+megabytes to be very noticable. git-lfs may tend to be used with
+files not that large. git-annex avoids this problem by not using the smudge
+filter in the usual way, as described below.
+
+A third problem with the interface is that piping large file contents
+between git and filters is innefficient. Seems this must affect git-lfs
+too, but perhaps it's used on less enourmous data sets than git-annex.
+To avoid the problem, git-annex relies on a not very well documented trick: The
+clean filter is fed a possibly large file on stdin, but when it closes the FD
+without reading. git gets a SIGPIPE and stops reading and sending the
+file. Instead of reading from stdin, git-annex abuses the fact that git
+provides the clean filter with the work tree filename, and reads and cleans
+the file itself, more efficiently. But, this trick only works when the
+long-running filter process interface is not used, so git-annex can't use
+it.
+
+git-lfs differs from git-annex in that all the large files in the
+repository are usually present in the working tree; it doesn't have a way
+to drop content that is not wanted locally while keeping other content
+locally available, as git-annex does. And so it does not need to be able to
+get content like git-annex can do either. It also differs in that it uses a
+central server, which is trusted to retain content, so it doesn't try to
+avoid losing the local copy, which could be the only copy, as git-annex
+does. (All AFAIK; have not looked at git-lfs recently.) 
+
+Those properties of git-lfs make it fit fairly well into the smudge/clean
+interface. Conversely, the different properties of git-annex make it a poor
+fit. 
+
+* git-annex needs to be able to update the working tree itself,
+  to make large file content available or not available. But this would cause
+  git to think the file is modified. 
+  
+  The way git-annex works around this is to run git update-index on files
+  after updating them. Git then runs the clean filter, and the clean filter
+  tells git there's not been any real modification of the file.
+
+* git-annex needs to hard link from its object store to a work tree
+  file, to avoid keeping two copies of the file on disk while preventing
+  a rm or git checkout from deleting the only local copy. But the smudge
+  interface does not provide a way to update the worktree itself.
+
+  So, git-annex's smudge filter does not actually provide the large file
+  content. It just echos back the file as checked into git, and
+  remembers that git wanted to check out that file.
+  git-annex installs post-checkout, post-merge, and pre-commit
+  hooks, which update the working tree files to make content from
+  git-annex available. Of course, that means git sees modifications to the
+  working tree, so git-annex then has to run git update-index on the files,
+  which runs the clean filter, as described above.
+  
+  (Not emitting large files from the smudge filter also avoids the problem with git
+  leaking memory described earlier.)
+
+And here's the consequences of git-annex's workarounds:
+
+* It doesn't use the long-running filter process interface, so `git add`
+  of a lot of files runs the clean filter once per file, which is slower than it
+  could be. Using `git-annex add` avoids this problem.
+
+* After a git-annex get/drop or a git checkout or pull that affects a lot
+  of files, the clean filter gets run once per file, which is again, slower
+  than ideal.
+
+* In a rare situation, git-annex would like to get git to run the clean
+  filter, but it cannot because git has the index locked. So, git-annex has
+  to print an ugly warning message saying that git status will show
+  modififcations to files that are not really modified, and giving a command to
+  fix the git status display.
+
+* git does not run any hook after a `git stash` or `git reset --hard`,
+  so after these operations, annexed files remain unpopulated until 
+  the user runs `git annex fix`.
+
+The best fix would be to improve git's smudge/clean interface:
+
+* Add hooks run after every work tree update or after `git stash` and
+  `git reset --hard`
+
+* Avoid buffering smudge filter output in memory.
+
+* Allow smudge filter to modify the work tree itself.
+  (I developed a patch series for this in 2016, but it didn't land.
+  --[[Joey]])
+
+* Allow clean filter to read work tree files itself, to avoid overhead of
+  sending huge files through a pipe.