diff --git a/CHANGELOG b/CHANGELOG index fcc69d1ee8..7e12578c9a 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -6,6 +6,9 @@ git-annex (6.20180808) UNRELEASED; urgency=medium Affected commands: find, add, whereis, drop, copy, move, get * Make metadata --batch combined with matching options refuse to run, since it does not seem worth supporting that combination. + * v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to + speed up the clean filter by not reading the file content from the + pipe. This also avoids git buffering the whole file content in memory. -- Joey Hess Wed, 08 Aug 2018 11:24:08 -0400 diff --git a/Command/Smudge.hs b/Command/Smudge.hs index 1644ee2577..cfd327c8d0 100644 --- a/Command/Smudge.hs +++ b/Command/Smudge.hs @@ -16,6 +16,7 @@ import Annex.Ingest import Annex.CatFile import Logs.Location import qualified Database.Keys +import qualified Git.BuildVersion import Git.FilePath import Backend @@ -68,7 +69,7 @@ smudge file = do -- Clean filter is fed file content on stdin, decides if a file -- should be stored in the annex, and outputs a pointer to its --- injested content. +-- injested content if so. Otherwise, the original content. clean :: FilePath -> CommandStart clean file = do b <- liftIO $ B.hGetContents stdin @@ -76,10 +77,18 @@ clean file = do then liftIO $ B.hPut stdout b else ifM (shouldAnnex file) ( do - -- Even though we ingest the actual file, - -- and not stdin, we need to consume all - -- stdin, or git will get annoyed. - B.length b `seq` return () + -- Before git 2.5, failing to consume all + -- stdin here would cause a SIGPIPE and + -- crash it. + -- Newer git catches the signal and + -- stops sending, which is much faster. + -- (Also, git seems to forget to free memory + -- when sending the file, so the less we + -- let it send, the less memory it will + -- waste.) + if Git.BuildVersion.older "2.5" + then B.length b `seq` return () + else liftIO $ hClose stdin -- Look up the backend that was used -- for this file before, so that when -- git re-cleans a file its backend does diff --git a/doc/todo/smudge.mdwn b/doc/todo/smudge.mdwn index 3a980b10a2..3ebb92b0ff 100644 --- a/doc/todo/smudge.mdwn +++ b/doc/todo/smudge.mdwn @@ -65,16 +65,16 @@ git-annex should use smudge/clean filters. * When git runs the smudge filter, it buffers all its output in ram before writing it to a file. So, checking out a branch with a large v6 unlocked files can cause git to use a lot of memory. - (This needs to be fixed in git, but my proposed interface in + + This needs to be fixed in git, but my proposed interface in would avoid the problem for git checkout, since it would use the new interface - and not the smudge filter.) + and not the smudge filter. -* When `git add` is run with a large file, it allocates memory for - the whole file content, even though it's only going - to stream it to the clean filter. My proposed smudge/clean - interface patch also fixed this problem, since it made git not read - the file at all. + Last verified with git 2.18 in 2018. + + To check: Does the long-running filter process interface have the same + problem? * Eventually (but not yet), make v6 the default for new repositories. Note that the assistant forces repos into direct mode; that will need to diff --git a/doc/todo/smudge/comment_12_a4712d510432931062406c4e256f04b1._comment b/doc/todo/smudge/comment_12_a4712d510432931062406c4e256f04b1._comment new file mode 100644 index 0000000000..cfa1be8bca --- /dev/null +++ b/doc/todo/smudge/comment_12_a4712d510432931062406c4e256f04b1._comment @@ -0,0 +1,11 @@ +[[!comment format=mdwn + username="joey" + subject="""re: Git 2.5 allows smudge filters to not read all of stdin""" + date="2018-08-09T22:11:00Z" + content=""" +@torarnv thanks for pointing that out.. I finally got around to verifying +that, and was able to speed up the smudge filter. Also this avoids the +problem that git for some reason buffers the whole file content in memory +when it sends it to the smudge filter, which is a pretty bad memory leak in git +that no longer affects this. +"""]]