massive v6 add speed/memory improvement
v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to speed up the clean filter by not reading the file content from the pipe. This also avoids git buffering the whole file content in memory. When built with an older git, still consumes stdin. If built with a newer git and used with an older one, it breaks, but that's acceptable -- checking the git version every time would make repeated smudge runs slow. This commit was supported by the NSF-funded DataLad project.
This commit is contained in:
parent
74551a430a
commit
a96972015d
4 changed files with 35 additions and 12 deletions
|
@ -6,6 +6,9 @@ git-annex (6.20180808) UNRELEASED; urgency=medium
|
||||||
Affected commands: find, add, whereis, drop, copy, move, get
|
Affected commands: find, add, whereis, drop, copy, move, get
|
||||||
* Make metadata --batch combined with matching options refuse to run,
|
* Make metadata --batch combined with matching options refuse to run,
|
||||||
since it does not seem worth supporting that combination.
|
since it does not seem worth supporting that combination.
|
||||||
|
* v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to
|
||||||
|
speed up the clean filter by not reading the file content from the
|
||||||
|
pipe. This also avoids git buffering the whole file content in memory.
|
||||||
|
|
||||||
-- Joey Hess <id@joeyh.name> Wed, 08 Aug 2018 11:24:08 -0400
|
-- Joey Hess <id@joeyh.name> Wed, 08 Aug 2018 11:24:08 -0400
|
||||||
|
|
||||||
|
|
|
@ -16,6 +16,7 @@ import Annex.Ingest
|
||||||
import Annex.CatFile
|
import Annex.CatFile
|
||||||
import Logs.Location
|
import Logs.Location
|
||||||
import qualified Database.Keys
|
import qualified Database.Keys
|
||||||
|
import qualified Git.BuildVersion
|
||||||
import Git.FilePath
|
import Git.FilePath
|
||||||
import Backend
|
import Backend
|
||||||
|
|
||||||
|
@ -68,7 +69,7 @@ smudge file = do
|
||||||
|
|
||||||
-- Clean filter is fed file content on stdin, decides if a file
|
-- Clean filter is fed file content on stdin, decides if a file
|
||||||
-- should be stored in the annex, and outputs a pointer to its
|
-- should be stored in the annex, and outputs a pointer to its
|
||||||
-- injested content.
|
-- injested content if so. Otherwise, the original content.
|
||||||
clean :: FilePath -> CommandStart
|
clean :: FilePath -> CommandStart
|
||||||
clean file = do
|
clean file = do
|
||||||
b <- liftIO $ B.hGetContents stdin
|
b <- liftIO $ B.hGetContents stdin
|
||||||
|
@ -76,10 +77,18 @@ clean file = do
|
||||||
then liftIO $ B.hPut stdout b
|
then liftIO $ B.hPut stdout b
|
||||||
else ifM (shouldAnnex file)
|
else ifM (shouldAnnex file)
|
||||||
( do
|
( do
|
||||||
-- Even though we ingest the actual file,
|
-- Before git 2.5, failing to consume all
|
||||||
-- and not stdin, we need to consume all
|
-- stdin here would cause a SIGPIPE and
|
||||||
-- stdin, or git will get annoyed.
|
-- crash it.
|
||||||
B.length b `seq` return ()
|
-- Newer git catches the signal and
|
||||||
|
-- stops sending, which is much faster.
|
||||||
|
-- (Also, git seems to forget to free memory
|
||||||
|
-- when sending the file, so the less we
|
||||||
|
-- let it send, the less memory it will
|
||||||
|
-- waste.)
|
||||||
|
if Git.BuildVersion.older "2.5"
|
||||||
|
then B.length b `seq` return ()
|
||||||
|
else liftIO $ hClose stdin
|
||||||
-- Look up the backend that was used
|
-- Look up the backend that was used
|
||||||
-- for this file before, so that when
|
-- for this file before, so that when
|
||||||
-- git re-cleans a file its backend does
|
-- git re-cleans a file its backend does
|
||||||
|
|
|
@ -65,16 +65,16 @@ git-annex should use smudge/clean filters.
|
||||||
* When git runs the smudge filter, it buffers all its output in ram before
|
* When git runs the smudge filter, it buffers all its output in ram before
|
||||||
writing it to a file. So, checking out a branch with a large v6 unlocked files
|
writing it to a file. So, checking out a branch with a large v6 unlocked files
|
||||||
can cause git to use a lot of memory.
|
can cause git to use a lot of memory.
|
||||||
(This needs to be fixed in git, but my proposed interface in
|
|
||||||
|
This needs to be fixed in git, but my proposed interface in
|
||||||
<http://thread.gmane.org/gmane.comp.version-control.git/294425> would
|
<http://thread.gmane.org/gmane.comp.version-control.git/294425> would
|
||||||
avoid the problem for git checkout, since it would use the new interface
|
avoid the problem for git checkout, since it would use the new interface
|
||||||
and not the smudge filter.)
|
and not the smudge filter.
|
||||||
|
|
||||||
* When `git add` is run with a large file, it allocates memory for
|
Last verified with git 2.18 in 2018.
|
||||||
the whole file content, even though it's only going
|
|
||||||
to stream it to the clean filter. My proposed smudge/clean
|
To check: Does the long-running filter process interface have the same
|
||||||
interface patch also fixed this problem, since it made git not read
|
problem?
|
||||||
the file at all.
|
|
||||||
|
|
||||||
* Eventually (but not yet), make v6 the default for new repositories.
|
* Eventually (but not yet), make v6 the default for new repositories.
|
||||||
Note that the assistant forces repos into direct mode; that will need to
|
Note that the assistant forces repos into direct mode; that will need to
|
||||||
|
|
|
@ -0,0 +1,11 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""re: Git 2.5 allows smudge filters to not read all of stdin"""
|
||||||
|
date="2018-08-09T22:11:00Z"
|
||||||
|
content="""
|
||||||
|
@torarnv thanks for pointing that out.. I finally got around to verifying
|
||||||
|
that, and was able to speed up the smudge filter. Also this avoids the
|
||||||
|
problem that git for some reason buffers the whole file content in memory
|
||||||
|
when it sends it to the smudge filter, which is a pretty bad memory leak in git
|
||||||
|
that no longer affects this.
|
||||||
|
"""]]
|
Loading…
Reference in a new issue