Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2014-07-04 15:28:58 -04:00
commit 2622fd3192
Failed to extract signature
3 changed files with 47 additions and 0 deletions

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.55"
subject="comment 1"
date="2014-07-04T18:24:51Z"
content="""
The diff-filter=T comes from when Command.Add runs its pass to find unlocked files. It's finished adding all the files, so it must either be that or the git-annex branch commit that's running out of memory, I think.
"""]]

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.55"
subject="comment 2"
date="2014-07-04T18:36:49Z"
content="""
Does not seem to be the diff-filter=T command that is the problem. It's not outputting a lot of files, and should stream over them even if it did.
The last xargs appears to be at or near the problem. It is called by Annex.Content.saveState, which first does a Annex.Queue.flush, followed by a Annex.Branch.commit. I suspect the problem is the latter; at this point there are 2 million files in .git/annex/journal waiting to be committed to the git-annex branch.
In the same big repo, I can add one more file and reproduce the problem running `git annex add $newfile`.
"""]]

View file

@ -0,0 +1,27 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="209.250.56.55"
subject="comment 3"
date="2014-07-04T19:26:00Z"
content="""
Looking at the code, it's pretty clear why this is using a lot of memory:
<pre>
fs <- getJournalFiles jl
liftIO $ do
h <- hashObjectStart g
Git.UpdateIndex.streamUpdateIndex g
[genstream dir h fs]
hashObjectStop h
return $ liftIO $ mapM_ (removeFile . (dir </>)) fs
</pre>
So the big list in `fs` has to be retained in memory after the files are streamed to update-index, in order for them to be deleted!
Fixing is a bit tricky.. New journal files can appear while this is going on, so it can't just run getJournalFiles a second time to get the files to clean.
Maybe delete the file after it's been sent to git-update-index? But git-update-index is going to want to read the file, and we don't really know when it will choose to do so. It could wait a while after we've sent the filename to it, potentially.
Also, per [[!commit 750c4ac6c282d14d19f79e0711f858367da145e4]], we cannot delete the journal files until *after* the commit, or another git-annex process would see inconsistent data!
I actually think I am going to need to use a temp file to hold the list of files..
"""]]