diff --git a/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_1_8b60b7816b9bf2c8cdd21b5cae431555._comment b/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_1_8b60b7816b9bf2c8cdd21b5cae431555._comment new file mode 100644 index 0000000000..d4b1c3fd8d --- /dev/null +++ b/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_1_8b60b7816b9bf2c8cdd21b5cae431555._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="http://joeyh.name/" + ip="209.250.56.55" + subject="comment 1" + date="2014-07-04T18:24:51Z" + content=""" +The diff-filter=T comes from when Command.Add runs its pass to find unlocked files. It's finished adding all the files, so it must either be that or the git-annex branch commit that's running out of memory, I think. +"""]] diff --git a/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_2_32908da23e4fb38a7d20b765a5ab4656._comment b/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_2_32908da23e4fb38a7d20b765a5ab4656._comment new file mode 100644 index 0000000000..7f0e8ba704 --- /dev/null +++ b/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_2_32908da23e4fb38a7d20b765a5ab4656._comment @@ -0,0 +1,12 @@ +[[!comment format=mdwn + username="http://joeyh.name/" + ip="209.250.56.55" + subject="comment 2" + date="2014-07-04T18:36:49Z" + content=""" +Does not seem to be the diff-filter=T command that is the problem. It's not outputting a lot of files, and should stream over them even if it did. + +The last xargs appears to be at or near the problem. It is called by Annex.Content.saveState, which first does a Annex.Queue.flush, followed by a Annex.Branch.commit. I suspect the problem is the latter; at this point there are 2 million files in .git/annex/journal waiting to be committed to the git-annex branch. + +In the same big repo, I can add one more file and reproduce the problem running `git annex add $newfile`. +"""]] diff --git a/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_3_3cff88b50eb3872565bccbeb6ee15716._comment b/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_3_3cff88b50eb3872565bccbeb6ee15716._comment new file mode 100644 index 0000000000..7e2c5568d2 --- /dev/null +++ b/doc/bugs/runs_of_of_memory_adding_2_million_files/comment_3_3cff88b50eb3872565bccbeb6ee15716._comment @@ -0,0 +1,27 @@ +[[!comment format=mdwn + username="http://joeyh.name/" + ip="209.250.56.55" + subject="comment 3" + date="2014-07-04T19:26:00Z" + content=""" +Looking at the code, it's pretty clear why this is using a lot of memory: + +
+        fs <- getJournalFiles jl
+        liftIO $ do
+                h <- hashObjectStart g
+                Git.UpdateIndex.streamUpdateIndex g
+                        [genstream dir h fs]
+                hashObjectStop h
+        return $ liftIO $ mapM_ (removeFile . (dir )) fs
+
+ +So the big list in `fs` has to be retained in memory after the files are streamed to update-index, in order for them to be deleted! + +Fixing is a bit tricky.. New journal files can appear while this is going on, so it can't just run getJournalFiles a second time to get the files to clean. +Maybe delete the file after it's been sent to git-update-index? But git-update-index is going to want to read the file, and we don't really know when it will choose to do so. It could wait a while after we've sent the filename to it, potentially. + +Also, per [[!commit 750c4ac6c282d14d19f79e0711f858367da145e4]], we cannot delete the journal files until *after* the commit, or another git-annex process would see inconsistent data! + +I actually think I am going to need to use a temp file to hold the list of files.. +"""]]