Improve memory use of --all when using annex.private

This does not improve Annex.Branch.files at all, since it still uses ++ to
combine the lists, so forcing all but the last one.

But when there are a lot of files in the private journal, it does avoid
--all (or a bare repo) from buffering the filenames in memory.

See commit 653b719472 for prior discussion of
this buffering.

Sponsored-by: Graham Spencer on Patreon
This commit is contained in:
Joey Hess 2023-10-24 13:06:54 -04:00
parent 18f902efa9
commit 0da1d40cd4
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 42 additions and 25 deletions

View file

@ -1,6 +1,6 @@
{- management of the git-annex branch
-
- Copyright 2011-2022 Joey Hess <id@joeyh.name>
- Copyright 2011-2023 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU AGPL version 3 or higher.
-}
@ -597,21 +597,24 @@ files = do
then return Nothing
else do
(bfs, cleanup) <- branchFiles
jfs <- journalledFiles
pjfs <- journalledFilesPrivate
-- ++ forces the content of the first list to be
-- buffered in memory, so use journalledFiles,
-- which should be much smaller most of the time.
-- branchFiles will stream as the list is consumed.
l <- (++) <$> journalledFiles <*> pure bfs
let l = jfs ++ pjfs ++ bfs
return (Just (l, cleanup))
{- Lists all files currently in the journal. There may be duplicates in
- the list when using a private journal. -}
{- Lists all files currently in the journal, but not files in the private
- journal. -}
journalledFiles :: Annex [RawFilePath]
journalledFiles = ifM privateUUIDsKnown
( (++)
<$> getJournalledFilesStale gitAnnexPrivateJournalDir
<*> getJournalledFilesStale gitAnnexJournalDir
, getJournalledFilesStale gitAnnexJournalDir
journalledFiles = getJournalledFilesStale gitAnnexJournalDir
journalledFilesPrivate :: Annex [RawFilePath]
journalledFilesPrivate = ifM privateUUIDsKnown
( getJournalledFilesStale gitAnnexPrivateJournalDir
, return []
)
{- Files in the branch, not including any from journalled changes,
@ -992,8 +995,11 @@ overBranchFileContents' select go st = do
-- This can cause the action to be run a
-- second time with a file it already ran on.
| otherwise -> liftIO (tryTakeMVar buf) >>= \case
Nothing -> drain buf =<< journalledFiles
Just fs -> drain buf fs
Nothing -> do
jfs <- journalledFiles
pjfs <- journalledFilesPrivate
drain buf jfs pjfs
Just (jfs, pjfs) -> drain buf jfs pjfs
catObjectStreamLsTree l (select' . getTopFilePath . Git.LsTree.file) g go'
`finally` liftIO (void cleanup)
where
@ -1007,9 +1013,9 @@ overBranchFileContents' select go st = do
PossiblyStaleJournalledContent journalledcontent ->
Just (fromMaybe mempty branchcontent <> journalledcontent)
drain buf fs = case getnext fs of
Just (v, f, fs') -> do
liftIO $ putMVar buf fs'
drain buf fs pfs = case getnext fs pfs of
Just (v, f, fs', pfs') -> do
liftIO $ putMVar buf (fs', pfs')
content <- getJournalFileStale (GetPrivate True) f >>= \case
NoJournalledContent -> return Nothing
JournalledContent journalledcontent ->
@ -1022,13 +1028,16 @@ overBranchFileContents' select go st = do
return (Just (content <> journalledcontent))
return (Just (v, f, content))
Nothing -> do
liftIO $ putMVar buf []
liftIO $ putMVar buf ([], [])
return Nothing
getnext [] = Nothing
getnext (f:fs) = case select f of
Nothing -> getnext fs
Just v -> Just (v, f, fs)
getnext [] [] = Nothing
getnext (f:fs) pfs = case select f of
Nothing -> getnext fs pfs
Just v -> Just (v, f, fs, pfs)
getnext [] (pf:pfs) = case select pf of
Nothing -> getnext [] pfs
Just v -> Just (v, pf, [], pfs)
{- Check if the git-annex branch has been updated from the oldtree.
- If so, returns the tuple of the old and new trees. -}

View file

@ -4,6 +4,7 @@ git-annex (10.20230927) UNRELEASED; urgency=medium
* Fix crash of enableremote when the special remote has embedcreds=yes.
* importfeed: Use caching database to avoid needing to list urls
on every run, and avoid using too much memory.
* Improve memory use of --all when using annex.private.
-- Joey Hess <id@joeyh.name> Tue, 10 Oct 2023 13:17:31 -0400

View file

@ -1,16 +1,23 @@
Using --all, or running in a bare repo, as well as
`git annex unused` and `git annex info` all end up buffering the list of
all keys that have uncommitted journalled changes in memory.
This is due to Annex.Branch.files's call to getJournalledFilesStale which
reads all the files in the directory into a buffer.
`git annex unused --from=$remote` and `git annex info $remote`
buffer the list of keys that have uncommitted journalled changes
in memory. This is due to Annex.Branch.files's which reads all the
files in the journal into a buffer.
Note that the list of keys in the branch *does* stream in, so this
is only really a problem when using annex.alwayscommit=false to build
up big git-annex branch commits via the journal.
up big git-annex branch commits via the journal. Or using annex.private,
since the private journal can build up a lot of keys in it.
An attempt at making it stream via unsafeInterleaveIO failed miserably
and that is not the right approach. This would be a good place to use
ResourceT, but it might need some changes to the Annex monad to allow
combining the two. --[[Joey]]
> This used to also affect --all and using git-annex in a bare repo, but
> that was avoided by using the overBranchFileContents interface. This
> suggests that changing to that interface in unused and info would be a
> solution.
[[!tag confirmed]]
[[!meta title="improve memory usage of unused and info when the journal contains a lot of files"]]