From 56478e99acebb638121207bf48c06af1118544bc Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 21 Apr 2021 13:24:32 -0400 Subject: [PATCH 1/5] bug report --- doc/bugs/git-annex_branch_caching_bug.mdwn | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 doc/bugs/git-annex_branch_caching_bug.mdwn diff --git a/doc/bugs/git-annex_branch_caching_bug.mdwn b/doc/bugs/git-annex_branch_caching_bug.mdwn new file mode 100644 index 0000000000..5c099f6fe5 --- /dev/null +++ b/doc/bugs/git-annex_branch_caching_bug.mdwn @@ -0,0 +1,22 @@ +If the journal contains a newer version of a log file than the git-annex +branch, and annex.alwayscommit=false so the branch is not getting updated, +the value from the journal can be ignored when reading that log file. + +In CmdLine.Seek, there is some code that precached location logs as an +optimisation. That streams info from the git-annex branch into the cache. +But it never checks for a journal file. + +One fix would be to just check the journal file too, but that would +probably slow down the optimisation and would not speed up anything. So I +think that the caching needs to note that it's got cached a value from the +git-annex branch, but that the journal file needs to be checked for before +using that cached data. + +Also in Cmdline.Seek, there is a LsTreeRecursive over the branch to handle +`--all`, and I think again that would mean it doesn't notice location +logs that are only in the journal. +Before that optimisation, it was using Logs.Location.loggedKeys, +which does look at the journal. + +(This is also a blocker for [[todo/hiding_a_repository]].) +--[[Joey]] From b470673e50f36cc39bd2513859102c38982c620a Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 21 Apr 2021 13:46:39 -0400 Subject: [PATCH 2/5] wip --- doc/bugs/git-annex_branch_caching_bug.mdwn | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/doc/bugs/git-annex_branch_caching_bug.mdwn b/doc/bugs/git-annex_branch_caching_bug.mdwn index 5c099f6fe5..829cd92a89 100644 --- a/doc/bugs/git-annex_branch_caching_bug.mdwn +++ b/doc/bugs/git-annex_branch_caching_bug.mdwn @@ -2,15 +2,12 @@ If the journal contains a newer version of a log file than the git-annex branch, and annex.alwayscommit=false so the branch is not getting updated, the value from the journal can be ignored when reading that log file. -In CmdLine.Seek, there is some code that precached location logs as an -optimisation. That streams info from the git-annex branch into the cache. -But it never checks for a journal file. +In CmdLine.Seek, there is some code that precaches location logs as an +optimisation (when using eg --copies). That streams info from the +git-annex branch into the cache. But it never checks for a journal file +with newer information. -One fix would be to just check the journal file too, but that would -probably slow down the optimisation and would not speed up anything. So I -think that the caching needs to note that it's got cached a value from the -git-annex branch, but that the journal file needs to be checked for before -using that cached data. +> fixed this Also in Cmdline.Seek, there is a LsTreeRecursive over the branch to handle `--all`, and I think again that would mean it doesn't notice location From 6eb3c0a6b4f329cbc3b1d93fcf69eeaab744f64e Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 21 Apr 2021 14:02:15 -0400 Subject: [PATCH 3/5] fix branch precacheing bug by checking journal Fix bug caused by recent optimisations that could make git-annex not see recently recorded status information when configured with annex.alwayscommit=false. When not using --all, precaching only gets triggered when the command actually needs location logs, and so there's no speed hit there. This is a minor speed hit for --all, because it precaches even when the location log is not actually going to be used, and so checking the journal is not necessary. It would have been possible to defer checking the journal until the cache gets used. But that would complicate the usual Branch.get code path with two different kinds of caches, and the speed hit is really minimal. A better way to speed up --all, later, would be to avoid precaching at all when the location log is not going to be used. --- Annex/Branch.hs | 15 ++++++++++++++- CHANGELOG | 3 +++ CmdLine/Seek.hs | 5 ++--- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/Annex/Branch.hs b/Annex/Branch.hs index 3acb12975b..9b20c7a907 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -1,6 +1,6 @@ {- management of the git-annex branch - - - Copyright 2011-2020 Joey Hess + - Copyright 2011-2021 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -30,6 +30,7 @@ module Annex.Branch ( rememberTreeish, performTransitions, withIndex, + precache, ) where import qualified Data.ByteString as B @@ -260,6 +261,18 @@ get file = getCache file >>= \case setCache file content return content +{- Used to cache the value of a file, which has been read from the branch + - using some optimised method. The journal has to be checked, in case + - it has a newer version of the file that has not reached the branch yet. + -} +precache :: RawFilePath -> L.ByteString -> Annex () +precache file branchcontent = do + st <- getState + content <- if journalIgnorable st + then pure branchcontent + else fromMaybe branchcontent <$> getJournalFileStale file + Annex.BranchState.setCache file content + {- Like get, but does not merge the branch, so the info returned may not - reflect changes in remotes. - (Changing the value this returns, and then merging is always the diff --git a/CHANGELOG b/CHANGELOG index cdbf2c004f..a9793f09d0 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -12,6 +12,9 @@ git-annex (8.20210331) UNRELEASED; urgency=medium * directory: When cp supports reflinks, use it. * init: Fix a crash when the repo's was cloned from a repo that had an adjusted branch checked out, and the origin remote is not named "origin". + * Fix bug caused by recent optimisations that could make git-annex not + see recently recorded status information when configured with + annex.alwayscommit=false. -- Joey Hess Thu, 01 Apr 2021 12:17:26 -0400 diff --git a/CmdLine/Seek.hs b/CmdLine/Seek.hs index 65156772e3..1245ec4d4d 100644 --- a/CmdLine/Seek.hs +++ b/CmdLine/Seek.hs @@ -44,7 +44,6 @@ import Annex.Concurrent import Annex.CheckIgnore import Annex.Action import qualified Annex.Branch -import qualified Annex.BranchState import qualified Database.Keys import qualified Utility.RawFilePath as R import Utility.Tuple @@ -288,7 +287,7 @@ withKeyOptions' ko auto mkkeyaction fallbackaction worktreeitems = do let go reader = liftIO reader >>= \case Nothing -> return () Just ((k, f), content) -> checktimelimit (discard reader) $ do - maybe noop (Annex.BranchState.setCache f) content + maybe noop (Annex.Branch.precache f) content keyaction Nothing (SeekInput [], k, mkActionItem k) go reader catObjectStreamLsTree l (getk . getTopFilePath . LsTree.file) g go @@ -395,7 +394,7 @@ seekFilteredKeys seeker listfs = do precachefinisher mi lreader checktimelimit = liftIO lreader >>= \case Just ((logf, (si, f), k), logcontent) -> checktimelimit discard $ do - maybe noop (Annex.BranchState.setCache logf) logcontent + maybe noop (Annex.Branch.precache logf) logcontent checkMatcherWhen mi (matcherNeedsLocationLog mi && not (matcherNeedsFileName mi)) (MatchingFile $ FileInfo f f (Just k)) From 74acf17a3158938c8034fdf58e5b05cfdabb8b9f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 21 Apr 2021 14:19:58 -0400 Subject: [PATCH 4/5] refactoring --- Annex/Branch.hs | 27 +++++++++++++++++++++++++++ CmdLine/Seek.hs | 17 ++++------------- 2 files changed, 31 insertions(+), 13 deletions(-) diff --git a/Annex/Branch.hs b/Annex/Branch.hs index 9b20c7a907..73ab2ac10d 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -31,6 +31,7 @@ module Annex.Branch ( performTransitions, withIndex, precache, + overBranchFileContents, ) where import qualified Data.ByteString as B @@ -66,6 +67,7 @@ import Annex.HashObject import Git.Types (Ref(..), fromRef, fromRef', RefDate, TreeItemType(..)) import Git.FilePath import Annex.CatFile +import Git.CatFile (catObjectStreamLsTree) import Annex.Perms import Logs import Logs.Transitions @@ -752,3 +754,28 @@ rememberTreeishLocked treeish graftpoint jl = do -- say that the index contains c'. setIndexSha c' +{- Runs an action on the content of selected files from the branch. + - This is much faster than reading the content of each file in turn, + - because it lets git cat-file stream content as fast as it can run. + - + - The action is passed an IO action that it can repeatedly call to read + - the next file and its contents. When there are no more files, that + - action will return Nothing. + -} +overBranchFileContents + :: (RawFilePath -> Maybe v) + -> (IO (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex ()) + -> Annex () +overBranchFileContents select go = do + void update + g <- Annex.gitRepo + (l, cleanup) <- inRepo $ Git.LsTree.lsTree + Git.LsTree.LsTreeRecursive + (Git.LsTree.LsTreeLong False) + fullname + let select' f = fmap (\v -> (v, f)) (select f) + let go' reader = go $ reader >>= \case + Nothing -> return Nothing + Just ((v, f), content) -> return (Just (v, f, content)) + catObjectStreamLsTree l (select' . getTopFilePath . Git.LsTree.file) g go' + liftIO $ void cleanup diff --git a/CmdLine/Seek.hs b/CmdLine/Seek.hs index 1245ec4d4d..3f0bfa8ce3 100644 --- a/CmdLine/Seek.hs +++ b/CmdLine/Seek.hs @@ -9,8 +9,6 @@ - Licensed under the GNU AGPL version 3 or higher. -} -{-# LANGUAGE TupleSections #-} - module CmdLine.Seek where import Annex.Common @@ -273,25 +271,18 @@ withKeyOptions' ko auto mkkeyaction fallbackaction worktreeitems = do checktimelimit <- mkCheckTimeLimit keyaction <- mkkeyaction config <- Annex.getGitConfig - g <- Annex.gitRepo - void Annex.Branch.update - (l, cleanup) <- inRepo $ LsTree.lsTree - LsTree.LsTreeRecursive - (LsTree.LsTreeLong False) - Annex.Branch.fullname - let getk f = fmap (,f) (locationLogFileKey config f) + let getk = locationLogFileKey config let discard reader = reader >>= \case Nothing -> noop Just _ -> discard reader let go reader = liftIO reader >>= \case - Nothing -> return () - Just ((k, f), content) -> checktimelimit (discard reader) $ do + Just (k, f, content) -> checktimelimit (discard reader) $ do maybe noop (Annex.Branch.precache f) content keyaction Nothing (SeekInput [], k, mkActionItem k) go reader - catObjectStreamLsTree l (getk . getTopFilePath . LsTree.file) g go - liftIO $ void cleanup + Nothing -> return () + Annex.Branch.overBranchFileContents getk go runkeyaction getks = do keyaction <- mkkeyaction From 653b7194727a27dfd185cdb0b752f8e190250fc4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 21 Apr 2021 15:40:32 -0400 Subject: [PATCH 5/5] fix --all to include not yet committed files from the journal Fix bug caused by recent optimisations that could make git-annex not see recently recorded status information when configured with annex.alwayscommit=false. This does mean that --all can end up processing the same key more than once, but before the optimisations that introduced this bug, it used to also behave that way. So I didn't try to fix that; it's an edge case and anyway git-annex behaves well when run on the same key repeatedly. I am not too happy with the use of a MVar to buffer the list of files in the journal. I guess it doesn't defeat lazy streaming of the list, if that list is actually generated lazily, and anyway the size of the journal is normally capped and small, so if configs are changed to make it huge and this code path fire, git-annex using enough memory to buffer it all is not a large problem. --- Annex/Branch.hs | 50 ++++++++++++++++++---- CmdLine/Seek.hs | 14 +++--- doc/bugs/git-annex_branch_caching_bug.mdwn | 5 ++- doc/todo/hiding_a_repository.mdwn | 16 ++----- 4 files changed, 55 insertions(+), 30 deletions(-) diff --git a/Annex/Branch.hs b/Annex/Branch.hs index 73ab2ac10d..35a7b5e18b 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -43,6 +43,7 @@ import Data.Function import Data.Char import Data.ByteString.Builder import Control.Concurrent (threadDelay) +import Control.Concurrent.MVar import qualified System.FilePath.ByteString as P import Annex.Common @@ -756,26 +757,57 @@ rememberTreeishLocked treeish graftpoint jl = do {- Runs an action on the content of selected files from the branch. - This is much faster than reading the content of each file in turn, - - because it lets git cat-file stream content as fast as it can run. + - because it lets git cat-file stream content without blocking. - - - The action is passed an IO action that it can repeatedly call to read - - the next file and its contents. When there are no more files, that - - action will return Nothing. + - The action is passed a callback that it can repeatedly call to read + - the next file and its contents. When there are no more files, the + - callback will return Nothing. -} overBranchFileContents :: (RawFilePath -> Maybe v) - -> (IO (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex ()) + -> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex ()) -> Annex () overBranchFileContents select go = do - void update + st <- update g <- Annex.gitRepo (l, cleanup) <- inRepo $ Git.LsTree.lsTree Git.LsTree.LsTreeRecursive (Git.LsTree.LsTreeLong False) fullname let select' f = fmap (\v -> (v, f)) (select f) - let go' reader = go $ reader >>= \case - Nothing -> return Nothing - Just ((v, f), content) -> return (Just (v, f, content)) + buf <- liftIO newEmptyMVar + let go' reader = go $ liftIO reader >>= \case + Just ((v, f), content) -> do + -- Check the journal if it did not get + -- committed to the branch + content' <- if journalIgnorable st + then pure content + else maybe content Just <$> getJournalFileStale f + return (Just (v, f, content')) + Nothing + | journalIgnorable st -> return Nothing + -- The journal did not get committed to the + -- branch, and may contain files that + -- are not present in the branch, which + -- need to be provided to the action still. + -- This can cause the action to be run a + -- second time with a file it already ran on. + | otherwise -> liftIO (tryTakeMVar buf) >>= \case + Nothing -> drain buf =<< getJournalledFilesStale + Just fs -> drain buf fs catObjectStreamLsTree l (select' . getTopFilePath . Git.LsTree.file) g go' liftIO $ void cleanup + where + getnext [] = Nothing + getnext (f:fs) = case select f of + Nothing -> getnext fs + Just v -> Just (v, f, fs) + + drain buf fs = case getnext fs of + Just (v, f, fs') -> do + liftIO $ putMVar buf fs' + content <- getJournalFileStale f + return (Just (v, f, content)) + Nothing -> do + liftIO $ putMVar buf [] + return Nothing diff --git a/CmdLine/Seek.hs b/CmdLine/Seek.hs index 3f0bfa8ce3..096c588485 100644 --- a/CmdLine/Seek.hs +++ b/CmdLine/Seek.hs @@ -276,7 +276,7 @@ withKeyOptions' ko auto mkkeyaction fallbackaction worktreeitems = do let discard reader = reader >>= \case Nothing -> noop Just _ -> discard reader - let go reader = liftIO reader >>= \case + let go reader = reader >>= \case Just (k, f, content) -> checktimelimit (discard reader) $ do maybe noop (Annex.Branch.precache f) content keyaction Nothing (SeekInput [], k, mkActionItem k) @@ -373,7 +373,7 @@ seekFilteredKeys seeker listfs = do liftIO $ void cleanup where finisher mi oreader checktimelimit = liftIO oreader >>= \case - Just ((si, f), content) -> checktimelimit discard $ do + Just ((si, f), content) -> checktimelimit (liftIO discard) $ do keyaction f mi content $ commandAction . startAction seeker si f finisher mi oreader checktimelimit @@ -384,7 +384,7 @@ seekFilteredKeys seeker listfs = do Just _ -> discard precachefinisher mi lreader checktimelimit = liftIO lreader >>= \case - Just ((logf, (si, f), k), logcontent) -> checktimelimit discard $ do + Just ((logf, (si, f), k), logcontent) -> checktimelimit (liftIO discard) $ do maybe noop (Annex.Branch.precache logf) logcontent checkMatcherWhen mi (matcherNeedsLocationLog mi && not (matcherNeedsFileName mi)) @@ -591,9 +591,9 @@ notSymlink :: RawFilePath -> IO Bool notSymlink f = liftIO $ not . isSymbolicLink <$> R.getSymbolicLinkStatus f {- Returns an action that, when there's a time limit, can be used - - to check it before processing a file. The IO action is run when over the - - time limit. -} -mkCheckTimeLimit :: Annex (IO () -> Annex () -> Annex ()) + - to check it before processing a file. The first action is run when over the + - time limit, otherwise the second action is run. -} +mkCheckTimeLimit :: Annex (Annex () -> Annex () -> Annex ()) mkCheckTimeLimit = Annex.getState Annex.timelimit >>= \case Nothing -> return $ \_ a -> a Just (duration, cutoff) -> return $ \cleanup a -> do @@ -602,6 +602,6 @@ mkCheckTimeLimit = Annex.getState Annex.timelimit >>= \case then do warning $ "Time limit (" ++ fromDuration duration ++ ") reached! Shutting down..." shutdown True - liftIO cleanup + cleanup liftIO $ exitWith $ ExitFailure 101 else a diff --git a/doc/bugs/git-annex_branch_caching_bug.mdwn b/doc/bugs/git-annex_branch_caching_bug.mdwn index 829cd92a89..c1d7adaefd 100644 --- a/doc/bugs/git-annex_branch_caching_bug.mdwn +++ b/doc/bugs/git-annex_branch_caching_bug.mdwn @@ -15,5 +15,8 @@ logs that are only in the journal. Before that optimisation, it was using Logs.Location.loggedKeys, which does look at the journal. +> fixed + (This is also a blocker for [[todo/hiding_a_repository]].) ---[[Joey]] + +[[done]] --[[Joey]] diff --git a/doc/todo/hiding_a_repository.mdwn b/doc/todo/hiding_a_repository.mdwn index 38d0e6e200..8fd380282d 100644 --- a/doc/todo/hiding_a_repository.mdwn +++ b/doc/todo/hiding_a_repository.mdwn @@ -151,19 +151,9 @@ later write. > No way to configure what repo is hidden yet. --[[Joey]] > > Implementation notes: -> -> * CmdLine.Seek precaches git-annex branch -> location logs, but that does not include private ones. Since they're -> cached, the private ones don't get read. Result is eg, whereis finds no -> copies. Either need to disable CmdLine.Seek precaching when there's -> hidden repos, or could make the cache indicate it's only of public -> info, so private info still gets read. -> * CmdLine.Seek contains a LsTreeRecursive over the branch to handle -> --all, and again that won't see private information, including even -> annexed files that are only present in the hidden repo. -> * (And I wonder, don't both the caches above already miss things in -> the journal?) -> * Any other direct accesses of the branch, not going through +> +> * [[bugs/git-annex_branch_caching_bug]] was a problem, now fixed. +> * Any other similar direct accesses of the branch, not going through > Annex.Branch, also need to be fixed (and may be missing journal files > already?) Command.ImportFeed.knownItems is one. Command.Log behavior > needs to be investigated, may be ok. And Logs.Web.withKnownUrls is another.