implement journalledRepoSizes

Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.

From there, Annex.reposizes can be kept up to date with changes made
by the running process.
This commit is contained in:
Joey Hess 2024-08-14 13:46:44 -04:00
parent 8ac2685b33
commit 3e6eb2a58d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
7 changed files with 148 additions and 66 deletions

View file

@ -21,6 +21,7 @@ module Annex.Branch (
updateTo, updateTo,
get, get,
getHistorical, getHistorical,
getRef,
getUnmergedRefs, getUnmergedRefs,
RegardingUUID(..), RegardingUUID(..),
change, change,
@ -39,6 +40,7 @@ module Annex.Branch (
UnmergedBranches(..), UnmergedBranches(..),
overBranchFileContents, overBranchFileContents,
overJournalFileContents, overJournalFileContents,
combineStaleJournalWithBranch,
updatedFromTree, updatedFromTree,
) where ) where
@ -1010,7 +1012,7 @@ overBranchFileContents
-- and in this case it's also possible for the callback to be -- and in this case it's also possible for the callback to be
-- passed some of the same file content repeatedly. -- passed some of the same file content repeatedly.
-> (RawFilePath -> Maybe v) -> (RawFilePath -> Maybe v)
-> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex a) -> (Annex (Maybe (v, RawFilePath, Maybe (L.ByteString, Maybe Bool))) -> Annex a)
-> Annex (UnmergedBranches (a, Git.Sha)) -> Annex (UnmergedBranches (a, Git.Sha))
overBranchFileContents ignorejournal select go = do overBranchFileContents ignorejournal select go = do
st <- update st <- update
@ -1024,7 +1026,7 @@ overBranchFileContents ignorejournal select go = do
overBranchFileContents' overBranchFileContents'
:: (RawFilePath -> Maybe v) :: (RawFilePath -> Maybe v)
-> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex a) -> (Annex (Maybe (v, RawFilePath, Maybe (L.ByteString, Maybe Bool))) -> Annex a)
-> BranchState -> BranchState
-> Annex (a, Git.Sha) -> Annex (a, Git.Sha)
overBranchFileContents' select go st = do overBranchFileContents' select go st = do
@ -1038,11 +1040,14 @@ overBranchFileContents' select go st = do
buf <- liftIO newEmptyMVar buf <- liftIO newEmptyMVar
let go' reader = go $ liftIO reader >>= \case let go' reader = go $ liftIO reader >>= \case
Just ((v, f), content) -> do Just ((v, f), content) -> do
content' <- checkjournal f content content' <- checkjournal f content >>= return . \case
Nothing -> Nothing
Just c -> Just (c, Just False)
return (Just (v, f, content')) return (Just (v, f, content'))
Nothing Nothing
| journalIgnorable st -> return Nothing | journalIgnorable st -> return Nothing
| otherwise -> overJournalFileContents' buf (handlestale branchsha) select | otherwise ->
overJournalFileContents' buf (handlestale branchsha) select
res <- catObjectStreamLsTree l (select' . getTopFilePath . Git.LsTree.file) g go' res <- catObjectStreamLsTree l (select' . getTopFilePath . Git.LsTree.file) g go'
`finally` liftIO (void cleanup) `finally` liftIO (void cleanup)
return (res, branchsha) return (res, branchsha)
@ -1059,29 +1064,33 @@ overBranchFileContents' select go st = do
handlestale branchsha f journalledcontent = do handlestale branchsha f journalledcontent = do
-- This is expensive, but happens only when there is a -- This is expensive, but happens only when there is a
-- private journal file. -- private journal file.
content <- getRef branchsha f branchcontent <- getRef branchsha f
return (content <> journalledcontent) return (combineStaleJournalWithBranch branchcontent journalledcontent, Just True)
combineStaleJournalWithBranch :: L.ByteString -> L.ByteString -> L.ByteString
combineStaleJournalWithBranch branchcontent journalledcontent =
branchcontent <> journalledcontent
{- Like overBranchFileContents but only reads the content of journalled {- Like overBranchFileContents but only reads the content of journalled
- files. Note that when there are private UUIDs, the journal files may - files.
- only include information about the private UUID, while information about
- other UUIDs has been committed to the git-annex branch.
-} -}
overJournalFileContents overJournalFileContents
:: (RawFilePath -> Maybe v) :: (RawFilePath -> L.ByteString -> Annex (L.ByteString, Maybe b))
-> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex a) -- ^ Called with the journalled file content when the journalled
-- content may be stale or lack information committed to the
-- git-annex branch.
-> (RawFilePath -> Maybe v)
-> (Annex (Maybe (v, RawFilePath, Maybe (L.ByteString, Maybe b))) -> Annex a)
-> Annex a -> Annex a
overJournalFileContents select go = do overJournalFileContents handlestale select go = do
buf <- liftIO newEmptyMVar buf <- liftIO newEmptyMVar
go $ overJournalFileContents' buf handlestale select go $ overJournalFileContents' buf handlestale select
where
handlestale _f journalledcontent = return journalledcontent
overJournalFileContents' overJournalFileContents'
:: MVar ([RawFilePath], [RawFilePath]) :: MVar ([RawFilePath], [RawFilePath])
-> (RawFilePath -> L.ByteString -> Annex L.ByteString) -> (RawFilePath -> L.ByteString -> Annex (L.ByteString, Maybe b))
-> (RawFilePath -> Maybe a) -> (RawFilePath -> Maybe a)
-> Annex (Maybe (a, RawFilePath, Maybe L.ByteString)) -> Annex (Maybe (a, RawFilePath, (Maybe (L.ByteString, Maybe b))))
overJournalFileContents' buf handlestale select = overJournalFileContents' buf handlestale select =
liftIO (tryTakeMVar buf) >>= \case liftIO (tryTakeMVar buf) >>= \case
Nothing -> do Nothing -> do
@ -1096,7 +1105,7 @@ overJournalFileContents' buf handlestale select =
content <- getJournalFileStale (GetPrivate True) f >>= \case content <- getJournalFileStale (GetPrivate True) f >>= \case
NoJournalledContent -> return Nothing NoJournalledContent -> return Nothing
JournalledContent journalledcontent -> JournalledContent journalledcontent ->
return (Just journalledcontent) return (Just (journalledcontent, Nothing))
PossiblyStaleJournalledContent journalledcontent -> PossiblyStaleJournalledContent journalledcontent ->
Just <$> handlestale f journalledcontent Just <$> handlestale f journalledcontent
return (Just (v, f, content)) return (Just (v, f, content))

View file

@ -22,23 +22,43 @@ import qualified Data.Map.Strict as M
- -
- The map includes the UUIDs of all known repositories, including - The map includes the UUIDs of all known repositories, including
- repositories that are empty. - repositories that are empty.
-
- Note that private repositories, which do not get recorded in
- the git-annex branch, will have 0 size. journalledRepoSizes
- takes care of getting repo sizes for those.
-} -}
calcBranchRepoSizes :: Annex (M.Map UUID RepoSize, Sha) calcBranchRepoSizes :: Annex (M.Map UUID RepoSize, Sha)
calcBranchRepoSizes = do calcBranchRepoSizes = do
knownuuids <- M.keys <$> uuidDescMap knownuuids <- M.keys <$> uuidDescMap
let startmap = M.fromList $ map (\u -> (u, RepoSize 0)) knownuuids let startmap = M.fromList $ map (\u -> (u, RepoSize 0)) knownuuids
overLocationLogs True startmap accum >>= \case overLocationLogs True startmap accumsizes >>= \case
UnmergedBranches v -> return v UnmergedBranches v -> return v
NoUnmergedBranches v -> return v NoUnmergedBranches v -> return v
where where
addksz ksz (Just (RepoSize sz)) = Just $ RepoSize $ sz + ksz accumsizes k locs m = return $
addksz ksz Nothing = Just $ RepoSize ksz foldl' (flip $ M.alter $ addKeyRepoSize k) m locs
accum k locs m = return $
let sz = fromMaybe 0 $ fromKey keySize k
in foldl' (flip $ M.alter $ addksz sz) m locs
{- Given the RepoSizes calculated from the git-annex branch, updates it with {- Given the RepoSizes calculated from the git-annex branch, updates it with
- data from journalled location logs. - data from journalled location logs.
-} -}
journalledRepoSizes :: M.Map UUID RepoSize -> Sha -> Annex (M.Map UUID RepoSize) journalledRepoSizes :: M.Map UUID RepoSize -> Sha -> Annex (M.Map UUID RepoSize)
journalledRepoSizes m branchsha = undefined --- XXX journalledRepoSizes startmap branchsha =
overLocationLogsJournal startmap branchsha accumsizes
where
accumsizes k (newlocs, removedlocs) m = return $
let m' = foldl' (flip $ M.alter $ addKeyRepoSize k) m newlocs
in foldl' (flip $ M.alter $ removeKeyRepoSize k) m' removedlocs
addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize
addKeyRepoSize k mrs = case mrs of
Just (RepoSize sz) -> Just $ RepoSize $ sz + ksz
Nothing -> Just $ RepoSize ksz
where
ksz = fromMaybe 0 $ fromKey keySize k
removeKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize
removeKeyRepoSize k mrs = case mrs of
Just (RepoSize sz) -> Just $ RepoSize $ sz - ksz
Nothing -> Nothing
where
ksz = fromMaybe 0 $ fromKey keySize k

View file

@ -197,7 +197,7 @@ updateFromLog db@(ImportFeedDbHandle h) (oldtree, currtree)
| otherwise = Nothing | otherwise = Nothing
goscan reader = reader >>= \case goscan reader = reader >>= \case
Just ((), f, Just content) Just ((), f, Just (content, _))
| isUrlLog f -> do | isUrlLog f -> do
knownurls (parseUrlLog content) knownurls (parseUrlLog content)
goscan reader goscan reader

View file

@ -599,10 +599,11 @@ limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles
M.lookup g (uuidsByGroup gm) M.lookup g (uuidsByGroup gm)
maxsizes <- getMaxSizes maxsizes <- getMaxSizes
-- XXX do not calc this every time! -- XXX do not calc this every time!
(sizemap, _sha) <- calcBranchRepoSizes (sizemap, sha) <- calcBranchRepoSizes
sizemap' <- journalledRepoSizes sizemap sha
let keysize = fromMaybe 0 (fromKey keySize key) let keysize = fromMaybe 0 (fromKey keySize key)
currentlocs <- S.fromList <$> loggedLocations key currentlocs <- S.fromList <$> loggedLocations key
let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap') of
(Just (MaxSize maxsize), Just (RepoSize reposize)) -> (Just (MaxSize maxsize), Just (RepoSize reposize)) ->
if u `S.member` currentlocs if u `S.member` currentlocs
then reposize <= maxsize then reposize <= maxsize

View file

@ -24,7 +24,6 @@ module Logs.Location (
loggedPreviousLocations, loggedPreviousLocations,
loggedLocationsHistorical, loggedLocationsHistorical,
loggedLocationsRef, loggedLocationsRef,
parseLoggedLocations,
isKnownKey, isKnownKey,
checkDead, checkDead,
setDead, setDead,
@ -35,6 +34,7 @@ module Logs.Location (
loggedKeysFor', loggedKeysFor',
overLocationLogs, overLocationLogs,
overLocationLogs', overLocationLogs',
overLocationLogsJournal,
) where ) where
import Annex.Common import Annex.Common
@ -222,38 +222,96 @@ loggedKeysFor' u = loggedKeys' isthere
overLocationLogs overLocationLogs
:: Bool :: Bool
-> v -> v
-> (Key -> [UUID] -> (Key -> [UUID] -> v -> Annex v)
-> v
-> Annex v)
-> Annex (Annex.Branch.UnmergedBranches (v, Sha)) -> Annex (Annex.Branch.UnmergedBranches (v, Sha))
overLocationLogs ignorejournal v = overLocationLogs ignorejournal v =
overLocationLogs' ignorejournal v (flip const) overLocationLogs' ignorejournal v (flip const)
overLocationLogs' overLocationLogs'
:: Bool :: Bool
-> v -> v
-> (Annex (Maybe (Key, RawFilePath, Maybe L.ByteString)) -> Annex v -> Annex v) -> (Annex (Maybe (Key, RawFilePath, Maybe (L.ByteString, Maybe Bool))) -> Annex v -> Annex v)
-> (Key -> [UUID] -> v -> Annex v) -> (Key -> [UUID] -> v -> Annex v)
-> Annex (Annex.Branch.UnmergedBranches (v, Sha)) -> Annex (Annex.Branch.UnmergedBranches (v, Sha))
overLocationLogs' ignorejournal iv discarder keyaction = do overLocationLogs' ignorejournal =
overLocationLogsHelper
(Annex.Branch.overBranchFileContents ignorejournal)
(\locparser _ _ content -> pure (locparser (fst <$> content)))
True
type LocChanges =
( S.Set UUID
-- ^ locations that are in the journal, but not in the
-- git-annex branch
, S.Set UUID
-- ^ locations that are in the git-annex branch,
-- but have been removed in the journal
)
{- Like overLocationLogs, but only adds changes in journalled files
- compared with what was logged in the git-annex branch at the specified
- commit sha. -}
overLocationLogsJournal
:: v
-> Sha
-> (Key -> LocChanges -> v -> Annex v)
-> Annex v
overLocationLogsJournal v branchsha keyaction =
overLocationLogsHelper
(Annex.Branch.overJournalFileContents handlestale)
changedlocs
False
-- ^ do not precache journalled content, which may be stale
v (flip const) keyaction
where
handlestale _ journalcontent = return (journalcontent, Just True)
changedlocs locparser _key logf (Just (journalcontent, isstale)) = do
branchcontent <- Annex.Branch.getRef branchsha logf
let branchlocs = S.fromList $ locparser $ Just branchcontent
let journallocs = S.fromList $ locparser $ Just $ case isstale of
Just True -> Annex.Branch.combineStaleJournalWithBranch
branchcontent journalcontent
_ -> journalcontent
return
( S.difference journallocs branchlocs
, S.difference branchlocs journallocs
)
changedlocs _ _ _ Nothing = pure (S.empty, S.empty)
overLocationLogsHelper
:: ( (RawFilePath -> Maybe Key)
-> (Annex (Maybe (Key, RawFilePath, Maybe (L.ByteString, Maybe b))) -> Annex v)
-> Annex a
)
-> ((Maybe L.ByteString -> [UUID]) -> Key -> RawFilePath -> Maybe (L.ByteString, Maybe b) -> Annex u)
-> Bool
-> v
-> (Annex (Maybe (Key, RawFilePath, Maybe (L.ByteString, Maybe b))) -> Annex v -> Annex v)
-> (Key -> u -> v -> Annex v)
-> Annex a
overLocationLogsHelper runner locparserrunner canprecache iv discarder keyaction = do
config <- Annex.getGitConfig config <- Annex.getGitConfig
clusters <- getClusters clusters <- getClusters
let locparser = maybe [] (parseLoggedLocations clusters)
let getk = locationLogFileKey config let getk = locationLogFileKey config
let go v reader = reader >>= \case let go v reader = reader >>= \case
Just (k, f, content) -> discarder reader $ do Just (k, f, content) -> discarder reader $ do
-- precache to make checkDead fast, and also to -- precache to make checkDead fast, and also to
-- make any accesses done in keyaction fast. -- make any accesses done in keyaction fast.
maybe noop (Annex.Branch.precache f) content when canprecache $
maybe noop (Annex.Branch.precache f . fst) content
ifM (checkDead k) ifM (checkDead k)
( go v reader ( go v reader
, do , do
!v' <- keyaction k (maybe [] (parseLoggedLocations clusters) content) v !locs <- locparserrunner locparser k f content
!v' <- keyaction k locs v
go v' reader go v' reader
) )
Nothing -> return v Nothing -> return v
Annex.Branch.overBranchFileContents ignorejournal getk (go iv) runner getk (go iv)
-- Cannot import Logs.Cluster due to a cycle. -- Cannot import Logs.Cluster due to a cycle.
-- Annex.clusters gets populated when starting up git-annex. -- Annex.clusters gets populated when starting up git-annex.

View file

@ -6,7 +6,7 @@
- A line of the log will look like: "date N INFO" - A line of the log will look like: "date N INFO"
- Where N=1 when the INFO is present, 0 otherwise. - Where N=1 when the INFO is present, 0 otherwise.
- -
- Copyright 2010-2022 Joey Hess <id@joeyh.name> - Copyright 2010-2024 Joey Hess <id@joeyh.name>
- -
- Licensed under the GNU AGPL version 3 or higher. - Licensed under the GNU AGPL version 3 or higher.
-} -}
@ -20,6 +20,7 @@ module Logs.Presence (
presentLogInfo, presentLogInfo,
notPresentLogInfo, notPresentLogInfo,
historicalLogInfo, historicalLogInfo,
parseLogInfo,
) where ) where
import Logs.Presence.Pure as X import Logs.Presence.Pure as X
@ -28,6 +29,8 @@ import Annex.VectorClock
import qualified Annex.Branch import qualified Annex.Branch
import Git.Types (RefDate) import Git.Types (RefDate)
import qualified Data.ByteString.Lazy as L
{- Adds to the log, removing any LogLines that are obsoleted. -} {- Adds to the log, removing any LogLines that are obsoleted. -}
addLog :: Annex.Branch.RegardingUUID -> RawFilePath -> LogStatus -> LogInfo -> Annex () addLog :: Annex.Branch.RegardingUUID -> RawFilePath -> LogStatus -> LogInfo -> Annex ()
addLog ru file logstatus loginfo = addLog ru file logstatus loginfo =
@ -82,5 +85,8 @@ notPresentLogInfo file = map info . filterNotPresent <$> readLog file
- The date is formatted as shown in gitrevisions man page. - The date is formatted as shown in gitrevisions man page.
-} -}
historicalLogInfo :: RefDate -> RawFilePath -> Annex [LogInfo] historicalLogInfo :: RefDate -> RawFilePath -> Annex [LogInfo]
historicalLogInfo refdate file = map info . filterPresent . parseLog historicalLogInfo refdate file = parseLogInfo
<$> Annex.Branch.getHistorical refdate file <$> Annex.Branch.getHistorical refdate file
parseLogInfo :: L.ByteString -> [LogInfo]
parseLogInfo = map info . filterPresent . parseLog

View file

@ -51,48 +51,36 @@ Planned schedule of work:
* `git-annex info` can use maxsize to display how full repositories are * `git-annex info` can use maxsize to display how full repositories are
* overBranchFileContents can improve its handling of journalled files
by first going over the branch, and then at the end, feeding
the journalled filenames into catObjectStream (run on the same branch
sha) to check if the file was in the branch. Only pass the journalled
file to the callback when it was not. This will avoid innaccuracies
in calcRepoSizes and git-annex info.
calcRepoSizes currently skips log files in private journals,
when they are for a key that does not appear in the git-annex branch.
It needs to include those.
* Implement [[track_free_space_in_repos_via_git-annex_branch]]: * Implement [[track_free_space_in_repos_via_git-annex_branch]]:
* Goal is for limitFullyBalanced not to need to calcRepoSizes. * Goal is for limitFullyBalanced not to need to calcRepoSizes.
* Load Annex.reposizes from Database.RepoSizes on demand. * Add git-annex branch sha to Database.RepoSizes.
* When Annex.reposizes does not list the size of a UUID, and * When Annex.reposizes does not list the size of a UUID, and
that UUID's size is needed eg for balanced preferred that UUID's size is needed eg for balanced preferred
content, use calcRepoSizes and store in content, use calcRepoSizes and store in
Database.RepoSizes. Database.RepoSizes.
* Load Annex.reposizes from Database.RepoSizes on demand,
supplimenting with journalledRepoSizes.
* Update Annex.reposizes in Logs.Location.logChange, * Update Annex.reposizes in Logs.Location.logChange,
when it makes a change and when Annex.reposizes has a size when it makes a change and when Annex.reposizes has a size
for the UUID. So Annex.reposizes is kept up-to-date for the UUID. So Annex.reposizes is kept up-to-date
for each transfer and drop. for each transfer and drop.
* Update Database.RepoSizes during merge of git-annex branch. * When calling journalledRepoSizes make sure that the current
process is prevented from making changes to the journal in another
thread. Probably lock the journal? (No need to worry about changes made
by other processes; Annex.reposizes does not need to be kept current
with what other processes might be doing.)
* Update Database.RepoSizes incrementally during merge of
git-annex branch, and after commit of git-annex branch.
(Also update Annex.reposizes) (Also update Annex.reposizes)
* On commit of git-annex branch, update Database.RepoSize to reflect (Annex.reposizes can be updated to the resulting values as well.)
the size changes in the commit.
Probably cannot use Annex.reposizes for the values, since they must
match the sizes in the location log files being committed. Note
that other processes may journal location log changes, which will be
part of the commit. So need to read all the changed location logs,
and update Database.RepoSize accordingly.
Also private journals complicate this.
(Annex.reposizes can be updated to the resulting values.)
* Perhaps: setRepoSize to 0 when initializing a new repo or a * Perhaps: setRepoSize to 0 when initializing a new repo or a
new special remote (but not when reinitializing), new special remote (but not when reinitializing),