implement getRepoSizes

At this point the RepoSize database is getting populated, and it
all seems to be working correctly. Incremental updates still need to be
done to make it performant.
This commit is contained in:
Joey Hess 2024-08-15 12:31:56 -04:00
parent bba23e7cc9
commit c200523bac
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 60 additions and 49 deletions

View file

@ -5,17 +5,52 @@
- Licensed under the GNU AGPL version 3 or higher.
-}
module Annex.RepoSize where
module Annex.RepoSize (
getRepoSizes,
) where
import Annex.Common
import Annex.Branch (UnmergedBranches(..))
import qualified Annex
import Annex.Branch (UnmergedBranches(..), getBranch)
import Types.RepoSize
import qualified Database.RepoSize as Db
import Logs.Location
import Logs.UUID
import Git.Types (Sha)
import qualified Data.Map.Strict as M
{- Gets the repo size map. Cached for speed. -}
getRepoSizes :: Annex (M.Map UUID RepoSize)
getRepoSizes = maybe updateRepoSizes return =<< Annex.getState Annex.reposizes
{- Updates Annex.reposizes with current information from the git-annex
- branch, supplimented with journalled but not yet committed information.
-}
updateRepoSizes :: Annex (M.Map UUID RepoSize)
updateRepoSizes = bracket Db.openDb Db.closeDb $ \h -> do
(oldsizemap, moldbranchsha) <- liftIO $ Db.getRepoSizes h
case moldbranchsha of
Nothing -> calculatefromscratch h >>= set
Just oldbranchsha -> do
currbranchsha <- getBranch
if oldbranchsha == currbranchsha
then journalledRepoSizes oldsizemap oldbranchsha
>>= set
else do
-- XXX todo incremental update by diffing
-- from old to new branch.
calculatefromscratch h >>= set
where
calculatefromscratch h = do
(sizemap, branchsha) <- calcBranchRepoSizes
liftIO $ Db.setRepoSizes h sizemap branchsha
journalledRepoSizes sizemap branchsha
set sizemap = do
Annex.changeState $ \st -> st
{ Annex.reposizes = Just sizemap }
return sizemap
{- Sum up the sizes of all keys in all repositories, from the information
- in the git-annex branch, but not the journal. Retuns the sha of the
- branch commit that was used.

View file

@ -598,12 +598,10 @@ limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles
let groupmembers = fromMaybe S.empty $
M.lookup g (uuidsByGroup gm)
maxsizes <- getMaxSizes
-- XXX do not calc this every time!
(sizemap, sha) <- calcBranchRepoSizes
sizemap' <- journalledRepoSizes sizemap sha
sizemap <- getRepoSizes
let keysize = fromMaybe 0 (fromKey keySize key)
currentlocs <- S.fromList <$> loggedLocations key
let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap') of
let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of
(Just (MaxSize maxsize), Just (RepoSize reposize)) ->
if u `S.member` currentlocs
then reposize <= maxsize

View file

@ -30,6 +30,23 @@ Planned schedule of work:
## work notes
* Implement [[track_free_space_in_repos_via_git-annex_branch]]:
* Update Annex.reposizes in Logs.Location.logChange,
when it makes a change and when Annex.reposizes has a size
for the UUID. So Annex.reposizes is kept up-to-date
for each transfer and drop.
* When calling journalledRepoSizes make sure that the current
process is prevented from making changes to the journal in another
thread. Probably lock the journal? (No need to worry about changes made
by other processes; Annex.reposizes does not need to be kept current
with what other processes might be doing.)
* updateRepoSizes incrementally when the git-annex branch sha in the
database is older than the current git-annex branch. Diff from old to
new branch to efficiently update.
* `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg,
when 2 out of 3 repositories are full, and the 3rd is mostly empty,
it is no longer possible to add new files to 2 repositories.
@ -45,52 +62,13 @@ Planned schedule of work:
Also note that "fullybalanced=foo:2" is not currently actually
implemented!
* implement size-based balancing, so all balanced repositories are around
the same percent full, either as the default or as another preferred
content expression.
* Make `git-annex info` use Annex.reposizes.
* `git-annex info` can use maxsize to display how full repositories are
* Implement [[track_free_space_in_repos_via_git-annex_branch]]:
* Goal is for limitFullyBalanced not to need to calcRepoSizes.
* When Annex.reposizes does not list the size of a UUID, and
that UUID's size is needed eg for balanced preferred
content, use calcRepoSizes and store in
Database.RepoSizes.
* Load Annex.reposizes from Database.RepoSizes on demand,
supplimenting with journalledRepoSizes.
* Update Annex.reposizes in Logs.Location.logChange,
when it makes a change and when Annex.reposizes has a size
for the UUID. So Annex.reposizes is kept up-to-date
for each transfer and drop.
* When calling journalledRepoSizes make sure that the current
process is prevented from making changes to the journal in another
thread. Probably lock the journal? (No need to worry about changes made
by other processes; Annex.reposizes does not need to be kept current
with what other processes might be doing.)
* Update Database.RepoSizes incrementally during merge of
git-annex branch, and after commit of git-annex branch.
(Also update Annex.reposizes)
(Annex.reposizes can be updated to the resulting values as well.)
* Perhaps: setRepoSize to 0 when initializing a new repo or a
new special remote (but not when reinitializing),
and also update Annex.reposizes for that uuid.
Whether it makes sense to do this will depend on how expensive
it is to update Database.RepoSize on git-annex branch merge and commit.
If it is not expensive, will want to track reposizes from the beginning
whenever possible, to avoid a later expensive read of the git-annex
branch to calculate the reposizes.
* Make `git-annex info` use Annex.reposizes.
* implement size-based balancing, so all balanced repositories are around
the same percent full, either as the default or as another preferred
content expression.
## completed items for August's work on balanced preferred content