only delete bundles on pushEmpty

This avoids some apparently otherwise unsolveable problems involving
races that resulted in the manifest listing bundles that were deleted.

Removed the annex-max-git-bundles config because it can't actually
result in deleting old bundles. It would still be possible to have a
config that controls how often to do a full push, which would avoid
needing to download too many bundles on clone, as well as needing to
checkpresent too many bundles in verifyManifest. But it would need a
different name and description.
This commit is contained in:
Joey Hess 2024-05-21 10:41:48 -04:00
parent f544946b09
commit 3e7324bbcb
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 53 additions and 103 deletions

View file

@ -272,9 +272,8 @@ fullPush' :: Manifest -> State -> Remote -> [Ref] -> Annex (Bool, State)
fullPush' oldmanifest st rmt refs =do
let bs = map Git.Bundle.fullBundleSpec refs
bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest
oldmanifest' <- dropOldKeys rmt oldmanifest (/= bundlekey)
let manifest = mkManifest [bundlekey]
(inManifest oldmanifest ++ outManifest oldmanifest')
(inManifest oldmanifest ++ outManifest oldmanifest)
uploadManifest rmt manifest
return (True, st { manifestCache = Nothing })
@ -291,17 +290,11 @@ guardPush st a = catchNonAsync a $ \ex -> do
incrementalPush :: State -> Remote -> M.Map Ref Sha -> M.Map Ref Sha -> Annex (Bool, State)
incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do
oldmanifest <- maybe (downloadManifestWhenPresent rmt) pure (manifestCache st)
if length (inManifest oldmanifest) + 1 > remoteAnnexMaxGitBundles (Remote.gitconfig rmt)
then fullPush' oldmanifest st rmt (M.keys newtrackingrefs)
else go oldmanifest
bs <- calc [] (M.toList newtrackingrefs)
bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest
uploadManifest rmt (oldmanifest <> mkManifest [bundlekey] [])
return (True, st { manifestCache = Nothing })
where
go oldmanifest = do
bs <- calc [] (M.toList newtrackingrefs)
bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest
oldmanifest' <- dropOldKeys rmt oldmanifest (/= bundlekey)
uploadManifest rmt (oldmanifest' <> mkManifest [bundlekey] [])
return (True, st { manifestCache = Nothing })
calc c [] = return (reverse c)
calc c ((ref, sha):refs) = case M.lookup ref oldtrackingrefs of
Just oldsha
@ -610,7 +603,7 @@ downloadManifest rmt = get mkmain >>= maybe (get mkbak) (pure . Just)
_ <- dl tmp
b <- liftIO (B.readFile tmp)
case parseManifest b of
Right m -> return (Just m)
Right m -> Just <$> verifyManifest rmt m
Left err -> giveup err
getexport _ [] = return Nothing
@ -634,13 +627,6 @@ downloadManifest rmt = get mkmain >>= maybe (get mkbak) (pure . Just)
-- So this may be interrupted and leave the manifest key not present.
-- To deal with that, there is a backup manifest key. This takes care
-- to ensure that one of the two keys will always exist.
--
-- Once the manifest has been uploaded, attempts to drop all outManifest
-- keys. A failure to drop does not cause an error to be thrown, because
-- the push has already succeeded. Avoids re-uploading the manifest with
-- the dropped keys removed from outManifest, because dropping the keys
-- takes some time and another push may have already overwritten
-- the manifest in the meantime.
uploadManifest :: Remote -> Manifest -> Annex ()
uploadManifest rmt manifest = do
ok <- ifM (Remote.checkPresent rmt mkbak)
@ -650,9 +636,8 @@ uploadManifest rmt manifest = do
-- This ensures that at no point are both deleted.
, put mkbak <&&> dropandput mkmain
)
if ok
then void $ dropOldKeys rmt manifest (const True)
else uploadfailed
unless ok
uploadfailed
where
mkmain = genManifestKey (Remote.uuid rmt)
mkbak = genBackupManifestKey (Remote.uuid rmt)
@ -696,6 +681,17 @@ dropOldKeys rmt manifest p =
mkManifest (inManifest manifest)
<$> filterM (dropKey rmt) (filter p (outManifest manifest))
-- When pushEmpty raced with another push, it could result in the manifest
-- listing bundles that it deleted. Such a manifest has to be treated the
-- same as an empty manifest. To detect that, this checks that all the
-- bundles listed in the manifest still exist on the remote.
verifyManifest :: Remote -> Manifest -> Annex Manifest
verifyManifest rmt manifest =
ifM (allM (checkPresentGitBundle rmt) (inManifest manifest))
( return manifest
, return $ mkManifest [] (inManifest manifest <> outManifest manifest)
)
-- Downloads a git bundle to the annex objects directory, unless
-- the object file is already present. Returns the filename of the object
-- file.
@ -732,6 +728,16 @@ downloadGitBundle rmt k = getKeyExportLocations rmt k >>= \case
rsp = Remote.retrievalSecurityPolicy rmt
vc = Remote.RemoteVerify rmt
-- Checks if a bundle is present. Throws errors if the remote cannot be
-- accessed.
checkPresentGitBundle :: Remote -> Key -> Annex Bool
checkPresentGitBundle rmt k =
getKeyExportLocations rmt k >>= \case
Nothing -> Remote.checkPresent rmt k
Just locs -> anyM checkexport locs
where
checkexport = Remote.checkPresentExport (Remote.exportActions rmt) k
-- Uploads a bundle or manifest object from the annex objects directory
-- to the remote.
--

View file

@ -373,7 +373,6 @@ data RemoteGitConfig = RemoteGitConfig
, remoteAnnexBwLimitDownload :: Maybe BwRate
, remoteAnnexAllowUnverifiedDownloads :: Bool
, remoteAnnexConfigUUID :: Maybe UUID
, remoteAnnexMaxGitBundles :: Int
, remoteAnnexAllowEncryptedGitRepo :: Bool
{- These settings are specific to particular types of remotes
@ -479,8 +478,6 @@ extractRemoteGitConfig r remotename = do
, remoteAnnexDdarRepo = getmaybe "ddarrepo"
, remoteAnnexHookType = notempty $ getmaybe "hooktype"
, remoteAnnexExternalType = notempty $ getmaybe "externaltype"
, remoteAnnexMaxGitBundles =
fromMaybe 100 (getmayberead "max-git-bundles")
, remoteAnnexAllowEncryptedGitRepo =
getbool "allow-encrypted-gitrepo" False
}

View file

@ -1648,17 +1648,6 @@ Remotes are configured using these settings in `.git/config`.
remotes, and is set when using [[git-annex-initremote]](1) with the
`--private` option.
* `remote.<name>.annex-max-git-bundles`, `annex.max-git-bundles`
When using [[git-remote-annex]] to store a git repository in a special
remote, this configures how many separate git bundle objects to store
in the special remote before re-uploading a single git bundle that contains
the entire git repository.
The default is 100, which aims to avoid often needing to often re-upload,
while preventing a new clone needing to download too many objects. Set to
0 to disable re-uploading.
* `remote.<name>.annex-allow-encrypted-gitrepo`
Setting this to true allows using [[git-remote-annex]] to push the git

View file

@ -38,23 +38,29 @@ variables. Some special remotes may also need environment variables to be
set when pulling or pushing.
The git repository is stored in the special remote using special annex objects
with names starting with "GITMANIFEST--" and "GITBUNDLE--". For details about
with names starting with "GITMANIFEST" and "GITBUNDLE". For details about
how the git repository is stored, see
<https://git-annex.branchable.com/internals/git-remote-annex/>
Pushes to a special remote are usually done incrementally. However,
sometimes the whole git repository (but not the annex) needs to be
re-uploaded. That is done when force pushing a ref, or deleting a
ref from the remote. It's also done when too many git bundles
accumulate in the special remote, as configured by the
`remote.<name>.annex-max-git-bundles` git config.
ref from the remote.
The special remote accumulates one GITBUNDLE object per push, and old
objects are usually not deleted. This means that refs pushed to the special
remote can still be accessed even after deleting or overwriting them.
A push that deletes every ref from the special remote does delete all
the accumulated GITBUNDLE objects. But of course, making such a push
means that someone clones from the special remote at that point in time
will see an empty remote.
Like any git repository, a git repository stored on a special remote can
have conflicting things pushed to it from different places. This mostly
works the same as any other git repository, eg a push that overwrites other
work will be prevented unless forced. However, it is possible, when
conflicting pushes are being done at the same time, for one of the pushes
to be overwritten by the other one. In this sitiation, the push will appear
to be overwritten by the other one. In this situation, the push will appear
to have succeeded, but pulling later will show the true situation.
# SEE ALSO

View file

@ -43,6 +43,13 @@ stored in such a special remote, this procedure will work:
(Note that later bundles can update refs from the versions in previous
bundles.)
When the special remote is encrypted, the GITMANIFEST and GITBUNDLE will
also be encrypted. To decrypt those manually, see this
[[fairly simple shell script using standard tools|tips/Decrypting_files_in_special_remotes_without_git-annex]].
Note that, if a GITBUNDLE listed in the GITMANIFEST turns out not to exist,
a clone should treat this the same as if the GITMANIFEST were empty.
bundle objects are deleted when a push is made to the remote that
deletes all refs from it, and in a race between such a push and another
push of some refs, it is possible for the GITMANIFEST to refer to deleted
bundles.
When the special remote is encrypted, both the names and content of
the GITMANIFEST and GITBUNDLE will also be encrypted. To
decrypt those manually, see this [[fairly simple shell script using standard tools|tips/Decrypting_files_in_special_remotes_without_git-annex]].

View file

@ -12,6 +12,9 @@ This is implememented and working. Remaining todo list for it:
* Test incremental push edge cases involving checkprereq.
* Cloning a special remote with an empty manifest results in a repo where
git fetch fails, claiming the special remote is encrypted, when it's not.
* Cloning from an annex:: url with importtree=yes doesn't work
(with or without exporttree=yes). This is because the ContentIdentifier
db is not populated. It should be possible to work around this.
@ -60,61 +63,3 @@ This is implememented and working. Remaining todo list for it:
They would have to use git fetch --prune to notice the deletion.
Once the user does notice, they can re-push their ref to recover.
Can this be improved?
## bundle deletion problems
Deleting bundles results in some problems involving races,
detailed below, that result in the manifest file listing a bundle that has
been deleted. Which breaks cloning, and is data loss, and so *must*
be solved before release.
* A race between an incremental push and a full push can result in
a bundle that the incremental push is based on being deleted by the full
push, and then incremental push's manifest file being written later.
Which will prevent cloning or some pulls from working.
A fix: Make each full push (and emptying push) also write to a fallback
manifest file that is only written by full pushes (and emptying pushes),
not incremental pushes. When fetching the main manifest file, always
check that all bundles mentioned in it are still in the remote. If any
are missing, fetch and use the fallback manifest file instead.
* A race between two full pushes can also result in the manifest file listing
a bundle that has been deleted:
Start with a full push of bundle A.
Then there are 2 racing full pushes X and Y, of bundle A and B
respectively. With this series of operations:
1. Y: write bundle B
1. Y: read manifest (listing A)
1. Y: write B to manifest
1. X: write bundle A
1. Y: delete bundle A
1. X: read manifest (listing B)
1. X: write A to manifest
1. X: delete bundle B
Which results in a manifest that lists A, but that bundle was deleted.
The problems above *could* be solved by not deleting bundles, but that is
unsatisfactory.
Old bundles could be deleted after some period of time. But a process can
be suspended at any point and resumed later, so the race windows can be
arbitrarily wide.
What if only emptying pushes delete bundles? If a manifest file refers to a
bundle that has been deleted, that can be treated the same as if the
manifest file was empty, because we know that, for that bundle to have been
deleted, there must have been an emptying push. So this would work.
It is kind of a cop-out, because it requires the user to do an emptying
push from time to time. But by doing that, the user will expect that
someone who pulls at that point gets an empty repository.
Note that a race between an emptying push an a ref push will result in the
emptying push winning, so the ref push is lost. This is the same behavior
as can happen in a push race not involving deletion though, and any
improvements that are made to the UI around that will also help with this.