cache remotes via annex-speculate-present
Added remote.name.annex-speculate-present config that can be used to make cache remotes. Implemented it in Remote.keyPossibilities, which is used by the get/move/copy/mirror commands, and nothing else. This way, things like whereis will not show content that's speculatively present. The assistant and sync --content were not using Remote.keyPossibilities, and were changed to use it. The efficiency hit should be small; Remote.keyPossibilities is only used before transferring a file, which is the expensive operation. And, it's only doing one lookup of the remoteList and a very cheap filter over it. Note that, git-annex still updates the location log when copying content to a remote with annex-speculate-present set. In this case, the location tracking will indicate that content is present in the remote. This may not be wanted for caches, or may not be a real problem for them. TBD. This commit was supported by the NSF-funded DataLad project.
This commit is contained in:
parent
2884637cab
commit
fd5a392006
7 changed files with 114 additions and 4 deletions
|
@ -92,7 +92,7 @@ queueTransfersMatching matching reason schedule k f direction
|
|||
filter (\r -> not (inset s r || Remote.readonly r))
|
||||
(syncDataRemotes st)
|
||||
where
|
||||
locs = S.fromList <$> Remote.keyLocations k
|
||||
locs = S.fromList . map Remote.uuid <$> Remote.keyPossibilities k
|
||||
inset s r = S.member (Remote.uuid r) s
|
||||
gentransfer r = Transfer
|
||||
{ transferDirection = direction
|
||||
|
|
|
@ -3,6 +3,8 @@ git-annex (6.20180720) UNRELEASED; urgency=medium
|
|||
* S3: Support credential-less download from remotes configured
|
||||
with public=yes exporttree=yes.
|
||||
* Fix reversion in display of http 404 errors.
|
||||
* Added remote.name.annex-speculate-present config that can be used to
|
||||
make cache remotes.
|
||||
|
||||
-- Joey Hess <id@joeyh.name> Tue, 31 Jul 2018 12:14:11 -0400
|
||||
|
||||
|
|
|
@ -616,7 +616,7 @@ seekSyncContent o rs = do
|
|||
-}
|
||||
syncFile :: Either (Maybe (Bloom Key)) (Key -> Annex ()) -> [Remote] -> AssociatedFile -> Key -> Annex Bool
|
||||
syncFile ebloom rs af k = onlyActionOn' k $ do
|
||||
locs <- Remote.keyLocations k
|
||||
locs <- map Remote.uuid <$> Remote.keyPossibilities k
|
||||
let (have, lack) = partition (\r -> Remote.uuid r `elem` locs) rs
|
||||
|
||||
got <- anyM id =<< handleget have
|
||||
|
|
12
Remote.hs
12
Remote.hs
|
@ -1,6 +1,6 @@
|
|||
{- git-annex remotes
|
||||
-
|
||||
- Copyright 2011 Joey Hess <id@joeyh.name>
|
||||
- Copyright 2011-2018 Joey Hess <id@joeyh.name>
|
||||
-
|
||||
- Licensed under the GNU GPL version 3 or higher.
|
||||
-}
|
||||
|
@ -278,13 +278,21 @@ keyLocations key = trustExclude DeadTrusted =<< loggedLocations key
|
|||
|
||||
{- Cost ordered lists of remotes that the location log indicates
|
||||
- may have a key.
|
||||
-
|
||||
- Also includes remotes with remoteAnnexSpeculatePresent set.
|
||||
-}
|
||||
keyPossibilities :: Key -> Annex [Remote]
|
||||
keyPossibilities key = do
|
||||
u <- getUUID
|
||||
-- uuids of all remotes that are recorded to have the key
|
||||
locations <- filter (/= u) <$> keyLocations key
|
||||
fst <$> remoteLocations locations []
|
||||
speclocations <- map uuid
|
||||
. filter (remoteAnnexSpeculatePresent . gitconfig)
|
||||
<$> remoteList
|
||||
-- there are unlikely to be many speclocations, so building a Set
|
||||
-- is not worth the expense
|
||||
let locations' = speclocations ++ filter (`notElem` speclocations) locations
|
||||
fst <$> remoteLocations locations' []
|
||||
|
||||
{- Given a list of locations of a key, and a list of all
|
||||
- trusted repositories, generates a cost-ordered list of
|
||||
|
|
|
@ -226,6 +226,7 @@ data RemoteGitConfig = RemoteGitConfig
|
|||
, remoteAnnexStartCommand :: Maybe String
|
||||
, remoteAnnexStopCommand :: Maybe String
|
||||
, remoteAnnexAvailability :: Maybe Availability
|
||||
, remoteAnnexSpeculatePresent :: Bool
|
||||
, remoteAnnexBare :: Maybe Bool
|
||||
, remoteAnnexRetry :: Maybe Integer
|
||||
, remoteAnnexRetryDelay :: Maybe Seconds
|
||||
|
@ -281,6 +282,7 @@ extractRemoteGitConfig r remotename = do
|
|||
, remoteAnnexStartCommand = notempty $ getmaybe "start-command"
|
||||
, remoteAnnexStopCommand = notempty $ getmaybe "stop-command"
|
||||
, remoteAnnexAvailability = getmayberead "availability"
|
||||
, remoteAnnexSpeculatePresent = getbool "speculate-present" False
|
||||
, remoteAnnexBare = getmaybebool "bare"
|
||||
, remoteAnnexRetry = getmayberead "retry"
|
||||
, remoteAnnexRetryDelay = Seconds
|
||||
|
|
|
@ -1283,6 +1283,13 @@ Here are all the supported configuration settings.
|
|||
Can be used to tell git-annex whether a remote is LocallyAvailable
|
||||
or GloballyAvailable. Normally, git-annex determines this automatically.
|
||||
|
||||
* `remote.<name>.annex-speculate-present`
|
||||
|
||||
Make git-annex speculate that this remote may contain the content of any
|
||||
file, even though its normal location tracking does not indicate that it
|
||||
does. This will cause git-annex to try to get all file contents from the
|
||||
remote. Can be useful in setting up a caching remote.
|
||||
|
||||
* `remote.<name>.annex-bare`
|
||||
|
||||
Can be used to tell git-annex if a remote is a bare repository
|
||||
|
|
91
doc/tips/local_caching_of_annexed_files.mdwn
Normal file
91
doc/tips/local_caching_of_annexed_files.mdwn
Normal file
|
@ -0,0 +1,91 @@
|
|||
Here's how to set up a local cache of annexed files, that can be used
|
||||
to avoid repeated downloads.
|
||||
|
||||
An example use case: Your CI system is operating on a git-annex repository,
|
||||
so every time it runs it makes a fresh clone of the repository and uses
|
||||
`git-annex get` to download a lot of data into it.
|
||||
|
||||
We'll create a cache repository, set it as a remote of the other git-annex
|
||||
repositories, and configure git-annex to check the cache first before other
|
||||
more expensive ways of retrieving content. The cache can be cleaned out
|
||||
whenever you like with simple unix commands.
|
||||
|
||||
Some other nice properties -- When used on a system like BTRFS with COW
|
||||
support, content from the cache can populate multiple other repositories
|
||||
without using any additional disk space. And, git-annex repositories that
|
||||
are otherwise unrelated can share use of the cache if they happen to
|
||||
contain a common file.
|
||||
|
||||
You'll need git-annex 6.20180802 or newer to follow these instructions.
|
||||
|
||||
## creating the cache
|
||||
|
||||
First let's create a new, empty git-annex repository. It will be put in
|
||||
~/.annex-cache in the example, but for best results, it in the same
|
||||
filesystem as your other git-annex repositories.
|
||||
|
||||
git init ~/.annex-cache
|
||||
cd ~/.annex-cache
|
||||
git annex init
|
||||
git config annex.hardlink true
|
||||
git annex untrust here
|
||||
|
||||
The cache does not need to be a git annex repository; any kind of special
|
||||
remote can be used as a cache too. But, using a git repository lets
|
||||
annex.hardlink be used to make hard links between the cache and
|
||||
repositories using it.
|
||||
|
||||
The cache is made untrusted, because its contents can be cleaned at any
|
||||
time; other repositories should not trust it to retain content.
|
||||
|
||||
## making repositories use the cache
|
||||
|
||||
Now in each git-annex repository that you want to use the cache, add it as
|
||||
a remote, and configure it as follows:
|
||||
|
||||
cd my-repository
|
||||
git remote add cache ~/.annex-cache
|
||||
git config remote.cache.annex-speculate-present true
|
||||
git config remote.cache.annex-cost 10
|
||||
git config remote.cache.annex-pull false
|
||||
git config remote.cache.annex-push false
|
||||
|
||||
The annex-speculate-present setting is the essential part. It makes
|
||||
git-annex know that the cache repository may contain the content of any
|
||||
annexed file. So, when getting a file, git-annex will try the cache
|
||||
repository first.
|
||||
|
||||
The low annex-cost makes git-annex try to get content from the cache remote
|
||||
before any other remotes.
|
||||
|
||||
The annex-pull and annex-push settings prevent `git-annex sync` from
|
||||
pulling and pushing to the remote. The cache repository will remain an
|
||||
empty git repository (except for the content of annexed files). This means
|
||||
that the same cache can be used with multiple different git-annex
|
||||
repositories, without intermingling their git data. You should also avoid
|
||||
manual `git pull` and `git push` to the cache remote.
|
||||
|
||||
## populating the cache
|
||||
|
||||
For the cache to be used, you need to get file contents into it somehow.
|
||||
A simple way to do that is, in a git-annex repository that already
|
||||
contains the content of files:
|
||||
|
||||
git annex copy --to cache
|
||||
|
||||
You could run that anytime after you get content. There are also ways to
|
||||
automate it, but getting some files into the cache manually is a good
|
||||
enough start.
|
||||
|
||||
## cleaning the cache
|
||||
|
||||
XXX find
|
||||
|
||||
## automatically populating the cache
|
||||
|
||||
XXX
|
||||
|
||||
## more caches
|
||||
|
||||
The example above used a local cache on the same system. However, it's also
|
||||
possible to have a cache repository shared amoung computers on a LAN.
|
Loading…
Reference in a new issue