git-annex/doc/tips/local_caching_of_annexed_files.mdwn
Joey Hess fd5a392006
cache remotes via annex-speculate-present
Added remote.name.annex-speculate-present config that can be used to
make cache remotes.

Implemented it in Remote.keyPossibilities, which is used by the
get/move/copy/mirror commands, and nothing else. This way, things like
whereis will not show content that's speculatively present.

The assistant and sync --content were not using Remote.keyPossibilities,
and were changed to use it.

The efficiency hit should be small; Remote.keyPossibilities is only
used before transferring a file, which is the expensive operation.
And, it's only doing one lookup of the remoteList and a very cheap
filter over it.

Note that, git-annex still updates the location log when copying content
to a remote with annex-speculate-present set. In this case, the location
tracking will indicate that content is present in the remote. This may
not be wanted for caches, or may not be a real problem for them. TBD.

This commit was supported by the NSF-funded DataLad project.
2018-08-01 14:28:05 -04:00

91 lines
3.3 KiB
Markdown

Here's how to set up a local cache of annexed files, that can be used
to avoid repeated downloads.
An example use case: Your CI system is operating on a git-annex repository,
so every time it runs it makes a fresh clone of the repository and uses
`git-annex get` to download a lot of data into it.
We'll create a cache repository, set it as a remote of the other git-annex
repositories, and configure git-annex to check the cache first before other
more expensive ways of retrieving content. The cache can be cleaned out
whenever you like with simple unix commands.
Some other nice properties -- When used on a system like BTRFS with COW
support, content from the cache can populate multiple other repositories
without using any additional disk space. And, git-annex repositories that
are otherwise unrelated can share use of the cache if they happen to
contain a common file.
You'll need git-annex 6.20180802 or newer to follow these instructions.
## creating the cache
First let's create a new, empty git-annex repository. It will be put in
~/.annex-cache in the example, but for best results, it in the same
filesystem as your other git-annex repositories.
git init ~/.annex-cache
cd ~/.annex-cache
git annex init
git config annex.hardlink true
git annex untrust here
The cache does not need to be a git annex repository; any kind of special
remote can be used as a cache too. But, using a git repository lets
annex.hardlink be used to make hard links between the cache and
repositories using it.
The cache is made untrusted, because its contents can be cleaned at any
time; other repositories should not trust it to retain content.
## making repositories use the cache
Now in each git-annex repository that you want to use the cache, add it as
a remote, and configure it as follows:
cd my-repository
git remote add cache ~/.annex-cache
git config remote.cache.annex-speculate-present true
git config remote.cache.annex-cost 10
git config remote.cache.annex-pull false
git config remote.cache.annex-push false
The annex-speculate-present setting is the essential part. It makes
git-annex know that the cache repository may contain the content of any
annexed file. So, when getting a file, git-annex will try the cache
repository first.
The low annex-cost makes git-annex try to get content from the cache remote
before any other remotes.
The annex-pull and annex-push settings prevent `git-annex sync` from
pulling and pushing to the remote. The cache repository will remain an
empty git repository (except for the content of annexed files). This means
that the same cache can be used with multiple different git-annex
repositories, without intermingling their git data. You should also avoid
manual `git pull` and `git push` to the cache remote.
## populating the cache
For the cache to be used, you need to get file contents into it somehow.
A simple way to do that is, in a git-annex repository that already
contains the content of files:
git annex copy --to cache
You could run that anytime after you get content. There are also ways to
automate it, but getting some files into the cache manually is a good
enough start.
## cleaning the cache
XXX find
## automatically populating the cache
XXX
## more caches
The example above used a local cache on the same system. However, it's also
possible to have a cache repository shared amoung computers on a LAN.