183 lines
8.5 KiB
Markdown
183 lines
8.5 KiB
Markdown
git-annex allows managing files with git, without checking the file
|
|
contents into git. This is useful when dealing with files larger than git
|
|
can currently easily handle, whether due to limitations in memory,
|
|
checksumming time, or disk space (only one copy need be stored of an
|
|
annexed file).
|
|
|
|
Even without file content tracking, being able to manage file metadata with
|
|
git, move files around and delete files with versioned directory trees, and use
|
|
branches and distributed clone, are all very handy reasons to use git. And
|
|
annexed files can co-exist in the same git repository with regularly versioned
|
|
files, which is convenient for maintaining code, Makefiles, etc that are
|
|
associated with annexed files but that benefit from full revision control.
|
|
|
|
Enough broad picture, here's how it actually looks:
|
|
|
|
* `git annex add $file` moves the file into `.git/annex/`, and replaces
|
|
it with a symlink pointing at the annexed file, and then calls `git add`
|
|
to version the *symlink*. (If the file has already been annexed, it does
|
|
nothing.)
|
|
* You can move the symlink around, copy it, delete it, etc, and commit changes
|
|
as desired using git. Reading the symlink will always get you the annexed
|
|
file content, or the link may be broken if the content is not currently
|
|
available.
|
|
* If you use normal git push/pull commands, the annexed file contents
|
|
won't be sent, but the symlinks will be. So different clones of a repository
|
|
can have different sets of annexed files available.
|
|
* `git annex push $repository` pushes *all* annexed files to the specified
|
|
repository.
|
|
* `git annex pull $repository` pulls *all* annexed files from the specified
|
|
repository.
|
|
* `git annex want $file` indicates that you want access to a file's
|
|
content, without immediatly transferring it.
|
|
* `git annex get $file` is used to transfer a specified file, and/or
|
|
files previously indicated with `git annex want`. If a configured repository has it,
|
|
or it is available from other key/value storage, it will be immediatly
|
|
downloaded.
|
|
* `git annex drop $file` indicates that you no longer want the file's
|
|
content to be available in this repository.
|
|
* `git annex unannex $file` undoes a `git annex add`. But use `git annex drop`
|
|
if you're just done with a file; only use `unannex` if you
|
|
accidentially added a file.
|
|
* `git annex $file` is a shorthand. If the file
|
|
is already known, it does `git annex get`, otherwise it does `git annex add`.
|
|
|
|
## copies
|
|
|
|
git-annex can be configured to try to keep N copies of a file's content
|
|
available across all repositories. By default, N is 1 (configured by
|
|
annex.numcopies).
|
|
|
|
`git annex drop` attempts to check all other configured
|
|
repositories, to check that N copies of the file exist. If enough
|
|
repositories cannot be verified to have it, it will retain the file content
|
|
to avoid data loss.
|
|
|
|
For example, consider three repositories: Server, Laptop, and USB. Both Server
|
|
and USB have a copy of a file, and N=1. If on Laptop, you `git annex get
|
|
$file`, this will transfer it from either Server or USB (depending on which
|
|
is available), and there are now 3 copies of the file.
|
|
|
|
Suppose you want to free up space on laptop again, and you `git annex drop` the file
|
|
there. If USB is connected, or Server can be contacted, git-annex can check
|
|
that it still has a copy of the file, and the content is removed from
|
|
Laptop. But if USB is currently disconnected, and Server also cannot be
|
|
contacted, it can't verify that it is safe to drop the file, and will
|
|
refuse to do so.
|
|
|
|
With N=2, in order to drop the file content from Laptop, it would need access
|
|
to both USB and Server.
|
|
|
|
Note that different repositories can be configured with different values of
|
|
N. So just because Laptop has N=2, this does not prevent the number of
|
|
copies falling to 1, when USB and Server have N=1, and if they have the
|
|
only copies of a file.
|
|
|
|
## the .git-annex directory
|
|
|
|
The `.git-annex` directory at the top of the repository is used to store
|
|
git-annex information that should be propigated between repositories.
|
|
|
|
## key/value storage
|
|
|
|
git-annex uses a key/value abstraction layer to allow files contents to be
|
|
stored in different ways. In theory, any key/value storage system could be
|
|
used to store the file contents, and git-annex would then retrieve them
|
|
as needed and put them in `.git/annex/`.
|
|
|
|
When a file is annexed, a key is generated from its content and/or metadata.
|
|
The file checked into git symlinks to the key. This key can later be used
|
|
to retrieve the file's content (its value). This key generation must be
|
|
stable for a given file content, name, and size.
|
|
|
|
Multiple pluggable backends are supported, and more than one can be used
|
|
to store different files' contents in a given repository.
|
|
|
|
* `file` -- This backend stores the file's content in
|
|
`.git/annex/`, and assumes that any file with the same basename
|
|
has the same content. So with this backend, files can be moved around,
|
|
but should never be added to or changed. This is the default, and
|
|
the least expensive backend.
|
|
* `sha1sum` -- This backend stores the file's content in
|
|
`.git/annex/`, with a name based on its sha1 checksum. This backend allows
|
|
modifications of files to be tracked. Its need to generate checksums
|
|
can make it slow for large files.
|
|
* `url` -- This backend downloads the file's content from an external URL.
|
|
|
|
## location tracking
|
|
|
|
git-annex keeps track of on which repository it last saw a file's content.
|
|
This can be useful when using it for archiving with offline storage. When
|
|
you indicate you want a file, git-annex will tell you which repositories
|
|
have the file's content.
|
|
|
|
Location tracking information is stored in `.git-annex/$key.log`.
|
|
Repositories record their UUID and the date when they get or drop
|
|
a file's content. (Git is configured to use a union merge for this file,
|
|
so the lines may be in arbitrary order, but it will never conflict.)
|
|
|
|
The optional file `.git-annex/uuid.map` can be created to add a description
|
|
to a UUID. If git-annex needs a file from a repository and it cannot find
|
|
the repository amoung the remotes, it will use the description from this
|
|
file when asking for the repository to be made available. The file format
|
|
is a UUID, a space, and the rest of the line is its description. For
|
|
example:
|
|
|
|
UUID d3d2474c-d5c3-11df-80a9-002170d25c55 USB drive in red enclosure
|
|
UUID 60cf39c8-d5c6-11df-aa8b-93fda39008d6 my colocated server
|
|
|
|
## configuration
|
|
|
|
* `annex.uuid` -- a unique UUID for this repository
|
|
* `annex.numcopies` -- number of copies of files to keep (default: 1)
|
|
* `annex.backends` -- space-separated list of names of
|
|
the key/value backends to use. The first listed is used to store
|
|
new files. (default: file, checksum, url)
|
|
* `remote.<name>.annex-cost` -- When determining which repository to
|
|
transfer annexed files from or to, ones with lower costs are preferred.
|
|
The default cost is 100 for local repositories, and 200 for remote
|
|
repositories. Note that other factors may be configured when pushing
|
|
files to repositories, in particular, whether the repository is on
|
|
a filesystem with sufficient free space.
|
|
* `remote.<name>.annex-uuid` -- git-annex caches UUIDs of repositories
|
|
here.
|
|
|
|
## issues
|
|
|
|
### symlinks
|
|
|
|
If the symlink to annexed content is relative, moving it to a subdir will
|
|
break it. But it it's absolute, moving the git repo (or mounting its drive
|
|
elsewhere) will break it. Either:
|
|
|
|
* Use relative links and need `git annex mv` to move (or post-commit
|
|
hook that caches moves and updates links).
|
|
* Use absolute links and need `git annex fixlinks` when location changes;
|
|
note that would also mean that git would see the symlink targets changed
|
|
and want to commit the change. And, other clones of the repo would
|
|
diverge and there would be conflicts on the symlink text. Ugh.
|
|
|
|
Hard links are not an option, because git would then happily commit the
|
|
file content. Amoung other reasons..
|
|
|
|
### free space determination
|
|
|
|
Need a way to tell how much free space is available on the disk containing
|
|
a given repository. The repository may be remote, so ssh may need to be
|
|
used.
|
|
|
|
Similarly, need a way to tell the size of a file before downloading it from
|
|
remote, to check local disk space.
|
|
|
|
### auto-drop files on rm
|
|
|
|
When git-rm removed a file, it should get dropped too. Of course, it may
|
|
not be dropped right away, depending on number of copies available.
|
|
|
|
### branching
|
|
|
|
The use of `.git-annex` to store logs means that if a repo has branches
|
|
and the user switched between them, git-annex will see different logs in
|
|
the different branches, and so may miss info about what remotes have which
|
|
files (though it can re-learn). An alternative would be to
|
|
store the log data directly in the git repo as `pristine-tar` does.
|