git-annex/doc/git-annex.mdwn

183 lines
8.5 KiB
Text
Raw Normal View History

2010-10-09 18:06:25 +00:00
git-annex allows managing files with git, without checking the file
2010-10-15 23:32:33 +00:00
contents into git. While that may seem paradoxical, it is useful when
dealing with files larger than git can currently easily handle, whether due
to limitations in memory, checksumming time, or disk space.
Even without file content tracking, being able to manage files with git,
move files around and delete files with versioned directory trees, and use
branches and distributed clones, are all very handy reasons to use git. And
annexed files can co-exist in the same git repository with regularly
versioned files, which is convenient for maintaining documents, Makefiles,
etc that are associated with annexed files but that benefit from full
revision control.
2010-10-09 18:06:25 +00:00
Enough broad picture, here's how it actually looks:
2010-10-15 01:12:54 +00:00
* `git annex add $file` moves the file into `.git/annex/`, and replaces
2010-10-09 18:06:25 +00:00
it with a symlink pointing at the annexed file, and then calls `git add`
to version the *symlink*. (If the file has already been annexed, it does
nothing.)
2010-10-15 23:32:33 +00:00
* If you use normal git push/pull commands, the annexed file content
won't be transferred, but the symlinks will be. So different clones of a
repository can have different sets of annexed files available.
2010-10-09 18:06:25 +00:00
* You can move the symlink around, copy it, delete it, etc, and commit changes
as desired using git. Reading the symlink will always get you the annexed
file content, or the link may be broken if the content is not currently
available.
2010-10-15 01:12:54 +00:00
* `git annex push $repository` pushes *all* annexed files to the specified
2010-10-09 18:06:25 +00:00
repository.
2010-10-15 01:12:54 +00:00
* `git annex pull $repository` pulls *all* annexed files from the specified
2010-10-09 18:06:25 +00:00
repository.
2010-10-15 01:12:54 +00:00
* `git annex want $file` indicates that you want access to a file's
2010-10-09 18:06:25 +00:00
content, without immediatly transferring it.
2010-10-15 01:12:54 +00:00
* `git annex get $file` is used to transfer a specified file, and/or
2010-10-15 23:32:33 +00:00
files previously indicated with `git annex want`. If a configured
repository has it, or it is available from other key/value storage,
it will be immediatly downloaded.
2010-10-15 01:12:54 +00:00
* `git annex drop $file` indicates that you no longer want the file's
2010-10-09 18:06:25 +00:00
content to be available in this repository.
2010-10-15 01:12:54 +00:00
* `git annex unannex $file` undoes a `git annex add`. But use `git annex drop`
if you're just done with a file; only use `unannex` if you
2010-10-13 00:26:02 +00:00
accidentially added a file.
2010-10-09 18:06:25 +00:00
2010-10-16 19:22:47 +00:00
Oh yeah, "$file" in the above can be any number of files, or directories,
same as you'd pass to "git add" or "git rm".
So "git annex add ." or "git annex get dir/" work fine.
2010-10-09 18:06:25 +00:00
## copies
git-annex can be configured to try to keep N copies of a file's content
2010-10-16 19:22:47 +00:00
available across all repositories. By default, N is 1; it is configured by
annex.numcopies.
2010-10-09 18:06:25 +00:00
2010-10-15 23:32:33 +00:00
`git annex drop` attempts to check with other git remotes, to check that N
copies of the file exist. If enough repositories cannot be verified to have
it, it will retain the file content to avoid data loss.
2010-10-14 18:38:29 +00:00
2010-10-09 18:06:25 +00:00
For example, consider three repositories: Server, Laptop, and USB. Both Server
2010-10-15 01:12:54 +00:00
and USB have a copy of a file, and N=1. If on Laptop, you `git annex get
2010-10-09 18:06:25 +00:00
$file`, this will transfer it from either Server or USB (depending on which
is available), and there are now 3 copies of the file.
2010-10-15 23:32:33 +00:00
Suppose you want to free up space on Laptop again, and you `git annex drop` the file
2010-10-09 18:06:25 +00:00
there. If USB is connected, or Server can be contacted, git-annex can check
that it still has a copy of the file, and the content is removed from
Laptop. But if USB is currently disconnected, and Server also cannot be
2010-10-14 19:05:10 +00:00
contacted, it can't verify that it is safe to drop the file, and will
refuse to do so.
2010-10-09 18:06:25 +00:00
With N=2, in order to drop the file content from Laptop, it would need access
to both USB and Server.
Note that different repositories can be configured with different values of
N. So just because Laptop has N=2, this does not prevent the number of
2010-10-15 23:32:33 +00:00
copies falling to 1, when USB and Server have N=1.
2010-10-09 18:06:25 +00:00
## key/value storage
2010-10-15 23:32:33 +00:00
git-annex uses a key/value abstraction layer to allow file contents to be
2010-10-09 18:06:25 +00:00
stored in different ways. In theory, any key/value storage system could be
used to store the file contents, and git-annex would then retrieve them
as needed and put them in `.git/annex/`.
When a file is annexed, a key is generated from its content and/or metadata.
The file checked into git symlinks to the key. This key can later be used
to retrieve the file's content (its value). This key generation must be
stable for a given file content, name, and size.
2010-10-09 18:06:25 +00:00
Multiple pluggable backends are supported, and more than one can be used
to store different files' contents in a given repository.
2010-10-15 20:42:36 +00:00
* `WORM` ("Write Once, Read Many") This backend stores the file's content
2010-10-15 23:32:33 +00:00
only in `.git/annex/`, and assumes that any file with the same basename,
2010-10-15 22:57:05 +00:00
size, and modification time has the same content. So with this backend,
files can be moved around, but should never be added to or changed.
This is the default, and the least expensive backend.
2010-10-15 23:32:33 +00:00
* `SHA1` -- This backend stores the file's content in
2010-10-09 18:06:25 +00:00
`.git/annex/`, with a name based on its sha1 checksum. This backend allows
modifications of files to be tracked. Its need to generate checksums
can make it slow for large files.
2010-10-15 23:32:33 +00:00
* `URL` -- This backend downloads the file's content from an external URL.
2010-10-09 18:06:25 +00:00
## location tracking
git-annex keeps track of on which repository it last saw a file's content.
This can be useful when using it for archiving with offline storage. When
2010-10-15 01:12:54 +00:00
you indicate you want a file, git-annex will tell you which repositories
2010-10-16 19:22:47 +00:00
have the file's content. For example:
# git annex get myfile
git-annex: unable to get: myfile
To get that file, need access to one of these remotes: usbdrive
2010-10-09 18:06:25 +00:00
2010-10-13 00:26:02 +00:00
Location tracking information is stored in `.git-annex/$key.log`.
2010-10-15 01:12:54 +00:00
Repositories record their UUID and the date when they get or drop
2010-10-09 18:06:25 +00:00
a file's content. (Git is configured to use a union merge for this file,
so the lines may be in arbitrary order, but it will never conflict.)
2010-10-16 00:20:16 +00:00
The optional file `.git-annex/uuid.log` can be created to add a description
2010-10-16 19:22:47 +00:00
to a UUID. If git-annex needs a file from some repository, and it cannot find
2010-10-12 06:00:29 +00:00
the repository amoung the remotes, it will use the description from this
file when asking for the repository to be made available. The file format
is a UUID, a space, and the rest of the line is its description. For
example:
UUID d3d2474c-d5c3-11df-80a9-002170d25c55 USB drive in red enclosure
2010-10-12 06:05:33 +00:00
UUID 60cf39c8-d5c6-11df-aa8b-93fda39008d6 my colocated server
2010-10-12 06:00:29 +00:00
2010-10-09 18:06:25 +00:00
## configuration
2010-10-12 06:00:29 +00:00
* `annex.uuid` -- a unique UUID for this repository
* `annex.numcopies` -- number of copies of files to keep (default: 1)
* `annex.backends` -- space-separated list of names of
the key/value backends to use. The first listed is used to store
2010-10-15 23:32:33 +00:00
new files. (default: "WORM SHA1 URL")
2010-10-09 18:06:25 +00:00
* `remote.<name>.annex-cost` -- When determining which repository to
transfer annexed files from or to, ones with lower costs are preferred.
2010-10-13 00:26:02 +00:00
The default cost is 100 for local repositories, and 200 for remote
repositories. Note that other factors may be configured when pushing
files to repositories, in particular, whether the repository is on
a filesystem with sufficient free space.
* `remote.<name>.annex-uuid` -- git-annex caches UUIDs of repositories
here.
2010-10-09 18:06:25 +00:00
## issues
### symlinks
If the symlink to annexed content is relative, moving it to a subdir will
break it. But it it's absolute, moving the git repo (or mounting its drive
elsewhere) will break it. Either:
2010-10-15 01:12:54 +00:00
* Use relative links and need `git annex mv` to move (or post-commit
2010-10-09 18:06:25 +00:00
hook that caches moves and updates links).
* Use absolute links and need `git annex fixlinks` when location changes;
note that would also mean that git would see the symlink targets changed
and want to commit the change. And, other clones of the repo would
diverge and there would be conflicts on the symlink text. Ugh.
Hard links are not an option, because git would then happily commit the
file content. Amoung other reasons..
2010-10-09 18:06:25 +00:00
### free space determination
Need a way to tell how much free space is available on the disk containing
a given repository. The repository may be remote, so ssh may need to be
used.
Similarly, need a way to tell the size of a file before downloading it from
remote, to check local disk space.
### auto-drop files on rm
When git-rm removed a file, it should get dropped too. Of course, it may
not be dropped right away, depending on number of copies available.
2010-10-13 00:35:20 +00:00
### branching
The use of `.git-annex` to store logs means that if a repo has branches
and the user switched between them, git-annex will see different logs in
the different branches, and so may miss info about what remotes have which
files (though it can re-learn). An alternative would be to
store the log data directly in the git repo as `pristine-tar` does.