236 lines
11 KiB
Markdown
236 lines
11 KiB
Markdown
git-annex allows managing files with git, without checking the file
|
|
contents into git. While that may seem paradoxical, it is useful when
|
|
dealing with files larger than git can currently easily handle, whether due
|
|
to limitations in memory, checksumming time, or disk space.
|
|
|
|
Even without file content tracking, being able to manage files with git,
|
|
move files around and delete files with versioned directory trees, and use
|
|
branches and distributed clones, are all very handy reasons to use git. And
|
|
annexed files can co-exist in the same git repository with regularly
|
|
versioned files, which is convenient for maintaining documents, Makefiles,
|
|
etc that are associated with annexed files but that benefit from full
|
|
revision control.
|
|
|
|
My motivation for git-annex was the growing number of external drives I
|
|
use. Some are used to archive data, others hold backups, and yet others
|
|
come with me when I'm away from home to carry data that doesn't fit on my
|
|
netbook. Maintaining all that was a nightmare, lots of ad-hoc moving files
|
|
around, rsyncing files (unison is too slow), and deleting multiple copies
|
|
of files from multiple places. I realized what what I needed was a form of
|
|
revision control where each drive was a repository, and where copying the
|
|
files around, and deciding which copies were safe to delete was automated.
|
|
I posted about this to the VCS-home mailing list and got a great suggestion
|
|
to make it support arbitrary key-value stores, for more generality and
|
|
flexability. A week of coding later, and git-annex is born.
|
|
|
|
Enough broad picture, here's how it actually looks:
|
|
|
|
* `git annex add $file` moves the file into `.git/annex/`, and replaces
|
|
it with a symlink pointing at the annexed file, and then calls `git add`
|
|
to version the *symlink*. (If the file has already been annexed, it does
|
|
nothing.)
|
|
|
|
If you then use normal git push/pull commands, the annexed file content
|
|
won't be transferred between repositories, but the symlinks will be.
|
|
So different clones of a repository can have different sets of annexed
|
|
files available.
|
|
|
|
You can move the symlink around, copy it, delete it, etc, and commit changes
|
|
as desired using git. Reading the symlink will always get you the annexed
|
|
file content, or the link may be broken if the content is not currently
|
|
available.
|
|
* `git annex get $file` is used to transfer a specified file from the
|
|
backend storage to the current repository.
|
|
* `git annex drop $file` indicates that you no longer want the file's
|
|
content to be available in this repository.
|
|
* `git annex file $file` adjusts the symlink for the file to point to its
|
|
content again. Use this if you've moved the file around.
|
|
* `git annex unannex $file` undoes a `git annex add`. But use `git annex drop`
|
|
if you're just done with a file; only use `unannex` if you
|
|
accidentially added a file. (You can also run this on all your annexed
|
|
files come the Singularity. ;-)
|
|
* `git annex init "some description"` allows associating some description
|
|
(such as "USB archive drive 1") with a repository. This can help with
|
|
finding it later, see "Location Tracking" below.
|
|
|
|
Oh yeah, "$file" in the above can be any number of files, or directories,
|
|
same as you'd pass to "git add" or "git rm".
|
|
So "git annex add ." or "git annex get dir/" work fine.
|
|
|
|
## key-value storage
|
|
|
|
git-annex uses a key-value abstraction layer to allow file contents to be
|
|
stored in different ways. In theory, any key-value storage system could be
|
|
used to store the file contents, and git-annex would then retrieve them
|
|
as needed and put them in `.git/annex/`.
|
|
|
|
When a file is annexed, a key is generated from its content and/or metadata.
|
|
The file checked into git symlinks to the key. This key can later be used
|
|
to retrieve the file's content (its value). This key generation must be
|
|
stable for a given file content, name, and size.
|
|
|
|
Multiple pluggable backends are supported, and more than one can be used
|
|
to store different files' contents in a given repository.
|
|
|
|
* `WORM` ("Write Once, Read Many") This backend stores the file's content
|
|
only in `.git/annex/`, and assumes that any file with the same basename,
|
|
size, and modification time has the same content. So with this backend,
|
|
files can be moved around, but should never be added to or changed.
|
|
This is the default, and the least expensive backend.
|
|
* `SHA1` -- This backend stores the file's content in
|
|
`.git/annex/`, with a name based on its sha1 checksum. This backend allows
|
|
modifications of files to be tracked. Its need to generate checksums
|
|
can make it slow for large files.
|
|
* `URL` -- This backend downloads the file's content from an external URL.
|
|
|
|
## copies
|
|
|
|
The WORM and SHA1 key-value backends store data inside your git repository.
|
|
It's important that data not get lost by an ill-though `git annex drop`
|
|
command. So, then using those backends, git-annex can be configured to try
|
|
to keep N copies of a file's content available across all repositories. By
|
|
default, N is 1; it is configured by annex.numcopies.
|
|
|
|
`git annex drop` attempts to check with other git remotes, to check that N
|
|
copies of the file exist. If enough repositories cannot be verified to have
|
|
it, it will retain the file content to avoid data loss.
|
|
|
|
For example, consider three repositories: Server, Laptop, and USB. Both Server
|
|
and USB have a copy of a file, and N=1. If on Laptop, you `git annex get
|
|
$file`, this will transfer it from either Server or USB (depending on which
|
|
is available), and there are now 3 copies of the file.
|
|
|
|
Suppose you want to free up space on Laptop again, and you `git annex drop` the file
|
|
there. If USB is connected, or Server can be contacted, git-annex can check
|
|
that it still has a copy of the file, and the content is removed from
|
|
Laptop. But if USB is currently disconnected, and Server also cannot be
|
|
contacted, it can't verify that it is safe to drop the file, and will
|
|
refuse to do so.
|
|
|
|
With N=2, in order to drop the file content from Laptop, it would need access
|
|
to both USB and Server.
|
|
|
|
Note that different repositories can be configured with different values of
|
|
N. So just because Laptop has N=2, this does not prevent the number of
|
|
copies falling to 1, when USB and Server have N=1.
|
|
|
|
## location tracking
|
|
|
|
git-annex keeps track of in which repositories it last saw a file's content.
|
|
This location tracking information is stored in `.git-annex/$key.log`.
|
|
Repositories record their UUID and the date when they get or drop
|
|
a file's content. (Git is configured to use a union merge for this file,
|
|
so the lines may be in arbitrary order, but it will never conflict.)
|
|
|
|
This location tracking information is useful if you have multiple
|
|
repositories, and not all are always accessible. For example, perhaps one
|
|
is on a home file server, and you are away from home. Then git-annex can
|
|
tell you what git remote it needs access to in order to get a file:
|
|
|
|
# git annex get myfile
|
|
get myfile (need access to one of these remotes: home)
|
|
git-annex: get myfile failed
|
|
|
|
Another way the location tracking comes in handy is if you put repositories
|
|
on removable USB drives, that might be archived away offline in a safe
|
|
place. In this sort of case, you probably don't have a git remotes
|
|
configured for every USB drive. So git-annex may have to resort to talking
|
|
about repository UUIDs. If you have previously used "git annex init"
|
|
to attach descriptions to those repositories, it will include their
|
|
descriptions to help you with finding them:
|
|
|
|
# git annex get myfile
|
|
get myfile (No available git remotes have the file.)
|
|
It has been seen before in these repositories:
|
|
c0a28e06-d7ef-11df-885c-775af44f8882 -- USB archive drive 1
|
|
e1938fee-d95b-11df-96cc-002170d25c55
|
|
git-annex: get myfile failed
|
|
|
|
## symlink farming commit hook
|
|
|
|
git-annex does use a lot of symlinks. Specicially, relative symlinks,
|
|
that are checked into git. To allow you to move those around without
|
|
annoyance, git-annex can run as a post-commit hook. This way, you can `git mv`
|
|
a symlink to an annexed file, and as soon as you commit, it will be fixed
|
|
up.
|
|
|
|
`git annex init` tries to set up a post-commit hook that is itself a symlink
|
|
back to git-annex. If you want to have your own shell script in the post-commit
|
|
hook, just make it call `git annex` with no parameters. git-annex will detect
|
|
when it's run from a git hook and do the necessary fixups.
|
|
|
|
## configuration
|
|
|
|
* `annex.uuid` -- a unique UUID for this repository
|
|
* `annex.numcopies` -- number of copies of files to keep across all
|
|
repositories (default: 1)
|
|
* `annex.backends` -- space-separated list of names of
|
|
the key-value backends to use. The first listed is used to store
|
|
new files. (default: "WORM SHA1 URL")
|
|
* `remote.<name>.annex-cost` -- When determining which repository to
|
|
transfer annexed files from or to, ones with lower costs are preferred.
|
|
The default cost is 100 for local repositories, and 200 for remote
|
|
repositories. Note that other factors may be configured when pushing
|
|
files to repositories, in particular, whether the repository is on
|
|
a filesystem with sufficient free space.
|
|
* `remote.<name>.annex-uuid` -- git-annex caches UUIDs of repositories
|
|
here.
|
|
|
|
## issues
|
|
|
|
### free space determination
|
|
|
|
Need a way to tell how much free space is available on the disk containing
|
|
a given repository. The repository may be remote, so ssh may need to be
|
|
used.
|
|
|
|
Similarly, need a way to tell the size of a file before copying it from
|
|
a remote, to check local disk space.
|
|
|
|
### auto-drop on rm
|
|
|
|
When git-rm removed a file, its key should get dropped too. Of course, it
|
|
may not be dropped right away, depending on number of copies available.
|
|
|
|
### branching
|
|
|
|
The use of `.git-annex` to store logs means that if a repo has branches
|
|
and the user switched between them, git-annex will see different logs in
|
|
the different branches, and so may miss info about what remotes have which
|
|
files (though it can re-learn).
|
|
|
|
An alternative would be to store the log data directly in the git repo
|
|
as `pristine-tar` does. Problem with that approach is that git won't merge
|
|
conflicting changes to log files if they are not in the currently checked
|
|
out branch.
|
|
|
|
It would be possible to use a branch with a tree like this, to avoid
|
|
conflicts:
|
|
|
|
key/uuid/time/status
|
|
|
|
As long as new files are only added, and old timestamped files deleted,
|
|
there would be no conflicts.
|
|
|
|
A related problem though is the size of the tree objects git needs to
|
|
commit. Having the logs in a separate branch doesn't help with that.
|
|
As more keys are added, the tree object size will increase, and git will
|
|
take longer and longer to commit, and use more space. One way to deal with
|
|
this is simply by splitting the logs amoung subdirectories. Git then can
|
|
reuse trees for most directories. (Check: Does it still have to build
|
|
dup trees in memory?)
|
|
|
|
Another approach would be to have git-annex *delete* old logs. Keep logs
|
|
for the currently available files, or something like that. If other log
|
|
info is needed, look back through history to find the first occurance of a
|
|
log. Maybe even look at other branches -- so if the logs were on master,
|
|
a new empty branch could be made and git-annex would still know where to
|
|
get keys in that branch.
|
|
|
|
Would have to be careful about conflicts when deleting and bringing back
|
|
files with the same name. And would need to avoid expensive searching thru
|
|
all history to try to find an old log file.
|
|
|
|
## contact
|
|
|
|
Joey Hess <joey@kitenet.net>
|