This commit is contained in:
Joey Hess 2011-10-17 13:56:36 -04:00
parent 66fa4c947c
commit 617bdc740f
12 changed files with 5 additions and 18 deletions

View file

@ -0,0 +1,49 @@
[The Internet Archive](http://www.archive.org/) allows members to upload
collections using an Amazon S3
[compatible API](http://www.archive.org/help/abouts3.txt), and this can
be used with git-annex's [[special_remotes/S3]] support.
So, you can locally archive things with git-annex, define remotes that
correspond to "items" at the Internet Archive, and use git-annex to upload
your files to there. Of course, your use of the Internet Archive must
comply with their [terms of service](http://www.archive.org/about/terms.php).
Sign up for an account, and get your access keys here:
<http://www.archive.org/account/s3.php>
# export AWS_ACCESS_KEY_ID=blahblah
# export AWS_SECRET_ACCESS_KEY=xxxxxxx
Specify `host=s3.us.archive.org` when doing `initremote` to set up
a remote at the Archive. This will enable a special Internet Archive mode:
Encryption is not allowed; you are required to specify a bucket name
rather than having git-annex pick a random one; and you can optionally
specify `x-archive-meta*` headers to add metadata as explained in their
[documentation](http://www.archive.org/help/abouts3.txt).
[[!template id=note text="""
/!\ There seems to be a bug in either hS3 or the archive that breaks
authentication when the bucket name contains spaces or upper-case letters..
use all lowercase and no spaces when making the bucket with `initremote`.
"""]]
# git annex initremote archive-panama type=S3 \
host=s3.us.archive.org bucket=panama-canal-lock-blueprints \
x-archive-meta-mediatype=texts x-archive-meta-language=eng \
x-archive-meta-title="original Panama Canal lock design blueprints"
initremote archive-panama (Internet Archive mode) ok
# git annex describe archive-panama "a man, a plan, a canal: panama"
describe archive-panama ok
Then you can annex files and copy them to the remote as usual:
# git annex add photo1.jpeg --backend=SHA1E
add photo1.jpeg (checksum...) ok
# git annex copy photo1.jpeg --fast --to archive-panama
copy (to archive-panama...) ok
Note the use of the SHA1E [[backend|backends]]. It makes most sense
to use the WORM or SHA1E backend for files that will be stored in
the Internet Archive, since the key name will be exposed as the filename
there, and since the Archive does special processing of files based on
their extension.

View file

@ -0,0 +1,16 @@
Maybe you started out using the WORM backend, and have now configured
git-annex to use SHA1. But files you added to the annex before still
use the WORM backend. There is a simple command that can migrate that
data:
# git annex migrate my_cool_big_file
migrate my_cool_big_file (checksum...) ok
You can only migrate files whose content is currently available. Other
files will be skipped.
After migrating a file to a new backend, the old content in the old backend
will still be present. That is necessary because multiple files
can point to the same content. The `git annex unused` subcommand can be
used to clear up that detritus later. Note that hard links are used,
to avoid wasting disk space.

View file

@ -0,0 +1,36 @@
git-annex has a powerful syntax for making it act on only certian files.
The simplest thing is to exclude some files, using wild cards:
git annex get --exclude '*.mp3' --exclude '*.ogg'
But you can also exclude files that git-annex's [[location_tracking]]
information indicates are present in a given repository. For example,
if you want to populate newarchive with files, but not those already
on oldarchive, you could do it like this:
git annex copy --not --in oldarchive --to newarchive
Without the --not, --in makes it act on files that *are* in the specified
repository. So, to remove files that are on oldarchive:
git annex drop --in oldarchive
Or maybe you're curious which files have a lot of copies, and then
also want to know which files have only one copy:
git annex find --copies 7
git annex find --not --copies 2
The above are the simple examples of specifying what files git-annex
should act on. But you can specify anything you can dream up by combining
the things above, with --and --or -( and -). Those last two strange-looking
options are parentheses, for grouping other options. You will probably
have to escape them from your shell.
Here are the mp3 files that are in either of two repositories, but have
less than 3 copies:
git annex find --not --exclude '*.mp3' --and \
-\( --in usbdrive --or --in archive -\) --and \
--not --copies 3

View file

@ -0,0 +1,19 @@
Suppose something goes wrong, and fsck puts all the files in lost+found.
It's actually very easy to recover from this disaster.
First, check out the git repository again. Then, in the new checkout:
$ mkdir recovered-content
$ sudo mv ../lost+found/* recovered-content
$ sudo chown you:you recovered-content
$ chmod -R u+w recovered-content
$ git annex add recovered-content
$ git rm recovered-content
$ git commit -m "recovered some content"
$ git annex fsck
The way that works is that when git-annex adds the same content that was in
the repository before, all the old links to that content start working
again. This works particularly well if the SHA* backends are used, but even
with the default backend it will work pretty well, as long as fsck
preserved the modification time of the files.

View file

@ -0,0 +1,28 @@
Suppose you have a USB thumb drive and are using it as a git annex
repository. You don't trust the drive, because you could lose it, or
accidentally run it through the laundry. Or, maybe you have a drive that
you know is dying, and you'd like to be warned if there are any files
on it not backed up somewhere else. Maybe the drive has already died
or been lost.
You can let git-annex know that you don't trust a repository, and it will
adjust its behavior to avoid relying on that repositories's continued
availability.
# git annex untrust usbdrive
untrust usbdrive ok
Now when you do a fsck, you'll be warned appropriately:
# git annex fsck .
fsck my_big_file
Only these untrusted locations may have copies of this file!
05e296c4-2989-11e0-bf40-bad1535567fe -- portable USB drive
Back it up to trusted locations with git-annex copy.
failed
Also, git-annex will refuse to drop a file from elsewhere just because
it can see a copy on the untrusted repository.
It's also possible to tell git-annex that you have an unusually high
level of trust for a repository. See [[trust]] for details.

View file

@ -0,0 +1,37 @@
git-annex extends git's usual remotes with some [[special_remotes]], that
are not git repositories. This way you can set up a remote using say,
Amazon S3, and use git-annex to transfer files into the cloud.
First, export your S3 credentials:
# export ANNEX_S3_ACCESS_KEY_ID="08TJMT99S3511WOZEP91"
# export ANNEX_S3_SECRET_ACCESS_KEY="s3kr1t"
Now, create a gpg key, if you don't already have one. This will be used
to encrypt everything stored in S3, for your privacy. Once you have
a gpg key, run `gpg --list-secret-keys` to look up its key id, something
like "2512E3C7"
Next, create the S3 remote, and describe it.
# git annex initremote cloud type=S3 encryption=2512E3C7
initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok
# git annex describe cloud "at Amazon's US datacenter"
describe cloud ok
The configuration for the S3 remote is stored in git. So to make another
repository use the same S3 remote is easy:
# cd /media/usb/annex
# git pull laptop
# git annex initremote cloud
initremote cloud (gpg) (checking bucket) ok
Now the remote can be used like any other remote.
# git annex copy my_cool_big_file --to cloud
copy my_cool_big_file (gpg) (checking cloud...) (to cloud...) ok
# git annex move video/hackity_hack_and_kaxxt.mov --to cloud
move video/hackity_hack_and_kaxxt.mov (checking cloud...) (to cloud...) ok
See [[special_remotes/S3]] for details.

View file

@ -0,0 +1,11 @@
A handy alternative to the default [[backend|backends]] is the
SHA1 backend. This backend provides more git-style assurance that your data
has not been damaged. And the checksum means that when you add the same
content to the annex twice, only one copy need be stored in the backend.
The only reason it's not the default is that it needs to checksum
files when they're added to the annex, and this can slow things down
significantly for really big files. To make SHA1 the default, just
add something like this to `.gitattributes`:
* annex.backend=SHA1

View file

@ -0,0 +1,32 @@
The web can be used as a [[special_remote|special_remotes]] too.
# git annex addurl http://example.com/video.mpeg
addurl example.com_video.mpeg (downloading http://example.com/video.mpeg)
########################################################## 100.0%
ok
Now the file is downloaded, and has been added to the annex like any other
file. So it can be renamed, copied to other repositories, and so on.
Note that git-annex assumes that, if the web site does not 404, the file is
still present on the web, and this counts as one [[copy|copies]] of the
file. So it will let you remove your last copy, trusting it can be
downloaded again:
# git annex drop example.com_video.mpeg
drop example.com_video.mpeg (checking http://example.com/video.mpeg) ok
If you don't [[trust]] the web to this degree, just let git-annex know:
# git annex untrust web
untrust web ok
With the result that it will hang onto files:
# git annex drop example.com_video.mpeg
drop example.com_video.mpeg (unsafe)
Could only verify the existence of 0 out of 1 necessary copies
Also these untrusted repositories may contain the file:
00000000-0000-0000-0000-000000000001 -- web
(Use --force to override this check, or adjust annex.numcopies.)
failed

View file

@ -0,0 +1,19 @@
So you lost a thumb drive containing a git-annex repository. Or a hard
drive died or some other misfortune has befallen your data.
Unless you configured backups, git-annex can't get your data back. But it
can help you deal with the loss.
First, go somewhere that knows about the lost repository, and mark it as
untrusted.
git annex untrust usbdrive
To remind yourself later what happened, you can change its description, too:
git annex describe usbdrive "USB drive lost in Timbuktu. Probably gone forever."
This retains the [[location_tracking]] information for the repository.
Maybe you'll find the drive later. Maybe that's impossible. Either way,
this lets git-annex tell you why a file is no longer accessible, and
it avoids it relying on that drive to hold any content.