2011-05-16 06:07:59 +00:00
|
|
|
[The Internet Archive](http://www.archive.org/) allows members to upload
|
|
|
|
collections using an Amazon S3
|
|
|
|
[compatible API](http://www.archive.org/help/abouts3.txt), and this can
|
|
|
|
be used with git-annex's [[special_remotes/S3]] support.
|
|
|
|
|
2011-05-16 15:52:33 +00:00
|
|
|
So, you can locally archive things with git-annex, define remotes that
|
|
|
|
correspond to "items" at the Internet Archive, and use git-annex to upload
|
|
|
|
your files to there. Of course, your use of the Internet Archive must
|
|
|
|
comply with their [terms of service](http://www.archive.org/about/terms.php).
|
2011-05-16 06:07:59 +00:00
|
|
|
|
2013-04-26 05:00:58 +00:00
|
|
|
A nice added feature is that whenever git-annex sends a file to the
|
|
|
|
Internet Archive, it records its url, the same as if you'd run `git annex
|
|
|
|
addurl`. So any users who can clone your repository can download the files
|
|
|
|
from archive.org, without needing any login or password info. This makes
|
|
|
|
the Internet Archive a nice way to publish the large files associated with
|
|
|
|
a public git repository.
|
|
|
|
|
2014-02-28 22:42:31 +00:00
|
|
|
## webapp setup
|
|
|
|
|
|
|
|
Just go to "Add Another Repository", pick "Internet Archive",
|
|
|
|
and you're on your way.
|
|
|
|
|
|
|
|
## basic setup
|
2013-04-26 05:00:58 +00:00
|
|
|
|
2011-05-16 06:07:59 +00:00
|
|
|
Sign up for an account, and get your access keys here:
|
|
|
|
<http://www.archive.org/account/s3.php>
|
|
|
|
|
|
|
|
# export AWS_ACCESS_KEY_ID=blahblah
|
|
|
|
# export AWS_SECRET_ACCESS_KEY=xxxxxxx
|
|
|
|
|
2011-05-16 15:20:30 +00:00
|
|
|
Specify `host=s3.us.archive.org` when doing `initremote` to set up
|
|
|
|
a remote at the Archive. This will enable a special Internet Archive mode:
|
|
|
|
Encryption is not allowed; you are required to specify a bucket name
|
2011-05-16 16:12:03 +00:00
|
|
|
rather than having git-annex pick a random one; and you can optionally
|
2011-05-16 15:20:30 +00:00
|
|
|
specify `x-archive-meta*` headers to add metadata as explained in their
|
|
|
|
[documentation](http://www.archive.org/help/abouts3.txt).
|
|
|
|
|
2011-05-16 15:52:33 +00:00
|
|
|
# git annex initremote archive-panama type=S3 \
|
|
|
|
host=s3.us.archive.org bucket=panama-canal-lock-blueprints \
|
2011-05-16 15:20:30 +00:00
|
|
|
x-archive-meta-mediatype=texts x-archive-meta-language=eng \
|
|
|
|
x-archive-meta-title="original Panama Canal lock design blueprints"
|
2011-05-16 17:10:26 +00:00
|
|
|
initremote archive-panama (Internet Archive mode) ok
|
2011-05-16 17:33:33 +00:00
|
|
|
# git annex describe archive-panama "a man, a plan, a canal: panama"
|
2011-05-16 06:07:59 +00:00
|
|
|
describe archive-panama ok
|
|
|
|
|
|
|
|
Then you can annex files and copy them to the remote as usual:
|
|
|
|
|
2014-02-28 22:42:31 +00:00
|
|
|
# git annex add photo1.jpeg --backend=SHA256E
|
2011-05-16 15:46:34 +00:00
|
|
|
add photo1.jpeg (checksum...) ok
|
2011-05-16 15:20:30 +00:00
|
|
|
# git annex copy photo1.jpeg --fast --to archive-panama
|
|
|
|
copy (to archive-panama...) ok
|
|
|
|
|
2013-10-16 20:35:47 +00:00
|
|
|
Once a file has been stored on archive.org, it cannot be (easily) removed
|
|
|
|
from it. Also, git-annex whereis will tell you a public url for the file
|
|
|
|
on archive.org. (It may take a while for archive.org to make the file
|
|
|
|
publically visibile.)
|
|
|
|
|
2014-02-28 22:42:31 +00:00
|
|
|
Note the use of the SHA256E [[backend|backends]] when adding files. That is
|
2013-10-16 20:35:47 +00:00
|
|
|
the default backend used by git-annex, but even if you don't normally use
|
2014-02-28 22:42:31 +00:00
|
|
|
it, it makes most sense to use the WORM or SHA256E backend for files that
|
2013-10-16 20:35:47 +00:00
|
|
|
will be stored in the Internet Archive, since the key name will be exposed
|
|
|
|
as the filename there, and since the Archive does special processing of
|
|
|
|
files based on their extension.
|
2014-02-28 22:42:31 +00:00
|
|
|
|
|
|
|
## publishing only one subdirectory
|
|
|
|
|
|
|
|
Perhaps you have a repository with lots of files in it, and only want
|
|
|
|
to publish some of them to a particular Internet Archive item. Of course
|
|
|
|
you can specify which files to send manually, but it's useful to
|
|
|
|
configure [[preferred_content]] settings so git-annex knows what content
|
|
|
|
you want to store in the Internet Archive.
|
|
|
|
|
|
|
|
One way to do this is using the "public" repository type.
|
|
|
|
|
|
|
|
git annex enableremote archive-panama preferreddir=panama
|
|
|
|
git annex wanted archive-panama standard
|
|
|
|
git annex group archive-panama public
|
|
|
|
|
|
|
|
Now anything in a "panama" directory will be sent to that remote,
|
|
|
|
and anything else won't. You can use `git annex copy --auto` or the
|
|
|
|
assistant and it'll do the right thing.
|
|
|
|
|
|
|
|
When setting up an Internet Archive item using the webapp, this
|
|
|
|
configuration is automatically done, using an item name that the user
|
|
|
|
enters as the name of the subdirectory.
|