This commit is contained in:
Joey Hess 2015-03-03 22:30:20 -04:00
parent 550a2fcac2
commit d45ec91a6b

View file

@ -1,6 +1,8 @@
This is a fairly detailed design proposal for using git-annex to build This is a fairly detailed design proposal for using git-annex to build
<http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK> <http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK>
[[!toc ]]
## sharding to scale ## sharding to scale
The IA contains some 24 million Items. The IA contains some 24 million Items.
@ -33,6 +35,10 @@ them.
* Add new shards as the IA continues to grow. * Add new shards as the IA continues to grow.
Question: How many files are in IA across all Items? It might be better
to use $item/$file rather than $item.tar as the unit that's stored in
the git-annex repository. This would need more shards.
## the IA git repository ## the IA git repository
We're building a pyramid of git-annex repositories, and at the tip We're building a pyramid of git-annex repositories, and at the tip
@ -176,6 +182,23 @@ drill.
(Remember to turn off the fire alarm by running (Remember to turn off the fire alarm by running
`setpresentkey $key $iauuid 1`) `setpresentkey $key $iauuid 1`)
## shard servers
A server at the IA (or otherwise with a fast pipe) is needed to serve one or
more shards. Let's consider what this server needs to have on it:
* git and git-annex
* ssh server
* rsync
* The git repository for the shard. Probably a few hundred mb?
* The git update hook to filter out bad pushes.
* Some way to get the content of a given Item from the IA
when a client wants to download it. This probably means
generating the $item.tar file and buffering it to disk for a while.
* So, enough disk to buffer a reasonable number of items.
* Some way to learn when a new user has registered to access a shard,
so their ssh key is given access.
## other optional nice stuff ## other optional nice stuff
The user running a client can delete some or all of their files at any The user running a client can delete some or all of their files at any
@ -226,8 +249,8 @@ this seems excessive).
There may be a thundering herd problem, where many clients end up There may be a thundering herd problem, where many clients end up
downloading the same Item at the same time, and more copies than neecessary downloading the same Item at the same time, and more copies than neecessary
result. The next `git annex sync --content` in some of the result. The next `git annex sync --content` in some of the
redundant clients will notice this and drop that item, and presumably redundant clients will notice this and drop that Item, and presumably
download some other item. However, it might be good to rate limit the download some other Item. However, it might be good to rate limit the
number of concurrent downloads of a given item, to prevent this and perhaps number of concurrent downloads of a given item, to prevent this and perhaps
other issues. This could be done by a wrapper around git-annex shell or other issues. This could be done by a wrapper around git-annex shell or
perhaps a git-annex modification. perhaps a git-annex modification.