242 lines
10 KiB
Markdown
242 lines
10 KiB
Markdown
This is a fairly detailed design proposal for using git-annex to build
|
|
<http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK>
|
|
|
|
[[!toc ]]
|
|
|
|
## end-user view
|
|
|
|
What the end user sees is a directory, with a .git subdirectory,
|
|
and 100 thousand little files (actually, they're broken symlinks, on
|
|
Linux/OSX). Over time, some of the symlinks start filling in with
|
|
"random" content from the IA.
|
|
|
|
The user can look at that content, or even delete files they don't want to
|
|
host.
|
|
|
|
The user can control how much total disk space the directory takes up.
|
|
(It will use around 100 mb when empty.)
|
|
|
|
## sharding to scale
|
|
|
|
The IA contains some 14 million Items. Inside these Items are 271 million
|
|
files. Around 177 million of those are available for download.
|
|
|
|
git repositories do not scale well in the 1-10 million file
|
|
range, and very badly above that. Storing all that in a git repository
|
|
would strain git's scalability badly.
|
|
|
|
Solution: Create multiple git repositories, and split the files
|
|
amoung them.
|
|
|
|
* If each git repository holds 100 thousand files, that is 1770
|
|
repositories, which is not an unmanagable number.
|
|
(For comparison, git.debian.org has 18500 repositories.)
|
|
|
|
* The IA is ~20 Petabytes large. Each shard would thus be around 1
|
|
terabyte in size, although this will vary considerably.
|
|
|
|
* Clients are assigned one or more shards, and clone those repositories.
|
|
|
|
* A client decides which files in its shard to back up, and does
|
|
so by running "git annex get" on them. This downloads the files
|
|
over http from the IA.
|
|
|
|
* A client will typically not back up its entire shard, but maybe
|
|
only 500 gb or less of it. Also, we want redundancy (LOCKSS)
|
|
-- say at least 3 copies of each file. So, a given shard will probably
|
|
have between 3 and 9 clients handling it.
|
|
|
|
* Add new shards as the IA continues to grow.
|
|
|
|
Problem: Need to get the checksums for the files, for git-annex
|
|
to use. The census published by the IA only has md5sums in it. While
|
|
git-annex can use md5sums, this allows bad actors to find md5 collisions
|
|
with files from the archive, and upload bogus files that checksum ok
|
|
when restoring.
|
|
|
|
## creating a shard
|
|
|
|
This is a simple matter of making a git repository and telling git-annex
|
|
the filenames and urls that belong in it.
|
|
|
|
A script can do this using the `git annex fromkey` and `git annex
|
|
registerurl` commands. Time to make such a repository with 100k files
|
|
is in the 10 minute range (faster on SSD or ramdisk).
|
|
|
|
## adding a client
|
|
|
|
When a client registers to participate:
|
|
|
|
1. Generate a UUID, which is assigned to this client, and send it to the
|
|
client, and assign that UUID to a particular shard.
|
|
2. Send the client an appropriate auth token (eg, a locked down ssh private
|
|
key) to let them access the shard's git repository (or all the shards).
|
|
3. Client clones its assigned shard git repository,
|
|
runs `git annex init reinit $UUID`.
|
|
|
|
Note that a client could be assigned to multiple shards, rather than just
|
|
one. Probably good to keep a pool of empty shards that have clients waiting
|
|
for new files to be added.
|
|
|
|
Note that we may want to enable direct mode in the client's clone,
|
|
because it lets the user easily delete files to free up space.
|
|
OTOH, direct mode is slow and less safe, so we might prefer to use indirect
|
|
mode, and then the client would need to use `git annex drop` if they
|
|
decided to remove content.
|
|
|
|
## distributing files
|
|
|
|
1. Client runs `git annex sync --content`, which downloads as many
|
|
files from the IA as will fit in their disk's free space
|
|
(leaving some configurable amount free in reserve by configuring
|
|
annex.diskreserve)
|
|
2. Note that [[numcopies|copies]] and [[preferred_content]] settings can be
|
|
used to make clients only want to download an file if it's not yet
|
|
reached the desired number of copies. Lots of flexibility here in
|
|
git-annex.
|
|
3. git-annex will push back to the server an updated git-annex branch,
|
|
which will record when it has successfully stored an file.
|
|
|
|
## bad actors
|
|
|
|
Clients can misbehave in probably many ways. The best defense for many
|
|
misbehaviors is to distribute files to enough different clients that we can
|
|
trust some of them.
|
|
|
|
The main git-annex specific misbehavior is that a client could try to push
|
|
garbage information back to the origin repository on the server.
|
|
|
|
To guard against this, the server will reject all pushes of branches other
|
|
than the git-annex branch, which is the only one clients need to modify.
|
|
|
|
Check pushes of the git-annex branch. There are only a few files that
|
|
clients can legitimately modify, and the modifications will always involve
|
|
that client's UUID, not some other client's UUID. Reject anything shady.
|
|
|
|
These checks can be done in a git `update` hook. Rough estimate is that
|
|
such a hook would be a couple hundred lines of code.
|
|
|
|
## verification
|
|
|
|
We want a lightweight verification process, to verify that a client still
|
|
has the data. This can be done using `git annex fsck`, which can be
|
|
configured to eg, check each file only once per month.
|
|
|
|
git-annex will need a modification here. Currently, a successful fsck
|
|
does not leave any trace in the git-annex branch that it happened. But
|
|
we want the server to track when a client is not fscking (the user probably
|
|
dropped out).
|
|
|
|
The modification is simple; just have a successful fsck
|
|
update the timestamp in the fscked file's location log.
|
|
It will probably take just a few hours to code.
|
|
|
|
With that change, the server can check for files that not enough clients
|
|
have verified they have recently, and distribute them to more clients.
|
|
|
|
(This is now implemented.)
|
|
|
|
Note that bad actors can lie about this verification; it's not a proof they
|
|
still have the file. But, a bad actor could prove they have a file, and
|
|
refuse to give it back if the IA needed to restore the backup, too.
|
|
|
|
## fire drill
|
|
|
|
If we really want to test how well the system is working, we need a fire
|
|
drill.
|
|
|
|
1. Pick some files that we'll assume the IA has lost in some disaster.
|
|
2. Look up the shard the file belongs to.
|
|
3. Get the git-annex key of the file, and tell git-annex it's been
|
|
lost from the IA, by running in its shard: `setpresentkey $key $iauuid 0`
|
|
4. The next time a client runs `git annex sync --content`, it will notice
|
|
that the IA repo doesn't have the file anymore. The client will then
|
|
send the file back to the origin repo.
|
|
5. To guard against bad actors, that restored file should be checked with
|
|
`git annex fsck`. If its checksum is good, it can be re-injected back
|
|
into the IA. (Or, the fire drill was successful.)
|
|
(Remember to turn off the fire alarm by running
|
|
`setpresentkey $key $iauuid 1`)
|
|
|
|
## shard servers
|
|
|
|
A server at the IA (or otherwise with a fast pipe) is needed to serve
|
|
the shards. One server can probably manage them all.
|
|
Let's consider what this server needs to have on it:
|
|
|
|
* git and git-annex
|
|
* ssh server
|
|
* The git repository for each shard. A few hundred mb per shard.
|
|
* The git update hook to filter out bad pushes.
|
|
* Some way to learn when a new user has registered to access a shard,
|
|
so their ssh key is given access.
|
|
|
|
## other optional nice stuff
|
|
|
|
The user running a client can delete some or all of their files at any
|
|
time, to free up disk space. The next time `git-annex sync` runs on the client,
|
|
it'll notice and let the server know, and other clients will then take
|
|
over storing it. (Or if the git-annex assistant is run on the client,
|
|
it would inform the server immediately.)
|
|
|
|
The user is also free to move files around (within the git repository
|
|
directory), modify files, view them, etc. This doesn't affect anyone else.
|
|
|
|
Offline storage is supported. As long as the user can spin it up from time
|
|
to time to run `git annex fsck`.
|
|
|
|
More advanced users might have multiple repositories on different disks.
|
|
Each has their own UUID, and they could move files around between them as
|
|
desired; this would be communicated back to the origin repository
|
|
automatically.
|
|
|
|
Shards could have themes, and users could request to be part of the
|
|
shard that includes Software, or Grateful Dead, etc. This might encourage
|
|
users to devote more resources.
|
|
|
|
Or, rather than doing a lucky dip and getting one or a couple shards,
|
|
a user could clone em all, and pick just which files to get.
|
|
|
|
The contents of files sometimes changes.
|
|
This can be reflected by updating the file in the git repository.
|
|
Clients will then download the new version of the file. (They will also
|
|
tend to retain the old version, although this can be dealt with by using
|
|
`git annex unused`).
|
|
|
|
Items sometimes go dark; this could be reflected by deleting the Item's
|
|
files from the repository. It's up to the clients what they do with the
|
|
content of such Items.
|
|
|
|
Client's repos could be put into groups to classify them. For example,
|
|
there could be groups per continent, or for trust levels, or whatever.
|
|
These can be used by [[preferred_content]] expressions to fine tune how
|
|
files are spread out amoung the available clients.
|
|
|
|
## other potential gotchas
|
|
|
|
If any single file is very large (eg, 10 terabytes), there may not be
|
|
any clients that can handle it. This could be dealt with by splitting up
|
|
the file into smaller chunks. Word is there is a single 2 tb item, and a few
|
|
more around 100 gb, so this is probably not a concern.
|
|
|
|
A client could add other files to its local repo, and git-annex branch
|
|
pushes would include junk data about those files. It should probably be
|
|
filtered out by the git update hook (rejecting the whole push because of
|
|
this seems excessive).
|
|
|
|
There may be a thundering herd problem, where many clients end up
|
|
downloading the same file at the same time, and more copies than neecessary
|
|
result. The next `git annex sync --content` in some of the
|
|
redundant clients will notice this and drop that file, and presumably
|
|
download some other file. It would be good to avoid this problem,
|
|
perhaps by having a new client initially download a random set of the
|
|
files in their shard that don't yet have enough copies.
|
|
|
|
With clients all fscking their part of a shard once a month,
|
|
that will increase the size of the git repository, with new distributed
|
|
fsck updates. I have run some test and this fsck overhead delta compresses
|
|
well. With a 10 thousand file repo and 100 clients all updating the
|
|
location log, the monthly fsck only added 1 mb to the repository size
|
|
(after `git gc --aggressive`). Should scale linearly with number of files
|
|
in repo. Note that `git annex forget` could be used to forget old
|
|
historical data if the repo grew too large from fsck updates.
|