This commit is contained in:
Joey Hess 2015-03-06 16:24:01 -04:00
parent 3439ea4bdc
commit a383d880da

View file

@ -17,14 +17,15 @@ The user can control how much total disk space the directory takes up.
## sharding to scale ## sharding to scale
The IA contains some 24 million Items. The IA contains some 14 million Items. Inside these Items are 271 million
files.
git repositories do not scale well in the 1-10 million file git repositories do not scale well in the 1-10 million file
range, and very badly above that. Storing individual IA Items range, and very badly above that. Storing individual IA Items
would strain git's scalability badly. would strain git's scalability badly.
Solution: Create multiple git repositories, and split the Items amoung Solution: Create multiple git repositories, and split the Items
them. amoung them. Make a tarball of each Item.
* Needs a map from an Item to its repository. (Could be stored in a * Needs a map from an Item to its repository. (Could be stored in a
database, or whatever.) database, or whatever.)
@ -47,9 +48,22 @@ them.
* Add new shards as the IA continues to grow. * Add new shards as the IA continues to grow.
Question: How many files are in IA across all Items? It might be better Or, the files could be checked directly into the repositories, not tarred up.
to use $item/$file rather than $item.tar as the unit that's stored in With 100 thousand files per repository, it needs 2710 repositories.
the git-annex repository. This would need more shards. This seems much manageable than 10 thousand files in 27100 repositories.
The big advantage of not tarring up files is that the url to the file
can be added with `git annex addurl`, and then clients can download
the content direct from the IA http servers, rather than needing to
connect to a ssh server to get the tarballs. This simplifies and scales
better for seeding the downloads. (Uploads still need that ssh server
connection.)
Problem: Would still need to get the checksums for the files, for git-annex
to use. The census published by the IA only has md5sums in it. While
git-annex can use md5sums, this allows bad actors to find md5 collisions
with files from the archive, and upload bogus files that checksum ok
when restoring.
## the IA git repository ## the IA git repository
@ -274,14 +288,9 @@ perhaps a git-annex modification.
With clients all fscking their part of a shard once a month, With clients all fscking their part of a shard once a month,
that will increase the size of the git repository, with new distributed that will increase the size of the git repository, with new distributed
fsck updates. Basically, it grows by one line per file in the shard, fsck updates. I have run some test and this fsck overhead delta compresses
times the amount of redundancy that's been reached. So, a 10 thousand item well. With a 10 thousand file repo and 100 clients all updating the
shard with redundancy 3 will grow by 30000 lines per month. Line length location log, the monthly fsck only added 1 mb to the repository size
for location log is 58 bytes, so that's 1.7 mb growth per month of the git (after `git gc --aggressive`). Should scale linearly with number of files
repo. (That's for blobs, plus additional overhead for trees and commits.) in repo. Note that `git annex forget` could be used to forget old
However, git will delta compress most of it, so it might be historical data if the repo grew too large from fsck updates.
significantly smaller. If the distributed fsck timestamps are all
the same for a client, they will delta compress along with everything else.
This could reduce the blob growth to a few dozen bytes per client per month.
This is something to keep an eye on, especially since shipping large git
repo changes to clients is not desirable.