This commit is contained in:
parent
b7dfa38e1e
commit
69205d6b06
1 changed files with 9 additions and 0 deletions
9
doc/forum/Content-sensitive_chunking__63__.mdwn
Normal file
9
doc/forum/Content-sensitive_chunking__63__.mdwn
Normal file
|
@ -0,0 +1,9 @@
|
|||
Bup, ddar, and borg all use content-sensitive chunking, but git and git-annex do not. Git uses delta compression with heuristics for finding similar files, while with git-annex, deduplication beyond only storing one copy of identical content is only possible in special remotes because of how it works. Anyway, I'm posting this here not because I think git-annex should directly support content-sensitive chunking but because it's an idea I can't get out of my head and I think there are likely to be people in this group who are interested in it.
|
||||
|
||||
With content-sensitive chunking, there's always a trade-off between chunk size and deduplication. Bigger chunks mean more duplication, while smaller chunks mean more overhead for storing the chunks themselves. Even if chunks are concatenated into larger files to minimize filesystem overhead, you still have to store a collision-resistant hash of each chunk. My idea is this: because you have the chunks available, you can always recompute the hash if you need to. So instead of indexing on the full hash, index on only part of it, and identify the chunks themselves using base 128 integer IDs instead.
|
||||
|
||||
Given that you're only ever looking up the hash of a chunk in order to decide if you need to store a new chunk, one possible approach is to use a trie with only enough bytes of the hash to make it unique in a given repository. Then, any time there's a false collision, add enough additional bytes to make it unique in the presence of the new chunk. Use a cache to avoid repeated hashing of commonly matched chunks.
|
||||
|
||||
One problem with this technique is that the per-chunk overhead increases with the number of chunks, but it's logarithmic, so it grows slowly, and for many types of content the savings from increased deduplication should grow similarly. For a trie the average number of hash bytes stored would be the base 256 log of the number of chunks, so 2 bytes for up to 65536 chunks, 3 for up to 16.8 million chunks, 4 for up to 4 billion, etc. The space needed to store the chunk IDs is the same, but you store the ID for each time the chunk is used.
|
||||
|
||||
It occurs to me that I should propose this to the Datashards folks :)
|
Loading…
Reference in a new issue