document directory hashes

This commit is contained in:
Joey Hess 2013-03-31 20:13:49 -04:00
parent f968a40e04
commit e0f3d1a3ba
2 changed files with 38 additions and 1 deletions

View file

@ -10,6 +10,7 @@ to the file content.
First there are two levels of directories used for hashing, to prevent
too many things ending up in any one directory.
See [[hashing]] for details.
Each subdirectory has the [[name_of_a_key|key_format]] in one of the
[[key-value_backends|backends]]. The file inside also has the name of the key.
@ -107,7 +108,9 @@ somewhere else.
These log files record [[location_tracking]] information
for file contents. Again these are placed in two levels of subdirectories
for hashing. The name of the key is the filename, and the content
for hashing. See [[hashing]] for details.
The name of the key is the filename, and the content
consists of a timestamp, either 1 (present) or 0 (not present), and
the UUID of the repository that has or lacks the file content.

View file

@ -0,0 +1,34 @@
In both the .git/annex directory and the git-annex branch, two levels of
hash directories are used, to avoid issues with too many files in one
directory.
Two separate hash methods are used. One, the old hash format, is only used
for non-bare git repositories. The other, the new hash format, is used for
bare git repositories, the git-annex branch, and on special remotes as
well.
## new hash format
This uses two directories, each with a three-letter name, such as "f87/4d5"
The directory names come from the md5sum of the [[key|key_format]].
Note that you cannot use the `md5sum` utility from coreutils to generate
the same hash. Why it generates something else is unknown. The md5 hash
libraries for programming languages will work though.
For example:
python -c 'import hashlib, sys; print hashlib.md5(sys.argv[1]).hexdigest()'
## old hash format
This uses two directories, each with a two-letter name, such as "pX/1J"
It takes the md5sum of the key, but rather than a string, represents it as 4
32bit words. Only the first word is used. It is converted into a string by the
same mechanism that would be used to encode a normal md5sum value into a
string, but where that would normally encode the bits using the 16 characters
0-9a-f, this instead uses the 32 characters "0123456789zqjxkmvwgpfZQJXKMVWGPF".
The first 2 letters of the resulting string are the first directory, and the
second 2 are the second directory.