use SHA256 by default

To get old behavior, add a .gitattributes containing: * annex.backend=WORM

I feel that SHA256 is a better default for most people, as long as their
systems are fast enough that checksumming their files isn't a problem.
git-annex should default to preserving the integrity of data as well as git
does. Checksum backends also work better with editing files via
unlock/lock.

I considered just using SHA1, but since that hash is believed to be somewhat
near to being broken, and git-annex deals with large files which would be a
perfect exploit medium, I decided to go to a SHA-2 hash.

SHA512 is annoyingly long when displayed, and git-annex displays it in a
few places (and notably it is shown in ls -l), so I picked the shorter
hash. Considered SHA224 as it's even shorter, but feel it's a bit weird.

I expect git-annex will use SHA-3 at some point in the future, but
probably not soon!

Note that systems without a sha256sum (or sha256) program will fall back to
defaulting to SHA1.
This commit is contained in:
Joey Hess 2011-11-04 15:21:45 -04:00
parent 1089e85d48
commit ef3457196a
8 changed files with 37 additions and 30 deletions

View file

@ -26,12 +26,12 @@ import Types.Key
import qualified Types.Backend as B
-- When adding a new backend, import it here and add it to the list.
import qualified Backend.WORM
import qualified Backend.SHA
import qualified Backend.WORM
import qualified Backend.URL
list :: [Backend Annex]
list = Backend.WORM.backends ++ Backend.SHA.backends ++ Backend.URL.backends
list = Backend.SHA.backends ++ Backend.WORM.backends ++ Backend.URL.backends
{- List of backends in the order to try them when storing a new key. -}
orderedList :: Annex [Backend Annex]

View file

@ -16,12 +16,12 @@ import qualified Build.SysConfig as SysConfig
type SHASize = Int
-- order is slightly significant; want SHA256 first, and more general
-- sizes earlier
sizes :: [Int]
sizes = [1, 256, 512, 224, 384]
sizes = [256, 1, 512, 224, 384]
backends :: [Backend Annex]
-- order is slightly significant; want sha1 first, and more general
-- sizes earlier
backends = catMaybes $ map genBackend sizes ++ map genBackendE sizes
genBackend :: SHASize -> Maybe (Backend Annex)

3
debian/changelog vendored
View file

@ -1,5 +1,8 @@
git-annex (3.20111026) UNRELEASED; urgency=low
* The default backend used when adding files to the annex is changed
from WORM to SHA256.
To get old behavior, add a .gitattributes containing: * annex.backend=WORM
* Sped up some operations on remotes that are on the same host.
* copy --to: Fixed leak when copying many files to a remote on the same
host.

View file

@ -5,17 +5,19 @@ to retrieve the file's content (its value).
Multiple pluggable key-value backends are supported, and a single repository
can use different ones for different files.
* `WORM` ("Write Once, Read Many") This assumes that any file with
the same basename, size, and modification time has the same content.
This is the default, and the least expensive backend.
* `SHA1` -- This uses a key based on a sha1 checksum. This allows
* `SHA256` -- The default backend for new files. This allows
verifying that the file content is right, and can avoid duplicates of
files with the same content. Its need to generate checksums
can make it slower for large files.
* `SHA512`, `SHA384`, `SHA256`, `SHA224` -- Like SHA1, but larger
checksums. Mostly useful for the very paranoid, or anyone who is
researching checksum collisions and wants to annex their colliding data. ;)
* `SHA1E`, `SHA512E`, etc -- Variants that preserve filename extension as
can make it slower for large files.
* `WORM` ("Write Once, Read Many") This assumes that any file with
the same basename, size, and modification time has the same content.
This is the the least expensive backend, recommended for really large
files or slow systems.
* `SHA512` -- Best currently available hash, for the very paranoid.
* `SHA1` -- Smaller hash than `SHA256` for those who want a checksum
but are not concerned about security.
* `SHA384`, `SHA224` -- Hashes for people who like unusual sizes.
* `SHA256E`, `SHA1E`, etc -- Variants that preserve filename extension as
part of the key. Useful for archival tasks where the filename extension
contains metadata that should be preserved.
@ -27,9 +29,11 @@ For finer control of what backend is used when adding different types of
files, the `.gitattributes` file can be used. The `annex.backend`
attribute can be set to the name of the backend to use for matching files.
For example, to use the SHA1 backend for sound files, which tend to be
smallish and might be modified or copied over time, you could set in
`.gitattributes`:
For example, to use the SHA256 backend for sound files, which tend to be
smallish and might be modified or copied over time,
while using the WORM backend for everything else, you could set
in `.gitattributes`:
*.mp3 annex.backend=SHA1
*.ogg annex.backend=SHA1
* annex.backend=WORM
*.mp3 annex.backend=SHA256
*.ogg annex.backend=SHA256

View file

@ -2,8 +2,8 @@
# cp /tmp/big_file .
# cp /tmp/debian.iso .
# git annex add .
add big_file ok
add debian.iso ok
add big_file (checksum...) ok
add debian.iso (checksum...) ok
# git commit -a -m added
When you add a file to the annex and commit it, only a symlink to

View file

@ -9,5 +9,5 @@ makes it very easy.
move my_cool_big_file (to usbdrive...) ok
# git annex move video/hackity_hack_and_kaxxt.mov --from fileserver
move video/hackity_hack_and_kaxxt.mov (from fileserver...)
WORM-s86050597-m1274316523--hackity_hack_and_kax 100% 82MB 199.1KB/s 07:02
SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 82MB 199.1KB/s 07:02
ok

View file

@ -1,8 +1,8 @@
It's possible for data to accumulate in the annex that no files point to
anymore. One way it can happen is if you `git rm` a file without
first calling `git annex drop`. And, when you modify an annexed file, the old
content of the file remains in the annex. Another way is when migrating
between key-value [[backends|backend]].
It's possible for data to accumulate in the annex that no files in any
branch point to anymore. One way it can happen is if you `git rm` a file
without first calling `git annex drop`. And, when you modify an annexed
file, the old content of the file remains in the annex. Another way is when
migrating between key-value [[backends|backend]].
This might be historical data you want to preserve, so git-annex defaults to
preserving it. So from time to time, you may want to check for such data and
@ -12,8 +12,8 @@ eliminate it to save space.
unused . (checking for unused data...)
Some annexed data is no longer used by any files in the repository.
NUMBER KEY
1 WORM-s3-m1289672605--file
2 WORM-s14-m1289672605--file
1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
(To see where data was previously used, try: git log --stat -S'KEY')
(To remove unwanted data: git-annex dropunused NUMBER)
ok

View file

@ -13,7 +13,7 @@ Now you can get files and they will be transferred (using `rsync` via `ssh`):
# git annex get my_cool_big_file
get my_cool_big_file (getting UUID for origin...) (from origin...)
WORM-s2159-m1285650548--my_cool_big_file 100% 2159 2.1KB/s 00:00
SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 2159 2.1KB/s 00:00
ok
When you drop files, git-annex will ssh over to the remote and make