From ef3457196ace3669ddfa93039f2d3c15baf54713 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 4 Nov 2011 15:21:45 -0400 Subject: [PATCH] use SHA256 by default To get old behavior, add a .gitattributes containing: * annex.backend=WORM I feel that SHA256 is a better default for most people, as long as their systems are fast enough that checksumming their files isn't a problem. git-annex should default to preserving the integrity of data as well as git does. Checksum backends also work better with editing files via unlock/lock. I considered just using SHA1, but since that hash is believed to be somewhat near to being broken, and git-annex deals with large files which would be a perfect exploit medium, I decided to go to a SHA-2 hash. SHA512 is annoyingly long when displayed, and git-annex displays it in a few places (and notably it is shown in ls -l), so I picked the shorter hash. Considered SHA224 as it's even shorter, but feel it's a bit weird. I expect git-annex will use SHA-3 at some point in the future, but probably not soon! Note that systems without a sha256sum (or sha256) program will fall back to defaulting to SHA1. --- Backend.hs | 4 +-- Backend/SHA.hs | 6 ++-- debian/changelog | 3 ++ doc/backends.mdwn | 32 +++++++++++-------- doc/walkthrough/adding_files.mdwn | 4 +-- ...ing_file_content_between_repositories.mdwn | 2 +- doc/walkthrough/unused_data.mdwn | 14 ++++---- doc/walkthrough/using_ssh_remotes.mdwn | 2 +- 8 files changed, 37 insertions(+), 30 deletions(-) diff --git a/Backend.hs b/Backend.hs index a09fc0e990..9a40e54598 100644 --- a/Backend.hs +++ b/Backend.hs @@ -26,12 +26,12 @@ import Types.Key import qualified Types.Backend as B -- When adding a new backend, import it here and add it to the list. -import qualified Backend.WORM import qualified Backend.SHA +import qualified Backend.WORM import qualified Backend.URL list :: [Backend Annex] -list = Backend.WORM.backends ++ Backend.SHA.backends ++ Backend.URL.backends +list = Backend.SHA.backends ++ Backend.WORM.backends ++ Backend.URL.backends {- List of backends in the order to try them when storing a new key. -} orderedList :: Annex [Backend Annex] diff --git a/Backend/SHA.hs b/Backend/SHA.hs index 3a54a8871b..d449821172 100644 --- a/Backend/SHA.hs +++ b/Backend/SHA.hs @@ -16,12 +16,12 @@ import qualified Build.SysConfig as SysConfig type SHASize = Int +-- order is slightly significant; want SHA256 first, and more general +-- sizes earlier sizes :: [Int] -sizes = [1, 256, 512, 224, 384] +sizes = [256, 1, 512, 224, 384] backends :: [Backend Annex] --- order is slightly significant; want sha1 first, and more general --- sizes earlier backends = catMaybes $ map genBackend sizes ++ map genBackendE sizes genBackend :: SHASize -> Maybe (Backend Annex) diff --git a/debian/changelog b/debian/changelog index e59b4f4048..e74a190ba5 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,5 +1,8 @@ git-annex (3.20111026) UNRELEASED; urgency=low + * The default backend used when adding files to the annex is changed + from WORM to SHA256. + To get old behavior, add a .gitattributes containing: * annex.backend=WORM * Sped up some operations on remotes that are on the same host. * copy --to: Fixed leak when copying many files to a remote on the same host. diff --git a/doc/backends.mdwn b/doc/backends.mdwn index ebcdedc2a7..2030d107a3 100644 --- a/doc/backends.mdwn +++ b/doc/backends.mdwn @@ -5,17 +5,19 @@ to retrieve the file's content (its value). Multiple pluggable key-value backends are supported, and a single repository can use different ones for different files. -* `WORM` ("Write Once, Read Many") This assumes that any file with - the same basename, size, and modification time has the same content. - This is the default, and the least expensive backend. -* `SHA1` -- This uses a key based on a sha1 checksum. This allows +* `SHA256` -- The default backend for new files. This allows verifying that the file content is right, and can avoid duplicates of files with the same content. Its need to generate checksums - can make it slower for large files. -* `SHA512`, `SHA384`, `SHA256`, `SHA224` -- Like SHA1, but larger - checksums. Mostly useful for the very paranoid, or anyone who is - researching checksum collisions and wants to annex their colliding data. ;) -* `SHA1E`, `SHA512E`, etc -- Variants that preserve filename extension as + can make it slower for large files. +* `WORM` ("Write Once, Read Many") This assumes that any file with + the same basename, size, and modification time has the same content. + This is the the least expensive backend, recommended for really large + files or slow systems. +* `SHA512` -- Best currently available hash, for the very paranoid. +* `SHA1` -- Smaller hash than `SHA256` for those who want a checksum + but are not concerned about security. +* `SHA384`, `SHA224` -- Hashes for people who like unusual sizes. +* `SHA256E`, `SHA1E`, etc -- Variants that preserve filename extension as part of the key. Useful for archival tasks where the filename extension contains metadata that should be preserved. @@ -27,9 +29,11 @@ For finer control of what backend is used when adding different types of files, the `.gitattributes` file can be used. The `annex.backend` attribute can be set to the name of the backend to use for matching files. -For example, to use the SHA1 backend for sound files, which tend to be -smallish and might be modified or copied over time, you could set in -`.gitattributes`: +For example, to use the SHA256 backend for sound files, which tend to be +smallish and might be modified or copied over time, +while using the WORM backend for everything else, you could set +in `.gitattributes`: - *.mp3 annex.backend=SHA1 - *.ogg annex.backend=SHA1 + * annex.backend=WORM + *.mp3 annex.backend=SHA256 + *.ogg annex.backend=SHA256 diff --git a/doc/walkthrough/adding_files.mdwn b/doc/walkthrough/adding_files.mdwn index 77a7fbc154..d1b5a04f77 100644 --- a/doc/walkthrough/adding_files.mdwn +++ b/doc/walkthrough/adding_files.mdwn @@ -2,8 +2,8 @@ # cp /tmp/big_file . # cp /tmp/debian.iso . # git annex add . - add big_file ok - add debian.iso ok + add big_file (checksum...) ok + add debian.iso (checksum...) ok # git commit -a -m added When you add a file to the annex and commit it, only a symlink to diff --git a/doc/walkthrough/moving_file_content_between_repositories.mdwn b/doc/walkthrough/moving_file_content_between_repositories.mdwn index 27dffe9138..3ffcc11750 100644 --- a/doc/walkthrough/moving_file_content_between_repositories.mdwn +++ b/doc/walkthrough/moving_file_content_between_repositories.mdwn @@ -9,5 +9,5 @@ makes it very easy. move my_cool_big_file (to usbdrive...) ok # git annex move video/hackity_hack_and_kaxxt.mov --from fileserver move video/hackity_hack_and_kaxxt.mov (from fileserver...) - WORM-s86050597-m1274316523--hackity_hack_and_kax 100% 82MB 199.1KB/s 07:02 + SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 82MB 199.1KB/s 07:02 ok diff --git a/doc/walkthrough/unused_data.mdwn b/doc/walkthrough/unused_data.mdwn index e142b576c0..bd6c398710 100644 --- a/doc/walkthrough/unused_data.mdwn +++ b/doc/walkthrough/unused_data.mdwn @@ -1,8 +1,8 @@ -It's possible for data to accumulate in the annex that no files point to -anymore. One way it can happen is if you `git rm` a file without -first calling `git annex drop`. And, when you modify an annexed file, the old -content of the file remains in the annex. Another way is when migrating -between key-value [[backends|backend]]. +It's possible for data to accumulate in the annex that no files in any +branch point to anymore. One way it can happen is if you `git rm` a file +without first calling `git annex drop`. And, when you modify an annexed +file, the old content of the file remains in the annex. Another way is when +migrating between key-value [[backends|backend]]. This might be historical data you want to preserve, so git-annex defaults to preserving it. So from time to time, you may want to check for such data and @@ -12,8 +12,8 @@ eliminate it to save space. unused . (checking for unused data...) Some annexed data is no longer used by any files in the repository. NUMBER KEY - 1 WORM-s3-m1289672605--file - 2 WORM-s14-m1289672605--file + 1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e + 2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1 (To see where data was previously used, try: git log --stat -S'KEY') (To remove unwanted data: git-annex dropunused NUMBER) ok diff --git a/doc/walkthrough/using_ssh_remotes.mdwn b/doc/walkthrough/using_ssh_remotes.mdwn index fbbbbe0701..60011a200b 100644 --- a/doc/walkthrough/using_ssh_remotes.mdwn +++ b/doc/walkthrough/using_ssh_remotes.mdwn @@ -13,7 +13,7 @@ Now you can get files and they will be transferred (using `rsync` via `ssh`): # git annex get my_cool_big_file get my_cool_big_file (getting UUID for origin...) (from origin...) - WORM-s2159-m1285650548--my_cool_big_file 100% 2159 2.1KB/s 00:00 + SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e 100% 2159 2.1KB/s 00:00 ok When you drop files, git-annex will ssh over to the remote and make