## What is hashdeep [hashdeep](http://md5deep.sourceforge.net/) is a handy tool that allows you to check file integrity across whole directory trees. It can detect renames and missing files, for example. ## How to use it with git-annex The general working principle of hashdeep is that it iterates over a set of files and produces a manifest that looks like this: $ hashdeep -r * %%%% HASHDEEP-1.0 %%%% size,md5,sha256,filename ## Invoked from: /home/jessek ## $ hashdeep -r archives bin lib doc 21508,6178d221a1714b7e2089565e997d6ad1,92caa3f5754b22ca792e4f8626362d2ef39596b080abfcfed951a86bee82bec3,/home/jessek/archives/foo-1.2.1.tar.gz 12292,116e77a5dc6af0996597f7bc1b9252a2,c2afc6aa8d5c094a7226db1695d99a37fa858548f5d09aad9e41badfc62b1d27,/home/jessek/archives/bar-0.9.tar.bz2 145684,4409c1e0b5995c290c2fc3d1d6d74bac,f56881fb277358c95ed3ddf64f28c4ff3f3937e636e17d6a26d42822b16fd4ed,/home/jessek/bin/ls Then this manifest can be used to check consistency of the files later. Because git-annex also uses hashes to identify files, it fits nicely with this pattern and I have used it to verify files that were outside of git-annex's control yet still from the repository. First, we produce the manifest file: ( echo '%%%% HASHDEEP-1.0' echo '%%%% size,sha256,filename' git annex find --format '${bytesize},${keyname},${file}\n' | sed 's/\.[^,]*,/,/' ) > manifest.txt Then this can be used to verify an external fileset with the following command: hashdeep -k manifest.txt -a -vv -e /mnt/ > result This will create a listing of every file that was moved, that is missing and so on. I have used this to audit corrupted files on my phone's microSD card as it turned out that about half of the files were corrupted for some mysterious reason: hashdeep: Audit failed Input files examined: 0 Known files expecting: 0 Files matched: 0 Files partially matched: 0 Files moved: 3411 New files found: 2179 Known files not found: 42117 The non-zero numbers are interesting: 3411 files were detected as being sane and just the filenames had changed. 2179 files were "new" which means that they were not in the original set. Since files were supposed to *only* come from the original set, this means those files were corrupt. Actually, that's not completely true: some files (JPG image files, namely) *were* created in the external fileset so I had to be careful to exclude those false positives by hand. The 42117 "known files not found" were files that were simply not transferred over to the phone for lack of space. This way, I was able to quickly find which files were corrupt and remove them. This created a list of files to remove: grep 'No match' result | grep -v '.jpg' | sed 's/: No match$//' And I used the following loop to remove the files one by one: grep 'No match' result | grep -v '.jpg' | sed 's/: No match$//' | while read file; do rm "$file" ; done Note the above is actually quite dangerous and you might want to insert an `echo` in there to avoid shenanigans, especially if you do not trust the filesystem. ## How else this might work Naturally, I could have imported all the files into git-annex and work only with git-annex to operate this. But because the files were renamed to some canonical version by the software transferring the file ([dSub](https://f-droid.org/en/packages/github.daneren2005.dsub/) and [Airsonic](https://airsonic.github.io/)), it would have been difficult to make a diff with the original set. This is on a (ex)fat filesystem too, which might make git-annex operation difficult. Yet I can't help but think this is something that [[git-annex-export]] should be able to do, but I am not sure it could deal with the renames. And I must say I have found it a little inconvenient to have to `initremote` to be able to use what are essentially ephemeral storage mountpoints. The above procedure reuses the best of both world: hashdeep does the fuzzy matching and git-annex provides the catalog of files. ## Future improvements It would be nice if [[git-annex-find]] would allow listing only the checksum, which would remove a potentially error-prone pattern substitution above (`sed 's/\.[^,]*,/,/'`). This is necessary because `${keyname}` includes the file extension which is expected with the `SHA256E` backend, but it is somewhat inconvenient to deal with. Of course, it would be pretty awesome if git-annex could output hashdeep-compatible catalogs out of the box: it would improve interoperability here... And the icing on cake would be a git-annex command (a variation of [[git-annex-import]]?) that would audit an external, non-annexed repository for consistency in the same way. Also note that hashdeep can operate in "chunk" mode which means that it can work across file boundaries, detecting partial matches, for example. This is something that, as far as I know, is impossible in git-annex as checksums are only file-based. This would be useful in eliminating the false positives by distinguishing the "this file is completely new" and "this file is corrupt" cases. ## Comments Those notes were provided by [[anarcat]] but would gladly welcome corrections and improvements.