114 lines
5.2 KiB
Markdown
114 lines
5.2 KiB
Markdown
## What is hashdeep
|
|
|
|
[hashdeep](http://md5deep.sourceforge.net/) is a handy tool that allows you to check file integrity
|
|
across whole directory trees. It can detect renames and missing files,
|
|
for example.
|
|
|
|
## How to use it with git-annex
|
|
|
|
The general working principle of hashdeep is that it iterates over a
|
|
set of files and produces a manifest that looks like this:
|
|
|
|
$ hashdeep -r *
|
|
%%%% HASHDEEP-1.0
|
|
%%%% size,md5,sha256,filename
|
|
## Invoked from: /home/jessek
|
|
## $ hashdeep -r archives bin lib doc
|
|
21508,6178d221a1714b7e2089565e997d6ad1,92caa3f5754b22ca792e4f8626362d2ef39596b080abfcfed951a86bee82bec3,/home/jessek/archives/foo-1.2.1.tar.gz
|
|
12292,116e77a5dc6af0996597f7bc1b9252a2,c2afc6aa8d5c094a7226db1695d99a37fa858548f5d09aad9e41badfc62b1d27,/home/jessek/archives/bar-0.9.tar.bz2
|
|
145684,4409c1e0b5995c290c2fc3d1d6d74bac,f56881fb277358c95ed3ddf64f28c4ff3f3937e636e17d6a26d42822b16fd4ed,/home/jessek/bin/ls
|
|
|
|
Then this manifest can be used to check consistency of the files
|
|
later. Because git-annex also uses hashes to identify files, it fits
|
|
nicely with this pattern and I have used it to verify files that were
|
|
outside of git-annex's control yet still from the repository. First,
|
|
we produce the manifest file:
|
|
|
|
(
|
|
echo '%%%% HASHDEEP-1.0'
|
|
echo '%%%% size,sha256,filename'
|
|
git annex find --format '${bytesize},${keyname},${file}\n' | sed 's/\.[^,]*,/,/'
|
|
) > manifest.txt
|
|
|
|
Then this can be used to verify an external fileset with the following
|
|
command:
|
|
|
|
hashdeep -k manifest.txt -a -vv -e -r /mnt/ > result
|
|
|
|
This will create a listing of every file that was moved, that is
|
|
missing and so on. I have used this to audit corrupted files on my
|
|
phone's microSD card as it turned out that about half of the files
|
|
were corrupted for some mysterious reason:
|
|
|
|
hashdeep: Audit failed
|
|
Input files examined: 0
|
|
Known files expecting: 0
|
|
Files matched: 0
|
|
Files partially matched: 0
|
|
Files moved: 3411
|
|
New files found: 2179
|
|
Known files not found: 42117
|
|
|
|
The non-zero numbers are interesting: 3411 files were detected as
|
|
being sane and just the filenames had changed. 2179 files were "new"
|
|
which means that they were not in the original set. Since files were
|
|
supposed to *only* come from the original set, this means those files
|
|
were corrupt. Actually, that's not completely true: some files (JPG
|
|
image files, namely) *were* created in the external fileset so I had
|
|
to be careful to exclude those false positives by hand. The 42117
|
|
"known files not found" were files that were simply not transferred
|
|
over to the phone for lack of space.
|
|
|
|
This way, I was able to quickly find which files were corrupt and
|
|
remove them. This created a list of files to remove:
|
|
|
|
grep 'No match' result | grep -v '.jpg' | sed 's/: No match$//'
|
|
|
|
And I used the following loop to remove the files one by one:
|
|
|
|
grep 'No match' result | grep -v '.jpg' | sed 's/: No match$//' | while read file; do rm "$file" ; done
|
|
|
|
Note the above is actually quite dangerous and you might want to
|
|
insert an `echo` in there to avoid shenanigans, especially if you do
|
|
not trust the filesystem.
|
|
|
|
## How else this might work
|
|
|
|
Naturally, I could have imported all the files into git-annex and work
|
|
only with git-annex to operate this. But because the files were
|
|
renamed to some canonical version by the software transferring the file
|
|
([dSub](https://f-droid.org/en/packages/github.daneren2005.dsub/) and [Airsonic](https://airsonic.github.io/)), it would have been difficult to make a
|
|
diff with the original set. This is on a (ex)fat filesystem too, which
|
|
might make git-annex operation difficult. Yet I can't help but think
|
|
this is something that [[git-annex-export]] should be able to do, but
|
|
I am not sure it could deal with the renames. And I must say I have
|
|
found it a little inconvenient to have to `initremote` to be able to
|
|
use what are essentially ephemeral storage mountpoints.
|
|
|
|
The above procedure reuses the best of both world: hashdeep does the
|
|
fuzzy matching and git-annex provides the catalog of files.
|
|
|
|
## Future improvements
|
|
|
|
It would be nice if [[git-annex-find]] would allow listing only the
|
|
checksum, which would remove a potentially error-prone pattern
|
|
substitution above (`sed 's/\.[^,]*,/,/'`). This is necessary because
|
|
`${keyname}` includes the file extension which is expected with the
|
|
`SHA256E` backend, but it is somewhat inconvenient to deal with. Of
|
|
course, it would be pretty awesome if git-annex could output
|
|
hashdeep-compatible catalogs out of the box: it would improve
|
|
interoperability here... And the icing on cake would be a git-annex
|
|
command (a variation of [[git-annex-import]]?) that would audit an
|
|
external, non-annexed repository for consistency in the same way.
|
|
|
|
Also note that hashdeep can operate in "chunk" mode which means that
|
|
it can work across file boundaries, detecting partial matches, for
|
|
example. This is something that, as far as I know, is impossible in
|
|
git-annex as checksums are only file-based. This would be useful in
|
|
eliminating the false positives by distinguishing the "this file is
|
|
completely new" and "this file is corrupt" cases.
|
|
|
|
## Comments
|
|
|
|
Those notes were provided by [[anarcat]] but would gladly welcome
|
|
corrections and improvements.
|