hashdeep integration
This commit is contained in:
parent
d8de48ddee
commit
d889d9666d
1 changed files with 114 additions and 0 deletions
114
doc/tips/hashdeep_integration.mdwn
Normal file
114
doc/tips/hashdeep_integration.mdwn
Normal file
|
@ -0,0 +1,114 @@
|
|||
## What is hashdeep
|
||||
|
||||
[hashdeep](http://md5deep.sourceforge.net/) is a handy tool that allows you to check file integrity
|
||||
across whole directory trees. It can detect renames and missing files,
|
||||
for example.
|
||||
|
||||
## How to use it with git-annex
|
||||
|
||||
The general working principle of hashdeep is that it iterates over a
|
||||
set of files and produces a manifest that looks like this:
|
||||
|
||||
$ hashdeep -r *
|
||||
%%%% HASHDEEP-1.0
|
||||
%%%% size,md5,sha256,filename
|
||||
## Invoked from: /home/jessek
|
||||
## $ hashdeep -r archives bin lib doc
|
||||
21508,6178d221a1714b7e2089565e997d6ad1,92caa3f5754b22ca792e4f8626362d2ef39596b080abfcfed951a86bee82bec3,/home/jessek/archives/foo-1.2.1.tar.gz
|
||||
12292,116e77a5dc6af0996597f7bc1b9252a2,c2afc6aa8d5c094a7226db1695d99a37fa858548f5d09aad9e41badfc62b1d27,/home/jessek/archives/bar-0.9.tar.bz2
|
||||
145684,4409c1e0b5995c290c2fc3d1d6d74bac,f56881fb277358c95ed3ddf64f28c4ff3f3937e636e17d6a26d42822b16fd4ed,/home/jessek/bin/ls
|
||||
|
||||
Then this manifest can be used to check consistency of the files
|
||||
later. Because git-annex also uses hashes to identify files, it fits
|
||||
nicely with this pattern and I have used it to verify files that were
|
||||
outside of git-annex's control yet still from the repository. First,
|
||||
we produce the manifest file:
|
||||
|
||||
(
|
||||
echo '%%%% HASHDEEP-1.0'
|
||||
echo '%%%% size,sha256,filename'
|
||||
git annex find --format '${bytesize},${keyname},${file}\n' | sed 's/\.[^,]*,/,/'
|
||||
) > manifest.txt
|
||||
|
||||
Then this can be used to verify an external fileset with the following
|
||||
command:
|
||||
|
||||
hashdeep -k manifest.txt -a -vv -e /mnt/ > result
|
||||
|
||||
This will create a listing of every file that was moved, that is
|
||||
missing and so on. I have used this to audit corrupted files on my
|
||||
phone's microSD card as it turned out that about half of the files
|
||||
were corrupted for some mysterious reason:
|
||||
|
||||
hashdeep: Audit failed
|
||||
Input files examined: 0
|
||||
Known files expecting: 0
|
||||
Files matched: 0
|
||||
Files partially matched: 0
|
||||
Files moved: 3411
|
||||
New files found: 2179
|
||||
Known files not found: 42117
|
||||
|
||||
The non-zero numbers are interesting: 3411 files were detected as
|
||||
being sane and just the filenames had changed. 2179 files were "new"
|
||||
which means that they were not in the original set. Since files were
|
||||
supposed to *only* come from the original set, this means those files
|
||||
were corrupt. Actually, that's not completely true: some files (JPG
|
||||
image files, namely) *were* created in the external fileset so I had
|
||||
to be careful to exclude those false positives by hand. The 42117
|
||||
"known files not found" were files that were simply not transferred
|
||||
over to the phone for lack of space.
|
||||
|
||||
This way, I was able to quickly find which files were corrupt and
|
||||
remove them. This created a list of files to remove:
|
||||
|
||||
grep 'No match' result | grep -v '.jpg' | sed 's/: No match$//'
|
||||
|
||||
And I used the following loop to remove the files one by one:
|
||||
|
||||
grep 'No match' result | grep -v '.jpg' | sed 's/: No match$//' | while read file; do rm "$file" ; done
|
||||
|
||||
Note the above is actually quite dangerous and you might want to
|
||||
insert an `echo` in there to avoid shenanigans, especially if you do
|
||||
not trust the filesystem.
|
||||
|
||||
## How else this might work
|
||||
|
||||
Naturally, I could have imported all the files into git-annex and work
|
||||
only with git-annex to operate this. But because the files were
|
||||
renamed to some canonical version by the software transferring the file
|
||||
([dSub](https://f-droid.org/en/packages/github.daneren2005.dsub/) and [Airsonic](https://airsonic.github.io/)), it would have been difficult to make a
|
||||
diff with the original set. This is on a (ex)fat filesystem too, which
|
||||
might make git-annex operation difficult. Yet I can't help but think
|
||||
this is something that [[git-annex-export]] should be able to do, but
|
||||
I am not sure it could deal with the renames. And I must say I have
|
||||
found it a little inconvenient to have to `initremote` to be able to
|
||||
use what are essentially ephemeral storage mountpoints.
|
||||
|
||||
The above procedure reuses the best of both world: hashdeep does the
|
||||
fuzzy matching and git-annex provides the catalog of files.
|
||||
|
||||
## Future improvements
|
||||
|
||||
It would be nice if [[git-annex-find]] would allow listing only the
|
||||
checksum, which would remove a potentially error-prone pattern
|
||||
substitution above (`sed 's/\.[^,]*,/,/'`). This is necessary because
|
||||
`${keyname}` includes the file extension which is expected with the
|
||||
`SHA256E` backend, but it is somewhat inconvenient to deal with. Of
|
||||
course, it would be pretty awesome if git-annex could output
|
||||
hashdeep-compatible catalogs out of the box: it would improve
|
||||
interoperability here... And the icing on cake would be a git-annex
|
||||
command (a variation of [[git-annex-import]]?) that would audit an
|
||||
external, non-annexed repository for consistency in the same way.
|
||||
|
||||
Also note that hashdeep can operate in "chunk" mode which means that
|
||||
it can work across file boundaries, detecting partial matches, for
|
||||
example. This is something that, as far as I know, is impossible in
|
||||
git-annex as checksums are only file-based. This would be useful in
|
||||
eliminating the false positives by distinguishing the "this file is
|
||||
completely new" and "this file is corrupt" cases.
|
||||
|
||||
## Comments
|
||||
|
||||
Those notes were provided by [[anarcat]] but would gladly welcome
|
||||
corrections and improvements.
|
Loading…
Add table
Reference in a new issue