Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2018-10-22 13:01:06 -04:00
commit d6b2468b4c
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 146 additions and 0 deletions

View file

@ -0,0 +1,35 @@
I recently discovered (thanks to Paul Wise) the [Meow hash][]. The
TL;DR: is that it's a fast non-crypto hash which might be useful for
git-annex. Here's their intro, quoted from the website:
[Meow hash]: https://mollyrocket.com/meowhash
> The Meow hash is a high-speed hash function named after the character
> Meow in [Meow the Infinite][]. We developed the hash function at
> [Molly Rocket][] for use in the asset pipeline of [1935][].
>
> Because we have to process hundreds of gigabytes of art assets to build
> game packages, we wanted a fast, non-cryptographic hash for use in
> change detection and deduplication. We had been using a cryptographic
> hash ([SHA-1][]), but it was
> unnecessarily slowing things down.
>
> To our surprise, we found a lack of published, well-optimized,
> large-data hash functions. Most hash work seems to focus on small input
> sizes (for things like dictionary lookup) or on cryptographic quality.
> We wanted the fastest possible hash that would be collision-free in
> practice (like SHA-1 was), and we didn't need any cryptograhic security.
>
> We ended up creating Meow to fill this niche.
[1935]: https://molly1935.com/
[Molly Rocket]: https://mollyrocket.com/
[Meow the Infinite]: https://meowtheinfinite.com/
[SHA-1]: https://en.m.wikipedia.org/wiki/SHA-1
I don't an immediate use case for this right now, but I think it could
be useful to speed up checks on larger files. The license is a
*little* weird but seems close enough to a BSD to be acceptable.
I know it might sound like a conflict of interest, but I *swear* I am
not bringing this up only as a oblique feline reference. ;) -- [[anarcat]]

View file

@ -0,0 +1,5 @@
It would be good if one could define custom external [[backends]], the way one can define external custom remotes. This would solve [[todo/consider_meow_backend]] but also have other uses. For instance, sometimes files contain details irrelevant to the file's semantics (e.g. comments), but that change the file's checksum; with a custom backend, one could "canonicalize" a file before computing the checksum.
@joey pointed out a potential problem: "needing to deal with the backend being missing or failing to work could have wide repurcussions in the code base." I wonder if there are ways around that. Suppose you specified a default backend to use in case a custom one was unavailable? Then you could always compute a key from a file, even if it's not in the right backend. And once a key is stored in git-annex, most of git-annex treats the key as just a string. If the custom backend supports checksum verification, without the backend's implementation, keys from that backend would be treated like WORM/URL keys that do not support checksum checking.
Thoughts?