Avoiding using a callback simplifies this and should make it easier to
implement incremental checksumming, which will need to happen partly in
writeRetrievedContent and partly in retrieveChunks.
Following up to f44d4704c6,
I tried making updateIncremental pure, avoiding the IORef overhead.
That did not improve speed though. It did complicate the interface since
thunks needed to be forced to avoid leaking memory. So am not going with
that change.
Looking at Crypto.Hash.hashUpdate, it copies a byte array on each call,
compared with hashlazy that only uses 1 copy for the whole bytestring.
That could well explain a lot of the overhead discussed in the
abovementioned commit. Don't see any way to improve that while hashing
incrementally, except using bigger chunks should reduce its overhead.
Since 10x larger chunks did not, I'm kind of puzzled if it's really
what's affecting performance.
IORef rather than MVar sped up benchmark mentioned in last commit to
13.0s.
This makes me wonder if changing the interface to not need the IORef
either would improve speed further.
This benchmarks only slightly faster than the old git-annex. Eg, for a 1
gb file, 14.56s vs 15.57s. (On a ram disk; there would certianly be
more of an effect if the file was written to disk and didn't stay in
cache.)
Commenting out the updateIncremental calls make the same run in 6.31s.
May be that overhead in the implementation, other than the actual
checksumming, is slowing it down. Eg, MVar access.
(I also tried using 10x larger chunks, which did not change the speed.)
This is groundwork for calculating checksums while copying, rather than
in a separate pass, but that's not done yet. For now, avoid using rsync
(and cp on Windows), and instead read and write the file ourselves, with
resume handling.
Benchmarking vs old git-annex that used rsync, this is faster,
at least once the file size is larger than a couple of MB.
Changing to the P2P protocol broke this, because preseedTmp copies
the local copy of the object to the temp file, and then the P2P transfer
sees the right length file and uses it as-is.
When git-annex-shell is too old and rsync is used, it did verify the
content, and when the local repo does not have the object it did verify the
content.
This could perhaps have caused a hard link to be made when the content
of the object was modified. I don't think that actually happened,
because the annexed file would have to be unlocked, with annex.thin, for
the object to get modified, and in that case, a hard link is not made.
However, to be sure, run the check.
Note that it seemed best to run the check only once, although the
current implementation is fast and safe to run repeatedly.
Checksum as content is received from a remote git-annex repository, rather
than doing it in a second pass.
Not tested at all yet, but I imagine it will work!
Not implemented for any special remotes, and also not implemented for
copies from local remotes. It may be that, for local remotes, it will
suffice to use rsync, rely on its checksumming, and simply return Verified.
(It would still make a checksumming pass when cp is used for COW, I guess.)
As yet unused.
Backend.External could perhaps implement it too, although that would
involve sending chunks of data to it via a pipe or something, so likely
to be slow.
See my comment in the next commit for some details about why
Verified needs a hash with preimage resistance. As far as tahoe goes,
it's fully cryptographically secure.
I think that bup could also return Verified. However, the Retriever
interface does not currenly support that.