Leverage the new chunked remotes to automatically resume uploads.
Sort of like rsync, although of course not as efficient since this
needs to start at a chunk boundry.
But, unlike rsync, this method will work for S3, WebDAV, external
special remotes, etc, etc. Only directory special remotes so far,
but many more soon!
This implementation will also allow starting an upload from one repository,
interrupting it, and then resuming the upload to the same remote from
an entirely different repository.
Note that I added a comment that storeKey should atomically move the content
into place once it's all received. This was already an undocumented
requirement -- it's necessary for hasKey to work reliably. This resume code
just uses hasKey to find the first chunk that's missing.
Note that if there are two uploads of the same key to the same chunked remote,
one might resume at the point the other had gotten to, but both will then
redundantly upload. As before.
In the non-resume case, this adds one hasKey call per storeKey, and only
if the remote is configured to use chunks. Future work: Try to eliminate that
hasKey. Notice that eg, `git annex copy --to` checks if the key is present
before sending it, so is already running hasKey.. which could perhaps
be cached and reused.
However, this additional overhead is not very large compared with
transferring an entire large file, and the ability to resume
is certianly worth it. There is an optimisation in place for small files,
that avoids trying to resume if the whole file fits within one chunk.
This commit was sponsored by Georg Bauer.
Leverage the new chunked remotes to automatically resume downloads.
Sort of like rsync, although of course not as efficient since this
needs to start at a chunk boundry.
But, unlike rsync, this method will work for S3, WebDAV, external
special remotes, etc, etc. Only directory special remotes so far,
but many more soon!
This implementation will also properly handle starting a download
from one remote, interrupting, and resuming from another one, and so on.
(Resuming interrupted chunked uploads is similarly doable, although
slightly more expensive.)
This commit was sponsored by Thomas Djärv.
Needed for resuming from chunks.
Url keys are considered not stable. I considered treating url keys with a
known size as stable, but just don't feel that is enough information.
When chunk=0, always try the unchunked key first. This avoids the overhead
of needing to read the git-annex branch to find the chunkcount.
However, if the unchunked key is not present, go on and try the chunks.
Also, when removing a chunked key, update the chunkcounts even when
chunk=0.
No need to process each L.ByteString chunk, instead ask it to split.
Doesn't seem to have really sped things up much, but it also made the code
simpler.
Note that this does (and already did) buffer in memory. It seems that only
the directory special remote could take advantage of streaming chunks to
files w/o buffering, so probably won't add an interface to allow for that.
This will allow things like WebDAV to opean a single persistent connection
and reuse it for all the chunked data.
The crazy types allow for some nice code reuse.
Push it down from needing to be done in every Storer,
to being checked once inside ChunkedEncryptable.
Also, catch exceptions from PrepareStorer and PrepareRetriever,
just in case..
I'd have liked to keep these two concepts entirely separate,
but that are entagled: Storing a key in an encrypted and chunked remote
need to generate chunk keys, encrypt the keys, chunk the data, encrypt the
chunks, and send them to the remote. Similar for retrieval, etc.
So, here's an implemnetation of all of that.
The total win here is that every remote was implementing encrypted storage
and retrival, and now it can move into this single place. I expect this
to result in several hundred lines of code being removed from git-annex
eventually!
This commit was sponsored by Henrik Ahlgren.