This threw an unusual exception w/o an error message when probing to see if
the bucket exists yet. So rather than relying on tryS3, catch all
exceptions.
This does mean that it might get an exception for some transient network
error, think this means the bucket DNE yet, and try to create it, and then
fail when it already exists.
When uploading the last part of a file, which was 640229 bytes, S3 rejected
that part: "Your proposed upload is smaller than the minimum allowed size"
I don't know what the minimum is, but the fix is just to include the last
part into the previous part. Since this can result in a part that's
double-sized, use half-sized parts normally.
Unfortunately, I don't fully understand why it was leaking using the old
method of a lazy bytestring. I just know that it was leaking, despite
neither hGetUntilMetered nor byteStringPopper seeming to leak by
themselves.
The new method avoids the lazy bytestring, and simply reads chunks from the
handle and streams them out to the http socket.
Untested and not even compiled yet.
Testing should include checks that file content streams through without
buffering in memory.
Note that CL.consume causes all the etags to be buffered in memory.
This is probably nearly unavoidable, since a request has to be constructed
that contains the list of etags in its body. (While it might be possible to
stream generation of the body, that would entail making a http request that
dribbles out parts of the body as the multipart uploads complete, which is
not likely to work well..
To limit this being a problem, it's best for partsize to be set to some
suitably large value, like 1gb. Then a full terabyte file will need only
1024 etags to be stored, which will probably use around 1 mb of memory.
I'm a little stuck on getting the list of etags of the parts.
This seems to require taking the md5 of each part locally,
which doesn't get along well with lazily streaming in the part from the
file. It would need to read the file twice, or lose laziness and buffer a
whole part -- but parts might be quite large.
This seems to be a problem with the API provided; S3 is supposed to return
an etag, but that is not exposed. I have filed a bug:
https://github.com/aristidb/aws/issues/141
This is intended to let the user easily tell if a remote's creds are
coming from info embedded in the repository, or instead from the
environment, or perhaps are locally stored in a creds file.
This commit was sponsored by Frédéric Schütz.
Now `git annex info $remote` shows info specific to the type of the remote,
for example, it shows the rsync url.
Remote types that support encryption or chunking also include that in their
info.
This commit was sponsored by Ævar Arnfjörð Bjarmason.
encryptionSetup must be called before setRemoteCredPair. Otherwise,
the RemoteConfig doesn't have the cipher in it, and so no cipher is used to
encrypt the embedded creds.
This is a security fix for non-shared encryption methods!
For encryption=shared, there's no security problem, just an
inconsistentency in whether the embedded creds are encrypted.
This is very important to get right, so used some types to help ensure that
setRemoteCredPair is only run after encryptionSetup. Note that the external
special remote bypasses the type safety, since creds can be set after the
initial remote config, if the external special remote program requests it.
Also note that IA remotes never use encryption, so encryptionSetup is not
run for them at all, and again the type safety is bypassed.
This leaves two open questions:
1. What to do about S3 and glacier remotes that were set up
using encryption=pubkey/hybrid with embedcreds?
Such a git repo has a security hole embedded in it, and this needs to be
communicated to the user. Is the changelog enough?
2. enableremote won't work in such a repo, because git-annex will
try to decrypt the embedded creds, which are not encrypted, so fails.
This needs to be dealt with, especially for ecryption=shared repos,
which are not really broken, just inconsistently configured.
Noticing that problem for encryption=shared is what led to commit
fbdeeeed5f, which tried to
fix the problem by not decrypting the embedded creds.
This commit was sponsored by Josh Taylor.
Added a mkUnavailable method, which a Remote can use to generate a version
of itself that is not available. Implemented for several, but not yet all
remotes.
This allows testing that checkPresent properly throws an exceptions when
it cannot check if a key is present or not. It also allows testing that the
other methods don't throw exceptions in these circumstances.
This immediately found several bugs, which this commit also fixes!
* git remotes using ssh accidentially had checkPresent return
an exception, rather than throwing it
* The chunking code accidentially returned False rather than
propigating an exception when there were no chunks and
checkPresent threw an exception for the non-chunked key.
This commit was sponsored by Carlo Matteo Capocasa.
Implemented the Retriever.
Unfortunately, it is a fileRetriever and not a byteRetriever.
It should be possible to convert this to a byteRetiever, but I got stuck:
The conduit sink needs to process individual chunks, but a byteRetriever
needs to pass a single L.ByteString to its callback for processing. I
looked into using unsafeInerlaveIO to build up the bytestring lazily,
but the sink is already operating under conduit's inversion of control,
and does not run directly in IO anyway.
On the plus side, no more memory leak..
Fixes the memory leak on store.. the second oldest open git-annex bug!
Only retrieve remains to be converted.
This commit was sponsored by Scott Robinson.
Currently, initremote works, but not the other operations. They should be
fairly easy to add from this base.
Also, https://github.com/aristidb/aws/issues/119 blocks internet archive
support.
Note that since http-conduit is used, this also adds https support to S3.
Although git-annex encrypts everything anyway, so that may not be extremely
useful. It is not enabled by default, because existing S3 special remotes
have port=80 in their config. Setting port=443 will enable it.
This commit was sponsored by Daniel Brockman.
This will allow special remotes to eg, open a http connection and reuse it,
while checking if chunks are present, or removing chunks.
S3 and WebDAV both need this to support chunks with reasonable speed.
Note that a special remote might want to cache a http connection across
multiple requests. A simple case of this is that CheckPresent is typically
called before Store or Remove. A remote using this interface can certianly
use a Preparer that eg, uses a MVar to cache a http connection.
However, it's up to the remote to then deal with things like stale or
stalled http connections when eg, doing a series of downloads from a remote
and other places. There could be long delays between calls to a remote,
which could lead to eg, http connection stalls; the machine might even
move to a new network, etc.
It might be nice to improve this interface later to allow
the simple case without needing to handle the full complex case.
One way to do it would be to have a `Transaction SpecialRemote cache`,
where SpecialRemote contains methods for Storer, Retriever, Remover, and
CheckPresent, that all expect to be passed a `cache`.
I tend to prefer moving toward explicit exception handling, not away from
it, but in this case, I think there are good reasons to let checkPresent
throw exceptions:
1. They can all be caught in one place (Remote.hasKey), and we know
every possible exception is caught there now, which we didn't before.
2. It simplified the code of the Remotes. I think it makes sense for
Remotes to be able to be implemented without needing to worry about
catching exceptions inside them. (Mostly.)
3. Types.StoreRetrieve.Preparer can only work on things that return a
Bool, which all the other relevant remote methods already did.
I do not see a good way to generalize that type; my previous attempts
failed miserably.
Make the byteRetriever be passed the callback that consumes the bytestring.
This way, there's no worries about the lazy bytestring not all being read
when the resource that's creating it is closed.
Which in turn lets bup, ddar, and S3 each switch from using an unncessary
fileRetriver to a byteRetriever. So, more efficient on chunks and encrypted
files.
The only remaining fileRetrievers are hook and external, which really do
retrieve to files.