git-annex/doc/bugs/Glacier_remote_uploads_duplicates.mdwn

37 lines
3.5 KiB
Text
Raw Normal View History

2013-05-22 18:09:46 +00:00
### Please describe the problem.
Other references:
https://github.com/basak/glacier-cli/pull/19
http://git-annex.branchable.com/special_remotes/glacier/#comment-a2b05b8dc2d640ee498d90398f02931c
#### Background
* Glacier doesn't support keys that the client selects, unlike S3. If you upload to Glacier, Glacier assigns a unique ID, not the client.
* Glacier does support an "archive description" which is immutable. It also provides this "archive description" in an inventory listing, together with the unique IDs.
* An "archive description" is not a unique key. It's perfectly possible to upload multiple archives to Glacier with the same "archive description".
* glacier-cli uses the "archive description" field as an upload identifier, since the unique IDs are unfriendly to users. However, since they are potentially ambiguous identifiers, it also supports disambiguation using the ID itself. See "Addressing Archives" in README.md for details.
#### The Problem
This what I believe is happening in the two reports referenced above. When git-annex is used without `--trust-glacier`, it can end up uploading the same data multiple times. From git-annex's point of view, it cannot verify that the data is already in Glacier, so it uploads again, expecting an overwrite operation if the key is already in Glacier. Since glacier-cli maps the key to an "archive description" that can be duplicated, this is not what happens. Instead, a second archive is uploaded.
When git-annex later does a "checkpresent" operation, glacier-cli fails. This is because the request is ambiguous, since there are two archives in Glacier with the same "key". The error message could be better here, but I believe that the behaviour is correct.
#### Discussion
glacier-cli can find out what data Glacier claims to have using an inventory retrieval. However, this retrieval takes about four hours and can be out of date (eg. if someone else recently deleted the archive from another client). Thus, I can understand git-annex's desire not to trust this data or a cache of it.
However, whatever we do, it is impossible to map an "upload or overwrite on key X" type command to Glacier. We'll always end up with duplicates. Even if git-annex stored the Glacier archive IDs, there is no API to replace an existing archive with the same ID, and inventories are out of date even before we retrieve them.
#### Workaround
If the problem is as I think it is, always applying `--trust-glacier` should prevent the problem from occurring in most cases, since git-annex will run "checkpresent" and glacier-cli will confirm that the archive exists.
To fix the problem after it has occurred, it should be sufficient to delete duplicates using glacier-cli, since they _should_ be identical to each other. Some enhancement of the `glacier-cli archive list` command would help here.
2013-06-10 17:21:26 +00:00
Update 10 June 2013: I've pushed a `glacier-cli` update and helper script in commit `b68835`. This adds a `--force-ids` option to `glacier archive list`, with which the helper script `glacier-list-duplicates.sh` uses to identify duplicates that can be removed. If you're affected by this issue, I suggest that you use this helper to identify and fix your problem by removing the duplicates. Please do so carefully by checking that the output of the helper is correct before you use it to delete the duplicates. See the comments at the top of the helper script for usage information.
2013-06-11 14:39:08 +00:00
> [[fixed|done]], at least for the only well-working case for glacier, where
> only one repository can access glacier directly. --[[Joey]]