git-annex/doc/bugs/Improvements_to_S3_glacier_integration.mdwn
ijc@c69abafeb65fa2e784811fc549e9976a5cf4b903 414d3325ea
2021-04-25 10:15:28 +00:00

117 lines
7.3 KiB
Markdown

### Please describe the problem.
I have a git annex remote on s3 configured to push things to glacier rather than normal storage. Compared to regular s3 things in glacier are not immediately available and must be "restored" before they can be downloaded (the trade off is that data which is untouched long term is quite a lot cheaper per GB). I'm using the DEEP_ARCHIVE storage class (configured using the `storageclass` key in the remote's config, I didn't fiddle with the s3 bucket lifecycle at all). I think the following applies to any Glacier stored objects, the class just changes how long a restore will take.
My annexed objects are > 1GB and the s3 remote is chunked at 1GB granularity.
When I attempt to `git annex get` such an object the error message misleading refers to the unchunked path e.g. `/SHA256E-s6749536256--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso` rather than `/SHA256E-s6749536256-S1000000000-C1--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso` etc, which sent me down a blind alleyway for a bit. If I `git annex get -d` then I can see in the log that it tries the chunked path first and the falls back to the unchunked. It would be useful if the non-verbose error message listed the first attempt and the fallback. It would be even better if it could be aware enough of Glacier to point out that some list objects need to be manually restored in order to be retrieved.
In my quest to manually restore I could not for the life of me figure out (even going into the plumbing layers of git etc) how to retrieve a list of the chunks needed. I can get the key from `git annex info` easily enough and then `aws s3api list-objects --bucket <...> --prefix` to look for chunks of objects with the `SHA256E-s6749536256` prefix which works ok so long as all objects in the annex are different sizes -- AWS CLI seems to only lets you filter by prefix not a glob. I could probably list everything and extract what I wanted with `jq` but I _think_ there are cost implications to listing everything (although I might be wrong about that, and it wouldn't be a lot of money for my use case in any event).
Fixing those two minor issues (the error reporting and the ability to get the list of chunks) would be a massive improvement to the usability of S3/glacier remotes IMHO, especially if the output of the latter were consumable by scripts.
I will also include my manual steps to restore in the final section, it would be amazing of git annex could learn to do all this itself though...
### What steps will reproduce the problem?
Given an S3/glacier remote with chunked objects in it just a `git annex get` for an object in it will do.
### What version of git-annex are you using? On what operating system?
8.20210223-1 on Debian, from the Debian archive.
### Please provide any additional information below.
Issue with `git annex get` error logging:
[[!format sh """
$ git annex get OBJECT.iso
get OBJECT.iso (from s3...)
HttpExceptionRequest Request {
host = "<REDACTED>"
port = 80
secure = False
requestHeaders = [("Date","Sun, 25 Apr 2021 09:42:56 GMT"),("Authorization","<REDACTED>")]
path = "/SHA256E-s6749536256--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
queryString = ""
method = "GET"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
(StatusCodeException (Response {responseStatus = Status {statusCode = 404, statusMessage = "Not Found"}, responseVersion = HTTP/1.1, responseHeaders = [("x-amz-request-id","YVCPYD2RW4QEWMWN"),("x-amz-id-2","6dJY8ceWlLOSNIyTTchniLm5+cvJLovbMZL44YjNmViGwfChQSmWLl6VI6E5sFNDbMpwUeBhpbA="),("Content-Type","application/xml"),("Transfer-Encoding","chunked"),("Date","Sun, 25 Apr 2021 09:42:55 GMT"),("Server","AmazonS3")], responseBody = (), responseCookieJar = CJ {expose = []}, responseClose' = ResponseClose}) "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>SHA256E-s6749536256--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso</Key><RequestId>YVCPYD2RW4QEWMWN</RequestId><HostId>6dJY8ceWlLOSNIyTTchniLm5+cvJLovbMZL44YjNmViGwfChQSmWLl6VI6E5sFNDbMpwUeBhpbA=</HostId></Error>")
Unable to access these remotes: s3
No other repository is known to contain the file.
(Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 1 failed
$ git annex get -d OBJECT.iso
[...]
(from s3...)
[2021-04-25 10:43:44.098104491] Path: "/SHA256E-s6749536256-S1000000000-C1--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
[...]
[2021-04-25 10:43:44.200461839] Response status: Status {statusCode = 403, statusMessage = "Forbidden"}
[...]
[2021-04-25 10:43:44.238156623] Response status: Status {statusCode = 404, statusMessage = "Not Found"}
[...]
*** error message as above ***
"""]]
It's notable that the status codes differ, since the chunk is present but not currently accessible while the unchunked just isn't there.
Manually fetching things:
[[!format sh """
: 1. Figure out the key and the prefix on it:
$ git annex info OBJECT.iso
file: OBJECT.iso
size: 6.75 gigabytes
key: SHA256E-s6749536256--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso
present: false
: 2. Find the number of chunks using the prefix:
$ aws s3api list-objects --bucket <BUCKET> --prefix SHA256E-s6749536256 | jq '.Contents[].Key'
"SHA256E-s6749536256-S1000000000-C1--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
"SHA256E-s6749536256-S1000000000-C2--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
"SHA256E-s6749536256-S1000000000-C3--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
"SHA256E-s6749536256-S1000000000-C4--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
"SHA256E-s6749536256-S1000000000-C5--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
"SHA256E-s6749536256-S1000000000-C6--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
"SHA256E-s6749536256-S1000000000-C7--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso"
: 3. Request a restore of those chunks:
$ for i in $(seq 1 7) ; do aws s3api restore-object --bucket <BUCKET> --key SHA256E-s6749536256-S1000000000-C${i}--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso --restore-request Days=1 ; done
: 4. Poll for completion, for DEEP_ARCHIVE restores happen on the order of hours.
$ until
for i in $(seq 1 7) ; do aws s3api head-object --bucket <BUCKET> --key SHA256E-s6749536256-S1000000000-C${i}--f76639fa11276b4045844e6110035c15e6803acc38d77847c2e4a2be1b1850ca.iso ; done | jq -r .Restore
git annex get OBJECT.iso
do echo "$(date): sleeping..." ; sleep 1h; done
ongoing-request="true"
ongoing-request="true"
...eventually becoming
ongoing-request="false", expiry-date="Mon, 26 Apr 2021 00:00:00 GMT"
... for all objects then the "git annex get" succeeds and the loop exits
"""]]
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Yes, for loads of stuff. It's awesome, thanks!