Merge branch 'master' into append
This commit is contained in:
commit
2ce1eaf56a
6 changed files with 290 additions and 1 deletions
158
doc/bugs/S3_ACL_deprecation.mdwn
Normal file
158
doc/bugs/S3_ACL_deprecation.mdwn
Normal file
|
@ -0,0 +1,158 @@
|
|||
### Please describe the problem.
|
||||
|
||||
Amazon has [deprecated ACLs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/about-object-ownership.html)
|
||||
|
||||
> A majority of modern use cases in Amazon S3 no longer require the use of ACLs, and we recommend that you disable ACLs except in unusual circumstances where you need to control access for each object individually. With Object Ownership, you can disable ACLs and rely on policies for access control. When you disable ACLs, you can easily maintain a bucket with objects uploaded by different AWS accounts. You, as the bucket owner, own all the objects in the bucket and can manage access to them using policies.
|
||||
|
||||
They are encouraging everyone to [migrate to bucket policies](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-migrating-acls-prerequisites.html) instead.
|
||||
|
||||
But if such a bucket is public (meaning: write needs credentials, reads are open to the world):
|
||||
|
||||
* `public=yes` causes `git annex copy --to` to always set an ACL, which fails, which fails the entire upload
|
||||
* But setting `public=no` causes `publicurl` to be ignored by `git annex copy --from`, failing the download
|
||||
|
||||
#### Feature Request
|
||||
|
||||
|
||||
* If `public=yes`, instead of trying to set an ACL, first try `HTTP HEAD` on the newly uploaded object without using the AWS credentials. Only if that fails, fall over to trying to set an ACL using credential. And if you get AccessControlListNotSupported (i.e. the error due to BucketOwnerEnforced), then give a warning that the bucket policy is not configured for public access.
|
||||
* Make `publicurl` orthogonal to `public`: if set, `git annex copy --from` should _always_ use it unconditionally.
|
||||
* Update [the docs here](https://git-annex.branchable.com/special_remotes/S3/) to explain how to set up a public bucket policy as recommended by Amazon, and that `public=yes` will either try to confirm that the bucket policy is public, or will fallback to using (legacy) ACLs.
|
||||
|
||||
|
||||
### What steps will reproduce the problem?
|
||||
|
||||
In a bucket I run, I reset the ACLs on that bucket to Amazon's default permissions:
|
||||
|
||||
* Bucket owner (your AWS account):
|
||||
* Objects:
|
||||
* List
|
||||
* Write
|
||||
* Bucket ACL (i.e. what ACLs are applied by default to all objects):
|
||||
* Read
|
||||
* Write
|
||||
|
||||
and with that set Amazon let me also set
|
||||
|
||||
> Object ownership: Bucket owner enforced
|
||||
|
||||
|
||||
This should be the **default configuration** for any new bucket created now, so you only need to do the above if you're migrating an existing bucket like I was; for reproducing, just creating an empty bucket should be enough.
|
||||
|
||||
It should look like this:
|
||||
|
||||
[[!img S3_bucket.png align="right" size="" alt=""]]
|
||||
|
||||
With that in place, I wrote and attached this Bucket Policy to make it public:
|
||||
|
||||
```
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Principal": "*",
|
||||
"Action": "s3:GetObject",
|
||||
"Resource": [
|
||||
"arn:aws:s3:::BUCKET-NAME",
|
||||
"arn:aws:s3:::BUCKET-NAME/*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
[[!img S3_bucket-perms.png align="right" size="" alt=""]]
|
||||
|
||||
|
||||
I told `git-annex` about it with
|
||||
|
||||
```
|
||||
git annex initremote amazon type=S3 bucket=BUCKET_NAME public=yes publicurl=https://BUCKET_NAME.s3.amazonaws.com autoenable=true
|
||||
```
|
||||
|
||||
But, attempting to use the new setup failed:
|
||||
|
||||
```
|
||||
$ git annex copy --to amazon what.nii.gz
|
||||
copy what.nii.gz (checking amazon...) (to amazon...)
|
||||
41% 8.15 MiB 20 MiB/s 0s
|
||||
S3Error {s3StatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, s3ErrorCode = "AccessControlListNotSupported", s3ErrorMessage = "The bucket does not allow ACLs", s3ErrorResource = Nothing, s3ErrorHostId = Just "a6+ieujj4z3Z4P8ooA306DdbGAoxWDiXd6O2ZwjdfapGnuOGPyL5/WQ4UBEytR80FG+5b6xdlsM=", s3ErrorAccessKeyId = Nothing, s3ErrorStringToSign = Nothing, s3ErrorBucket = Nothing, s3ErrorEndpointRaw = Nothing, s3ErrorEndpoint = Nothing}
|
||||
|
||||
32% 6.43 MiB 16 MiB/s 0s
|
||||
S3Error {s3StatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, s3ErrorCode = "AccessControlListNotSupported", s3ErrorMessage = "The bucket does not allow ACLs", s3ErrorResource = Nothing, s3ErrorHostId = Just "bFOgMomROCOes9yI6HZHysQGoZaTbsPI5b7rHjcTI0wA8Yx5Dm1JOky9BvXvpcXxzY1kVt48FRQ=", s3ErrorAccessKeyId = Nothing, s3ErrorStringToSign = Nothing, s3ErrorBucket = Nothing, s3ErrorEndpointRaw = Nothing, s3ErrorEndpoint = Nothing}
|
||||
|
||||
37% 7.37 MiB 21 MiB/s 0s
|
||||
S3Error {s3StatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, s3ErrorCode = "AccessControlListNotSupported", s3ErrorMessage = "The bucket does not allow ACLs", s3ErrorResource = Nothing, s3ErrorHostId = Just "hqd4HRNk5yp3tKJ6yMhcECEpCjBw8qB6oTpKF3PaOsYFeVG0C+dGI06xq3zgmvnPoFUttI040sY=", s3ErrorAccessKeyId = Nothing, s3ErrorStringToSign = Nothing, s3ErrorBucket = Nothing, s3ErrorEndpointRaw = Nothing, s3ErrorEndpoint = Nothing}
|
||||
|
||||
39% 7.81 MiB 21 MiB/s 0s
|
||||
S3Error {s3StatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, s3ErrorCode = "AccessControlListNotSupported", s3ErrorMessage = "The bucket does not allow ACLs", s3ErrorResource = Nothing, s3ErrorHostId = Just "7m7wwG5woSPmICIuXr9QnBOEjUikuyzHSebMLuaNyZMc2Ki2vaqKpU9U+GOTYmR/NzFjOeyxngk=", s3ErrorAccessKeyId = Nothing, s3ErrorStringToSign = Nothing, s3ErrorBucket = Nothing, s3ErrorEndpointRaw = Nothing, s3ErrorEndpoint = Nothing}
|
||||
failed
|
||||
git-annex: copy: 1 failed
|
||||
```
|
||||
|
||||
|
||||
### What version of git-annex are you using? On what operating system?
|
||||
|
||||
|
||||
```
|
||||
$ git annex version
|
||||
git-annex version: 8.20210223
|
||||
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
|
||||
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.26 DAV-1.3.4 feed-1.3.0.1 ghc-8.8.4 http-client-0.6.4.1 persistent-sqlite-2.10.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.1.0
|
||||
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
|
||||
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
|
||||
operating system: linux x86_64
|
||||
supported repository versions: 8
|
||||
upgrade supported from repository versions: 0 1 2 3 4 5 6 7
|
||||
local repository version: 8
|
||||
$ cat /etc/os-release
|
||||
PRETTY_NAME="Ubuntu 22.04 LTS"
|
||||
NAME="Ubuntu"
|
||||
VERSION_ID="22.04"
|
||||
VERSION="22.04 LTS (Jammy Jellyfish)"
|
||||
VERSION_CODENAME=jammy
|
||||
ID=ubuntu
|
||||
ID_LIKE=debian
|
||||
HOME_URL="https://www.ubuntu.com/"
|
||||
SUPPORT_URL="https://help.ubuntu.com/"
|
||||
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
|
||||
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
|
||||
UBUNTU_CODENAME=jammy
|
||||
```
|
||||
|
||||
### Please provide any additional information below.
|
||||
|
||||
|
||||
Disabling `public` allows uploads:
|
||||
|
||||
```
|
||||
$ git annex enableremote amazon public=no
|
||||
enableremote amazon ok
|
||||
(recording state in git...)
|
||||
$ git annex copy --to amazon what.nii.gz
|
||||
copy what.nii.gz (checking amazon...) (to amazon...)
|
||||
ok
|
||||
```
|
||||
|
||||
But then causes the new problem that downloads fail when other users try to download the updated dataset:
|
||||
|
||||
```
|
||||
$ git annex get what.nii.gz
|
||||
get what.nii.gz (from amazon...)
|
||||
|
||||
S3 bucket does not allow public access; Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use S3
|
||||
|
||||
cannot download content
|
||||
|
||||
Unable to access these remotes: amazon
|
||||
|
||||
(Note that these git remotes have annex-ignore set: origin)
|
||||
failed
|
||||
git-annex: get: 1 failed
|
||||
```
|
||||
|
||||
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
|
||||
|
||||
We use git-annex to share large datasets with the scientific community at https://github.com/spine-generic/data-multi-subject !
|
||||
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
[[!comment format=mdwn
|
||||
username="nick.guenther@e418ed3c763dff37995c2ed5da4232a7c6cee0a9"
|
||||
nickname="nick.guenther"
|
||||
avatar="http://cdn.libravatar.org/avatar/9e85c6ca61c3f877fef4f91c2bf6e278"
|
||||
subject="comment 38"
|
||||
date="2022-07-15T15:55:38Z"
|
||||
content="""
|
||||
Thanks for taking the time to direct me, Joey. I usually find myself getting lost in this wiki amongst all the old notes and worklogs and documentation about old features, so sometimes someone coming along with a sign post is a very kind help :)
|
||||
"""]]
|
|
@ -0,0 +1,39 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 10"""
|
||||
date="2022-07-15T15:50:00Z"
|
||||
content="""
|
||||
Ran `git-annex whereis --quiet` over 10000 annexed files. With journal
|
||||
locking on read, it took 11.71 seconds. Without journal locking,
|
||||
it took 11.73 seconds. No speed difference.
|
||||
|
||||
And strace showed why: This only opened the journal directory once, noticed
|
||||
it was empty, and skipped ever trying to read any files from it! If there
|
||||
are files, it stages them and still manages to not need to read from the
|
||||
journal after that. Nice optimisation from earlier this year. :-)
|
||||
|
||||
I thought that --batch commands would still check the journal files,
|
||||
but surprisingly, they don't seem to. That was a bug:
|
||||
[[bugs/batch_commands_miss_journalled_changes_made_while_running]]
|
||||
|
||||
After fixing that, I benchmarked feeding 10000 filenames into `git-annex
|
||||
whereis --batch`. With journal locking on read, it took 18.43 seconds.
|
||||
Without journal locking, it took 17.22 seconds. Before that bug fix,
|
||||
with or without journal locking, it took 16.59 seconds.
|
||||
|
||||
So, if the slow down caused by journal locking on read is a problem for
|
||||
anyone, a mode could be added that makes --batch not check the journal for
|
||||
changes made after the command started. That would make it run as fast as
|
||||
before that bug fix.
|
||||
|
||||
There might be other commands than --batch commands, that both read and
|
||||
write git-annex branch data, and so end up checking the journal on every
|
||||
read, since writing invalidates the above optimisation. Not sure what
|
||||
commands that would be, maybe `git-annex drop`? Anyway, such commands are
|
||||
probably doing more expensive things than locking the journal; they're not
|
||||
query commands.
|
||||
|
||||
That makes me ok with adding the locking on read, if needed for append.
|
||||
(Or similar added overheads to journal reads.)
|
||||
For now, I've committed it to the `append` branch.
|
||||
"""]]
|
|
@ -0,0 +1,73 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 11"""
|
||||
date="2022-07-15T17:58:27Z"
|
||||
content="""
|
||||
The remaining problem with appending is crash safety. If an append is
|
||||
not atomic, a journal file could end up having a truncated line written to
|
||||
it.
|
||||
|
||||
That seems unlikely, but see the bugzilla page above; it can happen on a
|
||||
kill signal at least.
|
||||
|
||||
So, can append somehow be made atomic? How about this:
|
||||
|
||||
Make `.git/annex/journal-append/` which contains append files,
|
||||
that are the same as journal files, but in the process of being appended.
|
||||
And make it also contain size files, which contain a number, the size of
|
||||
the append file before anything got appended to it.
|
||||
|
||||
Then, to append to a journal file:
|
||||
|
||||
1. When the journal file exists, and the append file does not,
|
||||
move the journal file to the append file.
|
||||
2. When the journal file does not exist and the append file does,
|
||||
truncate it to the size in the size file. (If the size file does not
|
||||
exist, skip truncating.)
|
||||
3. When the journal and append file both exist, truncate the append file,
|
||||
and add the journal file's content to what is going to be appended.
|
||||
(This is in case an old git-annex wrote a new value to the
|
||||
journal file, not knowing about the append file.)
|
||||
2. Write the current size of the append file to the size file.
|
||||
3. Append to the append file.
|
||||
4. Move the append file back to the journal file.
|
||||
5. Delete the size file.
|
||||
|
||||
When reading journalled files, it would need to also check the append
|
||||
file, and only read the recorded size. When both the append file and the
|
||||
journal file exist, it would read both and combine them. This change would
|
||||
slow down reads slightly, though as seen in comment #10, mostly only for
|
||||
--batch commands.
|
||||
|
||||
(It may not be necessary to lock on read actually. It can check for the
|
||||
append file and read the size file. If a write is happening at the
|
||||
same time, the size file may not exist yet, or may have been deleted
|
||||
already. In either case, reading the whole append file is ok.
|
||||
Should be possible to make this race-safe without locking.)
|
||||
|
||||
When staging the journal, it would need to first handle any interrupted
|
||||
appends, by checking if any append files exist.
|
||||
|
||||
1. Truncate the append file to the value in the size file
|
||||
2. Read the value of the file from the branch.
|
||||
3. Append the value of the file from the branch to the append file.
|
||||
(This is to handle a case with old git-annex having written
|
||||
divergent data to the branch, see below.)
|
||||
4. Move the append file back to the journal.
|
||||
5. Delete the size file.
|
||||
|
||||
----
|
||||
|
||||
When a new git-annex is doing an append and an old git-annex is also in use,
|
||||
the old git-annex will not see files in the journal that are in the process
|
||||
of being appended to. So it might use out of date information for queries.
|
||||
When it's making a write, it always did first read with the journal locked,
|
||||
so it will block until the append is complete. So it will not use out of
|
||||
date information for writes.
|
||||
|
||||
Only when something was written to the journal, but not committed to the
|
||||
branch, and then an append happened but got interruped will the old
|
||||
git-annex miss data. It will not see that data, and might make its own
|
||||
divergent changes, that get committed to the branch. The new git-annex
|
||||
will need to deal with this when handling interrupted appends.
|
||||
"""]]
|
|
@ -0,0 +1,10 @@
|
|||
[[!comment format=mdwn
|
||||
username="yarikoptic"
|
||||
avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
|
||||
subject="comment 12"
|
||||
date="2022-07-15T19:33:50Z"
|
||||
content="""
|
||||
> It will not see that data, and might make its own divergent changes, that get committed to the branch.
|
||||
|
||||
FWIW, not sure when/if I would use that mode then fearing for introducing some inconsistencies. Or do you see legit use case (like mine) where it should be safe?
|
||||
"""]]
|
|
@ -1,6 +1,6 @@
|
|||
Name: git-annex
|
||||
Version: 10.20220624
|
||||
Cabal-Version: >= 1.10
|
||||
Cabal-Version: 1.12
|
||||
License: AGPL-3
|
||||
Maintainer: Joey Hess <id@joeyh.name>
|
||||
Author: Joey Hess
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue