Merge branch 'master' into proxy

This commit is contained in:
Joey Hess 2024-06-07 12:35:47 -04:00
commit 6568ba4904
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 82 additions and 16 deletions

View file

@ -0,0 +1,33 @@
[[!comment format=mdwn
username="joey"
subject="""comment 7"""
date="2024-06-04T15:15:36Z"
content="""
Decoding the export.log, we have these events:
Tue Aug 4 13:44:10 2020 (PST): An export is run on an openneuro worker
sending to `s3-PRIVATE`, of b78b723042e6d7a967c806b52258e8554caa1696 which
is now lost to history. After that export completed, there was a subsequent
started but not completed export of
ae2937297eb1b4f6c9bfdfcf9d7a41b1adcea32e, also lost to history.
Fri Jan 19 21:04:26 2024: An export run on the same worker, sending to
a `s3-PUBLIC` (not the current one, one that has been marked dead and
forgotten), of ae2937297eb1b4f6c9bfdfcf9d7a41b1adcea32e. After that export
completed, there was a subsequent started but not completed export of
28b655e8207f916122bbcbd22c0369d86bb4ffc1.
Later the same day, an export run on the same worker, sending to
`s3-PUBLIC` (the current one), of 28b655e8207f916122bbcbd22c0369d86bb4ffc1.
This export completed.
Interesting that two exports were apparently started but left incomplete.
This could have been because git-annex was interrupted, which would go a
way toward confirming my analysis of this bug. But also possible
there was a error exporting one or more files.
According to Nell, the git history of main was rewritten to remove a large
file from git. The tree 28b655e8207f916122bbcbd22c0369d86bb4ffc1 appears
to still contain the large binary file. No commit in main references it.
It did get grafted into the git-annex branch which is why it was not lost.
"""]]

View file

@ -189,24 +189,60 @@ The remote interface operates on object files stored on disk. See
[[todo/transitive_transfers]] for discussion of that problem. If proxies
get implemented, that problem should be revisited.
## chunking
When the proxy is in front of a special remote that is chunked,
where does the chunking happen? It could happen on the client, or on the
proxy.
Git remotes don't ever do chunking currently, so chunking on the client
would need changes there.
Also, a given upload via a proxy may get sent to several special remotes,
each with different chunk sizes, or perhaps some not chunked and some
chunked. For uploads to be efficient, chunking needs to happen on the proxy.
## encryption
When the proxy is in front of a special remote that uses encryption, where
does the encryption happen? It could either happen on the client before
sending to the proxy, or the proxy could do the encryption since it
communicates with the special remote. For security, doing the encryption on
the client seems like the best choice by far.
communicates with the special remote.
But, git-annex's git remotes don't currently ever do encryption. And
special remotes don't communicate via the P2P protocol with a git remote.
So none of git-annex's existing remote implementations would be able to handle
this case. Something will need to be changed in the remote
implementation for this.
If the client does not want the proxy to see unencrypted data,
they would obviously prefer encryption happens locally.
(Chunking has the same problem.)
But, the proxy could be the only thing that has access to a security key
that is used in encrypting a special remote that's located behind it.
There's a security benefit there too.
So there are kind of two different perspectives here that can have
different opinions.
Also if encryption for a special remote behind a proxy happened
client-side, and the client relied on that, nothing would stop the proxy
from replacing that encrypted special remote with an unencrypted remote.
Then the client side encryption would not happen, the user would not
notice, and the proxy could see their unencrypted content.
Of course, if a client really wanted to, they could make a special remote
that uses the remote behind the proxy as a key/value backend.
Then the client could encrypt locally.
On the implementation side, git-annex's git remotes don't currently ever do
encryption. And special remotes don't communicate via the P2P protocol with
a git remote. So none of git-annex's existing remote implementations would
be able to handle client-side encryption.
There's potentially a layering problem here, because exactly how encryption
(or chunking) works can vary depending on the type of special remote.
works can vary depending on the type of special remote.
Encrypted and chunked special remotes first chunk, then encrypt.
So it chunking happens on the proxy, encryption *must* also happen there.
So overall, it seems better to do proxy-side encryption. But it may be
worth adding a special remote that does its own client-side encryption
in front of the proxy.
## cycles

View file

@ -34,16 +34,13 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
1. Add `git-annex updateproxy` command and remote.name.annex-proxy
configuration. (done)
2. Remote instantiation for proxies almost works, but fails at:
"git-annex: cannot determine uuid for origin-foo"
getRepoUUID does not look at the Repo's UUID setting, but reads it
from git-config. It's not set there for a proxied remote.
So: Add annex-uuid parsing to RemoteConfig.
2. Remote instantiation for proxies. (done)
3. Implement proxying in git-annex-shell.
4. Either implement proxying for local path remotes, or prevent
listProxied from operating on them.
4. Let `storeKey` return a list of UUIDs where content was stored,
and make proxies accept uploads directed at them, rather than a specific
instantiated remote, and fan out the upload to whatever nodes behind