DATA-PRESENT working for exporttree=yes remotes

Since the annex-tracking-branch is pushed first, git-annex has already
updated the export database when the DATA-PRESENT arrives. Which means
that just using checkPresent is enough to verify that there is some file
on the special remote in the export location for the key.

So, the simplest possible implementation of this happened to work!

(I also tested it with chunked specialremotes, which also works, as long
as the chunk size used is the same as the configured chunk size. In that
case, the lack of a chunk log is not a problem. Doubtful this will ever
make sense to use with a chunked special remote though, that gets pretty
deep into re-implementing git-annex.)

Updated the client side upload tip with a missing step, and reorged for clarity.
This commit is contained in:
Joey Hess 2024-10-30 13:51:58 -04:00
parent 54dc1d6f6e
commit 126daf949d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 40 additions and 54 deletions

View file

@ -112,10 +112,7 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
go :: Annex ()
go = liftIO receivemessage >>= \case
Just (CHECKPRESENT k) -> do
tryNonAsync (Remote.checkPresent r k) >>= \case
Right True -> liftIO $ sendmessage SUCCESS
Right False -> liftIO $ sendmessage FAILURE
Left err -> liftIO $ propagateerror err
checkpresent k
go
Just (LOCKCONTENT _) -> do
-- Special remotes do not support locking content.
@ -211,22 +208,14 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
nuketmp
giveup "protocol error"
else store >> nuketmp
Just DATA_PRESENT -> tryNonAsync (verifydatapresent k) >>= \case
Right True -> liftIO $ sendmessage SUCCESS
Right False -> liftIO $ sendmessage FAILURE
Left err -> liftIO $ propagateerror err
Just DATA_PRESENT -> checkpresent k
_ -> giveup "protocol error"
verifydatapresent k = case mexportdb of
Just exportdb -> liftIO (Export.getExportTree exportdb k) >>= \case
[] -> verifykey
-- XXX TODO check that one of the export locs is populated,
-- or for an annexobjects=yes special remote, the
-- annexobject file could be populated.
locs -> return True
Nothing -> verifykey
where
verifykey = Remote.checkPresent r k
checkpresent k =
tryNonAsync (Remote.checkPresent r k) >>= \case
Right True -> liftIO $ sendmessage SUCCESS
Right False -> liftIO $ sendmessage FAILURE
Left err -> liftIO $ propagateerror err
storeput k af tmpfile = case mexportdb of
Just exportdb -> liftIO (Export.getExportTree exportdb k) >>= \case

View file

@ -21,38 +21,45 @@ and then you can upload whatever filenames you want to it, rather than
needing to use the same filenames git-annex uses for storing keys in a S3
bucket.
Once the browser uploads the file to S3, you need to add a git-annex
symlink or pointer file to the git repository. This can be done in the
browser, using [js-git](https://github.com/creationix/js-git). Generating a
git-annex key is not hard, just hash the file content before/while
uploading it, and see [[internals/key_format]]. Write that to a pointer
file, or make a symlink to the appropriate directory under
.git/annex/objects (a bit harder). Commit it to git and push to your
server using js-git.
Now git-annex knows about the file. But it doesn't yet know it's been
uploaded to the S3 special remote. To do this, you will need have your
server set up to run git-annex. Set up the S3 special
remote there. And make git-annex on the server a
Along with the S3 bucket, you will need a server set up, which is where
git-annex will run in a git repository. Set up the S3 special remote there.
And make git-annex on the server a
[proxy|git-annex-updateproxy]] for the S3 special remote:
git-annex initremote s3 type=S3 exporttree=yes encryption=none bucket=mybucket
git config remote.s3.annex-proxy true
git-annex updateproxy
For the web browser to be able to easily talk with git-annex on the server,
you can run [[git-annex p2phttp|git-annex-p2phttp]].
If the special remote is configured with exporttree=yes, be sure to also
configure the annex-tracking-branch for it on the server:
git config remote.s3.annex-tracking-branch master
Once the browser uploads the file to S3, you need to add a git-annex
symlink or pointer file to the git repository. This can be done in the
browser, using [js-git](https://github.com/creationix/js-git). Generating a
git-annex key is not hard, just hash the file content before/while
uploading it, and see [[internals/key_format]]. Write that to a pointer
file, or make a symlink to the appropriate directory under
.git/annex/objects (a bit harder). Commit it to git and push the branch
("master" in this example) to your server using js-git.
All that's left is to let git-annex know that the file has been uploaded to
the S3 special remote. To accomplish this, the web browser will need to
talk with git-annex on the server. The easy way to accomplish that
is to run [[git-annex p2phttp|git-annex-p2phttp]].
The web browser will be speaking the [[doc/design/P2P_protocol_over_HTTP]].
Make sure you have git-annex 10.20241031 or newer installed. That version
extended the [[design/p2p_protocol]] with a `DATA-PRESENT` feature, which
is just what you need.
All the web browser needs to do is `POST /git-annex/$uuid/v4/put`
with `data-present=true` included in the URL parameters, along with the
key of the file that was added to the git repository.
Replace `$uuid` with the UUID of the S3 special remote.
You can look that up with eg `git config remote.s3.annex-uuid`.
All the web browser needs to do, after uploading the S3 and pushing the git
branch to the server, is `POST /git-annex/$uuid/v4/put` with
`data-present=true` included in the URL parameters, along with the key of
the file that was added to the git repository. Replace `$uuid` with the
UUID of the S3 special remote. You can look that up with eg `git config
remote.s3.annex-uuid`.
When the git-annex HTTP server receives that request, since it is
configured to be able to proxy for the S3 special remote, it will act the

View file

@ -28,22 +28,6 @@ Planned schedule of work:
## remaining things to do in October
* Streaming uploads to special remotes via the proxy. Possibly; if a
workable design can be developed. It seems difficult without changing the
external special remote protocol, unless a fifo is used. Make ORDERED
response in p2p protocol allow using a fifo?
* Indirect uploads when proxying for special remote is an alternative that
would work for OpenNeuro's use case.
* If not implementing upload streaming to proxied special remotes,
this needs to be addressed:
When an upload to a cluster is distributed to multiple special remotes,
a temporary file is written for each one, which may even happen in
parallel. This is a lot of extra work and may use excess disk space.
It should be possible to only write a single temp file.
(With streaming this wouldn't be an issue.)
* Possibly some of the deferred items listed in following sections:
## items deferred until later for balanced preferred content and maxsize tracking
@ -123,6 +107,12 @@ Planned schedule of work:
* Support using a proxy when its url is a P2P address.
(Eg tor-annex remotes.)
* When an upload to a cluster is distributed to multiple special remotes,
a temporary file is written for each one, which may even happen in
parallel. This is a lot of extra work and may use excess disk space.
It should be possible to only write a single temp file.
(With streaming this wouldn't be an issue.)
## completed items for October's work on streaming through proxy to special remotes
* Stream downloads through proxy for all special remotes that indicate