DATA-PRESENT working for exporttree=yes remotes

Since the annex-tracking-branch is pushed first, git-annex has already updated the export database when the DATA-PRESENT arrives. Which means that just using checkPresent is enough to verify that there is some file on the special remote in the export location for the key. So, the simplest possible implementation of this happened to work! (I also tested it with chunked specialremotes, which also works, as long as the chunk size used is the same as the configured chunk size. In that case, the lack of a chunk log is not a problem. Doubtful this will ever make sense to use with a chunked special remote though, that gets pretty deep into re-implementing git-annex.) Updated the client side upload tip with a missing step, and reorged for clarity.
2024-10-30 13:51:58 -04:00 · 2024-10-30 13:51:58 -04:00 · 126daf949d
commit 126daf949d
parent 54dc1d6f6e
3 changed files with 40 additions and 54 deletions
--- a/Annex/Proxy.hs
+++ b/Annex/Proxy.hs
@ -112,10 +112,7 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
 	go :: Annex ()
 	go = liftIO receivemessage >>= \case
 		Just (CHECKPRESENT k) -> do
-			tryNonAsync (Remote.checkPresent r k) >>= \case
-				Right True -> liftIO $ sendmessage SUCCESS
-				Right False -> liftIO $ sendmessage FAILURE
-				Left err -> liftIO $ propagateerror err
+			checkpresent k
 			go
 		Just (LOCKCONTENT _) -> do
 			-- Special remotes do not support locking content.
@ -211,22 +208,14 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
 							nuketmp
 							giveup "protocol error"
 					else store >> nuketmp
-			Just DATA_PRESENT -> tryNonAsync (verifydatapresent k) >>= \case
-				Right True -> liftIO $ sendmessage SUCCESS
-				Right False -> liftIO $ sendmessage FAILURE
-				Left err -> liftIO $ propagateerror err
+			Just DATA_PRESENT -> checkpresent k
 			_ -> giveup "protocol error"

-	verifydatapresent k = case mexportdb of
-		Just exportdb -> liftIO (Export.getExportTree exportdb k) >>= \case
-			[] -> verifykey
-			-- XXX TODO check that one of the export locs is populated,
-			-- or for an annexobjects=yes special remote, the
-			-- annexobject file could be populated.
-			locs -> return True
-		Nothing -> verifykey
-	  where
-		verifykey = Remote.checkPresent r k
+	checkpresent k = 
+		tryNonAsync (Remote.checkPresent r k) >>= \case
+			Right True -> liftIO $ sendmessage SUCCESS
+			Right False -> liftIO $ sendmessage FAILURE
+			Left err -> liftIO $ propagateerror err

 	storeput k af tmpfile = case mexportdb of
 		Just exportdb -> liftIO (Export.getExportTree exportdb k) >>= \case
--- a/doc/tips/client_side_upload_to_a_special_remote.mdwn
+++ b/doc/tips/client_side_upload_to_a_special_remote.mdwn
@ -21,38 +21,45 @@ and then you can upload whatever filenames you want to it, rather than
 needing to use the same filenames git-annex uses for storing keys in a S3
 bucket.

-Once the browser uploads the file to S3, you need to add a git-annex
-symlink or pointer file to the git repository. This can be done in the
-browser, using [js-git](https://github.com/creationix/js-git). Generating a
-git-annex key is not hard, just hash the file content before/while
-uploading it, and see [[internals/key_format]]. Write that to a pointer
-file, or make a symlink to the appropriate directory under
-.git/annex/objects (a bit harder). Commit it to git and push to your
-server using js-git.
-
-Now git-annex knows about the file. But it doesn't yet know it's been
-uploaded to the S3 special remote. To do this, you will need have your
-server set up to run git-annex. Set up the S3 special
-remote there. And make git-annex on the server a
+Along with the S3 bucket, you will need a server set up, which is where
+git-annex will run in a git repository. Set up the S3 special remote there.
+And make git-annex on the server a
 [proxy|git-annex-updateproxy]] for the S3 special remote:

    git-annex initremote s3 type=S3 exporttree=yes encryption=none bucket=mybucket
    git config remote.s3.annex-proxy true
    git-annex updateproxy

-For the web browser to be able to easily talk with git-annex on the server, 
-you can run [[git-annex p2phttp|git-annex-p2phttp]]. 
+If the special remote is configured with exporttree=yes, be sure to also
+configure the annex-tracking-branch for it on the server:
+
+    git config remote.s3.annex-tracking-branch master
+
+Once the browser uploads the file to S3, you need to add a git-annex
+symlink or pointer file to the git repository. This can be done in the
+browser, using [js-git](https://github.com/creationix/js-git). Generating a
+git-annex key is not hard, just hash the file content before/while
+uploading it, and see [[internals/key_format]]. Write that to a pointer
+file, or make a symlink to the appropriate directory under
+.git/annex/objects (a bit harder). Commit it to git and push the branch
+("master" in this example) to your server using js-git.
+
+All that's left is to let git-annex know that the file has been uploaded to
+the S3 special remote. To accomplish this, the web browser will need to
+talk with git-annex on the server. The easy way to accomplish that
+is to run [[git-annex p2phttp|git-annex-p2phttp]]. 
 The web browser will be speaking the [[doc/design/P2P_protocol_over_HTTP]].

 Make sure you have git-annex 10.20241031 or newer installed. That version
 extended the [[design/p2p_protocol]] with a `DATA-PRESENT` feature, which
 is just what you need.

-All the web browser needs to do is `POST /git-annex/$uuid/v4/put`
-with `data-present=true` included in the URL parameters, along with the
-key of the file that was added to the git repository.
-Replace `$uuid` with the UUID of the S3 special remote.
-You can look that up with eg `git config remote.s3.annex-uuid`.
+All the web browser needs to do, after uploading the S3 and pushing the git
+branch to the server, is `POST /git-annex/$uuid/v4/put` with
+`data-present=true` included in the URL parameters, along with the key of
+the file that was added to the git repository. Replace `$uuid` with the
+UUID of the S3 special remote. You can look that up with eg `git config
+remote.s3.annex-uuid`.

 When the git-annex HTTP server receives that request, since it is
 configured to be able to proxy for the S3 special remote, it will act the
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -28,22 +28,6 @@ Planned schedule of work:

 ## remaining things to do in October

-* Streaming uploads to special remotes via the proxy. Possibly; if a
-  workable design can be developed. It seems difficult without changing the
-  external special remote protocol, unless a fifo is used. Make ORDERED
-  response in p2p protocol allow using a fifo?
-
-* Indirect uploads when proxying for special remote is an alternative that
-  would work for OpenNeuro's use case.
-
-* If not implementing upload streaming to proxied special remotes, 
-  this needs to be addressed:
-  When an upload to a cluster is distributed to multiple special remotes,
-  a temporary file is written for each one, which may even happen in
-  parallel. This is a lot of extra work and may use excess disk space.
-  It should be possible to only write a single temp file.
-  (With streaming this wouldn't be an issue.)
-
 * Possibly some of the deferred items listed in following sections:

 ## items deferred until later for balanced preferred content and maxsize tracking
@ -123,6 +107,12 @@ Planned schedule of work:
 * Support using a proxy when its url is a P2P address.
  (Eg tor-annex remotes.)

+* When an upload to a cluster is distributed to multiple special remotes,
+  a temporary file is written for each one, which may even happen in
+  parallel. This is a lot of extra work and may use excess disk space.
+  It should be possible to only write a single temp file.
+  (With streaming this wouldn't be an issue.)
+
 ## completed items for October's work on streaming through proxy to special remotes

 * Stream downloads through proxy for all special remotes that indicate