tried a blind alley on streaming special remote download via proxy

This didn't work. In case I want to revisit, here's what I tried.

diff --git a/Annex/Proxy.hs b/Annex/Proxy.hs
index 48222872c1..e4e526d3dd 100644
--- a/Annex/Proxy.hs
+++ b/Annex/Proxy.hs
@@ -26,16 +26,21 @@ import Logs.UUID
 import Logs.Location
 import Utility.Tmp.Dir
 import Utility.Metered
+import Utility.ThreadScheduler
+import Utility.OpenFd
 import Git.Types
 import qualified Database.Export as Export

 import Control.Concurrent.STM
 import Control.Concurrent.Async
+import Control.Concurrent.MVar
 import qualified Data.ByteString as B
+import qualified Data.ByteString as BS
 import qualified Data.ByteString.Lazy as L
 import qualified System.FilePath.ByteString as P
 import qualified Data.Map as M
 import qualified Data.Set as S
+import System.IO.Unsafe

 proxyRemoteSide :: ProtocolVersion -> Bypass -> Remote -> Annex RemoteSide
 proxyRemoteSide clientmaxversion bypass r
@@ -240,21 +245,99 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
 		writeVerifyChunk iv h b
 		storetofile iv h (n - fromIntegral (B.length b)) bs

-	proxyget offset af k = withproxytmpfile k $ \tmpfile -> do
+	proxyget offset af k = withproxytmpfile k $ \tmpfile ->
+		let retrieve = tryNonAsync $ Remote.retrieveKeyFile r k af
+			(fromRawFilePath tmpfile) nullMeterUpdate vc
+		in case fromKey keySize k of
+			Just size | size > 0 -> do
+				cancelv <- liftIO newEmptyMVar
+				donev <- liftIO newEmptyMVar
+				streamer <- liftIO $ async $
+					streamdata offset tmpfile size cancelv donev
+				retrieve >>= \case
+					Right _ -> liftIO $ do
+						putMVar donev ()
+						wait streamer
+					Left err -> liftIO $ do
+						putMVar cancelv ()
+						wait streamer
+						propagateerror err
+			_ -> retrieve >>= \case
+				Right _ -> liftIO $ senddata offset tmpfile
+				Left err -> liftIO $ propagateerror err
+	  where
 		-- Don't verify the content from the remote,
 		-- because the client will do its own verification.
-		let vc = Remote.NoVerify
-		tryNonAsync (Remote.retrieveKeyFile r k af (fromRawFilePath tmpfile) nullMeterUpdate vc) >>= \case
-			Right _ -> liftIO $ senddata offset tmpfile
-			Left err -> liftIO $ propagateerror err
+		vc = Remote.NoVerify

+	streamdata (Offset offset) f size cancelv donev = do
+		sendlen offset size
+		waitforfile
+		x <- tryNonAsync $ do
+			fd <- openFdWithMode f ReadOnly Nothing defaultFileFlags
+			h <- fdToHandle fd
+			hSeek h AbsoluteSeek offset
+			senddata' h (getcontents size)
+		case x of
+			Left err -> do
+				throwM err
+			Right res -> return res
+	  where
+		-- The file doesn't exist at the start.
+		-- Wait for some data to be written to it as well,
+		-- in case an empty file is first created and then
+		-- overwritten. When there is an offset, wait for
+		-- the file to get that large. Note that this is not used
+		-- when the size is 0.
+		waitforfile = tryNonAsync (fromIntegral <$> getFileSize f) >>= \case
+			Right sz | sz > 0 && sz >= offset -> return ()
+			_ -> ifM (isEmptyMVar cancelv)
+				( do
+					threadDelaySeconds (Seconds 1)
+					waitforfile
+				, do
+					return ()
+				)
+
+		getcontents n h = unsafeInterleaveIO $ do
+			isdone <- isEmptyMVar donev <||> isEmptyMVar cancelv
+			c <- BS.hGet h defaultChunkSize
+			let n' = n - fromIntegral (BS.length c)
+			let c' = L.fromChunks [BS.take (fromIntegral n) c]
+			if BS.null c
+				then if isdone
+					then return mempty
+					else do
+						-- Wait for more data to be
+						-- written to the file.
+						threadDelaySeconds (Seconds 1)
+						getcontents n h
+				else if n' > 0
+					then do
+						-- unsafeInterleaveIO causes
+						-- this to be deferred until
+						-- data is read from the lazy
+						-- ByteString.
+						cs <- getcontents n' h
+						return $ L.append c' cs
+					else return c'
+
 	senddata (Offset offset) f = do
 		size <- fromIntegral <$> getFileSize f
-		let n = max 0 (size - offset)
-		sendmessage $ DATA (Len n)
+		sendlen offset size
 		withBinaryFile (fromRawFilePath f) ReadMode $ \h -> do
 			hSeek h AbsoluteSeek offset
-			sendbs =<< L.hGetContents h
+			senddata' h L.hGetContents
+
+	senddata' h getcontents = do
+			sendbs =<< getcontents h
 			-- Important to keep the handle open until
 			-- the client responds. The bytestring
 			-- could still be lazily streaming out to
@@ -272,6 +355,11 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
 				Just FAILURE -> return ()
 				Just _ -> giveup "protocol error"
 				Nothing -> return ()
+
+	sendlen offset size = do
+		let n = max 0 (size - offset)
+		sendmessage $ DATA (Len n)
+

 {- Check if this repository can proxy for a specified remote uuid,
  - and if so enable proxying for it. -}
This commit is contained in:
Joey Hess 2024-10-07 13:54:04 -04:00
parent b501d23f9b
commit 8baa43ee12
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 58 additions and 0 deletions

View file

@ -364,6 +364,10 @@ remote to the usual temp object file on the proxy, but without moving that
to the annex object file at the end. As the temp object file grows, stream
the content out via the proxy.
> This needs the same process to read and write the same file, which is
> disallowed in Haskell (without going lowlevel in a way that seems
> difficult).
Some special remotes will overwrite or truncate an existing temp object
file when starting a download. So the proxy should wait until the file is
growing to start streaming it.
@ -383,6 +387,20 @@ stream downloads from such special remotes. So there will be a perhaps long
delay before the client sees their download start. Extend the P2P protocol
with a way to send pre-download progress perhaps?
> That seems pretty complicated. Alternatively, require that
> retrieveKeyFile only writes to the file in-order. Even the bittorrent
> special remote currently does, since it waits for the bittorrent download
> to complete before moving the file to the destination. All other
> special remotes built into git-annex are ok as well.
>
> Possibly some external special remote does not (eg maybe rclone in some
> situation)?
>
> This could be handled with a special remote protocol extension that asks
> the special remote to confirm if it retrieves in order. When a special
> remote does not support that extension, Remote.External can just download
> to a temp file and rename after download.
A simple approach for proxying uploads is to buffer the upload to the temp
object file, and once it's complete (and hash verified), send it on to the
special remote(s). Then delete the temp object file. This has a problem that
@ -412,6 +430,16 @@ another process could open the temp file and stream it out to its client.
But how to detect when the whole content has been received? Could check key
size, but what about unsized keys?
## special remotes using P2P protocol
Another way to handle proxying to special remotes would be to make some
special remotes speak the P2P protocol. Then the proxy can just proxy P2P
protocol to them the same as it does to git-annex remotes.
The difficulty with this though is that encryption and chunking are
implemented as transformations of special remotes, and would need to be
re-implemented on top of the P2P protocol.
## chunking
When the proxy is in front of a special remote that is chunked,

View file

@ -28,6 +28,23 @@ Planned schedule of work:
## work notes
* Currently working on streaming download via proxy from special remote.
* Tried implementing a background thread in the proxy that runs while
retrieving a file, to stream it out as it comes in. That failed because
reading from a file that the same process is writing to is prevented by
locking in haskell. (Could be gotten around by using FD rather than Handle,
but would need to read from the FD and use packCString to make a ByteString.)
But also, remotes using fileRetriever retrieve to the temp object file,
before it is renamed to the requested file. In the case of a proxy,
that is a different file, and so it won't see the file until it's all
been transferred and renamed.
* Could the P2P protocol be used as an alternate interface for a special
remote? Would avoid needing temp files when proxying for special remotes,
and would support resume from offset as well for special remotes for
which that makes sense.
## completed items for September's work on proving behavior of preferred content

View file

@ -0,0 +1,13 @@
[[!comment format=mdwn
username="joey"
subject="""comment 10"""
date="2024-10-07T17:57:05Z"
content="""
Strictly speaking it's possible to do better than `git-annex copy --from --to` currently does.
When git-annex is used as a proxy to a P2P remote, it streams the P2P
protocol from client to remote, and so needs no temp files.
So in a way, the P2P protocol is the real solution to this? Except special
remote don't use the P2P protocol.
"""]]