Rather than use the filename provided by INPUT, which could come from user
input, and so could be something that looks like a dashed parameter,
use a .git/object/<sha> filename.
This avoids user input passing through INPUT and back out, with the file
path then passed to a command, which could do something unexpected with
a dashed parameter, or other special parameter.
Added a note in the design about being careful of passing user input to
commands. They still have to be careful of that in general, just not in
this case.
Added annex.security.autoenable-compute-programs and only allow
autoenabling special remotes that use compute programs on that list.
The reason this is needed is a user might have some compute programs
that are less safe to use than others. They might want to use an unsafe
one only with one repository, where they are the only committer or other
committers are trusted. They might be ok with others being used by any
repository, and if so they can add them to the list.
Another reason would be a user who has installed a compute program by
accident. Eg, it might be included with git-annex at some point, or
pulled in by some dependency. That user doesn't necessarily want that
compute program to be used in an autoenabled special remote.
Eg, a computation might be run in "foo/" and refer to "../bar" as an
input or output.
So, the subdir is part of the computation state.
Also, prevent input or output of files that are outside the git
repository. Of course, the program can access any file on disk if it
wants to; this is just a guard against mistakes. And it may also be
useful if the program comunicates with something less trusted than it,
eg a container image, so input/output files communicated by that are not
the source of security problems.
Changed the protocol docs because servant parses "true" and "false" for
booleans in query parameters, not "1" and "0".
clientPut with datapresent=True is not used by git-annex, and I don't
anticipate it being used in git-annex, except for testing.
I've tested this by making clientPut be called with datapresent=True and
git-annex copy to a remote succeeds once the object file is first
manually copied to the remote. That would be a good test for the test
suite, but running the http client means exposing it to at least
localhost, and would fail if a real http client was already running on
that port.
I anticipate lots of external special remote programs will neglect
implementing this. Still, it's the right thing to do to assume that some
of them may write files out of order. Probably most external special
remotes will not be used with a proxy. When someone is using one with a
proxy, they can always get it fixed to send ORDERED.
This didn't work. In case I want to revisit, here's what I tried.
diff --git a/Annex/Proxy.hs b/Annex/Proxy.hs
index 48222872c1..e4e526d3dd 100644
--- a/Annex/Proxy.hs
+++ b/Annex/Proxy.hs
@@ -26,16 +26,21 @@ import Logs.UUID
import Logs.Location
import Utility.Tmp.Dir
import Utility.Metered
+import Utility.ThreadScheduler
+import Utility.OpenFd
import Git.Types
import qualified Database.Export as Export
import Control.Concurrent.STM
import Control.Concurrent.Async
+import Control.Concurrent.MVar
import qualified Data.ByteString as B
+import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as L
import qualified System.FilePath.ByteString as P
import qualified Data.Map as M
import qualified Data.Set as S
+import System.IO.Unsafe
proxyRemoteSide :: ProtocolVersion -> Bypass -> Remote -> Annex RemoteSide
proxyRemoteSide clientmaxversion bypass r
@@ -240,21 +245,99 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
writeVerifyChunk iv h b
storetofile iv h (n - fromIntegral (B.length b)) bs
- proxyget offset af k = withproxytmpfile k $ \tmpfile -> do
+ proxyget offset af k = withproxytmpfile k $ \tmpfile ->
+ let retrieve = tryNonAsync $ Remote.retrieveKeyFile r k af
+ (fromRawFilePath tmpfile) nullMeterUpdate vc
+ in case fromKey keySize k of
+ Just size | size > 0 -> do
+ cancelv <- liftIO newEmptyMVar
+ donev <- liftIO newEmptyMVar
+ streamer <- liftIO $ async $
+ streamdata offset tmpfile size cancelv donev
+ retrieve >>= \case
+ Right _ -> liftIO $ do
+ putMVar donev ()
+ wait streamer
+ Left err -> liftIO $ do
+ putMVar cancelv ()
+ wait streamer
+ propagateerror err
+ _ -> retrieve >>= \case
+ Right _ -> liftIO $ senddata offset tmpfile
+ Left err -> liftIO $ propagateerror err
+ where
-- Don't verify the content from the remote,
-- because the client will do its own verification.
- let vc = Remote.NoVerify
- tryNonAsync (Remote.retrieveKeyFile r k af (fromRawFilePath tmpfile) nullMeterUpdate vc) >>= \case
- Right _ -> liftIO $ senddata offset tmpfile
- Left err -> liftIO $ propagateerror err
+ vc = Remote.NoVerify
+ streamdata (Offset offset) f size cancelv donev = do
+ sendlen offset size
+ waitforfile
+ x <- tryNonAsync $ do
+ fd <- openFdWithMode f ReadOnly Nothing defaultFileFlags
+ h <- fdToHandle fd
+ hSeek h AbsoluteSeek offset
+ senddata' h (getcontents size)
+ case x of
+ Left err -> do
+ throwM err
+ Right res -> return res
+ where
+ -- The file doesn't exist at the start.
+ -- Wait for some data to be written to it as well,
+ -- in case an empty file is first created and then
+ -- overwritten. When there is an offset, wait for
+ -- the file to get that large. Note that this is not used
+ -- when the size is 0.
+ waitforfile = tryNonAsync (fromIntegral <$> getFileSize f) >>= \case
+ Right sz | sz > 0 && sz >= offset -> return ()
+ _ -> ifM (isEmptyMVar cancelv)
+ ( do
+ threadDelaySeconds (Seconds 1)
+ waitforfile
+ , do
+ return ()
+ )
+
+ getcontents n h = unsafeInterleaveIO $ do
+ isdone <- isEmptyMVar donev <||> isEmptyMVar cancelv
+ c <- BS.hGet h defaultChunkSize
+ let n' = n - fromIntegral (BS.length c)
+ let c' = L.fromChunks [BS.take (fromIntegral n) c]
+ if BS.null c
+ then if isdone
+ then return mempty
+ else do
+ -- Wait for more data to be
+ -- written to the file.
+ threadDelaySeconds (Seconds 1)
+ getcontents n h
+ else if n' > 0
+ then do
+ -- unsafeInterleaveIO causes
+ -- this to be deferred until
+ -- data is read from the lazy
+ -- ByteString.
+ cs <- getcontents n' h
+ return $ L.append c' cs
+ else return c'
+
senddata (Offset offset) f = do
size <- fromIntegral <$> getFileSize f
- let n = max 0 (size - offset)
- sendmessage $ DATA (Len n)
+ sendlen offset size
withBinaryFile (fromRawFilePath f) ReadMode $ \h -> do
hSeek h AbsoluteSeek offset
- sendbs =<< L.hGetContents h
+ senddata' h L.hGetContents
+
+ senddata' h getcontents = do
+ sendbs =<< getcontents h
-- Important to keep the handle open until
-- the client responds. The bytestring
-- could still be lazily streaming out to
@@ -272,6 +355,11 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
Just FAILURE -> return ()
Just _ -> giveup "protocol error"
Nothing -> return ()
+
+ sendlen offset size = do
+ let n = max 0 (size - offset)
+ sendmessage $ DATA (Len n)
+
{- Check if this repository can proxy for a specified remote uuid,
- and if so enable proxying for it. -}
This all works fine. But it doesn't check repository sizes yet, and
without repository size checking, once a repository gets full, there
will be no other repository that will want its files.
Use of sha2 seems unncessary, probably alder2 or md5 or crc would have
been enough. Possibly just summing up the bytes of the key mod the number
of repositories would have sufficed. But sha2 is there, and probably
hardware accellerated. I doubt very much there is any security benefit
to using it though. If someone wants to construct a key that will be
balanced onto a given repository, sha2 is certianly not going to stop
them.
This removes versionedExport, which was only used by the S3 special
remote. Instead, versionedexport=yes is a common way for remotes to
indicate that they are versioned.
proxyRequest was treating UNLOCKCONTENT as a separate request.
That made it possible for there to be two different connections to the
proxied remote, with LOCKCONTENT being sent to one, and UNLOCKCONTENT
to the other one. A protocol error.
git-annex testremote now passes against a http proxied remote.