2024-05-01 15:04:20 +00:00
|
|
|
This is a summary todo covering several subprojects, which would extend
|
|
|
|
git-annex to be able to use proxies which sit in front of a cluster of
|
|
|
|
repositories.
|
|
|
|
|
2024-05-01 16:19:12 +00:00
|
|
|
1. [[design/passthrough_proxy]]
|
2024-05-01 19:26:51 +00:00
|
|
|
2. [[design/p2p_protocol_over_http]]
|
|
|
|
3. [[design/balanced_preferred_content]]
|
|
|
|
4. [[todo/track_free_space_in_repos_via_git-annex_branch]]
|
|
|
|
5. [[todo/proving_preferred_content_behavior]]
|
2024-05-01 15:04:20 +00:00
|
|
|
|
2024-07-01 15:38:29 +00:00
|
|
|
## table of contents
|
|
|
|
|
|
|
|
[[!toc ]]
|
|
|
|
|
|
|
|
## planned schedule
|
|
|
|
|
2024-06-04 10:53:59 +00:00
|
|
|
Joey has received funding to work on this.
|
|
|
|
Planned schedule of work:
|
|
|
|
|
2024-06-27 19:28:10 +00:00
|
|
|
* June: git-annex proxies and clusters
|
2024-07-30 15:42:17 +00:00
|
|
|
* July: p2p protocol over http
|
|
|
|
* August, part 1: git-annex proxy support for exporttree
|
2024-08-30 15:14:45 +00:00
|
|
|
* August, part 2: balanced preferred content
|
2024-10-28 17:46:57 +00:00
|
|
|
* September: proving behavior of balanced preferred content with proxies
|
|
|
|
* October: streaming through proxy to special remotes (especially S3)
|
2024-06-04 10:53:59 +00:00
|
|
|
|
2024-05-01 15:04:20 +00:00
|
|
|
[[!tag projects/openneuro]]
|
2024-06-04 11:51:33 +00:00
|
|
|
|
2024-10-28 17:46:57 +00:00
|
|
|
## remaining things to do in October
|
2024-06-04 11:51:33 +00:00
|
|
|
|
2024-10-22 15:09:47 +00:00
|
|
|
* Streaming uploads to special remotes via the proxy. Possibly; if a
|
|
|
|
workable design can be developed. It seems difficult without changing the
|
|
|
|
external special remote protocol, unless a fifo is used. Make ORDERED
|
|
|
|
response in p2p protocol allow using a fifo?
|
2024-10-15 14:31:42 +00:00
|
|
|
|
2024-10-22 15:09:47 +00:00
|
|
|
* Indirect uploads when proxying for special remote is an alternative that
|
|
|
|
would work for OpenNeuro's use case.
|
|
|
|
|
|
|
|
* If not implementing upload streaming to proxied special remotes,
|
|
|
|
this needs to be addressed:
|
|
|
|
When an upload to a cluster is distributed to multiple special remotes,
|
|
|
|
a temporary file is written for each one, which may even happen in
|
|
|
|
parallel. This is a lot of extra work and may use excess disk space.
|
|
|
|
It should be possible to only write a single temp file.
|
|
|
|
(With streaming this wouldn't be an issue.)
|
tried a blind alley on streaming special remote download via proxy
This didn't work. In case I want to revisit, here's what I tried.
diff --git a/Annex/Proxy.hs b/Annex/Proxy.hs
index 48222872c1..e4e526d3dd 100644
--- a/Annex/Proxy.hs
+++ b/Annex/Proxy.hs
@@ -26,16 +26,21 @@ import Logs.UUID
import Logs.Location
import Utility.Tmp.Dir
import Utility.Metered
+import Utility.ThreadScheduler
+import Utility.OpenFd
import Git.Types
import qualified Database.Export as Export
import Control.Concurrent.STM
import Control.Concurrent.Async
+import Control.Concurrent.MVar
import qualified Data.ByteString as B
+import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as L
import qualified System.FilePath.ByteString as P
import qualified Data.Map as M
import qualified Data.Set as S
+import System.IO.Unsafe
proxyRemoteSide :: ProtocolVersion -> Bypass -> Remote -> Annex RemoteSide
proxyRemoteSide clientmaxversion bypass r
@@ -240,21 +245,99 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
writeVerifyChunk iv h b
storetofile iv h (n - fromIntegral (B.length b)) bs
- proxyget offset af k = withproxytmpfile k $ \tmpfile -> do
+ proxyget offset af k = withproxytmpfile k $ \tmpfile ->
+ let retrieve = tryNonAsync $ Remote.retrieveKeyFile r k af
+ (fromRawFilePath tmpfile) nullMeterUpdate vc
+ in case fromKey keySize k of
+ Just size | size > 0 -> do
+ cancelv <- liftIO newEmptyMVar
+ donev <- liftIO newEmptyMVar
+ streamer <- liftIO $ async $
+ streamdata offset tmpfile size cancelv donev
+ retrieve >>= \case
+ Right _ -> liftIO $ do
+ putMVar donev ()
+ wait streamer
+ Left err -> liftIO $ do
+ putMVar cancelv ()
+ wait streamer
+ propagateerror err
+ _ -> retrieve >>= \case
+ Right _ -> liftIO $ senddata offset tmpfile
+ Left err -> liftIO $ propagateerror err
+ where
-- Don't verify the content from the remote,
-- because the client will do its own verification.
- let vc = Remote.NoVerify
- tryNonAsync (Remote.retrieveKeyFile r k af (fromRawFilePath tmpfile) nullMeterUpdate vc) >>= \case
- Right _ -> liftIO $ senddata offset tmpfile
- Left err -> liftIO $ propagateerror err
+ vc = Remote.NoVerify
+ streamdata (Offset offset) f size cancelv donev = do
+ sendlen offset size
+ waitforfile
+ x <- tryNonAsync $ do
+ fd <- openFdWithMode f ReadOnly Nothing defaultFileFlags
+ h <- fdToHandle fd
+ hSeek h AbsoluteSeek offset
+ senddata' h (getcontents size)
+ case x of
+ Left err -> do
+ throwM err
+ Right res -> return res
+ where
+ -- The file doesn't exist at the start.
+ -- Wait for some data to be written to it as well,
+ -- in case an empty file is first created and then
+ -- overwritten. When there is an offset, wait for
+ -- the file to get that large. Note that this is not used
+ -- when the size is 0.
+ waitforfile = tryNonAsync (fromIntegral <$> getFileSize f) >>= \case
+ Right sz | sz > 0 && sz >= offset -> return ()
+ _ -> ifM (isEmptyMVar cancelv)
+ ( do
+ threadDelaySeconds (Seconds 1)
+ waitforfile
+ , do
+ return ()
+ )
+
+ getcontents n h = unsafeInterleaveIO $ do
+ isdone <- isEmptyMVar donev <||> isEmptyMVar cancelv
+ c <- BS.hGet h defaultChunkSize
+ let n' = n - fromIntegral (BS.length c)
+ let c' = L.fromChunks [BS.take (fromIntegral n) c]
+ if BS.null c
+ then if isdone
+ then return mempty
+ else do
+ -- Wait for more data to be
+ -- written to the file.
+ threadDelaySeconds (Seconds 1)
+ getcontents n h
+ else if n' > 0
+ then do
+ -- unsafeInterleaveIO causes
+ -- this to be deferred until
+ -- data is read from the lazy
+ -- ByteString.
+ cs <- getcontents n' h
+ return $ L.append c' cs
+ else return c'
+
senddata (Offset offset) f = do
size <- fromIntegral <$> getFileSize f
- let n = max 0 (size - offset)
- sendmessage $ DATA (Len n)
+ sendlen offset size
withBinaryFile (fromRawFilePath f) ReadMode $ \h -> do
hSeek h AbsoluteSeek offset
- sendbs =<< L.hGetContents h
+ senddata' h L.hGetContents
+
+ senddata' h getcontents = do
+ sendbs =<< getcontents h
-- Important to keep the handle open until
-- the client responds. The bytestring
-- could still be lazily streaming out to
@@ -272,6 +355,11 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
Just FAILURE -> return ()
Just _ -> giveup "protocol error"
Nothing -> return ()
+
+ sendlen offset size = do
+ let n = max 0 (size - offset)
+ sendmessage $ DATA (Len n)
+
{- Check if this repository can proxy for a specified remote uuid,
- and if so enable proxying for it. -}
2024-10-07 17:54:04 +00:00
|
|
|
|
2024-10-28 17:46:57 +00:00
|
|
|
* Possibly some of the deferred items listed in following sections:
|
2024-09-25 18:26:32 +00:00
|
|
|
|
2024-08-30 15:14:45 +00:00
|
|
|
## items deferred until later for balanced preferred content and maxsize tracking
|
2024-08-27 17:07:06 +00:00
|
|
|
|
2024-08-24 13:37:24 +00:00
|
|
|
* The assistant is using NoLiveUpdate, but it should be posssible to plumb
|
|
|
|
a LiveUpdate through it from preferred content checking to location log
|
|
|
|
updating.
|
2024-08-23 20:35:12 +00:00
|
|
|
|
2024-08-17 18:58:36 +00:00
|
|
|
* `git-annex info` in the limitedcalc path in cachedAllRepoData
|
|
|
|
double-counts redundant information from the journal due to using
|
2024-08-28 15:00:59 +00:00
|
|
|
overLocationLogs. In the other path it does not (any more; it used to),
|
|
|
|
and this should be fixed for consistency and correctness.
|
2024-08-17 18:58:36 +00:00
|
|
|
|
2024-08-26 18:50:09 +00:00
|
|
|
* getLiveRepoSizes has a filterM getRecentChange over the live updates.
|
|
|
|
This could be optimised to a single sql join. There are usually not many
|
|
|
|
live updates, but sometimes there will be a great many recent changes,
|
2024-08-30 15:14:45 +00:00
|
|
|
so it might be worth doing this optimisation. Persistent is not capable
|
|
|
|
of this, would need dependency added on esquelito.
|
2024-08-26 18:50:09 +00:00
|
|
|
|
2024-07-28 18:22:44 +00:00
|
|
|
## items deferred until later for p2p protocol over http
|
|
|
|
|
2024-07-29 13:11:27 +00:00
|
|
|
* Support proxying to git remotes that use annex+http urls. This needs a
|
|
|
|
translation from P2P protocol to servant-client to P2P protocol.
|
|
|
|
|
|
|
|
* Should be possible to use a git-remote-annex annex::$uuid url as
|
|
|
|
remote.foo.url with remote.foo.annexUrl using annex+http, and so
|
|
|
|
not need a separate web server to serve the git repository. Doesn't work
|
|
|
|
currently because git-remote-annex urls only support special remotes.
|
|
|
|
It would need a new form of git-remote-annex url, eg:
|
|
|
|
annex::$uuid?annex+http://example.com/git-annex/
|
2024-07-28 18:22:44 +00:00
|
|
|
|
|
|
|
* `git-annex p2phttp` could support systemd socket activation. This would
|
|
|
|
allow making a systemd unit that listens on port 80.
|
|
|
|
|
2024-07-02 20:16:37 +00:00
|
|
|
## items deferred until later for [[design/passthrough_proxy]]
|
2024-06-23 13:28:18 +00:00
|
|
|
|
2024-07-01 15:29:04 +00:00
|
|
|
* Check annex.diskreserve when proxying for special remotes
|
|
|
|
to avoid the proxy's disk filling up with the temporary object file
|
|
|
|
cached there.
|
|
|
|
|
2024-06-28 19:32:00 +00:00
|
|
|
* Resuming an interrupted download from proxied special remote makes the proxy
|
|
|
|
re-download the whole content. It could instead keep some of the
|
|
|
|
object files around when the client does not send SUCCESS. This would
|
2024-10-22 15:09:47 +00:00
|
|
|
use more disk, but could minimize to eg, the last 2 or so.
|
2024-06-28 21:07:01 +00:00
|
|
|
The design doc has some more thoughts about this.
|
2024-06-18 16:07:01 +00:00
|
|
|
|
2024-06-27 18:36:55 +00:00
|
|
|
* Getting a key from a cluster currently picks from amoung
|
|
|
|
the lowest cost remotes at random. This could be smarter,
|
|
|
|
eg prefer to avoid using remotes that are doing other transfers at the
|
|
|
|
same time.
|
|
|
|
|
2024-06-27 19:21:03 +00:00
|
|
|
* The cost of a proxied node that is accessed via an intermediate gateway
|
|
|
|
is currently the same as a node accessed via the cluster gateway.
|
|
|
|
To fix this, there needs to be some way to tell how many hops through
|
|
|
|
gateways it takes to get to a node. Currently the only way is to
|
|
|
|
guess based on number of dashes in the node name, which is not satisfying.
|
|
|
|
|
|
|
|
Even counting hops is not very satisfying, one cluster gateway could
|
|
|
|
be much more expensive to traverse than another one.
|
|
|
|
|
|
|
|
If seriously tackling this, it might be worth making enough information
|
|
|
|
available to use spanning tree protocol for routing inside clusters.
|
2024-06-25 21:50:22 +00:00
|
|
|
|
2024-07-28 17:31:30 +00:00
|
|
|
* Speed: A proxy to a local git repository spawns git-annex-shell
|
|
|
|
to communicate with it. It would be more efficient to operate
|
|
|
|
directly on the Remote. Especially when transferring content to/from it.
|
|
|
|
But: When a cluster has several nodes that are local git repositories,
|
|
|
|
and is sending data to all of them, this would need an alternate
|
|
|
|
interface than `storeKey`, which supports streaming, of chunks
|
|
|
|
of a ByteString.
|
|
|
|
|
2024-06-12 15:55:18 +00:00
|
|
|
* Use `sendfile()` to avoid data copying overhead when
|
|
|
|
`receiveBytes` is being fed right into `sendBytes`.
|
2024-06-25 21:26:26 +00:00
|
|
|
Library to use:
|
|
|
|
<https://hackage.haskell.org/package/hsyscall-0.4/docs/System-Syscall.html>
|
2024-06-04 11:51:33 +00:00
|
|
|
|
2024-06-12 15:55:18 +00:00
|
|
|
* Support using a proxy when its url is a P2P address.
|
|
|
|
(Eg tor-annex remotes.)
|
2024-06-23 16:31:00 +00:00
|
|
|
|
2024-10-28 17:46:57 +00:00
|
|
|
## completed items for October's work on streaming through proxy to special remotes
|
|
|
|
|
|
|
|
* Stream downloads through proxy for all special remotes that indicate
|
|
|
|
they download in order.
|
|
|
|
* Added ORDERED message to external special remote protocol.
|
2024-10-29 20:14:10 +00:00
|
|
|
* Added DATA-PRESENT and documented in
|
|
|
|
[[tips/client_side_upload_to_a_special_remote]]
|
2024-10-28 17:46:57 +00:00
|
|
|
|
|
|
|
## completed items for September's work on proving behavior of preferred content
|
|
|
|
|
|
|
|
* Static analysis to detect "not present", "not balanced", and similar
|
|
|
|
unstable preferred content expressions and avoid problems with them.
|
|
|
|
* Implemented `git-annex sim` command.
|
|
|
|
* Simulated a variety of repository networks, and random preferred content
|
|
|
|
expressions, checking that a stable state is always reached.
|
|
|
|
* Fix bug that prevented anything being stored in an empty
|
|
|
|
repository whose preferred content expression uses sizebalanced.
|
|
|
|
(Identified via `git-annex sim`)
|
|
|
|
|
|
|
|
## completed items for August's work on balanced preferred content
|
|
|
|
|
|
|
|
* Balanced preferred content basic implementation, including --rebalance
|
|
|
|
option.
|
|
|
|
* Implemented [[track_free_space_in_repos_via_git-annex_branch]]
|
|
|
|
* Implemented tracking of live changes to repository sizes.
|
|
|
|
* `git-annex maxsize`
|
|
|
|
* annex.fullybalancedthreshhold
|
|
|
|
|
|
|
|
## completed items for August's work on git-annex proxy support for exporttre
|
|
|
|
|
|
|
|
* Special remotes configured with exporttree=yes annexobjects=yes
|
|
|
|
can store objects in .git/annex/objects, as well as an exported tree.
|
|
|
|
|
|
|
|
* Support proxying to special remotes configured with
|
|
|
|
exporttree=yes annexobjects=yes.
|
|
|
|
|
|
|
|
* post-retrieve: When proxying is enabled for an exporttree=yes
|
|
|
|
special remote and the configured remote.name.annex-tracking-branch
|
|
|
|
is received, the tree is exported to the special remote.
|
|
|
|
|
|
|
|
* When getting from a P2P HTTP remote, prompt for credentials when
|
|
|
|
required, instead of failing.
|
|
|
|
|
|
|
|
* Prevent `updateproxy` and `updatecluster` from adding
|
|
|
|
an exporttree=yes special remote that does not have
|
|
|
|
annexobjects=yes, to avoid foot shooting.
|
|
|
|
|
|
|
|
* Implement `git-annex export treeish --to=foo --from=bar`, which
|
|
|
|
gets from bar as needed to send to foo. Make post-retrieve use
|
|
|
|
`--to=r --from=r` to handle the multiple files case.
|
|
|
|
|
|
|
|
## completed items for July's work on p2p protocol over http
|
|
|
|
|
|
|
|
* HTTP P2P protocol design [[design/p2p_protocol_over_http]].
|
|
|
|
|
|
|
|
* addressed [[doc/todo/P2P_locking_connection_drop_safety]]
|
|
|
|
|
|
|
|
* implemented server and client for HTTP P2P protocol
|
|
|
|
|
|
|
|
* added git-annex p2phttp command to serve HTTP P2P protocol
|
|
|
|
|
|
|
|
* Make git-annex p2phttp support https.
|
|
|
|
|
|
|
|
* Allow using annex+http urls in remote.name.annexUrl
|
|
|
|
|
|
|
|
* Make http server support proxying.
|
|
|
|
|
|
|
|
* Make http server support serving a cluster.
|
|
|
|
|
2024-07-01 15:38:29 +00:00
|
|
|
## completed items for June's work on [[design/passthrough_proxy]]:
|
2024-06-23 20:38:01 +00:00
|
|
|
|
|
|
|
* UUID discovery via git-annex branch. Add a log file listing UUIDs
|
|
|
|
accessible via proxy UUIDs. It also will contain the names
|
|
|
|
of the remotes that the proxy is a proxy for,
|
|
|
|
from the perspective of the proxy. (done)
|
|
|
|
|
|
|
|
* Add `git-annex updateproxy` command (done)
|
|
|
|
|
|
|
|
* Remote instantiation for proxies. (done)
|
|
|
|
|
|
|
|
* Implement git-annex-shell proxying to git remotes. (done)
|
|
|
|
|
|
|
|
* Proxy should update location tracking information for proxied remotes,
|
|
|
|
so it is available to other users who sync with it. (done)
|
|
|
|
|
2024-06-27 19:28:10 +00:00
|
|
|
* Implement `git-annex initcluster` and `git-annex updatecluster` commands (done)
|
2024-06-23 20:38:01 +00:00
|
|
|
|
|
|
|
* Implement cluster UUID insertation on location log load, and removal
|
|
|
|
on location log store. (done)
|
|
|
|
|
|
|
|
* Omit cluster UUIDs when constructing drop proofs, since lockcontent will
|
|
|
|
always fail on a cluster. (done)
|
|
|
|
|
|
|
|
* Don't count cluster UUID as a copy in numcopies checking etc. (done)
|
|
|
|
|
|
|
|
* Tab complete proxied remotes and clusters in eg --from option. (done)
|
|
|
|
|
|
|
|
* Getting a key from a cluster should proxy from one of the nodes that has
|
|
|
|
it. (done)
|
|
|
|
|
|
|
|
* Implement upload with fanout to multiple cluster nodes and reporting back
|
|
|
|
additional UUIDs over P2P protocol. (done)
|
|
|
|
|
|
|
|
* Implement cluster drops, trying to remove from all nodes, and returning
|
|
|
|
which UUIDs it was dropped from. (done)
|
|
|
|
|
|
|
|
* `git-annex testremote` works against proxied remote and cluster. (done)
|
2024-06-25 14:06:28 +00:00
|
|
|
|
|
|
|
* Avoid `git-annex sync --content` etc from operating on cluster nodes by
|
|
|
|
default since syncing with a cluster implicitly syncs with its nodes. (done)
|
2024-06-25 15:35:41 +00:00
|
|
|
|
|
|
|
* On upload to cluster, send to nodes where its preferred content, and not
|
|
|
|
to other nodes. (done)
|
2024-06-25 18:52:47 +00:00
|
|
|
|
|
|
|
* Support annex.jobs for clusters. (done)
|
|
|
|
|
2024-06-26 16:56:16 +00:00
|
|
|
* Add `git-annex extendcluster` command and extend `git-annex updatecluster`
|
|
|
|
to support clusters with multiple gateways. (done)
|
|
|
|
|
|
|
|
* Support proxying for a remote that is proxied by another gateway of
|
|
|
|
a cluster. (done)
|
2024-06-27 16:20:22 +00:00
|
|
|
|
|
|
|
* Support distributed clusters: Make a proxy for a cluster repeat
|
|
|
|
protocol messages on to any remotes that have the same UUID as
|
|
|
|
the cluster. Needs extension to P2P protocol to avoid cycles.
|
|
|
|
(done)
|
2024-06-27 19:21:03 +00:00
|
|
|
|
|
|
|
* Proxied cluster nodes should have slightly higher cost than the cluster
|
|
|
|
gateway. (done)
|
2024-06-28 19:32:00 +00:00
|
|
|
|
|
|
|
* Basic support for proxying special remotes. (But not exporttree=yes ones
|
|
|
|
yet.) (done)
|
2024-07-01 15:29:04 +00:00
|
|
|
|
|
|
|
* Tab complete remotes in all relevant commands (done)
|
|
|
|
|
|
|
|
* Display cluster and proxy information in git-annex info (done)
|