Merge branch 'proxy'

This commit is contained in:
Joey Hess 2024-06-27 15:43:45 -04:00
commit c3f88923c0
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
78 changed files with 3145 additions and 448 deletions

View file

@ -11,7 +11,7 @@ repositories.
Joey has received funding to work on this.
Planned schedule of work:
* June: git-annex proxy
* June: git-annex proxies and clusters
* July, part 1: git-annex proxy support for exporttree
* July, part 2: p2p protocol over http
* August: balanced preferred content
@ -24,7 +24,49 @@ Planned schedule of work:
In development on the `proxy` branch.
For June's work on [[design/passthrough_proxy]], implementation plan:
For June's work on [[design/passthrough_proxy]], remaining todos:
* Since proxying to special remotes is not supported yet, and won't be for
the first release, make it fail in a reasonable way.
- or -
* Proxying for special remotes.
Including encryption and chunking. See design for issues.
# items deferred until later for [[design/passthrough_proxy]]
* Indirect uploads when proxying for special remote
(to be considered). See design.
* Getting a key from a cluster currently picks from amoung
the lowest cost remotes at random. This could be smarter,
eg prefer to avoid using remotes that are doing other transfers at the
same time.
* The cost of a proxied node that is accessed via an intermediate gateway
is currently the same as a node accessed via the cluster gateway.
To fix this, there needs to be some way to tell how many hops through
gateways it takes to get to a node. Currently the only way is to
guess based on number of dashes in the node name, which is not satisfying.
Even counting hops is not very satisfying, one cluster gateway could
be much more expensive to traverse than another one.
If seriously tackling this, it might be worth making enough information
available to use spanning tree protocol for routing inside clusters.
* Optimise proxy speed. See design for ideas.
* Use `sendfile()` to avoid data copying overhead when
`receiveBytes` is being fed right into `sendBytes`.
Library to use:
<https://hackage.haskell.org/package/hsyscall-0.4/docs/System-Syscall.html>
* Support using a proxy when its url is a P2P address.
(Eg tor-annex remotes.)
# completed items for June's work on [[design/passthrough_proxy]]:
* UUID discovery via git-annex branch. Add a log file listing UUIDs
accessible via proxy UUIDs. It also will contain the names
@ -40,7 +82,7 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
* Proxy should update location tracking information for proxied remotes,
so it is available to other users who sync with it. (done)
* Implement `git-annex updatecluster` command (done)
* Implement `git-annex initcluster` and `git-annex updatecluster` commands (done)
* Implement cluster UUID insertation on location log load, and removal
on location log store. (done)
@ -48,66 +90,39 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
* Omit cluster UUIDs when constructing drop proofs, since lockcontent will
always fail on a cluster. (done)
* Don't count cluster UUID as a copy. (done)
* Don't count cluster UUID as a copy in numcopies checking etc. (done)
* Tab complete proxied remotes and clusters in eg --from option. (done)
* Getting a key from a cluster should proxy from one of the nodes that has
it. (done)
* Getting a key from a cluster currently always selects the lowest cost
remote, and always the same remote if cost is the same. Should
round-robin amoung remotes, and prefer to avoid using remotes that
other git-annex processes are currently using.
* Implement upload with fanout and reporting back additional UUIDs over P2P
protocol. (done, but need to check for fencepost errors on resume of
incomplete upload with remotes at different points)
* On upload to cluster, send to nodes where it's preferred content, and not
to other nodes.
* Implement upload with fanout to multiple cluster nodes and reporting back
additional UUIDs over P2P protocol. (done)
* Implement cluster drops, trying to remove from all nodes, and returning
which UUIDs it was dropped from.
which UUIDs it was dropped from. (done)
Problem: May lock content on cluster
nodes to satisfy numcopies (rather than locking elsewhere) and so not be
able to drop from nodes. Avoid using cluster nodes when constructing drop
proof for cluster.
* `git-annex testremote` works against proxied remote and cluster. (done)
Problem: When nodes are special remotes, may
treat nodes as copies while dropping from cluster, and so violate
numcopies. (But not mincopies.)
* Avoid `git-annex sync --content` etc from operating on cluster nodes by
default since syncing with a cluster implicitly syncs with its nodes. (done)
Problem: `move --from cluster` in "does this make it worse"
check may fail to realize that dropping from multiple nodes does in fact
make it worse.
* On upload to cluster, send to nodes where its preferred content, and not
to other nodes. (done)
* On upload to a cluster, as well as fanout to nodes, if the key is
preferred content of the proxy repository, store it there.
(But not when preferred content is not configured.)
And on download from a cluster, if the proxy repository has the content,
get it from there to avoid the overhead of proxying to a node.
* Support annex.jobs for clusters. (done)
* Basic proxying to special remote support (non-streaming).
* Add `git-annex extendcluster` command and extend `git-annex updatecluster`
to support clusters with multiple gateways. (done)
* Support proxies-of-proxies better, eg foo-bar-baz.
Currently, it does work, but have to run `git-annex updateproxy`
on foo in order for it to notice the bar-baz proxied remote exists,
and record it as foo-bar-baz. Make it skip recording proxies of
proxies like that, and instead automatically generate those from the log.
(With cycle prevention there of course.)
* Support proxying for a remote that is proxied by another gateway of
a cluster. (done)
* Cycle prevention including cluster-in-cluster cycles. See design.
* Support distributed clusters: Make a proxy for a cluster repeat
protocol messages on to any remotes that have the same UUID as
the cluster. Needs extension to P2P protocol to avoid cycles.
(done)
* Optimise proxy speed. See design for ideas.
* Use `sendfile()` to avoid data copying overhead when
`receiveBytes` is being fed right into `sendBytes`.
* Encryption and chunking. See design for issues.
* Indirect uploads (to be considered). See design.
* Support using a proxy when its url is a P2P address.
(Eg tor-annex remotes.)
* Proxied cluster nodes should have slightly higher cost than the cluster
gateway. (done)

View file

@ -6,7 +6,7 @@ remotes.
So this todo remains open, but is now only concerned with
streaming an object that is being received from one remote out to another
remote without first needing to buffer the whole object on disk.
repository without first needing to buffer the whole object on disk.
git-annex's remote interface does not currently support that.
`retrieveKeyFile` stores the object into a file. And `storeKey`
@ -27,3 +27,7 @@ Recieving to a file, and sending from the same file as it grows is one
possibility, since that would handle buffering, and it might avoid needing
to change interfaces as much. It would still need a new interface since the
current one does not guarantee the file is written in-order.
A fifo is a possibility, but would certianly not work with remotes
that don't write to the file in-order. Also resuming a download would not
work with a fifo, the sending remote wouldn't know where to resume from.