Merge branch 'proxy'
This commit is contained in:
commit
c3f88923c0
78 changed files with 3145 additions and 448 deletions
|
@ -124,9 +124,16 @@ See [[todo/proving_preferred_content_behavior]].
|
|||
## rebalancing
|
||||
|
||||
In both the 3 of 5 use case and a split brain situation, it's possible for
|
||||
content to end up not optimally balanced between repositories. git-annex
|
||||
can be made to operate in a mode where it does additional work to rebalance
|
||||
repositories.
|
||||
content to end up not optimally balanced between repositories.
|
||||
|
||||
(There are also situations where a cluster node ends up without a copy
|
||||
of a file that is preferred content, or where adding a copy to a node
|
||||
would satisfy numcopies. This can happen eg, when a client sends a file
|
||||
to a single node rather than to the cluster. Rebalancing also will deal
|
||||
with those.)
|
||||
|
||||
git-annex can be made to operate in a mode where it does additional work
|
||||
to rebalance repositories.
|
||||
|
||||
This can be an option like --rebalance, that changes how the preferred content
|
||||
expression is evaluated. The user can choose where and when to run that.
|
||||
|
|
|
@ -40,8 +40,8 @@ The server responds with either its own UUID when authentication
|
|||
is successful. Or, it can fail the authentication, and close the
|
||||
connection.
|
||||
|
||||
AUTH_SUCCESS UUID
|
||||
AUTH_FAILURE
|
||||
AUTH-SUCCESS UUID
|
||||
AUTH-FAILURE
|
||||
|
||||
Note that authentication does not guarantee that the client is talking to
|
||||
who they expect to be talking to. This, and encryption of the connection,
|
||||
|
@ -64,6 +64,19 @@ that is less than or equal to the version the client sent:
|
|||
|
||||
Now both client and server should use version 1.
|
||||
|
||||
## Cluster cycle prevention
|
||||
|
||||
In protocol version 2, immediately after VERSION, the
|
||||
client can send an additional message that is used to
|
||||
prevent cycles when accessing clusters.
|
||||
|
||||
BYPASS UUID1 UUID2 ...
|
||||
|
||||
The UUIDs are cluster gateways to avoid connecting to when
|
||||
serving a cluster.
|
||||
|
||||
The server makes no response to this message.
|
||||
|
||||
## Binary data
|
||||
|
||||
The protocol allows raw binary data to be sent. This is done
|
||||
|
@ -117,6 +130,10 @@ To remove a key's content from the server, the client sends:
|
|||
|
||||
The server responds with either SUCCESS or FAILURE.
|
||||
|
||||
In protocol version 2, the server can optionally reply with SUCCESS-PLUS
|
||||
or FAILURE-PLUS. Each has a subsequent list of UUIDs of repositories
|
||||
that the content was removed from.
|
||||
|
||||
## Storing content on the server
|
||||
|
||||
To store content on the server, the client sends:
|
||||
|
@ -132,7 +149,14 @@ spaces, since it's not the last token in the line. Use '%' to indicate
|
|||
whitespace.)
|
||||
|
||||
The server may respond with ALREADY-HAVE if it already
|
||||
had the conent of that key. Otherwise, it responds with:
|
||||
had the conent of that key.
|
||||
|
||||
In protocol version 2, the server can optionally reply with
|
||||
ALREADY-HAVE-PLUS. The subsequent list of UUIDs are additional
|
||||
UUIDs where the content is stored, in addition to the UUID where
|
||||
the client was going to send it.
|
||||
|
||||
Otherwise, it responds with:
|
||||
|
||||
PUT-FROM Offset
|
||||
|
||||
|
@ -152,6 +176,9 @@ was being sent.
|
|||
If the server successfully receives the data and stores the content,
|
||||
it replies with SUCCESS. Otherwise, FAILURE.
|
||||
|
||||
In protocol version 2, the server can optionally reply with SUCCESS-PLUS
|
||||
and a list of UUIDs where the content was stored.
|
||||
|
||||
## Getting content from the server
|
||||
|
||||
To get content from the server, the client sends:
|
||||
|
@ -192,6 +219,8 @@ its exit code.
|
|||
|
||||
CONNECTDONE ExitCode
|
||||
|
||||
After that, the server closes the connection.
|
||||
|
||||
## Change notification
|
||||
|
||||
The client can request to be notified when a ref in
|
||||
|
|
|
@ -35,7 +35,7 @@ For example (eliding the full HTTP responses, only showing the data):
|
|||
> Content-Length: ...
|
||||
>
|
||||
> AUTH 79a5a1f4-07e8-11ef-873d-97f93ca91925
|
||||
< AUTH_SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
|
||||
< AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
|
||||
|
||||
> POST /git-annex HTTP/1.0
|
||||
> Content-Type: x-git-annex-p2p
|
||||
|
@ -80,7 +80,7 @@ correspond to each action in the P2P protocol.
|
|||
Something like this:
|
||||
|
||||
> GET /git-annex/v1/AUTH?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925 HTTP/1.0
|
||||
< AUTH_SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
|
||||
< AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
|
||||
|
||||
> GET /git-annex/v1/CHECKPRESENT?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
|
||||
> SUCCESS
|
||||
|
|
|
@ -219,11 +219,6 @@ And, if the proxy repository itself contains the requested key, it can send
|
|||
it directly. This allows the proxy repository to be primed with frequently
|
||||
accessed files when it has the space.
|
||||
|
||||
(Should uploads check preferred content of the proxy repository and also
|
||||
store a copy there when allowed? I think this would be ok, so long as when
|
||||
preferred content is not set, it does not default to storing content
|
||||
there.)
|
||||
|
||||
When a drop is requested from the cluster's UUID, git-annex-shell drops
|
||||
from all nodes, as well as from the proxy itself. Only indicating success
|
||||
if it is able to delete all copies from the cluster. This needs
|
||||
|
@ -238,6 +233,14 @@ always fail. Also, when constructing a drop proof for a cluster's UUID,
|
|||
the nodes of that cluster should be omitted, otherwise a drop from the
|
||||
cluster can lock content on individual nodes, causing the drop to fail.
|
||||
|
||||
Moving from a cluster is a special case because it may reduce the number
|
||||
of copies. So move's `willDropMakeItWorse` check needs to special case
|
||||
clusters. Since dropping from the cluster may remove content from any of
|
||||
its nodes, which may include copies on nodes that the local location log does
|
||||
not know about yet, the special case probably needs to always assume
|
||||
that dropping from a cluster in a move risks reducing numcopies,
|
||||
and so only allow it when a drop proof can be constructed.
|
||||
|
||||
Some commands like `git-annex whereis` will list content as being stored in
|
||||
the cluster, as well as on whichever of its nodes, and whereis currently
|
||||
says "n copies", but since the cluster doesn't count as a copy, that
|
||||
|
@ -279,9 +282,9 @@ configuration of the cluster. But the cluster is configured via the
|
|||
git-annex branch, particularly preferred content, and the proxy log, and
|
||||
the cluster log.
|
||||
|
||||
A user could, for example, make the cluster's frontend want all
|
||||
content, and so fill up its small disk. They could make a particular node
|
||||
not want any content. They could remove nodes from the cluster.
|
||||
A user could, for example, make a small cluster node want all content, and
|
||||
so fill up its small disk. They could make a particular node not want any
|
||||
content. They could remove nodes from the cluster.
|
||||
|
||||
One way to deal with this is for the cluster to reject git-annex branch
|
||||
pushes that make such changes. Or only allow them if they are signed with a
|
||||
|
@ -296,24 +299,43 @@ A remote will only be treated as a node of a cluster when the git
|
|||
configuration remote.name.annex-cluster-node is set, which will prevent
|
||||
creating clusters in places where they are not intended to be.
|
||||
|
||||
## distributed clusters
|
||||
|
||||
A cluster's nodes may be geographically distributed amoung several
|
||||
locations, which are effectivly subclusters. To support this, an upload
|
||||
or removal sent to one frontend proxy of the cluster will be repeated to
|
||||
other frontend proxies that are remotes of that one and have the cluster's
|
||||
UUID.
|
||||
|
||||
This is better than supporting a cluster that is a node of another cluster,
|
||||
because rather than a hierarchical structure, this allows for organic
|
||||
structures of any shape. For example, there could be two frontends to a
|
||||
cluster, in different locations. An upload to either frontend fans out to
|
||||
its local nodes as well as over to the other frontend, and to its local
|
||||
nodes.
|
||||
|
||||
This does mean that cycles need to be prevented. See section below.
|
||||
|
||||
## speed
|
||||
|
||||
A passthrough proxy should be as fast as possible so as not to add overhead
|
||||
A proxy should be as fast as possible so as not to add overhead
|
||||
to a file retrieve, store, or checkpresent. This probably means that
|
||||
it keeps TCP connections open to each host in the cluster. It might use a
|
||||
it keeps TCP connections open to each host. It might use a
|
||||
protocol with less overhead than ssh.
|
||||
|
||||
In the case of checkpresent, it would be possible for the proxy to not
|
||||
communicate with the cluster to check that the data is still present on it.
|
||||
As long as all access is intermediated via the proxy, its git-annex branch
|
||||
could be relied on to always be correct, in theory. Proving that theory,
|
||||
making sure to account for all possible race conditions and other scenarios,
|
||||
would be necessary for such an optimisation.
|
||||
In the case of checkpresent, it would be possible for the gateway to not
|
||||
communicate with cluster nodes to check that the data is still present
|
||||
in the cluster. As long as all access is intermediated via a single gateway,
|
||||
its git-annex branch could be relied on to always be correct, in theory.
|
||||
Proving that theory, making sure to account for all possible race conditions
|
||||
and other scenarios, would be necessary for such an optimisation. This
|
||||
would not work for multi-gateway clusters unless the gateways were kept in
|
||||
sync about locations, which they currently are not.
|
||||
|
||||
Another way the proxy could speed things up is to cache some subset of
|
||||
content. Eg, analize what files are typically requested, and store another
|
||||
copy of those on the proxy. Perhaps prioritize storing smaller files, where
|
||||
latency tends to swamp transfer speed.
|
||||
Another way the cluster gateway could speed things up is to cache some
|
||||
subset of content. Eg, analize what files are typically requested, and
|
||||
store another copy of those on the proxy. Perhaps prioritize storing
|
||||
smaller files, where latency tends to swamp transfer speed.
|
||||
|
||||
## proxying to special remotes
|
||||
|
||||
|
@ -446,7 +468,7 @@ So overall, it seems better to do proxy-side encryption. But it may be
|
|||
worth adding a special remote that does its own client-side encryption
|
||||
in front of the proxy.
|
||||
|
||||
## cycles
|
||||
## cycles of proxies
|
||||
|
||||
A repo can advertise that it proxies for a repo which has the same uuid as
|
||||
itself. Or there can be a larger cycle involving a proxy that proxies to a
|
||||
|
@ -454,36 +476,43 @@ proxy, etc.
|
|||
|
||||
Since the proxied repo uuid is communicated to git-annex-shell via
|
||||
--uuid, a repo that advertises proxying for itself will be connected to
|
||||
with its own uuid. No proxying is done in this case. Same happens with a
|
||||
larger cycle.
|
||||
|
||||
Instantiating remotes needs to identity cycles and break them. Otherwise
|
||||
it would construct an infinite number of proxied remotes with names
|
||||
like "foo-foo-foo-foo-..." or "foo-bar-foo-bar-..."
|
||||
|
||||
Once `git-annex copy --to proxy` is implemented, and the proxy decides
|
||||
where to send content that is being sent directly to it, cycles will
|
||||
become an issue with that as well.
|
||||
with its own uuid. No proxying is done in that case.
|
||||
|
||||
What if repo A is a proxy and has repo B as a remote. Meanwhile, repo B is
|
||||
a proxy and has repo A as a remote?
|
||||
a proxy and has repo A as a remote? git-annex-shell on repo A will get
|
||||
A's uuid, and so will operate on it directly without proxying. So larger
|
||||
cycles are also not a problem on the proxy side.
|
||||
|
||||
An upload to repo A will start by checking if repo B wants the content and if so,
|
||||
start an upload to repo B. Then the same happens on repo B, leading it to
|
||||
start an upload to repo A.
|
||||
On the client side, instantiating remotes needs to identity cycles and
|
||||
break them. Otherwise it would construct an infinite number of proxied
|
||||
remotes with names like "foo-foo-foo-foo-..." or "foo-bar-foo-bar-..."
|
||||
|
||||
At this point, it might be possible for git-annex to detect the cycle,
|
||||
if the proxy uses a transfer lock file. If repo B or repo A had some other
|
||||
remote that is not part of a cycle, they could deposit the upload there and
|
||||
the upload still succeed. Otherwise the upload would fail, which is
|
||||
probably the best that can be done with such a broken configuration.
|
||||
## cycles of cluster proxies
|
||||
|
||||
So, it seems like proxies would need to take transfer locks for uploads,
|
||||
even though the content is being proxied to elsewhere.
|
||||
If an PUT or REMOVE message is sent to a proxy for a cluster, and that
|
||||
repository has a remote that is also a proxy for the same cluster,
|
||||
the message gets repeated on to it. This can lead to cycles, which have to
|
||||
be broken.
|
||||
|
||||
Dropping could have similar cycles with content presence locking, which
|
||||
needs to be thought through as well. A cycle of the actual dropContent
|
||||
operation might also be possible.
|
||||
To break the cycle, extend the P2P protocol with an additional message,
|
||||
like:
|
||||
|
||||
VIA uuid1 uuid2
|
||||
|
||||
This indicates to a proxy that the message has been received via the other
|
||||
listed proxies. It can then avoid repeating the message out via any of
|
||||
those proxies. When repeating a message out to another proxy, just add
|
||||
the UUID of the local repository to the list.
|
||||
|
||||
This will be an extension to the protocol, but so long as it's added in
|
||||
the same git-annex version that adds support for proxies, every cluster
|
||||
proxy will support it.
|
||||
|
||||
This avoids cycles, but it does not avoid situations where there are
|
||||
multiple paths through a proxy network that reach the same node. In such a
|
||||
situation, a REMOVE might happen twice (no problem) or a PUT be received
|
||||
twice from different paths (one of them would fail due to the other one
|
||||
taking the transfer lock).
|
||||
|
||||
## exporttree=yes
|
||||
|
||||
|
|
44
doc/git-annex-extendcluster.mdwn
Normal file
44
doc/git-annex-extendcluster.mdwn
Normal file
|
@ -0,0 +1,44 @@
|
|||
# NAME
|
||||
|
||||
git-annex extendcluster - add an additional gateway to a cluster
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
git-annex extendcluster gateway clustername
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
This command is used to configure a repository to serve as an additional
|
||||
gateway to a cluster. It is run in that repository.
|
||||
|
||||
The repository this command is run in should have a remote that is a
|
||||
gateway to the cluster. The `gateway` parameter is the name of that remote.
|
||||
The `clustername` parameter is the name of the cluster.
|
||||
|
||||
The next step after running this command is to configure
|
||||
any additional cluster nodes that this gateway serves to the cluster,
|
||||
then run [[git-annex-updatecluster]]. See the documentation of that
|
||||
command for details about configuring nodes.
|
||||
|
||||
After running this command in the new gateway repository, it typically
|
||||
also needs to be run in the other gateway repositories as well,
|
||||
after adding the new gateway repository as a remote.
|
||||
|
||||
# OPTIONS
|
||||
|
||||
* The [[git-annex-common-options]](1) can be used.
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
[[git-annex]](1)
|
||||
[[git-annex-initcluster]](1)
|
||||
[[git-annex-updatecluster]](1)
|
||||
[[git-annex-updateproxy]](1)
|
||||
|
||||
<https://git-annex.branchable.com/tips/clusters/>
|
||||
|
||||
# AUTHOR
|
||||
|
||||
Joey Hess <id@joeyh.name>
|
||||
|
||||
Warning: Automatically converted into a man page by mdwn2man. Edit with care.
|
39
doc/git-annex-initcluster.mdwn
Normal file
39
doc/git-annex-initcluster.mdwn
Normal file
|
@ -0,0 +1,39 @@
|
|||
# NAME
|
||||
|
||||
git-annex initcluster - initialize a new cluster
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
git-annex initcluster name [description]
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
This command initializes a new cluster with the specified name. If no
|
||||
description is provided, one will be set automatically.
|
||||
|
||||
This command should be run in the repository that will serve as the gateway
|
||||
to the cluster.
|
||||
|
||||
The next step after running this command is to configure
|
||||
the cluster nodes, then run [[git-annex-updatecluster]]. See the
|
||||
documentation of that command for details about configuring nodes.
|
||||
|
||||
# OPTIONS
|
||||
|
||||
* The [[git-annex-common-options]](1) can be used.
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
[[git-annex]](1)
|
||||
[[git-annex-updatecluster]](1)
|
||||
[[git-annex-extendcluster]](1)
|
||||
[[git-annex-preferred-content]](1)
|
||||
[[git-annex-updateproxy]](1)
|
||||
|
||||
<https://git-annex.branchable.com/tips/clusters/>
|
||||
|
||||
# AUTHOR
|
||||
|
||||
Joey Hess <id@joeyh.name>
|
||||
|
||||
Warning: Automatically converted into a man page by mdwn2man. Edit with care.
|
|
@ -8,7 +8,7 @@ Each repository has a preferred content setting, which specifies content
|
|||
that the repository wants to have present. These settings can be configured
|
||||
using `git annex vicfg` or `git annex wanted`.
|
||||
They are used by the `--auto` option, by `git annex sync --content`,
|
||||
and by the git-annex assistant.
|
||||
by clusters, and by the git-annex assistant.
|
||||
|
||||
While preferred content expresses a preference, it can be overridden
|
||||
by simply using `git annex drop`. On the other hand, required content
|
||||
|
|
|
@ -9,7 +9,7 @@ git annex required `repository [expression]`
|
|||
# DESCRIPTION
|
||||
|
||||
When run with an expression, configures the content that is required
|
||||
to be held in the archive.
|
||||
to be held in the repository.
|
||||
|
||||
For example:
|
||||
|
||||
|
|
|
@ -86,7 +86,9 @@ first "/~/" or "/~user/" is expanded to the specified home directory.
|
|||
* --uuid=UUID
|
||||
|
||||
git-annex uses this to specify the UUID of the repository it was expecting
|
||||
git-annex-shell to access, as a sanity check.
|
||||
git-annex-shell to access. This is both a sanity check, and allows
|
||||
git-annex shell to proxy access to remotes, when configured
|
||||
by [[git-annex-update-proxy]].
|
||||
|
||||
* Also the [[git-annex-common-options]](1) can be used.
|
||||
|
||||
|
|
43
doc/git-annex-updatecluster.mdwn
Normal file
43
doc/git-annex-updatecluster.mdwn
Normal file
|
@ -0,0 +1,43 @@
|
|||
# NAME
|
||||
|
||||
git-annex updatecluster - update records of cluster nodes
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
git-annex updatecluster
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
This command is used to record the nodes of a cluster in the git-annex
|
||||
branch, and set up proxying to the nodes. It should be run in the
|
||||
repository that will serve as a gateway to the cluster.
|
||||
|
||||
It looks at the git config `remote.name.annex-cluster-node` of
|
||||
each remote. When that is set to the name of a cluster that has been
|
||||
initialized with `git-annex initcluster`, the node will be recorded in the
|
||||
git-annex branch.
|
||||
|
||||
To remove a node from a cluster, unset `remote.name.annex-cluster-node`
|
||||
and run this command.
|
||||
|
||||
To add additional gateways to a cluster, after running this command,
|
||||
use [[git-annex-extendcluster]].
|
||||
|
||||
# OPTIONS
|
||||
|
||||
* The [[git-annex-common-options]](1) can be used.
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
[[git-annex]](1)
|
||||
[[git-annex-initcluster]](1)
|
||||
[[git-annex-extendcluster]](1)
|
||||
[[git-annex-updateproxy]](1)
|
||||
|
||||
<https://git-annex.branchable.com/tips/clusters/>
|
||||
|
||||
# AUTHOR
|
||||
|
||||
Joey Hess <id@joeyh.name>
|
||||
|
||||
Warning: Automatically converted into a man page by mdwn2man. Edit with care.
|
44
doc/git-annex-updateproxy.mdwn
Normal file
44
doc/git-annex-updateproxy.mdwn
Normal file
|
@ -0,0 +1,44 @@
|
|||
# NAME
|
||||
|
||||
git-annex updateproxy - update records with proxy configuration
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
git annex updateproxy
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
A git-annex repository can act as a proxy for its remotes. That allows
|
||||
annexed content to be stored and removed from the proxy's remotes, by
|
||||
repositories that do not have a direct connection to the remotes.
|
||||
|
||||
By default, no proxying is done. To configure the local repository to act
|
||||
as a proxy for its remote named "foo", run `git config remote.foo.annex-proxy`
|
||||
true`.
|
||||
|
||||
After setting or unsetting `remote.<name>.annex-proxy` git configurations,
|
||||
run `git-annex updateproxy` to record the proxy configuration in the
|
||||
git-annex branch. That tells other repositories about the proxy
|
||||
configuration.
|
||||
|
||||
Suppose, for example, that remote "work" has had this command run in
|
||||
it. Then after pulling from "work", git-annex will know about an
|
||||
additional remote, "work-foo". That remote will be accessed using "work" as
|
||||
a proxy.
|
||||
|
||||
Proxies can only be accessed via ssh.
|
||||
|
||||
# OPTIONS
|
||||
|
||||
* The [[git-annex-common-options]](1) can be used.
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
[[git-annex]](1)
|
||||
[[git-annex-updatecluster]](1)
|
||||
|
||||
# AUTHOR
|
||||
|
||||
Joey Hess <id@joeyh.name>
|
||||
|
||||
Warning: Automatically converted into a man page by mdwn2man. Edit with care.
|
|
@ -9,7 +9,7 @@ git annex wanted `repository [expression]`
|
|||
# DESCRIPTION
|
||||
|
||||
When run with an expression, configures the content that is preferred
|
||||
to be held in the archive. See [[git-annex-preferred-content]](1)
|
||||
to be held in the repository. See [[git-annex-preferred-content]](1)
|
||||
|
||||
For example:
|
||||
|
||||
|
|
|
@ -252,7 +252,6 @@ content from the key-value store.
|
|||
|
||||
See [[git-annex-configremote]](1) for details.
|
||||
|
||||
|
||||
* `renameremote`
|
||||
|
||||
Renames a special remote.
|
||||
|
@ -327,6 +326,31 @@ content from the key-value store.
|
|||
|
||||
See [[git-annex-required]](1) for details.
|
||||
|
||||
* `initcluster`
|
||||
|
||||
Initializes a new cluster.
|
||||
|
||||
See [[git-annex-initcluster](1) for details.
|
||||
|
||||
* `updatecluster`
|
||||
|
||||
Update records of cluster nodes.
|
||||
|
||||
See [[git-annex-updatecluster](1) for details.
|
||||
|
||||
* `extendcluster`
|
||||
|
||||
Adds an additional gateway to a cluster.
|
||||
|
||||
See [[git-annex-extendcluster](1) for details.
|
||||
|
||||
|
||||
* `updateproxy`
|
||||
|
||||
Update records with proxy configuration.
|
||||
|
||||
See [[git-annex-updateproxy](1) for details.
|
||||
|
||||
* `schedule repository [expression]`
|
||||
|
||||
Get or set scheduled jobs.
|
||||
|
@ -1372,6 +1396,15 @@ repository, using [[git-annex-config]]. See its man page for a list.)
|
|||
set in global git configuration.
|
||||
For details, see <https://git-annex.branchable.com/tuning/>.
|
||||
|
||||
* `annex.cluster.<name>`
|
||||
|
||||
This is set to make the repository be a gateway to a cluster.
|
||||
The value is the cluster UUID. Note that cluster UUIDs are not
|
||||
the same as repository UUIDs, and a repository UUID cannot be used here.
|
||||
|
||||
Usually this is set up by running [[git-annex-initcluster]] or
|
||||
[[git-annex-extendcluster]].
|
||||
|
||||
# CONFIGURATION OF REMOTES
|
||||
|
||||
Remotes are configured using these settings in `.git/config`.
|
||||
|
@ -1640,6 +1673,38 @@ Remotes are configured using these settings in `.git/config`.
|
|||
content of any file, even though its normal location tracking does not
|
||||
indicate that it does. This will cause git-annex to try to get all file
|
||||
contents from the remote. Can be useful in setting up a caching remote.
|
||||
|
||||
* `remote.<name>.annex-proxy`
|
||||
|
||||
Set to "true" to make the local repository able to act as a proxy to this
|
||||
remote.
|
||||
|
||||
After configuring this, run [[git-annex-updateproxy](1) to store
|
||||
the new configuration in the git-annex branch.
|
||||
|
||||
* `remote.<name>.annex-proxied-by`
|
||||
|
||||
Usually this is used internally, when git-annex sets up proxied remotes,
|
||||
and will not need to be configured. The value is the UUID of the
|
||||
git-annex repository that proxies access to this remote.
|
||||
|
||||
* `remote.<name>.annex-cluster-node`
|
||||
|
||||
Set to the name of a cluster to make this remote be part of
|
||||
the cluster. Names of multiple clusters can be separated by
|
||||
whitespace to make a remote be part of more than one cluster.
|
||||
|
||||
After configuring this, run [[git-annex-updatecluster](1) to store
|
||||
the new configuration in the git-annex branch.
|
||||
|
||||
* `remote.<name>.annex-cluster-gateway`
|
||||
|
||||
Set to the UUID of a cluster that this remote serves as a gateway for.
|
||||
Multiple UUIDs can be listed, separated by whitespace. When the local
|
||||
repository is also a gateway for that cluster, it will proxy for the
|
||||
nodes of the remote gateway.
|
||||
|
||||
Usually this is set up by running [[git-annex-extendcluster]].
|
||||
|
||||
* `remote.<name>.annex-private`
|
||||
|
||||
|
|
|
@ -288,7 +288,7 @@ For example:
|
|||
These log files store per-remote content identifiers for keys.
|
||||
A given key may have any number of content identifiers.
|
||||
|
||||
The format is a timestamp, followed by the uuid of the remote,
|
||||
The format is a timestamp, followed by the UUID of the remote,
|
||||
followed by the content identifiers which are separated by colons.
|
||||
If a content identifier contains a colon or \r or \n, it will be base64
|
||||
encoded. Base64 encoded values are indicated by prefixing them with "!".
|
||||
|
@ -308,6 +308,33 @@ For example, this logs that a remote has an object stored using both
|
|||
|
||||
(When those chunks are removed from the remote, the 9 is changed to 0.)
|
||||
|
||||
## `proxy.log`
|
||||
|
||||
Used to record what repositories are accessible via a proxy.
|
||||
|
||||
Each line starts with a timestamp, then the UUID of the repository
|
||||
that can serve as a proxy, and then a list of the remotes that it can
|
||||
proxy to, separated by spaces.
|
||||
|
||||
Each remote in the list consists of a repository's UUID,
|
||||
followed by a colon (`:`) and then a remote name.
|
||||
|
||||
For example:
|
||||
|
||||
1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55 26339d22-446b-11e0-9101-002170d25c55:foo c076460c-2290-11ef-be53-b7f0d194c863:bar
|
||||
|
||||
## `cluster.log`
|
||||
|
||||
Used to record the UUIDs of clusters, and the UUIDs of the nodes
|
||||
comprising each cluster.
|
||||
|
||||
Each line starts with a timestamp, then the UUID the cluster,
|
||||
followed by a list of the UUIDs of its nodes, separated by spaces.
|
||||
|
||||
For example:
|
||||
|
||||
1317929100.012345s 5b070cc8-29b8-11ef-80e1-0fd524be241b 5c0c97d2-29b8-11ef-b1d2-5f3d1c80940d 5c40375e-29b8-11ef-814d-872959d2c013
|
||||
|
||||
## `schedule.log`
|
||||
|
||||
Used to record scheduled events, such as periodic fscks.
|
||||
|
|
|
@ -4,4 +4,10 @@
|
|||
* [[how_it_works]]
|
||||
* [[special_remotes]]
|
||||
* [[workflows|workflow]]
|
||||
* [[preferred_content]]
|
||||
* [[sync]]
|
||||
|
||||
### new features
|
||||
|
||||
* [[tips/clusters]]
|
||||
* [[git-remote-annex|tips/storing_a_git_repository_on_any_special_remote]]
|
||||
|
|
217
doc/tips/clusters.mdwn
Normal file
217
doc/tips/clusters.mdwn
Normal file
|
@ -0,0 +1,217 @@
|
|||
A cluster is a collection of git-annex repositories which are combined to
|
||||
form a single logical repository.
|
||||
|
||||
A cluster is accessed via a gateway repository. The gateway is not itself
|
||||
a node of the cluster.
|
||||
|
||||
[[!toc ]]
|
||||
|
||||
## using a cluster
|
||||
|
||||
To use a cluster, your repository needs to have its gateway configured as a
|
||||
remote. Clusters can currently only be accessed via ssh. This gateway
|
||||
remote is added the same as any other remote:
|
||||
|
||||
git remote add bigserver me@bigserver:annex
|
||||
|
||||
The gateway publishes information about the cluster to the git-annex
|
||||
branch. So you may need to fetch from it to learn about the cluster:
|
||||
|
||||
git fetch bigserver
|
||||
|
||||
That will make available an additional remote for the cluster, eg
|
||||
"bigserver-mycluster", as well as some remotes for each node eg
|
||||
"bigserver-node1", "bigserver-node2", etc.
|
||||
|
||||
You can get files from the cluster without caring which node it comes
|
||||
from:
|
||||
|
||||
$ git-annex get foo --from bigserver-mycluster
|
||||
copy foo (from bigserver-mycluster...) ok
|
||||
|
||||
And you can send files to the cluster, without caring what nodes
|
||||
they are stored to:
|
||||
|
||||
$ git-annex move bar --to bigserver-mycluster
|
||||
move bar (to bigserver-mycluster...) ok
|
||||
|
||||
In fact, a single upload like that can be sent to every node of the cluster
|
||||
at once, very efficiently.
|
||||
|
||||
$ git-annex whereis bar
|
||||
whereis bar (3 copies)
|
||||
acae2ff6-6c1e-8bec-b8b9-397a3755f397 -- [bigserver-mycluster]
|
||||
9f514001-6dc0-4d83-9af3-c64c96626892 -- node 1 [bigserver-node1]
|
||||
d81e0b28-612e-4d73-a4e6-6dabbb03aba1 -- node 2 [bigserver-node2]
|
||||
5657baca-2f11-11ef-ae1a-5b68c6321dd9 -- node 3 [bigserver-node3]
|
||||
|
||||
Notice that the file is shown as present in the cluster, as well as on
|
||||
individual nodes. But the cluster itself does not count as a copy of the file,
|
||||
so the 3 copies are the copies on individual nodes.
|
||||
|
||||
Most other git-annex commands that operate on repositories can also operate on
|
||||
clusters.
|
||||
|
||||
A cluster is not a git repository, and so `git pull bigserver-mycluster`
|
||||
will not work.
|
||||
|
||||
## preferred content of clusters
|
||||
|
||||
The preferred content of the cluster can be configured. This tells
|
||||
users what files the cluster as a whole should contain.
|
||||
|
||||
To configure the preferred content of a cluster, as well as other related
|
||||
things like [[groups|git-annex-group]] and [[required_content]], it's easiest
|
||||
to do the configuration in a repository that has the cluster as a remote.
|
||||
|
||||
For example:
|
||||
|
||||
$ git-annex wanted bigserver-mycluster standard
|
||||
$ git-annex group bigserver-mycluster archive
|
||||
|
||||
By default, when a file is uploaded to a cluster, it is stored on every node of
|
||||
the cluster. To control which nodes to store to, the [[preferred_content]] of
|
||||
each node can be configured.
|
||||
|
||||
It's also a good idea to configure the preferred content of the cluster's
|
||||
gateway. To avoid files redundantly being stored on the gateway
|
||||
(which remember, is not a node of the cluster), you might make it not want
|
||||
any files:
|
||||
|
||||
$ git-annex wanted bigserver nothing
|
||||
|
||||
## setting up a cluster
|
||||
|
||||
A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
|
||||
the repository that will serve as the cluster's gateway. In the example above,
|
||||
this was the "bigserver" repository.
|
||||
|
||||
$ git-annex initcluster mycluster
|
||||
|
||||
Once a cluster is initialized, the next step is to add nodes to it.
|
||||
To make a remote be a node of the cluster, configure
|
||||
`git config remote.name.annex-cluster-node`, setting it to the
|
||||
name of the cluster.
|
||||
|
||||
In the example above, the three cluster nodes were configured like this:
|
||||
|
||||
$ git remote add node1 /media/disk1/repo
|
||||
$ git remote add node2 /media/disk2/repo
|
||||
$ git remote add node3 /media/disk3/repo
|
||||
$ git config remote.node1.annex-cluster-node mycluster
|
||||
$ git config remote.node2.annex-cluster-node mycluster
|
||||
$ git config remote.node3.annex-cluster-node mycluster
|
||||
|
||||
Finally, run `git-annex updatecluster` to record the cluster configuration
|
||||
in the git-annex branch. That tells other repositories about the cluster.
|
||||
|
||||
$ git-annex updatecluster
|
||||
Added node node1 to cluster: mycluster
|
||||
Added node node2 to cluster: mycluster
|
||||
Added node node3 to cluster: mycluster
|
||||
Started proxying for node1
|
||||
Started proxying for node2
|
||||
Started proxying for node3
|
||||
|
||||
Operations that affect multiple nodes of a cluster can often be sped up by
|
||||
configuring annex.jobs in the repository that will serve the cluster to
|
||||
clients. In the example above, the nodes are all disk bound, so operating
|
||||
on more than one at a time will likely be faster.
|
||||
|
||||
$ git config annex.jobs cpus
|
||||
|
||||
## adding additional gateways to a cluster
|
||||
|
||||
A cluster can have more than one gateway. One way to use this is to
|
||||
make a cluster that is distributed across several locations.
|
||||
|
||||
Suppose you have a datacenter in AMS, and one in NYC. There
|
||||
will be a gateway in each datacenter which provides access to the nodes
|
||||
there. And the gateways will relay data between each other as well.
|
||||
|
||||
Start by setting up the cluster in Amsterdam. The process is the same
|
||||
as in the previous section.
|
||||
|
||||
AMS$ git-annex initcluster mycluster
|
||||
AMS$ git remote add node1 /media/disk1/repo
|
||||
AMS$ git remote add node2 /media/disk2/repo
|
||||
AMS$ git config remote.node1.annex-cluster-node mycluster
|
||||
AMS$ git config remote.node2.annex-cluster-node mycluster
|
||||
AMS$ git-annex updatecluster
|
||||
AMS$ git config annex.jobs cpus
|
||||
|
||||
Now in a clone of the same repository in NYC, add AMS as a git remote
|
||||
accessed with ssh:
|
||||
|
||||
NYC$ git remote add AMS me@amsterdam.example.com:annex
|
||||
NYC$ git fetch AMS
|
||||
|
||||
Setting up the cluster in NYC is different, rather than using
|
||||
`git-annex initcluster` again (which would make a new, different
|
||||
cluster), we ask git-annex to extend the cluster from AMS:
|
||||
|
||||
NYC$ git-annex extendcluster AMS mycluster
|
||||
|
||||
The rest of the setup process for NYC is the same, of course different
|
||||
nodes are added.
|
||||
|
||||
NYC$ git remote add node3 /media/disk3/repo
|
||||
NYC$ git remote add node4 /media/disk4/repo
|
||||
NYC$ git config remote.node3.annex-cluster-node mycluster
|
||||
NYC$ git config remote.node4.annex-cluster-node mycluster
|
||||
NYC$ git-annex updatecluster
|
||||
NYC$ git config annex.jobs cpus
|
||||
|
||||
Finally, the AMS side of the cluster has to be updated, adding a git remote
|
||||
for NYC, and extending the cluster to there as well:
|
||||
|
||||
AMS$ git remote add NYC me@nyc.example.com:annex
|
||||
AMS$ git-annex sync NYC
|
||||
NYC$ git-annex extendcluster NYC mycluster
|
||||
|
||||
A user can now add either AMS or NYC as a remote, and will have access
|
||||
to the entire cluster as either `AMS-mycluster` or `NYC-mycluster`.
|
||||
|
||||
user$ git-annex move foo --to AMS-mycluster
|
||||
move foo (to AMS-mycluster...) ok
|
||||
|
||||
Looking at where files end up, all the nodes are visible, not only those
|
||||
served by the current gateway.
|
||||
|
||||
user$ git-annex whereis foo
|
||||
whereis foo (4 copies)
|
||||
acfc1cb2-b8d5-8393-b8dc-4a419ea38183 -- cluster mycluster [AMS-mycluster]
|
||||
11ab09a9-7448-45bd-ab81-3997780d00b3 -- node4 [AMS-NYC-node4]
|
||||
36197d0e-6d49-4213-8440-71cbb121e670 -- node2 [AMS-node2]
|
||||
43652651-1efa-442a-8333-eb346db31553 -- node3 [AMS-NYC-node3]
|
||||
7fb5a77b-77a3-4032-b3e5-536698e308b3 -- node1 [AMS-node1]
|
||||
ok
|
||||
|
||||
Notice that remotes for cluster nodes have names indicating the path through
|
||||
the cluster used to access them. For example, "AMS-NYC-node3" is accessed via
|
||||
the AMS gateway, which then relays to NYC where node3 is located.
|
||||
|
||||
## considerations for multi-gateway clusters
|
||||
|
||||
When a cluster has multiple gateways, nothing keeps the git repositories on
|
||||
the gateways in sync. A branch pushed to one gateway will not be able to
|
||||
be pulled from another one. And gateways only learn about the locations of
|
||||
keys that are uploaded to the cluster via them. So in the example above,
|
||||
after an upload to AMS-mycluster, NYC-mycluster will only know that the
|
||||
key is stored in its nodes, but won't know that it's stored in nodes
|
||||
behind AMS. So, it's best to have a single git repository that is synced
|
||||
with, or perhaps run [[git-annex-remotedaemon]] on each gateway to keep
|
||||
its git repository in sync with the other gateways.
|
||||
|
||||
Clusters can be constructed with any number of gateways, and any internal
|
||||
topology of connections between gateways. But there must always be a path
|
||||
from any gateway to all nodes of the cluster, otherwise a key won't
|
||||
be able to be stored from, or retrieved from some nodes.
|
||||
|
||||
It's best to avoid there being multiple paths to a node that go via
|
||||
different gateways, since all paths will be tried in parallel when eg,
|
||||
uploading a key to the cluster.
|
||||
|
||||
A breakdown in communication between gateways will temporarily split the
|
||||
cluster. When communication resumes, some keys may need to be copied to
|
||||
additional nodes.
|
|
@ -11,7 +11,7 @@ repositories.
|
|||
Joey has received funding to work on this.
|
||||
Planned schedule of work:
|
||||
|
||||
* June: git-annex proxy
|
||||
* June: git-annex proxies and clusters
|
||||
* July, part 1: git-annex proxy support for exporttree
|
||||
* July, part 2: p2p protocol over http
|
||||
* August: balanced preferred content
|
||||
|
@ -24,7 +24,49 @@ Planned schedule of work:
|
|||
|
||||
In development on the `proxy` branch.
|
||||
|
||||
For June's work on [[design/passthrough_proxy]], implementation plan:
|
||||
For June's work on [[design/passthrough_proxy]], remaining todos:
|
||||
|
||||
* Since proxying to special remotes is not supported yet, and won't be for
|
||||
the first release, make it fail in a reasonable way.
|
||||
|
||||
- or -
|
||||
|
||||
* Proxying for special remotes.
|
||||
Including encryption and chunking. See design for issues.
|
||||
|
||||
# items deferred until later for [[design/passthrough_proxy]]
|
||||
|
||||
* Indirect uploads when proxying for special remote
|
||||
(to be considered). See design.
|
||||
|
||||
* Getting a key from a cluster currently picks from amoung
|
||||
the lowest cost remotes at random. This could be smarter,
|
||||
eg prefer to avoid using remotes that are doing other transfers at the
|
||||
same time.
|
||||
|
||||
* The cost of a proxied node that is accessed via an intermediate gateway
|
||||
is currently the same as a node accessed via the cluster gateway.
|
||||
To fix this, there needs to be some way to tell how many hops through
|
||||
gateways it takes to get to a node. Currently the only way is to
|
||||
guess based on number of dashes in the node name, which is not satisfying.
|
||||
|
||||
Even counting hops is not very satisfying, one cluster gateway could
|
||||
be much more expensive to traverse than another one.
|
||||
|
||||
If seriously tackling this, it might be worth making enough information
|
||||
available to use spanning tree protocol for routing inside clusters.
|
||||
|
||||
* Optimise proxy speed. See design for ideas.
|
||||
|
||||
* Use `sendfile()` to avoid data copying overhead when
|
||||
`receiveBytes` is being fed right into `sendBytes`.
|
||||
Library to use:
|
||||
<https://hackage.haskell.org/package/hsyscall-0.4/docs/System-Syscall.html>
|
||||
|
||||
* Support using a proxy when its url is a P2P address.
|
||||
(Eg tor-annex remotes.)
|
||||
|
||||
# completed items for June's work on [[design/passthrough_proxy]]:
|
||||
|
||||
* UUID discovery via git-annex branch. Add a log file listing UUIDs
|
||||
accessible via proxy UUIDs. It also will contain the names
|
||||
|
@ -40,7 +82,7 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
|
|||
* Proxy should update location tracking information for proxied remotes,
|
||||
so it is available to other users who sync with it. (done)
|
||||
|
||||
* Implement `git-annex updatecluster` command (done)
|
||||
* Implement `git-annex initcluster` and `git-annex updatecluster` commands (done)
|
||||
|
||||
* Implement cluster UUID insertation on location log load, and removal
|
||||
on location log store. (done)
|
||||
|
@ -48,66 +90,39 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
|
|||
* Omit cluster UUIDs when constructing drop proofs, since lockcontent will
|
||||
always fail on a cluster. (done)
|
||||
|
||||
* Don't count cluster UUID as a copy. (done)
|
||||
* Don't count cluster UUID as a copy in numcopies checking etc. (done)
|
||||
|
||||
* Tab complete proxied remotes and clusters in eg --from option. (done)
|
||||
|
||||
* Getting a key from a cluster should proxy from one of the nodes that has
|
||||
it. (done)
|
||||
|
||||
* Getting a key from a cluster currently always selects the lowest cost
|
||||
remote, and always the same remote if cost is the same. Should
|
||||
round-robin amoung remotes, and prefer to avoid using remotes that
|
||||
other git-annex processes are currently using.
|
||||
|
||||
* Implement upload with fanout and reporting back additional UUIDs over P2P
|
||||
protocol. (done, but need to check for fencepost errors on resume of
|
||||
incomplete upload with remotes at different points)
|
||||
|
||||
* On upload to cluster, send to nodes where it's preferred content, and not
|
||||
to other nodes.
|
||||
* Implement upload with fanout to multiple cluster nodes and reporting back
|
||||
additional UUIDs over P2P protocol. (done)
|
||||
|
||||
* Implement cluster drops, trying to remove from all nodes, and returning
|
||||
which UUIDs it was dropped from.
|
||||
which UUIDs it was dropped from. (done)
|
||||
|
||||
Problem: May lock content on cluster
|
||||
nodes to satisfy numcopies (rather than locking elsewhere) and so not be
|
||||
able to drop from nodes. Avoid using cluster nodes when constructing drop
|
||||
proof for cluster.
|
||||
* `git-annex testremote` works against proxied remote and cluster. (done)
|
||||
|
||||
Problem: When nodes are special remotes, may
|
||||
treat nodes as copies while dropping from cluster, and so violate
|
||||
numcopies. (But not mincopies.)
|
||||
* Avoid `git-annex sync --content` etc from operating on cluster nodes by
|
||||
default since syncing with a cluster implicitly syncs with its nodes. (done)
|
||||
|
||||
Problem: `move --from cluster` in "does this make it worse"
|
||||
check may fail to realize that dropping from multiple nodes does in fact
|
||||
make it worse.
|
||||
* On upload to cluster, send to nodes where its preferred content, and not
|
||||
to other nodes. (done)
|
||||
|
||||
* On upload to a cluster, as well as fanout to nodes, if the key is
|
||||
preferred content of the proxy repository, store it there.
|
||||
(But not when preferred content is not configured.)
|
||||
And on download from a cluster, if the proxy repository has the content,
|
||||
get it from there to avoid the overhead of proxying to a node.
|
||||
* Support annex.jobs for clusters. (done)
|
||||
|
||||
* Basic proxying to special remote support (non-streaming).
|
||||
* Add `git-annex extendcluster` command and extend `git-annex updatecluster`
|
||||
to support clusters with multiple gateways. (done)
|
||||
|
||||
* Support proxies-of-proxies better, eg foo-bar-baz.
|
||||
Currently, it does work, but have to run `git-annex updateproxy`
|
||||
on foo in order for it to notice the bar-baz proxied remote exists,
|
||||
and record it as foo-bar-baz. Make it skip recording proxies of
|
||||
proxies like that, and instead automatically generate those from the log.
|
||||
(With cycle prevention there of course.)
|
||||
* Support proxying for a remote that is proxied by another gateway of
|
||||
a cluster. (done)
|
||||
|
||||
* Cycle prevention including cluster-in-cluster cycles. See design.
|
||||
* Support distributed clusters: Make a proxy for a cluster repeat
|
||||
protocol messages on to any remotes that have the same UUID as
|
||||
the cluster. Needs extension to P2P protocol to avoid cycles.
|
||||
(done)
|
||||
|
||||
* Optimise proxy speed. See design for ideas.
|
||||
|
||||
* Use `sendfile()` to avoid data copying overhead when
|
||||
`receiveBytes` is being fed right into `sendBytes`.
|
||||
|
||||
* Encryption and chunking. See design for issues.
|
||||
|
||||
* Indirect uploads (to be considered). See design.
|
||||
|
||||
* Support using a proxy when its url is a P2P address.
|
||||
(Eg tor-annex remotes.)
|
||||
* Proxied cluster nodes should have slightly higher cost than the cluster
|
||||
gateway. (done)
|
||||
|
|
|
@ -6,7 +6,7 @@ remotes.
|
|||
|
||||
So this todo remains open, but is now only concerned with
|
||||
streaming an object that is being received from one remote out to another
|
||||
remote without first needing to buffer the whole object on disk.
|
||||
repository without first needing to buffer the whole object on disk.
|
||||
|
||||
git-annex's remote interface does not currently support that.
|
||||
`retrieveKeyFile` stores the object into a file. And `storeKey`
|
||||
|
@ -27,3 +27,7 @@ Recieving to a file, and sending from the same file as it grows is one
|
|||
possibility, since that would handle buffering, and it might avoid needing
|
||||
to change interfaces as much. It would still need a new interface since the
|
||||
current one does not guarantee the file is written in-order.
|
||||
|
||||
A fifo is a possibility, but would certianly not work with remotes
|
||||
that don't write to the file in-order. Also resuming a download would not
|
||||
work with a fifo, the sending remote wouldn't know where to resume from.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue