Commit graph

44938 commits

Author SHA1 Message Date
Joey Hess
f916ce4b68
allow proxying to remotes that are nodes of clusters
fixes reversion in ca08f3fcc2
2024-06-18 17:02:23 -04:00
Joey Hess
f18740699e
P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.

Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
2024-06-18 16:21:40 -04:00
Joey Hess
ca08f3fcc2
only proxy to a remote when remote.name.annex-proxy is set
Avoids someone writing to proxy.log and gaining access to remotes
of someone else's repository that they were not intended to be able
to proxy to.
2024-06-18 11:43:10 -04:00
Joey Hess
fb0fd78485
only use a remote as a node when git configuration is set
Avoids someone writing to cluster.log and nominating remotes
of someone else's repository as a cluster.
2024-06-18 11:37:38 -04:00
Joey Hess
f049156a03
checkpresent support for clusters
This assumes that the proxy for a cluster has up-to-date location
logs. If it didn't, it might proxy the checkpresent to a node that no
longer has the content, while some other node still does, and so
it would incorrectly appear that the cluster no longer contains the
content.

Since cluster UUIDs are not stored to location logs,
git-annex fsck --fast when claiming to fix a location log when
that occurred would not cause any problems. And presumably the location
tracking would later get sorted out.

At least usually, changes to the content of nodes goes via the proxy,
and it will update its location logs, so they will be accurate. However,
if there were multiple proxies to the same cluster, or nodes were
accessed directly (or via proxy to the node and not the cluster),
the proxy's location log could certainly be wrong.

(The location log access for GET has the same issues.)
2024-06-18 11:16:16 -04:00
Joey Hess
88d9a02f7c
initial, working support for getting from clusters
Currently tends to put all the load on a single node, which will need to
be improved.
2024-06-18 11:01:10 -04:00
Joey Hess
d34326ab76
factor out Annex.Proxy 2024-06-18 10:51:37 -04:00
Joey Hess
f0d6114286
refactor cluster code into own module 2024-06-18 10:36:04 -04:00
Joey Hess
8290f70978
update 2024-06-18 10:08:15 -04:00
Joey Hess
ef26470810
ProxySelector data type 2024-06-17 19:19:15 -04:00
Joey Hess
7a839a983a
preparing for cluster node selection
Support selecting what remote to proxy for each top-level P2P protocol
message.

This only needs to be extended now to support fanout to multiple
nodes for PUT and REMOVE, and with a remote that fails for
LOCKCONTENT and UNLOCKCONTENT.

But a good first step would be to implement CHECKPRESENT and GET for
clusters. Both should select a node that actually does have the content.
That will allow a cluster to work for GET even when location tracking is
out of date.
2024-06-17 15:51:10 -04:00
Joey Hess
291280ced2
started on git-annex-shell cluster support
Works down to P2P protocol.

The question now is, how to handle protocol version negotiation for
clusters? Connecting to each node to find their protocol versions and
using the lowest would be too expensive with a lot of nodes. So it seems
that the cluster needs to pick its own protocol version to use with the
client.

Then it can either negotiate that same version with the nodes when
it comes time to use them, or it can translate between multiple protocol
versions. That seems complicated. Thinking it would be ok to refuse to
use a node if it is not able to negotiate the same protocol version with
it as with the client. That will mean that sometimes need nodes to be
upgraded when upgrading the cluster's proxy. But protocol versions
rarely change.
2024-06-17 15:10:04 -04:00
Joey Hess
c7ad44e4d1
work toward supporting proxying to multiple remotes at once
For eg, upload fanout.

Delay connecting to a remote until it's needed. When there are many
proxied remotes, it would not do for the proxy to connect to each of
them on startup; that could take a long time.
2024-06-17 14:16:44 -04:00
Joey Hess
83a1db8d17
more specific type 2024-06-17 13:04:40 -04:00
Joey Hess
b72ccc6f0c
improve types 2024-06-17 12:44:08 -04:00
Joey Hess
e2fd2ee2bd
update 2024-06-17 09:31:44 -04:00
Joey Hess
3970bbb03b
Merge branch 'master' into proxy 2024-06-17 09:29:34 -04:00
Joey Hess
af79728ac3
tab complete special remotes
An oversight..

And with the work in progress proxy and cluster, there
can be additional remotes that are not listed in .git/config, but are
available. Making those more discoverable is another big benefit of
this.
2024-06-17 09:26:03 -04:00
Joey Hess
64afbb0b93
don't count clusters as copies, continued
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.

This should be everything, probably.

Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
2024-06-16 15:14:53 -04:00
Joey Hess
780367200b
remove dead nodes when loading the cluster log
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.

One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.

Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.

Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.

That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.

Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.

Other commands, like git-annex-shell currently don't run annexStartup
either.

So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
2024-06-16 14:39:44 -04:00
Joey Hess
36c6d8da69
don't count clusters as copies
Since the cluster UUID is inserted into the location log when the
location log lists a node as containing content.

Also avoid trying to lock content on cluster remotes. The cluster nodes
are also proxied, so that content can be locked on individual nodes, and
locking content on a cluster as a whole probably won't be implemented.

And made git-annex whereis use numcopies machinery for displaying its
count, so it won't count cluster UUIDs redundantly to nodes.
Other commands, like git-annex info that also display numcopies
information already used the numcopies machinery.

There is more to be done, fromNumCopies is sometimes used to get a
number that is compared with a list of UUIDs. And limitCopies doesn't
use numcopies machinery.
2024-06-16 14:17:56 -04:00
Joey Hess
b3370a191c
insert cluster UUIDs when loading location logs, and omit when saving
Inline isClusterUUID for speed.
2024-06-14 18:06:28 -04:00
Joey Hess
a4c9d4424c
remove Logs.Presence imports
When imported along with Logs.Location, it can be an unused import and
it won't warn, due to reexports. The point if this is really to show
that Logs.Presence is not widely used, outside Logs/
2024-06-14 17:27:34 -04:00
Joey Hess
570ceffe8d
broke out initcluster
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.

Also it gets the cluster description set and is consistent with
initremote.
2024-06-14 17:23:11 -04:00
Joey Hess
bfe7f488d9
fogot to add 2024-06-14 16:37:17 -04:00
Joey Hess
846903e9bb
update todo list for this month
whew that's gonna be a lot
2024-06-14 15:23:43 -04:00
Joey Hess
2028ad02b8
add clusters to proxy log
Note that it's not defined what will happen if a cluster has the same
name as a remote that has proxying enabled.
2024-06-14 15:03:42 -04:00
Joey Hess
bbf261487d
add git-annex updatecluster command
Seems to work fine, making the right changes to the git-annex branch.
2024-06-14 15:02:01 -04:00
Joey Hess
2844230dfe
add git configs for clusters 2024-06-14 12:20:17 -04:00
Joey Hess
de1d795dfe
cache getClusters in Annex state 2024-06-14 11:16:01 -04:00
Joey Hess
da3c0115cb
make cluster UUIDs distinguishable from any other repository UUID
A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'.
The rest of the content will be random.

This avoids a class of attack where the UUID of a repository is used as
the UUID of a cluster, which will prevent git-annex from updating
location logs for that repository. I don't know why someone would want
to do that, but let's prevent it.

Also, isClusterUUID make it easy to filter out cluster UUIDs when
writing the location logs.
2024-06-14 11:11:09 -04:00
Joey Hess
9895e6659d
update 2024-06-13 19:08:04 -04:00
Joey Hess
6d59118b29
unique uuid namespace for clusters 2024-06-13 17:56:53 -04:00
Joey Hess
aa56d433d5
implement cluster.log
Not used yet. (Or tested.)

I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
2024-06-13 16:00:58 -04:00
Joey Hess
d16e19b8ca
comment 2024-06-13 14:30:32 -04:00
Joey Hess
ebebc04273
comment 2024-06-13 13:40:04 -04:00
Joey Hess
6ea78ec867
partial reproducer 2024-06-13 13:03:38 -04:00
Joey Hess
01f5015f30
update 2024-06-13 11:44:39 -04:00
Joey Hess
5e0acd1842
more cluster thoughts 2024-06-13 10:48:31 -04:00
Joey Hess
90e3b8b44f
avoided the strangeness of the cluster's proxy location tracking being wrong 2024-06-13 10:34:19 -04:00
Joey Hess
ffd7c745ff
update 2024-06-13 06:49:36 -04:00
Joey Hess
d8daabe9ec
Merge branch 'master' of ssh://git-annex.branchable.com 2024-06-13 06:44:22 -04:00
Joey Hess
22a329c57e
copied over some changes from proxy branch 2024-06-13 06:43:59 -04:00
Joey Hess
3cc48279ad
more thoughts on clusters 2024-06-13 06:41:42 -04:00
Joey Hess
555d7e52d3
more thoughts on clusters 2024-06-12 17:30:55 -04:00
Joey Hess
0ebb107974
update 2024-06-12 15:21:23 -04:00
Joey Hess
46a1fcb3ea
avoid git syncing with instantiate proxied remotes
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.

Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
2024-06-12 15:10:03 -04:00
Joey Hess
a986a20034
designing clusters 2024-06-12 14:57:26 -04:00
Joey Hess
e70e3473b3
on cycles 2024-06-12 13:52:17 -04:00
Joey Hess
0ffb0a4d25
dash is legal in git remote names 2024-06-12 13:24:31 -04:00