Commit graph

45054 commits

Author SHA1 Message Date
nobodyinperson
724eb8a369 Suggest that 'git annex unused' reports total unused size 2024-06-21 16:30:09 +00:00
Joey Hess
53674e8abb
Merge branch 'master' into proxy 2024-06-20 11:20:26 -04:00
Joey Hess
53598e5154
merge from proxy branch 2024-06-20 11:20:16 -04:00
Joey Hess
d89ac8c6ee
Merge branch 'master' of ssh://git-annex.branchable.com 2024-06-20 11:03:30 -04:00
Joey Hess
9173095d11
add my distribits talk 2024-06-20 11:03:19 -04:00
Joey Hess
ff5fe4e759
clusters documentation 2024-06-20 10:57:43 -04:00
Joey Hess
032d3902d8
wording 2024-06-20 10:15:24 -04:00
Joey Hess
ecab2e03b9
working PUT fanout to multiple remotes for clusters
Still need to check for fencepost errors on resume when
different nodes have different amounts of data.
2024-06-20 10:04:26 -04:00
joris
b35be4b656 Added a comment 2024-06-20 09:58:05 +00:00
jochen.keil@38b1f86ab65128dab3e62e726403ceee4f5141bf
4da453e30c 2024-06-19 15:46:26 +00:00
Joey Hess
54307af8c0
more on proxying special remotes 2024-06-19 06:40:19 -04:00
Joey Hess
097ef9979c
towards a design for proxying to special remotes 2024-06-19 06:15:03 -04:00
Joey Hess
6eac3112e5
be quiet when reading cluster and proxy information at startup
I had a transfer of 3 files fail like this:

git-annex: transferrer protocol error: "(recording state in git...)"

The remote had stalldetection enabled, although I didn't see it stall.
So git-annex transferrer would have been started up. I guess that
one of these new git-annex branch reads, that happens early, caused
that message due to perhaps an uncommitted git-annex branch change.

Since the transferrer speaks a protocol over stdout, it needs to be
prevented from outputting other messages to stdout. Interestingly,
startupAnnex is run after prepRunCommand, so if a command requests quiet
output it would already be quiet. But the transferrer does not, instead
it calls Annex.setOutput SerializedOutput in its start action.
2024-06-18 21:31:32 -04:00
Joey Hess
f916ce4b68
allow proxying to remotes that are nodes of clusters
fixes reversion in ca08f3fcc2
2024-06-18 17:02:23 -04:00
Joey Hess
f18740699e
P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.

Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
2024-06-18 16:21:40 -04:00
Joey Hess
ca08f3fcc2
only proxy to a remote when remote.name.annex-proxy is set
Avoids someone writing to proxy.log and gaining access to remotes
of someone else's repository that they were not intended to be able
to proxy to.
2024-06-18 11:43:10 -04:00
Joey Hess
fb0fd78485
only use a remote as a node when git configuration is set
Avoids someone writing to cluster.log and nominating remotes
of someone else's repository as a cluster.
2024-06-18 11:37:38 -04:00
Joey Hess
f049156a03
checkpresent support for clusters
This assumes that the proxy for a cluster has up-to-date location
logs. If it didn't, it might proxy the checkpresent to a node that no
longer has the content, while some other node still does, and so
it would incorrectly appear that the cluster no longer contains the
content.

Since cluster UUIDs are not stored to location logs,
git-annex fsck --fast when claiming to fix a location log when
that occurred would not cause any problems. And presumably the location
tracking would later get sorted out.

At least usually, changes to the content of nodes goes via the proxy,
and it will update its location logs, so they will be accurate. However,
if there were multiple proxies to the same cluster, or nodes were
accessed directly (or via proxy to the node and not the cluster),
the proxy's location log could certainly be wrong.

(The location log access for GET has the same issues.)
2024-06-18 11:16:16 -04:00
Joey Hess
88d9a02f7c
initial, working support for getting from clusters
Currently tends to put all the load on a single node, which will need to
be improved.
2024-06-18 11:01:10 -04:00
Joey Hess
d34326ab76
factor out Annex.Proxy 2024-06-18 10:51:37 -04:00
Joey Hess
f0d6114286
refactor cluster code into own module 2024-06-18 10:36:04 -04:00
Joey Hess
8290f70978
update 2024-06-18 10:08:15 -04:00
yarikoptic
28029d6668 original report / question 2024-06-18 13:57:23 +00:00
Joey Hess
ef26470810
ProxySelector data type 2024-06-17 19:19:15 -04:00
Joey Hess
7a839a983a
preparing for cluster node selection
Support selecting what remote to proxy for each top-level P2P protocol
message.

This only needs to be extended now to support fanout to multiple
nodes for PUT and REMOVE, and with a remote that fails for
LOCKCONTENT and UNLOCKCONTENT.

But a good first step would be to implement CHECKPRESENT and GET for
clusters. Both should select a node that actually does have the content.
That will allow a cluster to work for GET even when location tracking is
out of date.
2024-06-17 15:51:10 -04:00
Joey Hess
291280ced2
started on git-annex-shell cluster support
Works down to P2P protocol.

The question now is, how to handle protocol version negotiation for
clusters? Connecting to each node to find their protocol versions and
using the lowest would be too expensive with a lot of nodes. So it seems
that the cluster needs to pick its own protocol version to use with the
client.

Then it can either negotiate that same version with the nodes when
it comes time to use them, or it can translate between multiple protocol
versions. That seems complicated. Thinking it would be ok to refuse to
use a node if it is not able to negotiate the same protocol version with
it as with the client. That will mean that sometimes need nodes to be
upgraded when upgrading the cluster's proxy. But protocol versions
rarely change.
2024-06-17 15:10:04 -04:00
Joey Hess
c7ad44e4d1
work toward supporting proxying to multiple remotes at once
For eg, upload fanout.

Delay connecting to a remote until it's needed. When there are many
proxied remotes, it would not do for the proxy to connect to each of
them on startup; that could take a long time.
2024-06-17 14:16:44 -04:00
Joey Hess
83a1db8d17
more specific type 2024-06-17 13:04:40 -04:00
Joey Hess
b72ccc6f0c
improve types 2024-06-17 12:44:08 -04:00
Joey Hess
e2fd2ee2bd
update 2024-06-17 09:31:44 -04:00
Joey Hess
3970bbb03b
Merge branch 'master' into proxy 2024-06-17 09:29:34 -04:00
Joey Hess
af79728ac3
tab complete special remotes
An oversight..

And with the work in progress proxy and cluster, there
can be additional remotes that are not listed in .git/config, but are
available. Making those more discoverable is another big benefit of
this.
2024-06-17 09:26:03 -04:00
Joey Hess
64afbb0b93
don't count clusters as copies, continued
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.

This should be everything, probably.

Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
2024-06-16 15:14:53 -04:00
Joey Hess
780367200b
remove dead nodes when loading the cluster log
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.

One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.

Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.

Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.

That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.

Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.

Other commands, like git-annex-shell currently don't run annexStartup
either.

So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
2024-06-16 14:39:44 -04:00
Joey Hess
36c6d8da69
don't count clusters as copies
Since the cluster UUID is inserted into the location log when the
location log lists a node as containing content.

Also avoid trying to lock content on cluster remotes. The cluster nodes
are also proxied, so that content can be locked on individual nodes, and
locking content on a cluster as a whole probably won't be implemented.

And made git-annex whereis use numcopies machinery for displaying its
count, so it won't count cluster UUIDs redundantly to nodes.
Other commands, like git-annex info that also display numcopies
information already used the numcopies machinery.

There is more to be done, fromNumCopies is sometimes used to get a
number that is compared with a list of UUIDs. And limitCopies doesn't
use numcopies machinery.
2024-06-16 14:17:56 -04:00
beryllium@5bc3c32eb8156390f96e363e4ba38976567425ec
f707baf908 Added a comment 2024-06-15 07:37:07 +00:00
beryllium@5bc3c32eb8156390f96e363e4ba38976567425ec
0062ac1b49 Added a comment: Grafting? a special remote for tuned migration 2024-06-15 00:57:27 +00:00
Joey Hess
b3370a191c
insert cluster UUIDs when loading location logs, and omit when saving
Inline isClusterUUID for speed.
2024-06-14 18:06:28 -04:00
Joey Hess
a4c9d4424c
remove Logs.Presence imports
When imported along with Logs.Location, it can be an unused import and
it won't warn, due to reexports. The point if this is really to show
that Logs.Presence is not widely used, outside Logs/
2024-06-14 17:27:34 -04:00
Joey Hess
570ceffe8d
broke out initcluster
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.

Also it gets the cluster description set and is consistent with
initremote.
2024-06-14 17:23:11 -04:00
Joey Hess
bfe7f488d9
fogot to add 2024-06-14 16:37:17 -04:00
Joey Hess
846903e9bb
update todo list for this month
whew that's gonna be a lot
2024-06-14 15:23:43 -04:00
Joey Hess
2028ad02b8
add clusters to proxy log
Note that it's not defined what will happen if a cluster has the same
name as a remote that has proxying enabled.
2024-06-14 15:03:42 -04:00
Joey Hess
bbf261487d
add git-annex updatecluster command
Seems to work fine, making the right changes to the git-annex branch.
2024-06-14 15:02:01 -04:00
Joey Hess
2844230dfe
add git configs for clusters 2024-06-14 12:20:17 -04:00
Joey Hess
de1d795dfe
cache getClusters in Annex state 2024-06-14 11:16:01 -04:00
Joey Hess
da3c0115cb
make cluster UUIDs distinguishable from any other repository UUID
A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'.
The rest of the content will be random.

This avoids a class of attack where the UUID of a repository is used as
the UUID of a cluster, which will prevent git-annex from updating
location logs for that repository. I don't know why someone would want
to do that, but let's prevent it.

Also, isClusterUUID make it easy to filter out cluster UUIDs when
writing the location logs.
2024-06-14 11:11:09 -04:00
Joey Hess
9895e6659d
update 2024-06-13 19:08:04 -04:00
Joey Hess
6d59118b29
unique uuid namespace for clusters 2024-06-13 17:56:53 -04:00
Joey Hess
aa56d433d5
implement cluster.log
Not used yet. (Or tested.)

I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
2024-06-13 16:00:58 -04:00