The VIA extension is still needed to avoid some extra work and ugly
messages, but this is enough that it actually works.
This filters out the RemoteSides that are a proxied connection via a
remote gateway to the cluster.
The VIA extension will not filter those out, but will send VIA to them
on connect, which will cause the ones that are accessed via the listed
gateways to be filtered out.
Walking a tightrope between security and convenience here, because
git-annex-shell needs to only proxy for things when there has been
an explicit, local action to configure them.
In this case, the user has to have run `git-annex extendcluster`,
which now sets annex-cluster-gateway on the remote.
Note that any repositories that the gateway is recorded to
proxy for will be proxied onward. This is not limited to cluster nodes,
because checking the node log would not add any security; someone could
add any uuid to it. The gateway of course then does its own
checking to determine if it will allow proxying for the remote.
When there are multiple gateways to a cluster, this sets up proxying
for nodes that are accessed via a remote gateway.
Eg, when running in nyc and amsterdam is the remote gateway,
and it has node1 and node2, this sets up proxying for
amsterdam-node1 and amsterdam-node2. A client that has nyc as a remote
will see proxied remotes nyc-amsterdam-node1 and nyc-amsterdam-node2.
Just look at the existing proxied remotes that correspond to already
existing nodes of the cluster, and keep those nodes in the cluster.
While adding any remotes of the local repo that are configured as
cluster nodes. This allows removing cluster nodes from the local repo
and updating, without it also removing nodes provided by other gateways.
Rejected the idea of automatically instantiating remotes for proxies-of-proxies.
That needs cycle protection, while the current behavior, which happened
for free, is that running git-annex updateproxy on the proxy can be used
to configure it, but only for topologies that actually exist.
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.
Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?
This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
Except when no nodes want a file, it has to be stored somewhere, so
store it on all. Which is not really desirable, but neither is having to
pick one.
ProtoAssociatedFile deserialization is rather broken, and this could
possibly affect preferred content expressions that match on filenames.
The inability to roundtrip whitespace like tabs and newlines through is
not a problem because preferred content expressions can't be written
that match on whitespace such as a tab. For example:
joey@darkstar:~/tmp/bench/z>git-annex wanted origin-node2 'exclude=*CTRL-VTab*'
wanted origin-node2
git-annex: Parse error: Parse failure: near "*"
But, the filtering of control characters could perhaps be a problem. I think
that filtering is now obsolete, git-annex has comprehensive filtering of
control characters when displaying filenames, that happens at a higher level.
However, I don't want to risk a security hole so am leaving in that filtering
in ProtoAssociatedFile deserialization for now.
Avoid `git-annex sync --content` etc from operating on cluster nodes by default
since syncing with a cluster implicitly syncs with its nodes. This avoids a
lot of unncessary work when a cluster has a lot of nodes just in checking
if each node's preferred content is satisfied. And it avoids content
being sent to nodes individually, so instead syncing with clusters always
fanout uploads to nodes.
The downside is that there are situations where a cluster's preferred content
settings can be met, but those of its nodes are not. Or where a node does not
contain a key, but the cluster does, and there are not enough copies of the key
yet, so it would be desirable the send it there. I think that's an acceptable
tradeoff. These kind of situations are ones where the cluster itself should
probably be responsible for copying content to the node. Which it can do much
less expensively than a client can. Part of the balanced preferred content
design that I will be working on in a couple of months involves rebalancing
clusters, so I expect to revisit this.
The use of annex-sync config does allow running git-annex sync with a specific
node, or nodes, and it will sync with it. And it's also possible to set
annex-sync git configs to make it sync with a node by default. (Although that
will require setting up an explicit git remote for the node rather than relying
on the proxied remote.)
Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster
due to a cycle. And the Annex.Startup load of clusters happens
too late for Remote.Git to use that. This does mean one redundant load
of the cluster log, though only when there is a proxy.
This makes git-annex sync and similar not treat proxied remotes as git
syncable remotes.
Also, display in git-annex info remote when the remote is proxied.
Loading the remote list a second time was removing all proxied remotes.
That happened because setting up the proxied remote added some config
fields to the in-memory git config, and on the second load, it saw those
configs and decided not to overwrite them with the proxy.
Now on the second load, that still happens. But now, the proxied
git configs are used to generate a remote same as if those configs were
all set. The reason that didn't happen before was twofold,
the gitremotes cache was not dropped, and the remote's url field was not
set correctly.
The problem with the remote's url field is that while it was marked as
proxy inherited, all other proxy inherited fields are annex- configs.
And the code to inherit didn't work for the url field.
Now it all works, but git-annex sync is left running git push/pull on
the proxied remote, which doesn't work. That still needs to be fixed.
Tested it with small chunk sizes (like 2) and resumes that were
eg 1 byte from the end of the file or beginning of file.
Also, git-annex testremote passes now against a cluster!
When the destination does not start with a copy, the cluster has one or
more copies. If more, dropping would reduce the number of copies, so
numcopies must be checked.
Considered checking how many nodes of the cluster contain a copy. If
only 1 node does, it could allow a move without checking numcopies.
The problem with that, though, is that other nodes of the cluster could
have copies that we don't know about. And dropping from a cluster tries
to drop from all nodes, so will drop even from those. So any drop from a
cluster can remove more than 1 copy.
Dropping from a cluster drops from every node of the cluster.
Including nodes that the cluster does not think have the content.
This is different from GET and CHECKPRESENT, which do trust the
cluster's location log. The difference is that removing from a cluster
should make 100% the content is gone from every node. So doing extra
work is ok. Compare with CHECKPRESENT where checking every node could
make it very expensive, and the worst that can happen in a false
negative is extra work being done.
Extended the P2P protocol with FAILURE-PLUS to handle the case where a
drop from one node succeeds, but a drop from another node fails. In that
case the entire cluster drop has failed.
Note that SUCCESS-PLUS is returned when dropping from a proxied remote
that is not a cluster, when the protocol version supports it. This is
because P2P.Proxy does not know when it's proxying for a single node
cluster vs for a remote that is not a cluster.
This is obviously necessary in order for dropping from a cluster to be able to
drop from all nodes.
It also avoids violating numcopies when a cluster node is a special remote.
If it were used in the drop proof, nothing would prevent the cluster from
dropping from it.
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.
Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
This assumes that the proxy for a cluster has up-to-date location
logs. If it didn't, it might proxy the checkpresent to a node that no
longer has the content, while some other node still does, and so
it would incorrectly appear that the cluster no longer contains the
content.
Since cluster UUIDs are not stored to location logs,
git-annex fsck --fast when claiming to fix a location log when
that occurred would not cause any problems. And presumably the location
tracking would later get sorted out.
At least usually, changes to the content of nodes goes via the proxy,
and it will update its location logs, so they will be accurate. However,
if there were multiple proxies to the same cluster, or nodes were
accessed directly (or via proxy to the node and not the cluster),
the proxy's location log could certainly be wrong.
(The location log access for GET has the same issues.)
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.
This should be everything, probably.
Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.
One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.
Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.
Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.
That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.
Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.
Other commands, like git-annex-shell currently don't run annexStartup
either.
So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.
Also it gets the cluster description set and is consistent with
initremote.
Not used yet. (Or tested.)
I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.
Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
This does mean a redundant write to the git-annex branch. But,
it means that two clients can be using the same proxy, and after
one sends a file to a proxied remote, the other only has to pull from
the proxy to learn about that. It does not need to pull from every
remote behind the proxy (which it couldn't do anyway as git repo
access is not currently proxied).
Anyway, the overhead of this in git-annex branch writes is no worse
than eg, sending a file to a repository where git-annex assistant
is running, which then sends the file on to a remote, and updates
the git-annex branch then. Indeed, when the assistant also drops
the local copy, that results in more writes to the git-annex branch.
CONNECT is not supported by git-annex-shell p2pstdio, but for proxying
to tor-annex remotes, it will be supported, and will make a git pull/push
to a proxied remote work the same with that as it does over ssh,
eg it accesses the proxy's git repo not the proxied remote's git repo.
The p2p protocol docs say that NOTIFYCHANGES is not always supported,
and it looked annoying to implement it for this, and it also seems
pretty useless, so make it be a protocol error. git-annex remotedaemon
will already be getting change notifications from the proxy's git repo,
so there's no need to get additional redundant change notifications for
proxied remotes that would be for changes to the same git repo.
Prevent listProxied from listing anything when the proxy remote's
url is a local directory. Proxying does not work in that situation,
because the proxied remotes have the same url, and so git-annex-shell
is not run when accessing them, instead the proxy remote is accessed
directly.
I don't think there is any good way to support this. Even if the instantiated
git repos for the proxied remotes somehow used an url that caused it to use
git-annex-shell to access them, planned features like `git-annex copy --to
proxy` accepting a key and sending it on to nodes behind the proxy would not
work, since git-annex-shell is not used to access the proxy.
So it would need to use something to access the proxy that causes
git-annex-shell to be run and speaks P2P protocol over it. And we have that.
It's a ssh connection to localhost. Of course, it would be possible to
take ssh out of that mix, and swap in something that does not have
encryption overhead and authentication complications, but otherwise
behaves the same as ssh. And if the user wants to do that, GIT_SSH
does exist.
This just happened to work correctly. Rather surprisingly. It turns out
that openP2PSshConnection actually also supports local git remotes,
by just running git-annex-shell with the path to the remote.
Renamed "P2PSsh" to "P2PShell" to make this clear.
The almost identical code duplication between relayDATA and relayDATA'
is very annoying. I tried quite a few things to parameterize them, but
the type checker is having fits when I try it.
Memory use is small and constant; receiveBytes returns a lazy bytestring
and it does stream.
Comparing speed of a get of a 500 mb file over proxy from origin-origin,
vs from the same remote over a direct ssh:
joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from origin-origin
get bigfile (from origin-origin...)
ok
(recording state in git...)
1.89user 0.67system 0:10.79elapsed 23%CPU (0avgtext+0avgdata 68716maxresident)k
0inputs+984320outputs (0major+10779minor)pagefaults 0swaps
joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from direct-ssh
get bigfile (from direct-ssh...)
ok
1.79user 0.63system 0:10.49elapsed 23%CPU (0avgtext+0avgdata 65776maxresident)k
0inputs+1024312outputs (0major+9773minor)pagefaults 0swaps
So the proxy doesn't add much overhead even when run on the same machine as
the client and remote.
Still, piping receiveBytes into sendBytes like this does suggest that the proxy
could be made to use less CPU resouces by using `sendfile()`.
getRepoUUID looks at that, and was seeing the annex.uuid of the proxy.
Which caused it to unncessarily set the git config. Probably also would
have led to other problems.
For NotifyChanges and also for the fallthrough case where
git-annex-shell passes a command off to git-shell, proxying is currently
ignored. So every remote that is accessed via a proxy will be treated as
the same git repository.
Every other command listed in cmdsMap will need to check if
Annex.proxyremote is set, and if so handle the proxying appropriately.
Probably only P2PStdio will need to support proxying. For now,
everything else refuses to work when proxying.
The part of that I don't like is that there's the possibility a command
later gets added to the list that doesn't check proxying.
When proxying is not enabled, it's important that git-annex-shell not
leak information that it would not have exposed before. Such as the
names or uuids of remotes.
I decided that, in the case where a repository used to have proxying
enabled, but no longer supports any proxies, it's ok to give the user a
clear error message indicating that proxying is not configured, rather
than a confusing uuid mismatch message.
Similarly, if a repository has proxying enabled, but not for the
requested repository, give a clear error message.
A tricky thing here is how to handle the case where there is more than
one remote, with proxying enabled, with the specified uuid. One way to
handle that would be to plumb the proxyRemoteName all the way through
from the remote git-annex to git-annex-shell, eg as a field, and use
only a remote with the same name. That would be very intrusive though.
Instead, I decided to let the proxy pick which remote it uses to access
a given Remote. And so it picks the least expensive one.
The client after all doesn't necessarily know any details about the
proxy's configuration. This does mean though, that if the least
expensive remote is not accessible, but another remote would have
worked, an access via the proxy will fail.
They could be missing due to an interrupted git-annex at just the wrong
time during a prior graft, after which the tree objects got garbage
collected.
Or they could be missing because of manual messing with the git-annex
branch, eg resetting it to back before the graft commit.
Sponsored-by: Dartmouth College's OpenNeuro project
Fix a bug where interrupting git-annex while it is updating the git-annex
branch could lead to git fsck complaining about missing tree objects.
Interrupting git-annex while regraftexports is running in a transition
that is forgetting git-annex branch history would leave the
repository with a git-annex branch that did not contain the tree shas
listed in export.log. That lets those trees be garbage collected.
A subsequent run of the same transition then regrafts the trees listed
in export.log into the git-annex branch. But those trees have been lost.
Note that both sides of `if neednewlocalbranch` are atomic now. I had
thought only the True side needed to be, but I do think there may be
cases where the False side needs to be as well.
Sponsored-by: Dartmouth College's OpenNeuro project