A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'.
The rest of the content will be random.
This avoids a class of attack where the UUID of a repository is used as
the UUID of a cluster, which will prevent git-annex from updating
location logs for that repository. I don't know why someone would want
to do that, but let's prevent it.
Also, isClusterUUID make it easy to filter out cluster UUIDs when
writing the location logs.
Not used yet. (Or tested.)
I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.
Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
Since a proxied remote uses the proxy's git repo, this makes sense.
Although I don't think this config is ever used when accessing a remote
via git-annex-shell.
This does mean a redundant write to the git-annex branch. But,
it means that two clients can be using the same proxy, and after
one sends a file to a proxied remote, the other only has to pull from
the proxy to learn about that. It does not need to pull from every
remote behind the proxy (which it couldn't do anyway as git repo
access is not currently proxied).
Anyway, the overhead of this in git-annex branch writes is no worse
than eg, sending a file to a repository where git-annex assistant
is running, which then sends the file on to a remote, and updates
the git-annex branch then. Indeed, when the assistant also drops
the local copy, that results in more writes to the git-annex branch.
CONNECT is not supported by git-annex-shell p2pstdio, but for proxying
to tor-annex remotes, it will be supported, and will make a git pull/push
to a proxied remote work the same with that as it does over ssh,
eg it accesses the proxy's git repo not the proxied remote's git repo.
The p2p protocol docs say that NOTIFYCHANGES is not always supported,
and it looked annoying to implement it for this, and it also seems
pretty useless, so make it be a protocol error. git-annex remotedaemon
will already be getting change notifications from the proxy's git repo,
so there's no need to get additional redundant change notifications for
proxied remotes that would be for changes to the same git repo.
Prevent listProxied from listing anything when the proxy remote's
url is a local directory. Proxying does not work in that situation,
because the proxied remotes have the same url, and so git-annex-shell
is not run when accessing them, instead the proxy remote is accessed
directly.
I don't think there is any good way to support this. Even if the instantiated
git repos for the proxied remotes somehow used an url that caused it to use
git-annex-shell to access them, planned features like `git-annex copy --to
proxy` accepting a key and sending it on to nodes behind the proxy would not
work, since git-annex-shell is not used to access the proxy.
So it would need to use something to access the proxy that causes
git-annex-shell to be run and speaks P2P protocol over it. And we have that.
It's a ssh connection to localhost. Of course, it would be possible to
take ssh out of that mix, and swap in something that does not have
encryption overhead and authentication complications, but otherwise
behaves the same as ssh. And if the user wants to do that, GIT_SSH
does exist.
This just happened to work correctly. Rather surprisingly. It turns out
that openP2PSshConnection actually also supports local git remotes,
by just running git-annex-shell with the path to the remote.
Renamed "P2PSsh" to "P2PShell" to make this clear.
The almost identical code duplication between relayDATA and relayDATA'
is very annoying. I tried quite a few things to parameterize them, but
the type checker is having fits when I try it.
Memory use is small and constant; receiveBytes returns a lazy bytestring
and it does stream.
Comparing speed of a get of a 500 mb file over proxy from origin-origin,
vs from the same remote over a direct ssh:
joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from origin-origin
get bigfile (from origin-origin...)
ok
(recording state in git...)
1.89user 0.67system 0:10.79elapsed 23%CPU (0avgtext+0avgdata 68716maxresident)k
0inputs+984320outputs (0major+10779minor)pagefaults 0swaps
joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from direct-ssh
get bigfile (from direct-ssh...)
ok
1.79user 0.63system 0:10.49elapsed 23%CPU (0avgtext+0avgdata 65776maxresident)k
0inputs+1024312outputs (0major+9773minor)pagefaults 0swaps
So the proxy doesn't add much overhead even when run on the same machine as
the client and remote.
Still, piping receiveBytes into sendBytes like this does suggest that the proxy
could be made to use less CPU resouces by using `sendfile()`.
getRepoUUID looks at that, and was seeing the annex.uuid of the proxy.
Which caused it to unncessarily set the git config. Probably also would
have led to other problems.
Still need to implement GET and PUT, and will implement CONNECT and
NOTIFYCHANGE for completeness.
All ServerMode checking is implemented for the proxy.
There are two possible approaches for how the proxy sends back messages
from the remote to the client. One would be to have a background thread
that reads messages and sends them back as they come in. The other,
which is being implemented so far, is to read messages from the remote
at points where it is expected to send them, and relay back to the
client before reading the next message from the client. At this point,
I'm unsure which approach would be better.
The need for proxynoresponse to be used by UNLOCKCONTENT, for example,
builds protocol knowledge into the proxy which it would not need with
the other method.
connRepo is only used when relaying git upload-pack and receive-pack.
That's only supposed to be used when git-annex-remotedaemon is serving
git-remote-tor-annex connections over tor. But, it was always set, and
so could be used in other places possibly.
Fixed by making connRepo optional in the P2P protocol interface.
In Command.EnableTor, it's not needed, because it only speaks the
protocol in order to check that it's able to connect back to itself via
the hidden service. So changed that to pass Nothing rather than the git
repo.
In Remote.Helper.Ssh, it's connecting to git-annex-shell p2pstdio,
so is making the requests, so will never need connRepo.
In git-annex-shell p2pstdio, it was accepting git upload-pack and
receive-pack requests over the P2P protocol, even though nothing sent
them. This is arguably a security hole, particularly if the user has
set environment variables like GIT_ANNEX_SHELL_LIMITED to prevent
git push/pull via git-annex-shell.
For NotifyChanges and also for the fallthrough case where
git-annex-shell passes a command off to git-shell, proxying is currently
ignored. So every remote that is accessed via a proxy will be treated as
the same git repository.
Every other command listed in cmdsMap will need to check if
Annex.proxyremote is set, and if so handle the proxying appropriately.
Probably only P2PStdio will need to support proxying. For now,
everything else refuses to work when proxying.
The part of that I don't like is that there's the possibility a command
later gets added to the list that doesn't check proxying.
When proxying is not enabled, it's important that git-annex-shell not
leak information that it would not have exposed before. Such as the
names or uuids of remotes.
I decided that, in the case where a repository used to have proxying
enabled, but no longer supports any proxies, it's ok to give the user a
clear error message indicating that proxying is not configured, rather
than a confusing uuid mismatch message.
Similarly, if a repository has proxying enabled, but not for the
requested repository, give a clear error message.
A tricky thing here is how to handle the case where there is more than
one remote, with proxying enabled, with the specified uuid. One way to
handle that would be to plumb the proxyRemoteName all the way through
from the remote git-annex to git-annex-shell, eg as a field, and use
only a remote with the same name. That would be very intrusive though.
Instead, I decided to let the proxy pick which remote it uses to access
a given Remote. And so it picks the least expensive one.
The client after all doesn't necessarily know any details about the
proxy's configuration. This does mean though, that if the least
expensive remote is not accessible, but another remote would have
worked, an access via the proxy will fail.