This handles the workflow where the branch is first pushed to the proxy,
and then files in the exported tree are later are copied to the proxied remote.
Turns out that the way the export log is structured, nothing needs
to be done to finalize the export once the last key is sent to it. Which
is great because that would have been a lot of complication. On
receiving the push, Command.Export runs and calls recordExportBeginning,
does as much as it can to update the export with the files currently
on it, and then calls recordExportUnderway. At that point, the
export.log records the export as "complete", but it's not really. And
that's fine. The same happens when using `git-annex export` when some
files are not available to send. Other repositories that have
access to the special remote can already retrieve files from it. As
the missing files get copied to the exported remote, all that needs
to be done is record each in the export db.
At this point, proxying to exporttree=yes annexobjects=yes special remotes
is fully working. Except for in the case where multiple files in the
tree use the same key, and the files are sent to the proxied remote
before pushing the tree.
It seems that even special remotes without annexobjects=yes will work if
used with the workflow where the git-annex branch is pushed before
copying files. But not with the `git-annex push` workflow.
It works when using git-annex sync/push/assist, or when manually sending
all content to the proxied remote before pushing to the proxy remote.
But when the push comes before the content is sent, sending content does
not update the exported tree.
(When possible, of course it may not be there, or it may get renamed from
there for another exported file first. Or the remote may not support
renames.)
This will avoids redundant uploads.
An example case where this is important: Proxying to a exporttree remote,
a file is uploaded to it but is not yet in an exported tree. When the
exported tree is pushed, the remote needs to be updated by exporting to
it. In this case, the proxy doesn't have a copy of the file, so it would
need to download it from annexobjects before uploading it to the final
location. With this optimisation, it can just rename it.
However: If a key is used twice in an exported tree, it seems a proxy
will need to download and reupload anyway. Unless a copy operation is
added to exporttree remotes..
This avoids needing to re-upload the file again to get it to the
annexobjects location, which git-annex sync was doing when it was
preferred content.
If the file is not preferred content, sync will drop it from the
annexobjects location.
If the file has been deleted from the tree, it will remain in the
annexobjects location until an unused/dropunused pass is done.
Decided not to use the annexobjects location for exportTempName.
There doesn't seem to be any actual benefit to doing that, because an
export that renames to exportTempName always renames it back from that
to another location.
Also the annexobjects directory won't actually help with the paired
rename issue.
The file in the annexobjects location may have been renamed from a
previously exported file that got deleted in a subsequent export.
Or it may be renamed to annexobjects temporarily before being renamed to
another name (to handle eg pairwise renames).
But, an exported file is not guaranteed to contain the content of the
key that the local repository last exported there. Another tree could
have been exported from elsewhere in the meantime.
So, files in annexobjects do not necessarily have the content of their
key. And so have to be strongly verified when retrieving. The same as
is done when retrieving exported files.
While usually uploading to a special remote does not verify the content,
the content in a repository is assumed to be valid, and there is no trust
boundary. But with a proxied special remote, there may be users who are
allowed to store objects, but are not really trusted.
Another way to look at this is it's the equivilant of git-annex-shell
checking the hash of received data, which it does (see StoreContent
implementation).
An interrupted PUT to cluster that has a node that is a special remote
over http left open the connection to the cluster, so the next request
opens another one. So did an interrupted PUT directly to the proxied
special remote over http.
proxySpecialRemote was stuck waiting for all the DATA. Its connection
remained open so it kept waiting.
In servePut, checktooshort handles closing the P2P connection
when too short a data is received from PUT. But, checktooshort was only
called after the protoaction, which is what runs the proxy, which is
what was getting stuck. Modified it to run as a background thread,
which waits for the tooshortv to be written to, which gather always does
once it gets to the end of the data received from the http client.
That makes proxyConnection's releaseconn run once all data is received
from the http client. Made it close the connection handles before
waiting on the asyncworker thread. This lets proxySpecialRemote finish
processing any data from the handle, and then it will give up,
more or less cleanly, if it didn't receive enough data.
I say "more or less cleanly" because with both sides of the P2P
connection taken down, some protocol unhappyness results. Which can lead
to some ugly debug messages. But also can cause the asyncworker thread
to throw an exception. So made withP2PConnections not crash when it
receives an exception from releaseconn.
This did have a small change to the behavior of an interrupted PUT when
proxying to a regular remote. proxyConnection has a protoerrorhandler
that closes the proxy connection on a protocol error. But the proxy
connection is also closed by checktooshort when it closes the P2P
connection. Closing the same proxy connection twice is not a problem,
it just results in duplicated debug messages about it.
An interrupted `git-annex copy --to` a cluster via the http server,
when repeated, failed. The http server output "transfer already in
progress, or unable to take transfer lock". Apparently a second
connection was opened to the cluster, because the first connection
never got shut down.
Turned out the problem was that when proxying to a cluster, it would read a
short ByteString from the client, and send that to the nodes. But that left the
nodes warning more. Meanwhile, the proxy was expecting a SUCCESS/FAILURE
message from the nodes. So it didn't return, and so the cluster connection
stayed open.
This fixes a problem when git-annex testremote is run against a cluster
accessed via the http server. Annex.Cluster uses the location log
to find nodes that contain a key when checking if the key is present or getting
it. Just after a key was stored to a cluster node, reading the location log
was not getting the UUID of that node.
Apparently the Annex action that wrote to the location log, and the one
that read from it were run with two different Annex states. The http server
does use several different Annex threads.
BranchState was part of the AnnexState, and so two threads could have
different BranchStates.
Moved BranchState to the AnnexRead, so all threads will see the common state.
This might possibly impact performance. If one thread is writing changes to the
branch, and another thread is reading from the branch, the writing thread will
now invalidate the BranchState's cache, which will cause the reading thread to
need to do extra work. But correctness is surely more important. If did is
found to have impacted performance, it could probably be dealt with by doing
smarter BranchState cache invalidation.
Another way this might impact performance is that the BranchState has a small
cache. If several threads were reading from the branch and relying on the value
they just read still being in the case, now a cache miss will be more likely.
Increasing the BranchState cache to the number of jobs might be a good
idea to amelorate that. But the cache is currently an innefficient list,
so making it large would need changes to the data types.
(Commit 4304f1b6ae dealt with a follow-on
effect of the bug fixed here.)
Wired it up and it seems to basically work, although the test suite is
not fully passing.
Note that --jobs currently gets multiplied by the number of nodes in the
cluster, which is probably not good.
proxyRequest was treating UNLOCKCONTENT as a separate request.
That made it possible for there to be two different connections to the
proxied remote, with LOCKCONTENT being sent to one, and UNLOCKCONTENT
to the other one. A protocol error.
git-annex testremote now passes against a http proxied remote.
sendExactly will now be sure to evaluate the whole lazy ByteString.
In this case, the lazy ByteString was exactly the right lenth.
But, it seems that L.take caused it to not actually be fully evaluated.
In servePut, this manifested as gather never being fully evaluated,
which caused the hang.
Very, very subtle, and horrible bug. Clearly the use of lazy ByteString
(or really just laziness) is at fault, and it would be very worth moving
to conduit or whatever to avoid this.
removeOldestProxyConnectionPool will be innefficient the larger the pool
is. A better data structure could be more efficient. Eg, make each value
in the pool include the timestamp of its oldest element, then the oldest
value can be found and modified, rather than rebuilding the whole Map.
But, for pools of a few hundred items, this should be fine. It's O(n*n log n)
or so.
Also, when more than 1 connection with the same pool key exists,
it's efficient even for larger pools, since removeOldestProxyConnectionPool
is not needed.
The default of 1 idle connection could perhaps be larger.. like the
number of jobs? Otoh, it seems good to ramp up and down the number of
connections, which does happen. With 1, there is at most one stale
connection, which might cause a request to fail.
The proxy always checks the protocol version of a remote before talking
to it in a version-specific way, so the protocol version in the ProxyParams
is the client's protocol version. The remote will always be at the same or
an older protocol version than the client.
Note that in relayDATAFinish, when the client is at protocol version 0,
the remote must thus be as well, and that's why its version is not
checked in the case for that.
With that clarified, it's evident that, in P2P.Http.State, there's no
need to look at the proxied remote's protocol version at all.
But, it's buggy: the server hangs without processing the VALIDITY,
and I can't seem to work out why. As far as I can see, storefile
is getting as far as running the validitycheck, which is supposed to
read that, but never does.
This is especially strange because what seems like the same protocol
doesn't hang when servePut runs it. This made me think that it needed
to use inAnnexWorker to be more like servePut, but that didn't help.
Another small problem with this is that it does create an empty
.git/annex/tmp/ file for the key. Since this will usually be used in
combination with servePut, that doesn't seem worth worrying about much.
This means that when the client sends a truncated data to indicate
invalidity, DATA is not passed the full expected data. That leaves the
P2P connection in a state where it cannot be reused. While so far, they
are not reused, they will be later when proxies are supported. So, have
to close the P2P connection in this situation.
Always truncate instead. The padding risked something not noticing the
content was bad and getting a file that was corrupted in a novel way
with the padding "X" at the end. A truncated file is better.
Made the data-length header required even for v0. This simplifies the
implementation, and doesn't preclude extra verification being done for
v0.
The connectionWaitVar is an ugly hack. In servePut, nothing waits
on the waitvar, and I could not find a good way to make anything wait on
it.
This came down to SendBytes waiting on the waitv. Nothing ever filled
it.
Only Annex.Proxy needs the waitv, and it handles filling it. So make it
optional.
Added v2-v0 endpoints. These are tedious, but will be needed in order to
use the HTTP protocol to proxy to repositories with older git-annex,
where git-annex-shell will be speaking an older version of the protocol.
Changed GET to use 422 when the content is not present. 404 is needed to
detect when a protocol version is not supported.
The error message is not displayed to the use, but this mirrors the
behavior when a regular get from a special remote fails. At least now
there is not a protocol error.
Now that storeKey can have a different object file passed to it, this
complication is not needed. This avoids a lot of strange situations,
and will also be needed if streaming is eventually supported.
Still needs some work.
The reason that the waitv is necessary is because without it,
runNet loops back around and reads the next protocol message. But it's
not finished reading the whole bytestring yet, and so it reads some part
of it.
Working, but lots of room for improvement...
Without streaming, so there is a delay before download begins as the
file is retreived from the special remote.
And when resuming it retrieves the whole file from the special remote
*again*.
Also, if the special remote throws an exception, currently it
shows as "protocol error".
This makes eg git-annex get default to using the cluster rather than an
arbitrary node, which is better UI.
The actual cost of accessing a proxied node vs using the cluster is
basically the same. But using the cluster allows smarter load-balancing
to be done on the cluster.
Before it was using a node that might have had a higher cost.
Also threw in a random selection from amoung the low cost nodes. Of
course this is a poor excuse for load balancing, but it's better than
nothing. Most of the time...
The VIA extension is still needed to avoid some extra work and ugly
messages, but this is enough that it actually works.
This filters out the RemoteSides that are a proxied connection via a
remote gateway to the cluster.
The VIA extension will not filter those out, but will send VIA to them
on connect, which will cause the ones that are accessed via the listed
gateways to be filtered out.
Walking a tightrope between security and convenience here, because
git-annex-shell needs to only proxy for things when there has been
an explicit, local action to configure them.
In this case, the user has to have run `git-annex extendcluster`,
which now sets annex-cluster-gateway on the remote.
Note that any repositories that the gateway is recorded to
proxy for will be proxied onward. This is not limited to cluster nodes,
because checking the node log would not add any security; someone could
add any uuid to it. The gateway of course then does its own
checking to determine if it will allow proxying for the remote.
Rejected the idea of automatically instantiating remotes for proxies-of-proxies.
That needs cycle protection, while the current behavior, which happened
for free, is that running git-annex updateproxy on the proxy can be used
to configure it, but only for topologies that actually exist.
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.
Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?
This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
Except when no nodes want a file, it has to be stored somewhere, so
store it on all. Which is not really desirable, but neither is having to
pick one.
ProtoAssociatedFile deserialization is rather broken, and this could
possibly affect preferred content expressions that match on filenames.
The inability to roundtrip whitespace like tabs and newlines through is
not a problem because preferred content expressions can't be written
that match on whitespace such as a tab. For example:
joey@darkstar:~/tmp/bench/z>git-annex wanted origin-node2 'exclude=*CTRL-VTab*'
wanted origin-node2
git-annex: Parse error: Parse failure: near "*"
But, the filtering of control characters could perhaps be a problem. I think
that filtering is now obsolete, git-annex has comprehensive filtering of
control characters when displaying filenames, that happens at a higher level.
However, I don't want to risk a security hole so am leaving in that filtering
in ProtoAssociatedFile deserialization for now.
Avoid `git-annex sync --content` etc from operating on cluster nodes by default
since syncing with a cluster implicitly syncs with its nodes. This avoids a
lot of unncessary work when a cluster has a lot of nodes just in checking
if each node's preferred content is satisfied. And it avoids content
being sent to nodes individually, so instead syncing with clusters always
fanout uploads to nodes.
The downside is that there are situations where a cluster's preferred content
settings can be met, but those of its nodes are not. Or where a node does not
contain a key, but the cluster does, and there are not enough copies of the key
yet, so it would be desirable the send it there. I think that's an acceptable
tradeoff. These kind of situations are ones where the cluster itself should
probably be responsible for copying content to the node. Which it can do much
less expensively than a client can. Part of the balanced preferred content
design that I will be working on in a couple of months involves rebalancing
clusters, so I expect to revisit this.
The use of annex-sync config does allow running git-annex sync with a specific
node, or nodes, and it will sync with it. And it's also possible to set
annex-sync git configs to make it sync with a node by default. (Although that
will require setting up an explicit git remote for the node rather than relying
on the proxied remote.)
Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster
due to a cycle. And the Annex.Startup load of clusters happens
too late for Remote.Git to use that. This does mean one redundant load
of the cluster log, though only when there is a proxy.
This makes git-annex sync and similar not treat proxied remotes as git
syncable remotes.
Also, display in git-annex info remote when the remote is proxied.
Loading the remote list a second time was removing all proxied remotes.
That happened because setting up the proxied remote added some config
fields to the in-memory git config, and on the second load, it saw those
configs and decided not to overwrite them with the proxy.
Now on the second load, that still happens. But now, the proxied
git configs are used to generate a remote same as if those configs were
all set. The reason that didn't happen before was twofold,
the gitremotes cache was not dropped, and the remote's url field was not
set correctly.
The problem with the remote's url field is that while it was marked as
proxy inherited, all other proxy inherited fields are annex- configs.
And the code to inherit didn't work for the url field.
Now it all works, but git-annex sync is left running git push/pull on
the proxied remote, which doesn't work. That still needs to be fixed.
Tested it with small chunk sizes (like 2) and resumes that were
eg 1 byte from the end of the file or beginning of file.
Also, git-annex testremote passes now against a cluster!
When the destination does not start with a copy, the cluster has one or
more copies. If more, dropping would reduce the number of copies, so
numcopies must be checked.
Considered checking how many nodes of the cluster contain a copy. If
only 1 node does, it could allow a move without checking numcopies.
The problem with that, though, is that other nodes of the cluster could
have copies that we don't know about. And dropping from a cluster tries
to drop from all nodes, so will drop even from those. So any drop from a
cluster can remove more than 1 copy.
Dropping from a cluster drops from every node of the cluster.
Including nodes that the cluster does not think have the content.
This is different from GET and CHECKPRESENT, which do trust the
cluster's location log. The difference is that removing from a cluster
should make 100% the content is gone from every node. So doing extra
work is ok. Compare with CHECKPRESENT where checking every node could
make it very expensive, and the worst that can happen in a false
negative is extra work being done.
Extended the P2P protocol with FAILURE-PLUS to handle the case where a
drop from one node succeeds, but a drop from another node fails. In that
case the entire cluster drop has failed.
Note that SUCCESS-PLUS is returned when dropping from a proxied remote
that is not a cluster, when the protocol version supports it. This is
because P2P.Proxy does not know when it's proxying for a single node
cluster vs for a remote that is not a cluster.
This is obviously necessary in order for dropping from a cluster to be able to
drop from all nodes.
It also avoids violating numcopies when a cluster node is a special remote.
If it were used in the drop proof, nothing would prevent the cluster from
dropping from it.
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.
Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
This assumes that the proxy for a cluster has up-to-date location
logs. If it didn't, it might proxy the checkpresent to a node that no
longer has the content, while some other node still does, and so
it would incorrectly appear that the cluster no longer contains the
content.
Since cluster UUIDs are not stored to location logs,
git-annex fsck --fast when claiming to fix a location log when
that occurred would not cause any problems. And presumably the location
tracking would later get sorted out.
At least usually, changes to the content of nodes goes via the proxy,
and it will update its location logs, so they will be accurate. However,
if there were multiple proxies to the same cluster, or nodes were
accessed directly (or via proxy to the node and not the cluster),
the proxy's location log could certainly be wrong.
(The location log access for GET has the same issues.)
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.
This should be everything, probably.
Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.
One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.
Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.
Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.
That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.
Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.
Other commands, like git-annex-shell currently don't run annexStartup
either.
So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.
Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
This does mean a redundant write to the git-annex branch. But,
it means that two clients can be using the same proxy, and after
one sends a file to a proxied remote, the other only has to pull from
the proxy to learn about that. It does not need to pull from every
remote behind the proxy (which it couldn't do anyway as git repo
access is not currently proxied).
Anyway, the overhead of this in git-annex branch writes is no worse
than eg, sending a file to a repository where git-annex assistant
is running, which then sends the file on to a remote, and updates
the git-annex branch then. Indeed, when the assistant also drops
the local copy, that results in more writes to the git-annex branch.