Commit graph

4793 commits

Author SHA1 Message Date
Joey Hess
9984252ab5
P2P protocol is finalized 2024-07-22 19:50:08 -04:00
Joey Hess
e979e85bff
make serveKeepLocked check auth just to be safe 2024-07-22 19:15:52 -04:00
Joey Hess
f5dd7a8bc0
implemented serveLockContent (untested) 2024-07-22 17:38:42 -04:00
Joey Hess
b697c6b9da
fix TMVar left full crash affecting servePutOffset
Problem is that whatever is reading from the TMVar may not have read
from it yet before the client writes the next thing to it.
2024-07-22 15:48:46 -04:00
Joey Hess
3069e28dd8
implemented servePutOffset and clientPutOffset
But, it's buggy: the server hangs without processing the VALIDITY,
and I can't seem to work out why. As far as I can see, storefile
is getting as far as running the validitycheck, which is supposed to
read that, but never does.

This is especially strange because what seems like the same protocol
doesn't hang when servePut runs it. This made me think that it needed
to use inAnnexWorker to be more like servePut, but that didn't help.

Another small problem with this is that it does create an empty
.git/annex/tmp/ file for the key. Since this will usually be used in
combination with servePut, that doesn't seem worth worrying about much.
2024-07-22 15:04:10 -04:00
Joey Hess
b240a11b79
clientPut seeking to offset 2024-07-22 12:50:21 -04:00
Joey Hess
a01426b713
avoid padding in servePut
This means that when the client sends a truncated data to indicate
invalidity, DATA is not passed the full expected data. That leaves the
P2P connection in a state where it cannot be reused. While so far, they
are not reused, they will be later when proxies are supported. So, have
to close the P2P connection in this situation.
2024-07-22 12:30:30 -04:00
Joey Hess
efa0efdc44
avoid padding in clientPut
Instead truncate when necessary to indicate invalid content was sent.
Very similar to how serveGet handles it.
2024-07-22 11:47:24 -04:00
Joey Hess
72d0769ca5
avoid padding content in serveGet
Always truncate instead. The padding risked something not noticing the
content was bad and getting a file that was corrupted in a novel way
with the padding "X" at the end. A truncated file is better.
2024-07-22 11:19:52 -04:00
Joey Hess
4826a3745d
servePut and clientPut implementation
Made the data-length header required even for v0. This simplifies the
implementation, and doesn't preclude extra verification being done for
v0.

The connectionWaitVar is an ugly hack. In servePut, nothing waits
on the waitvar, and I could not find a good way to make anything wait on
it.
2024-07-22 10:27:44 -04:00
m.risse@77eac2c22d673d5f10305c0bade738ad74055f92
1ee30b29ee Added a comment 2024-07-21 12:38:12 +00:00
nobodyinperson
b920655acd Added a comment: Also Serveo.net 2024-07-19 15:21:19 +00:00
m.risse@77eac2c22d673d5f10305c0bade738ad74055f92
2878343354 2024-07-19 12:12:56 +00:00
mih
5bc00a55dd 2024-07-16 15:02:46 +00:00
Joey Hess
97a2d0e4fb
use worker pool in withLocalP2PConnections
This allows multiple clients to be handled at the same time.
2024-07-11 14:37:52 -04:00
Joey Hess
14e0f778b7
simplify 2024-07-11 11:50:44 -04:00
Joey Hess
2228d56db3
serveGet invalidation 2024-07-11 11:42:32 -04:00
Joey Hess
3b37b9e53f
fix serveGet hang
This came down to SendBytes waiting on the waitv. Nothing ever filled
it.

Only Annex.Proxy needs the waitv, and it handles filling it. So make it
optional.
2024-07-11 07:46:52 -04:00
benjamin.poldrack@d09ccff6d42dd20277610b59867cf7462927b8e3
a82a573f75 2024-07-11 07:47:27 +00:00
benjamin.poldrack@d09ccff6d42dd20277610b59867cf7462927b8e3
9ce207532e 2024-07-11 07:23:30 +00:00
Joey Hess
8cb1332407
update 2024-07-10 16:10:08 -04:00
Joey Hess
b5b3d8cde2
update 2024-07-09 14:30:50 -04:00
Joey Hess
751b8e0baf
implemented serveCheckPresent
Still need a way to run Proto though
2024-07-09 09:08:42 -04:00
Joey Hess
3f402a20a8
implement Locker 2024-07-08 21:00:10 -04:00
Joey Hess
838169ee86
status 2024-07-07 16:16:11 -04:00
Joey Hess
40306d3fcf
finalizing HTTP P2p protocol some more
Added v2-v0 endpoints. These are tedious, but will be needed in order to
use the HTTP protocol to proxy to repositories with older git-annex,
where git-annex-shell will be speaking an older version of the protocol.

Changed GET to use 422 when the content is not present. 404 is needed to
detect when a protocol version is not supported.
2024-07-05 15:34:58 -04:00
Joey Hess
0bfdc57d25
update 2024-07-04 15:18:06 -04:00
Joey Hess
f452bd448a
REMOVE-BEFORE and GETTIMESTAMP proxying
For clusters, the timestamps have to be translated, since each node can
have its own idea about what time it is. To translate a timestamp, the
proxy remembers what time it asked the node for a timestamp in
GETTIMESTAMP, and applies the delta as an offset in REMOVE-BEFORE.

This does mean that a remove from a cluster has to call GETTIMESTAMP on
every node before dropping from nodes. Not very efficient. Although
currently it tries to drop from every single node anyway, which is also
not very efficient.

I thought about caching the GETTIMESTAMP from the nodes on the first
call. That would improve efficiency. But, since monotonic clocks on
!Linux don't advance when the computer is suspended, consider what might
happen if one node was suspended for a while, then came back. Its
monotonic timestamp would end up behind where the proxying expects it to
be. Would that result in removing when it shouldn't, or refusing to
remove when it should? Have not thought it through. Either way, a
cluster behaving strangly for an extended period of time because one
of its nodes was briefly asleep doesn't seem like good behavior.
2024-07-04 15:09:34 -04:00
Joey Hess
99b7a0cfe9
use REMOVE-BEFORE in P2P protocol
Only clusters still need to be fixed to close this todo.
2024-07-04 13:47:38 -04:00
Joey Hess
1243af4a18
toward SafeDropProof expiry checking
Added Maybe POSIXTime to SafeDropProof, which gets set when the proof is
based on a LockedCopy. If there are several LockedCopies, it uses the
closest expiry time. That is not optimal, it may be that the proof
expires based on one LockedCopy but another one has not expired. But
that seems unlikely to really happen, and anyway the user can just
re-run a drop if it fails due to expiry.

Pass the SafeDropProof to removeKey, which is responsible for checking
it for expiry in situations where that could be a problem. Which really
only means in Remote.Git.

Made Remote.Git check expiry when dropping from a local remote.

Checking expiry when dropping from a P2P remote is not yet implemented.
P2P.Protocol.remove has SafeDropProof plumbed through to it for that
purpose.

Fixing the remaining 2 build warnings should complete this work.

Note that the use of a POSIXTime here means that if the clock gets set
forward while git-annex is in the middle of a drop, it may say that
dropping took too long. That seems ok. Less ok is that if the clock gets
turned back a sufficient amount (eg 5 minutes), proof expiry won't be
noticed. It might be better to use the Monotonic clock, but that doesn't
advance when a laptop is suspended, and while there is the linux
Boottime clock, that is not available on other systems. Perhaps a
combination of POSIXTime and the Monotonic clock could detect laptop
suspension and also detect clock being turned back?

There is a potential future flag day where
p2pDefaultLockContentRetentionDuration is not assumed, but is probed
using the P2P protocol, and peers that don't support it can no longer
produce a LockedCopy. Until that happens, when git-annex is
communicating with older peers there is a risk of data loss when
a ssh connection closes during LOCKCONTENT.
2024-07-04 12:39:06 -04:00
Joey Hess
f69661ab65
status 2024-07-03 17:04:12 -04:00
Joey Hess
44b3136fdf
update 2024-07-03 15:53:25 -04:00
Joey Hess
6a95eb08ce
status 2024-07-03 15:01:34 -04:00
Joey Hess
badcb502a4
todo 2024-07-03 13:15:09 -04:00
Joey Hess
24d63e8c8e
update 2024-07-02 18:04:29 -04:00
Joey Hess
b2a24a1669
update 2024-07-02 16:16:37 -04:00
Joey Hess
fbc4d549f3
reorder 2024-07-01 11:44:54 -04:00
Joey Hess
8db30323b0
update 2024-07-01 11:38:29 -04:00
Joey Hess
1e1584d34b
toc 2024-07-01 11:37:12 -04:00
Joey Hess
d9e66f7754
update 2024-07-01 11:33:07 -04:00
Joey Hess
f58a5f577d
update 2024-07-01 11:29:04 -04:00
Joey Hess
fa5e7463eb
fix display when proxied GET yields ERROR
The error message is not displayed to the use, but this mirrors the
behavior when a regular get from a special remote fails. At least now
there is not a protocol error.
2024-07-01 11:19:02 -04:00
Joey Hess
dce3848ad8
avoid populating proxy's object file when storing on special remote
Now that storeKey can have a different object file passed to it, this
complication is not needed. This avoids a lot of strange situations,
and will also be needed if streaming is eventually supported.
2024-07-01 10:53:49 -04:00
Joey Hess
0dfdc9f951
dup stdio handles for P2P proxy
Special remotes might output to stdout, or read from stdin, which would
mess up the P2P protocol. So dup the handles to avoid any such problem.
2024-07-01 10:06:29 -04:00
Joey Hess
0e19c1c9fa
todo 2024-06-28 17:14:18 -04:00
Joey Hess
711a5166e2
PUT to proxied special remote working
Still needs some work.

The reason that the waitv is necessary is because without it,
runNet loops back around and reads the next protocol message. But it's
not finished reading the whole bytestring yet, and so it reads some part
of it.
2024-06-28 17:10:58 -04:00
Joey Hess
2e5af38f86
GET from proxied special remote
Working, but lots of room for improvement...

Without streaming, so there is a delay before download begins as the
file is retreived from the special remote.

And when resuming it retrieves the whole file from the special remote
*again*.

Also, if the special remote throws an exception, currently it
shows as "protocol error".
2024-06-28 15:44:48 -04:00
Joey Hess
5b1971e2f8
merged the proxy branch into master! 2024-06-27 15:44:11 -04:00
Joey Hess
c3f88923c0
Merge branch 'proxy' 2024-06-27 15:43:45 -04:00
Joey Hess
85f4527d74
update 2024-06-27 15:28:10 -04:00
Joey Hess
20ef1262df
give proxied cluster nodes a higher cost than the cluster gateway
This makes eg git-annex get default to using the cluster rather than an
arbitrary node, which is better UI.

The actual cost of accessing a proxied node vs using the cluster is
basically the same. But using the cluster allows smarter load-balancing
to be done on the cluster.
2024-06-27 15:21:03 -04:00
Joey Hess
cf59d7f92c
GET and CHECKPRESENT amoung lowest cost cluster nodes
Before it was using a node that might have had a higher cost.

Also threw in a random selection from amoung the low cost nodes. Of
course this is a poor excuse for load balancing, but it's better than
nothing. Most of the time...
2024-06-27 14:36:55 -04:00
Joey Hess
dceb8dc776
update 2024-06-27 13:40:09 -04:00
Joey Hess
c9d63d74d8
remove viconfig item
it works when run on a client that has the cluster gateway as a remote,
just not when on the cluster gateway
2024-06-27 13:34:24 -04:00
Joey Hess
87a7eeac33
document various multi-gateway cluster considerations
Perhaps this will avoid me needing to eg, implement spanning tree
protocol. ;-)
2024-06-27 13:33:19 -04:00
Joey Hess
8e322f76bc
updates 2024-06-27 12:57:08 -04:00
Joey Hess
3dad9446ce
distributed cluster cycle prevention
Added BYPASS to P2P protocol, and use it to avoid cycling between
cluster gateways.

Distributed clusters are working well now!
2024-06-27 12:20:22 -04:00
Joey Hess
effaf51b1f
avoid loop between cluster gateways
The VIA extension is still needed to avoid some extra work and ugly
messages, but this is enough that it actually works.

This filters out the RemoteSides that are a proxied connection via a
remote gateway to the cluster.

The VIA extension will not filter those out, but will send VIA to them
on connect, which will cause the ones that are accessed via the listed
gateways to be filtered out.
2024-06-26 15:29:59 -04:00
Joey Hess
4172109c8d
support multi-gateway clusters
VIA extension still needed otherwise a copy to a cluster can loop
forever.
2024-06-26 15:07:03 -04:00
Joey Hess
07e899c9d3
git-annex-shell: proxy nodes located beyond remote cluster gateways
Walking a tightrope between security and convenience here, because
git-annex-shell needs to only proxy for things when there has been
an explicit, local action to configure them.

In this case, the user has to have run `git-annex extendcluster`,
which now sets annex-cluster-gateway on the remote.

Note that any repositories that the gateway is recorded to
proxy for will be proxied onward. This is not limited to cluster nodes,
because checking the node log would not add any security; someone could
add any uuid to it. The gateway of course then does its own
checking to determine if it will allow proxying for the remote.
2024-06-26 12:56:16 -04:00
Joey Hess
798d6f6a46
todo 2024-06-25 17:58:45 -04:00
Joey Hess
e3dd29409b
improve docs 2024-06-25 17:50:22 -04:00
Joey Hess
0a1001dbfb
update 2024-06-25 17:26:26 -04:00
Joey Hess
9a8dcb58cd
design for distributed clusters 2024-06-25 17:20:49 -04:00
Joey Hess
b9889917a3
thoughts on cycles
Rejected the idea of automatically instantiating remotes for proxies-of-proxies.
That needs cycle protection, while the current behavior, which happened
for free, is that running git-annex updateproxy on the proxy can be used
to configure it, but only for topologies that actually exist.
2024-06-25 15:32:11 -04:00
Joey Hess
cec2848e8a
support annex.jobs for clusters 2024-06-25 14:54:20 -04:00
Joey Hess
5ede109ae5
gave up on upload fanout to cluster's proxy
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.

Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?

This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
2024-06-25 13:35:12 -04:00
Joey Hess
1bfe7f8a53
honor preferred content settings of cluster nodes
Except when no nodes want a file, it has to be stored somewhere, so
store it on all. Which is not really desirable, but neither is having to
pick one.

ProtoAssociatedFile deserialization is rather broken, and this could
possibly affect preferred content expressions that match on filenames.

The inability to roundtrip whitespace like tabs and newlines through is
not a problem because preferred content expressions can't be written
that match on whitespace such as a tab. For example:

joey@darkstar:~/tmp/bench/z>git-annex wanted  origin-node2 'exclude=*CTRL-VTab*'
wanted origin-node2
git-annex: Parse error: Parse failure: near "*"

But, the filtering of control characters could perhaps be a problem. I think
that filtering is now obsolete, git-annex has comprehensive filtering of
control characters when displaying filenames, that happens at a higher level.
However, I don't want to risk a security hole so am leaving in that filtering
in ProtoAssociatedFile deserialization for now.
2024-06-25 11:43:09 -04:00
Joey Hess
202ea3ff2a
don't sync with cluster nodes by default
Avoid `git-annex sync --content` etc from operating on cluster nodes by default
since syncing with a cluster implicitly syncs with its nodes. This avoids a
lot of unncessary work when a cluster has a lot of nodes just in checking
if each node's preferred content is satisfied. And it avoids content
being sent to nodes individually, so instead syncing with clusters always
fanout uploads to nodes.

The downside is that there are situations where a cluster's preferred content
settings can be met, but those of its nodes are not. Or where a node does not
contain a key, but the cluster does, and there are not enough copies of the key
yet, so it would be desirable the send it there. I think that's an acceptable
tradeoff. These kind of situations are ones where the cluster itself should
probably be responsible for copying content to the node. Which it can do much
less expensively than a client can. Part of the balanced preferred content
design that I will be working on in a couple of months involves rebalancing
clusters, so I expect to revisit this.

The use of annex-sync config does allow running git-annex sync with a specific
node, or nodes, and it will sync with it. And it's also possible to set
annex-sync git configs to make it sync with a node by default. (Although that
will require setting up an explicit git remote for the node rather than relying
on the proxied remote.)

Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster
due to a cycle. And the Annex.Startup load of clusters happens
too late for Remote.Git to use that. This does mean one redundant load
of the cluster log, though only when there is a proxy.
2024-06-25 10:24:38 -04:00
Joey Hess
b8016eeb65
add annex-proxied
This makes git-annex sync and similar not treat proxied remotes as git
syncable remotes.

Also, display in git-annex info remote when the remote is proxied.
2024-06-24 10:16:59 -04:00
Joey Hess
0c111fc96a
fix git-annex sync --content with proxied remotes
Loading the remote list a second time was removing all proxied remotes.
That happened because setting up the proxied remote added some config
fields to the in-memory git config, and on the second load, it saw those
configs and decided not to overwrite them with the proxy.

Now on the second load, that still happens. But now, the proxied
git configs are used to generate a remote same as if those configs were
all set. The reason that didn't happen before was twofold,
the gitremotes cache was not dropped, and the remote's url field was not
set correctly.

The problem with the remote's url field is that while it was marked as
proxy inherited, all other proxy inherited fields are annex- configs.
And the code to inherit didn't work for the url field.

Now it all works, but git-annex sync is left running git push/pull on
the proxied remote, which doesn't work. That still needs to be fixed.
2024-06-24 09:45:51 -04:00
Joey Hess
60413a2557
update 2024-06-23 16:38:01 -04:00
Joey Hess
5d8bdac38e
upload fanout resume seems free of fenceposts
Tested it with small chunk sizes (like 2) and resumes that were
eg 1 byte from the end of the file or beginning of file.

Also, git-annex testremote passes now against a cluster!
2024-06-23 16:22:39 -04:00
Joey Hess
9e070470f4
update 2024-06-23 12:48:22 -04:00
Joey Hess
3cd7969823
update 2024-06-23 12:31:00 -04:00
Joey Hess
d0aec8f623
always check numcopies when moving from cluster
When the destination does not start with a copy, the cluster has one or
more copies. If more, dropping would reduce the number of copies, so
numcopies must be checked.

Considered checking how many nodes of the cluster contain a copy. If
only 1 node does, it could allow a move without checking numcopies.
The problem with that, though, is that other nodes of the cluster could
have copies that we don't know about. And dropping from a cluster tries
to drop from all nodes, so will drop even from those. So any drop from a
cluster can remove more than 1 copy.
2024-06-23 12:00:50 -04:00
Joey Hess
ec5b6454f4
todo 2024-06-23 10:09:35 -04:00
Joey Hess
2762f9c4ce
fix location log update for copy to 1-node cluster 2024-06-23 09:53:33 -04:00
Joey Hess
5b332a87be
dropping from clusters
Dropping from a cluster drops from every node of the cluster.
Including nodes that the cluster does not think have the content.
This is different from GET and CHECKPRESENT, which do trust the
cluster's location log. The difference is that removing from a cluster
should make 100% the content is gone from every node. So doing extra
work is ok. Compare with CHECKPRESENT where checking every node could
make it very expensive, and the worst that can happen in a false
negative is extra work being done.

Extended the P2P protocol with FAILURE-PLUS to handle the case where a
drop from one node succeeds, but a drop from another node fails. In that
case the entire cluster drop has failed.

Note that SUCCESS-PLUS is returned when dropping from a proxied remote
that is not a cluster, when the protocol version supports it. This is
because P2P.Proxy does not know when it's proxying for a single node
cluster vs for a remote that is not a cluster.
2024-06-23 09:43:40 -04:00
Joey Hess
7bbd822a17
avoid using cluster nodes in drop proof when dropping from cluster
This is obviously necessary in order for dropping from a cluster to be able to
drop from all nodes.

It also avoids violating numcopies when a cluster node is a special remote.
If it were used in the drop proof, nothing would prevent the cluster from
dropping from it.
2024-06-23 06:20:11 -04:00
Joey Hess
5a4b4b59b9
update 2024-06-23 05:26:45 -04:00
nobodyinperson
724eb8a369 Suggest that 'git annex unused' reports total unused size 2024-06-21 16:30:09 +00:00
Joey Hess
53674e8abb
Merge branch 'master' into proxy 2024-06-20 11:20:26 -04:00
Joey Hess
53598e5154
merge from proxy branch 2024-06-20 11:20:16 -04:00
Joey Hess
032d3902d8
wording 2024-06-20 10:15:24 -04:00
joris
b35be4b656 Added a comment 2024-06-20 09:58:05 +00:00
Joey Hess
097ef9979c
towards a design for proxying to special remotes 2024-06-19 06:15:03 -04:00
Joey Hess
f18740699e
P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.

Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
2024-06-18 16:21:40 -04:00
Joey Hess
f049156a03
checkpresent support for clusters
This assumes that the proxy for a cluster has up-to-date location
logs. If it didn't, it might proxy the checkpresent to a node that no
longer has the content, while some other node still does, and so
it would incorrectly appear that the cluster no longer contains the
content.

Since cluster UUIDs are not stored to location logs,
git-annex fsck --fast when claiming to fix a location log when
that occurred would not cause any problems. And presumably the location
tracking would later get sorted out.

At least usually, changes to the content of nodes goes via the proxy,
and it will update its location logs, so they will be accurate. However,
if there were multiple proxies to the same cluster, or nodes were
accessed directly (or via proxy to the node and not the cluster),
the proxy's location log could certainly be wrong.

(The location log access for GET has the same issues.)
2024-06-18 11:16:16 -04:00
Joey Hess
88d9a02f7c
initial, working support for getting from clusters
Currently tends to put all the load on a single node, which will need to
be improved.
2024-06-18 11:01:10 -04:00
Joey Hess
8290f70978
update 2024-06-18 10:08:15 -04:00
Joey Hess
e2fd2ee2bd
update 2024-06-17 09:31:44 -04:00
Joey Hess
3970bbb03b
Merge branch 'master' into proxy 2024-06-17 09:29:34 -04:00
Joey Hess
64afbb0b93
don't count clusters as copies, continued
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.

This should be everything, probably.

Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
2024-06-16 15:14:53 -04:00
Joey Hess
780367200b
remove dead nodes when loading the cluster log
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.

One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.

Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.

Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.

That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.

Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.

Other commands, like git-annex-shell currently don't run annexStartup
either.

So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
2024-06-16 14:39:44 -04:00
Joey Hess
b3370a191c
insert cluster UUIDs when loading location logs, and omit when saving
Inline isClusterUUID for speed.
2024-06-14 18:06:28 -04:00
Joey Hess
846903e9bb
update todo list for this month
whew that's gonna be a lot
2024-06-14 15:23:43 -04:00
Joey Hess
d8daabe9ec
Merge branch 'master' of ssh://git-annex.branchable.com 2024-06-13 06:44:22 -04:00
Joey Hess
22a329c57e
copied over some changes from proxy branch 2024-06-13 06:43:59 -04:00
Joey Hess
555d7e52d3
more thoughts on clusters 2024-06-12 17:30:55 -04:00
Joey Hess
0ebb107974
update 2024-06-12 15:21:23 -04:00
Joey Hess
46a1fcb3ea
avoid git syncing with instantiate proxied remotes
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.

Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
2024-06-12 15:10:03 -04:00
Joey Hess
a986a20034
designing clusters 2024-06-12 14:57:26 -04:00
Joey Hess
e70e3473b3
on cycles 2024-06-12 13:52:17 -04:00
Joey Hess
44464e4410
update 2024-06-12 12:37:14 -04:00
Joey Hess
67d1e2a459
updates 2024-06-12 12:02:25 -04:00
Joey Hess
dfdda95053
proxy updates location tracking information
This does mean a redundant write to the git-annex branch. But,
it means that two clients can be using the same proxy, and after
one sends a file to a proxied remote, the other only has to pull from
the proxy to learn about that. It does not need to pull from every
remote behind the proxy (which it couldn't do anyway as git repo
access is not currently proxied).

Anyway, the overhead of this in git-annex branch writes is no worse
than eg, sending a file to a repository where git-annex assistant
is running, which then sends the file on to a remote, and updates
the git-annex branch then. Indeed, when the assistant also drops
the local copy, that results in more writes to the git-annex branch.
2024-06-12 11:37:14 -04:00
Joey Hess
96853cd833
finish P2P protocol proxying
CONNECT is not supported by git-annex-shell p2pstdio, but for proxying
to tor-annex remotes, it will be supported, and will make a git pull/push
to a proxied remote work the same with that as it does over ssh,
eg it accesses the proxy's git repo not the proxied remote's git repo.

The p2p protocol docs say that NOTIFYCHANGES is not always supported,
and it looked annoying to implement it for this, and it also seems
pretty useless, so make it be a protocol error. git-annex remotedaemon
will already be getting change notifications from the proxy's git repo,
so there's no need to get additional redundant change notifications for
proxied remotes that would be for changes to the same git repo.
2024-06-12 10:40:51 -04:00
Joey Hess
f98605bce7
a local git remote cannot proxy
Prevent listProxied from listing anything when the proxy remote's
url is a local directory. Proxying does not work in that situation,
because the proxied remotes have the same url, and so git-annex-shell
is not run when accessing them, instead the proxy remote is accessed
directly.

I don't think there is any good way to support this. Even if the instantiated
git repos for the proxied remotes somehow used an url that caused it to use
git-annex-shell to access them, planned features like `git-annex copy --to
proxy` accepting a key and sending it on to nodes behind the proxy would not
work, since git-annex-shell is not used to access the proxy.

So it would need to use something to access the proxy that causes
git-annex-shell to be run and speaks P2P protocol over it. And we have that.
It's a ssh connection to localhost. Of course, it would be possible to
take ssh out of that mix, and swap in something that does not have
encryption overhead and authentication complications, but otherwise
behaves the same as ssh. And if the user wants to do that, GIT_SSH
does exist.
2024-06-12 10:16:04 -04:00
Joey Hess
c6e0710281
proxying to local git remotes works
This just happened to work correctly. Rather surprisingly. It turns out
that openP2PSshConnection actually also supports local git remotes,
by just running git-annex-shell with the path to the remote.

Renamed "P2PSsh" to "P2PShell" to make this clear.
2024-06-12 10:10:11 -04:00
yarikoptic
c6f2a5d372 TODO for log --key 2024-06-12 13:20:29 +00:00
Joey Hess
5beaffb412
proxying PUT now working
The almost identical code duplication between relayDATA and relayDATA'
is very annoying. I tried quite a few things to parameterize them, but
the type checker is having fits when I try it.
2024-06-11 16:56:52 -04:00
Joey Hess
ed4fda098b
todo 2024-06-11 15:15:58 -04:00
Joey Hess
a2f4a8eddf
proxying GET now working
Memory use is small and constant; receiveBytes returns a lazy bytestring
and it does stream.

Comparing speed of a get of a 500 mb file over proxy from origin-origin,
vs from the same remote over a direct ssh:

joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from origin-origin
get bigfile (from origin-origin...)
ok
(recording state in git...)
1.89user 0.67system 0:10.79elapsed 23%CPU (0avgtext+0avgdata 68716maxresident)k
0inputs+984320outputs (0major+10779minor)pagefaults 0swaps

joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from direct-ssh
get bigfile (from direct-ssh...)
ok
1.79user 0.63system 0:10.49elapsed 23%CPU (0avgtext+0avgdata 65776maxresident)k
0inputs+1024312outputs (0major+9773minor)pagefaults 0swaps

So the proxy doesn't add much overhead even when run on the same machine as
the client and remote.

Still, piping receiveBytes into sendBytes like this does suggest that the proxy
could be made to use less CPU resouces by using `sendfile()`.
2024-06-11 15:09:43 -04:00
Joey Hess
09b5e53f49
set annex.uuid in proxy's Repo
getRepoUUID looks at that, and was seeing the annex.uuid of the proxy.
Which caused it to unncessarily set the git config. Probably also would
have led to other problems.
2024-06-11 13:40:50 -04:00
Joey Hess
657a91527a
update 2024-06-11 13:22:03 -04:00
Joey Hess
d2e3c5c89f
update 2024-06-11 13:07:53 -04:00
Joey Hess
43ff697f25
update status and design work on proxy encryption and chunking 2024-06-07 12:35:04 -04:00
Joey Hess
058726ee86
next step identified 2024-06-06 18:06:45 -04:00
Joey Hess
d59383beaf
update 2024-06-06 17:25:22 -04:00
ruslan@302cb7f8d398fcce72f88b26b0c2f3a53aaf0bcd
ca687413ef Added a comment 2024-06-05 16:53:51 +00:00
Joey Hess
1761e971ee
status update after day 1 of new project 2024-06-04 14:55:54 -04:00
Joey Hess
3df70c5c0c
implementation plan 2024-06-04 07:51:33 -04:00
Joey Hess
6375e3be3b
recieved funding to work on this, which comes with a schedule 2024-06-04 06:53:59 -04:00
Joey Hess
5992e1729a
fixed by git release 2024-06-04 06:39:08 -04:00
Joey Hess
adf17f5038
Merge branch 'master' of ssh://git-annex.branchable.com 2024-05-30 13:26:44 -04:00
Joey Hess
0155abfba4
git-remote-annex: Support urls like annex::https://example.com/foo-repo
Using the usual url download machinery even allows these urls to need
http basic auth, which is prompted for with git-credential. Which opens
the possibility for urls that contain a secret to be used, eg the cipher
for encryption=shared. Although the user is currently on their own
constructing such an url, I do think it would work.

Limited to httpalso for now, for security reasons. Since both httpalso
(and retrieving this very url) is limited by the usual
annex.security.allowed-ip-addresses configs, it's not possible for an
attacker to make one of these urls that sets up a httpalso url that
opens the garage door. Which is one class of attacks to keep in mind
with this thing.

It seems that there could be either a git-config that allows other types
of special remotes to be set up this way, or special remotes could
indicate when they are safe. I do worry that the git-config would
encourage users to set it without thinking through the security
implications. One remote config might be safe to access this way, but
another config, for one with the same type, might not be. This will need
further thought, and real-world examples to decide what to do.
2024-05-30 12:24:16 -04:00
yarikoptic
d23ae92da8 Added a comment 2024-05-30 14:34:32 +00:00
yarikoptic
285a7ff3c3 Added a comment 2024-05-30 14:29:43 +00:00
Joey Hess
3f33616068
security 2024-05-29 22:55:06 -04:00
Joey Hess
efa684ab8a
todo 2024-05-29 18:21:17 -04:00
Joey Hess
09a0552489
split off todo, comment 2024-05-29 13:16:36 -04:00
Joey Hess
e19916f54b
add config-uuid to annex:: url for --sameas remotes
And use it to set annex-config-uuid in git config. This makes
using the origin special remote work after cloning.

Without the added Logs.Remote.configSet, instantiating the remote will
look at the annex-config-uuid's config in the remote log, which will be
empty, and so it will fail to find a special remote.

The added deletion of files in the alternatejournaldir is just to make
100% sure they don't get committed to the git-annex branch. Now that
they contain things that definitely should not be committed.
2024-05-29 12:50:00 -04:00
Joey Hess
bbf49c9de7
httpalso just worked, with one small issue to fix 2024-05-28 16:26:16 -04:00
Joey Hess
cb9f7b5646
update 2024-05-28 12:50:54 -04:00
Joey Hess
14443fd307
update 2024-05-28 12:46:56 -04:00
Joey Hess
e19f56e7d8
Merge branch 'master' of ssh://git-annex.branchable.com 2024-05-28 10:27:50 -04:00
Joey Hess
c6669990fb
update 2024-05-28 09:19:00 -04:00
m.risse@77eac2c22d673d5f10305c0bade738ad74055f92
bab6d3e58f Added a comment: Re: worktree provisioning 2024-05-28 12:06:39 +00:00
Joey Hess
c2483f6e6d
update 2024-05-27 22:44:35 -04:00
Joey Hess
0975e792ea
git-remote-annex: Fix error display on clone
cleanupInitialization gets run when an exception is thrown, so needs to
avoid throwing exceptions itself, as that would hide the error message
that the user needs to see.
2024-05-27 13:28:05 -04:00
Joey Hess
a766475d14
split out a todo 2024-05-27 12:50:46 -04:00
Joey Hess
e64add7cdf
git-remote-annex: support importrree=yes remotes
When exporttree=yes is also set. Probably it would also be possible to
support ones with only importtree=yes, by enabling exporttree=yes for
the remote only when using git-remote-annex, but let's keep this
simple... I'm not sure what gets recorded in .git/annex/ state
differently in the two cases that might cause a problem when doing that.

Note that the full annex:: urls generated and displayed for such a
remote omit the importree=yes. Which is ok, cloning from such an url
uses an exporttree=remote, but the git-annex branch doesn't get written
by this program, so once the real config is available from the git-annex
branch, it will still function as an importree=yes remote.
2024-05-27 12:35:42 -04:00
Joey Hess
19418e81ee
git-remote-annex: Display full url when using remote with the shorthand url 2024-05-24 17:15:31 -04:00
Joey Hess
04a256a0f8
work around git "defense in depth" breakage with git clone checking for hooks
This git bug also broke git-lfs, and I am confident it will be reverted
in the next release.

For now, cloning from an annex:: url wastes some bandwidth on the next
pull by not caching bundles locally.

If git doesn't fix this in the next version, I'd be tempted to rethink
whether bundle objects need to be cached locally. It would be possible to
instead remember which bundles have been seen and their heads, and
respond to the list command with the heads, and avoid unbundling them
agian in fetch. This might even be a useful performance improvement in
the latter case. It would be quite a complication to a currently simple
implementation though.
2024-05-24 15:49:53 -04:00
Joey Hess
6ccd09298b
convert srcref to a sha
This fixes pushing a new ref that is the same as something already
pushed. In findotherprereq, it compares two shas, which didn't work when
one is actually not a sha but a ref.

This is one of those cases where Sha being an alias for Ref makes it
hard to catch mistakes. One of these days those need to be
differentiated at the type level, but not today..
2024-05-24 15:33:35 -04:00
Joey Hess
96c66a7ca9
bug 2024-05-24 15:15:42 -04:00
Joey Hess
58301e40d2
sync with special remotes with an annex:: url
Check explicitly for an annex:: url, not just any url. While no built-in
special remotes set an url, except ones that can be synced with, it
seems possible that some external special remote sets an url for its own
use, but did not expect it to be used by git-annex sync et al.

The assistant also syncs with them.
2024-05-24 14:57:29 -04:00
Joey Hess
22bf23782f
initremote, enableremote: Added --with-url to enable using git-remote-annex
Also sets remote.name.fetch to a typical value, same as git remote add does.
2024-05-24 14:29:36 -04:00
Joey Hess
7d61a99da3
todo 2024-05-24 13:57:33 -04:00
Joey Hess
2670508b97
also broke git-remote-annex 2024-05-24 13:35:45 -04:00
Joey Hess
b792b128a0
verified checkprereq
The case documented in its comment worked in a test push and clone.
2024-05-24 13:06:29 -04:00
Joey Hess
1a3c60cc8e
git-remote-annex: avoid bundle object leakage in push race or interrupted push
Locally record the manifest before uploading it or any bundles,
and read it on the next push. Any bundles from the push that are
not included in the currently being pushed manifest will get added
to the outManifest, and so eventually get deleted.

This deals with an interrupted push that is not resumed and instead
something else is pushed. And it deals with a push race that overwrites
the manifest.

Of course, this can't help if one of those situations is followed by
the local repo being deleted. But that's equivilant to doing a git-annex
copy of a new annexed file to a special remote and then deleting the
special repo w/o pushing. In either case the special remote ends up with
a object in it that git-annex doesn't know about.
2024-05-24 12:47:32 -04:00
Joey Hess
264c51b4f4
comment 2024-05-22 06:06:18 -04:00
Joey Hess
4131e31f5c
PATH_MAX 2024-05-22 04:26:36 -04:00
Joey Hess
5fb307f1c5
comment 2024-05-21 17:47:55 -04:00
Joey Hess
938e714a11
bleh 2024-05-21 17:32:49 -04:00
Joey Hess
10a60183e1
guard pushEmpty 2024-05-21 12:12:44 -04:00
Joey Hess
14c79373c4
update 2024-05-21 12:05:44 -04:00
Joey Hess
b3d7ae51f0
fix edge case where git-annex branch does not have config for enabled special remote
One way this could happen is cloning an empty special remote.
A later fetch would then fail.
2024-05-21 11:27:49 -04:00
Joey Hess
3e7324bbcb
only delete bundles on pushEmpty
This avoids some apparently otherwise unsolveable problems involving
races that resulted in the manifest listing bundles that were deleted.

Removed the annex-max-git-bundles config because it can't actually
result in deleting old bundles. It would still be possible to have a
config that controls how often to do a full push, which would avoid
needing to download too many bundles on clone, as well as needing to
checkpresent too many bundles in verifyManifest. But it would need a
different name and description.
2024-05-21 11:13:27 -04:00
Joey Hess
f544946b09
update 2024-05-21 10:20:30 -04:00
Joey Hess
b042dfeb0e
emptying pushes only delete 2024-05-21 09:52:35 -04:00
Joey Hess
5d40759470
formalize problem description 2024-05-21 09:35:46 -04:00
Joey Hess
3a38520aac
avoid interrupted push leaving remote without a manifest
Added a backup manifest key, which is used if the main manifest key is
not present. When uploading a new Manifest, it makes sure that it never
drops one key except when the other key is present.

It's entirely possible for the two manifest keys to get out of sync, due
to races. The main one wins when it's present, it is possible for the
main one being dropped to expose the backup one, which has a different
push recorded.
2024-05-20 15:41:09 -04:00
Joey Hess
594ca2fd3a
update 2024-05-20 14:52:06 -04:00
Joey Hess
34a6db4f15
improve recovery from interrupted push
On push, first try to drop all outManifest keys listed in the current
manifest file, which resumes from an interrupted push that didn't
get a chance to delete those keys.

The new manifest gets its outManifest populated with the keys that were
in the old manifest, plus any of the keys that were unable to be
dropped.

Note that it would be possible for uploadManifest to skip dropping old
keys at all. The old keys would get dropped on the next push. But it
seems better to delete stuff immediately rather than waiting. And the
extra work is limited to push and typically is small.

A remote where dropKey always fails will result in an outManifest that
grows longer and longer. It would be possible to check if the remote
has appendonly = True and avoid populating the outManifest. Of course,
an appendonly remote will grow with every git push anyway. And currently
only Remote.GitLFS sets that, which can't be used as a git-remote-annex
remote anyway.
2024-05-20 13:49:45 -04:00
Joey Hess
ce60211881
add incremental vs full push race to todo
with plan to deal with it
2024-05-16 09:37:28 -04:00
Joey Hess
b1b6e35d4c
reorg todo 2024-05-15 17:41:55 -04:00
Joey Hess
adcebbae47
clean up git-remote-annex git-annex branch handling
Implemented alternateJournal, which git-remote-annex
uses to avoid any writes to the git-annex branch while setting up
a special remote from an annex:: url.

That prevents the remote.log from being overwritten with the special
remote configuration from the url, which might not be 100% the same as
the existing special remote configuration.

And it prevents an overwrite deleting of other stuff that was
already in the remote.log.

Also, when the branch was created by git-remote-annex, only delete it
at the end if nothing else has been written to it by another command.
This fixes the race condition described in
797f27ab05, where git-remote-annex
set up the branch and git-annex init and other commands were
run at the same time and their writes to the branch were lost.
2024-05-15 17:33:38 -04:00
Joey Hess
d24d8870c5
todo 2024-05-15 14:33:13 -04:00
Joey Hess
2dfffa0621
bugfix
When pushing branch foo, we don't want to delete other tracking
branches. In particular, a full push needs all the tracking branches.
2024-05-14 16:17:27 -04:00
Joey Hess
169e673ad4
result of some testing 2024-05-14 16:01:24 -04:00
Joey Hess
0722c504c5
update docs for git-remote-annex 2024-05-14 15:31:16 -04:00
Joey Hess
23c4125ed4
mention other commands shipped with git-annex in SEE ALSO in man page 2024-05-14 15:23:45 -04:00
Joey Hess
24af51e66d
git-annex unused --from remote skips its git-remote-annex keys
This turns out to only be necessary is edge cases. Most of the
time, git-annex unused --from remote doesn't see git-remote-annex keys
at all, because it does not record a location log for them.

On the other hand, git-annex unused does find them, since it does not
rely on the location log. And that's good because they're a local cache
that the user should be able to drop.

If, however, the user ran git-annex unused and then git-annex move
--unused --to remote, the keys would have a location log for that
remote. Then git-annex unused --from remote would see them, and would
consider them unused. Even when they are present on the special remote
they belong to. And that risks losing data if they drop the keys from
the special remote, but didn't expect it would delete git branches they
had pushed to it.

So, make git-annex unused --from skip git-remote-annex keys whose uuid
is the same as the remote.
2024-05-14 15:17:40 -04:00
Joey Hess
0bf72ef103
max-git-bundles config for git-remote-annex 2024-05-14 14:23:40 -04:00
Joey Hess
8ad768fdba
todo 2024-05-14 13:58:35 -04:00
Joey Hess
6f1039900d
prevent using git-remote-annex with unsuitable special remote configs
I hope to support importtree=yes eventually, but it does not currently
work.

Added remote.<name>.allow-encrypted-gitrepo that needs to be set to
allow using it with encrypted git repos.

Note that even encryption=pubkey uses a cipher stored in the git repo
to encrypt the keys stored in the remote. While it would be possible to
not encrypt the GITBUNDLE and GITMANIFEST keys, and then allow using
encryption=pubkey, it doesn't currently work, and that would be a
complication that I doubt is worth it.
2024-05-14 13:52:20 -04:00
Joey Hess
8bf6dab615
update 2024-05-13 14:42:25 -04:00
Joey Hess
ddf05c271b
fix cloning from an annex:: remote with exporttree=yes
Updating the remote list needs the config to be written to the git-annex
branch, which was not done for good reasons. While it would be possible
to instead use Remote.List.remoteGen without writing to the branch, I
already have a plan to discard git-annex branch writes made by
git-remote-annex, so the simplest fix is to write the config to the
branch.

Sponsored-by: k0ld on Patreon
2024-05-13 14:35:17 -04:00
Joey Hess
552b000ef1
update 2024-05-13 14:30:18 -04:00
Joey Hess
34eae54ff9
git-remote-annex support exporttree=yes remotes
Put the annex objects in .git/annex/objects/ inside the export remote.
This way, when importing from the remote, they will be filtered out.

Note that, when importtree=yes, content identifiers are used, and this
means that pushing to a remote updates the git-annex branch. Urk.
Will need to try to prevent that later, but I already had a todo about
that for other reasons.

Untested!

Sponsored-By: Brock Spratlen on Patreon
2024-05-13 11:48:00 -04:00
Joey Hess
3f848564ac
refuse to fetch from a remote that has no manifest
Otherwise, it can be confusing to clone from a wrong url, since it fails
to download a manifest and so appears as if the remote exists but is empty.

Sponsored-by: Jack Hill on Patreon
2024-05-13 09:47:21 -04:00
Joey Hess
97b309b56e
extend manifest with keys to be deleted
This will eventually be used to recover from an interrupted fullPush
and drop the old bundle keys it was unable to delete.

Sponsored-by: Luke T. Shumaker on Patreon
2024-05-13 09:09:33 -04:00
Joey Hess
dfb09ad1ad
preparing to merge git-remote-annex
Update its todo with remaining items.

Add changelog entry.

Simplified internals document to no longer be notes to myself, but
target users who want to understand how the data is stored
and might want to extract these repos manually.

Sponsored-by: Kevin Mueller on Patreon
2024-05-10 15:06:15 -04:00
Yaroslav Halchenko
9c2ab31549
Fix compatable typo (yet to add to codespell)
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "git-sedi compatable compatible",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
2024-05-01 15:46:25 -04:00
Joey Hess
cbaf2172ab
started on a design for P2P protocol over HTTP
Added to git-annex_proxies todo because this is something OpenNeuro
would need in order to use the git-annex proxy.

Sponsored-by: Dartmouth College's OpenNeuro project
2024-05-01 15:26:51 -04:00
Joey Hess
d28adebd6b
number list 2024-05-01 12:19:12 -04:00
Joey Hess
0d0c891ff9
add headers for tocs 2024-05-01 12:18:14 -04:00
Joey Hess
4cd2c980d2
toc 2024-05-01 12:14:59 -04:00
Joey Hess
e7333aa505
fix link 2024-05-01 11:08:57 -04:00
Joey Hess
a612fe7299
add todo linking to two design docs and some related todos
Tagging with projects/openneuro as Christopher Markiewicz has oked
them funding at least the initial design work on this.
2024-05-01 11:04:20 -04:00
Joey Hess
5b36e6b4fb
comments 2024-04-30 16:08:46 -04:00
Joey Hess
f3cca8a9f8
applied patch 2024-04-30 15:17:38 -04:00
Joey Hess
1f37d0b00d
promote comment to todo 2024-04-30 15:13:59 -04:00
Joey Hess
84611e7ee6
todo 2024-04-26 04:03:10 -04:00
Joey Hess
e3c5f0079d
Merge branch 'master' of ssh://git-annex.branchable.com 2024-04-25 17:01:32 -04:00
Joey Hess
d895df1010
update 2024-04-25 17:01:17 -04:00
ErrGe
cafa9af811 2024-04-22 15:37:04 +00:00