Commit graph

4721 commits

Author SHA1 Message Date
Joey Hess
5c36177e58
proxied exporttree=yes remotes are untrustworthy
This is not perfect because it does not handle versioned special
remotes, which should not be untrustworthy, but now are when proxied.

The implementation turned out to be easy, because the exporttree field
is a default field, so is available in RemoteConfig even for git
remotes.
2024-08-08 14:43:53 -04:00
Joey Hess
b23c7f769e
update 2024-08-08 14:25:18 -04:00
Joey Hess
9663888c77
update 2024-08-08 14:05:05 -04:00
Joey Hess
a2eb3b450a
post-receive: use the exporttree=yes remote as a source
This handles cases where a single key is used by multiple files in the
exported tree. When using `git-annex push`, the key's content gets
stored in the annexobjects location, and then when the branch is pushed,
it gets renamed from the annexobjects location to the first exported
file. For subsequent exported files, a copy of the content needs to be
made. This causes it to download the key from the remote in order to
upload another copy to it.

This is not needed when using `git push` followed by `git-annex copy --to`
the proxied remote, because the received key is stored at all export
locations then.

Also, fixed handling of the synced branch push, it was exporting master
when synced/master was pushed.

Note that currently, the first push to the remote does not see that it
is able to get a key from it in order to upload it back. It displays
"(not available)". The second push is able to. Since git-annex push
pushes first the synced branch and then the branch, this does end up
with a full export being made, but it is not quite right.
2024-08-08 13:49:53 -04:00
Joey Hess
7294d23d78
export: Added --from option
This is similar to git-annex copy --from --to, in that it downloads a
local copy, locks it for removal, uploads it, and drops it. Removal of
the temporary local copy is done without verifying numcopies for the
same reason as that command.

I do wonder, looking at this, if there's a race where the local copy
gets used as a copy to allow some other drop in the narrow window after
it is downloaded and before it gets locked for removal. That would need
some other repository to have an out of date location log that says the
repository contains a copy of the key, in order for it to try to use it
as a copy. If there is such a race, git-annex copy/move would also be
vulnerable to it. It would be better to lock it for removal before
starting to download it! That is possible in v10 repositories, which do
use a separate content lock file.

Note that, when the exported tree contains several files that use the
same key, it will be downloaded repeatedly, once per time needed to
upload it. It would be possible to avoid that extra work, but it would
complicate this since the local copy would need to be preserved, locked
for removal, until the end. Also, that would mean that interrupting the
export would leave possibly a lot of temporarily downloaded keys in the
local repository, while currently it can only leave one.
2024-08-08 12:08:55 -04:00
Joey Hess
01edd186e9
update proxied exporttree=yes remote on receive of sync branch
Since git-annex sync sends the sync branch first, and only displays the
output of the push to the sync branch, this makes git-annex
post-retrieve's output when updating the exported tree be visible when
syncing.

This also makes syncing with a non-bare repository still update the
exported tree, even when the checked out branch is not able to be
updated. The sync branch gets sent regardless.
2024-08-07 13:11:06 -04:00
Joey Hess
55adbb6694
avoid trying to export tree to proxied exporttree=yes remotes
This avoids a lot of ugly messages when syncing with such a remote.
The export tree happens on the proxy side.
2024-08-07 13:00:19 -04:00
Joey Hess
6d96734128
updateproxy, updatecluster check annexobjects=yes
updateproxy, updatecluster: Prevent using an exporttree=yes special remote
that does not have annexobjects=yes, since it will not work.
2024-08-07 12:27:24 -04:00
Joey Hess
8864a9e353
update 2024-08-07 11:49:53 -04:00
Joey Hess
1038567881
proxy stores received keys to known export locations
This handles the workflow where the branch is first pushed to the proxy,
and then files in the exported tree are later are copied to the proxied remote.

Turns out that the way the export log is structured, nothing needs
to be done to finalize the export once the last key is sent to it. Which
is great because that would have been a lot of complication. On
receiving the push, Command.Export runs and calls recordExportBeginning,
does as much as it can to update the export with the files currently
on it, and then calls recordExportUnderway. At that point, the
export.log records the export as "complete", but it's not really. And
that's fine. The same happens when using `git-annex export` when some
files are not available to send. Other repositories that have
access to the special remote can already retrieve files from it. As
the missing files get copied to the exported remote, all that needs
to be done is record each in the export db.

At this point, proxying to exporttree=yes annexobjects=yes special remotes
is fully working. Except for in the case where multiple files in the
tree use the same key, and the files are sent to the proxied remote
before pushing the tree.

It seems that even special remotes without annexobjects=yes will work if
used with the workflow where the git-annex branch is pushed before
copying files. But not with the `git-annex push` workflow.
2024-08-07 09:47:34 -04:00
Joey Hess
ba1cb517c0
update 2024-08-06 14:46:56 -04:00
Joey Hess
3289b1ad02
proxying to exporttree=yes annexobjects=yes basically working
It works when using git-annex sync/push/assist, or when manually sending
all content to the proxied remote before pushing to the proxy remote.
But when the push comes before the content is sent, sending content does
not update the exported tree.
2024-08-06 14:21:23 -04:00
Joey Hess
be5c86c248
refine 2024-08-06 12:15:18 -04:00
Joey Hess
4750ffbd3b
finalized design for proxying to exporttree=yes annexobjects=yes special remotes 2024-08-06 11:45:45 -04:00
Joey Hess
a535eaa176
rename from annexobjects location on export
(When possible, of course it may not be there, or it may get renamed from
there for another exported file first. Or the remote may not support
renames.)

This will avoids redundant uploads.

An example case where this is important: Proxying to a exporttree remote,
a file is uploaded to it but is not yet in an exported tree. When the
exported tree is pushed, the remote needs to be updated by exporting to
it. In this case, the proxy doesn't have a copy of the file, so it would
need to download it from annexobjects before uploading it to the final
location. With this optimisation, it can just rename it.

However: If a key is used twice in an exported tree, it seems a proxy
will need to download and reupload anyway. Unless a copy operation is
added to exporttree remotes..
2024-08-04 12:19:10 -04:00
Joey Hess
a3d96474f2
rename to annexobjects location on unexport
This avoids needing to re-upload the file again to get it to the
annexobjects location, which git-annex sync was doing when it was
preferred content.

If the file is not preferred content, sync will drop it from the
annexobjects location.

If the file has been deleted from the tree, it will remain in the
annexobjects location until an unused/dropunused pass is done.
2024-08-04 11:58:07 -04:00
Joey Hess
6b63449133
update
Decided not to use the annexobjects location for exportTempName.

There doesn't seem to be any actual benefit to doing that, because an
export that renames to exportTempName always renames it back from that
to another location.

Also the annexobjects directory won't actually help with the paired
rename issue.
2024-08-04 11:34:00 -04:00
Joey Hess
ee076b68f5
strong verification on retrieval from annexobjects location
The file in the annexobjects location may have been renamed from a
previously exported file that got deleted in a subsequent export.
Or it may be renamed to annexobjects temporarily before being renamed to
another name (to handle eg pairwise renames).

But, an exported file is not guaranteed to contain the content of the
key that the local repository last exported there. Another tree could
have been exported from elsewhere in the meantime.

So, files in annexobjects do not necessarily have the content of their
key. And so have to be strongly verified when retrieving. The same as
is done when retrieving exported files.
2024-08-04 11:24:21 -04:00
Joey Hess
fe01a1e7e1
design work on annexobjects remotes 2024-08-03 19:51:03 -04:00
Joey Hess
a4a06404d4
sync --content with annexobjects=true exporttree remotes 2024-08-03 11:39:23 -04:00
Joey Hess
9497bf7fdb
update 2024-08-02 18:50:57 -04:00
Joey Hess
9da2860812
Merge branch 'master' into exportreeplus 2024-08-02 18:45:44 -04:00
Joey Hess
c4352adf6a
in unexport, check for annexobjects presence before updating location log
The key may still be in the annexobjects location.
2024-08-02 18:43:10 -04:00
Joey Hess
34c10d082d
status 2024-08-02 14:15:05 -04:00
Joey Hess
83fa76733f
status 2024-08-02 14:10:34 -04:00
Joey Hess
28b29f63dc
initial support for annexobjects=yes
Works but some commands may need changes to support special remotes
configured this way.
2024-08-02 14:07:45 -04:00
Joey Hess
d52fd3cf83
update 2024-07-30 12:17:05 -04:00
Joey Hess
1500a9525d
todo 2024-07-30 11:58:44 -04:00
Joey Hess
1632beaf70
fix negative DATA when 1 node of a cluster has a partial transfer 2024-07-30 11:42:17 -04:00
Joey Hess
fcc052bed8
When proxying an upload to a special remote, verify the hash.
While usually uploading to a special remote does not verify the content,
the content in a repository is assumed to be valid, and there is no trust
boundary. But with a proxied special remote, there may be users who are
allowed to store objects, but are not really trusted.

Another way to look at this is it's the equivilant of git-annex-shell
checking the hash of received data, which it does (see StoreContent
implementation).
2024-07-29 13:40:51 -04:00
Joey Hess
6f20085a60
update 2024-07-29 11:25:07 -04:00
Joey Hess
4f3ae96666
cleanly close proxy connection on interrupted PUT
An interrupted PUT to cluster that has a node that is a special remote
over http left open the connection to the cluster, so the next request
opens another one. So did an interrupted PUT directly to the proxied
special remote over http.

proxySpecialRemote was stuck waiting for all the DATA. Its connection
remained open so it kept waiting.

In servePut, checktooshort handles closing the P2P connection
when too short a data is received from PUT. But, checktooshort was only
called after the protoaction, which is what runs the proxy, which is
what was getting stuck. Modified it to run as a background thread,
which waits for the tooshortv to be written to, which gather always does
once it gets to the end of the data received from the http client.

That makes proxyConnection's releaseconn run once all data is received
from the http client. Made it close the connection handles before
waiting on the asyncworker thread. This lets proxySpecialRemote finish
processing any data from the handle, and then it will give up,
more or less cleanly, if it didn't receive enough data.

I say "more or less cleanly" because with both sides of the P2P
connection taken down, some protocol unhappyness results. Which can lead
to some ugly debug messages. But also can cause the asyncworker thread
to throw an exception. So made withP2PConnections not crash when it
receives an exception from releaseconn.

This did have a small change to the behavior of an interrupted PUT when
proxying to a regular remote. proxyConnection has a protoerrorhandler
that closes the proxy connection on a protocol error. But the proxy
connection is also closed by checktooshort when it closes the P2P
connection. Closing the same proxy connection twice is not a problem,
it just results in duplicated debug messages about it.
2024-07-29 10:37:19 -04:00
Joey Hess
c8e7231f48
add debugging of opening and closing connections to proxies 2024-07-29 09:52:26 -04:00
Joey Hess
7ac8d36f38
idea 2024-07-29 09:11:27 -04:00
Joey Hess
bc9cc79e85
set remote's annexUrl automatically
When the remote repository's git config file
has annex.url set to an annex+http url.
2024-07-28 20:13:41 -04:00
Joey Hess
c87cfe1e00
todo 2024-07-28 17:29:32 -04:00
Joey Hess
ccbdaf0448
documentation for p2phttp 2024-07-28 17:19:27 -04:00
Joey Hess
dfe65b92c8
avoid repeatedly parsing the proxy log 2024-07-28 16:04:20 -04:00
Joey Hess
2fdec6b4e1
update 2024-07-28 15:55:24 -04:00
Joey Hess
ddabc138ec
todo 2024-07-28 15:41:31 -04:00
Joey Hess
cdc4bd7443
fix hang in PUT of large file to a special remote node of a cluster over http 2024-07-28 15:34:59 -04:00
Joey Hess
66679c9bb4
remove temp file after upload to special remote 2024-07-28 14:36:45 -04:00
Joey Hess
ccd102cd19
update 2024-07-28 14:22:44 -04:00
Joey Hess
5e205f215d
clean shut down of cluster connection when PUT is interrupted
An interrupted `git-annex copy --to` a cluster via the http server,
when repeated, failed. The http server output "transfer already in
progress, or unable to take transfer lock". Apparently a second
connection was opened to the cluster, because the first connection
never got shut down.

Turned out the problem was that when proxying to a cluster, it would read a
short ByteString from the client, and send that to the nodes. But that left the
nodes warning more. Meanwhile, the proxy was expecting a SUCCESS/FAILURE
message from the nodes. So it didn't return, and so the cluster connection
stayed open.
2024-07-28 14:20:11 -04:00
Joey Hess
bdde6d829c
fix http proxying for a local git remote with a relative path
git-annex-shell expects an absolute path
2024-07-28 13:35:51 -04:00
Joey Hess
41667ad36b
found some bugs with clusters 2024-07-28 13:00:05 -04:00
Joey Hess
770aac97a7
share single BranchState amoung all threads
This fixes a problem when git-annex testremote is run against a cluster
accessed via the http server. Annex.Cluster uses the location log
to find nodes that contain a key when checking if the key is present or getting
it. Just after a key was stored to a cluster node, reading the location log
was not getting the UUID of that node.

Apparently the Annex action that wrote to the location log, and the one
that read from it were run with two different Annex states. The http server
does use several different Annex threads.

BranchState was part of the AnnexState, and so two threads could have
different BranchStates.

Moved BranchState to the AnnexRead, so all threads will see the common state.

This might possibly impact performance. If one thread is writing changes to the
branch, and another thread is reading from the branch, the writing thread will
now invalidate the BranchState's cache, which will cause the reading thread to
need to do extra work. But correctness is surely more important. If did is
found to have impacted performance, it could probably be dealt with by doing
smarter BranchState cache invalidation.

Another way this might impact performance is that the BranchState has a small
cache. If several threads were reading from the branch and relying on the value
they just read still being in the case, now a cache miss will be more likely.
Increasing the BranchState cache to the number of jobs might be a good
idea to amelorate that. But the cache is currently an innefficient list,
so making it large would need changes to the data types.

(Commit 4304f1b6ae dealt with a follow-on
effect of the bug fixed here.)
2024-07-28 12:30:27 -04:00
Joey Hess
fbbedae497
add --clusterjobs option and default to 1
The default of 1 is not ideal at all, but it avoids an accidental M*N
causing so much concurrency it becomes unusable.
2024-07-28 10:36:22 -04:00
Joey Hess
1259ad89b6
cluster support in http API server
Wired it up and it seems to basically work, although the test suite is
not fully passing.

Note that --jobs currently gets multiplied by the number of nodes in the
cluster, which is probably not good.
2024-07-28 10:17:29 -04:00
Joey Hess
0cdd418407
tested shutdown of connection to http proxied special remote
I had worried it might not work properly, but it does, the endv works.
2024-07-28 09:17:47 -04:00