Commit graph

2979 commits

Author SHA1 Message Date
Joey Hess
74c6175795
fix serveGet early handle close
Needed that waitv after all..
2024-07-11 09:55:17 -04:00
Joey Hess
1e0f92a5a1
implemented serveGet and clientGet
Both are only at bare proof of concept stage. Still need to deal with
signaling validity and invalidity, and checking it.

And there's a bad bug: After -JN*2 requests, another request hangs!

So, I think it's failing to free up the Annex worker and end of request
lifetime.

Perhaps I need to use this:

https://docs.servant.dev/en/stable/cookbook/managed-resource/ManagedResource.html
2024-07-10 16:06:39 -04:00
Joey Hess
f9b7ce7224
add Annex worker pool to P2PHttp
This will be needed for get and store, since those need to run Annex
actions.

withLocalP2PConnections will also probably use it.
2024-07-10 12:19:47 -04:00
Joey Hess
d4b9aea87b
implement gettimestamp 2024-07-10 10:23:10 -04:00
Joey Hess
7c588a5791
implement remove-before
The reason to use removeBeforeRemoteEndTime is twofold.

First, removeBefore sends two protocol commands. Currently, the HTTP
protocol runner only supports sending a single command per invocation.

Secondly, the http server gets a monotonic timestamp from the client. So
translating back to a POSIXTime would be annoying.

The timestamp flow with a proxy will be:

- client gets timestamp, which gets the monotonic timestamp from the
  proxied remote via the proxy. The timestamp is currently not
  proxied when there is a single proxy.
- client calls remove-before
- http server calls removeBeforeRemoteEndTime which sends REMOVE-BEFORE
  to the proxied remote.
2024-07-10 10:03:26 -04:00
Joey Hess
b8a26712c6
implement clientRemove
Tested removal.
2024-07-10 09:20:13 -04:00
Joey Hess
48f76cb3e8
implement serveRemove and send WWW-Authenticate header on auth failure 2024-07-10 09:13:01 -04:00
Joey Hess
97d0fc9b65
git-annex p2phttp options 2024-07-10 00:01:55 -04:00
Joey Hess
08371c3745
started on auth 2024-07-09 17:30:55 -04:00
Joey Hess
3d13521479
set up handles for p2phttp
Now it fully works.. for the first request. But then it gets stuck
waiting for the P2P protocol runner to shut down.
2024-07-09 13:50:42 -04:00
Joey Hess
edf8a3df2d
p2phttp is almost working for checkpresent
The server is fully running annex actions, only the P2PConnection is
wrong, currently using stdio.
2024-07-09 13:37:55 -04:00
Joey Hess
0bdee626ad
thread in a state 2024-07-08 14:00:23 -04:00
Joey Hess
82d66ede5e
convert lockcontent api to http long polling
Websockets would work, but the problem with using them for this is that
each lockcontent call is a separate websocket connection. And that's an
actual TCP connection. One TCP connection per file dropped would be too
expensive. With http long polling, regular http pipelining can be used,
so it will reuse a TCP connection.

Unfortunately, at least with servant, bi-directional streams with long
polling don't result in true bidirectional full duplex communication.
Servant processes the whole client body stream before generating the server
body stream. I think it's entirely possible to do full bi-directional
communication over http, but it would need changes to servant.

And, there's no way for the client to tell if the server successfully
locked the content, since the server will keep processing the client
stream no matter what.:

So, added a new api endpoint, keeplocked. lockcontent will lock the key
for 10 minutes with retention lock, and then a call to keeplocked will
keep it locked for as long as needed. This does mean that there will
need to be a Map of locks by key, and I will probably want to add
some kind of lock identifier that lockcontent returns.
2024-07-08 12:57:46 -04:00
Joey Hess
522700d1c4
implemented servant-client support for websockets 2024-07-08 07:44:59 -04:00
Joey Hess
1dbb5ec70d
servant API type is complete 2024-07-07 12:59:12 -04:00
Joey Hess
86ce3bf1e4
started servant implementation of HTTP P2P protocol 2024-07-07 12:08:10 -04:00
Joey Hess
f452bd448a
REMOVE-BEFORE and GETTIMESTAMP proxying
For clusters, the timestamps have to be translated, since each node can
have its own idea about what time it is. To translate a timestamp, the
proxy remembers what time it asked the node for a timestamp in
GETTIMESTAMP, and applies the delta as an offset in REMOVE-BEFORE.

This does mean that a remove from a cluster has to call GETTIMESTAMP on
every node before dropping from nodes. Not very efficient. Although
currently it tries to drop from every single node anyway, which is also
not very efficient.

I thought about caching the GETTIMESTAMP from the nodes on the first
call. That would improve efficiency. But, since monotonic clocks on
!Linux don't advance when the computer is suspended, consider what might
happen if one node was suspended for a while, then came back. Its
monotonic timestamp would end up behind where the proxying expects it to
be. Would that result in removing when it shouldn't, or refusing to
remove when it should? Have not thought it through. Either way, a
cluster behaving strangly for an extended period of time because one
of its nodes was briefly asleep doesn't seem like good behavior.
2024-07-04 15:09:34 -04:00
Joey Hess
1243af4a18
toward SafeDropProof expiry checking
Added Maybe POSIXTime to SafeDropProof, which gets set when the proof is
based on a LockedCopy. If there are several LockedCopies, it uses the
closest expiry time. That is not optimal, it may be that the proof
expires based on one LockedCopy but another one has not expired. But
that seems unlikely to really happen, and anyway the user can just
re-run a drop if it fails due to expiry.

Pass the SafeDropProof to removeKey, which is responsible for checking
it for expiry in situations where that could be a problem. Which really
only means in Remote.Git.

Made Remote.Git check expiry when dropping from a local remote.

Checking expiry when dropping from a P2P remote is not yet implemented.
P2P.Protocol.remove has SafeDropProof plumbed through to it for that
purpose.

Fixing the remaining 2 build warnings should complete this work.

Note that the use of a POSIXTime here means that if the clock gets set
forward while git-annex is in the middle of a drop, it may say that
dropping took too long. That seems ok. Less ok is that if the clock gets
turned back a sufficient amount (eg 5 minutes), proof expiry won't be
noticed. It might be better to use the Monotonic clock, but that doesn't
advance when a laptop is suspended, and while there is the linux
Boottime clock, that is not available on other systems. Perhaps a
combination of POSIXTime and the Monotonic clock could detect laptop
suspension and also detect clock being turned back?

There is a potential future flag day where
p2pDefaultLockContentRetentionDuration is not assumed, but is probed
using the P2P protocol, and peers that don't support it can no longer
produce a LockedCopy. Until that happens, when git-annex is
communicating with older peers there is a risk of data loss when
a ssh connection closes during LOCKCONTENT.
2024-07-04 12:39:06 -04:00
Joey Hess
d2b27ca136
add content retention files
This allows lockContentShared to lock content for eg, 10 minutes and
if the process then gets terminated before it can unlock, the content
will remain locked for that amount of time.

The Windows implementation is not yet tested.

In P2P.Annex, a duration of 10 minutes is used. This way, when p2pstdio
or remotedaemon is serving the P2P protocol, and is asked to
LOCKCONTENT, and that process gets killed, the content will not be
subject to deletion. This is not a perfect solution to
doc/todo/P2P_locking_connection_drop_safety.mdwn yet, but it gets most
of the way there, without needing any P2P protocol changes.

This is only done in v10 and higher repositories (or on Windows). It
might be possible to backport it to v8 or earlier, but it would
complicate locking even further, and without a separate lock file, might
be hard. I think that by the time this fix reaches a given user, they
will probably have been running git-annex 10.x long enough that their v8
repositories will have upgraded to v10 after the 1 year wait. And it's
not as if git-annex hasn't already been subject to this problem (though
I have not heard of any data loss caused by it) for 6 years already, so
waiting another fraction of a year on top of however long it takes this
fix to reach users is unlikely to be a problem.
2024-07-03 14:58:39 -04:00
Joey Hess
8fa4e25c1e
Merge branch 'master' into proxy-specialremotes 2024-07-01 11:23:21 -04:00
Joey Hess
8b5fc94d50
add optional object file location to storeKey
This will be used by the next commit to simplify the proxy.
2024-07-01 10:42:27 -04:00
Joey Hess
0dfdc9f951
dup stdio handles for P2P proxy
Special remotes might output to stdout, or read from stdin, which would
mess up the P2P protocol. So dup the handles to avoid any such problem.
2024-07-01 10:06:29 -04:00
Joey Hess
0033e6c0a6
Tab completion of many commands like info and trust now includes remotes
Especially useful with proxied remotes and clusters, where the user may not
be entirely familiar with the name and can learn by tab completion.
2024-06-30 12:39:18 -04:00
Joey Hess
62750f0102
shut down RemoteSides cleanly
Before it just exited without actually shutting down the RemoteSides,
when the client hung up.
2024-06-28 13:19:57 -04:00
Joey Hess
c3a785204e
support a P2PConnection that uses TMVars rather than Handles
This will allow having an internal thread speaking P2P protocol,
which will be needed to support proxying to external special remotes.

No serialization is done on the internal P2P protocol of course.

When a ByteString is being exchanged, it may or may not be exactly
the length indicated by DATA. While that has to be carefully managed
for the serialized P2P protocol, here it would require buffering the
whole lazy bytestring in memory to check its length when sending,
so it's better to do length checks on the receiving side.
2024-06-28 11:22:29 -04:00
Joey Hess
41a0817188
make extendcluster also updatecluster
This avoids the user forgetting to do it and simplifies the
documentation.
2024-06-27 15:34:45 -04:00
Joey Hess
cf59d7f92c
GET and CHECKPRESENT amoung lowest cost cluster nodes
Before it was using a node that might have had a higher cost.

Also threw in a random selection from amoung the low cost nodes. Of
course this is a poor excuse for load balancing, but it's better than
nothing. Most of the time...
2024-06-27 14:36:55 -04:00
Joey Hess
3dad9446ce
distributed cluster cycle prevention
Added BYPASS to P2P protocol, and use it to avoid cycling between
cluster gateways.

Distributed clusters are working well now!
2024-06-27 12:20:22 -04:00
Joey Hess
923953c9fe
fix cycle prevention code 2024-06-26 13:21:51 -04:00
Joey Hess
07e899c9d3
git-annex-shell: proxy nodes located beyond remote cluster gateways
Walking a tightrope between security and convenience here, because
git-annex-shell needs to only proxy for things when there has been
an explicit, local action to configure them.

In this case, the user has to have run `git-annex extendcluster`,
which now sets annex-cluster-gateway on the remote.

Note that any repositories that the gateway is recorded to
proxy for will be proxied onward. This is not limited to cluster nodes,
because checking the node log would not add any security; someone could
add any uuid to it. The gateway of course then does its own
checking to determine if it will allow proxying for the remote.
2024-06-26 12:56:16 -04:00
Joey Hess
1ec2fecf3f
set up proxies for cluster nodes that are themselves proxied via a remote
When there are multiple gateways to a cluster, this sets up proxying
for nodes that are accessed via a remote gateway.

Eg, when running in nyc and amsterdam is the remote gateway,
and it has node1 and node2, this sets up proxying for
amsterdam-node1 and amsterdam-node2. A client that has nyc as a remote
will see proxied remotes nyc-amsterdam-node1 and nyc-amsterdam-node2.
2024-06-26 11:24:55 -04:00
Joey Hess
02bf3ddc3f
updatecluster: support multiple gateways
Just look at the existing proxied remotes that correspond to already
existing nodes of the cluster, and keep those nodes in the cluster.
While adding any remotes of the local repo that are configured as
cluster nodes. This allows removing cluster nodes from the local repo
and updating, without it also removing nodes provided by other gateways.
2024-06-26 10:51:14 -04:00
Joey Hess
0b72b85df5
added git-annex extendcluster
This works, but updatecluster does not work yet in multi-gateway
clusters, nor do gateways relay to other gateways.
2024-06-26 10:26:54 -04:00
Joey Hess
cec2848e8a
support annex.jobs for clusters 2024-06-25 14:54:20 -04:00
Joey Hess
b8016eeb65
add annex-proxied
This makes git-annex sync and similar not treat proxied remotes as git
syncable remotes.

Also, display in git-annex info remote when the remote is proxied.
2024-06-24 10:16:59 -04:00
Joey Hess
bf6b309917
remove attempt to avoid git syncing with instantiate proxied remotes
It didn't work. Actually, sync was skipping those remotes due to a bug.
2024-06-24 09:35:24 -04:00
Joey Hess
d0aec8f623
always check numcopies when moving from cluster
When the destination does not start with a copy, the cluster has one or
more copies. If more, dropping would reduce the number of copies, so
numcopies must be checked.

Considered checking how many nodes of the cluster contain a copy. If
only 1 node does, it could allow a move without checking numcopies.
The problem with that, though, is that other nodes of the cluster could
have copies that we don't know about. And dropping from a cluster tries
to drop from all nodes, so will drop even from those. So any drop from a
cluster can remove more than 1 copy.
2024-06-23 12:00:50 -04:00
Joey Hess
5b332a87be
dropping from clusters
Dropping from a cluster drops from every node of the cluster.
Including nodes that the cluster does not think have the content.
This is different from GET and CHECKPRESENT, which do trust the
cluster's location log. The difference is that removing from a cluster
should make 100% the content is gone from every node. So doing extra
work is ok. Compare with CHECKPRESENT where checking every node could
make it very expensive, and the worst that can happen in a false
negative is extra work being done.

Extended the P2P protocol with FAILURE-PLUS to handle the case where a
drop from one node succeeds, but a drop from another node fails. In that
case the entire cluster drop has failed.

Note that SUCCESS-PLUS is returned when dropping from a proxied remote
that is not a cluster, when the protocol version supports it. This is
because P2P.Proxy does not know when it's proxying for a single node
cluster vs for a remote that is not a cluster.
2024-06-23 09:43:40 -04:00
Joey Hess
f18740699e
P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.

Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
2024-06-18 16:21:40 -04:00
Joey Hess
d34326ab76
factor out Annex.Proxy 2024-06-18 10:51:37 -04:00
Joey Hess
f0d6114286
refactor cluster code into own module 2024-06-18 10:36:04 -04:00
Joey Hess
ef26470810
ProxySelector data type 2024-06-17 19:19:15 -04:00
Joey Hess
7a839a983a
preparing for cluster node selection
Support selecting what remote to proxy for each top-level P2P protocol
message.

This only needs to be extended now to support fanout to multiple
nodes for PUT and REMOVE, and with a remote that fails for
LOCKCONTENT and UNLOCKCONTENT.

But a good first step would be to implement CHECKPRESENT and GET for
clusters. Both should select a node that actually does have the content.
That will allow a cluster to work for GET even when location tracking is
out of date.
2024-06-17 15:51:10 -04:00
Joey Hess
291280ced2
started on git-annex-shell cluster support
Works down to P2P protocol.

The question now is, how to handle protocol version negotiation for
clusters? Connecting to each node to find their protocol versions and
using the lowest would be too expensive with a lot of nodes. So it seems
that the cluster needs to pick its own protocol version to use with the
client.

Then it can either negotiate that same version with the nodes when
it comes time to use them, or it can translate between multiple protocol
versions. That seems complicated. Thinking it would be ok to refuse to
use a node if it is not able to negotiate the same protocol version with
it as with the client. That will mean that sometimes need nodes to be
upgraded when upgrading the cluster's proxy. But protocol versions
rarely change.
2024-06-17 15:10:04 -04:00
Joey Hess
c7ad44e4d1
work toward supporting proxying to multiple remotes at once
For eg, upload fanout.

Delay connecting to a remote until it's needed. When there are many
proxied remotes, it would not do for the proxy to connect to each of
them on startup; that could take a long time.
2024-06-17 14:16:44 -04:00
Joey Hess
b72ccc6f0c
improve types 2024-06-17 12:44:08 -04:00
Joey Hess
64afbb0b93
don't count clusters as copies, continued
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.

This should be everything, probably.

Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
2024-06-16 15:14:53 -04:00
Joey Hess
780367200b
remove dead nodes when loading the cluster log
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.

One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.

Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.

Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.

That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.

Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.

Other commands, like git-annex-shell currently don't run annexStartup
either.

So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
2024-06-16 14:39:44 -04:00
Joey Hess
36c6d8da69
don't count clusters as copies
Since the cluster UUID is inserted into the location log when the
location log lists a node as containing content.

Also avoid trying to lock content on cluster remotes. The cluster nodes
are also proxied, so that content can be locked on individual nodes, and
locking content on a cluster as a whole probably won't be implemented.

And made git-annex whereis use numcopies machinery for displaying its
count, so it won't count cluster UUIDs redundantly to nodes.
Other commands, like git-annex info that also display numcopies
information already used the numcopies machinery.

There is more to be done, fromNumCopies is sometimes used to get a
number that is compared with a list of UUIDs. And limitCopies doesn't
use numcopies machinery.
2024-06-16 14:17:56 -04:00
Joey Hess
a4c9d4424c
remove Logs.Presence imports
When imported along with Logs.Location, it can be an unused import and
it won't warn, due to reexports. The point if this is really to show
that Logs.Presence is not widely used, outside Logs/
2024-06-14 17:27:34 -04:00
Joey Hess
570ceffe8d
broke out initcluster
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.

Also it gets the cluster description set and is consistent with
initremote.
2024-06-14 17:23:11 -04:00
Joey Hess
bfe7f488d9
fogot to add 2024-06-14 16:37:17 -04:00
Joey Hess
2028ad02b8
add clusters to proxy log
Note that it's not defined what will happen if a cluster has the same
name as a remote that has proxying enabled.
2024-06-14 15:03:42 -04:00
Joey Hess
bbf261487d
add git-annex updatecluster command
Seems to work fine, making the right changes to the git-annex branch.
2024-06-14 15:02:01 -04:00
Joey Hess
46a1fcb3ea
avoid git syncing with instantiate proxied remotes
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.

Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
2024-06-12 15:10:03 -04:00
Joey Hess
dfdda95053
proxy updates location tracking information
This does mean a redundant write to the git-annex branch. But,
it means that two clients can be using the same proxy, and after
one sends a file to a proxied remote, the other only has to pull from
the proxy to learn about that. It does not need to pull from every
remote behind the proxy (which it couldn't do anyway as git repo
access is not currently proxied).

Anyway, the overhead of this in git-annex branch writes is no worse
than eg, sending a file to a repository where git-annex assistant
is running, which then sends the file on to a remote, and updates
the git-annex branch then. Indeed, when the assistant also drops
the local copy, that results in more writes to the git-annex branch.
2024-06-12 11:37:14 -04:00
Joey Hess
c6e0710281
proxying to local git remotes works
This just happened to work correctly. Rather surprisingly. It turns out
that openP2PSshConnection actually also supports local git remotes,
by just running git-annex-shell with the path to the remote.

Renamed "P2PSsh" to "P2PShell" to make this clear.
2024-06-12 10:10:11 -04:00
Joey Hess
58d8ba5a4f
implement simple proxy actions (untested)
Still need to implement GET and PUT, and will implement CONNECT and
NOTIFYCHANGE for completeness.

All ServerMode checking is implemented for the proxy.

There are two possible approaches for how the proxy sends back messages
from the remote to the client. One would be to have a background thread
that reads messages and sends them back as they come in. The other,
which is being implemented so far, is to read messages from the remote
at points where it is expected to send them, and relay back to the
client before reading the next message from the client. At this point,
I'm unsure which approach would be better.

The need for proxynoresponse to be used by UNLOCKCONTENT, for example,
builds protocol knowledge into the proxy which it would not need with
the other method.
2024-06-11 12:56:20 -04:00
Joey Hess
92c83a417f
refactoring 2024-06-11 10:22:05 -04:00
Joey Hess
501d65eeab
started implementing git-annex-shell proxy
So far, it negotiates VERSION with both parties. This is a tricky dance.

Untested.
2024-06-10 18:01:36 -04:00
Joey Hess
649b87bedd
Merge branch 'master' into proxy 2024-06-10 14:26:18 -04:00
Joey Hess
9a8391078a
git-annex-shell: block relay requests
connRepo is only used when relaying git upload-pack and receive-pack.
That's only supposed to be used when git-annex-remotedaemon is serving
git-remote-tor-annex connections over tor. But, it was always set, and
so could be used in other places possibly.

Fixed by making connRepo optional in the P2P protocol interface.

In Command.EnableTor, it's not needed, because it only speaks the
protocol in order to check that it's able to connect back to itself via
the hidden service. So changed that to pass Nothing rather than the git
repo.

In Remote.Helper.Ssh, it's connecting to git-annex-shell p2pstdio,
so is making the requests, so will never need connRepo.

In git-annex-shell p2pstdio, it was accepting git upload-pack and
receive-pack requests over the P2P protocol, even though nothing sent
them. This is arguably a security hole, particularly if the user has
set environment variables like GIT_ANNEX_SHELL_LIMITED to prevent
git push/pull via git-annex-shell.
2024-06-10 14:16:27 -04:00
Joey Hess
f97f4b8bdb
Added updateproxy command and remote.name.annex-proxy configuration
So far this only records proxy information on the git-annex branch.
2024-06-04 14:52:03 -04:00
Joey Hess
98762a2f96
group: Added --list option
Seemed to make sense to exclude groups used only by dead repositories.
2024-05-29 13:37:35 -04:00
Joey Hess
19418e81ee
git-remote-annex: Display full url when using remote with the shorthand url 2024-05-24 17:15:31 -04:00
Joey Hess
58301e40d2
sync with special remotes with an annex:: url
Check explicitly for an annex:: url, not just any url. While no built-in
special remotes set an url, except ones that can be synced with, it
seems possible that some external special remote sets an url for its own
use, but did not expect it to be used by git-annex sync et al.

The assistant also syncs with them.
2024-05-24 14:57:29 -04:00
Joey Hess
22bf23782f
initremote, enableremote: Added --with-url to enable using git-remote-annex
Also sets remote.name.fetch to a typical value, same as git remote add does.
2024-05-24 14:29:36 -04:00
Joey Hess
434a88c368
Merge branch 'git-remote-annex' 2024-05-15 17:57:50 -04:00
Joey Hess
768cdee461
testremote: Really fsck downloaded objects
8844372c23 exposted a bug in testremote, it
was passing the serialized key, not the object file, to be checksummed.
2024-05-15 17:57:27 -04:00
Joey Hess
468de43d66
Merge branch 'master' into git-remote-annex 2024-05-15 17:49:12 -04:00
Joey Hess
24af51e66d
git-annex unused --from remote skips its git-remote-annex keys
This turns out to only be necessary is edge cases. Most of the
time, git-annex unused --from remote doesn't see git-remote-annex keys
at all, because it does not record a location log for them.

On the other hand, git-annex unused does find them, since it does not
rely on the location log. And that's good because they're a local cache
that the user should be able to drop.

If, however, the user ran git-annex unused and then git-annex move
--unused --to remote, the keys would have a location log for that
remote. Then git-annex unused --from remote would see them, and would
consider them unused. Even when they are present on the special remote
they belong to. And that risks losing data if they drop the keys from
the special remote, but didn't expect it would delete git branches they
had pushed to it.

So, make git-annex unused --from skip git-remote-annex keys whose uuid
is the same as the remote.
2024-05-14 15:17:40 -04:00
Joey Hess
0281f7f23e
Avoid the --fast option preventing checksumming in some cases it was not supposed to
fsck --fast was intended to disable checksumming, but checksumming is done
after transfers too. Due to the check being in the non-incremental path,
it would only affect non-incremental checksumming during a transfer,
and I'm not 100% sure that it was a problem.

Also, when using an external backend that does checksumming, fsck --fast
didn't disable it and now does.
2024-05-12 21:36:48 -04:00
Joey Hess
05684bdd6c
fsck: Fix recent reversion that made it say it was checksumming files whose content is not present
Did not track down the commit that caused the problem, but git-annex
version 10.20240431 didn't behave that way.
2024-05-12 21:23:27 -04:00
Joey Hess
ff5193c6ad
Merge branch 'master' into git-remote-annex 2024-05-10 14:20:36 -04:00
Joey Hess
483887591d
working toward git-remote-annex using a special remote
Not quite there yet.

Also, changed the format of GITBUNDLE keys to use only one '-'
after the UUID. A sha256 does not contain that character, so can just
split at the last one.

Amusingly, the sha256 will probably not actually be verified. A git
bundle contains its own checksums that git uses to verify it. And if
someone wanted to replace the content of a GITBUNDLE object, they
could just edit the manifest to use a new one whose sha256 does verify.

Sponsored-by: Nicholas Golder-Manning
2024-05-06 16:28:04 -04:00
Yaroslav Halchenko
87e2ae2014
run codespell throughout fixing typos automagically
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
2024-05-01 15:46:21 -04:00
Yaroslav Halchenko
d20ecff73c
Fix one ambigous typo 2024-05-01 15:46:15 -04:00
Joey Hess
2c73845d90
multiple -m second try
Test suite passes this time. When committing the adjusted branch, use
the old method to make a message that old git-annex can consume. Also
made the code accept the new message, so that eventually
commitTreeExactMessage can be removed.

Sponsored-by: Kevin Mueller on Patreon
2024-04-09 12:56:47 -04:00
Joey Hess
a8dd85ea5a
Revert "multiple -m"
This reverts commit cee12f6a2f.

This commit broke git-annex init run in a repo that was cloned from a
repo with an adjusted branch checked out.

The problem is that findAdjustingCommit was not able to identify the
commit that created the adjusted branch. It seems that there is an extra
"\n" at the end of the commit message that it does not expect.

Since backwards compatability needs to be maintained, cannot just make
findAdjustingCommit accept it with the "\n". Will have to instead
have one commitTree variant that uses the old method, and use it for
adjusted branch committing.
2024-04-02 17:29:07 -04:00
Joey Hess
cee12f6a2f
multiple -m
sync, assist, import: Allow -m option to be specified multiple times, to
provide additional paragraphs for the commit message.

The option parser didn't allow multiple -m before, so there is no risk of
behavior change breaking something that was for some reason using multiple
-m already.

Pass through to git commands, so that the method used to assemble the
paragrahs is whatever git does. Which might conceivably change in the
future.

Note that git commit-tree has supported -m since git 1.7.7. commitTree
was probably not using it since it predates that version. Since the
configure script prevents building git-annex with git older than 2.1,
there is no risk that it's not supported now.

Sponsored-by: Nicholas Golder-Manning on Patreon
2024-03-27 15:58:27 -04:00
Joey Hess
e23721f579
fix build warning
A recent change made plumbing the backend through fsck unncessary.

Left fsck checking backend and skipping operating on key when it could
not find one. Not checking the backend would be a behavior change.
For example the command git-annex fsck --key FOO--bar does nothing
since FOO is not a known backend. If this were removed it would
instead go on and fsck it and warn that no copies exist of the key.
That behavior change seems like it would be fine, but I also have no
reason to make it.
2024-03-26 14:13:59 -04:00
Joey Hess
9bf4f2eb16
fix reversion in unexport when unable to rename
Bug introduced in commit 7cef5e8f35
2024-03-15 16:14:44 -04:00
Joey Hess
dd4c4bcd7a
fix build warning
A recent change made plumbing the backend through fsck unncessary.

Left fsck checking backend and skipping operating on key when it could
not find one, although I'm not sure if that's necessary to support eg,
keys with unknown backend.
2024-03-09 13:50:30 -04:00
Joey Hess
7cef5e8f35
export tree: avoid confusing output about renaming files
When a file in the export is renamed, and the remote's renameExport
returned Nothing, renaming to the temp file would first say it was
renaming, and appear to succeed, but actually what it did was delete the
file. Then renaming from the temp file would not do anything, since the
temp file is not present on the remote. This appeared as if a file got
renamed to a temp file and left there.

Note that exporttree=yes importree=yes remotes have their usual
renameExport replaced with one that returns Nothing. (For reasons
explained in Remote.Helper.ExportImport.) So this happened
even with remotes that support renameExport.

Fix by letting renameExport = Nothing when it's not supported at all.
This avoids displaying the rename.

Sponsored-by: Graham Spencer on Patreon
2024-03-09 13:50:26 -04:00
Joey Hess
016d1bee88
add reregisterurl command
What this can currently be used for is only to change an url from being
used by a special remote to being used by the web remote.

This could have been a --move-from option to registerurl. But, that would
have complicated its option and --batch processing, and also would have
complicated unregisterurl, which is implemented on top of
Command.Registerurl. So, a separate command was actually less complicated
to implement.

The generic description of the command is because I want to make this
command a catch-all for other url updating kind of things, if there are
ever any more. Also because it was hard to come up with a good name for the
specific action. I considered `git-annex moveurl`, but that seems to
indicate data is perhaps actually being moved, and seems to sit at the same
level as addurl and rmurl, and this command is at the plumbing
level of registerurl and unregisterurl.

Sponsored-by: Dartmouth College's DANDI project
2024-03-05 15:06:14 -04:00
Joey Hess
e7652b0997
implement URL to VURL migration
This needs the content to be present in order to hash it. But it's not
possible for a module used by Backend.URL to call inAnnex because that
would entail a dependency loop. So instead, rely on the fact that
Command.Migrate calls inAnnex before performing a migration.

But, Command.ExamineKey calls fastMigrate and the key may or may not
exist, and it's not wanting to actually perform a migration in any case.
To handle that, had to add an additional value to fastMigrate to
indicate whether the content is inAnnex.

Factored generateEquivilantKey out of Remote.Web.

Note that migrateFromURLToVURL hardcodes use of the SHA256E backend.
It would have been difficult not to, given all the dependency loop
issues. But --backend and annex.backend are used to tell git-annex
migrate to use VURL in any case, so there's no config knob that
the user could expect to configure that.

Sponsored-by: Brock Spratlen on Patreon
2024-03-01 16:42:02 -04:00
Joey Hess
9c988ee607
handle multiple VURL checksums in one pass
git-annex fsck and some other commands that verify the content of a key
were using the non-incremental verification interface. But for VURL
urls, that interface is innefficient because when there are multiple
equivilant keys, it has to separately read and checksum for each key in
turn until one matches. It's more efficient for those to use the
incremental interface, since the file can be read a single time.

There's no real downside to using the incremental interface when available.

Note that more speedup could be had for VURL, if it was able to
calculate the checksum a single time and then compare with the
equivilant keys checksums. When the equivilant keys use the same type of
checksum.

Sponsored-by: k0ld on Patreon
2024-03-01 14:41:10 -04:00
Joey Hess
0ac8962b1b
fix comment typo 2024-03-01 14:12:21 -04:00
Joey Hess
cc17ac423b
implement isCryptographicallySecureKey for VURL
Considerable difficulty to work around an import cycle. Had to move the
list of backends (except for VURL) to Backend.Variety to VURL could use
it.

Sponsored-by: Kevin Mueller on Patreon
2024-02-29 17:26:35 -04:00
Joey Hess
0f7143d226
support VURL backend
Not yet implemented is recording hashes on download from web and
verifying hashes.

addurl --verifiable option added with -V short option because I
expect a lot of people will want to use this.

It seems likely that --verifiable will become the default eventually,
and possibly rather soon. While old git-annex versions don't support
VURL, that doesn't prevent using them with keys that use VURL. Of
course, they won't verify the content on transfer, and fsck will warn
that it doesn't know about VURL. So there's not much problem with
starting to use VURL even when interoperating with old versions.

Sponsored-by: Joshua Antonishen on Patreon
2024-02-29 13:48:51 -04:00
Joey Hess
3475b09c3e
pre-commit: Avoid committing the git-annex branch
Except when a commit is made in a view, which changes metadata.

Make the assistant commit the git-annex branch after git commit of working
tree changes.

This allows using the annex.commitmessage-command in the assistant to
generate a commit message for the git-annex branch that relies on state
gathered during the commit of the working tree. Eg, it might reuse the
commit message.

Note that, when not using the assistant, a git-annex add still commits
the git-annex branch, so such a annex.commitmessage-command set up would
not work then. But if someone is using the assistant and wants
programmatic control over commit messages, this is useful. Someone not
using the assistant can get the same result by using annex.alwayscommit=false
during the git-annex add, and git-annex merge after they git commit.

pre-commit was never really intended to commit the git-annex branch
(except after recording changed metadata), but the assistant did sort of
rely on it. It does later commit the git-annex branch before pushing to
remotes, but I didn't want to risk building up lots of uncommitted changes
to it if that didn't happen frequently.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-02-12 14:42:11 -04:00
Joey Hess
fa9197560d
move commitStaged out of Command.Sync which no longer uses it
It's trivial enough that it it's not worth factoring it out to somewhere
in common with Command.Undo and the assistant.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-02-07 16:19:28 -04:00
Joey Hess
21123ba368
assistant, undo: When committing, let the usual git commit hooks run
Was doing a Git.Branch.commit for historical reasons to do with direct
mode, which no longer apply.

Note that the preCommitAnnexHook is no longer called in commitStaged
because git-annex installs a pre-commit hook that runs the pre-commit-annex
hook. And git commit will run the pre-commit hook.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-02-07 16:15:35 -04:00
Joey Hess
6b38d0c427
addurl, importfeed: Added --raw-except option
--raw-except=web allows using yt-dlp but not any other special remotes.

Currently this option can only be used once, trying to use it repeatedly
will make option parsing fail. Perhaps it ought to support being used more
than once, but it seemed like an unlikely use case to need that.

Note that getParsed is called repeatedly when the option is used with
several urls. While implementing DeferredParseClass would avoid that
innefficiency, it didn't seem worth the added boilerplate since
getParsed only calls byNameWithUUID which does minimal work.

Sponsored-by: Dartmouth College's DANDI project
2024-02-05 15:16:25 -04:00
Joey Hess
2f3fe4d904
fix importfeed --force skip behavior reversion
importfeed --force: Don't treat it as a failure when an already downloaded
file exists. (Fixes a behavior change introduced in 10.20230626.)

04ee6c4c6b caused the reversion. Inside a CommandPerform, stop causes it
to fail. Before that commit, it was inside a CommandStart, where stop
causes it to skip.
2024-02-02 15:57:07 -04:00
Joey Hess
0c64cd30c2
compare urls irrespective of downloader
importfeed --force: Avoid creating duplicates of existing already
downloaded files when yt-dlp or a special remote was used.
2024-02-02 15:50:56 -04:00
Joey Hess
90db97d9a2
importfeed: Added --scrape option
Which uses yt-dlp to screen scrape the equivilant of an RSS feed.

Note that youtubedlscraped is a speed optimisation. Since yt-dlp found
the urls, we know it can download them. That avoids calling
youtubeDlSupported on each url, which makes --fast a lot faster.

Almost all the same metadata fields and file formatting fields are
populated, when yt-dlp is able to get the data. Note that yt-dlp has some
additional useful metadata that could be exposed. But, much of it is
specific to particular websites, and it would be hard to document on the
git-annex importfeed man page.

Sponsored-by: unqueued on Patreon
2024-01-30 15:37:29 -04:00
Joey Hess
d7949f8202
move Feed and Item out of ToDownload
This is groundwork for producing ToDownload in other ways, that may not
be entirely isomorphic with feeds. Eg by using yt-dlp.
2024-01-30 14:11:26 -04:00
Joey Hess
8e9ee31621
webapp: Added --port option, and annex.port config
The getSocket comment that mentioned using ":port"
in the hostname seems to have been incorrect or be out of date.
After all, the bug report came when the user first tried doing that,
and it didn't work.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-01-25 14:08:36 -04:00
Joey Hess
e765d3e24c
import: --message/-m option 2024-01-18 12:41:44 -04:00
Joey Hess
f6cf2dec4c
disk free checking for unsized keys
Improve disk free space checking when transferring unsized keys to
local git remotes. Since the size of the object file is known, can
check that instead.

Getting unsized keys from local git remotes does not check the actual
object size. It would be harder to handle that direction because the size
check is run locally, before anything involving the remote is done. So it
doesn't know the size of the file on the remote.

Also, transferring unsized keys to other remotes, including ssh remotes and
p2p remotes don't do disk size checking for unsized keys. This would need a
change in protocol.

(It does seem like it would be possible to implement the same thing for
directory special remotes though.)

In some sense, it might be better to not ever do disk free checking for
unsized keys, than to do it only sometimes. A user might notice this
direction working and consider it a bug that the other direction does not.
On the other hand, disk reserve checking is not implemented for most
special remotes at all, and yet it is implemented for a few, which is also
inconsistent, but best effort. And so doing this best effort seems to make
some sense. Fundamentally, if the user wants the size to always be checked,
they should not use unsized keys.

Sponsored-by: Brock Spratlen on Patreon
2024-01-16 14:29:10 -04:00
Joey Hess
1e775ab83b
remove slightly incorrect comments 2023-12-29 13:23:27 -04:00
Joey Hess
a4a5ec6366
info: Added "annex sizes of repositories" table to the overall display
Thanks to previous work in 11cc9f1933,
this is almost entirely free, it only needs to do some additional map
lookups and math.

The strictness annotations keep the memory use from blowing up.

Sponsored-by: unqueued on Patreon
2023-12-29 12:09:30 -04:00
Joey Hess
64db927d73
optimisation 2023-12-29 10:51:05 -04:00
Joey Hess
6d789c9c81
sync, push: Avoid trying to send individual files to special remotes configured with importtree=yes exporttree=no
That will always fail. It already skipped doing this when exporttree=yes.
2023-12-26 15:56:58 -04:00
Joey Hess
86dbe9a825
migrate: support adding size back to URL keys
migrate: Support adding size to URL keys that were added with --relaxed, by
running eg: git-annex migrate --backend=URL foo

Since url keys cannot be generated, that used to fail. Make it notice that
the backend is not changed, and just get the size of the content.

Sponsored-by: Brock Spratlen on Patreon
2023-12-08 16:22:14 -04:00
Joey Hess
257f01729c
distributed migration for pull and sync --content
pull, sync: When operating on content, automatically hard link objects
that have been migrated.

Added annex.syncmigrations config that can be set to false to prevent
pull and sync from migrating object content.

I think that true is a good default for this config, because it avoids
users having to re-download migrated content or learning about migration.
But, some users will surely not like it, whether because it does take some
time (especially for the first git-annex branch scan when there is a long
history), or because they want to deal with it manually, or because their
filesystem doesn't support hard links and they don't want it to copy
objects.

Sponsored-by: k0ld on Patreon
2023-12-08 14:18:18 -04:00
Joey Hess
4ed71b34de
migrate --apply
And avoid migrate --update/--aply migrating when the new key was already
present in the repository, and got dropped. Luckily, the location log
allows distinguishing from the new key never having been present!

That is mostly useful for --apply because otherwise dropped files would
keep coming back until the old objects were reaped as unused. But it
seemed to make sense to also do it for --update. for consistency in edge
cases if nothing else. One case where --update can use it is when one
branch got migrated earlier, and we dropped the file, and now another
branch has migrated the same file.

Sponsored-by: Jack Hill on Patreon
2023-12-08 13:23:46 -04:00
Joey Hess
51b974d9f0
skip distributed migration to insecure key when annex.securehashesonly is set
This only avoids extra work and a warning messsage. It seems likely that
in such a situation, the user does not want migrations to insecure
hashes, and so best to ignore them as much as possible. If
the user merges a branch that switches annexed files to an insecure
hash, they will notice that the file contents are unavailable,
and git-annex get will tell them the problem then. So it does not seem
useful to have migrate --update also complain about it.
2023-12-08 12:41:50 -04:00
Joey Hess
30c2728d65
always verify content in distributed migration
doc/todo/distributed_migration.mdwn discusses security of distributed
migration, and this was identified as necessary to do.
2023-12-07 20:05:42 -04:00
Joey Hess
62ce56c4ea
display filenames in migrate --update
Have to go to a lot of bother to find them, but I think it's worth it
for usability.

Sponsored-by: Luke T. Shumaker on Patreon
2023-12-07 18:00:09 -04:00
Joey Hess
abea01d9e0
migrate --update fully working
Could use some more testing.

When the old key is not present, Command.ReKey.linkKey' will return
False, so this handles that case ok.

But, I do wonder if distributed migration may need to deal with the old
key getting copied into the repository later. In that situation,
re-running migrate --update won't link it to the new key. It may be that
some users will need that. They can delete .git/annex/migrate.log and
run it again, but that is not a good user interface. Maybe either have
a way to re-run all distributed migrations, or record migrations
in a database and scan the db to find migrations to do in a future run?

Sponsored-by: Kevin Mueller on Patreon
2023-12-07 17:27:51 -04:00
Joey Hess
7c7c9912c1
migrate --update gets keys
The git log is outputting the diff, but this only looks at the new
files. When we have a new file, we can get the old filename by just
replacing "new" with "old". And then use branchFileRef to refer to it
allows catting the old key.

While this does have to skip past the old files in the diff, it's still
faster than calling git diff separately.

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-12-07 17:25:56 -04:00
Joey Hess
f1ce15036f
started migrate --update
This is most of the way there, but not quite working.

The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.

Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.

Sponsored-by: Joshua Antonishen on Patreon
2023-12-07 15:50:52 -04:00
Joey Hess
0bd8b17b59
log migration trees to git-annex branch
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.

commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.

Sponsored-by: Graham Spencer on Patreon
2023-12-06 15:40:03 -04:00
Joey Hess
b55efc179a
add startAction parameter for KeySha
I have a use planned for this in Command.Migrate.

Sponsored-by: unqueued on Patreon
2023-12-06 13:28:02 -04:00
Joey Hess
0485dd3161
sync: Fix locking problems during merge when annex.pidlock is set
Presumably git merge sometimes needs to verifiy if a worktree file is
modified, and so will then run git-annex filter-process which would try to
take the pid lock. And for whatever reason, git-annex sync already had the
pidlock held. I have not replicated that, but it does make enough sense to
deploy the workaround.

Like I said back in commit 7bdb0cdc0d,

   Arguably, it would be better to have a way to make any process git-annex
   runs have the env var set. But then it would need to take the pid lock
   when running any and all processes, and that would be a problem when
   git-annex runs two processes concurrently. So, I'm left doing it ad-hoc
   in places where git-annex really does run a child process, directly
   or indirectly via a particular git command.

Sponsored-by: KDM on Patreon
2023-12-04 13:40:28 -04:00
Joey Hess
1e31bf8122
copy/move --from-anywhere --to remote
Implementation was simple because it's equivilant to
--from=foo --to remote for each other remote, followed by
--to remote when there's a local copy.

(Or, in the edge case of --from-anywhere --to=here,
it's the same as --to=here.)

Note that, when the local repo does not have a copy,
fromToPerform gets it from a remote, sends it to the destination,
and drops the local copy. Another call to that for a second remote
will notice that the dest now has a copy, and simply drop from the
second remote, avoiding a second transfer.

Also note that, when numcopies doesn't allow dropping it from
everywhere, it will drop it from the cheapest remotes first
(maybe not ideal) up to more expensive remotes, and finally from the local
repo. So the local repo will generally end up holding a copy. Maybe not
ideal in all cases either, but it seems no worse to do that than to end up
with a copy undropped from a remote.

And I'm not entirely happy with the output, eg:

	copy bigfile (from r3...) ok
	copy bigfile ok

That makes sense if you think of the second line as being
the same as what is output by `git-annex copy bigfile --to bar`,
but it's less clear in this context. Maybe add "(from here...)"?
Also the --json output doesn't have a machine-readable field for
the "from" uuid, and maybe it should?

Sponsored-by: Dartmouth College's DANDI project
2023-11-30 16:34:30 -04:00
Joey Hess
1654572bc1
fix --from overriding annex-ignore
Make git-annex get/copy/move --from foo override configuration of
remote.foo.annex-ignore, as documented.

This already worked for remotes supporting hasKeyCheap. For others though,
git-annex copy --from foo would silently not do anything, while
git-annex copy --to foo would use the annex-ignored remote.

Also improved the annex-ignore docs, to reflect that `git-annex get`
without --from will skip using annex-ignored remotes, for example.

Sponsored-by: Dartmouth College's DANDI project
2023-11-30 15:12:07 -04:00
Joey Hess
6e3bcbf4dd
Make git-annex copy --from --to --fast actually fast
Eg when the destination is logged as containing a file, skip
actively checking that it does contain it.

Note that --fast does not prevent other verifications of content
location that are done in a copy --from --to. Perhaps it could, but this
change will already avoid the real unnecessary work of operating on
files that are already in the remote.

And avoiding other verifications
might cause it to fail if the location log thinks that --to does not
contain the content but does. Such complications with `git-annex copy
--to remote --fast` led to commit d006586cd0
which added a note that gets displayed when that fails, mentioning it
might be due to --fast being enabled.

copy --from --to is already complicated enough without needing to worry
about such edge cases, so continuing to doing some verification of
content location after the initial --fast filtering seems ok.

Sponsored-by: Dartmouth College's DANDI project
2023-11-17 17:37:58 -04:00
Joey Hess
96c0e720c5
small refactoring
Sponsored-by: Dartmouth College's DANDI project
2023-11-17 17:23:24 -04:00
Joey Hess
7a8393ce7d
Fix bug in git-annex copy --from --to
Caused it to skip files that were locally present.

Sponsored-by: Dartmouth College's DANDI project
2023-11-17 16:30:20 -04:00
Joey Hess
7d67229884
git-annex log --gnuplot
The gnuplot output is pretty good, but could still be improved with:

* more colors (repeating colors is confusing with a lot of repos)
* better positioning of the legend, making the plot wider and moving it
  from over top of the graph

Sponsored-by: Kevin Mueller on Patreon
2023-11-14 14:56:58 -04:00
Joey Hess
0fdc1a54db
git-annex log --received modifier option
Only counting received and not dropped makes this show the bandwidth of
data coming into the repository, although only in a sense. Since
git-annex branch updates only happen at the end of a command, and we
don't know when a command started, it's only an approximation of the
actual bandwidth. (A previous git-annex branch update made have
happened in a different repository.)

It would be possible to also add a --dropped option, but I don't know
how useful that would be?

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-11-14 14:04:46 -04:00
Joey Hess
21e66ce209
rename --when to --interval
More accurately describes its behavior.
2023-11-14 11:45:16 -04:00
Joey Hess
cde081a025
guard against wrong timestamps in git log
For example, my sound repo has in the git-annex branch a commit from
2036, which is followed by one from 2034, in amoung commits from 2013.
Clearly there was a problem with the clock.

Since git log --date-order has a behavior of
"Show no parents before all of its children are shown", the data still
gets processed ok. The future timestamp just prevented displaying data
after that commit. It seems better, when the clock was wrong, to display
a wrong date, and then return to right dates.

It would be nice to filter out the wrong dates from display entirely,
but that seems it would need to buffer the whole output. This command is
too slow to buffer it all before displaying anything, and anyway this
kind of problem is probably rare.

Sponsored-by: Joshua Antonishen on Patreon
2023-11-13 15:06:04 -04:00
Joey Hess
2ab11fe06e
treat dead repos as 0 size
With this, git annex log --totalsizes can be compared with
git-annex info's "combined annex size of all repositories"
to double-check it works correctly.

In my sound repo, the two match.

In my big repo, the two report slightly different sizes,
with the former being 1.3 gb smaller than the latter.
I don't know the reason for this disreprency. Given the 30+tb size of
the repo, it's a small difference.

It seems possible that a bug in an old version of git-annex could
explain it. Eg, if an old git-annex lost a line when updating trust.log
or a location log in a merge, git-annex info would see only what it
replaced it with, while git-annex log will see the previous value as
well.

Sponsored-by: Leon Schuermann on Patreon
2023-11-13 15:05:48 -04:00
Joey Hess
38b9ebc5fd
newtype MapLog
Noticed that Semigroup instance of Map is not suitable to use
for MapLog. For example, it behaved like this:

ghci>  parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12"
fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})]

Which was wrong, it lost the newer DeadTrusted value.

Luckily, nothing used that Semigroup when operating on a MapLog. And this
provides a safe instance.

Sponsored-by: Graham Spencer on Patreon
2023-11-13 14:37:22 -04:00
Joey Hess
5d8b8a8ad0
git-annex log --totalsizes
Note that dead repositories are not yet handled so their sizes show as
nonzero after they are marked dead.

Sponsored-By: unqueued on Patreon
2023-11-13 13:15:36 -04:00
Joey Hess
dc02236c85
git-annex log --sizes
CSV format so it can be fed into a program to graph it.

Note that dead repositories are not yet handled so their sizes show as
nonzero after they are marked dead.

Sponsored-By: k0ld on Patreon
2023-11-13 13:07:22 -04:00
Joey Hess
6203b8afba
the last line is for the current time 2023-11-10 17:37:55 -04:00
Joey Hess
574514545c
git-annex log --sizesof
This can take a lot of memory. I decided to violate the usual rule in
git-annex that it operate in constant memory no matter how many annexed
objects. In this case, it would be hard to be fast without using a big
map of the location logs. The main difficulty here is that there can be
many git-annex branches and it needs to display a consistent view at a
point in time, which means merging information from multiple git-annex
branches.

I have not checked if there are any laziness leaks in this code. It
takes 1 gb to run in my big repo, which is around what I estimated
before writing it.

2 options that are documented are not yet implemented.

Small bug: With eg --when=1h, it will display at 12:00 then 1:10 if the
next change after 12:59 is then. Then it waits until after 2:10 to
display the next change. It ought to wait until after 2:00.

Sponsored-by: Brock Spratlen on Patreon
2023-11-10 17:26:10 -04:00
Joey Hess
561c036664
split out generic git log parser
Sponsored-By: Jack Hill on Patreon
2023-11-10 15:40:03 -04:00
Joey Hess
11cc9f1933
info: Added calculation of combined annex size of all repositories
Factored out overLocationLogs from CmdLine.Seek, which can calculate this
pretty fast even in a large repo. In my big repo, the time to run git-annex
info went up from 1.33s to 8.5s.

Note that the "backend usage" stats are for annexed files in the working
tree only, not all annexed files. This new data source would let that be
changed, but that would be a confusing behavior change. And I cannot
retitle it either, out of fear something uses the current title (eg parsing
the json).

Also note that, while time says "402108maxresident" in my big repo now,
up from "54092maxresident", top shows the RES constant at 64mb, and it
was 48mb before. So I don't think there is a memory leak. I tried using
deepseq to force full evaluation of addKeyCopies and memory use didn't
change, which also says no memory leak. And indeed, not even calling
addKeyCopies resulted in the same memory use. Probably the increased memory
usage is buffering the stream of data from git in overLocationLogs.

Sponsored-by: Brett Eisenberg on Patreon
2023-11-08 13:35:11 -04:00
Joey Hess
8768966d97
improve comments 2023-11-08 12:06:03 -04:00
Joey Hess
f8d35d9480
lookupkey: Sped up --batch
When the file is relative, it does not need to be passed
through git lsfiles to normalize it.

Sponsored-by: Kevin Mueller on Patreon
2023-10-30 14:59:09 -04:00
Joey Hess
d9fd205cbb
push RawFilePath down into Annex.ReplaceFile
Minor optimisation, but a win in every case, except for a couple where
it's a wash.

Note that replaceFile still takes a FilePath, because it needs to
operate on Chars to truncate unicode filenames properly.
2023-10-26 13:36:49 -04:00
Joey Hess
c873586e14
eliminate s2w8 and w82s
Note that the use of s2w8 in genUUIDInNameSpace made it truncate unicode
characters. Luckily, genUUIDInNameSpace is only ever used on ASCII
strings as far as I can determine. In particular, git-remote-gcrypt's
gcrypt-id is an ASCII string.
2023-10-26 13:12:57 -04:00
Joey Hess
8bde6101e3
sqlite datbase for importfeed
importfeed: Use caching database to avoid needing to list urls on every
run, and avoid using too much memory.

Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster,
and memory use dropped from 203000k to 59408k.

Database.ImportFeed is Database.ContentIdentifier with the serial number
filed off. There is a bit of code duplication I would like to avoid,
particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use
the persistent sqlite tables, so despite the code being the same, they
cannot be factored out.

Since this database includes the contentidentifier metadata, it will be
slightly redundant if a sqlite database is ever added for metadata. I
did consider making such a generic database and using it for this. But,
that would then need importfeed to update both the url database and the
metadata database, which is twice as much work diffing the git-annex
branch trees. Or would entagle updating two databases in a complex way.
So instead it seems better to optimise the database that
importfeed needs, and if the metadata database is used by another command,
use a little more disk space and do a little bit of redundant work to
update it.

Sponsored-by: unqueued on Patreon
2023-10-23 16:46:22 -04:00
Joey Hess
41f4d0bda9
enableremote: Avoid overwriting existing git remote when passed the uuid of a specialremote that was earlier initialized with the same name 2023-09-22 13:29:48 -04:00
Joey Hess
ef7c867238
fix some build warnings from ghc 9.4.6
It now notices that a RepoLocation may not be Local, in which case
pattern matching on Local wouldn't do.
2023-09-21 13:40:22 -04:00
Joey Hess
a147a31baa
fix some build warnings from ghc 9.4.6
For some reason it doesn't notice that req must be a Req, because
the toplevel function matched on that.
2023-09-21 13:38:36 -04:00
Joey Hess
a18e40bdd7
lookupkey: Added --ref option
Sponsored-by: Joshua Antonishen on Patreon
2023-09-12 12:49:11 -04:00
Joey Hess
7be8950138
propigateAdjustedCommits in seekExportContent
push: When on an adjusted branch, propagate changes to parent branch before
updating export remotes.

This is a somewhat redundant call to propigateAdjustedCommits, since it
also gets called at pushLocal time. That other one needs to come after
importing from importtree remotes though, and seekExportContent has to come
earlier, so I don't see a way to avoid doing it twice.

Note that git-annex sync also manages to avoid the problem, it's only
git-annex push that had the bug.

Sponsored-by: Leon Schuermann on Patreon
2023-09-11 14:54:26 -04:00
Joey Hess
aeaadb8eb8
improve warning message when unable to update export
A misleading message was displayed in several cases.

If the user has run eg: git config
remote.push-win-remote.annex-tracking-branch 'adjusted/main(unlocked)'
That is not supported, and now it will tell them it's not a valid
configuration. A user reported doing that, but I don't know if it's a
common point of confusion. If it is a common problem, a better message
would be possible, or it could convert back from the adjusted branch to
the actual branch.

Sponsored-by: Graham Spencer on Patreon
2023-09-11 14:21:36 -04:00
Joey Hess
49b97b0675
oldkeys: check associated files by default and add --unchecked
Removed the prior code that checked for keys used by current versions of
the files being acted on. It is redundant with the associated files
check (so long as the associated files database is always up-to-date,
which reconcileStaged should accomplish).

Sponsored-by: Luke T. Shumaker on Patreon
2023-08-23 13:46:41 -04:00
Joey Hess
5489c2cdd6
oldkeys --revision-range
Sponsored-by: Brett Eisenberg on Patreon
2023-08-22 15:00:29 -04:00
Joey Hess
cf8b30c914
oldkeys: New command that lists the keys used by old versions of a file
The tricky thing about this turned out to be handling renames and reverts.
For that, it has to make two passes over the git log, and to avoid
buffering a possibly huge amount of logs in memory (ie the whole git log of
an entire repository!), runs git log twice.

(It might be possible to speed this up by asking git log to show a diff,
and so avoid needing to use catKey.)

Sponsored-By: Brock Spratlen on Patreon
2023-08-22 14:51:06 -04:00
Joey Hess
379d58b499
diffdriver: Added --get option
Removed the dontCheck repoExists, because running it in a repo that has not
been initialized yet would update location log with nouuid. And I guess
it's ok for it to only support running in git-annex repos.
2023-08-22 11:58:53 -04:00
Joey Hess
67c99a4db7
info: Added available to the info displayed for a remote
Sponsored-by: Kevin Mueller on Patreon
2023-08-16 14:52:58 -04:00