Commit graph

2224 commits

Author SHA1 Message Date
Joey Hess
467d80101a
improve handling of unmerged git-annex branches in readonly repo
git-annex info was displaying a message that didn't make sense in
context.

In calcRepoSizes, it seems better to return the information from the
git-annex branch, rather than giving up. Especially since balanced
preferred content uses it, and we can't just give up evaluating a
preferred content expression if git-annex is to be usable in such a
readonly repo.

Commit 6d7ecd9e5d nobly wanted git-annex
to behave the same with such unmerged branches as it does when it can
merge them. But for the purposes of preferred content, it seems to me
there's a sense that such an unmerged branch is the same as a remote we
have not pulled from. The balanced preferred content will either way
operate under outdated information, and so make not the best choices.
2024-08-13 13:13:12 -04:00
Joey Hess
745bc5c547
take maxsize into account for balanced preferred content
This is very innefficient, it will need to be optimised not to
calculate the sizes of repos every time.

Also, fixed a bug in balancedPicker that caused it to pick a too high
index when some repos were excluded due to being full.
2024-08-13 11:00:20 -04:00
Joey Hess
99a126bebb
added reposize database
The idea is that upon a merge of the git-annex branch, or a commit to
the git-annex branch, the reposize database will be updated. So it
should always accurately reflect the location log sizes, but it will
often be behind the actual current sizes.

Annex.reposizes will start with the value from the database, and get
updated with each transfer, so it will reflect a process's best
understanding of the current sizes.

When there are multiple processes all transferring to the same repo,
Annex.reposize will not reflect transfers made by the other processes
since the current process started. So when using balanced preferred
content, it may make suboptimal choices, including trying to transfer
content to the repo when another process has already filled it up.
But this is the same as if there are multiple processes running on
ifferent machines, so is acceptable. The reposize will eventually
get an accurate value reflecting changes made by other processes or in
other repos.
2024-08-12 11:19:58 -04:00
Joey Hess
bd5affa362
use hmac in balanced preferred content
This deals with the possible security problem that someone could make an
unusually low UUID and generate keys that are all constructed to hash to
a number that, mod the number of repositories in the group, == 0.
So balanced preferred content would always put those keys in the
repository with the low UUID as long as the group contains the
number of repositories that the attacker anticipated.
Presumably the attacker than holds the data for ransom? Dunno.

Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs
combined together as the "secret", and the key as the "message". Now any
change in the set of UUIDs in a group will invalidate the attacker's
constructed keys from hashing to anything in particular.

Given that there are plenty of other things someone can do if they can
write to the repository -- including modifying preferred content so only
their repository wants files, and numcopies so other repositories drom
them -- this seems like safeguard enough.

Note that, in balancedPicker, combineduuids is memoized.
2024-08-10 16:32:54 -04:00
Joey Hess
3ce2e95a5f
balanced preferred content and --rebalance
This all works fine. But it doesn't check repository sizes yet, and
without repository size checking, once a repository gets full, there
will be no other repository that will want its files.

Use of sha2 seems unncessary, probably alder2 or md5 or crc would have
been enough. Possibly just summing up the bytes of the key mod the number
of repositories would have sufficed. But sha2 is there, and probably
hardware accellerated. I doubt very much there is any security benefit
to using it though. If someone wants to construct a key that will be
balanced onto a given repository, sha2 is certianly not going to stop
them.
2024-08-09 14:16:09 -04:00
Joey Hess
3ea835c7e8
proxied exporttree=yes versionedexport=yes remotes are not untrusted
This removes versionedExport, which was only used by the S3 special
remote. Instead, versionedexport=yes is a common way for remotes to
indicate that they are versioned.
2024-08-08 15:24:19 -04:00
Joey Hess
1038567881
proxy stores received keys to known export locations
This handles the workflow where the branch is first pushed to the proxy,
and then files in the exported tree are later are copied to the proxied remote.

Turns out that the way the export log is structured, nothing needs
to be done to finalize the export once the last key is sent to it. Which
is great because that would have been a lot of complication. On
receiving the push, Command.Export runs and calls recordExportBeginning,
does as much as it can to update the export with the files currently
on it, and then calls recordExportUnderway. At that point, the
export.log records the export as "complete", but it's not really. And
that's fine. The same happens when using `git-annex export` when some
files are not available to send. Other repositories that have
access to the special remote can already retrieve files from it. As
the missing files get copied to the exported remote, all that needs
to be done is record each in the export db.

At this point, proxying to exporttree=yes annexobjects=yes special remotes
is fully working. Except for in the case where multiple files in the
tree use the same key, and the files are sent to the proxied remote
before pushing the tree.

It seems that even special remotes without annexobjects=yes will work if
used with the workflow where the git-annex branch is pushed before
copying files. But not with the `git-annex push` workflow.
2024-08-07 09:47:34 -04:00
Joey Hess
c53f61e93f
Merge branch 'master' into exportreeplus 2024-08-06 14:46:33 -04:00
Joey Hess
3cc03b4c96
fix file corruption when proxying an upload to a special remote
The file corruption consists of each chunk of the file being duplicated.
Since chunks are typically a fixed size, it would certianly be possible
to get from a corrupted file back to the original file. But this is still
bad data loss.

Reversion was in commit fcc052bed8.
Luckily that did not make the most recent release.
2024-08-06 14:41:19 -04:00
Joey Hess
3289b1ad02
proxying to exporttree=yes annexobjects=yes basically working
It works when using git-annex sync/push/assist, or when manually sending
all content to the proxied remote before pushing to the proxy remote.
But when the push comes before the content is sent, sending content does
not update the exported tree.
2024-08-06 14:21:23 -04:00
Joey Hess
a3d96474f2
rename to annexobjects location on unexport
This avoids needing to re-upload the file again to get it to the
annexobjects location, which git-annex sync was doing when it was
preferred content.

If the file is not preferred content, sync will drop it from the
annexobjects location.

If the file has been deleted from the tree, it will remain in the
annexobjects location until an unused/dropunused pass is done.
2024-08-04 11:58:07 -04:00
Joey Hess
28b29f63dc
initial support for annexobjects=yes
Works but some commands may need changes to support special remotes
configured this way.
2024-08-02 14:07:45 -04:00
Joey Hess
3a1f39fbdf
Avoid loading cluster log at startup
This fixes a problem with datalad's test suite, where loading the cluster
log happened to cause the git-annex branch commits to take a different
shape, with an additional commit.

It's also faster though, since many commands don't need the cluster log.

Just fill Annex.clusters with a thunk.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-07-31 15:54:14 -04:00
Joey Hess
dc11c5f493
final fix to windows build 2024-07-29 16:32:24 -04:00
Joey Hess
be8e1ab512
more fixes to windows build for content retention files
Will probably build successfully now. Still untested.
2024-07-29 15:14:12 -04:00
Joey Hess
647bff9770
more fixes to windows build for content retention files 2024-07-29 13:58:40 -04:00
Joey Hess
fcc052bed8
When proxying an upload to a special remote, verify the hash.
While usually uploading to a special remote does not verify the content,
the content in a repository is assumed to be valid, and there is no trust
boundary. But with a proxied special remote, there may be users who are
allowed to store objects, but are not really trusted.

Another way to look at this is it's the equivilant of git-annex-shell
checking the hash of received data, which it does (see StoreContent
implementation).
2024-07-29 13:40:51 -04:00
Joey Hess
f397296739
add missing do on windows 2024-07-29 12:54:52 -04:00
Joey Hess
0dc064a9ad
When proxying for a special remote, avoid unncessary hashing
Like the comment says, the client will do its own verification. But it was
calling verifyKeyContentPostRetrieval, which was hashing the file.
2024-07-29 11:18:03 -04:00
Joey Hess
dfe65b92c8
avoid repeatedly parsing the proxy log 2024-07-28 16:04:20 -04:00
Joey Hess
cdc4bd7443
fix hang in PUT of large file to a special remote node of a cluster over http 2024-07-28 15:34:59 -04:00
Joey Hess
18ed4e5b20
use closedv rather than separate endv
Doesn't fix any known problem, but this way if the connection does get
closed, it will notice.
2024-07-28 15:11:31 -04:00
Joey Hess
66679c9bb4
remove temp file after upload to special remote 2024-07-28 14:36:45 -04:00
Joey Hess
6722a61a21
clusters need enableInteractiveBranchAccess
As seen in commit 770aac97a7, a cluster
relies accurate location logs. If long-running processes are serving a
cluster, and one process puts a file, the other process needs to see
what nodes it was stored on when checking if the file is present.
2024-07-28 12:39:42 -04:00
Joey Hess
bd3d327d8a
smarter BranchState cache invalidation
Only invalidate a just-written file in the cache, not the whole cache.

This will avoid the possibly performance impact of cache invalidation
mentioned in commit 770aac97a7
2024-07-28 12:33:32 -04:00
Joey Hess
770aac97a7
share single BranchState amoung all threads
This fixes a problem when git-annex testremote is run against a cluster
accessed via the http server. Annex.Cluster uses the location log
to find nodes that contain a key when checking if the key is present or getting
it. Just after a key was stored to a cluster node, reading the location log
was not getting the UUID of that node.

Apparently the Annex action that wrote to the location log, and the one
that read from it were run with two different Annex states. The http server
does use several different Annex threads.

BranchState was part of the AnnexState, and so two threads could have
different BranchStates.

Moved BranchState to the AnnexRead, so all threads will see the common state.

This might possibly impact performance. If one thread is writing changes to the
branch, and another thread is reading from the branch, the writing thread will
now invalidate the BranchState's cache, which will cause the reading thread to
need to do extra work. But correctness is surely more important. If did is
found to have impacted performance, it could probably be dealt with by doing
smarter BranchState cache invalidation.

Another way this might impact performance is that the BranchState has a small
cache. If several threads were reading from the branch and relying on the value
they just read still being in the case, now a cache miss will be more likely.
Increasing the BranchState cache to the number of jobs might be a good
idea to amelorate that. But the cache is currently an innefficient list,
so making it large would need changes to the data types.

(Commit 4304f1b6ae dealt with a follow-on
effect of the bug fixed here.)
2024-07-28 12:30:27 -04:00
Joey Hess
fbbedae497
add --clusterjobs option and default to 1
The default of 1 is not ideal at all, but it avoids an accidental M*N
causing so much concurrency it becomes unusable.
2024-07-28 10:36:22 -04:00
Joey Hess
1259ad89b6
cluster support in http API server
Wired it up and it seems to basically work, although the test suite is
not fully passing.

Note that --jobs currently gets multiplied by the number of nodes in the
cluster, which is probably not good.
2024-07-28 10:17:29 -04:00
Joey Hess
8ec174408e
remove duplicate code 2024-07-28 09:35:09 -04:00
Joey Hess
ef8f24f28c
fix PUT to http proxied special remote
It was hanging because it never sent FAILURE in the INVALID case.
And putoffset always triggers the INVALID case.
2024-07-28 09:14:42 -04:00
Joey Hess
0fb86d2916
UNLOCKCONTENT is not a top-level request
proxyRequest was treating UNLOCKCONTENT as a separate request.
That made it possible for there to be two different connections to the
proxied remote, with LOCKCONTENT being sent to one, and UNLOCKCONTENT
to the other one. A protocol error.

git-annex testremote now passes against a http proxied remote.
2024-07-26 20:39:06 -04:00
Joey Hess
267a202e72
clean up after http p2p proxy GET is interrupted
There was an annex worker thread that did not get stopped.

It was stuck in ReceiveMessage from the P2PHandleTMVar.

Fixed by making P2PHandleTMVar closeable.

In serveGet, releaseP2PConnection has to come first, else the
annexworker may not shut down, if it's waiting to read from it.

In proxyConnection, call closeRemoteSide in order to wait for the ssh
process (for example).
2024-07-26 15:33:20 -04:00
Joey Hess
ad025b8e5e
clean up protocol version for proxying
The proxy always checks the protocol version of a remote before talking
to it in a version-specific way, so the protocol version in the ProxyParams
is the client's protocol version. The remote will always be at the same or
an older protocol version than the client.

Note that in relayDATAFinish, when the client is at protocol version 0,
the remote must thus be as well, and that's why its version is not
checked in the case for that.

With that clarified, it's evident that, in P2P.Http.State, there's no
need to look at the proxied remote's protocol version at all.
2024-07-26 13:49:05 -04:00
Joey Hess
cc1da2d516
http p2p proxy is now largely working 2024-07-26 10:44:10 -04:00
Joey Hess
b391756b32
remove some debugging 2024-07-25 21:36:10 -04:00
Joey Hess
6ef6ad808f
use a record to reduce the huge number of parameters 2024-07-25 15:23:18 -04:00
Joey Hess
3d14e2cf58
http server support for proxies, incomplete
Refactored git-annex-shell code so this can use checkCanProxy'.

At this point all that remains is opening a proxy connection,
and using a proxy connection.
2024-07-25 13:19:24 -04:00
Joey Hess
10f2c23fd7
fix slowloris timeout in hashing resume of download of large file
Hash the data that is already present in the file before connecting to
the http server.
2024-07-24 11:03:59 -04:00
Joey Hess
0594338a78
fix meter display on resume
It still needs to be offset, otherwise on resume from 80% it will
display 1%..20%.

Seems that this bug must have affected P2P.Annex as well where it runs
this code, but apparently it didn't affect it in a very user-visible
way. Maybe the transfer log file was updated incorrectly?
2024-07-24 10:28:48 -04:00
Joey Hess
7bd616e169
Remote.Git retrieveKeyFile works with annex+http urls
This includes a bugfix to serveGet, it hung at the end.
2024-07-24 10:28:44 -04:00
Joey Hess
a2d1844292
factor out resumeVerifyFromOffset 2024-07-24 09:29:44 -04:00
Joey Hess
b4d749cc91
Merge branch 'master' into httpproto 2024-07-23 21:17:06 -04:00
Joey Hess
f7404a64c0
Propagate --force to git-annex transferrer
And other child processes.
2024-07-23 21:16:56 -04:00
Joey Hess
4826a3745d
servePut and clientPut implementation
Made the data-length header required even for v0. This simplifies the
implementation, and doesn't preclude extra verification being done for
v0.

The connectionWaitVar is an ugly hack. In servePut, nothing waits
on the waitvar, and I could not find a good way to make anything wait on
it.
2024-07-22 10:27:44 -04:00
Joey Hess
74c6175795
fix serveGet early handle close
Needed that waitv after all..
2024-07-11 09:55:17 -04:00
Joey Hess
3b37b9e53f
fix serveGet hang
This came down to SendBytes waiting on the waitv. Nothing ever filled
it.

Only Annex.Proxy needs the waitv, and it handles filling it. So make it
optional.
2024-07-11 07:46:52 -04:00
Joey Hess
f9b7ce7224
add Annex worker pool to P2PHttp
This will be needed for get and store, since those need to run Annex
actions.

withLocalP2PConnections will also probably use it.
2024-07-10 12:19:47 -04:00
Joey Hess
f452bd448a
REMOVE-BEFORE and GETTIMESTAMP proxying
For clusters, the timestamps have to be translated, since each node can
have its own idea about what time it is. To translate a timestamp, the
proxy remembers what time it asked the node for a timestamp in
GETTIMESTAMP, and applies the delta as an offset in REMOVE-BEFORE.

This does mean that a remove from a cluster has to call GETTIMESTAMP on
every node before dropping from nodes. Not very efficient. Although
currently it tries to drop from every single node anyway, which is also
not very efficient.

I thought about caching the GETTIMESTAMP from the nodes on the first
call. That would improve efficiency. But, since monotonic clocks on
!Linux don't advance when the computer is suspended, consider what might
happen if one node was suspended for a while, then came back. Its
monotonic timestamp would end up behind where the proxying expects it to
be. Would that result in removing when it shouldn't, or refusing to
remove when it should? Have not thought it through. Either way, a
cluster behaving strangly for an extended period of time because one
of its nodes was briefly asleep doesn't seem like good behavior.
2024-07-04 15:09:34 -04:00
Joey Hess
99b7a0cfe9
use REMOVE-BEFORE in P2P protocol
Only clusters still need to be fixed to close this todo.
2024-07-04 13:47:38 -04:00
Joey Hess
1243af4a18
toward SafeDropProof expiry checking
Added Maybe POSIXTime to SafeDropProof, which gets set when the proof is
based on a LockedCopy. If there are several LockedCopies, it uses the
closest expiry time. That is not optimal, it may be that the proof
expires based on one LockedCopy but another one has not expired. But
that seems unlikely to really happen, and anyway the user can just
re-run a drop if it fails due to expiry.

Pass the SafeDropProof to removeKey, which is responsible for checking
it for expiry in situations where that could be a problem. Which really
only means in Remote.Git.

Made Remote.Git check expiry when dropping from a local remote.

Checking expiry when dropping from a P2P remote is not yet implemented.
P2P.Protocol.remove has SafeDropProof plumbed through to it for that
purpose.

Fixing the remaining 2 build warnings should complete this work.

Note that the use of a POSIXTime here means that if the clock gets set
forward while git-annex is in the middle of a drop, it may say that
dropping took too long. That seems ok. Less ok is that if the clock gets
turned back a sufficient amount (eg 5 minutes), proof expiry won't be
noticed. It might be better to use the Monotonic clock, but that doesn't
advance when a laptop is suspended, and while there is the linux
Boottime clock, that is not available on other systems. Perhaps a
combination of POSIXTime and the Monotonic clock could detect laptop
suspension and also detect clock being turned back?

There is a potential future flag day where
p2pDefaultLockContentRetentionDuration is not assumed, but is probed
using the P2P protocol, and peers that don't support it can no longer
produce a LockedCopy. Until that happens, when git-annex is
communicating with older peers there is a risk of data loss when
a ssh connection closes during LOCKCONTENT.
2024-07-04 12:39:06 -04:00
Joey Hess
d2b27ca136
add content retention files
This allows lockContentShared to lock content for eg, 10 minutes and
if the process then gets terminated before it can unlock, the content
will remain locked for that amount of time.

The Windows implementation is not yet tested.

In P2P.Annex, a duration of 10 minutes is used. This way, when p2pstdio
or remotedaemon is serving the P2P protocol, and is asked to
LOCKCONTENT, and that process gets killed, the content will not be
subject to deletion. This is not a perfect solution to
doc/todo/P2P_locking_connection_drop_safety.mdwn yet, but it gets most
of the way there, without needing any P2P protocol changes.

This is only done in v10 and higher repositories (or on Windows). It
might be possible to backport it to v8 or earlier, but it would
complicate locking even further, and without a separate lock file, might
be hard. I think that by the time this fix reaches a given user, they
will probably have been running git-annex 10.x long enough that their v8
repositories will have upgraded to v10 after the 1 year wait. And it's
not as if git-annex hasn't already been subject to this problem (though
I have not heard of any data loss caused by it) for 6 years already, so
waiting another fraction of a year on top of however long it takes this
fix to reach users is unlikely to be a problem.
2024-07-03 14:58:39 -04:00
Joey Hess
2f2cc38c28
fix build on old ghc
getStdRandom used to be an IO action
2024-07-02 12:27:14 -04:00
Joey Hess
fa5e7463eb
fix display when proxied GET yields ERROR
The error message is not displayed to the use, but this mirrors the
behavior when a regular get from a special remote fails. At least now
there is not a protocol error.
2024-07-01 11:19:02 -04:00
Joey Hess
dce3848ad8
avoid populating proxy's object file when storing on special remote
Now that storeKey can have a different object file passed to it, this
complication is not needed. This avoids a lot of strange situations,
and will also be needed if streaming is eventually supported.
2024-07-01 10:53:49 -04:00
Joey Hess
8b5fc94d50
add optional object file location to storeKey
This will be used by the next commit to simplify the proxy.
2024-07-01 10:42:27 -04:00
Joey Hess
711a5166e2
PUT to proxied special remote working
Still needs some work.

The reason that the waitv is necessary is because without it,
runNet loops back around and reads the next protocol message. But it's
not finished reading the whole bytestring yet, and so it reads some part
of it.
2024-06-28 17:10:58 -04:00
Joey Hess
2e5af38f86
GET from proxied special remote
Working, but lots of room for improvement...

Without streaming, so there is a delay before download begins as the
file is retreived from the special remote.

And when resuming it retrieves the whole file from the special remote
*again*.

Also, if the special remote throws an exception, currently it
shows as "protocol error".
2024-06-28 15:44:48 -04:00
Joey Hess
158d7bc933
fix handling of ERROR in response to REMOVE
This allows an error message from a proxied special remote to be
displayed to the client.

In the case where removal from several nodes of a cluster fails,
there can be several errors. What to do? I decided to only show
the first error to the user. Probably in this case the user is not in a
position to do anything about an error message, so best keep it simple.
If the problem with the first node is fixed, they'll see the error from
the next node.
2024-06-28 14:10:25 -04:00
Joey Hess
a6ea057f6b
fix handling of ERROR in response to CHECKPRESENT
That error is now rethrown on the client, so it will be displayed.

For example:

$ git-annex fsck x --fast --from AMS-dir
fsck x (special remote reports: directory /home/joey/tmp/bench2/dir is not accessible) failed

No protocol version check is needed. Because in order to talk to a
proxied special remote, the client has to be running the upcoming
git-annex release. Which has this fix in it.
2024-06-28 13:46:27 -04:00
Joey Hess
d3c75c003a
proxying special remotes
This is early, but already working for CHECKPRESENT.

However, when the special remote throws an exception on checkPresent,
this happens:

[2024-06-28 13:22:18.520884287] (P2P.IO) [ThreadId 4] P2P > ERROR directory /home/joey/tmp/bench2/dir is not accessible
[2024-06-28 13:22:18.521053135] (P2P.IO) [ThreadId 4] P2P < ERROR expected SUCCESS or FAILURE
git-annex: client error: expected SUCCESS or FAILURE
(fixing location log) p2pstdio: 1 failed

  ** Based on the location log, x
  ** was expected to be present, but its content is missing.
failed
2024-06-28 13:31:19 -04:00
Joey Hess
62750f0102
shut down RemoteSides cleanly
Before it just exited without actually shutting down the RemoteSides,
when the client hung up.
2024-06-28 13:19:57 -04:00
Joey Hess
cf59d7f92c
GET and CHECKPRESENT amoung lowest cost cluster nodes
Before it was using a node that might have had a higher cost.

Also threw in a random selection from amoung the low cost nodes. Of
course this is a poor excuse for load balancing, but it's better than
nothing. Most of the time...
2024-06-27 14:36:55 -04:00
Joey Hess
dabd05e547
remove a TODO marker
I have a todo item for this outside the code
2024-06-27 13:36:04 -04:00
Joey Hess
3dad9446ce
distributed cluster cycle prevention
Added BYPASS to P2P protocol, and use it to avoid cycling between
cluster gateways.

Distributed clusters are working well now!
2024-06-27 12:20:22 -04:00
Joey Hess
effaf51b1f
avoid loop between cluster gateways
The VIA extension is still needed to avoid some extra work and ugly
messages, but this is enough that it actually works.

This filters out the RemoteSides that are a proxied connection via a
remote gateway to the cluster.

The VIA extension will not filter those out, but will send VIA to them
on connect, which will cause the ones that are accessed via the listed
gateways to be filtered out.
2024-06-26 15:29:59 -04:00
Joey Hess
4172109c8d
support multi-gateway clusters
VIA extension still needed otherwise a copy to a cluster can loop
forever.
2024-06-26 15:07:03 -04:00
Joey Hess
cec2848e8a
support annex.jobs for clusters 2024-06-25 14:54:20 -04:00
Joey Hess
1bfe7f8a53
honor preferred content settings of cluster nodes
Except when no nodes want a file, it has to be stored somewhere, so
store it on all. Which is not really desirable, but neither is having to
pick one.

ProtoAssociatedFile deserialization is rather broken, and this could
possibly affect preferred content expressions that match on filenames.

The inability to roundtrip whitespace like tabs and newlines through is
not a problem because preferred content expressions can't be written
that match on whitespace such as a tab. For example:

joey@darkstar:~/tmp/bench/z>git-annex wanted  origin-node2 'exclude=*CTRL-VTab*'
wanted origin-node2
git-annex: Parse error: Parse failure: near "*"

But, the filtering of control characters could perhaps be a problem. I think
that filtering is now obsolete, git-annex has comprehensive filtering of
control characters when displaying filenames, that happens at a higher level.
However, I don't want to risk a security hole so am leaving in that filtering
in ProtoAssociatedFile deserialization for now.
2024-06-25 11:43:09 -04:00
Joey Hess
a23b0abf28
PUT to cluster send to all nodes rather than none
If the location log says all nodes contain content, pass in all nodes,
rather than none.

The location log can be wrong. While it's good to avoid unncessessary
connections to nodes that already contain a key, it would be bad to
refuse to accept an upload at all when the location log is wrong.

Also, passing in no nodes leaves the proxy in an untenable state. It
can't proxy to no nodes. So it closes the connection. Passing in all
nodes means it has to do the work to connect to all of them, and see
that they say they already have the content, and then it can tell the
client that.
2024-06-25 10:32:34 -04:00
Joey Hess
202ea3ff2a
don't sync with cluster nodes by default
Avoid `git-annex sync --content` etc from operating on cluster nodes by default
since syncing with a cluster implicitly syncs with its nodes. This avoids a
lot of unncessary work when a cluster has a lot of nodes just in checking
if each node's preferred content is satisfied. And it avoids content
being sent to nodes individually, so instead syncing with clusters always
fanout uploads to nodes.

The downside is that there are situations where a cluster's preferred content
settings can be met, but those of its nodes are not. Or where a node does not
contain a key, but the cluster does, and there are not enough copies of the key
yet, so it would be desirable the send it there. I think that's an acceptable
tradeoff. These kind of situations are ones where the cluster itself should
probably be responsible for copying content to the node. Which it can do much
less expensively than a client can. Part of the balanced preferred content
design that I will be working on in a couple of months involves rebalancing
clusters, so I expect to revisit this.

The use of annex-sync config does allow running git-annex sync with a specific
node, or nodes, and it will sync with it. And it's also possible to set
annex-sync git configs to make it sync with a node by default. (Although that
will require setting up an explicit git remote for the node rather than relying
on the proxied remote.)

Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster
due to a cycle. And the Annex.Startup load of clusters happens
too late for Remote.Git to use that. This does mean one redundant load
of the cluster log, though only when there is a proxy.
2024-06-25 10:24:38 -04:00
Joey Hess
5b332a87be
dropping from clusters
Dropping from a cluster drops from every node of the cluster.
Including nodes that the cluster does not think have the content.
This is different from GET and CHECKPRESENT, which do trust the
cluster's location log. The difference is that removing from a cluster
should make 100% the content is gone from every node. So doing extra
work is ok. Compare with CHECKPRESENT where checking every node could
make it very expensive, and the worst that can happen in a false
negative is extra work being done.

Extended the P2P protocol with FAILURE-PLUS to handle the case where a
drop from one node succeeds, but a drop from another node fails. In that
case the entire cluster drop has failed.

Note that SUCCESS-PLUS is returned when dropping from a proxied remote
that is not a cluster, when the protocol version supports it. This is
because P2P.Proxy does not know when it's proxying for a single node
cluster vs for a remote that is not a cluster.
2024-06-23 09:43:40 -04:00
Joey Hess
7bbd822a17
avoid using cluster nodes in drop proof when dropping from cluster
This is obviously necessary in order for dropping from a cluster to be able to
drop from all nodes.

It also avoids violating numcopies when a cluster node is a special remote.
If it were used in the drop proof, nothing would prevent the cluster from
dropping from it.
2024-06-23 06:20:11 -04:00
Joey Hess
6eac3112e5
be quiet when reading cluster and proxy information at startup
I had a transfer of 3 files fail like this:

git-annex: transferrer protocol error: "(recording state in git...)"

The remote had stalldetection enabled, although I didn't see it stall.
So git-annex transferrer would have been started up. I guess that
one of these new git-annex branch reads, that happens early, caused
that message due to perhaps an uncommitted git-annex branch change.

Since the transferrer speaks a protocol over stdout, it needs to be
prevented from outputting other messages to stdout. Interestingly,
startupAnnex is run after prepRunCommand, so if a command requests quiet
output it would already be quiet. But the transferrer does not, instead
it calls Annex.setOutput SerializedOutput in its start action.
2024-06-18 21:31:32 -04:00
Joey Hess
f18740699e
P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.

Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
2024-06-18 16:21:40 -04:00
Joey Hess
fb0fd78485
only use a remote as a node when git configuration is set
Avoids someone writing to cluster.log and nominating remotes
of someone else's repository as a cluster.
2024-06-18 11:37:38 -04:00
Joey Hess
f049156a03
checkpresent support for clusters
This assumes that the proxy for a cluster has up-to-date location
logs. If it didn't, it might proxy the checkpresent to a node that no
longer has the content, while some other node still does, and so
it would incorrectly appear that the cluster no longer contains the
content.

Since cluster UUIDs are not stored to location logs,
git-annex fsck --fast when claiming to fix a location log when
that occurred would not cause any problems. And presumably the location
tracking would later get sorted out.

At least usually, changes to the content of nodes goes via the proxy,
and it will update its location logs, so they will be accurate. However,
if there were multiple proxies to the same cluster, or nodes were
accessed directly (or via proxy to the node and not the cluster),
the proxy's location log could certainly be wrong.

(The location log access for GET has the same issues.)
2024-06-18 11:16:16 -04:00
Joey Hess
88d9a02f7c
initial, working support for getting from clusters
Currently tends to put all the load on a single node, which will need to
be improved.
2024-06-18 11:01:10 -04:00
Joey Hess
d34326ab76
factor out Annex.Proxy 2024-06-18 10:51:37 -04:00
Joey Hess
f0d6114286
refactor cluster code into own module 2024-06-18 10:36:04 -04:00
Joey Hess
64afbb0b93
don't count clusters as copies, continued
Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.

This should be everything, probably.

Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
2024-06-16 15:14:53 -04:00
Joey Hess
780367200b
remove dead nodes when loading the cluster log
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.

One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.

Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.

Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.

That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.

Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.

Other commands, like git-annex-shell currently don't run annexStartup
either.

So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
2024-06-16 14:39:44 -04:00
Joey Hess
36c6d8da69
don't count clusters as copies
Since the cluster UUID is inserted into the location log when the
location log lists a node as containing content.

Also avoid trying to lock content on cluster remotes. The cluster nodes
are also proxied, so that content can be locked on individual nodes, and
locking content on a cluster as a whole probably won't be implemented.

And made git-annex whereis use numcopies machinery for displaying its
count, so it won't count cluster UUIDs redundantly to nodes.
Other commands, like git-annex info that also display numcopies
information already used the numcopies machinery.

There is more to be done, fromNumCopies is sometimes used to get a
number that is compared with a list of UUIDs. And limitCopies doesn't
use numcopies machinery.
2024-06-16 14:17:56 -04:00
Joey Hess
a2f4a8eddf
proxying GET now working
Memory use is small and constant; receiveBytes returns a lazy bytestring
and it does stream.

Comparing speed of a get of a 500 mb file over proxy from origin-origin,
vs from the same remote over a direct ssh:

joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from origin-origin
get bigfile (from origin-origin...)
ok
(recording state in git...)
1.89user 0.67system 0:10.79elapsed 23%CPU (0avgtext+0avgdata 68716maxresident)k
0inputs+984320outputs (0major+10779minor)pagefaults 0swaps

joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from direct-ssh
get bigfile (from direct-ssh...)
ok
1.79user 0.63system 0:10.49elapsed 23%CPU (0avgtext+0avgdata 65776maxresident)k
0inputs+1024312outputs (0major+9773minor)pagefaults 0swaps

So the proxy doesn't add much overhead even when run on the same machine as
the client and remote.

Still, piping receiveBytes into sendBytes like this does suggest that the proxy
could be made to use less CPU resouces by using `sendfile()`.
2024-06-11 15:09:43 -04:00
Joey Hess
09b5e53f49
set annex.uuid in proxy's Repo
getRepoUUID looks at that, and was seeing the annex.uuid of the proxy.
Which caused it to unncessarily set the git config. Probably also would
have led to other problems.
2024-06-11 13:40:50 -04:00
Joey Hess
649b87bedd
Merge branch 'master' into proxy 2024-06-10 14:26:18 -04:00
Joey Hess
25a6ab6f11
Avoid grafting in export tree objects that are missing
They could be missing due to an interrupted git-annex at just the wrong
time during a prior graft, after which the tree objects got garbage
collected.

Or they could be missing because of manual messing with the git-annex
branch, eg resetting it to back before the graft commit.

Sponsored-by: Dartmouth College's OpenNeuro project
2024-06-07 16:51:50 -04:00
Joey Hess
b32c4c2e98
atomic git-annex branch update when regrafting in transition
Fix a bug where interrupting git-annex while it is updating the git-annex
branch could lead to git fsck complaining about missing tree objects.

Interrupting git-annex while regraftexports is running in a transition
that is forgetting git-annex branch history would leave the
repository with a git-annex branch that did not contain the tree shas
listed in export.log. That lets those trees be garbage collected.

A subsequent run of the same transition then regrafts the trees listed
in export.log into the git-annex branch. But those trees have been lost.

Note that both sides of `if neednewlocalbranch` are atomic now. I had
thought only the True side needed to be, but I do think there may be
cases where the False side needs to be as well.

Sponsored-by: Dartmouth College's OpenNeuro project
2024-06-07 16:34:10 -04:00
Joey Hess
b43c835def
instantiate remotes that are behind a proxy remote
Untested, but this should be close to working. The proxied remotes have
the same url but a different uuid. When talking to current
git-annex-shell, it will fail due to a uuid mismatch. Once it supports
proxies, it will know that the presented uuid is for a remote that it
proxies for.

The check for any git config settings for a remote with the same name as
the proxied remote is there for several reasons. One is security:
Writing a name to the proxy log should not cause changes to
how an existing, configured git remote operates in a different clone of
the repo.

It's possible that the user has been using a proxied remote, and decides
to set a git config for it. We can't tell the difference between that
scenario and an evil remote trying to eg, intercept a file upload
by replacing their remote with a proxied remote.

Also, if the user sets some git config, does it override the config
inherited from the proxy remote? Seems a difficult question. Luckily,
the above means we don't need to think through it.

This does mean though, that in order for a user to change the config of
a proxy remote, they have to manually set its annex-uuid and url, as
well as the config they want to change. They may also have to set any of
the inherited configs that they were relying on.
2024-06-06 17:15:32 -04:00
Joey Hess
3318d25c65
adjust unlocked execute bit handling
When building an adjusted unlocked branch, make pointer files executable
when the annex object file is executable.

This slows down git-annex adjust --unlock/--unlock-present by needing to
stat all annex object files in the tree. Probably not a significant
slowdown compared to other work they do, but I have not benchmarked.

I chose to leave git-annex adjust --unlock marked as stable, even though
get or drop of an object file can change whether it would make the pointer
file executable. Partly because making it unstable would slow down
re-adjustment, and partly for symmetry with the handling of an unlocked
pointer file that is executable when the content is dropped, which does not
remove its execute bit.
2024-05-28 12:39:42 -04:00
Joey Hess
19418e81ee
git-remote-annex: Display full url when using remote with the shorthand url 2024-05-24 17:15:31 -04:00
Joey Hess
adcebbae47
clean up git-remote-annex git-annex branch handling
Implemented alternateJournal, which git-remote-annex
uses to avoid any writes to the git-annex branch while setting up
a special remote from an annex:: url.

That prevents the remote.log from being overwritten with the special
remote configuration from the url, which might not be 100% the same as
the existing special remote configuration.

And it prevents an overwrite deleting of other stuff that was
already in the remote.log.

Also, when the branch was created by git-remote-annex, only delete it
at the end if nothing else has been written to it by another command.
This fixes the race condition described in
797f27ab05, where git-remote-annex
set up the branch and git-annex init and other commands were
run at the same time and their writes to the branch were lost.
2024-05-15 17:33:38 -04:00
Joey Hess
ff5193c6ad
Merge branch 'master' into git-remote-annex 2024-05-10 14:20:36 -04:00
Joey Hess
59fc2005ec
git clone support for git-remote-annex
Also support using annex:: urls that specify the whole special remote
config.

Both of these cases need a special remote to be initialized enough to
use it, which means writing to .git/config but not to the git-annex
branch. When cloning, the remote is left set up in .git/config,
so further use of it, by git-annex or git-remote-annex will work. When
using git with an annex:: url, a temporary remote is written to
.git/config, but then removed at the end.

While that's a little bit ugly, the fact is that the Remote interface
expects that it's ok to set git configs of the remote that is being
initialized. And it's nowhere near as ugly as the alternative of making
a temporary git repository and initializing the special remote in there.

Cloning from a repository that does not contain a git-annex branch and
then later running git-annex init is currently broken, although I've
gotten most of the way there to supporting it.
See cleanupInitialization FIXME.

Special shout out to git clone for running gitremote-helpers with
GIT_DIR set, but not in the git repository and with GIT_WORK_TREE not
set. Resulting in needing the fixupRepo hack.

Sponsored-by: unqueued on Patreon
2024-05-08 17:07:33 -04:00
Yaroslav Halchenko
87e2ae2014
run codespell throughout fixing typos automagically
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
2024-05-01 15:46:21 -04:00
Joey Hess
c410b2bb73
annex.maxextensions configuration
Controls how many filename extensions to preserve.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-04-18 14:23:38 -04:00
Joey Hess
c64a73c7ea
startExternalAddonProcess add parameters
Not used yet but intended to support eg running "rclone gitannex"
2024-04-17 13:09:10 -04:00
Joey Hess
2c73845d90
multiple -m second try
Test suite passes this time. When committing the adjusted branch, use
the old method to make a message that old git-annex can consume. Also
made the code accept the new message, so that eventually
commitTreeExactMessage can be removed.

Sponsored-by: Kevin Mueller on Patreon
2024-04-09 12:56:47 -04:00
Joey Hess
a8dd85ea5a
Revert "multiple -m"
This reverts commit cee12f6a2f.

This commit broke git-annex init run in a repo that was cloned from a
repo with an adjusted branch checked out.

The problem is that findAdjustingCommit was not able to identify the
commit that created the adjusted branch. It seems that there is an extra
"\n" at the end of the commit message that it does not expect.

Since backwards compatability needs to be maintained, cannot just make
findAdjustingCommit accept it with the "\n". Will have to instead
have one commitTree variant that uses the old method, and use it for
adjusted branch committing.
2024-04-02 17:29:07 -04:00
Joey Hess
cee12f6a2f
multiple -m
sync, assist, import: Allow -m option to be specified multiple times, to
provide additional paragraphs for the commit message.

The option parser didn't allow multiple -m before, so there is no risk of
behavior change breaking something that was for some reason using multiple
-m already.

Pass through to git commands, so that the method used to assemble the
paragrahs is whatever git does. Which might conceivably change in the
future.

Note that git commit-tree has supported -m since git 1.7.7. commitTree
was probably not using it since it predates that version. Since the
configure script prevents building git-annex with git older than 2.1,
there is no risk that it's not supported now.

Sponsored-by: Nicholas Golder-Manning on Patreon
2024-03-27 15:58:27 -04:00
Joey Hess
f601e06b90
avoid build warning on windows 2024-03-26 14:07:41 -04:00
Joey Hess
a69871491f
avoid build warning on windows
Since append was only exported by Annex.Common on unix, excluding it
from import caused a build warning on windows.
2024-03-26 13:16:33 -04:00
Joey Hess
f04d9574d6
fix transfer lock file for Download to not include uuid
While redundant concurrent transfers were already prevented in most
cases, it failed to prevent the case where two different repositories were
sending the same content to the same repository. By removing the uuid
from the transfer lock file for Download transfers, one repository
sending content will block the other one from also sending the same
content.

In order to interoperate with old git-annex, the old lock file is still
locked, as well as locking the new one. That added a lot of extra code
and work, and the plan is to eventually stop locking the old lock file,
at some point in time when an old git-annex process is unlikely to be
running at the same time.

Note that in the case of 2 repositories both doing eg
`git-annex copy foo --to origin`
the output is not that great:

copy b (to origin...)
  transfer already in progress, or unable to take transfer lock
git-annex: transfer already in progress, or unable to take transfer lock
97%   966.81 MiB      534 GiB/s 0sp2pstdio: 1 failed

  Lost connection (fd:14: hPutBuf: resource vanished (Broken pipe))

  Transfer failed

Perhaps that output could be cleaned up? Anyway, it's a lot better than letting
the redundant transfer happen and then failing with an obscure error about
a temp file, which is what it did before. And it seems users don't often
try to do this, since nobody ever reported this bug to me before.
(The "97%" there is actually how far along the *other* transfer is.)

Sponsored-by: Joshua Antonishen on Patreon
2024-03-25 14:47:46 -04:00
Joey Hess
62129f0b24
fix windows transfer lock check
If the lock file was not able to be exclusivlely locked, don't indicate
locking failed. I'm pretty sure this was a typo. It goes all the way
back to 891c85cd88 where locking was first
introduced on windows, and there's no indication of why it would make
sense to return True here.

Sponsored-by: Leon Schuermann on Patreon
2024-03-25 14:11:25 -04:00
Joey Hess
9c988ee607
handle multiple VURL checksums in one pass
git-annex fsck and some other commands that verify the content of a key
were using the non-incremental verification interface. But for VURL
urls, that interface is innefficient because when there are multiple
equivilant keys, it has to separately read and checksum for each key in
turn until one matches. It's more efficient for those to use the
incremental interface, since the file can be read a single time.

There's no real downside to using the incremental interface when available.

Note that more speedup could be had for VURL, if it was able to
calculate the checksum a single time and then compare with the
equivilant keys checksums. When the equivilant keys use the same type of
checksum.

Sponsored-by: k0ld on Patreon
2024-03-01 14:41:10 -04:00
Joey Hess
cc17ac423b
implement isCryptographicallySecureKey for VURL
Considerable difficulty to work around an import cycle. Had to move the
list of backends (except for VURL) to Backend.Variety to VURL could use
it.

Sponsored-by: Kevin Mueller on Patreon
2024-02-29 17:26:35 -04:00
Joey Hess
e7b7ea78af
lift isCryptographicallySecure to Annex
Needed for VURL backend.

Sponsored-by: Nicholas Golder-Manning on Patreon
2024-02-29 16:14:13 -04:00
Joey Hess
0f7143d226
support VURL backend
Not yet implemented is recording hashes on download from web and
verifying hashes.

addurl --verifiable option added with -V short option because I
expect a lot of people will want to use this.

It seems likely that --verifiable will become the default eventually,
and possibly rather soon. While old git-annex versions don't support
VURL, that doesn't prevent using them with keys that use VURL. Of
course, they won't verify the content on transfer, and fsck will warn
that it doesn't know about VURL. So there's not much problem with
starting to use VURL even when interoperating with old versions.

Sponsored-by: Joshua Antonishen on Patreon
2024-02-29 13:48:51 -04:00
Joey Hess
70cb41028e
Pass --no-warnings to yt-dlp
Notice a warning with -J2 causing git-annex progress output to get slightly
messed up.

Error output would also probably do that, so perhaps it should capture
stderr and only display it when yt-dlp exited nonzero?

This option might also make sense for youtube-dl, I don't have an
installation handy anymore to check.
2024-02-19 18:35:57 -04:00
Joey Hess
68e99513f0
added annex.commitmessage-command config
Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-02-12 14:35:22 -04:00
Joey Hess
90db97d9a2
importfeed: Added --scrape option
Which uses yt-dlp to screen scrape the equivilant of an RSS feed.

Note that youtubedlscraped is a speed optimisation. Since yt-dlp found
the urls, we know it can download them. That avoids calling
youtubeDlSupported on each url, which makes --fast a lot faster.

Almost all the same metadata fields and file formatting fields are
populated, when yt-dlp is able to get the data. Note that yt-dlp has some
additional useful metadata that could be exposed. But, much of it is
specific to particular websites, and it would be hard to document on the
git-annex importfeed man page.

Sponsored-by: unqueued on Patreon
2024-01-30 15:37:29 -04:00
Joey Hess
2114253eaf
update comment
The segfault seems to be fixed with git 2.43, I'm not sure what the
affected range was.
2024-01-20 11:25:22 -04:00
Joey Hess
20567e605a
add directional stalldetection and bwlimit configs
Sponsored-by: Dartmouth College's DANDI project
2024-01-19 15:27:53 -04:00
Joey Hess
8da85fd3a3
RawFilePath conversion
Sponsored-by: Dartmouth College's DANDI project
2024-01-19 14:26:21 -04:00
Joey Hess
703a70cafa
avoid watchFileSize running backward
This is groundwork for using watchFileSize for downloads from external
special remotes.

In Annex.Content.downloadUrl, this potentially avoids jitter in the
progress meter. When downloading with conduit, the meter gets updated based
on both the size of the file, and on the data flowing through conduit.
If that has not yet been flushed to the file, it seems possible for the
meter to run backwards when meter is updated with the file size.
It's probably only a few kb of jitter, so may not be visible.

Sponsored-by: Dartmouth College's DANDI project
2024-01-19 14:11:27 -04:00
Joey Hess
df35f70801
tweak stall detection scaling
Refactored to allow offline experimentation, and ended up changing the
allowedvariation (aka fudge factor) to 3. 10 seems too high, and 1.5 too low.

Scale earlier, so even if the first chunk takes less than the configured
time period, allowance is made that later chunks might transfer slower.
Decided to use the same allowedvariation to decide when to start
scaling.

Smoothed the scaling out.

Some examples:

ghci> upscale (BwRate 10 (Duration 60)) 25
BwRate 13 (Duration {durationSeconds = 75})
-- A small scaling upwards after 1/3rd the time. Not noticable.
ghci> upscale (BwRate 10 (Duration 60)) 60
BwRate 30 (Duration {durationSeconds = 180})
-- At the configured time, 3x scaling.
ghci> upscale (BwRate 10 (Duration 60)) 120
BwRate 60 (Duration {durationSeconds = 360})
-- A typical upscaling, here a 1 minute duration became 6 minutes
-- due to the first chunk taking 2 minutes to transfer.
ghci> upscale (BwRate 10 (Duration 60)) 600
BwRate 300 (Duration {durationSeconds = 1800})
-- Here the first chunk took 10 minutes to transfer, so it will
-- take 30 minutes to detect a stall.

Sponsored-by: Dartmouth College's DANDI project
2024-01-19 12:58:41 -04:00
Joey Hess
c2634e7df2
automatically adjust stall detection period
Improve annex.stalldetection to handle remotes that update progress less
frequently than the configured time period.

In particular, this makes remotes that don't report progress but are
chunked work when transferring a single chunk takes longer than the
specified time period.

Any remotes that just have very low update granulatity would also be
handled by this.

The change to Remote.Helper.Chunked avoids an extra progress update when
resuming an interrupted upload. In that case, the code saw first Nothing
and then Just the already transferred number of bytes, which defeated this
new heuristic. This change will mean that, when resuming an interrupted
upload to a chunked remote that does not do its own progress reporting, the
progress display does not start out displaying the amount sent so far,
until after the first chunk is sent. This behavior change does not seem
like a major problem.

About the scalefudgefactor, it seems reasonable to expect subsequent chunks
to take no more than 1.5 times as long as the first chunk to transfer.
Could set it to 1, but then any chunk taking a little longer would be
treated as a stall. 2 also seems a likely value. Even 10 might be fine?

Sponsored-by: Dartmouth College's DANDI project
2024-01-18 17:12:10 -04:00
Joey Hess
f6cf2dec4c
disk free checking for unsized keys
Improve disk free space checking when transferring unsized keys to
local git remotes. Since the size of the object file is known, can
check that instead.

Getting unsized keys from local git remotes does not check the actual
object size. It would be harder to handle that direction because the size
check is run locally, before anything involving the remote is done. So it
doesn't know the size of the file on the remote.

Also, transferring unsized keys to other remotes, including ssh remotes and
p2p remotes don't do disk size checking for unsized keys. This would need a
change in protocol.

(It does seem like it would be possible to implement the same thing for
directory special remotes though.)

In some sense, it might be better to not ever do disk free checking for
unsized keys, than to do it only sometimes. A user might notice this
direction working and consider it a bug that the other direction does not.
On the other hand, disk reserve checking is not implemented for most
special remotes at all, and yet it is implemented for a few, which is also
inconsistent, but best effort. And so doing this best effort seems to make
some sense. Fundamentally, if the user wants the size to always be checked,
they should not use unsized keys.

Sponsored-by: Brock Spratlen on Patreon
2024-01-16 14:29:10 -04:00
Joey Hess
11b9069dc2
bump copyright year
after my first commit of 2024
2024-01-02 14:10:52 -04:00
Joey Hess
a5b9c2ca69
import: Sped up import from special remote when the imported tree is unchanged
I saw a nearly 2 minute speed up from this, in a repo with 56000 files some
of which are preferred content of the special remote and others not. In
such a case, addBackExportExcluded has to do a lot of work, which is
unncessary when the tree is unchanged.

When using sync --content, preferred content checking of that many files
takes about 1 minute. So this speeds up sync --content by 3x.
When using git-annex import, the speed up is much larger.

Sponsored-by: Nicholas Golder-Manning on Patreon
2024-01-02 13:57:31 -04:00
Joey Hess
9a67ed0f10
importtree: support preferred content expressions needing keys
When importing from a special remote, support preferred content expressions
that use terms that match on keys (eg "present", "copies=1"). Such terms
are ignored when importing, since the key is not known yet.

When "standard" or "groupwanted" is used, the terms in those
expressions also get pruned accordingly.

This does allow setting preferred content to "not (copies=1)" to make a
special remote into a "source" type of repository. Importing from it will
import all files. Then exporting to it will drop all files from it.

In the case of setting preferred content to "present", it's pruned on
import, so everything gets imported from it. Then on export, it's applied,
and everything in it is left on it, and no new content is exported to it.

Since the old behavior on these preferred content expressions was for
importtree to error out, there's no backwards compatability to worry about.
Except that sync/pull/etc will now import where before it errored out.
2023-12-18 16:27:59 -04:00
Joey Hess
eb59da9dd2
Lower precision of timestamps in git-annex branch
This can reduce the size of the branch by up to 8%. My test was
running git-annex add 1000 times on one file each.
Lots of different high-resolution timestamps were recorded before
and eliminating those, after packing, the git repo was 8% smaller.

Due to the use of vector clocks, high resolution timestamps are
not necessary to make clear which information is most recent when
eg, a value is changed repeatedly in the same second. In such a
case, the vector clock will be advanced to the next second after
the last modification. For example, running
git-annex numcopies 1; git-annex numcopies 2
The first will record the current second, while the next records
the second after that even if it runs in the same second.

As for conflicting information written to two different clones of the
repository, this will make git-annex sometimes pick information that
was written earlier in a second over information written later in the
same second. Usually git-annex does not write conflicting information,
but there are some cases where it could. Eg, storing an object on a remote
can update the remote state log with some state. If two repos both store the
same object, and end up storing different remote state for some reason,
this can result in one that ran a tiny bit later winning. Such a situation
seems unlikely to be user visible. And a small amount of clock skew could
already result in such things.

The only case I can think of where this might be a user visible change
is if a configuration command like git-annex numcopies is being run
in 2 clones of a repository on the same machine at very
close to the same time. Then the user will know which they ran last,
and git-annex won't.

If that did become a problem, this could be dialed back to eg log
milliseconds with still some space saving.
2023-12-11 15:04:06 -04:00
Joey Hess
86dbe9a825
migrate: support adding size back to URL keys
migrate: Support adding size to URL keys that were added with --relaxed, by
running eg: git-annex migrate --backend=URL foo

Since url keys cannot be generated, that used to fail. Make it notice that
the backend is not changed, and just get the size of the content.

Sponsored-by: Brock Spratlen on Patreon
2023-12-08 16:22:14 -04:00
Joey Hess
b65379a107
fix missing space in warning message 2023-12-08 12:36:33 -04:00
Joey Hess
f1ce15036f
started migrate --update
This is most of the way there, but not quite working.

The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.

Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.

Sponsored-by: Joshua Antonishen on Patreon
2023-12-07 15:50:52 -04:00
Joey Hess
0bd8b17b59
log migration trees to git-annex branch
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.

commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.

Sponsored-by: Graham Spencer on Patreon
2023-12-06 15:40:03 -04:00
Joey Hess
fd0b510573
improve message about 1 copy
"Could only verify the existence of 0 out of 1 necessary copy"
does not sound right, but neither does it with "copies".

Kept the "1" rather than "only" or such since numcopies is mentioned.

Sponsored-by: Brock Spratlen on Patreon
2023-12-04 11:12:54 -04:00
Joey Hess
1654572bc1
fix --from overriding annex-ignore
Make git-annex get/copy/move --from foo override configuration of
remote.foo.annex-ignore, as documented.

This already worked for remotes supporting hasKeyCheap. For others though,
git-annex copy --from foo would silently not do anything, while
git-annex copy --to foo would use the annex-ignored remote.

Also improved the annex-ignore docs, to reflect that `git-annex get`
without --from will skip using annex-ignored remotes, for example.

Sponsored-by: Dartmouth College's DANDI project
2023-11-30 15:12:07 -04:00
Joey Hess
38b9ebc5fd
newtype MapLog
Noticed that Semigroup instance of Map is not suitable to use
for MapLog. For example, it behaved like this:

ghci>  parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12"
fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})]

Which was wrong, it lost the newer DeadTrusted value.

Luckily, nothing used that Semigroup when operating on a MapLog. And this
provides a safe instance.

Sponsored-by: Graham Spencer on Patreon
2023-11-13 14:37:22 -04:00
Joey Hess
be6b56df4c
remove unused import 2023-11-01 13:14:39 -04:00
Joey Hess
eb42935e58
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.

writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.

I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.

This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.

When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.

Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.

Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 14:23:23 -04:00
Joey Hess
d9fd205cbb
push RawFilePath down into Annex.ReplaceFile
Minor optimisation, but a win in every case, except for a couple where
it's a wash.

Note that replaceFile still takes a FilePath, because it needs to
operate on Chars to truncate unicode filenames properly.
2023-10-26 13:36:49 -04:00
Joey Hess
c873586e14
eliminate s2w8 and w82s
Note that the use of s2w8 in genUUIDInNameSpace made it truncate unicode
characters. Luckily, genUUIDInNameSpace is only ever used on ASCII
strings as far as I can determine. In particular, git-remote-gcrypt's
gcrypt-id is an ASCII string.
2023-10-26 13:12:57 -04:00
Joey Hess
3742263c99
simplify base64 to only use ByteString
Note the use of fromString and toString from Data.ByteString.UTF8 dated
back to commit 9b93278e8a. Back then it
was using the dataenc package for base64, which operated on Word8 and
String. But with the switch to sandi, it uses ByteString, and indeed
fromB64' and toB64' were already using ByteString without that
complication. So I think there is no risk of such an encoding related
breakage.

I also tested the case that 9b93278e8a
fixed:

	git-annex metadata -s foo='a …' x
	git-annex metadata x
	metadata x
	  foo=a …

In Remote.Helper.Encryptable, it was avoiding using Utility.Base64
because of that UTF8 conversion. Since that's no longer done, it can
just use it now.
2023-10-26 13:10:05 -04:00
Joey Hess
0da1d40cd4
Improve memory use of --all when using annex.private
This does not improve Annex.Branch.files at all, since it still uses ++ to
combine the lists, so forcing all but the last one.

But when there are a lot of files in the private journal, it does avoid
--all (or a bare repo) from buffering the filenames in memory.

See commit 653b719472 for prior discussion of
this buffering.

Sponsored-by: Graham Spencer on Patreon
2023-10-24 13:20:55 -04:00
Joey Hess
8bde6101e3
sqlite datbase for importfeed
importfeed: Use caching database to avoid needing to list urls on every
run, and avoid using too much memory.

Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster,
and memory use dropped from 203000k to 59408k.

Database.ImportFeed is Database.ContentIdentifier with the serial number
filed off. There is a bit of code duplication I would like to avoid,
particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use
the persistent sqlite tables, so despite the code being the same, they
cannot be factored out.

Since this database includes the contentidentifier metadata, it will be
slightly redundant if a sqlite database is ever added for metadata. I
did consider making such a generic database and using it for this. But,
that would then need importfeed to update both the url database and the
metadata database, which is twice as much work diffing the git-annex
branch trees. Or would entagle updating two databases in a complex way.
So instead it seems better to optimise the database that
importfeed needs, and if the metadata database is used by another command,
use a little more disk space and do a little bit of redundant work to
update it.

Sponsored-by: unqueued on Patreon
2023-10-23 16:46:22 -04:00
Joey Hess
c268dc5878
only stage regular files from the journal
git-annex only writes regular files there, but other things may drop junk
like empty .DAV directories around the tree. And trying to hash such things
can have weird and hard to understand effects. So it seems best to do a
small amount of work in statting the journal file to make sure it's a
regular file.

Sponsored-by: Jack Hill on Patreon
2023-10-10 13:22:02 -04:00
Joey Hess
724ceeb1a9
avoid unncessary use of curl when conduit will do
Avoid using curl when annex.security.allowed-ip-addresses is set but
neither annex.web-options nor annex.security.allowed-url-schemes is set to
a value that needs curl.

Bug introduced in 840bd50390

Sponsored-By: Brock Spratlen on Patreon
2023-08-22 10:25:53 -04:00
Joey Hess
10b5f79e2d
fix empty tree import when directory does not exist
Fix behavior when importing a tree from a directory remote when the
directory does not exist. An empty tree was imported, rather than the
import failing. Merging that tree would delete every file in the
branch, if those files had been exported to the directory before.

The problem was that dirContentsRecursive returned [] when the directory
did not exist. Better for it to throw an exception. But in commit
74f0d67aa3 back in 2012, I made it never
theow exceptions, because exceptions throw inside unsafeInterleaveIO become
untrappable when the list is being traversed.

So, changed it to list the contents of the directory before entering
unsafeInterleaveIO. So exceptions are thrown for the directory. But still
not if it's unable to list the contents of a subdirectory. That's less of a
problem, because the subdirectory does exist (or if not, it got removed
after being listed, and it's ok to not include it in the list). A
subdirectory that has permissions that don't allow listing it will have its
contents omitted from the list still.

(Might be better to have it return a type that includes indications of
errors listing contents of subdirectories?)

The rest of the changes are making callers of dirContentsRecursive
use emptyWhenDoesNotExist when they relied on the behavior of it not
throwing an exception when the directory does not exist. Note that
it's possible some callers of dirContentsRecursive that used to ignore
permissions problems listing a directory will now start throwing exceptions
on them.

The fix to the directory special remote consisted of not making its
call in listImportableContentsM use emptyWhenDoesNotExist. So it will
throw an exception as desired.

Sponsored-by: Joshua Antonishen on Patreon
2023-08-15 12:57:41 -04:00
Joey Hess
be028f10e5
split out Utility.Url.Parse
This is mostly for git-repair which can't include all of Utility.Url
without adding many dependencies that are not really necessary.
2023-08-14 12:28:10 -04:00
Joey Hess
adda6c1088
Add git-annex remote refs that are not newer to the merged refs list
Significant startup speed increase by avoiding repeatedly checking if some
remote git-annex branch refs need to be merged when it is not newer.

One way this could happen is when there are 2 remotes that are themselves
connected. The git-annex branch on the first remote gets updated. Then the
second remote pulls from the first, and merges in its git-annex branch.
Then the local repo pulls from the second remote, and merges its git-annex
branch. At this point, a pull from the first remote will get a git-annex
branch that is not newer, but is not on the merged refs list.

In my big repo, git-annex startup time dropped from 4 seconds to 0.1 seconds.
There were 5 to 10 such remote refs out of 18 remotes.

Sponsored-by: Graham Spencer on Patreon
2023-08-09 13:31:36 -04:00
Joey Hess
3a52b4c4c3
fix hang when built with unix-2.8
git-annex test hang when running git-annex add in an adjusted unlocked
branch. I couldn't seem to reproduce the hang outside the test suite.

Seems that the code added in 26a9ea12d1
was buggy, and as that commit was made without testing it, building with
unix-2.8 exposed the bug.

I don't fully understand the bug, which involves fdToHandle
and then closing the fd, vs closing the handle. May somehow involve
laziness or forcing around the S.hGet? Using hClose solved it
in any case.

(Also eliminated checkcontentfollowssymlinks to fix a build warning
when it's not used.)
2023-08-01 20:22:28 -04:00
Joey Hess
eb8e30a2f1
fix build with unix-2.8.0
got the arguments the wrong way around when I wrote this

also squelch a build warning
2023-08-01 18:27:12 -04:00
Joey Hess
fa92383993
onlyingroup
* Support "onlyingroup=" in preferred content expressions.
* Support --onlyingroup= matching option.

Sponsored-by: Jack Hill on Patreon
2023-07-31 14:43:58 -04:00
Joey Hess
473d66132d
display explanations in --debug too
When --explain is not enabled. This can be useful debugging information
as well.

Sponsored-by: Dartmouth College's DANDI project
2023-07-31 13:06:40 -04:00
Joey Hess
846384fc3a
--explain for numcopies checks
And closed the todo as completed.

Sponsored-by: Dartmouth College's DANDI project
2023-07-31 12:53:17 -04:00
Joey Hess
518a51a8a0
--explain for preferred/required content matching
And annex.largefiles and annex.addunlocked.

Also git-annex matchexpression --explain explains why its input
expression matches or fails to match.

When there is no limit, avoid explaining why the lack of limit
matches. This is also done when no preferred content expression is set,
although in a few cases it defaults to a non-empty matcher, which will
be explained.

Sponsored-by: Dartmouth College's DANDI project
2023-07-26 14:50:04 -04:00
Joey Hess
f25eeedeac
initial implementation of --explain
Currently it only displays explanations of options like --in and --copies.

In the future, it should explain preferred content expression evaluation
and other decisions.

The explanations of a few things could be better. In particular,
"standard" will just appear as-is (or as "!standard" if it doesn't
match), rather than explaining why the standard preferred content expression
for the group matches or not.

Currently as implemented, it goes to stdout, and so commands like
git-annex find that have custom output will not display --explain
information. Perhaps that should change, dunno.

Sponsored-by: Dartmouth College's DANDI project
2023-07-25 16:52:57 -04:00
Joey Hess
cf40e2d4b6
Revert "use existing debug machinery for explain"
This reverts commit 409572c9e4.
2023-07-25 15:53:50 -04:00
Joey Hess
409572c9e4
use existing debug machinery for explain
explain is a kind of debug message, but not formatted in the same way.
So it makes sense to reuse the debug machinery for it, since that is
already quite optimised.

Sponsored-by: Dartmouth College's DANDI project
2023-07-25 15:47:58 -04:00
Joey Hess
e82823d448
nub list of files
yt-dlp when resumed was observed having written the same filename twice
into the file list. Perhaps once by the first download and once by the
resumed one?
2023-07-09 14:18:25 -04:00