Commit graph

1090 commits

Author SHA1 Message Date
Joey Hess
06064f897c
update Annex.reposizes when changing location logs
The live update is only needed when Annex.reposizes has already been
populated.
2024-08-15 13:27:14 -04:00
Joey Hess
745bc5c547
take maxsize into account for balanced preferred content
This is very innefficient, it will need to be optimised not to
calculate the sizes of repos every time.

Also, fixed a bug in balancedPicker that caused it to pick a too high
index when some repos were excluded due to being full.
2024-08-13 11:00:20 -04:00
Joey Hess
99a126bebb
added reposize database
The idea is that upon a merge of the git-annex branch, or a commit to
the git-annex branch, the reposize database will be updated. So it
should always accurately reflect the location log sizes, but it will
often be behind the actual current sizes.

Annex.reposizes will start with the value from the database, and get
updated with each transfer, so it will reflect a process's best
understanding of the current sizes.

When there are multiple processes all transferring to the same repo,
Annex.reposize will not reflect transfers made by the other processes
since the current process started. So when using balanced preferred
content, it may make suboptimal choices, including trying to transfer
content to the repo when another process has already filled it up.
But this is the same as if there are multiple processes running on
ifferent machines, so is acceptable. The reposize will eventually
get an accurate value reflecting changes made by other processes or in
other repos.
2024-08-12 11:19:58 -04:00
Joey Hess
1265d7e5df
implement maxsize log and command
* maxsize: New command to tell git-annex how large the expected maximum
  size of a repository is.
* vicfg: Include maxsize configuration.
2024-08-11 15:41:26 -04:00
Joey Hess
bd5affa362
use hmac in balanced preferred content
This deals with the possible security problem that someone could make an
unusually low UUID and generate keys that are all constructed to hash to
a number that, mod the number of repositories in the group, == 0.
So balanced preferred content would always put those keys in the
repository with the low UUID as long as the group contains the
number of repositories that the attacker anticipated.
Presumably the attacker than holds the data for ransom? Dunno.

Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs
combined together as the "secret", and the key as the "message". Now any
change in the set of UUIDs in a group will invalidate the attacker's
constructed keys from hashing to anything in particular.

Given that there are plenty of other things someone can do if they can
write to the repository -- including modifying preferred content so only
their repository wants files, and numcopies so other repositories drom
them -- this seems like safeguard enough.

Note that, in balancedPicker, combineduuids is memoized.
2024-08-10 16:32:54 -04:00
Joey Hess
c15c32b5f8
releasing package git-annex version 10.20240808 2024-08-08 15:27:04 -04:00
Joey Hess
c1bc0bffc8
releasing package git-annex version 10.20240731 2024-07-31 14:05:01 -04:00
Joey Hess
d1b641cb1e
update stack.yaml to nightly-2024-07-29 and remove stack-lts-18.13.yaml
Primarily because Windows needs a dependency bump to get stm-2.5.1
for Servant build flag.

This includes Win32-2.13.4.0 and aws-0.24 which adds some features
that windows had been missing out on as well.

Lots of warnings about head and tail will need to eventually be
addressed. Of course AFAIK the uses of it in git-annex are all safe.
2024-07-29 20:09:37 -04:00
Joey Hess
54f2ea2b85
fix syntax 2024-07-29 19:13:31 -04:00
Joey Hess
ebe81ccdfa
servant build flag needs stm-2.5.1
For writeTMVar. Would be possible to rewrite to use something else, but
I don't want to. Might be possible to write a writeTMVar that works with
the old version of stm.
2024-07-29 19:10:00 -04:00
Joey Hess
ab22938c0b
fix build without servant 2024-07-24 15:13:02 -04:00
Joey Hess
b7454f1eeb
protocol version fallback on 404
and prettified errors
2024-07-23 14:58:49 -04:00
Joey Hess
b0eed55d4f
factor out http server and client into own modules
To avoid a cycle when Remote.Git uses the client.
2024-07-23 14:12:38 -04:00
Joey Hess
6bbc4565e6
started wiring p2phttp into Remote.Git
but we have a cycle, ugh
2024-07-23 13:53:10 -04:00
Joey Hess
5c39652235
starting support for remote.name.annexUrl set to annex+http
In this case, Remote.Git should not use that url for all access to
the repository. It will only be used for annex operations, which isn't
done yet.
2024-07-23 09:12:21 -04:00
Joey Hess
fdb888a56a
update servant build flag
make it work when building w/o assistant
2024-07-23 08:53:56 -04:00
Joey Hess
b290a72025
update deps 2024-07-11 14:51:45 -04:00
Joey Hess
9a592f946f
split module 2024-07-08 21:12:23 -04:00
Joey Hess
82d66ede5e
convert lockcontent api to http long polling
Websockets would work, but the problem with using them for this is that
each lockcontent call is a separate websocket connection. And that's an
actual TCP connection. One TCP connection per file dropped would be too
expensive. With http long polling, regular http pipelining can be used,
so it will reuse a TCP connection.

Unfortunately, at least with servant, bi-directional streams with long
polling don't result in true bidirectional full duplex communication.
Servant processes the whole client body stream before generating the server
body stream. I think it's entirely possible to do full bi-directional
communication over http, but it would need changes to servant.

And, there's no way for the client to tell if the server successfully
locked the content, since the server will keep processing the client
stream no matter what.:

So, added a new api endpoint, keeplocked. lockcontent will lock the key
for 10 minutes with retention lock, and then a call to keeplocked will
keep it locked for as long as needed. This does mean that there will
need to be a Map of locks by key, and I will probably want to add
some kind of lock identifier that lockcontent returns.
2024-07-08 12:57:46 -04:00
Joey Hess
9ee005e49a
dummy HasClient ClientM WebSocket
Enough to let lockcontent routes be included and servant-client be used.
But not enough to use servant-client with those routes. May need to
implement a separate runner for that part of the protocol?

Also some misc other stuff needed to use servant-client.

And fix exposing of UUID in the JSON types. UUID does actually have
aeson instances, but they're used elsewhere (metadata --batch, although
only included to get it to compile, not actually used in there) and not
suitable for use here since this must work with every possible UUID.
2024-07-07 21:21:45 -04:00
Joey Hess
9a726cedf6
servant server now compiling
Just need to fill in some undefined
2024-07-07 14:48:20 -04:00
Joey Hess
1dbb5ec70d
servant API type is complete 2024-07-07 12:59:12 -04:00
Joey Hess
86ce3bf1e4
started servant implementation of HTTP P2P protocol 2024-07-07 12:08:10 -04:00
Joey Hess
1243af4a18
toward SafeDropProof expiry checking
Added Maybe POSIXTime to SafeDropProof, which gets set when the proof is
based on a LockedCopy. If there are several LockedCopies, it uses the
closest expiry time. That is not optimal, it may be that the proof
expires based on one LockedCopy but another one has not expired. But
that seems unlikely to really happen, and anyway the user can just
re-run a drop if it fails due to expiry.

Pass the SafeDropProof to removeKey, which is responsible for checking
it for expiry in situations where that could be a problem. Which really
only means in Remote.Git.

Made Remote.Git check expiry when dropping from a local remote.

Checking expiry when dropping from a P2P remote is not yet implemented.
P2P.Protocol.remove has SafeDropProof plumbed through to it for that
purpose.

Fixing the remaining 2 build warnings should complete this work.

Note that the use of a POSIXTime here means that if the clock gets set
forward while git-annex is in the middle of a drop, it may say that
dropping took too long. That seems ok. Less ok is that if the clock gets
turned back a sufficient amount (eg 5 minutes), proof expiry won't be
noticed. It might be better to use the Monotonic clock, but that doesn't
advance when a laptop is suspended, and while there is the linux
Boottime clock, that is not available on other systems. Perhaps a
combination of POSIXTime and the Monotonic clock could detect laptop
suspension and also detect clock being turned back?

There is a potential future flag day where
p2pDefaultLockContentRetentionDuration is not assumed, but is probed
using the P2P protocol, and peers that don't support it can no longer
produce a LockedCopy. Until that happens, when git-annex is
communicating with older peers there is a risk of data loss when
a ssh connection closes during LOCKCONTENT.
2024-07-04 12:39:06 -04:00
Joey Hess
5b6150e5d5
factor out Utility.MonotonicClock 2024-07-03 17:54:01 -04:00
Joey Hess
543c610a31
REMOVE-BEFORE and GETTIMESTAMP
Only implemented server side, not used client side yet.

And not yet implemented for proxies/clusters, for which there's a build
warning about unhandled cases.

This is P2P protocol version 3. Probably will be the only change in that
version..

Added a dependency on clock to access a monotonic clock.
On i386-ancient, that is at version 0.2.0.0.
2024-07-03 17:01:58 -04:00
Joey Hess
20ebb54b6f
prep release 2024-07-01 15:13:10 -04:00
Joey Hess
0b72b85df5
added git-annex extendcluster
This works, but updatecluster does not work yet in multi-gateway
clusters, nor do gateways relay to other gateways.
2024-06-26 10:26:54 -04:00
Joey Hess
202ea3ff2a
don't sync with cluster nodes by default
Avoid `git-annex sync --content` etc from operating on cluster nodes by default
since syncing with a cluster implicitly syncs with its nodes. This avoids a
lot of unncessary work when a cluster has a lot of nodes just in checking
if each node's preferred content is satisfied. And it avoids content
being sent to nodes individually, so instead syncing with clusters always
fanout uploads to nodes.

The downside is that there are situations where a cluster's preferred content
settings can be met, but those of its nodes are not. Or where a node does not
contain a key, but the cluster does, and there are not enough copies of the key
yet, so it would be desirable the send it there. I think that's an acceptable
tradeoff. These kind of situations are ones where the cluster itself should
probably be responsible for copying content to the node. Which it can do much
less expensively than a client can. Part of the balanced preferred content
design that I will be working on in a couple of months involves rebalancing
clusters, so I expect to revisit this.

The use of annex-sync config does allow running git-annex sync with a specific
node, or nodes, and it will sync with it. And it's also possible to set
annex-sync git configs to make it sync with a node by default. (Although that
will require setting up an explicit git remote for the node rather than relying
on the proxied remote.)

Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster
due to a cycle. And the Annex.Startup load of clusters happens
too late for Remote.Git to use that. This does mean one redundant load
of the cluster log, though only when there is a proxy.
2024-06-25 10:24:38 -04:00
Joey Hess
d34326ab76
factor out Annex.Proxy 2024-06-18 10:51:37 -04:00
Joey Hess
f0d6114286
refactor cluster code into own module 2024-06-18 10:36:04 -04:00
Joey Hess
780367200b
remove dead nodes when loading the cluster log
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.

One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.

Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.

Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.

That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.

Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.

Other commands, like git-annex-shell currently don't run annexStartup
either.

So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
2024-06-16 14:39:44 -04:00
Joey Hess
570ceffe8d
broke out initcluster
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.

Also it gets the cluster description set and is consistent with
initremote.
2024-06-14 17:23:11 -04:00
Joey Hess
bbf261487d
add git-annex updatecluster command
Seems to work fine, making the right changes to the git-annex branch.
2024-06-14 15:02:01 -04:00
Joey Hess
aa56d433d5
implement cluster.log
Not used yet. (Or tested.)

I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
2024-06-13 16:00:58 -04:00
Joey Hess
501d65eeab
started implementing git-annex-shell proxy
So far, it negotiates VERSION with both parties. This is a tricky dance.

Untested.
2024-06-10 18:01:36 -04:00
Joey Hess
f97f4b8bdb
Added updateproxy command and remote.name.annex-proxy configuration
So far this only records proxy information on the git-annex branch.
2024-06-04 14:52:03 -04:00
Joey Hess
aeedca70ca
prep release 2024-05-30 17:53:33 -04:00
Joey Hess
424afe46d7
fix incremental push to preserve existing bundle keys in manifest
Also broke Manifest out to its own type with a smart constructor.

Sponsored-by: mycroft on Patreon
2024-05-13 09:47:05 -04:00
Joey Hess
ff5193c6ad
Merge branch 'master' into git-remote-annex 2024-05-10 14:20:36 -04:00
Joey Hess
e1447dc2e2
add git bundle interface
Sponsored-by: mycroft on Patreon
2024-05-07 14:22:41 -04:00
Joey Hess
c7731cdbd9
add Backend.GitRemoteAnnex
Making GITBUNDLE be in the backend list allows those keys to be
hashed to verify, both when git-remote-annex downloads them, and by other
transfers and by git fsck.

GITMANIFEST is not in the backend list, because those keys will never be
stored in .git/annex/objects and can't be verified in any case.

This does mean that git-annex version will include GITBUNDLE in the list
of backends.

Also documented these in backends.mdwn

Sponsored-by: Kevin Mueller on Patreon
2024-05-07 13:54:08 -04:00
Joey Hess
a01d64a4ad
add git-remote-annex stub and build machinery
Renamed git-remote-annex.sh, keeping it around for now for reference.

Sponsored-by: Graham Spencer on Patreon
2024-05-06 13:05:58 -04:00
Joey Hess
d6ad5b9b50
releasing package git-annex version 10.20240430 2024-04-30 15:27:31 -04:00
Joey Hess
d372553540
rclone special remote
Added rclone special remote, which can be used without needing to install
the git-annex-remote-rclone program. This needs a new version of rclone,
which supports "rclone gitannex".

This is implemented as a variant of an external special remote, that
runs "rclone gitannex" instead of the usual git-annex-remote- command.
Parameterized Remote.External to support that.

Sponsored-by: Luke T. Shumaker on Patreon
2024-04-17 15:20:37 -04:00
Joey Hess
016d1bee88
add reregisterurl command
What this can currently be used for is only to change an url from being
used by a special remote to being used by the web remote.

This could have been a --move-from option to registerurl. But, that would
have complicated its option and --batch processing, and also would have
complicated unregisterurl, which is implemented on top of
Command.Registerurl. So, a separate command was actually less complicated
to implement.

The generic description of the command is because I want to make this
command a catch-all for other url updating kind of things, if there are
ever any more. Also because it was hard to come up with a good name for the
specific action. I considered `git-annex moveurl`, but that seems to
indicate data is perhaps actually being moved, and seems to sit at the same
level as addurl and rmurl, and this command is at the plumbing
level of registerurl and unregisterurl.

Sponsored-by: Dartmouth College's DANDI project
2024-03-05 15:06:14 -04:00
Joey Hess
e7652b0997
implement URL to VURL migration
This needs the content to be present in order to hash it. But it's not
possible for a module used by Backend.URL to call inAnnex because that
would entail a dependency loop. So instead, rely on the fact that
Command.Migrate calls inAnnex before performing a migration.

But, Command.ExamineKey calls fastMigrate and the key may or may not
exist, and it's not wanting to actually perform a migration in any case.
To handle that, had to add an additional value to fastMigrate to
indicate whether the content is inAnnex.

Factored generateEquivilantKey out of Remote.Web.

Note that migrateFromURLToVURL hardcodes use of the SHA256E backend.
It would have been difficult not to, given all the dependency loop
issues. But --backend and annex.backend are used to tell git-annex
migrate to use VURL in any case, so there's no config knob that
the user could expect to configure that.

Sponsored-by: Brock Spratlen on Patreon
2024-03-01 16:42:02 -04:00
Joey Hess
cc17ac423b
implement isCryptographicallySecureKey for VURL
Considerable difficulty to work around an import cycle. Had to move the
list of backends (except for VURL) to Backend.Variety to VURL could use
it.

Sponsored-by: Kevin Mueller on Patreon
2024-02-29 17:26:35 -04:00
Joey Hess
55bf01b788
add equivilant key log for VURL keys
When downloading a VURL from the web, make sure that the equivilant key
log is populated.

Unfortunately, this does not hash the content while it's being
downloaded from the web. There is not an interface in Backend currently
for incrementally hash generation, only for incremental verification of an
existing hash. So this might add a noticiable delay, and it has to show
a "(checksum...") message. This could stand to be improved.

But, that separate hashing step only has to happen on the first download
of new content from the web. Once the hash is known, the VURL key can have
its hash verified incrementally while downloading except when the
content in the web has changed. (Doesn't happen yet because
verifyKeyContentIncrementally is not implemented yet for VURL keys.)

Note that the equivilant key log file is formatted as a presence log.
This adds a tiny bit of overhead (eg "1 ") per line over just listing the
urls. The reason I chose to use that format is it seems possible that
there will need to be a way to remove an equivilant key at some point in
the future. I don't know why that would be necessary, but it seemed wise
to allow for the possibility.

Downloads of VURL keys from other special remotes that claim urls,
like bittorrent for example, does not popilate the equivilant key log.
So for now, no checksum verification will be done for those.

Sponsored-by: Nicholas Golder-Manning on Patreon
2024-02-29 16:01:49 -04:00
Joey Hess
c2d6c02c27
Added dependency on unbounded-delays
And stop vendoring part of it.

This is a free dependency because tasty depends on it.

Sponsored-by: Leon Schuermann on Patreon
2024-02-27 13:11:59 -04:00