git-annex

Author	SHA1	Message	Date
Joey Hess	783e910d0c	sim: Add metadata command Only really needed for completeness, preferred content expressions can match against metadata.	2024-09-26 12:20:37 -04:00
Joey Hess	6cf9a101b8	sim: Fix size tracking for balanced preferred content	2024-09-23 12:42:32 -04:00
Joey Hess	52891711d2	git-annex sim command is working Had to add Read instances to Key and NumCopies and some other similar types. I only expect to use those in serializing a sim. Of course, this risks that implementation changes break reading old data. For a sim, that would not be a big problem.	2024-09-12 16:10:52 -04:00
Joey Hess	7e8274c6b7	implemented ActionDropUnwanted Not tested yet. This emulates the same checking that is done when dropping. Note that when dropping from a special remote it is not able to make a locked copy.	2024-09-12 10:44:31 -04:00
Joey Hess	4e11cb19ef	implemented cloneSimRepo Started on updateSimRepoState	2024-09-06 14:23:29 -04:00
Joey Hess	5807e1480c	correct comment This is not related to v5 versus newer versions.	2024-09-03 14:23:32 -04:00
Joey Hess	340bdd0dac	treat "not present" in preferred content as invalid Detect when a preferred content expression contains "not present", which would lead to repeatedly getting and then dropping files, and make it never match. This also applies to "not balanced" and "not sizebalanced". --explain will tell the user when this happens Note that getMatcher calls matchMrun' and does not check for unstable negated limits. While there is no --present anyway, if there was, it would not make sense for --not --present to complain about instability and fail to match.	2024-09-03 13:50:06 -04:00
Joey Hess	35ff8c8c00	use Utility.PID fixes build on i386ancient	2024-08-30 14:56:38 -04:00
Joey Hess	f89a1b8216	remove stale live changes from reposize database Reorganized the reposize database directory, and split up a column. checkStaleSizeChanges needs to run before needLiveUpdate, otherwise the process won't be holding a lock on its pid file, and another process could go in and expire the live update it records. It just so happens that they do get called in the correct order, since checking balanced preferred content calls getLiveRepoSizes before needLiveUpdate. The 1 minute delay between checks is arbitrary, but will avoid excess work. The downside of it is that, if a process is dropping a file and gets interrupted, for 1 minute another process can expect a repository will soon be smaller than it is. And so a process might send data to a repository when a file is not really going to be dropped from it. But note that can already happen if a drop takes some time in eg locking and then fails. So it seems possible that live updates should only be allowed to increase, rather than decrease the size of a repository.	2024-08-28 13:57:25 -04:00
Joey Hess	e006acef22	avoid reposize database locking overhead when not needed Only when the preferred content expression being matched uses balanced preferred content is this overhead needed. It might be possible to eliminate the locking entirely. Eg, check the live changes before and after the action and re-run if they are not stable. For now, this is good enough, it avoids existing preferred content getting slow. If balanced preferred content turns out to be too slow to check, that could be tried later.	2024-08-28 10:52:34 -04:00
Joey Hess	4d2f95853d	closing in on finishing live reposizes Fixed successfullyFinishedLiveSizeChange to not update the rolling total when a redundant change is in RecentChanges. Made setRepoSizes clear RecentChanges that are no longer needed. It might be possible to clear those earlier, this is only a convenient point to do it. The reason it's safe to clear RecentChanges here is that, in order for a live update to call successfullyFinishedLiveSizeChange, a change must be made to a location log. If a RecentChange gets cleared, and just after that a new live update is started, making the same change, the location log has already been changed (since the RecentChange exists), and so when the live update succeeds, it won't call successfullyFinishedLiveSizeChange. The reason it doesn't clear RecentChanges when there is a reduntant live update is because I didn't want to think through whether or not all races are avoided in that case. The rolling total in SizeChanges is never cleared. Instead, calcJournalledRepoSizes gets the initial value of it, and then getLiveRepoSizes subtracts that initial value from the current value. Since the rolling total can only be updated by updateRepoSize, which is called with the journal locked, locking the journal in calcJournalledRepoSizes ensures that the database does not change while reading the journal.	2024-08-27 12:54:46 -04:00
Joey Hess	d7813876a0	fixed the build Manually tested getLiveRepoSizes and it is working correctly.	2024-08-27 09:41:35 -04:00
Joey Hess	521e0a7062	fix a deadlock When finishedLiveUpdate was run on a different key than expected, it blocked forever waiting for an indication the database had been updated. Since the journal is locked when finishedLiveUpdate runs, this could also have caused other git-annex commands to hang.	2024-08-27 00:13:54 -04:00
Joey Hess	18f8d61f55	rolling total of size changes in RepoSize database When a live size change completes successfully, the same transaction that removes it from the database updates the rolling total for its repository. The idea is that when RepoSizes is read, SizeChanges will be as well, and cached locally. Any time a change is made, the local cache will be updated. So by comparing the local cache with the current SizeChanges, it can learn about size changes that were made by other processes. Then read the LiveSizeChanges, and add that in to get a live picture of the current sizes. Also added a SizeChangeId. This allows 2 different threads, or processes, to both record a live size change for the same repo and key, and update their own information without stepping on one-another's toes.	2024-08-25 10:34:47 -04:00
Joey Hess	d60a33fd13	improve live update starting In an expression like "balanced=foo and exclude=bar", avoid it starting a live update when the overall expression doesn't match.	2024-08-24 13:07:05 -04:00
Joey Hess	2f20b939b7	LiveUpdate db updates working I've tested the behavior of the thread that waits for the LiveUpdate to be finished, and it does get signaled and exit cleanly when the LiveUpdate is GCed instead. Made finishedLiveUpdate wait for the thread to finish updating the database. There is a case where GC doesn't happen in time and the database is left with a live update recorded in it. This should not be a problem as such stale data can also happen when interrupted and will need to be detected when loading the database. Balanced preferred content expressions now call startLiveUpdate.	2024-08-24 11:49:58 -04:00
Joey Hess	c3d40b9ec3	plumb in LiveUpdate (WIP) Each command that first checks preferred content (and/or required content) and then does something that can change the sizes of repositories needs to call prepareLiveUpdate, and plumb it through the preferred content check and the location log update. So far, only Command.Drop is done. Many other commands that don't need to do this have been updated to keep working. There may be some calls to NoLiveUpdate in places where that should be done. All will need to be double checked. Not currently in a compilable state.	2024-08-23 16:35:12 -04:00
Joey Hess	70e2fca257	Added the annex.fullybalancedthreshhold git config.	2024-08-22 07:15:55 -04:00
Joey Hess	99514f9d18	maxsize overview display and --json support	2024-08-18 12:08:13 -04:00
Joey Hess	b62b58b50b	git-annex info speed up using getRepoSizes	2024-08-17 14:54:31 -04:00
Joey Hess	99a126bebb	added reposize database The idea is that upon a merge of the git-annex branch, or a commit to the git-annex branch, the reposize database will be updated. So it should always accurately reflect the location log sizes, but it will often be behind the actual current sizes. Annex.reposizes will start with the value from the database, and get updated with each transfer, so it will reflect a process's best understanding of the current sizes. When there are multiple processes all transferring to the same repo, Annex.reposize will not reflect transfers made by the other processes since the current process started. So when using balanced preferred content, it may make suboptimal choices, including trying to transfer content to the repo when another process has already filled it up. But this is the same as if there are multiple processes running on ifferent machines, so is acceptable. The reposize will eventually get an accurate value reflecting changes made by other processes or in other repos.	2024-08-12 11:19:58 -04:00
Joey Hess	1265d7e5df	implement maxsize log and command * maxsize: New command to tell git-annex how large the expected maximum size of a repository is. * vicfg: Include maxsize configuration.	2024-08-11 15:41:26 -04:00
Joey Hess	bd5affa362	use hmac in balanced preferred content This deals with the possible security problem that someone could make an unusually low UUID and generate keys that are all constructed to hash to a number that, mod the number of repositories in the group, == 0. So balanced preferred content would always put those keys in the repository with the low UUID as long as the group contains the number of repositories that the attacker anticipated. Presumably the attacker than holds the data for ransom? Dunno. Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs combined together as the "secret", and the key as the "message". Now any change in the set of UUIDs in a group will invalidate the attacker's constructed keys from hashing to anything in particular. Given that there are plenty of other things someone can do if they can write to the repository -- including modifying preferred content so only their repository wants files, and numcopies so other repositories drom them -- this seems like safeguard enough. Note that, in balancedPicker, combineduuids is memoized.	2024-08-10 16:32:54 -04:00
Joey Hess	3ea835c7e8	proxied exporttree=yes versionedexport=yes remotes are not untrusted This removes versionedExport, which was only used by the S3 special remote. Instead, versionedexport=yes is a common way for remotes to indicate that they are versioned.	2024-08-08 15:24:19 -04:00
Joey Hess	1038567881	proxy stores received keys to known export locations This handles the workflow where the branch is first pushed to the proxy, and then files in the exported tree are later are copied to the proxied remote. Turns out that the way the export log is structured, nothing needs to be done to finalize the export once the last key is sent to it. Which is great because that would have been a lot of complication. On receiving the push, Command.Export runs and calls recordExportBeginning, does as much as it can to update the export with the files currently on it, and then calls recordExportUnderway. At that point, the export.log records the export as "complete", but it's not really. And that's fine. The same happens when using `git-annex export` when some files are not available to send. Other repositories that have access to the special remote can already retrieve files from it. As the missing files get copied to the exported remote, all that needs to be done is record each in the export db. At this point, proxying to exporttree=yes annexobjects=yes special remotes is fully working. Except for in the case where multiple files in the tree use the same key, and the files are sent to the proxied remote before pushing the tree. It seems that even special remotes without annexobjects=yes will work if used with the workflow where the git-annex branch is pushed before copying files. But not with the `git-annex push` workflow.	2024-08-07 09:47:34 -04:00
Joey Hess	c34d1da22a	Remove debug output (to stderr) Accidentially included in last version. Only happens when running code that uses remoteUrl.	2024-08-02 14:13:29 -04:00
Joey Hess	cd89f91aa5	remove uuid from annex+http urls Not needed it turns out.	2024-07-28 20:29:42 -04:00
Joey Hess	bc9cc79e85	set remote's annexUrl automatically When the remote repository's git config file has annex.url set to an annex+http url.	2024-07-28 20:13:41 -04:00
Joey Hess	0bdeafc2c4	use annex+http for accessing proxies Doesn't work yet on the http server side, which is throwing 502 bad gateway.	2024-07-25 12:00:57 -04:00
Joey Hess	75b1d50b99	add remoteAnnexP2PHttpUrl to RemoveGitConfig This is always parsed, when building without servant, a Baseurl is not generated, and users of it will need to fail.	2024-07-23 09:57:01 -04:00
Joey Hess	9ee005e49a	dummy HasClient ClientM WebSocket Enough to let lockcontent routes be included and servant-client be used. But not enough to use servant-client with those routes. May need to implement a separate runner for that part of the protocol? Also some misc other stuff needed to use servant-client. And fix exposing of UUID in the JSON types. UUID does actually have aeson instances, but they're used elsewhere (metadata --batch, although only included to get it to compile, not actually used in there) and not suitable for use here since this must work with every possible UUID.	2024-07-07 21:21:45 -04:00
Joey Hess	1243af4a18	toward SafeDropProof expiry checking Added Maybe POSIXTime to SafeDropProof, which gets set when the proof is based on a LockedCopy. If there are several LockedCopies, it uses the closest expiry time. That is not optimal, it may be that the proof expires based on one LockedCopy but another one has not expired. But that seems unlikely to really happen, and anyway the user can just re-run a drop if it fails due to expiry. Pass the SafeDropProof to removeKey, which is responsible for checking it for expiry in situations where that could be a problem. Which really only means in Remote.Git. Made Remote.Git check expiry when dropping from a local remote. Checking expiry when dropping from a P2P remote is not yet implemented. P2P.Protocol.remove has SafeDropProof plumbed through to it for that purpose. Fixing the remaining 2 build warnings should complete this work. Note that the use of a POSIXTime here means that if the clock gets set forward while git-annex is in the middle of a drop, it may say that dropping took too long. That seems ok. Less ok is that if the clock gets turned back a sufficient amount (eg 5 minutes), proof expiry won't be noticed. It might be better to use the Monotonic clock, but that doesn't advance when a laptop is suspended, and while there is the linux Boottime clock, that is not available on other systems. Perhaps a combination of POSIXTime and the Monotonic clock could detect laptop suspension and also detect clock being turned back? There is a potential future flag day where p2pDefaultLockContentRetentionDuration is not assumed, but is probed using the P2P protocol, and peers that don't support it can no longer produce a LockedCopy. Until that happens, when git-annex is communicating with older peers there is a risk of data loss when a ssh connection closes during LOCKCONTENT.	2024-07-04 12:39:06 -04:00
Joey Hess	8b5fc94d50	add optional object file location to storeKey This will be used by the next commit to simplify the proxy.	2024-07-01 10:42:27 -04:00
Joey Hess	07e899c9d3	git-annex-shell: proxy nodes located beyond remote cluster gateways Walking a tightrope between security and convenience here, because git-annex-shell needs to only proxy for things when there has been an explicit, local action to configure them. In this case, the user has to have run `git-annex extendcluster`, which now sets annex-cluster-gateway on the remote. Note that any repositories that the gateway is recorded to proxy for will be proxied onward. This is not limited to cluster nodes, because checking the node log would not add any security; someone could add any uuid to it. The gateway of course then does its own checking to determine if it will allow proxying for the remote.	2024-06-26 12:56:16 -04:00
Joey Hess	202ea3ff2a	don't sync with cluster nodes by default Avoid `git-annex sync --content` etc from operating on cluster nodes by default since syncing with a cluster implicitly syncs with its nodes. This avoids a lot of unncessary work when a cluster has a lot of nodes just in checking if each node's preferred content is satisfied. And it avoids content being sent to nodes individually, so instead syncing with clusters always fanout uploads to nodes. The downside is that there are situations where a cluster's preferred content settings can be met, but those of its nodes are not. Or where a node does not contain a key, but the cluster does, and there are not enough copies of the key yet, so it would be desirable the send it there. I think that's an acceptable tradeoff. These kind of situations are ones where the cluster itself should probably be responsible for copying content to the node. Which it can do much less expensively than a client can. Part of the balanced preferred content design that I will be working on in a couple of months involves rebalancing clusters, so I expect to revisit this. The use of annex-sync config does allow running git-annex sync with a specific node, or nodes, and it will sync with it. And it's also possible to set annex-sync git configs to make it sync with a node by default. (Although that will require setting up an explicit git remote for the node rather than relying on the proxied remote.) Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster due to a cycle. And the Annex.Startup load of clusters happens too late for Remote.Git to use that. This does mean one redundant load of the cluster log, though only when there is a proxy.	2024-06-25 10:24:38 -04:00
Joey Hess	b8016eeb65	add annex-proxied This makes git-annex sync and similar not treat proxied remotes as git syncable remotes. Also, display in git-annex info remote when the remote is proxied.	2024-06-24 10:16:59 -04:00
Joey Hess	0c111fc96a	fix git-annex sync --content with proxied remotes Loading the remote list a second time was removing all proxied remotes. That happened because setting up the proxied remote added some config fields to the in-memory git config, and on the second load, it saw those configs and decided not to overwrite them with the proxy. Now on the second load, that still happens. But now, the proxied git configs are used to generate a remote same as if those configs were all set. The reason that didn't happen before was twofold, the gitremotes cache was not dropped, and the remote's url field was not set correctly. The problem with the remote's url field is that while it was marked as proxy inherited, all other proxy inherited fields are annex- configs. And the code to inherit didn't work for the url field. Now it all works, but git-annex sync is left running git push/pull on the proxied remote, which doesn't work. That still needs to be fixed.	2024-06-24 09:45:51 -04:00
Joey Hess	f18740699e	P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS is complete, when a PUT stores to additional repositories than the expected on, the location log is updated with the additional UUIDs that contain the content. Started implementing PUT fanout to multiple remotes for clusters. It is untested, and I fear fencepost errors in the relative offset calculations. And it is missing proxying for the protocol after DATA.	2024-06-18 16:21:40 -04:00
Joey Hess	780367200b	remove dead nodes when loading the cluster log This is to avoid inserting a cluster uuid into the location log when only dead nodes in the cluster contain the content of a key. One reason why this is necessary is Remote.keyLocations, which excludes dead repositories from the list. But there are probably many more. Implementing this was challenging, because Logs.Location importing Logs.Cluster which imports Logs.Trust which imports Remote.List resulted in an import cycle through several other modules. Resorted to making Logs.Location not import Logs.Cluster, and instead it assumes that Annex.clusters gets populated when necessary before it's called. That's done in Annex.Startup, which is run by the git-annex command (but not other commands) at early startup in initialized repos. Or, is run after initialization. Note that is Remote.Git, it is unable to import Annex.Startup, because Remote.Git importing Logs.Cluster leads the the same import cycle. So ensureInitialized is not passed annexStartup in there. Other commands, like git-annex-shell currently don't run annexStartup either. So there are cases where Logs.Location will not see clusters. So it won't add any cluster UUIDs when loading the log. That's ok, the only reason to do that is to make display of where objects are located include clusters, and to make commands like git-annex get --from treat keys as being located in a cluster. git-annex-shell certainly does not do anything like that, and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo) don't either.	2024-06-16 14:39:44 -04:00
Joey Hess	b3370a191c	insert cluster UUIDs when loading location logs, and omit when saving Inline isClusterUUID for speed.	2024-06-14 18:06:28 -04:00
Joey Hess	bbf261487d	add git-annex updatecluster command Seems to work fine, making the right changes to the git-annex branch.	2024-06-14 15:02:01 -04:00
Joey Hess	2844230dfe	add git configs for clusters	2024-06-14 12:20:17 -04:00
Joey Hess	da3c0115cb	make cluster UUIDs distinguishable from any other repository UUID A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'. The rest of the content will be random. This avoids a class of attack where the UUID of a repository is used as the UUID of a cluster, which will prevent git-annex from updating location logs for that repository. I don't know why someone would want to do that, but let's prevent it. Also, isClusterUUID make it easy to filter out cluster UUIDs when writing the location logs.	2024-06-14 11:11:09 -04:00
Joey Hess	aa56d433d5	implement cluster.log Not used yet. (Or tested.) I did consider making the log start with the uuid of the node, followed by the cluster uuid (or uuids). That would perhaps mean a smaller write to the git-annex branch when adding a node, but overall the log file would be larger, and it will be read and cached near to startup on most git-annex runs.	2024-06-13 16:00:58 -04:00
Joey Hess	2e76a4744f	inherit remote.name.annex-bare Since a proxied remote uses the proxy's git repo, this makes sense. Although I don't think this config is ever used when accessing a remote via git-annex-shell.	2024-06-12 11:53:28 -04:00
Joey Hess	d2576e5f1a	git-annex-shell: accept uuid of remote that proxying is enabled for For NotifyChanges and also for the fallthrough case where git-annex-shell passes a command off to git-shell, proxying is currently ignored. So every remote that is accessed via a proxy will be treated as the same git repository. Every other command listed in cmdsMap will need to check if Annex.proxyremote is set, and if so handle the proxying appropriately. Probably only P2PStdio will need to support proxying. For now, everything else refuses to work when proxying. The part of that I don't like is that there's the possibility a command later gets added to the list that doesn't check proxying. When proxying is not enabled, it's important that git-annex-shell not leak information that it would not have exposed before. Such as the names or uuids of remotes. I decided that, in the case where a repository used to have proxying enabled, but no longer supports any proxies, it's ok to give the user a clear error message indicating that proxying is not configured, rather than a confusing uuid mismatch message. Similarly, if a repository has proxying enabled, but not for the requested repository, give a clear error message. A tricky thing here is how to handle the case where there is more than one remote, with proxying enabled, with the specified uuid. One way to handle that would be to plumb the proxyRemoteName all the way through from the remote git-annex to git-annex-shell, eg as a field, and use only a remote with the same name. That would be very intrusive though. Instead, I decided to let the proxy pick which remote it uses to access a given Remote. And so it picks the least expensive one. The client after all doesn't necessarily know any details about the proxy's configuration. This does mean though, that if the least expensive remote is not accessible, but another remote would have worked, an access via the proxy will fail.	2024-06-10 12:44:35 -04:00
Joey Hess	7f1cdb3107	remote git config inheritance for proxied remotes When there is a proxy remote, remotes that it proxies need to be constructed with the right subset of the remote git-config settings. Obviously, the url is the same, and the uuid is different. Added proxyInheritedFields that lists all the fields that should be inherited. These will be copied into the proxied remote when instantiating it. There were a lot of decisions here, made without certainty in some cases. May need to revisit them. The RemoteGitConfigField type was added to make sure that every config used in extractRemoteGitConfig gets considered for proxy inheritance, including new ones that get added going forward. And to avoid needing to write the field string more than once.	2024-06-06 16:30:40 -04:00
Joey Hess	f97f4b8bdb	Added updateproxy command and remote.name.annex-proxy configuration So far this only records proxy information on the git-annex branch.	2024-06-04 14:52:03 -04:00
Joey Hess	a31770c350	reorder constructors for consistency	2024-06-04 13:11:27 -04:00
Joey Hess	2ffe077cc2	git-remote-annex: brought back max-git-bundles config An incremental push that gets converted to a full push due to this config results in the inManifest having just one bundle in it, and the outManifest listing every other bundle. So it actually takes up more space on the special remote. But, it speeds up clone and fetch to not have to download a long series of bundles for incremental pushes.	2024-05-28 13:28:19 -04:00

1 2 3 4 5 ...

801 commits