git-annex

Author	SHA1	Message	Date
Joey Hess	d09a005f2b	update RepoSize database from git-annex branch incrementally The use of catObjectStream is optimally fast. Although it might be possible to combine this with git-annex branch merge to avoid some redundant work. Benchmarking, a git-annex branch that had 100000 files changed took less than 1.88 seconds to run through this.	2024-08-17 13:35:00 -04:00
Joey Hess	8239824d92	consistently omit clusters when calculating RepoSizes updateRepoSize is only called on the UUID of a repository, not any cluster it might be a node of. But overLocationLogs and overLocationLogsJournal were inclusing cluster UUIDs. So it was inconsistent. Currently I don't see any reason to calculate RepoSize for a cluster. It's not even clear what it should mean, the total size of all nodes, or the amount of information stored in the cluster in total?	2024-08-17 11:24:14 -04:00
Joey Hess	06064f897c	update Annex.reposizes when changing location logs The live update is only needed when Annex.reposizes has already been populated.	2024-08-15 13:27:14 -04:00
Joey Hess	63a3cedc45	slightly improve hairy types	2024-08-14 16:04:18 -04:00
Joey Hess	3e6eb2a58d	implement journalledRepoSizes Plan is to run this when populating Annex.reposizes on demand. So Annex.reposizes will be up-to-date with the journal, including crucially journal entries for private repositories. But also anything that has been written to the journal by another process, especially if the process was ran with annex.alwayscommit=false. From there, Annex.reposizes can be kept up to date with changes made by the running process.	2024-08-14 13:53:24 -04:00
Joey Hess	8ac2685b33	calcBranchRepoSizes without journal files This will be used to prime the RepoSizes database, which will always contain values that correpond to information in the git-annex branch, so without anything from journal files. Factored out overJournalFileContents which will later be used to update Annex.reposizes to include information from journal files. This will be partitcularly important to support private UUIDs which only ever get to journal files and not to the branch.	2024-08-14 03:19:30 -04:00
Joey Hess	5afbea25e7	avoid counting size of keys that are in the journal twice In calcRepoSizes and also git-annex info, when a key was in the journal, it was passed to the callback twice, so the calculated size was wrong.	2024-08-13 13:23:39 -04:00
Joey Hess	467d80101a	improve handling of unmerged git-annex branches in readonly repo git-annex info was displaying a message that didn't make sense in context. In calcRepoSizes, it seems better to return the information from the git-annex branch, rather than giving up. Especially since balanced preferred content uses it, and we can't just give up evaluating a preferred content expression if git-annex is to be usable in such a readonly repo. Commit `6d7ecd9e5d` nobly wanted git-annex to behave the same with such unmerged branches as it does when it can merge them. But for the purposes of preferred content, it seems to me there's a sense that such an unmerged branch is the same as a remote we have not pulled from. The balanced preferred content will either way operate under outdated information, and so make not the best choices.	2024-08-13 13:13:12 -04:00
Joey Hess	5c35b3d579	fix typo	2024-08-13 11:47:37 -04:00
Joey Hess	99a126bebb	added reposize database The idea is that upon a merge of the git-annex branch, or a commit to the git-annex branch, the reposize database will be updated. So it should always accurately reflect the location log sizes, but it will often be behind the actual current sizes. Annex.reposizes will start with the value from the database, and get updated with each transfer, so it will reflect a process's best understanding of the current sizes. When there are multiple processes all transferring to the same repo, Annex.reposize will not reflect transfers made by the other processes since the current process started. So when using balanced preferred content, it may make suboptimal choices, including trying to transfer content to the repo when another process has already filled it up. But this is the same as if there are multiple processes running on ifferent machines, so is acceptable. The reposize will eventually get an accurate value reflecting changes made by other processes or in other repos.	2024-08-12 11:19:58 -04:00
Joey Hess	1265d7e5df	implement maxsize log and command * maxsize: New command to tell git-annex how large the expected maximum size of a repository is. * vicfg: Include maxsize configuration.	2024-08-11 15:41:26 -04:00
Joey Hess	bd5affa362	use hmac in balanced preferred content This deals with the possible security problem that someone could make an unusually low UUID and generate keys that are all constructed to hash to a number that, mod the number of repositories in the group, == 0. So balanced preferred content would always put those keys in the repository with the low UUID as long as the group contains the number of repositories that the attacker anticipated. Presumably the attacker than holds the data for ransom? Dunno. Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs combined together as the "secret", and the key as the "message". Now any change in the set of UUIDs in a group will invalidate the attacker's constructed keys from hashing to anything in particular. Given that there are plenty of other things someone can do if they can write to the repository -- including modifying preferred content so only their repository wants files, and numcopies so other repositories drom them -- this seems like safeguard enough. Note that, in balancedPicker, combineduuids is memoized.	2024-08-10 16:32:54 -04:00
Joey Hess	3a1f39fbdf	Avoid loading cluster log at startup This fixes a problem with datalad's test suite, where loading the cluster log happened to cause the git-annex branch commits to take a different shape, with an additional commit. It's also faster though, since many commands don't need the cluster log. Just fill Annex.clusters with a thunk. Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project	2024-07-31 15:54:14 -04:00
Joey Hess	202ea3ff2a	don't sync with cluster nodes by default Avoid `git-annex sync --content` etc from operating on cluster nodes by default since syncing with a cluster implicitly syncs with its nodes. This avoids a lot of unncessary work when a cluster has a lot of nodes just in checking if each node's preferred content is satisfied. And it avoids content being sent to nodes individually, so instead syncing with clusters always fanout uploads to nodes. The downside is that there are situations where a cluster's preferred content settings can be met, but those of its nodes are not. Or where a node does not contain a key, but the cluster does, and there are not enough copies of the key yet, so it would be desirable the send it there. I think that's an acceptable tradeoff. These kind of situations are ones where the cluster itself should probably be responsible for copying content to the node. Which it can do much less expensively than a client can. Part of the balanced preferred content design that I will be working on in a couple of months involves rebalancing clusters, so I expect to revisit this. The use of annex-sync config does allow running git-annex sync with a specific node, or nodes, and it will sync with it. And it's also possible to set annex-sync git configs to make it sync with a node by default. (Although that will require setting up an explicit git remote for the node rather than relying on the proxied remote.) Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster due to a cycle. And the Annex.Startup load of clusters happens too late for Remote.Git to use that. This does mean one redundant load of the cluster log, though only when there is a proxy.	2024-06-25 10:24:38 -04:00
Joey Hess	291280ced2	started on git-annex-shell cluster support Works down to P2P protocol. The question now is, how to handle protocol version negotiation for clusters? Connecting to each node to find their protocol versions and using the lowest would be too expensive with a lot of nodes. So it seems that the cluster needs to pick its own protocol version to use with the client. Then it can either negotiate that same version with the nodes when it comes time to use them, or it can translate between multiple protocol versions. That seems complicated. Thinking it would be ok to refuse to use a node if it is not able to negotiate the same protocol version with it as with the client. That will mean that sometimes need nodes to be upgraded when upgrading the cluster's proxy. But protocol versions rarely change.	2024-06-17 15:10:04 -04:00
Joey Hess	780367200b	remove dead nodes when loading the cluster log This is to avoid inserting a cluster uuid into the location log when only dead nodes in the cluster contain the content of a key. One reason why this is necessary is Remote.keyLocations, which excludes dead repositories from the list. But there are probably many more. Implementing this was challenging, because Logs.Location importing Logs.Cluster which imports Logs.Trust which imports Remote.List resulted in an import cycle through several other modules. Resorted to making Logs.Location not import Logs.Cluster, and instead it assumes that Annex.clusters gets populated when necessary before it's called. That's done in Annex.Startup, which is run by the git-annex command (but not other commands) at early startup in initialized repos. Or, is run after initialization. Note that is Remote.Git, it is unable to import Annex.Startup, because Remote.Git importing Logs.Cluster leads the the same import cycle. So ensureInitialized is not passed annexStartup in there. Other commands, like git-annex-shell currently don't run annexStartup either. So there are cases where Logs.Location will not see clusters. So it won't add any cluster UUIDs when loading the log. That's ok, the only reason to do that is to make display of where objects are located include clusters, and to make commands like git-annex get --from treat keys as being located in a cluster. git-annex-shell certainly does not do anything like that, and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo) don't either.	2024-06-16 14:39:44 -04:00
Joey Hess	b3370a191c	insert cluster UUIDs when loading location logs, and omit when saving Inline isClusterUUID for speed.	2024-06-14 18:06:28 -04:00
Joey Hess	bbf261487d	add git-annex updatecluster command Seems to work fine, making the right changes to the git-annex branch.	2024-06-14 15:02:01 -04:00
Joey Hess	de1d795dfe	cache getClusters in Annex state	2024-06-14 11:16:01 -04:00
Joey Hess	da3c0115cb	make cluster UUIDs distinguishable from any other repository UUID A cluster UUID is a version 8 UUID, with first octets 'a' and 'c'. The rest of the content will be random. This avoids a class of attack where the UUID of a repository is used as the UUID of a cluster, which will prevent git-annex from updating location logs for that repository. I don't know why someone would want to do that, but let's prevent it. Also, isClusterUUID make it easy to filter out cluster UUIDs when writing the location logs.	2024-06-14 11:11:09 -04:00
Joey Hess	aa56d433d5	implement cluster.log Not used yet. (Or tested.) I did consider making the log start with the uuid of the node, followed by the cluster uuid (or uuids). That would perhaps mean a smaller write to the git-annex branch when adding a node, but overall the log file would be larger, and it will be read and cached near to startup on most git-annex runs.	2024-06-13 16:00:58 -04:00
Joey Hess	5b668f9ef1	add missing spaces	2024-06-12 13:06:14 -04:00
Joey Hess	649b87bedd	Merge branch 'master' into proxy	2024-06-10 14:26:18 -04:00
Joey Hess	f5532be954	graft in exported tree before updating the export log It was possible for the export.log to get written and then git-annex was interrupted, before it could graft in the exported tree. Which could result in export.log referencing a tree that got garbage collected.	2024-06-07 15:25:02 -04:00
Joey Hess	b43c835def	instantiate remotes that are behind a proxy remote Untested, but this should be close to working. The proxied remotes have the same url but a different uuid. When talking to current git-annex-shell, it will fail due to a uuid mismatch. Once it supports proxies, it will know that the presented uuid is for a remote that it proxies for. The check for any git config settings for a remote with the same name as the proxied remote is there for several reasons. One is security: Writing a name to the proxy log should not cause changes to how an existing, configured git remote operates in a different clone of the repo. It's possible that the user has been using a proxied remote, and decides to set a git config for it. We can't tell the difference between that scenario and an evil remote trying to eg, intercept a file upload by replacing their remote with a proxied remote. Also, if the user sets some git config, does it override the config inherited from the proxy remote? Seems a difficult question. Luckily, the above means we don't need to think through it. This does mean though, that in order for a user to change the config of a proxy remote, they have to manually set its annex-uuid and url, as well as the config they want to change. They may also have to set any of the inherited configs that they were relying on.	2024-06-06 17:15:32 -04:00
Joey Hess	a72d0f69d0	filter out illegal remote names when reading proxy log	2024-06-06 12:51:30 -04:00
Joey Hess	f97f4b8bdb	Added updateproxy command and remote.name.annex-proxy configuration So far this only records proxy information on the git-annex branch.	2024-06-04 14:52:03 -04:00
Joey Hess	8dbee28963	reorder imports for consistency	2024-06-04 13:25:48 -04:00
Yaroslav Halchenko	87e2ae2014	run codespell throughout fixing typos automagically === Do not change lines below === { "chain": [], "cmd": "codespell -w", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^	2024-05-01 15:46:21 -04:00
Joey Hess	07baa7ffcf	fix build on windows deletestale renamed from cleanstale	2024-03-26 13:12:58 -04:00
Joey Hess	f04d9574d6	fix transfer lock file for Download to not include uuid While redundant concurrent transfers were already prevented in most cases, it failed to prevent the case where two different repositories were sending the same content to the same repository. By removing the uuid from the transfer lock file for Download transfers, one repository sending content will block the other one from also sending the same content. In order to interoperate with old git-annex, the old lock file is still locked, as well as locking the new one. That added a lot of extra code and work, and the plan is to eventually stop locking the old lock file, at some point in time when an old git-annex process is unlikely to be running at the same time. Note that in the case of 2 repositories both doing eg `git-annex copy foo --to origin` the output is not that great: copy b (to origin...) transfer already in progress, or unable to take transfer lock git-annex: transfer already in progress, or unable to take transfer lock 97% 966.81 MiB 534 GiB/s 0sp2pstdio: 1 failed Lost connection (fd:14: hPutBuf: resource vanished (Broken pipe)) Transfer failed Perhaps that output could be cleaned up? Anyway, it's a lot better than letting the redundant transfer happen and then failing with an obscure error about a temp file, which is what it did before. And it seems users don't often try to do this, since nobody ever reported this bug to me before. (The "97%" there is actually how far along the other transfer is.) Sponsored-by: Joshua Antonishen on Patreon	2024-03-25 14:47:46 -04:00
Joey Hess	310a49a76d	nub the equivilant key list If somehow the same key gets listed in the log twice, avoid doing double checksumming.	2024-03-01 14:39:01 -04:00
Joey Hess	55bf01b788	add equivilant key log for VURL keys When downloading a VURL from the web, make sure that the equivilant key log is populated. Unfortunately, this does not hash the content while it's being downloaded from the web. There is not an interface in Backend currently for incrementally hash generation, only for incremental verification of an existing hash. So this might add a noticiable delay, and it has to show a "(checksum...") message. This could stand to be improved. But, that separate hashing step only has to happen on the first download of new content from the web. Once the hash is known, the VURL key can have its hash verified incrementally while downloading except when the content in the web has changed. (Doesn't happen yet because verifyKeyContentIncrementally is not implemented yet for VURL keys.) Note that the equivilant key log file is formatted as a presence log. This adds a tiny bit of overhead (eg "1 ") per line over just listing the urls. The reason I chose to use that format is it seems possible that there will need to be a way to remove an equivilant key at some point in the future. I don't know why that would be necessary, but it seemed wise to allow for the possibility. Downloads of VURL keys from other special remotes that claim urls, like bittorrent for example, does not popilate the equivilant key log. So for now, no checksum verification will be done for those. Sponsored-by: Nicholas Golder-Manning on Patreon	2024-02-29 16:01:49 -04:00
Joey Hess	9a67ed0f10	importtree: support preferred content expressions needing keys When importing from a special remote, support preferred content expressions that use terms that match on keys (eg "present", "copies=1"). Such terms are ignored when importing, since the key is not known yet. When "standard" or "groupwanted" is used, the terms in those expressions also get pruned accordingly. This does allow setting preferred content to "not (copies=1)" to make a special remote into a "source" type of repository. Importing from it will import all files. Then exporting to it will drop all files from it. In the case of setting preferred content to "present", it's pruned on import, so everything gets imported from it. Then on export, it's applied, and everything in it is left on it, and no new content is exported to it. Since the old behavior on these preferred content expressions was for importtree to error out, there's no backwards compatability to worry about. Except that sync/pull/etc will now import where before it errored out.	2023-12-18 16:27:59 -04:00
Joey Hess	4ed71b34de	migrate --apply And avoid migrate --update/--aply migrating when the new key was already present in the repository, and got dropped. Luckily, the location log allows distinguishing from the new key never having been present! That is mostly useful for --apply because otherwise dropped files would keep coming back until the old objects were reaped as unused. But it seemed to make sense to also do it for --update. for consistency in edge cases if nothing else. One case where --update can use it is when one branch got migrated earlier, and we dropped the file, and now another branch has migrated the same file. Sponsored-by: Jack Hill on Patreon	2023-12-08 13:23:46 -04:00
Joey Hess	7c7c9912c1	migrate --update gets keys The git log is outputting the diff, but this only looks at the new files. When we have a new file, we can get the old filename by just replacing "new" with "old". And then use branchFileRef to refer to it allows catting the old key. While this does have to skip past the old files in the diff, it's still faster than calling git diff separately. Sponsored-by: Nicholas Golder-Manning on Patreon	2023-12-07 17:25:56 -04:00
Joey Hess	f1ce15036f	started migrate --update This is most of the way there, but not quite working. The layout of migrate.tree/ needs to be changed to follow this approach. git log will list all the files in tree order, so the new layout needs to alternate old and new keys. Can that be done? git may not document tree order, or may not preserve it here. Alternatively, change to using git log --format=raw and extract the tree header from that, then use git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new That will be a little more expensive, but only when there are lots of migrations. Sponsored-by: Joshua Antonishen on Patreon	2023-12-07 15:50:52 -04:00
Joey Hess	d06aee7ce0	make commitMigration interuption safe Fixed inversion of control issue, so the tree is recorded in streamLogFile finalizer. Sponsored-by: Leon Schuermann on Patreon	2023-12-06 16:29:58 -04:00
Joey Hess	0bd8b17b59	log migration trees to git-annex branch This will allow distributed migration: Start a migration in one clone of a repo, and then update other clones. commitMigration is a bit of a bear.. There is some inversion of control that needs some TMVars. Also streamLogFile's finalizer does not handle recording the trees, so an interrupt at just the wrong time can cause migration.log to be emptied but the git-annex branch not updated. Sponsored-by: Graham Spencer on Patreon	2023-12-06 15:40:03 -04:00
Joey Hess	38b9ebc5fd	newtype MapLog Noticed that Semigroup instance of Map is not suitable to use for MapLog. For example, it behaved like this: ghci> parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12" fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})] Which was wrong, it lost the newer DeadTrusted value. Luckily, nothing used that Semigroup when operating on a MapLog. And this provides a safe instance. Sponsored-by: Graham Spencer on Patreon	2023-11-13 14:37:22 -04:00
Joey Hess	574514545c	git-annex log --sizesof This can take a lot of memory. I decided to violate the usual rule in git-annex that it operate in constant memory no matter how many annexed objects. In this case, it would be hard to be fast without using a big map of the location logs. The main difficulty here is that there can be many git-annex branches and it needs to display a consistent view at a point in time, which means merging information from multiple git-annex branches. I have not checked if there are any laziness leaks in this code. It takes 1 gb to run in my big repo, which is around what I estimated before writing it. 2 options that are documented are not yet implemented. Small bug: With eg --when=1h, it will display at 12:00 then 1:10 if the next change after 12:59 is then. Then it waits until after 2:10 to display the next change. It ought to wait until after 2:00. Sponsored-by: Brock Spratlen on Patreon	2023-11-10 17:26:10 -04:00
Joey Hess	11cc9f1933	info: Added calculation of combined annex size of all repositories Factored out overLocationLogs from CmdLine.Seek, which can calculate this pretty fast even in a large repo. In my big repo, the time to run git-annex info went up from 1.33s to 8.5s. Note that the "backend usage" stats are for annexed files in the working tree only, not all annexed files. This new data source would let that be changed, but that would be a confusing behavior change. And I cannot retitle it either, out of fear something uses the current title (eg parsing the json). Also note that, while time says "402108maxresident" in my big repo now, up from "54092maxresident", top shows the RES constant at 64mb, and it was 48mb before. So I don't think there is a memory leak. I tried using deepseq to force full evaluation of addKeyCopies and memory use didn't change, which also says no memory leak. And indeed, not even calling addKeyCopies resulted in the same memory use. Probably the increased memory usage is buffering the stream of data from git in overLocationLogs. Sponsored-by: Brett Eisenberg on Patreon	2023-11-08 13:35:11 -04:00
Joey Hess	eb42935e58	Windows: Fix CRLF handling in some log files In particular, the mergedrefs file was written with CR added to each line, but read without CRLF handling. This resulted in each update of the file adding CR to each line in it, growing the number of lines, while also preventing the optimisation from working, so it remerged unncessarily. writeFile and readFile do NewlineMode translation on Windows. But the ByteString conversion prevented that from happening any longer. I've audited for other cases of this, and found three more (.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All of those also only prevent optimisations from working. Some other files are currently both read and written with ByteString, but old git-annex may have written them with NewlineMode translation. Other files are at risk for breakage later if the reader gets converted to ByteString. This is a minimal fix, but should be enough, as long as I remember to use fileLines when splitting a ByteString into lines. This leaves files written using ByteString without CR added, but that's ok because old git-annex has no difficulty reading such files. When the mergedrefs file has gotten lines that end with "\r\r\r\n", this will eventually clean it up. Each update will remove a single trailing CR. Note that S8.lines is still used in eg Command.Unused, where it is parsing git show-ref, and similar in Git/*. git commands don't include CR in their output so that's ok. Sponsored-by: Joshua Antonishen on Patreon	2023-10-30 14:23:23 -04:00
Joey Hess	d9fd205cbb	push RawFilePath down into Annex.ReplaceFile Minor optimisation, but a win in every case, except for a couple where it's a wash. Note that replaceFile still takes a FilePath, because it needs to operate on Chars to truncate unicode filenames properly.	2023-10-26 13:36:49 -04:00
Joey Hess	3742263c99	simplify base64 to only use ByteString Note the use of fromString and toString from Data.ByteString.UTF8 dated back to commit `9b93278e8a`. Back then it was using the dataenc package for base64, which operated on Word8 and String. But with the switch to sandi, it uses ByteString, and indeed fromB64' and toB64' were already using ByteString without that complication. So I think there is no risk of such an encoding related breakage. I also tested the case that `9b93278e8a` fixed: git-annex metadata -s foo='a …' x git-annex metadata x metadata x foo=a … In Remote.Helper.Encryptable, it was avoiding using Utility.Base64 because of that UTF8 conversion. Since that's no longer done, it can just use it now.	2023-10-26 13:10:05 -04:00
Joey Hess	c9866d2164	speed up populating the importfeed database Avoid conversion from ByteString to String for urls that will just be converted right back to ByteString to go into the database. Also setTempUrl is not used by importfeed, so avoid checking for temp urls in this code path. This benchmarks as only a small improvement. From 2.99s to 2.78s when populating a database with 33k urls. Note that it does not seem worth replacing URLString with URLByteString generally, because the ways urls are used all entails either parseURI, which takes a string, or passing a parameter to eg curl, which also is currently a string. Sponsored-by: Leon Schuermann on Patreon	2023-10-25 13:00:17 -04:00
Joey Hess	10b5f79e2d	fix empty tree import when directory does not exist Fix behavior when importing a tree from a directory remote when the directory does not exist. An empty tree was imported, rather than the import failing. Merging that tree would delete every file in the branch, if those files had been exported to the directory before. The problem was that dirContentsRecursive returned [] when the directory did not exist. Better for it to throw an exception. But in commit `74f0d67aa3` back in 2012, I made it never theow exceptions, because exceptions throw inside unsafeInterleaveIO become untrappable when the list is being traversed. So, changed it to list the contents of the directory before entering unsafeInterleaveIO. So exceptions are thrown for the directory. But still not if it's unable to list the contents of a subdirectory. That's less of a problem, because the subdirectory does exist (or if not, it got removed after being listed, and it's ok to not include it in the list). A subdirectory that has permissions that don't allow listing it will have its contents omitted from the list still. (Might be better to have it return a type that includes indications of errors listing contents of subdirectories?) The rest of the changes are making callers of dirContentsRecursive use emptyWhenDoesNotExist when they relied on the behavior of it not throwing an exception when the directory does not exist. Note that it's possible some callers of dirContentsRecursive that used to ignore permissions problems listing a directory will now start throwing exceptions on them. The fix to the directory special remote consisted of not making its call in listImportableContentsM use emptyWhenDoesNotExist. So it will throw an exception as desired. Sponsored-by: Joshua Antonishen on Patreon	2023-08-15 12:57:41 -04:00
Joey Hess	518a51a8a0	--explain for preferred/required content matching And annex.largefiles and annex.addunlocked. Also git-annex matchexpression --explain explains why its input expression matches or fails to match. When there is no limit, avoid explaining why the lack of limit matches. This is also done when no preferred content expression is set, although in a few cases it defaults to a non-empty matcher, which will be explained. Sponsored-by: Dartmouth College's DANDI project	2023-07-26 14:50:04 -04:00
Joey Hess	6821ba8dab	sync: use log to track adjusted branch needs updating Speeds up sync in an adjusted branch by avoiding re-adjusting the branch unncessarily, particularly when it is adjusted with --hide-missing or --unlock-present. When there are a lot of files, that was the majority of the time of a --no-content sync. Uses a log file, which is updated when content presence changes. This adds a little bit of overhead to every file get/drop when on such an adjusted branch. The overhead is minimal for get of any size of file, but might be noticable for drop in some cases. It seems like a reasonable trade-off. It would be possible to update the log file only at the end, but then it would not happen if the command is interrupted. When not in an adjusted branch, there should be no additional overhead. (getCurrentBranch is an MVar read, and it avoids the MVar read of getGitConfig.) Note that this does not deal with situations such as: git checkout master, git-annex get, git checkout adjusted branch, git-annex sync. The sync won't know that the adjusted branch needs to be updated. Dealing with that would add overhead to operation in non-adjusted branches, which I don't like. Also, there are other situations like having two adjusted branches that both need to be updated like this, and switching between them and sync not updating. This does mean a behavior change to sync, since it did previously deal with those situations. But, the documentation did not say that it did. The man pages only talk about sync updating the adjusted branch after it transfers content. I did consider making sync keep track of content it transferred (and dropped) and only update the adjusted branch then, not to catch up to other changes made previously. That would perform better. But it seemed rather hard to implement, and also it would have problems with races with a concurrent get/drop, which this implementation avoids. And it seemed pretty likely someone had gotten used to get/drop followed by sync updating the branch. It seems much less likely someone is switching branches, doing get/drop, and then switching back and expecting sync to update the branch. Re-running git-annex adjust still does a full re-adjusting of the branch, for anyone who needs that. Sponsored-by: Leon Schuermann on Patreon	2023-06-08 14:35:41 -04:00
Joey Hess	3c15e0f7a0	cache negative lookups of global numcopies and mincopies Speeds up eg git-annex sync --content by up to 50%. When it does not need to transfer or drop anything, it now noops a lot more quickly. I didn't see anything else in sync --content noop loop that could really be sped up. It has to cat git objects to keys, stat object files, etc. Sponsored-by: unqueued on Patreon	2023-06-06 14:43:25 -04:00

1 2 3 4 5 ...

546 commits