git-annex

Author	SHA1	Message	Date
Joey Hess	d60a33fd13	improve live update starting In an expression like "balanced=foo and exclude=bar", avoid it starting a live update when the overall expression doesn't match.	2024-08-24 13:07:05 -04:00
Joey Hess	2f20b939b7	LiveUpdate db updates working I've tested the behavior of the thread that waits for the LiveUpdate to be finished, and it does get signaled and exit cleanly when the LiveUpdate is GCed instead. Made finishedLiveUpdate wait for the thread to finish updating the database. There is a case where GC doesn't happen in time and the database is left with a live update recorded in it. This should not be a problem as such stale data can also happen when interrupted and will need to be detected when loading the database. Balanced preferred content expressions now call startLiveUpdate.	2024-08-24 11:49:58 -04:00
Joey Hess	c3d40b9ec3	plumb in LiveUpdate (WIP) Each command that first checks preferred content (and/or required content) and then does something that can change the sizes of repositories needs to call prepareLiveUpdate, and plumb it through the preferred content check and the location log update. So far, only Command.Drop is done. Many other commands that don't need to do this have been updated to keep working. There may be some calls to NoLiveUpdate in places where that should be done. All will need to be double checked. Not currently in a compilable state.	2024-08-23 16:35:12 -04:00
Joey Hess	70e2fca257	Added the annex.fullybalancedthreshhold git config.	2024-08-22 07:15:55 -04:00
Joey Hess	76ece2a699	make --rebalance of balanced use fullysizebalanced when useful When the specified number of copies is > 1, and some repositories are too full, it can be better to move content from them to other less full repositories, in order to make space for new content. annex.fullybalancedthreshhold is documented, but not implemented yet This is not tested very well yet, and is known to sometimes take several runs to stabalize.	2024-08-21 17:59:08 -04:00
Joey Hess	9e87061de2	Support "sizebalanced=" and "fullysizebalanced=" too Might want to make --rebalance turn balanced=group:N where N > 1 to fullysizebalanced=group:N. Have not yet determined if that will improve situations enough to be worth the extra work.	2024-08-21 15:01:54 -04:00
Joey Hess	de7ac1bb70	fix	2024-08-20 13:52:46 -04:00
Joey Hess	476d223bce	implement fullbalanced=group:N Rebalancing this when it gets into a suboptimal situation will need further work.	2024-08-20 13:51:02 -04:00
Joey Hess	b62b58b50b	git-annex info speed up using getRepoSizes	2024-08-17 14:54:31 -04:00
Joey Hess	c200523bac	implement getRepoSizes At this point the RepoSize database is getting populated, and it all seems to be working correctly. Incremental updates still need to be done to make it performant.	2024-08-15 12:31:56 -04:00
Joey Hess	3e6eb2a58d	implement journalledRepoSizes Plan is to run this when populating Annex.reposizes on demand. So Annex.reposizes will be up-to-date with the journal, including crucially journal entries for private repositories. But also anything that has been written to the journal by another process, especially if the process was ran with annex.alwayscommit=false. From there, Annex.reposizes can be kept up to date with changes made by the running process.	2024-08-14 13:53:24 -04:00
Joey Hess	8ac2685b33	calcBranchRepoSizes without journal files This will be used to prime the RepoSizes database, which will always contain values that correpond to information in the git-annex branch, so without anything from journal files. Factored out overJournalFileContents which will later be used to update Annex.reposizes to include information from journal files. This will be partitcularly important to support private UUIDs which only ever get to journal files and not to the branch.	2024-08-14 03:19:30 -04:00
Joey Hess	08f55948e9	take all repository locations into account for balancing This fully fixes --rebalance stability, and also deals with an issue where a file is present in each balanced repository and it didn't want to remove it from any.	2024-08-13 13:46:47 -04:00
Joey Hess	10d8b3cc63	fixed --rebalance stability on drop Was checking the wrong uuid, oops	2024-08-13 13:32:11 -04:00
Joey Hess	467d80101a	improve handling of unmerged git-annex branches in readonly repo git-annex info was displaying a message that didn't make sense in context. In calcRepoSizes, it seems better to return the information from the git-annex branch, rather than giving up. Especially since balanced preferred content uses it, and we can't just give up evaluating a preferred content expression if git-annex is to be usable in such a readonly repo. Commit `6d7ecd9e5d` nobly wanted git-annex to behave the same with such unmerged branches as it does when it can merge them. But for the purposes of preferred content, it seems to me there's a sense that such an unmerged branch is the same as a remote we have not pulled from. The balanced preferred content will either way operate under outdated information, and so make not the best choices.	2024-08-13 13:13:12 -04:00
Joey Hess	745bc5c547	take maxsize into account for balanced preferred content This is very innefficient, it will need to be optimised not to calculate the sizes of repos every time. Also, fixed a bug in balancedPicker that caused it to pick a too high index when some repos were excluded due to being full.	2024-08-13 11:00:20 -04:00
Joey Hess	bd5affa362	use hmac in balanced preferred content This deals with the possible security problem that someone could make an unusually low UUID and generate keys that are all constructed to hash to a number that, mod the number of repositories in the group, == 0. So balanced preferred content would always put those keys in the repository with the low UUID as long as the group contains the number of repositories that the attacker anticipated. Presumably the attacker than holds the data for ransom? Dunno. Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs combined together as the "secret", and the key as the "message". Now any change in the set of UUIDs in a group will invalidate the attacker's constructed keys from hashing to anything in particular. Given that there are plenty of other things someone can do if they can write to the repository -- including modifying preferred content so only their repository wants files, and numcopies so other repositories drom them -- this seems like safeguard enough. Note that, in balancedPicker, combineduuids is memoized.	2024-08-10 16:32:54 -04:00
Joey Hess	3ce2e95a5f	balanced preferred content and --rebalance This all works fine. But it doesn't check repository sizes yet, and without repository size checking, once a repository gets full, there will be no other repository that will want its files. Use of sha2 seems unncessary, probably alder2 or md5 or crc would have been enough. Possibly just summing up the bytes of the key mod the number of repositories would have sufficed. But sha2 is there, and probably hardware accellerated. I doubt very much there is any security benefit to using it though. If someone wants to construct a key that will be balanced onto a given repository, sha2 is certianly not going to stop them.	2024-08-09 14:16:09 -04:00
Joey Hess	64afbb0b93	don't count clusters as copies, continued Handled limitCopies, as well as everything using fromNumCopies and fromMinCopies. This should be everything, probably. Note that, git-annex info displays a count of repositories, which still includes cluster. I think that's ok. It would be possible to filter out clusters there, but to the user they're pretty much just another repository. The numcopies displayed by eg `git-annex info .` does not include clusters.	2024-06-16 15:14:53 -04:00
Joey Hess	cc17ac423b	implement isCryptographicallySecureKey for VURL Considerable difficulty to work around an import cycle. Had to move the list of backends (except for VURL) to Backend.Variety to VURL could use it. Sponsored-by: Kevin Mueller on Patreon	2024-02-29 17:26:35 -04:00
Joey Hess	b9e147d282	Added --expected-present file matching option	2024-01-25 12:56:41 -04:00
Joey Hess	6d83bcff0f	Fix behavior of onlyingroup Sponsored-by: k0ld on Patreon	2023-08-07 13:05:11 -04:00
Joey Hess	fa92383993	onlyingroup * Support "onlyingroup=" in preferred content expressions. * Support --onlyingroup= matching option. Sponsored-by: Jack Hill on Patreon	2023-07-31 14:43:58 -04:00
Joey Hess	499d014123	improve match explanations Using == and != proved too hard to read, so went with [TRUE] and [FALSE] after the term. I would kind of liked to have used emojis for a green check and red X, but probably that is too fancy to be a good idea. Make --explain output be inside [ ] with whitespace around them, to avoid it ending with eg "[FALSE]]" and to make it easier to visually find the start of it. Sponsored-by: Dartmouth College's DANDI project	2023-07-26 15:37:03 -04:00
Joey Hess	518a51a8a0	--explain for preferred/required content matching And annex.largefiles and annex.addunlocked. Also git-annex matchexpression --explain explains why its input expression matches or fails to match. When there is no limit, avoid explaining why the lack of limit matches. This is also done when no preferred content expression is set, although in a few cases it defaults to a non-empty matcher, which will be explained. Sponsored-by: Dartmouth College's DANDI project	2023-07-26 14:50:04 -04:00
Joey Hess	f25eeedeac	initial implementation of --explain Currently it only displays explanations of options like --in and --copies. In the future, it should explain preferred content expression evaluation and other decisions. The explanations of a few things could be better. In particular, "standard" will just appear as-is (or as "!standard" if it doesn't match), rather than explaining why the standard preferred content expression for the group matches or not. Currently as implemented, it goes to stdout, and so commands like git-annex find that have custom output will not display --explain information. Perhaps that should change, dunno. Sponsored-by: Dartmouth College's DANDI project	2023-07-25 16:52:57 -04:00
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Yaroslav Halchenko	0ae5ff797f	Typo: sansative -> sensitive	2023-03-17 15:14:50 -04:00
Joey Hess	54ad1b4cfb	Windows: Support long filenames in more (possibly all) of the code Works around this bug in unix-compat: https://github.com/jacobstanley/unix-compat/issues/56 getFileStatus and other FilePath using functions in unix-compat do not do UNC conversion on Windows. Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the necessary conversion on windows to support long filenames. Audited all imports of System.PosixCompat.Files to make sure that no functions that operate on FilePath were imported from it. Instead, use the equvilants from Utility.RawFilePath. In particular the re-export of that module in Common had to be removed, which led to lots of other changes throughout the code. The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig make Utility.Directory not be needed to build setup. And so let it use Utility.RawFilePath, which depends on unix, which cannot be in setup-depends. Sponsored-by: Dartmouth College's Datalad project	2023-03-01 15:55:58 -04:00
Joey Hess	0b2dd374d8	--anything and --nothing Added --anything (and --nothing). Eg, git-annex find --anything will list all annexed files whether or not the content is present. This is slightly faster and clearer than --include=* or --exclude=* While I can't imagine how --nothing will be used, preferred content expressions already had anything and nothing, so might as well support both as matching options as well. Sponsored-by: Dartmouth College's Datalad project	2022-12-20 15:44:09 -04:00
Joey Hess	8a4cfd4f2d	use getSymbolicLinkStatus not getFileStatus to avoid crash on broken symlink Fix crash importing from a directory special remote that contains a broken symlink. The crash was in listImportableContentsM but some other places in Remote.Directory also seemed like they could have the same problem. Also audited for other places that have such a problem. Not all calls to getFileStatus are bad, in some cases it's better to crash on something unexpected. For example, `git-annex import path` when the path is a broken symlink should crash, the same as when it does not exist. Many of the getFileStatus calls are like that, particularly when they involve .git/annex/objects which should never have a broken symlink in it. Fixed a few other possible cases of the problem. Sponsored-by: Lawrence Brogan on Patreon	2022-09-05 13:46:32 -04:00
Joey Hess	d266a41f8d	prevent numcopies or mincopies being configured to 0 Ignore annex.numcopies set to 0 in gitattributes or git config, or by git-annex numcopies or by --numcopies, since that configuration would make git-annex easily lose data. Same for mincopies. This is a continuation of the work to make data only be able to be lost when --force is used. It earlier led to the --trust option being disabled, and similar reasoning applies here. Most numcopies configs had docs that strongly discouraged setting it to 0 anyway. And I can't imagine a use case for setting to 0. Not that there might not be one, but it's just so far from the intended use case of git-annex, of managing and storing your data, that it does not seem like it makes sense to cater to such a hypothetical use case, where any git-annex drop can lose your data at any time. Using a smart constructor makes sure every place avoids 0. Note that this does mean that NumCopies is for the configured desired values, and not the actual existing number of copies, which of course can be 0. The name configuredNumCopies is used to make that clear. Sponsored-by: Brock Spratlen on Patreon	2022-03-28 15:20:34 -04:00
Joey Hess	b5f5475ed6	New matching options --excludesamecontent and --includesamecontent The normalisation of filenames turns out to be the tricky part here, because the associated files coming out of the keys db may look like "./foo/bar" or "../bar". For the former to match a glob like "foo/", it needs to be normalised. Note that, on windows, normalise "./foo/bar" = "foo\\bar" which a glob like "foo/" won't match. So the glob is matched a second time, on the toInternalGitPath, so allowing the user to provide a glob with the slashes in either direction. However, this still won't support some wacky edge cases like the user providing a glob of "foo/bar\\*" Sponsored-by: Dartmouth College's Datalad project	2021-05-25 13:08:18 -04:00
Joey Hess	cbf94fd13d	prep for fixing find --branch --unlocked Added LinkType to ProvidedInfo, and unified MatchingKey with ProvidedInfo. They're both used in the same way, so there was no real reason to keep separate. Note that addLocked and addUnlocked still set matchNeedsFileName, because to handle MatchingFile, they do need it. However, they don't use it when MatchingInfo is provided. This should be ok, the --branch case will be able skip checking matchNeedsFileName, since it will provide a filename in any case.	2021-03-02 13:39:31 -04:00
Joey Hess	ee4fd38ecf	remove unused contentFile = Nothing	2021-03-01 16:35:38 -04:00
Joey Hess	25e4ab7e81	Prevent combinations of options such as --all with --include Previously such nonsensical combinations always treated the matching option as if it didn't match. For now, made find --branch refuse matching options that need a filename, because one is not provided to them in a way they'll use. There's an open bug report to support it, but making it error out is better than the old behavior of not finding what it was asked to. Also, made --mimetype combined with eg --all work, by looking at the object file when operating on keys.	2021-03-01 16:25:23 -04:00
Joey Hess	a3a19518d8	fix --time-limit It got broken in several ways by the streaming seeking optimisations around version 8.20201007. Moved time limit checking out of the matcher, which was a hack in the first place. So everywhere that uses Limit.getMatcher needs to check time limit. Well, almost everywhere. Command.Info uses it, but it does not make sense to time limit getting info. And Command.MultiCast uses it just to build up a list of files that then get passed to a command, so it would never have hit the timeout in a useful way. This implementation is a little more expensive when at time limit than necessary, since it continues seeking only to discard everything after the time limit. I did try making it close the file handles to force a faster shutdown, but that didn't work and hung. Could certianly be improved somehow, but seeking is probably not the expensive bit when a time limit is hit, so this seems acceptable for now.	2021-01-04 15:57:11 -04:00
Joey Hess	6b13574827	Windows: include= and exclude= containing '/' will also match filenames that are written using '\' And vice-versa, but it's better to use '/' for portability. Notably, standardPreferredContent contains "archive/*" and that might not match if the filename ends up coming in with the slashes the other way around.	2020-12-15 12:39:34 -04:00
Joey Hess	01527b21d8	add key to FileInfo MatchingKey is not the thing to use when matching on actual worktreee files. Fix reversion in 8.20201116 that made include= and exclude= in preferred/required content expressions match a path relative to the current directory, rather than the path from the top of the repository.	2020-12-14 17:42:02 -04:00
Joey Hess	9b0dde834e	convert getFileSize to RawFilePath Lots of nice wins from this in avoiding unncessary work, and I think nothing got slower. This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.	2020-11-05 11:32:57 -04:00
Joey Hess	87f91ce563	more RawFilePath conversion 451/645	2020-10-30 15:55:59 -04:00
Joey Hess	7036d0a4c1	add, import: Fix a reversion in 7.20191009 that broke handling of --largerthan and --smallerthan This commit was sponsored by Jochen Bartl on Patreon.	2020-10-19 15:36:18 -04:00
Joey Hess	15c1ee16d9	import --no-content: Check annex.largefiles Import small files into git, the same as is done when importing with content. Which means, for small files, --no-content does download them. If the largefiles expression needs the file content available (due to mimetype or mimeencoding being used), the import will fail. This commit was sponsored by Jake Vosloo on Patreon.	2020-09-28 13:28:57 -04:00
Joey Hess	8b74f01a26	split ProvidedInfo and UserProvidedInfo The latter is for git-annex matchexpression and matching against it can throw an exception. Splitting out the former reduces the potential for mistakes and avoids needing to worry about matching against that throwing an exception. This is more groundwork for matching largefiles while importing, without downloading content. This commit was sponsored by Graham Spencer on Patreon.	2020-09-28 12:12:38 -04:00
Joey Hess	00dbe35fbc	allow matching on files whose content is not present Anything that needs to examine the file content will fail to match, or fall back to other available information. But the intent is that the matcher be checked for matchNeedsFileContent and only be used if it does not, so the exact behavior doesn't much matter as it should never happen. The real point of this is to not need to provide a dummy content file when matching. This commit was sponsored by Martin D on Patreon.	2020-09-28 11:17:46 -04:00
Joey Hess	d81f549385	fix some compile warnings left in yesterday at least 2 could have caused a crash in some circumstances This commit was sponsored by Brett Eisenberg on Patreon.	2020-09-25 10:55:39 -04:00
Joey Hess	ace02f41b0	seek: defer matcher check until more info is known Sped up seeking for files to operate on, when using options like --copies or --in, by around 20%. Benchmark showed an increase for --copies from 155 seconds to 121 seconds, and --in remote will be similar to that. For --in here, the speedup was less, 5-10% or so. (both warm cache) This commit was sponsored by Jack Hill on Patreon.	2020-09-24 17:59:12 -04:00
Joey Hess	b7afcda887	fix some matchNeedsFileName values matchMagic: Always False for MatchingKey. Unsure why.. Could be a bug? limitUnused: Behaves differently when there is a filename. limitSize: When used with LimitDiskFiles, checks the size on disk of the filename.	2020-09-24 16:08:47 -04:00
Joey Hess	d89984b121	sync --all avoid unncessary first pass Sped up seeking to around twice as fast, by avoiding a pass over the worktree files when preferred content expressions of the local repo and remotes don't use include=/exclude=. Thanks to Lukey for identifying the optimisation. This commit was sponsored by Brock Spratlen on Patreon.	2020-09-24 15:12:09 -04:00

1 2 3 4

167 commits