git-annex

Author	SHA1	Message	Date
Joey Hess	d3d5d2b4ec	fix test suite failure when run with LANG=C	2021-08-18 17:36:00 -04:00
Joey Hess	fa62c98910	simplify and speed up Utility.FileSystemEncoding This eliminates the distinction between decodeBS and decodeBS', encodeBS and encodeBS', etc. The old implementation truncated at NUL, and the primed versions had to do extra work to avoid that problem. The new implementation does not truncate at NUL, and is also a lot faster. (Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the primed versions.) Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation, and upgrading to it will speed up to/fromRawFilePath. AFAIK, nothing relied on the old behavior of truncating at NUL. Some code used the faster versions in places where I was sure there would not be a NUL. So this change is unlikely to break anything. Also, moved s2w8 and w82s out of the module, as they do not involve filesystem encoding really. Sponsored-by: Shae Erisson on Patreon	2021-08-11 12:13:31 -04:00
Joey Hess	1acdd18ea8	deal better with clock skew situations, using vector clocks * Deal with clock skew, both forwards and backwards, when logging information to the git-annex branch. * GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1) rather than needing to be advanced each time a new change is made. * Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex. When changing a file in the git-annex branch, the vector clock to use is now determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK when set), and comparing it to the newest vector clock already in use in that file. If a newer time stamp was already in use, advance it forward by a second instead. When the clock is set to a time in the past, this avoids logging with an old timestamp, which would risk that log line later being ignored in favor of "newer" line that is really not newer. When a log entry has been made with a clock that was set far ahead in the future, this avoids newer information being logged with an older timestamp and so being ignored in favor of that future-timestamped information. Once all clocks get fixed, this will result in the vector clocks being incremented, until finally enough time has passed that time gets back ahead of the vector clock value, and then it will return to usual operation. (This latter situation is not ideal, but it seems the best that can be done. The issue with it is, since all writers will be incrementing the last vector clock they saw, there's no way to tell when one writer made a write significantly later in time than another, so the earlier write might arbitrarily be picked when merging. This problem is why git-annex uses timestamps in the first place, rather than pure vector clocks.) Advancing forward by 1 second is somewhat arbitrary. setDead advances a timestamp by just 1 picosecond, and the vector clock could too. But then it would interfere with setDead, which wants to be overrulled by any change. So it could use 2 picoseconds or something, but that seems weird. It could just as well advance it forward by a minute or whatever, but then it would be harder for real time to catch up with the vector clock when forward clock slew had happened. A complication is that many log files contain several different peices of information, and it may be best to only use vector clocks for the same peice of information. For example, a key's location log file contains InfoPresent/InfoMissing for each UUID, and it only looks at the vector clocks for the UUID that is being changed, and not other UUIDs. Although exactly where the dividing line is can be hard to determine. Consider metadata logs, where a field "tag" can have multiple values set at different times. Should it advance forward past the last tag? Probably. What about when a different field is set, should it look at the clocks of other fields? Perhaps not, but currently it does, and this does not seems like it will cause any problems. Another one I'm not entirely sure about is the export log, which is keyed by (fromuuid, touuid). So if multiple repos are exporting to the same remote, different vector clocks can be used for that remote. It looks like that's probably ok, because it does not try to determine what order things occurred when there was an export conflict. Sponsored-by: Jochen Bartl on Patreon	2021-08-04 12:33:46 -04:00
Joey Hess	ada61676ff	factor out pure code	2021-06-30 16:42:08 -04:00
Joey Hess	78da00c7a6	Future proof activity log parsing When the log has an activity that is not known, eg added by a future version of git-annex, it used to be treated as no activity at all, which would make git-annex expire think it should expire the repository, despite it having some kind of recent activity. Hopefully there will be no reason to add a new activity until enough time has passed that this commit is in use everywhere. Sponsored-by: Jake Vosloo on Patreon	2021-06-14 14:18:19 -04:00
Joey Hess	cedc28a783	prevent dropping required content of other file using same content When two files have the same content, and a required content expression matches one but not the other, dropping the latter file will fail as it would also remove the content of the required file. This will slow down drop (w/o --auto), dropunused, mirror, and move, by one keys db lookup per file. But I did include an optimisation to avoid a double db lookup in the drop --auto / sync --content case. I suspect that dropunused could also use PreferredContentChecked True, but haven't entirely thought it through and it's rarely used with enough files for the optimisation to matter. Sponsored-by: Dartmouth College's Datalad project	2021-05-25 11:34:06 -04:00
Joey Hess	1d16654a22	convert formatLsTree to ByteString for speed	2021-05-17 10:46:24 -04:00
Joey Hess	5cb05c43c9	fix build after merge	2021-04-23 13:26:10 -04:00
Joey Hess	d5a05655b4	Merge branch 'master' into hiddenannex	2021-04-23 13:06:33 -04:00
Joey Hess	b689f17062	refactoring	2021-04-23 11:44:10 -04:00
Joey Hess	657d55c401	convert withKnownUrls to use overBranchFileContents This only partly fixes importfeed to see journalled files, since it separately cats metadata directly from the branch. Held off on a changelog for a bug fix until that's dealt with.	2021-04-23 11:32:25 -04:00
Joey Hess	05989556a2	start implementing hidden git-annex repositories This adds a separate journal, which does not currently get committed to an index, but is planned to be committed to .git/annex/index-private. Changes that are regarding a UUID that is private will get written to this journal, and so will not be published into the git-annex branch. All log writing should have been made to indicate the UUID it's regarding, though I've not verified this yet. Currently, no UUIDs are treated as private yet, a way to configure that is needed. The implementation is careful to not add any additional IO work when privateUUIDsKnown is False. It will skip looking at the private journal at all. So this should be free, or nearly so, unless the feature is used. When it is used, all branch reads will be about twice as expensive. It is very lucky -- or very prudent design -- that Annex.Branch.change and maybeChange are the only ways to change a file on the branch, and Annex.Branch.set is only internal use. That let Annex.Branch.get always yield any private information that has been recorded, without the risk that Annex.Branch.set might be called, with a non-private UUID, and end up leaking the private information into the git-annex branch. And, this relies on the way git-annex union merges the git-annex branch. When reading a file, there can be a public and a private version, and they are just concacenated together. That will be handled the same as if there were two diverged git-annex branches that got union merged.	2021-04-20 15:04:53 -04:00
Joey Hess	8e7dc958d2	forget: Preserve currently exported trees Avoiding problems with exporttree remotes in some unusual circumstances. This commit was sponsored by Brett Eisenberg on Patreon.	2021-04-13 15:00:23 -04:00
Joey Hess	a10cc80997	split out Logs.Export.Pure This will allow Annex.Branch to use it, in transitions code. This commit was sponsored by Luke Shumaker on Patreon.	2021-04-13 14:10:23 -04:00
Joey Hess	4631d1ab56	Fix build with attoparsec-0.14 It changed parseOnly in the ByteString.Lazy module to take a lazy, not strict ByteString. In all these cases though, we actually had a strict ByteString, so the most efficient fix, which also happens to avoid needing ifdefs, is to use the non-lazy module instead. This commit was sponsored by Denis Dzyubenko on Patreon.	2021-03-24 12:11:50 -04:00
Joey Hess	a8b837aaef	add git ls-tree --long parser Not yet used, but allows getting the size of items in the tree fairly cheaply. I noticed that CmdLine.Seek uses ls-tree and the feeds the files into another long-running process to check their size. That would be an example of a place that might be sped up by using this. Although in that particular case, it only needs to know the size of unlocked files, not locked. And since enabling --long probably doubles the ls-tree runtime or more, the overhead of using it there may outwweigh the benefit.	2021-03-23 12:47:00 -04:00
Joey Hess	f175d4cc90	rmurl: remove all forms of an url, no matter what the downloader is set to * rmurl: When youtube-dl was used for an url, it no longer needs to be prefixed with "yt:" in order to be removed. * rmurl: If an url is both used by the web and also claimed by another special remote, fix a bug that caused the url to to not be removed. The youtube-dl change is a consequence of how the bug fix is implemented. But I also think it's the right thing to do. Consider that, before, git-annex addurl $url followed by git-annex rmurl $url would not remove the url in the case where youtube-dl was used. That was surprising behavior. In the unlikely case where a special remote claims an url, and it's been added using OtherDownloader, but it was also added already as a web url, it seems better for rmurl to remove both than to arbitrarily remove only one. And in the case the bug report was filed for, when an url was added as a web url, but a special remote now claims it, that should not prevent rmurl removing the web url. Calling setUrlMissing lets other callers of it behave differently. Probably the calls to it in eg, Remote.External and Remote.BitTorrent are fine, since they don't mangle the url and just remove what was provided, and the OtherDownloader form of a bittorrent url, respectively. I suspect unregisterurl needs to have a similar change made to rmurl, for similar reasons.	2021-03-22 12:09:15 -04:00
Joey Hess	dd39e9e255	suggest when user may want annex.stalldetection When annex.stalldetection is not enabled, and a likely stall is detected, display a suggestion to enable it. Note that the progress meter display is not taken down when displaying the message, so it will display like this: 0% 8 B 0 B/s Transfer seems to have stalled. To handle stalling transfers, configure annex.stalldetection 0% 10 B 0 B/s Although of course if it's really stalled, it will never update again after the message. Taking down the progress meter and starting a new one doesn't seem too necessary given how unusual this is, also this does help show the state it was at when it stalled. Use of uninterruptibleCancel here is ok, the thread it's canceling only does STM transactions and sleeps. The annex thread that gets forked off is separate to avoid it being canceled, so that it can be joined back at the end. A module cycle required moving from dupState the precaching of the remote list. Doing it at startConcurrency should cover all the cases where the remote list is used in concurrent actions. This commit was sponsored by Kevin Mueller on Patreon.	2021-02-03 15:57:19 -04:00
Joey Hess	e3224ff77d	formatLsTree did not use a tab where git does Fixed that, and made parserLsTree accept the space as well as tab. Fixes a reversion that made import of a tree from a special remote result in a merge that deleted files that were not preferred content of that special remote.	2021-01-28 12:36:37 -04:00
Joey Hess	cc89699457	mincopies This is conceptually very simple, just making a 1 that was hard coded be exposed as a config option. The hard part was plumbing all that, and dealing with complexities like reading it from git attributes at the same time that numcopies is read. Behavior change: When numcopies is set to 0, git-annex used to drop content without requiring any copies. Now to get that (highly unsafe) behavior, mincopies also needs to be set to 0. It seemed better to remove that edge case, than complicate mincopies by ignoring it when numcopies is 0. This commit was sponsored by Denis Dzyubenko on Patreon.	2021-01-06 14:15:19 -04:00
Joey Hess	36133f27c0	move untrust forcing from Logs.Trust into Remote No behavior changes here, but this is groundwork for letting remotes such as borg vary untrust forcing depending on configuration.	2020-12-28 15:22:10 -04:00
Joey Hess	46059ab0e5	split off versionedExport from appendonly S3 uses versionedExport, while GitLFS uses appendonly. This is groundwork for later changes.	2020-12-28 14:37:15 -04:00
Joey Hess	6280af2901	generate more compact git-annex branch for imports Especially from borg, where the content identifier logs all end up being the same identical file! But also, for other imports, the location tracking logs can, in some cases, be identical files. Bonus optimisation: Avoid looking up (and parsing when set) GIT_ANNEX_VECTOR_CLOCK env var every time a log is written to. Although the lookup does happen at startup even when no log will be written now.	2020-12-23 15:25:16 -04:00
Joey Hess	f8aadbfb9b	avoid grafting in an imported tree when it's not changed This just avoids some churn in the git-annex branch.	2020-12-23 14:31:14 -04:00
Joey Hess	7916fc98a3	graft in imported tree to avoid gc Fix a bug that could prevent getting files from an importtree=yes remote, because the imported tree was allowed to be garbage collected.	2020-12-23 14:27:38 -04:00
Joey Hess	5d8e4a7c74	avoid borg list of archives that have been listed before This makes sync a lot faster in the common case where there's no new backup. There's still room for it to be faster. Currently the old imported tree has to be traversed, to generate the ImportableContents. Which then gets turned around to generate the new imported tree, which is identical. So, it would be possible to just return a "no new imports", or an ImportableContents that has a way to graft in a tree. The latter is probably too far to go to optimise this, unless other things need it. The former might be worth it, but it's already pretty fast, since git ls-tree is pretty fast.	2020-12-22 14:06:40 -04:00
Joey Hess	06ef1b7d68	improve storage of redundant ContentIdentifiers When a ContentIdentifier is already recorded, don't add it to the log again, and avoid updating the log.	2020-12-22 12:03:25 -04:00
Joey Hess	1b5cb77acf	importtree only remotes are untrusted, same as exporttree remotes Importtree only remotes are new; importtree remotes used to always also be exporttree, so were untrusted. Since an import remote is one that can be edited by something other than git-annex, it's clearly not trustworthy at all.	2020-12-17 13:45:07 -04:00
Joey Hess	a422a056f2	make getViaTmpFrom no longer update location log All callers adjusted to update it themselves. In Command.ReKey, and Command.SetKey, the cleanup action already did, so it was updating the log twice before. This fixes a bug when annex.stalldetection is set, as now Command.Transferrer can skip updating the location log, and let it be updated by the calling process.	2020-12-11 11:50:13 -04:00
Joey Hess	4b739fc460	Fix build on Windows Thanks to bug reporter for the patch.	2020-11-19 12:33:00 -04:00
Joey Hess	885974be99	add newtypes for QuickCheck to avoid LANG=C issues All properties changed to use them, except for prop_encode_c_decode_c_roundtrip, which already filtered to ascii for other reasons. A few modules had to be split out, because Setup does not build-depend on QuickCheck.	2020-11-09 20:21:18 -04:00
Joey Hess	2c8cf06e75	more RawFilePath conversion Converted file mode setting to it, and follow-on changes. Compiles up through 369/646. This commit was sponsored by Ethan Aubin.	2020-11-05 18:45:37 -04:00
Joey Hess	eb42cd4d46	more RawFilePath conversion 535/645 This commit was sponsored by Brett Eisenberg on Patreon.	2020-11-03 10:11:04 -04:00
Joey Hess	681b44236a	more RawFilePath conversion at 377/645 This commit was sponsored by Svenne Krap on Patreon.	2020-10-29 14:20:57 -04:00
Joey Hess	f45ad178cb	more RawFilePath conversion At 318/645 after 4k lines of changes This commit was sponsored by Jake Vosloo on Patreon.	2020-10-29 12:03:50 -04:00
Joey Hess	e505c03bcc	more RawFilePath conversion nukeFile replaced with removeWhenExistsWith removeLink, which allows using RawFilePath. Utility.Directory cannot use RawFilePath since setup does not depend on posix. This commit was sponsored by Graham Spencer on Patreon.	2020-10-29 10:50:29 -04:00
Joey Hess	363acfb55b	more log file actions Which will be needed soon. And use more ByteStrings for speed. This commit was sponsored by Graham Spencer on Patreon.	2020-10-20 16:51:03 -04:00
Joey Hess	ca454c47f2	explicitly wait for a git process Eliminate a zombie that was only cleaned up by the later zombie cleanup code. This is still not ideal, it would be cleaner if it used conduit or something, and if the thread gets killed before waiting, it won't stop the process. Only remaining zombies are in CmdLine.Seek	2020-09-25 11:03:12 -04:00
Joey Hess	d89984b121	sync --all avoid unncessary first pass Sped up seeking to around twice as fast, by avoiding a pass over the worktree files when preferred content expressions of the local repo and remotes don't use include=/exclude=. Thanks to Lukey for identifying the optimisation. This commit was sponsored by Brock Spratlen on Patreon.	2020-09-24 15:12:09 -04:00
Joey Hess	5cfcf1f05f	cache remote.log Unlikely to speed up any of the existing uses much, but I want to use it in a message that might be displayed many times.	2020-09-22 13:52:26 -04:00
Joey Hess	7b2d236556	importfeed: stream metadata for 5% speedup On top of the 10% speedup from streaming url logs.	2020-07-14 14:35:26 -04:00
Joey Hess	535cdc8d48	importfeed: Made checking known urls step around 10% faster. This was a bit disappointing, I was hoping for a 2x speedup. But, I think the metadata lookup is wasting a lot of time and also needs to be made to stream. The changes to catObjectStreamLsTree were benchmarked to not also speed up --all around 3% more. Seems I managed to make it polymorphic after all.	2020-07-14 12:47:51 -04:00
Joey Hess	9f6bd6cc05	add inRepoDetails planned to use for an optimisation most things using stagedDetails were not expecting to get dup files in a conflicted merge and deal with them, so converted them to use inRepoDetails.	2020-07-08 15:36:35 -04:00
Joey Hess	7347e50123	add stage number to stagedDetails parser And convert parser to attoparsec, probably faster. Before, a parse failure threw the whole --stage output line in to the filename, which was certianly a bad idea, so fixed that.	2020-07-08 15:05:12 -04:00
Joey Hess	57b89c635f	support required groupwanted When the required content is set to "groupwanted", use whatever expression has been set in groupwanted as the required content of the repo, similar to how setting required content to "standard" already worked.	2020-04-28 13:31:26 -04:00
Joey Hess	f85ca7dc80	fix all remaining -Wincomplete-uni-patterns warnings A couple of these were probably actual bugs in edge cases. Most of the changes I'm fine with. The fact that aeson's object returns sometihng that we know will be an Object, but the type checker does not know is kind of annoying.	2020-04-15 13:55:08 -04:00
Joey Hess	0e4c92503e	fix warning I don't think the NoConfigValue case ever actually occurs here.	2020-04-15 13:04:00 -04:00
Joey Hess	9cb69dbb76	support boolean git configs that are represented by the name of the setting with no value Eg"core.bare" is the same as "core.bare = true". Note that git treats "core.bare =" the same as "core.bare = false", so the code had to become more complicated in order to treat the absense of a value differently than an empty value. Ugh.	2020-04-13 13:35:22 -04:00
Joey Hess	aeca7c2207	Sped up query commands that read the git-annex branch by around 5% The only price paid is one additional MVar read per write to the journal. Presumably writing a journal file dominiates over a MVar read time by several orders of magnitude. --batch does not get the speedup because then it needs to notice when another process has made a change. Also made the assistant and other damon modes bypass the optimisation, which would not help them anyway.	2020-04-09 13:54:43 -04:00
Joey Hess	c0cd07c36b	Ref ByteString conversion done Test suite passes.	2020-04-07 17:41:09 -04:00

1 2 3 4 5 ...

470 commits