git-annex

Author	SHA1	Message	Date
Joey Hess	5a9e6b1fd4	when private journal file exists, still read from git-annex branch Fix bug that caused stale git-annex branch information to read when annex.private or remote.name.annex-private is set. The private journal file should not prevent reading more current information from the git-annex branch, but used to. Note that, overBranchFileContents has to do additional work now, when there's a private journal file, it reads from the branch redundantly and more slowly. Sponsored-by: Jack Hill on Patreon	2021-10-26 13:43:50 -04:00
Joey Hess	e853ef3095	decorate openTempFile errors with the template name This is to track down what file in .git/annex/ is being written to via a temp file when the repository is read-only. Sponsored-by: Dartmouth College's Datalad project	2021-08-30 13:05:02 -04:00
Joey Hess	8b6dad11a2	add createMessage init: When annex.commitmessage is set, use that message for the commit that creates the git-annex branch. This will be used by filter-branch too, and it seems to make sense to let annex.commitmessage affect it.	2021-05-17 13:07:47 -04:00
Joey Hess	4ff8a1ae2b	refactoring filterBranch should be reusable for copy-branch command. Changed LogVariety to differentiate between LocationLog and UrlLog; only location logs contain uuids and need to be filtered by uuid, while url logs do not. This does not change current behavior, but it will let filterBranch be reused without filtering url logs incorrectly.	2021-05-13 14:43:25 -04:00
Joey Hess	32138b8cd8	implement annex.privateremote and remote.name.private configs The slightly unusual parsing in Types.GitConfig avoids the need to look at the remote list to get configs of remotes. annexPrivateRepos combines all the configs, and will only be calculated once, so it's nice and fast. privateUUIDsKnown and regardingPrivateUUID now need to read from the annex mvar, so are not entirely free. But that overhead can be optimised away, as seen in getJournalFileStale. The other call sites didn't seem worth optimising to save a single MVar access. The feature should have impreceptable speed overhead when not being used.	2021-04-23 14:21:57 -04:00
Joey Hess	d5a05655b4	Merge branch 'master' into hiddenannex	2021-04-23 13:06:33 -04:00
Joey Hess	657d55c401	convert withKnownUrls to use overBranchFileContents This only partly fixes importfeed to see journalled files, since it separately cats metadata directly from the branch. Held off on a changelog for a bug fix until that's dealt with.	2021-04-23 11:32:25 -04:00
Joey Hess	c687eae80b	got private repos really working This new TODO will need private indexes to resolve; until then the private journal has to be checked when private UUIDs are known.	2021-04-21 16:26:23 -04:00
Joey Hess	24eeacdba8	adapt recent bug fixes to support private journal At this point, private repos should mostly work, except for a few commands that directly read from the git-annex branch and will not see the private journal. Private index not yet implemented.	2021-04-21 16:01:13 -04:00
Joey Hess	0bb57702e1	Merge branch 'master' into hiddenannex	2021-04-21 15:45:12 -04:00
Joey Hess	653b719472	fix --all to include not yet committed files from the journal Fix bug caused by recent optimisations that could make git-annex not see recently recorded status information when configured with annex.alwayscommit=false. This does mean that --all can end up processing the same key more than once, but before the optimisations that introduced this bug, it used to also behave that way. So I didn't try to fix that; it's an edge case and anyway git-annex behaves well when run on the same key repeatedly. I am not too happy with the use of a MVar to buffer the list of files in the journal. I guess it doesn't defeat lazy streaming of the list, if that list is actually generated lazily, and anyway the size of the journal is normally capped and small, so if configs are changed to make it huge and this code path fire, git-annex using enough memory to buffer it all is not a large problem.	2021-04-21 15:40:32 -04:00
Joey Hess	74acf17a31	refactoring	2021-04-21 14:29:02 -04:00
Joey Hess	6eb3c0a6b4	fix branch precacheing bug by checking journal Fix bug caused by recent optimisations that could make git-annex not see recently recorded status information when configured with annex.alwayscommit=false. When not using --all, precaching only gets triggered when the command actually needs location logs, and so there's no speed hit there. This is a minor speed hit for --all, because it precaches even when the location log is not actually going to be used, and so checking the journal is not necessary. It would have been possible to defer checking the journal until the cache gets used. But that would complicate the usual Branch.get code path with two different kinds of caches, and the speed hit is really minimal. A better way to speed up --all, later, would be to avoid precaching at all when the location log is not going to be used.	2021-04-21 14:02:15 -04:00
Joey Hess	05989556a2	start implementing hidden git-annex repositories This adds a separate journal, which does not currently get committed to an index, but is planned to be committed to .git/annex/index-private. Changes that are regarding a UUID that is private will get written to this journal, and so will not be published into the git-annex branch. All log writing should have been made to indicate the UUID it's regarding, though I've not verified this yet. Currently, no UUIDs are treated as private yet, a way to configure that is needed. The implementation is careful to not add any additional IO work when privateUUIDsKnown is False. It will skip looking at the private journal at all. So this should be free, or nearly so, unless the feature is used. When it is used, all branch reads will be about twice as expensive. It is very lucky -- or very prudent design -- that Annex.Branch.change and maybeChange are the only ways to change a file on the branch, and Annex.Branch.set is only internal use. That let Annex.Branch.get always yield any private information that has been recorded, without the risk that Annex.Branch.set might be called, with a non-private UUID, and end up leaking the private information into the git-annex branch. And, this relies on the way git-annex union merges the git-annex branch. When reading a file, there can be a public and a private version, and they are just concacenated together. That will be handled the same as if there were two diverged git-annex branches that got union merged.	2021-04-20 15:04:53 -04:00
Joey Hess	b2222e4639	optimisation Avoid unnecessary conversion to/from String.	2021-04-20 13:13:45 -04:00
Joey Hess	8e7dc958d2	forget: Preserve currently exported trees Avoiding problems with exporttree remotes in some unusual circumstances. This commit was sponsored by Brett Eisenberg on Patreon.	2021-04-13 15:00:23 -04:00
Joey Hess	bdba2c5914	fastDebug Annex.Branch reads and writes Reads of cached data are not debugged, only cache misses are, and since many commands pre-cache location log data, this avoids a slew of fastDebug calls when running commands such as git-annex get --from	2021-04-06 16:48:24 -04:00
Joey Hess	5d78cd9d08	Sped up git-annex init in a clone of an existing repository Seems that hasOrigin was never finding origin's git-annex branch, so a new one got created each time. And so then it later needed to merge the two branches, which is expensive. Added --no-track to git branch to avoid it displaying a message about setting up tracking branches. Of course there's no reason to make the git-annex branch a tracking branch since git-annex auto-merges it.	2021-03-23 15:23:13 -04:00
Joey Hess	a8b837aaef	add git ls-tree --long parser Not yet used, but allows getting the size of items in the tree fairly cheaply. I noticed that CmdLine.Seek uses ls-tree and the feeds the files into another long-running process to check their size. That would be an example of a place that might be sped up by using this. Although in that particular case, it only needs to know the size of unlocked files, not locked. And since enabling --long probably doubles the ls-tree runtime or more, the overhead of using it there may outwweigh the benefit.	2021-03-23 12:47:00 -04:00
Joey Hess	a3b714ddd9	finish fixing removeLink on windows `9cb250f7be` got the ones in RawFilePath, but there were others that used the one from unix-compat, which fails at runtime on windows. To avoid this, import System.PosixCompat.Files hiding removeLink This commit was sponsored by Ethan Aubin.	2020-11-24 13:20:44 -04:00
Joey Hess	2c8cf06e75	more RawFilePath conversion Converted file mode setting to it, and follow-on changes. Compiles up through 369/646. This commit was sponsored by Ethan Aubin.	2020-11-05 18:45:37 -04:00
Joey Hess	681b44236a	more RawFilePath conversion at 377/645 This commit was sponsored by Svenne Krap on Patreon.	2020-10-29 14:20:57 -04:00
Joey Hess	e505c03bcc	more RawFilePath conversion nukeFile replaced with removeWhenExistsWith removeLink, which allows using RawFilePath. Utility.Directory cannot use RawFilePath since setup does not depend on posix. This commit was sponsored by Graham Spencer on Patreon.	2020-10-29 10:50:29 -04:00
Joey Hess	ca454c47f2	explicitly wait for a git process Eliminate a zombie that was only cleaned up by the later zombie cleanup code. This is still not ideal, it would be cleaner if it used conduit or something, and if the thread gets killed before waiting, it won't stop the process. Only remaining zombies are in CmdLine.Seek	2020-09-25 11:03:12 -04:00
Joey Hess	e72ec8b9b2	add back git-annex branch read cache The cache was removed way back in 2012, commit `3417c55189` Then I forgot I had removed it! I remember clearly multiple times when I thought, "this reads the same data twice, but the cache will avoid that being very expensive". The reason it was removed was it messed up the assistant noticing when other processes made changes. That same kind of problem has recently been addressed when adding the optimisation to avoid reading the journal unnecessarily. Indeed, enableInteractiveJournalAccess is run in just the right places, so can just piggyback on it to know when it's not safe to use the cache.	2020-07-06 12:22:33 -04:00
Joey Hess	cee6b344b4	cat-file resource pool Avoid running a large number of git cat-file child processes when run with a large -J value. This implementation takes care to avoid adding any overhead to git-annex when run without -J. When run with -J, there is a small bit of added overhead, to manipulate the resource pool. That optimisation added a fair bit of complexity.	2020-04-20 15:19:31 -04:00
Joey Hess	43a9808292	disable journal read optimisation when alwayscommit=false The journal read optimisation in `aeca7c220` later got fixed in `eedd73b84` to stage and commit any files that were left in the journal by a previous git-annex run. That's necessary for the optimisation to work correctly. But it also meant that alwayscommit=false started committing the previous git-annex processes journalled changes, which defeated the purpose of the config setting entirely. So, disable the optimisation when alwayscommit=false, leaving the files in the journal and not committing them. See my comments on the bug report for why this seemed the best approach. Also fixes a problem when annex.merge-annex-branches=false and there are changes in the journal. That config indirectly prevents committing the journal. (Which seems a bit odd given its name, but it always has..) So, when there were changes in the journal, perhaps left there due to alwayscommit=false being set before, the optimisation would prevent git-annex from reading the journal files, and it would operate with out of date information.	2020-04-15 13:24:33 -04:00
Joey Hess	eedd73b846	fix reversion caused by earlier optimisation to git-annex branch reads `aeca7c2207` was predicated on the assumption that updateTo would stage any journal files, but in one case it did not actually do so. The test suite happened to expose the bug.	2020-04-10 15:25:22 -04:00
Joey Hess	2caf579718	cache annex index filename for 1.5% speedup to queries	2020-04-10 13:37:04 -04:00
Joey Hess	aeca7c2207	Sped up query commands that read the git-annex branch by around 5% The only price paid is one additional MVar read per write to the journal. Presumably writing a journal file dominiates over a MVar read time by several orders of magnitude. --batch does not get the speedup because then it needs to notice when another process has made a change. Also made the assistant and other damon modes bypass the optimisation, which would not help them anyway.	2020-04-09 13:54:43 -04:00
Joey Hess	c0cd07c36b	Ref ByteString conversion done Test suite passes.	2020-04-07 17:41:09 -04:00
Joey Hess	6c81e0c8f1	ByteString Ref continued Several nice speed wins I think. At 340/633 files converted.	2020-04-07 13:27:11 -04:00
Joey Hess	879f52a116	annex.tune.branchhash1=true bugfix Fix support for repositories tuned with annex.tune.branchhash1=true, including --all not working and git-annex log not displaying anything for annexed files.	2020-02-14 15:22:48 -04:00
Joey Hess	c19211774f	use filepath-bytestring for annex object manipulations git-annex find is now RawFilePath end to end, no string conversions. So is git-annex get when it does not need to get anything. So this is a major milestone on optimisation. Benchmarks indicate around 30% speedup in both commands. Probably many other performance improvements. All or nearly all places where a file is statted use RawFilePath now.	2019-12-11 15:25:07 -04:00
Joey Hess	bdec7fed9c	convert TopFilePath to use RawFilePath Adds a dependency on filepath-bytestring, an as yet unreleased fork of filepath that operates on RawFilePath. Git.Repo also changed to use RawFilePath for the path to the repo. This does eliminate some RawFilePath -> FilePath -> RawFilePath conversions. And filepath-bytestring's </> is probably faster. But I don't expect a major performance improvement from this. This is mostly groundwork for making Annex.Location use RawFilePath, which will allow for a conversion-free pipleline.	2019-12-09 15:07:21 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	0be23bae2f	refactor Better to not have a single function module, and better to have a more specific type than Bool. This commit was sponsored by Jack Hill on Patreon	2019-11-11 19:10:52 -04:00
Joey Hess	3b34d123ed	Added annex.allowsign option. This commit was sponsored by Ilya Shlyakhter on Patreon.	2019-11-11 16:28:56 -04:00
Joey Hess	4306dfbe68	remove empty log files in transition forget --drop-dead: Remove several classes of git-annex log files when they become empty, further reducing the size of the git-annex branch. Noticed while testing sameas uuid removal, but it could happen other times too. An empty log file is always treated by git-annex the same as no file being present, and when the files are per-key, it can be a sizable space saving to exclude them from the tree.	2019-10-14 16:04:15 -04:00
Joey Hess	5e9a2cc37f	forget state of sameas remotes during DropDead transitions It would have been a lot less round-about to just make git annex dead also add the uuids of sameas remotes to the trust.log as dead. But, that would fail in the case where there's an unmerged other clone that has a sameas remote that the current repo does not know about. Then it would not get marked as dead. Handling it at transition time avoids that scenario. Note that the generation of trustmap' in dropDead should only happen once, due to the partial application.	2019-10-14 15:47:42 -04:00
Joey Hess	2a41712ef1	avoid stageJournal escaping withOtherTmp This is only done for correctness sake; I don't see any way that it would have caused a problem here. The jlog file escaped withOtherTmp so another process could swoop in and delete it, but the file is only used as a buffer for a list of filenames, and its handle gets rewound and they're read back out, which will still work even if it's already been deleted. The only reason I didn't just pre-delete the file and keep the handle open is I'm not sure that works on all OS's (eg Windows). If there was a problem that this fixed it might involve an OS that doesn't support deleting an open file or something like that.	2019-05-07 11:57:12 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	7af55de83c	optimisation: use graftTree to remember the export branch Sped up git-annex export in repositories with lots of keys. Old method read whole git-annex branch tree into memory.	2019-02-22 11:16:22 -04:00
Joey Hess	8fdea8f444	WIP Added graftTree but it's buggy. Should use graftTree in Annex.Branch.graftTreeish; it will be faster than the current implementation there. Started Annex.Import, but untested and it doesn't yet handle tree grafting.	2019-02-21 17:32:59 -04:00
Joey Hess	d5f2463702	misctmp cleanup * Switch to using .git/annex/othertmp for tmp files other than partial downloads, and make stale files left in that directory when git-annex is interrupted be cleaned up promptly by subsequent git-annex processes. * The .git/annex/misctmp directory is no longer used and git-annex will delete anything lingering in there after it's 1 week old. Also, in Annex.Ingest, made the filename it uses in the tmp dir be prefixed with "ingest-" to avoid potentially using a filename used by some other code.	2019-01-17 16:02:22 -04:00
Joey Hess	2eadb6cd68	convert transitions.log to attoparsec and bytestring-builder Not likely to be any speed gain here, but this completes porting every log file over. And, it let me get rid of code copied from ghc and modified, so simplifying the licensing.	2019-01-10 17:13:30 -04:00
Joey Hess	591e4b145f	convert old uuid-based log parsers to attoparsec This preserves the workaround for the old bug that caused NoUUID items to be stored in the log, prefixing log lines with " ". It's now handled implicitly, by using takeWhile1 (/= ' ') to get the uuid. There is a behavior change from the old parser, which split the value into words and then recombined it. That meant that "foo bar" and "foo\tbar" came out as "foo bar". That behavior was not documented, and seems surprising; it meant that after a git-annex describe here "foo bar", you wouldn't get that same string back out when git-annex displayed repo descriptions. Otoh, some other parsers relied on the old behavior, and the attoparsec rewrites had to deal with the issue themselves... For group.log, there are some edge cases around the user providing a group name with a leading or trailing space. The old parser would ignore such excess whitespace. The new parser does too, because the alternative is to refuse to parse something like " group1 group2 " due to excess whitespace, which would be even more confusing behavior. The only git-annex branch log file that is not converted to attoparsec and bytestring-builder now is transitions.log.	2019-01-10 16:34:20 -04:00
Joey Hess	232b1a08f3	simplification now that all logs use Builder	2019-01-09 14:10:05 -04:00
Joey Hess	bfc9039ead	convert git-annex branch access to ByteStrings and Builders Most of the individual logs are not converted yet, only presense logs have an efficient ByteString Builder implemented so far. The rest convert to and from String.	2019-01-03 13:21:48 -04:00
Joey Hess	7d51b0c109	import Utility.FileSystemEncoding in Common	2019-01-03 11:37:02 -04:00

1 2 3 4 5

206 commits