git-annex

Author	SHA1	Message	Date
Joey Hess	de18d92de6	efficient but unsafe journal file append This is only for checking performance, it's not safe. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 14:17:12 -04:00
Joey Hess	ce455223df	split out appending to journal from writing, high level only Currently this is not an improvement, but it allows for optimising appendJournalFile later. With an optimised appendJournalFile, this will greatly speed up access patterns like git-annex addurl of a lot of urls to the same key, where the log file can grow rather large. Appending rather than re-writing the journal file for each line can save a lot of disk writes. It still has to read the current journal or branch file, to check if it can append to it, and so when the journal file does not exist yet, it can write the old content from the branch to it. Probably the re-reads are better cached by the filesystem than repeated writes. (If the re-reads turn out to keep performance bad, they could be eliminated, at the cost of not being able to compact the log when replacing old information in it. That could be enabled by a switch.) While the immediate need is to affect addurl writes, it was implemented at the level of presence logs, so will also perhaps speed up location logs. The only added overhead is the call to isNewInfo, which only needs to compare ByteStrings. Helping to balance that out, it avoids compactLog when it's able to append. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 13:22:50 -04:00
Joey Hess	ad467791c1	optimise journal writes to not mkdir journal directory when it already exists Sponsored-by: Dartmouth College's DANDI project	2022-07-14 12:29:39 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	5a9e6b1fd4	when private journal file exists, still read from git-annex branch Fix bug that caused stale git-annex branch information to read when annex.private or remote.name.annex-private is set. The private journal file should not prevent reading more current information from the git-annex branch, but used to. Note that, overBranchFileContents has to do additional work now, when there's a private journal file, it reads from the branch redundantly and more slowly. Sponsored-by: Jack Hill on Patreon	2021-10-26 13:43:50 -04:00
Joey Hess	32138b8cd8	implement annex.privateremote and remote.name.private configs The slightly unusual parsing in Types.GitConfig avoids the need to look at the remote list to get configs of remotes. annexPrivateRepos combines all the configs, and will only be calculated once, so it's nice and fast. privateUUIDsKnown and regardingPrivateUUID now need to read from the annex mvar, so are not entirely free. But that overhead can be optimised away, as seen in getJournalFileStale. The other call sites didn't seem worth optimising to save a single MVar access. The feature should have impreceptable speed overhead when not being used.	2021-04-23 14:21:57 -04:00
Joey Hess	d0c5f6d2f0	optimisation Avoid trying to read private journal files when no private uuids are known.	2021-04-21 16:02:56 -04:00
Joey Hess	24eeacdba8	adapt recent bug fixes to support private journal At this point, private repos should mostly work, except for a few commands that directly read from the git-annex branch and will not see the private journal. Private index not yet implemented.	2021-04-21 16:01:13 -04:00
Joey Hess	05989556a2	start implementing hidden git-annex repositories This adds a separate journal, which does not currently get committed to an index, but is planned to be committed to .git/annex/index-private. Changes that are regarding a UUID that is private will get written to this journal, and so will not be published into the git-annex branch. All log writing should have been made to indicate the UUID it's regarding, though I've not verified this yet. Currently, no UUIDs are treated as private yet, a way to configure that is needed. The implementation is careful to not add any additional IO work when privateUUIDsKnown is False. It will skip looking at the private journal at all. So this should be free, or nearly so, unless the feature is used. When it is used, all branch reads will be about twice as expensive. It is very lucky -- or very prudent design -- that Annex.Branch.change and maybeChange are the only ways to change a file on the branch, and Annex.Branch.set is only internal use. That let Annex.Branch.get always yield any private information that has been recorded, without the risk that Annex.Branch.set might be called, with a non-private UUID, and end up leaking the private information into the git-annex branch. And, this relies on the way git-annex union merges the git-annex branch. When reading a file, there can be a public and a private version, and they are just concacenated together. That will be handled the same as if there were two diverged git-annex branches that got union merged.	2021-04-20 15:04:53 -04:00
Joey Hess	b2222e4639	optimisation Avoid unnecessary conversion to/from String.	2021-04-20 13:13:45 -04:00
Joey Hess	c30557594e	remove now redundant function	2021-04-20 12:42:57 -04:00
Joey Hess	f45ad178cb	more RawFilePath conversion At 318/645 after 4k lines of changes This commit was sponsored by Jake Vosloo on Patreon.	2020-10-29 12:03:50 -04:00
Joey Hess	3d38ec9585	fix fileJournal My ByteString rewrite oversimplified it, resulting in any _ in a journal file turning into a / in the git-annex branch, which was often the wrong filename, or sometimes (//) an invalid filename that git refused to add.	2019-12-18 11:29:34 -04:00
Joey Hess	cee0d738fc	match also / path separator on windows	2019-12-11 17:08:08 -04:00
Joey Hess	c19211774f	use filepath-bytestring for annex object manipulations git-annex find is now RawFilePath end to end, no string conversions. So is git-annex get when it does not need to get anything. So this is a major milestone on optimisation. Benchmarks indicate around 30% speedup in both commands. Probably many other performance improvements. All or nearly all places where a file is statted use RawFilePath now.	2019-12-11 15:25:07 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	d5f2463702	misctmp cleanup * Switch to using .git/annex/othertmp for tmp files other than partial downloads, and make stale files left in that directory when git-annex is interrupted be cleaned up promptly by subsequent git-annex processes. * The .git/annex/misctmp directory is no longer used and git-annex will delete anything lingering in there after it's 1 week old. Also, in Annex.Ingest, made the filename it uses in the tmp dir be prefixed with "ingest-" to avoid potentially using a filename used by some other code.	2019-01-17 16:02:22 -04:00
Joey Hess	bfc9039ead	convert git-annex branch access to ByteStrings and Builders Most of the individual logs are not converted yet, only presense logs have an efficient ByteString Builder implemented so far. The rest convert to and from String.	2019-01-03 13:21:48 -04:00
Joey Hess	d1961e4498	back out incorrect IO interleaving change Fix regression in last release that crashes when using --all or running git-annex in a bare repository. May have also affected git-annex unused and git-annex info. Reversed the order of the (++) in Annex.Branch.files so --all will stream lazily still when there are not a bunch of uncommitted journal files. Added a todo to maybe improve this later. This commit was sponsored by Trenton Cronholm on Patreon.	2018-05-08 13:54:42 -04:00
Joey Hess	bea0ad220a	avoid --all buffering list of all keys In Annex.Branch.branch, the (++) was killing laziness. Rewrote so it streams lazily. filterM also kills laziness, so made loggedKeys use a Unchecked type, and check if the key is dead in the seek loop. Note that loggedKeysFor still buffers, so git-annex info <remote> and git-annex unused --from remote still use more memory than necessary. Also removed some unused functions from Annex.Journal.	2018-04-26 16:00:20 -04:00
Joey Hess	25703e1413	finally really add back custom-setup stanza Fourth or fifth try at this and finally found a way to make it work. Absurd amount of busy-work forced on me by change in cabal's behavior. Split up Utility modules that need posix stuff out of ones used by Setup. Various other hacks around inability for Setup to use anything that ifdefs a use of unix. Probably lost a full day of my life to this. This is how build systems make their users hate them. Just saying.	2017-12-31 16:36:39 -04:00
Joey Hess	8484c0c197	Always use filesystem encoding for all file and handle reads and writes. This is a big scary change. I have convinced myself it should be safe. I hope!	2016-12-24 14:46:31 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	afc5153157	update my email address and homepage url	2015-01-21 12:50:09 -04:00
Joey Hess	c784ef4586	unify exception handling into Utility.Exception Removed old extensible-exceptions, only needed for very old ghc. Made webdav use Utility.Exception, to work after some changes in DAV's exception handling. Removed Annex.Exception. Mostly this was trivial, but note that tryAnnex is replaced with tryNonAsync and catchAnnex replaced with catchNonAsync. In theory that could be a behavior change, since the former caught all exceptions, and the latter don't catch async exceptions. However, in practice, nothing in the Annex monad uses async exceptions. Grepping for throwTo and killThread only find stuff in the assistant, which does not seem related. Command.Add.undo is changed to accept a SomeException, and things that use it for rollback now catch non-async exceptions, rather than only IOExceptions.	2014-08-07 22:03:29 -04:00
Joey Hess	26ee27915a	refactor locking	2014-07-10 00:32:23 -04:00
Joey Hess	e5b88713a1	refactor	2014-07-10 00:16:53 -04:00
Joey Hess	d9d76cf98b	Fix minor FD leak in journal code. Minor because normally only 1 FD is leaked per git-annex run. However, the test suite leaks a few hundred FDs, and this broke it on the Debian autobuilders, which seem to have a tigher than usual ulimit. The leak was introduced by the lazy getDirectoryContents' that was introduced in `e6330988dd` in order to scale to millions of journal files -- if the lazy list was never fully consumed, the directory handle did not get closed. Instead, pull in openDirectory/readDirectory/closeDirectory code that I already developed and submitted in a patch to the haskell directory library earlier. Using this in journalDirty avoids the place that the lazy list caused a problem. And using it in stageJournal eliminates the need for getDirectoryContents'. The getJournalFiles* functions are switched back to using the regular strict getDirectoryContents. I'm not sure if those always consume the whole list, so this avoids any leak. And the things that call those are things like git annex unused, which also look at every file committed to the git-annex branch, so would need more work to scale to insane numbers of files anyway.	2014-07-09 23:36:53 -04:00
Joey Hess	c90e4e8778	work around getDirectoryContents not streaming lazily	2014-07-04 17:59:26 -04:00
Joey Hess	1ab3d7c810	Windows: Fix bug introduced in last release that caused files in the git-annex branch to have lines teminated with \r.	2014-06-05 14:57:01 -04:00
Joey Hess	95ca3bb022	Fix encoding of data written to git-annex branch. Avoid truncating unicode characters to 8 bits. Allow any encoding to be used, as with filenames (but utf8 is the sane choice). Affects metadata and repository descriptions, and preferred content expressions. The question of what's the right encoding for the git-annex branch is a vexing one. utf-8 would be a nice choice, but this leaves the possibility of bad data getting into a git-annex branch somehow, and this resulting in git-annex crashing with encoding errors, which is a failure mode I want to avoid. (Also, preferred content expressions can refer to filenames, and filenames can have any encoding, so limiting to utf-8 would not be ideal.) The union merge code already took care to not assume any encoding for a file. Except it assumes that any \n is a literal newline, and not part of some encoding of a character that happens to contain a newline. (At least utf-8 avoids using newline for anything except liternal newlines.) Adapted the git-annex branch code to use this same approach. Note that there is a potential interop problem with Windows, since FileSystemEncoding doesn't work there, and instead things are always decoded as utf-8. If someone uses non-utf8 encoding for data on the git-annex branch, this can lead to an encoding error on windows. However, this commit doesn't actually make that any worse, because the union merge code would similarly fail with an encoding error on windows in that situation. This commit was sponsored by Kyle Meyer.	2014-05-27 14:16:33 -04:00
Joey Hess	a1432bce2f	Put non-object tmp files in .git/annex/misctmp, leaving .git/annex/tmp for only partially transferred objects. This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO of temp object files is too annoying (and if you don't want to keep partially transferred objects across reboots). .git/annex/misctmp must be on the same filesystem as the git work tree, since files are moved to there in a way that will not work cross-device, as well as symlinked into there. I first wanted to put the tmp objects in .git/annex/objects/tmp, but that would pose transition problems on upgrade when partially transferred objects existed. git annex info does not currently show the size of .git/annex/misctemp, since it should stay small. It would also be ok to make something clean it out, periodically.	2014-02-26 16:52:56 -04:00
Joey Hess	891c85cd88	use locking on Windows This is all the easy cases, where there was already a separate lock file.	2014-01-28 14:42:03 -04:00
Joey Hess	56c3f68a53	use types to partially prove correctness of journal locking code My implementation does not guard against double locking of the journal. But it does ensure that the journal is always locked when operated on, by using a type that is only produced by lockJournal, and which is required as a parameter of all functions that operate on the journal. Note that I had to add the fooStale functions for cases where it does not make sense to lock the journal when querying it. I was more concerned about ensuring that anything that modifies the journal is locked. setJournalFile's implementation ensures that any query of the journal will get one value or the other atomically, even if the journal is being changed at the time.	2013-10-03 14:41:57 -04:00
Joey Hess	a3224ce35b	avoid more build warnings on Windows	2013-08-04 14:05:36 -04:00
Joey Hess	93f2371e09	get rid of __WINDOWS__, use mingw32_HOST_OS The latter is harder for me to remember, but avoids build failures in code used by the configure program.	2013-08-02 12:27:32 -04:00
Joey Hess	bf86b5ca16	improve robustness of fromDirect and replaceFile Made fromDirect check that a file in the tree has good content (and is not a broken symlink either) before copying it to another file that has the same key. Made replaceFile clean up the temp file if the action that creates it, or the file replacement action fails.	2013-05-25 15:06:02 -04:00
Joey Hess	03e8594369	fix the day's windows permissions damage	2013-05-12 19:09:48 -04:00
Joey Hess	2f3ce4c02f	fix	2013-05-12 15:43:59 -05:00
Joey Hess	838b984797	deal with dos path separators	2013-05-12 15:37:32 -05:00
Joey Hess	abe8d549df	fix permission damage (thanks, Windows)	2013-05-11 23:54:25 -04:00
Joey Hess	3c7e30a295	git-annex now builds on Windows (doesn't work)	2013-05-11 15:03:00 -05:00
Joey Hess	f87a781aa6	finished where indentation changes	2012-12-13 00:24:19 -04:00
Joey Hess	3417c55189	remove git-annex branch read cache This cache prevented noticing changes made by another process. The case I just ran into involved the assistant dropping a file, which cached its presence info. Then the same file was downloaded again, but the assistant didn't know its presence info had changed. I don't see a way to keep this cache. Will instead rely on the OS level file cache, for files in the journal. May need to add more higher-level caching of info that it's ok to have a potentially stale copy of, although much of git-annex already does so.	2012-10-19 14:25:15 -04:00
Joey Hess	e8188ea611	flip catchDefaultIO	2012-09-17 00:18:07 -04:00
Joey Hess	b98b69e8c6	honor core.sharedRepository when making all the other files in the annex Lock files, directories, etc.	2012-04-21 19:36:03 -04:00
Joey Hess	146c36ca54	IO exception rework ghc 7.4 comaplains about use of System.IO.Error to catch exceptions. Ok, use Control.Exception, with variants specialized to only catch IO exceptions.	2012-02-03 16:47:24 -04:00
Joey Hess	da95cbadca	split out Annex/Journal.hs	2011-12-12 18:03:28 -04:00

49 commits