git-annex

Author	SHA1	Message	Date
Joey Hess	770aac97a7	share single BranchState amoung all threads This fixes a problem when git-annex testremote is run against a cluster accessed via the http server. Annex.Cluster uses the location log to find nodes that contain a key when checking if the key is present or getting it. Just after a key was stored to a cluster node, reading the location log was not getting the UUID of that node. Apparently the Annex action that wrote to the location log, and the one that read from it were run with two different Annex states. The http server does use several different Annex threads. BranchState was part of the AnnexState, and so two threads could have different BranchStates. Moved BranchState to the AnnexRead, so all threads will see the common state. This might possibly impact performance. If one thread is writing changes to the branch, and another thread is reading from the branch, the writing thread will now invalidate the BranchState's cache, which will cause the reading thread to need to do extra work. But correctness is surely more important. If did is found to have impacted performance, it could probably be dealt with by doing smarter BranchState cache invalidation. Another way this might impact performance is that the BranchState has a small cache. If several threads were reading from the branch and relying on the value they just read still being in the case, now a cache miss will be more likely. Increasing the BranchState cache to the number of jobs might be a good idea to amelorate that. But the cache is currently an innefficient list, so making it large would need changes to the data types. (Commit `4304f1b6ae` dealt with a follow-on effect of the bug fixed here.)	2024-07-28 12:30:27 -04:00
Joey Hess	adcebbae47	clean up git-remote-annex git-annex branch handling Implemented alternateJournal, which git-remote-annex uses to avoid any writes to the git-annex branch while setting up a special remote from an annex:: url. That prevents the remote.log from being overwritten with the special remote configuration from the url, which might not be 100% the same as the existing special remote configuration. And it prevents an overwrite deleting of other stuff that was already in the remote.log. Also, when the branch was created by git-remote-annex, only delete it at the end if nothing else has been written to it by another command. This fixes the race condition described in `797f27ab05`, where git-remote-annex set up the branch and git-annex init and other commands were run at the same time and their writes to the branch were lost.	2024-05-15 17:33:38 -04:00
Joey Hess	928b2a4839	create journal directory in withJournalHandle Fixes a crash by git-annex repair when .git/annex/journal/ does not exist. Normally the journal directory is created before withJournalHandle gets run, but git-annex repair can be run in a situation where it does not exist.	2023-06-21 15:23:59 -04:00
Joey Hess	0aa98aa09b	fix perms for core.sharedRepository These two missed setting it. It rarely matters that the journal gets the right perm. But, when using annex.alwayscommit=false, someone else may come along later and want to append to the journal file. It probably never matters what the sentinal perms are, but for completeness.. Sponsored-by: Luke Shumaker on Patreon	2023-04-26 16:29:11 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Joey Hess	e60766543f	add annex.dbdir (WIP) WIP: This is mostly complete, but there is a problem: createDirectoryUnder throws an error when annex.dbdir is set to outside the git repo. annex.dbdir is a workaround for filesystems where sqlite does not work, due to eg, the filesystem not properly supporting locking. It's intended to be set before initializing the repository. Changing it in an existing repository can be done, but would be the same as making a new repository and moving all the annexed objects into it. While the databases get recreated from the git-annex branch in that situation, any information that is in the databases but not stored in the branch gets lost. It may be that no information ever gets stored in the databases that cannot be reconstructed from the branch, but I have not verified that. Sponsored-by: Dartmouth College's Datalad project	2022-08-11 16:58:53 -04:00
Joey Hess	cbe12b9bc3	force fully strict read of journal file again I was thinking that discardIncompleteAppend would make it strict, since it looks at the end of the bytestring. But, it's applied lazily.. This probably fixes windows, which was failing: git-annex.exe: .git\annex\journal\trust.log: DeleteFile "\\\\?\\C:\\Users\\runneradmin\\.t\\5\\tmprepo22\\.git\\annex\\journal\\trust.log": permission denied (The process cannot access the file because it is being used by another process.)	2022-07-22 11:36:21 -04:00
Joey Hess	d275874e6c	handling of interrupted appends An append that is interrupted and writes part of a line is now dealt with by subsequent reads and appends. This also handles a read that happens at the same time as an append to the file. Old versions of git-annex will still see a partially written line, and could get confused. Since appends are currently done for url logs and location logs, the confusion is limited to a substring of the actual url or UUID of the remote being read. This will not affect writes, since the journal file is locked when reading in preparation for writing. However, the bad data can be output by git-annex and used by other things, or could cause surprising behavior by git-annex. Including eg, downloading the content of the wrong url. So, something needs to be done to prevent old versions of git-annex from running in a repository where this appending is being done.. Sponsored-by: Dartmouth College's DANDI project	2022-07-20 12:40:49 -04:00
Joey Hess	36f0bdcd57	add annex.alwayscompact Added annex.alwayscompact setting which can be unset to speed up writes to the git-annex branch in some cases. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 16:39:19 -04:00
Joey Hess	ccff639651	Merge branch 'master' into append	2022-07-18 14:17:15 -04:00
Joey Hess	de18d92de6	efficient but unsafe journal file append This is only for checking performance, it's not safe. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 14:17:12 -04:00
Joey Hess	1c40b927aa	minor optimisation Avoid re-writing the file when the journal directory did not exist.	2022-07-18 13:50:35 -04:00
Joey Hess	ce455223df	split out appending to journal from writing, high level only Currently this is not an improvement, but it allows for optimising appendJournalFile later. With an optimised appendJournalFile, this will greatly speed up access patterns like git-annex addurl of a lot of urls to the same key, where the log file can grow rather large. Appending rather than re-writing the journal file for each line can save a lot of disk writes. It still has to read the current journal or branch file, to check if it can append to it, and so when the journal file does not exist yet, it can write the old content from the branch to it. Probably the re-reads are better cached by the filesystem than repeated writes. (If the re-reads turn out to keep performance bad, they could be eliminated, at the cost of not being able to compact the log when replacing old information in it. That could be enabled by a switch.) While the immediate need is to affect addurl writes, it was implemented at the level of presence logs, so will also perhaps speed up location logs. The only added overhead is the call to isNewInfo, which only needs to compare ByteStrings. Helping to balance that out, it avoids compactLog when it's able to append. Sponsored-by: Dartmouth College's DANDI project	2022-07-18 13:22:50 -04:00
Joey Hess	ad467791c1	optimise journal writes to not mkdir journal directory when it already exists Sponsored-by: Dartmouth College's DANDI project	2022-07-14 12:29:39 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	5a9e6b1fd4	when private journal file exists, still read from git-annex branch Fix bug that caused stale git-annex branch information to read when annex.private or remote.name.annex-private is set. The private journal file should not prevent reading more current information from the git-annex branch, but used to. Note that, overBranchFileContents has to do additional work now, when there's a private journal file, it reads from the branch redundantly and more slowly. Sponsored-by: Jack Hill on Patreon	2021-10-26 13:43:50 -04:00
Joey Hess	32138b8cd8	implement annex.privateremote and remote.name.private configs The slightly unusual parsing in Types.GitConfig avoids the need to look at the remote list to get configs of remotes. annexPrivateRepos combines all the configs, and will only be calculated once, so it's nice and fast. privateUUIDsKnown and regardingPrivateUUID now need to read from the annex mvar, so are not entirely free. But that overhead can be optimised away, as seen in getJournalFileStale. The other call sites didn't seem worth optimising to save a single MVar access. The feature should have impreceptable speed overhead when not being used.	2021-04-23 14:21:57 -04:00
Joey Hess	d0c5f6d2f0	optimisation Avoid trying to read private journal files when no private uuids are known.	2021-04-21 16:02:56 -04:00
Joey Hess	24eeacdba8	adapt recent bug fixes to support private journal At this point, private repos should mostly work, except for a few commands that directly read from the git-annex branch and will not see the private journal. Private index not yet implemented.	2021-04-21 16:01:13 -04:00
Joey Hess	05989556a2	start implementing hidden git-annex repositories This adds a separate journal, which does not currently get committed to an index, but is planned to be committed to .git/annex/index-private. Changes that are regarding a UUID that is private will get written to this journal, and so will not be published into the git-annex branch. All log writing should have been made to indicate the UUID it's regarding, though I've not verified this yet. Currently, no UUIDs are treated as private yet, a way to configure that is needed. The implementation is careful to not add any additional IO work when privateUUIDsKnown is False. It will skip looking at the private journal at all. So this should be free, or nearly so, unless the feature is used. When it is used, all branch reads will be about twice as expensive. It is very lucky -- or very prudent design -- that Annex.Branch.change and maybeChange are the only ways to change a file on the branch, and Annex.Branch.set is only internal use. That let Annex.Branch.get always yield any private information that has been recorded, without the risk that Annex.Branch.set might be called, with a non-private UUID, and end up leaking the private information into the git-annex branch. And, this relies on the way git-annex union merges the git-annex branch. When reading a file, there can be a public and a private version, and they are just concacenated together. That will be handled the same as if there were two diverged git-annex branches that got union merged.	2021-04-20 15:04:53 -04:00
Joey Hess	b2222e4639	optimisation Avoid unnecessary conversion to/from String.	2021-04-20 13:13:45 -04:00
Joey Hess	c30557594e	remove now redundant function	2021-04-20 12:42:57 -04:00
Joey Hess	f45ad178cb	more RawFilePath conversion At 318/645 after 4k lines of changes This commit was sponsored by Jake Vosloo on Patreon.	2020-10-29 12:03:50 -04:00
Joey Hess	3d38ec9585	fix fileJournal My ByteString rewrite oversimplified it, resulting in any _ in a journal file turning into a / in the git-annex branch, which was often the wrong filename, or sometimes (//) an invalid filename that git refused to add.	2019-12-18 11:29:34 -04:00
Joey Hess	cee0d738fc	match also / path separator on windows	2019-12-11 17:08:08 -04:00
Joey Hess	c19211774f	use filepath-bytestring for annex object manipulations git-annex find is now RawFilePath end to end, no string conversions. So is git-annex get when it does not need to get anything. So this is a major milestone on optimisation. Benchmarks indicate around 30% speedup in both commands. Probably many other performance improvements. All or nearly all places where a file is statted use RawFilePath now.	2019-12-11 15:25:07 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	d5f2463702	misctmp cleanup * Switch to using .git/annex/othertmp for tmp files other than partial downloads, and make stale files left in that directory when git-annex is interrupted be cleaned up promptly by subsequent git-annex processes. * The .git/annex/misctmp directory is no longer used and git-annex will delete anything lingering in there after it's 1 week old. Also, in Annex.Ingest, made the filename it uses in the tmp dir be prefixed with "ingest-" to avoid potentially using a filename used by some other code.	2019-01-17 16:02:22 -04:00
Joey Hess	bfc9039ead	convert git-annex branch access to ByteStrings and Builders Most of the individual logs are not converted yet, only presense logs have an efficient ByteString Builder implemented so far. The rest convert to and from String.	2019-01-03 13:21:48 -04:00
Joey Hess	d1961e4498	back out incorrect IO interleaving change Fix regression in last release that crashes when using --all or running git-annex in a bare repository. May have also affected git-annex unused and git-annex info. Reversed the order of the (++) in Annex.Branch.files so --all will stream lazily still when there are not a bunch of uncommitted journal files. Added a todo to maybe improve this later. This commit was sponsored by Trenton Cronholm on Patreon.	2018-05-08 13:54:42 -04:00
Joey Hess	bea0ad220a	avoid --all buffering list of all keys In Annex.Branch.branch, the (++) was killing laziness. Rewrote so it streams lazily. filterM also kills laziness, so made loggedKeys use a Unchecked type, and check if the key is dead in the seek loop. Note that loggedKeysFor still buffers, so git-annex info <remote> and git-annex unused --from remote still use more memory than necessary. Also removed some unused functions from Annex.Journal.	2018-04-26 16:00:20 -04:00
Joey Hess	25703e1413	finally really add back custom-setup stanza Fourth or fifth try at this and finally found a way to make it work. Absurd amount of busy-work forced on me by change in cabal's behavior. Split up Utility modules that need posix stuff out of ones used by Setup. Various other hacks around inability for Setup to use anything that ifdefs a use of unix. Probably lost a full day of my life to this. This is how build systems make their users hate them. Just saying.	2017-12-31 16:36:39 -04:00
Joey Hess	8484c0c197	Always use filesystem encoding for all file and handle reads and writes. This is a big scary change. I have convinced myself it should be safe. I hope!	2016-12-24 14:46:31 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	afc5153157	update my email address and homepage url	2015-01-21 12:50:09 -04:00
Joey Hess	c784ef4586	unify exception handling into Utility.Exception Removed old extensible-exceptions, only needed for very old ghc. Made webdav use Utility.Exception, to work after some changes in DAV's exception handling. Removed Annex.Exception. Mostly this was trivial, but note that tryAnnex is replaced with tryNonAsync and catchAnnex replaced with catchNonAsync. In theory that could be a behavior change, since the former caught all exceptions, and the latter don't catch async exceptions. However, in practice, nothing in the Annex monad uses async exceptions. Grepping for throwTo and killThread only find stuff in the assistant, which does not seem related. Command.Add.undo is changed to accept a SomeException, and things that use it for rollback now catch non-async exceptions, rather than only IOExceptions.	2014-08-07 22:03:29 -04:00
Joey Hess	26ee27915a	refactor locking	2014-07-10 00:32:23 -04:00
Joey Hess	e5b88713a1	refactor	2014-07-10 00:16:53 -04:00
Joey Hess	d9d76cf98b	Fix minor FD leak in journal code. Minor because normally only 1 FD is leaked per git-annex run. However, the test suite leaks a few hundred FDs, and this broke it on the Debian autobuilders, which seem to have a tigher than usual ulimit. The leak was introduced by the lazy getDirectoryContents' that was introduced in `e6330988dd` in order to scale to millions of journal files -- if the lazy list was never fully consumed, the directory handle did not get closed. Instead, pull in openDirectory/readDirectory/closeDirectory code that I already developed and submitted in a patch to the haskell directory library earlier. Using this in journalDirty avoids the place that the lazy list caused a problem. And using it in stageJournal eliminates the need for getDirectoryContents'. The getJournalFiles* functions are switched back to using the regular strict getDirectoryContents. I'm not sure if those always consume the whole list, so this avoids any leak. And the things that call those are things like git annex unused, which also look at every file committed to the git-annex branch, so would need more work to scale to insane numbers of files anyway.	2014-07-09 23:36:53 -04:00
Joey Hess	c90e4e8778	work around getDirectoryContents not streaming lazily	2014-07-04 17:59:26 -04:00
Joey Hess	1ab3d7c810	Windows: Fix bug introduced in last release that caused files in the git-annex branch to have lines teminated with \r.	2014-06-05 14:57:01 -04:00
Joey Hess	95ca3bb022	Fix encoding of data written to git-annex branch. Avoid truncating unicode characters to 8 bits. Allow any encoding to be used, as with filenames (but utf8 is the sane choice). Affects metadata and repository descriptions, and preferred content expressions. The question of what's the right encoding for the git-annex branch is a vexing one. utf-8 would be a nice choice, but this leaves the possibility of bad data getting into a git-annex branch somehow, and this resulting in git-annex crashing with encoding errors, which is a failure mode I want to avoid. (Also, preferred content expressions can refer to filenames, and filenames can have any encoding, so limiting to utf-8 would not be ideal.) The union merge code already took care to not assume any encoding for a file. Except it assumes that any \n is a literal newline, and not part of some encoding of a character that happens to contain a newline. (At least utf-8 avoids using newline for anything except liternal newlines.) Adapted the git-annex branch code to use this same approach. Note that there is a potential interop problem with Windows, since FileSystemEncoding doesn't work there, and instead things are always decoded as utf-8. If someone uses non-utf8 encoding for data on the git-annex branch, this can lead to an encoding error on windows. However, this commit doesn't actually make that any worse, because the union merge code would similarly fail with an encoding error on windows in that situation. This commit was sponsored by Kyle Meyer.	2014-05-27 14:16:33 -04:00
Joey Hess	a1432bce2f	Put non-object tmp files in .git/annex/misctmp, leaving .git/annex/tmp for only partially transferred objects. This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO of temp object files is too annoying (and if you don't want to keep partially transferred objects across reboots). .git/annex/misctmp must be on the same filesystem as the git work tree, since files are moved to there in a way that will not work cross-device, as well as symlinked into there. I first wanted to put the tmp objects in .git/annex/objects/tmp, but that would pose transition problems on upgrade when partially transferred objects existed. git annex info does not currently show the size of .git/annex/misctemp, since it should stay small. It would also be ok to make something clean it out, periodically.	2014-02-26 16:52:56 -04:00
Joey Hess	891c85cd88	use locking on Windows This is all the easy cases, where there was already a separate lock file.	2014-01-28 14:42:03 -04:00
Joey Hess	56c3f68a53	use types to partially prove correctness of journal locking code My implementation does not guard against double locking of the journal. But it does ensure that the journal is always locked when operated on, by using a type that is only produced by lockJournal, and which is required as a parameter of all functions that operate on the journal. Note that I had to add the fooStale functions for cases where it does not make sense to lock the journal when querying it. I was more concerned about ensuring that anything that modifies the journal is locked. setJournalFile's implementation ensures that any query of the journal will get one value or the other atomically, even if the journal is being changed at the time.	2013-10-03 14:41:57 -04:00
Joey Hess	a3224ce35b	avoid more build warnings on Windows	2013-08-04 14:05:36 -04:00
Joey Hess	93f2371e09	get rid of __WINDOWS__, use mingw32_HOST_OS The latter is harder for me to remember, but avoids build failures in code used by the configure program.	2013-08-02 12:27:32 -04:00
Joey Hess	bf86b5ca16	improve robustness of fromDirect and replaceFile Made fromDirect check that a file in the tree has good content (and is not a broken symlink either) before copying it to another file that has the same key. Made replaceFile clean up the temp file if the action that creates it, or the file replacement action fails.	2013-05-25 15:06:02 -04:00
Joey Hess	03e8594369	fix the day's windows permissions damage	2013-05-12 19:09:48 -04:00

1 2

60 commits