git-annex

Author	SHA1	Message	Date
Joey Hess	6a3bd283b8	add restage log When pointer files need to be restaged, they're first written to the log, and then when the restage operation runs, it reads the log. This way, if the git-annex process is interrupted before it can do the restaging, a later git-annex process can do it. Currently, this lets a git-annex get/drop command be interrupted and then re-ran, and as long as it gets/drops additional files, it will clean up after the interrupted command. But more changes are needed to make it easier to restage after an interrupted process. Kept using the git queue to run the restage action, even though the list of files that it builds up for that action is not actually used by the action. This could perhaps be simplified to make restaging a cleanup action that gets registered, rather than using the git queue for it. But I wasn't sure if that would cause visible behavior changes, when eg dropping a large number of files, currently the git queue flushes periodically, and so it restages incrementally, rather than all at the end. In restagePointerFiles, it reads the restage log twice, once to get the number of files and size, and a second time to process it. This seemed better than reading the whole file into memory, since potentially a huge number of files could be in there. Probably the OS will cache the file in memory and there will not be much performance impact. It might be better to keep running tallies in another file though. But updating that atomically with the log seems hard. Also note that it's possible for calcRestageLog to see a different file than streamRestageLog does. More files may be added to the log in between. That is ok, it will only cause the filterprocessfaster heuristic to operate with slightly out of date information, so it may make the wrong choice for the files that got added and be a little slower than ideal. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 15:47:24 -04:00
Joey Hess	e60766543f	add annex.dbdir (WIP) WIP: This is mostly complete, but there is a problem: createDirectoryUnder throws an error when annex.dbdir is set to outside the git repo. annex.dbdir is a workaround for filesystems where sqlite does not work, due to eg, the filesystem not properly supporting locking. It's intended to be set before initializing the repository. Changing it in an existing repository can be done, but would be the same as making a new repository and moving all the annexed objects into it. While the databases get recreated from the git-annex branch in that situation, any information that is in the databases but not stored in the branch gets lost. It may be that no information ever gets stored in the databases that cannot be reconstructed from the branch, but I have not verified that. Sponsored-by: Dartmouth College's Datalad project	2022-08-11 16:58:53 -04:00
Joey Hess	faf84aa5c2	Avoid git status taking a long time after git-annex unlock of many files. Implemented by making Git.Queue have a FlushAction, which can accumulate along with another action on files, and runs only once the other action has run. This lets git-annex unlock queue up git update-index actions, without conflicting with the restagePointerFiles FlushActions. In a repository with filter-process enabled, git-annex unlock will often not take any more time than before, though it may when the files are large. Either way, it should always slow down less than git-annex status speeds up. When filter-process is not enabled, git-annex unlock will slow down as much as git status speeds up. Sponsored-by: Jochen Bartl on Patreon	2022-02-18 15:06:40 -04:00
Joey Hess	c2e46f4707	improve git command queue flushing with time limit So that eg, addurl of several large files that take time to download will update the index for each file, rather than deferring the index updates to the end. In cases like an add of many smallish files, where a new file is being added every few seconds. In that case, the queue will still build up a lot of changes which are flushed at once, for best performance. Since the default queue size is 10240, often it only gets flushed once at the end, same as before. (Notice that updateQueue updated _lastchanged when adding a new item to the queue without flushing it; that is necessary to avoid it flushing the queue every 5 minutes in this case.) But, when it takes more than a 5 minutes to add a file, the overhead of updating the index immediately is probably small, so do it after each file. This avoids git-annex potentially taking a very very long time indeed to stage newly added files, which can be annoying to the user who would like to get on with doing something with the files it's already added, eg using git mv to rename them to a better name. This is only likely to cause a problem if it takes say, 30 seconds to update the index; doing an extra 30 seconds of work after every 5 minute file add would be less optimal. Normally, updating the index takes significantly less time than that. On a SSD with 100k files it takes less than 1 second, and the index write time is bound by disk read and write so is not too much worse on a hard drive. So I hope this will not impact users, although if it does turn out to, the time limit could be made configurable. A perhaps better way to do it would be to have a background worker thread that wakes up every 60 seconds or so and flushes the queue. That is made somewhat difficult because the queue can contain Annex actions and so this would add a new source of concurrency issues. So I'm trying to avoid that approach if possible. Sponsored-by: Erik Bjäreholt on Patreon	2021-12-14 12:23:19 -04:00
Joey Hess	a0758bdd10	dynamically disable filter-process in restagePointerFile when it would be slower Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be really annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon	2021-11-08 16:20:34 -04:00
Joey Hess	1c5fc8f047	Git.Queue: allow providing git common options like -c	2021-01-04 12:51:55 -04:00
Joey Hess	681b44236a	more RawFilePath conversion at 377/645 This commit was sponsored by Svenne Krap on Patreon.	2020-10-29 14:20:57 -04:00
Joey Hess	99536e3a0b	remove one more warningIO Had to generalize Git.Queue so it can run an Annex action, yipes. Only remaining warningIO are in the legacy chunk code.	2019-11-12 10:45:52 -04:00
Joey Hess	b03e65d260	Improved locking when multiple git-annex processes are writing to the .git/index file	2019-05-06 15:15:12 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	759a87ad70	fix git command queue to be concurrency safe Probably not noticed until now because the queue is large enough that two threads each filling theirs at the same time and flushing is unlikely to happen. Also made explicit that each worker thread gets its own queue. I think that was the case before, but if something was put in the queue before worker threads were forked off, they could have each inherited the same queue. Could have gone with a single shared queue, but per-worker queues is more efficient, because a worker can add lots of stuff to its own queue without any locking. This commit was sponsored by Ole-Morten Duesund on Patreon.	2018-08-28 13:16:33 -04:00
Joey Hess	54d49eeac8	avoid update-index race This commit was supported by the NSF-funded DataLad project.	2018-08-17 16:03:40 -04:00
Joey Hess	82c5dd8a01	queueing of internal IO actions on files This would be better if getInternalFiles were more polymorphic, but I can't see a good way to accomplish that without messing with Data.Typeable, which seemed like overkill. Reverted CommandAction back to the simpler version. This commit was sponsored by Eric Drechsel on Patreon.	2018-08-17 13:28:21 -04:00
Joey Hess	82a239675f	narrow the race where a file gets modified before update-index Check just before running update-index if the worktree file's content is still the same, don't update it when it's been modified. This narrows the race window a lot, from possibly minutes or hours, to seconds or less. (Use replaceFile so that the worktree update happens atomically, allowing the InodeCache of the new worktree file to itself be gathered w/o any other race.) This doesn't eliminate the race; it can still occur in the window before update-index runs. When annex.queue is large, a lot of files will be statted by the checks, and so the window may still be large enough to be a problem. When only a few files are being processed, the window is as small as it is in the race where a modification gets overwritten by git-annex when it updates the worktree. Or maybe as small as whatever race git checkout/pull/merge may have when the worktree gets modified during it. Still, I've kept a todo about this race. This commit was supported by the NSF-funded DataLad project.	2018-08-16 15:56:43 -04:00
Joey Hess	8148ee3d4b	withAltRepo needs a separate queue of changes The queue could potentially contain changes from before withAltRepo, and get flushed inside the call, which would apply the changes to the modified repo. Or, changes could be queued in withAltRepo that were intended to affect the modified repo, but don't get flushed until later. I don't know of any cases where either happens, but better safe than sorry. Note that this affect withIndexFile, which is used in git-annex branch updates. So, it potentially makes things slower. Should not be by much; the overhead consists only of querying the current queue a couple of times, and potentially flushing changes queued within withAltRepo earlier, that could have maybe been bundled with other later changes. Notice in particular that the existing queue is not flushed when calling withAltRepo. So eg when git annex add needs to stage files in the index, it will still bundle them together efficiently.	2016-06-03 13:57:00 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	b52cf5697b	immediate queue flushing when annex.queuesize=1 Previously, it only flushed when the queue got larger than 1. Also, make the queue auto-flush when items are added, rather than needing to be flushed as a separate step. This simplifies the code and make it more efficient too, as it avoids needing to read the queue out of the state to check if it should be flushed.	2016-01-13 14:55:01 -04:00
Joey Hess	31472161e4	merge git command queue when joining with concurrent thread	2015-11-05 18:21:48 -04:00
Joey Hess	afc5153157	update my email address and homepage url	2015-01-21 12:50:09 -04:00
Joey Hess	4008590c68	type based git config handling for remotes Still a couple of places that use git config ad-hoc, but this is most of it done.	2013-01-01 13:58:14 -04:00
Joey Hess	7f7c31df1c	type based git config handling Now there's a Config type, that's extracted from the git config at startup. Note that laziness means that individual config values are only looked up and parsed on demand, and so we get implicit memoization for all of them. So this is not only prettier and more type safe, it optimises several places that didn't have explicit memoization before. As well as getting rid of the ugly explicit memoization code. Not yet done for annex.<remote>.* configuration settings.	2012-12-29 23:10:18 -04:00
Joey Hess	f87a781aa6	finished where indentation changes	2012-12-13 00:24:19 -04:00
Joey Hess	e0095b0bdc	fishy commit	2012-06-14 00:01:48 -04:00
Joey Hess	c5707c84d3	queue size fix Increase queue size for update-index actions, because otherwise they'll never be flushed.	2012-06-10 13:56:04 -04:00
Joey Hess	20f425be19	make watch use the queue May not work. Certianly needs to flush the queue from time to time when only symlink changes are being made.	2012-06-07 15:40:44 -04:00
Joey Hess	0a11b35d89	extend Git.Queue to be able to queue more than simple git commands While I was in there, I noticed and fixed a bug in the queue size calculations. It was never encountered only because Queue.add was only ever run with 1 file in the list.	2012-06-07 15:19:44 -04:00
Joey Hess	7a6fb8ae4e	flush the git queue when a new type of action is being added to it This allows the queue to be used in a single process for multiple possibly conflicting commands, like add and rm, without running them out of order. This assumes that running the same git subcommand with different parameters cannot itself conflict.	2012-06-04 20:41:22 -04:00
Joey Hess	f7d8982672	Fix use of several config settings annex.ssh-options, annex.rsync-options, annex.bup-split-options. And adjust types to avoid the bugs that broke several config settings recently. Now "annex." prefixing is enforced at the type level.	2012-05-05 20:16:56 -04:00
Joey Hess	76102c1c75	display "Recording state in git..." when staging the journal A bit tricky to avoid printing it twice in a row when there are queued git commands to run and journal to stage. Added a generic way to run an action that may output multiple side messages, with only the first displayed.	2012-04-27 13:54:33 -04:00
Joey Hess	f1398b5583	use new getConfig	2012-03-22 17:32:47 -04:00
Joey Hess	52c5b164d8	Added a annex.queuesize setting useful when adding hundreds of thousands of files on a system with plenty of memory. git add gets quite slow in such a large repository, so if the system has more than the ~32 mb of memory the queue can use by default, it's a useful optimisation to increase the queue size, in order to decrease the number of times git add is run.	2012-02-15 11:14:19 -04:00
Joey Hess	bf460a0a98	reorder repo parameters last Many functions took the repo as their first parameter. Changing it consistently to be the last parameter allows doing some useful things with currying, that reduce boilerplate. In particular, g <- gitRepo is almost never needed now, instead use inRepo to run an IO action in the repo, and fromRepo to get a value from the repo. This also provides more opportunities to use monadic and applicative combinators.	2011-11-08 16:27:20 -04:00
Joey Hess	6a6ea06cee	rename	2011-10-05 16:02:51 -04:00
Joey Hess	cfe21e85e7	rename	2011-10-04 00:59:08 -04:00

34 commits