git-annex

Author	SHA1	Message	Date
Joey Hess	e62e4eaaf2	refector for legibility	2022-09-23 18:53:06 -04:00
Joey Hess	2478e9e03a	restage: New git-annex command, handles restaging unlocked files This is much easier and less failure-prone than having the user run git update-index --refresh themselves. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 16:29:59 -04:00
Joey Hess	f7146c153b	fix restaging of transferred files after stalldetection kicks in Sponsored-by: Dartmouth College's DANDI project	2022-09-23 15:55:40 -04:00
Joey Hess	6a3bd283b8	add restage log When pointer files need to be restaged, they're first written to the log, and then when the restage operation runs, it reads the log. This way, if the git-annex process is interrupted before it can do the restaging, a later git-annex process can do it. Currently, this lets a git-annex get/drop command be interrupted and then re-ran, and as long as it gets/drops additional files, it will clean up after the interrupted command. But more changes are needed to make it easier to restage after an interrupted process. Kept using the git queue to run the restage action, even though the list of files that it builds up for that action is not actually used by the action. This could perhaps be simplified to make restaging a cleanup action that gets registered, rather than using the git queue for it. But I wasn't sure if that would cause visible behavior changes, when eg dropping a large number of files, currently the git queue flushes periodically, and so it restages incrementally, rather than all at the end. In restagePointerFiles, it reads the restage log twice, once to get the number of files and size, and a second time to process it. This seemed better than reading the whole file into memory, since potentially a huge number of files could be in there. Probably the OS will cache the file in memory and there will not be much performance impact. It might be better to keep running tallies in another file though. But updating that atomically with the log seems hard. Also note that it's possible for calcRestageLog to see a different file than streamRestageLog does. More files may be added to the log in between. That is ok, it will only cause the filterprocessfaster heuristic to operate with slightly out of date information, so it may make the wrong choice for the files that got added and be a little slower than ideal. Sponsored-by: Dartmouth College's DANDI project	2022-09-23 15:47:24 -04:00
Joey Hess	8718125ae4	refactor the restage runner Sponsored-by: Dartmouth College's DANDI project	2022-09-23 13:12:17 -04:00
Joey Hess	debcf86029	use RawFilePath version of rename Some small wins, almost certianly swamped by the system calls, but still worthwhile progress on the RawFilePath conversion. Sponsored-by: Erik Bjäreholt on Patreon	2022-06-22 16:47:34 -04:00
Joey Hess	224a57f9ed	RawFilePath optimisation	2022-06-22 16:11:03 -04:00
Joey Hess	95a04920cf	remove objectDir'	2022-06-22 16:08:49 -04:00
Joey Hess	64ccb4734e	smudge: Warn when encountering a pointer file that has other content appended to it It will then proceed to add the file the same as if it were any other file containing possibly annexable content. Usually the file is one that was annexed before, so the new, probably corrupt content will also be added to the annex. If the file was not annexed before, the content will be added to git. It's not possible for the smudge filter to throw an error here, because git then just adds the file to git anyway. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 15:17:08 -04:00
Joey Hess	67245ae00f	fully specify the pointer file format This format is designed to detect accidental appends, while having some room for future expansion. Detect when an unlocked file whose content is not present has gotten some other content appended to it, and avoid treating it as a pointer file, so that appended content will not be checked into git, but will be annexed like any other file. Dropped the max size of a pointer file down to 32kb, it was around 80 kb, but without any good reason and certianly there are no valid pointer files anywhere that are larger than 8kb, because it's just been specified what it means for a pointer file with additional data even looks like. I assume 32kb will be good enough for anyone. ;-) Really though, it needs to be some smallish number, because that much of a file in git gets read into memory when eg, catting pointer files. And since we have no use cases for the extra lines of a pointer file yet, except possibly to add some human-visible explanation that it is a git-annex pointer file, 32k seems as reasonable an arbitrary number as anything. Increasing it would be possible, eg to 64k, as long as users of such jumbo pointer files didn't mind upgrading all their git-annex installations to one that supports the new larger size. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 14:20:31 -04:00
Joey Hess	5b373a9dd2	read a consistent amount from pointer file A few places were reading the max symlink size of a pointer file, then passing tp parseLinkTargetOrPointer. Which is fine currently, but to support pointer files with lines of data after the pointer, enough has to be read that parseLinkTargetOrPointer can be assured of seeing enough of that data to know if it's correctly formatted. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:52:34 -04:00
Joey Hess	4cd9325c2c	fold parseLinkTarget into parseLinkTargetOrPointer Only one place remained that differentiated between them. It is the case that a symlink target that happens to contain a newline somehow will be treated as a link to a key truncated at the newline. This is super unlikely to happen, and since a key cannot actually contain a newline, it's as good a behavior as any. Anyway, this commit does not change the behavior there, although arguably it should be changed. Note that getAnnexLinkTarget does prevent a symlink target containing a newline. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:30:32 -04:00
Joey Hess	faf84aa5c2	Avoid git status taking a long time after git-annex unlock of many files. Implemented by making Git.Queue have a FlushAction, which can accumulate along with another action on files, and runs only once the other action has run. This lets git-annex unlock queue up git update-index actions, without conflicting with the restagePointerFiles FlushActions. In a repository with filter-process enabled, git-annex unlock will often not take any more time than before, though it may when the files are large. Either way, it should always slow down less than git-annex status speeds up. When filter-process is not enabled, git-annex unlock will slow down as much as git status speeds up. Sponsored-by: Jochen Bartl on Patreon	2022-02-18 15:06:40 -04:00
Joey Hess	a0758bdd10	dynamically disable filter-process in restagePointerFile when it would be slower Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be really annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon	2021-11-08 16:20:34 -04:00
Joey Hess	837025b14f	Revert "disable filter.annex.process in restagePointerFile" This reverts commit `afe327ac49`. Unfortunately, disabling it by setting it to "" does not work, git then ignores filter.annex.smudge/clean, and does not pass files through git-annex at all. I don't think there is a way to temporarily disable this git config from the git command line. Which seems like a bug in git. So, it may be more expensive than anticipated to enable filter.annex.process, since git checkout etc will pipe all annexed files being checked out through it.	2021-11-05 12:43:33 -04:00
Joey Hess	afe327ac49	disable filter.annex.process in restagePointerFile This means git will run git-annex smudge --clean once per file that is restaged, which can be slow. But probably not as slow as git feeding all the content of annexed files you've gotten through a pipe to git-annex filter-process. The only time this is probably not ideal is after a drop of a bunch of files, when filter-process would be faster.	2021-11-04 15:20:26 -04:00
Joey Hess	0f38ad9a69	close keys db to possibly work around WSL1 issue	2021-10-19 13:07:49 -04:00
Joey Hess	c4f1465a81	check symlink before reading file This is faster because when multiple files are in a directory, it gets cached.	2021-06-14 11:53:51 -04:00
Joey Hess	26a9ea12d1	handle edge case of symlink to something that is not really a pointer file That seems very unlikely to happen, but still, it's possible it could. And with the recent addition of locked files to the keys db, this could be called by places that did not call it before, so it seems even more important it's correct. Adds an extra stat of the file, and is potentially racy, but both problems are fixed by the unix-2.8.0 path. I have not tested that path builds because that package is not yet released and it would be difficult to install it since it's tightly tied to a ghc version.	2021-06-14 11:35:52 -04:00
Joey Hess	c2f612292a	start splitting out readonly values from AnnexState Values in AnnexRead can be read more efficiently, without MVar overhead. Only a few things have been moved into there, and the performance increase so far is not likely to be noticable. This is groundwork for putting more stuff in there, particularly a value that indicates if debugging is enabled. The obvious next step is to change option parsing to not run in the Annex monad to set values in AnnexState, and instead return a pure value that gets stored in AnnexRead.	2021-04-02 15:51:44 -04:00
Joey Hess	34a535ebea	adjust: Fix some bad behavior when unlocked files use URL keys. This avoids the smudge --clean filter failing on the URL keys. git checkout runs the post-checkout hook, which runs smudge --update. That populates all the pointer files, but it neglected to store their inode caches in the keys db. With that done, and the keys db flushed before smudge --clean gets run (by restagePointerFile), the isUnmodifiedCheap check can tell the file is not modified, so will not try to re-ingest it, which does not work with URL keys because they do not support genKey. It also seems possible that the isUnmodifiedCheap was also failing for non-URL keys, which would cause them to be re-ingested, leading to a lot of extra work. I have not verified that, but don't see why it wouldn't have happened. So this probably also speeds up checking out adjusted branches. This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.	2021-01-25 17:25:42 -04:00
Joey Hess	2c8cf06e75	more RawFilePath conversion Converted file mode setting to it, and follow-on changes. Compiles up through 369/646. This commit was sponsored by Ethan Aubin.	2020-11-05 18:45:37 -04:00
Joey Hess	87f91ce563	more RawFilePath conversion 451/645	2020-10-30 15:55:59 -04:00
Joey Hess	ca80c3154c	more RawFilePath conversion removeFile changed to removeLink, because AFAICS it should be fine to remove non-file things here. In particular, it's fine to remove a symlink, since we're about to write a symlink. (removeLink does not remove directories, so file, symlink, and unix socket are the only possibilities.)	2020-10-30 13:07:41 -04:00
Joey Hess	681b44236a	more RawFilePath conversion at 377/645 This commit was sponsored by Svenne Krap on Patreon.	2020-10-29 14:20:57 -04:00
Joey Hess	b24ba92231	refactor out Annex.PidLock	2020-08-26 12:29:13 -04:00
Joey Hess	7bdb0cdc0d	add gitAnnexChildProcess and use instead of incorrect use of runsGitAnnexChildProcess Fixes reversion in 8.20200617 that made annex.pidlock being enabled result in some commands stalling, particularly those needing to autoinit. Renamed runsGitAnnexChildProcess to make clearer where it should be used. Arguably, it would be better to have a way to make any process git-annex runs have the env var set. But then it would need to take the pid lock when running any and all processes, and that would be a problem when git-annex runs two processes concurrently. So, I'm left doing it ad-hoc in places where git-annex really does run a child process, directly or indirectly via a particular git command.	2020-08-25 14:57:49 -04:00
Joey Hess	d5451afc8f	fix deadlock Fix a deadlock that could occur after git-annex got an unlocked file, causing the command to hang indefinitely. Known to happen on vfat filesystems, possibly others. Note that a deadlock is still theoretically possible, if anything smudge --clean does causes it to run the git queue for some other reason. Apparently that doesn't happen, but will need to keep an eye on it.	2020-06-18 12:56:29 -04:00
Joey Hess	82448bdf39	fix a annex.pidlock issue That made eg git-annex get of an unlocked file hang until the annex.pidlocktimeout and then fail. This fix should be fully thread safe no matter what else git-annex is doing. Only using runsGitAnnexChildProcess in the one place it's known to be a problem. Could audit for all places where git-annex runs itself as a child and add it to all of them, later.	2020-06-17 15:30:59 -04:00
Joey Hess	686791c4ed	more RawFilePath Remove dup definitions and just use the RawFilePath one. </> etc are enough faster that it's probably faster than building a String directly, although I have not benchmarked.	2019-12-18 17:10:28 -04:00
Joey Hess	c19211774f	use filepath-bytestring for annex object manipulations git-annex find is now RawFilePath end to end, no string conversions. So is git-annex get when it does not need to get anything. So this is a major milestone on optimisation. Benchmarks indicate around 30% speedup in both commands. Probably many other performance improvements. All or nearly all places where a file is statted use RawFilePath now.	2019-12-11 15:25:07 -04:00
Joey Hess	bdec7fed9c	convert TopFilePath to use RawFilePath Adds a dependency on filepath-bytestring, an as yet unreleased fork of filepath that operates on RawFilePath. Git.Repo also changed to use RawFilePath for the path to the repo. This does eliminate some RawFilePath -> FilePath -> RawFilePath conversions. And filepath-bytestring's </> is probably faster. But I don't expect a major performance improvement from this. This is mostly groundwork for making Annex.Location use RawFilePath, which will allow for a conversion-free pipleline.	2019-12-09 15:07:21 -04:00
Joey Hess	5f391179f1	use RawFilePath getFileStatus for speed Only done on those calls to getFileStatus that had a RawFilePath, not a FilePath. The others would probably be just as fast if converted to use it with toRawFilePath, but I'm not 100% sure. Note that genInodeCache' uses fromRawFilePath, but that value only gets used on Windows, so on unix the thunk will never be evaluated.	2019-12-06 14:44:42 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	99536e3a0b	remove one more warningIO Had to generalize Git.Queue so it can run an Annex action, yipes. Only remaining warningIO are in the legacy chunk code.	2019-11-12 10:45:52 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	1e95bc4fd1	avoid git warning about CRLF in restagePointerFile Saw it on Windows, could probably also happen on linux with some configuration. Since this is a pointer file, the warning does not apply.	2019-02-18 18:35:36 -04:00
Joey Hess	1a367cad83	Fix path separator bug on Windows that completely broke git-annex since version 7.20190122.	2019-02-18 17:16:39 -04:00
Joey Hess	5d98cba923	use ByteStrings when reading annex symlinks and pointers Now there's a ByteString used all the way from disk to Key. The main complication in this conversion was the use of fromInternalGitPath in several places to munge things on Windows. The things that used that were changed to parse the ByteString using either path separator. Also some code that had read from files to a String lazily was changed to read a minimal strict ByteString.	2019-01-14 15:37:08 -04:00
Joey Hess	53905490df	convert Git.HashObject to use ByteStrings Both lazy and strict, because sometimes it's more efficient to build a small strict bytestring, and other times better to lazily stream.	2019-01-03 13:21:01 -04:00
Joey Hess	7d51b0c109	import Utility.FileSystemEncoding in Common	2019-01-03 11:37:02 -04:00
Joey Hess	b3c69eaaf8	strict bytestring encoders and decoders Only had lazy ones before. Already sped up a few parts of the code.	2019-01-01 14:55:15 -04:00
Joey Hess	54d49eeac8	avoid update-index race This commit was supported by the NSF-funded DataLad project.	2018-08-17 16:03:40 -04:00
Joey Hess	0f25d48639	pass absolute path to update-index Test suite found a case where this is necessary. And the man page says this, although current behavior is not as documented.. Note that files beginning with . are discarded. This includes ./file and dir/./file. If you don’t want this, then use cleaner names. This may hit path length limits on Windows. shrug This commit was supported by the NSF-funded DataLad project.	2018-08-16 16:00:29 -04:00
Joey Hess	82a239675f	narrow the race where a file gets modified before update-index Check just before running update-index if the worktree file's content is still the same, don't update it when it's been modified. This narrows the race window a lot, from possibly minutes or hours, to seconds or less. (Use replaceFile so that the worktree update happens atomically, allowing the InodeCache of the new worktree file to itself be gathered w/o any other race.) This doesn't eliminate the race; it can still occur in the window before update-index runs. When annex.queue is large, a lot of files will be statted by the checks, and so the window may still be large enough to be a problem. When only a few files are being processed, the window is as small as it is in the race where a modification gets overwritten by git-annex when it updates the worktree. Or maybe as small as whatever race git checkout/pull/merge may have when the worktree gets modified during it. Still, I've kept a todo about this race. This commit was supported by the NSF-funded DataLad project.	2018-08-16 15:56:43 -04:00
Joey Hess	82cfcfc838	better index file refresh method Use git update-index --refresh, since it's a little bit more efficient and the user can be told to run it if a locked index prevents git-annex from running it. This also fixes the problem where an annexed file was deleted in the index and a get of another file that uses the same key caused the index update to add back the deleted file. update-index will not add back the deleted file. Documented in tips/unlocked_files.mdwn the gotcha that the index update may conflict with other operations. I can't see any way to possibly avoid that conflict. One new todo about a race that causes a modification to be accidentially staged. Note that the assistant only flushes the git command queue when it commits a modification. I have not tested the assistant with v6 unlocked files, but assume most users of the assistant won't care if the index shows a file as modified for a while. This commit was supported by the NSF-funded DataLad project.	2018-08-16 14:16:24 -04:00
Joey Hess	0b7f6d24d3	rename BlobType and add submodule to it This was badly named, it's a not a blob necessarily, but anything that a tree can refer to. Also removed the Show instance which was used for serialization to git format, instead use fmtTreeItemType. This commit was supported by the NSF-funded DataLad project.	2018-05-14 14:45:41 -04:00
Joey Hess	fc845e6530	more lambda-case conversion	2017-12-05 15:00:50 -04:00
Joey Hess	8484c0c197	Always use filesystem encoding for all file and handle reads and writes. This is a big scary change. I have convinced myself it should be safe. I hope!	2016-12-24 14:46:31 -04:00
Joey Hess	34530e59d9	Avoid using a lot of memory when large objects are present in the git repository .. and have to be checked to see if they are a pointed to an annexed file. Cases where such memory use could occur included, but were not limited to: - git commit -a of a large unlocked file (in v5 mode) - git-annex adjust when a large file was checked into git directly Generally, any use of catKey was a potential problem. Fix by using git cat-file --batch-check to check size before catting. This adds another git batch process, which is included in the CatFileHandle for simplicity. There could be performance impact, anywhere catKey is used. Particularly likely to affect adjusted branch generation speed, and operations on unlocked files in v6 mode. Hopefully since the --batch-check and --batch read the same data, disk buffering will avoid most overhead. Leaving only the overhead of talking to the process over the pipe and whatever computation --batch-check needs to do. This commit was sponsored by Bruno BEAUFILS on Patreon.	2016-10-05 15:24:13 -04:00

1 2

86 commits