git-annex

Author	SHA1	Message	Date
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Joey Hess	e47b4badb3	separate handles for cat-file and cat-file --batch-check This avoids starting one process when only the other one is needed. Eg in git-annex smudge --clean, this reduces the total number of cat-file processes that are started from 4 to 2. The only performance penalty is that when both are needed, it has to do twice as much work to maintain the two Maps. But both are very small, consisting of 1 or 2 items, so that work is negligible. Sponsored-by: Dartmouth College's Datalad project	2021-09-24 13:16:13 -04:00
Joey Hess	70dbe61fc2	remove unnecessary liftIO	2021-06-07 14:51:12 -04:00
Joey Hess	b88ecb36dd	remove unused code	2020-07-15 11:16:36 -04:00
Joey Hess	992fe446ad	unused import	2020-07-15 11:16:15 -04:00
Joey Hess	8cd0e351ba	avoid using cat-file --buffer if git is too old This is only needed for the i386ancient build, so build in the git version git-annex is built with, assuming git won't be upgraded, or if it is, they just won't get the speedup of --buffer	2020-07-15 10:56:32 -04:00
Joey Hess	535cdc8d48	importfeed: Made checking known urls step around 10% faster. This was a bit disappointing, I was hoping for a 2x speedup. But, I think the metadata lookup is wasting a lot of time and also needs to be made to stream. The changes to catObjectStreamLsTree were benchmarked to not also speed up --all around 3% more. Seems I managed to make it polymorphic after all.	2020-07-14 12:47:51 -04:00
Joey Hess	5387b95dcd	add catObjectMetaDataStream	2020-07-10 14:36:18 -04:00
Joey Hess	bd2d304064	better catObjectStream' and use Chan The catObjectStream' is generic enough to let it be nicely used from inside Annex monad. Chan will be faster than DList here. Bearing in mind, it is unbounded, but in reality will be bounded by the size of the stdio buffer through git cat-file. This speeds up --all by about 10% although I think only getting back to the previous performance before I introduced that DList.	2020-07-10 13:15:14 -04:00
Joey Hess	cb6e19f4c5	work around catObjectStream polymorism perf Breaking it up like this doesn't change perf, and lets another version be written in just a couple lines.	2020-07-09 14:27:07 -04:00
Joey Hess	d08c178f97	avoid catObjectStream skipping over unavailable shas Not needed as it's used for --all, but will be needed later.	2020-07-08 13:57:17 -04:00
Joey Hess	de3d7d044d	make catObjectStream support newline and carriage return in filenames Turns out the %(rest) trick was not needed. Instead, just maintain a list of files we've asked for, and each cat-file response is for the next file in the list. This actually benchmarks 25% faster than before! Very surprising, but it must be due to needing to shove less data through the pipe, and parse less.	2020-07-08 13:49:03 -04:00
Joey Hess	d010ab04be	sped up the --all option by 2x to 16x by using git cat-file --buffer This assumes that no location log files will have a newline or carriage return in their name. catObjectStream skips any such files due to cat-file not supporting them. Keys have been prevented from containing newlines since 2011, commit `480495beb4`. If some old repo had a key with a newline in it, --all will just skip processing that key. Other things, like .git/annex/unused files certianly assume no newlines in keys too, and AFAICR, such keys never actually worked. Carriage return is escaped by preSanitizeKeyName since 2013. WORM keys generated before that point could perhaps contain a CR. (URL probably not, http probably doesn't support an URL with a raw CR in it.) So, added a warning in fsck about such keys. Although, fsck --all will naturally skip them, so won't be able to warn about them. Not entirely satisfactory, but I'll bet there are not really any such keys in existence. Thanks to Lukey for finding this optimisation.	2020-07-07 13:54:04 -04:00
Joey Hess	438dbe3b66	convert to withCreateProcess for async exception safety This handles all sites where checkSuccessProcess/ignoreFailureProcess is used, except for one: Git.Command.pipeReadLazy That one will be significantly more work to convert to bracketing. (Also skipped Command.Assistant.autoStart, but it does not need to shut down the processes it started on exception because they are git-annex assistant daemons..) forceSuccessProcess is done, except for createProcessSuccess. All call sites of createProcessSuccess will need to be converted to bracketing. (process pools still todo also)	2020-06-04 12:44:09 -04:00
Joey Hess	86426036a0	optimise catfile interface with ByteString and Attoparsec Around 3% total speedup. Profiling git annex find --not --in web, it's now bytestring end-to-end, and there is only a little added overhead in eg accessing the Annex state MVar (3%). The rest of the runtime is spent reading symlinks, and in attoparsec. This feels like the end of the optimisation road, without a major change like caching information for faster queries.	2020-04-10 14:18:52 -04:00
Joey Hess	6c81e0c8f1	ByteString Ref continued Several nice speed wins I think. At 340/633 files converted.	2020-04-07 13:27:11 -04:00
Joey Hess	5e4deb3620	support sha256 git repos Git will eventually switch to sha2 and there will not be one single shaSize anymore, but two (40 and 64). Changed all parsers for git plumbing output to support both sizes of shas. One potential problem this does not deal with is, if somewhere in git-annex it reads two shas from different sources, and compares them to see if they're the same sha, it would fail if they're sha1 and sha256 of the same value. I don't know if that will really be a concern.	2020-01-07 12:22:19 -04:00
Joey Hess	ea3cb7d277	fix a case where file tracked by git unexpectedly becomes annex pointer file smudge: When annex.largefiles=anything, files that were already stored in git, and have not been modified could sometimes be converted to being stored in the annex. Changes in 7.20191024 made this more of a problem. This case is now detected and prevented.	2019-12-27 15:08:03 -04:00
Joey Hess	6a97ff6b3a	wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state.	2019-11-25 16:18:19 -04:00
Joey Hess	f4dd7d5191	work around windows having infected git's plumbing Work around git cat-file --batch's odd stripping of carriage return from the end of the line (some windows infection), avoiding crashing when the repo contains a filename ending in a carriage return.	2019-10-08 15:27:05 -04:00
Joey Hess	bf5dd723d3	Fix querying git for object type when operating on a file containing newlines This typo would make "git cat-file cat-file" fail, and the way it's used, I think it broke querying all info from filenames containing newlines, because the other queries are only run when it succeeds.	2019-08-07 13:35:42 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	7d51b0c109	import Utility.FileSystemEncoding in Common	2019-01-03 11:37:02 -04:00
Joey Hess	b3c69eaaf8	strict bytestring encoders and decoders Only had lazy ones before. Already sped up a few parts of the code.	2019-01-01 14:55:15 -04:00
Joey Hess	2aae6e84af	Support newlines in filenames. Work around git cat-file --batch's protocol not supporting newlines by running git cat-file not batched and passing the filename as a parameter. Of course this is quite a lot less efficient, especially because it currently runs it multiple times to query for different pieces of information. Also, it has subtly different behavior when the batch process was started and then some changes were made, in which case the batch process sees the old index but this workaround sees the current index. Since that batch behavior is mostly a problem that affects the assistant and has to be worked around in it, I think I can get away with this difference. I don't know of any other problems with newlines in filenames, everything else in git I can think of supports -z. And git-annex's json output supports newlines in filenames so downstream parsers from git-annex will be ok. git-annex commands that use --batch themselves don't support newlines in input filenames; using --json --batch is currently a way around that problem. This commit was sponsored by Ewen McNeill on Patreon.	2018-09-20 13:45:44 -04:00
Joey Hess	a1730cd6af	adeiu, MissingH Removed dependency on MissingH, instead depending on the split library. After laying groundwork for this since 2015, it was mostly straightforward. Added Utility.Tuple and Utility.Split. Eyeballed System.Path.WildMatch while implementing the same thing. Since MissingH's progress meter display was being used, I re-implemented my own. Bonus: Now progress is displayed for transfers of files of unknown size. This commit was sponsored by Shane-o on Patreon.	2017-05-16 01:03:52 -04:00
Joey Hess	8484c0c197	Always use filesystem encoding for all file and handle reads and writes. This is a big scary change. I have convinced myself it should be safe. I hope!	2016-12-24 14:46:31 -04:00
Joey Hess	e23028d19b	restart coprocess in raw mode Restarting a crashing git process could result in filename encoding issues when not in a unicode locale, as the restarted processes's handles were not read in raw mode. Since rawMode is always used when starting a coprocess, didn't bother to parameterise it and just always enable it for simplicity. This commit was sponsored by Jake Vosloo on Patreon.	2016-11-01 14:03:59 -04:00
Joey Hess	b530e43285	Fix reversion in 6.20161012 that prevented adding files with a space in their name.	2016-10-31 18:39:37 -04:00
Joey Hess	34530e59d9	Avoid using a lot of memory when large objects are present in the git repository .. and have to be checked to see if they are a pointed to an annexed file. Cases where such memory use could occur included, but were not limited to: - git commit -a of a large unlocked file (in v5 mode) - git-annex adjust when a large file was checked into git directly Generally, any use of catKey was a potential problem. Fix by using git cat-file --batch-check to check size before catting. This adds another git batch process, which is included in the CatFileHandle for simplicity. There could be performance impact, anywhere catKey is used. Particularly likely to affect adjusted branch generation speed, and operations on unlocked files in v6 mode. Hopefully since the --batch-check and --batch read the same data, disk buffering will avoid most overhead. Leaving only the overhead of talking to the process over the pipe and whatever computation --batch-check needs to do. This commit was sponsored by Bruno BEAUFILS on Patreon.	2016-10-05 15:24:13 -04:00
Joey Hess	11935c4d6f	fix parsing of commit with no parents	2016-03-31 17:12:01 -04:00
Joey Hess	fbf4d89e82	extract commit parent(s)	2016-03-11 12:47:14 -04:00
Joey Hess	1f91d1d0b7	add catCommit, with commit object parser	2016-02-25 15:14:47 -04:00
Joey Hess	70fee8208c	remove old TODO	2016-01-01 15:43:13 -04:00
Joey Hess	addc82dab7	removed all uses of undefined from code base It's a code smell, can lead to hard to diagnose error messages.	2015-04-19 00:38:29 -04:00
Joey Hess	afc5153157	update my email address and homepage url	2015-01-21 12:50:09 -04:00
Joey Hess	7b50b3c057	fix some mixed space+tab indentation This fixes all instances of " \t" in the code base. Most common case seems to be after a "where" line; probably vim copied the two space layout of that line. Done as a background task while listening to episode 2 of the Type Theory podcast.	2014-10-09 15:09:11 -04:00
Joey Hess	1052eeface	Windows: Fix some filename encoding bugs. http://git-annex.branchable.com/bugs/Unicode_file_names_ignored_on_Windows/ Not a complete fix yet.	2014-03-19 15:57:56 -04:00
Joey Hess	1192d98721	sync: Fix bug in direct mode that caused a file not checked into git to be deleted when merging with a remote that added a file by the same name. (Thanks, jkt)	2014-03-03 14:57:16 -04:00
Joey Hess	4e0be2792b	remove Read instance for Ref Removed instance, got it all to build using fromRef. (With a few things that really need to show something using a ref for debugging stubbed out.) Then added back Read instance, and made Logs.View use it for serialization. This changes the view log format.	2014-02-19 01:19:57 -04:00
Joey Hess	4f871f89ba	git-recover-repository 1/2 done	2013-10-20 17:50:51 -04:00
Joey Hess	f482de1b76	remove workaround for bug in git 1.8.4r0	2013-10-20 15:23:06 -04:00
Joey Hess	7390f08ef9	Use cryptohash rather than SHA for hashing. This is a massive win on OSX, which doesn't have a sha256sum normally. Only use external hash commands when the file is > 1 mb, since cryptohash is quite close to them in speed. SHA is still used to calculate HMACs. I don't quite understand cryptohash's API for those. Used the following benchmark to arrive at the 1 mb number. 1 mb file: benchmarking sha256/internal mean: 13.86696 ms, lb 13.83010 ms, ub 13.93453 ms, ci 0.950 std dev: 249.3235 us, lb 162.0448 us, ub 458.1744 us, ci 0.950 found 5 outliers among 100 samples (5.0%) 4 (4.0%) high mild 1 (1.0%) high severe variance introduced by outliers: 10.415% variance is moderately inflated by outliers benchmarking sha256/external mean: 14.20670 ms, lb 14.17237 ms, ub 14.27004 ms, ci 0.950 std dev: 230.5448 us, lb 150.7310 us, ub 427.6068 us, ci 0.950 found 3 outliers among 100 samples (3.0%) 2 (2.0%) high mild 1 (1.0%) high severe 2 mb file: benchmarking sha256/internal mean: 26.44270 ms, lb 26.23701 ms, ub 26.63414 ms, ci 0.950 std dev: 1.012303 ms, lb 925.8921 us, ub 1.122267 ms, ci 0.950 variance introduced by outliers: 35.540% variance is moderately inflated by outliers benchmarking sha256/external mean: 26.84521 ms, lb 26.77644 ms, ub 26.91433 ms, ci 0.950 std dev: 347.7867 us, lb 210.6283 us, ub 571.3351 us, ci 0.950 found 6 outliers among 100 samples (6.0%) import Crypto.Hash import Data.ByteString.Lazy as L import Criterion.Main import Common testfile :: FilePath testfile = "/run/shm/data" -- on ram disk main = defaultMain [ bgroup "sha256" [ bench "internal" $ whnfIO internal , bench "external" $ whnfIO external ] ] sha256 :: L.ByteString -> Digest SHA256 sha256 = hashlazy internal :: IO String internal = show . sha256 <$> L.readFile testfile external :: IO String external = do s <- readProcess "sha256sum" [testfile] return $ fst $ separate (== ' ') s	2013-09-22 20:06:02 -04:00
Joey Hess	006cf7976f	more completely solve catKey memory leak Done using a mode witness, which ensures it's fixed everywhere. Fixing catFileKey was a bear, because git cat-file does not provide a nice way to query for the mode of a file and there is no other efficient way to do it. Oh, for libgit2.. Note that I am looking at tree objects from HEAD, rather than the index. Because I cat-file cannot show a tree object for the index. So this fix is technically incomplete. The only cases where it matters are: 1. A new large file has been directly staged in git, but not committed. 2. A file that was committed to HEAD as a symlink has been staged directly in the index. This could be fixed a lot better using libgit2.	2013-09-19 16:41:21 -04:00
Joey Hess	f26c996dc6	interface to parse git tree objects	2013-09-19 15:58:35 -04:00
Joey Hess	a3224ce35b	avoid more build warnings on Windows	2013-08-04 14:05:36 -04:00
Joey Hess	d16114d024	Slow and ugly work around for bug #718517 in git, which broke git-cat-file --batch for filenames containing spaces. This runs git-cat-file in non-batch mode for all files with spaces. If a directory tree has a lot of them, and is in direct mode, even "git annex add" when there are few new files will need a lot of forks! The only reason buffering the whole file content to get the sha is not a memory leak is that git-annex only ever uses this on symlinks. This needs to be reverted as soon as a fix is available in git!	2013-08-01 17:30:47 -04:00
Joey Hess	91c4dcfc69	Can now restart certain long-running git processes if they crash, and continue working. Fuzz tests have shown that git cat-file --batch sometimes stops running. It's not yet known why (no error message; repo seems ok). But this is something we can deal with in the CoProcess framework, since all 3 types of long-running git processes should be restartable if they fail. Note that, as implemented, only IO errors are caught. So an error thrown by the reveiver, when it sees something that is not valid output from git cat-file (etc) will not cause a restart. I don't want it to retry if git commands change their output or are just outputting garbage. This does mean that if the command did a partial output and crashed in the middle, it would still not be restarted. There is currently no guard against restarting a command repeatedly, if, for example, it crashes repeatedly on startup.	2013-05-31 12:42:13 -04:00
Joey Hess	03e8594369	fix the day's windows permissions damage	2013-05-12 19:09:48 -04:00
Joey Hess	73d2f8b280	deal with git using / internally, even on DOS	2013-05-12 17:29:49 -05:00

1 2

72 commits