git-annex

Author	SHA1	Message	Date
Joey Hess	d5628a16b8	Merge branch 'bs' into sqlite-bs	2019-12-18 14:51:03 -04:00
Joey Hess	c19211774f	use filepath-bytestring for annex object manipulations git-annex find is now RawFilePath end to end, no string conversions. So is git-annex get when it does not need to get anything. So this is a major milestone on optimisation. Benchmarks indicate around 30% speedup in both commands. Probably many other performance improvements. All or nearly all places where a file is statted use RawFilePath now.	2019-12-11 15:25:07 -04:00
Joey Hess	bdec7fed9c	convert TopFilePath to use RawFilePath Adds a dependency on filepath-bytestring, an as yet unreleased fork of filepath that operates on RawFilePath. Git.Repo also changed to use RawFilePath for the path to the repo. This does eliminate some RawFilePath -> FilePath -> RawFilePath conversions. And filepath-bytestring's </> is probably faster. But I don't expect a major performance improvement from this. This is mostly groundwork for making Annex.Location use RawFilePath, which will allow for a conversion-free pipleline.	2019-12-09 15:07:21 -04:00
Joey Hess	2f9a80d803	merging sqlite and bs branches Since the sqlite branch uses blobs extensively, there are some performance benefits, ByteStrings now get stored and retrieved w/o conversion in some cases like in Database.Export.	2019-12-06 15:30:45 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	c35a9047d3	improve data types for sqlite This is a non-backwards compatable change, so not suitable for merging w/o a annex.version bump and transition code. Not yet tested. This improves performance of git-annex benchmark --databases across the board by 10-25%, since eg Key roundtrips as a ByteString. (serializeKey' produces a lazy ByteString, so there is still a copy involved in converting it to a strict ByteString. It may be faster to switch to using bytestring-strict-builder.) FilePath and Key are both stored as blobs. This avoids mojibake in some situations. It would be possible to use varchar instead, if persistent could avoid converting that to Text, but it seems there is no good way to do so. See doc/todo/sqlite_database_improvements.mdwn Eliminated some ugly artifacts of using Read/Show serialization; constructors and quoted strings are no longer stored in sqlite. Renamed SRef to SSha to reflect that it is only ever a git sha, not a ref name. Since it is limited to the characters in a sha, it is not affected by mojibake, so still uses String.	2019-10-29 17:05:36 -04:00
Joey Hess	94efc400e9	horrible impementation of isInodeKnown The only good thing about it is it does not require a major version bump to improve the database. That will need to happen at some point though. Potentially very very slow in a large repository. Ugly use of raw sql.	2019-10-23 14:37:29 -04:00
Joey Hess	3f0eef4baa	v7 for all repositories * Default to v7 for new repositories. * Automatically upgrade v5 repositories to v7.	2019-08-30 14:09:14 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Joey Hess	fcc9eea554	avoid closeDb opening the db if it's not already open	2018-10-30 22:19:05 -04:00
Joey Hess	a7f0b99a33	fix v6 deadlock with git 2.1.4 I don't know why git diff --raw would run the clean filter, but it did with this version of git. Perhaps it is cleaning the file to generate the diff to search with -G? But then why would newer gits not run the clean filter? It caused git annex to deadlock because the keys database was locked and ran a git command that ran git-annex, which tried to read from the keys database. This commit was sponsored by Brett Eisenberg on Patreon.	2018-09-13 13:55:25 -04:00
Joey Hess	50fa17aee6	v6: recover from race between git mv and git-annex get/drop Update pointer file next time reconcileStaged is run to recover from the race. Note that restagePointerFile causes git to run the clean filter, and that will run reconcileStaged. So, normally by the time the git annex get/drop command finishes, the race has already been dealt with. It may be that, in some case, that won't happen and the race will be dealt with at a later point. git-annex could run reconcileStaged at shutdown if that becomes a problem. This does not handle the situation where the git mv is committed before git-annex gets a chance to run again. git commit does run the clean filter, and that happens to re-inject the content if it was supposed to be dropped but is still populated. But, the case where the file was supposed to be gotten but is not populated is not handled yet. This commit was supported by the NSF-funded DataLad project.	2018-08-22 15:56:43 -04:00
Joey Hess	18ecf41917	avoid running reconcileStaged when the index has not changed This commit was supported by the NSF-funded DataLad project.	2018-08-22 13:04:12 -04:00
Joey Hess	5e56d9b620	v6: Update associated files database when git has staged changes to pointer files This commit was supported by the NSF-funded DataLad project.	2018-08-21 17:02:20 -04:00
Joey Hess	6ab14710fc	fix consistency bug reading from export database The export database has writes made to it and then expects to read back the same data immediately. But, the way that Database.Handle does writes, in order to support multiple writers, makes that not work, due to caching issues. This resulted in export re-uploading files it had already successfully renamed into place. Fixed by allowing databases to be opened in MultiWriter or SingleWriter mode. The export database only needs to support a single writer; it does not make sense for multiple exports to run at the same time to the same special remote. All other databases still use MultiWriter mode. And by inspection, nothing else in git-annex seems to be relying on being able to immediately query for changes that were just written to the database. This commit was supported by the NSF-funded DataLad project.	2017-09-06 17:19:07 -04:00
Joey Hess	3b22ad9f47	Work around sqlite's incorrect handling of umask when creating databases. Refactored some common code into initDb. This only deals with the problem when creating new databases. If a repo got bad permissions into it, it's up to the user to deal with it. This commit was sponsored by Ole-Morten Duesund on Patreon.	2017-02-13 17:39:16 -04:00
Joey Hess	206dab8b44	remove unused imports	2016-10-18 16:16:47 -04:00
Joey Hess	148bd0dbfd	refactor	2016-10-17 14:58:33 -04:00
Joey Hess	e34046de38	slightly more efficient checking of versionUsesKeysDatabase It's a mvar lookup either way, but I think this way will be slightly more efficient. And it reduces the number of places where it's checked to 1.	2016-07-19 14:02:49 -04:00
Joey Hess	2619019630	Avoid any access to keys database in v5 mode repositories, which are not supposed to use that database.	2016-07-19 12:12:19 -04:00
Joey Hess	5f0b551c0c	assistant: Fix race in v6 mode that caused downloaded file content to sometimes not replace pointer files. The keys database handle needs to be closed after merging, because the smudge filter, in another process, updates the database. Old cached info can be read for a while from the open database handle; closing it ensures that the info written by the smudge filter is available. This is pretty horribly ad-hoc, and it's especially nasty that the transferrer closes the database every time.	2016-05-16 14:49:12 -04:00
Joey Hess	d05a75e45a	fix bug in unlocked file scanner that skipped over executable unlocked files	2016-04-14 13:07:46 -04:00
Joey Hess	cf260d9a15	Fix storing of filenames of v6 unlocked files when the filename is not representable in the current locale. This is a mostly backwards compatable change. I broke backwards compatability in the case where a filename starts with double-quote. That seems likely to be very rare, and v6 unlocked files are a new feature anyway, and fsck needs to fix missing associated file mappings anyway. So, I decided that is good enough. The encoding used is to just show the String when it contains a problem character. While that adds some overhead to addAssociatedFile and removeAssociatedFile, those are not called very often. This approach has minimal decode overhead, because most filenames won't be encoded that way, and it only has to look for the leading double-quote to skip the expensive read. So, getAssociatedFiles remains fast. I did consider using ByteString instead, but getting a FilePath converted with all chars intact, even surrigates, is difficult, and it looks like instance PersistField ByteString uses Text, which I don't trust for problem encoded data. It would probably be slower too, and it would make the database less easy to inspect manually.	2016-02-14 16:37:25 -04:00
Joey Hess	9df13e73ae	if keys database cannot be opened due to permissions, ignore This lets readonly repos be used. If a repo is readonly, we can ignore the keys database, because nothing that we can do will change the state of the repo anyway.	2016-02-12 14:16:35 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	423fffcd41	change keys database to use IKey type with more efficient serialization This breaks any existing keys database! IKey serializes more efficiently than SKey, although this limits the use of its Read/Show instances. This makes the keys database use less disk space, and so should be a win. Updated benchmark: benchmarking keys database/getAssociatedFiles from 1000 (hit) time 64.04 μs (63.95 μs .. 64.13 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.02 μs (63.96 μs .. 64.08 μs) std dev 218.2 ns (172.5 ns .. 299.3 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 52.53 μs (52.18 μs .. 53.21 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 52.31 μs (52.18 μs .. 52.91 μs) std dev 734.6 ns (206.2 ns .. 1.623 μs) benchmarking keys database/getAssociatedKey from 1000 (hit) time 64.60 μs (64.46 μs .. 64.77 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.74 μs (64.57 μs .. 65.20 μs) std dev 900.2 ns (389.7 ns .. 1.733 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 52.46 μs (52.29 μs .. 52.68 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.63 μs (52.35 μs .. 53.37 μs) std dev 1.362 μs (562.7 ns .. 2.608 μs) variance introduced by outliers: 24% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (old) time 487.3 μs (484.7 μs .. 490.1 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 490.9 μs (487.8 μs .. 496.5 μs) std dev 13.95 μs (6.841 μs .. 22.03 μs) variance introduced by outliers: 20% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (new) time 6.633 ms (5.741 ms .. 7.751 ms) 0.905 R² (0.850 R² .. 0.965 R²) mean 8.252 ms (7.803 ms .. 8.602 ms) std dev 1.126 ms (900.3 μs .. 1.430 ms) variance introduced by outliers: 72% (severely inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 65.36 μs (64.71 μs .. 66.37 μs) 0.998 R² (0.995 R² .. 1.000 R²) mean 65.28 μs (64.72 μs .. 66.45 μs) std dev 2.576 μs (920.8 ns .. 4.122 μs) variance introduced by outliers: 42% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 52.34 μs (52.25 μs .. 52.45 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.49 μs (52.42 μs .. 52.59 μs) std dev 255.4 ns (205.8 ns .. 312.9 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 64.76 μs (64.67 μs .. 64.84 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.67 μs (64.62 μs .. 64.72 μs) std dev 177.3 ns (148.1 ns .. 217.1 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 52.75 μs (52.66 μs .. 52.82 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.69 μs (52.63 μs .. 52.75 μs) std dev 210.6 ns (173.7 ns .. 265.9 ns) benchmarking keys database/addAssociatedFile to 10000 (old) time 489.7 μs (488.7 μs .. 490.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 490.4 μs (489.6 μs .. 492.2 μs) std dev 3.990 μs (2.435 μs .. 7.604 μs) benchmarking keys database/addAssociatedFile to 10000 (new) time 9.994 ms (9.186 ms .. 10.74 ms) 0.959 R² (0.928 R² .. 0.979 R²) mean 9.906 ms (9.343 ms .. 10.40 ms) std dev 1.384 ms (1.051 ms .. 2.100 ms) variance introduced by outliers: 69% (severely inflated)	2016-01-12 14:01:50 -04:00
Joey Hess	75f61df323	cleanup	2016-01-12 13:31:13 -04:00
Joey Hess	f9c5aa84e0	add database benchmark The benchmark shows that the database access is quite fast indeed! And, it scales linearly to the number of keys, with one exception, getAssociatedKey. Based on this benchmark, I don't think I need worry about optimising for cases where all files are locked and the database is mostly empty. In those cases, database access will be misses, and according to this benchmark, should add only 50 milliseconds to runtime. (NB: There may be some overhead to getting the database opened and locking the handle that this benchmark doesn't see.) joey@darkstar:~/src/git-annex>./git-annex benchmark setting up database with 1000 setting up database with 10000 benchmarking keys database/getAssociatedFiles from 1000 (hit) time 62.77 μs (62.70 μs .. 62.85 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 62.81 μs (62.76 μs .. 62.88 μs) std dev 201.6 ns (157.5 ns .. 259.5 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 50.02 μs (49.97 μs .. 50.07 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.09 μs (50.04 μs .. 50.17 μs) std dev 206.7 ns (133.8 ns .. 295.3 ns) benchmarking keys database/getAssociatedKey from 1000 (hit) time 211.2 μs (210.5 μs .. 212.3 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 211.0 μs (210.7 μs .. 212.0 μs) std dev 1.685 μs (334.4 ns .. 3.517 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 173.5 μs (172.7 μs .. 174.2 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 173.7 μs (173.0 μs .. 175.5 μs) std dev 3.833 μs (1.858 μs .. 6.617 μs) variance introduced by outliers: 16% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 64.01 μs (63.84 μs .. 64.18 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.85 μs (64.34 μs .. 66.02 μs) std dev 2.433 μs (547.6 ns .. 4.652 μs) variance introduced by outliers: 40% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 50.33 μs (50.28 μs .. 50.39 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.32 μs (50.26 μs .. 50.38 μs) std dev 202.7 ns (167.6 ns .. 252.0 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 1.142 ms (1.139 ms .. 1.146 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.142 ms (1.140 ms .. 1.144 ms) std dev 7.142 μs (4.994 μs .. 10.98 μs) benchmarking keys database/getAssociatedKey from 10000 (miss) time 1.094 ms (1.092 ms .. 1.096 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.095 ms (1.095 ms .. 1.097 ms) std dev 4.277 μs (2.591 μs .. 7.228 μs)	2016-01-12 13:07:03 -04:00
Joey Hess	8111eb21e6	split out raw sql interface	2016-01-11 15:52:11 -04:00
Joey Hess	b1a1b40a15	fix inverted logic in old associated files cleanup	2016-01-07 15:54:10 -04:00
Joey Hess	aa4f353e5d	clarify absPathFrom The repo path is typically relative, not absolute, so providing it to absPathFrom doesn't yield an absolute path. This is not a bug, just unclear documentation. Indeed, there seem to be no reason to simplifyPath here, which absPathFrom does, so instead just combine the repo path and the TopFilePath. Also, removed an export of the TopFilePath constructor; asTopFilePath is provided to construct one as-is.	2016-01-05 17:33:48 -04:00
Joey Hess	b3d60ca285	use TopFilePath for associated files Fixes several bugs with updates of pointer files. When eg, running git annex drop --from localremote it was updating the pointer file in the local repository, not the remote. Also, fixes drop ../foo when run in a subdir, and probably lots of other problems. Test suite drops from ~30 to 11 failures now. TopFilePath is used to force thinking about what the filepath is relative to. The data stored in the sqlite db is still just a plain string, and TopFilePath is a newtype, so there's no overhead involved in using it in DataBase.Keys.	2016-01-05 17:22:19 -04:00
Joey Hess	ec28151722	improve data type	2016-01-01 15:56:24 -04:00
Joey Hess	f7256842cc	wait for git lstree to exit	2016-01-01 15:51:29 -04:00
Joey Hess	9b99595473	only do scan when there's a branch, not in freshly created new repo	2016-01-01 15:16:16 -04:00
Joey Hess	f36f24197a	scan for unlocked files on init/upgrade of v6 repo	2016-01-01 15:09:42 -04:00
Joey Hess	c21567dfd3	typo	2015-12-24 13:06:03 -04:00
Joey Hess	4224fae71f	optimise read and write for Keys database (untested) Writes are optimised by queueing up multiple writes when possible. The queue is flushed after the Annex monad action finishes. That makes it happen on program termination, and also whenever a nested Annex monad action finishes. Reads are optimised by checking once (per AnnexState) if the database exists. If the database doesn't exist yet, all reads return mempty. Reads also cause queued writes to be flushed, so reads will always be consistent with writes (as long as they're made inside the same Annex monad). A future optimisation path would be to determine when that's not necessary, which is probably most of the time, and avoid flushing unncessarily. Design notes for this commit: - separate reads from writes - reuse a handle which is left open until program exit or until the MVar goes out of scope (and autoclosed then) - writes are queued - queue is flushed periodically - immediate queue flush before any read - auto-flush queue when database handle is garbage collected - flush queue on exit from Annex monad (Note that this may happen repeatedly for a single database connection; or a connection may be reused for multiple Annex monad actions, possibly even concurrent ones.) - if database does not exist (or is empty) the handle is not opened by reads; reads instead return empty results - writes open the handle if it was not open previously	2015-12-23 19:18:52 -04:00
Joey Hess	6d38f54db4	split out Database.Queue from Database.Handle Fsck can use the queue for efficiency since it is write-heavy, and only reads a value before writing it. But, the queue is not suited to the Keys database.	2015-12-23 14:59:58 -04:00
Joey Hess	38a23928e9	temporarily remove cached keys database connection The problem is that shutdown is not always called, particularly in the test suite. So, a database connection would be opened, possibly some changes queued, and then not shut down. One way this can happen is when using Annex.eval or Annex.run with a new state. A better fix might be to make both of them call Keys.shutdown (and be sure to do it even if the annex action threw an error). Complication: Sometimes they're run reusing an existing state, so shutting down a database connection could cause problems for other users of that same state. I think this would need a MVar holding the database handle, so it could be emptied once shut down, and another user of the database connection could then start up a new one if it got shut down. But, what if 2 threads were concurrently using the same database handle and one shut it down while the other was writing to it? Urgh. Might have to go that route eventually to get the database access to run fast enough. For now, a quick fix to get the test suite happier, at the expense of speed.	2015-12-16 14:05:26 -04:00
Joey Hess	1a051f4300	comment	2015-12-16 13:24:45 -04:00
Joey Hess	0a7a2dae4e	add getAssociatedKey I guess this is just as efficient as the getAssociatedFiles query, but I have not tried to optimise the database yet.	2015-12-15 13:05:23 -04:00
Joey Hess	ce73a96e4e	use InodeCache when dropping a key to see if a pointer file can be safely reset The Keys database can hold multiple inode caches for a given key. One for the annex object, and one for each pointer file, which may not be hard linked to it. Inode caches for a key are recorded when its content is added to the annex, but only if it has known pointer files. This is to avoid the overhead of maintaining the database when not needed. When the smudge filter outputs a file's content, the inode cache is not updated, because git's smudge interface doesn't let us write the file. So, dropping will fall back to doing an expensive verification then. Ideally, git's interface would be improved, and then the inode cache could be updated then too.	2015-12-09 17:54:54 -04:00
Joey Hess	5e8c628d2e	add inode cache to the db Renamed the db to keys, since it is various info about a Keys. Dropping a key will update its pointer files, as long as their content can be verified to be unmodified. This falls back to checksum verification, but I want it to use an InodeCache of the key, for speed. But, I have not made anything populate that cache yet.	2015-12-09 17:00:37 -04:00

1 2

94 commits