git-annex

Author	SHA1	Message	Date
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	927e1a067e	fix import warnings	2016-01-14 10:30:54 -04:00
Joey Hess	fd3d866dec	another fix for old ghc	2016-01-13 12:32:57 -04:00
Joey Hess	423fffcd41	change keys database to use IKey type with more efficient serialization This breaks any existing keys database! IKey serializes more efficiently than SKey, although this limits the use of its Read/Show instances. This makes the keys database use less disk space, and so should be a win. Updated benchmark: benchmarking keys database/getAssociatedFiles from 1000 (hit) time 64.04 μs (63.95 μs .. 64.13 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.02 μs (63.96 μs .. 64.08 μs) std dev 218.2 ns (172.5 ns .. 299.3 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 52.53 μs (52.18 μs .. 53.21 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 52.31 μs (52.18 μs .. 52.91 μs) std dev 734.6 ns (206.2 ns .. 1.623 μs) benchmarking keys database/getAssociatedKey from 1000 (hit) time 64.60 μs (64.46 μs .. 64.77 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.74 μs (64.57 μs .. 65.20 μs) std dev 900.2 ns (389.7 ns .. 1.733 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 52.46 μs (52.29 μs .. 52.68 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.63 μs (52.35 μs .. 53.37 μs) std dev 1.362 μs (562.7 ns .. 2.608 μs) variance introduced by outliers: 24% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (old) time 487.3 μs (484.7 μs .. 490.1 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 490.9 μs (487.8 μs .. 496.5 μs) std dev 13.95 μs (6.841 μs .. 22.03 μs) variance introduced by outliers: 20% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (new) time 6.633 ms (5.741 ms .. 7.751 ms) 0.905 R² (0.850 R² .. 0.965 R²) mean 8.252 ms (7.803 ms .. 8.602 ms) std dev 1.126 ms (900.3 μs .. 1.430 ms) variance introduced by outliers: 72% (severely inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 65.36 μs (64.71 μs .. 66.37 μs) 0.998 R² (0.995 R² .. 1.000 R²) mean 65.28 μs (64.72 μs .. 66.45 μs) std dev 2.576 μs (920.8 ns .. 4.122 μs) variance introduced by outliers: 42% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 52.34 μs (52.25 μs .. 52.45 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.49 μs (52.42 μs .. 52.59 μs) std dev 255.4 ns (205.8 ns .. 312.9 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 64.76 μs (64.67 μs .. 64.84 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.67 μs (64.62 μs .. 64.72 μs) std dev 177.3 ns (148.1 ns .. 217.1 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 52.75 μs (52.66 μs .. 52.82 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.69 μs (52.63 μs .. 52.75 μs) std dev 210.6 ns (173.7 ns .. 265.9 ns) benchmarking keys database/addAssociatedFile to 10000 (old) time 489.7 μs (488.7 μs .. 490.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 490.4 μs (489.6 μs .. 492.2 μs) std dev 3.990 μs (2.435 μs .. 7.604 μs) benchmarking keys database/addAssociatedFile to 10000 (new) time 9.994 ms (9.186 ms .. 10.74 ms) 0.959 R² (0.928 R² .. 0.979 R²) mean 9.906 ms (9.343 ms .. 10.40 ms) std dev 1.384 ms (1.051 ms .. 2.100 ms) variance introduced by outliers: 69% (severely inflated)	2016-01-12 14:01:50 -04:00
Joey Hess	75f61df323	cleanup	2016-01-12 13:31:13 -04:00
Joey Hess	ca2a527e93	add FileKeyIndex to Keys db to optimize getAssociatedKey This is a schema change so will break any existing keys databases. But, it's not been released yet, so I'm still able to make such changes. This speeds up the benchmark quite nicely: benchmarking keys database/getAssociatedKey from 1000 (hit) time 91.65 μs (91.48 μs .. 91.81 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 91.78 μs (91.66 μs .. 91.94 μs) std dev 468.3 ns (353.1 ns .. 624.3 ns) benchmarking keys database/getAssociatedKey from 1000 (miss) time 53.33 μs (53.23 μs .. 53.40 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 53.43 μs (53.36 μs .. 53.53 μs) std dev 274.2 ns (211.7 ns .. 361.5 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 92.99 μs (92.74 μs .. 93.27 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 92.90 μs (92.76 μs .. 93.16 μs) std dev 608.7 ns (404.1 ns .. 963.5 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 53.12 μs (52.91 μs .. 53.39 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.84 μs (52.68 μs .. 53.16 μs) std dev 715.4 ns (400.4 ns .. 1.370 μs)	2016-01-12 13:07:14 -04:00
Joey Hess	f9c5aa84e0	add database benchmark The benchmark shows that the database access is quite fast indeed! And, it scales linearly to the number of keys, with one exception, getAssociatedKey. Based on this benchmark, I don't think I need worry about optimising for cases where all files are locked and the database is mostly empty. In those cases, database access will be misses, and according to this benchmark, should add only 50 milliseconds to runtime. (NB: There may be some overhead to getting the database opened and locking the handle that this benchmark doesn't see.) joey@darkstar:~/src/git-annex>./git-annex benchmark setting up database with 1000 setting up database with 10000 benchmarking keys database/getAssociatedFiles from 1000 (hit) time 62.77 μs (62.70 μs .. 62.85 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 62.81 μs (62.76 μs .. 62.88 μs) std dev 201.6 ns (157.5 ns .. 259.5 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 50.02 μs (49.97 μs .. 50.07 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.09 μs (50.04 μs .. 50.17 μs) std dev 206.7 ns (133.8 ns .. 295.3 ns) benchmarking keys database/getAssociatedKey from 1000 (hit) time 211.2 μs (210.5 μs .. 212.3 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 211.0 μs (210.7 μs .. 212.0 μs) std dev 1.685 μs (334.4 ns .. 3.517 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 173.5 μs (172.7 μs .. 174.2 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 173.7 μs (173.0 μs .. 175.5 μs) std dev 3.833 μs (1.858 μs .. 6.617 μs) variance introduced by outliers: 16% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 64.01 μs (63.84 μs .. 64.18 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.85 μs (64.34 μs .. 66.02 μs) std dev 2.433 μs (547.6 ns .. 4.652 μs) variance introduced by outliers: 40% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 50.33 μs (50.28 μs .. 50.39 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.32 μs (50.26 μs .. 50.38 μs) std dev 202.7 ns (167.6 ns .. 252.0 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 1.142 ms (1.139 ms .. 1.146 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.142 ms (1.140 ms .. 1.144 ms) std dev 7.142 μs (4.994 μs .. 10.98 μs) benchmarking keys database/getAssociatedKey from 10000 (miss) time 1.094 ms (1.092 ms .. 1.096 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.095 ms (1.095 ms .. 1.097 ms) std dev 4.277 μs (2.591 μs .. 7.228 μs)	2016-01-12 13:07:03 -04:00
Joey Hess	8111eb21e6	split out raw sql interface	2016-01-11 15:52:11 -04:00
Joey Hess	b1a1b40a15	fix inverted logic in old associated files cleanup	2016-01-07 15:54:10 -04:00
Joey Hess	aa4f353e5d	clarify absPathFrom The repo path is typically relative, not absolute, so providing it to absPathFrom doesn't yield an absolute path. This is not a bug, just unclear documentation. Indeed, there seem to be no reason to simplifyPath here, which absPathFrom does, so instead just combine the repo path and the TopFilePath. Also, removed an export of the TopFilePath constructor; asTopFilePath is provided to construct one as-is.	2016-01-05 17:33:48 -04:00
Joey Hess	b3d60ca285	use TopFilePath for associated files Fixes several bugs with updates of pointer files. When eg, running git annex drop --from localremote it was updating the pointer file in the local repository, not the remote. Also, fixes drop ../foo when run in a subdir, and probably lots of other problems. Test suite drops from ~30 to 11 failures now. TopFilePath is used to force thinking about what the filepath is relative to. The data stored in the sqlite db is still just a plain string, and TopFilePath is a newtype, so there's no overhead involved in using it in DataBase.Keys.	2016-01-05 17:22:19 -04:00
Joey Hess	ec28151722	improve data type	2016-01-01 15:56:24 -04:00
Joey Hess	f7256842cc	wait for git lstree to exit	2016-01-01 15:51:29 -04:00
Joey Hess	9b99595473	only do scan when there's a branch, not in freshly created new repo	2016-01-01 15:16:16 -04:00
Joey Hess	f36f24197a	scan for unlocked files on init/upgrade of v6 repo	2016-01-01 15:09:42 -04:00
Joey Hess	bcdc6db2c3	fix build with pre-AMP ghc	2015-12-28 17:21:26 -04:00
Joey Hess	b61575516b	fix build with pre-AMP GHC	2015-12-28 12:41:47 -04:00
Joey Hess	9d3474ef1b	unused import	2015-12-24 13:07:42 -04:00
Joey Hess	c21567dfd3	typo	2015-12-24 13:06:03 -04:00
Joey Hess	4224fae71f	optimise read and write for Keys database (untested) Writes are optimised by queueing up multiple writes when possible. The queue is flushed after the Annex monad action finishes. That makes it happen on program termination, and also whenever a nested Annex monad action finishes. Reads are optimised by checking once (per AnnexState) if the database exists. If the database doesn't exist yet, all reads return mempty. Reads also cause queued writes to be flushed, so reads will always be consistent with writes (as long as they're made inside the same Annex monad). A future optimisation path would be to determine when that's not necessary, which is probably most of the time, and avoid flushing unncessarily. Design notes for this commit: - separate reads from writes - reuse a handle which is left open until program exit or until the MVar goes out of scope (and autoclosed then) - writes are queued - queue is flushed periodically - immediate queue flush before any read - auto-flush queue when database handle is garbage collected - flush queue on exit from Annex monad (Note that this may happen repeatedly for a single database connection; or a connection may be reused for multiple Annex monad actions, possibly even concurrent ones.) - if database does not exist (or is empty) the handle is not opened by reads; reads instead return empty results - writes open the handle if it was not open previously	2015-12-23 19:18:52 -04:00
Joey Hess	959b060e26	allow flushDbQueue to be run repeatedly	2015-12-23 16:36:08 -04:00
Joey Hess	d43ac8056b	auto-close database connections when MVar is GCed	2015-12-23 16:11:36 -04:00
Joey Hess	6d38f54db4	split out Database.Queue from Database.Handle Fsck can use the queue for efficiency since it is write-heavy, and only reads a value before writing it. But, the queue is not suited to the Keys database.	2015-12-23 14:59:58 -04:00
Joey Hess	38a23928e9	temporarily remove cached keys database connection The problem is that shutdown is not always called, particularly in the test suite. So, a database connection would be opened, possibly some changes queued, and then not shut down. One way this can happen is when using Annex.eval or Annex.run with a new state. A better fix might be to make both of them call Keys.shutdown (and be sure to do it even if the annex action threw an error). Complication: Sometimes they're run reusing an existing state, so shutting down a database connection could cause problems for other users of that same state. I think this would need a MVar holding the database handle, so it could be emptied once shut down, and another user of the database connection could then start up a new one if it got shut down. But, what if 2 threads were concurrently using the same database handle and one shut it down while the other was writing to it? Urgh. Might have to go that route eventually to get the database access to run fast enough. For now, a quick fix to get the test suite happier, at the expense of speed.	2015-12-16 14:05:26 -04:00
Joey Hess	622da992f8	reorder database shutdown to be concurrency safe If a DbHandle is in use by another thread, it could be queueing changes while shutdown is running. So, wait for the worker to finish before flushing the queue, so that any last-minute writes are included. Before this fix, they would be silently dropped. Of course, if the other thread continues to try to use a DbHandle once it's closed, it will block forever as the worker is no longer reading from the jobs MVar. So, that would crash with "thread blocked indefinitely in an MVar operation".	2015-12-16 13:52:43 -04:00
Joey Hess	1a051f4300	comment	2015-12-16 13:24:45 -04:00
Joey Hess	0a7a2dae4e	add getAssociatedKey I guess this is just as efficient as the getAssociatedFiles query, but I have not tried to optimise the database yet.	2015-12-15 13:05:23 -04:00
Joey Hess	ce73a96e4e	use InodeCache when dropping a key to see if a pointer file can be safely reset The Keys database can hold multiple inode caches for a given key. One for the annex object, and one for each pointer file, which may not be hard linked to it. Inode caches for a key are recorded when its content is added to the annex, but only if it has known pointer files. This is to avoid the overhead of maintaining the database when not needed. When the smudge filter outputs a file's content, the inode cache is not updated, because git's smudge interface doesn't let us write the file. So, dropping will fall back to doing an expensive verification then. Ideally, git's interface would be improved, and then the inode cache could be updated then too.	2015-12-09 17:54:54 -04:00
Joey Hess	5e8c628d2e	add inode cache to the db Renamed the db to keys, since it is various info about a Keys. Dropping a key will update its pointer files, as long as their content can be verified to be unmodified. This falls back to checksum verification, but I want it to use an InodeCache of the key, for speed. But, I have not made anything populate that cache yet.	2015-12-09 17:00:37 -04:00
Joey Hess	05b598a057	stash DbHandle in Annex state	2015-12-09 14:55:47 -04:00
Joey Hess	a6e5ee0d0e	associated files database	2015-12-07 14:35:37 -04:00
Joey Hess	5072c62932	avoid ugly error about MVar if the sqlite worker thread crashes	2015-10-12 13:00:22 -04:00
Joey Hess	4ed82e5328	fsck: Work around bug in persistent that broke display of problematically encoded filenames on stderr when using --incremental.	2015-09-09 17:02:00 -04:00
Joey Hess	bc4129cc77	fsck: Commit incremental fsck database after every 1000 files fscked, or every 5 minutes, whichever comes first. Previously, commits were made every 1000 files fscked. Also, improve docs	2015-07-31 16:42:15 -04:00
Joey Hess	ecb0d5c087	use lock pools throughout git-annex The one exception is in Utility.Daemon. As long as a process only daemonizes once, which seems reasonable, and as long as it avoids calling checkDaemon once it's already running as a daemon, the fcntl locking gotchas won't be a problem there. Annex.LockFile has it's own separate lock pool layer, which has been renamed to LockCache. This is a persistent cache of locks that persist until closed. This is not quite done; lockContent stil needs to be converted.	2015-05-19 14:09:52 -04:00
Joey Hess	ec267aa1ea	rejigger imports for clean build with ghc 7.10's AMP changes The explict import Prelude after import Control.Applicative is a trick to avoid a warning.	2015-05-10 16:20:30 -04:00
Joey Hess	addc82dab7	removed all uses of undefined from code base It's a code smell, can lead to hard to diagnose error messages.	2015-04-19 00:38:29 -04:00
Joey Hess	5d974b26fc	generated TH uses forall	2015-02-22 16:57:19 -04:00
Joey Hess	e143d5e7d1	avoid closing db handle when reconnecting to do a write	2015-02-22 14:21:39 -04:00
Joey Hess	bf80a16c2e	complete work around for sqlite SELECT ErrorBusy on new connection bug	2015-02-22 14:08:26 -04:00
Joey Hess	b541a5e38b	WIP	2015-02-18 17:46:58 -04:00
Joey Hess	a01285ff33	more extensions needed by newer version of persistent	2015-02-18 17:30:07 -04:00
Joey Hess	80683871ee	deal with rare SELECT ErrorBusy failures I think they might be a sqlite bug. In discussions with sqlite devs.	2015-02-18 16:56:52 -04:00
Joey Hess	af254615b2	use WAL mode to ensure read from db always works, even when it's being written to Also, moved the database to a subdir, as there are multiple files. This seems to work well with concurrent fscks, although they still do redundant work due to the commit granularity. Occasionally two writes will conflict, and one is then deferred and happens later. Except, with 3 concurrent fscks, I got failures: git-annex: user error (SQLite3 returned ErrorBusy while attempting to perform prepare "SELECT \"fscked\".\"key\"\nFROM \"fscked\"\nWHERE \"fscked\".\"key\" = ?\n": database is locked) Argh!!!	2015-02-18 15:54:24 -04:00
Joey Hess	17cb219231	more robust handling of deferred commits Still not robust enough. I have 3 fscks running concurrently, and am seeing: ("commit deferred",user error (SQLite3 returned ErrorBusy while attempting to perform step.)) and git-annex: user error (SQLite3 returned ErrorBusy while attempting to perform prepare "SELECT \"fscked\".\"key\"\nFROM \"fscked\"\nWHERE \"fscked\".\"key\" = ?\n": database is locked)	2015-02-18 14:11:27 -04:00
Joey Hess	3414229354	fsck: Multiple incremental fscks of different repos (some remote) can now be in progress at the same time in the same repo without it getting confused about which files have been checked for which remotes.	2015-02-17 17:08:11 -04:00
Joey Hess	a3370ac459	allow for concurrent incremental fsck processes again (sorta) Sqlite doesn't support multiple concurrent writers at all. One of them will fail to write. It's not even possible to have two processes building up separate transactions at the same time. Before using sqlite, incremental fsck could work perfectly well with multiple fsck processes running concurrently. I'd like to keep that working. My partial solution, so far, is to make git-annex buffer writes, and every so often send them all to sqlite at once, in a transaction. So most of the time, nothing is writing to the database. (And if it gets unlucky and a write fails due to a collision with another writer, it can just wait and retry the write later.) This lets multiple processes write to the database successfully. But, for the purposes of concurrent, incremental fsck, it's not ideal. Each process doesn't immediately learn of files that another process has checked. So they'll tend to do redundant work. Only way I can see to improve this is to use some other mechanism for short-term IPC between the fsck processes. Not yet done. ---- Also, make addDb check if an item is in the database already, and not try to re-add it. That fixes an intermittent crash with "SQLite3 returned ErrorConstraint while attempting to perform step." I am not 100% sure why; it only started happening when I moved write buffering into the queue. It seemed to generally happen on the same file each time, so could just be due to multiple files having the same key. However, I doubt my sound repo has many duplicate keys, and I suspect something else is going on. ---- Updated benchmark, with the 1000 item queue: 6m33.808s	2015-02-17 16:56:12 -04:00
Joey Hess	afb3e3e472	avoid crash when starting fsck --incremental when one is already running Turns out sqlite does not like having its database deleted out from underneath it. It might suffice to empty the table, but I would rather start each fsck over with a new database, so I added a lock file, and running incremental fscks use a shared lock. This leaves one concurrency bug left; running two concurrent fsck --more will lead to: "SQLite3 returned ErrorBusy while attempting to perform step." and one or both will fail. This is a concurrent writers problem.	2015-02-17 13:30:24 -04:00
Joey Hess	ea76d04e15	show error when sqlite crashes worker thread Better than "blocked indefinitely in MVar"..	2015-02-17 13:03:57 -04:00
Joey Hess	99a1287f4f	avoid fromIntegral overhead	2015-02-16 17:22:00 -04:00

1 2

53 commits