git-annex

Author	SHA1	Message	Date
Joey Hess	c633144d28	remove empty directories when removing from export The subtle part of this is what happens when the remote fails to remove an empty directory. The removal from the export needs to fail in that case, so the removal will be tried again later. However, removeExportLocation has already been run and changed the export db, so if the next run checks getExportLocation, it might decide nothing remains to be done, leaving the empty directory. Dealt with that by making removeEmptyDirectories, handle a failure by calling addExportLocation, reverting the database changes so the next run will be guaranteed to try deleting the empty directory again. This commit was sponsored by Thomas Hochstein on Patreon.	2017-09-15 15:22:53 -04:00
Joey Hess	e223cf568f	add table to keep track of what subdirectories are populated in the export So empty subdirectories can be identified and removed. This commit was sponsored by Jochen Bartl on Patreon.	2017-09-15 14:35:22 -04:00
Joey Hess	6ab14710fc	fix consistency bug reading from export database The export database has writes made to it and then expects to read back the same data immediately. But, the way that Database.Handle does writes, in order to support multiple writers, makes that not work, due to caching issues. This resulted in export re-uploading files it had already successfully renamed into place. Fixed by allowing databases to be opened in MultiWriter or SingleWriter mode. The export database only needs to support a single writer; it does not make sense for multiple exports to run at the same time to the same special remote. All other databases still use MultiWriter mode. And by inspection, nothing else in git-annex seems to be relying on being able to immediately query for changes that were just written to the database. This commit was supported by the NSF-funded DataLad project.	2017-09-06 17:19:07 -04:00
Joey Hess	4da763439b	use export db to correctly handle duplicate files Removed uncorrect UniqueKey key in db schema; a key can appear multiple times with different files. The database has to be flushed after each removal. But when adding files to the export, lots of changes are able to be queued up w/o flushing. So it's still fairly efficient. If large removals of files from exports are too slow, an alternative would be to make two passes over the diff, one pass queueing deletions from the database, then a flush and the a second pass updating the location log. But that would use more memory, and need to look up exportKey twice per removed file, so I've avoided such optimisation yet. This commit was supported by the NSF-funded DataLad project.	2017-09-04 14:39:32 -04:00
Joey Hess	2c90ed1fea	flush queued changes to export db on exit	2017-09-04 14:00:54 -04:00
Joey Hess	7eb9889bfd	track exported files in a sqlite database Went with a separate db per export remote, rather than a single export database. Mostly because there will probably not be a lot of separate export remotes, and it might be convenient to be able to delete a given remote's export database. This commit was supported by the NSF-funded DataLad project.	2017-09-04 13:53:08 -04:00
Joey Hess	ca0daa8bb8	factor non-type stuff out of Key	2017-02-24 13:42:30 -04:00
Joey Hess	3b22ad9f47	Work around sqlite's incorrect handling of umask when creating databases. Refactored some common code into initDb. This only deals with the problem when creating new databases. If a repo got bad permissions into it, it's up to the user to deal with it. This commit was sponsored by Ole-Morten Duesund on Patreon.	2017-02-13 17:39:16 -04:00
Joey Hess	23d71423e1	work around ghc segfault hSetEncoding of a closed handle segfaults. https://ghc.haskell.org/trac/ghc/ticket/7161 `8484c0c197` introduced the crash. In particular, stdin may get closed (by eg, getContents) and then trying to set its encoding will crash. We didn't need to adjust stdin's encoding anyway, but only stderr, to work around https://github.com/yesodweb/persistent/issues/474 Thanks to Mesar Hameed for assistance related to reproducing this bug.	2016-12-30 18:14:19 -04:00
Joey Hess	8484c0c197	Always use filesystem encoding for all file and handle reads and writes. This is a big scary change. I have convinced myself it should be safe. I hope!	2016-12-24 14:46:31 -04:00
Joey Hess	0a4479b8ec	Avoid backtraces on expected failures when built with ghc 8; only use backtraces for unexpected errors. ghc 8 added backtraces on uncaught errors. This is great, but git-annex was using error in many places for a error message targeted at the user, in some known problem case. A backtrace only confuses such a message, so omit it. Notably, commands like git annex drop that failed due to eg, numcopies, used to use error, so had a backtrace. This commit was sponsored by Ethan Aubin.	2016-11-15 21:29:54 -04:00
Joey Hess	206dab8b44	remove unused imports	2016-10-18 16:16:47 -04:00
Joey Hess	148bd0dbfd	refactor	2016-10-17 14:58:33 -04:00
Joey Hess	e34046de38	slightly more efficient checking of versionUsesKeysDatabase It's a mvar lookup either way, but I think this way will be slightly more efficient. And it reduces the number of places where it's checked to 1.	2016-07-19 14:02:49 -04:00
Joey Hess	2619019630	Avoid any access to keys database in v5 mode repositories, which are not supposed to use that database.	2016-07-19 12:12:19 -04:00
Joey Hess	5f0b551c0c	assistant: Fix race in v6 mode that caused downloaded file content to sometimes not replace pointer files. The keys database handle needs to be closed after merging, because the smudge filter, in another process, updates the database. Old cached info can be read for a while from the open database handle; closing it ensures that the info written by the smudge filter is available. This is pretty horribly ad-hoc, and it's especially nasty that the transferrer closes the database every time.	2016-05-16 14:49:12 -04:00
Joey Hess	d05a75e45a	fix bug in unlocked file scanner that skipped over executable unlocked files	2016-04-14 13:07:46 -04:00
Joey Hess	6702f5a9c6	fix typo	2016-02-14 19:31:40 -04:00
Joey Hess	49215d68ae	devblog	2016-02-14 18:01:35 -04:00
Joey Hess	cf260d9a15	Fix storing of filenames of v6 unlocked files when the filename is not representable in the current locale. This is a mostly backwards compatable change. I broke backwards compatability in the case where a filename starts with double-quote. That seems likely to be very rare, and v6 unlocked files are a new feature anyway, and fsck needs to fix missing associated file mappings anyway. So, I decided that is good enough. The encoding used is to just show the String when it contains a problem character. While that adds some overhead to addAssociatedFile and removeAssociatedFile, those are not called very often. This approach has minimal decode overhead, because most filenames won't be encoded that way, and it only has to look for the leading double-quote to skip the expensive read. So, getAssociatedFiles remains fast. I did consider using ByteString instead, but getting a FilePath converted with all chars intact, even surrigates, is difficult, and it looks like instance PersistField ByteString uses Text, which I don't trust for problem encoded data. It would probably be slower too, and it would make the database less easy to inspect manually.	2016-02-14 16:37:25 -04:00
Joey Hess	9df13e73ae	if keys database cannot be opened due to permissions, ignore This lets readonly repos be used. If a repo is readonly, we can ignore the keys database, because nothing that we can do will change the state of the repo anyway.	2016-02-12 14:16:35 -04:00
Joey Hess	737e45156e	remove 163 lines of code without changing anything except imports	2016-01-20 16:36:33 -04:00
Joey Hess	927e1a067e	fix import warnings	2016-01-14 10:30:54 -04:00
Joey Hess	fd3d866dec	another fix for old ghc	2016-01-13 12:32:57 -04:00
Joey Hess	423fffcd41	change keys database to use IKey type with more efficient serialization This breaks any existing keys database! IKey serializes more efficiently than SKey, although this limits the use of its Read/Show instances. This makes the keys database use less disk space, and so should be a win. Updated benchmark: benchmarking keys database/getAssociatedFiles from 1000 (hit) time 64.04 μs (63.95 μs .. 64.13 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.02 μs (63.96 μs .. 64.08 μs) std dev 218.2 ns (172.5 ns .. 299.3 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 52.53 μs (52.18 μs .. 53.21 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 52.31 μs (52.18 μs .. 52.91 μs) std dev 734.6 ns (206.2 ns .. 1.623 μs) benchmarking keys database/getAssociatedKey from 1000 (hit) time 64.60 μs (64.46 μs .. 64.77 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.74 μs (64.57 μs .. 65.20 μs) std dev 900.2 ns (389.7 ns .. 1.733 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 52.46 μs (52.29 μs .. 52.68 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.63 μs (52.35 μs .. 53.37 μs) std dev 1.362 μs (562.7 ns .. 2.608 μs) variance introduced by outliers: 24% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (old) time 487.3 μs (484.7 μs .. 490.1 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 490.9 μs (487.8 μs .. 496.5 μs) std dev 13.95 μs (6.841 μs .. 22.03 μs) variance introduced by outliers: 20% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (new) time 6.633 ms (5.741 ms .. 7.751 ms) 0.905 R² (0.850 R² .. 0.965 R²) mean 8.252 ms (7.803 ms .. 8.602 ms) std dev 1.126 ms (900.3 μs .. 1.430 ms) variance introduced by outliers: 72% (severely inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 65.36 μs (64.71 μs .. 66.37 μs) 0.998 R² (0.995 R² .. 1.000 R²) mean 65.28 μs (64.72 μs .. 66.45 μs) std dev 2.576 μs (920.8 ns .. 4.122 μs) variance introduced by outliers: 42% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 52.34 μs (52.25 μs .. 52.45 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.49 μs (52.42 μs .. 52.59 μs) std dev 255.4 ns (205.8 ns .. 312.9 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 64.76 μs (64.67 μs .. 64.84 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.67 μs (64.62 μs .. 64.72 μs) std dev 177.3 ns (148.1 ns .. 217.1 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 52.75 μs (52.66 μs .. 52.82 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.69 μs (52.63 μs .. 52.75 μs) std dev 210.6 ns (173.7 ns .. 265.9 ns) benchmarking keys database/addAssociatedFile to 10000 (old) time 489.7 μs (488.7 μs .. 490.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 490.4 μs (489.6 μs .. 492.2 μs) std dev 3.990 μs (2.435 μs .. 7.604 μs) benchmarking keys database/addAssociatedFile to 10000 (new) time 9.994 ms (9.186 ms .. 10.74 ms) 0.959 R² (0.928 R² .. 0.979 R²) mean 9.906 ms (9.343 ms .. 10.40 ms) std dev 1.384 ms (1.051 ms .. 2.100 ms) variance introduced by outliers: 69% (severely inflated)	2016-01-12 14:01:50 -04:00
Joey Hess	75f61df323	cleanup	2016-01-12 13:31:13 -04:00
Joey Hess	ca2a527e93	add FileKeyIndex to Keys db to optimize getAssociatedKey This is a schema change so will break any existing keys databases. But, it's not been released yet, so I'm still able to make such changes. This speeds up the benchmark quite nicely: benchmarking keys database/getAssociatedKey from 1000 (hit) time 91.65 μs (91.48 μs .. 91.81 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 91.78 μs (91.66 μs .. 91.94 μs) std dev 468.3 ns (353.1 ns .. 624.3 ns) benchmarking keys database/getAssociatedKey from 1000 (miss) time 53.33 μs (53.23 μs .. 53.40 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 53.43 μs (53.36 μs .. 53.53 μs) std dev 274.2 ns (211.7 ns .. 361.5 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 92.99 μs (92.74 μs .. 93.27 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 92.90 μs (92.76 μs .. 93.16 μs) std dev 608.7 ns (404.1 ns .. 963.5 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 53.12 μs (52.91 μs .. 53.39 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.84 μs (52.68 μs .. 53.16 μs) std dev 715.4 ns (400.4 ns .. 1.370 μs)	2016-01-12 13:07:14 -04:00
Joey Hess	f9c5aa84e0	add database benchmark The benchmark shows that the database access is quite fast indeed! And, it scales linearly to the number of keys, with one exception, getAssociatedKey. Based on this benchmark, I don't think I need worry about optimising for cases where all files are locked and the database is mostly empty. In those cases, database access will be misses, and according to this benchmark, should add only 50 milliseconds to runtime. (NB: There may be some overhead to getting the database opened and locking the handle that this benchmark doesn't see.) joey@darkstar:~/src/git-annex>./git-annex benchmark setting up database with 1000 setting up database with 10000 benchmarking keys database/getAssociatedFiles from 1000 (hit) time 62.77 μs (62.70 μs .. 62.85 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 62.81 μs (62.76 μs .. 62.88 μs) std dev 201.6 ns (157.5 ns .. 259.5 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 50.02 μs (49.97 μs .. 50.07 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.09 μs (50.04 μs .. 50.17 μs) std dev 206.7 ns (133.8 ns .. 295.3 ns) benchmarking keys database/getAssociatedKey from 1000 (hit) time 211.2 μs (210.5 μs .. 212.3 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 211.0 μs (210.7 μs .. 212.0 μs) std dev 1.685 μs (334.4 ns .. 3.517 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 173.5 μs (172.7 μs .. 174.2 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 173.7 μs (173.0 μs .. 175.5 μs) std dev 3.833 μs (1.858 μs .. 6.617 μs) variance introduced by outliers: 16% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 64.01 μs (63.84 μs .. 64.18 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.85 μs (64.34 μs .. 66.02 μs) std dev 2.433 μs (547.6 ns .. 4.652 μs) variance introduced by outliers: 40% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 50.33 μs (50.28 μs .. 50.39 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.32 μs (50.26 μs .. 50.38 μs) std dev 202.7 ns (167.6 ns .. 252.0 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 1.142 ms (1.139 ms .. 1.146 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.142 ms (1.140 ms .. 1.144 ms) std dev 7.142 μs (4.994 μs .. 10.98 μs) benchmarking keys database/getAssociatedKey from 10000 (miss) time 1.094 ms (1.092 ms .. 1.096 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.095 ms (1.095 ms .. 1.097 ms) std dev 4.277 μs (2.591 μs .. 7.228 μs)	2016-01-12 13:07:03 -04:00
Joey Hess	8111eb21e6	split out raw sql interface	2016-01-11 15:52:11 -04:00
Joey Hess	b1a1b40a15	fix inverted logic in old associated files cleanup	2016-01-07 15:54:10 -04:00
Joey Hess	aa4f353e5d	clarify absPathFrom The repo path is typically relative, not absolute, so providing it to absPathFrom doesn't yield an absolute path. This is not a bug, just unclear documentation. Indeed, there seem to be no reason to simplifyPath here, which absPathFrom does, so instead just combine the repo path and the TopFilePath. Also, removed an export of the TopFilePath constructor; asTopFilePath is provided to construct one as-is.	2016-01-05 17:33:48 -04:00
Joey Hess	b3d60ca285	use TopFilePath for associated files Fixes several bugs with updates of pointer files. When eg, running git annex drop --from localremote it was updating the pointer file in the local repository, not the remote. Also, fixes drop ../foo when run in a subdir, and probably lots of other problems. Test suite drops from ~30 to 11 failures now. TopFilePath is used to force thinking about what the filepath is relative to. The data stored in the sqlite db is still just a plain string, and TopFilePath is a newtype, so there's no overhead involved in using it in DataBase.Keys.	2016-01-05 17:22:19 -04:00
Joey Hess	ec28151722	improve data type	2016-01-01 15:56:24 -04:00
Joey Hess	f7256842cc	wait for git lstree to exit	2016-01-01 15:51:29 -04:00
Joey Hess	9b99595473	only do scan when there's a branch, not in freshly created new repo	2016-01-01 15:16:16 -04:00
Joey Hess	f36f24197a	scan for unlocked files on init/upgrade of v6 repo	2016-01-01 15:09:42 -04:00
Joey Hess	bcdc6db2c3	fix build with pre-AMP ghc	2015-12-28 17:21:26 -04:00
Joey Hess	b61575516b	fix build with pre-AMP GHC	2015-12-28 12:41:47 -04:00
Joey Hess	9d3474ef1b	unused import	2015-12-24 13:07:42 -04:00
Joey Hess	c21567dfd3	typo	2015-12-24 13:06:03 -04:00
Joey Hess	4224fae71f	optimise read and write for Keys database (untested) Writes are optimised by queueing up multiple writes when possible. The queue is flushed after the Annex monad action finishes. That makes it happen on program termination, and also whenever a nested Annex monad action finishes. Reads are optimised by checking once (per AnnexState) if the database exists. If the database doesn't exist yet, all reads return mempty. Reads also cause queued writes to be flushed, so reads will always be consistent with writes (as long as they're made inside the same Annex monad). A future optimisation path would be to determine when that's not necessary, which is probably most of the time, and avoid flushing unncessarily. Design notes for this commit: - separate reads from writes - reuse a handle which is left open until program exit or until the MVar goes out of scope (and autoclosed then) - writes are queued - queue is flushed periodically - immediate queue flush before any read - auto-flush queue when database handle is garbage collected - flush queue on exit from Annex monad (Note that this may happen repeatedly for a single database connection; or a connection may be reused for multiple Annex monad actions, possibly even concurrent ones.) - if database does not exist (or is empty) the handle is not opened by reads; reads instead return empty results - writes open the handle if it was not open previously	2015-12-23 19:18:52 -04:00
Joey Hess	959b060e26	allow flushDbQueue to be run repeatedly	2015-12-23 16:36:08 -04:00
Joey Hess	d43ac8056b	auto-close database connections when MVar is GCed	2015-12-23 16:11:36 -04:00
Joey Hess	6d38f54db4	split out Database.Queue from Database.Handle Fsck can use the queue for efficiency since it is write-heavy, and only reads a value before writing it. But, the queue is not suited to the Keys database.	2015-12-23 14:59:58 -04:00
Joey Hess	38a23928e9	temporarily remove cached keys database connection The problem is that shutdown is not always called, particularly in the test suite. So, a database connection would be opened, possibly some changes queued, and then not shut down. One way this can happen is when using Annex.eval or Annex.run with a new state. A better fix might be to make both of them call Keys.shutdown (and be sure to do it even if the annex action threw an error). Complication: Sometimes they're run reusing an existing state, so shutting down a database connection could cause problems for other users of that same state. I think this would need a MVar holding the database handle, so it could be emptied once shut down, and another user of the database connection could then start up a new one if it got shut down. But, what if 2 threads were concurrently using the same database handle and one shut it down while the other was writing to it? Urgh. Might have to go that route eventually to get the database access to run fast enough. For now, a quick fix to get the test suite happier, at the expense of speed.	2015-12-16 14:05:26 -04:00
Joey Hess	622da992f8	reorder database shutdown to be concurrency safe If a DbHandle is in use by another thread, it could be queueing changes while shutdown is running. So, wait for the worker to finish before flushing the queue, so that any last-minute writes are included. Before this fix, they would be silently dropped. Of course, if the other thread continues to try to use a DbHandle once it's closed, it will block forever as the worker is no longer reading from the jobs MVar. So, that would crash with "thread blocked indefinitely in an MVar operation".	2015-12-16 13:52:43 -04:00
Joey Hess	1a051f4300	comment	2015-12-16 13:24:45 -04:00
Joey Hess	0a7a2dae4e	add getAssociatedKey I guess this is just as efficient as the getAssociatedFiles query, but I have not tried to optimise the database yet.	2015-12-15 13:05:23 -04:00
Joey Hess	ce73a96e4e	use InodeCache when dropping a key to see if a pointer file can be safely reset The Keys database can hold multiple inode caches for a given key. One for the annex object, and one for each pointer file, which may not be hard linked to it. Inode caches for a key are recorded when its content is added to the annex, but only if it has known pointer files. This is to avoid the overhead of maintaining the database when not needed. When the smudge filter outputs a file's content, the inode cache is not updated, because git's smudge interface doesn't let us write the file. So, dropping will fall back to doing an expensive verification then. Ideally, git's interface would be improved, and then the inode cache could be updated then too.	2015-12-09 17:54:54 -04:00
Joey Hess	5e8c628d2e	add inode cache to the db Renamed the db to keys, since it is various info about a Keys. Dropping a key will update its pointer files, as long as their content can be verified to be unmodified. This falls back to checksum verification, but I want it to use an InodeCache of the key, for speed. But, I have not made anything populate that cache yet.	2015-12-09 17:00:37 -04:00

1 2

74 commits