git-annex

Author	SHA1	Message	Date
Joey Hess	8bde6101e3	sqlite datbase for importfeed importfeed: Use caching database to avoid needing to list urls on every run, and avoid using too much memory. Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster, and memory use dropped from 203000k to 59408k. Database.ImportFeed is Database.ContentIdentifier with the serial number filed off. There is a bit of code duplication I would like to avoid, particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use the persistent sqlite tables, so despite the code being the same, they cannot be factored out. Since this database includes the contentidentifier metadata, it will be slightly redundant if a sqlite database is ever added for metadata. I did consider making such a generic database and using it for this. But, that would then need importfeed to update both the url database and the metadata database, which is twice as much work diffing the git-annex branch trees. Or would entagle updating two databases in a complex way. So instead it seems better to optimise the database that importfeed needs, and if the metadata database is used by another command, use a little more disk space and do a little bit of redundant work to update it. Sponsored-by: unqueued on Patreon	2023-10-23 16:46:22 -04:00
Joey Hess	19d95c9bb8	enable TypeOperators more warnings about ~ needing it in a future ghc release	2023-08-02 09:47:42 -04:00
Joey Hess	cc36c8516a	Sped up sqlite inserts 2x when built with persistent 2.14.5.0 https://github.com/yesodweb/persistent/issues/1457 Sponsored-by: Dartmouth College's DANDI project	2023-03-31 14:38:25 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Joey Hess	cf892f4256	use insert_ for speed improvement persistent-2.14.4.1 makes insert_ faster than insert because it skips getting the key back. Sponsored-by: Dartmouth College's DANDI project	2022-12-26 15:59:41 -04:00
Joey Hess	c834d2025a	queue more changes to keys db Increasing the size of the queue 10x makes git-annex init 7% faster in a repository with 86000 annexed files. The memory use goes up, from 70876 kb to 85376 kb.	2022-11-18 13:29:34 -04:00
Joey Hess	8fcee4ac9d	Sped up the initial scanning for annexed files by 15% Avoids database querying overhead when the database is newly created. In the large repository where git-annex init took 24 seconds, this sped it up to 20.47 seconds, a speedup of around 15%. Sponsored-by: Dartmouth College's DANDI project	2022-11-18 13:16:57 -04:00
Yaroslav Halchenko	0151976676	Typo fix unncessary -> unnecessary. Detected while reading recent CHANGELOG entry but then decided to apply to entire codebase and docs since why not?	2022-08-20 09:40:19 -04:00
Joey Hess	c941ab6f5b	avoid double work in git-annex init, second try reconcileStaged populates the db, so scanAnnexedFiles does not need to do it again. It still makes a pass over the HEAD tree, but populating the db was most of the expensive part. Benchmarking with 100,000 files, git-annex init now takes 40 seconds, vs 37 seconds with the old, buggy version of this fix. It should be possible to win those 3 precious seconds per 100k files back, in the case when when annex.thin is not set, with improvements to reconcileStaged that avoid needing this second pass. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 09:36:53 -04:00
Joey Hess	c831a562f5	faster associated file replacement with upsert Rather than first deleting and then inserting, upsert lets the key associated with a file be updated in place. Benchmarked with 100,000 files, and an empty keys database, running reconcileStaged. It improved from 47 seconds to 34 seconds. So this got reconcileStaged to be as fast as scanAssociatedFiles, or faster -- scanAssociatedFiles benchmarks at 37 seconds. (Also checked for other users of deleteWhere that could be sped up by upsert. There are a couple, but they are not in performance critical code paths, eg recordExportTreeCurrent is only run once per tree export.) I would have liked to rename FileKeyIndex to FileKeyUnique since it is being used as a uniqueness constraint now, not just to get an index. But, that gets converted into part of the SQL schema, and the name is used by the upsert, so it can't be changed. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 07:53:36 -04:00
Joey Hess	eb6f6ff9b8	speed up keys database writes There seems to be no reason to check the time here. I think it was inherited from code in Database.Fsck, which does have a reason to commit every few minutes. Removing that syscall speeds up a git-annex init in a repo with 100000 annexed files by about 3 seconds. Sponsored-by: Dartmouth College's Datalad project	2021-05-31 15:01:00 -04:00
Joey Hess	675556fd9a	smudge: check for known annexed inodes before checking annex.largefiles smudge: Fix a case where an unlocked annexed file that annex.largefiles does not match could get its unchanged content checked into git, due to git running the smudge filter unecessarily. When the file has the same inodecache as an already annexed file, we can assume that the user is not intending to change how it's stored in git. Note that checkunchangedgitfile already handled the inverse case, where the file was added to git previously. That goes further and actually sha1 hashes the new file and checks if it's the same hash in the index. It would be possible to generate a key for the file and see if it's the same as the old key, however that could be considerably more expensive than sha1 of a small file is, and it is not necessary for the case I have, at least, where the file is not modified or touched, and so its inode will match the cache. git-annex add was changed, when adding a small file, to remove the inode cache for it. This is necessary to keep the recipe in doc/tips/largefiles.mdwn for converting from annex to git working. It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn which the earlier try at this change introduced.	2021-05-10 13:20:10 -04:00
Joey Hess	028d4517c7	enable extensions needed by new version of persistent Needed in order to use mkPersist in persistent version 2.11.0.1 persistent-template version 2.9.1.0	2020-11-07 14:09:17 -04:00
Joey Hess	029c883713	Merge branch 'master' into v8	2020-02-19 14:32:11 -04:00
Joey Hess	c9357bdc0e	ifdef persistent-template 2.8.0 fixes The i386ancient build has a ghc too old for these extensions. Build with persistent-template 2.8.0 tested.	2020-02-04 13:53:00 -04:00
Joey Hess	4920df6573	Fix build with newest version of persistent-template. This is untested because of rain, also I am operating from truncated copiler error messages in a bug report that also doesn't mention what the library version is. Still, it should work. May break builds with old ghc, in particular DerivingStrategies is I think fairly new? The pragmas could be ifdefed if necessary. Works with ghc 8.6.5.	2020-02-04 12:03:30 -04:00
Joey Hess	02e00fd7ab	Merge branch 'master' into sqlite	2019-12-19 16:33:42 -04:00
Joey Hess	535b153381	building again after merge Nice, several conversions fell out.	2019-12-18 15:02:46 -04:00
Joey Hess	d5628a16b8	Merge branch 'bs' into sqlite-bs	2019-12-18 14:51:03 -04:00
Joey Hess	bdec7fed9c	convert TopFilePath to use RawFilePath Adds a dependency on filepath-bytestring, an as yet unreleased fork of filepath that operates on RawFilePath. Git.Repo also changed to use RawFilePath for the path to the repo. This does eliminate some RawFilePath -> FilePath -> RawFilePath conversions. And filepath-bytestring's </> is probably faster. But I don't expect a major performance improvement from this. This is mostly groundwork for making Annex.Location use RawFilePath, which will allow for a conversion-free pipleline.	2019-12-09 15:07:21 -04:00
Joey Hess	4940a135af	eliminate raw sql LIKE query	2019-10-30 15:19:52 -04:00
Joey Hess	9085a2cfec	make sure all sqlite selects have indexes Bearing in mind that these indexes are really uniqueness constraints that just happen to also make sqlite generate indexes. In Database.ContentIndentifier, the ContentIndentifiersKeyRemoteCidIndex is fine as a uniqueness constraint because it contains all rows from the table. The ContentIndentifiersCidRemoteIndex is also ok because there can only be one key for a given (cid, uuid) combination. In Database.Export, the new ExportTreeFileKeyIndex is the same pair as the old ExportTreeKeyFileIndex (previously ExportTreeIndex). And in Database.Keys.SQL, the new InodeCacheKeyIndex is the same pair as the old KeyInodeCacheIndex.	2019-10-30 13:46:52 -04:00
Joey Hess	3732f27722	document indexes This was really confusing, though the code was ok. I think I now understand it fully again.	2019-10-30 13:28:00 -04:00
Joey Hess	f6cfb84dfe	rename row	2019-10-30 13:22:41 -04:00
Joey Hess	c35a9047d3	improve data types for sqlite This is a non-backwards compatable change, so not suitable for merging w/o a annex.version bump and transition code. Not yet tested. This improves performance of git-annex benchmark --databases across the board by 10-25%, since eg Key roundtrips as a ByteString. (serializeKey' produces a lazy ByteString, so there is still a copy involved in converting it to a strict ByteString. It may be faster to switch to using bytestring-strict-builder.) FilePath and Key are both stored as blobs. This avoids mojibake in some situations. It would be possible to use varchar instead, if persistent could avoid converting that to Text, but it seems there is no good way to do so. See doc/todo/sqlite_database_improvements.mdwn Eliminated some ugly artifacts of using Read/Show serialization; constructors and quoted strings are no longer stored in sqlite. Renamed SRef to SSha to reflect that it is only ever a git sha, not a ref name. Since it is limited to the characters in a sha, it is not affected by mojibake, so still uses String.	2019-10-29 17:05:36 -04:00
Joey Hess	0f7fd008d4	fix sql syntax	2019-10-24 11:57:17 -04:00
Joey Hess	94efc400e9	horrible impementation of isInodeKnown The only good thing about it is it does not require a major version bump to improve the database. That will need to happen at some point though. Potentially very very slow in a large repository. Ugly use of raw sql.	2019-10-23 14:37:29 -04:00
Joey Hess	2e6fd5de71	fix flipped diffUTCTime fsck --incremental/--more: Fix bug that prevented the incremental fsck information from being updated every 5 minutes as it was supposed to be; it was only updated after 1000 files were checked, which may be more files that are possible to fsck in a given fsck time window. Thanks to Peter Simons for help with analysis of this bug. Auditing for other cases of the same mistake, the keys db also had it backwards. This seems unlikely to really have been a problem; it would need associated files updates etc to be coming in slowly for some reason and then be interrupted to cause any problem. IIRC the design of the keys db assumes that any interruped operation will be restarted, and so it can lose any buffered database updates safely.	2019-10-03 09:54:19 -04:00
Joey Hess	018b5b8173	Support building with socks-0.6 and persistant-template-2.7 persistent-template now needs UndecidableInstances. socks changed defaultSocksConf to take a SockAddr.	2019-07-30 12:50:48 -04:00
Joey Hess	40ecf58d4b	update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.)	2019-03-13 15:48:14 -04:00
Sean Parsons	ef5a735d47	Fixed a not equal condition in addAssociatedFile.	2018-11-06 22:50:00 +00:00
Sean Parsons	42bdc9fa2f	Removed Esqueleto as a dependency.	2018-11-06 22:18:55 +00:00
Joey Hess	148bd0dbfd	refactor	2016-10-17 14:58:33 -04:00
Joey Hess	cf260d9a15	Fix storing of filenames of v6 unlocked files when the filename is not representable in the current locale. This is a mostly backwards compatable change. I broke backwards compatability in the case where a filename starts with double-quote. That seems likely to be very rare, and v6 unlocked files are a new feature anyway, and fsck needs to fix missing associated file mappings anyway. So, I decided that is good enough. The encoding used is to just show the String when it contains a problem character. While that adds some overhead to addAssociatedFile and removeAssociatedFile, those are not called very often. This approach has minimal decode overhead, because most filenames won't be encoded that way, and it only has to look for the leading double-quote to skip the expensive read. So, getAssociatedFiles remains fast. I did consider using ByteString instead, but getting a FilePath converted with all chars intact, even surrigates, is difficult, and it looks like instance PersistField ByteString uses Text, which I don't trust for problem encoded data. It would probably be slower too, and it would make the database less easy to inspect manually.	2016-02-14 16:37:25 -04:00
Joey Hess	927e1a067e	fix import warnings	2016-01-14 10:30:54 -04:00
Joey Hess	fd3d866dec	another fix for old ghc	2016-01-13 12:32:57 -04:00
Joey Hess	423fffcd41	change keys database to use IKey type with more efficient serialization This breaks any existing keys database! IKey serializes more efficiently than SKey, although this limits the use of its Read/Show instances. This makes the keys database use less disk space, and so should be a win. Updated benchmark: benchmarking keys database/getAssociatedFiles from 1000 (hit) time 64.04 μs (63.95 μs .. 64.13 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.02 μs (63.96 μs .. 64.08 μs) std dev 218.2 ns (172.5 ns .. 299.3 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 52.53 μs (52.18 μs .. 53.21 μs) 0.999 R² (0.998 R² .. 1.000 R²) mean 52.31 μs (52.18 μs .. 52.91 μs) std dev 734.6 ns (206.2 ns .. 1.623 μs) benchmarking keys database/getAssociatedKey from 1000 (hit) time 64.60 μs (64.46 μs .. 64.77 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.74 μs (64.57 μs .. 65.20 μs) std dev 900.2 ns (389.7 ns .. 1.733 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 52.46 μs (52.29 μs .. 52.68 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.63 μs (52.35 μs .. 53.37 μs) std dev 1.362 μs (562.7 ns .. 2.608 μs) variance introduced by outliers: 24% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (old) time 487.3 μs (484.7 μs .. 490.1 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 490.9 μs (487.8 μs .. 496.5 μs) std dev 13.95 μs (6.841 μs .. 22.03 μs) variance introduced by outliers: 20% (moderately inflated) benchmarking keys database/addAssociatedFile to 1000 (new) time 6.633 ms (5.741 ms .. 7.751 ms) 0.905 R² (0.850 R² .. 0.965 R²) mean 8.252 ms (7.803 ms .. 8.602 ms) std dev 1.126 ms (900.3 μs .. 1.430 ms) variance introduced by outliers: 72% (severely inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 65.36 μs (64.71 μs .. 66.37 μs) 0.998 R² (0.995 R² .. 1.000 R²) mean 65.28 μs (64.72 μs .. 66.45 μs) std dev 2.576 μs (920.8 ns .. 4.122 μs) variance introduced by outliers: 42% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 52.34 μs (52.25 μs .. 52.45 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.49 μs (52.42 μs .. 52.59 μs) std dev 255.4 ns (205.8 ns .. 312.9 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 64.76 μs (64.67 μs .. 64.84 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.67 μs (64.62 μs .. 64.72 μs) std dev 177.3 ns (148.1 ns .. 217.1 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 52.75 μs (52.66 μs .. 52.82 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 52.69 μs (52.63 μs .. 52.75 μs) std dev 210.6 ns (173.7 ns .. 265.9 ns) benchmarking keys database/addAssociatedFile to 10000 (old) time 489.7 μs (488.7 μs .. 490.7 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 490.4 μs (489.6 μs .. 492.2 μs) std dev 3.990 μs (2.435 μs .. 7.604 μs) benchmarking keys database/addAssociatedFile to 10000 (new) time 9.994 ms (9.186 ms .. 10.74 ms) 0.959 R² (0.928 R² .. 0.979 R²) mean 9.906 ms (9.343 ms .. 10.40 ms) std dev 1.384 ms (1.051 ms .. 2.100 ms) variance introduced by outliers: 69% (severely inflated)	2016-01-12 14:01:50 -04:00
Joey Hess	ca2a527e93	add FileKeyIndex to Keys db to optimize getAssociatedKey This is a schema change so will break any existing keys databases. But, it's not been released yet, so I'm still able to make such changes. This speeds up the benchmark quite nicely: benchmarking keys database/getAssociatedKey from 1000 (hit) time 91.65 μs (91.48 μs .. 91.81 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 91.78 μs (91.66 μs .. 91.94 μs) std dev 468.3 ns (353.1 ns .. 624.3 ns) benchmarking keys database/getAssociatedKey from 1000 (miss) time 53.33 μs (53.23 μs .. 53.40 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 53.43 μs (53.36 μs .. 53.53 μs) std dev 274.2 ns (211.7 ns .. 361.5 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 92.99 μs (92.74 μs .. 93.27 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 92.90 μs (92.76 μs .. 93.16 μs) std dev 608.7 ns (404.1 ns .. 963.5 ns) benchmarking keys database/getAssociatedKey from 10000 (miss) time 53.12 μs (52.91 μs .. 53.39 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 52.84 μs (52.68 μs .. 53.16 μs) std dev 715.4 ns (400.4 ns .. 1.370 μs)	2016-01-12 13:07:14 -04:00
Joey Hess	f9c5aa84e0	add database benchmark The benchmark shows that the database access is quite fast indeed! And, it scales linearly to the number of keys, with one exception, getAssociatedKey. Based on this benchmark, I don't think I need worry about optimising for cases where all files are locked and the database is mostly empty. In those cases, database access will be misses, and according to this benchmark, should add only 50 milliseconds to runtime. (NB: There may be some overhead to getting the database opened and locking the handle that this benchmark doesn't see.) joey@darkstar:~/src/git-annex>./git-annex benchmark setting up database with 1000 setting up database with 10000 benchmarking keys database/getAssociatedFiles from 1000 (hit) time 62.77 μs (62.70 μs .. 62.85 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 62.81 μs (62.76 μs .. 62.88 μs) std dev 201.6 ns (157.5 ns .. 259.5 ns) benchmarking keys database/getAssociatedFiles from 1000 (miss) time 50.02 μs (49.97 μs .. 50.07 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.09 μs (50.04 μs .. 50.17 μs) std dev 206.7 ns (133.8 ns .. 295.3 ns) benchmarking keys database/getAssociatedKey from 1000 (hit) time 211.2 μs (210.5 μs .. 212.3 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 211.0 μs (210.7 μs .. 212.0 μs) std dev 1.685 μs (334.4 ns .. 3.517 μs) benchmarking keys database/getAssociatedKey from 1000 (miss) time 173.5 μs (172.7 μs .. 174.2 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 173.7 μs (173.0 μs .. 175.5 μs) std dev 3.833 μs (1.858 μs .. 6.617 μs) variance introduced by outliers: 16% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (hit) time 64.01 μs (63.84 μs .. 64.18 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 64.85 μs (64.34 μs .. 66.02 μs) std dev 2.433 μs (547.6 ns .. 4.652 μs) variance introduced by outliers: 40% (moderately inflated) benchmarking keys database/getAssociatedFiles from 10000 (miss) time 50.33 μs (50.28 μs .. 50.39 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 50.32 μs (50.26 μs .. 50.38 μs) std dev 202.7 ns (167.6 ns .. 252.0 ns) benchmarking keys database/getAssociatedKey from 10000 (hit) time 1.142 ms (1.139 ms .. 1.146 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.142 ms (1.140 ms .. 1.144 ms) std dev 7.142 μs (4.994 μs .. 10.98 μs) benchmarking keys database/getAssociatedKey from 10000 (miss) time 1.094 ms (1.092 ms .. 1.096 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.095 ms (1.095 ms .. 1.097 ms) std dev 4.277 μs (2.591 μs .. 7.228 μs)	2016-01-12 13:07:03 -04:00
Joey Hess	8111eb21e6	split out raw sql interface	2016-01-11 15:52:11 -04:00

40 commits