git-annex

Author	SHA1	Message	Date
Joey Hess	eb2cd944d9	update	2024-03-08 14:32:29 -04:00
Joey Hess	01b301b902	fix comment	2024-03-08 14:23:17 -04:00
Joey Hess	8a3beabf35	use RawFilePath for opening sqlite databases Fix a crash opening sqlite databases when run in a non-unicode locale, with a remote that uses a non-unicode filepath. In that situation converting to Text fails. The fix needs git-annex to be built with persistent-sqlite 2.13.3. Building against older versions still works, but that version is used when building with stack. Database.RawFilePath is a lot of code copied from persistent-sqlite and lightly modified, since only 1 function in persistent-sqlite was made to support RawFilePath. This is a bit of a pain, and I hope that persistent-sqlite will eventually switch to using OsPath, allowing this module to be removed from git-annex. Sponsored-by: k0ld on Patreon	2023-12-26 18:31:52 -04:00
Joey Hess	7776d03355	avoid unused import	2023-10-26 14:00:21 -04:00
Joey Hess	8bde6101e3	sqlite datbase for importfeed importfeed: Use caching database to avoid needing to list urls on every run, and avoid using too much memory. Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster, and memory use dropped from 203000k to 59408k. Database.ImportFeed is Database.ContentIdentifier with the serial number filed off. There is a bit of code duplication I would like to avoid, particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use the persistent sqlite tables, so despite the code being the same, they cannot be factored out. Since this database includes the contentidentifier metadata, it will be slightly redundant if a sqlite database is ever added for metadata. I did consider making such a generic database and using it for this. But, that would then need importfeed to update both the url database and the metadata database, which is twice as much work diffing the git-annex branch trees. Or would entagle updating two databases in a complex way. So instead it seems better to optimise the database that importfeed needs, and if the metadata database is used by another command, use a little more disk space and do a little bit of redundant work to update it. Sponsored-by: unqueued on Patreon	2023-10-23 16:46:22 -04:00
Joey Hess	19d95c9bb8	enable TypeOperators more warnings about ~ needing it in a future ghc release	2023-08-02 09:47:42 -04:00
Joey Hess	a0ab425c95	add ContentIndentifiersCidRemoteKeyIndex Optimise database to further speed up importing large trees from special remotes. See comment for details of why the other index didn't help cid queries. It would probably be better to manually create an index on only cid, rather than adding a second uniqueness constraint that is a larger index. But persitent does not support creating indexes, and an attempt to manually add it to the migration failed. Sponsored-by: Nicholas Golder-Manning on Patreon	2023-06-09 15:12:33 -04:00
Joey Hess	fe1b2dfb4b	speed up very first tree import by 25% Reading from the cidsdb is responsible for about 25% of the runtime of an import. Since the cidmap is used to store the same information in ram, the cidsdb is not written to during an import any longer. And so, if it started off empty (and updateFromLog wasn't needed), those reads can just be skipped. This is kind of a cheesy optimisation, since after any import from any special remote, the database will no longer be empty, so it's a single use optimisation. But it's probably not uncommon to start by importing a lot of files, and it can save a lot of time then. Sponsored-by: Brock Spratlen on Patreon	2023-06-02 13:30:30 -04:00
Joey Hess	f9baf11e11	tab indentation	2023-05-30 15:42:11 -04:00
Joey Hess	cc36c8516a	Sped up sqlite inserts 2x when built with persistent 2.14.5.0 https://github.com/yesodweb/persistent/issues/1457 Sponsored-by: Dartmouth College's DANDI project	2023-03-31 14:38:25 -04:00
Joey Hess	e60766543f	add annex.dbdir (WIP) WIP: This is mostly complete, but there is a problem: createDirectoryUnder throws an error when annex.dbdir is set to outside the git repo. annex.dbdir is a workaround for filesystems where sqlite does not work, due to eg, the filesystem not properly supporting locking. It's intended to be set before initializing the repository. Changing it in an existing repository can be done, but would be the same as making a new repository and moving all the annexed objects into it. While the databases get recreated from the git-annex branch in that situation, any information that is in the databases but not stored in the branch gets lost. It may be that no information ever gets stored in the databases that cannot be reconstructed from the branch, but I have not verified that. Sponsored-by: Dartmouth College's Datalad project	2022-08-11 16:58:53 -04:00
Joey Hess	f5b642318d	eliminate single/multi writer distinction After commit `f4bdecc4ec`, there is no longer any distinction between SingleWriter and MultiWriter's handling of read after write. Databases that were SingleWriter still have lock files that are used to prevent multiple writers. This does make writing to such databases a bit more expensive, because the MultiWriter code path that is now used opens a second db connection in order to write to them.	2021-10-20 12:26:30 -04:00
Joey Hess	c6e693b25d	remove ContentIndentifiersCidRemoteIndex uniqueness constraint For reasons explained in the bug report. Implemented using a persistent migration, which works fine. It may add a little startup overhead when a remote is enabled that uses this, but probably un-noticable. On the next major version, it would be fine to delete this database, and regenerate it from the git-annex branch information. Then this change could be reverted. Did nothing about adding back the data that got dropped from the db due to the bug. Only the borg special remote was probably affected, and it's not been released yet. rm -rf .git/annex/cidsdb does work.	2020-12-23 14:03:33 -04:00
Joey Hess	028d4517c7	enable extensions needed by new version of persistent Needed in order to use mkPersist in persistent version 2.11.0.1 persistent-template version 2.9.1.0	2020-11-07 14:09:17 -04:00
Joey Hess	1db49497e0	finished this stage of the RawFilePath conversion This commit was sponsored by Denis Dzyubenko on Patreon.	2020-11-06 14:10:58 -04:00
Joey Hess	2c8cf06e75	more RawFilePath conversion Converted file mode setting to it, and follow-on changes. Compiles up through 369/646. This commit was sponsored by Ethan Aubin.	2020-11-05 18:45:37 -04:00
Joey Hess	681b44236a	more RawFilePath conversion at 377/645 This commit was sponsored by Svenne Krap on Patreon.	2020-10-29 14:20:57 -04:00
Joey Hess	029c883713	Merge branch 'master' into v8	2020-02-19 14:32:11 -04:00
Joey Hess	c9357bdc0e	ifdef persistent-template 2.8.0 fixes The i386ancient build has a ghc too old for these extensions. Build with persistent-template 2.8.0 tested.	2020-02-04 13:53:00 -04:00
Joey Hess	4920df6573	Fix build with newest version of persistent-template. This is untested because of rain, also I am operating from truncated copiler error messages in a bug report that also doesn't mention what the library version is. Still, it should work. May break builds with old ghc, in particular DerivingStrategies is I think fairly new? The pragmas could be ifdefed if necessary. Works with ghc 8.6.5.	2020-02-04 12:03:30 -04:00
Joey Hess	d5628a16b8	Merge branch 'bs' into sqlite-bs	2019-12-18 14:51:03 -04:00
Joey Hess	bdec7fed9c	convert TopFilePath to use RawFilePath Adds a dependency on filepath-bytestring, an as yet unreleased fork of filepath that operates on RawFilePath. Git.Repo also changed to use RawFilePath for the path to the repo. This does eliminate some RawFilePath -> FilePath -> RawFilePath conversions. And filepath-bytestring's </> is probably faster. But I don't expect a major performance improvement from this. This is mostly groundwork for making Annex.Location use RawFilePath, which will allow for a conversion-free pipleline.	2019-12-09 15:07:21 -04:00
Joey Hess	2f9a80d803	merging sqlite and bs branches Since the sqlite branch uses blobs extensively, there are some performance benefits, ByteStrings now get stored and retrieved w/o conversion in some cases like in Database.Export.	2019-12-06 15:30:45 -04:00
Joey Hess	067aabdd48	wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more.	2019-11-26 16:01:58 -04:00
Joey Hess	d3e4de0175	fix test suite The test suite found a bug; select_ can fail now because a uniqueness constrain has been added. Now the test suite passes. Also, I'm satisfied the changed PersistField instances work. Looking over what changed, and what I've already tested, Key, FilePath, and InodeCache are known working; ContentIdentifier is trivial ByteString to blob; and SSha is trivial String to varchar. Both are tested by the test suite. I've also tested the new FileSize and EpochTime instances already, and they work.	2019-10-30 15:51:37 -04:00
Joey Hess	9085a2cfec	make sure all sqlite selects have indexes Bearing in mind that these indexes are really uniqueness constraints that just happen to also make sqlite generate indexes. In Database.ContentIndentifier, the ContentIndentifiersKeyRemoteCidIndex is fine as a uniqueness constraint because it contains all rows from the table. The ContentIndentifiersCidRemoteIndex is also ok because there can only be one key for a given (cid, uuid) combination. In Database.Export, the new ExportTreeFileKeyIndex is the same pair as the old ExportTreeKeyFileIndex (previously ExportTreeIndex). And in Database.Keys.SQL, the new InodeCacheKeyIndex is the same pair as the old KeyInodeCacheIndex.	2019-10-30 13:46:52 -04:00
Joey Hess	c35a9047d3	improve data types for sqlite This is a non-backwards compatable change, so not suitable for merging w/o a annex.version bump and transition code. Not yet tested. This improves performance of git-annex benchmark --databases across the board by 10-25%, since eg Key roundtrips as a ByteString. (serializeKey' produces a lazy ByteString, so there is still a copy involved in converting it to a strict ByteString. It may be faster to switch to using bytestring-strict-builder.) FilePath and Key are both stored as blobs. This avoids mojibake in some situations. It would be possible to use varchar instead, if persistent could avoid converting that to Text, but it seems there is no good way to do so. See doc/todo/sqlite_database_improvements.mdwn Eliminated some ugly artifacts of using Read/Show serialization; constructors and quoted strings are no longer stored in sqlite. Renamed SRef to SSha to reflect that it is only ever a git sha, not a ref name. Since it is limited to the characters in a sha, it is not affected by mojibake, so still uses String.	2019-10-29 17:05:36 -04:00
Joey Hess	9828f45d85	add RemoteStateHandle This solves the problem of sameas remotes trampling over per-remote state. Used for: * per-remote state, of course * per-remote metadata, also of course * per-remote content identifiers, because two remote implementations could in theory generate the same content identifier for two different peices of content While chunk logs are per-remote data, they don't use this, because the number and size of chunks stored is a common property across sameas remotes. External special remote had a complication, where it was theoretically possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or EXPORTSUPPORTED. Since the uuid of the remote is typically generate in Remote.setup, it would only be possible to pass a Maybe RemoteStateHandle into it, and it would otherwise have to construct its own. Rather than go that route, I decided to send an ERROR in this case. It seems unlikely that any existing external special remote will be affected. They would have to make up a git-annex key, and set state for some reason during INITREMOTE. I can imagine such a hack, but it doesn't seem worth complicating the code in such an ugly way to support it. Unfortunately, both TestRemote and Annex.Import needed the Remote to have a new field added that holds its RemoteStateHandle.	2019-10-14 13:51:42 -04:00
Joey Hess	018b5b8173	Support building with socks-0.6 and persistant-template-2.7 persistent-template now needs UndecidableInstances. socks changed defaultSocksConf to take a SockAddr.	2019-07-30 12:50:48 -04:00
Joey Hess	6babb2c73f	remove wrong uniqueness constraint from ContentIdentifier db Fix bug that caused importing from a special remote to repeatedly download unchanged files when multiple files in the remote have the same content. Unfortunately, there's really no good way to remove a uniqueness constraint from a sqlite database. The best that can be done is to make a new table and copy the data over. But that would require using persistent's migrations or raw sql, and I don't want to do either. Instead, a sledgehammer approach: Renamed .git/annex/cid to .git/annex/cids. When the new database doesn't exist, it will be populated from the git-annex branch. Noting deletes the old database. Don't want to delete it out from under some long-running git-annex process that might be using it. It could eventually be deleted. But this is such a new feature, probably few repos have the database in any case.	2019-04-09 19:58:24 -04:00
Joey Hess	50797ee2c5	remove obsolete comment	2019-03-07 13:02:46 -04:00
Joey Hess	71fec9060c	move	2019-03-07 12:56:40 -04:00
Joey Hess	ee251b2e2e	implement updating the ContentIdentifier db with info from the git-annex branch untested This won't be super slow, but it does need to diff two likely large trees, and since the git-annex branch rarely sits still, it will most likely be run at the beginning of every import. A possible speed improvement would be to only run this when the database did not contain a ContentIdentifier. But that would only speed up imports when there is no new version of a file on the special remote, at most renames of existing files being imported. A better speed improvement would be to record something in the git-annex branch that indicates when an import has been run, and only do the diff if the git-annex branch has record of a newer import than we've seen before. Then, it would only run when there is in fact new ContentIdentifier information available from a remote. Certianly doable, but didn't want to complicate things yet.	2019-03-06 18:04:30 -04:00
Joey Hess	f85f06aae3	change to more efficient IKey	2019-03-06 11:14:33 -04:00
Joey Hess	3c652e1499	limit to requested remote	2019-03-05 15:56:28 -04:00
Joey Hess	cd3a2b023a	initial try at using storeExportWithContentIdentifier Untested, and I'm not sure about the locking of the ContentIdentifier db.	2019-03-04 17:50:41 -04:00
Joey Hess	138d07eb97	add getContentIdentifiers Changed the database schema for this, with a new index.	2019-03-04 16:48:07 -04:00
Joey Hess	775e6ed86f	fix table name	2019-02-27 13:52:56 -04:00
Joey Hess	fd304dce60	split out Types.Import and some changes to the types in it	2019-02-21 13:39:09 -04:00
Joey Hess	a818bc5e73	add Database.ContentIdentifier Does not yet have a way to update with new information from the git-annex branch, which will be needed when multiple repos are importing from the same remote.	2019-02-20 16:59:10 -04:00

40 commits