git-annex

Author	SHA1	Message	Date
Joey Hess	f971b199ed	fix init .git/annex/ perms for core.sharedRepository init: Bug fix: Create .git/annex/ and .git/annex/fsckdb/ directories with permissions configured by core.sharedRepository. The fsckfb being created happens to create .git/annex/ and it was not using createAnnexDirectory. Probably a reversion partly, but maybe the database directory was always created not honoring core.sharedRepository? Sponsored-by: Noam Kremen on Patreon	2023-04-26 16:14:21 -04:00
Joey Hess	a3f433eac8	improve error message when commitDb' fails due to disk full or IO error There's still a 60 second delay in this situation because it retries, in case the failure was due to something recoverable like another process. Sponsored-by: unqueued on Patreon	2023-04-19 12:43:30 -04:00
Joey Hess	cd544e548b	filter out control characters in error messages giveup changed to filter out control characters. (It is too low level to make it use StringContainingQuotedPath.) error still does not, but it should only be used for internal errors, where the message is not attacker-controlled. Changed a lot of existing error to giveup when it is not strictly an internal error. Of course, other exceptions can still be thrown, either by code in git-annex, or a library, that include some attacker-controlled value. This does not guard against those. Sponsored-by: Noam Kremen on Patreon	2023-04-10 13:50:51 -04:00
Joey Hess	cc36c8516a	Sped up sqlite inserts 2x when built with persistent 2.14.5.0 https://github.com/yesodweb/persistent/issues/1457 Sponsored-by: Dartmouth College's DANDI project	2023-03-31 14:38:25 -04:00
Yaroslav Halchenko	84b0a3707a	Apply codespell -w throughout	2023-03-17 15:14:58 -04:00
Joey Hess	195508fc65	Improve error message when unable to read a sqlite database due to permissions problem Old message was: sqlite query crashed: thread blocked indefinitely in an MVar operation New message is eg: sqlite worker thread crashed: SQLite3 returned ErrorCan'tOpen while attempting to perform open ".git/annex/keysdb/db". The worker thread used to throw an exception. But before that exception was seen by anything waiting on the worker thread to finish, the takeMVar in queryDb would have crashed with BlockedIndefinitelyOnMVar. Sponsored-by: k0ld on Patreon	2023-02-23 15:28:22 -04:00
Joey Hess	672258c8f4	Revert "revert recent bug fix temporarily for release" This reverts commit `16f1e24665`.	2023-02-14 14:11:23 -04:00
Joey Hess	16f1e24665	revert recent bug fix temporarily for release Decided this bug is not severe enough to delay the release until tomorrow, so this will be re-applied after the release.	2023-02-14 14:06:29 -04:00
Joey Hess	12b45d3b89	use isBareRepo The advantage over Git.Config.isBare is that it avoids a string comparison and a map lookup. Sponsored-by: Noam Kremen on Patreon	2023-02-14 13:52:05 -04:00
Joey Hess	cf892f4256	use insert_ for speed improvement persistent-2.14.4.1 makes insert_ faster than insert because it skips getting the key back. Sponsored-by: Dartmouth College's DANDI project	2022-12-26 15:59:41 -04:00
Joey Hess	c834d2025a	queue more changes to keys db Increasing the size of the queue 10x makes git-annex init 7% faster in a repository with 86000 annexed files. The memory use goes up, from 70876 kb to 85376 kb.	2022-11-18 13:29:34 -04:00
Joey Hess	8fcee4ac9d	Sped up the initial scanning for annexed files by 15% Avoids database querying overhead when the database is newly created. In the large repository where git-annex init took 24 seconds, this sped it up to 20.47 seconds, a speedup of around 15%. Sponsored-by: Dartmouth College's DANDI project	2022-11-18 13:16:57 -04:00
Joey Hess	cde2e61105	improve sqlite retrying behavior Avoid hanging when a suspended git-annex process is keeping a sqlite database locked. Sponsored-by: Dartmouth College's Datalad project	2022-10-18 15:47:20 -04:00
Joey Hess	3149a1e2fe	More robust handling of ErrorBusy when writing to sqlite databases While ErrorBusy and other exceptions were caught and the write retried for up to 10 seconds, it was still possible for git-annex to eventually give up and error out without writing to the database. Now it will retry as long as necessary. This does mean that, if one git-annex process is suspended just as sqlite has locked the database for writing, another git-annex that tries to write it it might get stuck retrying forever. But, that could already happen when opening the sqlite database, which retries forever on ErrorBusy. This is an area where git-annex is known to not behave well, there's a todo about the general case of it. Sponsored-by: Dartmouth College's Datalad project	2022-10-17 15:56:19 -04:00
Joey Hess	0d762acf7e	update comment, probably not a sqlite bug Sqlite's page documenting WAL mode changed in Oct 2016 to mention ways that queries could fail with SQLITE_BUSY. http://web.archive.org/web/20161009044054/http://www.sqlite.org:80/wal.html Probably not cooincidentally, I emailed sqlite-users about such a situation in Feb 2015. https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg90580.html Noone ever replied to me, but at least now I understand why it does that. Since it's documented now, it's no longer a bug.	2022-10-17 15:09:47 -04:00
Joey Hess	6fbd337e34	avoid uncessary keys db writes; doubled speed! When running eg git-annex get, for each file it has to read from and write to the keys database. But it's reading exclusively from one table, and writing to a different table. So, it is not necessary to flush the write to the database before reading. This avoids writing the database once per file, instead it will buffer 1000 changes before writing. Benchmarking getting 1000 small files from a local origin, git-annex get now takes 13.62s, down from 22.41s! git-annex drop now takes 9.07s, down from 18.63s! Wowowowowowowow! (It would perhaps have been better if there were separate databases for the two tables. At least it would have avoided this complexity. Ah well, this is better than splitting the table in a annex.version upgrade.) Sponsored-by: Dartmouth College's Datalad project	2022-10-12 15:33:16 -04:00
Joey Hess	ba7ecbc6a9	avoid flushing keys db queue after each Annex action The flush was only done Annex.run' to make sure that the queue was flushed before git-annex exits. But, doing it there means that as soon as one change gets queued, it gets flushed soon after, which contributes to excessive writes to the database, slowing git-annex down. (This does not yet speed git-annex up, but it is a stepping stone to doing so.) Database queues do not autoflush when garbage collected, so have to be flushed explicitly. I don't think it's possible to make them autoflush (except perhaps if git-annex sqitched to using ResourceT..). The comment in Database.Keys.closeDb used to be accurate, since the automatic flushing did mean that all writes reached the database even when closeDb was not called. But now, closeDb or flushDb needs to be called before stopping using an Annex state. So, removed that comment. In Remote.Git, change to using quiesce everywhere that it used to use stopCoProcesses. This means that uses on onLocal in there are just as slow as before. I considered only calling closeDb on the local git remotes when git-annex exits. But, the reason that Remote.Git calls stopCoProcesses in each onLocal is so as not to leave git processes running that have files open on the remote repo, when it's on removable media. So, it seemed to make sense to also closeDb after each one, since sqlite may also keep files open. Although that has not seemed to cause problems with removable media so far. It was also just easier to quiesce in each onLocal than once at the end. This does likely leave performance on the floor, so could be revisited. In Annex.Content.saveState, there was no reason to close the db, flushing it is enough. The rest of the changes are from auditing for Annex.new, and making sure that quiesce is called, after any action that might possibly need it. After that audit, I'm pretty sure that the change to Annex.run' is safe. The only concern might be that this does let more changes get queued for write to the db, and if git-annex is interrupted, those will be lost. But interrupting git-annex can obviously already prevent it from writing the most recent change to the db, so it must recover from such lost data... right? Sponsored-by: Dartmouth College's Datalad project	2022-10-12 14:12:23 -04:00
Yaroslav Halchenko	0151976676	Typo fix unncessary -> unnecessary. Detected while reading recent CHANGELOG entry but then decided to apply to entire codebase and docs since why not?	2022-08-20 09:40:19 -04:00
Joey Hess	b801812660	init: probe if sqlite works Help the user get annex.dbdir configured when their filesystem is not one that sqlite works on. The change in Database.Handle makes an error from sqlite not be ignored besides being displayed, which it was before. I can't see any reason git-annex would want to ignore these errors. I chose to use the fsck database rather than the keys database because opening the keys database populates it, and see commit `b3c4579c79`. The placement of the call to checkSqliteWorks inside checkInitializeAllowed avoids annex.uuid getting set before it's called. Sponsored-by: Dartmouth College's Datalad project	2022-08-17 13:12:26 -04:00
Joey Hess	4cfe17a9e8	use a subdirectory of annex.dbdir This allows annex.dbdir to be set globally or always set to the same value when needed. Each repository uses a subdirectory of it. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 13:18:15 -04:00
Joey Hess	a335c1e46e	annex.dbdir fully working Completes work started in `e60766543f` I've verified that all the sqlite databases get stored in annex.dbdir and are created successfully. If annex.dbdir does not exist, it will be created; its parent directory must already exist though. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 13:06:58 -04:00
Joey Hess	23c6e350cb	improve createDirectoryUnder to allow alternate top directories This should not change the behavior of it, unless there are multiple top directories, and then it should behave the same as if there was a single top directory that was actually above the directory to be created. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 12:52:37 -04:00
Joey Hess	e60766543f	add annex.dbdir (WIP) WIP: This is mostly complete, but there is a problem: createDirectoryUnder throws an error when annex.dbdir is set to outside the git repo. annex.dbdir is a workaround for filesystems where sqlite does not work, due to eg, the filesystem not properly supporting locking. It's intended to be set before initializing the repository. Changing it in an existing repository can be done, but would be the same as making a new repository and moving all the annexed objects into it. While the databases get recreated from the git-annex branch in that situation, any information that is in the databases but not stored in the branch gets lost. It may be that no information ever gets stored in the databases that cannot be reconstructed from the branch, but I have not verified that. Sponsored-by: Dartmouth College's Datalad project	2022-08-11 16:58:53 -04:00
Joey Hess	2d65c4ff1d	avoid unix-compat's rename On Windows, that does not support long paths https://github.com/jacobstanley/unix-compat/issues/56 Instead, use System.Directory.renamePath, which does support long paths. Sponsored-by: Dartmouth College's Datalad project	2022-07-12 14:55:02 -04:00
Joey Hess	95a04920cf	remove objectDir'	2022-06-22 16:08:49 -04:00
Joey Hess	5da1a78508	add debugging around commits to sqlite dbs	2022-06-06 12:36:55 -04:00
Joey Hess	331c97df88	fix MVar deadlock when sqlite commit fails The database queue was left empty, which caused subsequent calls to flushDbQueue to deadlock. Sponsored-by: Dartmouth College's Datalad project	2022-06-06 12:16:55 -04:00
Joey Hess	09edb07ac5	add debugLocks around database operations to track down a blocked indefinitely on MVar that seems to occur after sqlite throws ErrorBusy but that I have not been able to reproduce when I made commits synthetically throw ErrorBusy. Sponsored-by: Dartmouth College's Datalad project	2022-06-03 14:16:28 -04:00
Joey Hess	649464619e	read up to and including maxPointerSz For consistency with everything else. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:54:40 -04:00
Joey Hess	5b373a9dd2	read a consistent amount from pointer file A few places were reading the max symlink size of a pointer file, then passing tp parseLinkTargetOrPointer. Which is fine currently, but to support pointer files with lines of data after the pointer, enough has to be read that parseLinkTargetOrPointer can be assured of seeing enough of that data to know if it's correctly formatted. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:52:34 -04:00
Joey Hess	46d5098ff4	Pass --no-textconv when running git diff internally Seems that --no-ext-diff and -c diff.external= are not enough to disable external diff command when gitattributes textconv specifies it. I'm pretty sure that --no-ext-diff and -c diff.external= are not both needed, but not 100%. Something about -G may need the latter to fully disable diffs in some cases. So kept that part as it was. Sponsored-by: Dartmouth College's Datalad project	2022-02-01 13:43:18 -04:00
Joey Hess	f54c58f0df	Avoid crashing when run in a bare git repo that somehow contains an index file Do not populate the keys database with associated files, because a bare repo has no working tree, and so it does not make sense to populate it. Queries of associated files in the keys database always return empty lists in a bare repo, even if it's somehow populated. One way it could be populated is if a user converts a non-bare repo to a bare repo. Note that Git.Config.isBare does a string comparison, so this is not free! But, that string comparison is very small compared to a sqlite query. Sponsored-by: Erik Bjäreholt on Patreon	2022-01-11 13:01:49 -04:00
Joey Hess	d0ef8303cf	avoid using a second db connection for writes This is a potentially breaking change in a very delicate area. However, examining the code path for writes, I don't see any benefit to opening a second db connection for them. If the write throws an exception, commitDb will retry it with a new db connection. A potential benefit to not opening a second db connection, beyond using less resources, is it just might avoid problems in WSL with sqlite that I have hypothesized are caused by multiple db connections. Commit `5f9eff3f32` explains why it needs to shut down the db connection to force the database to be updated on disk: When closeDb does not get called, garbage collection of DbHandle may not give the workterThread time to cleanly shut down before git-annex exits, resulting in a recently written change not reaching disk.	2021-10-20 12:32:46 -04:00
Joey Hess	f5b642318d	eliminate single/multi writer distinction After commit `f4bdecc4ec`, there is no longer any distinction between SingleWriter and MultiWriter's handling of read after write. Databases that were SingleWriter still have lock files that are used to prevent multiple writers. This does make writing to such databases a bit more expensive, because the MultiWriter code path that is now used opens a second db connection in order to write to them.	2021-10-20 12:26:30 -04:00
Joey Hess	c47794991c	improve with continuation no behavior change	2021-10-20 12:13:49 -04:00
Joey Hess	f4bdecc4ec	improve sqlite MultiWriter handling of read after write This removes a messy caveat that was easy to forget and caused at least one bug. The price paid is that, after a write to a MultiWriter db, it has to close the db connection that it had been using to read, and open a new connection. So it might be a little bit slower. But, writes are usually batched together, so there's often only a single write, and so there should not be much of a slowdown. Notice that SingleWriter already closed the db connection after a write, so paid the same overhead. This is the second try at fixing a bug: git-annex get when run as the first git-annex command in a new repo did not populate all unlocked files. (Reversion in version 8.20210621) Sponsored-by: Boyd Stephen Smith Jr. on Patreon	2021-10-19 15:13:29 -04:00
Joey Hess	0f38ad9a69	close keys db to possibly work around WSL1 issue	2021-10-19 13:07:49 -04:00
Joey Hess	19e78816f0	convert Key to ShortByteString This adds the overhead of a copy when serializing and deserializing keys. I have not benchmarked much, but runtimes seem barely changed at all by that. When a lot of keys are in memory, it improves memory use. And, it prevents keys sometimes getting PINNED in memory and failing to GC, which is a problem ByteString has sometimes. In particular, git-annex sync from a borg special remote had that problem and this improved its memory use by a large amount. Sponsored-by: Shae Erisson on Patreon	2021-10-05 20:20:08 -04:00
Joey Hess	837116ef1e	Fix support for readonly git remotes Boolean blindness oops. (Reversion in version 8.20210621) Sponsored-by: Dartmouth College's Datalad project	2021-08-30 12:34:19 -04:00
Joey Hess	fa62c98910	simplify and speed up Utility.FileSystemEncoding This eliminates the distinction between decodeBS and decodeBS', encodeBS and encodeBS', etc. The old implementation truncated at NUL, and the primed versions had to do extra work to avoid that problem. The new implementation does not truncate at NUL, and is also a lot faster. (Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the primed versions.) Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation, and upgrading to it will speed up to/fromRawFilePath. AFAIK, nothing relied on the old behavior of truncating at NUL. Some code used the faster versions in places where I was sure there would not be a NUL. So this change is unlikely to break anything. Also, moved s2w8 and w82s out of the module, as they do not involve filesystem encoding really. Sponsored-by: Shae Erisson on Patreon	2021-08-11 12:13:31 -04:00
Joey Hess	9f94d2894e	remove unused code	2021-07-30 18:01:36 -04:00
Joey Hess	a306560374	use SQL.addInodeCaches This avoids deadlock when opening the database handle calls reconcileStaged.	2021-07-27 17:34:56 -04:00
Joey Hess	73e0cbbb19	fix problem populating pointer files This is a result of an audit of every use of getInodeCaches, to find places that misbehave when the annex object is not in the inode cache, despite pointer files for the same key being in the inode cache. Unfortunately, that is the case for objects that were in v7 repos that upgraded to v8. Added a note about this gotcha to getInodeCaches. Database.Keys.reconcileStaged, then annex.thin is set, would fail to populate pointer files in this situation. Changed it to check if the annex object is unmodified the same way inAnnex does, falling back to a checksum if the inode cache is not recorded. Sponsored-by: Dartmouth College's Datalad project	2021-07-27 14:26:49 -04:00
Joey Hess	e4b2a067e0	fix potential race in updating inode cache In Annex.Content, the object file was statted after pointer files were populated. But if annex.thin is set, once the pointer files are populated, the object file can potentially be modified via the hard link. So, it was possible, though seemingly very unlikely, for the inode of the modified object file to be cached. Command.Fix and Command.Fsck had similar problems, statting the work tree files after they were in place. Changed them to stat the temp file that gets moved into place. This does rely on .git/annex being on the same filesystem. If it's not, the cached inode will not be the same as the one that the temp file gets moved to. Result will be that git-annex will later need to do an expensive verification of the content of the worktree files. Note that the cross-filesystem move of the temp file already is a larger amount of extra work, so this seems acceptable. Sponsored-by: Luke Shumaker on Patreon	2021-07-27 12:29:10 -04:00
Joey Hess	af9fdf5dba	verify associated files when checking numcopies Most of this is just refactoring. But, handleDropsFrom did not verify that associated files from the keys db were still accurate, and has now been fixed to. A minor improvement to this would be to avoid calling catKeyFile twice on the same file, when getting the numcopies and mincopies value, in the common case where the same file has the highest value for both. But, it avoids checking every associated file, so it will scale well to lots of dups already. Sponsored-by: Kevin Mueller on Patreon	2021-06-15 11:14:52 -04:00
Joey Hess	7b6deb1109	display scanning message whenever reconcileStaged has enough files to chew on Clear visible progress bar first. Removed showSideActionAfter because it can't be used in reconcileStaged (import loop). Instead, it counts the number of files it processes and displays it after it's seen a sufficient to know it's taking a while. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 12:48:30 -04:00
Joey Hess	1a6fa5abc8	add debugging for reconcileStaged calls for benchmarking	2021-06-08 11:57:23 -04:00
Joey Hess	7f742589f9	claw back annexed file scan speedup Following commit `c941ab6f5b`, this avoids the second, redundant scan when annex.thin is not set. The benchmark now runs in 35.5 seconds, down from 40 seconds. Note that the inode cache of the annex object has to be passed to addInodeCaches now, because it might not already be in the inode caches, unlike previously. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 11:09:15 -04:00
Joey Hess	ec1f2f246b	improve comment remove obsolete part about a commit preventing it seeing changes	2021-06-08 10:43:48 -04:00
Joey Hess	c941ab6f5b	avoid double work in git-annex init, second try reconcileStaged populates the db, so scanAnnexedFiles does not need to do it again. It still makes a pass over the HEAD tree, but populating the db was most of the expensive part. Benchmarking with 100,000 files, git-annex init now takes 40 seconds, vs 37 seconds with the old, buggy version of this fix. It should be possible to win those 3 precious seconds per 100k files back, in the case when when annex.thin is not set, with improvements to reconcileStaged that avoid needing this second pass. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 09:36:53 -04:00

1 2 3 4 5

246 commits