git-annex

Author	SHA1	Message	Date
Joey Hess	c834d2025a	queue more changes to keys db Increasing the size of the queue 10x makes git-annex init 7% faster in a repository with 86000 annexed files. The memory use goes up, from 70876 kb to 85376 kb.	2022-11-18 13:29:34 -04:00
Joey Hess	8fcee4ac9d	Sped up the initial scanning for annexed files by 15% Avoids database querying overhead when the database is newly created. In the large repository where git-annex init took 24 seconds, this sped it up to 20.47 seconds, a speedup of around 15%. Sponsored-by: Dartmouth College's DANDI project	2022-11-18 13:16:57 -04:00
Joey Hess	cde2e61105	improve sqlite retrying behavior Avoid hanging when a suspended git-annex process is keeping a sqlite database locked. Sponsored-by: Dartmouth College's Datalad project	2022-10-18 15:47:20 -04:00
Joey Hess	3149a1e2fe	More robust handling of ErrorBusy when writing to sqlite databases While ErrorBusy and other exceptions were caught and the write retried for up to 10 seconds, it was still possible for git-annex to eventually give up and error out without writing to the database. Now it will retry as long as necessary. This does mean that, if one git-annex process is suspended just as sqlite has locked the database for writing, another git-annex that tries to write it it might get stuck retrying forever. But, that could already happen when opening the sqlite database, which retries forever on ErrorBusy. This is an area where git-annex is known to not behave well, there's a todo about the general case of it. Sponsored-by: Dartmouth College's Datalad project	2022-10-17 15:56:19 -04:00
Joey Hess	0d762acf7e	update comment, probably not a sqlite bug Sqlite's page documenting WAL mode changed in Oct 2016 to mention ways that queries could fail with SQLITE_BUSY. http://web.archive.org/web/20161009044054/http://www.sqlite.org:80/wal.html Probably not cooincidentally, I emailed sqlite-users about such a situation in Feb 2015. https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg90580.html Noone ever replied to me, but at least now I understand why it does that. Since it's documented now, it's no longer a bug.	2022-10-17 15:09:47 -04:00
Joey Hess	6fbd337e34	avoid uncessary keys db writes; doubled speed! When running eg git-annex get, for each file it has to read from and write to the keys database. But it's reading exclusively from one table, and writing to a different table. So, it is not necessary to flush the write to the database before reading. This avoids writing the database once per file, instead it will buffer 1000 changes before writing. Benchmarking getting 1000 small files from a local origin, git-annex get now takes 13.62s, down from 22.41s! git-annex drop now takes 9.07s, down from 18.63s! Wowowowowowowow! (It would perhaps have been better if there were separate databases for the two tables. At least it would have avoided this complexity. Ah well, this is better than splitting the table in a annex.version upgrade.) Sponsored-by: Dartmouth College's Datalad project	2022-10-12 15:33:16 -04:00
Joey Hess	ba7ecbc6a9	avoid flushing keys db queue after each Annex action The flush was only done Annex.run' to make sure that the queue was flushed before git-annex exits. But, doing it there means that as soon as one change gets queued, it gets flushed soon after, which contributes to excessive writes to the database, slowing git-annex down. (This does not yet speed git-annex up, but it is a stepping stone to doing so.) Database queues do not autoflush when garbage collected, so have to be flushed explicitly. I don't think it's possible to make them autoflush (except perhaps if git-annex sqitched to using ResourceT..). The comment in Database.Keys.closeDb used to be accurate, since the automatic flushing did mean that all writes reached the database even when closeDb was not called. But now, closeDb or flushDb needs to be called before stopping using an Annex state. So, removed that comment. In Remote.Git, change to using quiesce everywhere that it used to use stopCoProcesses. This means that uses on onLocal in there are just as slow as before. I considered only calling closeDb on the local git remotes when git-annex exits. But, the reason that Remote.Git calls stopCoProcesses in each onLocal is so as not to leave git processes running that have files open on the remote repo, when it's on removable media. So, it seemed to make sense to also closeDb after each one, since sqlite may also keep files open. Although that has not seemed to cause problems with removable media so far. It was also just easier to quiesce in each onLocal than once at the end. This does likely leave performance on the floor, so could be revisited. In Annex.Content.saveState, there was no reason to close the db, flushing it is enough. The rest of the changes are from auditing for Annex.new, and making sure that quiesce is called, after any action that might possibly need it. After that audit, I'm pretty sure that the change to Annex.run' is safe. The only concern might be that this does let more changes get queued for write to the db, and if git-annex is interrupted, those will be lost. But interrupting git-annex can obviously already prevent it from writing the most recent change to the db, so it must recover from such lost data... right? Sponsored-by: Dartmouth College's Datalad project	2022-10-12 14:12:23 -04:00
Yaroslav Halchenko	0151976676	Typo fix unncessary -> unnecessary. Detected while reading recent CHANGELOG entry but then decided to apply to entire codebase and docs since why not?	2022-08-20 09:40:19 -04:00
Joey Hess	b801812660	init: probe if sqlite works Help the user get annex.dbdir configured when their filesystem is not one that sqlite works on. The change in Database.Handle makes an error from sqlite not be ignored besides being displayed, which it was before. I can't see any reason git-annex would want to ignore these errors. I chose to use the fsck database rather than the keys database because opening the keys database populates it, and see commit `b3c4579c79`. The placement of the call to checkSqliteWorks inside checkInitializeAllowed avoids annex.uuid getting set before it's called. Sponsored-by: Dartmouth College's Datalad project	2022-08-17 13:12:26 -04:00
Joey Hess	4cfe17a9e8	use a subdirectory of annex.dbdir This allows annex.dbdir to be set globally or always set to the same value when needed. Each repository uses a subdirectory of it. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 13:18:15 -04:00
Joey Hess	a335c1e46e	annex.dbdir fully working Completes work started in `e60766543f` I've verified that all the sqlite databases get stored in annex.dbdir and are created successfully. If annex.dbdir does not exist, it will be created; its parent directory must already exist though. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 13:06:58 -04:00
Joey Hess	23c6e350cb	improve createDirectoryUnder to allow alternate top directories This should not change the behavior of it, unless there are multiple top directories, and then it should behave the same as if there was a single top directory that was actually above the directory to be created. Sponsored-by: Dartmouth College's Datalad project	2022-08-12 12:52:37 -04:00
Joey Hess	e60766543f	add annex.dbdir (WIP) WIP: This is mostly complete, but there is a problem: createDirectoryUnder throws an error when annex.dbdir is set to outside the git repo. annex.dbdir is a workaround for filesystems where sqlite does not work, due to eg, the filesystem not properly supporting locking. It's intended to be set before initializing the repository. Changing it in an existing repository can be done, but would be the same as making a new repository and moving all the annexed objects into it. While the databases get recreated from the git-annex branch in that situation, any information that is in the databases but not stored in the branch gets lost. It may be that no information ever gets stored in the databases that cannot be reconstructed from the branch, but I have not verified that. Sponsored-by: Dartmouth College's Datalad project	2022-08-11 16:58:53 -04:00
Joey Hess	2d65c4ff1d	avoid unix-compat's rename On Windows, that does not support long paths https://github.com/jacobstanley/unix-compat/issues/56 Instead, use System.Directory.renamePath, which does support long paths. Sponsored-by: Dartmouth College's Datalad project	2022-07-12 14:55:02 -04:00
Joey Hess	95a04920cf	remove objectDir'	2022-06-22 16:08:49 -04:00
Joey Hess	5da1a78508	add debugging around commits to sqlite dbs	2022-06-06 12:36:55 -04:00
Joey Hess	331c97df88	fix MVar deadlock when sqlite commit fails The database queue was left empty, which caused subsequent calls to flushDbQueue to deadlock. Sponsored-by: Dartmouth College's Datalad project	2022-06-06 12:16:55 -04:00
Joey Hess	09edb07ac5	add debugLocks around database operations to track down a blocked indefinitely on MVar that seems to occur after sqlite throws ErrorBusy but that I have not been able to reproduce when I made commits synthetically throw ErrorBusy. Sponsored-by: Dartmouth College's Datalad project	2022-06-03 14:16:28 -04:00
Joey Hess	649464619e	read up to and including maxPointerSz For consistency with everything else. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:54:40 -04:00
Joey Hess	5b373a9dd2	read a consistent amount from pointer file A few places were reading the max symlink size of a pointer file, then passing tp parseLinkTargetOrPointer. Which is fine currently, but to support pointer files with lines of data after the pointer, enough has to be read that parseLinkTargetOrPointer can be assured of seeing enough of that data to know if it's correctly formatted. Sponsored-by: Dartmouth College's Datalad project	2022-02-23 12:52:34 -04:00
Joey Hess	46d5098ff4	Pass --no-textconv when running git diff internally Seems that --no-ext-diff and -c diff.external= are not enough to disable external diff command when gitattributes textconv specifies it. I'm pretty sure that --no-ext-diff and -c diff.external= are not both needed, but not 100%. Something about -G may need the latter to fully disable diffs in some cases. So kept that part as it was. Sponsored-by: Dartmouth College's Datalad project	2022-02-01 13:43:18 -04:00
Joey Hess	f54c58f0df	Avoid crashing when run in a bare git repo that somehow contains an index file Do not populate the keys database with associated files, because a bare repo has no working tree, and so it does not make sense to populate it. Queries of associated files in the keys database always return empty lists in a bare repo, even if it's somehow populated. One way it could be populated is if a user converts a non-bare repo to a bare repo. Note that Git.Config.isBare does a string comparison, so this is not free! But, that string comparison is very small compared to a sqlite query. Sponsored-by: Erik Bjäreholt on Patreon	2022-01-11 13:01:49 -04:00
Joey Hess	d0ef8303cf	avoid using a second db connection for writes This is a potentially breaking change in a very delicate area. However, examining the code path for writes, I don't see any benefit to opening a second db connection for them. If the write throws an exception, commitDb will retry it with a new db connection. A potential benefit to not opening a second db connection, beyond using less resources, is it just might avoid problems in WSL with sqlite that I have hypothesized are caused by multiple db connections. Commit `5f9eff3f32` explains why it needs to shut down the db connection to force the database to be updated on disk: When closeDb does not get called, garbage collection of DbHandle may not give the workterThread time to cleanly shut down before git-annex exits, resulting in a recently written change not reaching disk.	2021-10-20 12:32:46 -04:00
Joey Hess	f5b642318d	eliminate single/multi writer distinction After commit `f4bdecc4ec`, there is no longer any distinction between SingleWriter and MultiWriter's handling of read after write. Databases that were SingleWriter still have lock files that are used to prevent multiple writers. This does make writing to such databases a bit more expensive, because the MultiWriter code path that is now used opens a second db connection in order to write to them.	2021-10-20 12:26:30 -04:00
Joey Hess	c47794991c	improve with continuation no behavior change	2021-10-20 12:13:49 -04:00
Joey Hess	f4bdecc4ec	improve sqlite MultiWriter handling of read after write This removes a messy caveat that was easy to forget and caused at least one bug. The price paid is that, after a write to a MultiWriter db, it has to close the db connection that it had been using to read, and open a new connection. So it might be a little bit slower. But, writes are usually batched together, so there's often only a single write, and so there should not be much of a slowdown. Notice that SingleWriter already closed the db connection after a write, so paid the same overhead. This is the second try at fixing a bug: git-annex get when run as the first git-annex command in a new repo did not populate all unlocked files. (Reversion in version 8.20210621) Sponsored-by: Boyd Stephen Smith Jr. on Patreon	2021-10-19 15:13:29 -04:00
Joey Hess	0f38ad9a69	close keys db to possibly work around WSL1 issue	2021-10-19 13:07:49 -04:00
Joey Hess	19e78816f0	convert Key to ShortByteString This adds the overhead of a copy when serializing and deserializing keys. I have not benchmarked much, but runtimes seem barely changed at all by that. When a lot of keys are in memory, it improves memory use. And, it prevents keys sometimes getting PINNED in memory and failing to GC, which is a problem ByteString has sometimes. In particular, git-annex sync from a borg special remote had that problem and this improved its memory use by a large amount. Sponsored-by: Shae Erisson on Patreon	2021-10-05 20:20:08 -04:00
Joey Hess	837116ef1e	Fix support for readonly git remotes Boolean blindness oops. (Reversion in version 8.20210621) Sponsored-by: Dartmouth College's Datalad project	2021-08-30 12:34:19 -04:00
Joey Hess	fa62c98910	simplify and speed up Utility.FileSystemEncoding This eliminates the distinction between decodeBS and decodeBS', encodeBS and encodeBS', etc. The old implementation truncated at NUL, and the primed versions had to do extra work to avoid that problem. The new implementation does not truncate at NUL, and is also a lot faster. (Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the primed versions.) Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation, and upgrading to it will speed up to/fromRawFilePath. AFAIK, nothing relied on the old behavior of truncating at NUL. Some code used the faster versions in places where I was sure there would not be a NUL. So this change is unlikely to break anything. Also, moved s2w8 and w82s out of the module, as they do not involve filesystem encoding really. Sponsored-by: Shae Erisson on Patreon	2021-08-11 12:13:31 -04:00
Joey Hess	9f94d2894e	remove unused code	2021-07-30 18:01:36 -04:00
Joey Hess	a306560374	use SQL.addInodeCaches This avoids deadlock when opening the database handle calls reconcileStaged.	2021-07-27 17:34:56 -04:00
Joey Hess	73e0cbbb19	fix problem populating pointer files This is a result of an audit of every use of getInodeCaches, to find places that misbehave when the annex object is not in the inode cache, despite pointer files for the same key being in the inode cache. Unfortunately, that is the case for objects that were in v7 repos that upgraded to v8. Added a note about this gotcha to getInodeCaches. Database.Keys.reconcileStaged, then annex.thin is set, would fail to populate pointer files in this situation. Changed it to check if the annex object is unmodified the same way inAnnex does, falling back to a checksum if the inode cache is not recorded. Sponsored-by: Dartmouth College's Datalad project	2021-07-27 14:26:49 -04:00
Joey Hess	e4b2a067e0	fix potential race in updating inode cache In Annex.Content, the object file was statted after pointer files were populated. But if annex.thin is set, once the pointer files are populated, the object file can potentially be modified via the hard link. So, it was possible, though seemingly very unlikely, for the inode of the modified object file to be cached. Command.Fix and Command.Fsck had similar problems, statting the work tree files after they were in place. Changed them to stat the temp file that gets moved into place. This does rely on .git/annex being on the same filesystem. If it's not, the cached inode will not be the same as the one that the temp file gets moved to. Result will be that git-annex will later need to do an expensive verification of the content of the worktree files. Note that the cross-filesystem move of the temp file already is a larger amount of extra work, so this seems acceptable. Sponsored-by: Luke Shumaker on Patreon	2021-07-27 12:29:10 -04:00
Joey Hess	af9fdf5dba	verify associated files when checking numcopies Most of this is just refactoring. But, handleDropsFrom did not verify that associated files from the keys db were still accurate, and has now been fixed to. A minor improvement to this would be to avoid calling catKeyFile twice on the same file, when getting the numcopies and mincopies value, in the common case where the same file has the highest value for both. But, it avoids checking every associated file, so it will scale well to lots of dups already. Sponsored-by: Kevin Mueller on Patreon	2021-06-15 11:14:52 -04:00
Joey Hess	7b6deb1109	display scanning message whenever reconcileStaged has enough files to chew on Clear visible progress bar first. Removed showSideActionAfter because it can't be used in reconcileStaged (import loop). Instead, it counts the number of files it processes and displays it after it's seen a sufficient to know it's taking a while. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 12:48:30 -04:00
Joey Hess	1a6fa5abc8	add debugging for reconcileStaged calls for benchmarking	2021-06-08 11:57:23 -04:00
Joey Hess	7f742589f9	claw back annexed file scan speedup Following commit `c941ab6f5b`, this avoids the second, redundant scan when annex.thin is not set. The benchmark now runs in 35.5 seconds, down from 40 seconds. Note that the inode cache of the annex object has to be passed to addInodeCaches now, because it might not already be in the inode caches, unlike previously. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 11:09:15 -04:00
Joey Hess	ec1f2f246b	improve comment remove obsolete part about a commit preventing it seeing changes	2021-06-08 10:43:48 -04:00
Joey Hess	c941ab6f5b	avoid double work in git-annex init, second try reconcileStaged populates the db, so scanAnnexedFiles does not need to do it again. It still makes a pass over the HEAD tree, but populating the db was most of the expensive part. Benchmarking with 100,000 files, git-annex init now takes 40 seconds, vs 37 seconds with the old, buggy version of this fix. It should be possible to win those 3 precious seconds per 100k files back, in the case when when annex.thin is not set, with improvements to reconcileStaged that avoid needing this second pass. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 09:36:53 -04:00
Joey Hess	2cb7b7b336	Revert "avoid double work in git-annex init" This reverts commit `0f10f208a7`. The implementation of this turns out to be unsafe; it can lead to a keys db deadlock. scanAnnexedFiles injects a call to inAnnex into reconcileStaged, but inAnnex sometimes needs to read from the keys db, which will try to re-open it when it's in the process of being opened. The exclusive lock of gitAnnexKeysDbLock will then deadlock. This needs to be done in some other way...	2021-06-08 09:11:24 -04:00
Joey Hess	c831a562f5	faster associated file replacement with upsert Rather than first deleting and then inserting, upsert lets the key associated with a file be updated in place. Benchmarked with 100,000 files, and an empty keys database, running reconcileStaged. It improved from 47 seconds to 34 seconds. So this got reconcileStaged to be as fast as scanAssociatedFiles, or faster -- scanAssociatedFiles benchmarks at 37 seconds. (Also checked for other users of deleteWhere that could be sped up by upsert. There are a couple, but they are not in performance critical code paths, eg recordExportTreeCurrent is only run once per tree export.) I would have liked to rename FileKeyIndex to FileKeyUnique since it is being used as a uniqueness constraint now, not just to get an index. But, that gets converted into part of the SQL schema, and the name is used by the upsert, so it can't be changed. Sponsored-by: Dartmouth College's Datalad project	2021-06-08 07:53:36 -04:00
Joey Hess	0f10f208a7	avoid double work in git-annex init reconcileStaged was doing a redundant scan to scannAnnexedFiles. It would probably make sense to move the body of scannAnnexedFiles into reconcileStaged, the separation does not really serve any purpose. Sponsored-by: Dartmouth College's Datalad project	2021-06-07 16:50:14 -04:00
Joey Hess	6ceb31a30a	optimise reconcileStaged with git cat-file streaming Commit `428c91606b` made it need to do more work in situations like switching between very different branches. Compare with seekFilteredKeys which has a similar optimisation. Might be possible to factor out the common part from these? Sponsored-by: Dartmouth College's Datalad project	2021-06-07 15:26:48 -04:00
Joey Hess	0101363eb8	correctly update keys db in merge conflict This is quite a subtle edge case, see the bug report for full details. The second git diff is needed only when there's a merge conflict. It would be possible to speed it up marginally by using --diff-filter=Unmerged, but probably not enough to bother with. Sponsored-by: Graham Spencer on Patreon	2021-06-07 12:52:36 -04:00
Joey Hess	5b7429e73a	avoid removing old associated file when there is a merge conflict It makes sense to keep the key used by the old version of an associated file, until the merge conflict is resolved. Note that, since in this case git diff is being run with --index, it's not possible to use -1 or -3, which would let the keys associated with the new versions of the file also be added. That would be better, because it's possible that the local modification to the file that caused the merge conflict has not yet gotten its new key recorded in the db. Opened a bug about a case this is thus not able to address. Sponsored-by: Boyd Stephen Smith Jr. on Patreon	2021-06-01 11:43:00 -04:00
Joey Hess	eb6f6ff9b8	speed up keys database writes There seems to be no reason to check the time here. I think it was inherited from code in Database.Fsck, which does have a reason to commit every few minutes. Removing that syscall speeds up a git-annex init in a repo with 100000 annexed files by about 3 seconds. Sponsored-by: Dartmouth College's Datalad project	2021-05-31 15:01:00 -04:00
Joey Hess	73e1507c72	fix deadlock git-annex test hung, at varying points depending on when git decided to run the smudge clean filter. Recent changes to reconcileStaged caused a deadlock, when git write-tree for some reason decides to run the smudge clean filter. Which tries to open the keys db, and blocks waiting for the lock file that its grandparent has locked. I don't know why git write-tree does that. It's supposed to only write a tree from the index which needs no smudge/clean filtering. I've verified that, in a situation where git write-tree runs the clean filter, disabling the filter results in a tree being written that contains the annex link, not eg, the worktree file content. So it seems safe to disable the clean filter, but also this seems likely to be working around a bug in git because it seems it is running the clean filter in a situation where the object has already been cleaned. Sponsored-by: Dartmouth College's Datalad project	2021-05-24 16:19:26 -04:00
Joey Hess	f46e4c9b7c	fix case where keys db was not initialized in time When the keys db is opened for read, and did not exist yet, it used to skip creating it, and return mempty values. But that prevents reconcileStaged from populating associated files information in time for the read. This fixes the one remaining case I know of where the fix in `a56b151f90` didn't work. Note that, when there is a permissions error, it still avoids creating the db and returns mempty for all queries. This does mean that reconcileStaged does not run and so it may want to drop files that it should not. However, presumably a permissions error on the keys database also means that the user does not have permission to delete annex objects, so they won't be able to drop the files anyway. Sponsored-by: Dartmouth College's Datalad project	2021-05-24 14:46:59 -04:00
Joey Hess	d62d6e2fcf	note about a wart All code that uses associated files already deals with this problem, which used to be worse. Unfortunately I was not able to entirely eliminate it, although it happens in fewer cases now.	2021-05-24 12:05:49 -04:00

1 2 3 4 5

236 commits