Commit graph

236 commits

Author SHA1 Message Date
Joey Hess
c834d2025a
queue more changes to keys db
Increasing the size of the queue 10x makes git-annex init 7% faster in a
repository with 86000 annexed files.

The memory use goes up, from 70876 kb to 85376 kb.
2022-11-18 13:29:34 -04:00
Joey Hess
8fcee4ac9d
Sped up the initial scanning for annexed files by 15%
Avoids database querying overhead when the database is newly created.

In the large repository where git-annex init took 24 seconds, this sped it
up to 20.47 seconds, a speedup of around 15%.

Sponsored-by: Dartmouth College's DANDI project
2022-11-18 13:16:57 -04:00
Joey Hess
cde2e61105
improve sqlite retrying behavior
Avoid hanging when a suspended git-annex process is keeping a sqlite
database locked.

Sponsored-by: Dartmouth College's Datalad project
2022-10-18 15:47:20 -04:00
Joey Hess
3149a1e2fe
More robust handling of ErrorBusy when writing to sqlite databases
While ErrorBusy and other exceptions were caught and the write retried for
up to 10 seconds, it was still possible for git-annex to eventually
give up and error out without writing to the database. Now it will retry
as long as necessary.

This does mean that, if one git-annex process is suspended just as sqlite
has locked the database for writing, another git-annex that tries to write
it it might get stuck retrying forever. But, that could already happen when
opening the sqlite database, which retries forever on ErrorBusy. This is an
area where git-annex is known to not behave well, there's a todo about the
general case of it.

Sponsored-by: Dartmouth College's Datalad project
2022-10-17 15:56:19 -04:00
Joey Hess
0d762acf7e
update comment, probably not a sqlite bug
Sqlite's page documenting WAL mode changed in Oct 2016 to mention ways
that queries could fail with SQLITE_BUSY.

http://web.archive.org/web/20161009044054/http://www.sqlite.org:80/wal.html

Probably not cooincidentally, I emailed sqlite-users about such a
situation in Feb 2015.
https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg90580.html

Noone ever replied to me, but at least now I understand why it does that.
Since it's documented now, it's no longer a bug.
2022-10-17 15:09:47 -04:00
Joey Hess
6fbd337e34
avoid uncessary keys db writes; doubled speed!
When running eg git-annex get, for each file it has to read from and
write to the keys database. But it's reading exclusively from one table,
and writing to a different table. So, it is not necessary to flush the
write to the database before reading. This avoids writing the database
once per file, instead it will buffer 1000 changes before writing.

Benchmarking getting 1000 small files from a local origin,
git-annex get now takes 13.62s, down from 22.41s!
git-annex drop now takes 9.07s, down from 18.63s!
Wowowowowowowow!

(It would perhaps have been better if there were separate databases for
the two tables. At least it would have avoided this complexity. Ah well,
this is better than splitting the table in a annex.version upgrade.)

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 15:33:16 -04:00
Joey Hess
ba7ecbc6a9
avoid flushing keys db queue after each Annex action
The flush was only done Annex.run' to make sure that the queue was flushed
before git-annex exits. But, doing it there means that as soon as one
change gets queued, it gets flushed soon after, which contributes to
excessive writes to the database, slowing git-annex down.
(This does not yet speed git-annex up, but it is a stepping stone to
doing so.)

Database queues do not autoflush when garbage collected, so have to
be flushed explicitly. I don't think it's possible to make them
autoflush (except perhaps if git-annex sqitched to using ResourceT..).
The comment in Database.Keys.closeDb used to be accurate, since the
automatic flushing did mean that all writes reached the database even
when closeDb was not called. But now, closeDb or flushDb needs to be
called before stopping using an Annex state. So, removed that comment.

In Remote.Git, change to using quiesce everywhere that it used to use
stopCoProcesses. This means that uses on onLocal in there are just as
slow as before. I considered only calling closeDb on the local git remotes
when git-annex exits. But, the reason that Remote.Git calls stopCoProcesses
in each onLocal is so as not to leave git processes running that have files
open on the remote repo, when it's on removable media. So, it seemed to make
sense to also closeDb after each one, since sqlite may also keep files
open. Although that has not seemed to cause problems with removable
media so far. It was also just easier to quiesce in each onLocal than
once at the end. This does likely leave performance on the floor, so
could be revisited.

In Annex.Content.saveState, there was no reason to close the db,
flushing it is enough.

The rest of the changes are from auditing for Annex.new, and making
sure that quiesce is called, after any action that might possibly need
it.

After that audit, I'm pretty sure that the change to Annex.run' is
safe. The only concern might be that this does let more changes get
queued for write to the db, and if git-annex is interrupted, those will be
lost. But interrupting git-annex can obviously already prevent it from
writing the most recent change to the db, so it must recover from such
lost data... right?

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 14:12:23 -04:00
Yaroslav Halchenko
0151976676
Typo fix unncessary -> unnecessary.
Detected while reading recent CHANGELOG entry but then decided to apply
to entire codebase and docs since why not?
2022-08-20 09:40:19 -04:00
Joey Hess
b801812660
init: probe if sqlite works
Help the user get annex.dbdir configured when their filesystem is not
one that sqlite works on.

The change in Database.Handle makes an error from sqlite not be ignored
besides being displayed, which it was before. I can't see any reason
git-annex would want to ignore these errors.

I chose to use the fsck database rather than the keys database because
opening the keys database populates it, and see commit
b3c4579c79.

The placement of the call to checkSqliteWorks inside checkInitializeAllowed
avoids annex.uuid getting set before it's called.

Sponsored-by: Dartmouth College's Datalad project
2022-08-17 13:12:26 -04:00
Joey Hess
4cfe17a9e8
use a subdirectory of annex.dbdir
This allows annex.dbdir to be set globally or always set to the same
value when needed. Each repository uses a subdirectory of it.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 13:18:15 -04:00
Joey Hess
a335c1e46e
annex.dbdir fully working
Completes work started in e60766543f

I've verified that all the sqlite databases get stored in annex.dbdir
and are created successfully. If annex.dbdir does not exist, it will be
created; its parent directory must already exist though.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 13:06:58 -04:00
Joey Hess
23c6e350cb
improve createDirectoryUnder to allow alternate top directories
This should not change the behavior of it, unless there are multiple top
directories, and then it should behave the same as if there was a single
top directory that was actually above the directory to be created.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 12:52:37 -04:00
Joey Hess
e60766543f
add annex.dbdir (WIP)
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.

annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.

It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.

Sponsored-by: Dartmouth College's Datalad project
2022-08-11 16:58:53 -04:00
Joey Hess
2d65c4ff1d
avoid unix-compat's rename
On Windows, that does not support long paths
https://github.com/jacobstanley/unix-compat/issues/56

Instead, use System.Directory.renamePath, which does support long paths.

Sponsored-by: Dartmouth College's Datalad project
2022-07-12 14:55:02 -04:00
Joey Hess
95a04920cf
remove objectDir' 2022-06-22 16:08:49 -04:00
Joey Hess
5da1a78508
add debugging around commits to sqlite dbs 2022-06-06 12:36:55 -04:00
Joey Hess
331c97df88
fix MVar deadlock when sqlite commit fails
The database queue was left empty, which caused subsequent calls to
flushDbQueue to deadlock.

Sponsored-by: Dartmouth College's Datalad project
2022-06-06 12:16:55 -04:00
Joey Hess
09edb07ac5
add debugLocks around database operations
to track down a blocked indefinitely on MVar that seems to occur after
sqlite throws ErrorBusy but that I have not been able to reproduce when
I made commits synthetically throw ErrorBusy.

Sponsored-by: Dartmouth College's Datalad project
2022-06-03 14:16:28 -04:00
Joey Hess
649464619e
read up to and including maxPointerSz
For consistency with everything else.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:54:40 -04:00
Joey Hess
5b373a9dd2
read a consistent amount from pointer file
A few places were reading the max symlink size of a pointer file,
then passing tp parseLinkTargetOrPointer. Which is fine currently, but
to support pointer files with lines of data after the pointer, enough
has to be read that parseLinkTargetOrPointer can be assured of seeing
enough of that data to know if it's correctly formatted.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:52:34 -04:00
Joey Hess
46d5098ff4
Pass --no-textconv when running git diff internally
Seems that --no-ext-diff and -c diff.external= are not enough to disable
external diff command when gitattributes textconv specifies it.

I'm pretty sure that --no-ext-diff and -c diff.external= are not both
needed, but not 100%. Something about -G may need the latter to fully
disable diffs in some cases. So kept that part as it was.

Sponsored-by: Dartmouth College's Datalad project
2022-02-01 13:43:18 -04:00
Joey Hess
f54c58f0df
Avoid crashing when run in a bare git repo that somehow contains an index file
Do not populate the keys database with associated files,
because a bare repo has no working tree, and so it does not make sense to
populate it.

Queries of associated files in the keys database always return empty lists
in a bare repo, even if it's somehow populated. One way it could be
populated is if a user converts a non-bare repo to a bare repo.

Note that Git.Config.isBare does a string comparison, so this is not free!
But, that string comparison is very small compared to a sqlite query.

Sponsored-by: Erik Bjäreholt on Patreon
2022-01-11 13:01:49 -04:00
Joey Hess
d0ef8303cf
avoid using a second db connection for writes
This is a potentially breaking change in a very delicate area. However,
examining the code path for writes, I don't see any benefit to opening a
second db connection for them. If the write throws an exception,
commitDb will retry it with a new db connection.

A potential benefit to not opening a second db connection, beyond using
less resources, is it just might avoid problems in WSL with sqlite that
I have hypothesized are caused by multiple db connections.

Commit 5f9eff3f32 explains why it needs to
shut down the db connection to force the database to be updated on disk:
When closeDb does not get called, garbage collection of DbHandle may not
give the workterThread time to cleanly shut down before git-annex exits,
resulting in a recently written change not reaching disk.
2021-10-20 12:32:46 -04:00
Joey Hess
f5b642318d
eliminate single/multi writer distinction
After commit f4bdecc4ec, there is no
longer any distinction between SingleWriter and MultiWriter's handling
of read after write.

Databases that were SingleWriter still have lock files that are used to
prevent multiple writers.

This does make writing to such databases a bit more expensive,
because the MultiWriter code path that is now used opens a second db
connection in order to write to them.
2021-10-20 12:26:30 -04:00
Joey Hess
c47794991c
improve with continuation
no behavior change
2021-10-20 12:13:49 -04:00
Joey Hess
f4bdecc4ec
improve sqlite MultiWriter handling of read after write
This removes a messy caveat that was easy to forget and caused at least one
bug. The price paid is that, after a write to a MultiWriter db, it has to
close the db connection that it had been using to read, and open a new
connection. So it might be a little bit slower. But, writes are usually
batched together, so there's often only a single write, and so there should
not be much of a slowdown. Notice that SingleWriter already closed the db
connection after a write, so paid the same overhead.

This is the second try at fixing a bug: git-annex get when run as the first
git-annex command in a new repo did not populate all unlocked files.
(Reversion in version 8.20210621)

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-19 15:13:29 -04:00
Joey Hess
0f38ad9a69
close keys db to possibly work around WSL1 issue 2021-10-19 13:07:49 -04:00
Joey Hess
19e78816f0
convert Key to ShortByteString
This adds the overhead of a copy when serializing and deserializing keys.
I have not benchmarked much, but runtimes seem barely changed at all by that.

When a lot of keys are in memory, it improves memory use.

And, it prevents keys sometimes getting PINNED in memory and failing to GC,
which is a problem ByteString has sometimes. In particular, git-annex sync
from a borg special remote had that problem and this improved its memory
use by a large amount.

Sponsored-by: Shae Erisson on Patreon
2021-10-05 20:20:08 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
fa62c98910
simplify and speed up Utility.FileSystemEncoding
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)

Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.

AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.

Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.

Sponsored-by: Shae Erisson on Patreon
2021-08-11 12:13:31 -04:00
Joey Hess
9f94d2894e
remove unused code 2021-07-30 18:01:36 -04:00
Joey Hess
a306560374
use SQL.addInodeCaches
This avoids deadlock when opening the database handle calls
reconcileStaged.
2021-07-27 17:34:56 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
e4b2a067e0
fix potential race in updating inode cache
In Annex.Content, the object file was statted after pointer files were
populated. But if annex.thin is set, once the pointer files are
populated, the object file can potentially be modified via the hard
link. So, it was possible, though seemingly very unlikely, for the inode
of the modified object file to be cached.

Command.Fix and Command.Fsck had similar problems, statting the work
tree files after they were in place. Changed them to stat the temp file
that gets moved into place. This does rely on .git/annex being on the
same filesystem. If it's not, the cached inode will not be the same as
the one that the temp file gets moved to. Result will be that git-annex
will later need to do an expensive verification of the content of the
worktree files. Note that the cross-filesystem move of the temp file
already is a larger amount of extra work, so this seems acceptable.

Sponsored-by: Luke Shumaker on Patreon
2021-07-27 12:29:10 -04:00
Joey Hess
af9fdf5dba
verify associated files when checking numcopies
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.

A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.

Sponsored-by: Kevin Mueller on Patreon
2021-06-15 11:14:52 -04:00
Joey Hess
7b6deb1109
display scanning message whenever reconcileStaged has enough files to chew on
Clear visible progress bar first.

Removed showSideActionAfter because it can't be used in reconcileStaged
(import loop). Instead, it counts the number of files it
processes and displays it after it's seen a sufficient to know it's
taking a while.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 12:48:30 -04:00
Joey Hess
1a6fa5abc8
add debugging for reconcileStaged calls for benchmarking 2021-06-08 11:57:23 -04:00
Joey Hess
7f742589f9
claw back annexed file scan speedup
Following commit c941ab6f5b, this avoids
the second, redundant scan when annex.thin is not set.

The benchmark now runs in 35.5 seconds, down from 40 seconds.

Note that the inode cache of the annex object has to be passed to
addInodeCaches now, because it might not already be in the inode caches,
unlike previously.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 11:09:15 -04:00
Joey Hess
ec1f2f246b
improve comment
remove obsolete part about a commit preventing it seeing changes
2021-06-08 10:43:48 -04:00
Joey Hess
c941ab6f5b
avoid double work in git-annex init, second try
reconcileStaged populates the db, so scanAnnexedFiles does not need to
do it again. It still makes a pass over the HEAD tree, but populating
the db was most of the expensive part.

Benchmarking with 100,000 files, git-annex init now takes 40 seconds,
vs 37 seconds with the old, buggy version of this fix. It should be
possible to win those 3 precious seconds per 100k files back, in the
case when when annex.thin is not set, with improvements to reconcileStaged
that avoid needing this second pass.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 09:36:53 -04:00
Joey Hess
2cb7b7b336
Revert "avoid double work in git-annex init"
This reverts commit 0f10f208a7.

The implementation of this turns out to be unsafe; it can lead to a keys
db deadlock. scanAnnexedFiles injects a call to inAnnex into
reconcileStaged, but inAnnex sometimes needs to read from the keys db,
which will try to re-open it when it's in the process of being opened.
The exclusive lock of gitAnnexKeysDbLock will then deadlock.

This needs to be done in some other way...
2021-06-08 09:11:24 -04:00
Joey Hess
c831a562f5
faster associated file replacement with upsert
Rather than first deleting and then inserting, upsert lets the key
associated with a file be updated in place.

Benchmarked with 100,000 files, and an empty keys database, running
reconcileStaged. It improved from 47 seconds to 34 seconds.
So this got reconcileStaged to be as fast as scanAssociatedFiles,
or faster -- scanAssociatedFiles benchmarks at 37 seconds.

(Also checked for other users of deleteWhere that could be sped up by
upsert. There are a couple, but they are not in performance critical
code paths, eg recordExportTreeCurrent is only run once per tree
export.)

I would have liked to rename FileKeyIndex to FileKeyUnique since it is
being used as a uniqueness constraint now, not just to get an index.
But, that gets converted into part of the SQL schema, and the name
is used by the upsert, so it can't be changed.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 07:53:36 -04:00
Joey Hess
0f10f208a7
avoid double work in git-annex init
reconcileStaged was doing a redundant scan to scannAnnexedFiles.

It would probably make sense to move the body of scannAnnexedFiles
into reconcileStaged, the separation does not really serve any purpose.

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 16:50:14 -04:00
Joey Hess
6ceb31a30a
optimise reconcileStaged with git cat-file streaming
Commit 428c91606b made it need to do more
work in situations like switching between very different branches.

Compare with seekFilteredKeys which has a similar optimisation. Might be
possible to factor out the common part from these?

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 15:26:48 -04:00
Joey Hess
0101363eb8
correctly update keys db in merge conflict
This is quite a subtle edge case, see the bug report for full details.

The second git diff is needed only when there's a merge conflict.
It would be possible to speed it up marginally by using
--diff-filter=Unmerged, but probably not enough to bother with.

Sponsored-by: Graham Spencer on Patreon
2021-06-07 12:52:36 -04:00
Joey Hess
5b7429e73a
avoid removing old associated file when there is a merge conflict
It makes sense to keep the key used by the old version of an
associated file, until the merge conflict is resolved.

Note that, since in this case git diff is being run with --index, it's
not possible to use -1 or -3, which would let the keys
associated with the new versions of the file also be added. That would
be better, because it's possible that the local modification to the file
that caused the merge conflict has not yet gotten its new key recorded
in the db.

Opened a bug about a case this is thus not able to address.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-06-01 11:43:00 -04:00
Joey Hess
eb6f6ff9b8
speed up keys database writes
There seems to be no reason to check the time here. I think it was
inherited from code in Database.Fsck, which does have a reason to commit
every few minutes. Removing that syscall speeds up a git-annex init
in a repo with 100000 annexed files by about 3 seconds.

Sponsored-by: Dartmouth College's Datalad project
2021-05-31 15:01:00 -04:00
Joey Hess
73e1507c72
fix deadlock
git-annex test hung, at varying points depending
on when git decided to run the smudge clean filter.

Recent changes to reconcileStaged caused a deadlock, when git write-tree
for some reason decides to run the smudge clean filter. Which tries
to open the keys db, and blocks waiting for the lock file that its
grandparent has locked.

I don't know why git write-tree does that. It's supposed to only write a
tree from the index which needs no smudge/clean filtering.

I've verified that, in a situation where git write-tree runs the clean
filter, disabling the filter results in a tree being written that
contains the annex link, not eg, the worktree file content. So it seems
safe to disable the clean filter, but also this seems likely to be
working around a bug in git because it seems it is running the clean
filter in a situation where the object has already been cleaned.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 16:19:26 -04:00
Joey Hess
f46e4c9b7c
fix case where keys db was not initialized in time
When the keys db is opened for read, and did not exist yet, it used to
skip creating it, and return mempty values. But that prevents
reconcileStaged from populating associated files information in time for
the read. This fixes the one remaining case I know of where
the fix in a56b151f90 didn't work.

Note that, when there is a permissions error, it still avoids creating
the db and returns mempty for all queries. This does mean that
reconcileStaged does not run and so it may want to drop files that it
should not. However, presumably a permissions error on the keys database
also means that the user does not have permission to delete annex
objects, so they won't be able to drop the files anyway.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 14:46:59 -04:00
Joey Hess
d62d6e2fcf
note about a wart
All code that uses associated files already deals with this problem,
which used to be worse. Unfortunately I was not able to entirely
eliminate it, although it happens in fewer cases now.
2021-05-24 12:05:49 -04:00