Commit graph

286 commits

Author SHA1 Message Date
Joey Hess
c834d2025a
queue more changes to keys db
Increasing the size of the queue 10x makes git-annex init 7% faster in a
repository with 86000 annexed files.

The memory use goes up, from 70876 kb to 85376 kb.
2022-11-18 13:29:34 -04:00
Joey Hess
8fcee4ac9d
Sped up the initial scanning for annexed files by 15%
Avoids database querying overhead when the database is newly created.

In the large repository where git-annex init took 24 seconds, this sped it
up to 20.47 seconds, a speedup of around 15%.

Sponsored-by: Dartmouth College's DANDI project
2022-11-18 13:16:57 -04:00
Joey Hess
cde2e61105
improve sqlite retrying behavior
Avoid hanging when a suspended git-annex process is keeping a sqlite
database locked.

Sponsored-by: Dartmouth College's Datalad project
2022-10-18 15:47:20 -04:00
Joey Hess
3149a1e2fe
More robust handling of ErrorBusy when writing to sqlite databases
While ErrorBusy and other exceptions were caught and the write retried for
up to 10 seconds, it was still possible for git-annex to eventually
give up and error out without writing to the database. Now it will retry
as long as necessary.

This does mean that, if one git-annex process is suspended just as sqlite
has locked the database for writing, another git-annex that tries to write
it it might get stuck retrying forever. But, that could already happen when
opening the sqlite database, which retries forever on ErrorBusy. This is an
area where git-annex is known to not behave well, there's a todo about the
general case of it.

Sponsored-by: Dartmouth College's Datalad project
2022-10-17 15:56:19 -04:00
Joey Hess
0d762acf7e
update comment, probably not a sqlite bug
Sqlite's page documenting WAL mode changed in Oct 2016 to mention ways
that queries could fail with SQLITE_BUSY.

http://web.archive.org/web/20161009044054/http://www.sqlite.org:80/wal.html

Probably not cooincidentally, I emailed sqlite-users about such a
situation in Feb 2015.
https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg90580.html

Noone ever replied to me, but at least now I understand why it does that.
Since it's documented now, it's no longer a bug.
2022-10-17 15:09:47 -04:00
Joey Hess
6fbd337e34
avoid uncessary keys db writes; doubled speed!
When running eg git-annex get, for each file it has to read from and
write to the keys database. But it's reading exclusively from one table,
and writing to a different table. So, it is not necessary to flush the
write to the database before reading. This avoids writing the database
once per file, instead it will buffer 1000 changes before writing.

Benchmarking getting 1000 small files from a local origin,
git-annex get now takes 13.62s, down from 22.41s!
git-annex drop now takes 9.07s, down from 18.63s!
Wowowowowowowow!

(It would perhaps have been better if there were separate databases for
the two tables. At least it would have avoided this complexity. Ah well,
this is better than splitting the table in a annex.version upgrade.)

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 15:33:16 -04:00
Joey Hess
ba7ecbc6a9
avoid flushing keys db queue after each Annex action
The flush was only done Annex.run' to make sure that the queue was flushed
before git-annex exits. But, doing it there means that as soon as one
change gets queued, it gets flushed soon after, which contributes to
excessive writes to the database, slowing git-annex down.
(This does not yet speed git-annex up, but it is a stepping stone to
doing so.)

Database queues do not autoflush when garbage collected, so have to
be flushed explicitly. I don't think it's possible to make them
autoflush (except perhaps if git-annex sqitched to using ResourceT..).
The comment in Database.Keys.closeDb used to be accurate, since the
automatic flushing did mean that all writes reached the database even
when closeDb was not called. But now, closeDb or flushDb needs to be
called before stopping using an Annex state. So, removed that comment.

In Remote.Git, change to using quiesce everywhere that it used to use
stopCoProcesses. This means that uses on onLocal in there are just as
slow as before. I considered only calling closeDb on the local git remotes
when git-annex exits. But, the reason that Remote.Git calls stopCoProcesses
in each onLocal is so as not to leave git processes running that have files
open on the remote repo, when it's on removable media. So, it seemed to make
sense to also closeDb after each one, since sqlite may also keep files
open. Although that has not seemed to cause problems with removable
media so far. It was also just easier to quiesce in each onLocal than
once at the end. This does likely leave performance on the floor, so
could be revisited.

In Annex.Content.saveState, there was no reason to close the db,
flushing it is enough.

The rest of the changes are from auditing for Annex.new, and making
sure that quiesce is called, after any action that might possibly need
it.

After that audit, I'm pretty sure that the change to Annex.run' is
safe. The only concern might be that this does let more changes get
queued for write to the db, and if git-annex is interrupted, those will be
lost. But interrupting git-annex can obviously already prevent it from
writing the most recent change to the db, so it must recover from such
lost data... right?

Sponsored-by: Dartmouth College's Datalad project
2022-10-12 14:12:23 -04:00
Yaroslav Halchenko
0151976676
Typo fix unncessary -> unnecessary.
Detected while reading recent CHANGELOG entry but then decided to apply
to entire codebase and docs since why not?
2022-08-20 09:40:19 -04:00
Joey Hess
b801812660
init: probe if sqlite works
Help the user get annex.dbdir configured when their filesystem is not
one that sqlite works on.

The change in Database.Handle makes an error from sqlite not be ignored
besides being displayed, which it was before. I can't see any reason
git-annex would want to ignore these errors.

I chose to use the fsck database rather than the keys database because
opening the keys database populates it, and see commit
b3c4579c79.

The placement of the call to checkSqliteWorks inside checkInitializeAllowed
avoids annex.uuid getting set before it's called.

Sponsored-by: Dartmouth College's Datalad project
2022-08-17 13:12:26 -04:00
Joey Hess
4cfe17a9e8
use a subdirectory of annex.dbdir
This allows annex.dbdir to be set globally or always set to the same
value when needed. Each repository uses a subdirectory of it.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 13:18:15 -04:00
Joey Hess
a335c1e46e
annex.dbdir fully working
Completes work started in e60766543f

I've verified that all the sqlite databases get stored in annex.dbdir
and are created successfully. If annex.dbdir does not exist, it will be
created; its parent directory must already exist though.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 13:06:58 -04:00
Joey Hess
23c6e350cb
improve createDirectoryUnder to allow alternate top directories
This should not change the behavior of it, unless there are multiple top
directories, and then it should behave the same as if there was a single
top directory that was actually above the directory to be created.

Sponsored-by: Dartmouth College's Datalad project
2022-08-12 12:52:37 -04:00
Joey Hess
e60766543f
add annex.dbdir (WIP)
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.

annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.

It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.

Sponsored-by: Dartmouth College's Datalad project
2022-08-11 16:58:53 -04:00
Joey Hess
2d65c4ff1d
avoid unix-compat's rename
On Windows, that does not support long paths
https://github.com/jacobstanley/unix-compat/issues/56

Instead, use System.Directory.renamePath, which does support long paths.

Sponsored-by: Dartmouth College's Datalad project
2022-07-12 14:55:02 -04:00
Joey Hess
95a04920cf
remove objectDir' 2022-06-22 16:08:49 -04:00
Joey Hess
5da1a78508
add debugging around commits to sqlite dbs 2022-06-06 12:36:55 -04:00
Joey Hess
331c97df88
fix MVar deadlock when sqlite commit fails
The database queue was left empty, which caused subsequent calls to
flushDbQueue to deadlock.

Sponsored-by: Dartmouth College's Datalad project
2022-06-06 12:16:55 -04:00
Joey Hess
09edb07ac5
add debugLocks around database operations
to track down a blocked indefinitely on MVar that seems to occur after
sqlite throws ErrorBusy but that I have not been able to reproduce when
I made commits synthetically throw ErrorBusy.

Sponsored-by: Dartmouth College's Datalad project
2022-06-03 14:16:28 -04:00
Joey Hess
649464619e
read up to and including maxPointerSz
For consistency with everything else.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:54:40 -04:00
Joey Hess
5b373a9dd2
read a consistent amount from pointer file
A few places were reading the max symlink size of a pointer file,
then passing tp parseLinkTargetOrPointer. Which is fine currently, but
to support pointer files with lines of data after the pointer, enough
has to be read that parseLinkTargetOrPointer can be assured of seeing
enough of that data to know if it's correctly formatted.

Sponsored-by: Dartmouth College's Datalad project
2022-02-23 12:52:34 -04:00
Joey Hess
46d5098ff4
Pass --no-textconv when running git diff internally
Seems that --no-ext-diff and -c diff.external= are not enough to disable
external diff command when gitattributes textconv specifies it.

I'm pretty sure that --no-ext-diff and -c diff.external= are not both
needed, but not 100%. Something about -G may need the latter to fully
disable diffs in some cases. So kept that part as it was.

Sponsored-by: Dartmouth College's Datalad project
2022-02-01 13:43:18 -04:00
Joey Hess
f54c58f0df
Avoid crashing when run in a bare git repo that somehow contains an index file
Do not populate the keys database with associated files,
because a bare repo has no working tree, and so it does not make sense to
populate it.

Queries of associated files in the keys database always return empty lists
in a bare repo, even if it's somehow populated. One way it could be
populated is if a user converts a non-bare repo to a bare repo.

Note that Git.Config.isBare does a string comparison, so this is not free!
But, that string comparison is very small compared to a sqlite query.

Sponsored-by: Erik Bjäreholt on Patreon
2022-01-11 13:01:49 -04:00
Joey Hess
d0ef8303cf
avoid using a second db connection for writes
This is a potentially breaking change in a very delicate area. However,
examining the code path for writes, I don't see any benefit to opening a
second db connection for them. If the write throws an exception,
commitDb will retry it with a new db connection.

A potential benefit to not opening a second db connection, beyond using
less resources, is it just might avoid problems in WSL with sqlite that
I have hypothesized are caused by multiple db connections.

Commit 5f9eff3f32 explains why it needs to
shut down the db connection to force the database to be updated on disk:
When closeDb does not get called, garbage collection of DbHandle may not
give the workterThread time to cleanly shut down before git-annex exits,
resulting in a recently written change not reaching disk.
2021-10-20 12:32:46 -04:00
Joey Hess
f5b642318d
eliminate single/multi writer distinction
After commit f4bdecc4ec, there is no
longer any distinction between SingleWriter and MultiWriter's handling
of read after write.

Databases that were SingleWriter still have lock files that are used to
prevent multiple writers.

This does make writing to such databases a bit more expensive,
because the MultiWriter code path that is now used opens a second db
connection in order to write to them.
2021-10-20 12:26:30 -04:00
Joey Hess
c47794991c
improve with continuation
no behavior change
2021-10-20 12:13:49 -04:00
Joey Hess
f4bdecc4ec
improve sqlite MultiWriter handling of read after write
This removes a messy caveat that was easy to forget and caused at least one
bug. The price paid is that, after a write to a MultiWriter db, it has to
close the db connection that it had been using to read, and open a new
connection. So it might be a little bit slower. But, writes are usually
batched together, so there's often only a single write, and so there should
not be much of a slowdown. Notice that SingleWriter already closed the db
connection after a write, so paid the same overhead.

This is the second try at fixing a bug: git-annex get when run as the first
git-annex command in a new repo did not populate all unlocked files.
(Reversion in version 8.20210621)

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-19 15:13:29 -04:00
Joey Hess
0f38ad9a69
close keys db to possibly work around WSL1 issue 2021-10-19 13:07:49 -04:00
Joey Hess
19e78816f0
convert Key to ShortByteString
This adds the overhead of a copy when serializing and deserializing keys.
I have not benchmarked much, but runtimes seem barely changed at all by that.

When a lot of keys are in memory, it improves memory use.

And, it prevents keys sometimes getting PINNED in memory and failing to GC,
which is a problem ByteString has sometimes. In particular, git-annex sync
from a borg special remote had that problem and this improved its memory
use by a large amount.

Sponsored-by: Shae Erisson on Patreon
2021-10-05 20:20:08 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
fa62c98910
simplify and speed up Utility.FileSystemEncoding
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)

Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.

AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.

Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.

Sponsored-by: Shae Erisson on Patreon
2021-08-11 12:13:31 -04:00
Joey Hess
9f94d2894e
remove unused code 2021-07-30 18:01:36 -04:00
Joey Hess
a306560374
use SQL.addInodeCaches
This avoids deadlock when opening the database handle calls
reconcileStaged.
2021-07-27 17:34:56 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
e4b2a067e0
fix potential race in updating inode cache
In Annex.Content, the object file was statted after pointer files were
populated. But if annex.thin is set, once the pointer files are
populated, the object file can potentially be modified via the hard
link. So, it was possible, though seemingly very unlikely, for the inode
of the modified object file to be cached.

Command.Fix and Command.Fsck had similar problems, statting the work
tree files after they were in place. Changed them to stat the temp file
that gets moved into place. This does rely on .git/annex being on the
same filesystem. If it's not, the cached inode will not be the same as
the one that the temp file gets moved to. Result will be that git-annex
will later need to do an expensive verification of the content of the
worktree files. Note that the cross-filesystem move of the temp file
already is a larger amount of extra work, so this seems acceptable.

Sponsored-by: Luke Shumaker on Patreon
2021-07-27 12:29:10 -04:00
Joey Hess
af9fdf5dba
verify associated files when checking numcopies
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.

A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.

Sponsored-by: Kevin Mueller on Patreon
2021-06-15 11:14:52 -04:00
Joey Hess
7b6deb1109
display scanning message whenever reconcileStaged has enough files to chew on
Clear visible progress bar first.

Removed showSideActionAfter because it can't be used in reconcileStaged
(import loop). Instead, it counts the number of files it
processes and displays it after it's seen a sufficient to know it's
taking a while.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 12:48:30 -04:00
Joey Hess
1a6fa5abc8
add debugging for reconcileStaged calls for benchmarking 2021-06-08 11:57:23 -04:00
Joey Hess
7f742589f9
claw back annexed file scan speedup
Following commit c941ab6f5b, this avoids
the second, redundant scan when annex.thin is not set.

The benchmark now runs in 35.5 seconds, down from 40 seconds.

Note that the inode cache of the annex object has to be passed to
addInodeCaches now, because it might not already be in the inode caches,
unlike previously.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 11:09:15 -04:00
Joey Hess
ec1f2f246b
improve comment
remove obsolete part about a commit preventing it seeing changes
2021-06-08 10:43:48 -04:00
Joey Hess
c941ab6f5b
avoid double work in git-annex init, second try
reconcileStaged populates the db, so scanAnnexedFiles does not need to
do it again. It still makes a pass over the HEAD tree, but populating
the db was most of the expensive part.

Benchmarking with 100,000 files, git-annex init now takes 40 seconds,
vs 37 seconds with the old, buggy version of this fix. It should be
possible to win those 3 precious seconds per 100k files back, in the
case when when annex.thin is not set, with improvements to reconcileStaged
that avoid needing this second pass.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 09:36:53 -04:00
Joey Hess
2cb7b7b336
Revert "avoid double work in git-annex init"
This reverts commit 0f10f208a7.

The implementation of this turns out to be unsafe; it can lead to a keys
db deadlock. scanAnnexedFiles injects a call to inAnnex into
reconcileStaged, but inAnnex sometimes needs to read from the keys db,
which will try to re-open it when it's in the process of being opened.
The exclusive lock of gitAnnexKeysDbLock will then deadlock.

This needs to be done in some other way...
2021-06-08 09:11:24 -04:00
Joey Hess
c831a562f5
faster associated file replacement with upsert
Rather than first deleting and then inserting, upsert lets the key
associated with a file be updated in place.

Benchmarked with 100,000 files, and an empty keys database, running
reconcileStaged. It improved from 47 seconds to 34 seconds.
So this got reconcileStaged to be as fast as scanAssociatedFiles,
or faster -- scanAssociatedFiles benchmarks at 37 seconds.

(Also checked for other users of deleteWhere that could be sped up by
upsert. There are a couple, but they are not in performance critical
code paths, eg recordExportTreeCurrent is only run once per tree
export.)

I would have liked to rename FileKeyIndex to FileKeyUnique since it is
being used as a uniqueness constraint now, not just to get an index.
But, that gets converted into part of the SQL schema, and the name
is used by the upsert, so it can't be changed.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 07:53:36 -04:00
Joey Hess
0f10f208a7
avoid double work in git-annex init
reconcileStaged was doing a redundant scan to scannAnnexedFiles.

It would probably make sense to move the body of scannAnnexedFiles
into reconcileStaged, the separation does not really serve any purpose.

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 16:50:14 -04:00
Joey Hess
6ceb31a30a
optimise reconcileStaged with git cat-file streaming
Commit 428c91606b made it need to do more
work in situations like switching between very different branches.

Compare with seekFilteredKeys which has a similar optimisation. Might be
possible to factor out the common part from these?

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 15:26:48 -04:00
Joey Hess
0101363eb8
correctly update keys db in merge conflict
This is quite a subtle edge case, see the bug report for full details.

The second git diff is needed only when there's a merge conflict.
It would be possible to speed it up marginally by using
--diff-filter=Unmerged, but probably not enough to bother with.

Sponsored-by: Graham Spencer on Patreon
2021-06-07 12:52:36 -04:00
Joey Hess
5b7429e73a
avoid removing old associated file when there is a merge conflict
It makes sense to keep the key used by the old version of an
associated file, until the merge conflict is resolved.

Note that, since in this case git diff is being run with --index, it's
not possible to use -1 or -3, which would let the keys
associated with the new versions of the file also be added. That would
be better, because it's possible that the local modification to the file
that caused the merge conflict has not yet gotten its new key recorded
in the db.

Opened a bug about a case this is thus not able to address.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-06-01 11:43:00 -04:00
Joey Hess
eb6f6ff9b8
speed up keys database writes
There seems to be no reason to check the time here. I think it was
inherited from code in Database.Fsck, which does have a reason to commit
every few minutes. Removing that syscall speeds up a git-annex init
in a repo with 100000 annexed files by about 3 seconds.

Sponsored-by: Dartmouth College's Datalad project
2021-05-31 15:01:00 -04:00
Joey Hess
73e1507c72
fix deadlock
git-annex test hung, at varying points depending
on when git decided to run the smudge clean filter.

Recent changes to reconcileStaged caused a deadlock, when git write-tree
for some reason decides to run the smudge clean filter. Which tries
to open the keys db, and blocks waiting for the lock file that its
grandparent has locked.

I don't know why git write-tree does that. It's supposed to only write a
tree from the index which needs no smudge/clean filtering.

I've verified that, in a situation where git write-tree runs the clean
filter, disabling the filter results in a tree being written that
contains the annex link, not eg, the worktree file content. So it seems
safe to disable the clean filter, but also this seems likely to be
working around a bug in git because it seems it is running the clean
filter in a situation where the object has already been cleaned.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 16:19:26 -04:00
Joey Hess
f46e4c9b7c
fix case where keys db was not initialized in time
When the keys db is opened for read, and did not exist yet, it used to
skip creating it, and return mempty values. But that prevents
reconcileStaged from populating associated files information in time for
the read. This fixes the one remaining case I know of where
the fix in a56b151f90 didn't work.

Note that, when there is a permissions error, it still avoids creating
the db and returns mempty for all queries. This does mean that
reconcileStaged does not run and so it may want to drop files that it
should not. However, presumably a permissions error on the keys database
also means that the user does not have permission to delete annex
objects, so they won't be able to drop the files anyway.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 14:46:59 -04:00
Joey Hess
d62d6e2fcf
note about a wart
All code that uses associated files already deals with this problem,
which used to be worse. Unfortunately I was not able to entirely
eliminate it, although it happens in fewer cases now.
2021-05-24 12:05:49 -04:00
Joey Hess
13423f337c
refactoring 2021-05-24 11:38:22 -04:00
Joey Hess
efae085272
fixed reconcileStaged crash when index is locked or in conflict
Eg, when git commit runs the smudge filter.

Commit 428c91606b introduced the crash,
as write-tree fails in those situations. Now it will work, and git-annex
always gets up-to-date information even in those situations. It does
need to do a bit more work, each time git-annex is run with the index
locked. Although if the index is unmodified from the last time
write-tree succeeded, that work is avoided.
2021-05-24 11:33:23 -04:00
Joey Hess
428c91606b
include locked files in the keys database associated files
Before only unlocked files were included.

The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.

reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.

On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.

reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.

Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.

Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.

There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.

However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
2021-05-21 16:24:37 -04:00
Joey Hess
675556fd9a
smudge: check for known annexed inodes before checking annex.largefiles
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.

When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.

Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.

It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.

git-annex add was changed, when adding a small file, to remove the inode
cache for it. This is necessary to keep the recipe in
doc/tips/largefiles.mdwn for converting from annex to git working.
It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn
which the earlier try at this change introduced.
2021-05-10 13:20:10 -04:00
Joey Hess
c2f612292a
start splitting out readonly values from AnnexState
Values in AnnexRead can be read more efficiently, without MVar overhead.
Only a few things have been moved into there, and the performance
increase so far is not likely to be noticable.

This is groundwork for putting more stuff in there, particularly a value
that indicates if debugging is enabled.

The obvious next step is to change option parsing to not run in the
Annex monad to set values in AnnexState, and instead return a pure value
that gets stored in AnnexRead.
2021-04-02 15:51:44 -04:00
Joey Hess
8868a3a4c7
Fix build with persistent-2.12.0.1
persistent stopped using askLogFunc, and the thing to use is askLoggerIO
from monad-logger. Bumped the dep to the first version that contained that.

Note that the i386ancient build uses a newer monad-logger than 0.3.10,
so the new versioned dep should not break it, and presumably nothing else
either.

This commit was sponsored by Noam Kremen on Patreon.
2021-04-01 12:21:02 -04:00
Joey Hess
fc61915230
use GIT keys for export of non-annexed files
This solves the problem that import of such files gets confused and
converts them back to annexed files.

The import code already used GIT keys internally when it determined a
file should not be annexed. So now when it sees a GIT key that export
used, it already does the right thing.

This also means that even older version of git-annex can import and will
do the right thing, once a fixed version has exported. Still, there may
be other complications around upgrades; still need to think it all
through.

Moved gitShaKey and keyGitSha from Key to Annex.Export since they're
only used for export/import.

Documented GIT keys in backends, since they do appear in the git-annex
branch now.

This commit was sponsored by Graham Spencer on Patreon.
2021-03-05 14:12:11 -04:00
Joey Hess
5ce61c6b2a
add: Significantly speed up adding lots of non-large files to git
* add: Significantly speed up adding lots of non-large files to git,
  by disabling the annex smudge filter when running git add.
* add --force-small: Run git add rather than updating the index itself,
  so any other smudge filters than the annex one that may be enabled will
  be used.
2021-01-04 13:12:28 -04:00
Joey Hess
c6e693b25d
remove ContentIndentifiersCidRemoteIndex uniqueness constraint
For reasons explained in the bug report.

Implemented using a persistent migration, which works fine. It may add a
little startup overhead when a remote is enabled that uses this, but
probably un-noticable.

On the next major version, it would be fine to delete this database,
and regenerate it from the git-annex branch information. Then this
change could be reverted.

Did nothing about adding back the data that got dropped from the db
due to the bug. Only the borg special remote was probably affected,
and it's not been released yet. rm -rf .git/annex/cidsdb does work.
2020-12-23 14:03:33 -04:00
Joey Hess
028d4517c7
enable extensions needed by new version of persistent
Needed in order to use mkPersist in
persistent version 2.11.0.1
persistent-template version 2.9.1.0
2020-11-07 14:09:17 -04:00
Joey Hess
1db49497e0
finished this stage of the RawFilePath conversion
This commit was sponsored by Denis Dzyubenko on Patreon.
2020-11-06 14:10:58 -04:00
Joey Hess
2c8cf06e75
more RawFilePath conversion
Converted file mode setting to it, and follow-on changes.

Compiles up through 369/646.

This commit was sponsored by Ethan Aubin.
2020-11-05 18:45:37 -04:00
Joey Hess
9b0dde834e
convert getFileSize to RawFilePath
Lots of nice wins from this in avoiding unncessary work, and I think
nothing got slower.

This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
2020-11-05 11:32:57 -04:00
Joey Hess
681b44236a
more RawFilePath conversion
at 377/645

This commit was sponsored by Svenne Krap on Patreon.
2020-10-29 14:20:57 -04:00
Joey Hess
f45ad178cb
more RawFilePath conversion
At 318/645 after 4k lines of changes

This commit was sponsored by Jake Vosloo on Patreon.
2020-10-29 12:03:50 -04:00
Joey Hess
8d66f7ba0f
more RawFilePath conversion
Added a RawFilePath createDirectory and kept making stuff build.

Up to 296/645

This commit was sponsored by Mark Reidenbach on Patreon.
2020-10-28 17:25:59 -04:00
Joey Hess
10e62a810e
remove unused import 2020-10-02 13:45:09 -04:00
Joey Hess
37426920d8
Fix build with Benchmark build flag
Broke a while ago during optimisation work, and not noticed since the flag
is disabled by default.

This commit was sponsored by Brock Spratlen on Patreon.
2020-10-02 13:30:24 -04:00
Joey Hess
aa1ad0b7ca
remove redundant imports
Clean build under ghc 8.8.3, which seems to do better at finding cases
where two imports both provide the same symbol, and warns about one of
them.

This commit was sponsored by Ilya Shlyakhter on Patreon.
2020-06-22 11:05:34 -04:00
Joey Hess
529f488ec4
fix a thundering herd problem
Avoid repeatedly opening keys db when accessing a local git remote and -J
is used.

What was happening was that Remote.Git.onLocal created a new annex state
as each thread started up. The way the MVar was used did not prevent that.
And that, in turn, led to repeated opening of the keys db, as well as
probably other extra work or resource use.

Also managed to get rid of Annex.remoteannexstate, and it turned out there
was an unncessary Maybe in the keysdbhandle, since the handle starts out
closed.
2020-04-17 17:09:29 -04:00
Joey Hess
6c81e0c8f1
ByteString Ref continued
Several nice speed wins I think.

At 340/633 files converted.
2020-04-07 13:27:11 -04:00
Joey Hess
279991604d
started converting Ref from String to ByteString
This should make code that reads shas and refs from git faster.

Does not compile yet, a lot needs to be done still.
2020-04-06 17:14:49 -04:00
Joey Hess
7f992ef59c
mostly finished with createDirectoryUnder conversion
Remaining things needing converted are in the assistant, and Annex.Ssh.

Every other remaining call to createDirectoryIfMissing True has been
audited and is not relevant. The ones in Build/ of course don't get
included in the program. Others included eg, Remote.Tahoe and
Config.Files which both write to dotfiles under the home directory.
2020-03-06 11:57:15 -04:00
Joey Hess
6d58ca94d6
some easy createDirectoryUnder conversions 2020-03-05 15:20:10 -04:00
Joey Hess
029c883713
Merge branch 'master' into v8 2020-02-19 14:32:11 -04:00
Joey Hess
c9357bdc0e
ifdef persistent-template 2.8.0 fixes
The i386ancient build has a ghc too old for these extensions.

Build with persistent-template 2.8.0 tested.
2020-02-04 13:53:00 -04:00
Joey Hess
4920df6573
Fix build with newest version of persistent-template.
This is untested because of rain, also I am operating from truncated
copiler error messages in a bug report that also doesn't mention what the
library version is. Still, it should work.

May break builds with old ghc, in particular DerivingStrategies is
I think fairly new? The pragmas could be ifdefed if necessary. Works with
ghc 8.6.5.
2020-02-04 12:03:30 -04:00
Joey Hess
6db4aee7df
use --no-abbrev instead of --abbrev=40
This avoids hardcoding the sha size, so when git uses sha256, it will
output the full sha256 and not a truncation to 40 characters.

I reviewed git's history, and while there have been some
bugs with commands not supporting --no-abbrev (eg git diff --no-index
--no-abbrev was broken in git 2.1), none of the commands git-annex
uses will be impacted by those old bugs.
2020-01-07 12:29:37 -04:00
Joey Hess
5e4deb3620
support sha256 git repos
Git will eventually switch to sha2 and there will not be one single
shaSize anymore, but two (40 and 64).

Changed all parsers for git plumbing output to support both sizes of
shas.

One potential problem this does not deal with is, if somewhere in
git-annex it reads two shas from different sources, and compares them
to see if they're the same sha, it would fail if they're sha1 and sha256
of the same value. I don't know if that will really be a concern.
2020-01-07 12:22:19 -04:00
Joey Hess
02e00fd7ab
Merge branch 'master' into sqlite 2019-12-19 16:33:42 -04:00
Joey Hess
686791c4ed
more RawFilePath
Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
2019-12-18 17:10:28 -04:00
Joey Hess
535b153381
building again after merge
Nice, several conversions fell out.
2019-12-18 15:02:46 -04:00
Joey Hess
d5628a16b8
Merge branch 'bs' into sqlite-bs 2019-12-18 14:51:03 -04:00
Joey Hess
c19211774f
use filepath-bytestring for annex object manipulations
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.

Benchmarks indicate around 30% speedup in both commands.

Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
2019-12-11 15:25:07 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
2f9a80d803
merging sqlite and bs branches
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
2019-12-06 15:30:45 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
81d402216d cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:

Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.

The Eq instance is fixed.

Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.

Used git-annex benchmark:
  find is 7% faster
  whereis is 3% faster
  get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
2019-11-22 17:49:16 -04:00
Joey Hess
70a8716324
improve benchmark
addAssociatedFileNewBench would sometimes pick a random number that a
previous call had already added. Using a MVar, make it always advance,
so the same behavior is benchmarked each time.
2019-11-22 13:20:22 -04:00
Joey Hess
7263aafd2b
Merge branch 'master' into sqlite 2019-11-22 12:49:35 -04:00
Joey Hess
92e1bb250b
simplify the name of the test cases 2019-11-21 17:38:58 -04:00
Joey Hess
d4661959de
Merge branch 'master' into sqlite 2019-11-21 17:26:50 -04:00
Joey Hess
25ba8156bc
improve benchmark --databases
* benchmark: Changed --databases to take a parameter specifiying the size
  of the database to benchmark.
* benchmark --databases: Display size of the populated database.
* benchmark --databases: Improve the "addAssociatedFile to (new)"
  benchmark to really add new values, not overwriting old values.
2019-11-21 17:25:20 -04:00
Joey Hess
1b5f4b67b5
use new name for new format fsck db
The old db is cleaned up when a new incremental fsck is started.

The incremental fsck won't pick up where the old one left off, but I
consider this a minor enough thing that it can just be documented and
won't be a problem.
2019-11-06 16:27:25 -04:00
Joey Hess
d3e4de0175
fix test suite
The test suite found a bug; select_ can fail now because a uniqueness
constrain has been added.

Now the test suite passes.

Also, I'm satisfied the changed PersistField instances work.
Looking over what changed, and what I've already tested, Key, FilePath,
and InodeCache are known working; ContentIdentifier is trivial
ByteString to blob; and SSha is trivial String to varchar. Both are
tested by the test suite. I've also tested the new FileSize and
EpochTime instances already, and they work.
2019-10-30 15:51:37 -04:00
Joey Hess
4940a135af
eliminate raw sql LIKE query 2019-10-30 15:19:52 -04:00
Joey Hess
9085a2cfec
make sure all sqlite selects have indexes
Bearing in mind that these indexes are really uniqueness constraints
that just happen to also make sqlite generate indexes.

In Database.ContentIndentifier, the ContentIndentifiersKeyRemoteCidIndex
is fine as a uniqueness constraint because it contains all rows from the
table. The ContentIndentifiersCidRemoteIndex is also ok because there
can only be one key for a given (cid, uuid) combination.

In Database.Export, the new ExportTreeFileKeyIndex is the same pair as
the old ExportTreeKeyFileIndex (previously ExportTreeIndex). And
in Database.Keys.SQL, the new InodeCacheKeyIndex is the same pair as the
old KeyInodeCacheIndex.
2019-10-30 13:46:52 -04:00
Joey Hess
3732f27722
document indexes
This was really confusing, though the code was ok. I think I now
understand it fully again.
2019-10-30 13:28:00 -04:00
Joey Hess
f6cfb84dfe
rename row 2019-10-30 13:22:41 -04:00
Joey Hess
aa27969e55
improve layout 2019-10-29 17:08:36 -04:00