This is a non-backwards compatable change, so not suitable for merging
w/o a annex.version bump and transition code. Not yet tested.
This improves performance of git-annex benchmark --databases
across the board by 10-25%, since eg Key roundtrips as a ByteString.
(serializeKey' produces a lazy ByteString, so there is still a
copy involved in converting it to a strict ByteString. It may be faster
to switch to using bytestring-strict-builder.)
FilePath and Key are both stored as blobs. This avoids mojibake in some
situations. It would be possible to use varchar instead, if persistent
could avoid converting that to Text, but it seems there is no good
way to do so. See doc/todo/sqlite_database_improvements.mdwn
Eliminated some ugly artifacts of using Read/Show serialization;
constructors and quoted strings are no longer stored in sqlite.
Renamed SRef to SSha to reflect that it is only ever a git sha,
not a ref name. Since it is limited to the characters in a sha,
it is not affected by mojibake, so still uses String.
Rescued from commit 11d6e2e260 which removed
db benchmarks in favor of benchmarking arbitrary git-annex commands. Which
is nice and general, but microbenchmarks are useful too.
Rescued from commit 11d6e2e260 which removed
db benchmarks in favor of benchmarking arbitrary git-annex commands. Which
is nice and general, but microbenchmarks are useful too.
The only good thing about it is it does not require a major version bump
to improve the database. That will need to happen at some point though.
Potentially very very slow in a large repository.
Ugly use of raw sql.
This solves the problem of sameas remotes trampling over per-remote
state. Used for:
* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
could in theory generate the same content identifier for two different
peices of content
While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.
External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.
Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
fsck --incremental/--more: Fix bug that prevented the incremental fsck
information from being updated every 5 minutes as it was supposed to be; it
was only updated after 1000 files were checked, which may be more files
that are possible to fsck in a given fsck time window.
Thanks to Peter Simons for help with analysis of this bug.
Auditing for other cases of the same mistake, the keys db also had it
backwards. This seems unlikely to really have been a problem;
it would need associated files updates etc to be coming in slowly for some
reason and then be interrupted to cause any problem.
IIRC the design of the keys db assumes that any interruped
operation will be restarted, and so it can lose any buffered database
updates safely.
Had a report of close throwing ErrorBusy on CIFS.
Retrying up to 16 seconds is a balance between hopefully waiting long
enough for the problem to clear up and waiting so long that git-annex seems
to hang.
The new dependency is free; persistent depends on unliftio-core.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
Fix bug that caused importing from a special remote to repeatedly download
unchanged files when multiple files in the remote have the same content.
Unfortunately, there's really no good way to remove a uniqueness constraint
from a sqlite database. The best that can be done is to make a new table
and copy the data over. But that would require using persistent's
migrations or raw sql, and I don't want to do either.
Instead, a sledgehammer approach: Renamed .git/annex/cid to
.git/annex/cids. When the new database doesn't exist, it will be populated
from the git-annex branch.
Noting deletes the old database. Don't want to delete it out from under
some long-running git-annex process that might be using it. It could
eventually be deleted. But this is such a new feature, probably few repos
have the database in any case.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
untested
This won't be super slow, but it does need to diff two likely large
trees, and since the git-annex branch rarely sits still, it will most
likely be run at the beginning of every import.
A possible speed improvement would be to only run this when the database
did not contain a ContentIdentifier. But that would only speed up
imports when there is no new version of a file on the special remote,
at most renames of existing files being imported.
A better speed improvement would be to record something in the git-annex
branch that indicates when an import has been run, and only do the diff
if the git-annex branch has record of a newer import than we've seen
before. Then, it would only run when there is in fact new
ContentIdentifier information available from a remote. Certianly doable,
but didn't want to complicate things yet.
Does not yet have a way to update with new information from the
git-annex branch, which will be needed when multiple repos are importing
from the same remote.
Somehow forgot about the case where the current export db tree is the
same one in the export log, and it warned about an export conflict when
getting a file in that situation. Of course it's no conflict at all!
This commit was sponsored by Jochen Bartl on Patreon.
Like the earlier fixed one in Command.Export, it occurred when the same
tree was exported by multiple clones. Previous fix was incomplete since
several other places looked at the list of exported trees to detect when
there was an export conflict. Added a single unified function to avoid
missing any places it needed to be fixed.
This commit was sponsored by mo on Patreon.
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.
Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
When an export conflict prevents accessing a special remote, be clearer
about what the problem is and how to resolve it.
This commit was sponsored by Trenton Cronholm on Patreon.
I suspect this may be due to SQLITE_IOERR_SHORT_READ, but have not
verified.
I was able to reproduce it on Linux after running the test suite in a loop
for 1-3 hours until it failed.
The WAL mode entry change in 3963c5fcf5
may have hidden the problem I was seeing; I have not seen an ErrorIO
since then.
I don't know the circumstances, but have a report of this:
git-annex: failed to commit changes to sqlite database: Just SQLite3 returned
ErrorConstraint while attempting to perform step.
All 3 tables in the export db have uniqueness constraints on them,
insertUnique is used for all the rest, but this use of insertMany
means it doesn't check the constraint. I guess that's what caused the
crash, but I have not been able to test it yet.
Use putMany when available, as it should be faster than mapM of insertMany.
This commit was sponsored by Brock Spratlen on Patreon.
I don't know why git diff --raw would run the clean filter, but it did
with this version of git. Perhaps it is cleaning the file to generate the
diff to search with -G? But then why would newer gits not run the clean
filter?
It caused git annex to deadlock because the keys database was locked
and ran a git command that ran git-annex, which tried to read from the
keys database.
This commit was sponsored by Brett Eisenberg on Patreon.
Update pointer file next time reconcileStaged is run to recover from the
race.
Note that restagePointerFile causes git to run the clean filter,
and that will run reconcileStaged. So, normally by the time the git
annex get/drop command finishes, the race has already been dealt with.
It may be that, in some case, that won't happen and the race will be
dealt with at a later point. git-annex could run reconcileStaged at
shutdown if that becomes a problem.
This does not handle the situation where the git mv is committed before
git-annex gets a chance to run again. git commit does run the clean
filter, and that happens to re-inject the content if it was supposed to
be dropped but is still populated. But, the case where the file was
supposed to be gotten but is not populated is not handled yet.
This commit was supported by the NSF-funded DataLad project.
The bug occurred when closeDb was not called, and garbage collection of
the DbHandle didn't give the workerThread time to shut down. Fixed by
exiting the runSqlite action when a commit is made.
(MultiWriter mode already forked off a runSqlite action, so avoided the
problem.)
This commit was sponsored by Brock Spratlen on Patreon.
Now when one repository has exported a tree, another repository can get
files from the export, after syncing.
There's a bug: While the database update works, somehow the database on
disk does not get updated, and so the database update is run the next
time, etc. Wasn't able to figure out why yet.
This commit was sponsored by Ole-Morten Duesund on Patreon.
New table needed to look up what filenames are used in the currently
exported tree, for reasons explained in export.mdwn.
Also, added smart constructors for ExportLocation and ExportDirectory to
make sure they contain filepaths with the right direction slashes.
And some code refactoring.
This commit was sponsored by Francois Marier on Patreon.
The subtle part of this is what happens when the remote fails to remove
an empty directory. The removal from the export needs to fail in that
case, so the removal will be tried again later. However, removeExportLocation
has already been run and changed the export db, so if the next run
checks getExportLocation, it might decide nothing remains to be done,
leaving the empty directory.
Dealt with that by making removeEmptyDirectories, handle a failure
by calling addExportLocation, reverting the database changes so the next
run will be guaranteed to try deleting the empty directory again.
This commit was sponsored by Thomas Hochstein on Patreon.
The export database has writes made to it and then expects to read back
the same data immediately. But, the way that Database.Handle does
writes, in order to support multiple writers, makes that not work, due
to caching issues. This resulted in export re-uploading files it had
already successfully renamed into place.
Fixed by allowing databases to be opened in MultiWriter or SingleWriter
mode. The export database only needs to support a single writer; it does
not make sense for multiple exports to run at the same time to the same
special remote.
All other databases still use MultiWriter mode. And by inspection,
nothing else in git-annex seems to be relying on being able to
immediately query for changes that were just written to the database.
This commit was supported by the NSF-funded DataLad project.
Removed uncorrect UniqueKey key in db schema; a key can appear multiple
times with different files.
The database has to be flushed after each removal. But when adding files
to the export, lots of changes are able to be queued up w/o flushing.
So it's still fairly efficient.
If large removals of files from exports are too slow, an alternative
would be to make two passes over the diff, one pass queueing deletions
from the database, then a flush and the a second pass updating the
location log. But that would use more memory, and need to look up
exportKey twice per removed file, so I've avoided such optimisation yet.
This commit was supported by the NSF-funded DataLad project.
Went with a separate db per export remote, rather than a single export
database. Mostly because there will probably not be a lot of separate
export remotes, and it might be convenient to be able to delete a given
remote's export database.
This commit was supported by the NSF-funded DataLad project.
Refactored some common code into initDb.
This only deals with the problem when creating new databases. If a repo
got bad permissions into it, it's up to the user to deal with it.
This commit was sponsored by Ole-Morten Duesund on Patreon.
hSetEncoding of a closed handle segfaults.
https://ghc.haskell.org/trac/ghc/ticket/71618484c0c197 introduced the crash.
In particular, stdin may get closed (by eg, getContents) and then trying
to set its encoding will crash. We didn't need to adjust stdin's
encoding anyway, but only stderr, to work around
https://github.com/yesodweb/persistent/issues/474
Thanks to Mesar Hameed for assistance related to reproducing this bug.
ghc 8 added backtraces on uncaught errors. This is great, but git-annex was
using error in many places for a error message targeted at the user, in
some known problem case. A backtrace only confuses such a message, so omit it.
Notably, commands like git annex drop that failed due to eg, numcopies,
used to use error, so had a backtrace.
This commit was sponsored by Ethan Aubin.
The keys database handle needs to be closed after merging, because the
smudge filter, in another process, updates the database. Old cached info
can be read for a while from the open database handle; closing it ensures
that the info written by the smudge filter is available.
This is pretty horribly ad-hoc, and it's especially nasty that the
transferrer closes the database every time.
This is a mostly backwards compatable change. I broke backwards
compatability in the case where a filename starts with double-quote.
That seems likely to be very rare, and v6 unlocked files are a new feature
anyway, and fsck needs to fix missing associated file mappings anyway. So,
I decided that is good enough.
The encoding used is to just show the String when it contains a problem
character. While that adds some overhead to addAssociatedFile and
removeAssociatedFile, those are not called very often. This approach has
minimal decode overhead, because most filenames won't be encoded that way,
and it only has to look for the leading double-quote to skip the expensive
read. So, getAssociatedFiles remains fast.
I did consider using ByteString instead, but getting a FilePath converted
with all chars intact, even surrigates, is difficult, and it looks like
instance PersistField ByteString uses Text, which I don't trust for problem
encoded data. It would probably be slower too, and it would make the
database less easy to inspect manually.
This lets readonly repos be used. If a repo is readonly, we can ignore the
keys database, because nothing that we can do will change the state of the
repo anyway.
The benchmark shows that the database access is quite fast indeed!
And, it scales linearly to the number of keys, with one exception,
getAssociatedKey.
Based on this benchmark, I don't think I need worry about optimising
for cases where all files are locked and the database is mostly empty.
In those cases, database access will be misses, and according to this
benchmark, should add only 50 milliseconds to runtime.
(NB: There may be some overhead to getting the database opened and locking
the handle that this benchmark doesn't see.)
joey@darkstar:~/src/git-annex>./git-annex benchmark
setting up database with 1000
setting up database with 10000
benchmarking keys database/getAssociatedFiles from 1000 (hit)
time 62.77 μs (62.70 μs .. 62.85 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 62.81 μs (62.76 μs .. 62.88 μs)
std dev 201.6 ns (157.5 ns .. 259.5 ns)
benchmarking keys database/getAssociatedFiles from 1000 (miss)
time 50.02 μs (49.97 μs .. 50.07 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 50.09 μs (50.04 μs .. 50.17 μs)
std dev 206.7 ns (133.8 ns .. 295.3 ns)
benchmarking keys database/getAssociatedKey from 1000 (hit)
time 211.2 μs (210.5 μs .. 212.3 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 211.0 μs (210.7 μs .. 212.0 μs)
std dev 1.685 μs (334.4 ns .. 3.517 μs)
benchmarking keys database/getAssociatedKey from 1000 (miss)
time 173.5 μs (172.7 μs .. 174.2 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 173.7 μs (173.0 μs .. 175.5 μs)
std dev 3.833 μs (1.858 μs .. 6.617 μs)
variance introduced by outliers: 16% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (hit)
time 64.01 μs (63.84 μs .. 64.18 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.85 μs (64.34 μs .. 66.02 μs)
std dev 2.433 μs (547.6 ns .. 4.652 μs)
variance introduced by outliers: 40% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (miss)
time 50.33 μs (50.28 μs .. 50.39 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 50.32 μs (50.26 μs .. 50.38 μs)
std dev 202.7 ns (167.6 ns .. 252.0 ns)
benchmarking keys database/getAssociatedKey from 10000 (hit)
time 1.142 ms (1.139 ms .. 1.146 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.142 ms (1.140 ms .. 1.144 ms)
std dev 7.142 μs (4.994 μs .. 10.98 μs)
benchmarking keys database/getAssociatedKey from 10000 (miss)
time 1.094 ms (1.092 ms .. 1.096 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.095 ms (1.095 ms .. 1.097 ms)
std dev 4.277 μs (2.591 μs .. 7.228 μs)
The repo path is typically relative, not absolute, so
providing it to absPathFrom doesn't yield an absolute path.
This is not a bug, just unclear documentation.
Indeed, there seem to be no reason to simplifyPath here, which absPathFrom
does, so instead just combine the repo path and the TopFilePath.
Also, removed an export of the TopFilePath constructor; asTopFilePath
is provided to construct one as-is.
Fixes several bugs with updates of pointer files. When eg, running
git annex drop --from localremote
it was updating the pointer file in the local repository, not the remote.
Also, fixes drop ../foo when run in a subdir, and probably lots of other
problems. Test suite drops from ~30 to 11 failures now.
TopFilePath is used to force thinking about what the filepath is relative
to.
The data stored in the sqlite db is still just a plain string, and
TopFilePath is a newtype, so there's no overhead involved in using it in
DataBase.Keys.
Writes are optimised by queueing up multiple writes when possible.
The queue is flushed after the Annex monad action finishes. That makes it
happen on program termination, and also whenever a nested Annex monad action
finishes.
Reads are optimised by checking once (per AnnexState) if the database
exists. If the database doesn't exist yet, all reads return mempty.
Reads also cause queued writes to be flushed, so reads will always be
consistent with writes (as long as they're made inside the same Annex monad).
A future optimisation path would be to determine when that's not necessary,
which is probably most of the time, and avoid flushing unncessarily.
Design notes for this commit:
- separate reads from writes
- reuse a handle which is left open until program
exit or until the MVar goes out of scope (and autoclosed then)
- writes are queued
- queue is flushed periodically
- immediate queue flush before any read
- auto-flush queue when database handle is garbage collected
- flush queue on exit from Annex monad
(Note that this may happen repeatedly for a single database connection;
or a connection may be reused for multiple Annex monad actions,
possibly even concurrent ones.)
- if database does not exist (or is empty) the handle
is not opened by reads; reads instead return empty results
- writes open the handle if it was not open previously
Fsck can use the queue for efficiency since it is write-heavy, and only
reads a value before writing it. But, the queue is not suited to the Keys
database.
The problem is that shutdown is not always called, particularly in the test
suite. So, a database connection would be opened, possibly some changes
queued, and then not shut down.
One way this can happen is when using Annex.eval or Annex.run with a new
state. A better fix might be to make both of them call Keys.shutdown
(and be sure to do it even if the annex action threw an error).
Complication: Sometimes they're run reusing an existing state, so shutting
down a database connection could cause problems for other users of that
same state. I think this would need a MVar holding the database handle,
so it could be emptied once shut down, and another user of the database
connection could then start up a new one if it got shut down. But, what if
2 threads were concurrently using the same database handle and one shut it
down while the other was writing to it? Urgh.
Might have to go that route eventually to get the database access to run
fast enough. For now, a quick fix to get the test suite happier, at the
expense of speed.
If a DbHandle is in use by another thread, it could be queueing changes
while shutdown is running. So, wait for the worker to finish before
flushing the queue, so that any last-minute writes are included. Before
this fix, they would be silently dropped.
Of course, if the other thread continues to try to use a DbHandle once it's
closed, it will block forever as the worker is no longer reading from the
jobs MVar. So, that would crash with
"thread blocked indefinitely in an MVar operation".
The Keys database can hold multiple inode caches for a given key. One for
the annex object, and one for each pointer file, which may not be hard
linked to it.
Inode caches for a key are recorded when its content is added to the annex,
but only if it has known pointer files. This is to avoid the overhead of
maintaining the database when not needed.
When the smudge filter outputs a file's content, the inode cache is not
updated, because git's smudge interface doesn't let us write the file. So,
dropping will fall back to doing an expensive verification then. Ideally,
git's interface would be improved, and then the inode cache could be
updated then too.
Renamed the db to keys, since it is various info about a Keys.
Dropping a key will update its pointer files, as long as their content can
be verified to be unmodified. This falls back to checksum verification, but
I want it to use an InodeCache of the key, for speed. But, I have not made
anything populate that cache yet.
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.
Also, moved the database to a subdir, as there are multiple files.
This seems to work well with concurrent fscks, although they still do
redundant work due to the commit granularity. Occasionally two writes will
conflict, and one is then deferred and happens later.
Except, with 3 concurrent fscks, I got failures:
git-annex: user error (SQLite3 returned ErrorBusy while attempting to perform prepare "SELECT \"fscked\".\"key\"\nFROM \"fscked\"\nWHERE \"fscked\".\"key\" = ?\n": database is locked)
Argh!!!
Still not robust enough. I have 3 fscks running concurrently, and am
seeing:
("commit deferred",user error (SQLite3 returned ErrorBusy while attempting
to perform step.))
and
git-annex: user error (SQLite3 returned ErrorBusy while attempting to perform prepare "SELECT \"fscked\".\"key\"\nFROM \"fscked\"\nWHERE \"fscked\".\"key\" = ?\n": database is locked)
Sqlite doesn't support multiple concurrent writers
at all. One of them will fail to write. It's not even possible to have two
processes building up separate transactions at the same time. Before using
sqlite, incremental fsck could work perfectly well with multiple fsck
processes running concurrently. I'd like to keep that working.
My partial solution, so far, is to make git-annex buffer writes, and every
so often send them all to sqlite at once, in a transaction. So most of the
time, nothing is writing to the database. (And if it gets unlucky and
a write fails due to a collision with another writer, it can just wait and
retry the write later.) This lets multiple processes write to the database
successfully.
But, for the purposes of concurrent, incremental fsck, it's not ideal.
Each process doesn't immediately learn of files that another process has
checked. So they'll tend to do redundant work.
Only way I can see to improve this is to use some other mechanism for
short-term IPC between the fsck processes. Not yet done.
----
Also, make addDb check if an item is in the database already, and not try
to re-add it. That fixes an intermittent crash with
"SQLite3 returned ErrorConstraint while attempting to perform step."
I am not 100% sure why; it only started happening when I moved write
buffering into the queue. It seemed to generally happen on the same file
each time, so could just be due to multiple files having the same key.
However, I doubt my sound repo has many duplicate keys, and I suspect
something else is going on.
----
Updated benchmark, with the 1000 item queue: 6m33.808s
Turns out sqlite does not like having its database deleted out from
underneath it. It might suffice to empty the table, but I would rather
start each fsck over with a new database, so I added a lock file, and
running incremental fscks use a shared lock.
This leaves one concurrency bug left; running two concurrent fsck --more
will lead to: "SQLite3 returned ErrorBusy while attempting to perform step."
and one or both will fail. This is a concurrent writers problem.
Database.Handle can now be given a CommitPolicy, making it easy to specify
transaction granularity.
Benchmarking the old git-annex incremental fsck that flips sticky bits
to the new that uses sqlite, running in a repo with 37000 annexed files,
both from cold cache:
old: 6m6.906s
new: 6m26.913s
This commit was sponsored by TasLUG.
Did not keep backwards compat for sticky bit records. An incremental fsck
that is already in progress will start over on upgrade to this version.
This is not yet ready for merging. The autobuilders need to have sqlite
installed.
Also, interrupting a fsck --incremental does not commit the database.
So, resuming with fsck --more restarts from beginning.
Memory: Constant during a fsck of tens of thousands of files.
(But, it does seem to buffer whole transation in memory, so
may really scale with number of files.)
CPU: ?