Commit graph

78 commits

Author SHA1 Message Date
Joey Hess
f5b642318d
eliminate single/multi writer distinction
After commit f4bdecc4ec, there is no
longer any distinction between SingleWriter and MultiWriter's handling
of read after write.

Databases that were SingleWriter still have lock files that are used to
prevent multiple writers.

This does make writing to such databases a bit more expensive,
because the MultiWriter code path that is now used opens a second db
connection in order to write to them.
2021-10-20 12:26:30 -04:00
Joey Hess
0f38ad9a69
close keys db to possibly work around WSL1 issue 2021-10-19 13:07:49 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
a306560374
use SQL.addInodeCaches
This avoids deadlock when opening the database handle calls
reconcileStaged.
2021-07-27 17:34:56 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
e4b2a067e0
fix potential race in updating inode cache
In Annex.Content, the object file was statted after pointer files were
populated. But if annex.thin is set, once the pointer files are
populated, the object file can potentially be modified via the hard
link. So, it was possible, though seemingly very unlikely, for the inode
of the modified object file to be cached.

Command.Fix and Command.Fsck had similar problems, statting the work
tree files after they were in place. Changed them to stat the temp file
that gets moved into place. This does rely on .git/annex being on the
same filesystem. If it's not, the cached inode will not be the same as
the one that the temp file gets moved to. Result will be that git-annex
will later need to do an expensive verification of the content of the
worktree files. Note that the cross-filesystem move of the temp file
already is a larger amount of extra work, so this seems acceptable.

Sponsored-by: Luke Shumaker on Patreon
2021-07-27 12:29:10 -04:00
Joey Hess
af9fdf5dba
verify associated files when checking numcopies
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.

A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.

Sponsored-by: Kevin Mueller on Patreon
2021-06-15 11:14:52 -04:00
Joey Hess
7b6deb1109
display scanning message whenever reconcileStaged has enough files to chew on
Clear visible progress bar first.

Removed showSideActionAfter because it can't be used in reconcileStaged
(import loop). Instead, it counts the number of files it
processes and displays it after it's seen a sufficient to know it's
taking a while.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 12:48:30 -04:00
Joey Hess
1a6fa5abc8
add debugging for reconcileStaged calls for benchmarking 2021-06-08 11:57:23 -04:00
Joey Hess
7f742589f9
claw back annexed file scan speedup
Following commit c941ab6f5b, this avoids
the second, redundant scan when annex.thin is not set.

The benchmark now runs in 35.5 seconds, down from 40 seconds.

Note that the inode cache of the annex object has to be passed to
addInodeCaches now, because it might not already be in the inode caches,
unlike previously.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 11:09:15 -04:00
Joey Hess
ec1f2f246b
improve comment
remove obsolete part about a commit preventing it seeing changes
2021-06-08 10:43:48 -04:00
Joey Hess
c941ab6f5b
avoid double work in git-annex init, second try
reconcileStaged populates the db, so scanAnnexedFiles does not need to
do it again. It still makes a pass over the HEAD tree, but populating
the db was most of the expensive part.

Benchmarking with 100,000 files, git-annex init now takes 40 seconds,
vs 37 seconds with the old, buggy version of this fix. It should be
possible to win those 3 precious seconds per 100k files back, in the
case when when annex.thin is not set, with improvements to reconcileStaged
that avoid needing this second pass.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 09:36:53 -04:00
Joey Hess
2cb7b7b336
Revert "avoid double work in git-annex init"
This reverts commit 0f10f208a7.

The implementation of this turns out to be unsafe; it can lead to a keys
db deadlock. scanAnnexedFiles injects a call to inAnnex into
reconcileStaged, but inAnnex sometimes needs to read from the keys db,
which will try to re-open it when it's in the process of being opened.
The exclusive lock of gitAnnexKeysDbLock will then deadlock.

This needs to be done in some other way...
2021-06-08 09:11:24 -04:00
Joey Hess
0f10f208a7
avoid double work in git-annex init
reconcileStaged was doing a redundant scan to scannAnnexedFiles.

It would probably make sense to move the body of scannAnnexedFiles
into reconcileStaged, the separation does not really serve any purpose.

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 16:50:14 -04:00
Joey Hess
6ceb31a30a
optimise reconcileStaged with git cat-file streaming
Commit 428c91606b made it need to do more
work in situations like switching between very different branches.

Compare with seekFilteredKeys which has a similar optimisation. Might be
possible to factor out the common part from these?

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 15:26:48 -04:00
Joey Hess
0101363eb8
correctly update keys db in merge conflict
This is quite a subtle edge case, see the bug report for full details.

The second git diff is needed only when there's a merge conflict.
It would be possible to speed it up marginally by using
--diff-filter=Unmerged, but probably not enough to bother with.

Sponsored-by: Graham Spencer on Patreon
2021-06-07 12:52:36 -04:00
Joey Hess
5b7429e73a
avoid removing old associated file when there is a merge conflict
It makes sense to keep the key used by the old version of an
associated file, until the merge conflict is resolved.

Note that, since in this case git diff is being run with --index, it's
not possible to use -1 or -3, which would let the keys
associated with the new versions of the file also be added. That would
be better, because it's possible that the local modification to the file
that caused the merge conflict has not yet gotten its new key recorded
in the db.

Opened a bug about a case this is thus not able to address.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-06-01 11:43:00 -04:00
Joey Hess
73e1507c72
fix deadlock
git-annex test hung, at varying points depending
on when git decided to run the smudge clean filter.

Recent changes to reconcileStaged caused a deadlock, when git write-tree
for some reason decides to run the smudge clean filter. Which tries
to open the keys db, and blocks waiting for the lock file that its
grandparent has locked.

I don't know why git write-tree does that. It's supposed to only write a
tree from the index which needs no smudge/clean filtering.

I've verified that, in a situation where git write-tree runs the clean
filter, disabling the filter results in a tree being written that
contains the annex link, not eg, the worktree file content. So it seems
safe to disable the clean filter, but also this seems likely to be
working around a bug in git because it seems it is running the clean
filter in a situation where the object has already been cleaned.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 16:19:26 -04:00
Joey Hess
f46e4c9b7c
fix case where keys db was not initialized in time
When the keys db is opened for read, and did not exist yet, it used to
skip creating it, and return mempty values. But that prevents
reconcileStaged from populating associated files information in time for
the read. This fixes the one remaining case I know of where
the fix in a56b151f90 didn't work.

Note that, when there is a permissions error, it still avoids creating
the db and returns mempty for all queries. This does mean that
reconcileStaged does not run and so it may want to drop files that it
should not. However, presumably a permissions error on the keys database
also means that the user does not have permission to delete annex
objects, so they won't be able to drop the files anyway.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 14:46:59 -04:00
Joey Hess
d62d6e2fcf
note about a wart
All code that uses associated files already deals with this problem,
which used to be worse. Unfortunately I was not able to entirely
eliminate it, although it happens in fewer cases now.
2021-05-24 12:05:49 -04:00
Joey Hess
13423f337c
refactoring 2021-05-24 11:38:22 -04:00
Joey Hess
efae085272
fixed reconcileStaged crash when index is locked or in conflict
Eg, when git commit runs the smudge filter.

Commit 428c91606b introduced the crash,
as write-tree fails in those situations. Now it will work, and git-annex
always gets up-to-date information even in those situations. It does
need to do a bit more work, each time git-annex is run with the index
locked. Although if the index is unmodified from the last time
write-tree succeeded, that work is avoided.
2021-05-24 11:33:23 -04:00
Joey Hess
428c91606b
include locked files in the keys database associated files
Before only unlocked files were included.

The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.

reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.

On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.

reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.

Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.

Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.

There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.

However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
2021-05-21 16:24:37 -04:00
Joey Hess
675556fd9a
smudge: check for known annexed inodes before checking annex.largefiles
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.

When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.

Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.

It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.

git-annex add was changed, when adding a small file, to remove the inode
cache for it. This is necessary to keep the recipe in
doc/tips/largefiles.mdwn for converting from annex to git working.
It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn
which the earlier try at this change introduced.
2021-05-10 13:20:10 -04:00
Joey Hess
c2f612292a
start splitting out readonly values from AnnexState
Values in AnnexRead can be read more efficiently, without MVar overhead.
Only a few things have been moved into there, and the performance
increase so far is not likely to be noticable.

This is groundwork for putting more stuff in there, particularly a value
that indicates if debugging is enabled.

The obvious next step is to change option parsing to not run in the
Annex monad to set values in AnnexState, and instead return a pure value
that gets stored in AnnexRead.
2021-04-02 15:51:44 -04:00
Joey Hess
5ce61c6b2a
add: Significantly speed up adding lots of non-large files to git
* add: Significantly speed up adding lots of non-large files to git,
  by disabling the annex smudge filter when running git add.
* add --force-small: Run git add rather than updating the index itself,
  so any other smudge filters than the annex one that may be enabled will
  be used.
2021-01-04 13:12:28 -04:00
Joey Hess
2c8cf06e75
more RawFilePath conversion
Converted file mode setting to it, and follow-on changes.

Compiles up through 369/646.

This commit was sponsored by Ethan Aubin.
2020-11-05 18:45:37 -04:00
Joey Hess
681b44236a
more RawFilePath conversion
at 377/645

This commit was sponsored by Svenne Krap on Patreon.
2020-10-29 14:20:57 -04:00
Joey Hess
529f488ec4
fix a thundering herd problem
Avoid repeatedly opening keys db when accessing a local git remote and -J
is used.

What was happening was that Remote.Git.onLocal created a new annex state
as each thread started up. The way the MVar was used did not prevent that.
And that, in turn, led to repeated opening of the keys db, as well as
probably other extra work or resource use.

Also managed to get rid of Annex.remoteannexstate, and it turned out there
was an unncessary Maybe in the keysdbhandle, since the handle starts out
closed.
2020-04-17 17:09:29 -04:00
Joey Hess
6c81e0c8f1
ByteString Ref continued
Several nice speed wins I think.

At 340/633 files converted.
2020-04-07 13:27:11 -04:00
Joey Hess
029c883713
Merge branch 'master' into v8 2020-02-19 14:32:11 -04:00
Joey Hess
6db4aee7df
use --no-abbrev instead of --abbrev=40
This avoids hardcoding the sha size, so when git uses sha256, it will
output the full sha256 and not a truncation to 40 characters.

I reviewed git's history, and while there have been some
bugs with commands not supporting --no-abbrev (eg git diff --no-index
--no-abbrev was broken in git 2.1), none of the commands git-annex
uses will be impacted by those old bugs.
2020-01-07 12:29:37 -04:00
Joey Hess
02e00fd7ab
Merge branch 'master' into sqlite 2019-12-19 16:33:42 -04:00
Joey Hess
686791c4ed
more RawFilePath
Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
2019-12-18 17:10:28 -04:00
Joey Hess
d5628a16b8
Merge branch 'bs' into sqlite-bs 2019-12-18 14:51:03 -04:00
Joey Hess
c19211774f
use filepath-bytestring for annex object manipulations
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.

Benchmarks indicate around 30% speedup in both commands.

Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
2019-12-11 15:25:07 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
2f9a80d803
merging sqlite and bs branches
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
2019-12-06 15:30:45 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
c35a9047d3
improve data types for sqlite
This is a non-backwards compatable change, so not suitable for merging
w/o a annex.version bump and transition code. Not yet tested.

This improves performance of git-annex benchmark --databases
across the board by 10-25%, since eg Key roundtrips as a ByteString.

(serializeKey' produces a lazy ByteString, so there is still a
copy involved in converting it to a strict ByteString. It may be faster
to switch to using bytestring-strict-builder.)

FilePath and Key are both stored as blobs. This avoids mojibake in some
situations. It would be possible to use varchar instead, if persistent
could avoid converting that to Text, but it seems there is no good
way to do so. See doc/todo/sqlite_database_improvements.mdwn

Eliminated some ugly artifacts of using Read/Show serialization;
constructors and quoted strings are no longer stored in sqlite.

Renamed SRef to SSha to reflect that it is only ever a git sha,
not a ref name. Since it is limited to the characters in a sha,
it is not affected by mojibake, so still uses String.
2019-10-29 17:05:36 -04:00
Joey Hess
94efc400e9
horrible impementation of isInodeKnown
The only good thing about it is it does not require a major version bump
to improve the database. That will need to happen at some point though.

Potentially very very slow in a large repository.

Ugly use of raw sql.
2019-10-23 14:37:29 -04:00
Joey Hess
3f0eef4baa
v7 for all repositories
* Default to v7 for new repositories.
* Automatically upgrade v5 repositories to v7.
2019-08-30 14:09:14 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
fcc9eea554
avoid closeDb opening the db if it's not already open 2018-10-30 22:19:05 -04:00
Joey Hess
a7f0b99a33
fix v6 deadlock with git 2.1.4
I don't know why git diff --raw would run the clean filter, but it did
with this version of git. Perhaps it is cleaning the file to generate the
diff to search with -G?  But then why would newer gits not run the clean
filter?

It caused git annex to deadlock because the keys database was locked
and ran a git command that ran git-annex, which tried to read from the
keys database.

This commit was sponsored by Brett Eisenberg on Patreon.
2018-09-13 13:55:25 -04:00
Joey Hess
50fa17aee6
v6: recover from race between git mv and git-annex get/drop
Update pointer file next time reconcileStaged is run to recover from the
race.

Note that restagePointerFile causes git to run the clean filter,
and that will run reconcileStaged. So, normally by the time the git
annex get/drop command finishes, the race has already been dealt with.
It may be that, in some case, that won't happen and the race will be
dealt with at a later point. git-annex could run reconcileStaged at
shutdown if that becomes a problem.

This does not handle the situation where the git mv is committed before
git-annex gets a chance to run again. git commit does run the clean
filter, and that happens to re-inject the content if it was supposed to
be dropped but is still populated. But, the case where the file was
supposed to be gotten but is not populated is not handled yet.

This commit was supported by the NSF-funded DataLad project.
2018-08-22 15:56:43 -04:00
Joey Hess
18ecf41917
avoid running reconcileStaged when the index has not changed
This commit was supported by the NSF-funded DataLad project.
2018-08-22 13:04:12 -04:00
Joey Hess
5e56d9b620
v6: Update associated files database when git has staged changes to pointer files
This commit was supported by the NSF-funded DataLad project.
2018-08-21 17:02:20 -04:00
Joey Hess
6ab14710fc
fix consistency bug reading from export database
The export database has writes made to it and then expects to read back
the same data immediately. But, the way that Database.Handle does
writes, in order to support multiple writers, makes that not work, due
to caching issues. This resulted in export re-uploading files it had
already successfully renamed into place.

Fixed by allowing databases to be opened in MultiWriter or SingleWriter
mode. The export database only needs to support a single writer; it does
not make sense for multiple exports to run at the same time to the same
special remote.

All other databases still use MultiWriter mode. And by inspection,
nothing else in git-annex seems to be relying on being able to
immediately query for changes that were just written to the database.

This commit was supported by the NSF-funded DataLad project.
2017-09-06 17:19:07 -04:00
Joey Hess
3b22ad9f47
Work around sqlite's incorrect handling of umask when creating databases.
Refactored some common code into initDb.

This only deals with the problem when creating new databases. If a repo
got bad permissions into it, it's up to the user to deal with it.

This commit was sponsored by Ole-Morten Duesund on Patreon.
2017-02-13 17:39:16 -04:00