Commit graph

262 commits

Author SHA1 Message Date
Joey Hess
c47794991c
improve with continuation
no behavior change
2021-10-20 12:13:49 -04:00
Joey Hess
f4bdecc4ec
improve sqlite MultiWriter handling of read after write
This removes a messy caveat that was easy to forget and caused at least one
bug. The price paid is that, after a write to a MultiWriter db, it has to
close the db connection that it had been using to read, and open a new
connection. So it might be a little bit slower. But, writes are usually
batched together, so there's often only a single write, and so there should
not be much of a slowdown. Notice that SingleWriter already closed the db
connection after a write, so paid the same overhead.

This is the second try at fixing a bug: git-annex get when run as the first
git-annex command in a new repo did not populate all unlocked files.
(Reversion in version 8.20210621)

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-10-19 15:13:29 -04:00
Joey Hess
0f38ad9a69
close keys db to possibly work around WSL1 issue 2021-10-19 13:07:49 -04:00
Joey Hess
19e78816f0
convert Key to ShortByteString
This adds the overhead of a copy when serializing and deserializing keys.
I have not benchmarked much, but runtimes seem barely changed at all by that.

When a lot of keys are in memory, it improves memory use.

And, it prevents keys sometimes getting PINNED in memory and failing to GC,
which is a problem ByteString has sometimes. In particular, git-annex sync
from a borg special remote had that problem and this improved its memory
use by a large amount.

Sponsored-by: Shae Erisson on Patreon
2021-10-05 20:20:08 -04:00
Joey Hess
837116ef1e
Fix support for readonly git remotes
Boolean blindness oops.

(Reversion in version 8.20210621)

Sponsored-by: Dartmouth College's Datalad project
2021-08-30 12:34:19 -04:00
Joey Hess
fa62c98910
simplify and speed up Utility.FileSystemEncoding
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)

Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.

AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.

Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.

Sponsored-by: Shae Erisson on Patreon
2021-08-11 12:13:31 -04:00
Joey Hess
9f94d2894e
remove unused code 2021-07-30 18:01:36 -04:00
Joey Hess
a306560374
use SQL.addInodeCaches
This avoids deadlock when opening the database handle calls
reconcileStaged.
2021-07-27 17:34:56 -04:00
Joey Hess
73e0cbbb19
fix problem populating pointer files
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.

Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.

Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.

Sponsored-by: Dartmouth College's Datalad project
2021-07-27 14:26:49 -04:00
Joey Hess
e4b2a067e0
fix potential race in updating inode cache
In Annex.Content, the object file was statted after pointer files were
populated. But if annex.thin is set, once the pointer files are
populated, the object file can potentially be modified via the hard
link. So, it was possible, though seemingly very unlikely, for the inode
of the modified object file to be cached.

Command.Fix and Command.Fsck had similar problems, statting the work
tree files after they were in place. Changed them to stat the temp file
that gets moved into place. This does rely on .git/annex being on the
same filesystem. If it's not, the cached inode will not be the same as
the one that the temp file gets moved to. Result will be that git-annex
will later need to do an expensive verification of the content of the
worktree files. Note that the cross-filesystem move of the temp file
already is a larger amount of extra work, so this seems acceptable.

Sponsored-by: Luke Shumaker on Patreon
2021-07-27 12:29:10 -04:00
Joey Hess
af9fdf5dba
verify associated files when checking numcopies
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.

A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.

Sponsored-by: Kevin Mueller on Patreon
2021-06-15 11:14:52 -04:00
Joey Hess
7b6deb1109
display scanning message whenever reconcileStaged has enough files to chew on
Clear visible progress bar first.

Removed showSideActionAfter because it can't be used in reconcileStaged
(import loop). Instead, it counts the number of files it
processes and displays it after it's seen a sufficient to know it's
taking a while.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 12:48:30 -04:00
Joey Hess
1a6fa5abc8
add debugging for reconcileStaged calls for benchmarking 2021-06-08 11:57:23 -04:00
Joey Hess
7f742589f9
claw back annexed file scan speedup
Following commit c941ab6f5b, this avoids
the second, redundant scan when annex.thin is not set.

The benchmark now runs in 35.5 seconds, down from 40 seconds.

Note that the inode cache of the annex object has to be passed to
addInodeCaches now, because it might not already be in the inode caches,
unlike previously.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 11:09:15 -04:00
Joey Hess
ec1f2f246b
improve comment
remove obsolete part about a commit preventing it seeing changes
2021-06-08 10:43:48 -04:00
Joey Hess
c941ab6f5b
avoid double work in git-annex init, second try
reconcileStaged populates the db, so scanAnnexedFiles does not need to
do it again. It still makes a pass over the HEAD tree, but populating
the db was most of the expensive part.

Benchmarking with 100,000 files, git-annex init now takes 40 seconds,
vs 37 seconds with the old, buggy version of this fix. It should be
possible to win those 3 precious seconds per 100k files back, in the
case when when annex.thin is not set, with improvements to reconcileStaged
that avoid needing this second pass.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 09:36:53 -04:00
Joey Hess
2cb7b7b336
Revert "avoid double work in git-annex init"
This reverts commit 0f10f208a7.

The implementation of this turns out to be unsafe; it can lead to a keys
db deadlock. scanAnnexedFiles injects a call to inAnnex into
reconcileStaged, but inAnnex sometimes needs to read from the keys db,
which will try to re-open it when it's in the process of being opened.
The exclusive lock of gitAnnexKeysDbLock will then deadlock.

This needs to be done in some other way...
2021-06-08 09:11:24 -04:00
Joey Hess
c831a562f5
faster associated file replacement with upsert
Rather than first deleting and then inserting, upsert lets the key
associated with a file be updated in place.

Benchmarked with 100,000 files, and an empty keys database, running
reconcileStaged. It improved from 47 seconds to 34 seconds.
So this got reconcileStaged to be as fast as scanAssociatedFiles,
or faster -- scanAssociatedFiles benchmarks at 37 seconds.

(Also checked for other users of deleteWhere that could be sped up by
upsert. There are a couple, but they are not in performance critical
code paths, eg recordExportTreeCurrent is only run once per tree
export.)

I would have liked to rename FileKeyIndex to FileKeyUnique since it is
being used as a uniqueness constraint now, not just to get an index.
But, that gets converted into part of the SQL schema, and the name
is used by the upsert, so it can't be changed.

Sponsored-by: Dartmouth College's Datalad project
2021-06-08 07:53:36 -04:00
Joey Hess
0f10f208a7
avoid double work in git-annex init
reconcileStaged was doing a redundant scan to scannAnnexedFiles.

It would probably make sense to move the body of scannAnnexedFiles
into reconcileStaged, the separation does not really serve any purpose.

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 16:50:14 -04:00
Joey Hess
6ceb31a30a
optimise reconcileStaged with git cat-file streaming
Commit 428c91606b made it need to do more
work in situations like switching between very different branches.

Compare with seekFilteredKeys which has a similar optimisation. Might be
possible to factor out the common part from these?

Sponsored-by: Dartmouth College's Datalad project
2021-06-07 15:26:48 -04:00
Joey Hess
0101363eb8
correctly update keys db in merge conflict
This is quite a subtle edge case, see the bug report for full details.

The second git diff is needed only when there's a merge conflict.
It would be possible to speed it up marginally by using
--diff-filter=Unmerged, but probably not enough to bother with.

Sponsored-by: Graham Spencer on Patreon
2021-06-07 12:52:36 -04:00
Joey Hess
5b7429e73a
avoid removing old associated file when there is a merge conflict
It makes sense to keep the key used by the old version of an
associated file, until the merge conflict is resolved.

Note that, since in this case git diff is being run with --index, it's
not possible to use -1 or -3, which would let the keys
associated with the new versions of the file also be added. That would
be better, because it's possible that the local modification to the file
that caused the merge conflict has not yet gotten its new key recorded
in the db.

Opened a bug about a case this is thus not able to address.

Sponsored-by: Boyd Stephen Smith Jr. on Patreon
2021-06-01 11:43:00 -04:00
Joey Hess
eb6f6ff9b8
speed up keys database writes
There seems to be no reason to check the time here. I think it was
inherited from code in Database.Fsck, which does have a reason to commit
every few minutes. Removing that syscall speeds up a git-annex init
in a repo with 100000 annexed files by about 3 seconds.

Sponsored-by: Dartmouth College's Datalad project
2021-05-31 15:01:00 -04:00
Joey Hess
73e1507c72
fix deadlock
git-annex test hung, at varying points depending
on when git decided to run the smudge clean filter.

Recent changes to reconcileStaged caused a deadlock, when git write-tree
for some reason decides to run the smudge clean filter. Which tries
to open the keys db, and blocks waiting for the lock file that its
grandparent has locked.

I don't know why git write-tree does that. It's supposed to only write a
tree from the index which needs no smudge/clean filtering.

I've verified that, in a situation where git write-tree runs the clean
filter, disabling the filter results in a tree being written that
contains the annex link, not eg, the worktree file content. So it seems
safe to disable the clean filter, but also this seems likely to be
working around a bug in git because it seems it is running the clean
filter in a situation where the object has already been cleaned.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 16:19:26 -04:00
Joey Hess
f46e4c9b7c
fix case where keys db was not initialized in time
When the keys db is opened for read, and did not exist yet, it used to
skip creating it, and return mempty values. But that prevents
reconcileStaged from populating associated files information in time for
the read. This fixes the one remaining case I know of where
the fix in a56b151f90 didn't work.

Note that, when there is a permissions error, it still avoids creating
the db and returns mempty for all queries. This does mean that
reconcileStaged does not run and so it may want to drop files that it
should not. However, presumably a permissions error on the keys database
also means that the user does not have permission to delete annex
objects, so they won't be able to drop the files anyway.

Sponsored-by: Dartmouth College's Datalad project
2021-05-24 14:46:59 -04:00
Joey Hess
d62d6e2fcf
note about a wart
All code that uses associated files already deals with this problem,
which used to be worse. Unfortunately I was not able to entirely
eliminate it, although it happens in fewer cases now.
2021-05-24 12:05:49 -04:00
Joey Hess
13423f337c
refactoring 2021-05-24 11:38:22 -04:00
Joey Hess
efae085272
fixed reconcileStaged crash when index is locked or in conflict
Eg, when git commit runs the smudge filter.

Commit 428c91606b introduced the crash,
as write-tree fails in those situations. Now it will work, and git-annex
always gets up-to-date information even in those situations. It does
need to do a bit more work, each time git-annex is run with the index
locked. Although if the index is unmodified from the last time
write-tree succeeded, that work is avoided.
2021-05-24 11:33:23 -04:00
Joey Hess
428c91606b
include locked files in the keys database associated files
Before only unlocked files were included.

The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.

reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.

On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.

reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.

Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.

Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.

There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.

However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
2021-05-21 16:24:37 -04:00
Joey Hess
675556fd9a
smudge: check for known annexed inodes before checking annex.largefiles
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.

When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.

Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.

It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.

git-annex add was changed, when adding a small file, to remove the inode
cache for it. This is necessary to keep the recipe in
doc/tips/largefiles.mdwn for converting from annex to git working.
It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn
which the earlier try at this change introduced.
2021-05-10 13:20:10 -04:00
Joey Hess
c2f612292a
start splitting out readonly values from AnnexState
Values in AnnexRead can be read more efficiently, without MVar overhead.
Only a few things have been moved into there, and the performance
increase so far is not likely to be noticable.

This is groundwork for putting more stuff in there, particularly a value
that indicates if debugging is enabled.

The obvious next step is to change option parsing to not run in the
Annex monad to set values in AnnexState, and instead return a pure value
that gets stored in AnnexRead.
2021-04-02 15:51:44 -04:00
Joey Hess
8868a3a4c7
Fix build with persistent-2.12.0.1
persistent stopped using askLogFunc, and the thing to use is askLoggerIO
from monad-logger. Bumped the dep to the first version that contained that.

Note that the i386ancient build uses a newer monad-logger than 0.3.10,
so the new versioned dep should not break it, and presumably nothing else
either.

This commit was sponsored by Noam Kremen on Patreon.
2021-04-01 12:21:02 -04:00
Joey Hess
fc61915230
use GIT keys for export of non-annexed files
This solves the problem that import of such files gets confused and
converts them back to annexed files.

The import code already used GIT keys internally when it determined a
file should not be annexed. So now when it sees a GIT key that export
used, it already does the right thing.

This also means that even older version of git-annex can import and will
do the right thing, once a fixed version has exported. Still, there may
be other complications around upgrades; still need to think it all
through.

Moved gitShaKey and keyGitSha from Key to Annex.Export since they're
only used for export/import.

Documented GIT keys in backends, since they do appear in the git-annex
branch now.

This commit was sponsored by Graham Spencer on Patreon.
2021-03-05 14:12:11 -04:00
Joey Hess
5ce61c6b2a
add: Significantly speed up adding lots of non-large files to git
* add: Significantly speed up adding lots of non-large files to git,
  by disabling the annex smudge filter when running git add.
* add --force-small: Run git add rather than updating the index itself,
  so any other smudge filters than the annex one that may be enabled will
  be used.
2021-01-04 13:12:28 -04:00
Joey Hess
c6e693b25d
remove ContentIndentifiersCidRemoteIndex uniqueness constraint
For reasons explained in the bug report.

Implemented using a persistent migration, which works fine. It may add a
little startup overhead when a remote is enabled that uses this, but
probably un-noticable.

On the next major version, it would be fine to delete this database,
and regenerate it from the git-annex branch information. Then this
change could be reverted.

Did nothing about adding back the data that got dropped from the db
due to the bug. Only the borg special remote was probably affected,
and it's not been released yet. rm -rf .git/annex/cidsdb does work.
2020-12-23 14:03:33 -04:00
Joey Hess
028d4517c7
enable extensions needed by new version of persistent
Needed in order to use mkPersist in
persistent version 2.11.0.1
persistent-template version 2.9.1.0
2020-11-07 14:09:17 -04:00
Joey Hess
1db49497e0
finished this stage of the RawFilePath conversion
This commit was sponsored by Denis Dzyubenko on Patreon.
2020-11-06 14:10:58 -04:00
Joey Hess
2c8cf06e75
more RawFilePath conversion
Converted file mode setting to it, and follow-on changes.

Compiles up through 369/646.

This commit was sponsored by Ethan Aubin.
2020-11-05 18:45:37 -04:00
Joey Hess
9b0dde834e
convert getFileSize to RawFilePath
Lots of nice wins from this in avoiding unncessary work, and I think
nothing got slower.

This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
2020-11-05 11:32:57 -04:00
Joey Hess
681b44236a
more RawFilePath conversion
at 377/645

This commit was sponsored by Svenne Krap on Patreon.
2020-10-29 14:20:57 -04:00
Joey Hess
f45ad178cb
more RawFilePath conversion
At 318/645 after 4k lines of changes

This commit was sponsored by Jake Vosloo on Patreon.
2020-10-29 12:03:50 -04:00
Joey Hess
8d66f7ba0f
more RawFilePath conversion
Added a RawFilePath createDirectory and kept making stuff build.

Up to 296/645

This commit was sponsored by Mark Reidenbach on Patreon.
2020-10-28 17:25:59 -04:00
Joey Hess
10e62a810e
remove unused import 2020-10-02 13:45:09 -04:00
Joey Hess
37426920d8
Fix build with Benchmark build flag
Broke a while ago during optimisation work, and not noticed since the flag
is disabled by default.

This commit was sponsored by Brock Spratlen on Patreon.
2020-10-02 13:30:24 -04:00
Joey Hess
aa1ad0b7ca
remove redundant imports
Clean build under ghc 8.8.3, which seems to do better at finding cases
where two imports both provide the same symbol, and warns about one of
them.

This commit was sponsored by Ilya Shlyakhter on Patreon.
2020-06-22 11:05:34 -04:00
Joey Hess
529f488ec4
fix a thundering herd problem
Avoid repeatedly opening keys db when accessing a local git remote and -J
is used.

What was happening was that Remote.Git.onLocal created a new annex state
as each thread started up. The way the MVar was used did not prevent that.
And that, in turn, led to repeated opening of the keys db, as well as
probably other extra work or resource use.

Also managed to get rid of Annex.remoteannexstate, and it turned out there
was an unncessary Maybe in the keysdbhandle, since the handle starts out
closed.
2020-04-17 17:09:29 -04:00
Joey Hess
6c81e0c8f1
ByteString Ref continued
Several nice speed wins I think.

At 340/633 files converted.
2020-04-07 13:27:11 -04:00
Joey Hess
279991604d
started converting Ref from String to ByteString
This should make code that reads shas and refs from git faster.

Does not compile yet, a lot needs to be done still.
2020-04-06 17:14:49 -04:00
Joey Hess
7f992ef59c
mostly finished with createDirectoryUnder conversion
Remaining things needing converted are in the assistant, and Annex.Ssh.

Every other remaining call to createDirectoryIfMissing True has been
audited and is not relevant. The ones in Build/ of course don't get
included in the program. Others included eg, Remote.Tahoe and
Config.Files which both write to dotfiles under the home directory.
2020-03-06 11:57:15 -04:00
Joey Hess
6d58ca94d6
some easy createDirectoryUnder conversions 2020-03-05 15:20:10 -04:00
Joey Hess
029c883713
Merge branch 'master' into v8 2020-02-19 14:32:11 -04:00
Joey Hess
c9357bdc0e
ifdef persistent-template 2.8.0 fixes
The i386ancient build has a ghc too old for these extensions.

Build with persistent-template 2.8.0 tested.
2020-02-04 13:53:00 -04:00
Joey Hess
4920df6573
Fix build with newest version of persistent-template.
This is untested because of rain, also I am operating from truncated
copiler error messages in a bug report that also doesn't mention what the
library version is. Still, it should work.

May break builds with old ghc, in particular DerivingStrategies is
I think fairly new? The pragmas could be ifdefed if necessary. Works with
ghc 8.6.5.
2020-02-04 12:03:30 -04:00
Joey Hess
6db4aee7df
use --no-abbrev instead of --abbrev=40
This avoids hardcoding the sha size, so when git uses sha256, it will
output the full sha256 and not a truncation to 40 characters.

I reviewed git's history, and while there have been some
bugs with commands not supporting --no-abbrev (eg git diff --no-index
--no-abbrev was broken in git 2.1), none of the commands git-annex
uses will be impacted by those old bugs.
2020-01-07 12:29:37 -04:00
Joey Hess
5e4deb3620
support sha256 git repos
Git will eventually switch to sha2 and there will not be one single
shaSize anymore, but two (40 and 64).

Changed all parsers for git plumbing output to support both sizes of
shas.

One potential problem this does not deal with is, if somewhere in
git-annex it reads two shas from different sources, and compares them
to see if they're the same sha, it would fail if they're sha1 and sha256
of the same value. I don't know if that will really be a concern.
2020-01-07 12:22:19 -04:00
Joey Hess
02e00fd7ab
Merge branch 'master' into sqlite 2019-12-19 16:33:42 -04:00
Joey Hess
686791c4ed
more RawFilePath
Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
2019-12-18 17:10:28 -04:00
Joey Hess
535b153381
building again after merge
Nice, several conversions fell out.
2019-12-18 15:02:46 -04:00
Joey Hess
d5628a16b8
Merge branch 'bs' into sqlite-bs 2019-12-18 14:51:03 -04:00
Joey Hess
c19211774f
use filepath-bytestring for annex object manipulations
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.

Benchmarks indicate around 30% speedup in both commands.

Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
2019-12-11 15:25:07 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
2f9a80d803
merging sqlite and bs branches
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
2019-12-06 15:30:45 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
81d402216d cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:

Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.

The Eq instance is fixed.

Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.

Used git-annex benchmark:
  find is 7% faster
  whereis is 3% faster
  get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
2019-11-22 17:49:16 -04:00
Joey Hess
70a8716324
improve benchmark
addAssociatedFileNewBench would sometimes pick a random number that a
previous call had already added. Using a MVar, make it always advance,
so the same behavior is benchmarked each time.
2019-11-22 13:20:22 -04:00
Joey Hess
7263aafd2b
Merge branch 'master' into sqlite 2019-11-22 12:49:35 -04:00
Joey Hess
92e1bb250b
simplify the name of the test cases 2019-11-21 17:38:58 -04:00
Joey Hess
d4661959de
Merge branch 'master' into sqlite 2019-11-21 17:26:50 -04:00
Joey Hess
25ba8156bc
improve benchmark --databases
* benchmark: Changed --databases to take a parameter specifiying the size
  of the database to benchmark.
* benchmark --databases: Display size of the populated database.
* benchmark --databases: Improve the "addAssociatedFile to (new)"
  benchmark to really add new values, not overwriting old values.
2019-11-21 17:25:20 -04:00
Joey Hess
1b5f4b67b5
use new name for new format fsck db
The old db is cleaned up when a new incremental fsck is started.

The incremental fsck won't pick up where the old one left off, but I
consider this a minor enough thing that it can just be documented and
won't be a problem.
2019-11-06 16:27:25 -04:00
Joey Hess
d3e4de0175
fix test suite
The test suite found a bug; select_ can fail now because a uniqueness
constrain has been added.

Now the test suite passes.

Also, I'm satisfied the changed PersistField instances work.
Looking over what changed, and what I've already tested, Key, FilePath,
and InodeCache are known working; ContentIdentifier is trivial
ByteString to blob; and SSha is trivial String to varchar. Both are
tested by the test suite. I've also tested the new FileSize and
EpochTime instances already, and they work.
2019-10-30 15:51:37 -04:00
Joey Hess
4940a135af
eliminate raw sql LIKE query 2019-10-30 15:19:52 -04:00
Joey Hess
9085a2cfec
make sure all sqlite selects have indexes
Bearing in mind that these indexes are really uniqueness constraints
that just happen to also make sqlite generate indexes.

In Database.ContentIndentifier, the ContentIndentifiersKeyRemoteCidIndex
is fine as a uniqueness constraint because it contains all rows from the
table. The ContentIndentifiersCidRemoteIndex is also ok because there
can only be one key for a given (cid, uuid) combination.

In Database.Export, the new ExportTreeFileKeyIndex is the same pair as
the old ExportTreeKeyFileIndex (previously ExportTreeIndex). And
in Database.Keys.SQL, the new InodeCacheKeyIndex is the same pair as the
old KeyInodeCacheIndex.
2019-10-30 13:46:52 -04:00
Joey Hess
3732f27722
document indexes
This was really confusing, though the code was ok. I think I now
understand it fully again.
2019-10-30 13:28:00 -04:00
Joey Hess
f6cfb84dfe
rename row 2019-10-30 13:22:41 -04:00
Joey Hess
aa27969e55
improve layout 2019-10-29 17:08:36 -04:00
Joey Hess
e8437ae7a3
removed now unused import 2019-10-29 17:07:15 -04:00
Joey Hess
c35a9047d3
improve data types for sqlite
This is a non-backwards compatable change, so not suitable for merging
w/o a annex.version bump and transition code. Not yet tested.

This improves performance of git-annex benchmark --databases
across the board by 10-25%, since eg Key roundtrips as a ByteString.

(serializeKey' produces a lazy ByteString, so there is still a
copy involved in converting it to a strict ByteString. It may be faster
to switch to using bytestring-strict-builder.)

FilePath and Key are both stored as blobs. This avoids mojibake in some
situations. It would be possible to use varchar instead, if persistent
could avoid converting that to Text, but it seems there is no good
way to do so. See doc/todo/sqlite_database_improvements.mdwn

Eliminated some ugly artifacts of using Read/Show serialization;
constructors and quoted strings are no longer stored in sqlite.

Renamed SRef to SSha to reflect that it is only ever a git sha,
not a ref name. Since it is limited to the characters in a sha,
it is not affected by mojibake, so still uses String.
2019-10-29 17:05:36 -04:00
Joey Hess
e1b21a0491
benchmark: Add --databases to benchmark sqlite databases
Rescued from commit 11d6e2e260 which removed
db benchmarks in favor of benchmarking arbitrary git-annex commands. Which
is nice and general, but microbenchmarks are useful too.
2019-10-29 17:05:10 -04:00
Joey Hess
25f912de5b
benchmark: Add --databases to benchmark sqlite databases
Rescued from commit 11d6e2e260 which removed
db benchmarks in favor of benchmarking arbitrary git-annex commands. Which
is nice and general, but microbenchmarks are useful too.
2019-10-29 16:59:27 -04:00
Joey Hess
0f7fd008d4
fix sql syntax 2019-10-24 11:57:17 -04:00
Joey Hess
098afe144e
display sqlite error message when it crashes 2019-10-24 11:50:55 -04:00
Joey Hess
94efc400e9
horrible impementation of isInodeKnown
The only good thing about it is it does not require a major version bump
to improve the database. That will need to happen at some point though.

Potentially very very slow in a large repository.

Ugly use of raw sql.
2019-10-23 14:37:29 -04:00
Joey Hess
904b175707
Fix build with persistent-2.10.
Added an additional constraint that persistent needs.
This also builds with persistent-2.9.2 without needing any cpp.
2019-10-17 11:58:31 -04:00
Joey Hess
9828f45d85
add RemoteStateHandle
This solves the problem of sameas remotes trampling over per-remote
state. Used for:

* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
  could in theory generate the same content identifier for two different
  peices of content

While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.

External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.

Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
2019-10-14 13:51:42 -04:00
Joey Hess
2e6fd5de71
fix flipped diffUTCTime
fsck --incremental/--more: Fix bug that prevented the incremental fsck
information from being updated every 5 minutes as it was supposed to be; it
was only updated after 1000 files were checked, which may be more files
that are possible to fsck in a given fsck time window.

Thanks to Peter Simons for help with analysis of this bug.

Auditing for other cases of the same mistake, the keys db also had it
backwards. This seems unlikely to really have been a problem;
it would need associated files updates etc to be coming in slowly for some
reason and then be interrupted to cause any problem.

IIRC the design of the keys db assumes that any interruped
operation will be restarted, and so it can lose any buffered database
updates safely.
2019-10-03 09:54:19 -04:00
Joey Hess
9628ae2e67
Close sqlite databases more robustly.
Had a report of close throwing ErrorBusy on CIFS.

Retrying up to 16 seconds is a balance between hopefully waiting long
enough for the problem to clear up and waiting so long that git-annex seems
to hang.

The new dependency is free; persistent depends on unliftio-core.
2019-09-26 12:25:21 -04:00
Joey Hess
3f0eef4baa
v7 for all repositories
* Default to v7 for new repositories.
* Automatically upgrade v5 repositories to v7.
2019-08-30 14:09:14 -04:00
Joey Hess
018b5b8173
Support building with socks-0.6 and persistant-template-2.7
persistent-template now needs UndecidableInstances.

socks changed defaultSocksConf to take a SockAddr.
2019-07-30 12:50:48 -04:00
Joey Hess
9a5ddda511
remove many old version ifdefs
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.

The only remaining version ifdefs in the entire code base are now a couple
for aws!

This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of  git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.

This commit was sponsored by Ilya Shlyakhter.
2019-07-05 15:09:37 -04:00
Joey Hess
6babb2c73f
remove wrong uniqueness constraint from ContentIdentifier db
Fix bug that caused importing from a special remote to repeatedly download
unchanged files when multiple files in the remote have the same content.

Unfortunately, there's really no good way to remove a uniqueness constraint
from a sqlite database. The best that can be done is to make a new table
and copy the data over. But that would require using persistent's
migrations or raw sql, and I don't want to do either.

Instead, a sledgehammer approach: Renamed .git/annex/cid to
.git/annex/cids. When the new database doesn't exist, it will be populated
from the git-annex branch.

Noting deletes the old database. Don't want to delete it out from under
some long-running git-annex process that might be using it. It could
eventually be deleted. But this is such a new feature, probably few repos
have the database in any case.
2019-04-09 19:58:24 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
e3a704224f
fix export db locking deadlock 2019-03-07 16:06:02 -04:00
Joey Hess
9a72785307
fixes to export db lookup when accessing importtree=yes
Now in a fresh clone with a importtree=yes remote enabled,
git annex fsck --from the remote works.
2019-03-07 14:10:56 -04:00
Joey Hess
50797ee2c5
remove obsolete comment 2019-03-07 13:02:46 -04:00
Joey Hess
71fec9060c
move 2019-03-07 12:56:40 -04:00
Joey Hess
ee251b2e2e
implement updating the ContentIdentifier db with info from the git-annex branch
untested

This won't be super slow, but it does need to diff two likely large
trees, and since the git-annex branch rarely sits still, it will most
likely be run at the beginning of every import.

A possible speed improvement would be to only run this when the database
did not contain a ContentIdentifier. But that would only speed up
imports when there is no new version of a file on the special remote,
at most renames of existing files being imported.

A better speed improvement would be to record something in the git-annex
branch that indicates when an import has been run, and only do the diff
if the git-annex branch has record of a newer import than we've seen
before. Then, it would only run when there is in fact new
ContentIdentifier information available from a remote. Certianly doable,
but didn't want to complicate things yet.
2019-03-06 18:04:30 -04:00
Joey Hess
f85f06aae3
change to more efficient IKey 2019-03-06 11:14:33 -04:00
Joey Hess
3c652e1499
limit to requested remote 2019-03-05 15:56:28 -04:00
Joey Hess
cd3a2b023a
initial try at using storeExportWithContentIdentifier
Untested, and I'm not sure about the locking of the ContentIdentifier db.
2019-03-04 17:50:41 -04:00