Commit graph

114 commits

Author SHA1 Message Date
Joey Hess
7a42a47902
renaming 2020-07-10 14:17:35 -04:00
Joey Hess
c19211774f
use filepath-bytestring for annex object manipulations
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.

Benchmarks indicate around 30% speedup in both commands.

Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
2019-12-11 15:25:07 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
81d402216d cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:

Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.

The Eq instance is fixed.

Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.

Used git-annex benchmark:
  find is 7% faster
  whereis is 3% faster
  get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
2019-11-22 17:49:16 -04:00
Joey Hess
3066bdb1fb
fix annex.largefiles largerthan/smallerthan bug
Fix bug in handling of annex.largefiles that use largerthan/smallerthan.
When adding a modified file, it incorrectly used the file size of the old
version of the file, not the current size.

That was the only largefiles limit that didn't directly look at the file on
disk already. Added a new type to keep straight the two different ways such
a limit can be matched. I kind of wanted to extend MatchingFile or FileInfo
to indicate that the matcher is supposed to operate on files from disk or
annex, but it turned out to be too complex to implement it that way.

This also changes the LimitAnnexFiles case when lookupFileKey does not find
a key. It used to fall back to statting the file, now it always returns
False. I doubt the old code could really get to that point, but if it
somehow does, it's better for preferred content matching to be consistent.
2019-09-30 17:15:08 -04:00
Joey Hess
b13a350556
added --unlocked and --locked 2019-09-19 12:33:13 -04:00
Joey Hess
fda1bdd679
Added --mimetype and --mimeencoding file matching options.
Already had these for largefiles matching, but I forgot to add them as
command-line options.
2019-09-19 12:09:59 -04:00
Joey Hess
d1a0c7b16f
make --in=here fast
Use the same optimisation for --in=here as has always been used for --in=.
rather than the slow code path that unncessarily queries the git-annex
branch.

It looks like when "here" got added as an alias for "." back in 2012, I
forgot about this place.

Also sped up some very unlikely ways of referring to the current
repository.

Note that, this could in some rare corner case cause a behavior
change, if the git-annex branch and inAnnex disagree about whether content
is present in the local repository. But --in=. already behaved
that way, and the truth on the ground should win also.
2019-08-01 00:29:47 -04:00
Joey Hess
aa7710982b
avoid list lookup by parseToken
Minor optimisation to parsing of a preferred content expression.
2019-05-14 13:11:29 -04:00
Joey Hess
9dd764e6f7
Added mimeencoding= term to annex.largefiles expressions.
* Added mimeencoding= term to annex.largefiles expressions.
  This is probably mostly useful to match non-text files with eg
  "mimeencoding=binary"
* git-annex matchexpression: Added --mimeencoding option.
2019-04-30 12:17:22 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
467c3b393d
refactor magic 2019-01-23 12:40:59 -04:00
Joey Hess
727767e1e2
make everything build again after ByteString Key changes 2019-01-11 16:39:46 -04:00
Joey Hess
6f66b53a30
newtype Group to ByteString
This may speed up queries for things in groups, due to Eq and Ord being faster.
2019-01-09 15:05:49 -04:00
Joey Hess
029ae8d4db
support findred and --branch with file matching options
* findref: Support file matching options: --include, --exclude,
  --want-get, --want-drop, --largerthan, --smallerthan, --accessedwithin
* Commands supporting --branch now apply file matching options --include,
  --exclude, --want-get, --want-drop to filenames from the branch.
  Previously, combining --branch with those would fail to match anything.
* add, import, findref: Support --time-limit.

This commit was sponsored by Jake Vosloo on Patreon.
2018-12-09 13:38:35 -04:00
Joey Hess
6e6c9cc6d3
Added --accessedwithin matching option.
Useful for dropping old objects from cache repositories.

But also, quite a genrally useful thing to have..

Rather than imitiating find's -atime and other options, all of which are
pretty horrible to use, I made this match files accessed within a time
period, using the same duration format used by git-annex schedule and
--limit-time

In passing, changed the --limit-time option parser to parse the
duration, instead of having it later throw an error.

This commit was supported by the NSF-funded DataLad project.
2018-08-01 15:34:03 -04:00
Joey Hess
95f7295b67
followup 2018-06-04 12:12:56 -04:00
Joey Hess
f56594af9e
finish fixing inverted Ord for TrustLevel
Flipped all comparisons. When a TrustLevel list was wanted from Trusted
downwards, used Down to compare it in that order.

This commit was sponsored by mo on Patreon.
2018-04-13 15:17:54 -04:00
Joey Hess
a0e4b9678b
fix inverted Ord for TrustLevel (intermediate commit)
This commit removes the Ord and Enum instances, commenting out all code
that depends on them, to make sure that all code effected by the
inversion fix has been identified.

(Assuming no ifdefs involve TrustLevel.)

The next commit will fix up all the identified code.
2018-04-13 14:50:14 -04:00
Joey Hess
49114cf4ea
securehash matching
Added --securehash option to match files using a secure hash function, and
corresponding securehash preferred content expression.

This commit was sponsored by Ethan Aubin.
2017-02-27 15:02:44 -04:00
Joey Hess
9c4650358c
add KeyVariety type
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.

This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.

Benchmarks ran in my big repo:

old git-annex info:

real	0m3.338s
user	0m3.124s
sys	0m0.244s

new git-annex info:

real	0m3.216s
user	0m3.024s
sys	0m0.220s

new git-annex find:

real	0m7.138s
user	0m6.924s
sys	0m0.252s

old git-annex find:

real	0m7.433s
user	0m7.240s
sys	0m0.232s

Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.

This commit was supported by the NSF-funded DataLad project.
2017-02-24 15:16:56 -04:00
Joey Hess
9eb10caa27
Some optimisations to string splitting code.
Turns out that Data.List.Utils.split is slow and makes a lot of
allocations. Here's a much simpler single character splitter that behaves
the same (even in wacky corner cases) while running in half the time and
75% the allocations.

As well as being an optimisation, this helps move toward eliminating use of
missingh.

(Data.List.Split.splitOn is nearly as slow as Data.List.Utils.split and
allocates even more.)

I have not benchmarked the effect on git-annex, but would not be surprised
to see some parsing of eg, large streams from git commands run twice as
fast, and possibly in less memory.

This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
2017-01-31 19:06:22 -04:00
Joey Hess
0a4479b8ec
Avoid backtraces on expected failures when built with ghc 8; only use backtraces for unexpected errors.
ghc 8 added backtraces on uncaught errors. This is great, but git-annex was
using error in many places for a error message targeted at the user, in
some known problem case. A backtrace only confuses such a message, so omit it.

Notably, commands like git annex drop that failed due to eg, numcopies,
used to use error, so had a backtrace.

This commit was sponsored by Ethan Aubin.
2016-11-15 21:29:54 -04:00
Joey Hess
eac26f13db
Fix bug in annex.largefiles mimetype= matching when git-annex is run in a subdirectory of the repository. 2016-04-12 14:19:34 -04:00
Joey Hess
b946ca44c3
Support --metadata field<number, --metadata field>number etc to match ranges of numeric values.
Similarly (well, for free), support preferred content expressions like
metadata=field<number and metadata=field>number
2016-02-27 10:55:02 -04:00
Joey Hess
a5bf674bec
Avoid crashing when built with MagicMime support, but when the magic database cannot be loaded. 2016-02-23 14:39:56 -04:00
Joey Hess
23cc315c38
matchexpression: Added --largefiles option to parse an annex.largefiles expression. 2016-02-03 16:58:36 -04:00
Joey Hess
5127cb59cc
annex.largefiles: Add support for mimetype=text/* etc, when git-annex is linked with libmagic. 2016-02-03 16:29:34 -04:00
Joey Hess
cdf5977053
simplify 2016-02-03 13:23:34 -04:00
Joey Hess
d37fe6a547
annex.largefiles can be configured in .gitattributes too
This is particulary useful for v6 repositories, since the .gitattributes
configuration will apply in all clones of the repository.
2016-02-02 15:18:17 -04:00
Joey Hess
d3ba9fe5c8
matchexpression: New plumbing command to check if a preferred content expression matches some data. 2016-01-25 16:16:18 -04:00
Joey Hess
737e45156e
remove 163 lines of code without changing anything except imports 2016-01-20 16:36:33 -04:00
Joey Hess
cdd27b8920
reorg 2015-12-15 15:34:28 -04:00
Joey Hess
983c1894eb
avoid unnecessary reading of git-annex branch data when matching on annex.largefiles
This makes git annex clean not look at the git-annex branch at all,
and so speeds it up by 50% or more.
2015-12-04 15:06:41 -04:00
Joey Hess
9dfe03dbcd Improve shutdown due to --time-limit, especially for fsck
* Perform a clean shutdown when --time-limit is reached.
  This includes running queued git commands, and cleanup actions normally
  run when a command is finished.
* fsck: Commit incremental fsck database when --time-limit is reached.
  Previously, some of the last files fscked did not make it into the
  database when using --time-limit.

Note that this changes Annex.addCleanup hooks, to run after --time-limit
expires. Fsck was using such a hook to clean up after a
--incremental-schedule, and that shouldn't run when --time-limit exipires
it. So, instead, moved that cleanup code to be run by cleanupIncremental.
Resulted in some data type juggling.
2015-07-31 16:01:54 -04:00
Joey Hess
8c46ea22c2 Added new "anything" preferred content expression, which matches all versions of all files. 2015-06-16 17:03:34 -04:00
Joey Hess
38c458b407 refactor 2015-04-30 14:02:56 -04:00
Joey Hess
b94eb9b22c relFile does not have to be relative; rename to currFile 2015-02-06 16:03:02 -04:00
Joey Hess
afc5153157 update my email address and homepage url 2015-01-21 12:50:09 -04:00
Joey Hess
4f657aa14e add getFileSize, which can get the real size of a large file on Windows
Avoid using fileSize which maxes out at just 2 gb on Windows.
Instead, use hFileSize, which doesn't have a bounded size.
Fixes support for files > 2 gb on Windows.

Note that the InodeCache code only needs to compare a file size,
so it doesn't matter it the file size wraps. So it has been
left as-is. This was necessary both to avoid invalidating existing inode
caches, and because the code passed FileStatus around and would have become
more expensive if it called getFileSize.

This commit was sponsored by Christian Dietrich.
2015-01-20 17:09:24 -04:00
Joey Hess
b61c6bc2ff hlint 2014-10-09 15:46:05 -04:00
Joey Hess
7b50b3c057 fix some mixed space+tab indentation
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.

Done as a background task while listening to episode 2 of the Type Theory
podcast.
2014-10-09 15:09:11 -04:00
Joey Hess
c784ef4586 unify exception handling into Utility.Exception
Removed old extensible-exceptions, only needed for very old ghc.

Made webdav use Utility.Exception, to work after some changes in DAV's
exception handling.

Removed Annex.Exception. Mostly this was trivial, but note that
tryAnnex is replaced with tryNonAsync and catchAnnex replaced with
catchNonAsync. In theory that could be a behavior change, since the former
caught all exceptions, and the latter don't catch async exceptions.

However, in practice, nothing in the Annex monad uses async exceptions.
Grepping for throwTo and killThread only find stuff in the assistant,
which does not seem related.

Command.Add.undo is changed to accept a SomeException, and things
that use it for rollback now catch non-async exceptions, rather than
only IOExceptions.
2014-08-07 22:03:29 -04:00
Joey Hess
e880d0d22c replace (Key, Backend) with Key
Only fsck and reinject and the test suite used the Backend, and they can
look it up as needed from the Key. This simplifies the code and also speeds
it up.

There is a small behavior change here. Before, all commands would warn when
acting on an annexed file with an unknown backend. Now, only fsck and
reinject show that warning.
2014-04-17 18:03:39 -04:00
Joey Hess
fe19e15040 reorg matcher types; no non-type code changes 2014-03-29 14:43:34 -04:00
Joey Hess
ed30b81e2c Improve behavior when unable to parse a preferred content expression (thanks, ion).
Fall back to "present" as the preferred conent expression, which will
not result in any content movement.
2014-03-20 00:10:12 -04:00
Joey Hess
83ccce68a2 theoretical optimisation of --in
Avoids looking up the remote each time, but in practice, does not result in
a measurable speedup.
2014-03-13 18:51:44 -04:00
Joey Hess
24f8136504 --metadata field=value can now use globs to match, and matches case insensatively, the same as git annex view field=value does.
Also refactored glob code into its own module.
2014-02-21 18:34:34 -04:00
Joey Hess
2075cdeb59
limiting files based on metadata
Note that there is currently no caching, so
	--metadata foo=bar --metadata tag=blah
will currently read the log 2x per file.
2014-02-13 02:24:30 -04:00