Commit graph

416 commits

Author SHA1 Message Date
Joey Hess
37467a008f
annex.addunlocked expressions
* annex.addunlocked can be set to an expression with the same format used by
  annex.largefiles, in case you want to default to unlocking some files but
  not others.
* annex.addunlocked can be configured by git-annex config.

Added a git-annex-matching-expression man page, broken out from
tips/largefiles.

A tricky consequence of this is that git-annex add --relaxed
honors annex.addunlocked, but an expression might want to know the size
or content of an url, which it's not going to download. I decided it was
better not to fail, and just dummy up some plausible data in that case.

Performance impact should be negligible. The global config is already
loaded for annex.largefiles. The expression only has to be parsed once,
and in the simple true/false case, it should not do any additional work
matching it.
2019-12-20 15:56:25 -04:00
Joey Hess
ce3fb0b2e5
fixed an oversight that had always prevented annex.resolvemerge from being honored, when it was configured by git-annex config
forgot to add it to the merge function
2019-12-20 11:00:08 -04:00
Joey Hess
686791c4ed
more RawFilePath
Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
2019-12-18 17:10:28 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
c20f4704a7
all commands building except for assistant
also, changed ConfigValue to a newtype, and moved it into Git.Config.
2019-12-05 14:41:18 -04:00
Joey Hess
b88f89c1ef
get the most commonly used commands building again
A quick benchmark of whereis shows not much speed improvement, maybe a
few percent. Profiling it found a hotspot, adds to todo.
2019-12-04 13:45:18 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
81d402216d cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:

Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.

The Eq instance is fixed.

Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.

Used git-annex benchmark:
  find is 7% faster
  whereis is 3% faster
  get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
2019-11-22 17:49:16 -04:00
Joey Hess
ce48eb797c
make DropDead transition minimize remote.log for dead sameas remotes
All that needs to be retained in remote.log is the sameas-uuid.
The rest of the config is eliminated. This doesn't save enough space to
bother with, but it prevents anything sensitive in the config of the
dead sameas remote from lingering around.

Note that minimizesameasdead does not update the VectorClock when
changing the log line. That's normally a no-no, but in this case,
it makes each DropDead result in the exact same file contents,
and vector clocks are not needed because the transition breaks
the history chain.
2019-10-15 11:39:25 -04:00
Joey Hess
78f522e7c8
forgot to add this new file 2019-10-14 16:09:04 -04:00
Joey Hess
5e9a2cc37f
forget state of sameas remotes during DropDead transitions
It would have been a lot less round-about to just make git annex dead
also add the uuids of sameas remotes to the trust.log as dead.

But, that would fail in the case where there's an unmerged other clone
that has a sameas remote that the current repo does not know about.
Then it would not get marked as dead.

Handling it at transition time avoids that scenario.

Note that the generation of trustmap' in dropDead should only
happen once, due to the partial application.
2019-10-14 15:47:42 -04:00
Joey Hess
9828f45d85
add RemoteStateHandle
This solves the problem of sameas remotes trampling over per-remote
state. Used for:

* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
  could in theory generate the same content identifier for two different
  peices of content

While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.

External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.

Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
2019-10-14 13:51:42 -04:00
Joey Hess
91eed85fd4
add sameas inherited configs to newConfig
This makes initremote --sameas work with encryption inherited.
2019-10-11 13:05:20 -04:00
Joey Hess
c3975ff3b4
sameas RemoteConfig inheritance
I found a way to avoid inheritance complicating anything outside of
Logs.Remote. It seems fine to require all inherited values to be
inherited and not set in the sameas remote's config. Since inherited
values will be used for stuff like encryption and perhaps chunking, which
control the actual content stored on the remote, it seems likely that
there will not be any reason to need them to vary between two remotes
that access the same underlying data store.

The newer version of containers is free; the minimum ghc version is
bundled with a newer version than that.
2019-10-10 15:58:22 -04:00
Joey Hess
84e729fda5
fix init default description reversion
init: Fix a reversion in the last release that prevented automatically
generating and setting a description for the repository.

Seemed best to factor out uuidDescMapRaw that does not
have the default mempty descrition behavior.

I don't much like that behavior, but I know things depend on it.
One thing in particular is `git annex info` which lists the uuids and
descriptions; if the current repo has been initialized in some way that
means it does not have a description, it would not show up w/o that.

(Not only repos created due to this bug might lack that. For example a repo
that was marked dead and had --drop-dead delete its git-annex branch info,
and then came back from the dead would similarly not be in the uuid.log.
Also there have been other versions of git-annex that didn't set a default
description; for years there was no default description.)
2019-06-20 20:30:24 -04:00
Joey Hess
258a7c5cd1
add Key to all ActionItem constructors 2019-06-06 12:53:24 -04:00
Joey Hess
b646a85c4a
oops, fixed wrong change 2019-05-23 12:44:27 -04:00
Joey Hess
e1aa8b9001
avoid build failure with older base 2019-05-23 12:16:57 -04:00
Joey Hess
16a2bed710
avoid build warning on Windows about unused import 2019-05-23 12:15:33 -04:00
Joey Hess
e06feb7316
honor preferred content when importing
Importing from a special remote honors its preferred content too; unwanted
files are not imported. But, some preferred content expressions can't be
checked before files are imported, and trying to import with such an
expression will fail.

Tested this with scenarios including changing the preferred content
expression and making sure merging the import didn't delete files that were
no longer wanted.

There was one minor inefficiency mentioned in the todo that I punted on.
2019-05-21 14:38:06 -04:00
Joey Hess
97fd9da6e7
add back non-preferred files to imported tree
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.

adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.

That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
2019-05-20 16:43:52 -04:00
Joey Hess
354c0eb57f
support standard and groupwanted in keyless mode
Only when the preferred content expression includes them will a parse
failure due to them needing keys result in the preferred content
expression not parsing in keyless mode.
2019-05-14 14:59:03 -04:00
Joey Hess
9411a7c93c
matching preferred content before key is known
This will let import try to match preferred content expressions before
downloading the content and generating its key.

If an expression needs a key, it preferredContentParser with
preferredContentKeylessTokens will fail to parse it.

standard and groupwanted are not in preferredContentKeylessTokens
because they may refer to an expression that refers to a key.
That needs further work to support them.
2019-05-14 14:28:23 -04:00
Joey Hess
aa7710982b
avoid list lookup by parseToken
Minor optimisation to parsing of a preferred content expression.
2019-05-14 13:11:29 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
ee251b2e2e
implement updating the ContentIdentifier db with info from the git-annex branch
untested

This won't be super slow, but it does need to diff two likely large
trees, and since the git-annex branch rarely sits still, it will most
likely be run at the beginning of every import.

A possible speed improvement would be to only run this when the database
did not contain a ContentIdentifier. But that would only speed up
imports when there is no new version of a file on the special remote,
at most renames of existing files being imported.

A better speed improvement would be to record something in the git-annex
branch that indicates when an import has been run, and only do the diff
if the git-annex branch has record of a newer import than we've seen
before. Then, it would only run when there is in fact new
ContentIdentifier information available from a remote. Certianly doable,
but didn't want to complicate things yet.
2019-03-06 18:04:30 -04:00
Joey Hess
1b3d04979e
speed up slow quickcheck test
Only test parsing of ContentIdentifier lists, not the whole log.
2019-03-06 16:43:41 -04:00
Joey Hess
6ef38df881
fix another parser bug 2019-03-06 14:51:31 -04:00
Joey Hess
c0bd202147
fix failing test case
An empty list of [ContentIdenfier] serialized to the same thing
as a single ContentIdentifier "". Avoid this ambiguity by requiring the
list be non-empty.
2019-03-06 14:27:15 -04:00
Joey Hess
b0fe4916b7
forgot to change list delimiter in parser 2019-03-06 11:59:42 -04:00
Joey Hess
7af55de83c
optimisation: use graftTree to remember the export branch
Sped up git-annex export in repositories with lots of keys.

Old method read whole git-annex branch tree into memory.
2019-02-22 11:16:22 -04:00
Joey Hess
56137ce0d2
use colon not space to delimit content identifier list
InodeCache serializes to a value with spaces, and seems likely other
things will too, and want to avoid unncessary base64 of content
identifiers when possible.
2019-02-21 13:45:16 -04:00
Joey Hess
1f6339ade7
fix parsing of empty content identifier
Seems very unlikely an empty content identifier would be used, but
quickcheck found this bug.
2019-02-21 13:44:09 -04:00
Joey Hess
9887a378fe
renamings to make clean when old-format logs are being used 2019-02-21 13:43:44 -04:00
Joey Hess
936aee6a60
quickcheck property for parsing of content identifier logs 2019-02-21 13:17:43 -04:00
Joey Hess
5a294f0dd7
add Logs.ContentIdentifier 2019-02-20 17:22:56 -04:00
Joey Hess
a818bc5e73
add Database.ContentIdentifier
Does not yet have a way to update with new information from the
git-annex branch, which will be needed when multiple repos are importing
from the same remote.
2019-02-20 16:59:10 -04:00
Joey Hess
e8bfc3640b
storing ContentIdentifier in the git-annex branch 2019-02-20 15:40:07 -04:00
Joey Hess
ad1d422dd7
fix false positive in export conflict detection
Like the earlier fixed one in Command.Export, it occurred when the same
tree was exported by multiple clones. Previous fix was incomplete since
several other places looked at the list of exported trees to detect when
there was an export conflict. Added a single unified function to avoid
missing any places it needed to be fixed.

This commit was sponsored by mo on Patreon.
2019-01-30 12:36:30 -04:00
Joey Hess
d913961b7b
avoid shadowing warning 2019-01-21 20:40:30 -04:00
Joey Hess
b79ef475ac
still fixing this sigh 2019-01-21 18:39:25 -04:00
Joey Hess
f603fc252d
fixed a different way this test case could fail in LANG=C 2019-01-21 16:46:40 -04:00
Joey Hess
0aea1a7b93
fix inverted logic 2019-01-21 15:41:35 -04:00
Joey Hess
e15d18058b
add back exclusion of whitespace and = 2019-01-21 12:38:00 -04:00
Joey Hess
928e08904d
avoid two test failures with LANG=C
Log.Remote.prop_parse_show_Config failed on an input of fromList [("\28162","")]
in LANG=C, encodeBS "\28162" == "\STX=", while in UTF-8 locale,
encodeBS "\28162" == "\230\184\130". So in the C locale, the String
that's the parsed Map key ends up being encoded differently than it was
in the input Map.

Logs.Presence.Pure.prop_parse_build_log was failing in LANG=C because
the Arbitrary LogLine for some reason sometimes generated LogInfo values
containing \n or \r, despite using suchThat to prevent that. I don't
understand why at all, but switching the suchThat to filter the
ByteString instead of the String before conversion with encodeBS
somehow avoids the problem.

Both of these suggest something wonky with encodeBS in LANG=C, but
I *think* it's not a problem except for with test data generated by
Arbitrary.
2019-01-18 13:33:48 -04:00
Joey Hess
d3ab5e626b
rename key2file and file2key
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.

Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
2019-01-14 13:03:35 -04:00
Joey Hess
2eadb6cd68
convert transitions.log to attoparsec and bytestring-builder
Not likely to be any speed gain here, but this completes porting every
log file over.

And, it let me get rid of code copied from ghc and modified, so
simplifying the licensing.
2019-01-10 17:13:30 -04:00
Joey Hess
591e4b145f
convert old uuid-based log parsers to attoparsec
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.

There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo  bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo  bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.

Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...

For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1  group2 " due to excess
whitespace, which would be even more confusing behavior.

The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
2019-01-10 16:34:20 -04:00
Joey Hess
66603d6f75
attoparsec parsers for all new-format uuid-based logs
There should be some speed gains here, especially for chunk and remote
state logs, which are queried once per key.

Now only old-format uuid-based logs still need to be converted to attoparsec.
2019-01-10 13:30:36 -04:00
Joey Hess
7e54c215b4
Remove redundant endOfInput
parseLogLines consumes till endOfInput
2019-01-10 12:42:59 -04:00