* annex.addunlocked can be set to an expression with the same format used by
annex.largefiles, in case you want to default to unlocking some files but
not others.
* annex.addunlocked can be configured by git-annex config.
Added a git-annex-matching-expression man page, broken out from
tips/largefiles.
A tricky consequence of this is that git-annex add --relaxed
honors annex.addunlocked, but an expression might want to know the size
or content of an url, which it's not going to download. I decided it was
better not to fail, and just dummy up some plausible data in that case.
Performance impact should be negligible. The global config is already
loaded for annex.largefiles. The expression only has to be parsed once,
and in the simple true/false case, it should not do any additional work
matching it.
annex.largefiles can be configured by git-annex config, to more easily set
a default that will also be used by clones, without needing to shoehorn the
expression into the gitattributes file. The git config and gitattributes
override that.
Whenever something is added to git-annex config, we have to consider what
happens if a user puts a purposfully bad value in there. Or, if a new
git-annex adds some new value that an old git-annex can't parse.
In this case, a global annex.largefiles that can't be parsed currently
makes an error be thrown. That might not be ideal, but the gitattribute
behaves the same, and is almost equally repo-global.
Performance notes:
git-annex add and addurl construct a matcher once
and uses it for every file, so the added time penalty for reading the global
config log is minor. If the gitattributes annex.largefiles were deprecated,
git-annex add would get around 2% faster (excluding hashing), because
looking that up for each file is not fast. So this new way of setting
it is progress toward speeding up add.
git-annex smudge does need to load the log every time. As well as checking
the git attribute. Not ideal. Setting annex.gitaddtoannex=false avoids
both overheads.
Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
The parser and looking up config keys in the map should both be faster
due to using ByteString.
I had hoped this would speed up startup time, but any improvement to
that was too small to measure. Seems worth keeping though.
Note that the parser breaks up the ByteString, but a config map ends up
pointing to the config as read, which is retained in memory until every
value from it is no longer used. This can change memory usage
patterns marginally, but won't affect git-annex.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
* git-lfs: The url provided to initremote/enableremote will now be
stored in the git-annex branch, allowing enableremote to be used without
an url. initremote --sameas can be used to add additional urls.
* git-lfs: When there's a git remote with an url that's known to be
used for git-lfs, automatically enable the special remote.
Reasons to do this include:
1. I've gotten pretty used to git-annex's own progress display, which is
used for all transfers over ssh (except to old git-annex-shell),
and for most special remote transfers. It's getting to seem weird to see
the rsync progress display instead.
2. When -J was used, the rsync output could not be shown, and so there was
no progress display. Now there will be.
Progress will also be displayed now when cp CoW is used. But I'd expect a CoW
copy to typically run so fast that the progress display will barely be
noticable.
This commit was sponsored by Peter on Patreon.
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
warningIO is not concurrent output safe, and it doesn't go to
--json-error-messages
There are a few more that would be too hard to remove, and there are also
several dozen direct prints to stderr still.
This solves the problem of sameas remotes trampling over per-remote
state. Used for:
* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
could in theory generate the same content identifier for two different
peices of content
While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.
External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.
Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
This is used by a special remote with sameas-uuid=
The remote's uuid is the sameas-uuid, but it needs to get
its RemoteConfig from the annex-config-uuid.
I found a way to avoid inheritance complicating anything outside of
Logs.Remote. It seems fine to require all inherited values to be
inherited and not set in the sameas remote's config. Since inherited
values will be used for stuff like encryption and perhaps chunking, which
control the actual content stored on the remote, it seems likely that
there will not be any reason to need them to vary between two remotes
that access the same underlying data store.
The newer version of containers is free; the minimum ghc version is
bundled with a newer version than that.
Initremote sets that, so after both initremote and enableremote,
the git config will be set.
Any remote that does not use Annex.SpecialRemote won't set
annex-config-uuid. But that's only Remote.Git, which doesn't use
RemoteConfig anyway.
This avoids some extra work, but I don't think it was possible for two ssh
endpoint discoveries run concurrently to both prompt for the ssh password;
Annex.Ssh itself deals with concurrency.
This is mostly groundwork for http password prompting.
tryGitConfigRead may run ensureInitialized first, but when checkuuid = false,
that is skipped. So, make sure it's run before all onLocal actions.
ensureInitialized is inexpensive, so the extra call by tryGitConfigRead
is not a big deal. But since it was easy to do, I made it only be run
once by all calls to onLocal.
A few calls to onLocal didn't call ensureInitialized before. Notably,
the checkPresent action didn't, and does now.
That means that there's a guarantee that any necessary repo upgrades
will be run before the checkPresent action runs in the repo. Which is
important especially for the direct mode conversion, because without
that upgrade, the checkPresent action would need to support direct mode
still. Now I can remove the last bits of direct mode support in
Annex.Content without worrying that it will break accessing remotes
that have not been upgraded.
This does necessarily mean that checkPresent needs to write to the disk
when performing such a repo upgrade. The other remote actions already
did, so retrieval from a readonly remote that needed to be upgraded would
fail. Having checkPresent also fail doesn't seem like a large reversion,
especially since it already failed in the default case when checkuuid = true.
Prompted by the test suite on windows failing to with "export foo failed"
and no information about what went wrong.
Note that only storeExportWithContentIdentifier has been converted.
storeExport still returns a Bool and so exceptions may be hidden.
However, storeExportWithContentIdentifier has many more failure modes,
since it needs to avoid overwriting modified files. So it's more
important it have better error display.
The test suite was intermittently failing with rsync complaining it
could not write to dest.
get foo (from origin...)
SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
20 100% 0.00kB/s 0:00:00 ^M 20 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)
(from origin...)
SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77
20 100% 0.00kB/s 0:00:00 ^M 20 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=0/1)
rsync: open "/home/joey/src/git-annex/.t/tmprepo1103/.git/annex/tmp/SHA256E-s20--e394a389d787383843decc5d3d99b6d184ffa5fddeec23b911f9ee7fc8b9ea77" failed: Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1207) [sender=3.1.3]
It seems that the first rsync actually transferred the file, but then for some
reason git-annex thinks it failed, so it retries. The second rsync then fails
because the first rsync copied the file mode over and so the file is not
writable now.
So, this fixes that problem, but leaves open the question of why git-annex
would think rsync failed when it wrote the file and didn't output any
error message. Possibly a bug in rsyncProgress that either hides an
error message, or somehow makes rsync unhappy?
Using Logs.RemoteState for this means that if the same key gets uploaded
twice to a git-lfs remote, but somehow has different content the two
times (eg it's an URL key with non-stable content), the sha256/size of
the newer content uploaded will overwrite what was remembered before. That
seems ok; it just means that git-annex will request the newer version of
the content when downloading from git-lfs.
It will remember the sha256 and size if both are not known, or if only
the sha256 is not known but the size is known, it only remembers the
sha256, to avoid wasting space on the size. I did not add special case
for when the sha256 is known and the size is not, because it's been a
long time since git-annex created SHA256 keys without a size.
(See doc/upgrades/SHA_size.mdwn)
The protocol design allows the server to respond with some other object;
if a server for some reason a server did that, it would not be right for
git-annex to download its content. I don't think it would be a security
hole, since git-annex is downloading a specific key and will verify the
key's content. Seems like a good idea to belt-and-suspenders test for
such a misuse of the protocol.