planned to use for an optimisation
most things using stagedDetails were not expecting to get dup files in a
conflicted merge and deal with them, so converted them to use
inRepoDetails.
And convert parser to attoparsec, probably faster.
Before, a parse failure threw the whole --stage output line in to the
filename, which was certianly a bad idea, so fixed that.
Turns out the %(rest) trick was not needed. Instead, just maintain a
list of files we've asked for, and each cat-file response is for the
next file in the list.
This actually benchmarks 25% faster than before! Very surprising, but it
must be due to needing to shove less data through the pipe, and parse
less.
This assumes that no location log files will have a newline or carriage
return in their name. catObjectStream skips any such files due to
cat-file not supporting them.
Keys have been prevented from containing newlines since 2011,
commit 480495beb4. If some old repo
had a key with a newline in it, --all will just skip processing that key.
Other things, like .git/annex/unused files certianly assume no newlines in
keys too, and AFAICR, such keys never actually worked.
Carriage return is escaped by preSanitizeKeyName since 2013. WORM keys
generated before that point could perhaps contain a CR. (URL probably not,
http probably doesn't support an URL with a raw CR in it.) So, added
a warning in fsck about such keys. Although, fsck --all will naturally
skip them, so won't be able to warn about them. Not entirely
satisfactory, but I'll bet there are not really any such keys in
existence.
Thanks to Lukey for finding this optimisation.
Fixes reversion in recent conversions, the old code relied on the GC
apparently, but the new code explicitly waits on the process, so must
close stdin handle first or the command will never exit.
Except for the assistant, which I think may use them between threads?
Most of the uses of SomeException were already catching only async exceptions.
But I did find a few places that were accidentially catching them.
This handles all createProcessSuccess callers, and aside from process
pools, the complete conversion of all process running to async exception
safety should be complete now.
Also, was able to remove from Utility.Process the old API that I now
know was not a good idea. And proof it was bad: The code size went *down*,
despite there being a fair bit of boilerplate for some future API to
reduce.
This handles all sites where checkSuccessProcess/ignoreFailureProcess
is used, except for one: Git.Command.pipeReadLazy
That one will be significantly more work to convert to bracketing.
(Also skipped Command.Assistant.autoStart, but it does not need to
shut down the processes it started on exception because they are
git-annex assistant daemons..)
forceSuccessProcess is done, except for createProcessSuccess.
All call sites of createProcessSuccess will need to be converted
to bracketing.
(process pools still todo also)
Not yet 100% done, so far I've grepped for waitForProcess and converted
everything that uses that to start the process with withCreateProcess.
Except for some things like P2P.IO and Assistant.TransferrerPool,
and Utility.CoProcess, that manage a pool of processes. See #2
in https://git-annex.branchable.com/todo/more_extensive_retries_to_mask_transient_failures/#comment-209f8a8c38e63fb3a704e1282cb269c7
for how those will need to be dealt with.
checkSuccessProcess, ignoreFailureProcess, and forceSuccessProcess calls waitForProcess, so
callers of them will also need to be dealt with, and have not been yet.
Added annex.skipunknown git config, that can be set to false to change the
behavior of commands like `git annex get foo*`, to not skip over files/dirs
that are not checked into git and are explicitly listed in the command
line.
Significant complexity was needed to handle git-annex add, which uses some
git ls-files calls, but needs to not use --error-unmatch because of course
the files are not known to git.
annex.skipunknown is planned to change to default to false in a
git-annex release in early 2022. There's a todo for that.
User reported git@my.gitlab.foo:username/myrepo.git didn't work with
git-repair, because it rewrites it to an url
ssh://git@my.gitlab.foo/~/username/myrepo.git
and the /~/ was not something the hosting site supported.
Since git-annex still generally needs the repo url to be well, an url, did
not change the conversion code. But in this case, we're running git fetch,
so we might as well pass it the remote name rather than the url.
Did a quick audit of repoLocation uses to see if there was anything else
like this problem elsewhere, and didn't see any. But this is not the first
time this special case in git and git-annex's attempt to de-special-case
it has caused a problem..
A couple of these were probably actual bugs in edge cases. Most of the
changes I'm fine with. The fact that aeson's object returns sometihng
that we know will be an Object, but the type checker does not know is
kind of annoying.
This was very susprising to me that it was not caught by -Wall, so I
enabled -Wincomplete-uni-patterns to catch such things. It found a
second one just lines above, but no others anywhere.
This change does impact git-annex config
eg "git annex config --set annex.addunlocked on"
will store "on" and new git-annex will understand that value, while
old git-annex will error:
git-annex: bad annex.addunlocked configuration in git annex config:
Parse failure: near "on"
That seems acceptable.
Not special remote configs that are only documented as =true or =false
however. Having git-annex support other values for those would break
backwards compatability when used with old versions of git-annex. And
older versions ignore invalid special remote configs.. That would not
be a good combination.
Eg"core.bare" is the same as "core.bare = true".
Note that git treats "core.bare =" the same as "core.bare = false", so the
code had to become more complicated in order to treat the absense of a
value differently than an empty value. Ugh.
Git has an obnoxious special case in git config, a line "foo" is the same
as "foo = true". That means there is no way to examine the output of
git config and tell if it was run with --null or not, since a "foo"
in the first line could be such a boolean, or could be followed by its
value on the next line if --null were used.
So, rather than trying to do such a detection, track the style of config
at all the points where it's generated.
Around 3% total speedup.
Profiling git annex find --not --in web, it's now bytestring end-to-end,
and there is only a little added overhead in eg accessing the Annex
state MVar (3%). The rest of the runtime is spent reading symlinks, and
in attoparsec.
This feels like the end of the optimisation road, without a major change
like caching information for faster queries.
Attoparsec parser for diff-tree.
Changed fromRef back to producing a String, to avoid needing to convert
every use of it. However, this does mean I'm going to miss some
opportunities where fromRef is used and the result converted back to a
ByteString. Would be worth revisiting that at some point maybe.
When the recorded submodule commit changes on an adjusted branch, the
change is carried in the function that reverseAdjustedCommit passes
for adjustTree's adjusttreeitem parameter. Update the CommitObject
handling in adjustTree to consider adjusttreeitem so that a submodule
change is synced back.
This breaks several parts of the upgrade code, when upgrading remotes
of the current repo, but those parts were buggy, and will need to be
fixed somehow anyway.
While git ls-files can actually be used on a repo that is not in the
cwd, it works inconsistently. For example, this fails:
git --git-dir=../foo/.git --work-tree=../foo ls-files ../foo
But change some of the paths to absolute and it will succeed. That seems
like a bug in git.
OTOH, this succeeds:
git --git-dir=../foo/.git --work-tree=../foo ls-files
But, that lists paths relative to the top of the --work-tree,
rather than the usual listing them relative to the cwd. Because the cwd
is not in the repo. And so anything parsing the ls-files output of that
is likely to operate on files in the wrong location. Indeed, there is
code in Upgrade/ that has this problem!
Remaining things needing converted are in the assistant, and Annex.Ssh.
Every other remaining call to createDirectoryIfMissing True has been
audited and is not relevant. The ones in Build/ of course don't get
included in the program. Others included eg, Remote.Tahoe and
Config.Files which both write to dotfiles under the home directory.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
Rather than leaking the name of the temp file, just say the config parse
failed, and where the config was downloaded from.
Not closing the bug report because two issues were reported in the same
bug report, because the universe wants me to continually re-read old
unclosed bug reports to waste my time determining what still needs to be
done.
This avoids hardcoding the sha size, so when git uses sha256, it will
output the full sha256 and not a truncation to 40 characters.
I reviewed git's history, and while there have been some
bugs with commands not supporting --no-abbrev (eg git diff --no-index
--no-abbrev was broken in git 2.1), none of the commands git-annex
uses will be impacted by those old bugs.
Git will eventually switch to sha2 and there will not be one single
shaSize anymore, but two (40 and 64).
Changed all parsers for git plumbing output to support both sizes of
shas.
One potential problem this does not deal with is, if somewhere in
git-annex it reads two shas from different sources, and compares them
to see if they're the same sha, it would fail if they're sha1 and sha256
of the same value. I don't know if that will really be a concern.
prop_encode_decode_roundtrip failed on "\175" in C locale.
This may be a new problem after the switch to RawFilePath, but it
already had filtering for high chars, so changed to only test ascii
chars.
smudge: When annex.largefiles=anything, files that were already stored in
git, and have not been modified could sometimes be converted to being
stored in the annex. Changes in 7.20191024 made this more of a problem.
This case is now detected and prevented.
* annex.addunlocked can be set to an expression with the same format used by
annex.largefiles, in case you want to default to unlocking some files but
not others.
* annex.addunlocked can be configured by git-annex config.
Added a git-annex-matching-expression man page, broken out from
tips/largefiles.
A tricky consequence of this is that git-annex add --relaxed
honors annex.addunlocked, but an expression might want to know the size
or content of an url, which it's not going to download. I decided it was
better not to fail, and just dummy up some plausible data in that case.
Performance impact should be negligible. The global config is already
loaded for annex.largefiles. The expression only has to be parsed once,
and in the simple true/false case, it should not do any additional work
matching it.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
File mode is octal not decimal. This broke in the conversion to
attoparsec.
(I've submitted the content of Utility.Attoparsec to the attoparsec
developers.)
Test suite passes 100% now.
I had thought using ByteString would avoid the problem, but the
quickcheck property is still taking Arbitrary String input, so the use
of ByteString internally doesn't matter.
The parser and looking up config keys in the map should both be faster
due to using ByteString.
I had hoped this would speed up startup time, but any improvement to
that was too small to measure. Seems worth keeping though.
Note that the parser breaks up the ByteString, but a config map ends up
pointing to the config as read, which is retained in memory until every
value from it is no longer used. This can change memory usage
patterns marginally, but won't affect git-annex.
While L.toStrict copies, profiling showed it was only around 0.3% of
git-annex find runtime. Does not seem worth optimising that, which would
probably involve either a major refactoring, or a use of
UnsafeInterleaveIO.
Also, it seems to me that the latter would need to read chunks, and
preappend the leftover part to the next chunk. But a strict ByteString
append itself is a copy, so I'm not convinced that would be faster than
L.toStrict.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
Eg:
git clone url --bare r
git --git-dir r annex init
This resulted in worktree = Just "." and so several things that check
worktree to determine when the repo is bare ran code paths intended for
non-bare. One such code path[1] ran git checkout with --worktree=. which
actually makes it ignore core.bare config, and so the current directory
got populated with a checkout of the master branch in this example. There
was probably also other breakage.
The fix is a bit complicated because whether the repo is bare is not
known until after Git.Config reads the config, but Git.Config handles
setting the RepoLocations's worktree when core.worktree is set. So have
to assume the worktree is the cwd, let core.worktree override that,
and then if the repo turns out to be bare, it's set back to Nothing.
(And then GIT_WORK_TREE can still override all of that.)
[1] switchHEADBack, which runs even when the clone is not from a bare repo.
* git-lfs: The url provided to initremote/enableremote will now be
stored in the git-annex branch, allowing enableremote to be used without
an url. initremote --sameas can be used to add additional urls.
* git-lfs: When there's a git remote with an url that's known to be
used for git-lfs, automatically enable the special remote.
Renamed the database to .git/annex/keysdb;
the old .git/annex/keys gets deleted during the upgrade.
It is possible that an old git-annex process is running during the
upgrade. If so, it will be able to continue using the old keys db until the
upgrade is complete, and then will presumably fail in some ugly way. Or
perhaps the upgrade will be unable to delete the open files on some
systems, and so fail with an ugly error message.
It's also possible for multiple processes to be running the upgrade
concurrently. That should be fine; they will both write the same
information into the keys db.
Other databases still need to be upgraded.
Rescued from commit 11d6e2e260 which removed
db benchmarks in favor of benchmarking arbitrary git-annex commands. Which
is nice and general, but microbenchmarks are useful too.
Work around git cat-file --batch's odd stripping of carriage return from
the end of the line (some windows infection), avoiding crashing when the
repo contains a filename ending in a carriage return.
debian oldoldstable has 2.1, and that's what i386ancient uses. It would be
better to require git 2.2, which is needed to use adjusted branches, but
can't do that w/o losing support for some old linux kernels or a
complicated git backport.
This typo would make "git cat-file cat-file" fail, and the way it's used,
I think it broke querying all info from filenames containing newlines,
because the other queries are only run when it succeeds.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.
(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)
But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.
Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.
Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.
In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.
adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.
That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
Needed for the --quiet to actually shut it up. The extra verification
this makes it do should be fine, as this is supposed to really return a
single tree's sha.
git rev-parse --quiet avoids "fatal: Invalid object name" when the
branch does not exist. Git.Ref.tree already returned a Maybe, so callers
already handle those cases themselves.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
The branch is only updated once the export is 100% complete. This way,
if an export is started but interrupted and so the remote does not yet
contain some of the files, an import will make a commit on the old
branch, and so won't delete the missing files.
Added graftTree but it's buggy.
Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.
Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
My concern with using bytestring for this is the file needs to be split
into lines, and the encoding is not known. It's safe to split a utf-8
encoded file on the \n byte; only newlines get encoded to that byte in utf-8.
And this code already assumes utf-8 or ascii encoding, because it used
the filesystem encoding.