Some recent changes to use mask missed that async exceptions can still
be thrown inside it. The goal is to make sure a block of cleanup code
runs entirely, w/o being interrupted by an async exception, so use
uninterruptibleMask.
Also, converted a few to bracket, which is nicer.
Tested the forcerestart code path and it works.
The hairy part is, what if an async exception is caught when it's in
restart?
If it's in the part that stops the old process, the old process
is left in the handle. The next attempt to use the CoProcessHandle
will then throw an IO exception, which will result in restart getting
run again. So I think this will work, but have not actually tested it.
The use of withMVarMasked lets it start the new process and fill the
mvar with it, even if there's an async exception at that point.
Note that exceptions are masked while running forcerestart, so
do not need to worry about an async exception being thrown while it's
recovering from an async exception.
Audited for openFile and openFd, and this fixes all the ones I found
where an async exception could prevent the file getting closed.
Except for the lock pool, which is a whole other can of worms.
Remove old code that can be trivially implemented using async in a much
nicer way (that is async exception safe).
I've audited all forkOS calls (except for ones in the assistant),
and this was the last remaining one that is not async exception safe.
The rest look ok to me.
This handles all createProcessSuccess callers, and aside from process
pools, the complete conversion of all process running to async exception
safety should be complete now.
Also, was able to remove from Utility.Process the old API that I now
know was not a good idea. And proof it was bad: The code size went *down*,
despite there being a fair bit of boilerplate for some future API to
reduce.
This handles all sites where checkSuccessProcess/ignoreFailureProcess
is used, except for one: Git.Command.pipeReadLazy
That one will be significantly more work to convert to bracketing.
(Also skipped Command.Assistant.autoStart, but it does not need to
shut down the processes it started on exception because they are
git-annex assistant daemons..)
forceSuccessProcess is done, except for createProcessSuccess.
All call sites of createProcessSuccess will need to be converted
to bracketing.
(process pools still todo also)
Not yet 100% done, so far I've grepped for waitForProcess and converted
everything that uses that to start the process with withCreateProcess.
Except for some things like P2P.IO and Assistant.TransferrerPool,
and Utility.CoProcess, that manage a pool of processes. See #2
in https://git-annex.branchable.com/todo/more_extensive_retries_to_mask_transient_failures/#comment-209f8a8c38e63fb3a704e1282cb269c7
for how those will need to be dealt with.
checkSuccessProcess, ignoreFailureProcess, and forceSuccessProcess calls waitForProcess, so
callers of them will also need to be dealt with, and have not been yet.
Makes it stop the command if the consumer gets killed.
Also, it seems that the old version expected bracketOnError to return
the False from the error handler, but it does not, it would have thrown
the exception and ignored the False. That's fixed, it will now return
False when there is an exception.
This was a pre-withCreateProcess attempt at doing the same thing, so can
just call boolSystem now that it uses withCreateProcess.
There's a slight behavior change, since it used to wait, after an async
exception, for the command to finish, before re-throwing the exception.
Now, it rethrows the exception right away. I don't think that impact any
of the users of this.
This relicates git's behavior. It adds a few stat calls for the command
line parameters, so there is some minor slowdown, but even with thousands
of parameters it will not be very noticable, and git does the same statting
in similar circumstances.
Note that this does not prevent eg "git annex add symlink"; the symlink
will be added to git as usual. And "git annex find symlink" will silently
list nothing as well. It's only "symlink/foo" or "subdir/symlink/foo" that
triggers the warning.
* addurl --preserve-filename: New option, uses server-provided filename
without any sanitization, but with some security checking.
Not yet implemented for remotes other than the web.
* addurl, importfeed: Avoid adding filenames with leading '.', instead
it will be replaced with '_'.
This might be considered a security fix, but a CVE seems unwattanted.
It was possible for addurl to create a dotfile, which could change
behavior of some program. It was also possible for a web server to say
the file name was ".git" or "foo/.git". That would not overrwrite the
.git directory, but would cause addurl to fail; of course git won't
add "foo/.git".
sanitizeFilePath is too opinionated to remain in Utility, so moved it.
The changes to mkSafeFilePath are because it used sanitizeFilePath.
In particular:
isDrive will never succeed, because "c:" gets munged to "c_"
".." gets sanitized now
".git" gets sanitized now
It will never be null, because sanitizeFilePath keeps the length
the same, and splitDirectories never returns a null path.
Also, on the off chance a web server suggests a filename of "",
ignore that, rather than trying to save to such a filename, which would
fail in some way.
Due to eg, too long a path to the agent socket, caused by running gpg in a
container where /run is not mounted, and/or some other gpg behavior like
unnecessarily making relative paths to its home directory absolute.
addurl: When run with --fast on an url that
annex.security.allowed-ip-addresses prevents accessing, display a more
useful message.
(Also importfeed --fast potentially.)
Limited to min of -JN or number of CPU cores, because it will often be
CPU bound, once it's read the gitignore file for a directory.
In some situations it's more disk bound, but in any case it's unlikely
to be the main bottleneck that -J is used to avoid. Eg, when dropping,
this is used for numcopies checks, but the main bottleneck will be
accessing the remotes to verify presence. So the user might decide to
-J32 that, but having 32 check-attr processes would just waste however
many filehandles they open, and probably worsen their performance due to
CPU contention.
Note that, I first tried just letting up to the -JN be started. However,
even when it's no bottleneck at all, that still results in all of them
being started. Why? Well, all the worker threads start up nearly
simulantaneously, so there's a thundering herd..
Avoid running a large number of git cat-file child processes when run with
a large -J value.
This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
Attoparsec parser for diff-tree.
Changed fromRef back to producing a String, to avoid needing to convert
every use of it. However, this does mean I'm going to miss some
opportunities where fromRef is used and the result converted back to a
ByteString. Would be worth revisiting that at some point maybe.
It will create foo/.git/annex/, but not foo/.git/ and not foo/.
This will avoid it creating an empty path to a repo when a drive is
yanked out and the mount point goes away, for example.
Noticed that it gets the CWD unncessarily when the path is absolute.
I have not benchmarked this, but I guess that the small overhead of
isAbsolute is so tiny compared to the system call that it's worth
it even if most of the time relative paths are passed to absPath.
The 'fail' method has been moved to the 'MonadFail' class. I made the changes
so that the code still compiles with previous versions of 'base' that don't
have the new MonadFail class exported by Prelude yet.
If git-credential has it cached and does not prompt, this will
unfortunately result in a brief flicker, as the displayed console
regions are hidden while running it and then re-displayed. Better than a
corrupted display.
Actually, I tried it and don't see a visible flicker, so probably only
over a slow ssh will it be apparent.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
prop_encode_decode_roundtrip failed on "\175" in C locale.
This may be a new problem after the switch to RawFilePath, but it
already had filtering for high chars, so changed to only test ascii
chars.
eg, `git-annex get . ..` used to order the files strangly, because it
did not realize that when git ls-files output eg "foo", that should be
grouped with the first set of files and not the second set.
Fixed by making dirContains "." "./foo" = True
which makes sense, because dirContains ".." "../foo" = True
the encode' and decode' functions on Windows should not apply the
filesystem encoding, which does not work there. Instead, convert to and
from UTF-8.
Also, avoid exporting encodeW8 and decodeW8. Both use the filesystem
encoding, so won't work as expected on windows.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
Only done on those calls to getFileStatus that had a RawFilePath, not a
FilePath. The others would probably be just as fast if converted to use
it with toRawFilePath, but I'm not 100% sure.
Note that genInodeCache' uses fromRawFilePath, but that value only gets
used on Windows, so on unix the thunk will never be evaluated.
File mode is octal not decimal. This broke in the conversion to
attoparsec.
(I've submitted the content of Utility.Attoparsec to the attoparsec
developers.)
Test suite passes 100% now.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
(cherry picked from commit 09ee6b0ccb)
Eliminated some dead code. In other cases, exported a currently unused
function, since it was a logical part of the API.
Of course this improves the API documentation. It may also sometimes
let ghc optimize code better, since it can know a function is internal
to a module.
364 modules still to go, according to
git grep -E 'module [A-Za-z.]+ where'
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
The only good thing about it is it does not require a major version bump
to improve the database. That will need to happen at some point though.
Potentially very very slow in a large repository.
Ugly use of raw sql.
gksu is no longer in debian, even stable
kdesu in debian is not installed in PATH any longer, though the executable
is still present under /usr/lib
pkexec is packagekit's replacement for those older commands.
That git fixed a memory leak that could cause an OOM during the upgrade.
Most git-annex builds have a new enough git already.
OSX git was upgraded with brew.
Linux i386ancient build's git was too old. Upgrading it to a fixed
git didn't work (due to the newer git not working with the old ssh,
https://bugs.chromium.org/p/git/issues/detail?id=7 )
Choices to deal with that were:
* Somehow make direct mode upgrade work with the old git, avoiding its
OOM problem. One way would be to switch the repo to indirect mode
first, and so upgrade to a repo with locked files. Not good when
the filesystem does not support symlinks.
* backport the OOM fix from git 2.22
(And do what about the version number so git-annex knows it's fixed?)
* backport openssh (and possibly more stuff)
* move the i386ancient build to at least Debian stretch (still backporting git)
But this will make it no longer work with some of the ancient kernels it
targets.
Of those, backporting the OOM fix seemed the best approach. Put "oomfix"
in the git version number to indicate it.
I have not automated building the git backport, so here's the patch I
used:
diff -ur orig/git-2.1.4/convert.c git-2.1.4/convert.c
--- orig/git-2.1.4/convert.c 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/convert.c 2019-08-29 20:05:04.371872338 +0100
@@ -404,7 +404,7 @@
if (start_async(&async))
return 0; /* error was already reported */
- if (strbuf_read(&nbuf, async.out, len) < 0) {
+ if (strbuf_read(&nbuf, async.out, 0) < 0) {
error("read from external filter %s failed", cmd);
ret = 0;
}
diff -ur orig/git-2.1.4/GIT-VERSION-GEN git-2.1.4/GIT-VERSION-GEN
--- orig/git-2.1.4/GIT-VERSION-GEN 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/GIT-VERSION-GEN 2019-08-29 20:06:39.132743228 +0100
@@ -1,7 +1,7 @@
#!/bin/sh
GVF=GIT-VERSION-FILE
-DEF_VER=v2.1.4
+DEF_VER=v2.1.4.oomfix
LF='
'
diff -ur orig/git-2.1.4/configure git-2.1.4/configure
--- orig/git-2.1.4/configure 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/configure 2019-08-29 20:27:45.896380015 +0100
@@ -580,8 +580,8 @@
# Identity of this package.
PACKAGE_NAME='git'
PACKAGE_TARNAME='git'
-PACKAGE_VERSION='2.1.4'
-PACKAGE_STRING='git 2.1.4'
+PACKAGE_VERSION='2.1.4.oomfix'
+PACKAGE_STRING='git 2.1.4.oomfix'
PACKAGE_BUGREPORT='git@vger.kernel.org'
PACKAGE_URL=''
diff -ur orig/git-2.1.4/version git-2.1.4/version
--- orig/git-2.1.4/version 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/version 2019-08-29 20:06:17.572545210 +0100
@@ -1 +1 @@
-2.1.4
+2.1.4.oomfix
I'm seeing the github lfs server request an upload of an object that has
already been uploaded to it before. Probably because they offload
storage to S3 and so skipped the overhead of checking for an unncessary
upload.