The 'fail' method has been moved to the 'MonadFail' class. I made the changes
so that the code still compiles with previous versions of 'base' that don't
have the new MonadFail class exported by Prelude yet.
If git-credential has it cached and does not prompt, this will
unfortunately result in a brief flicker, as the displayed console
regions are hidden while running it and then re-displayed. Better than a
corrupted display.
Actually, I tried it and don't see a visible flicker, so probably only
over a slow ssh will it be apparent.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
prop_encode_decode_roundtrip failed on "\175" in C locale.
This may be a new problem after the switch to RawFilePath, but it
already had filtering for high chars, so changed to only test ascii
chars.
eg, `git-annex get . ..` used to order the files strangly, because it
did not realize that when git ls-files output eg "foo", that should be
grouped with the first set of files and not the second set.
Fixed by making dirContains "." "./foo" = True
which makes sense, because dirContains ".." "../foo" = True
the encode' and decode' functions on Windows should not apply the
filesystem encoding, which does not work there. Instead, convert to and
from UTF-8.
Also, avoid exporting encodeW8 and decodeW8. Both use the filesystem
encoding, so won't work as expected on windows.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
Only done on those calls to getFileStatus that had a RawFilePath, not a
FilePath. The others would probably be just as fast if converted to use
it with toRawFilePath, but I'm not 100% sure.
Note that genInodeCache' uses fromRawFilePath, but that value only gets
used on Windows, so on unix the thunk will never be evaluated.
File mode is octal not decimal. This broke in the conversion to
attoparsec.
(I've submitted the content of Utility.Attoparsec to the attoparsec
developers.)
Test suite passes 100% now.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
(cherry picked from commit 09ee6b0ccb)
Eliminated some dead code. In other cases, exported a currently unused
function, since it was a logical part of the API.
Of course this improves the API documentation. It may also sometimes
let ghc optimize code better, since it can know a function is internal
to a module.
364 modules still to go, according to
git grep -E 'module [A-Za-z.]+ where'
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
The only good thing about it is it does not require a major version bump
to improve the database. That will need to happen at some point though.
Potentially very very slow in a large repository.
Ugly use of raw sql.
gksu is no longer in debian, even stable
kdesu in debian is not installed in PATH any longer, though the executable
is still present under /usr/lib
pkexec is packagekit's replacement for those older commands.
That git fixed a memory leak that could cause an OOM during the upgrade.
Most git-annex builds have a new enough git already.
OSX git was upgraded with brew.
Linux i386ancient build's git was too old. Upgrading it to a fixed
git didn't work (due to the newer git not working with the old ssh,
https://bugs.chromium.org/p/git/issues/detail?id=7 )
Choices to deal with that were:
* Somehow make direct mode upgrade work with the old git, avoiding its
OOM problem. One way would be to switch the repo to indirect mode
first, and so upgrade to a repo with locked files. Not good when
the filesystem does not support symlinks.
* backport the OOM fix from git 2.22
(And do what about the version number so git-annex knows it's fixed?)
* backport openssh (and possibly more stuff)
* move the i386ancient build to at least Debian stretch (still backporting git)
But this will make it no longer work with some of the ancient kernels it
targets.
Of those, backporting the OOM fix seemed the best approach. Put "oomfix"
in the git version number to indicate it.
I have not automated building the git backport, so here's the patch I
used:
diff -ur orig/git-2.1.4/convert.c git-2.1.4/convert.c
--- orig/git-2.1.4/convert.c 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/convert.c 2019-08-29 20:05:04.371872338 +0100
@@ -404,7 +404,7 @@
if (start_async(&async))
return 0; /* error was already reported */
- if (strbuf_read(&nbuf, async.out, len) < 0) {
+ if (strbuf_read(&nbuf, async.out, 0) < 0) {
error("read from external filter %s failed", cmd);
ret = 0;
}
diff -ur orig/git-2.1.4/GIT-VERSION-GEN git-2.1.4/GIT-VERSION-GEN
--- orig/git-2.1.4/GIT-VERSION-GEN 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/GIT-VERSION-GEN 2019-08-29 20:06:39.132743228 +0100
@@ -1,7 +1,7 @@
#!/bin/sh
GVF=GIT-VERSION-FILE
-DEF_VER=v2.1.4
+DEF_VER=v2.1.4.oomfix
LF='
'
diff -ur orig/git-2.1.4/configure git-2.1.4/configure
--- orig/git-2.1.4/configure 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/configure 2019-08-29 20:27:45.896380015 +0100
@@ -580,8 +580,8 @@
# Identity of this package.
PACKAGE_NAME='git'
PACKAGE_TARNAME='git'
-PACKAGE_VERSION='2.1.4'
-PACKAGE_STRING='git 2.1.4'
+PACKAGE_VERSION='2.1.4.oomfix'
+PACKAGE_STRING='git 2.1.4.oomfix'
PACKAGE_BUGREPORT='git@vger.kernel.org'
PACKAGE_URL=''
diff -ur orig/git-2.1.4/version git-2.1.4/version
--- orig/git-2.1.4/version 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/version 2019-08-29 20:06:17.572545210 +0100
@@ -1 +1 @@
-2.1.4
+2.1.4.oomfix
I'm seeing the github lfs server request an upload of an object that has
already been uploaded to it before. Probably because they offload
storage to S3 and so skipped the overhead of checking for an unncessary
upload.
Let's keep this entirely pure.
git-annex has its own facilities for running a ssh command, that make it
respect various config settings, and cache connections, etc. So better
not to have the library run ssh itself.
In 40ecf58d4b I changed the license of code I
wrote from GPL to AGPL. But, two files containing code I wrote combined
with code by others were updated to say their license is AGPL, while in
fact part of it was (the code I wrote) but part remained under the original
license (the code written by others).
Remote/Ddar.hs is now changed entirely back to GPL 3.
Annex/DirHashes.hs stays AGPL, but I broke out Utility/MD5.hs with the code
not written by me, and corrected its license statement to GPL-2, which
is the actual version of the GPL included with the code in its original
distribution at http://www.cs.ox.ac.uk/people/ian.lynagh/md5/
I made some improvements to its API after splitting it out of git-annex,
so merge those back in.
This is groundwork for removing the embedded copy of it and depending on
it.
Also moved the managerResponseTimeout disabling to Annex.Url as it's
git-annex specific.
This commit was sponsored by Ethan Aubin on Patreon.
Avoid statting file, just try to remove it.
Also a comment to explain why it tries to remove it, which was puzzling
me when I revisited this code until I saw that cp fails to overwrite a
mode 444 file, including perhaps one left by a previous interrupted cp.
This commit was sponsored by Fernando Jimenez on Patreon.
Improved probing when CoW copies can be made between files on the same
drive. Now supports CoW between BTRFS subvolumes. And, falls back to rsync
instead of using cp when CoW won't work, eg copies between repos on the
same EXT4 filesystem.
Rather than trying cp --reflink=always for each file copied to a remote,
it's tried once and if it fails it falls back to using rsync thereafter
for the lifetime of the Remote object. That avoids overhead of calling cp
which while small, will add up over a large number of files.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
using a blake2 variant optimised for 4-way CPUs
This had been deferred because the Debian package of cryptonite, and
possibly other builds, was broken for blake2bp, but I've confirmed #892855
is fixed.
This commit was sponsored by Brett Eisenberg on Patreon.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
When downloading an url and the destination file exists but is empty,
avoid using http range to resume, since a range "bytes=0-" is an unusual
edge case that it's best to avoid relying on working.
This is known to fix a case where importfeed downloaded a partial feed from
such a server. Since importfeed uses withTmpFile, the destination always exists
empty, so it would particularly tickle such problem servers. Resuming from 0
is otherwise possible, but unlikely.
Add back support for ftp urls, which was disabled as part of the fix for
security hole CVE-2018-10857 (except for configurations which enabled curl
and bypassed public IP address restrictions). Now it will work if allowed
by annex.security.allowed-ip-addresses.
The NonEmpty instance was moved out of QuickCheck and into a package
with more deps than I want to drag in, so I'm providing my own instance,
but with older QuickCheck, use theirs to avoid overlapping.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
An empty list of [ContentIdenfier] serialized to the same thing
as a single ContentIdentifier "". Avoid this ambiguity by requiring the
list be non-empty.
It got removed from network-3.0.0.0 and nothing in the haskell ecosystem
currently provides it (which seems it ought to be fixed).
Tested new code on both little-endian and big-endian with:
ghci> hostAddressToTuple $ fromJust $ embeddedIpv4 (0,0,0,0,0,0xffff,0x7f00,1)
(127,0,0,1)
The gitAnnexTmpOtherDir cleanup made it be deleted too early sometimes,
and so the test suite failed. Also there was a report of a similar
failure which likely had a similar cause and hopwfully this fixes that
too.
This reverts commit 4536c93bb2.
That broke Read/Show of a Key, and unfortunately Key is read in at least
one place; the GitAnnexDistribution data type.
It would be worth bringing this optimisation back, but it would need
either a custom Read/Show instance that preserves back-compat, or
wrapping Key in a data type that contains the serialization, or changing
how GitAnnexDistribution is serialized.
Also, the Eq instance would need to compare keys with and without a
cached seralization the same.
The builder produces a lazy ByteString, and L.toStrict has to copy it,
but needing to use the builder is no longer to common case; the
serialization will normally be cached already as a strict ByteString,
and this avoids keyFile' needing to use L.toStrict . serializeKey'
Now there's a ByteString used all the way from disk to Key.
The main complication in this conversion was the use of fromInternalGitPath
in several places to munge things on Windows. The things that used that
were changed to parse the ByteString using either path separator.
Also some code that had read from files to a String lazily was changed
to read a minimal strict ByteString.
MetaField was already limited to alphanumerics, so it makes sense to use
Text for it.
Note that technically a UUID can contain invalid UTF-8, and so
remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8
values with '?' or whatever. In practice, a UUID is usually also text,
I only kept open the possibility of it containing invalid UTF-8 to avoid
breaking parsing of strange UUIDs in git-annex branch files. So, I
decided to let this edge case slip by.
Have not updated the rest of the code base yet for this change, as the
change took 2.5 hours longer than I expected to get working properly.
This is not as efficient as using ByteStrings throughout, but converting
the String to ByteString is actually significantly faster than the old
parser.
benchmarking parse/old
time 9.657 μs (9.600 μs .. 9.732 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 9.703 μs (9.645 μs .. 9.785 μs)
std dev 231.6 ns (161.5 ns .. 323.7 ns)
variance introduced by outliers: 25% (moderately inflated)
benchmarking parse/new
time 834.6 ns (797.1 ns .. 886.9 ns)
0.987 R² (0.976 R² .. 0.999 R²)
mean 816.4 ns (802.7 ns .. 845.1 ns)
std dev 62.39 ns (37.66 ns .. 108.4 ns)
variance introduced by outliers: 82% (severely inflated)
There is a small behavior change from the old parsePOSIXTime,
which accepted any amount of trailing whitespace after the timestamp.
That behavior was not documented, and it doesn't seem anything relied on it.