the encode' and decode' functions on Windows should not apply the
filesystem encoding, which does not work there. Instead, convert to and
from UTF-8.
Also, avoid exporting encodeW8 and decodeW8. Both use the filesystem
encoding, so won't work as expected on windows.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Only done on those calls to getFileStatus that had a RawFilePath, not a
FilePath. The others would probably be just as fast if converted to use
it with toRawFilePath, but I'm not 100% sure.
Note that genInodeCache' uses fromRawFilePath, but that value only gets
used on Windows, so on unix the thunk will never be evaluated.
File mode is octal not decimal. This broke in the conversion to
attoparsec.
(I've submitted the content of Utility.Attoparsec to the attoparsec
developers.)
Test suite passes 100% now.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
(cherry picked from commit 09ee6b0ccb)
Eliminated some dead code. In other cases, exported a currently unused
function, since it was a logical part of the API.
Of course this improves the API documentation. It may also sometimes
let ghc optimize code better, since it can know a function is internal
to a module.
364 modules still to go, according to
git grep -E 'module [A-Za-z.]+ where'
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
The only good thing about it is it does not require a major version bump
to improve the database. That will need to happen at some point though.
Potentially very very slow in a large repository.
Ugly use of raw sql.
gksu is no longer in debian, even stable
kdesu in debian is not installed in PATH any longer, though the executable
is still present under /usr/lib
pkexec is packagekit's replacement for those older commands.
That git fixed a memory leak that could cause an OOM during the upgrade.
Most git-annex builds have a new enough git already.
OSX git was upgraded with brew.
Linux i386ancient build's git was too old. Upgrading it to a fixed
git didn't work (due to the newer git not working with the old ssh,
https://bugs.chromium.org/p/git/issues/detail?id=7 )
Choices to deal with that were:
* Somehow make direct mode upgrade work with the old git, avoiding its
OOM problem. One way would be to switch the repo to indirect mode
first, and so upgrade to a repo with locked files. Not good when
the filesystem does not support symlinks.
* backport the OOM fix from git 2.22
(And do what about the version number so git-annex knows it's fixed?)
* backport openssh (and possibly more stuff)
* move the i386ancient build to at least Debian stretch (still backporting git)
But this will make it no longer work with some of the ancient kernels it
targets.
Of those, backporting the OOM fix seemed the best approach. Put "oomfix"
in the git version number to indicate it.
I have not automated building the git backport, so here's the patch I
used:
diff -ur orig/git-2.1.4/convert.c git-2.1.4/convert.c
--- orig/git-2.1.4/convert.c 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/convert.c 2019-08-29 20:05:04.371872338 +0100
@@ -404,7 +404,7 @@
if (start_async(&async))
return 0; /* error was already reported */
- if (strbuf_read(&nbuf, async.out, len) < 0) {
+ if (strbuf_read(&nbuf, async.out, 0) < 0) {
error("read from external filter %s failed", cmd);
ret = 0;
}
diff -ur orig/git-2.1.4/GIT-VERSION-GEN git-2.1.4/GIT-VERSION-GEN
--- orig/git-2.1.4/GIT-VERSION-GEN 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/GIT-VERSION-GEN 2019-08-29 20:06:39.132743228 +0100
@@ -1,7 +1,7 @@
#!/bin/sh
GVF=GIT-VERSION-FILE
-DEF_VER=v2.1.4
+DEF_VER=v2.1.4.oomfix
LF='
'
diff -ur orig/git-2.1.4/configure git-2.1.4/configure
--- orig/git-2.1.4/configure 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/configure 2019-08-29 20:27:45.896380015 +0100
@@ -580,8 +580,8 @@
# Identity of this package.
PACKAGE_NAME='git'
PACKAGE_TARNAME='git'
-PACKAGE_VERSION='2.1.4'
-PACKAGE_STRING='git 2.1.4'
+PACKAGE_VERSION='2.1.4.oomfix'
+PACKAGE_STRING='git 2.1.4.oomfix'
PACKAGE_BUGREPORT='git@vger.kernel.org'
PACKAGE_URL=''
diff -ur orig/git-2.1.4/version git-2.1.4/version
--- orig/git-2.1.4/version 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/version 2019-08-29 20:06:17.572545210 +0100
@@ -1 +1 @@
-2.1.4
+2.1.4.oomfix
I'm seeing the github lfs server request an upload of an object that has
already been uploaded to it before. Probably because they offload
storage to S3 and so skipped the overhead of checking for an unncessary
upload.
Let's keep this entirely pure.
git-annex has its own facilities for running a ssh command, that make it
respect various config settings, and cache connections, etc. So better
not to have the library run ssh itself.
In 40ecf58d4b I changed the license of code I
wrote from GPL to AGPL. But, two files containing code I wrote combined
with code by others were updated to say their license is AGPL, while in
fact part of it was (the code I wrote) but part remained under the original
license (the code written by others).
Remote/Ddar.hs is now changed entirely back to GPL 3.
Annex/DirHashes.hs stays AGPL, but I broke out Utility/MD5.hs with the code
not written by me, and corrected its license statement to GPL-2, which
is the actual version of the GPL included with the code in its original
distribution at http://www.cs.ox.ac.uk/people/ian.lynagh/md5/
I made some improvements to its API after splitting it out of git-annex,
so merge those back in.
This is groundwork for removing the embedded copy of it and depending on
it.
Also moved the managerResponseTimeout disabling to Annex.Url as it's
git-annex specific.
This commit was sponsored by Ethan Aubin on Patreon.
Avoid statting file, just try to remove it.
Also a comment to explain why it tries to remove it, which was puzzling
me when I revisited this code until I saw that cp fails to overwrite a
mode 444 file, including perhaps one left by a previous interrupted cp.
This commit was sponsored by Fernando Jimenez on Patreon.
Improved probing when CoW copies can be made between files on the same
drive. Now supports CoW between BTRFS subvolumes. And, falls back to rsync
instead of using cp when CoW won't work, eg copies between repos on the
same EXT4 filesystem.
Rather than trying cp --reflink=always for each file copied to a remote,
it's tried once and if it fails it falls back to using rsync thereafter
for the lifetime of the Remote object. That avoids overhead of calling cp
which while small, will add up over a large number of files.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.