Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
No longer used. The only possible user of it would be code in
Upgrade.V5, so I verified that the parts of Annex.Content it used were
not used to manipulate direct mode files.
When file matching options are specified when getting info of
something other than a directory, they won't have any effect, so error out
to avoid confusion.
This commit was sponsored by mo on Patreon.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
Added graftTree but it's buggy.
Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.
Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.
Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
This should make == comparison of UUIDs somewhat faster, and perhaps a
few other operations around maps of UUIDs etc.
FromUUID/ToUUID are used to convert String, which is still used for all
IO of UUIDs. Eventually the hope is those instances can be removed,
and all git-annex branch log files etc use ByteString throughout, for a
real speed improvement.
Note the use of fromRawFilePath / toRawFilePath -- while a UUID usually
contains only alphanumerics and so could be treated as ascii, it's
conceivable that some git-annex repository has been initialized using
a UUID that is not only not a canonical UUID, but contains high unicode
or invalid unicode. Using the filesystem encoding avoids any problems
with such a thing. However, a NUL in a UUID seems extremely unlikely,
so I didn't use encodeBS / decodeBS to avoid their extra overhead in
handling NULs.
The Read/Show instance for UUID luckily serializes the same way for
ByteString as it did for String.
Running git-annex linux builds in termux seems to work well enough that the
only reason to keep the Android app would be to support Android 4-5, which
the old Android app supported, and which I don't know if the termux method
works on (although I see no reason why it would not).
According to [1], Android 4-5 remains on around 29% of devices, down from
51% one year ago.
[1] https://www.statista.com/statistics/271774/share-of-android-platforms-on-mobile-devices-with-android-os/
This is a rather large commit, but mostly very straightfoward removal of
android ifdefs and patches and associated cruft.
Also, removed support for building with very old ghc < 8.0.1, and with
yesod < 1.4.3, and without concurrent-output, which were only being used
by the cross build.
Some documentation specific to the Android app (screenshots etc) needs
to be updated still.
This commit was sponsored by Brett Eisenberg on Patreon.
This is groundwork for nested seek loops, eg seeking over all files and
then performing commandActions on a list of remotes, which can be done
concurrently.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Added -z option to git-annex commands that use --batch, useful for
supporting filenames containing newlines.
It only controls input to --batch, the output will still be line delimited
unless --json or etc is used to get some other output. While git often
makes -z affect both input and output, I don't like trying them together,
and making it affect output would have been a significant complication,
and also git-annex output is generally not intended to be machine parsed,
unless using --json or a format option.
Commands that take pairs like "file key" still separate them with a space
in --batch mode. All such commands take care to support filenames with
spaces when parsing that, so there was no need to change it, and it would
have needed significant changes to the batch machinery to separate tose
with a null.
To make fromkey and registerurl support -z, I had to give them a --batch
option. The implicit batch mode they enter when not provided with input
parameters does not support -z as that would have complicated option
parsing. Seemed better to move these toward using the same --batch as
everything else, though the implicit batch mode can still be used.
This commit was sponsored by Ole-Morten Duesund on Patreon.
https://prime.haskell.org/wiki/Libraries/Proposals/SemigroupMonoid
I am not happy with the fragile pile of CPP boilerplate required to support
ghc back to 7.0, which git-annex still targets for both the android build
and the standalone build targeting old linux kernels. It makes me unlikely
to want to use Semigroup more in git-annex, because the benefit of the
abstraction is swamped by the ugliness. I actually considered ripping out
all the Semigroup instances, but some are needed to use
optparse-applicative.
The problem, I think, is they made this transaction on too fast a timeline.
(Although ironically, work on it started in 2015 or earlier!)
In particular, Debian oldstable is not out of security support, and it's
not possible to follow the simpler workarounds documented on the wiki and
have it build on oldstable (because the semigroups package in it is too
old).
I have only tested this build with ghc 8.2.2, not the newer and older
versions that branches of the CPP support. So there could be typoes, we'll
see.
This commit was sponsored by Brock Spratlen on Patreon.
This leaves git annex unused --from remote still using loggedKeysFor
and buffering more than ought to be necessary, but I can't see a way to
improve that.
As long as all code imports Utility.Aeson rather than Data.Aeson,
and no Strings that may contain utf-8 characters are used for eg, object
keys via T.pack, this is guaranteed to fix the problem everywhere that
git-annex generates json.
It's kind of annoying to need to wrap ToJSON with a ToJSON', especially
since every data type that has a ToJSON instance has to be ported over.
However, that only took 50 lines of code, which is worth it to ensure full
coverage. I initially tried an alternative approach of a newtype FileEncoded,
which had to be used everywhere a String was fed into aeson, and chasing
down all the sites would have been far too hard. Did consider creating an
intentionally overlapping instance ToJSON String, and letting ghc fail
to build anything that passed in a String, but am not sure that wouldn't
pollute some library that git-annex depends on that happens to use ToJSON
String internally.
This commit was supported by the NSF-funded DataLad project.
Compare these...
numcopies stats:
numcopies -1: 1986
numcopies +0: 1170
numcopies -2: 769
numcopies +1: 716
numcopies -4: 696
numcopies -3: 485
numcopies -6: 230
numcopies -5: 111
numcopies -7: 91
numcopies -9: 9
numcopies stats:
numcopies +1: 716
numcopies +0: 1170
numcopies -1: 1986
numcopies -2: 769
numcopies -3: 485
numcopies -4: 696
numcopies -5: 111
numcopies -6: 230
numcopies -7: 91
numcopies -9: 9
I feel that the former is a jumbled mess that doesn't tell much overall,
while the second shows pretty clearly that most files are within 1 degree
of the desired number of copies, with some outliers without enough.
Added --json-error-messages option, which includes error messages in the
json output, rather than outputting them to stderr.
The actual rediretion of errors is not implemented yet, this is only
the docs and option plumbing.
This commit was supported by the NSF-funded DataLad project.
9c4650358c changed the Read instance for
Key.
I've checked all uses of that instance (by removing it and seeing what
breaks), and they're all limited to the webapp, except one.
That is GitAnnexDistribution's Read instance.
So, 9c4650358c would have broken upgrades
of git-annex from downloads.kitenet.net. Once the .info files there got
updated for a new release, old releases would have failed to parse them
and never upgraded.
To fix this, I found a way to make the .info files that contain
GitAnnexDistribution values be readable by the old version of git-annex.
This commit was sponsored by Ewen McNeill.
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.
This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.
Benchmarks ran in my big repo:
old git-annex info:
real 0m3.338s
user 0m3.124s
sys 0m0.244s
new git-annex info:
real 0m3.216s
user 0m3.024s
sys 0m0.220s
new git-annex find:
real 0m7.138s
user 0m6.924s
sys 0m0.252s
old git-annex find:
real 0m7.433s
user 0m7.240s
sys 0m0.232s
Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.
This commit was supported by the NSF-funded DataLad project.
Note that get --from foo --failed will get things that a previous get --from bar
tried and failed to get, etc. I considered making --failed only retry
transfers from the same remote, but it was easier, and seems more useful,
to not have the same remote requirement.
Noisy due to some refactoring into Types/
Keeping Text.JSON use for now, because it seems a better fit for most of
the commands, which don't use very structured JSON objects, but just output
whatever fields suites them. But this lets Aeson be used when a more
structured data type is available to serialize to JSON.
This is a work in progress. It compiles and is able to do basic command
dispatch, including git autocorrection, while using optparse-applicative
for the core commandline parsing.
* Many commands are temporarily disabled before conversion.
* Options are not wired in yet.
* cmdnorepo actions don't work yet.
Also, removed the [Command] list, which was only used in one place.
This makes git annex unused use around 48 mb more memory than it did before,
but the massive increase in accuracy makes this worthwhile for all but the
smallest systems.
Also, I want to use the bloom filter for sync --all --content, to avoid
dropping files that the preferred content doesn't want, and 1/1000
false positives would be far too many in that use case, even if it were
acceptable for unused.
Actual memory use numbers:
1000: 21.06user 3.42system 0:26.40elapsed 92%CPU (0avgtext+0avgdata 501552maxresident)k
1000000: 21.41user 3.55system 0:26.84elapsed 93%CPU (0avgtext+0avgdata 549496maxresident)k
10000000: 21.84user 3.52system 0:27.89elapsed 90%CPU (0avgtext+0avgdata 549920maxresident)k
Based on these numbers, 10 million seemed a better pick than 1 million.
This is a nearly free feature; it piggybacks on the location log lookups
done for the numcopies stats. So, the only extra overhead is updating
the map of repository sizes.
However, I had to switch to Data.Map.Strict, which needs containers 0.5.
If backporting to wheezy, will probably need to revert this commit.
Avoid using fileSize which maxes out at just 2 gb on Windows.
Instead, use hFileSize, which doesn't have a bounded size.
Fixes support for files > 2 gb on Windows.
Note that the InodeCache code only needs to compare a file size,
so it doesn't matter it the file size wraps. So it has been
left as-is. This was necessary both to avoid invalidating existing inode
caches, and because the code passed FileStatus around and would have become
more expensive if it called getFileSize.
This commit was sponsored by Christian Dietrich.
* info: Can now display info about a given uuid.
* Added to remote/uuid info: Count of the number of keys present
on the remote, and their size. This is rather expensive to calculate,
so comes last and --fast will disable it.
* Git remote info now includes the date of the last sync with the remote.
Now `git annex info $remote` shows info specific to the type of the remote,
for example, it shows the rsync url.
Remote types that support encryption or chunking also include that in their
info.
This commit was sponsored by Ævar Arnfjörð Bjarmason.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
unused: In direct mode, files that are deleted from the work tree are no longer incorrectly detected as unused.
Direct mode `git annex info` slows down a bit due to more stringent
checking, but not by a lot.
This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO
of temp object files is too annoying (and if you don't want to keep
partially transferred objects across reboots).
.git/annex/misctmp must be on the same filesystem as the git work tree,
since files are moved to there in a way that will not work cross-device,
as well as symlinked into there.
I first wanted to put the tmp objects in .git/annex/objects/tmp, but
that would pose transition problems on upgrade when partially transferred
objects existed.
git annex info does not currently show the size of .git/annex/misctemp,
since it should stay small. It would also be ok to make something clean it
out, periodically.
I've been disliking how the command seek actions were written for some
time, with their inversion of control and ugly workarounds.
The last straw to fix it was sync --content, which didn't fit the
Annex [CommandStart] interface well at all. I have not yet made it take
advantage of the changed interface though.
The crucial change, and probably why I didn't do it this way from the
beginning, is to make each CommandStart action be run with exceptions
caught, and if it fails, increment a failure counter in annex state.
So I finally remove the very first code I wrote for git-annex, which
was before I had exception handling in the Annex monad, and so ran outside
that monad, passing state explicitly as it ran each CommandStart action.
This was a real slog from 1 to 5 am.
Test suite passes.
Memory usage is lower than before, sometimes by a couple of megabytes, and
remains constant, even when running in a large repo, and even when
repeatedly failing and incrementing the error counter. So no accidental
laziness space leaks.
Wall clock speed is identical, even in large repos.
This commit was sponsored by an anonymous bitcoiner.