Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Reversion from commit 436f10771, CustomOutput was forcing quiet output
which overrode the json setting.
find happened to be the only command that uses CustomOutput and also
outputs json. (metadata --get does also use CustomOutput and --json does not
enable json output for that, which may be an oversight, but was already the
behavior before this regression.)
Fixes the last wart in the StartMessage transition. A few commands
include other CommandStart actions that generate output, and
do not themselves need to display a start/end message.
Oh joyous day, this is probably git-annex's oldest implementation wart,
source of much unncessary bother.
Now that we have a StartMessage, showEndResult' can look at it to know
if it needs to display an end message or not.
This is also going to be faster, because it avoids an uncessary state
lookup for each file processed.
The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.
(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)
But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.
Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.
Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.
In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.
To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.
Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)
The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.
One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.
Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.
In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.
Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
Added the ability to run one job per CPU (core), by setting annex.jobs=cpus,
or using option --jobs=cpus or -Jcpus.
Built with future expansion in mind, including not defaulting matching on
Concurrency so more constructors can later be added, and using "cpu"
instead of "0".
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
Of course, it wasn't used much in those modes, because normal output is
avoided. But it was still initialized and used in a few places,
including a call to hideRegionsWhile.
That could cause git-annex to get confused about whether a locked file's
content was present, when the object file got touched.
Unfortunately this means more work sometimes when annex.thin is set,
since it has to checksum the file to tell if it's still got the right
content.
Had to suppress output when inAnnex calls isUnmodified, otherwise
"(checksum...)" would be printed in places it ought not to be,
eg "git annex get" could turn out not need to get anything, and
so only display that.
This commit was sponsored by Ole-Morten Duesund on Patreon.
It's left up to the special remote to detect when git-annex is new enough
to support the message; an old git-annex will blow up.
This commit was supported by the NSF-funded DataLad project.
Clean up some uses of showStart with "" for the file,
or in some cases, a non-filename description string. That would
generate bad json, although none of the commands doing that
supported --json.
Using "" for the file resulted in output like "foo rest";
now the extra space is eliminated.
This commit was sponsored by Fernando Jimenez on Patreon.
When built with concurrent-output 1.9, ssh password prompts will no longer
interfere with the -J display.
To avoid flicker, only done when ssh actually does need to prompt;
ssh is first run in batch mode and if that succeeds the connection is up
and no need to clear regions.
This commit was supported by the NSF-funded DataLad project.
When ssh connection caching is enabled (and when GIT_ANNEX_USE_GIT_SSH is
not set), only one ssh password prompt will be made per host, and only one
ssh password prompt will be made at a time.
This also fixes a race in prepSocket's stale ssh connection stopping
when run with -J. It was possible for one thread to start a cached ssh
connection, and another thread to immediately stop it, resulting in excess
connections being made.
This commit was supported by the NSF-funded DataLad project.
This gets rid of quite a lot of ugly hacks around json generation.
I doubt that any real-world json parsers can parse incomplete objects, so
while it's not as nice to need to wait for the complete object, especially
for commands like `git annex info` that take a while, it doesn't seem worth
the added complexity.
This also causes the order of fields within the json objects to be
reordered. Since any real json parser shouldn't care, the only possible
problem would be with ad-hoc parsers of the old json output.
Avoid threads emitting json at the same time and scrambling, which was
still possible even with the buffering, just less likely.
Converted json IO actions to JSONChunk data too.
This makes -Jn work with --json and --quiet, where before
setting -Jn disabled those options.
Concurrent json output is currently a mess though since threads output
chunks over top of one-another.
Note that get --from foo --failed will get things that a previous get --from bar
tried and failed to get, etc. I considered making --failed only retry
transfers from the same remote, but it was easier, and seems more useful,
to not have the same remote requirement.
Noisy due to some refactoring into Types/
Keeping Text.JSON use for now, because it seems a better fit for most of
the commands, which don't use very structured JSON objects, but just output
whatever fields suites them. But this lets Aeson be used when a more
structured data type is available to serialize to JSON.
Show branch:file that is being operated on.
I had to make ActionItem a type and not a type class because
withKeyOptions' passed two different types of values when using the type
class, and I could not get the type checker to accept that.
Instead -J will behave as if it was built without concurrent-output support
in this situation. Ie, it will be mostly quiet, except when there's an
error.
Note that it's not a problem for a filename to contain invalid utf-8 when
in a utf-8 locale. That is handled ok by concurrent-output. It's only
displaying unicode characters in a non-unicode locale that doesn't work.
This allows things like Command.Find to use noMessages and generate their
own complete json objects. Previouly, Command.Find managed that only via a
hack, which wasn't compatable with batch mode.
Only Command.Find, Command.Smudge, and Commange.Status use noMessages
currently, and none except for Command.Find are impacted by this change.
Fixes find --json --batch output
Writes are optimised by queueing up multiple writes when possible.
The queue is flushed after the Annex monad action finishes. That makes it
happen on program termination, and also whenever a nested Annex monad action
finishes.
Reads are optimised by checking once (per AnnexState) if the database
exists. If the database doesn't exist yet, all reads return mempty.
Reads also cause queued writes to be flushed, so reads will always be
consistent with writes (as long as they're made inside the same Annex monad).
A future optimisation path would be to determine when that's not necessary,
which is probably most of the time, and avoid flushing unncessarily.
Design notes for this commit:
- separate reads from writes
- reuse a handle which is left open until program
exit or until the MVar goes out of scope (and autoclosed then)
- writes are queued
- queue is flushed periodically
- immediate queue flush before any read
- auto-flush queue when database handle is garbage collected
- flush queue on exit from Annex monad
(Note that this may happen repeatedly for a single database connection;
or a connection may be reused for multiple Annex monad actions,
possibly even concurrent ones.)
- if database does not exist (or is empty) the handle
is not opened by reads; reads instead return empty results
- writes open the handle if it was not open previously
Came up with a generic way to filter out progress messages while keeping
errors, for commands that use stderr for both.
--json mode will disable command outputs too.