Commit graph

282 commits

Author SHA1 Message Date
Joey Hess
2a8fdfc7d8
Display a warning message when asked to operate on a file inside a directory that's a symbolic link to elsewhere
This relicates git's behavior. It adds a few stat calls for the command
line parameters, so there is some minor slowdown, but even with thousands
of parameters it will not be very noticable, and git does the same statting
in similar circumstances.

Note that this does not prevent eg "git annex add symlink"; the symlink
will be added to git as usual. And "git annex find symlink" will silently
list nothing as well. It's only "symlink/foo" or "subdir/symlink/foo" that
triggers the warning.
2020-05-11 15:03:35 -04:00
Joey Hess
cee6b344b4
cat-file resource pool
Avoid running a large number of git cat-file child processes when run with
a large -J value.

This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
2020-04-20 15:19:31 -04:00
Joey Hess
fada5c120c
remove unused import 2020-04-17 15:19:49 -04:00
Joey Hess
fe9cf1256e
move remoteList into dupState
This does mean that RemoteDaemon.Transport.Tor's call runs it, otherwise
no change, but this is groundwork for doing more such expensive actions
in dupState.
2020-04-17 14:36:45 -04:00
Joey Hess
957a87b437
fix absolute filenames fed into --batch and git-annex info 2020-04-15 16:04:05 -04:00
Joey Hess
ca9c6c5f60
Fix a potential failure to parse git config
Git has an obnoxious special case in git config, a line "foo" is the same
as "foo = true". That means there is no way to examine the output of
git config and tell if it was run with --null or not, since a "foo"
in the first line could be such a boolean, or could be followed by its
value on the next line if --null were used.

So, rather than trying to do such a detection, track the style of config
at all the points where it's generated.
2020-04-13 13:05:41 -04:00
Joey Hess
aeca7c2207
Sped up query commands that read the git-annex branch by around 5%
The only price paid is one additional MVar read per write to the journal.
Presumably writing a journal file dominiates over a MVar read time by
several orders of magnitude.

--batch does not get the speedup because then it needs to notice when
another process has made a change. Also made the assistant and other damon
modes bypass the optimisation, which would not help them anyway.
2020-04-09 13:54:43 -04:00
Joey Hess
c8fec6ab03
Fix a minor bug that caused options provided with -c to be passed multiple times to git. 2020-03-16 13:06:44 -04:00
Joey Hess
f6d629e483
changelog and minor style 2020-02-28 12:57:55 -04:00
Peter Simons
73cf523a4b
Fix build with ghc-8.8.x.
The 'fail' method has been moved to the 'MonadFail' class. I made the changes
so that the code still compiles with previous versions of 'base' that don't
have the new MonadFail class exported by Prelude yet.
2020-02-28 12:54:20 -04:00
Joey Hess
c78b9b55b6
rename changeGitConfig to overrideGitConfig and avoid unncessary calls
It's important that it be clear that it overrides a config, such that
reloading the git config won't change it, and in particular, setConfig
won't change it.

Most of the calls to changeGitConfig were actually after setConfig,
which was redundant and unncessary. So removed those.

The only remaining one, besides --debug, is in the handling of
repository-global config values. That one's ok, because the
way mergeGitConfig is implemented, it does not override any value that
is set in git config. If a value with a repo-global setting was passed
to setConfig, it would set it in the git config, reload the git config,
re-apply mergeGitConfig, and use the newly set value, which is the right
thing.
2020-02-27 01:11:53 -04:00
Joey Hess
029c883713
Merge branch 'master' into v8 2020-02-19 14:32:11 -04:00
Joey Hess
aa949bbb7d
initremote --describe-other-params
Does not yet include descriptions from external special remote programs.
2020-01-20 16:05:51 -04:00
Joey Hess
3cd3757236
annex.dotfiles
The git add behavior changes could be avoided if it turns out to be
really annoying, but then it would need to behave the old way when
annex.dotfiles=false and the new way when annex.dotfiles=true. I'd
rather not have the config option result in such divergent behavior as
`git annex add .` skipping a dotfile (old) vs adding to annex (new).

Note that the assistant always adds dotfiles to the annex.
This is surprising, but not new behavior. Might be worth making it also
honor annex.dotfiles, but I wonder if perhaps some user somewhere uses
it and keeps large files in a directory that happens to begin with a
dot. Since dotfiles and dotdirs are a unix culture thing, and the
assistant users may not be part of that culture, it seems best to keep
its current behavior for now.
2019-12-26 16:33:39 -04:00
Joey Hess
7d9dff5b05
Merge branch 'master' into bs
and update changelog
2019-12-18 15:13:30 -04:00
Joey Hess
7fd5376334
inprogress: Support --key 2019-12-18 14:14:16 -04:00
Joey Hess
c19211774f
use filepath-bytestring for annex object manipulations
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.

Benchmarks indicate around 30% speedup in both commands.

Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
2019-12-11 15:25:07 -04:00
Joey Hess
bdec7fed9c
convert TopFilePath to use RawFilePath
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.

Git.Repo also changed to use RawFilePath for the path to the repo.

This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
2019-12-09 15:07:21 -04:00
Joey Hess
a0168cd9a2
use RawFilePath getSymbolicLinkStatus for speed 2019-12-06 15:42:54 -04:00
Joey Hess
c20f4704a7
all commands building except for assistant
also, changed ConfigValue to a newtype, and moved it into Git.Config.
2019-12-05 14:41:18 -04:00
Joey Hess
3c7fd09ec8
get many more commands building again
about half are building now
2019-12-05 11:40:10 -04:00
Joey Hess
b88f89c1ef
get the most commonly used commands building again
A quick benchmark of whereis shows not much speed improvement, maybe a
few percent. Profiling it found a hotspot, adds to todo.
2019-12-04 13:45:18 -04:00
Joey Hess
d7833def66
use ByteString for git config
The parser and looking up config keys in the map should both be faster
due to using ByteString.

I had hoped this would speed up startup time, but any improvement to
that was too small to measure. Seems worth keeping though.

Note that the parser breaks up the ByteString, but a config map ends up
pointing to the config as read, which is retained in memory until every
value from it is no longer used. This can change memory usage
patterns marginally, but won't affect git-annex.
2019-11-27 17:40:09 -04:00
Joey Hess
067aabdd48
wip RawFilePath 2x git-annex find speedup
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.

Benchmarking vs master, this git-annex find is significantly faster!
Specifically:

	num files	old	new	speedup
	48500		4.77	3.73	28%
	12500		1.36	1.02	66%
	20		0.075	0.074	0% (so startup time is unchanged)

That's without really finishing the optimization. Things still to do:

* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
  decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
2019-11-26 16:01:58 -04:00
Joey Hess
6a97ff6b3a
wip RawFilePath
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)

Currently in a very bad unbuildable in-between state.
2019-11-25 16:18:19 -04:00
Joey Hess
667d38a8f1
Fix a crash (STM deadlock) when -J is used with multiple files that point to the same key
See the comment for a trace of the deadlock.

Added a new StartStage. New worker threads begin in the StartStage.
Once a thread is ready to do work, it moves away from the StartStage,
and no thread will ever transition back to it.

A thread that blocks waiting on another thread that is processing
the same key will block while in the StartStage. That other thread
will never switch back to the StartStage, and so the deadlock is avoided.
2019-11-14 13:51:09 -04:00
Joey Hess
61b384d2b7
add --sameas option, not yet used 2019-10-01 12:36:25 -04:00
Joey Hess
2b55a2b882
remotedaemon: Don't list --stop in help since it's not supported.
Also, move out of plumbing section. When using tor, the remotedaemon is
part of the user's workflow, as it runs the tor hidden service.
2019-09-30 14:40:46 -04:00
Joey Hess
b13a350556
added --unlocked and --locked 2019-09-19 12:33:13 -04:00
Joey Hess
fda1bdd679
Added --mimetype and --mimeencoding file matching options.
Already had these for largefiles matching, but I forgot to add them as
command-line options.
2019-09-19 12:09:59 -04:00
Joey Hess
3f0eef4baa
v7 for all repositories
* Default to v7 for new repositories.
* Automatically upgrade v5 repositories to v7.
2019-08-30 14:09:14 -04:00
Joey Hess
9a5ddda511
remove many old version ifdefs
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.

The only remaining version ifdefs in the entire code base are now a couple
for aws!

This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of  git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.

This commit was sponsored by Ilya Shlyakhter.
2019-07-05 15:09:37 -04:00
Joey Hess
ad6b8c5f77
fix STM deadlock in finishCommandActions
Happened every time, because it was taking the pool TMVar while threads
were still running, and then the thread would try to switch state.
2019-06-19 18:34:26 -04:00
Joey Hess
37d505dd6b
avoid STM deadlock
When all worker threads are running and enteringStage is called,
it waits for an idle slot. If all off the other threads then call it in
turn, a deadlock occurrs.

This is the same problem I didn't actually fix in
5a9842d7ed.

Fixed by doing two separate STM transactions, the first replaces its
active thread with an idle thread, and the second waits for another idle
thread. That guarantees there will eventually be an idle thread to find.

The changes to WorkerPool were necessary because it can't add an idle
thread containing the Annex state and go on to run an action using that
same state, so I had to remove the Annex state from IdleWorker.
2019-06-19 18:15:25 -04:00
Joey Hess
a0d3a699e2
fix concurrency
Broken by recent commits, because before dupState is called, the Annex
state needs to have concurrent output enabled, and the thread pool
populated.
2019-06-19 16:12:39 -04:00
Joey Hess
9671248fff
speed up enteringStage in non-concurrent mode
Avoid a STM transaction.

Also got rid of UnallocatedWorkerPool.
2019-06-19 15:47:54 -04:00
Joey Hess
53882ab4a7
make WorkerStage an open type
Rather than limiting it to PerformStage and CleanupStage, this opens it
up so any number of stages can be added as needed by commands.

Each concurrent command has a set of stages that it uses, and only
transitions between those can block waiting for a free slot in the
worker pool. Calling enteringStage for some other stage does not block,
and has very little overhead.

Note that while before the Annex state was duplicated on the first call
to commandAction, this now happens earlier, in startConcurrency.
That means that seek stage actions should that use startConcurrency
and then modify Annex state won't modify the state of worker threads
they then start. I audited all of them, and only Command.Seek
did so; prepMerge changes the working directory and so has to come
before startConcurrency.

Also, the remote list is built before duplicating the state, which means
that it gets built earlier now than it used to. This would only have an
effect of making commands that end up not needing to perform any actions
unncessary build the remote list (only when they're run with concurrency
enable), but that's a minor overhead compared to commands seeking
through the work tree and determining they don't need to do anything.
2019-06-19 13:05:03 -04:00
Joey Hess
5a9842d7ed
avoid STM deadlock onredundant call to changeStageTo
I couldn't find a way to avoid the deadlock w/o rewriting it to clearly
not have one. I'm not quite sure what was the actual cause of the
deadlock.

This makes me unsure how I now know it clearly doesn't have a
deadlock. But, it was easy to reproduce before (just call it twice in a
row) and doesn't happen now.
2019-06-17 14:51:30 -04:00
Joey Hess
ecbd456312
fix restoring worker pool bug
The bug might have led to a STM deadlock, if this case could ever
actually fire.
2019-06-17 12:52:57 -04:00
Joey Hess
70bc30acb1
get rid of implicitMessages state
Oh joyous day, this is probably git-annex's oldest implementation wart,
source of much unncessary bother.

Now that we have a StartMessage, showEndResult' can look at it to know
if it needs to display an end message or not.

This is also going to be faster, because it avoids an uncessary state
lookup for each file processed.
2019-06-12 14:01:41 -04:00
Joey Hess
8e5ea28c26
finish CommandStart transition
The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.

(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)

But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.

Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.

Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.

In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
2019-06-12 13:24:01 -04:00
Joey Hess
436f107715
make CommandStart return a StartMessage
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.

To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.

Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)

The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.

One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.

Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.

In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.

Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
2019-06-06 17:13:54 -04:00
Joey Hess
258a7c5cd1
add Key to all ActionItem constructors 2019-06-06 12:53:24 -04:00
Joey Hess
4932972487
fix STM deadlock
659640e224 was buggy, it had a STM
deadlock because two actions both wanted to takeTMVar the WorkerPool
and so blocked one-another.

Fixed by completely reworking how the pool is maintained. Maintenace
threads now wait for the Async actions and update the WorkerPool. This
means twice as many threads as before, but green threads so will only
use a few extra bytes ram per thread.
2019-06-05 20:07:35 -04:00
Joey Hess
659640e224
separate queue for cleanup actions
When running multiple concurrent actions, the cleanup phase is run in a
separate queue than the main action queue. This can make some commands
faster, because less time is spent on bookkeeping in between each file
transfer.

But as far as I can see, nothing will be sped up much by this yet, because
all the existing cleanup actions are very light-weight. This is just groundwork
for deferring checksum verification to cleanup time.

This change does mean that if the user expects -J2 will mean that they see no
more than 2 jobs running at a time, they may be surprised to see 4 in some
cases (if the cleanup actions are slow enough to notice).

It might also make sense to enable background cleanup without the -J,
for at least one cleanup action. Indeed, that's the behavior that -J1
has now. At some point in the future, it make make sense to make the
behavior with no -J the same as -J1. The only reason it's not currently
is that git-annex can build w/o concurrent-output, and also any bugs
in concurrent-output (such as perhaps misbehaving on non-VT100 compatible
terminals) are avoided by default by only using it when -J is used.
2019-06-05 17:54:35 -04:00
Joey Hess
c04b2af3e1
improved WorkerPool abstraction
No behavior changes.
2019-06-05 14:26:48 -04:00
Joey Hess
30286bf067
improve layout 2019-06-05 12:20:20 -04:00
Joey Hess
aa7710982b
avoid list lookup by parseToken
Minor optimisation to parsing of a preferred content expression.
2019-05-14 13:11:29 -04:00
Joey Hess
fa070df373
fix usage 2019-05-10 14:52:52 -04:00
Joey Hess
82186ca58f
annex.jobs=cpus etc
Added the ability to run one job per CPU (core), by setting annex.jobs=cpus,
or using option --jobs=cpus or -Jcpus.

Built with future expansion in mind, including not defaulting matching on
Concurrency so more constructors can later be added, and using "cpu"
instead of "0".
2019-05-10 13:27:08 -04:00