Commit graph

264 commits

Author SHA1 Message Date
Joey Hess
5cfcf1f05f
cache remote.log
Unlikely to speed up any of the existing uses much, but I want to use it
in a message that might be displayed many times.
2020-09-22 13:52:26 -04:00
Joey Hess
77c42782d0
differentiate between concurrency enabled at command line and by git config
The latter should not affect --batch mode.
2020-09-16 11:47:12 -04:00
Joey Hess
f912f8e5fd
refix bug in a better way
Always run Git.Config.store, so when the git config gets reloaded,
the override gets re-added to it, and changeGitRepo then calls extractGitConfig
on it and sees the annex.* settings from the override.

Remove any prior occurance of -c v and add it to the end. This way,
-c foo=1 -c foo=2 -c foo=1 will pass -c foo=1 to git, rather than -c foo=2

Note that, if git had some multiline config that got built up by
multiple -c's, this would not work still. But it never worked because
before the bug got fixed in the first place, the -c value was repeated
many times, so the multivalue thing would have been wrong. I don't think
-c can be used with multiline configs anyway, though git-config does
talk about them?
2020-07-02 13:32:33 -04:00
Joey Hess
d5451afc8f
fix deadlock
Fix a deadlock that could occur after git-annex got an unlocked file,
causing the command to hang indefinitely.

Known to happen on vfat filesystems, possibly others.

Note that a deadlock is still theoretically possible, if anything
smudge --clean does causes it to run the git queue for some other
reason.

Apparently that doesn't happen, but will need to keep an eye on it.
2020-06-18 12:56:29 -04:00
Joey Hess
04352ed9c5
check-ignore resource pool
Much like check-attr before.
2020-04-21 11:25:28 -04:00
Joey Hess
45fb7af21c
check-attr resource pool
Limited to min of -JN or number of CPU cores, because it will often be
CPU bound, once it's read the gitignore file for a directory.

In some situations it's more disk bound, but in any case it's unlikely
to be the main bottleneck that -J is used to avoid. Eg, when dropping,
this is used for numcopies checks, but the main bottleneck will be
accessing the remotes to verify presence. So the user might decide to
-J32 that, but having 32 check-attr processes would just waste however
many filehandles they open, and probably worsen their performance due to
CPU contention.

Note that, I first tried just letting up to the -JN be started. However,
even when it's no bottleneck at all, that still results in all of them
being started. Why? Well, all the worker threads start up nearly
simulantaneously, so there's a thundering herd..
2020-04-21 11:05:57 -04:00
Joey Hess
cee6b344b4
cat-file resource pool
Avoid running a large number of git cat-file child processes when run with
a large -J value.

This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
2020-04-20 15:19:31 -04:00
Joey Hess
529f488ec4
fix a thundering herd problem
Avoid repeatedly opening keys db when accessing a local git remote and -J
is used.

What was happening was that Remote.Git.onLocal created a new annex state
as each thread started up. The way the MVar was used did not prevent that.
And that, in turn, led to repeated opening of the keys db, as well as
probably other extra work or resource use.

Also managed to get rid of Annex.remoteannexstate, and it turned out there
was an unncessary Maybe in the keysdbhandle, since the handle starts out
closed.
2020-04-17 17:09:29 -04:00
Joey Hess
2caf579718
cache annex index filename for 1.5% speedup to queries 2020-04-10 13:37:04 -04:00
Joey Hess
c8fec6ab03
Fix a minor bug that caused options provided with -c to be passed multiple times to git. 2020-03-16 13:06:44 -04:00
Joey Hess
c78b9b55b6
rename changeGitConfig to overrideGitConfig and avoid unncessary calls
It's important that it be clear that it overrides a config, such that
reloading the git config won't change it, and in particular, setConfig
won't change it.

Most of the calls to changeGitConfig were actually after setConfig,
which was redundant and unncessary. So removed those.

The only remaining one, besides --debug, is in the handling of
repository-global config values. That one's ok, because the
way mergeGitConfig is implemented, it does not override any value that
is set in git config. If a value with a repo-global setting was passed
to setConfig, it would set it in the git config, reload the git config,
re-apply mergeGitConfig, and use the newly set value, which is the right
thing.
2020-02-27 01:11:53 -04:00
Joey Hess
c089f395b0
Bugfix: Don't ignore --debug when it is followed by -c 2020-02-27 00:52:37 -04:00
Joey Hess
4acbb40112
git-annex config annex.largefiles
annex.largefiles can be configured by git-annex config, to more easily set
a default that will also be used by clones, without needing to shoehorn the
expression into the gitattributes file. The git config and gitattributes
override that.

Whenever something is added to git-annex config, we have to consider what
happens if a user puts a purposfully bad value in there. Or, if a new
git-annex adds some new value that an old git-annex can't parse.
In this case, a global annex.largefiles that can't be parsed currently
makes an error be thrown. That might not be ideal, but the gitattribute
behaves the same, and is almost equally repo-global.

Performance notes:

git-annex add and addurl construct a matcher once
and uses it for every file, so the added time penalty for reading the global
config log is minor. If the gitattributes annex.largefiles were deprecated,
git-annex add would get around 2% faster (excluding hashing), because
looking that up for each file is not fast. So this new way of setting
it is progress toward speeding up add.

git-annex smudge does need to load the log every time. As well as checking
the git attribute. Not ideal. Setting annex.gitaddtoannex=false avoids
both overheads.
2019-12-20 13:01:41 -04:00
Joey Hess
f08cc8218b
optimisation
This was already optimised before, but profiling found that delEntry was
around 1.5% of the total runtime of git-annex whereis. It was being
called once per environment variable per file processed.

Fixed by better caching. Since withIndexFile is almost always run with
the same .git/annex/index file, it can cache the modified environment,
rather than re-modifying it each time called.

(cherry picked from commit 6535aea49a)
2019-12-04 14:30:22 -04:00
Joey Hess
99536e3a0b
remove one more warningIO
Had to generalize Git.Queue so it can run an Annex action, yipes.

Only remaining warningIO are in the legacy chunk code.
2019-11-12 10:45:52 -04:00
Joey Hess
fef3cd055d
Removed support for git versions older than 2.1
debian oldoldstable has 2.1, and that's what i386ancient uses. It would be
better to require git 2.2, which is needed to use adjusted branches, but
can't do that w/o losing support for some old linux kernels or a
complicated git backport.
2019-09-11 16:14:43 -04:00
Joey Hess
9671248fff
speed up enteringStage in non-concurrent mode
Avoid a STM transaction.

Also got rid of UnallocatedWorkerPool.
2019-06-19 15:47:54 -04:00
Joey Hess
659640e224
separate queue for cleanup actions
When running multiple concurrent actions, the cleanup phase is run in a
separate queue than the main action queue. This can make some commands
faster, because less time is spent on bookkeeping in between each file
transfer.

But as far as I can see, nothing will be sped up much by this yet, because
all the existing cleanup actions are very light-weight. This is just groundwork
for deferring checksum verification to cleanup time.

This change does mean that if the user expects -J2 will mean that they see no
more than 2 jobs running at a time, they may be surprised to see 4 in some
cases (if the cleanup actions are slow enough to notice).

It might also make sense to enable background cleanup without the -J,
for at least one cleanup action. Indeed, that's the behavior that -J1
has now. At some point in the future, it make make sense to make the
behavior with no -J the same as -J1. The only reason it's not currently
is that git-annex can build w/o concurrent-output, and also any bugs
in concurrent-output (such as perhaps misbehaving on non-VT100 compatible
terminals) are avoided by default by only using it when -J is used.
2019-06-05 17:54:35 -04:00
Joey Hess
c04b2af3e1
improved WorkerPool abstraction
No behavior changes.
2019-06-05 14:26:48 -04:00
Joey Hess
b03e65d260
Improved locking when multiple git-annex processes are writing to the .git/index file 2019-05-06 15:15:12 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
e4e464da65
import command is updating tracking branch 2019-02-26 13:15:48 -04:00
Joey Hess
2e0e557e75
Support being built with ghc 8.0.1 (MonadFail)
Tested on an older ghc by enabling MonadFailDesugaring globally.

In TransferQueue, the lack of a MonadFail for STM exposed what would
normally be a bug in the pattern matching, although in this case an
earlier check that the queue was not empty avoided a pattern match
failure.
2019-01-05 11:55:15 -04:00
Joey Hess
894716512d
add a UUIDDesc type containing a ByteString
Groundwork for handling uuid.log using ByteString
2019-01-01 16:17:54 -04:00
Joey Hess
8be5a7269a
refactor getCurrentBranch
Both Command.Sync and Annex.Ingest had their own versions of this.

The one in Annex.Ingest used Git.Branch.currentUnsafe, but does not seem
to need it. That is only checking to see if it's in an adjusted unlocked
branch, and when in an adjusted branch, the branch does in fact exist,
so the added check that Git.Branch.current does is fine.

This commit was sponsored by Denis Dzyubenko on Patreon.
2018-10-19 17:29:18 -04:00
Joey Hess
759a87ad70
fix git command queue to be concurrency safe
Probably not noticed until now because the queue is large enough that two
threads each filling theirs at the same time and flushing is unlikely to
happen.

Also made explicit that each worker thread gets its own queue.
I think that was the case before, but if something was put in the queue
before worker threads were forked off, they could have each inherited the
same queue.

Could have gone with a single shared queue, but per-worker queues is more
efficient, because a worker can add lots of stuff to its own queue without
any locking.

This commit was sponsored by Ole-Morten Duesund on Patreon.
2018-08-28 13:16:33 -04:00
Joey Hess
256d8f07e8
avoid insertWith' depreaction warning
Switch to Data.Map.Strict everywhere that used it.

There are still lots of lazy maps in git-annex. I think switching these
is safe. The risk is that there might be a map that is used in a way
that relies on the values not being evaluated to WHNF, and switching to
strict might result in bad performance or memory use. So, I have not
switched everything.
2018-04-22 13:28:31 -04:00
Joey Hess
2ec07bc29f
Avoid running annex.http-headers-command more than once. 2018-04-04 15:15:08 -04:00
Joey Hess
fa65f1d240
fix --json-progress --json to be same as --json --json-progress
Fix behavior of --json-progress followed by --json, in which
the latter option disabled the former.

This commit was supported by the NSF-funded DataLad project.
2018-02-19 14:12:15 -04:00
Joey Hess
2b66492d6e
Improve startup time for commands that do not operate on remotes
And for tab completion, by not unnessessarily statting paths to remotes,
which used to cause eg, spin-up of removable drives.

Got rid of the remotes member of Git.Repo. This was a bit painful.

Remote.Git modifies the list of remotes as it reads their configs,
so still need a persistent list of remotes. So, put it in as
Annex.gitremotes. It's only populated by getGitRemotes, so commands
like examinekey that don't care about remotes won't do so.

This commit was sponsored by Jake Vosloo on Patreon.
2018-01-09 16:22:07 -04:00
Joey Hess
24f27ec39d
convert importfeed to youtube-dl
Fully working, including --fast/--relaxed.

Note that, while git-annex addurl --relaxed is not going to check
youtube-dl, I kept git annex importfeed --relaxed checking it.
Thinking is that, let's not break people's importfeed cron jobs, and
importfeed does not typically have to check a large number of new items,
so it's ok if it's a little bit slower when used with youtube playlist
feeds.

importfeed's behavior is also improved (?) when a feed has links in it
to non-media files. Before, those were skipped. Now, the content of the
link is downloaded. This had to be done, because trying to use
youtube-dl is slow, and if those were skipped, it would have to check
every time importfeed was run. While this behavior change may not be
desirable for some feeds, that intersperse links to web pages with
enclosures, it will be desirable for other feeds, that have
non-enclosure directy links to media files.

Remove old quvi modules.

This commit was sponsored by Øyvind Andersen Holm.
2017-11-29 17:30:02 -04:00
Joey Hess
e1ac299ad0
better dup key with -J fix
This avoids all the complication about redundant work discussed in
the previous try at fixing this. At the expense of needing each command
that could have the problem to be patched to simply wrap the action in
onlyActionOn once the key is known. But there do not seem to be many
such commands.

onlyActionOn' should not be used with a CommandStart (or CommandPerform),
although the types do allow it. onlyActionOn handles running the whole
CommandStart chain. I couldn't immediately see a way to avoid mistken
use of onlyActionOn'.

This commit was supported by the NSF-funded DataLad project.
2017-10-17 18:48:53 -04:00
Joey Hess
68a49adcda
Improve behavior when -J transfers multiple files that point to the same key
After a false start, I found a fairly non-intrusive way to deal with it.
Although it only handles transfers -- there may be issues with eg
concurrent dropping of the same key, or other operations.

There is no added overhead when -J is not used, other than an added
inAnnex check. When -J is used, it has to maintain and check a small
Set, which should be negligible overhead.

It could output some message saying that the transfer is being done by
another thread. Or it could even display the same progress info for both
files that are being downloaded since they have the same content. But I
opted to keep it simple, since this is rather an edge case, so it just
doesn't say anything about the transfer of the file until the other
thread finishes.

Since the deferred transfer action still runs, actions that do more than
transfer content will still get a chance to do their other work. (An
example of something that needs to do such other work is P2P.Annex,
where the download always needs to receive the content from the peer.)
And, if the first thread fails to complete a transfer, the second thread
can resume it.

But, this unfortunately means that there's a risk of redundant work
being done to transfer a key that just got transferred.
That's not ideal, but should never cause breakage; the same
thing can occur when running two separate git-annex processes.

The get/move/copy/mirror --from commands had extra inAnnex checks added,
inside the download actions. Without those checks, the first thread
downloaded the content, and then the second thread woke up and
downloaded the same content redundantly.

move/copy/mirror --to is left doing redundant uploads for now. It
would need a second checkPresent of the remote inside the upload
to avoid them, which would be expensive. A better way to avoid
redundant work needs to be found..

This commit was supported by the NSF-funded DataLad project.
2017-10-17 17:10:50 -04:00
Joey Hess
d39c120afa
add annex-ignore-command and annex-sync-command configs
Added remote configuration settings annex-ignore-command and
annex-sync-command, which are dynamic equivilants of the annex-ignore
and annex-sync configurations.

For this I needed a new DynamicConfig infrastructure. Its implementation
should be as fast as before when there is no dynamic config, and it caches
so shell commands are only run once.

Note that annex-ignore-command exits nonzero when the remote should be ignored.
While that may seem backwards, it allows using the same command for it as
for annex-sync-command when you want to disable both.

This commit was sponsored by Trenton Cronholm on Patreon.
2017-08-17 13:54:14 -04:00
Joey Hess
3f4b671486
fix sshCleanup race using STM 2017-05-11 18:29:51 -04:00
Joey Hess
6992fe133b
Ssh password prompting improved when using -J
When ssh connection caching is enabled (and when GIT_ANNEX_USE_GIT_SSH is
not set), only one ssh password prompt will be made per host, and only one
ssh password prompt will be made at a time.

This also fixes a race in prepSocket's stale ssh connection stopping
when run with -J. It was possible for one thread to start a cached ssh
connection, and another thread to immediately stop it, resulting in excess
connections being made.

This commit was supported by the NSF-funded DataLad project.
2017-05-11 17:36:03 -04:00
Joey Hess
4c1e3210fa
annex.backend is the new name for what was annex.backends
It takes a single key-value backend, rather than the unncessary and confusing list.
The old option still works if set.

Simplified some old old code too.

This commit was sponsored by Thomas Hochstein on Patreon.
2017-05-09 15:04:07 -04:00
Joey Hess
0534152685
get -J: Improve distribution of jobs amoung remotes when there are more jobs than remotes.
It was distributing jobs to remotes that were not being used by any other
job. But, suppose that there are only 2 remotes, and -J10. In such a case,
the first 2 downloads would be distributed amoung the 2 remotes, but
the other 8 would all go to remote #1. Improved by keeping a counter
of how many jobs are assigned to a remote, and prefer remotes with fewer
jobs.

Note use of Data.Map.Strict to avoid blowing up space. I kept the
bang-patterns as-is, although probably not needed with Data.Map.Strict.

This commit was sponsored by Jack Hill on Patreon.
2017-03-08 14:49:30 -04:00
Joey Hess
ed56dba868
annex.autocommit can be configured via git-annex config
... to control the default behavior in all clones of a repository.

This includes a new Configurable data type, so the GitConfig type indicates
which values can be configured this way.

The implementation should be quite efficient; the config log is only read
once, and only when a Configurable value has not already been set by
git-config.

Indeed, it would be nice in the future to extend this, so that git-config
is itself only read on demand. Some commands may not need to look at the
git configuration at all.

This commit was sponsored by Trenton Cronholm on Patreon.
2017-02-03 13:58:53 -04:00
Joey Hess
339464e847
config: New command for storing configuration in the git-annex branch.
Any config names can be set using this; git-annex commands will only look
at specific ones that make sense and are worth the overhead of querying the
branch.

This might also be useful for storing whatever other config-type stuff the
user might want to shove into the git-annex branch.

This commit was sponsored by Jochen Bartl on Patreon.
2017-01-30 16:46:38 -04:00
Joey Hess
1cd02762bf
Optimisations to git-annex branch query and setting, avoiding repeated copies of the environment.
Speeds up commands like  "git-annex find --in remote" by over 50%.

Profiling showed that adjustGitEnv was 21% of the time and 37% of the
allocations of that command. It copied the environment each time with
getEnvironment.

The only repeated use of adjustGitEnv is in withIndexFile, which tends to
be run at least once per file. So, it was optimised by keeping a cache of
the environment, which can be reused.

There could be other better ways to optimise this. Maybe get the while
environment once at startup. But, then it would have to be serialized back
out each time running a child process, so I doubt that would be a net win.

It might be better to cache a version of the environment that is
pre-modified to use .git-annex/index. But, profiling doesn't show that
modifying the enviroment is taking any significant time.
2016-09-29 13:36:48 -04:00
Joey Hess
8ef494a833
disentangle concurrency and message type
This makes -Jn work with --json and --quiet, where before
setting -Jn disabled those options.

Concurrent json output is currently a mess though since threads output
chunks over top of one-another.
2016-09-09 12:57:42 -04:00
Joey Hess
31289da691
get -J: Download different files from different remotes when the remotes have the same costs.
Only done in -J mode because only if there's concurrency can downloading
from two remotes be faster. Without concurrency, it's likely the case that
sequential downloads from the same remote are faster than switching back
and forth between two remotes.

There is some hairy MVar code here, but basically it just keeps
the activeremotes MVar full except when deciding which remote to assign
to a thread.

Also affects gets by sync --content -J

This commit was sponsored by Jochen Bartl.
2016-09-06 12:45:21 -04:00
Joey Hess
42b7ccc89f
git annex add in adjusted unlocked branch
Cached the current branch lookup just because it seems unnecessary overhead
to run an extra git command per add to query the current branch.
2016-03-29 13:26:06 -04:00
Joey Hess
88a4a6f396
Sped up git-annex add in direct mode and v6 by using git hash-object --batch.
Speeds up hashSymlink and hashPointerFile.
2016-03-14 15:58:46 -04:00
Joey Hess
f051b51645
remove 3 build flags
* Removed the webapp-secure build flag, rolling it into the webapp build
  flag.
* Removed the quvi and tahoe build flags, which only adds aeson to
  the core dependencies.
* Removed the feed build flag, which only adds feed to the core
  dependencies.

Build flags have cost in both code complexity and also make Setup configure
have to work harder to find a usable set of build flags when some
dependencies are missing.
2016-01-26 08:14:57 -04:00
Joey Hess
99c646615d
Bug fix: Git config settings passed to git-annex -c did not always take effect.
When Config.setConfig runs, it throws away the old Repo and loads a new
one. So, add an action to adjust the Repo so that -c settings will persist
across that.
2016-01-22 13:47:41 -04:00
Joey Hess
f839d407e3
flush keys db queue even on exception
Also fixed a bug in makeRunner; run' leaves the mvar empty so have to
refill it.
2015-12-23 19:38:18 -04:00
Joey Hess
4224fae71f
optimise read and write for Keys database (untested)
Writes are optimised by queueing up multiple writes when possible.
The queue is flushed after the Annex monad action finishes. That makes it
happen on program termination, and also whenever a nested Annex monad action
finishes.

Reads are optimised by checking once (per AnnexState) if the database
exists. If the database doesn't exist yet, all reads return mempty.

Reads also cause queued writes to be flushed, so reads will always be
consistent with writes (as long as they're made inside the same Annex monad).
A future optimisation path would be to determine when that's not necessary,
which is probably most of the time, and avoid flushing unncessarily.

Design notes for this commit:

- separate reads from writes
- reuse a handle which is left open until program
  exit or until the MVar goes out of scope (and autoclosed then)
- writes are queued
  - queue is flushed periodically
  - immediate queue flush before any read
  - auto-flush queue when database handle is garbage collected
  - flush queue on exit from Annex monad
    (Note that this may happen repeatedly for a single database connection;
    or a connection may be reused for multiple Annex monad actions,
    possibly even concurrent ones.)
- if database does not exist (or is empty) the handle
  is not opened by reads; reads instead return empty results
- writes open the handle if it was not open previously
2015-12-23 19:18:52 -04:00
Joey Hess
38a23928e9
temporarily remove cached keys database connection
The problem is that shutdown is not always called, particularly in the test
suite. So, a database connection would be opened, possibly some changes
queued, and then not shut down.

One way this can happen is when using Annex.eval or Annex.run with a new
state. A better fix might be to make both of them call Keys.shutdown
(and be sure to do it even if the annex action threw an error).

Complication: Sometimes they're run reusing an existing state, so shutting
down a database connection could cause problems for other users of that
same state. I think this would need a MVar holding the database handle,
so it could be emptied once shut down, and another user of the database
connection could then start up a new one if it got shut down. But, what if
2 threads were concurrently using the same database handle and one shut it
down while the other was writing to it? Urgh.

Might have to go that route eventually to get the database access to run
fast enough. For now, a quick fix to get the test suite happier, at the
expense of speed.
2015-12-16 14:05:26 -04:00