fix a annex.pidlock issue

That made eg git-annex get of an unlocked file hang until the
annex.pidlocktimeout and then fail.

This fix should be fully thread safe no matter what else git-annex is
doing.

Only using runsGitAnnexChildProcess in the one place it's known to be a
problem. Could audit for all places where git-annex runs itself as a child
and add it to all of them, later.
This commit is contained in:
Joey Hess 2020-06-17 15:13:52 -04:00
parent 9583b267f5
commit 82448bdf39
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
10 changed files with 247 additions and 43 deletions

View file

@ -56,3 +56,5 @@ details of the setup are [in that PR](https://github.com/datalad/datalad-extensi
PS determining the boundaries and names of the tests git annex had ran is a tricky business on its own -- I wondered if tests output formatting and annotation could have been improved as well. E.g. unlikely there is a point to print all output if test passes. With `nose` in Python / datalad we get a summary of all failed tests (and what was output when they were ran) at the end of the full sweep. That helps to avoid needing to search the entire long list
> I've fixed two tests now, so [[done]]. (Also git-annex test succeeded
> on vfat.) --[[Joey]]

View file

@ -0,0 +1,50 @@
[[!comment format=mdwn
username="joey"
subject="""comment 8"""
date="2020-06-17T16:16:19Z"
content="""
Minimal test case for the hang:
git init a
cd a
git annex init
git annex adjust --unlock
git config annex.pidlock true
date > foo
git annex add --force
git commit -m add
cd ..
git clone a b
cd b
git annex init
git annex adjust --unlock
git config annex.pidlock true
git annex get foo --force
That does not need vfat to hang.
364479 pts/2 Sl+ 0:00 | \_ /home/joey/bin/git-annex get foo
364504 pts/2 Sl+ 0:00 | \_ git --git-dir=.git --work-tree=. --literal-pathspecs -c core.safecrlf=false update-index -q --refresh -z --stdin
364506 pts/2 S+ 0:00 | \_ /bin/sh -c git-annex smudge --clean -- 'foo' git-annex smudge --clean -- 'foo'
364507 pts/2 Sl+ 0:00 | \_ git-annex smudge --clean -- foo
So is git-annex smudge waiting on the pidlock its parent has?
Yes: After setting annex.pidlocktimeout 2:
2 second timeout exceeded while waiting for pid lock file .git/annex/pidlock
git-annex: Gave up waiting for possibly stale pid lock file .git/annex/pidlock
error: external filter 'git-annex smudge --clean -- %f' failed 1
What I'm not sure about: annex.pidlock is not set by default on vfat,
so why would the test suite have failed there, and intermittently?
Maybe annex.pidlock does get set in some circumstances?
Anyway, there's a clear problem that annex.pidlock prevents more than 1 git-annex
process that uses locking from running, and here we have a parent git-annex
that necessarily runs a child git-annex that does locking.
Could the child process check if a parent/grandparent has the pid lock held
and so safely skip taking it? Or do all places git-annex runs itself
have to shut down pid locking?
"""]]

View file

@ -0,0 +1,50 @@
[[!comment format=mdwn
username="joey"
subject="""comment 9"""
date="2020-06-17T17:04:57Z"
content="""
Oh this is tricky.. git-annex is taking the gitAnnexGitQueueLock
while running the queued git update-index command. Which is the command
that then runs git-annex.
So dropping the pid lock before running the command won't work.
If pid locks were fine grained, this would not be a problem because the
child process is really locking a different resource than its grandparent.
But, I think the reasons for not making them fine grained do make sense:
Since git-annex sometimes takes a posix lock on a content file, it would
need to use some other lock file for the pid lock. So every place that
uses a lock file would have to specifiy the filename to use for pid
locking. Which makes pid locking complicate the rest of the code base
quite a lot, and every code path involving locking would have to be tested
twice, in order to check that the pid lock file used by that lock works.
Doubling the complexity of your file locking is a good thing to avoid.
Hmm.. I said "the child process is really locking a different resource than
its grandparent". And that generally has to be the case, or people using
git-annex w/o pid locking would notice that hey, these child processes
fail to take a lock and crash.
So.. If we assume that is the case, and that there are plenty of git-annex
users not using pid locking, then there's no need for a child process
to take the pid lock, if its parent currently has the pid lock held,
and will keep it held.
And this could be communicated via an env var. When the pid lock is taken
set `ANNEX_PIDLOCKED`, and unset when it's dropped. Then, so long
as childen inherit that env variable, they can skip taking the pid lock when
it's set.
To make sure that's safe, any time git-annex runs a child process
(or a git command that runs git-annex), it ought to hold the pid lock
while doing it. Holding any lock will do. The risk is, if one thread
has some lock held for whatever reason, and another thread runs the child
process, then the child process will rely on the unrelated thread keeping
the lock held. Explicitly holding some lock avoids such a scenario.
So, let's make it more explicit, add a runsGitAnnex that, when pid locking
is enabled, holds the pid lock while running the action. Then that has to
be wrapped around any places where a git-annex child process is run,
which can be done broadly, or just as these issues come up.
"""]]