fix a annex.pidlock issue

That made eg git-annex get of an unlocked file hang until the annex.pidlocktimeout and then fail. This fix should be fully thread safe no matter what else git-annex is doing. Only using runsGitAnnexChildProcess in the one place it's known to be a problem. Could audit for all places where git-annex runs itself as a child and add it to all of them, later.
2020-06-17 15:13:52 -04:00 · 2020-06-17 15:13:52 -04:00 · 82448bdf39
commit 82448bdf39
parent 9583b267f5
10 changed files with 247 additions and 43 deletions
--- a/doc/bugs/one_annex_test_FAILs_when_HOME_is_a_crippled_fs.mdwn
+++ b/doc/bugs/one_annex_test_FAILs_when_HOME_is_a_crippled_fs.mdwn
@ -56,3 +56,5 @@ details of the setup are [in that PR](https://github.com/datalad/datalad-extensi
 PS determining the boundaries and names of the tests git annex had ran is a tricky business on its own -- I wondered if tests output formatting and annotation could have been improved as well. E.g. unlikely there is a point to print all output if test passes.  With `nose` in Python / datalad we get a summary of all failed tests (and what was output when they were ran) at the end of the full sweep. That helps to avoid needing to search the entire long list 


+> I've fixed two tests now, so [[done]]. (Also git-annex test succeeded
+> on vfat.) --[[Joey]]
--- a/doc/bugs/one_annex_test_FAILs_when_HOME_is_a_crippled_fs/comment_8_1be9990d0cab0e69936de356072ea890._comment
+++ b/doc/bugs/one_annex_test_FAILs_when_HOME_is_a_crippled_fs/comment_8_1be9990d0cab0e69936de356072ea890._comment
@ -0,0 +1,50 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 8"""
+ date="2020-06-17T16:16:19Z"
+ content="""
+Minimal test case for the hang: 
+
+	git init a
+	cd a
+	git annex init
+	git annex adjust --unlock
+	git config annex.pidlock true
+	date > foo
+	git annex add --force
+	git commit -m add
+	cd ..
+	git clone a b
+	cd b
+	git annex init
+	git annex adjust --unlock
+	git config annex.pidlock true
+	git annex get foo --force
+
+That does not need vfat to hang. 
+
+	 364479 pts/2    Sl+    0:00          |           \_ /home/joey/bin/git-annex get foo
+	 364504 pts/2    Sl+    0:00          |               \_ git --git-dir=.git --work-tree=. --literal-pathspecs -c core.safecrlf=false update-index -q --refresh -z --stdin
+	 364506 pts/2    S+     0:00          |                   \_ /bin/sh -c git-annex smudge --clean -- 'foo' git-annex smudge --clean -- 'foo'
+	 364507 pts/2    Sl+    0:00          |                       \_ git-annex smudge --clean -- foo
+
+So is git-annex smudge waiting on the pidlock its parent has?
+
+Yes: After setting annex.pidlocktimeout 2:
+
+	2 second timeout exceeded while waiting for pid lock file .git/annex/pidlock
+	git-annex: Gave up waiting for possibly stale pid lock file .git/annex/pidlock
+	error: external filter 'git-annex smudge --clean -- %f' failed 1
+
+What I'm not sure about: annex.pidlock is not set by default on vfat,
+so why would the test suite have failed there, and intermittently?
+Maybe annex.pidlock does get set in some circumstances?
+
+Anyway, there's a clear problem that annex.pidlock prevents more than 1 git-annex
+process that uses locking from running, and here we have a parent git-annex
+that necessarily runs a child git-annex that does locking.
+
+Could the child process check if a parent/grandparent has the pid lock held
+and so safely skip taking it? Or do all places git-annex runs itself
+have to shut down pid locking?
+"""]]
--- a/doc/bugs/one_annex_test_FAILs_when_HOME_is_a_crippled_fs/comment_9_cf253f0ad3f9857b9f0746e678d8dbd8._comment
+++ b/doc/bugs/one_annex_test_FAILs_when_HOME_is_a_crippled_fs/comment_9_cf253f0ad3f9857b9f0746e678d8dbd8._comment
@ -0,0 +1,50 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 9"""
+ date="2020-06-17T17:04:57Z"
+ content="""
+Oh this is tricky.. git-annex is taking the gitAnnexGitQueueLock
+while running the queued git update-index command. Which is the command
+that then runs git-annex.
+
+So dropping the pid lock before running the command won't work.
+
+If pid locks were fine grained, this would not be a problem because the
+child process is really locking a different resource than its grandparent.
+
+But, I think the reasons for not making them fine grained do make sense:
+Since git-annex sometimes takes a posix lock on a content file, it would
+need to use some other lock file for the pid lock. So every place that
+uses a lock file would have to specifiy the filename to use for pid
+locking. Which makes pid locking complicate the rest of the code base
+quite a lot, and every code path involving locking would have to be tested
+twice, in order to check that the pid lock file used by that lock works.
+Doubling the complexity of your file locking is a good thing to avoid.
+
+Hmm.. I said "the child process is really locking a different resource than
+its grandparent". And that generally has to be the case, or people using
+git-annex w/o pid locking would notice that hey, these child processes
+fail to take a lock and crash. 
+
+So.. If we assume that is the case, and that there are plenty of git-annex
+users not using pid locking, then there's no need for a child process
+to take the pid lock, if its parent currently has the pid lock held,
+and will keep it held.
+
+And this could be communicated via an env var. When the pid lock is taken
+set `ANNEX_PIDLOCKED`, and unset when it's dropped. Then, so long
+as childen inherit that env variable, they can skip taking the pid lock when
+it's set.
+
+To make sure that's safe, any time git-annex runs a child process
+(or a git command that runs git-annex), it ought to hold the pid lock
+while doing it. Holding any lock will do. The risk is, if one thread
+has some lock held for whatever reason, and another thread runs the child
+process, then the child process will rely on the unrelated thread keeping
+the lock held. Explicitly holding some lock avoids such a scenario.
+
+So, let's make it more explicit, add a runsGitAnnex that, when pid locking
+is enabled, holds the pid lock while running the action. Then that has to
+be wrapped around any places where a git-annex child process is run,
+which can be done broadly, or just as these issues come up.
+"""]]