This prevents serveLockContent from starting an unbounded number of
threads.
Note that, when it goes over this limit, git-annex is still able to drop
from the local repository in most situations, it just falls back to
checking content presence and is still able to prove the drop is safe.
But of course there are some cases where an active lock is needed in order to
drop.
The ugly getTimestamp hack works around a bug in the server. I suspect that
bug is also responsible for what happens if git-annex drop is interrupted
at the wrong time when checking the lock on the server -- as well as
leaving the lock fd open, the annex worker is not released to the pool,
so later connections to the server stall out. This needs to be
investigated, and the hack removed.
A deadlock eventually occurred when there were more concurent clients
than the size of the annex worker pool.
A test case for the deadlock is multiple clients all running
git-annex get; git-annex drop in a loop. With more clients than the
server's -J, this tended to lock up the server fairly quickly.
The problem was that inAnnexWorker is run twice per request, once for
the P2P protocol handling thread, and once for the P2P protocol
generating thread. Those two threads were started concurrently. Which,
when the worker pool is close to full, is equivilant to two locks being
taken, in potentially two different orders, and so could deadlock.
Fixed by making P2P.Http.Server use handleRequestAnnex instead of
inAnnexWorker. That forks off a new Annex state, runs the action in it,
and merges it back in.
Also, made getP2PConnection wait until the inAnnexWorker action has
started to return. When there are more incoming requests than the size
of the worker pool, this prevents request handers from starting
handleRequestAnnex until after getP2PConnection has started, so avoiding
running more annex actions than the -J level.
While before the server needed 2 jobs per request, so would handle
concurrent requests up to 1/2 of the -J level maximum, now it matches
the -J level. Updated docs accordingly.
Note that serveLockContent starts a thread which keeps running after the
request finishes. Before, that still consumed a worker. Which was also
probably a way for the worker pool to get full. Now, it does not.
So, lots of calls to serveLockContent can result in lots of threads,
which are lightweight though since they only keep a lock held.
Considering this as a new DOS attack, the server would run out of FDs
before it runs out of memory. I'll address this in the next commit.
When multiple clients drop the same key at the same time, or potentially in
other situations where the content directory gets frozen at the wrong time,
writeContentRetentionTimestamp threw an exception which left the lock file
descriptor open, but locking failed so it was never closed.
/dev/null was opened for reading, so writing to it would error out...
oops!
But, with --debug, we want to see the debug output. So avoid
/dev/nulling it.