updated analysis

This commit is contained in:
Joey Hess 2018-11-05 17:12:19 -04:00
parent c31ea81ee9
commit 14075afbd7
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -4,45 +4,66 @@
date="2018-11-05T19:14:44Z" date="2018-11-05T19:14:44Z"
content=""" content="""
Straces confirm the hanging git-annex-shell never gets to the point of Straces confirm the hanging git-annex-shell never gets to the point of
sending "DATA". sending "DATA". It does receive the "GET".
It does receive the "GET". annex.pidlock is set. That's unusual. It seems likely it's got
something to do with the problem, because pidlocks and concurrency don't
really mix; at best one git-annex process can hold the pid lock, so
other processes will need to wait, and could also time out and fail
to take the pid lock.
Ah. pidlock is in use. It hangs taking the pid lock. Yeah: Setting annex.pidlock, I can reproduce the hang!
I feel foolish now that I've gone so deep debugging when we established The strace of the non-hanging process shows it took the pid lock and dropped it,
at the beginning that the server was using NFS, which kind of implies pid twice. So the other process should have been able to take the pid lock too.
locking.
(At least I found and fixed that other problem.) The hanging process does something around taking the pid lock just before
it hangs.
---- It checks the side lock, which is locked (this is `trySideLock`):
When one git-annex-shell is holding the pid lock, 6219 openat(AT_FDCWD, "/dev/shm/08d71513c120bda153ed3b773c1e59ff_BIDS_dbic_QA_.git_annex_pidlock.lck", O_RDWR|O_CREAT, 0666 <unfinished ...>
the other one has to wait for it to exit. 6286 futex(0x555f91523d2c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
6219 <... openat resumed> ) = 11
6219 fcntl(11, F_GETFD) = 0
6219 fcntl(11, F_SETFD, FD_CLOEXEC) = 0
6219 fcntl(11, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
6219 close(11) = 0
If you wait 300 seconds, it should give up waiting for the pid lock, and Then it checks the pid lock (this is `tryLock` when
complain about it to stderr, which may or may not be visible across the ssh `linkToLock` fails due to the side lock not being held):
connection. That's kinda sorta ok when using git-annex at the command line,
a concurrent command getting blocked will eventually either continue or
display a warning about pidlock.
Problem is, we have concurrent git-annex-shell p2pstdio being run, and both 6219 openat(AT_FDCWD, "XXX/.git/annex/locktmp6219-0", O_RDWR|O_CREAT|O_EXCL|O_NOCTTY|O_NONBLOCK, 0600) = 11
are locking, and git-annex doesn't shut down either of them, so it blocks 6219 openat(AT_FDCWD, "XXX/.git/annex/pidlock", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 11
at least for 300 seconds, and this could repeat several times when acting on a 6219 fstat(11, {st_mode=S_IFREG|0444, st_size=52, ...}) = 0
lot of files. 6219 ioctl(11, TCGETS, 0x7ffd3c59ba90) = -1 ENOTTY (Inappropriate ioctl for device)
6219 uname({sysname="Linux", nodename="XXX", ...}) = 0
6219 stat("XXX/.git/annex/locktmp6219-0", {st_mode=S_IFREG|0444, st_size=52, ...}) = 0
6219 read(11, "PidLock {lockingPid = 6221, lock"..., 8192) = 52
It's taking the pid lock, it appears, as part of writing the transfer log. And gives up without overwriting it:
Not a very important reason. Some git-annex-shell operations like dropping
will need to take the pid lock for better reasons.
Currently, once a process takes the pid lock, it continues to hold it 6219 unlink("XXX/.git/annex/locktmp6219-0") = 0
until that process terminates. `dropLock` is never called. 6219 unlink("XXX/.git/annex/locktmp6219-0") = -1 ENOENT (No such file or directory)
Should be able to fix this, by counting the number of finer-grained And then hangs waiting on stdin, it looks lile:
pseudo-locks that are being held, and dropping the pid lock when all are
done. It will still be possible for one process to hog the pid lock, 6219 poll([{fd=0, events=POLLIN}], 1, 0) = 0 (Timeout)
eg, a git-annex-shell that's performing a string of requests. But 6219 epoll_ctl(3, EPOLL_CTL_MOD, 0, {EPOLLIN|EPOLLONESHOT, {u32=0, u64=0}}) = 0
eventually the other process, if it's not timed out, will be able to take 6219 futex(0x555f91523c0c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
the pid lock. 6285 <... read resumed> "\1\0\0\0\0\0\0\0", 8) = 8
6285 read(4, "\1\0\0\0\0\0\0\0", 8) = 8
6285 read(4, "\1\0\0\0\0\0\0\0", 8) = 8
6285 read(4, "\1\0\0\0\0\0\0\0", 8) = 8
6285 read(4, "\1\0\0\0\0\0\0\0", 8) = 8
Seems it was not in `waitLock`, because that should loop and `tryLock` again once
per second.
So, maybe some part of git-annex is calling `tryLock`, and when it fails, it
somehow ends up blocked waiting for more protocol commands?
My next step now that I can reproduce it is to manually take the pid lock,
and run git-annex-shell p2pstdio, and see if it's indeed reading more commands from
stdin despite not responding to GET, or is sending an error or what.
"""]] """]]