[Cluster-devel] (no subject)
eric
zren at suse.com
Tue Oct 13 10:07:14 UTC 2015
Hi David and list,
I'm working on ocfs2, and encountered an problem about dlm posix file lock.
After some investigation, I'd like to share information about it and get
some
hints from you.
Environment:
kernel: 3.12.47
FS: OCFS2
stack: pacemaker
cluster: 2 testing nodes, node1, node2
Issue desc:
There is a deadlock test case for file lock in ocfs2 test suites. The
deadlock test first prepare
an testing file1 on shared disk, then on node1 do "fcntl(file1,
F_SETLKW, {F_WRLCK, SEEK_SET, 0, 0})"
, then on node2 set alarm(10s) and also "fcntl(file1, F_SETLKW,
{F_WRLCK, SEEK_SET, 0, 0})".
It expects alarm timeout to send SIGALRM, and wake up the sleep process,
as "man fcntl"
says: "If a signal is caught while waiting, then the call is
interrupted and (after the signal handler has returned)
returns immediately (with return value -1 and errno set to EINTR".
But, the process on node2 was in "Dl" state when using ps, and signal
was blocked. So, the test case was hung for ever.
Investigations:
* Key debug infos:
process stack on node1:
n1:/opt/ocfs2-test/bin # cat /proc/22677/stack
[<ffffffff8104250b>] kvm_clock_get_cycles+0x1b/0x20
[<ffffffff810ba924>] __getnstimeofday+0x34/0xc0
[<ffffffff810ba9ba>] getnstimeofday+0xa/0x30
[<ffffffff811bb30d>] SyS_poll+0x5d/0xf0
[<ffffffff81529809>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
process stack on node2:
n2:~ # cat /proc/1534/stack
[<ffffffffa050fa65>] dlm_posix_lock+0x185/0x380 [dlm]
[<ffffffff811f39ce>] fcntl_setlk+0x12e/0x2d0
[<ffffffff811b8231>] SyS_fcntl+0x261/0x510
[<ffffffff81529809>] system_call_fastpath+0x16/0x1b
[<00007f3f5721eb42>] 0x7f3f5721eb42
[<ffffffffffffffff>] 0xffffffffffffffff
* dlm_posix_lock
Through adding printk and recompile dlm kernel module, where n2 is hung
has been located:
dlm_posix_lock -> wait_event_killable
And wait_event_killable will put process into "TASK_KILLABLE" state which's like
"UNINTERRUPTABLE" but can be waked up by fatal signals. I did some tests, SIGTERM
can did it, but SIGALRM cannot.
Did this go against posix file lock semanteme? Any hints would be very appreciated!
I can provide any infos as I can if needed;-)
Thanks,
Eric
More information about the Cluster-devel
mailing list