[Linux-cluster] what should fence_xvm do if dom0 is down

Joel Heenan joelh at planetjoel.com
Thu Oct 7 00:05:34 UTC 2010


On Wed, Oct 6, 2010 at 6:51 AM, Lon Hohberger <lhh at redhat.com> wrote:

> On 10/01/2010 02:11 AM, Joel Heenan wrote:
>
>> So just further to this I found a Red Hat bug about this exact issue:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=570373
>>
>> And for me it works perfectly if the dom0 is fenced using fence_node on
>> the command line. However, if the host becomes unavailable then it is
>> not fenced, and from reading the fenced man page it seems this is
>> because there isn't a shared resource like clvm or gfs, so therefore the
>> cluster doesn't see a need to fence the host. This means subsequent
>> fence_xvm commands fail.
>>
>> I guess I need to find a way to force fenced to operate without clvm and
>> fence dom0s?
>>
>> Joel
>>
>>
> fence_xvm/fence_xvmd is designed to handle two primary cases:
>
> 1) kill the misbehaving VM, or
> 2) Wait for the last-known owner of misbehaving VM to be dead.
>
> Effectively, (2) occurs when the host cluster node dies and the host is
> subsequently fenced.
>
> According to 570373, (2) stopped working at some point, but I haven't
> gotten enough information to adequately debug the problem.
>
> If you have a cluster which exhibits this behavior, please contact me on
> FreeNode in #linux-cluster.
>

Hi Lon,

I was able to re-create this issue and capture the logs as per the bug, I
will send them to your email address.

This is what it looks like from the guest:

"""
2010-10-06T23:26:31.902493+00:00 c013otin01-test fenced[1891]:
c013otin07-test not a cluster member after 0 sec post_fail_delay
2010-10-06T23:26:31.902608+00:00 c013otin01-test fenced[1891]: fencing node
"c013otin07-test"
2010-10-06T23:26:36.858569+00:00 c013otin01-test clurgmgrd[3519]: <info>
Waiting for node #7 to be fenced
2010-10-06T23:27:04.440434+00:00 c013otin01-test fenced[1891]: agent
"fence_xvm" reports: Timed out waiting for response
2010-10-06T23:27:04.440548+00:00 c013otin01-test ccsd[1862]: Attempt to
close an unopened CCS descriptor (3035370).
2010-10-06T23:27:04.440595+00:00 c013otin01-test ccsd[1862]: Error while
processing disconnect: Invalid request descriptor
2010-10-06T23:27:04.440633+00:00 c013otin01-test fenced[1891]: fence
"c013otin07-test" failed
2010-10-06T23:27:09.444804+00:00 c013otin01-test fenced[1891]: fencing node
"c013otin07-test"
2010-10-06T23:27:41.703023+00:00 c013otin01-test fenced[1891]: agent
"fence_xvm" reports: Timed out waiting for response
2010-10-06T23:27:41.703146+00:00 c013otin01-test fenced[1891]: fence
"c013otin07-test" failed
2010-10-06T23:27:46.703283+00:00 c013otin01-test fenced[1891]: fencing node
"c013otin07-test"
2010-10-06T23:28:19.365666+00:00 c013otin01-test fenced[1891]: agent
"fence_xvm" reports: Timed out waiting for response
2010-10-06T23:28:19.365967+00:00 c013otin01-test fenced[1891]: fence
"c013otin07-test" failed
2010-10-06T23:28:24.365843+00:00 c013otin01-test fenced[1891]: fencing node
"c013otin07-test"
2010-10-06T23:28:56.643939+00:00 c013otin01-test fenced[1891]: agent
"fence_xvm" reports: Timed out waiting for response
2010-10-06T23:28:56.644226+00:00 c013otin01-test fenced[1891]: fence
"c013otin07-test" failed
2010-10-06T23:29:01.644127+00:00 c013otin01-test fenced[1891]: fencing node
"c013otin07-test"
2010-10-06T23:29:34.171420+00:00 c013otin01-test fenced[1891]: agent
"fence_xvm" reports: Timed out waiting for response
2010-10-06T23:29:34.171507+00:00 c013otin01-test ccsd[1862]: Attempt to
close an unopened CCS descriptor (3035970).
2010-10-06T23:29:34.171524+00:00 c013otin01-test ccsd[1862]: Error while
processing disconnect: Invalid request descriptor
2010-10-06T23:29:34.171578+00:00 c013otin01-test fenced[1891]: fence
"c013otin07-test" failed
2010-10-06T23:29:39.170656+00:00 c013otin01-test fenced[1891]: fencing node
"c013otin07-test"
2010-10-06T23:30:01.418667+00:00 c013otin01-test rsync_policy_files:
receiving file list ... done
2010-10-06T23:30:01.418699+00:00 c013otin01-test rsync_policy_files:
2010-10-06T23:30:01.418708+00:00 c013otin01-test rsync_policy_files: sent 30
bytes  received 12 bytes  84.00 bytes/sec
2010-10-06T23:30:01.418716+00:00 c013otin01-test rsync_policy_files: total
size is 0  speedup is 0.00
2010-10-06T23:30:08.760903+00:00 c013otin01-test kernel: INFO: task
clurgmgrd:25022 blocked for more than 120 seconds.
2010-10-06T23:30:08.760918+00:00 c013otin01-test kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2010-10-06T23:30:08.760923+00:00 c013otin01-test kernel: clurgmgrd     D
ffff880001064b60     0 25022   3518         25023 25019 (NOTLB)
2010-10-06T23:30:08.760926+00:00 c013otin01-test kernel: ffff88016437fdb8
0000000000000286  0000000000000000  00000000ee8f8108
2010-10-06T23:30:08.760928+00:00 c013otin01-test kernel: 0000000000000008
ffff88098ed37080  ffff88097c9207a0  00000000000087b3
2010-10-06T23:30:08.760930+00:00 c013otin01-test kernel: ffff88098ed37268
ffffffff8029ed82
2010-10-06T23:30:08.760933+00:00 c013otin01-test kernel: Call Trace:
2010-10-06T23:30:08.760937+00:00 c013otin01-test kernel:
[<ffffffff8029ed82>] futex_wake+0x50/0xd4
2010-10-06T23:30:08.760940+00:00 c013otin01-test kernel:
[<ffffffff8023fe9c>] do_futex+0x2c2/0xcfb
2010-10-06T23:30:08.760942+00:00 c013otin01-test kernel:
[<ffffffff802644cb>] __down_read+0x82/0x9a
2010-10-06T23:30:08.760945+00:00 c013otin01-test kernel:
[<ffffffff8830b468>] :dlm:dlm_user_request+0x2d/0x175
"""

Here is what the fence_xvmd log shows on one dom0:

"""
Request to fence: c013otin07-test
Evaluating Domain: c013otin07-test   Last Owner: 7   State 1
Domain                   UUID                                 Owner State
------                   ----                                 ----- -----
c013operations01-test    9654e57b-7bb6-019e-937b-dc009f734a13 00001 00001
c013otin01-test          6fc9063b-5e9f-ef86-5ae2-8faa5fcde84a 00001 00001
c013summary01-test       10432e54-673f-8c61-d08d-591c42adce6e 00001 00002
Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
Storing c013operations01-test
Storing c013otin01-test
Storing c013summary01-test
Request to fence: c013otin07-test
Evaluating Domain: c013otin07-test   Last Owner: 7   State 1
"""

I did notice that group_tool state looks a bit borked:

"""
[root at dom0-01 ~]# group_tool
type             level name       id       state
fence            0     default    00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 8 9 10]
dlm              1     rgmanager  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 8 9 10]
"""

Is the group_tool output, the JOIN_STOP_WAIT the problem here? If so do you
know how to fix it without rebooting all the nodes? I tried "fence_tool
leave", and "fence_tool join" on all dom0's but that didn't resolve the
problem.

Thanks

Joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101007/39383876/attachment.htm>


More information about the Linux-cluster mailing list