[Linux-cluster] info on "A processor failed" message and fencing when going to single user mode

Mon Oct 5 10:08:43 UTC 2009

Hello,
2 nodes cluster  (virtfed and virtfedbis their names) with F11 x86_64
up2date as of today and without qdisk
cman-3.0.2-1.fc11.x86_64
openais-1.0.1-1.fc11.x86_64
corosync-1.0.0-1.fc11.x86_64
and kernel 2.6.30.8-64.fc11.x86_64

I was in a situation where both nodes up, after virtfedbis hust restarted
and starting a service
Inside one of its resources there is a loop where it tests availability of a
file and so it was in starting of this service, but infra ws up, as of this
messages:

Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0)
ip(192.168.16.101)
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0)
ip(192.168.16.102)
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM] This node is within the
primary component and will provide service.
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM] Members[1]:
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM]     1
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0)
ip(192.168.16.101)
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:44:39 virtfed corosync[4684]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Oct  5 11:44:39 virtfed kernel: dlm: closing connection to node 2
Oct  5 11:44:39 virtfed corosync[4684]:   [MAIN  ] Completed service
synchronization, ready to provide service.

So now they are at this condition, reported by virtfedbis
[root at virtfedbis ~]# clustat
Cluster Status for kvm @ Mon Oct  5 11:49:27 2009
Member Status: Quorate

 Member Name                                                ID   Status
 ------ ----                                                ---- ------
 kvm1                                                           1 Online,
rgmanager
 kvm2                                                           2 Online,
Local, rgmanager

 Service Name                                      Owner
(Last)                                      State
 ------- ----                                      -----
------                                      -----
 service:DRBDNODE1
kvm1                                              started
 service:DRBDNODE2
kvm2                                              starting

I realize that I forgot a thing so that after 10 attempts DRBDNODE2 service
would not come up and so I decide to put
virtfedbis in single user mode, so that I run on it

shutdown 0

I would expect virtfedbis to leave cleanly the cluster, instead it is fenced
and rebooted (via fence_ilo agent)

On virtfed these are the messages:
Oct  5 11:49:49 virtfed corosync[4684]:   [TOTEM ] A processor failed,
forming new configuration.
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0)
ip(192.168.16.101)
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0)
ip(192.168.16.102)
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM] This node is within the
primary component and will provide service.
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM] Members[1]:
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM]     1
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0)
ip(192.168.16.101)
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:49:54 virtfed corosync[4684]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Oct  5 11:49:54 virtfed corosync[4684]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Oct  5 11:49:54 virtfed kernel: dlm: closing connection to node 2
Oct  5 11:49:54 virtfed fenced[4742]: fencing node kvm2
Oct  5 11:49:54 virtfed rgmanager[5496]: State change: kvm2 DOWN
Oct  5 11:50:26 virtfed fenced[4742]: fence kvm2 success

What I find on virtfedbis after restart in /var/log/cluster directory is
this:

corosync.log
Oct 05 11:49:49 corosync [TOTEM ] A processor failed, forming new
configuration.
Oct 05 11:49:49 corosync [TOTEM ] The network interface is down.
Oct 05 11:49:54 corosync [CLM   ] CLM CONFIGURATION CHANGE
Oct 05 11:49:54 corosync [CLM   ] New Configuration:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(127.0.0.1)
Oct 05 11:49:54 corosync [CLM   ] Members Left:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(192.168.16.102)
Oct 05 11:49:54 corosync [CLM   ] Members Joined:
Oct 05 11:49:54 corosync [QUORUM] This node is within the primary component
and will provide service.
Oct 05 11:49:54 corosync [QUORUM] Members[1]:
Oct 05 11:49:54 corosync [QUORUM]     1
Oct 05 11:49:54 corosync [CLM   ] CLM CONFIGURATION CHANGE
Oct 05 11:49:54 corosync [CLM   ] New Configuration:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(127.0.0.1)
Oct 05 11:49:54 corosync [CLM   ] Members Left:
Oct 05 11:49:54 corosync [CLM   ] Members Joined:
Oct 05 11:49:54 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Oct 05 11:49:54 corosync [CMAN  ] Killing node kvm2 because it has rejoined
the cluster with existing state

I think there is something wrong in this behaviour....
This is a test cluster so I have no qdisk .....
Is this the cause inherent with my config that has:
<cman expected_votes="1" two_node="1"/>
        <fence_daemon clean_start="1" post_fail_delay="0"
post_join_delay="20"/>

In general, if I do a shutdown -r now an one of the two nodes I have not
thsi kind of problems.....

Thanks for any insight,
Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20091005/ea47fc8e/attachment.htm>