[Linux-cluster] node is reboot during stop cluster application (oracle) and unable to relocate cluster application between nodes

Mon May 11 16:33:14 UTC 2009

On 05/11/2009 10:34 AM, Christopher Chen wrote:
> I hope you're planning to expand to least a 3 node cluster before you go
> into production. You know two node clusters are inherently unstable,
> right?I assume you've read the architectural overview of how the cluster
> suite achieves quorum.
>
> A cluster requires (n/2)+1 to continue to operate. If you restart or
> otherwise remove a machine from a two node cluster, you've lost quorum
> and by definition you've dissolved your cluster while you're in that state.
>

Unless the special case two_node="1" is in use, and it is here:

        <cman expected_votes="1" two_node="1"/>

This allows for maintaining quorum when only one vote is present. 
Fencing is occurring because the link is dropping.  See below:

> I'm pretty sure the behavior you are describing is proper.
>
> Time flies like an arrow.
> Fruit flies like a banana.
>
> On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz at ril.com
> <mailto:CISPLengineer.hz at ril.com>> wrote:
>
>> Hi,
>>
>> I have configured two node cluster on redhat-5. now the problem is
>> when i relocate,restart or stop, running cluster service between nodes
>> (2 nos) ,the node get fenced and restart server . Other side, the
>> server who obtain cluster service leave the cluster and it's cluster
>> service (cman) stop automatically .so it is also fenced by other server.
>>
>> I observed that , this problem occurred while stopping cluster service
>> (oracle).
>>
>> Please help me to resolve this problem.
>>
>> log messages and cluster.conf file are as given as below.
>> -------------------------
>> /etc/cluster/cluster.conf
>> -------------------------
>> <?xml version="1.0"?>
>> <cluster config_version="59" name="new_cluster">
>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>> <clusternodes>
>> <clusternode name="psfhost1" nodeid="1" votes="1">
>> <fence>
>> <method name="1">
>> <device name="cluster1"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="psfhost2" nodeid="2" votes="1">
>> <fence>
>> <method name="1">
>> <device name="cluster2"/>
>> </method>
>> </fence>
>> </clusternode>
>> </clusternodes>
>> <cman expected_votes="1" two_node="1"/>
>> <fencedevices>
>> <fencedevice agent="fence_ilo" hostname="ilonode1"
>> login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
>> <fencedevice agent="fence_ilo" hostname="ilonode2"
>> login="Administrator" name="cluster2" passwd="ST69D87V"/>
>> </fencedevices>
>> <rm>
>> <failoverdomains>
>> <failoverdomain name="poy-cluster" ordered="0" restricted="0">
>> <failoverdomainnode name="psfhost1" priority="1"/>
>> <failoverdomainnode name="psfhost2" priority="1"/>
>> </failoverdomain>
>> </failoverdomains>
>> <resources>
>> <ip address="10.2.220.2" monitor_link="1"/>
>> <script file="/etc/init.d/httpd" name="httpd"/>
>> <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
>> fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
>> self_fence="0"/>
>> <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
>> fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
>> self_fence="0"/>
>> <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
>> fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
>> self_fence="0"/>
>> <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
>> fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
>> options="" self_fence="0"/>
>> <script file="/etc/init.d/orcl" name="Oracle"/>
>> </resources>
>> <service autostart="0" name="oracle" recovery="relocate">
>> <fs ref="app"/>
>> <fs ref="opt"/>
>> <fs ref="data"/>
>> <fs ref="OPERATION"/>
>> <ip ref="10.2.220.2"/>
>> <script ref="Oracle"/>
>> </service>
>> </rm>
>> </cluster>
>>
>>
>>
>>
>>
>>
>>
>> ---------------- -------
>> /var/log/messages
>> -----------------------
>> following logs during relocate cluster service (oracle) between nodes.
>>
>> _*Node-1*_
>>
>> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
>> service:oracle
>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
>> seconds
>> May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
>> May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>> May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
>> record for 10.2.220.2 on eth0.
>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
>> (IP_ADD_MEMBERSHIP): Address already in use
>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
>> May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
>> service:oracle started
>> May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down

^^^^^
The cluster interconnect link went down, and thus this node could no 
longer communicate with the other node.

>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
>> from 11.
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
>> high seq received 1b
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
>> for ring 90
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
>> 10.2.220.6:
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
>> rep 10.2.220.6
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
>> received flag 1
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
>> 10.2.220.7:
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
>> rep 10.2.220.7
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
>> 1b received flag 1
>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
>> originate any messages in recovery.
>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>> May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
>> state.
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>> 10.2.220.6
>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>> 10.2.220.7
>> May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
>> from node 2
>> May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
>> full duplex, receive & transmit flow control ON
>> May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
>> May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
>> full duplex, receive & transmit flow control ON
>> May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
>> May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
>> (version 57 -> 59).
>> May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
>> May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
>> service:oracle
>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
>> for 10.2.220.7 on eth0.
>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
>> group on interface eth0.IPv4 with address 10.2.220.7.
>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
>> group on interface eth0.IPv4 with address 10.2.220.2.
>> May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
>> 10.2.220.2
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
>> 127.0.0.1:
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
>> rep 10.2.220.6
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
>> 31 received flag 1
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
>> originate any messages in recovery.
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>> May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
>> May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
>> because it has rejoined the cluster without cman_tool join
>> May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
>> because we rejoined the cluster without a full restart
>> May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
>> May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
>> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
>> May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
>> May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
>> May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
>> May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
>> May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
>> rebooting...
>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
>> May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
>> May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
>> completed yet!
>> May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
>> May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>> May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
>> (brewbuilder at hs20-bc1-7.build.redhat.com
>> <mailto:brewbuilder at hs20-bc1-7.build.redhat.com>) (gcc version 4.1.2
>> 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster