[Linux-cluster] node is reboot during stop cluster application (oracle) and unable to relocate cluster application between nodes

Mon May 11 17:21:20 UTC 2009

On Mon, May 11, 2009 at 9:33 AM, John Ruemker <jruemker at redhat.com> wrote:
> On 05/11/2009 10:34 AM, Christopher Chen wrote:
>>
>> I hope you're planning to expand to least a 3 node cluster before you go
>> into production. You know two node clusters are inherently unstable,
>> right?I assume you've read the architectural overview of how the cluster
>> suite achieves quorum.
>>
>> A cluster requires (n/2)+1 to continue to operate. If you restart or
>> otherwise remove a machine from a two node cluster, you've lost quorum
>> and by definition you've dissolved your cluster while you're in that
>> state.
>>
>
> Unless the special case two_node="1" is in use, and it is here:
>
>       <cman expected_votes="1" two_node="1"/>
>
> This allows for maintaining quorum when only one vote is present. Fencing is
> occurring because the link is dropping.  See below:

I understand that that's an option, but how safe is it? Two node
clusters scare me.
>
>> I'm pretty sure the behavior you are describing is proper.
>>
>> Time flies like an arrow.
>> Fruit flies like a banana.
>>
>> On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz at ril.com
>> <mailto:CISPLengineer.hz at ril.com>> wrote:
>>
>>> Hi,
>>>
>>> I have configured two node cluster on redhat-5. now the problem is
>>> when i relocate,restart or stop, running cluster service between nodes
>>> (2 nos) ,the node get fenced and restart server . Other side, the
>>> server who obtain cluster service leave the cluster and it's cluster
>>> service (cman) stop automatically .so it is also fenced by other server.
>>>
>>> I observed that , this problem occurred while stopping cluster service
>>> (oracle).
>>>
>>> Please help me to resolve this problem.
>>>
>>> log messages and cluster.conf file are as given as below.
>>> -------------------------
>>> /etc/cluster/cluster.conf
>>> -------------------------
>>> <?xml version="1.0"?>
>>> <cluster config_version="59" name="new_cluster">
>>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>> <clusternodes>
>>> <clusternode name="psfhost1" nodeid="1" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="cluster1"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="psfhost2" nodeid="2" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="cluster2"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <cman expected_votes="1" two_node="1"/>
>>> <fencedevices>
>>> <fencedevice agent="fence_ilo" hostname="ilonode1"
>>> login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
>>> <fencedevice agent="fence_ilo" hostname="ilonode2"
>>> login="Administrator" name="cluster2" passwd="ST69D87V"/>
>>> </fencedevices>
>>> <rm>
>>> <failoverdomains>
>>> <failoverdomain name="poy-cluster" ordered="0" restricted="0">
>>> <failoverdomainnode name="psfhost1" priority="1"/>
>>> <failoverdomainnode name="psfhost2" priority="1"/>
>>> </failoverdomain>
>>> </failoverdomains>
>>> <resources>
>>> <ip address="10.2.220.2" monitor_link="1"/>
>>> <script file="/etc/init.d/httpd" name="httpd"/>
>>> <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
>>> fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
>>> self_fence="0"/>
>>> <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
>>> fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
>>> self_fence="0"/>
>>> <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
>>> fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
>>> self_fence="0"/>
>>> <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
>>> fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
>>> options="" self_fence="0"/>
>>> <script file="/etc/init.d/orcl" name="Oracle"/>
>>> </resources>
>>> <service autostart="0" name="oracle" recovery="relocate">
>>> <fs ref="app"/>
>>> <fs ref="opt"/>
>>> <fs ref="data"/>
>>> <fs ref="OPERATION"/>
>>> <ip ref="10.2.220.2"/>
>>> <script ref="Oracle"/>
>>> </service>
>>> </rm>
>>> </cluster>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------- -------
>>> /var/log/messages
>>> -----------------------
>>> following logs during relocate cluster service (oracle) between nodes.
>>>
>>> _*Node-1*_
>>>
>>> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
>>> service:oracle
>>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
>>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
>>> seconds
>>> May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
>>> reached, running e2fsck is recommended
>>> May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
>>> May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
>>> ordered data mode.
>>> May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
>>> record for 10.2.220.2 on eth0.
>>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
>>> (IP_ADD_MEMBERSHIP): Address already in use
>>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
>>> May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
>>> service:oracle started
>>> May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
>
> ^^^^^
> The cluster interconnect link went down, and thus this node could no longer
> communicate with the other node.
>
>
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
>>> from 11.
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
>>> high seq received 1b
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
>>> for ring 90
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
>>> 10.2.220.6:
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
>>> rep 10.2.220.6
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
>>> received flag 1
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
>>> 10.2.220.7:
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
>>> rep 10.2.220.7
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
>>> 1b received flag 1
>>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
>>> originate any messages in recovery.
>>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
>>> May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
>>> primary component and will provide service.
>>> May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
>>> state.
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>>> 10.2.220.6
>>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
>>> 10.2.220.7
>>> May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
>>> from node 2
>>> May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
>>> full duplex, receive & transmit flow control ON
>>> May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
>>> May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
>>> full duplex, receive & transmit flow control ON
>>> May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
>>> May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
>>> (version 57 -> 59).
>>> May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
>>> May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
>>> service:oracle
>>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
>>> for 10.2.220.7 on eth0.
>>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
>>> group on interface eth0.IPv4 with address 10.2.220.7.
>>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
>>> group on interface eth0.IPv4 with address 10.2.220.2.
>>> May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
>>> 10.2.220.2
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
>>> 127.0.0.1:
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
>>> rep 10.2.220.6
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
>>> 31 received flag 1
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
>>> originate any messages in recovery.
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
>>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
>>> May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
>>> primary component and will provide service.
>>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
>>> state.
>>> May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
>>> because it has rejoined the cluster without cman_tool join
>>> May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
>>> because we rejoined the cluster without a full restart
>>> May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
>>> May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
>>> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
>>> May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
>>> May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
>>> May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
>>> May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
>>> May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
>>> rebooting...
>>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
>>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
>>> May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
>>> May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
>>> completed yet!
>>> May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
>>> May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
>>> started.
>>> May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
>>> (brewbuilder at hs20-bc1-7.build.redhat.com
>>> <mailto:brewbuilder at hs20-bc1-7.build.redhat.com>) (gcc version 4.1.2
>>> 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Chris Chen <muffaleta at gmail.com>
"I want the kind of six pack you can't drink."
-- Micah