[Linux-cluster] node is reboot during stop cluster application (oracle) and unable to relocate cluster application between nodes

Tue May 12 05:27:02 UTC 2009

Hi Jhon,

Thanks for the replay..........

As you said from log  that, node getting fenced (reboot) due to link is 
down.But the there is no problem with network connectivity. Actually 
when i relocate cluster application from one to another (oracle),the 
node who runs application is reboot. Both nodes are connected with cross 
cable for heartbeat link and that is why it is showing link down in logs.

Regards,
Viral Ahire

------------------------------------------------------------

    * /From/: John Ruemker <jruemker redhat com>
    * /To/: linux clustering <linux-cluster redhat com>
    * /Subject/: Re: [Linux-cluster] node is reboot during stop cluster
      application (oracle) and unable to relocate cluster application
      between nodes
    * /Date/: Mon, 11 May 2009 12:33:14 -0400

------------------------------------------------------------------------

On 05/11/2009 10:34 AM, Christopher Chen wrote:

    I hope you're planning to expand to least a 3 node cluster before you go
    into production. You know two node clusters are inherently unstable,
    right?I assume you've read the architectural overview of how the cluster
    suite achieves quorum.

    A cluster requires (n/2)+1 to continue to operate. If you restart or
    otherwise remove a machine from a two node cluster, you've lost quorum
    and by definition you've dissolved your cluster while you're in that state.

Unless the special case two_node="1" is in use, and it is here:

       <cman expected_votes="1" two_node="1"/>

This allows for maintaining quorum when only one vote is present. 
Fencing is occurring because the link is dropping. See below:

    I'm pretty sure the behavior you are describing is proper.

    Time flies like an arrow.
    Fruit flies like a banana.

    On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer hz ril com
    <mailto:CISPLengineer hz ril com>> wrote:

        Hi,

        I have configured two node cluster on redhat-5. now the problem is
        when i relocate,restart or stop, running cluster service between nodes
        (2 nos) ,the node get fenced and restart server . Other side, the
        server who obtain cluster service leave the cluster and it's cluster
        service (cman) stop automatically .so it is also fenced by other server.

        I observed that , this problem occurred while stopping cluster service
        (oracle).

        Please help me to resolve this problem.

        log messages and cluster.conf file are as given as below.
        -------------------------
        /etc/cluster/cluster.conf
        -------------------------
        <?xml version="1.0"?>
        <cluster config_version="59" name="new_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
        <clusternode name="psfhost1" nodeid="1" votes="1">
        <fence>
        <method name="1">
        <device name="cluster1"/>
        </method>
        </fence>
        </clusternode>
        <clusternode name="psfhost2" nodeid="2" votes="1">
        <fence>
        <method name="1">
        <device name="cluster2"/>
        </method>
        </fence>
        </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
        <fencedevice agent="fence_ilo" hostname="ilonode1"
        login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
        <fencedevice agent="fence_ilo" hostname="ilonode2"
        login="Administrator" name="cluster2" passwd="ST69D87V"/>
        </fencedevices>
        <rm>
        <failoverdomains>
        <failoverdomain name="poy-cluster" ordered="0" restricted="0">
        <failoverdomainnode name="psfhost1" priority="1"/>
        <failoverdomainnode name="psfhost2" priority="1"/>
        </failoverdomain>
        </failoverdomains>
        <resources>
        <ip address="10.2.220.2" monitor_link="1"/>
        <script file="/etc/init.d/httpd" name="httpd"/>
        <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
        fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
        self_fence="0"/>
        <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
        fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
        self_fence="0"/>
        <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
        fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
        self_fence="0"/>
        <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
        fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
        options="" self_fence="0"/>
        <script file="/etc/init.d/orcl" name="Oracle"/>
        </resources>
        <service autostart="0" name="oracle" recovery="relocate">
        <fs ref="app"/>
        <fs ref="opt"/>
        <fs ref="data"/>
        <fs ref="OPERATION"/>
        <ip ref="10.2.220.2"/>
        <script ref="Oracle"/>
        </service>
        </rm>
        </cluster>

        ---------------- -------
        /var/log/messages
        -----------------------
        following logs during relocate cluster service (oracle) between nodes.

        _*Node-1*_

        2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
        service:oracle
        May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
        May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
        May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
        May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
        seconds
        May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
        reached, running e2fsck is recommended
        May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
        May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
        ordered data mode.
        May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
        record for 10.2.220.2 on eth0.
        May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
        (IP_ADD_MEMBERSHIP): Address already in use
        May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
        May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
        service:oracle started
        May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down

^^^^^

The cluster interconnect link went down, and thus this node could no 
longer communicate with the other node.

        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
        from 11.
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
        high seq received 1b
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
        for ring 90
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
        10.2.220.6:
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
        rep 10.2.220.6
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
        received flag 1
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
        10.2.220.7:
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
        rep 10.2.220.7
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
        1b received flag 1
        May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
        originate any messages in recovery.
        May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
        May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
        primary component and will provide service.
        May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
        state.
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
        10.2.220.6
        May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
        10.2.220.7
        May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
        from node 2
        May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
        full duplex, receive & transmit flow control ON
        May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
        May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
        full duplex, receive & transmit flow control ON
        May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
        May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
        (version 57 -> 59).
        May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
        May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
        service:oracle
        May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
        for 10.2.220.7 on eth0.
        May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
        group on interface eth0.IPv4 with address 10.2.220.7.
        May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
        group on interface eth0.IPv4 with address 10.2.220.2.
        May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
        10.2.220.2
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
        127.0.0.1:
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
        rep 10.2.220.6
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
        31 received flag 1
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
        originate any messages in recovery.
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
        May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
        May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
        primary component and will provide service.
        May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
        May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
        because it has rejoined the cluster without cman_tool join
        May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
        because we rejoined the cluster without a full restart
        May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
        May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
        0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
        May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
        May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
        May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
        May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
        May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
        rebooting...
        May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
        May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
        May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
        May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
        completed yet!
        May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
        May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
        started.
        May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
        (brewbuilder hs20-bc1-7 build redhat com
        <mailto:brewbuilder hs20-bc1-7 build redhat com>) (gcc version 4.1.2
        20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct

        --
        Linux-cluster mailing list
        Linux-cluster redhat com <mailto:Linux-cluster redhat com>
        https://www.redhat.com/mailman/listinfo/linux-cluster

    ------------------------------------------------------------------------

    --
    Linux-cluster mailing list
    Linux-cluster redhat com
    https://www.redhat.com/mailman/listinfo/linux-cluster

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential. and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090512/31150d81/attachment.htm>