[Linux-cluster] Problems with logging and cluster instability

Michael Rauch (ATIX AG) rauch at atix.de
Fri Aug 8 06:37:20 UTC 2008


Hello Daniel,

the first issue sounds like a network problem.
For RHEL5 you have to enable IGMP and multicast traffic forwarding
on some network-switches (on the logging network).

Regards, Michael



On Thursday 07 August 2008, dake at novatec.de wrote:
> Hello folks,
>
> we've been having two nasty problems with a GFS cluster, currently
> running version 2.03.03 of cluster suite and 0.80.3 of OpenAIS.
>
> The first is that for some time now, logging has been broken. We're
> getting kernel log messages from the DLM and GFS modules, but the
> userlnd utilities (i.e. OpenAIS) refuses to log at all when used with
> the cluster suite. Logging is fine when started without it (i.e.
> Default OpenAIS config file), so I'm pretty sure it's not the logging
> setup. Somehow, it seems that OpenAIS is not being given correct
> logging parameters by CMAN, and I really don't know why. I've tried
> including extra logging directives in cluster.conf, in various
> different forms, but to no avail. The cluster.conf we're using now is
> as follows:
>
> <?xml version="1.0"?>
> <cluster name="gfscluster" config_version="6">
>
>    <clusternodes>
>      <clusternode name="smb1-cluster" nodeid="1">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="1"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="smb1"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="smb2-cluster" nodeid="2">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="2"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="smb2"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="mail-cluster" nodeid="3">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="3"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="mail"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="backup-cluster" nodeid="4">
>        <fence>
>          <method name="powerswitch">
>            <device name="powerswitch" port="4"/>
>          </method>
>          <method name="last_resort">
>            <device name="manual" nodename="backup"/>
>          </method>
>        </fence>
>      </clusternode>
>    </clusternodes>
>
>    <fencedevices>
>      <fencedevice name="powerswitch" agent="fence_epc"
> host="192.168.10.xx" passwd="xxx" action="4"/>
>      <fencedevice name="manual" agent="fence_manual"/>
>    </fencedevices>
>
>    <fence_daemon post_join_delay="30">
>    </fence_daemon>
>
>    <logging to_syslog="yes" syslog_facility="local3">
>      <logger ident="CPG" to_syslog="yes">
>      </logger>
>      <logger ident="CMAN" to_syslog="yes">
>      </logger>
>      <logger ident="CLM" to_syslog="yes">
>      </logger>
>    </logging>
>
> </cluster>
>
> Any idea why this might not be working?
>
> The second problem is that once quorum is reached, any additional
> nodes joining will make the existing quorate cluster break apart. This
> behaviour has been seen in a three-node config with the third node
> joining, and in a four-node config with the fourth node joining. WHICH
> node is the last to join doesn't seem to make a difference. The
> "breaking apart" means that the newly joined node dies ("joining
> cluster with disallowed nodes, must die"), one of the existing nodes
> dies, and two of the other existing nodes keep running, but desynced -
> both show differing cluster membership and differing disallowed nodes.
> This is after a fresh reboot, so there is NO state in any node before
> joining. The crash occurs at the cman_tool join stage.
>
> I have a gut feeling it might have something to do with our network
> config, which has a total of four ethernet interfaces in three of the
> nodes, and two in the fourth. The first three have two iSCSI
> interfaces, one for cluster use and one for regular LAN access. The
> last has only one iSCSI interface and no LAN access for now. Routing
> tables etc. should be setup properly; as you can see above,
> cluster.conf uses special hostnames for the cluster interfaces, which
> are resolved to IPs using hosts files which are identical on all four
> machines. I have yet to do any packet sniffing, and I have very little
> information log-wise due to the first problem, so I'm sure this is not
> a lot of info; but I thought I might include it anyway, in case
> someone can immediately point out the problem.
>
> Thanks in advance,
> Daniel
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list