[Linux-cluster] Cluster environment issue

Fri Jun 3 08:48:31 UTC 2011

Hi,

On Thu, 2 Jun 2011 08:37:07 -0700 (PDT), Srija <swap_project at yahoo.com>
wrote:
> Thank you so much for your reply again.
> 
> --- On Tue, 5/31/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:
> Thanks for your reply again.
> 
> 
>  > 
>> If it is a switch restart you will have in your logs the
>> interface going
>> down/up, but more problematic is to find a short drop of
>> the multicast
> 
> I checked all nodes did not find anything about interface, but in all
the
> nodes it is reporting that server19(node 12) /server18 (node 11) is the
> problematic, here I am mentioning the logs  from three nodes (out of 16
> nodes)
> 
>    May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state
>    from 12.
>    May 24 18:05:01 server7 crond[5068]: (root) CMD ( 
>    /opt/hp/hp-health/bin/check-for-restart-requests)
>    May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state
>    from 11.
> 
>    May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state
>    from 12.
>    May 24 18:05:01 server1 crond[2275]: (root) CMD ( 
>    /opt/hp/hp-health/bin/check-for-restart-requests)
>    May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state
>    from 11.
> 
>    May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state
>    from 12.
>    May 24 18:05:01 server8 crond[11125]: (root) CMD ( 
>    /opt/hp/hp-health/bin/check-for-restart-requests)
>    May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state
>    from 11.
> 
> 
> Here is some lines from  node12 , at the same time
> ___________________________________________________
> 
> 
> May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in
the
> OPERATIONAL state.
> May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket
> recv buffer size (320000 bytes).
> May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast
socket
> send buffer size (262142 bytes).
> May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state
from
> 2.
> May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state
from
> 11.
> May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f
> high seq received 39a8f
> May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id
> for ring 2af0
> May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state.
> May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state.
> 
> 
> Here is few lines  on node11 ie server18 
> ------------------------------------------
> 
> ay 24 18:04:48 server18
> May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up;
> version='2.0.10'
> May 24 18:10:14 server18 Bootdata ok (command line is ro
> root=/dev/vgroot_xen/lvroot rhgb quiet)
> 
> 
> So it seems  that node11 is rebooting just after few mintues we get all
> the problems  in the logs of all nodes. 
> 
> 
>  > You may ask the network people to check for STP changes and
>> double check
>> the multicast configuration and you may also try to use
>> broadcast instead
>> of multicast or use a dedicated switch.
> 
> As per the dedicated switch,  I don't think it is possible as per the
> network team.  I asked the STP chanes  related.  their answer is 
> 
> "there are no stp changes for the private network as there are no
> redundant devices in the environment. the multicast configs is  igmp
> snooping with Pim"
> 
> I have talked to the network team for using the broadcast instead of
> multicast, as per them , they can set..  
> 
> Pl. comment  on this...
> 

to use broadcast (if private addresses are in the same VLAN/subnet) you
just need to set it in cluster.conf - cman section, but not sure if it can
be done on a running cluster (without stopping or braking it)

>  > your interface and multicast address)
>>     ping -I ethX -b -L 239.x.x.x -c 1
>> and finaly run this script until the cluster gets broken
> 
> Yes ,  I have checked it , it is working fine now.  I have also set a
cron
> for this script and set in one node.

no need for cron if you haven't changed the script - this will start
several processes and your network will be overloaded !!!
the script was made to run on a console (or via screen) and it will exit
_only_ when multicast is lost

> 
> I have few questions  regarding the cluster configuration ...
> 
> 
>    -  We are using clvm  in the cluster environment.  As I understand it
>    is active-active.
>       The environment is xen . all the xen hosts are in the cluster and
>       each host have
>       the guests. We are keeping the options  to live migrate the guests
>       from one host to another.
> 
>     - I was looking into the redhat knowledgebase
>     https://access.redhat.com/kb/docs/DOC-3068,
>      as per the document , what do you think using  CLVM or HA-LVM will
be
>      the best choice?
> 
> Pl. advice.

can't comment on this sorry

>  
> 
> Thanks  and regards again.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster