From gounini.geekarea at gmail.com  Wed Aug  1 08:28:07 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Wed, 1 Aug 2012 10:28:07 +0200 (CEST)
Subject: [Linux-cluster] Quorum device brain the cluster when
	master	lose network
In-Reply-To: <CAE7pJ3Czn8c9kkFXfFG0haSL2-jsbSCN9xS642Pw0ptfd0H0Kw@mail.gmail.com>
Message-ID: <841982910.2925.1343809687840.JavaMail.root@geekarea.fr>

I do this test one more time and I got same result with more precisions:

When I shutdown network on 2 nodes including the master, master stay alive while the 2 online nodes are fencing the offline non-master node. The cluster goes Inquorate after.
When fenced node came back, he joins cluster and cluster becomes quorate. New master is chose and the old master is fenced.

# cman_tool status
Version: 6.2.0
Config Version: 144
Cluster Name: cluname
Cluster Id: 57462
Cluster Member: Yes
Cluster Generation: 488
Membership state: Cluster-Member
Nodes: 4
Expected votes: 5
Quorum device votes: 1
Total votes: 5
Quorum: 3
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 177
Node name: nodename
Node ID: 2
Multicast addresses: ZZ.ZZ.ZZ.ZZ
Node addresses: YY.YY.YY.YY

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "emmanuel segura" <emi2fast at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Lundi 30 Juillet 2012 17:35:39
> Objet: Re: [Linux-cluster] Quorum device brain the cluster when master	lose network
> 
> 
> can you send me the ouput from cman_tool status? when the cluster
> it's running
> 
> 
> 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> 
> 
> 
> 
> ----- Mail original -----
> > De: "Digimer" < lists at alteeve.ca >
> > ?: "linux clustering" < linux-cluster at redhat.com >
> > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > master lose network
> > 
> > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > Hello,
> > > 
> > > I did some tests on 4 nodes cluster with quorum device and I find
> > > a
> > > bad situation with one test, so I need your knowledges to correct
> > > my configuration.
> > > 
> > > Configuation:
> > > 4 nodes, all vote for 1
> > > quorum device vote for 1 (to hold services with minimum 2 nodes
> > > up)
> > > cman expected votes 5
> > > 
> > > Situation:
> > > I shut down network on 2 nodes, one of them is master.
> > > 
> > > Observation:
> > > Fencing of one node (the master)... Quorum device Offline, Quorum
> > > disolved ! Services stopped.
> > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > fenced.
> > > Services restart.
> > > 2nd node offline reboot.
> > > 
> > > My cluster is not quorate for 8 min (very long hardware boot :-)
> > > and my services were offline.
> > > 
> > > Do you know how to prevent this situation?
> > > 
> > > Regards,
> > 
> > Please tell us the name and version of the cluster software you are
> > using, Please also share your configuration file(s).
> > 
> > --
> > Digimer
> > Papers and Projects: https://alteeve.com
> > 
> 
> Sorry, RHEL5.6 64bits
> 
> # rpm -q cman rgmanager
> cman-2.0.115-68.el5
> rgmanager-2.0.52-9.el5
> 
> 
> <?xml version="1.0"?>
> <cluster alias="cluname" config_version="144" name="cluname">
> <clusternodes>
> <clusternode name="node1" nodeid="1" votes="1">
> <fence>
> <method name="single">
> <device name="fenceIBM_307" port="12"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="node2" nodeid="2" votes="1">
> <fence>
> <method name="single">
> <device name="fenceIBM_307" port="11"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="node3" nodeid="3" votes="1">
> <fence>
> <method name="single">
> <device name="fenceIBM_308" port="6"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="node4" nodeid="4" votes="1">
> <fence>
> <method name="single">
> <device name="fenceIBM_308" port="7"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <fencedevices>
> <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> </fencedevices>
> <rm log_level="7">
> <failoverdomains/>
> <resources/>
> <service ...>
> <...>
> </service>
> </rm>
> <fence_daemon clean_start="0" post_fail_delay="15"
> post_join_delay="300"/>
> <cman expected_votes="5">
> <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> </cman>
> <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> </cluster>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> esta es mi vida e me la vivo hasta que dios quiera
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gounini.geekarea at gmail.com  Wed Aug  1 08:29:02 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Wed, 1 Aug 2012 10:29:02 +0200 (CEST)
Subject: [Linux-cluster] How to change quorumd intervel and tko online?
In-Reply-To: <1277138514.2193.1343659611647.JavaMail.root@geekarea.fr>
Message-ID: <1145411303.2926.1343809742361.JavaMail.root@geekarea.fr>

Infos:
RHEL5.6 64bits

# rpm -q cman rgmanager
cman-2.0.115-68.el5
rgmanager-2.0.52-9.el5

Any idea?

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "GouNiNi" <gounini.geekarea at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Lundi 30 Juillet 2012 16:46:51
> Objet: [Linux-cluster] How to change quorumd intervel and tko online?
> 
> Re,
> 
> Juste two little questions.
> How to change quorumd intervel and tko **online**?
> How to check these values on online cluster?
> 
> Thanks
> Regards,
> 
> --
>   .`'`.   GouNiNi
>  :  ': :
>  `. ` .`  GNU/Linux
>    `'`    http://www.geekarea.fr
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From emi2fast at gmail.com  Wed Aug  1 08:58:59 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 1 Aug 2012 10:58:59 +0200
Subject: [Linux-cluster] Quorum device brain the cluster when master
	lose network
In-Reply-To: <841982910.2925.1343809687840.JavaMail.root@geekarea.fr>
References: <CAE7pJ3Czn8c9kkFXfFG0haSL2-jsbSCN9xS642Pw0ptfd0H0Kw@mail.gmail.com>
	<841982910.2925.1343809687840.JavaMail.root@geekarea.fr>
Message-ID: <CAE7pJ3CCjywxBsceY0-KUzk9=nG3+k7k_j6osQoevyWR9L1T7g@mail.gmail.com>

Hello Gounini

Sorry but it told you, remove <cman expected_votes="5"> and reboot the
cluster

Let the cluster calculate the expected votes

2012/8/1 GouNiNi <gounini.geekarea at gmail.com>

> I do this test one more time and I got same result with more precisions:
>
> When I shutdown network on 2 nodes including the master, master stay alive
> while the 2 online nodes are fencing the offline non-master node. The
> cluster goes Inquorate after.
> When fenced node came back, he joins cluster and cluster becomes quorate.
> New master is chose and the old master is fenced.
>
> # cman_tool status
> Version: 6.2.0
> Config Version: 144
> Cluster Name: cluname
> Cluster Id: 57462
> Cluster Member: Yes
> Cluster Generation: 488
> Membership state: Cluster-Member
> Nodes: 4
> Expected votes: 5
> Quorum device votes: 1
> Total votes: 5
> Quorum: 3
> Active subsystems: 9
> Flags: Dirty
> Ports Bound: 0 177
> Node name: nodename
> Node ID: 2
> Multicast addresses: ZZ.ZZ.ZZ.ZZ
> Node addresses: YY.YY.YY.YY
>
> --
>   .`'`.   GouNiNi
>  :  ': :
>  `. ` .`  GNU/Linux
>    `'`    http://www.geekarea.fr
>
>
> ----- Mail original -----
> > De: "emmanuel segura" <emi2fast at gmail.com>
> > ?: "linux clustering" <linux-cluster at redhat.com>
> > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when master
>      lose network
> >
> >
> > can you send me the ouput from cman_tool status? when the cluster
> > it's running
> >
> >
> > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> >
> >
> >
> >
> > ----- Mail original -----
> > > De: "Digimer" < lists at alteeve.ca >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > >
> > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > Hello,
> > > >
> > > > I did some tests on 4 nodes cluster with quorum device and I find
> > > > a
> > > > bad situation with one test, so I need your knowledges to correct
> > > > my configuration.
> > > >
> > > > Configuation:
> > > > 4 nodes, all vote for 1
> > > > quorum device vote for 1 (to hold services with minimum 2 nodes
> > > > up)
> > > > cman expected votes 5
> > > >
> > > > Situation:
> > > > I shut down network on 2 nodes, one of them is master.
> > > >
> > > > Observation:
> > > > Fencing of one node (the master)... Quorum device Offline, Quorum
> > > > disolved ! Services stopped.
> > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > fenced.
> > > > Services restart.
> > > > 2nd node offline reboot.
> > > >
> > > > My cluster is not quorate for 8 min (very long hardware boot :-)
> > > > and my services were offline.
> > > >
> > > > Do you know how to prevent this situation?
> > > >
> > > > Regards,
> > >
> > > Please tell us the name and version of the cluster software you are
> > > using, Please also share your configuration file(s).
> > >
> > > --
> > > Digimer
> > > Papers and Projects: https://alteeve.com
> > >
> >
> > Sorry, RHEL5.6 64bits
> >
> > # rpm -q cman rgmanager
> > cman-2.0.115-68.el5
> > rgmanager-2.0.52-9.el5
> >
> >
> > <?xml version="1.0"?>
> > <cluster alias="cluname" config_version="144" name="cluname">
> > <clusternodes>
> > <clusternode name="node1" nodeid="1" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_307" port="12"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node2" nodeid="2" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_307" port="11"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node3" nodeid="3" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_308" port="6"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node4" nodeid="4" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_308" port="7"/>
> > </method>
> > </fence>
> > </clusternode>
> > </clusternodes>
> > <fencedevices>
> > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > </fencedevices>
> > <rm log_level="7">
> > <failoverdomains/>
> > <resources/>
> > <service ...>
> > <...>
> > </service>
> > </rm>
> > <fence_daemon clean_start="0" post_fail_delay="15"
> > post_join_delay="300"/>
> > <cman expected_votes="5">
> > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > </cman>
> > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > </cluster>
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/7f01baaf/attachment.htm>

From piotr.pietrzak at hp.com  Wed Aug  1 10:37:15 2012
From: piotr.pietrzak at hp.com (Pietrzak, Piotr (CMS rtBSS))
Date: Wed, 1 Aug 2012 10:37:15 +0000
Subject: [Linux-cluster] Named pipes not working on GFS2 in Redhat 5.x,
In-Reply-To: <3ABFB3D87EB6904F9F6FB45C90E24F4D0808A4@G1W3650.americas.hpqcorp.net>
References: <3ABFB3D87EB6904F9F6FB45C90E24F4D0808A4@G1W3650.americas.hpqcorp.net>
Message-ID: <3ABFB3D87EB6904F9F6FB45C90E24F4D0808C9@G1W3650.americas.hpqcorp.net>

Hello Bob,

                The problem has been reported by one of my customer, but I was not able to set up a real cluster so I have built a cluster with single machine, all software cluster and GFS2 installed and set up just one node. It allows me to mount GFS2 filesystem and conduct tests with application and shell steps.

In the real system following piece of code crashed application start up

if ( ioctl( nFifoDescr, FIONREAD, &lSize) == -1 ) {
        errNo = errno;

However when I have built the test system I was able to see that when I have created pipes on the GFS2 and I tried to use in shell I am getting an error message.

For instance when I tried to write to pipe following error shows up immediately

[root at erm4 gfs_test]# echo test > msg_act_0.pipe
-bash: echo: write error: Invalid argument

When I try to read from pipe following error comes

[root at erm4 gfs_test]# cat <msg_mon_0.pipe
cat: -: Invalid argument

I considered this as a good test, at least it shows that there is a problem with pipes.

The situation remain the same with two kernels I have tried 5.5 and 5.8.

Regards support I am trying to find OEM agreement to raise a request towards RedHat support, but as the problem has been identified some time ago and it seems that solution was released I wonder maybe I should just install something like GFS2 new kernel module or anything else, that's the reason I dropped this thread.

Piotr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/d44303ad/attachment.htm>

From queszama at yahoo.in  Wed Aug  1 11:56:46 2012
From: queszama at yahoo.in (Zama Ques)
Date: Wed, 1 Aug 2012 19:56:46 +0800 (SGT)
Subject: [Linux-cluster] Creating two different cluster using same set of
	nodes.
Message-ID: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>

Hi All ,

Need clarifications whether it is possible to create two different cluster using the same set of nodes. 


Looks like Redhat Cluster Suite does not support creating different 
clusters using the same nodes. I am getting the following 
error while building the second cluster using the same nodes using luci interface .? 

====
[dismiss]

The following errors occurred:
??? * Host system3.example.com is already a member of the cluster named "ClusterA"
??? * Host system4.example.com is already a member of the cluster named "ClusterA"
===

My
 query is that does Redhat Cluster Suite allows in any way to create two
 different clusters using same nodes. ?If not , any reason for not 
allowing ?this feature?.


Thanks in Advance
Zaman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/0141b91a/attachment.htm>

From lists at alteeve.ca  Wed Aug  1 12:23:39 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 01 Aug 2012 08:23:39 -0400
Subject: [Linux-cluster] Creating two different cluster using same set
 of nodes.
In-Reply-To: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
Message-ID: <50191FCB.7000804@alteeve.ca>

On 08/01/2012 07:56 AM, Zama Ques wrote:
> Hi All ,
>
> Need clarifications whether it is possible to create two different
> cluster using the same set of nodes.
>
> Looks like Redhat Cluster Suite does not support creating different
> clusters using the same nodes. I am getting the following
> error while building the second cluster using the same nodes using luci
> interface .
>
> ====
> [dismiss]
>
> The following errors occurred:
>      * Host system3.example.com is already a member of the cluster named
> "ClusterA"
>      * Host system4.example.com is already a member of the cluster named
> "ClusterA"
> ===
>
> My query is that does Redhat Cluster Suite allows in any way to create
> two different clusters using same nodes.  If not , any reason for not
> allowing  this feature?.
>
>
> Thanks in Advance
> Zaman

It is not possible, no. A node must be in one cluster only.

May I ask why you're trying to do this?

-- 
Digimer
Papers and Projects: https://alteeve.com



From gianluca.cecchi at gmail.com  Wed Aug  1 14:10:48 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Wed, 1 Aug 2012 16:10:48 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
Message-ID: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>

Hello,
testing a three node cluster + quorum disk and clvmd.
I was at CentOS 6.2 and I seem to remember to be able to start a
single node. Correct?
Then I upgraded to CentOS 6.3 and had a working environment.
My config has
<cman expected_votes="3" quorum_dev_poll="240000" two_node="0"/>

At the moment two nodes are in another site that is powered down and I
need to start a single node config.

When the node starts it gets waiting for quorum and when quorum disk
becomes master it goes ahead:

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2012-08-01 15:41:58  /dev/block/253:4
   1   X      0                        intrarhev1
   2   X      0                        intrarhev2
   3   M   1420   2012-08-01 15:39:58  intrarhev3

But the process hangs at clvmd start up. In particular at the step
vgchange -aly
Pid of "service clvmd start" command is 9335

# pstree -alp 9335
S24clvmd,9335 /etc/rc3.d/S24clvmd start
  ??vgchange,9363 -ayl


# ll /proc/9363/fd/
total 0
lrwx------ 1 root root 64 Aug  1 15:44 0 -> /dev/console
lrwx------ 1 root root 64 Aug  1 15:44 1 -> /dev/console
lrwx------ 1 root root 64 Aug  1 15:44 2 -> /dev/console
lrwx------ 1 root root 64 Aug  1 15:44 3 -> /dev/mapper/control
lrwx------ 1 root root 64 Aug  1 15:44 4 -> socket:[1348167]
lr-x------ 1 root root 64 Aug  1 15:44 5 -> /dev/dm-3

# lsof -p 9363
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
vgchange 9363 root  cwd    DIR              104,3     4096       2 /
vgchange 9363 root  rtd    DIR              104,3     4096       2 /
vgchange 9363 root  txt    REG              104,3   971464  132238 /sbin/lvm
vgchange 9363 root  mem    REG              104,3   156872     210
/lib64/ld-2.12.so
vgchange 9363 root  mem    REG              104,3  1918016     569
/lib64/libc-2.12.so
vgchange 9363 root  mem    REG              104,3    22536     593
/lib64/libdl-2.12.so
vgchange 9363 root  mem    REG              104,3    24000     832
/lib64/libdevmapper-event.so.1.02
vgchange 9363 root  mem    REG              104,3   124624     750
/lib64/libselinux.so.1
vgchange 9363 root  mem    REG              104,3   272008    2060
/lib64/libreadline.so.6.0
vgchange 9363 root  mem    REG              104,3   138280    2469
/lib64/libtinfo.so.5.7
vgchange 9363 root  mem    REG              104,3    61648    1694
/lib64/libudev.so.0.5.1
vgchange 9363 root  mem    REG              104,3   251112    1489
/lib64/libsepol.so.1
vgchange 9363 root  mem    REG              104,3   229024    1726
/lib64/libdevmapper.so.1.02
vgchange 9363 root  mem    REG              253,7 99158576   17029
/usr/lib/locale/locale-archive
vgchange 9363 root  mem    REG              253,7    26060  134467
/usr/lib64/gconv/gconv-modules.cache
vgchange 9363 root    0u   CHR                5,1      0t0    5218 /dev/console
vgchange 9363 root    1u   CHR                5,1      0t0    5218 /dev/console
vgchange 9363 root    2u   CHR                5,1      0t0    5218 /dev/console
vgchange 9363 root    3u   CHR              10,58      0t0    5486
/dev/mapper/control
vgchange 9363 root    4u  unix 0xffff880879b309c0      0t0 1348167 socket
vgchange 9363 root    5r   BLK              253,3 0t143360   10773 /dev/dm-3


# strace -p 9363
Process 9363 attached - interrupt to quit
read(4,

multipath seems ok in general and for md=3 in particular
# multipath -l /dev/mapper/mpathd
mpathd (3600507630efe0b0c0000000000001181) dm-3 IBM,1750500
size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| |- 0:0:0:3 sdd        8:48   active undef running
| `- 1:0:0:3 sdl        8:176  active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 0:0:1:3 sdq        65:0   active undef running
  `- 1:0:1:3 sdy        65:128 active undef running

Currently I have
lvm2-2.02.95-10.el6.x86_64
lvm2-cluster-2.02.95-10.el6.x86_64

startup is stuck as in image attached

Logs
messages:
Aug  1 15:46:14 udevd[663]: worker [9379] unexpectedly returned with
status 0x0100
Aug  1 15:46:14 udevd[663]: worker [9379] failed while handling
'/devices/virtual/block/dm-15'

dmesg
DLM (built Jul 20 2012 01:56:50) installed
dlm: Using TCP for communications


qdiskd
Aug 01 15:41:58 qdiskd Score sufficient for master operation (1/1;
required=1); upgrading
Aug 01 15:43:03 qdiskd Assuming master role

corosync.log
Aug 01 15:41:58 corosync [CMAN  ] quorum device registered
Aug 01 15:43:08 corosync [CMAN  ] quorum regained, resuming activity
Aug 01 15:43:08 corosync [QUORUM] This node is within the primary
component and will provide service.
Aug 01 15:43:08 corosync [QUORUM] Members[1]: 3

fenced.log
Aug 01 15:43:09 fenced fenced 3.0.12.1 started
Aug 01 15:43:09 fenced failed to get dbus connection

dlm_controld.log
Aug 01 15:43:10 dlm_controld dlm_controld 3.0.12.1 started

gfs_controld.log
Aug 01 15:43:11 gfs_controld gfs_controld 3.0.12.1 started


Do I miss anything simple?
Is it correct to say that clvmd can start only when one node is
active, given that it has quorum under the cluster configuration rules
set up?

Or am I hitting any known bug/problem?

Thanks in advance,
Gianluca
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clvms stuck.png
Type: image/png
Size: 21666 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/1507a4fa/attachment.png>

From sdake at redhat.com  Wed Aug  1 14:14:48 2012
From: sdake at redhat.com (Steven Dake)
Date: Wed, 01 Aug 2012 07:14:48 -0700
Subject: [Linux-cluster] Need HA for VMs on OpenStack?  check out Heat V5
Message-ID: <501939D8.9080209@redhat.com>

Hi folks,

A few developers from HA community have been hard at work on a project
called heat which provides native HA for OpenStack virtual machines.
Heat provides a template based system with API matching AWS
CloudFormation semantics specifically for OpenStack.

In v5, instance heatlhchecking has been added.  To get started on Fedora
16+ check out the getting started guide:

https://github.com/heat-api/heat/blob/master/docs/GettingStarted.rst#readme

or on Ubuntu Precise check out the devstack guide:
https://github.com/heat-api/heat/wiki/Getting-Started-with-Heat-using-Master-on-Ubuntu

An example template with instance HA features is here:

https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_IHA.template

An example template with applicatoin HA features that includes
escalation is here:

https://github.com/heat-api/heat/blob/master/templates/WordPress_Single_Instance_With_HA.template

Our website is here:

http://www.heat-api.org

The software can be downloaded from:
https://github.com/heat-api/heat/downloads

Enjoy
-steve



From emi2fast at gmail.com  Wed Aug  1 14:26:38 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 1 Aug 2012 16:26:38 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
Message-ID: <CAE7pJ3DbohvFdB-Y2aYWN+hA_DBPr1wHrat0_1qO_SZngBdkWQ@mail.gmail.com>

Hello GianLuca

Why you don't remove expected_votes=3 and let the cluster automatic
calculate that

I told you be cause i had some many problems with that setting

2012/8/1 Gianluca Cecchi <gianluca.cecchi at gmail.com>

> Hello,
> testing a three node cluster + quorum disk and clvmd.
> I was at CentOS 6.2 and I seem to remember to be able to start a
> single node. Correct?
> Then I upgraded to CentOS 6.3 and had a working environment.
> My config has
> <cman expected_votes="3" quorum_dev_poll="240000" two_node="0"/>
>
> At the moment two nodes are in another site that is powered down and I
> need to start a single node config.
>
> When the node starts it gets waiting for quorum and when quorum disk
> becomes master it goes ahead:
>
> # cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    0   M      0   2012-08-01 15:41:58  /dev/block/253:4
>    1   X      0                        intrarhev1
>    2   X      0                        intrarhev2
>    3   M   1420   2012-08-01 15:39:58  intrarhev3
>
> But the process hangs at clvmd start up. In particular at the step
> vgchange -aly
> Pid of "service clvmd start" command is 9335
>
> # pstree -alp 9335
> S24clvmd,9335 /etc/rc3.d/S24clvmd start
>   ??vgchange,9363 -ayl
>
>
> # ll /proc/9363/fd/
> total 0
> lrwx------ 1 root root 64 Aug  1 15:44 0 -> /dev/console
> lrwx------ 1 root root 64 Aug  1 15:44 1 -> /dev/console
> lrwx------ 1 root root 64 Aug  1 15:44 2 -> /dev/console
> lrwx------ 1 root root 64 Aug  1 15:44 3 -> /dev/mapper/control
> lrwx------ 1 root root 64 Aug  1 15:44 4 -> socket:[1348167]
> lr-x------ 1 root root 64 Aug  1 15:44 5 -> /dev/dm-3
>
> # lsof -p 9363
> COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
> vgchange 9363 root  cwd    DIR              104,3     4096       2 /
> vgchange 9363 root  rtd    DIR              104,3     4096       2 /
> vgchange 9363 root  txt    REG              104,3   971464  132238
> /sbin/lvm
> vgchange 9363 root  mem    REG              104,3   156872     210
> /lib64/ld-2.12.so
> vgchange 9363 root  mem    REG              104,3  1918016     569
> /lib64/libc-2.12.so
> vgchange 9363 root  mem    REG              104,3    22536     593
> /lib64/libdl-2.12.so
> vgchange 9363 root  mem    REG              104,3    24000     832
> /lib64/libdevmapper-event.so.1.02
> vgchange 9363 root  mem    REG              104,3   124624     750
> /lib64/libselinux.so.1
> vgchange 9363 root  mem    REG              104,3   272008    2060
> /lib64/libreadline.so.6.0
> vgchange 9363 root  mem    REG              104,3   138280    2469
> /lib64/libtinfo.so.5.7
> vgchange 9363 root  mem    REG              104,3    61648    1694
> /lib64/libudev.so.0.5.1
> vgchange 9363 root  mem    REG              104,3   251112    1489
> /lib64/libsepol.so.1
> vgchange 9363 root  mem    REG              104,3   229024    1726
> /lib64/libdevmapper.so.1.02
> vgchange 9363 root  mem    REG              253,7 99158576   17029
> /usr/lib/locale/locale-archive
> vgchange 9363 root  mem    REG              253,7    26060  134467
> /usr/lib64/gconv/gconv-modules.cache
> vgchange 9363 root    0u   CHR                5,1      0t0    5218
> /dev/console
> vgchange 9363 root    1u   CHR                5,1      0t0    5218
> /dev/console
> vgchange 9363 root    2u   CHR                5,1      0t0    5218
> /dev/console
> vgchange 9363 root    3u   CHR              10,58      0t0    5486
> /dev/mapper/control
> vgchange 9363 root    4u  unix 0xffff880879b309c0      0t0 1348167 socket
> vgchange 9363 root    5r   BLK              253,3 0t143360   10773
> /dev/dm-3
>
>
> # strace -p 9363
> Process 9363 attached - interrupt to quit
> read(4,
>
> multipath seems ok in general and for md=3 in particular
> # multipath -l /dev/mapper/mpathd
> mpathd (3600507630efe0b0c0000000000001181) dm-3 IBM,1750500
> size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw
> |-+- policy='round-robin 0' prio=0 status=active
> | |- 0:0:0:3 sdd        8:48   active undef running
> | `- 1:0:0:3 sdl        8:176  active undef running
> `-+- policy='round-robin 0' prio=0 status=enabled
>   |- 0:0:1:3 sdq        65:0   active undef running
>   `- 1:0:1:3 sdy        65:128 active undef running
>
> Currently I have
> lvm2-2.02.95-10.el6.x86_64
> lvm2-cluster-2.02.95-10.el6.x86_64
>
> startup is stuck as in image attached
>
> Logs
> messages:
> Aug  1 15:46:14 udevd[663]: worker [9379] unexpectedly returned with
> status 0x0100
> Aug  1 15:46:14 udevd[663]: worker [9379] failed while handling
> '/devices/virtual/block/dm-15'
>
> dmesg
> DLM (built Jul 20 2012 01:56:50) installed
> dlm: Using TCP for communications
>
>
> qdiskd
> Aug 01 15:41:58 qdiskd Score sufficient for master operation (1/1;
> required=1); upgrading
> Aug 01 15:43:03 qdiskd Assuming master role
>
> corosync.log
> Aug 01 15:41:58 corosync [CMAN  ] quorum device registered
> Aug 01 15:43:08 corosync [CMAN  ] quorum regained, resuming activity
> Aug 01 15:43:08 corosync [QUORUM] This node is within the primary
> component and will provide service.
> Aug 01 15:43:08 corosync [QUORUM] Members[1]: 3
>
> fenced.log
> Aug 01 15:43:09 fenced fenced 3.0.12.1 started
> Aug 01 15:43:09 fenced failed to get dbus connection
>
> dlm_controld.log
> Aug 01 15:43:10 dlm_controld dlm_controld 3.0.12.1 started
>
> gfs_controld.log
> Aug 01 15:43:11 gfs_controld gfs_controld 3.0.12.1 started
>
>
> Do I miss anything simple?
> Is it correct to say that clvmd can start only when one node is
> active, given that it has quorum under the cluster configuration rules
> set up?
>
> Or am I hitting any known bug/problem?
>
> Thanks in advance,
> Gianluca
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/1be08eda/attachment.htm>

From gounini.geekarea at gmail.com  Wed Aug  1 14:32:17 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Wed, 1 Aug 2012 16:32:17 +0200 (CEST)
Subject: [Linux-cluster] reasons for sporadic token loss?
In-Reply-To: <5017E44D.2010702@itechnical.de>
Message-ID: <861018155.3131.1343831537507.JavaMail.root@geekarea.fr>

Hello,

My answers is just feelings, I have to little experiment on RHCS 6.
A. Your token lose is on token, consensus comes after losing token.
B. Maybe your problem is on network, not on cluster tuning.
C. No idea
D. I think it doesn't. Token multicast use your network address matching nodename resolved.

I think you should use your interconnect link on one single interface for testing without bonding. If your problem disepear, your bond mode 5 is KO.

Regards,

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "Heiko Nardmann" <heiko.nardmann at itechnical.de>
> ?: linux-cluster at redhat.com
> Envoy?: Mardi 31 Juillet 2012 15:57:33
> Objet: [Linux-cluster] reasons for sporadic token loss?
> 
> Hi together!
> 
> I am experiencing sporadic problems with my cluster setup. Maybe
> someone
> has an idea? But first some facts:
> 
> Type: RHEL 6.1 two node cluster (corosync 1.2.3-36) on two Dell R610
> each with a quad port NIC
> 
> NICs:
> - interfaces em1/em2 are bonded using mode 5; these interfaces are
> cross
> connected (intended to be used for the cluster housekeeping
> communication) - no network element in between
> - interfaces em3/em4 are bonded using mode 1; these interfaces are
> connected to two switches
> 
> Cluster configuration:
> 
> <?xml version="1.0"?>
> <cluster config_version="51" name="my-cluster">
>      <cman expected_votes="1" two_node="1"/>
>      <clusternodes>
>          <clusternode name="df1-clusterlink" nodeid="1">
>              <fence>
>                  <method name="VBoxManage-DF-1">
>                      <device name="VBoxManage-DF-1" />
>                  </method>
>              </fence>
>              <unfence>
>              </unfence>
>          </clusternode>
>          <clusternode name="df2-clusterlink" nodeid="2">
>              <fence>
>                  <method name="VBoxManage-DF-2">
>                      <device name="VBoxManage-DF-2" />
>                  </method>
> 
>              </fence>
>              <unfence>
>              </unfence>
>          </clusternode>
>      </clusternodes>
>      <fencedevices>
>          <fencedevice name="VBoxManage-DF-1" agent="fence_vbox"
> vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64
> DF-System Server 1" />
>          <fencedevice name="VBoxManage-DF-2" agent="fence_vbox"
> vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64
> DF-System Server 2" />
>      </fencedevices>
>      <rm>
>          <resources>
>              <ip address="10.200.104.15/27" monitor_link="on"
> sleeptime="10"/>
>              <script file="/usr/share/cluster/app.sh" name="myapp"/>
>          </resources>
>          <failoverdomains>
>              <failoverdomain name="fod-myapp" nofailback="0"
>              ordered="1"
> restricted="0">
>                  <failoverdomainnode name="df1-clusterlink"
>                  priority="1"/>
>                  <failoverdomainnode name="df2-clusterlink"
>                  priority="2"/>
>              </failoverdomain>
>          </failoverdomains>
>          <service domain="fod-myapp" exclusive="1" max_restarts="3"
> name="rg-myapp" recovery="restart" restart_expire_time="1">
>              <script ref=myapp"/>
>              <ip ref="10.200.104.15/27"/>
>          </service>
>      </rm>
>      <logging debug="on"/>
>      <gfs_controld enable_plock="0" plock_rate_limit="0"/>
>      <dlm enable_plock="0" plock_ownership="1" plock_rate_limit="0"/>
> </cluster>
> 
> 
> --------------------------------------------------------------------------------
> 
> Problem:
> Sometimes the second node "detects" that the token has been lost
> (corosync.log):
> 
> [no TOTEM messages before that]
> Jul 28 13:00:10 corosync [TOTEM ] The token was lost in the
> OPERATIONAL
> state.
> Jul 28 13:00:10 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Jul 28 13:00:10 corosync [TOTEM ] Receive multicast socket recv
> buffer
> size (262142 bytes).
> Jul 28 13:00:10 corosync [TOTEM ] Transmit multicast socket send
> buffer
> size (262142 bytes).
> 
> This happens lets say once a week. This leads to fencing of the first
> node. What I see from 'corosync-objctl -a' is that this is maybe due
> to
> a consensus timeout (some excerpt from the commands output follows);
> I
> marked the lines which I so far consider as important:
> 
> totem.transport=udp
> totem.version=2
> totem.nodeid=2
> totem.vsftype=none
> totem.token=10000
> totem.join=60
> totem.fail_recv_const=2500
> totem.consensus=2000
> totem.rrp_mode=none
> totem.secauth=1
> totem.key=my-cluster
> totem.interface.ringnumber=0
> totem.interface.bindnetaddr=172.16.42.2
> totem.interface.mcastaddr=239.192.187.168
> totem.interface.mcastport=5405
> runtime.totem.pg.mrp.srp.orf_token_tx=3
> runtime.totem.pg.mrp.srp.orf_token_rx=1103226
> runtime.totem.pg.mrp.srp.memb_merge_detect_tx=395
> runtime.totem.pg.mrp.srp.memb_merge_detect_rx=1098359
> runtime.totem.pg.mrp.srp.memb_join_tx=38
> runtime.totem.pg.mrp.srp.memb_join_rx=50
> runtime.totem.pg.mrp.srp.mcast_tx=218
> runtime.totem.pg.mrp.srp.mcast_retx=0
> runtime.totem.pg.mrp.srp.mcast_rx=541
> runtime.totem.pg.mrp.srp.memb_commit_token_tx=12
> runtime.totem.pg.mrp.srp.memb_commit_token_rx=18
> runtime.totem.pg.mrp.srp.token_hold_cancel_tx=49
> runtime.totem.pg.mrp.srp.token_hold_cancel_rx=173
> runtime.totem.pg.mrp.srp.operational_entered=6
> runtime.totem.pg.mrp.srp.operational_token_lost=1
> ^^^
> runtime.totem.pg.mrp.srp.gather_entered=7
> runtime.totem.pg.mrp.srp.gather_token_lost=0
> runtime.totem.pg.mrp.srp.commit_entered=6
> runtime.totem.pg.mrp.srp.commit_token_lost=0
> runtime.totem.pg.mrp.srp.recovery_entered=6
> runtime.totem.pg.mrp.srp.recovery_token_lost=0
> runtime.totem.pg.mrp.srp.consensus_timeouts=1
> ^^^
> runtime.totem.pg.mrp.srp.mtt_rx_token=1727
> runtime.totem.pg.mrp.srp.avg_token_workload=62244458
> runtime.totem.pg.mrp.srp.avg_backlog_calc=0
> runtime.totem.pg.mrp.srp.rx_msg_dropped=0
> runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(172.16.42.2)
> runtime.totem.pg.mrp.srp.members.2.join_count=1
> runtime.totem.pg.mrp.srp.members.2.status=joined
> runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(172.16.42.1)
> runtime.totem.pg.mrp.srp.members.1.join_count=3
> runtime.totem.pg.mrp.srp.members.1.status=joined
> runtime.blackbox.dump_flight_data=no
> runtime.blackbox.dump_state=no
> 
> Some questions at this point:
> A) why did the cluster lose the token? due to timeout? token (10000)
> or
> consensus (2000)?
> B) why is the timeout ellapsed? maybe that is connected with the
> answer
> to A ... ?
> C) is it normal that 'token=10000' and 'consensus=2000' although
> normal
> documentation says that default is 'token=1000' and
> 'consensus=1.2*token'?
> D) since I suspect problems concerning the switches connecting the
> other
> interfaces (em3/em4 bonded to bond0) of those machines I wonder
> whether
> any traffic goes that way and not via bond1?
> 
> As I already stated: the connection of em3/em4 is a direct one
> without
> any network element.
> 
> So far I want to add the following line to cluster.conf and see
> whether
> the situation improves:
> 
>      <totem token_retransmits_before_loss_const="10"
> fail_recv_const="100" consensus="12000"/>
> 
> Any comment concerning that?
> 
> While googling for reasons I have seen that it is also a problem if
> both
> nodes are not synchronized concerning time; but in my case the ntpd
> on
> both nodes uses two stratum 2 NTP servers. I also cannot detect
> anything
> unsual like e.g. a jump of multiple seconds inside the log files
> although I have to admit that so far the ntpd does not run with debug
> enabled.
> 
> 
> Thanks in advance for any hint or comment!
> 
> 
> Kind regards,
> 
>      Heiko
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From gounini.geekarea at gmail.com  Wed Aug  1 14:40:28 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Wed, 1 Aug 2012 16:40:28 +0200 (CEST)
Subject: [Linux-cluster] How to change master node
Message-ID: <841831653.3136.1343832028779.JavaMail.root@geekarea.fr>

Hello,

When using quorum device, one node is elected "master" and you can see
  Aug  1 15:36:05 non-master-node qdiskd[8136]: <info> Node 1 is the master
on non-master nodes or 
  Aug  1 15:29:47 master-node qdiskd[8044]: <info> Assuming master role
on master node.

How do you change manually the master node?

The only way I found is to do "service qdiskd restart" on the master node but it's not recommended.

Regards,



From queszama at yahoo.in  Wed Aug  1 14:54:49 2012
From: queszama at yahoo.in (Zama Ques)
Date: Wed, 1 Aug 2012 22:54:49 +0800 (SGT)
Subject: [Linux-cluster] Creating two different cluster using same set
	of nodes.
In-Reply-To: <50191FCB.7000804@alteeve.ca>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
Message-ID: <1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>

Thanks Digimer for clarifying....

I am trying to create the following setup or rather will say I have been asked to do so . 


=====
Cluster Name: ClusterA 

Node1: system1.example.com?? Priority:1 in Failover Domain
Node2: system2.example.com?? Priority:2 ?in Failover Domain

File System Resource : /data1 - An ext3 file system

=====

Cluster Name : ClusterB

Node1: system1.example.com?? Priority:2 ?in Failover Domain
Node2: system2.example.com?? Priority:1 ?in Failover Domain

File System Resource : /data2 - An ext3 file system
========================================================

What
 I will achieve with this scenario , is that both the nodes will always 
be in active mode as one node is having ?higher priority in Failover 
Domain ?in one cluster and the other node has higher priority in the 
other ?cluster as shown above. This means that ?both the file system 
resource will always be available in either of the nodes. ?And ?if ?a 
node goes down ?' suppose system1.example.com '?which is active in 'ClusterA'?, cluster , ?the file system resource ? /data1 on the cluster will be 
mounted on system2.example.com which is already having /data2 ?mounted 
on ?'ClusterB' ?

So , based on the above architecture , we are achieving the following
1) Both the mount points will be always available and 
2) Both the nodes will be utilized as both the nodes will be in active mode ?in either of the cluster ?.

Will be great if you have some inputs to achieve the same . 


Thanks
Zaman


________________________________
 From: Digimer <lists at alteeve.ca>
To: Zama Ques <queszama at yahoo.in>; linux clustering <linux-cluster at redhat.com> 
Sent: Wednesday, 1 August 2012 5:53 PM
Subject: Re: [Linux-cluster] Creating two different cluster using same set of nodes.
 
On 08/01/2012 07:56 AM, Zama Ques wrote:
> Hi All ,
>
> Need clarifications whether it is possible to create two different
> cluster using the same set of nodes.
>
> Looks like Redhat Cluster Suite does not support creating different
> clusters using the same nodes. I am getting the following
> error while building the second cluster using the same nodes using luci
> interface .
>
> ====
> [dismiss]
>
> The following errors occurred:
>? ? ? * Host system3.example.com is already a member of the cluster named
> "ClusterA"
>? ? ? * Host system4.example.com is already a member of the cluster named
> "ClusterA"
> ===
>
> My query is that does Redhat Cluster Suite allows in any way to create
> two different clusters using same nodes.? If not , any reason for not
> allowing? this feature?.
>
>
> Thanks in Advance
> Zaman

It is not possible, no. A node must be in one cluster only.

May I ask why you're trying to do this?

-- 
Digimer
Papers and Projects: https://alteeve.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/eadd9316/attachment.htm>

From heiko.nardmann at itechnical.de  Wed Aug  1 15:01:49 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Wed, 01 Aug 2012 17:01:49 +0200
Subject: [Linux-cluster] How to change master node
In-Reply-To: <841831653.3136.1343832028779.JavaMail.root@geekarea.fr>
References: <841831653.3136.1343832028779.JavaMail.root@geekarea.fr>
Message-ID: <501944DD.9020405@itechnical.de>

Am 01.08.2012 16:40, schrieb GouNiNi:
> Hello,
>
> When using quorum device, one node is elected "master" and you can see
>    Aug  1 15:36:05 non-master-node qdiskd[8136]: <info> Node 1 is the master
> on non-master nodes or
>    Aug  1 15:29:47 master-node qdiskd[8044]: <info> Assuming master role
> on master node.
>
> How do you change manually the master node?
>
> The only way I found is to do "service qdiskd restart" on the master node but it's not recommended.
>
> Regards,
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

What does your cluster.conf look like?

For my part I have defined there a failoverdomain with priorities 
assigned to my two nodes.


Regards,

     Heiko



From heiko.nardmann at itechnical.de  Wed Aug  1 15:19:35 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Wed, 01 Aug 2012 17:19:35 +0200
Subject: [Linux-cluster] Creating two different cluster using same set
 of nodes.
In-Reply-To: <1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
	<1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
Message-ID: <50194907.9030307@itechnical.de>

Am 01.08.2012 16:54, schrieb Zama Ques:
> Thanks Digimer for clarifying....
>
> I am trying to create the following setup or rather will say I have 
> been asked to do so .
>
> =====
> Cluster Name: ClusterA
>
> Node1: system1.example.com   Priority:1 in Failover Domain
> Node2: system2.example.com   Priority:2  in Failover Domain
>
> File System Resource : /data1 - An ext3 file system
>
> =====
>
> Cluster Name : ClusterB
>
> Node1: system1.example.com   Priority:2  in Failover Domain
> Node2: system2.example.com   Priority:1  in Failover Domain
>
> File System Resource : /data2 - An ext3 file system
> ========================================================
>
> What I will achieve with this scenario , is that both the nodes will 
> always be in active mode as one node is having  higher priority in 
> Failover Domain  in one cluster and the other node has higher priority 
> in the other  cluster as shown above. This means that  both the file 
> system resource will always be available in either of the nodes.  And 
>  if  a node goes down  ' suppose system1.example.com ' which is active 
> in *'ClusterA*' , cluster ,  the file system resource   /data1 on the 
> cluster will be mounted on system2.example.com which is already having 
> /data2  mounted on *'ClusterB' *
>
> So , based on the above architecture , we are achieving the following
> 1) Both the mount points will be always available and
> 2) Both the nodes will be utilized as both the nodes will be in active 
> mode  in either of the cluster  .
>
> Will be great if you have some inputs to achieve the same .
>
> Thanks
> Zaman
> ------------------------------------------------------------------------
> **

Just one thought: did you intentionally decide against using GFS and 
favour ext3?

Regards,

     Heiko

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/01a3a528/attachment.htm>

From gianluca.cecchi at gmail.com  Wed Aug  1 16:15:45 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Wed, 1 Aug 2012 18:15:45 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
Message-ID: <CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>

On Wed, 1 Aug 2012 16:26:38 +0200 emmanuel segura wrote:
> Why you don't remove expected_votes=3 and let the cluster automatic calculate that

Thanks for your answer Emmanuel, but cman starts correctly, while the
problem seems related to
vgchange -aly
command hanging.
But I tried that option too and the cluster hangs at the same point as before.

>From logs posted before I only noticed

In messages:
Aug  1 15:46:14 udevd[663]: worker [9379] failed while handling
'/devices/virtual/block/dm-15'

and in fenced.log
Aug 01 15:43:09 fenced failed to get dbus connection

Currently I'm using default 6.3 lvm.conf apart

filter = [ "a|^/dev/mapper/mpath*|", "a|^/dev/cciss/*|",
"a|^/dev/mapper/VolGroup*|", "r/.*/" ]

locking_type = 3

The server is HP and has the internal disk recognized as /dev/cciss/c0d0
I have /usr, /var and other fs configured in non clustered VolGroup00
on internal disk.
I don't think it can create problems.
Attached an image of lvm early start with internal disk lvm activated
and cluster vgs skipped...

Gianluca
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvm start.png
Type: image/png
Size: 4554 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/37bc76c5/attachment.png>

From kailash.kumawat at rudrainfotainment.com  Wed Aug  1 17:34:49 2012
From: kailash.kumawat at rudrainfotainment.com (kailash kumawat)
Date: Wed, 1 Aug 2012 23:04:49 +0530
Subject: [Linux-cluster] password change of master and slave node
Message-ID: <CALO-jX4L9whq-n7St079+5DoAmGUDg9+ThaCD_RKxUiki2cGbQ@mail.gmail.com>

Hi

I have been successfully configure the redhat cluster for apache now i
want to change password of my master and other node so how can i
update in luci server, please help me

-- 
Regards
*Kailash Kumawat*
System Admin
09167396313



From emi2fast at gmail.com  Wed Aug  1 17:43:40 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 1 Aug 2012 19:43:40 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
Message-ID: <CAE7pJ3A-P=_5rbOaO3s0vxg-VK9Ogn_k5qKahBtwU57=9Ech8w@mail.gmail.com>

Hello GianLuca

clvmd to work correctly need the cluster in the quorate state

2012/8/1 Gianluca Cecchi <gianluca.cecchi at gmail.com>

> On Wed, 1 Aug 2012 16:26:38 +0200 emmanuel segura wrote:
> > Why you don't remove expected_votes=3 and let the cluster automatic
> calculate that
>
> Thanks for your answer Emmanuel, but cman starts correctly, while the
> problem seems related to
> vgchange -aly
> command hanging.
> But I tried that option too and the cluster hangs at the same point as
> before.
>
> >From logs posted before I only noticed
>
> In messages:
> Aug  1 15:46:14 udevd[663]: worker [9379] failed while handling
> '/devices/virtual/block/dm-15'
>
> and in fenced.log
> Aug 01 15:43:09 fenced failed to get dbus connection
>
> Currently I'm using default 6.3 lvm.conf apart
>
> filter = [ "a|^/dev/mapper/mpath*|", "a|^/dev/cciss/*|",
> "a|^/dev/mapper/VolGroup*|", "r/.*/" ]
>
> locking_type = 3
>
> The server is HP and has the internal disk recognized as /dev/cciss/c0d0
> I have /usr, /var and other fs configured in non clustered VolGroup00
> on internal disk.
> I don't think it can create problems.
> Attached an image of lvm early start with internal disk lvm activated
> and cluster vgs skipped...
>
> Gianluca
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120801/9c4f51d5/attachment.htm>

From queszama at yahoo.in  Wed Aug  1 17:26:21 2012
From: queszama at yahoo.in (Zama Ques)
Date: Thu, 2 Aug 2012 01:26:21 +0800 (SGT)
Subject: [Linux-cluster] Creating two different cluster using same set
	of nodes.
In-Reply-To: <50194907.9030307@itechnical.de>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
	<1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50194907.9030307@itechnical.de>
Message-ID: <1343841981.83590.YahooMailNeo@web193003.mail.sg3.yahoo.com>





________________________________
 From: Heiko Nardmann <heiko.nardmann at itechnical.de>
To: linux-cluster at redhat.com 
Sent: Wednesday, 1 August 2012 8:49 PM
Subject: Re: [Linux-cluster] Creating two different cluster using same set of nodes.
 

Am 01.08.2012 16:54, schrieb Zama Ques:

Thanks Digimer for clarifying....
>
>
>I am trying to create the following setup or rather will say I have been asked to do so . 
>
>
>
>=====
>Cluster Name: ClusterA 
>
>Node1: system1.example.com?? Priority:1 in Failover Domain
>Node2: system2.example.com?? Priority:2 ?in Failover Domain
>
>File System Resource : /data1 - An ext3 file system
>
>=====
>
>Cluster Name : ClusterB
>
>Node1: system1.example.com?? Priority:2 ?in Failover Domain
>Node2: system2.example.com?? Priority:1 ?in Failover Domain
>
>File System Resource : /data2 - An ext3 file system
>========================================================
>
>What I will achieve with this scenario , is that both the
            nodes will always be in active mode as one node is having
            ?higher priority in Failover Domain ?in one cluster and the
            other node has higher priority in the other ?cluster as
            shown above. This means that ?both the file system resource
            will always be available in either of the nodes. ?And ?if ?a
            node goes down ?' suppose system1.example.com '?which is
            active in 'ClusterA'?, cluster , ?the file system resource ? /data1 on the cluster will be mounted on system2.example.com which is already having /data2 ?mounted on ?'ClusterB' ?
>
>So , based on the above architecture , we are achieving the following
>1) Both the mount points will be always available and 
>2) Both the nodes will be utilized as both the nodes will be in active mode ?in either of the cluster ?.
>
>
>Will be great if you have some inputs to achieve the same . 
>
>
>
>Thanks
>Zaman
>
>
>________________________________
>
>
>? Just one thought: did you intentionally decide against using GFS and
    favour ext3?

???? Nothing intentional here . Was initially trying to check whether it is possible to create two different clusters using same nodes and
???? chosen ext3 for this.
???? Is it not going to be a useful feature if we allow same nodes to be used in different clusters ? 

Thanks
Zaman
?





Regards,

??? Heiko


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/5650e510/attachment.htm>

From lwj at gildata.com  Thu Aug  2 03:53:09 2012
From: lwj at gildata.com (lwj)
Date: Thu, 2 Aug 2012 11:53:09 +0800
Subject: [Linux-cluster] linux-cluster@redhat.com
Message-ID: <201208021153093760129@gildata.com>

linux-cluster at redhat.com 




From ???(Liu Wei Jie)
??????????????
(86)2160897890
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/43b2b10e/attachment.htm>

From fdinitto at redhat.com  Thu Aug  2 04:36:27 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 02 Aug 2012 06:36:27 +0200
Subject: [Linux-cluster] Creating two different cluster using same set
 of nodes.
In-Reply-To: <1343841981.83590.YahooMailNeo@web193003.mail.sg3.yahoo.com>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
	<1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50194907.9030307@itechnical.de>
	<1343841981.83590.YahooMailNeo@web193003.mail.sg3.yahoo.com>
Message-ID: <501A03CB.9070607@redhat.com>

On 08/01/2012 07:26 PM, Zama Ques wrote:

>      Is it not going to be a useful feature if we allow same nodes to be
> used in different clusters ?

Even if we leave aside the fact that a lot of code would need to be
rewritten and retested, it's a feature that would benefit very users, at
the cost of making the management of the cluster extremely more
complicated for the user and you gain nothing in HA terms.

At the end of the day it is cheaper to buy an extra node and have two
separate clusters.

Fabio



From fdinitto at redhat.com  Thu Aug  2 06:16:07 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 02 Aug 2012 08:16:07 +0200
Subject: [Linux-cluster] Creating two different cluster using same set
 of nodes.
In-Reply-To: <501A03CB.9070607@redhat.com>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
	<1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50194907.9030307@itechnical.de>
	<1343841981.83590.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<501A03CB.9070607@redhat.com>
Message-ID: <501A1B27.8050405@redhat.com>

On 8/2/2012 6:36 AM, Fabio M. Di Nitto wrote:
> On 08/01/2012 07:26 PM, Zama Ques wrote:
> 
>>      Is it not going to be a useful feature if we allow same nodes to be
>> used in different clusters ?
> 
> Even if we leave aside the fact that a lot of code would need to be
> rewritten and retested, it's a feature that would benefit very users, at

I meant to write "very FEW users"...

> the cost of making the management of the cluster extremely more
> complicated for the user and you gain nothing in HA terms.
> 
> At the end of the day it is cheaper to buy an extra node and have two
> separate clusters.
> 
> Fabio
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From Ralf.Aumueller at informatik.uni-stuttgart.de  Thu Aug  2 06:52:41 2012
From: Ralf.Aumueller at informatik.uni-stuttgart.de (Ralf Aumueller)
Date: Thu, 02 Aug 2012 08:52:41 +0200
Subject: [Linux-cluster] Update a Cluster from RHEL/CentOS 6.2 to 6.3
Message-ID: <501A23B9.40209@informatik.uni-stuttgart.de>

Hello,

we have a 2-node Cluster which is still running on CentOS 6.2. I would like to
make an update to 6.3. Planed procedure:

- migrate services from node 1 to node 2
- stop cluster on node 1
- update + reboot node 1
- migrate services from node 2 to node 1
- stop cluster on node 2
- update + reboot node 2
- (migrate services back to node 1)

Is the procedure OK?
Did somebody already made an update to 6.3? Any problems found?

Thanx and best regards,
Ralf



From kkovachev at varna.net  Thu Aug  2 07:38:20 2012
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Thu, 02 Aug 2012 10:38:20 +0300
Subject: [Linux-cluster] Creating two different cluster using same set
	of nodes.
In-Reply-To: <1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
	<1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
Message-ID: <8ef5801b4667ec0977496f0bd494ec00@mx.varna.net>

Hi,

On Wed, 1 Aug 2012 22:54:49 +0800 (SGT), Zama Ques <queszama at yahoo.in>
wrote:
> 
> =====
> Cluster Name: ClusterA 
> 
> Node1: system1.example.com?? Priority:1 in Failover Domain
> Node2: system2.example.com?? Priority:2 ?in Failover Domain
> 
> File System Resource : /data1 - An ext3 file system
> 
> =====
> 
> Cluster Name : ClusterB
> 
> Node1: system1.example.com?? Priority:2 ?in Failover Domain
> Node2: system2.example.com?? Priority:1 ?in Failover Domain
> 
> File System Resource : /data2 - An ext3 file system
> ========================================================
> 

Why not having ClusterA with two failover domains? You do not need two
clusters - just two failover domains.



From heiko.nardmann at itechnical.de  Thu Aug  2 07:39:34 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 02 Aug 2012 09:39:34 +0200
Subject: [Linux-cluster] Update a Cluster from RHEL/CentOS 6.2 to 6.3
In-Reply-To: <501A23B9.40209@informatik.uni-stuttgart.de>
References: <501A23B9.40209@informatik.uni-stuttgart.de>
Message-ID: <501A2EB6.4050206@itechnical.de>

Am 02.08.2012 08:52, schrieb Ralf Aumueller:
> Hello,
>
> we have a 2-node Cluster which is still running on CentOS 6.2. I would like to
> make an update to 6.3. Planed procedure:
>
> - migrate services from node 1 to node 2
> - stop cluster on node 1
> - update + reboot node 1
> - migrate services from node 2 to node 1
> - stop cluster on node 2
> - update + reboot node 2
> - (migrate services back to node 1)
>
> Is the procedure OK?
> Did somebody already made an update to 6.3? Any problems found?
>
> Thanx and best regards,
> Ralf
>
>

If that is a real production system and not just for playing you should 
setup a test environment before and also create a plan which usecases 
should run with the new cluster.

For my part I am often using a virtual environment for that (e.g. 
VirtualBox).


Kind regards,

     Heiko



From gianluca.cecchi at gmail.com  Thu Aug  2 10:13:06 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Thu, 2 Aug 2012 12:13:06 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
Message-ID: <CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>

On Wed, Aug 1, 2012 at 6:15 PM, Gianluca Cecchi wrote:
> On Wed, 1 Aug 2012 16:26:38 +0200 emmanuel segura wrote:
>> Why you don't remove expected_votes=3 and let the cluster automatic calculate that
>
> Thanks for your answer Emmanuel, but cman starts correctly, while the
> problem seems related to
> vgchange -aly
> command hanging.
> But I tried that option too and the cluster hangs at the same point as before.

Further testing shows that cluster is indeed quorated and problems are
related to lvm...

I also tried following a more used and clean configuration seen in
examples for 3 nodes + quorum daemon:

2 votes for each node
<clusternode name="nodeX" nodeid="X" votes="2">

3 votes for quorum disk
<quorumd device="/dev/mapper/mpathquorum" interval="5"
label="clrhevquorum" tko="24" votes="3">

with and without expected_votes="9" in <cman ... /> part

One node + its quorum only config should be ok (2+3 = 5 votes)

After cman starts and quorumd is not master yet:

# cman_tool status
Version: 6.2.0
Config Version: 51
Cluster Name: clrhev
Cluster Id: 43203
Cluster Member: Yes
Cluster Generation: 1428
Membership state: Cluster-Member
Nodes: 1
Expected votes: 9
Total votes: 2
Node votes: 2
Quorum: 5 Activity blocked
Active subsystems: 4
Flags:
Ports Bound: 0 178
Node name: intrarhev3
Node ID: 3
Multicast addresses: 239.192.168.108
Node addresses: 192.168.16.30

Then
# cman_tool status
Version: 6.2.0
Config Version: 51
Cluster Name: clrhev
Cluster Id: 43203
Cluster Member: Yes
Cluster Generation: 1428
Membership state: Cluster-Member
Nodes: 1
Expected votes: 9
Quorum device votes: 3
Total votes: 5
Node votes: 2
Quorum: 5
Active subsystems: 4
Flags:
Ports Bound: 0 178
Node name: intrarhev3
Node ID: 3
Multicast addresses: 239.192.168.108
Node addresses: 192.168.16.30

And startup continues up to clvmd step
In this phase, while clvmd startup hanges forever I have:

# dlm_tool ls
dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 1 joined 1 remove 0 failed 0 seq 1,1
members       3

# ps -ef|grep lv
root      3573  2593  0 01:05 ?        00:00:00 /bin/bash
/etc/rc3.d/S24clvmd start
root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
root      3620     1  0 01:05 ?        00:00:00 /sbin/lvm pvscan
--cache --major 253 --minor 13
root      3804  3322  0 01:09 pts/0    00:00:00 grep lv

# ps -ef|grep vg
root      3601  3573  0 01:05 ?        00:00:00 /sbin/vgchange -ayl
root      3808  3322  0 01:09 pts/0    00:00:00 grep vg

# ps -ef|grep lv
root      3573  2593  0 01:05 ?        00:00:00 /bin/bash
/etc/rc3.d/S24clvmd start
root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
root      4008  3322  0 01:13 pts/0    00:00:00 grep lv

# ps -ef|grep 3578
root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
root      4017  3322  0 01:13 pts/0    00:00:00 grep 3578

It remains at
# service clvmd start
Starting clvmd:
Activating VG(s):   3 logical volume(s) in volume group "VG_VIRT02" now active

Is there any way to debug clvmd?
I suppose it communicates through intracluster, correct?
tcpdump output could be of any help?

Any one already passed to 6.3 (on rhel and/or CentOS) and having all
ok with clvmd?

BTW: I also tried lvmetad, that is tech preview in 6.3, enabling its
service and putting "use_lvmetad = 1" in lvm.conf but without luck...

Thanks in advance



From gianluca.cecchi at gmail.com  Thu Aug  2 10:38:03 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Thu, 2 Aug 2012 12:38:03 +0200
Subject: [Linux-cluster] Update a Cluster from RHEL/CentOS 6.2 to 6.3
Message-ID: <CAG2kNCyO-v7d=PvC9_nfHAib7CoyZqWyx+zaTmgxqN+Nfh8A8A@mail.gmail.com>

On Thu, 02 Aug 2012 09:39:34 +0200 Heiko Nardmann wrote:
> If that is a real production system and not just for playing you should setup
> a test environment before and also create a plan which usecases should run with the new cluster.

+1 for sure for what Heiko recommended

Plus, first place:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ch-overview-CA.html#s2-newfeatures-6.3-CA
and links

I remember some problems passing from 6.1 to 6.2, because I had
virtual machines configured as cluster services.
If it is your case, the first place to search for vm compatibility is
the table 4.1 at
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/Live_migration_and_RHEL_compatibility.html

See also my thread about clvmd problems I'm facing in 6.3 (but it is CentOS)
https://www.redhat.com/archives/linux-cluster/2012-August/msg00006.html

Also these pointers could help in evaluation:
https://www.redhat.com/archives/rhelv6-list/2012-January/msg00024.html
and previous and following posts of the thread

https://access.redhat.com/knowledge/solutions/68077

HIH,
Gianluca

PS: share your clvmd in 6.3 experience then, it you have it... ;-)



From emi2fast at gmail.com  Thu Aug  2 10:50:17 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Thu, 2 Aug 2012 12:50:17 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
Message-ID: <CAE7pJ3Chu-NZh1V9m2tNAmfKS_zQf2bu7o-prSUL9mv8aje10Q@mail.gmail.com>

if you think the problem it's in lvm, put it in the debug man lvm.conf

2012/8/2 Gianluca Cecchi <gianluca.cecchi at gmail.com>

> On Wed, Aug 1, 2012 at 6:15 PM, Gianluca Cecchi wrote:
> > On Wed, 1 Aug 2012 16:26:38 +0200 emmanuel segura wrote:
> >> Why you don't remove expected_votes=3 and let the cluster automatic
> calculate that
> >
> > Thanks for your answer Emmanuel, but cman starts correctly, while the
> > problem seems related to
> > vgchange -aly
> > command hanging.
> > But I tried that option too and the cluster hangs at the same point as
> before.
>
> Further testing shows that cluster is indeed quorated and problems are
> related to lvm...
>
> I also tried following a more used and clean configuration seen in
> examples for 3 nodes + quorum daemon:
>
> 2 votes for each node
> <clusternode name="nodeX" nodeid="X" votes="2">
>
> 3 votes for quorum disk
> <quorumd device="/dev/mapper/mpathquorum" interval="5"
> label="clrhevquorum" tko="24" votes="3">
>
> with and without expected_votes="9" in <cman ... /> part
>
> One node + its quorum only config should be ok (2+3 = 5 votes)
>
> After cman starts and quorumd is not master yet:
>
> # cman_tool status
> Version: 6.2.0
> Config Version: 51
> Cluster Name: clrhev
> Cluster Id: 43203
> Cluster Member: Yes
> Cluster Generation: 1428
> Membership state: Cluster-Member
> Nodes: 1
> Expected votes: 9
> Total votes: 2
> Node votes: 2
> Quorum: 5 Activity blocked
> Active subsystems: 4
> Flags:
> Ports Bound: 0 178
> Node name: intrarhev3
> Node ID: 3
> Multicast addresses: 239.192.168.108
> Node addresses: 192.168.16.30
>
> Then
> # cman_tool status
> Version: 6.2.0
> Config Version: 51
> Cluster Name: clrhev
> Cluster Id: 43203
> Cluster Member: Yes
> Cluster Generation: 1428
> Membership state: Cluster-Member
> Nodes: 1
> Expected votes: 9
> Quorum device votes: 3
> Total votes: 5
> Node votes: 2
> Quorum: 5
> Active subsystems: 4
> Flags:
> Ports Bound: 0 178
> Node name: intrarhev3
> Node ID: 3
> Multicast addresses: 239.192.168.108
> Node addresses: 192.168.16.30
>
> And startup continues up to clvmd step
> In this phase, while clvmd startup hanges forever I have:
>
> # dlm_tool ls
> dlm lockspaces
> name          clvmd
> id            0x4104eefa
> flags         0x00000000
> change        member 1 joined 1 remove 0 failed 0 seq 1,1
> members       3
>
> # ps -ef|grep lv
> root      3573  2593  0 01:05 ?        00:00:00 /bin/bash
> /etc/rc3.d/S24clvmd start
> root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
> root      3620     1  0 01:05 ?        00:00:00 /sbin/lvm pvscan
> --cache --major 253 --minor 13
> root      3804  3322  0 01:09 pts/0    00:00:00 grep lv
>
> # ps -ef|grep vg
> root      3601  3573  0 01:05 ?        00:00:00 /sbin/vgchange -ayl
> root      3808  3322  0 01:09 pts/0    00:00:00 grep vg
>
> # ps -ef|grep lv
> root      3573  2593  0 01:05 ?        00:00:00 /bin/bash
> /etc/rc3.d/S24clvmd start
> root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
> root      4008  3322  0 01:13 pts/0    00:00:00 grep lv
>
> # ps -ef|grep 3578
> root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
> root      4017  3322  0 01:13 pts/0    00:00:00 grep 3578
>
> It remains at
> # service clvmd start
> Starting clvmd:
> Activating VG(s):   3 logical volume(s) in volume group "VG_VIRT02" now
> active
>
> Is there any way to debug clvmd?
> I suppose it communicates through intracluster, correct?
> tcpdump output could be of any help?
>
> Any one already passed to 6.3 (on rhel and/or CentOS) and having all
> ok with clvmd?
>
> BTW: I also tried lvmetad, that is tech preview in 6.3, enabling its
> service and putting "use_lvmetad = 1" in lvm.conf but without luck...
>
> Thanks in advance
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/7005f850/attachment.htm>

From akinoztopuz at yahoo.com  Thu Aug  2 12:19:35 2012
From: akinoztopuz at yahoo.com (=?iso-8859-1?Q?AKIN_=FFffffffffffd6ZTOPUZ?=)
Date: Thu, 2 Aug 2012 05:19:35 -0700 (PDT)
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
Message-ID: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>

Hi 
?
I have fencing problem?in 2 nodes cluster??<cman expected_votes="1" two_node="1"/> )
?
fence device? agent??is??? like that?:
?
<fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1" login="clsfenceadmin" method="cycle" name="fence_node2" passwd="**************" power_wait="4"/>?
?
?
when I run fence_node?? nodename??? command? on? host?? ,? Related node goes to down but???I am taking errors in /var/log/messages? :
?
?
Aug? 2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
Aug? 2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:192.168.11.68...Done 
Aug? 2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
?
you have any ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/977ca93f/attachment.htm>

From heiko.nardmann at itechnical.de  Thu Aug  2 12:42:28 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 02 Aug 2012 14:42:28 +0200
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
Message-ID: <501A75B4.3080803@itechnical.de>

Am 02.08.2012 14:19, schrieb AKIN ?ffffffffffd6ZTOPUZ:
> Hi
> I have fencing problem in 2 nodes cluster  <cman expected_votes="1" 
> two_node="1"/> )
> fence device  agent  is    like that :
> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1" 
> login="clsfenceadmin" method="cycle" name="fence_node2" 
> passwd="**************" power_wait="4"/>
> when I run fence_node   nodename    command  on  host   , Related node 
> goes to down but   I am taking errors in /var/log/messages  :
> Aug  2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
> Aug  2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports: 
> Rebooting machine @ IPMI:192.168.11.68...Done
> Aug  2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
> you have any ideas?
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Without looking at it: how is this agent implemented? Maybe you can 
easily debug it to see why it returns a non zero exit code?


Kind regards,

     Heiko

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/d06c95b2/attachment.htm>

From corey.kovacs at gmail.com  Thu Aug  2 13:07:25 2012
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 2 Aug 2012 07:07:25 -0600
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAE7pJ3Chu-NZh1V9m2tNAmfKS_zQf2bu7o-prSUL9mv8aje10Q@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
	<CAE7pJ3Chu-NZh1V9m2tNAmfKS_zQf2bu7o-prSUL9mv8aje10Q@mail.gmail.com>
Message-ID: <CAMH2m-okMPtEA=G+iVJacJz_BRrP1AO0xUMJvUnXpjwMCw=eNw@mail.gmail.com>

I might be reading this wrong but just in case, I thought I'd point this
out.

Your quorum config.

node=2 votes*3 (nodes have 6 votes total)
qdisk=3 votes.

A single node can maintain quorum since 2+3>(9/2).

In a split brain condition where a single node cannot talk to the other
nodes, this could be disastrous.

Now, that all said, qdiskd using a volume as yours appears to be, won't be
able to start until the cluster is quorate.

Also, you might be running into a chicken and egg situation. Is your qdisk
volume marked clustered? I believe once you set the locking type to 3, all
LVM activity requires clvmd to be running. If it's not marked as clustered,
then that's not going work I don't thinks since qdisk requires concurrent
access across nodes. And you have to wait for clvmd.

It's unclear why you actually need a qdisk. If it's to keep the cluster up
in a single node mode, then I'd make the qdisk member start up in a
minority vote=1 and only change that in a controlled situation where you
are sure the other nodes are shutdown completely. Remember, the purpose of
quorum is to ensure that a majority rules and your config violates that
premise. Just sayin' :)

Does your cluster run without qdiskd configured?

Anyway, I hope this helps at least a little. If I am way off base, I
apologize and will crawl back into my cave :)

Good luck


Corey

On Thu, Aug 2, 2012 at 4:50 AM, emmanuel segura <emi2fast at gmail.com> wrote:

> if you think the problem it's in lvm, put it in the debug man lvm.conf
>
>
> 2012/8/2 Gianluca Cecchi <gianluca.cecchi at gmail.com>
>
>> On Wed, Aug 1, 2012 at 6:15 PM, Gianluca Cecchi wrote:
>> > On Wed, 1 Aug 2012 16:26:38 +0200 emmanuel segura wrote:
>> >> Why you don't remove expected_votes=3 and let the cluster automatic
>> calculate that
>> >
>> > Thanks for your answer Emmanuel, but cman starts correctly, while the
>> > problem seems related to
>> > vgchange -aly
>> > command hanging.
>> > But I tried that option too and the cluster hangs at the same point as
>> before.
>>
>> Further testing shows that cluster is indeed quorated and problems are
>> related to lvm...
>>
>> I also tried following a more used and clean configuration seen in
>> examples for 3 nodes + quorum daemon:
>>
>> 2 votes for each node
>> <clusternode name="nodeX" nodeid="X" votes="2">
>>
>> 3 votes for quorum disk
>> <quorumd device="/dev/mapper/mpathquorum" interval="5"
>> label="clrhevquorum" tko="24" votes="3">
>>
>> with and without expected_votes="9" in <cman ... /> part
>>
>> One node + its quorum only config should be ok (2+3 = 5 votes)
>>
>> After cman starts and quorumd is not master yet:
>>
>> # cman_tool status
>> Version: 6.2.0
>> Config Version: 51
>> Cluster Name: clrhev
>> Cluster Id: 43203
>> Cluster Member: Yes
>> Cluster Generation: 1428
>> Membership state: Cluster-Member
>> Nodes: 1
>> Expected votes: 9
>> Total votes: 2
>> Node votes: 2
>> Quorum: 5 Activity blocked
>> Active subsystems: 4
>> Flags:
>> Ports Bound: 0 178
>> Node name: intrarhev3
>> Node ID: 3
>> Multicast addresses: 239.192.168.108
>> Node addresses: 192.168.16.30
>>
>> Then
>> # cman_tool status
>> Version: 6.2.0
>> Config Version: 51
>> Cluster Name: clrhev
>> Cluster Id: 43203
>> Cluster Member: Yes
>> Cluster Generation: 1428
>> Membership state: Cluster-Member
>> Nodes: 1
>> Expected votes: 9
>> Quorum device votes: 3
>> Total votes: 5
>> Node votes: 2
>> Quorum: 5
>> Active subsystems: 4
>> Flags:
>> Ports Bound: 0 178
>> Node name: intrarhev3
>> Node ID: 3
>> Multicast addresses: 239.192.168.108
>> Node addresses: 192.168.16.30
>>
>> And startup continues up to clvmd step
>> In this phase, while clvmd startup hanges forever I have:
>>
>> # dlm_tool ls
>> dlm lockspaces
>> name          clvmd
>> id            0x4104eefa
>> flags         0x00000000
>> change        member 1 joined 1 remove 0 failed 0 seq 1,1
>> members       3
>>
>> # ps -ef|grep lv
>> root      3573  2593  0 01:05 ?        00:00:00 /bin/bash
>> /etc/rc3.d/S24clvmd start
>> root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
>> root      3620     1  0 01:05 ?        00:00:00 /sbin/lvm pvscan
>> --cache --major 253 --minor 13
>> root      3804  3322  0 01:09 pts/0    00:00:00 grep lv
>>
>> # ps -ef|grep vg
>> root      3601  3573  0 01:05 ?        00:00:00 /sbin/vgchange -ayl
>> root      3808  3322  0 01:09 pts/0    00:00:00 grep vg
>>
>> # ps -ef|grep lv
>> root      3573  2593  0 01:05 ?        00:00:00 /bin/bash
>> /etc/rc3.d/S24clvmd start
>> root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
>> root      4008  3322  0 01:13 pts/0    00:00:00 grep lv
>>
>> # ps -ef|grep 3578
>> root      3578     1  0 01:05 ?        00:00:00 clvmd -T30
>> root      4017  3322  0 01:13 pts/0    00:00:00 grep 3578
>>
>> It remains at
>> # service clvmd start
>> Starting clvmd:
>> Activating VG(s):   3 logical volume(s) in volume group "VG_VIRT02" now
>> active
>>
>> Is there any way to debug clvmd?
>> I suppose it communicates through intracluster, correct?
>> tcpdump output could be of any help?
>>
>> Any one already passed to 6.3 (on rhel and/or CentOS) and having all
>> ok with clvmd?
>>
>> BTW: I also tried lvmetad, that is tech preview in 6.3, enabling its
>> service and putting "use_lvmetad = 1" in lvm.conf but without luck...
>>
>> Thanks in advance
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/a8e4a5d3/attachment.htm>

From akinoztopuz at yahoo.com  Thu Aug  2 13:35:20 2012
From: akinoztopuz at yahoo.com (=?iso-8859-1?Q?AKIN_=FFffffffffffd6ZTOPUZ?=)
Date: Thu, 2 Aug 2012 06:35:20 -0700 (PDT)
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501A75B4.3080803@itechnical.de>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de>
Message-ID: <1343914520.6137.YahooMailNeo@web160703.mail.bf1.yahoo.com>

This agent is implemented using ILO port on HP servers.
?
how can I use debug mode ?
?
thnks 
 

________________________________
 From: Heiko Nardmann <heiko.nardmann at itechnical.de>
To: linux-cluster at redhat.com 
Sent: Thursday, August 2, 2012 3:42 PM
Subject: Re: [Linux-cluster] fencing issue in 2 nodes cluster
  

Am 02.08.2012 14:19, schrieb AKIN ?ffffffffffd6ZTOPUZ:
 
Hi  
>
>I have fencing problem?in 2 nodes cluster??<cman expected_votes="1" two_node="1"/> ) 
>
>fence device? agent??is??? like that?: 
>
><fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1" login="clsfenceadmin" method="cycle" name="fence_node2" passwd="**************" power_wait="4"/>? 
>
>
>when I run fence_node?? nodename??? command? on? host?? ,? Related node goes to down but???I am taking errors in /var/log/messages? : 
>
>
>Aug? 2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
>Aug? 2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan"
          reports: Rebooting machine @ IPMI:192.168.11.68...Done 
>Aug? 2 14:55:32 sapclsn2 fenced[6714]: fence
          "sapclsn1.edase.com" failed 
>
>you have any ideas?  
> 
>
>--
Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster  
Without looking at it: how is this agent implemented? Maybe you can
    easily debug it to see why it returns a non zero exit code?


Kind regards,

??? Heiko

 
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/9a04a5a9/attachment.htm>

From gianluca.cecchi at gmail.com  Thu Aug  2 13:55:44 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Thu, 2 Aug 2012 15:55:44 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
Message-ID: <CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>

On Thu, 2 Aug 2012 07:07:25 -0600 Corey Kovacs wrte:
> I might be reading this wrong but just in case, I thought I'd point this out.
>
[snip]
> A single node can maintain quorum since 2+3>(9/2).
> In a split brain condition where a single node cannot talk to the other nodes, this could be disastrous.

Thanks for your input, Corey.
As I said before, at this moment I'll have only one node on a site so
I'm also tweaking config to be able to work with one node alone

Anyway I refer to this sentence in manual, also for more than two
nodes configuration (example pertains to a 13 nodes cluster):

"
A cluster must maintain quorum to prevent split-brain issues. If
quorum was not enforced, quorum, a communication error on that same
thirteen-node cluster may cause a situation where six nodes are
operating on the shared storage, while another six nodes are also
operating on it, independently. Because of the communication error,
the two partial-clusters would overwrite areas of the disk and corrupt
the file system. With quorum rules enforced, only one of the partial
clusters can use the shared storage, thus protecting data integrity.
Quorum doesn't prevent split-brain situations, but it does decide who
is dominant and allowed to function in the cluster. Should split-brain
occur, quorum prevents more than one cluster group from doing
anything.
"

This said, in my case my problem is not with quorum, that is gained
when quorum disk becomes master, but with clvmd freezing without
showing any error
As suggested I set up logging for both cluster and lvm.

I also configured lvmetad

The diff between previous lvm.conf and current for further tests is this:
# diff -u lvm.conf lvm.conf.pre020812
--- lvm.conf	2012-08-02 14:48:31.172565731 +0200
+++ lvm.conf.pre020812	2012-08-02 01:33:55.878511113 +0200
@@ -232,8 +232,7 @@

     # Controls the messages sent to stdout or stderr.
     # There are three levels of verbosity, 3 being the most verbose.
-    #verbose = 0
-    verbose = 2
+    verbose = 0

     # Should we send log messages through syslog?
     # 1 is yes; 0 is no.
@@ -242,7 +241,6 @@
     # Should we log error and debug messages to a file?
     # By default there is no log file.
     #file = "/var/log/lvm2.log"
-    file = "/var/log/lvm2.log"

     # Should we overwrite the log file each time the program is run?
     # By default we append.
@@ -251,8 +249,7 @@
     # What level of log messages should we send to the log file and/or syslog?
     # There are 6 syslog-like log levels currently in use - 2 to 7 inclusive.
     # 7 is the most verbose (LOG_DEBUG).
-    #level = 0
-    level = 4
+    level = 0

     # Format of output messages
     # Whether or not (1 or 0) to indent messages according to their severity
@@ -422,8 +419,7 @@
     # Check whether CRC is matching when parsed VG is used multiple times.
     # This is useful to catch unexpected internal cached volume group
     # structure modification. Please only enable for debugging.
-    #detect_internal_vg_cache_corruption = 0
-    detect_internal_vg_cache_corruption = 1
+    detect_internal_vg_cache_corruption = 0

     # If set to 1, no operations that change on-disk metadata will be
permitted.
     # Additionally, read-only commands that encounter metadata in
need of repair
@@ -483,8 +479,7 @@
     # libdevmapper.  Useful for debugging problems with activation.
     # Some of the checks may be expensive, so it's best to use this
     # only when there seems to be a problem.
-    #checks = 0
-    checks = 1
+    checks = 0

     # Set to 0 to disable udev synchronisation (if compiled into the binaries).
     # Processes will not wait for notification from udev.

cluster.conf changes
# diff cluster.conf cluster.conf.51
2,6c2
< <cluster config_version="52" name="clrhev">
< 	<dlm log_debug="1" plock_debug="1"/>
< 	<logging>
< 		<logging_daemon name="qdiskd" debug="on"/>
< 	</logging>
---
> <cluster config_version="51" name="clrhev">

In attach I send two files:
lvm2.log with mark separating before and after issue of clvmd start command
clvmd start output.txt that is the output during "service clvmd start" command

to be able to do so, I started in signle user mode and then started
the services one at a time as in

/etc/rc.d/rc3.d/S*

but anticipating the ssh daemon, so that I'm able to login remotely
In fact after clvmd freezes I can only run a pair of sync commands and
power off....

If I'm not missing something stupid I can also post a bugzilla vs
Centos Bug tracker and then eventually someone will report upstream if
reproducible

Gianluca
-------------- next part --------------
[root at crhev3 ~]# service clvmd start
Starting clvmd: 
Activating VG(s):     Logging initialised at Thu Aug  2 15:28:53 2012
      Setting global/umask to 63
    Set umask from 0022 to 0077
      Setting devices/dir to /dev
      Setting global/proc to /proc
      Setting global/activation to 1
      global/suffix not found in config: defaulting to 1
      Setting global/units to h
      Setting activation/readahead to auto
      Setting activation/udev_rules to 1
      Setting activation/udev_sync to 1
      Setting activation/retry_deactivation to 1
      Setting activation/checks to 1
      Setting activation/verify_udev_operations to 0
      Getting driver version
      Setting activation/use_linear_target to 1
      Setting activation/missing_stripe_filler to error
      Setting global/si_unit_consistency to 1
      Setting global/metadata_read_only to 0
      Setting devices/pv_min_size to 2048
      global/detect_internal_vg_cache_corruption() not found in config: defaulting to 0
      Setting global/use_lvmetad to 1
      Setting devices/disable_after_error_count to 0
      Setting devices/ignore_suspended_devices to 0
      Setting devices/cache_dir to /etc/lvm/cache
      Setting devices/write_cache_state to 1
      Setting activation/reserved_stack to 64
      Setting activation/reserved_memory to 8192
      Setting activation/process_priority to -18
      Initialised format: lvm1
      Initialised format: pool
      Initialised format: lvm2
      global/format not found in config: defaulting to lvm2
      Setting response to OK
      Setting protocol to lvmetad
      Setting version to 1
      Initialised segtype: striped
      Initialised segtype: zero
      Initialised segtype: error
      Initialised segtype: free
      Setting dmeventd/snapshot_library to libdevmapper-event-lvm2snapshot.so
      Initialised segtype: snapshot
      Setting dmeventd/mirror_library to libdevmapper-event-lvm2mirror.so
      Initialised segtype: mirror
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid1
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid4
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid5
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid5_la
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid5_ra
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid5_ls
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid5_rs
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid6
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid6_zr
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid6_nr
      dmeventd/raid_library not found in config: defaulting to libdevmapper-event-lvm2raid.so
      Initialised segtype: raid6_nc
      Setting dmeventd/thin_library to libdevmapper-event-lvm2thin.so
      Initialised segtype: thin-pool
      Initialised segtype: thin
      Setting backup/retain_days to 30
      Setting backup/retain_min to 10
      Setting backup/archive_dir to /etc/lvm/archive
      Setting backup/backup_dir to /etc/lvm/backup
      global/fallback_to_lvm1 not found in config: defaulting to 1
      Setting global/locking_type to 3
      Setting global/wait_for_locks to 1
      Cluster locking selected.
    Finding all volume groups
      Setting response to OK
      Setting response to OK
      Setting name to VG_VIRT01
      Setting metadata/format to lvm2
      Setting id to wFFcfY-NCSG-yKrx-Hz8v-tyzD-nTWK-5EqkFz
      Setting format to lvm2
      Setting device to 64768
      Setting dev_size to 209715200
      Setting label_sector to 1
      /dev/mapper/mpatha: size is 209715200 sectors
      /dev/mapper/mpatha: size is 209715200 sectors
      Setting size to 192512
      Setting start to 4096
      Setting ignore to 0
      Setting id to rmEMxt-q8ui-pR8a-NuGa-M15v-DKG9-xKDr45
      Setting format to lvm2
      Setting device to 64771
      Setting dev_size to 314572800
      Setting label_sector to 1
      /dev/mapper/mpathf: size is 314572800 sectors
      /dev/mapper/mpathf: size is 314572800 sectors
      Setting size to 1044480
      Setting start to 4096
      Setting ignore to 0
      Setting response to OK
      Setting name to VG_VIRT03
      Setting metadata/format to lvm2
      Setting id to whFL70-Af2W-oN8B-6rCI-w1Hm-XqL0-tuHNR5
      Setting format to lvm2
      Setting device to 64769
      Setting dev_size to 209715200
      Setting label_sector to 1
      /dev/mapper/mpathc: size is 209715200 sectors
      /dev/mapper/mpathc: size is 209715200 sectors
      Setting size to 192512
      Setting start to 4096
      Setting ignore to 0
      Setting id to s9Yi3G-6jls-30ds-BzdK-djN8-0yHT-VuSI8R
      Setting format to lvm2
      Setting device to 64773
      Setting dev_size to 104857600
      Setting label_sector to 1
      /dev/mapper/mpathe: size is 104857600 sectors
      /dev/mapper/mpathe: size is 104857600 sectors
      Setting size to 1044480
      Setting start to 4096
      Setting ignore to 0
      Setting response to OK
      Setting name to VG_VIRT04
      Setting metadata/format to lvm2
      Setting id to 8crlfj-nuzY-5vaY-QtAb-3SkA-yr9T-EM5Hi2
      Setting format to lvm2
      Setting device to 64774
      Setting dev_size to 209715200
      Setting label_sector to 1
      /dev/mapper/mpathd: size is 209715200 sectors
      /dev/mapper/mpathd: size is 209715200 sectors
      Setting size to 1044480
      Setting start to 4096
      Setting ignore to 0
      Setting response to OK
      Setting name to VolGroup00
      Setting metadata/format to lvm2
      Setting id to dStWwU-rc1a-6pmb-3WJG-2qYr-hgDC-SHm0Ve
      Setting format to lvm2
      Setting device to 26628
      Setting dev_size to 100335616
      Setting label_sector to 1
      /dev/cciss/c0d0p4: size is 100335616 sectors
      /dev/cciss/c0d0p4: size is 100335616 sectors
      Setting size to 1044480
      Setting start to 4096
      Setting ignore to 0
      Setting response to OK
      Setting name to VG_VIRT02
      Setting metadata/format to lvm2
      Setting id to 9375Nc-Lq2y-qkD1-5PLF-cuIO-Z3uN-YW9asm
      Setting format to lvm2
      Setting device to 64770
      Setting dev_size to 209715200
      Setting label_sector to 1
      /dev/mapper/mpathb: size is 209715200 sectors
      /dev/mapper/mpathb: size is 209715200 sectors
      Setting size to 192512
      Setting start to 4096
      Setting ignore to 0
    Finding volume group "VG_VIRT02"
      Locking VG V_VG_VIRT02 CR (VG) (0x1)
      Setting response to OK
      Setting name to VG_VIRT02
      Setting metadata/format to lvm2
      Setting id to 9375Nc-Lq2y-qkD1-5PLF-cuIO-Z3uN-YW9asm
      Setting format to lvm2
      Setting device to 64770
      Setting dev_size to 209715200
      Setting label_sector to 1
      /dev/mapper/mpathb: size is 209715200 sectors
      /dev/mapper/mpathb: size is 209715200 sectors
      Setting size to 192512
      Setting start to 4096
      Setting ignore to 0
      VG_VIRT02/vorastud1 is not active
      Locking LV UYtfUWONARtntMxo16uu464Vy7i6eZo7RPMB1JMB1F20VTBEiKlKtP3ZfbRn9xhn CR (LV|NONBLOCK|CLUSTER|LOCAL) (0xd9)
      VG_VIRT02/ltsp is not active
      Locking LV UYtfUWONARtntMxo16uu464Vy7i6eZo7sW1dCeAZwUfHbpQ55ZZ82Dr2vl7tA009 CR (LV|NONBLOCK|CLUSTER|LOCAL) (0xd9)
      VG_VIRT02/droratest_aaadata is not active
      Locking LV UYtfUWONARtntMxo16uu464Vy7i6eZo72WNqFXJ2i5UKsiRLqdvdQz2TiR0gG24E CR (LV|NONBLOCK|CLUSTER|LOCAL) (0xd9)
    Activated 3 logical volumes in volume group VG_VIRT02
  3 logical volume(s) in volume group "VG_VIRT02" now active
      Requesting sync names.
      Locking VG V_VG_VIRT02 UN (VG) (0x6)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvm2.log
Type: application/octet-stream
Size: 7253 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/b113e3c7/attachment.obj>

From emi2fast at gmail.com  Thu Aug  2 14:12:24 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Thu, 2 Aug 2012 16:12:24 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
	<CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
Message-ID: <CAE7pJ3AHKsjBaT8tPKmiwHGHCz+qmffmRTj+aHAFvjwoEPcJ8w@mail.gmail.com>

can you show me your lvm.conf?

2012/8/2 Gianluca Cecchi <gianluca.cecchi at gmail.com>

> On Thu, 2 Aug 2012 07:07:25 -0600 Corey Kovacs wrte:
> > I might be reading this wrong but just in case, I thought I'd point this
> out.
> >
> [snip]
> > A single node can maintain quorum since 2+3>(9/2).
> > In a split brain condition where a single node cannot talk to the other
> nodes, this could be disastrous.
>
> Thanks for your input, Corey.
> As I said before, at this moment I'll have only one node on a site so
> I'm also tweaking config to be able to work with one node alone
>
> Anyway I refer to this sentence in manual, also for more than two
> nodes configuration (example pertains to a 13 nodes cluster):
>
> "
> A cluster must maintain quorum to prevent split-brain issues. If
> quorum was not enforced, quorum, a communication error on that same
> thirteen-node cluster may cause a situation where six nodes are
> operating on the shared storage, while another six nodes are also
> operating on it, independently. Because of the communication error,
> the two partial-clusters would overwrite areas of the disk and corrupt
> the file system. With quorum rules enforced, only one of the partial
> clusters can use the shared storage, thus protecting data integrity.
> Quorum doesn't prevent split-brain situations, but it does decide who
> is dominant and allowed to function in the cluster. Should split-brain
> occur, quorum prevents more than one cluster group from doing
> anything.
> "
>
> This said, in my case my problem is not with quorum, that is gained
> when quorum disk becomes master, but with clvmd freezing without
> showing any error
> As suggested I set up logging for both cluster and lvm.
>
> I also configured lvmetad
>
> The diff between previous lvm.conf and current for further tests is this:
> # diff -u lvm.conf lvm.conf.pre020812
> --- lvm.conf    2012-08-02 14:48:31.172565731 +0200
> +++ lvm.conf.pre020812  2012-08-02 01:33:55.878511113 +0200
> @@ -232,8 +232,7 @@
>
>      # Controls the messages sent to stdout or stderr.
>      # There are three levels of verbosity, 3 being the most verbose.
> -    #verbose = 0
> -    verbose = 2
> +    verbose = 0
>
>      # Should we send log messages through syslog?
>      # 1 is yes; 0 is no.
> @@ -242,7 +241,6 @@
>      # Should we log error and debug messages to a file?
>      # By default there is no log file.
>      #file = "/var/log/lvm2.log"
> -    file = "/var/log/lvm2.log"
>
>      # Should we overwrite the log file each time the program is run?
>      # By default we append.
> @@ -251,8 +249,7 @@
>      # What level of log messages should we send to the log file and/or
> syslog?
>      # There are 6 syslog-like log levels currently in use - 2 to 7
> inclusive.
>      # 7 is the most verbose (LOG_DEBUG).
> -    #level = 0
> -    level = 4
> +    level = 0
>
>      # Format of output messages
>      # Whether or not (1 or 0) to indent messages according to their
> severity
> @@ -422,8 +419,7 @@
>      # Check whether CRC is matching when parsed VG is used multiple times.
>      # This is useful to catch unexpected internal cached volume group
>      # structure modification. Please only enable for debugging.
> -    #detect_internal_vg_cache_corruption = 0
> -    detect_internal_vg_cache_corruption = 1
> +    detect_internal_vg_cache_corruption = 0
>
>      # If set to 1, no operations that change on-disk metadata will be
> permitted.
>      # Additionally, read-only commands that encounter metadata in
> need of repair
> @@ -483,8 +479,7 @@
>      # libdevmapper.  Useful for debugging problems with activation.
>      # Some of the checks may be expensive, so it's best to use this
>      # only when there seems to be a problem.
> -    #checks = 0
> -    checks = 1
> +    checks = 0
>
>      # Set to 0 to disable udev synchronisation (if compiled into the
> binaries).
>      # Processes will not wait for notification from udev.
>
> cluster.conf changes
> # diff cluster.conf cluster.conf.51
> 2,6c2
> < <cluster config_version="52" name="clrhev">
> <       <dlm log_debug="1" plock_debug="1"/>
> <       <logging>
> <               <logging_daemon name="qdiskd" debug="on"/>
> <       </logging>
> ---
> > <cluster config_version="51" name="clrhev">
>
> In attach I send two files:
> lvm2.log with mark separating before and after issue of clvmd start command
> clvmd start output.txt that is the output during "service clvmd start"
> command
>
> to be able to do so, I started in signle user mode and then started
> the services one at a time as in
>
> /etc/rc.d/rc3.d/S*
>
> but anticipating the ssh daemon, so that I'm able to login remotely
> In fact after clvmd freezes I can only run a pair of sync commands and
> power off....
>
> If I'm not missing something stupid I can also post a bugzilla vs
> Centos Bug tracker and then eventually someone will report upstream if
> reproducible
>
> Gianluca
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/9f515c91/attachment.htm>

From corey.kovacs at gmail.com  Thu Aug  2 14:17:51 2012
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Thu, 2 Aug 2012 08:17:51 -0600
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
	<CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
Message-ID: <CAMH2m-oygYA0vCcjzMppPaFm23_EQ=z1y+st6FHkB5m7TLLBQw@mail.gmail.com>

Yup, I missed the part where you said you only have a single node.

To be clear, the portion of the docs you site below is exactly why you need
to be careful about how many votes you give to the qdiskd. It should be a
tie breaker. You are using it to bring up a 3 node cluster in which only a
single node exists. This is file in a testing environment, but is not
recommended in a production setup. Once your other nodes are in place, you
won't need the qdiskd. If you decide to keep it around, be very careful
with it's use. It's really only meant for clusters in which you have an
even number of actual nodes.


Sorry i don't have more time this morning to look at this but I am sure
someone else will.

Take care

-C

On Thu, Aug 2, 2012 at 7:55 AM, Gianluca Cecchi
<gianluca.cecchi at gmail.com>wrote:

> On Thu, 2 Aug 2012 07:07:25 -0600 Corey Kovacs wrte:
> > I might be reading this wrong but just in case, I thought I'd point this
> out.
> >
> [snip]
> > A single node can maintain quorum since 2+3>(9/2).
> > In a split brain condition where a single node cannot talk to the other
> nodes, this could be disastrous.
>
> Thanks for your input, Corey.
> As I said before, at this moment I'll have only one node on a site so
> I'm also tweaking config to be able to work with one node alone
>
> Anyway I refer to this sentence in manual, also for more than two
> nodes configuration (example pertains to a 13 nodes cluster):
>
> "
> A cluster must maintain quorum to prevent split-brain issues. If
> quorum was not enforced, quorum, a communication error on that same
> thirteen-node cluster may cause a situation where six nodes are
> operating on the shared storage, while another six nodes are also
> operating on it, independently. Because of the communication error,
> the two partial-clusters would overwrite areas of the disk and corrupt
> the file system. With quorum rules enforced, only one of the partial
> clusters can use the shared storage, thus protecting data integrity.
> Quorum doesn't prevent split-brain situations, but it does decide who
> is dominant and allowed to function in the cluster. Should split-brain
> occur, quorum prevents more than one cluster group from doing
> anything.
> "
>
> This said, in my case my problem is not with quorum, that is gained
> when quorum disk becomes master, but with clvmd freezing without
> showing any error
> As suggested I set up logging for both cluster and lvm.
>
> I also configured lvmetad
>
> The diff between previous lvm.conf and current for further tests is this:
> # diff -u lvm.conf lvm.conf.pre020812
> --- lvm.conf    2012-08-02 14:48:31.172565731 +0200
> +++ lvm.conf.pre020812  2012-08-02 01:33:55.878511113 +0200
> @@ -232,8 +232,7 @@
>
>      # Controls the messages sent to stdout or stderr.
>      # There are three levels of verbosity, 3 being the most verbose.
> -    #verbose = 0
> -    verbose = 2
> +    verbose = 0
>
>      # Should we send log messages through syslog?
>      # 1 is yes; 0 is no.
> @@ -242,7 +241,6 @@
>      # Should we log error and debug messages to a file?
>      # By default there is no log file.
>      #file = "/var/log/lvm2.log"
> -    file = "/var/log/lvm2.log"
>
>      # Should we overwrite the log file each time the program is run?
>      # By default we append.
> @@ -251,8 +249,7 @@
>      # What level of log messages should we send to the log file and/or
> syslog?
>      # There are 6 syslog-like log levels currently in use - 2 to 7
> inclusive.
>      # 7 is the most verbose (LOG_DEBUG).
> -    #level = 0
> -    level = 4
> +    level = 0
>
>      # Format of output messages
>      # Whether or not (1 or 0) to indent messages according to their
> severity
> @@ -422,8 +419,7 @@
>      # Check whether CRC is matching when parsed VG is used multiple times.
>      # This is useful to catch unexpected internal cached volume group
>      # structure modification. Please only enable for debugging.
> -    #detect_internal_vg_cache_corruption = 0
> -    detect_internal_vg_cache_corruption = 1
> +    detect_internal_vg_cache_corruption = 0
>
>      # If set to 1, no operations that change on-disk metadata will be
> permitted.
>      # Additionally, read-only commands that encounter metadata in
> need of repair
> @@ -483,8 +479,7 @@
>      # libdevmapper.  Useful for debugging problems with activation.
>      # Some of the checks may be expensive, so it's best to use this
>      # only when there seems to be a problem.
> -    #checks = 0
> -    checks = 1
> +    checks = 0
>
>      # Set to 0 to disable udev synchronisation (if compiled into the
> binaries).
>      # Processes will not wait for notification from udev.
>
> cluster.conf changes
> # diff cluster.conf cluster.conf.51
> 2,6c2
> < <cluster config_version="52" name="clrhev">
> <       <dlm log_debug="1" plock_debug="1"/>
> <       <logging>
> <               <logging_daemon name="qdiskd" debug="on"/>
> <       </logging>
> ---
> > <cluster config_version="51" name="clrhev">
>
> In attach I send two files:
> lvm2.log with mark separating before and after issue of clvmd start command
> clvmd start output.txt that is the output during "service clvmd start"
> command
>
> to be able to do so, I started in signle user mode and then started
> the services one at a time as in
>
> /etc/rc.d/rc3.d/S*
>
> but anticipating the ssh daemon, so that I'm able to login remotely
> In fact after clvmd freezes I can only run a pair of sync commands and
> power off....
>
> If I'm not missing something stupid I can also post a bugzilla vs
> Centos Bug tracker and then eventually someone will report upstream if
> reproducible
>
> Gianluca
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/8230b5f6/attachment.htm>

From gianluca.cecchi at gmail.com  Thu Aug  2 14:25:36 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Thu, 2 Aug 2012 16:25:36 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
	<CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
Message-ID: <CAG2kNCypGeYJRT0icoLFSEBKV2ZobS3HmpkdegcY_cqV=KCUAw@mail.gmail.com>

On  Thu, 2 Aug 2012 16:12:24 +0200 emmanuel segura wrote:
> can you show me your lvm.conf?
Here it is.

Gianluca
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvm.conf
Type: application/octet-stream
Size: 34923 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/b19b89cf/attachment.obj>

From heiko.nardmann at itechnical.de  Thu Aug  2 14:42:18 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 02 Aug 2012 16:42:18 +0200
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <1343914520.6137.YahooMailNeo@web160703.mail.bf1.yahoo.com>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de>
	<1343914520.6137.YahooMailNeo@web160703.mail.bf1.yahoo.com>
Message-ID: <501A91CA.2060300@itechnical.de>

Am 02.08.2012 15:35, schrieb AKIN ?ffffffffffd6ZTOPUZ:
> This agent is implemented using ILO port on HP servers.
> how can I use debug mode ?
> thnks
>
Uhh ... some misunderstanding. What I meant has been which programming 
language has been used to implement this agent ... Python, Shell, ...

I am not sure whether all fencing agents are implemented using the same 
programming language.


Kind regards,

     Heiko

> Am 02.08.2012 14:19, schrieb AKIN ?ffffffffffd6ZTOPUZ:
>> Hi
>> I have fencing problem in 2 nodes cluster  <cman expected_votes="1" 
>> two_node="1"/> )
>> fence device  agent  is    like that :
>> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1" 
>> login="clsfenceadmin" method="cycle" name="fence_node2" 
>> passwd="**************" power_wait="4"/>
>> when I run fence_node   nodename    command on  host   ,  Related 
>> node goes to down but   I am taking errors in /var/log/messages  :
>> Aug  2 14:55:31 sapclsn2 fenced[6714]: fencing node 
>> "sapclsn1.edase.com <http://sapclsn1.edase.com/>"
>> Aug  2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports: 
>> Rebooting machine @ IPMI:192.168.11.68...Done
>> Aug  2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
>> you have any ideas?
>>
>>
> Without looking at it: how is this agent implemented? Maybe you can 
> easily debug it to see why it returns a non zero exit code?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120802/7af4e077/attachment.htm>

From lists at alteeve.ca  Thu Aug  2 15:00:58 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 02 Aug 2012 11:00:58 -0400
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
Message-ID: <501A962A.8020009@alteeve.ca>

On 08/02/2012 08:19 AM, AKIN ?ffffffffffd6ZTOPUZ wrote:
> Hi
> I have fencing problem in 2 nodes cluster  <cman expected_votes="1"
> two_node="1"/> )
> fence device  agent  is    like that :
> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1"
> login="clsfenceadmin" method="cycle" name="fence_node2"
> passwd="**************" power_wait="4"/>
> when I run fence_node   nodename    command  on  host   ,  Related node
> goes to down but   I am taking errors in /var/log/messages  :
> Aug  2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
> Aug  2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports:
> Rebooting machine @ IPMI:192.168.11.68...Done
> Aug  2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
> you have any ideas?
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

You've added a bunch of options that I don't think you need. Can you 
strip the configuration down and see if it works? If so, then you can 
start putting the options back, one at a time until it breaks.

Here is an example implementation of fence_ipmilan which I have used 
dozens of times without trouble (including on HP DL1** series machines).

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_IPMI

-- 
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Thu Aug  2 15:02:21 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 02 Aug 2012 11:02:21 -0400
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501A75B4.3080803@itechnical.de>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de>
Message-ID: <501A967D.6090006@alteeve.ca>

On 08/02/2012 08:42 AM, Heiko Nardmann wrote:
> Am 02.08.2012 14:19, schrieb AKIN ?ffffffffffd6ZTOPUZ:
>> Hi
>> I have fencing problem in 2 nodes cluster  <cman expected_votes="1"
>> two_node="1"/> )
>> fence device  agent  is    like that :
>> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1"
>> login="clsfenceadmin" method="cycle" name="fence_node2"
>> passwd="**************" power_wait="4"/>
>> when I run fence_node   nodename    command  on  host   , Related node
>> goes to down but   I am taking errors in /var/log/messages  :
>> Aug  2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
>> Aug  2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports:
>> Rebooting machine @ IPMI:192.168.11.68...Done
>> Aug  2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
>> you have any ideas?
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> Without looking at it: how is this agent implemented? Maybe you can
> easily debug it to see why it returns a non zero exit code?
>
>
> Kind regards,
>
>      Heiko

It's written in C;

http://git.fedorahosted.org/cgit/fence-agents.git/tree/fence/agents/ipmilan/ipmilan.c

-- 
Digimer
Papers and Projects: https://alteeve.com



From heiko.nardmann at itechnical.de  Thu Aug  2 15:22:02 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 02 Aug 2012 17:22:02 +0200
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501A967D.6090006@alteeve.ca>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de> <501A967D.6090006@alteeve.ca>
Message-ID: <501A9B1A.1000104@itechnical.de>

Am 02.08.2012 17:02, schrieb Digimer:
> On 08/02/2012 08:42 AM, Heiko Nardmann wrote:
>> Am 02.08.2012 14:19, schrieb AKIN ?ffffffffffd6ZTOPUZ:
>>> Hi
>>> I have fencing problem in 2 nodes cluster  <cman expected_votes="1"
>>> two_node="1"/> )
>>> fence device  agent  is    like that :
>>> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1"
>>> login="clsfenceadmin" method="cycle" name="fence_node2"
>>> passwd="**************" power_wait="4"/>
>>> when I run fence_node   nodename    command  on  host   , Related node
>>> goes to down but   I am taking errors in /var/log/messages  :
>>> Aug  2 14:55:31 sapclsn2 fenced[6714]: fencing node 
>>> "sapclsn1.edase.com"
>>> Aug  2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports:
>>> Rebooting machine @ IPMI:192.168.11.68...Done
>>> Aug  2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" 
>>> failed
>>> you have any ideas?
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> Without looking at it: how is this agent implemented? Maybe you can
>> easily debug it to see why it returns a non zero exit code?
>>
>>
>> Kind regards,
>>
>>      Heiko
>
> It's written in C;
>
> http://git.fedorahosted.org/cgit/fence-agents.git/tree/fence/agents/ipmilan/ipmilan.c 
>
>

To what I see from the source it supports a '-v' option to increase 
verbosity.

@Akin: are you able to run the command manually with success? Every 
entry inside the cluster.conf has an equivalent concerning the command 
line options of the fencing agent (should be /usr/sbin/...). So if you 
follow digimers advice and pass only those parameters you need together 
with the '-v' then maybe you easily find out what is wrong.


Kind regards,

     Heiko



From arpittolani at gmail.com  Thu Aug  2 21:42:18 2012
From: arpittolani at gmail.com (Arpit Tolani)
Date: Fri, 3 Aug 2012 03:12:18 +0530
Subject: [Linux-cluster] password change of master and slave node
In-Reply-To: <CALO-jX4L9whq-n7St079+5DoAmGUDg9+ThaCD_RKxUiki2cGbQ@mail.gmail.com>
References: <CALO-jX4L9whq-n7St079+5DoAmGUDg9+ThaCD_RKxUiki2cGbQ@mail.gmail.com>
Message-ID: <CAD3MydCwkMDg0VdS0SJ9yiO-L8ZZDx_X2-EoXu5tUja0RxwUGg@mail.gmail.com>

Hello

On Wed, Aug 1, 2012 at 11:04 PM, kailash kumawat <
kailash.kumawat at rudrainfotainment.com> wrote:

> Hi
>
> I have been successfully configure the redhat cluster for apache now i
> want to change password of my master and other node so how can i
> update in luci server, please help me
>
>
>
Login in as the *root* user and issue the following command after stopping
the luci service. The command will then prompt the end user for a new admin
password:

$ service luci stop
$ /usr/sbin/luci_admin password

Regards
Arpit Tolani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120803/fd621f6a/attachment.htm>

From akinoztopuz at yahoo.com  Fri Aug  3 07:12:28 2012
From: akinoztopuz at yahoo.com (=?iso-8859-1?Q?AKIN_=FFffffffffffd6ZTOPUZ?=)
Date: Fri, 3 Aug 2012 00:12:28 -0700 (PDT)
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501A967D.6090006@alteeve.ca>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de> <501A967D.6090006@alteeve.ca>
Message-ID: <1343977948.71942.YahooMailNeo@web160704.mail.bf1.yahoo.com>

Hi
?
thanks all repilies 
?
?I added verbose parameter into fence agent line in cluster.conf according to ?Heiko? (verbose="true",method="onoff")
and I read the web link sent by Digimer .I tested fencing?using???????"echo c > /proc/sysrq-trigger"??? on? node1.
after that I saw fencing occured and related services??? moved to other node properly.
?
in this case??? can we say , fencing is ok ???? or also should I? use fence_node? command?? on node?? for being sure?
?
?
Regards ??
 

________________________________
 From: Digimer <lists at alteeve.ca>
To: linux clustering <linux-cluster at redhat.com> 
Sent: Thursday, August 2, 2012 6:02 PM
Subject: Re: [Linux-cluster] fencing issue in 2 nodes cluster
  
On 08/02/2012 08:42 AM, Heiko Nardmann wrote:
> Am 02.08.2012 14:19, schrieb AKIN ?ffffffffffd6ZTOPUZ:
>> Hi
>> I have fencing problem in 2 nodes cluster? <cman expected_votes="1"
>> two_node="1"/> )
>> fence device? agent? is? ? like that :
>> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1"
>> login="clsfenceadmin" method="cycle" name="fence_node2"
>> passwd="**************" power_wait="4"/>
>> when I run fence_node?  nodename? ? command? on? host?  , Related node
>> goes to down but?  I am taking errors in /var/log/messages? :
>> Aug? 2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
>> Aug? 2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports:
>> Rebooting machine @ IPMI:192.168.11.68...Done
>> Aug? 2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
>> you have any ideas?
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> Without looking at it: how is this agent implemented? Maybe you can
> easily debug it to see why it returns a non zero exit code?
>
>
> Kind regards,
>
>? ? ? Heiko

It's written in C;

http://git.fedorahosted.org/cgit/fence-agents.git/tree/fence/agents/ipmilan/ipmilan.c

-- 
Digimer
Papers and Projects: https://alteeve.com/

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120803/ad146069/attachment.htm>

From peter.mcgowan at bit63.com  Fri Aug  3 07:34:57 2012
From: peter.mcgowan at bit63.com (Peter McGowan)
Date: Fri, 3 Aug 2012 08:34:57 +0100
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501A962A.8020009@alteeve.ca>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A962A.8020009@alteeve.ca>
Message-ID: <806ED33E-64B1-4CDC-AC69-90B91542728F@bit63.com>

Fence device options are actually recommended in the Red Hat Knowledgebase article 28603: "Fence Device and Agent Information for Red Hat Enterprise Linux"...

For iLO3 is says "fence_ipmilan needs to be used with the -P option to enable Lanplus. It also requires usage of the -T parameter to provide a 4 second timeout rather than the default fence_ipmilan default of 2 seconds. The -T parameter is provided in versions of fence_ipmilan from Red Hat Enterprise Linux 5.5.z and up."

Peter

On 2 Aug 2012, at 16:00, Digimer <lists at alteeve.ca> wrote:

> On 08/02/2012 08:19 AM, AKIN ?ffffffffffd6ZTOPUZ wrote:
>> Hi
>> I have fencing problem in 2 nodes cluster  <cman expected_votes="1"
>> two_node="1"/> )
>> fence device  agent  is    like that :
>> <fencedevice agent="fence_ipmilan" ipaddr="***********" lanplus="1"
>> login="clsfenceadmin" method="cycle" name="fence_node2"
>> passwd="**************" power_wait="4"/>
>> when I run fence_node   nodename    command  on  host   ,  Related node
>> goes to down but   I am taking errors in /var/log/messages  :
>> Aug  2 14:55:31 sapclsn2 fenced[6714]: fencing node "sapclsn1.edase.com"
>> Aug  2 14:55:32 sapclsn2 fenced[6714]: agent "fence_ipmilan" reports:
>> Rebooting machine @ IPMI:192.168.11.68...Done
>> Aug  2 14:55:32 sapclsn2 fenced[6714]: fence "sapclsn1.edase.com" failed
>> you have any ideas?
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
> 
> You've added a bunch of options that I don't think you need. Can you strip the configuration down and see if it works? If so, then you can start putting the options back, one at a time until it breaks.
> 
> Here is an example implementation of fence_ipmilan which I have used dozens of times without trouble (including on HP DL1** series machines).
> 
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_IPMI
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120803/d719480c/attachment.htm>

From heiko.nardmann at itechnical.de  Fri Aug  3 07:40:24 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Fri, 03 Aug 2012 09:40:24 +0200
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <1343977948.71942.YahooMailNeo@web160704.mail.bf1.yahoo.com>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de> <501A967D.6090006@alteeve.ca>
	<1343977948.71942.YahooMailNeo@web160704.mail.bf1.yahoo.com>
Message-ID: <501B8068.3020909@itechnical.de>

Am 03.08.2012 09:12, schrieb AKIN ?ffffffffffd6ZTOPUZ:
> Hi
> thanks all repilies
>  I added verbose parameter into fence agent line in cluster.conf 
> according to  Heiko (verbose="true",method="onoff")
> and I read the web link sent by Digimer .I tested fencing using 
>   "echo c > /proc/sysrq-trigger" on  node1.
> after that I saw fencing occured and related services    moved to 
> other node properly.
> in this case    can we say , fencing is ok ?    or also should I  use 
> fence_node command   on node   for being sure?
>

No errors anymore? Did you reduce the number of parameters as Digimer 
suggested? Then it would be maybe interesting for others to see your 
working fence configuration as configured inside cluster.conf.

I would suggest to test a little bit too much than doing not enough 
testing ... ;-) ... debugging afterwards is more costly than doing 
testing before ...

So my recommendation would be to test that manually started fencing, 
too, to see whether any messages/errors appear.

After that you are probably fine ...


Kind regards,

     Heiko

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120803/ec9a2da3/attachment.htm>

From akinoztopuz at yahoo.com  Fri Aug  3 08:13:47 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Fri, 3 Aug 2012 01:13:47 -0700 (PDT)
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501B8068.3020909@itechnical.de>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de> <501A967D.6090006@alteeve.ca>
	<1343977948.71942.YahooMailNeo@web160704.mail.bf1.yahoo.com>
	<501B8068.3020909@itechnical.de>
Message-ID: <1343981627.68281.YahooMailNeo@web160706.mail.bf1.yahoo.com>

dont take any error 
and yes removed the method from command? (default is onoff)? 
?
exact? line is at below:
<fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68" lanplus="1" login="clsfenceadmin" ?name="fence_node1" passwd="clsfenceadmin" power_wait="4" verbose="true"/>

and?I tested fencing using manually (making a kernel crash:Digimer note)?? and?? command with fence_node node1?.
?
?t worked and services is relocated properly
?
?
thanks ??????????????
 

________________________________
 From: Heiko Nardmann <heiko.nardmann at itechnical.de>
To: linux-cluster at redhat.com 
Sent: Friday, August 3, 2012 10:40 AM
Subject: Re: [Linux-cluster] fencing issue in 2 nodes cluster
  

Am 03.08.2012 09:12, schrieb AKIN ?ffffffffffd6ZTOPUZ:
 
Hi 
>? 
>thanks all repilies  
>? 
>?I added verbose parameter into fence agent line in cluster.conf according to ?Heiko? (verbose="true",method="onoff") 
>and I read the web link sent by Digimer .I tested fencing?using???????"echo c > /proc/sysrq-trigger"??? on? node1. 
>after that I saw fencing occured and related services??? moved to other node properly. 
>? 
>in this case??? can we say , fencing is ok ???? or also should I? use fence_node? command?? on node?? for being sure? 
>? 
> 
>   
No errors anymore? Did you reduce the number of parameters as
    Digimer suggested? Then it would be maybe interesting for others to
    see your working fence configuration as configured inside
    cluster.conf.

I would suggest to test a little bit too much than doing not enough
    testing ... ;-) ... debugging afterwards is more costly than doing
    testing before ...

So my recommendation would be to test that manually started fencing,
    too, to see whether any messages/errors appear.

After that you are probably fine ...


Kind regards,

??? Heiko

 
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120803/26227deb/attachment.htm>

From lists at alteeve.ca  Fri Aug  3 14:49:24 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 03 Aug 2012 10:49:24 -0400
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <1343981627.68281.YahooMailNeo@web160706.mail.bf1.yahoo.com>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de> <501A967D.6090006@alteeve.ca>
	<1343977948.71942.YahooMailNeo@web160704.mail.bf1.yahoo.com>
	<501B8068.3020909@itechnical.de>
	<1343981627.68281.YahooMailNeo@web160706.mail.bf1.yahoo.com>
Message-ID: <501BE4F4.9050408@alteeve.ca>

That should like it's fine then.

As a general rule though, I recommend roughly a 2:1 test:implement 
ratio. So if it takes you a week to setup your cluster, take two weeks 
to think of and test every failure you can think of. It's amazing how 
many corner cases you find this way (and much of the two weeks will 
actually be improving your design).

digimer

On 08/03/2012 04:13 AM, AKIN ?ffffffffffd6ZTOPUZ wrote:
> dont take any error
> and yes removed the method from command  (default is onoff)
> exact  line is at below:
> <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68" lanplus="1"
> login="clsfenceadmin"  name="fence_node1" passwd="clsfenceadmin"
> power_wait="4" verbose="true"/>
> and I tested fencing using manually (making a kernel crash:Digimer
> note)   and   command with fence_node node1 .
> ?t worked and services is relocated properly
> thanks
>
> *From:* Heiko Nardmann <heiko.nardmann at itechnical.de>
> *To:* linux-cluster at redhat.com
> *Sent:* Friday, August 3, 2012 10:40 AM
> *Subject:* Re: [Linux-cluster] fencing issue in 2 nodes cluster
>
> Am 03.08.2012 09:12, schrieb AKIN ?ffffffffffd6ZTOPUZ:
>> Hi
>> thanks all repilies
>>  I added verbose parameter into fence agent line in cluster.conf
>> according to  Heiko (verbose="true",method="onoff")
>> and I read the web link sent by Digimer .I tested fencing using
>>   "echo c > /proc/sysrq-trigger" on  node1.
>> after that I saw fencing occured and related services    moved to
>> other node properly.
>> in this case    can we say , fencing is ok ?    or also should I  use
>> fence_node command   on node   for being sure?
>>
>
> No errors anymore? Did you reduce the number of parameters as Digimer
> suggested? Then it would be maybe interesting for others to see your
> working fence configuration as configured inside cluster.conf.
>
> I would suggest to test a little bit too much than doing not enough
> testing ... ;-) ... debugging afterwards is more costly than doing
> testing before ...
>
> So my recommendation would be to test that manually started fencing,
> too, to see whether any messages/errors appear.
>
> After that you are probably fine ...
>
>
> Kind regards,
>
>      Heiko
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com



From queszama at yahoo.in  Fri Aug  3 17:32:09 2012
From: queszama at yahoo.in (Zama Ques)
Date: Sat, 4 Aug 2012 01:32:09 +0800 (SGT)
Subject: [Linux-cluster] Creating two different cluster using same set
	of nodes.
In-Reply-To: <8ef5801b4667ec0977496f0bd494ec00@mx.varna.net>
References: <1343822206.68654.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<50191FCB.7000804@alteeve.ca>
	<1343832889.36630.YahooMailNeo@web193005.mail.sg3.yahoo.com>
	<8ef5801b4667ec0977496f0bd494ec00@mx.varna.net>
Message-ID: <1344015129.8418.YahooMailNeo@web193006.mail.sg3.yahoo.com>





________________________________
 From: Kaloyan Kovachev <kkovachev at varna.net>
To: Zama Ques <queszama at yahoo.in>; linux clustering <linux-cluster at redhat.com> 
Cc: Digimer <lists at alteeve.ca> 
Sent: Thursday, 2 August 2012 1:08 PM
Subject: Re: [Linux-cluster] Creating two different cluster using same set of nodes.
 
Hi,

On Wed, 1 Aug 2012 22:54:49 +0800 (SGT), Zama Ques <queszama at yahoo.in>
wrote:
> 
> =====
> Cluster Name: ClusterA 
> 
> Node1: system1.example.com?? Priority:1 in Failover Domain
> Node2: system2.example.com?? Priority:2 ?in Failover Domain
> 
> File System Resource : /data1 - An ext3 file system
> 
> =====
> 
> Cluster Name : ClusterB
> 
> Node1: system1.example.com?? Priority:2 ?in Failover Domain
> Node2: system2.example.com?? Priority:1 ?in Failover Domain
> 
> File System Resource : /data2 - An ext3 file system
> ========================================================
> 

> Why not having ClusterA with two failover domains? You do not need two
> clusters - just two failover domains.

Tried as suggested by kaloyan by creating two failover domains instead of trying to create two clusters. One Node (e.g system1 ) has higher priority in one? failover domain and the other node (system2 ) has higher priority in the other failover domain. With this both the nodes are in active state which we are trying to achieve .

Thanks All for all the replies and suggestions. 

Thanks
Zaman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120804/531c7d49/attachment.htm>

From gianluca.cecchi at gmail.com  Fri Aug  3 23:25:23 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Sat, 4 Aug 2012 01:25:23 +0200
Subject: [Linux-cluster] clvmd problems with centos 6.3 or normal clvmd
	behaviour?
In-Reply-To: <CAG2kNCypGeYJRT0icoLFSEBKV2ZobS3HmpkdegcY_cqV=KCUAw@mail.gmail.com>
References: <CAG2kNCxW7bBBmHvMziNQ3YPv=awByr_oVEh2AjjkyuLJzTgtLg@mail.gmail.com>
	<CAG2kNCzzGCwNX+fQ07Smn=xt4Y-ayoZCxtAwGERy=ZToEVOvSA@mail.gmail.com>
	<CAG2kNCzjaQLCDJaUT6O9kAEjSfhhdqjvyvijdx5+urQG=O0wdg@mail.gmail.com>
	<CAG2kNCzbr5AvMv3uzy5KKoE0tXvf08FdB56NwUD6mxMmcjGSTw@mail.gmail.com>
	<CAG2kNCypGeYJRT0icoLFSEBKV2ZobS3HmpkdegcY_cqV=KCUAw@mail.gmail.com>
Message-ID: <CAG2kNCz1EVgLDWLaufEwMkmGzyKiSEMsP5DPJdS04bk4p9d4eg@mail.gmail.com>

further debug.
While the node starts and clvmd hangs

mount -t debugfs debug /sys/kernel/debug

#  dlm_tool lockdump clvmd
id 024d0001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthdh8uMexWiIBPmEecuGc42XgLwwm2VcXU"
id 01710001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthFI8d5cBsh6WNN2hG8fSaVCWFgyErPeuf"
id 00a60001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthn6uU8cevFXXqDjEST8ZU2qXq3p8sLwK0"
id 00f60001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthoRaWOK9JbmgcdN2nAzp3n6cTuiY5Mlvb"
id 01de0001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthd0ES941PYa0eQHJIi15nPg6cJDh1MhsJ"
id 02cd0001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthjwF6DZGPkgfr7SvUb0iDjOA7dXLu34nF"
id 01260001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujth2c3V108uX4b92NJhcvQNVNct8Q1TOnqe"
id 03240001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthPovzZnXIFevTRFG8Lj6NLFd0sY98Q1hZ"
id 02af0001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthnjnkiGcK071zjtZSRsqRaLnwLxuhibme"
id 03be0001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthPQPNHrR52t4oJZGfyj0eR8l5cPD4PKNu"
id 01e80001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthWpkWKkhPdtQ8D2x5Soe6Rte8CWO9KyiG"
id 025b0001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthlt7HrhWKbyVW8EO3QOf7qwmyenTI0GUD"
id 03120001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthWQNoaQT13WPJEtK4310KZOmryslXdWAT"
id 00d70001 gr CR rq IV pid 3155 master 0
"7w53faJiHU9ZYs0Kw8CmFcdsSXVqujthijkcH4Hhu5OGGTWnxnLo4WxAZsjoTzeV"

In the mean time I opened a CentOS bug too.
Anyone pointing to how to read the output?
Thanks,
Gianluca



From akinoztopuz at yahoo.com  Sat Aug  4 10:23:44 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Sat, 4 Aug 2012 03:23:44 -0700 (PDT)
Subject: [Linux-cluster] fencing issue in 2 nodes cluster
In-Reply-To: <501BE4F4.9050408@alteeve.ca>
References: <1343909975.25825.YahooMailNeo@web160703.mail.bf1.yahoo.com>
	<501A75B4.3080803@itechnical.de> <501A967D.6090006@alteeve.ca>
	<1343977948.71942.YahooMailNeo@web160704.mail.bf1.yahoo.com>
	<501B8068.3020909@itechnical.de>
	<1343981627.68281.YahooMailNeo@web160706.mail.bf1.yahoo.com>
	<501BE4F4.9050408@alteeve.ca>
Message-ID: <1344075824.76531.YahooMailNeo@web160703.mail.bf1.yahoo.com>

I agree there are alot of parameters for tuning and best results
?
every case?born ??new experience? 
?
thanks aga?n
 

________________________________
 From: Digimer <lists at alteeve.ca>
To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>; linux clustering <linux-cluster at redhat.com> 
Sent: Friday, August 3, 2012 5:49 PM
Subject: Re: [Linux-cluster] fencing issue in 2 nodes cluster
  
That should like it's fine then.

As a general rule though, I recommend roughly a 2:1 test:implement 
ratio. So if it takes you a week to setup your cluster, take two weeks 
to think of and test every failure you can think of. It's amazing how 
many corner cases you find this way (and much of the two weeks will 
actually be improving your design).

digimer

On 08/03/2012 04:13 AM, AKIN ?ffffffffffd6ZTOPUZ wrote:
> dont take any error
> and yes removed the method from command? (default is onoff)
> exact? line is at below:
> <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68" lanplus="1"
> login="clsfenceadmin"? name="fence_node1" passwd="clsfenceadmin"
> power_wait="4" verbose="true"/>
> and I tested fencing using manually (making a kernel crash:Digimer
> note)?  and?  command with fence_node node1 .
> ?t worked and services is relocated properly
> thanks
>
> *From:* Heiko Nardmann <heiko.nardmann at itechnical.de>
> *To:* linux-cluster at redhat.com
> *Sent:* Friday, August 3, 2012 10:40 AM
> *Subject:* Re: [Linux-cluster] fencing issue in 2 nodes cluster
>
> Am 03.08.2012 09:12, schrieb AKIN ?ffffffffffd6ZTOPUZ:
>> Hi
>> thanks all repilies
>>? I added verbose parameter into fence agent line in cluster.conf
>> according to? Heiko (verbose="true",method="onoff")
>> and I read the web link sent by Digimer .I tested fencing using
>>?  "echo c > /proc/sysrq-trigger" on? node1.
>> after that I saw fencing occured and related services? ? moved to
>> other node properly.
>> in this case? ? can we say , fencing is ok ?? ? or also should I? use
>> fence_node command?  on node?  for being sure?
>>
>
> No errors anymore? Did you reduce the number of parameters as Digimer
> suggested? Then it would be maybe interesting for others to see your
> working fence configuration as configured inside cluster.conf.
>
> I would suggest to test a little bit too much than doing not enough
> testing ... ;-) ... debugging afterwards is more costly than doing
> testing before ...
>
> So my recommendation would be to test that manually started fencing,
> too, to see whether any messages/errors appear.
>
> After that you are probably fine ...
>
>
> Kind regards,
>
>? ? ? Heiko
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120804/dc91bb53/attachment.htm>

From kailash.kumawat at rudrainfotainment.com  Sun Aug  5 14:25:04 2012
From: kailash.kumawat at rudrainfotainment.com (kailash kumawat)
Date: Sun, 5 Aug 2012 19:55:04 +0530
Subject: [Linux-cluster] password change of master and slave node
In-Reply-To: <CAD3MydCwkMDg0VdS0SJ9yiO-L8ZZDx_X2-EoXu5tUja0RxwUGg@mail.gmail.com>
References: <CALO-jX4L9whq-n7St079+5DoAmGUDg9+ThaCD_RKxUiki2cGbQ@mail.gmail.com>
	<CAD3MydCwkMDg0VdS0SJ9yiO-L8ZZDx_X2-EoXu5tUja0RxwUGg@mail.gmail.com>
Message-ID: <CALO-jX66h6KVujFuR5U46bpeX4P8rcakyW+VqTtNF05z00bZdQ@mail.gmail.com>

Thanks Arpit

but i want to change root password of master server where my apache and
other services are running

On Fri, Aug 3, 2012 at 3:12 AM, Arpit Tolani <arpittolani at gmail.com> wrote:

> Hello
>
>
> On Wed, Aug 1, 2012 at 11:04 PM, kailash kumawat <
> kailash.kumawat at rudrainfotainment.com> wrote:
>
>> Hi
>>
>> I have been successfully configure the redhat cluster for apache now i
>> want to change password of my master and other node so how can i
>> update in luci server, please help me
>>
>>
>>
> Login in as the *root* user and issue the following command after
> stopping the luci service. The command will then prompt the end user for a
> new admin password:
>
> $ service luci stop
> $ /usr/sbin/luci_admin password
>
> Regards
> Arpit Tolani
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Regards
*Kailash Kumawat*
System Admin
09167396313
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120805/9f8b004c/attachment.htm>

From raju.rajsand at gmail.com  Sun Aug  5 14:39:07 2012
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 5 Aug 2012 20:09:07 +0530
Subject: [Linux-cluster] password change of master and slave node
In-Reply-To: <CALO-jX66h6KVujFuR5U46bpeX4P8rcakyW+VqTtNF05z00bZdQ@mail.gmail.com>
References: <CALO-jX4L9whq-n7St079+5DoAmGUDg9+ThaCD_RKxUiki2cGbQ@mail.gmail.com>
	<CAD3MydCwkMDg0VdS0SJ9yiO-L8ZZDx_X2-EoXu5tUja0RxwUGg@mail.gmail.com>
	<CALO-jX66h6KVujFuR5U46bpeX4P8rcakyW+VqTtNF05z00bZdQ@mail.gmail.com>
Message-ID: <CA+Ydgaq54apSg4xLsqvTu2TwR4mQ8iLk=raF=sbNsuLGT7hrBA@mail.gmail.com>

Greetings,

On Sun, Aug 5, 2012 at 7:55 PM, kailash kumawat
<kailash.kumawat at rudrainfotainment.com> wrote:
> Thanks Arpit
>
> but i want to change root password of master server where my apache and
> other services are running
>

IF your services are controlled by cluster, the "master" server could
be any of the members at that point of time.

IF you want to change the root password for each member, do it individually.

Could you be a bit more clearer?


-- 
Regards,

Rajagopal



From raju.rajsand at gmail.com  Sun Aug  5 14:55:35 2012
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 5 Aug 2012 20:25:35 +0530
Subject: [Linux-cluster] password change of master and slave node
In-Reply-To: <CA+Ydgaq54apSg4xLsqvTu2TwR4mQ8iLk=raF=sbNsuLGT7hrBA@mail.gmail.com>
References: <CALO-jX4L9whq-n7St079+5DoAmGUDg9+ThaCD_RKxUiki2cGbQ@mail.gmail.com>
	<CAD3MydCwkMDg0VdS0SJ9yiO-L8ZZDx_X2-EoXu5tUja0RxwUGg@mail.gmail.com>
	<CALO-jX66h6KVujFuR5U46bpeX4P8rcakyW+VqTtNF05z00bZdQ@mail.gmail.com>
	<CA+Ydgaq54apSg4xLsqvTu2TwR4mQ8iLk=raF=sbNsuLGT7hrBA@mail.gmail.com>
Message-ID: <CA+YdgaoQAL3h1V=of9-vb+G2rWC=j16sVkQDRfxtVOXtQNtvnA@mail.gmail.com>

Greetings,

On Sun, Aug 5, 2012 at 8:09 PM, Rajagopal Swaminathan
<raju.rajsand at gmail.com> wrote:

Perhaps this will help you understand some things we spoke about.

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

-- 
Regards,

Rajagopal



From CBurke at innova-partners.com  Tue Aug  7 05:07:28 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Tue, 7 Aug 2012 05:07:28 +0000
Subject: [Linux-cluster] clvmd hangs
Message-ID: <CC461ACF.BABE%cburke@innova-partners.com>

I had a node crash (actually, lost power) and now when the cluster comes back up none of the PV/VG/Lvs that contain the GFS2 volumes can be found. Pvscan, lvscan, vgscan etc all hang.


# pvscan -vvvv
#lvmcmdline.c:1070         Processing: pvscan -vvvv
#lvmcmdline.c:1073         O_DIRECT will be used
#libdm-config.c:789       Setting global/locking_type to 3
#libdm-config.c:789       Setting global/wait_for_locks to 1
#locking/locking.c:271       Cluster locking selected.

The output is more or less the same from lvscan and vgscan.

The cluster is pretty basic and I was in the midst of configuring fencing when this went down, thus the config has no fence in it.

<?xml version="1.0"?>
<cluster config_version="5" name="Xanadu">
<clusternodes>
<clusternode name="xanadunode1" nodeid="1"/>
<clusternode name="xanadunode2" nodeid="2"/>
</clusternodes>
<cman expected_votes="3"/>
<quorumd label="quorum"/>
</cluster>

Additionally the cluster logs all show similar unending messages such as:

Aug 07 01:03:12 dlm_controld daemon cpg_join error retrying
Aug 07 01:03:46 corosync [TOTEM ] Retransmit List: 13
Aug 07 01:04:04 gfs_controld cpg_mcast_joined retry 31200 protocol
Aug 07 01:04:12 fenced daemon cpg_join error retrying

Also

# cman_tool status
Version: 6.2.0
Config Version: 5
Cluster Name: Xanadu
Cluster Id: 10121
Cluster Member: Yes
Cluster Generation: 2084
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 11
Flags:
Ports Bound: 0 11 178
Node name: xanadunode2
Node ID: 2
Multicast addresses: 239.192.39.176
Node addresses: 192.168.30.66

So cman is up and working. It seems that clvmd and the tools it depends on are simply not wanting to play nice. What do I have to do to get those volumes to mount?


________________________________________
Chip Burke

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120807/b200fc9b/attachment.htm>

From lists at alteeve.ca  Tue Aug  7 05:22:25 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 07 Aug 2012 01:22:25 -0400
Subject: [Linux-cluster] clvmd hangs
In-Reply-To: <CC461ACF.BABE%cburke@innova-partners.com>
References: <CC461ACF.BABE%cburke@innova-partners.com>
Message-ID: <5020A611.1000005@alteeve.ca>

On 08/07/2012 01:07 AM, Chip Burke wrote:
> I had a node crash (actually, lost power) and now when the cluster comes
> back up none of the PV/VG/Lvs that contain the GFS2 volumes can be
> found. Pvscan, lvscan, vgscan etc all hang.
>
>
> # pvscan -vvvv
> #lvmcmdline.c:1070         Processing: pvscan -vvvv
> #lvmcmdline.c:1073         O_DIRECT will be used
> #libdm-config.c:789       Setting global/locking_type to 3
> #libdm-config.c:789       Setting global/wait_for_locks to 1
> #locking/locking.c:271       Cluster locking selected.
>
> The output is more or less the same from lvscan and vgscan.
>
> The cluster is pretty basic and I was in the midst of configuring
> fencing when this went down, thus the config has no fence in it.
>
> <?xml version="1.0"?>
> <cluster config_version="5" name="Xanadu">
> <clusternodes>
> <clusternode name="xanadunode1" nodeid="1"/>
> <clusternode name="xanadunode2" nodeid="2"/>
> </clusternodes>
> <cman expected_votes="3"/>
> <quorumd label="quorum"/>
> </cluster>
>
> Additionally the cluster logs all show similar unending messages such as:
>
> Aug 07 01:03:12 dlm_controld daemon cpg_join error retrying
> Aug 07 01:03:46 corosync [TOTEM ] Retransmit List: 13
> Aug 07 01:04:04 gfs_controld cpg_mcast_joined retry 31200 protocol
> Aug 07 01:04:12 fenced daemon cpg_join error retrying
>
> Also
>
> # cman_tool status
> Version: 6.2.0
> Config Version: 5
> Cluster Name: Xanadu
> Cluster Id: 10121
> Cluster Member: Yes
> Cluster Generation: 2084
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 3
> Quorum device votes: 1
> Total votes: 3
> Node votes: 1
> Quorum: 2
> Active subsystems: 11
> Flags:
> Ports Bound: 0 11 178
> Node name: xanadunode2
> Node ID: 2
> Multicast addresses: 239.192.39.176
> Node addresses: 192.168.30.66
>
> So cman is up and working. It seems that clvmd and the tools it depends
> on are simply not wanting to play nice. What do I have to do to get
> those volumes to mount?

Without a way to put the lost node into a known state, the only safe 
option remaining is to hang. This is by design. You have to add fencing 
to your cluster.

This explains it in detail;

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

-- 
Digimer
Papers and Projects: https://alteeve.com



From CBurke at innova-partners.com  Tue Aug  7 06:43:46 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Tue, 7 Aug 2012 06:43:46 +0000
Subject: [Linux-cluster] clvmd hangs
In-Reply-To: <5020A611.1000005@alteeve.ca>
Message-ID: <CC462F68.BAC5%cburke@innova-partners.com>


Ok, I have update my config but it seems to fail when unfencing itself.
fence_node -U fails as well as it can't find cman and cman fails because
it can't unfence. A bit of a catch 22. So?

# fence_node -U
fence_node: cannot connect to cman


New config:

<?xml version="1.0"?>
<cluster config_version="7" name="Xanadu">
	<clusternodes>
		<clusternode name="xanadunode1" nodeid="1">
			<fence>
				<method name="Method">
				<device name="SCSI_Fence"/>
				</method>
			</fence>
			<unfence>
                                <device action="on" name="SCSI_Fence"/>
                        </unfence>
		</clusternode>
		<clusternode name="xanadunode2" nodeid="2">
			<fence>
				<method name="Method">
				<device name="SCSI_Fence"/>
				</method>
			</fence>
			<unfence>
                                <device action="on" name="SCSI_Fence"/>
                        </unfence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_scsi" name="SCSI_Fence"/>
	</fencedevices>
</cluster>

Additionally these errors are being thrown to the console.


dlm: no local IP address has been set
dlm: cannot start dlm lowcomms -107


Thanks!







On 8/7/12 1:22 AM, "Digimer" <lists at alteeve.ca> wrote:

>
>
>Without a way to put the lost node into a known state, the only safe
>option remaining is to hang. This is by design. You have to add fencing
>to your cluster.
>
>This explains it in detail;
>
>https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fenci
>ng
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.com




From gounini.geekarea at gmail.com  Tue Aug  7 08:52:31 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Tue, 7 Aug 2012 10:52:31 +0200 (CEST)
Subject: [Linux-cluster] How to change master node
In-Reply-To: <501944DD.9020405@itechnical.de>
Message-ID: <933994361.5201.1344329551794.JavaMail.root@geekarea.fr>

I think failoverdomain has no affect on master role.
It seems master node is only present when quorum device is used.

Nobody knows how to manage master role on RHEL5 ?

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "Heiko Nardmann" <heiko.nardmann at itechnical.de>
> ?: linux-cluster at redhat.com
> Envoy?: Mercredi 1 Ao?t 2012 17:01:49
> Objet: Re: [Linux-cluster] How to change master node
> 
> Am 01.08.2012 16:40, schrieb GouNiNi:
> > Hello,
> >
> > When using quorum device, one node is elected "master" and you can
> > see
> >    Aug  1 15:36:05 non-master-node qdiskd[8136]: <info> Node 1 is
> >    the master
> > on non-master nodes or
> >    Aug  1 15:29:47 master-node qdiskd[8044]: <info> Assuming master
> >    role
> > on master node.
> >
> > How do you change manually the master node?
> >
> > The only way I found is to do "service qdiskd restart" on the
> > master node but it's not recommended.
> >
> > Regards,
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> What does your cluster.conf look like?
> 
> For my part I have defined there a failoverdomain with priorities
> assigned to my two nodes.
> 
> 
> Regards,
> 
>      Heiko
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From gounini.geekarea at gmail.com  Tue Aug  7 09:07:31 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Tue, 7 Aug 2012 11:07:31 +0200 (CEST)
Subject: [Linux-cluster] password change of master and slave node
In-Reply-To: <CA+YdgaoQAL3h1V=of9-vb+G2rWC=j16sVkQDRfxtVOXtQNtvnA@mail.gmail.com>
Message-ID: <217165624.5211.1344330451204.JavaMail.root@geekarea.fr>

In RH Cluster Suite, communication between luci interface and the cluster nodes pass through ricci agent.

In RHEL5, ricci password is the root password of your node. Changing your root password (using "passwd") will change the authentication in luci (it will be broken and have to be fixed).

In RHEL6, ricci password is password of ricci system user. You can change it with "passwd ricci". You will have to fix this password too in luci.

Simplest way to change password in luci on both version is to delete cluster and nodes from luci and readd your cluster using "Add an existing cluster".

Regards,

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "Rajagopal Swaminathan" <raju.rajsand at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Dimanche 5 Ao?t 2012 16:55:35
> Objet: Re: [Linux-cluster] password change of master and slave node
> 
> Greetings,
> 
> On Sun, Aug 5, 2012 at 8:09 PM, Rajagopal Swaminathan
> <raju.rajsand at gmail.com> wrote:
> 
> Perhaps this will help you understand some things we spoke about.
> 
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
> 
> --
> Regards,
> 
> Rajagopal
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From gounini.geekarea at gmail.com  Tue Aug  7 09:19:21 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Tue, 7 Aug 2012 11:19:21 +0200 (CEST)
Subject: [Linux-cluster] Quorum device brain the cluster when
	master	lose network
In-Reply-To: <CAE7pJ3CCjywxBsceY0-KUzk9=nG3+k7k_j6osQoevyWR9L1T7g@mail.gmail.com>
Message-ID: <130149447.5234.1344331161355.JavaMail.root@geekarea.fr>

Hello,

My problem is still here.
I made a try without expected_votes="5" but nothing change on my test loosing network on two nodes.

Any other idea?

Regards,

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "emmanuel segura" <emi2fast at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Mercredi 1 Ao?t 2012 10:58:59
> Objet: Re: [Linux-cluster] Quorum device brain the cluster when master	lose network
> 
> 
> Hello Gounini
> 
> Sorry but it told you, remove <cman expected_votes="5"> and reboot
> the cluster
> 
> Let the cluster calculate the expected votes
> 
> 
> 2012/8/1 GouNiNi < gounini.geekarea at gmail.com >
> 
> 
> I do this test one more time and I got same result with more
> precisions:
> 
> When I shutdown network on 2 nodes including the master, master stay
> alive while the 2 online nodes are fencing the offline non-master
> node. The cluster goes Inquorate after.
> When fenced node came back, he joins cluster and cluster becomes
> quorate. New master is chose and the old master is fenced.
> 
> # cman_tool status
> Version: 6.2.0
> Config Version: 144
> Cluster Name: cluname
> Cluster Id: 57462
> Cluster Member: Yes
> Cluster Generation: 488
> Membership state: Cluster-Member
> Nodes: 4
> Expected votes: 5
> Quorum device votes: 1
> Total votes: 5
> Quorum: 3
> Active subsystems: 9
> Flags: Dirty
> Ports Bound: 0 177
> Node name: nodename
> Node ID: 2
> Multicast addresses: ZZ.ZZ.ZZ.ZZ
> Node addresses: YY.YY.YY.YY
> 
> --
> .`'`. GouNiNi
> : ': :
> `. ` .` GNU/Linux
> `'` http://www.geekarea.fr
> 
> 
> ----- Mail original -----
> > De: "emmanuel segura" < emi2fast at gmail.com >
> > ?: "linux clustering" < linux-cluster at redhat.com >
> > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > master lose network
> > 
> > 
> > can you send me the ouput from cman_tool status? when the cluster
> > it's running
> > 
> > 
> > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> > 
> > 
> > 
> > 
> > ----- Mail original -----
> > > De: "Digimer" < lists at alteeve.ca >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > > 
> > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > Hello,
> > > > 
> > > > I did some tests on 4 nodes cluster with quorum device and I
> > > > find
> > > > a
> > > > bad situation with one test, so I need your knowledges to
> > > > correct
> > > > my configuration.
> > > > 
> > > > Configuation:
> > > > 4 nodes, all vote for 1
> > > > quorum device vote for 1 (to hold services with minimum 2 nodes
> > > > up)
> > > > cman expected votes 5
> > > > 
> > > > Situation:
> > > > I shut down network on 2 nodes, one of them is master.
> > > > 
> > > > Observation:
> > > > Fencing of one node (the master)... Quorum device Offline,
> > > > Quorum
> > > > disolved ! Services stopped.
> > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > fenced.
> > > > Services restart.
> > > > 2nd node offline reboot.
> > > > 
> > > > My cluster is not quorate for 8 min (very long hardware boot
> > > > :-)
> > > > and my services were offline.
> > > > 
> > > > Do you know how to prevent this situation?
> > > > 
> > > > Regards,
> > > 
> > > Please tell us the name and version of the cluster software you
> > > are
> > > using, Please also share your configuration file(s).
> > > 
> > > --
> > > Digimer
> > > Papers and Projects: https://alteeve.com
> > > 
> > 
> > Sorry, RHEL5.6 64bits
> > 
> > # rpm -q cman rgmanager
> > cman-2.0.115-68.el5
> > rgmanager-2.0.52-9.el5
> > 
> > 
> > <?xml version="1.0"?>
> > <cluster alias="cluname" config_version="144" name="cluname">
> > <clusternodes>
> > <clusternode name="node1" nodeid="1" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_307" port="12"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node2" nodeid="2" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_307" port="11"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node3" nodeid="3" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_308" port="6"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node4" nodeid="4" votes="1">
> > <fence>
> > <method name="single">
> > <device name="fenceIBM_308" port="7"/>
> > </method>
> > </fence>
> > </clusternode>
> > </clusternodes>
> > <fencedevices>
> > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > </fencedevices>
> > <rm log_level="7">
> > <failoverdomains/>
> > <resources/>
> > <service ...>
> > <...>
> > </service>
> > </rm>
> > <fence_daemon clean_start="0" post_fail_delay="15"
> > post_join_delay="300"/>
> > <cman expected_votes="5">
> > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > </cman>
> > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > </cluster>
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > 
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> esta es mi vida e me la vivo hasta que dios quiera
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From emi2fast at gmail.com  Tue Aug  7 09:29:59 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 7 Aug 2012 11:29:59 +0200
Subject: [Linux-cluster] Quorum device brain the cluster when master
	lose network
In-Reply-To: <130149447.5234.1344331161355.JavaMail.root@geekarea.fr>
References: <CAE7pJ3CCjywxBsceY0-KUzk9=nG3+k7k_j6osQoevyWR9L1T7g@mail.gmail.com>
	<130149447.5234.1344331161355.JavaMail.root@geekarea.fr>
Message-ID: <CAE7pJ3DkzaHtY-zvufC=TDfVNQW3TvNPCWtkhrsFewBmjCMj3A@mail.gmail.com>

do you reboot all nodes in your cluster after removed the expected_votes?

2012/8/7 GouNiNi <gounini.geekarea at gmail.com>

> Hello,
>
> My problem is still here.
> I made a try without expected_votes="5" but nothing change on my test
> loosing network on two nodes.
>
> Any other idea?
>
> Regards,
>
> --
>   .`'`.   GouNiNi
>  :  ': :
>  `. ` .`  GNU/Linux
>    `'`    http://www.geekarea.fr
>
>
> ----- Mail original -----
> > De: "emmanuel segura" <emi2fast at gmail.com>
> > ?: "linux clustering" <linux-cluster at redhat.com>
> > Envoy?: Mercredi 1 Ao?t 2012 10:58:59
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when master
>      lose network
> >
> >
> > Hello Gounini
> >
> > Sorry but it told you, remove <cman expected_votes="5"> and reboot
> > the cluster
> >
> > Let the cluster calculate the expected votes
> >
> >
> > 2012/8/1 GouNiNi < gounini.geekarea at gmail.com >
> >
> >
> > I do this test one more time and I got same result with more
> > precisions:
> >
> > When I shutdown network on 2 nodes including the master, master stay
> > alive while the 2 online nodes are fencing the offline non-master
> > node. The cluster goes Inquorate after.
> > When fenced node came back, he joins cluster and cluster becomes
> > quorate. New master is chose and the old master is fenced.
> >
> > # cman_tool status
> > Version: 6.2.0
> > Config Version: 144
> > Cluster Name: cluname
> > Cluster Id: 57462
> > Cluster Member: Yes
> > Cluster Generation: 488
> > Membership state: Cluster-Member
> > Nodes: 4
> > Expected votes: 5
> > Quorum device votes: 1
> > Total votes: 5
> > Quorum: 3
> > Active subsystems: 9
> > Flags: Dirty
> > Ports Bound: 0 177
> > Node name: nodename
> > Node ID: 2
> > Multicast addresses: ZZ.ZZ.ZZ.ZZ
> > Node addresses: YY.YY.YY.YY
> >
> > --
> > .`'`. GouNiNi
> > : ': :
> > `. ` .` GNU/Linux
> > `'` http://www.geekarea.fr
> >
> >
> > ----- Mail original -----
> > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > >
> > >
> > > can you send me the ouput from cman_tool status? when the cluster
> > > it's running
> > >
> > >
> > > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> > >
> > >
> > >
> > >
> > > ----- Mail original -----
> > > > De: "Digimer" < lists at alteeve.ca >
> > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > master lose network
> > > >
> > > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > > Hello,
> > > > >
> > > > > I did some tests on 4 nodes cluster with quorum device and I
> > > > > find
> > > > > a
> > > > > bad situation with one test, so I need your knowledges to
> > > > > correct
> > > > > my configuration.
> > > > >
> > > > > Configuation:
> > > > > 4 nodes, all vote for 1
> > > > > quorum device vote for 1 (to hold services with minimum 2 nodes
> > > > > up)
> > > > > cman expected votes 5
> > > > >
> > > > > Situation:
> > > > > I shut down network on 2 nodes, one of them is master.
> > > > >
> > > > > Observation:
> > > > > Fencing of one node (the master)... Quorum device Offline,
> > > > > Quorum
> > > > > disolved ! Services stopped.
> > > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > > fenced.
> > > > > Services restart.
> > > > > 2nd node offline reboot.
> > > > >
> > > > > My cluster is not quorate for 8 min (very long hardware boot
> > > > > :-)
> > > > > and my services were offline.
> > > > >
> > > > > Do you know how to prevent this situation?
> > > > >
> > > > > Regards,
> > > >
> > > > Please tell us the name and version of the cluster software you
> > > > are
> > > > using, Please also share your configuration file(s).
> > > >
> > > > --
> > > > Digimer
> > > > Papers and Projects: https://alteeve.com
> > > >
> > >
> > > Sorry, RHEL5.6 64bits
> > >
> > > # rpm -q cman rgmanager
> > > cman-2.0.115-68.el5
> > > rgmanager-2.0.52-9.el5
> > >
> > >
> > > <?xml version="1.0"?>
> > > <cluster alias="cluname" config_version="144" name="cluname">
> > > <clusternodes>
> > > <clusternode name="node1" nodeid="1" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_307" port="12"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node2" nodeid="2" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_307" port="11"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node3" nodeid="3" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_308" port="6"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node4" nodeid="4" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_308" port="7"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > </clusternodes>
> > > <fencedevices>
> > > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > > </fencedevices>
> > > <rm log_level="7">
> > > <failoverdomains/>
> > > <resources/>
> > > <service ...>
> > > <...>
> > > </service>
> > > </rm>
> > > <fence_daemon clean_start="0" post_fail_delay="15"
> > > post_join_delay="300"/>
> > > <cman expected_votes="5">
> > > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > > </cman>
> > > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > > </cluster>
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > >
> > > --
> > > esta es mi vida e me la vivo hasta que dios quiera
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120807/ab4711b2/attachment.htm>

From gounini.geekarea at gmail.com  Tue Aug  7 11:14:02 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Tue, 7 Aug 2012 13:14:02 +0200 (CEST)
Subject: [Linux-cluster] Quorum device brain the cluster when
	master	lose network
In-Reply-To: <CAE7pJ3DkzaHtY-zvufC=TDfVNQW3TvNPCWtkhrsFewBmjCMj3A@mail.gmail.com>
Message-ID: <54765718.5361.1344338042875.JavaMail.root@geekarea.fr>

Yes I do ;)

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "emmanuel segura" <emi2fast at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Mardi 7 Ao?t 2012 11:29:59
> Objet: Re: [Linux-cluster] Quorum device brain the cluster when master	lose network
> 
> 
> do you reboot all nodes in your cluster after removed the
> expected_votes?
> 
> 
> 2012/8/7 GouNiNi < gounini.geekarea at gmail.com >
> 
> 
> Hello,
> 
> My problem is still here.
> I made a try without expected_votes="5" but nothing change on my test
> loosing network on two nodes.
> 
> Any other idea?
> 
> Regards,
> 
> 
> --
> .`'`. GouNiNi
> : ': :
> `. ` .` GNU/Linux
> `'` http://www.geekarea.fr
> 
> 
> ----- Mail original -----
> > De: "emmanuel segura" < emi2fast at gmail.com >
> > ?: "linux clustering" < linux-cluster at redhat.com >
> > Envoy?: Mercredi 1 Ao?t 2012 10:58:59
> 
> 
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > master lose network
> > 
> > 
> > Hello Gounini
> > 
> > Sorry but it told you, remove <cman expected_votes="5"> and reboot
> > the cluster
> > 
> > Let the cluster calculate the expected votes
> > 
> > 
> > 2012/8/1 GouNiNi < gounini.geekarea at gmail.com >
> > 
> > 
> > I do this test one more time and I got same result with more
> > precisions:
> > 
> > When I shutdown network on 2 nodes including the master, master
> > stay
> > alive while the 2 online nodes are fencing the offline non-master
> > node. The cluster goes Inquorate after.
> > When fenced node came back, he joins cluster and cluster becomes
> > quorate. New master is chose and the old master is fenced.
> > 
> > # cman_tool status
> > Version: 6.2.0
> > Config Version: 144
> > Cluster Name: cluname
> > Cluster Id: 57462
> > Cluster Member: Yes
> > Cluster Generation: 488
> > Membership state: Cluster-Member
> > Nodes: 4
> > Expected votes: 5
> > Quorum device votes: 1
> > Total votes: 5
> > Quorum: 3
> > Active subsystems: 9
> > Flags: Dirty
> > Ports Bound: 0 177
> > Node name: nodename
> > Node ID: 2
> > Multicast addresses: ZZ.ZZ.ZZ.ZZ
> > Node addresses: YY.YY.YY.YY
> > 
> > --
> > .`'`. GouNiNi
> > : ': :
> > `. ` .` GNU/Linux
> > `'` http://www.geekarea.fr
> > 
> > 
> > ----- Mail original -----
> > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > > 
> > > 
> > > can you send me the ouput from cman_tool status? when the cluster
> > > it's running
> > > 
> > > 
> > > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> > > 
> > > 
> > > 
> > > 
> > > ----- Mail original -----
> > > > De: "Digimer" < lists at alteeve.ca >
> > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > master lose network
> > > > 
> > > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > > Hello,
> > > > > 
> > > > > I did some tests on 4 nodes cluster with quorum device and I
> > > > > find
> > > > > a
> > > > > bad situation with one test, so I need your knowledges to
> > > > > correct
> > > > > my configuration.
> > > > > 
> > > > > Configuation:
> > > > > 4 nodes, all vote for 1
> > > > > quorum device vote for 1 (to hold services with minimum 2
> > > > > nodes
> > > > > up)
> > > > > cman expected votes 5
> > > > > 
> > > > > Situation:
> > > > > I shut down network on 2 nodes, one of them is master.
> > > > > 
> > > > > Observation:
> > > > > Fencing of one node (the master)... Quorum device Offline,
> > > > > Quorum
> > > > > disolved ! Services stopped.
> > > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > > fenced.
> > > > > Services restart.
> > > > > 2nd node offline reboot.
> > > > > 
> > > > > My cluster is not quorate for 8 min (very long hardware boot
> > > > > :-)
> > > > > and my services were offline.
> > > > > 
> > > > > Do you know how to prevent this situation?
> > > > > 
> > > > > Regards,
> > > > 
> > > > Please tell us the name and version of the cluster software you
> > > > are
> > > > using, Please also share your configuration file(s).
> > > > 
> > > > --
> > > > Digimer
> > > > Papers and Projects: https://alteeve.com
> > > > 
> > > 
> > > Sorry, RHEL5.6 64bits
> > > 
> > > # rpm -q cman rgmanager
> > > cman-2.0.115-68.el5
> > > rgmanager-2.0.52-9.el5
> > > 
> > > 
> > > <?xml version="1.0"?>
> > > <cluster alias="cluname" config_version="144" name="cluname">
> > > <clusternodes>
> > > <clusternode name="node1" nodeid="1" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_307" port="12"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node2" nodeid="2" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_307" port="11"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node3" nodeid="3" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_308" port="6"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > <clusternode name="node4" nodeid="4" votes="1">
> > > <fence>
> > > <method name="single">
> > > <device name="fenceIBM_308" port="7"/>
> > > </method>
> > > </fence>
> > > </clusternode>
> > > </clusternodes>
> > > <fencedevices>
> > > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > > </fencedevices>
> > > <rm log_level="7">
> > > <failoverdomains/>
> > > <resources/>
> > > <service ...>
> > > <...>
> > > </service>
> > > </rm>
> > > <fence_daemon clean_start="0" post_fail_delay="15"
> > > post_join_delay="300"/>
> > > <cman expected_votes="5">
> > > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > > </cman>
> > > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > > </cluster>
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > 
> > > --
> > > esta es mi vida e me la vivo hasta que dios quiera
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > 
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> esta es mi vida e me la vivo hasta que dios quiera
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From emi2fast at gmail.com  Tue Aug  7 12:31:13 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 7 Aug 2012 14:31:13 +0200
Subject: [Linux-cluster] Quorum device brain the cluster when master
	lose network
In-Reply-To: <54765718.5361.1344338042875.JavaMail.root@geekarea.fr>
References: <CAE7pJ3DkzaHtY-zvufC=TDfVNQW3TvNPCWtkhrsFewBmjCMj3A@mail.gmail.com>
	<54765718.5361.1344338042875.JavaMail.root@geekarea.fr>
Message-ID: <CAE7pJ3BrNW5WK_tJo7N9uuxXb7=cyCtnjVM2G2L99BJAfswS+w@mail.gmail.com>

send me a cman_tool status ;-)

2012/8/7 GouNiNi <gounini.geekarea at gmail.com>

> Yes I do ;)
>
> --
>   .`'`.   GouNiNi
>  :  ': :
>  `. ` .`  GNU/Linux
>    `'`    http://www.geekarea.fr
>
>
> ----- Mail original -----
> > De: "emmanuel segura" <emi2fast at gmail.com>
> > ?: "linux clustering" <linux-cluster at redhat.com>
> > Envoy?: Mardi 7 Ao?t 2012 11:29:59
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when master
>      lose network
> >
> >
> > do you reboot all nodes in your cluster after removed the
> > expected_votes?
> >
> >
> > 2012/8/7 GouNiNi < gounini.geekarea at gmail.com >
> >
> >
> > Hello,
> >
> > My problem is still here.
> > I made a try without expected_votes="5" but nothing change on my test
> > loosing network on two nodes.
> >
> > Any other idea?
> >
> > Regards,
> >
> >
> > --
> > .`'`. GouNiNi
> > : ': :
> > `. ` .` GNU/Linux
> > `'` http://www.geekarea.fr
> >
> >
> > ----- Mail original -----
> > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Envoy?: Mercredi 1 Ao?t 2012 10:58:59
> >
> >
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > >
> > >
> > > Hello Gounini
> > >
> > > Sorry but it told you, remove <cman expected_votes="5"> and reboot
> > > the cluster
> > >
> > > Let the cluster calculate the expected votes
> > >
> > >
> > > 2012/8/1 GouNiNi < gounini.geekarea at gmail.com >
> > >
> > >
> > > I do this test one more time and I got same result with more
> > > precisions:
> > >
> > > When I shutdown network on 2 nodes including the master, master
> > > stay
> > > alive while the 2 online nodes are fencing the offline non-master
> > > node. The cluster goes Inquorate after.
> > > When fenced node came back, he joins cluster and cluster becomes
> > > quorate. New master is chose and the old master is fenced.
> > >
> > > # cman_tool status
> > > Version: 6.2.0
> > > Config Version: 144
> > > Cluster Name: cluname
> > > Cluster Id: 57462
> > > Cluster Member: Yes
> > > Cluster Generation: 488
> > > Membership state: Cluster-Member
> > > Nodes: 4
> > > Expected votes: 5
> > > Quorum device votes: 1
> > > Total votes: 5
> > > Quorum: 3
> > > Active subsystems: 9
> > > Flags: Dirty
> > > Ports Bound: 0 177
> > > Node name: nodename
> > > Node ID: 2
> > > Multicast addresses: ZZ.ZZ.ZZ.ZZ
> > > Node addresses: YY.YY.YY.YY
> > >
> > > --
> > > .`'`. GouNiNi
> > > : ': :
> > > `. ` .` GNU/Linux
> > > `'` http://www.geekarea.fr
> > >
> > >
> > > ----- Mail original -----
> > > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > master lose network
> > > >
> > > >
> > > > can you send me the ouput from cman_tool status? when the cluster
> > > > it's running
> > > >
> > > >
> > > > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> > > >
> > > >
> > > >
> > > >
> > > > ----- Mail original -----
> > > > > De: "Digimer" < lists at alteeve.ca >
> > > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > > master lose network
> > > > >
> > > > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I did some tests on 4 nodes cluster with quorum device and I
> > > > > > find
> > > > > > a
> > > > > > bad situation with one test, so I need your knowledges to
> > > > > > correct
> > > > > > my configuration.
> > > > > >
> > > > > > Configuation:
> > > > > > 4 nodes, all vote for 1
> > > > > > quorum device vote for 1 (to hold services with minimum 2
> > > > > > nodes
> > > > > > up)
> > > > > > cman expected votes 5
> > > > > >
> > > > > > Situation:
> > > > > > I shut down network on 2 nodes, one of them is master.
> > > > > >
> > > > > > Observation:
> > > > > > Fencing of one node (the master)... Quorum device Offline,
> > > > > > Quorum
> > > > > > disolved ! Services stopped.
> > > > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > > > fenced.
> > > > > > Services restart.
> > > > > > 2nd node offline reboot.
> > > > > >
> > > > > > My cluster is not quorate for 8 min (very long hardware boot
> > > > > > :-)
> > > > > > and my services were offline.
> > > > > >
> > > > > > Do you know how to prevent this situation?
> > > > > >
> > > > > > Regards,
> > > > >
> > > > > Please tell us the name and version of the cluster software you
> > > > > are
> > > > > using, Please also share your configuration file(s).
> > > > >
> > > > > --
> > > > > Digimer
> > > > > Papers and Projects: https://alteeve.com
> > > > >
> > > >
> > > > Sorry, RHEL5.6 64bits
> > > >
> > > > # rpm -q cman rgmanager
> > > > cman-2.0.115-68.el5
> > > > rgmanager-2.0.52-9.el5
> > > >
> > > >
> > > > <?xml version="1.0"?>
> > > > <cluster alias="cluname" config_version="144" name="cluname">
> > > > <clusternodes>
> > > > <clusternode name="node1" nodeid="1" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_307" port="12"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > <clusternode name="node2" nodeid="2" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_307" port="11"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > <clusternode name="node3" nodeid="3" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_308" port="6"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > <clusternode name="node4" nodeid="4" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_308" port="7"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > </clusternodes>
> > > > <fencedevices>
> > > > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > > > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > > > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > > > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > > > </fencedevices>
> > > > <rm log_level="7">
> > > > <failoverdomains/>
> > > > <resources/>
> > > > <service ...>
> > > > <...>
> > > > </service>
> > > > </rm>
> > > > <fence_daemon clean_start="0" post_fail_delay="15"
> > > > post_join_delay="300"/>
> > > > <cman expected_votes="5">
> > > > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > > > </cman>
> > > > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > > > </cluster>
> > > >
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > >
> > > >
> > > > --
> > > > esta es mi vida e me la vivo hasta que dios quiera
> > > >
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > >
> > > --
> > > esta es mi vida e me la vivo hasta que dios quiera
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120807/9b3234c8/attachment.htm>

From scooter at cgl.ucsf.edu  Tue Aug  7 22:06:24 2012
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Tue, 07 Aug 2012 15:06:24 -0700
Subject: [Linux-cluster] GFS2 and fragmentation
Message-ID: <50219160.3080700@cgl.ucsf.edu>

Hi All,
     We have a RedHat 6.2 cluster with 4 nodes using GFS2 for shared 
filesystems.  One of the filesystems we need to share is 
/var/spool/mail.  In general, with recent upgrades and improvements to 
GFS2, which has been working really, really well.  We're starting to see 
some significant fragmentation on files in the filesystem, though, even 
after recreating the filesystem after the RHEL 6 upgrade.  My 
understanding was that there were fixes in RHEL 6 that made 
defragmentation un-(or less?) necessary, but we're seeing a disturbing 
increase in the amount of fragmentation since the upgrade.  By the way, 
we're seeing numbers in the range of 4,000-6,000 extents/GB for some of 
these files, which seems a bit large.
     So, I've got two questions:

     1. At what point should we be worried about the number of extents?
     2. Are there plans for a defragmentation tool?


Thanks!

-- scooter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120807/a5eb1f5c/attachment.htm>

From jvdiago at gmail.com  Tue Aug  7 22:08:02 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Wed, 8 Aug 2012 00:08:02 +0200
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <CAEAM5QWVfPQ_PV0OydBtv558yE4yKW5GWu_B7TTQw1zQX3vGDg@mail.gmail.com>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
	<50044D50.5060008@alteeve.ca>
	<CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
	<CAEAM5QX-djOYBq0Tm3sYpDRzAFfwp3fY=7g2FbWBg+n4Gr4BqA@mail.gmail.com>
	<50057A78.2080304@alteeve.ca>
	<CAEAM5QWVfPQ_PV0OydBtv558yE4yKW5GWu_B7TTQw1zQX3vGDg@mail.gmail.com>
Message-ID: <CAEAM5QUaNAbjXtew6c+K04Kk=SRO5gKSXHK4zZdg_67YkOjHfg@mail.gmail.com>

Hi,

Sorry for reopen the thread, but I have new info that maybe can guide us to
a solution.

In the documentation I've readed that cman uses trhee UDP ports. Assuming
the default configuration:

MULTICAST_ADDR:5405
LOCAL_ADDR:5405
LOCAL_ADDR:5404

The problem is that in my cluster the 5404 port isn't up. Netstat shows the
5405 in both interfaces, but not the 5405 in the local interface. I've also
ran tcpdump:


[root at node1]# tcpdump -i eth0 -nn -vv port 5405
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96
bytes
16:36:35.340412 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 134) 15.15.1.10.5149 > 15.15.1.11.5405: UDP, length 106
16:36:35.340872 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 134) 15.15.1.11.5149 > 15.15.1.10.5405: UDP, length 106
16:36:35.549397 IP (tos 0x0, ttl   1, id 0, offset 0, flags [DF], proto:
UDP (17), length: 146) 15.15.1.10.5149 > 239.192.240.165.5405: UDP, length
118
16:36:37.399398 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 134) 15.15.1.10.5149 > 15.15.1.11.5405: UDP, length 106
16:36:37.399845 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 134) 15.15.1.11.5149 > 15.15.1.10.5405: UDP, length 106
16:36:37.608384 IP (tos 0x0, ttl   1, id 0, offset 0, flags [DF], proto:
UDP (17), length: 146) 15.15.1.10.5149 > 239.192.240.165.5405: UDP, length
118
16:36:39.458357 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 134) 15.15.1.10.5149 > 15.15.1.11.5405: UDP, length 106
16:36:39.458780 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto:
UDP (17), length: 134) 15.15.1.11.5149 > 15.15.1.10.5405: UDP, length 106
16:36:39.667355 IP (tos 0x0, ttl   1, id 0, offset 0, flags [DF], proto:
UDP (17), length: 146) 15.15.1.10.5149 > 239.192.240.165.5405: UDP, length
118

[root at node1]# tcpdump -i eth0 -nn -vv port 5404
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96
bytes

As you can see, I have udp traffic between the two nodes (15.15.1.10 and
15.15.1.11) and between the node and the multicast address
(239.192.240.165) but there is no traffic in the 5404 port. Is this normal?
I mean, this port should be visible? What can lead to this?

Thank you in advance, Javi

2012/7/17 Javier Vela <jvdiago at gmail.com>

> Hi,
>
> Thank you for the quick reply. I'm going to ask if we can upgrade to Red
> Hat 5.8.
>
> Moreover, the machines don't have now performance problems (we are still
> in pre). But all is virtual, under VMWare, so a punctual problem in the
> VMWare  infrastructure can affect us. Do you know some way to test network
> problems that could affect RHCS?
>
> I tried tcpdump and iperf, but haven't seen anything.
>
> Regards, Javi.
>
>
> 2012/7/17 Digimer <lists at alteeve.ca>
>
>> On 07/17/2012 03:30 AM, Javier Vela wrote:
>> > Hi, I'm also seeing a lot of log entries in the logs like that:
>> >
>> > openais[4264]: [TOTEM] Retransmit List: 34 35 36 37 38 39 3a 3b 3c
>> >
>> > I've searched through internet and this happens when there are some
>> > delay between the nodes, but openais its supposed to recover gracefully.
>> > Can this be a problem?
>> >
>> > 2012/7/16 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>
>> I saw this happen with a bug in rhel 6.1 when the nodes were too slow.
>> I'm wondering if a) you have network problems somewhere or b) you have
>> insufficient performance on your nodes.
>>
>> Usually it recovers on it's own, but I have seen it run away to the
>> point where I had to stop the cluster. That was on modest hardware in a
>> test environment. On all production machines I've seen, it recovered on
>> it's own.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120808/558de6f2/attachment.htm>

From gounini.geekarea at gmail.com  Wed Aug  8 08:48:19 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Wed, 8 Aug 2012 10:48:19 +0200 (CEST)
Subject: [Linux-cluster] How to show token value on runing cluster ?
Message-ID: <1079728732.5802.1344415699805.JavaMail.root@geekarea.fr>

Hello,

Everything in subject ;)

RHEL 5.6
cman-2.0.115-68.el5
openais-0.80.6-28.el5

Thx



From gounini.geekarea at gmail.com  Wed Aug  8 09:31:04 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Wed, 8 Aug 2012 11:31:04 +0200 (CEST)
Subject: [Linux-cluster] How to show token value on runing cluster ?
In-Reply-To: <1079728732.5802.1344415699805.JavaMail.root@geekarea.fr>
Message-ID: <442365009.5904.1344418264913.JavaMail.root@geekarea.fr>

It's not what I want but it's a kind of answer.

grep -E "(consensus|Token Timeout)" /var/log/messages

Have you something better?

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "GouNiNi" <gounini.geekarea at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Mercredi 8 Ao?t 2012 10:48:19
> Objet: [Linux-cluster] How to show token value on runing cluster ?
> 
> Hello,
> 
> Everything in subject ;)
> 
> RHEL 5.6
> cman-2.0.115-68.el5
> openais-0.80.6-28.el5
> 
> Thx
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From yamato at redhat.com  Wed Aug  8 10:40:19 2012
From: yamato at redhat.com (Masatake YAMATO)
Date: Wed, 08 Aug 2012 19:40:19 +0900 (JST)
Subject: [Linux-cluster] How to show token value on runing cluster ?
In-Reply-To: <1079728732.5802.1344415699805.JavaMail.root@geekarea.fr>
References: <1079728732.5802.1344415699805.JavaMail.root@geekarea.fr>
Message-ID: <20120808.194019.58647331708373119.yamato@redhat.com>

Wireshark with my plugin can show it and more.
http://github.com/masatake/wireshark-plugin-rhcs
(wireshark-plugin-rhcs is not a product supported by Red Hat.)
 
Masatake YAMATO

> Hello,
> 
> Everything in subject ;)
> 
> RHEL 5.6
> cman-2.0.115-68.el5
> openais-0.80.6-28.el5
> 
> Thx
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Wed Aug  8 12:50:51 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 8 Aug 2012 08:50:51 -0400 (EDT)
Subject: [Linux-cluster] GFS2 and fragmentation
In-Reply-To: <50219160.3080700@cgl.ucsf.edu>
Message-ID: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>

----- Original Message -----
| Hi All,
|      We have a RedHat 6.2 cluster with 4 nodes using GFS2 for shared
| filesystems.  One of the filesystems we need to share is
| /var/spool/mail.  In general, with recent upgrades and improvements
| to
| GFS2, which has been working really, really well.  We're starting to
| see
| some significant fragmentation on files in the filesystem, though,
| even
| after recreating the filesystem after the RHEL 6 upgrade.  My
| understanding was that there were fixes in RHEL 6 that made
| defragmentation un-(or less?) necessary, but we're seeing a
| disturbing
| increase in the amount of fragmentation since the upgrade.  By the
| way,
| we're seeing numbers in the range of 4,000-6,000 extents/GB for some
| of
| these files, which seems a bit large.
|      So, I've got two questions:
| 
|      1. At what point should we be worried about the number of
|      extents?
|      2. Are there plans for a defragmentation tool?
| 
| 
| Thanks!
| 
| -- scooter
Hi Scooter,

GFS2 has never really managed file fragmentation, so it's not a new problem.
Simultaneous writes to GFS2 will cause fragmentation.
There's nothing different in 6.2 regarding file fragmentation from any
other release of the kernel, so perhaps you can give us more details
about what you're seeing?

Having said that, Red Hat's GFS2 team HAS been actively developing code in
the GFS2 kernel code to reduce file fragmentation. Many of these patches
have already been sent upstream (to the kernel.org kernel), and we have
working prototypes for 6.4. Here's an upstream link to one of our major
defrag patches (there are others related to it scheduled for the next
merge window):
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=8e2e00473598dd5379d8408cb974dade000acafc

We currently don't have any plans for defrag tool for GFS2. In theory, you
can always copy the data from an old file system to a new one using this
new kernel code, and it should be less fragmented.

Regards,

Bob Peterson
Red Hat File Systems



From mkparam at gmail.com  Wed Aug  8 13:01:28 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Wed, 8 Aug 2012 18:31:28 +0530
Subject: [Linux-cluster] RHEL Cluster
Message-ID: <CAA1zgjZFg80ORaHh3K7u3Bsq_aXYdzDc6rM1VX1odMgD4DCaaA@mail.gmail.com>

Guys, I just joined this dl today. Kind of excited to know of the people
here...

I have worked in Sun / Veritas Cluster before, but liking to explore RHEL
Cluster and planning to use it for one of our production setup.

I would like to know few things to start with..

1. We have Cisco UCS blades in which i plan to put ESX and install RHEL
cluster. Does it work/supported ?
2. Which version of RHEL cluster you guys recommend for a production setup
?
3. Does the virtual environment completely supported in RHEL or any
limitations, whatsoever that i must be aware before considering ?

Any details in this would be highly regarded.

thanks
Param
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120808/57aaa244/attachment.htm>

From lists at alteeve.ca  Wed Aug  8 13:33:02 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 08 Aug 2012 09:33:02 -0400
Subject: [Linux-cluster] RHEL Cluster
In-Reply-To: <CAA1zgjZFg80ORaHh3K7u3Bsq_aXYdzDc6rM1VX1odMgD4DCaaA@mail.gmail.com>
References: <CAA1zgjZFg80ORaHh3K7u3Bsq_aXYdzDc6rM1VX1odMgD4DCaaA@mail.gmail.com>
Message-ID: <50226A8E.5020306@alteeve.ca>

On 08/08/2012 09:01 AM, PARAM KRISH wrote:
> Guys, I just joined this dl today. Kind of excited to know of the people
> here...
>
> I have worked in Sun / Veritas Cluster before, but liking to explore
> RHEL Cluster and planning to use it for one of our production setup.
>
> I would like to know few things to start with..
>
> 1. We have Cisco UCS blades in which i plan to put ESX and install RHEL
> cluster. Does it work/supported ?
> 2. Which version of RHEL cluster you guys recommend for a production
> setup ?
> 3. Does the virtual environment completely supported in RHEL or any
> limitations, whatsoever that i must be aware before considering ?
>
> Any details in this would be highly regarded.
>
> thanks
> Param

Welcome!

   The first question is "What kind of cluster are you trying to 
build?". That will influence a lot of the advice you get.

   The backing hardware is generally irrelevant given two conditions; 
Does the hardware support RHEL? Do you have a mechanism for fencing (ie: 
IPMI or similar)? Of course, if you create VMs then this is further 
abstracted. You can use 'fence_vmware' in that case.

   RHEL6 offers Cluster Stable 3, which is the most recent version 
available on RHEL. It is the version you should use.

   You won't be able to run VMs on top of VMs, so you can't build a 
cluster to support highly-available virtual machines if the nodes 
themselves are VMs. Other services are fine though.

   This tutorial covers a lot of stuff that may not apply to you, but 
the opening "Concepts" part you may find useful as it covers the various 
parts in RHEL cluster software;

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

   Lastly, if you are on IRC, we've got a semi-active channel on 
freenode called "#linux-cluster". Members span most all timezones, so if 
you visit, hang around if you don't get an answer quickly. Folks 
generally read the scroll-back when they return.

Cheers!

-- 
Digimer
Papers and Projects: https://alteeve.com



From ajb2 at mssl.ucl.ac.uk  Wed Aug  8 15:51:21 2012
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Wed, 08 Aug 2012 16:51:21 +0100
Subject: [Linux-cluster] GFS2 and fragmentation
In-Reply-To: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>
References: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>
Message-ID: <50228AF9.50701@mssl.ucl.ac.uk>

On 08/08/12 13:50, Bob Peterson wrote:

> We currently don't have any plans for defrag tool for GFS2. In theory, you
> can always copy the data from an old file system to a new one using this
> new kernel code, and it should be less fragmented.

Defragfs works as well as any other userland method:


http://sourceforge.net/projects/defragfs/






From scooter at cgl.ucsf.edu  Wed Aug  8 16:28:31 2012
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Wed, 08 Aug 2012 09:28:31 -0700
Subject: [Linux-cluster] GFS2 and fragmentation
In-Reply-To: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>
References: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>
Message-ID: <502293AF.5070207@cgl.ucsf.edu>

Bob,
     Thanks for the information and the pointer, that really helps.

-- scooter

On 08/08/2012 05:50 AM, Bob Peterson wrote:
> ----- Original Message -----
> | Hi All,
> |      We have a RedHat 6.2 cluster with 4 nodes using GFS2 for shared
> | filesystems.  One of the filesystems we need to share is
> | /var/spool/mail.  In general, with recent upgrades and improvements
> | to
> | GFS2, which has been working really, really well.  We're starting to
> | see
> | some significant fragmentation on files in the filesystem, though,
> | even
> | after recreating the filesystem after the RHEL 6 upgrade.  My
> | understanding was that there were fixes in RHEL 6 that made
> | defragmentation un-(or less?) necessary, but we're seeing a
> | disturbing
> | increase in the amount of fragmentation since the upgrade.  By the
> | way,
> | we're seeing numbers in the range of 4,000-6,000 extents/GB for some
> | of
> | these files, which seems a bit large.
> |      So, I've got two questions:
> |
> |      1. At what point should we be worried about the number of
> |      extents?
> |      2. Are there plans for a defragmentation tool?
> |
> |
> | Thanks!
> |
> | -- scooter
> Hi Scooter,
>
> GFS2 has never really managed file fragmentation, so it's not a new problem.
> Simultaneous writes to GFS2 will cause fragmentation.
> There's nothing different in 6.2 regarding file fragmentation from any
> other release of the kernel, so perhaps you can give us more details
> about what you're seeing?
>
> Having said that, Red Hat's GFS2 team HAS been actively developing code in
> the GFS2 kernel code to reduce file fragmentation. Many of these patches
> have already been sent upstream (to the kernel.org kernel), and we have
> working prototypes for 6.4. Here's an upstream link to one of our major
> defrag patches (there are others related to it scheduled for the next
> merge window):
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=8e2e00473598dd5379d8408cb974dade000acafc
>
> We currently don't have any plans for defrag tool for GFS2. In theory, you
> can always copy the data from an old file system to a new one using this
> new kernel code, and it should be less fragmented.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From scooter at cgl.ucsf.edu  Wed Aug  8 16:27:59 2012
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Wed, 08 Aug 2012 09:27:59 -0700
Subject: [Linux-cluster] GFS2 and fragmentation
In-Reply-To: <50228AF9.50701@mssl.ucl.ac.uk>
References: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>
	<50228AF9.50701@mssl.ucl.ac.uk>
Message-ID: <5022938F.2090808@cgl.ucsf.edu>

Alan,
     Thanks.  I'll take a look at it.

-- scooter

On 08/08/2012 08:51 AM, Alan Brown wrote:
> On 08/08/12 13:50, Bob Peterson wrote:
>
>> We currently don't have any plans for defrag tool for GFS2. In 
>> theory, you
>> can always copy the data from an old file system to a new one using this
>> new kernel code, and it should be less fragmented.
>
> Defragfs works as well as any other userland method:
>
>
> http://sourceforge.net/projects/defragfs/
>
>
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From mkparam at gmail.com  Thu Aug  9 04:17:17 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Thu, 9 Aug 2012 09:47:17 +0530
Subject: [Linux-cluster] RHEL Cluster
In-Reply-To: <50226A8E.5020306@alteeve.ca>
References: <CAA1zgjZFg80ORaHh3K7u3Bsq_aXYdzDc6rM1VX1odMgD4DCaaA@mail.gmail.com>
	<50226A8E.5020306@alteeve.ca>
Message-ID: <CAA1zgjYkJmuepdMn=EbbD_GNOH8=chxpLZis_Umwgf7cuug01g@mail.gmail.com>

Wow, Thanks for sharing the stuff to me. I really appreciate and kind of
excited to get a reply just in a day. Yeah, i wish i can join the IRC to
get quick answers, i must first check if my office network has it open.
Thanks Again. I will go through the stuff u sent me and will come back with
more appropriate questions to what i like to setup at my place.

Param

On Wed, Aug 8, 2012 at 7:03 PM, Digimer <lists at alteeve.ca> wrote:

> On 08/08/2012 09:01 AM, PARAM KRISH wrote:
>
>> Guys, I just joined this dl today. Kind of excited to know of the people
>> here...
>>
>> I have worked in Sun / Veritas Cluster before, but liking to explore
>> RHEL Cluster and planning to use it for one of our production setup.
>>
>> I would like to know few things to start with..
>>
>> 1. We have Cisco UCS blades in which i plan to put ESX and install RHEL
>> cluster. Does it work/supported ?
>> 2. Which version of RHEL cluster you guys recommend for a production
>> setup ?
>> 3. Does the virtual environment completely supported in RHEL or any
>> limitations, whatsoever that i must be aware before considering ?
>>
>> Any details in this would be highly regarded.
>>
>> thanks
>> Param
>>
>
> Welcome!
>
>   The first question is "What kind of cluster are you trying to build?".
> That will influence a lot of the advice you get.
>
>   The backing hardware is generally irrelevant given two conditions; Does
> the hardware support RHEL? Do you have a mechanism for fencing (ie: IPMI or
> similar)? Of course, if you create VMs then this is further abstracted. You
> can use 'fence_vmware' in that case.
>
>   RHEL6 offers Cluster Stable 3, which is the most recent version
> available on RHEL. It is the version you should use.
>
>   You won't be able to run VMs on top of VMs, so you can't build a cluster
> to support highly-available virtual machines if the nodes themselves are
> VMs. Other services are fine though.
>
>   This tutorial covers a lot of stuff that may not apply to you, but the
> opening "Concepts" part you may find useful as it covers the various parts
> in RHEL cluster software;
>
> https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>
>
>   Lastly, if you are on IRC, we've got a semi-active channel on freenode
> called "#linux-cluster". Members span most all timezones, so if you visit,
> hang around if you don't get an answer quickly. Folks generally read the
> scroll-back when they return.
>
> Cheers!
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120809/b3d2ba06/attachment.htm>

From lists at alteeve.ca  Thu Aug  9 04:31:27 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 09 Aug 2012 00:31:27 -0400
Subject: [Linux-cluster] RHEL Cluster
In-Reply-To: <CAA1zgjYkJmuepdMn=EbbD_GNOH8=chxpLZis_Umwgf7cuug01g@mail.gmail.com>
References: <CAA1zgjZFg80ORaHh3K7u3Bsq_aXYdzDc6rM1VX1odMgD4DCaaA@mail.gmail.com>
	<50226A8E.5020306@alteeve.ca>
	<CAA1zgjYkJmuepdMn=EbbD_GNOH8=chxpLZis_Umwgf7cuug01g@mail.gmail.com>
Message-ID: <50233D1F.80803@alteeve.ca>

On 08/09/2012 12:17 AM, PARAM KRISH wrote:
> Wow, Thanks for sharing the stuff to me. I really appreciate and kind of
> excited to get a reply just in a day. Yeah, i wish i can join the IRC to
> get quick answers, i must first check if my office network has it open.
> Thanks Again. I will go through the stuff u sent me and will come back
> with more appropriate questions to what i like to setup at my place.
>
> Param

Look forward to seeing you around. :)

-- 
Digimer
Papers and Projects: https://alteeve.com



From massimo.mastrilli at cadland.it  Thu Aug  9 08:00:53 2012
From: massimo.mastrilli at cadland.it (massimo.mastrilli at cadland.it)
Date: Thu, 9 Aug 2012 10:00:53 +0200
Subject: [Linux-cluster] =?iso-8859-1?q?AUTO=3A_Massimo_Mastrilli_=E8_asse?=
 =?iso-8859-1?q?nte_dall=27ufficio_=28torner=E0_il_06/09/2012=29?=
Message-ID: <OFD573B8F4.98184B8B-ONC1257A55.002C06FD-C1257A55.002C06FE@cadland.it>


Sono fuori dall'ufficio fino a 06/09/2012

I am out of the office on vacation till Thursday, September 06 and I?ll
only have occasional access to email.


Nota: Questa ? una risposta automatizzata al messaggio  "Re:
[Linux-cluster] RHEL Cluster" inviata il 09/08/2012 6.17.17.

Questa ? l'unica notifica che verr? ricevuta mentre la persona ? assente.



From andreas at hastexo.com  Thu Aug  9 09:42:22 2012
From: andreas at hastexo.com (Andreas Kurz)
Date: Thu, 09 Aug 2012 11:42:22 +0200
Subject: [Linux-cluster] How to show token value on runing cluster ?
In-Reply-To: <1079728732.5802.1344415699805.JavaMail.root@geekarea.fr>
References: <1079728732.5802.1344415699805.JavaMail.root@geekarea.fr>
Message-ID: <502385FE.80702@hastexo.com>

On 08/08/2012 10:48 AM, GouNiNi wrote:
> Hello,
> 
> Everything in subject ;)

IIRC it was like:

"openais-confdb-display totem token" ... and "openais-confdb-display" to
show all values.

Regards,
Andreas

> 
> RHEL 5.6
> cman-2.0.115-68.el5
> openais-0.80.6-28.el5
> 
> Thx
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Need help with Corosync?
http://www.hastexo.com/now


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120809/b91e7e4c/attachment.sig>

From swhiteho at redhat.com  Thu Aug  9 10:55:52 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 09 Aug 2012 11:55:52 +0100
Subject: [Linux-cluster] GFS2 and fragmentation
In-Reply-To: <502293AF.5070207@cgl.ucsf.edu>
References: <868946372.3387705.1344430251773.JavaMail.root@redhat.com>
	<502293AF.5070207@cgl.ucsf.edu>
Message-ID: <1344509752.2710.17.camel@menhir>

Hi,

On Wed, 2012-08-08 at 09:28 -0700, Scooter Morris wrote:
> Bob,
>      Thanks for the information and the pointer, that really helps.
> 
> -- scooter
> 
> On 08/08/2012 05:50 AM, Bob Peterson wrote:
> > ----- Original Message -----
> > | Hi All,
> > |      We have a RedHat 6.2 cluster with 4 nodes using GFS2 for shared
> > | filesystems.  One of the filesystems we need to share is
> > | /var/spool/mail.  In general, with recent upgrades and improvements
> > | to
> > | GFS2, which has been working really, really well.  We're starting to
> > | see
> > | some significant fragmentation on files in the filesystem, though,
> > | even
> > | after recreating the filesystem after the RHEL 6 upgrade.  My
> > | understanding was that there were fixes in RHEL 6 that made
> > | defragmentation un-(or less?) necessary, but we're seeing a
> > | disturbing
> > | increase in the amount of fragmentation since the upgrade.  By the
> > | way,
> > | we're seeing numbers in the range of 4,000-6,000 extents/GB for some
> > | of
> > | these files, which seems a bit large.
> > |      So, I've got two questions:
> > |
> > |      1. At what point should we be worried about the number of
> > |      extents?
> > |      2. Are there plans for a defragmentation tool?
> > |
The question is how large are the files? If each file is not very big,
and is only taking up a few disk blocks at most (not unusual for an
email server) then I'd hope that each individual file would be mostly a
single extent.

Over time though, if the files are being randomly deleted and then other
files created (and particularly if the filesystem is nearly full) then
this will increase fragmentation. It is a good plan not to let
filesystems get too full when fragmentation is a concern.

It also depends upon how the files were created. Files which are written
in a streaming fashion will be much more likely to be allocated as such
on disk, and thus less fragmented. Files which start off as large sparse
files and then have randomly placed writes to them are likely to be
fragmented on almost any filesystem. So application behaviour can be a
factor in this too,

Steve.




From gianluca.cecchi at gmail.com  Fri Aug 10 09:48:41 2012
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Fri, 10 Aug 2012 11:48:41 +0200
Subject: [Linux-cluster] How to see what node is the master for quorum disk?
Message-ID: <CAG2kNCw1QK1q0q_-2WhSddqEXJRPg-iJL3cxzztG=+m3hSKY-w@mail.gmail.com>

Hello,
in qdiskd.log I get at cluster startup the node that becomes master
for quorum disk.
config is in fact something like

<quorumd device="xxxx" ... log_facility="local4" log_level="7" ... >

and in syslog.conf
# qdisk logging
local4.*                                                /var/log/qdiskd.log

The file is rotated so after some time I have only empty qdiskd.log.N files.
Is there any command to get which node is the master at this moment?

Thanks,
Gianluca



From krisztian at poos.hu  Fri Aug 10 15:07:22 2012
From: krisztian at poos.hu (=?ISO-8859-1?Q?Po=F3s_Kriszti=E1n?=)
Date: Fri, 10 Aug 2012 17:07:22 +0200
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
Message-ID: <502523AA.3000000@poos.hu>

Dear all,

I hope that anyone run into this problem in the past, so maybe can help
me resolving this issue.

There is a 2 node rhel cluster with quorum also.
There are clustered lvms, where the -c- flag is on.
If I start clvmd all the clustered lvms became online.

After this if I start rgmanager, it deactivates all the volumes, and not
able to activate them anymore as there are no such devices anymore
during the startup of the service, so after this, the service fails.
All lvs remain without the active flag.

I can manually bring it up, but only if after clvmd is started, I set
the lvms manually offline by the lvchange -an <lv>
After this, when I start rgmanager, it can take it online without
problems. However I think, this action should be done by the rgmanager
itself. All the logs is full with the next:
rgmanager Making resilient: lvchange -an ....
rgmanager lv_exec_resilient failed
rgmanager lv_activate_resilient stop failed on ....

As well, sometimes the lvs/clvmd commands are also hanging. I have to
restart clvmd to make it work again. (sometimes killing it)

Anyone has any idea, what to check?

Thanks and regards,
Krisztian



-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120810/e087d85e/attachment.p7s>

From lists at alteeve.ca  Fri Aug 10 15:15:37 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 10 Aug 2012 11:15:37 -0400
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
In-Reply-To: <502523AA.3000000@poos.hu>
References: <502523AA.3000000@poos.hu>
Message-ID: <50252599.3060802@alteeve.ca>

On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
> Dear all,
>
> I hope that anyone run into this problem in the past, so maybe can help
> me resolving this issue.
>
> There is a 2 node rhel cluster with quorum also.
> There are clustered lvms, where the -c- flag is on.
> If I start clvmd all the clustered lvms became online.
>
> After this if I start rgmanager, it deactivates all the volumes, and not
> able to activate them anymore as there are no such devices anymore
> during the startup of the service, so after this, the service fails.
> All lvs remain without the active flag.
>
> I can manually bring it up, but only if after clvmd is started, I set
> the lvms manually offline by the lvchange -an <lv>
> After this, when I start rgmanager, it can take it online without
> problems. However I think, this action should be done by the rgmanager
> itself. All the logs is full with the next:
> rgmanager Making resilient: lvchange -an ....
> rgmanager lv_exec_resilient failed
> rgmanager lv_activate_resilient stop failed on ....
>
> As well, sometimes the lvs/clvmd commands are also hanging. I have to
> restart clvmd to make it work again. (sometimes killing it)
>
> Anyone has any idea, what to check?
>
> Thanks and regards,
> Krisztian

Please paste your cluster.conf file with minimal edits.

-- 
Digimer
Papers and Projects: https://alteeve.com



From krisztian at poos.hu  Fri Aug 10 16:38:35 2012
From: krisztian at poos.hu (=?ISO-8859-1?Q?Po=F3s_Kriszti=E1n?=)
Date: Fri, 10 Aug 2012 18:38:35 +0200
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
In-Reply-To: <50252599.3060802@alteeve.ca>
References: <502523AA.3000000@poos.hu> <50252599.3060802@alteeve.ca>
Message-ID: <5025390B.5040208@poos.hu>

This is the cluster conf, Which is a clone of the problematic system on
a test environment (without the ORacle and SAP instances, only focusing
on this LVM issue, with an LVM resource)

[root at rhel2 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="7" name="teszt">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="rhel1.local" nodeid="1" votes="1">
			<fence/>
		</clusternode>
		<clusternode name="rhel2.local" nodeid="2" votes="1">
			<fence/>
		</clusternode>
	</clusternodes>
	<cman expected_votes="3"/>
	<fencedevices/>
	<rm>
		<failoverdomains>
			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
				<failoverdomainnode name="rhel1.local" priority="1"/>
				<failoverdomainnode name="rhel2.local" priority="2"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
mountpoint="/lvm" name="teszt-fs"/>
		</resources>
		<service autostart="1" domain="all" exclusive="0" name="teszt"
recovery="disable">
			<lvm ref="teszt-lv"/>
			<fs ref="teszt-fs"/>
		</service>
	</rm>
	<quorumd label="qdisk"/>
</cluster>

Here are the log parts:
Aug 10 17:21:21 rgmanager I am node #2
Aug 10 17:21:22 rgmanager Resource Group Manager Starting
Aug 10 17:21:22 rgmanager Loading Service Data
Aug 10 17:21:29 rgmanager Initializing Services
Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
Aug 10 17:21:31 rgmanager Services Initialized
Aug 10 17:21:31 rgmanager State change: Local UP
Aug 10 17:21:31 rgmanager State change: rhel1.local UP
Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
Aug 10 17:23:25 rgmanager Failed to activate logical volume, teszt/teszt-lv
Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
Aug 10 17:23:29 rgmanager Failed second attempt to activate teszt/teszt-lv
Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic error)
Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
value: 1
Aug 10 17:23:29 rgmanager Stopping service service:teszt
Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
a real device
Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
argument(s))
Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
intervention required
Aug 10 17:23:31 rgmanager Service service:teszt is failed
Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
start.
Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop cleanly
Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
Aug 10 17:25:14 rgmanager Failed to activate logical volume, teszt/teszt-lv
Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
Aug 10 17:25:17 rgmanager Failed second attempt to activate teszt/teszt-lv
Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic error)
Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
value: 1
Aug 10 17:25:18 rgmanager Stopping service service:teszt
Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
a real device
Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
argument(s))


After I manually started the lvm on node1 and tried to switch it on
node2 it's not able to start it.

Regards,
Krisztian


On 08/10/2012 05:15 PM, Digimer wrote:
> On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
>> Dear all,
>>
>> I hope that anyone run into this problem in the past, so maybe can help
>> me resolving this issue.
>>
>> There is a 2 node rhel cluster with quorum also.
>> There are clustered lvms, where the -c- flag is on.
>> If I start clvmd all the clustered lvms became online.
>>
>> After this if I start rgmanager, it deactivates all the volumes, and not
>> able to activate them anymore as there are no such devices anymore
>> during the startup of the service, so after this, the service fails.
>> All lvs remain without the active flag.
>>
>> I can manually bring it up, but only if after clvmd is started, I set
>> the lvms manually offline by the lvchange -an <lv>
>> After this, when I start rgmanager, it can take it online without
>> problems. However I think, this action should be done by the rgmanager
>> itself. All the logs is full with the next:
>> rgmanager Making resilient: lvchange -an ....
>> rgmanager lv_exec_resilient failed
>> rgmanager lv_activate_resilient stop failed on ....
>>
>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>> restart clvmd to make it work again. (sometimes killing it)
>>
>> Anyone has any idea, what to check?
>>
>> Thanks and regards,
>> Krisztian
> 
> Please paste your cluster.conf file with minimal edits.
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120810/9b69cc18/attachment.p7s>

From lists at alteeve.ca  Fri Aug 10 16:46:24 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 10 Aug 2012 12:46:24 -0400
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
In-Reply-To: <5025390B.5040208@poos.hu>
References: <502523AA.3000000@poos.hu> <50252599.3060802@alteeve.ca>
	<5025390B.5040208@poos.hu>
Message-ID: <50253AE0.7090706@alteeve.ca>

Not sure if it relates, but I can say that without fencing, things will 
break in strange ways. The reason is that if anything triggers a fault, 
the cluster blocks by design and stays blocked until a fence call 
succeeds (which is impossible without fencing configured in the first 
place).

Can you please setup fencing, test to make sure it works (using 
'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this 
is done, test again for your problem. If it still exists, please paste 
the updated cluster.conf then. Also please include syslog from both 
nodes around the time of your LVM tests.

digimer

On 08/10/2012 12:38 PM, Po?s Kriszti?n wrote:
> This is the cluster conf, Which is a clone of the problematic system on
> a test environment (without the ORacle and SAP instances, only focusing
> on this LVM issue, with an LVM resource)
>
> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster config_version="7" name="teszt">
> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> 	<clusternodes>
> 		<clusternode name="rhel1.local" nodeid="1" votes="1">
> 			<fence/>
> 		</clusternode>
> 		<clusternode name="rhel2.local" nodeid="2" votes="1">
> 			<fence/>
> 		</clusternode>
> 	</clusternodes>
> 	<cman expected_votes="3"/>
> 	<fencedevices/>
> 	<rm>
> 		<failoverdomains>
> 			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
> 				<failoverdomainnode name="rhel1.local" priority="1"/>
> 				<failoverdomainnode name="rhel2.local" priority="2"/>
> 			</failoverdomain>
> 		</failoverdomains>
> 		<resources>
> 			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
> 			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
> mountpoint="/lvm" name="teszt-fs"/>
> 		</resources>
> 		<service autostart="1" domain="all" exclusive="0" name="teszt"
> recovery="disable">
> 			<lvm ref="teszt-lv"/>
> 			<fs ref="teszt-fs"/>
> 		</service>
> 	</rm>
> 	<quorumd label="qdisk"/>
> </cluster>
>
> Here are the log parts:
> Aug 10 17:21:21 rgmanager I am node #2
> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
> Aug 10 17:21:22 rgmanager Loading Service Data
> Aug 10 17:21:29 rgmanager Initializing Services
> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
> Aug 10 17:21:31 rgmanager Services Initialized
> Aug 10 17:21:31 rgmanager State change: Local UP
> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
> Aug 10 17:23:25 rgmanager Failed to activate logical volume, teszt/teszt-lv
> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
> Aug 10 17:23:29 rgmanager Failed second attempt to activate teszt/teszt-lv
> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic error)
> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
> value: 1
> Aug 10 17:23:29 rgmanager Stopping service service:teszt
> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
> a real device
> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
> argument(s))
> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
> intervention required
> Aug 10 17:23:31 rgmanager Service service:teszt is failed
> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
> start.
> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop cleanly
> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
> Aug 10 17:25:14 rgmanager Failed to activate logical volume, teszt/teszt-lv
> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
> Aug 10 17:25:17 rgmanager Failed second attempt to activate teszt/teszt-lv
> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic error)
> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
> value: 1
> Aug 10 17:25:18 rgmanager Stopping service service:teszt
> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
> a real device
> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
> argument(s))
>
>
> After I manually started the lvm on node1 and tried to switch it on
> node2 it's not able to start it.
>
> Regards,
> Krisztian
>
>
> On 08/10/2012 05:15 PM, Digimer wrote:
>> On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
>>> Dear all,
>>>
>>> I hope that anyone run into this problem in the past, so maybe can help
>>> me resolving this issue.
>>>
>>> There is a 2 node rhel cluster with quorum also.
>>> There are clustered lvms, where the -c- flag is on.
>>> If I start clvmd all the clustered lvms became online.
>>>
>>> After this if I start rgmanager, it deactivates all the volumes, and not
>>> able to activate them anymore as there are no such devices anymore
>>> during the startup of the service, so after this, the service fails.
>>> All lvs remain without the active flag.
>>>
>>> I can manually bring it up, but only if after clvmd is started, I set
>>> the lvms manually offline by the lvchange -an <lv>
>>> After this, when I start rgmanager, it can take it online without
>>> problems. However I think, this action should be done by the rgmanager
>>> itself. All the logs is full with the next:
>>> rgmanager Making resilient: lvchange -an ....
>>> rgmanager lv_exec_resilient failed
>>> rgmanager lv_activate_resilient stop failed on ....
>>>
>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>> restart clvmd to make it work again. (sometimes killing it)
>>>
>>> Anyone has any idea, what to check?
>>>
>>> Thanks and regards,
>>> Krisztian
>>
>> Please paste your cluster.conf file with minimal edits.


-- 
Digimer
Papers and Projects: https://alteeve.com



From CBurke at innova-partners.com  Fri Aug 10 17:00:20 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Fri, 10 Aug 2012 17:00:20 +0000
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
In-Reply-To: <50253AE0.7090706@alteeve.ca>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2B3EC@alexandria.innova.local>

See my thread earlier as I am having similar issues. I am testing this
soon, but I "think" the issue in my case is setting up SCSI fencing before
GFS2. So essentially it has nothing to fence off of, sees it as a fault,
and never recovers. I "think" my fix will be establish the LVMs, GFS2 etc
then put in the SCSI fence so that it can actually create the private
reservations. Then the fun begins in pulling the plug randomly to see how
it behaves.
________________________________________
Chip Burke







On 8/10/12 12:46 PM, "Digimer" <lists at alteeve.ca> wrote:

>Not sure if it relates, but I can say that without fencing, things will
>break in strange ways. The reason is that if anything triggers a fault,
>the cluster blocks by design and stays blocked until a fence call
>succeeds (which is impossible without fencing configured in the first
>place).
>
>Can you please setup fencing, test to make sure it works (using
>'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
>is done, test again for your problem. If it still exists, please paste
>the updated cluster.conf then. Also please include syslog from both
>nodes around the time of your LVM tests.
>
>digimer
>
>On 08/10/2012 12:38 PM, Po?s Kriszti?n wrote:
>> This is the cluster conf, Which is a clone of the problematic system on
>> a test environment (without the ORacle and SAP instances, only focusing
>> on this LVM issue, with an LVM resource)
>>
>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="7" name="teszt">
>> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>> 	<clusternodes>
>> 		<clusternode name="rhel1.local" nodeid="1" votes="1">
>> 			<fence/>
>> 		</clusternode>
>> 		<clusternode name="rhel2.local" nodeid="2" votes="1">
>> 			<fence/>
>> 		</clusternode>
>> 	</clusternodes>
>> 	<cman expected_votes="3"/>
>> 	<fencedevices/>
>> 	<rm>
>> 		<failoverdomains>
>> 			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
>> 				<failoverdomainnode name="rhel1.local" priority="1"/>
>> 				<failoverdomainnode name="rhel2.local" priority="2"/>
>> 			</failoverdomain>
>> 		</failoverdomains>
>> 		<resources>
>> 			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>> 			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>> mountpoint="/lvm" name="teszt-fs"/>
>> 		</resources>
>> 		<service autostart="1" domain="all" exclusive="0" name="teszt"
>> recovery="disable">
>> 			<lvm ref="teszt-lv"/>
>> 			<fs ref="teszt-fs"/>
>> 		</service>
>> 	</rm>
>> 	<quorumd label="qdisk"/>
>> </cluster>
>>
>> Here are the log parts:
>> Aug 10 17:21:21 rgmanager I am node #2
>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>> Aug 10 17:21:22 rgmanager Loading Service Data
>> Aug 10 17:21:29 rgmanager Initializing Services
>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>> Aug 10 17:21:31 rgmanager Services Initialized
>> Aug 10 17:21:31 rgmanager State change: Local UP
>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>>teszt/teszt-lv
>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>>teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>error)
>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>> intervention required
>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>> start.
>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>>cleanly
>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>>teszt/teszt-lv
>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>>teszt/teszt-lv
>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>error)
>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>>
>>
>> After I manually started the lvm on node1 and tried to switch it on
>> node2 it's not able to start it.
>>
>> Regards,
>> Krisztian
>>
>>
>> On 08/10/2012 05:15 PM, Digimer wrote:
>>> On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
>>>> Dear all,
>>>>
>>>> I hope that anyone run into this problem in the past, so maybe can
>>>>help
>>>> me resolving this issue.
>>>>
>>>> There is a 2 node rhel cluster with quorum also.
>>>> There are clustered lvms, where the -c- flag is on.
>>>> If I start clvmd all the clustered lvms became online.
>>>>
>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>>not
>>>> able to activate them anymore as there are no such devices anymore
>>>> during the startup of the service, so after this, the service fails.
>>>> All lvs remain without the active flag.
>>>>
>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>> the lvms manually offline by the lvchange -an <lv>
>>>> After this, when I start rgmanager, it can take it online without
>>>> problems. However I think, this action should be done by the rgmanager
>>>> itself. All the logs is full with the next:
>>>> rgmanager Making resilient: lvchange -an ....
>>>> rgmanager lv_exec_resilient failed
>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>
>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>
>>>> Anyone has any idea, what to check?
>>>>
>>>> Thanks and regards,
>>>> Krisztian
>>>
>>> Please paste your cluster.conf file with minimal edits.
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.com
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From lists at alteeve.ca  Fri Aug 10 17:03:38 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 10 Aug 2012 13:03:38 -0400
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2B3EC@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2B3EC@alexandria.innova.local>
Message-ID: <50253EEA.5070008@alteeve.ca>

Could well be. As I mentioned, no fencing == things break.

On 08/10/2012 01:00 PM, Chip Burke wrote:
> See my thread earlier as I am having similar issues. I am testing this
> soon, but I "think" the issue in my case is setting up SCSI fencing before
> GFS2. So essentially it has nothing to fence off of, sees it as a fault,
> and never recovers. I "think" my fix will be establish the LVMs, GFS2 etc
> then put in the SCSI fence so that it can actually create the private
> reservations. Then the fun begins in pulling the plug randomly to see how
> it behaves.
> ________________________________________
> Chip Burke
>
>
>
>
>
>
>
> On 8/10/12 12:46 PM, "Digimer" <lists at alteeve.ca> wrote:
>
>> Not sure if it relates, but I can say that without fencing, things will
>> break in strange ways. The reason is that if anything triggers a fault,
>> the cluster blocks by design and stays blocked until a fence call
>> succeeds (which is impossible without fencing configured in the first
>> place).
>>
>> Can you please setup fencing, test to make sure it works (using
>> 'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
>> is done, test again for your problem. If it still exists, please paste
>> the updated cluster.conf then. Also please include syslog from both
>> nodes around the time of your LVM tests.
>>
>> digimer
>>
>> On 08/10/2012 12:38 PM, Po?s Kriszti?n wrote:
>>> This is the cluster conf, Which is a clone of the problematic system on
>>> a test environment (without the ORacle and SAP instances, only focusing
>>> on this LVM issue, with an LVM resource)
>>>
>>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>>> <?xml version="1.0"?>
>>> <cluster config_version="7" name="teszt">
>>> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>>> 	<clusternodes>
>>> 		<clusternode name="rhel1.local" nodeid="1" votes="1">
>>> 			<fence/>
>>> 		</clusternode>
>>> 		<clusternode name="rhel2.local" nodeid="2" votes="1">
>>> 			<fence/>
>>> 		</clusternode>
>>> 	</clusternodes>
>>> 	<cman expected_votes="3"/>
>>> 	<fencedevices/>
>>> 	<rm>
>>> 		<failoverdomains>
>>> 			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
>>> 				<failoverdomainnode name="rhel1.local" priority="1"/>
>>> 				<failoverdomainnode name="rhel2.local" priority="2"/>
>>> 			</failoverdomain>
>>> 		</failoverdomains>
>>> 		<resources>
>>> 			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>>> 			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>>> mountpoint="/lvm" name="teszt-fs"/>
>>> 		</resources>
>>> 		<service autostart="1" domain="all" exclusive="0" name="teszt"
>>> recovery="disable">
>>> 			<lvm ref="teszt-lv"/>
>>> 			<fs ref="teszt-fs"/>
>>> 		</service>
>>> 	</rm>
>>> 	<quorumd label="qdisk"/>
>>> </cluster>
>>>
>>> Here are the log parts:
>>> Aug 10 17:21:21 rgmanager I am node #2
>>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>>> Aug 10 17:21:22 rgmanager Loading Service Data
>>> Aug 10 17:21:29 rgmanager Initializing Services
>>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>>> Aug 10 17:21:31 rgmanager Services Initialized
>>> Aug 10 17:21:31 rgmanager State change: Local UP
>>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>>> teszt/teszt-lv
>>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>>> teszt/teszt-lv
>>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>> error)
>>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>>> value: 1
>>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>>> a real device
>>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>>> argument(s))
>>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>>> intervention required
>>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>>> start.
>>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>>> cleanly
>>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>>> teszt/teszt-lv
>>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>>> teszt/teszt-lv
>>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>> error)
>>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>>> value: 1
>>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>>> a real device
>>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>>> argument(s))
>>>
>>>
>>> After I manually started the lvm on node1 and tried to switch it on
>>> node2 it's not able to start it.
>>>
>>> Regards,
>>> Krisztian
>>>
>>>
>>> On 08/10/2012 05:15 PM, Digimer wrote:
>>>> On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
>>>>> Dear all,
>>>>>
>>>>> I hope that anyone run into this problem in the past, so maybe can
>>>>> help
>>>>> me resolving this issue.
>>>>>
>>>>> There is a 2 node rhel cluster with quorum also.
>>>>> There are clustered lvms, where the -c- flag is on.
>>>>> If I start clvmd all the clustered lvms became online.
>>>>>
>>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>>> not
>>>>> able to activate them anymore as there are no such devices anymore
>>>>> during the startup of the service, so after this, the service fails.
>>>>> All lvs remain without the active flag.
>>>>>
>>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>>> the lvms manually offline by the lvchange -an <lv>
>>>>> After this, when I start rgmanager, it can take it online without
>>>>> problems. However I think, this action should be done by the rgmanager
>>>>> itself. All the logs is full with the next:
>>>>> rgmanager Making resilient: lvchange -an ....
>>>>> rgmanager lv_exec_resilient failed
>>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>>
>>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>>
>>>>> Anyone has any idea, what to check?
>>>>>
>>>>> Thanks and regards,
>>>>> Krisztian
>>>>
>>>> Please paste your cluster.conf file with minimal edits.
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.com



From krisztian at poos.hu  Fri Aug 10 18:27:48 2012
From: krisztian at poos.hu (=?ISO-8859-1?Q?Po=F3s_Kriszti=E1n?=)
Date: Fri, 10 Aug 2012 20:27:48 +0200
Subject: [Linux-cluster] Fwd: Re:  problems with clvmd and lvms on rhel6.1
In-Reply-To: <50255285.1010800@poos.hu>
References: <50255285.1010800@poos.hu>
Message-ID: <502552A4.5070601@poos.hu>

So I forgot the test environment in this case.
Here is the normal environment which is not fully productive yet, so I
can do tests on it...
Fencing (SCSI 3 persistent reservation) works and tested. I configured
the cluster to used, it, and the lvms are still down... the cluster not
able to mount the filesystem. However manually I can mount it, and also
the clustered lvm active flags looks ok, -a- on one node, and --- on the
other node: here are the logs and outputs and the config:

root at linuxsap2 cluster]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="14" name="linuxsap-c">
        <clusternodes>
                <clusternode name="linuxsap1-priv" nodeid="1">
                        <fence>
                                <method name="Method">
                                        <device name="fence_dev"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="fence_dev"/>
                        </unfence>
                </clusternode>
                <clusternode name="linuxsap2-priv" nodeid="2">
                        <fence>
                                <method name="Method">
                                        <device name="fence_dev"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" name="fence_dev"/>
                        </unfence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <quorumd label="qdisk_dev"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="FOD-Teszt" nofailback="1"
ordered="1" restricted="0">
                                <failoverdomainnode
name="linuxsap1-priv" priority="1"/>
                                <failoverdomainnode
name="linuxsap2-priv" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <lvm name="vg_PRD_oracle" vg_name="vg_PRD_oracle"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_orabin"
fsid="32283" fstype="ext4" mountpoint="/oracle/PRD" name="PRD_orabin"/>
                </resources>
                <service autostart="0" domain="FOD-Teszt"
name="FS_teszt" recovery="disable">
                        <lvm ref="vg_PRD_oracle"/>
                        <fs ref="PRD_oralog1"/>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="fence_dev"/>
        </fencedevices>
</cluster>
[root at linuxsap2 cluster]#

ug 10 20:10:07 linuxsap1 rgmanager[9680]: Service service:FS_teszt is
recovering
Aug 10 20:10:07 linuxsap1 rgmanager[9680]: #71: Relocating failed
service service:FS_teszt
Aug 10 20:10:08 linuxsap1 rgmanager[9680]: Service service:FS_teszt is
stopped
Aug 10 20:11:21 linuxsap1 rgmanager[9680]: Starting stopped service
service:FS_teszt
Aug 10 20:11:21 linuxsap1 rgmanager[10777]: [lvm] Starting volume group,
vg_PRD_oracle
Aug 10 20:11:21 linuxsap1 rgmanager[10801]: [lvm] Failed to activate
volume group, vg_PRD_oracle
Aug 10 20:11:21 linuxsap1 rgmanager[10823]: [lvm] Attempting cleanup of
vg_PRD_oracle
Aug 10 20:11:22 linuxsap1 rgmanager[10849]: [lvm] Failed second attempt
to activate vg_PRD_oracle
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: start on lvm "vg_PRD_oracle"
returned 1 (generic error)
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: #68: Failed to start
service:FS_teszt; return value: 1
Aug 10 20:11:22 linuxsap1 rgmanager[9680]: Stopping service service:FS_teszt


[root at linuxsap1 cluster]# lvs | grep PRD
  lv_PRD_oraarch  vg_PRD_oracle        -wi-a---  30.00g

  lv_PRD_orabin   vg_PRD_oracle        -wi-a---  10.00g

  lv_PRD_oralog1  vg_PRD_oracle        -wi-a---   1.00g

  lv_PRD_oralog2  vg_PRD_oracle        -wi-a---   1.00g

  lv_PRD_sapdata1 vg_PRD_oracle        -wi-a--- 408.00g

  lv_PRD_sapmnt   vg_PRD_sapmnt        -wi-a---  10.00g

  lv_PRD_trans    vg_PRD_trans         -wi-a---  40.00g

  lv_PRD_usrsap   vg_PRD_usrsap        -wi-a---   9.00g



[root at linuxsap2 cluster]# lvs | grep PRD
  lv_PRD_oraarch  vg_PRD_oracle        -wi-----  30.00g

  lv_PRD_orabin   vg_PRD_oracle        -wi-----  10.00g

  lv_PRD_oralog1  vg_PRD_oracle        -wi-----   1.00g

  lv_PRD_oralog2  vg_PRD_oracle        -wi-----   1.00g

  lv_PRD_sapdata1 vg_PRD_oracle        -wi----- 408.00g

  lv_PRD_sapmnt   vg_PRD_sapmnt        -wi-a---  10.00g

  lv_PRD_trans    vg_PRD_trans         -wi-a---  40.00g

  lv_PRD_usrsap   vg_PRD_usrsap        -wi-a---   9.00g


[root at linuxsap1 cluster]# mount /dev/vg_PRD_oracle/lv_PRD_orabin /oracle/PRD

[root at linuxsap1 cluster]# df -k /oracle/PRD/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_PRD_oracle-lv_PRD_orabin
                      10321208   4753336   5043584  49% /oracle/PRD




On 08/10/2012 06:46 PM, Digimer wrote:
> Not sure if it relates, but I can say that without fencing, things will
> break in strange ways. The reason is that if anything triggers a fault,
> the cluster blocks by design and stays blocked until a fence call
> succeeds (which is impossible without fencing configured in the first
> place).
> 
> Can you please setup fencing, test to make sure it works (using
> 'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
> is done, test again for your problem. If it still exists, please paste
> the updated cluster.conf then. Also please include syslog from both
> nodes around the time of your LVM tests.
> 
> digimer
> 
> On 08/10/2012 12:38 PM, Po?s Kriszti?n wrote:
>> This is the cluster conf, Which is a clone of the problematic system on
>> a test environment (without the ORacle and SAP instances, only focusing
>> on this LVM issue, with an LVM resource)
>>
>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="7" name="teszt">
>>     <fence_daemon clean_start="0" post_fail_delay="0"
>> post_join_delay="3"/>
>>     <clusternodes>
>>         <clusternode name="rhel1.local" nodeid="1" votes="1">
>>             <fence/>
>>         </clusternode>
>>         <clusternode name="rhel2.local" nodeid="2" votes="1">
>>             <fence/>
>>         </clusternode>
>>     </clusternodes>
>>     <cman expected_votes="3"/>
>>     <fencedevices/>
>>     <rm>
>>         <failoverdomains>
>>             <failoverdomain name="all" nofailback="1" ordered="1"
>> restricted="0">
>>                 <failoverdomainnode name="rhel1.local" priority="1"/>
>>                 <failoverdomainnode name="rhel2.local" priority="2"/>
>>             </failoverdomain>
>>         </failoverdomains>
>>         <resources>
>>             <lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>>             <fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>> mountpoint="/lvm" name="teszt-fs"/>
>>         </resources>
>>         <service autostart="1" domain="all" exclusive="0" name="teszt"
>> recovery="disable">
>>             <lvm ref="teszt-lv"/>
>>             <fs ref="teszt-fs"/>
>>         </service>
>>     </rm>
>>     <quorumd label="qdisk"/>
>> </cluster>
>>
>> Here are the log parts:
>> Aug 10 17:21:21 rgmanager I am node #2
>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>> Aug 10 17:21:22 rgmanager Loading Service Data
>> Aug 10 17:21:29 rgmanager Initializing Services
>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>> Aug 10 17:21:31 rgmanager Services Initialized
>> Aug 10 17:21:31 rgmanager State change: Local UP
>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>> teszt/teszt-lv
>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>> teszt/teszt-lv
>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>> error)
>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>> intervention required
>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>> start.
>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>> cleanly
>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>> teszt/teszt-lv
>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>> teszt/teszt-lv
>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>> error)
>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>> value: 1
>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>> a real device
>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>> argument(s))
>>
>>
>> After I manually started the lvm on node1 and tried to switch it on
>> node2 it's not able to start it.
>>
>> Regards,
>> Krisztian
>>
>>
>> On 08/10/2012 05:15 PM, Digimer wrote:
>>> On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
>>>> Dear all,
>>>>
>>>> I hope that anyone run into this problem in the past, so maybe can help
>>>> me resolving this issue.
>>>>
>>>> There is a 2 node rhel cluster with quorum also.
>>>> There are clustered lvms, where the -c- flag is on.
>>>> If I start clvmd all the clustered lvms became online.
>>>>
>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>> not
>>>> able to activate them anymore as there are no such devices anymore
>>>> during the startup of the service, so after this, the service fails.
>>>> All lvs remain without the active flag.
>>>>
>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>> the lvms manually offline by the lvchange -an <lv>
>>>> After this, when I start rgmanager, it can take it online without
>>>> problems. However I think, this action should be done by the rgmanager
>>>> itself. All the logs is full with the next:
>>>> rgmanager Making resilient: lvchange -an ....
>>>> rgmanager lv_exec_resilient failed
>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>
>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>
>>>> Anyone has any idea, what to check?
>>>>
>>>> Thanks and regards,
>>>> Krisztian
>>>
>>> Please paste your cluster.conf file with minimal edits.
> 
> 




-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120810/a7989a98/attachment.p7s>

From krisztian at poos.hu  Fri Aug 10 20:16:00 2012
From: krisztian at poos.hu (=?ISO-8859-1?Q?Po=F3s_Kriszti=E1n?=)
Date: Fri, 10 Aug 2012 22:16:00 +0200
Subject: [Linux-cluster] problems with clvmd and lvms on rhel6.1
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2B3EC@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2B3EC@alexandria.innova.local>
Message-ID: <50256C00.7090501@poos.hu>

Yeah, Thanks. I checked your thread...if you  ment "clvmd hangs" however
It's like not finished... I see only 3 entries for that thread and
unfortunately no solution at the end. May I miss something?
However my scenario is a bit different, I don't need gfs, but only clvmd
with a failover lvm, as this is an active/passive configuration. And my
clvmd is rarely hanging, but my main problem that all the volumes remain
inactive.

On 08/10/2012 07:00 PM, Chip Burke wrote:
> See my thread earlier as I am having similar issues. I am testing this
> soon, but I "think" the issue in my case is setting up SCSI fencing before
> GFS2. So essentially it has nothing to fence off of, sees it as a fault,
> and never recovers. I "think" my fix will be establish the LVMs, GFS2 etc
> then put in the SCSI fence so that it can actually create the private
> reservations. Then the fun begins in pulling the plug randomly to see how
> it behaves.
> ________________________________________
> Chip Burke
> 
> 
> 
> 
> 
> 
> 
> On 8/10/12 12:46 PM, "Digimer" <lists at alteeve.ca> wrote:
> 
>> Not sure if it relates, but I can say that without fencing, things will
>> break in strange ways. The reason is that if anything triggers a fault,
>> the cluster blocks by design and stays blocked until a fence call
>> succeeds (which is impossible without fencing configured in the first
>> place).
>>
>> Can you please setup fencing, test to make sure it works (using
>> 'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
>> is done, test again for your problem. If it still exists, please paste
>> the updated cluster.conf then. Also please include syslog from both
>> nodes around the time of your LVM tests.
>>
>> digimer
>>
>> On 08/10/2012 12:38 PM, Po?s Kriszti?n wrote:
>>> This is the cluster conf, Which is a clone of the problematic system on
>>> a test environment (without the ORacle and SAP instances, only focusing
>>> on this LVM issue, with an LVM resource)
>>>
>>> [root at rhel2 ~]# cat /etc/cluster/cluster.conf
>>> <?xml version="1.0"?>
>>> <cluster config_version="7" name="teszt">
>>> 	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>>> 	<clusternodes>
>>> 		<clusternode name="rhel1.local" nodeid="1" votes="1">
>>> 			<fence/>
>>> 		</clusternode>
>>> 		<clusternode name="rhel2.local" nodeid="2" votes="1">
>>> 			<fence/>
>>> 		</clusternode>
>>> 	</clusternodes>
>>> 	<cman expected_votes="3"/>
>>> 	<fencedevices/>
>>> 	<rm>
>>> 		<failoverdomains>
>>> 			<failoverdomain name="all" nofailback="1" ordered="1" restricted="0">
>>> 				<failoverdomainnode name="rhel1.local" priority="1"/>
>>> 				<failoverdomainnode name="rhel2.local" priority="2"/>
>>> 			</failoverdomain>
>>> 		</failoverdomains>
>>> 		<resources>
>>> 			<lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/>
>>> 			<fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4"
>>> mountpoint="/lvm" name="teszt-fs"/>
>>> 		</resources>
>>> 		<service autostart="1" domain="all" exclusive="0" name="teszt"
>>> recovery="disable">
>>> 			<lvm ref="teszt-lv"/>
>>> 			<fs ref="teszt-fs"/>
>>> 		</service>
>>> 	</rm>
>>> 	<quorumd label="qdisk"/>
>>> </cluster>
>>>
>>> Here are the log parts:
>>> Aug 10 17:21:21 rgmanager I am node #2
>>> Aug 10 17:21:22 rgmanager Resource Group Manager Starting
>>> Aug 10 17:21:22 rgmanager Loading Service Data
>>> Aug 10 17:21:29 rgmanager Initializing Services
>>> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
>>> Aug 10 17:21:31 rgmanager Services Initialized
>>> Aug 10 17:21:31 rgmanager State change: Local UP
>>> Aug 10 17:21:31 rgmanager State change: rhel1.local UP
>>> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
>>> Aug 10 17:23:25 rgmanager Failed to activate logical volume,
>>> teszt/teszt-lv
>>> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
>>> Aug 10 17:23:29 rgmanager Failed second attempt to activate
>>> teszt/teszt-lv
>>> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>> error)
>>> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
>>> value: 1
>>> Aug 10 17:23:29 rgmanager Stopping service service:teszt
>>> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>>> a real device
>>> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>>> argument(s))
>>> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
>>> intervention required
>>> Aug 10 17:23:31 rgmanager Service service:teszt is failed
>>> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
>>> start.
>>> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
>>> cleanly
>>> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
>>> Aug 10 17:25:14 rgmanager Failed to activate logical volume,
>>> teszt/teszt-lv
>>> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
>>> Aug 10 17:25:17 rgmanager Failed second attempt to activate
>>> teszt/teszt-lv
>>> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
>>> error)
>>> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
>>> value: 1
>>> Aug 10 17:25:18 rgmanager Stopping service service:teszt
>>> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
>>> a real device
>>> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
>>> argument(s))
>>>
>>>
>>> After I manually started the lvm on node1 and tried to switch it on
>>> node2 it's not able to start it.
>>>
>>> Regards,
>>> Krisztian
>>>
>>>
>>> On 08/10/2012 05:15 PM, Digimer wrote:
>>>> On 08/10/2012 11:07 AM, Po?s Kriszti?n wrote:
>>>>> Dear all,
>>>>>
>>>>> I hope that anyone run into this problem in the past, so maybe can
>>>>> help
>>>>> me resolving this issue.
>>>>>
>>>>> There is a 2 node rhel cluster with quorum also.
>>>>> There are clustered lvms, where the -c- flag is on.
>>>>> If I start clvmd all the clustered lvms became online.
>>>>>
>>>>> After this if I start rgmanager, it deactivates all the volumes, and
>>>>> not
>>>>> able to activate them anymore as there are no such devices anymore
>>>>> during the startup of the service, so after this, the service fails.
>>>>> All lvs remain without the active flag.
>>>>>
>>>>> I can manually bring it up, but only if after clvmd is started, I set
>>>>> the lvms manually offline by the lvchange -an <lv>
>>>>> After this, when I start rgmanager, it can take it online without
>>>>> problems. However I think, this action should be done by the rgmanager
>>>>> itself. All the logs is full with the next:
>>>>> rgmanager Making resilient: lvchange -an ....
>>>>> rgmanager lv_exec_resilient failed
>>>>> rgmanager lv_activate_resilient stop failed on ....
>>>>>
>>>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to
>>>>> restart clvmd to make it work again. (sometimes killing it)
>>>>>
>>>>> Anyone has any idea, what to check?
>>>>>
>>>>> Thanks and regards,
>>>>> Krisztian
>>>>
>>>> Please paste your cluster.conf file with minimal edits.
>>
>>
>> -- 
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120810/65bfdfed/attachment.p7s>

From arpittolani at gmail.com  Sun Aug 12 00:07:46 2012
From: arpittolani at gmail.com (Arpit Tolani)
Date: Sun, 12 Aug 2012 05:37:46 +0530
Subject: [Linux-cluster] How to see what node is the master for quorum
	disk?
In-Reply-To: <CAG2kNCw1QK1q0q_-2WhSddqEXJRPg-iJL3cxzztG=+m3hSKY-w@mail.gmail.com>
References: <CAG2kNCw1QK1q0q_-2WhSddqEXJRPg-iJL3cxzztG=+m3hSKY-w@mail.gmail.com>
Message-ID: <CAD3MydCasx5JS9ykNR5SuANCtxYHaF44a3fc5_su4S9OGtSfcw@mail.gmail.com>

Hello

On Fri, Aug 10, 2012 at 3:18 PM, Gianluca Cecchi
<gianluca.cecchi at gmail.com>wrote:

> Hello,
> in qdiskd.log I get at cluster startup the node that becomes master
> for quorum disk.
> config is in fact something like
>
> <quorumd device="xxxx" ... log_facility="local4" log_level="7" ... >
>
> and in syslog.conf
> # qdisk logging
> local4.*                                                /var/log/qdiskd.log
>
> The file is rotated so after some time I have only empty qdiskd.log.N
> files.
> Is there any command to get which node is the master at this moment?
>
> Thanks,
> Gianluca
>
>
 One way is to search the group_tool output:

$ group_tool | grep "master node" | awk '{print $3 }'

Did you added status_file option in quorumd ? Try something like below.

<quorumd interval="1" tko="5" votes="2" log_level="7"
device="/dev/vg01/lv01" status_file="/var/log/qdisk-status.log">
   <heuristic program="ping 192.168.1.1 -c2 -t2 -w1" score="1" interval="5"/>
</quorumd>


Regards
Arpit Tolani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120812/0fc0f4ad/attachment.htm>

From emi2fast at gmail.com  Sun Aug 12 20:34:23 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Sun, 12 Aug 2012 22:34:23 +0200
Subject: [Linux-cluster] How to see what node is the master for quorum
	disk?
In-Reply-To: <CAD3MydCasx5JS9ykNR5SuANCtxYHaF44a3fc5_su4S9OGtSfcw@mail.gmail.com>
References: <CAG2kNCw1QK1q0q_-2WhSddqEXJRPg-iJL3cxzztG=+m3hSKY-w@mail.gmail.com>
	<CAD3MydCasx5JS9ykNR5SuANCtxYHaF44a3fc5_su4S9OGtSfcw@mail.gmail.com>
Message-ID: <CAE7pJ3C3TJ01+WF2cxH9gvAtDtakbYUO21q-XB6h9z-inJG+CQ@mail.gmail.com>

The easy way ;-)

mkqdisk -d -L

2012/8/12 Arpit Tolani <arpittolani at gmail.com>

> Hello
>
> On Fri, Aug 10, 2012 at 3:18 PM, Gianluca Cecchi <
> gianluca.cecchi at gmail.com> wrote:
>
>> Hello,
>> in qdiskd.log I get at cluster startup the node that becomes master
>> for quorum disk.
>> config is in fact something like
>>
>> <quorumd device="xxxx" ... log_facility="local4" log_level="7" ... >
>>
>> and in syslog.conf
>> # qdisk logging
>> local4.*
>>  /var/log/qdiskd.log
>>
>> The file is rotated so after some time I have only empty qdiskd.log.N
>> files.
>> Is there any command to get which node is the master at this moment?
>>
>> Thanks,
>> Gianluca
>>
>>
>  One way is to search the group_tool output:
>
> $ group_tool | grep "master node" | awk '{print $3 }'
>
> Did you added status_file option in quorumd ? Try something like below.
>
> <quorumd interval="1" tko="5" votes="2" log_level="7" device="/dev/vg01/lv01" status_file="/var/log/qdisk-status.log">
>    <heuristic program="ping 192.168.1.1 -c2 -t2 -w1" score="1" interval="5"/>
> </quorumd>
>
>
> Regards
> Arpit Tolani
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120812/1b9e3150/attachment.htm>

From gounini.geekarea at gmail.com  Mon Aug 13 09:19:50 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Mon, 13 Aug 2012 11:19:50 +0200 (CEST)
Subject: [Linux-cluster] How to see what node is the master for
	quorum	disk?
In-Reply-To: <CAE7pJ3C3TJ01+WF2cxH9gvAtDtakbYUO21q-XB6h9z-inJG+CQ@mail.gmail.com>
Message-ID: <1853903037.7636.1344849590170.JavaMail.root@geekarea.fr>

Nice ! 
RTFM, shall never say it enough.
Thank you.

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "emmanuel segura" <emi2fast at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Dimanche 12 Ao?t 2012 22:34:23
> Objet: Re: [Linux-cluster] How to see what node is the master for quorum	disk?
> 
> 
> The easy way ;-)
> 
> mkqdisk -d -L
> 
> 
> 2012/8/12 Arpit Tolani < arpittolani at gmail.com >
> 
> 
> Hello
> 
> 
> On Fri, Aug 10, 2012 at 3:18 PM, Gianluca Cecchi <
> gianluca.cecchi at gmail.com > wrote:
> 
> 
> Hello,
> in qdiskd.log I get at cluster startup the node that becomes master
> for quorum disk.
> config is in fact something like
> 
> <quorumd device="xxxx" ... log_facility="local4" log_level="7" ... >
> 
> and in syslog.conf
> # qdisk logging
> local4.* /var/log/qdiskd.log
> 
> The file is rotated so after some time I have only empty qdiskd.log.N
> files.
> Is there any command to get which node is the master at this moment?
> 
> Thanks,
> Gianluca
> 
> 
> 
> One way is to search the group_tool output: $ group_tool | grep
> "master node" | awk '{print $3 }' Did you added status_file option
> in quorumd ? Try something like below.
> <quorumd interval="1" tko="5" votes="2" log_level="7"
> device="/dev/vg01/lv01" status_file="/var/log/qdisk-status.log">
>    <heuristic program="ping 192.168.1.1 -c2 -t2 -w1" score="1"
>    interval="5"/>
> </quorumd>
> 
> 
> Regards
> Arpit Tolani
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> --
> esta es mi vida e me la vivo hasta que dios quiera
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From krisztian at poos.hu  Mon Aug 13 12:28:39 2012
From: krisztian at poos.hu (=?UTF-8?B?UG/Ds3MgS3Jpc3p0acOhbg==?=)
Date: Mon, 13 Aug 2012 14:28:39 +0200
Subject: [Linux-cluster] Strange starting order during rgmanager
 starting and every failover
In-Reply-To: <1853903037.7636.1344849590170.JavaMail.root@geekarea.fr>
References: <1853903037.7636.1344849590170.JavaMail.root@geekarea.fr>
Message-ID: <5028F2F7.2020705@poos.hu>

Dear All,

After I successfully solved the ha-lvm/clvmd issue, during the startup
of the SAP group I experience strage behavior of the cluster.
Before starting the servicegroup it tries to start/stop the SAP instance
and mounting the disks (however the service is still not starting up)...
After this was unsuccessful, it starts the service itself, which starts
all the resources without problem in the right order.

What can be the reason of this trial of starting some resources before
starting the whole service (and before the node sees itself up )


Can you help me to identify the error why the resource dependecies does
not work all the time?

Thanks in advance,
Krisztian

Aug 11 17:27:16 linuxsap1 rgmanager[9801]: I am node #1
Aug 11 17:27:16 linuxsap1 rgmanager[9801]: Resource Group Manager Starting
Aug 11 17:27:16 linuxsap1 rgmanager[9801]: Loading Service Data
Aug 11 17:27:17 linuxsap1 rgmanager[9801]: Initializing Services
Aug 11 17:27:18 linuxsap1 rgmanager[10884]: [SAPInstance] sapstartsrv is
not running for instance PRD-DVEBMGS00, it will be started now
Aug 11 17:27:18 linuxsap1 rgmanager[10911]: [SAPInstance] sapstartsrv
for instance PRD-DVEBMGS00 could not be started!
Aug 11 17:27:18 linuxsap1 rgmanager[10934]: [SAPInstance] SAP Instance
PRD-DVEBMGS00 stop failed:
Aug 11 17:27:18 linuxsap1 rgmanager[10956]: [SAPInstance] Attribute
POST_STOP_USEREXIT is set to /usr/sap/PRD/sapsrvstop.sh, but this file
is not executable
Aug 11 17:27:18 linuxsap1 rgmanager[9801]: stop on SAPInstance
"PRD_DVEBMGS00_sapprd" returned 1 (generic error)
Aug 11 17:27:18 linuxsap1 rgmanager[10997]: [SAPDatabase] Cannot find
startdb,stopdb and R3trans executable, please set DIR_EXECUTABLE parameter!
Aug 11 17:27:18 linuxsap1 rgmanager[9801]: stop on SAPDatabase "PRD"
returned 7 (unspecified)
Aug 11 17:27:18 linuxsap1 rgmanager[11035]: [ip] 10.100.100.104 is not
configured
Aug 11 17:27:18 linuxsap1 rgmanager[11072]: [fs] stop: Could not match
/dev/vg_PRD_trans/lv_PRD_trans with a real device
Aug 11 17:27:18 linuxsap1 rgmanager[9801]: stop on fs "PRD_trans"
returned 2 (invalid argument(s))
Aug 11 17:27:19 linuxsap1 rgmanager[11150]: [fs] stop: Could not match
/dev/vg_PRD_usrsap/lv_PRD_usrsap with a real device
Aug 11 17:27:19 linuxsap1 rgmanager[9801]: stop on fs "PRD_usrsap"
returned 2 (invalid argument(s))
Aug 11 17:27:21 linuxsap1 rgmanager[11227]: [fs] stop: Could not match
/dev/vg_PRD_sapmnt/lv_PRD_sapmnt with a real device
Aug 11 17:27:21 linuxsap1 rgmanager[9801]: stop on fs "PRD_sapmnt"
returned 2 (invalid argument(s))
Aug 11 17:27:22 linuxsap1 rgmanager[9801]: stop on fs "PRD_sapdata1"
returned 2 (invalid argument(s))
Aug 11 17:27:22 linuxsap1 rgmanager[11305]: [fs] stop: Could not match
/dev/vg_PRD_oracle/lv_PRD_sapdata1 with a real device
Aug 11 17:27:22 linuxsap1 rgmanager[11342]: [fs] stop: Could not match
/dev/vg_PRD_oracle/lv_PRD_oraarch with a real device
Aug 11 17:27:22 linuxsap1 rgmanager[9801]: stop on fs "PRD_oraarch"
returned 2 (invalid argument(s))
Aug 11 17:27:22 linuxsap1 rgmanager[11379]: [fs] stop: Could not match
/dev/vg_PRD_oracle/lv_PRD_oralog1 with a real device
Aug 11 17:27:22 linuxsap1 rgmanager[9801]: stop on fs "PRD_oralog1"
returned 2 (invalid argument(s))
Aug 11 17:27:22 linuxsap1 rgmanager[11416]: [fs] stop: Could not match
/dev/vg_PRD_oracle/lv_PRD_oralog2 with a real device
Aug 11 17:27:22 linuxsap1 rgmanager[9801]: stop on fs "PRD_oralog2"
returned 2 (invalid argument(s))
Aug 11 17:27:22 linuxsap1 rgmanager[11453]: [fs] stop: Could not match
/dev/vg_PRD_oracle/lv_PRD_orabin with a real device
Aug 11 17:27:22 linuxsap1 rgmanager[9801]: stop on fs "PRD_orabin"
returned 2 (invalid argument(s))
Aug 11 17:27:24 linuxsap1 rgmanager[9801]: Services Initialized
Aug 11 17:27:24 linuxsap1 rgmanager[9801]: State change: Local UP
Aug 11 17:27:24 linuxsap1 rgmanager[9801]: Starting stopped service
service:SAP-PRD
Aug 11 17:27:25 linuxsap1 rgmanager[11551]: [lvm] Starting volume group,
vg_PRD_oracle
Aug 11 17:27:25 linuxsap1 rgmanager[11580]: [lvm] I can claim this
volume group
Aug 11 17:27:25 linuxsap1 rgmanager[11619]: [lvm] New tag
"linuxsap1-priv" added to vg_PRD_oracle
Aug 11 17:27:26 linuxsap1 rgmanager[11803]: [fs] mounting /dev/dm-13 on
/oracle/PRD
Aug 11 17:27:26 linuxsap1 rgmanager[11825]: [fs] mount -t ext4
/dev/dm-13 /oracle/PRD
Aug 11 17:27:26 linuxsap1 rgmanager[11985]: [fs] mounting /dev/dm-15 on
/oracle/PRD/origlogB
Aug 11 17:27:26 linuxsap1 rgmanager[12007]: [fs] mount -t ext4
/dev/dm-15 /oracle/PRD/origlogB
Aug 11 17:27:26 linuxsap1 kernel: EXT4-fs (dm-15): warning: maximal
mount count reached, running e2fsck is recommended
Aug 11 17:27:26 linuxsap1 kernel: EXT4-fs (dm-15): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:26 linuxsap1 rgmanager[12200]: [fs] mounting /dev/dm-14 on
/oracle/PRD/origlogA
Aug 11 17:27:26 linuxsap1 rgmanager[12222]: [fs] mount -t ext4
/dev/dm-14 /oracle/PRD/origlogA
Aug 11 17:27:27 linuxsap1 kernel: EXT4-fs (dm-14): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:27 linuxsap1 rgmanager[12391]: [fs] mounting /dev/dm-16 on
/oracle/PRD/oraarch
Aug 11 17:27:27 linuxsap1 rgmanager[12413]: [fs] mount -t ext4
/dev/dm-16 /oracle/PRD/oraarch
Aug 11 17:27:27 linuxsap1 kernel: EXT4-fs (dm-16): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:27 linuxsap1 rgmanager[12589]: [fs] mounting /dev/dm-17 on
/oracle/PRD/sapdata1
Aug 11 17:27:27 linuxsap1 rgmanager[12611]: [fs] mount -t ext4
/dev/dm-17 /oracle/PRD/sapdata1
Aug 11 17:27:27 linuxsap1 kernel: EXT4-fs (dm-17): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:28 linuxsap1 rgmanager[12681]: [lvm] Starting volume group,
vg_PRD_sapmnt
Aug 11 17:27:28 linuxsap1 rgmanager[12710]: [lvm] I can claim this
volume group
Aug 11 17:27:28 linuxsap1 rgmanager[12749]: [lvm] New tag
"linuxsap1-priv" added to vg_PRD_sapmnt
Aug 11 17:27:29 linuxsap1 rgmanager[12920]: [fs] mounting /dev/dm-18 on
/sapmnt/PRD
Aug 11 17:27:29 linuxsap1 rgmanager[12942]: [fs] mount -t ext4
/dev/dm-18 /sapmnt/PRD
Aug 11 17:27:29 linuxsap1 kernel: EXT4-fs (dm-18): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:30 linuxsap1 rgmanager[13018]: [lvm] Starting volume group,
vg_PRD_usrsap
Aug 11 17:27:30 linuxsap1 rgmanager[13047]: [lvm] I can claim this
volume group
Aug 11 17:27:30 linuxsap1 rgmanager[13094]: [lvm] New tag
"linuxsap1-priv" added to vg_PRD_usrsap
Aug 11 17:27:31 linuxsap1 rgmanager[13298]: [fs] mounting /dev/dm-19 on
/usr/sap/PRD
Aug 11 17:27:31 linuxsap1 rgmanager[13320]: [fs] mount -t ext4
/dev/dm-19 /usr/sap/PRD
Aug 11 17:27:31 linuxsap1 kernel: EXT4-fs (dm-19): warning: maximal
mount count reached, running e2fsck is recommended
Aug 11 17:27:31 linuxsap1 kernel: EXT4-fs (dm-19): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:32 linuxsap1 rgmanager[13391]: [lvm] Starting volume group,
vg_PRD_trans
Aug 11 17:27:32 linuxsap1 rgmanager[13422]: [lvm] I can claim this
volume group
Aug 11 17:27:32 linuxsap1 rgmanager[13461]: [lvm] New tag
"linuxsap1-priv" added to vg_PRD_trans
Aug 11 17:27:33 linuxsap1 rgmanager[13658]: [fs] mounting /dev/dm-33 on
/usr/sap/transERP
Aug 11 17:27:33 linuxsap1 rgmanager[13681]: [fs] mount -t ext4
/dev/dm-33 /usr/sap/transERP
Aug 11 17:27:33 linuxsap1 kernel: EXT4-fs (dm-33): mounted filesystem
with ordered data mode. Opts:
Aug 11 17:27:33 linuxsap1 kernel: SELinux: initialized (dev dm-33, type
ext4), uses xattr
Aug 11 17:27:33 linuxsap1 rgmanager[13761]: [ip] Link for publicteam1:
Detected
Aug 11 17:27:33 linuxsap1 rgmanager[13783]: [ip] Adding IPv4 address
10.100.100.104/16 to publicteam1
Aug 11 17:27:33 linuxsap1 rgmanager[13805]: [ip] Pinging addr
10.100.100.104 from dev publicteam1
Aug 11 17:27:35 linuxsap1 rgmanager[13832]: [ip] Sending gratuitous ARP:
10.100.100.104 d0:67:e5:ea:0f:a0 brd ff:ff:ff:ff:ff:ff
Aug 11 17:27:36 linuxsap1 su: pam_unix(su-l:session): session opened for
user oraprd by (uid=0)
Aug 11 17:27:37 linuxsap1 su: pam_unix(su-l:session): session closed for
user oraprd
Aug 11 17:27:38 linuxsap1 rgmanager[14005]: [SAPDatabase] Oracle
Listener LIST_PRD started: Warning: no access to tty (Bad file descriptor).
Aug 11 17:27:38 linuxsap1 Thus no job control in this s
Aug 11 17:27:38 linuxsap1 su: pam_unix(su-l:session): session opened for
user prdadm by (uid=0)
Aug 11 17:27:52 linuxsap1 su: pam_unix(su-l:session): session closed for
user prdadm
Aug 11 17:27:52 linuxsap1 rgmanager[14275]: [SAPDatabase] SAP database
PRD started: Trying to start PRD database ...
Aug 11 17:27:52 linuxsap1 Log file: /home/prdadm/startdb.log
Aug 11 17:27:52 linuxsap1 PRD database start
Aug 11 17:27:52 linuxsap1 rgmanager[14333]: [SAPInstance] sapstartsrv is
not running for instance PRD-DVEBMGS00, it will be started now
Aug 11 17:27:53 linuxsap1 SAPPRD_00[14507]: SAP Service SAPPRD_00
successfully started.
Aug 11 17:27:55 linuxsap1 rgmanager[14539]: [SAPInstance] sapstartsrv
for instance PRD-DVEBMGS00 was restarted !
Aug 11 17:27:55 linuxsap1 rgmanager[14702]: [SAPInstance] Starting SAP
Instance PRD-DVEBMGS00:
Aug 11 17:27:55 linuxsap1 11.08.2012 17:27:55
Aug 11 17:27:55 linuxsap1 Start
Aug 11 17:27:55 linuxsap1 OK
Aug 11 17:28:15 linuxsap1 rgmanager[15169]: [SAPInstance] SAP Instance
PRD-DVEBMGS00 started:
Aug 11 17:28:15 linuxsap1 11.08.2012 17:28:15
Aug 11 17:28:15 linuxsap1 WaitforStarted
Aug 11 17:28:15 linuxsap1 OK
Aug 11 17:28:15 linuxsap1 rgmanager[9801]: Service service:SAP-PRD started?



The cluster.conf is the next

<?xml version="1.0"?>
<cluster config_version="167" name="linuxsap">
        <clusternodes>
                <clusternode name="linuxsap1-priv" nodeid="1">
                        <fence>
                                <method name="scsi">
                                        <device key="1" name="scsi_dev"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" key="1"
name="scsi_dev"/>
                        </unfence>
                </clusternode>
                <clusternode name="linuxsap2-priv" nodeid="2">
                        <fence>
                                <method name="scsi">
                                        <device key="2" name="scsi_dev"/>
                                </method>
                        </fence>
                        <unfence>
                                <device action="on" key="2"
name="scsi_dev"/>
                        </unfence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3" transport="udpu"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="FOD-SAP" nofailback="1"
ordered="1" restricted="0">
                                <failoverdomainnode
name="linuxsap1-priv" priority="1"/>
                                <failoverdomainnode
name="linuxsap2-priv" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="FOD-Oracle" nofailback="1"
ordered="1" restricted="0">
                                <failoverdomainnode
name="linuxsap1-priv" priority="2"/>
                                <failoverdomainnode
name="linuxsap2-priv" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="FOD-LinuxSap1"
nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="linuxsap1-priv"/>
                        </failoverdomain>
                        <failoverdomain name="FOD-LinuxSap2"
nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="linuxsap2-priv"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <lvm name="vg_PRD_oracle" vg_name="vg_PRD_oracle"/>
                        <lvm name="vg_PRD_sapmnt" vg_name="vg_PRD_sapmnt"/>
                        <lvm name="vg_PRD_usrsap" vg_name="vg_PRD_usrsap"/>
                        <lvm name="vg_PRD_trans" vg_name="vg_PRD_trans"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_orabin"
force_unmount="1" fstype="ext4" mountpoint="/oracle/PRD" name="PRD_orabin"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_oralog1"
force_unmount="1" fstype="ext4" mountpoint="/oracle/PRD/origlogA"
name="PRD_oralog1"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_oralog2"
force_unmount="1" fstype="ext4" mountpoint="/oracle/PRD/origlogB"
name="PRD_oralog2"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_oraarch"
force_unmount="1" fstype="ext4" mountpoint="/oracle/PRD/oraarch"
name="PRD_oraarch"/>
                        <fs device="/dev/vg_PRD_oracle/lv_PRD_sapdata1"
force_unmount="1" fstype="ext4" mountpoint="/oracle/PRD/sapdata1"
name="PRD_sapdata1"/>
                        <fs device="/dev/vg_PRD_sapmnt/lv_PRD_sapmnt"
force_unmount="1" fstype="ext4" mountpoint="/sapmnt/PRD" name="PRD_sapmnt"/>
                        <fs device="/dev/vg_PRD_usrsap/lv_PRD_usrsap"
force_unmount="1" fstype="ext4" mountpoint="/usr/sap/PRD"
name="PRD_usrsap"/>
                        <fs device="/dev/vg_PRD_trans/lv_PRD_trans"
force_unmount="1" fstype="ext4" mountpoint="/usr/sap/transERP"
name="PRD_trans"/>
                        <ip address="10.100.100.104" monitor_link="on"
sleeptime="10"/>
                        <SAPInstance DIR_EXECUTABLE="/sapmnt/PRD/exe"
DIR_PROFILE="/sapmnt/PRD/profile" InstanceName="PRD_DVEBMGS00_sapprd"
POST_STOP_USEREXIT="/usr/sap/PRD/sapsrvstop.sh"
START_PROFILE="/sapmnt/PRD/profile/START_DVEBMGS00_sapprd"
START_WAITTIME="60"/>
                        <SAPDatabase DBTYPE="ORA"
DIR_EXECUTABLE="/sapmnt/PRD/exe" NETSERVICENAME="LIST_PRD"
POST_STOP_USEREXIT="/usr/sap/PRD/sapsrvstop.sh" SID="PRD"/>
                        <fs device="/dev/vg_teszt_10GB/lv_teszt_10GB"
force_unmount="1" fsid="1886" fstype="ext4" mountpoint="/teszt"
name="Teszt_10GB"/>
                        <lvm name="vg_teszt_10GB" vg_name="vg_teszt_10GB"/>
                </resources>
                <service domain="FOD-SAP" name="SAP-PRD"
recovery="relocate">
                        <lvm ref="vg_PRD_oracle">
                                <fs ref="PRD_orabin">
                                        <fs ref="PRD_oralog2"/>
                                        <fs ref="PRD_oralog1"/>
                                        <fs ref="PRD_oraarch"/>
                                        <fs ref="PRD_sapdata1"/>
                                </fs>
                        </lvm>
                        <lvm ref="vg_PRD_sapmnt">
                                <fs ref="PRD_sapmnt"/>
                        </lvm>
                        <lvm ref="vg_PRD_usrsap">
                                <fs ref="PRD_usrsap"/>
                        </lvm>
                        <lvm ref="vg_PRD_trans">
                                <fs ref="PRD_trans"/>
                        </lvm>
                        <ip ref="10.100.100.104"/>
                        <SAPDatabase ref="PRD">
                                <SAPInstance ref="PRD_DVEBMGS00_sapprd"/>
                        </SAPDatabase>
                </service>
        </rm>
        <dlm enable_deadlk="1" enable_quorum="1"/>
        <quorumd label="qdisk_dev"/>
        <fencedevices>
                <fencedevice agent="fence_scsi" aptpl="1"
devices="/dev/mapper/36006016057a01e006226605213c4e111,/dev/mapper/36006016057a01e0080664e5d9fa4e111,/dev/mapper/36006016057a01e00c88c499a9ea4e111,/dev/mapper/36006016057a01e00ecf4bd78beafe111"
logfile="/var/log/cluster/fence_scsi.log" name="scsi_dev"/>
        </fencedevices>
</cluster>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4925 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120813/d67b6126/attachment.p7s>

From Ralph.Grothe at itdz-berlin.de  Mon Aug 13 14:08:49 2012
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Mon, 13 Aug 2012 16:08:49 +0200
Subject: [Linux-cluster] dboracle agent would attempt restart of failed
	instance forever rather than failing service over, why,
	and are there maxrestart params available per resource?
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432CAF@itdzex101.ITDZ.verwalt-berlin.de>

Hello Cluster Gurus,

while making tests with our RHCS cluster we observed a behavior
that I hadn't expected.

As one other test we shut the running Oracle DB instance down
manually and renamed its pfile so that the (RedHat provided and
unmodified by us) oracledb could not restart it while the service
was running on one node and the failover node was also a cluster
member and would have been able to take over the service's
resources.

Tailing to the messages log and watching for clurgmgrd entries on
that node where the service was running we could observe endless
restarting attempts of clurgmgrd of the failed oracledb instance.

Honestly, I would have rather expected the agent to try a restart
of the downed instance at most three times (or whatever default
max restart attempts this resource had defined, if any) on the
local node, and after that to pull down all resources of that
service to try an relocate the whole service on the standby
failover node.

So, is this normal behavior or is the oracledb agent poorly
written that it cannot bail out with an error exit code that
would signal the clusterware to relocate the whole service?

In order to avoid endless restart attempts is it possible to
assign each single resource of a service a max_restart attribute
as one can to the whole service tag?
Well, I fear not because the RelaxNG XML parser complaint when I
insertet this in the oracledb tag.
So what would be the recommended practice then?


In this RHCS cluster we have this rgmanager release:

# rpm -q rgmanager
rgmanager-2.0.52-9.el5


Also, the meta params of the oracledb agent don't list a
max_restart attribute (of course, why the XML parser complaint, I
suppose)


# /usr/share/cluster/oracledb.sh meta-data|grep parameter\ name
        <parameter name="name" primary="1">
        <parameter name="listener_name" unique="1">
        <parameter name="user" required="1">
        <parameter name="home" required="1">
        <parameter name="type" required="0">
        <parameter name="vhost" required="0" unique="1">



Regards
Ralph



From gounini.geekarea at gmail.com  Mon Aug 13 15:54:38 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Mon, 13 Aug 2012 17:54:38 +0200 (CEST)
Subject: [Linux-cluster] Quorum device brain the cluster when
	master	lose network
In-Reply-To: <CAE7pJ3BrNW5WK_tJo7N9uuxXb7=cyCtnjVM2G2L99BJAfswS+w@mail.gmail.com>
Message-ID: <1719824091.7877.1344873277971.JavaMail.root@geekarea.fr>

Sorry, serveur was not usable until now.

[root at mynode1 ~]# cman_tool status
Version: 6.2.0
Config Version: 162
Cluster Name: cluname
Cluster Id: 57462
Cluster Member: Yes
Cluster Generation: 836
Membership state: Cluster-Member
Nodes: 4
Expected votes: 4
Quorum device votes: 1
Total votes: 5
Quorum: 3
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 177
Node name: mynode1
Node ID: 1
Multicast addresses: XX.XX.XX.XX
Node addresses: YY.YY.YY.YY

I reproduce my problem today.
I tried to use a speciel heuristic to leave quorum device but it's not working:
<heuristic program="/bin/ping -c1 -w1 10.148.8.1 || /etc/init.d/qdiskd stop" score="1" interval="2" tko="2"/>

An idea?

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr


----- Mail original -----
> De: "emmanuel segura" <emi2fast at gmail.com>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Envoy?: Mardi 7 Ao?t 2012 14:31:13
> Objet: Re: [Linux-cluster] Quorum device brain the cluster when master	lose network
> 
> 
> send me a cman_tool status ;-)
> 
> 
> 2012/8/7 GouNiNi < gounini.geekarea at gmail.com >
> 
> 
> Yes I do ;)
> 
> --
> .`'`. GouNiNi
> : ': :
> `. ` .` GNU/Linux
> `'` http://www.geekarea.fr
> 
> 
> ----- Mail original -----
> > De: "emmanuel segura" < emi2fast at gmail.com >
> > ?: "linux clustering" < linux-cluster at redhat.com >
> > Envoy?: Mardi 7 Ao?t 2012 11:29:59
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > master lose network
> > 
> > 
> > do you reboot all nodes in your cluster after removed the
> > expected_votes?
> > 
> > 
> > 2012/8/7 GouNiNi < gounini.geekarea at gmail.com >
> > 
> > 
> > Hello,
> > 
> > My problem is still here.
> > I made a try without expected_votes="5" but nothing change on my
> > test
> > loosing network on two nodes.
> > 
> > Any other idea?
> > 
> > Regards,
> > 
> > 
> > --
> > .`'`. GouNiNi
> > : ': :
> > `. ` .` GNU/Linux
> > `'` http://www.geekarea.fr
> > 
> > 
> > ----- Mail original -----
> > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Envoy?: Mercredi 1 Ao?t 2012 10:58:59
> > 
> > 
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > > 
> > > 
> > > Hello Gounini
> > > 
> > > Sorry but it told you, remove <cman expected_votes="5"> and
> > > reboot
> > > the cluster
> > > 
> > > Let the cluster calculate the expected votes
> > > 
> > > 
> > > 2012/8/1 GouNiNi < gounini.geekarea at gmail.com >
> > > 
> > > 
> > > I do this test one more time and I got same result with more
> > > precisions:
> > > 
> > > When I shutdown network on 2 nodes including the master, master
> > > stay
> > > alive while the 2 online nodes are fencing the offline non-master
> > > node. The cluster goes Inquorate after.
> > > When fenced node came back, he joins cluster and cluster becomes
> > > quorate. New master is chose and the old master is fenced.
> > > 
> > > # cman_tool status
> > > Version: 6.2.0
> > > Config Version: 144
> > > Cluster Name: cluname
> > > Cluster Id: 57462
> > > Cluster Member: Yes
> > > Cluster Generation: 488
> > > Membership state: Cluster-Member
> > > Nodes: 4
> > > Expected votes: 5
> > > Quorum device votes: 1
> > > Total votes: 5
> > > Quorum: 3
> > > Active subsystems: 9
> > > Flags: Dirty
> > > Ports Bound: 0 177
> > > Node name: nodename
> > > Node ID: 2
> > > Multicast addresses: ZZ.ZZ.ZZ.ZZ
> > > Node addresses: YY.YY.YY.YY
> > > 
> > > --
> > > .`'`. GouNiNi
> > > : ': :
> > > `. ` .` GNU/Linux
> > > `'` http://www.geekarea.fr
> > > 
> > > 
> > > ----- Mail original -----
> > > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > master lose network
> > > > 
> > > > 
> > > > can you send me the ouput from cman_tool status? when the
> > > > cluster
> > > > it's running
> > > > 
> > > > 
> > > > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ----- Mail original -----
> > > > > De: "Digimer" < lists at alteeve.ca >
> > > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster
> > > > > when
> > > > > master lose network
> > > > > 
> > > > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > I did some tests on 4 nodes cluster with quorum device and
> > > > > > I
> > > > > > find
> > > > > > a
> > > > > > bad situation with one test, so I need your knowledges to
> > > > > > correct
> > > > > > my configuration.
> > > > > > 
> > > > > > Configuation:
> > > > > > 4 nodes, all vote for 1
> > > > > > quorum device vote for 1 (to hold services with minimum 2
> > > > > > nodes
> > > > > > up)
> > > > > > cman expected votes 5
> > > > > > 
> > > > > > Situation:
> > > > > > I shut down network on 2 nodes, one of them is master.
> > > > > > 
> > > > > > Observation:
> > > > > > Fencing of one node (the master)... Quorum device Offline,
> > > > > > Quorum
> > > > > > disolved ! Services stopped.
> > > > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > > > fenced.
> > > > > > Services restart.
> > > > > > 2nd node offline reboot.
> > > > > > 
> > > > > > My cluster is not quorate for 8 min (very long hardware
> > > > > > boot
> > > > > > :-)
> > > > > > and my services were offline.
> > > > > > 
> > > > > > Do you know how to prevent this situation?
> > > > > > 
> > > > > > Regards,
> > > > > 
> > > > > Please tell us the name and version of the cluster software
> > > > > you
> > > > > are
> > > > > using, Please also share your configuration file(s).
> > > > > 
> > > > > --
> > > > > Digimer
> > > > > Papers and Projects: https://alteeve.com
> > > > > 
> > > > 
> > > > Sorry, RHEL5.6 64bits
> > > > 
> > > > # rpm -q cman rgmanager
> > > > cman-2.0.115-68.el5
> > > > rgmanager-2.0.52-9.el5
> > > > 
> > > > 
> > > > <?xml version="1.0"?>
> > > > <cluster alias="cluname" config_version="144" name="cluname">
> > > > <clusternodes>
> > > > <clusternode name="node1" nodeid="1" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_307" port="12"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > <clusternode name="node2" nodeid="2" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_307" port="11"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > <clusternode name="node3" nodeid="3" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_308" port="6"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > <clusternode name="node4" nodeid="4" votes="1">
> > > > <fence>
> > > > <method name="single">
> > > > <device name="fenceIBM_308" port="7"/>
> > > > </method>
> > > > </fence>
> > > > </clusternode>
> > > > </clusternodes>
> > > > <fencedevices>
> > > > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > > > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > > > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > > > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > > > </fencedevices>
> > > > <rm log_level="7">
> > > > <failoverdomains/>
> > > > <resources/>
> > > > <service ...>
> > > > <...>
> > > > </service>
> > > > </rm>
> > > > <fence_daemon clean_start="0" post_fail_delay="15"
> > > > post_join_delay="300"/>
> > > > <cman expected_votes="5">
> > > > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > > > </cman>
> > > > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > > > </cluster>
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > 
> > > > --
> > > > esta es mi vida e me la vivo hasta que dios quiera
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > 
> > > --
> > > esta es mi vida e me la vivo hasta que dios quiera
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > 
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> esta es mi vida e me la vivo hasta que dios quiera
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From CBurke at innova-partners.com  Mon Aug 13 21:38:13 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Mon, 13 Aug 2012 21:38:13 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <CAE7pJ3C3TJ01+WF2cxH9gvAtDtakbYUO21q-XB6h9z-inJG+CQ@mail.gmail.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2CF7B@alexandria.innova.local>

Ricci is seemingly not working through either Luci nor cman_tool. There does't seem to be a lot of logging to go off of (at least I haven't found it) but what I did find in the Luci log is as follows:

15:51:06,793 ERROR [luci.lib.ricci_communicator] Error receiving header from node2.domain.local:11111
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/luci/lib/ricci_communicator.py", line 121, in __init__
    hello = self.__receive(self.__timeout_init)
  File "/usr/lib64/python2.6/site-packages/luci/lib/ricci_communicator.py", line 503, in __receive
    errstr = _('Error reading from %s:%d: %s') \
  File "/usr/lib/python2.6/site-packages/pylons/i18n/translation.py", line 106, in ugettext
    return pylons.translator.ugettext(value)
  File "/usr/lib/python2.6/site-packages/paste/registry.py", line 137, in __getattr__
    return getattr(self._current_obj(), attr)
  File "/usr/lib/python2.6/site-packages/paste/registry.py", line 197, in _current_obj
    'thread' % self.____name__)
TypeError: No object (name: translator) has been registered for this thread
15:51:06,793 ERROR [luci.lib.ricci_helpers] Error receiving header from node2.XXXX.local:11111
15:51:06,793 ERROR [luci.lib.ricci_helpers] Error retrieving batch number from node3.XXXXX.local: Error receiving header from node3.XXXXX.local:11111

Cluster config I am trying to push:

<?xml version="1.0"?>
<cluster config_version="27" name="Xanadu">
<clusternodes>
<clusternode name="xanadunode1" nodeid="1">
<fence>
<method name="Method">
<device name="VMWare_Fence" port="XanaduNode1" ssl="on"/>
</method>
</fence>
</clusternode>
<clusternode name="xanadunode2" nodeid="2">
<fence>
<method name="Method">
<device name="VMWare_Fence" port="XanaduNode2" ssl="on"/>
</method>
</fence>
</clusternode>
<clusternode name="xanadunode3" nodeid="3">
<fence>
<method name="Method">
<device name="VMWare_Fence" port="XanaduNode3" ssl="on"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="6"/>
<fencedevices>
<fencedevice agent="fence_vmware_soap" ipaddr="vsphere.XXXXXX.local" login="vmwarefence" name="VMWare_Fence" passwd="XXXXXXXX"/>
</fencedevices>
<quorumd label="quorum" votes="3"/>
</cluster>


Running config:

<?xml version="1.0"?>
<cluster config_version="26" name="Xanadu">
<clusternodes>
<clusternode name="xanadunode1" nodeid="1">
<fence>
<method name="Method">
<device name="VMWare_Fence" port="XanaduNode1" ssl="on"/>
</method>
</fence>
</clusternode>
<clusternode name="xanadunode2" nodeid="2">
<fence>
<method name="Method">
<device name="VMWare_Fence" port="XanaduNode2" ssl="on"/>
</method>
</fence>
</clusternode>
<clusternode name="xanadunode3" nodeid="3">
<fence>
<method name="Method">
<device name="VMWare_Fence" port="XanaduNode3" ssl="on"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="5"/>
<fencedevices>
<fencedevice agent="fence_vmware_soap" ipaddr="vsphere.XXXXX.local" login="vmwarefence" name="VMWare_Fence" passwd="XXXXXX"/>
</fencedevices>
<quorumd label="quorum" votes="2"/>
</cluster>

Any ideas? SCP and reboots are fun and all, but I would love Ricci to work.

Thanks!




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120813/7c9b81cd/attachment.htm>

From lists at alteeve.ca  Mon Aug 13 23:58:25 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 13 Aug 2012 19:58:25 -0400
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2CF7B@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2CF7B@alexandria.innova.local>
Message-ID: <502994A1.1000900@alteeve.ca>

On 08/13/2012 05:38 PM, Chip Burke wrote:
> Ricci is seemingly not working through either Luci nor cman_tool. There
> does't seem to be a lot of logging to go off of (at least I haven't
> found it) but what I did find in the Luci log is as follows:
>
> 15:51:06,793 ERROR [luci.lib.ricci_communicator] Error receiving header
> from node2.domain.local:11111
> Traceback (most recent call last):
>    File
> "/usr/lib64/python2.6/site-packages/luci/lib/ricci_communicator.py",
> line 121, in __init__
>      hello = self.__receive(self.__timeout_init)
>    File
> "/usr/lib64/python2.6/site-packages/luci/lib/ricci_communicator.py",
> line 503, in __receive
>      errstr = _('Error reading from %s:%d: %s') \
>    File "/usr/lib/python2.6/site-packages/pylons/i18n/translation.py",
> line 106, in ugettext
>      return pylons.translator.ugettext(value)
>    File "/usr/lib/python2.6/site-packages/paste/registry.py", line 137,
> in __getattr__
>      return getattr(self._current_obj(), attr)
>    File "/usr/lib/python2.6/site-packages/paste/registry.py", line 197,
> in _current_obj
>      'thread' % self.____name__)
> TypeError: No object (name: translator) has been registered for this thread
> 15:51:06,793 ERROR [luci.lib.ricci_helpers] Error receiving header from
> node2.XXXX.local:11111
> 15:51:06,793 ERROR [luci.lib.ricci_helpers] Error retrieving batch
> number from node3.XXXXX.local: Error receiving header from
> node3.XXXXX.local:11111
>
> Cluster config I am trying to push:
>
> <?xml version="1.0"?>
> <cluster config_version="27" name="Xanadu">
> <clusternodes>
> <clusternode name="xanadunode1" nodeid="1">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode1" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode2" nodeid="2">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode2" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode3" nodeid="3">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode3" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="6"/>
> <fencedevices>
> <fencedevice agent="fence_vmware_soap" ipaddr="vsphere.XXXXXX.local"
> login="vmwarefence" name="VMWare_Fence" passwd="XXXXXXXX"/>
> </fencedevices>
> <quorumd label="quorum" votes="3"/>
> </cluster>
>
>
> Running config:
>
> <?xml version="1.0"?>
> <cluster config_version="26" name="Xanadu">
> <clusternodes>
> <clusternode name="xanadunode1" nodeid="1">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode1" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode2" nodeid="2">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode2" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode3" nodeid="3">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode3" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="5"/>
> <fencedevices>
> <fencedevice agent="fence_vmware_soap" ipaddr="vsphere.XXXXX.local"
> login="vmwarefence" name="VMWare_Fence" passwd="XXXXXX"/>
> </fencedevices>
> <quorumd label="quorum" votes="2"/>
> </cluster>
>
> Any ideas? SCP and reboots are fun and all, but I would love Ricci to work.
>
> Thanks!

What OS and cluster versions? What is in syslog on the three nodes when 
this occurs?

-- 
Digimer
Papers and Projects: https://alteeve.com



From fdinitto at redhat.com  Tue Aug 14 13:31:44 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 14 Aug 2012 15:31:44 +0200
Subject: [Linux-cluster] cluster 3.1.93 release (Release Candidate)
Message-ID: <502A5340.2040309@redhat.com>

Welcome to the cluster 3.1.93 (Release Candidate) release.

This release addresses a few major issues. Users of previous releases
are strongly encouraged to upgrade to this version.

This release also strictly requires corosync 1.4.4 to build and run.

Unless major issues will be reported, the next release will be marked
stable 3.2.0.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.93.tar.xz

ChangeLog:

https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.93

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience with other sysadmins and users.

Thanks/congratulations to all people that contributed to this release!

Happy clustering,
Fabio



From CBurke at innova-partners.com  Tue Aug 14 14:37:25 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Tue, 14 Aug 2012 14:37:25 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502994A1.1000900@alteeve.ca>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2D2AA@alexandria.innova.local>

CentOS 6.2, cman version 3.0.12.1

As for the syslog, I have nothing on other the node sending or the nodes
receiving. There is no output at all. And after troubleshooting further,

#cman_tool version -r -S

Works if I do a manual push via scp, so it must be getting hung up on the
file transfer. Running ccs_sync manually doesn't give me any different
results.  After a minute or two things fail with the output:

The connection to node1 died unexpectedly
The connection to node3 died unexpectedly
The connection to node2 died unexpectedly


And still nothing in the syslog. I also checked the secure log to see if
authentication was bombing out, but there was nothing conclusive there
either.






On 8/13/12 7:58 PM, "Digimer" <lists at alteeve.ca> wrote:

>
>What OS and cluster versions? What is in syslog on the three nodes when
>this occurs?
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.com




From cfeist at redhat.com  Tue Aug 14 19:12:39 2012
From: cfeist at redhat.com (Chris Feist)
Date: Tue, 14 Aug 2012 14:12:39 -0500
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2CF7B@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2CF7B@alexandria.innova.local>
Message-ID: <502AA327.9000206@redhat.com>

On 08/13/12 16:38, Chip Burke wrote:
> Ricci is seemingly not working through either Luci nor cman_tool. There does't
> seem to be a lot of logging to go off of (at least I haven't found it) but what
> I did find in the Luci log is as follows:
>
> 15:51:06,793 ERROR [luci.lib.ricci_communicator] Error receiving header from
> node2.domain.local:11111
> Traceback (most recent call last):
>    File "/usr/lib64/python2.6/site-packages/luci/lib/ricci_communicator.py",
> line 121, in __init__
>      hello = self.__receive(self.__timeout_init)
>    File "/usr/lib64/python2.6/site-packages/luci/lib/ricci_communicator.py",
> line 503, in __receive
>      errstr = _('Error reading from %s:%d: %s') \
>    File "/usr/lib/python2.6/site-packages/pylons/i18n/translation.py", line 106,
> in ugettext
>      return pylons.translator.ugettext(value)
>    File "/usr/lib/python2.6/site-packages/paste/registry.py", line 137, in
> __getattr__
>      return getattr(self._current_obj(), attr)
>    File "/usr/lib/python2.6/site-packages/paste/registry.py", line 197, in
> _current_obj
>      'thread' % self.____name__)
> TypeError: No object (name: translator) has been registered for this thread
> 15:51:06,793 ERROR [luci.lib.ricci_helpers] Error receiving header from
> node2.XXXX.local:11111
> 15:51:06,793 ERROR [luci.lib.ricci_helpers] Error retrieving batch number from
> node3.XXXXX.local: Error receiving header from node3.XXXXX.local:11111
>
> Cluster config I am trying to push:
>
> <?xml version="1.0"?>
> <cluster config_version="27" name="Xanadu">
> <clusternodes>
> <clusternode name="xanadunode1" nodeid="1">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode1" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode2" nodeid="2">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode2" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode3" nodeid="3">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode3" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="6"/>
> <fencedevices>
> <fencedevice agent="fence_vmware_soap" ipaddr="vsphere.XXXXXX.local"
> login="vmwarefence" name="VMWare_Fence" passwd="XXXXXXXX"/>
> </fencedevices>
> <quorumd label="quorum" votes="3"/>
> </cluster>
>
>
> Running config:
>
> <?xml version="1.0"?>
> <cluster config_version="26" name="Xanadu">
> <clusternodes>
> <clusternode name="xanadunode1" nodeid="1">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode1" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode2" nodeid="2">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode2" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="xanadunode3" nodeid="3">
> <fence>
> <method name="Method">
> <device name="VMWare_Fence" port="XanaduNode3" ssl="on"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="5"/>
> <fencedevices>
> <fencedevice agent="fence_vmware_soap" ipaddr="vsphere.XXXXX.local"
> login="vmwarefence" name="VMWare_Fence" passwd="XXXXXX"/>
> </fencedevices>
> <quorumd label="quorum" votes="2"/>
> </cluster>
>
> Any ideas? SCP and reboots are fun and all, but I would love Ricci to work.

Can you try using ccs to get the current configuration of that node:
ccs -h <host name> --getconf


As well as use ccs to try and set the conf on that node?
ccs -f <cluster.conf file> -h <host name> --setconf

This should let us narrow down whether it's an issue with ricci or luci.

Thanks!
Chris

>
> Thanks!
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From CBurke at innova-partners.com  Wed Aug 15 01:34:08 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Wed, 15 Aug 2012 01:34:08 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502AA327.9000206@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2D81D@alexandria.innova.local>

[root at xanadunode1 ~]# ccs -h xanadunode2 --getconf

This gives me similar results. It sits and spins for a few minutes and
then fails with:

Error: no ricci tag in ricci response

ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf

This locks up everything going to GFS2 mounts. Two nodes recovered, the
other didn't, required a fence_node. GFS2 showed this before the fence.

cd: /datastore/lvol0: Input/output error

Along with the error Error: no ricci tag in ricci response

________________________________________
Chip Burke






On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:

>
>
>Can you try using ccs to get the current configuration of that node:
>ccs -h <host name> --getconf
>
>
>As well as use ccs to try and set the conf on that node?
>ccs -f <cluster.conf file> -h <host name> --setconf
>
>This should let us narrow down whether it's an issue with ricci or luci.
>
>Thanks!
>Chris
>




From lists at verwilst.be  Wed Aug 15 15:32:16 2012
From: lists at verwilst.be (lists at verwilst.be)
Date: Wed, 15 Aug 2012 17:32:16 +0200
Subject: [Linux-cluster] GFS2 volumes hanging on 1 of 3 cluster nodes
Message-ID: <a24bc62ab90de0a10c568848ed7a0bc4@verwilst.be>

Hello,

I've set up a 3-node cluster, where i seem to be having problems with 
some of my GFS2 mounts. All servers have 2 gfs2 mounts on iscsi luns, 
/var/lib/libvirt/sanlock and /etc/libvirt/qemu.

/dev/mapper/iscsi_cluster_qemu on /etc/libvirt/qemu type gfs2 
(rw,relatime,hostdata=jid=0)
/dev/mapper/iscsi_cluster_sanlock on /var/lib/libvirt/sanlock type gfs2 
(rw,relatime,hostdata=jid=0)

Currently, on vm01-test, i cannot go to /var/lib/libvirt/sanlock:

root at vm01-test:~# ls /var/lib/libvirt/sanlock
^C^C^C
^C
^C

The same command on vm02-test ( and vm03-test ):

root at vm02-test:~# ls /var/lib/libvirt/sanlock/
42f8374d2c9513171301d94ab3f4c921  e193ecac416d5d6a4b7433ca80e201c5  
f97ab2f33af3dc0f3fc38a9921aa3711  __LIBVIRT__DISKS__

I have tried rebooting the whole cluster, rebooting several nodes, 
restarting cman, etc, it never fully works. If it's not happening on 
vm01, it happens on one of the other nodes. Both gfs2 volumes have been 
stuck like this on 1 of the 3 nodes.

I've included as much info as possible to assist you guys in getting to 
the bottom of this, if i have forgotten something, please let me know! I 
would really like to know what i'm missing here,


Cluster contains the following components:

cman 3.1.7-0ubuntu2.1
gfs2-cluster 3.1.3-0ubuntu1
corosync 1.4.2-2
lvm2 2.02.95-4ppa1
sanlock 2.2-1
libvirt-bin 0.9.13-1ppa1
rgmanager 3.1.7-0ubuntu2.1


Main configuration for the cluster is as follows:

<cluster name="kvm" config_version="11">
	<logging debug="on"/>
         <clusternodes>
         <clusternode name="vm01-test" nodeid="1">
		<fence>
			<method name="apc">
				<device name="apc01" port="1" action="off"/>
				<device name="apc02" port="1" action="off"/>
				<device name="apc01" port="1" action="on"/>
				<device name="apc02" port="1" action="on"/>
			</method>
		</fence>
         </clusternode>
         <clusternode name="vm02-test" nodeid="2">
		<fence>
			<method name="apc">
				<device name="apc01" port="8" action="off"/>
				<device name="apc02" port="8" action="off"/>
				<device name="apc01" port="8" action="on"/>
				<device name="apc02" port="8" action="on"/>
			</method>
                 </fence>
         </clusternode>
         <clusternode name="vm03-test" nodeid="3">
		<fence>
			<method name="apc">
				<device name="apc01" port="2" action="off"/>
				<device name="apc02" port="2" action="off"/>
				<device name="apc01" port="2" action="on"/>
				<device name="apc02" port="2" action="on"/>
			</method>
                 </fence>
         </clusternode>
         </clusternodes>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="apc01" secure="on" 
login="device" name="apc01" passwd="xxx"/>
		<fencedevice agent="fence_apc" ipaddr="apc02" secure="on" 
login="device" name="apc02" passwd="xxx"/>
	</fencedevices>
	<rm log_level="5">
		<failoverdomains>
			<failoverdomain name="any_node" nofailback="1" ordered="0" 
restricted="0"/>
		</failoverdomains>
		<vm domain="any_node" max_restarts="2" migrate="live" 
name="cloudstack" path="/etc/libvirt/qemu/" recovery="restart" 
restart_expire_time="600"/>
		<vm domain="any_node" max_restarts="2" migrate="live" name="test" 
path="/etc/libvirt/qemu/" recovery="restart" restart_expire_time="600"/>
	</rm>
	<totem rrp_mode="none" secauth="off"/>
	<quorumd device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
</cluster>



Output from various commands:

root at vm01-test:~#  dlm_tool ls
dlm lockspaces
name          rgmanager
id            0x5231f3eb
flags         0x00000000
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       1 2 3

name          sanlock
id            0x3c282c0a
flags         0x00000008 fs_reg
change        member 3 joined 1 remove 0 failed 0 seq 3,3
members       1 2 3

name          qemu
id            0xb061106c
flags         0x00000008 fs_reg
change        member 3 joined 1 remove 0 failed 0 seq 5,5
members       1 2 3

name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 1 joined 1 remove 0 failed 0 seq 1,1
members       1

root at vm02-test:~# dlm_tool ls
dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 2 joined 1 remove 0 failed 0 seq 1,1
members       1 2

name          qemu
id            0xb061106c
flags         0x00000008 fs_reg
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       1 2 3

name          rgmanager
id            0x5231f3eb
flags         0x00000000
change        member 3 joined 1 remove 0 failed 0 seq 3,3
members       1 2 3

name          sanlock
id            0x3c282c0a
flags         0x00000008 fs_reg
change        member 3 joined 1 remove 0 failed 0 seq 2,2
members       1 2 3

root at vm02-test:~# clustat
Cluster Status for kvm @ Wed Aug 15 17:19:24 2012
Member Status: Quorate

  Member Name                                                     ID   
Status
  ------ ----                                                     ---- 
------
  vm01-test                                                           1 
Online
  vm02-test                                                           2 
Online, Local, rgmanager
  vm03-test                                                           3 
Online, rgmanager
  /dev/mapper/iscsi_cluster_quorum                                    0 
Online, Quorum Disk

  Service Name                                                     Owner 
(Last)                                                     State
  ------- ----                                                     ----- 
------                                                     -----
  vm:cloudstack                                                    
(vm03-test)                                                      stopped
  vm:test                                                          
(vm02-test)                                                      
disabled

root at vm02-test:~# sanlock client status
daemon 806a79ee-ef22-4296-abf4-5f2d531063a1.vm02-test
p -1 listener
p -1 status
s __LIBVIRT__DISKS__:2:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0


root at vm01-test:~# cman_tool status
Version: 6.2.0
Config Version: 11
Cluster Name: kvm
Cluster Id: 773
Cluster Member: Yes
Cluster Generation: 1220
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Quorum device votes: 2
Total votes: 5
Node votes: 1
Quorum: 3
Active subsystems: 9
Flags:
Ports Bound: 0 11 178
Node name: vm01-test
Node ID: 1
Multicast addresses: 239.192.3.8
Node addresses: 10.254.128.240


root at vm02-test:~# cman_tool status
Version: 6.2.0
Config Version: 11
Cluster Name: kvm
Cluster Id: 773
Cluster Member: Yes
Cluster Generation: 1220
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Quorum device votes: 2
Total votes: 5
Node votes: 1
Quorum: 3
Active subsystems: 9
Flags:
Ports Bound: 0 11 177 178
Node name: vm02-test
Node ID: 2
Multicast addresses: 239.192.3.8
Node addresses: 10.254.128.65


root at vm01-test:~# ps aux | grep gfs
root     11686  0.0  0.0 140080  1996 ?        Ssl  13:20   0:00 
/usr/sbin/gfs_controld
root     14172  0.0  0.0      0     0 ?        S<   13:21   0:00 
[gfs_recovery]
root     14183  0.0  0.0      0     0 ?        S    13:21   0:00 
[gfs2_logd]
root     14184  0.0  0.0      0     0 ?        S    13:21   0:00 
[gfs2_quotad]
root     14388  0.0  0.0      0     0 ?        S    13:21   0:00 
[gfs2_logd]
root     14389  0.0  0.0      0     0 ?        S    13:21   0:00 
[gfs2_quotad]
root     14621  0.0  0.0   4316   540 ?        D    13:25   0:00 
/sbin/mount.gfs2 /dev/mapper/iscsi_cluster_sanlock 
/var/lib/libvirt/sanlock -o rw
root     20438  0.0  0.0   9380   944 pts/7    S+   17:25   0:00 grep 
--color=auto gfs
root at vm01-test:~# ps aux | grep dlm
root      7430  0.0  0.0      0     0 ?        S<   12:23   0:00 
[user_dlm]
root     11606  0.0  0.0 223096  2076 ?        Ssl  13:20   0:00 
dlm_controld
root     13614  0.0  0.0      0     0 ?        S    13:20   0:00 
[dlm_scand]
root     13615  0.0  0.0      0     0 ?        S<   13:20   0:00 
[dlm_recv]
root     13616  0.0  0.0      0     0 ?        S<   13:20   0:00 
[dlm_send]
root     13617  0.0  0.0      0     0 ?        S    13:20   0:00 
[dlm_recoverd]
root     14174  0.0  0.0      0     0 ?        S<   13:21   0:00 
[dlm_callback]
root     14175  0.0  0.0      0     0 ?        S    13:21   0:00 
[dlm_recoverd]
root     14382  0.0  0.0      0     0 ?        S<   13:21   0:00 
[dlm_callback]
root     14383  0.0  0.0      0     0 ?        S    13:21   0:00 
[dlm_recoverd]
root     15525  0.0  0.0      0     0 ?        S    13:35   0:00 
[dlm_recoverd]
root     20442  0.0  0.0   9380   940 pts/7    S+   17:25   0:00 grep 
--color=auto dlm

root at vm02-test:~# ps aux | grep gfs
root      8433  0.0  0.0 140080  2016 ?        Ssl  13:31   0:00 
/usr/sbin/gfs_controld
root      8465  0.0  0.0      0     0 ?        S<   13:31   0:00 
[gfs_recovery]
root      8493  0.0  0.0      0     0 ?        S    13:31   0:00 
[gfs2_logd]
root      8494  0.0  0.0      0     0 ?        S    13:31   0:00 
[gfs2_quotad]
root      9860  0.0  0.0      0     0 ?        S    13:34   0:00 
[gfs2_logd]
root      9861  0.0  0.0      0     0 ?        S    13:34   0:00 
[gfs2_quotad]
root     12818  0.0  0.0   9380   940 pts/0    S+   17:25   0:00 grep 
--color=auto gfs
root at vm02-test:~# ps aux | grep dlm
root      8012  0.0  0.0 223096  2064 ?        Ssl  12:04   0:00 
dlm_controld
root      8467  0.0  0.0      0     0 ?        S    13:31   0:00 
[dlm_scand]
root      8468  0.0  0.0      0     0 ?        S<   13:31   0:00 
[dlm_recv]
root      8469  0.0  0.0      0     0 ?        S<   13:31   0:00 
[dlm_send]
root      8485  0.0  0.0      0     0 ?        S<   13:31   0:00 
[dlm_callback]
root      8486  0.0  0.0      0     0 ?        S    13:31   0:00 
[dlm_recoverd]
root      8560  0.0  0.0      0     0 ?        S    13:31   0:00 
[dlm_recoverd]
root      9851  0.0  0.0      0     0 ?        S<   13:34   0:00 
[dlm_callback]
root      9852  0.0  0.0      0     0 ?        S    13:34   0:00 
[dlm_recoverd]
root     12603  0.0  0.0      0     0 ?        S    17:18   0:00 
[dlm_recoverd]
root     12820  0.0  0.0   9380   940 pts/0    S+   17:25   0:00 grep 
--color=auto dlm


root at vm02-test:~# gfs2_tool journals /var/lib/libvirt/sanlock
journal2 - 8MB
journal3 - 8MB
journal1 - 8MB
journal0 - 8MB
4 journal(s) found.
root at vm02-test:~# gfs2_tool journals /etc/libvirt/qemu
journal2 - 8MB
journal3 - 8MB
journal1 - 8MB
journal0 - 8MB
4 journal(s) found.

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 12.04 LTS
Release:	12.04
Codename:	precise





From heiko.nardmann at itechnical.de  Wed Aug 15 18:04:31 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Wed, 15 Aug 2012 20:04:31 +0200
Subject: [Linux-cluster] GFS2 volumes hanging on 1 of 3 cluster nodes
In-Reply-To: <a24bc62ab90de0a10c568848ed7a0bc4@verwilst.be>
References: <a24bc62ab90de0a10c568848ed7a0bc4@verwilst.be>
Message-ID: <502BE4AF.2000300@itechnical.de>

Which hardware are you using for your setup? I am just asking because I 
have experienced similar problems which finally have been solved by 
updating the NIC firmware of the systems involved (Dell R610 with 
Broadcom NICs).

Regards,

     Heiko

Am 15.08.2012 17:32, schrieb lists at verwilst.be:
> Hello,
>
> I've set up a 3-node cluster, where i seem to be having problems with 
> some of my GFS2 mounts. All servers have 2 gfs2 mounts on iscsi luns, 
> /var/lib/libvirt/sanlock and /etc/libvirt/qemu.
>
> /dev/mapper/iscsi_cluster_qemu on /etc/libvirt/qemu type gfs2 
> (rw,relatime,hostdata=jid=0)
> /dev/mapper/iscsi_cluster_sanlock on /var/lib/libvirt/sanlock type 
> gfs2 (rw,relatime,hostdata=jid=0)
>
> Currently, on vm01-test, i cannot go to /var/lib/libvirt/sanlock:
>
> root at vm01-test:~# ls /var/lib/libvirt/sanlock
> ^C^C^C
> ^C
> ^C



From cfeist at redhat.com  Wed Aug 15 19:56:50 2012
From: cfeist at redhat.com (Chris Feist)
Date: Wed, 15 Aug 2012 14:56:50 -0500
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2D81D@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2D81D@alexandria.innova.local>
Message-ID: <502BFF02.9040503@redhat.com>

On 08/14/12 20:34, Chip Burke wrote:
> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>
> This gives me similar results. It sits and spins for a few minutes and
> then fails with:
>
> Error: no ricci tag in ricci response
>
> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf

Can you send me the output of 'rpm -q modcluster' and
'ccs -d -h xanadunode2 --getconf'

Thanks,
Chris

>
> This locks up everything going to GFS2 mounts. Two nodes recovered, the
> other didn't, required a fence_node. GFS2 showed this before the fence.
>
> cd: /datastore/lvol0: Input/output error
>
> Along with the error Error: no ricci tag in ricci response
>
> ________________________________________
> Chip Burke
>
>
>
>
>
>
> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>
>>
>>
>> Can you try using ccs to get the current configuration of that node:
>> ccs -h <host name> --getconf
>>
>>
>> As well as use ccs to try and set the conf on that node?
>> ccs -f <cluster.conf file> -h <host name> --setconf
>>
>> This should let us narrow down whether it's an issue with ricci or luci.
>>
>> Thanks!
>> Chris
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From CBurke at innova-partners.com  Wed Aug 15 20:08:16 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Wed, 15 Aug 2012 20:08:16 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502BFF02.9040503@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2DC48@alexandria.innova.local>

modcluster-0.16.2-18.el6.x86_64

And

[root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
***Sending to ricci server:
<ricci function="process_batch" async="false" version="1.0"><batch><module
name="cluster"><request API_version="1.0"><function_call
name="get_cluster.conf"></function_call></request></module></batch></ricci>
***Sending End
***Received from ricci server
<?xml version="1.0"?>
<ricci authenticated="false" success="5" version="1.0"/>

***Receive End
xanadunode2 password:
***Sending to ricci server:
<ricci function="authenticate" password="XXXXXX" version="1.0"/>
***Sending End
***Received from ricci server
<?xml version="1.0"?>
<Timeout_reached_without_valid_XML_request/>

***Receive End
Error: no ricci tag in ricci response

Thanks!

________________________________________
Chip Burke







On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:

>On 08/14/12 20:34, Chip Burke wrote:
>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>
>> This gives me similar results. It sits and spins for a few minutes and
>> then fails with:
>>
>> Error: no ricci tag in ricci response
>>
>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>
>Can you send me the output of 'rpm -q modcluster' and
>'ccs -d -h xanadunode2 --getconf'
>
>Thanks,
>Chris
>
>>
>> This locks up everything going to GFS2 mounts. Two nodes recovered, the
>> other didn't, required a fence_node. GFS2 showed this before the fence.
>>
>> cd: /datastore/lvol0: Input/output error
>>
>> Along with the error Error: no ricci tag in ricci response
>>
>> ________________________________________
>> Chip Burke
>>
>>
>>
>>
>>
>>
>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>
>>>
>>>
>>> Can you try using ccs to get the current configuration of that node:
>>> ccs -h <host name> --getconf
>>>
>>>
>>> As well as use ccs to try and set the conf on that node?
>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>
>>> This should let us narrow down whether it's an issue with ricci or
>>>luci.
>>>
>>> Thanks!
>>> Chris
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From lists at verwilst.be  Wed Aug 15 20:16:06 2012
From: lists at verwilst.be (lists at verwilst.be)
Date: Wed, 15 Aug 2012 22:16:06 +0200
Subject: [Linux-cluster] GFS2 volumes hanging on 1 of 3 cluster nodes
In-Reply-To: <502BE4AF.2000300@itechnical.de>
References: <a24bc62ab90de0a10c568848ed7a0bc4@verwilst.be>
	<502BE4AF.2000300@itechnical.de>
Message-ID: <9938c76b36c834ea155b177011081aeb@verwilst.be>

Hi Heiko,

I'm using Dell R310's for this setup, which indeed have broadcom NIC's. 
I've emailed Dell Prosupport for the location to the latest firmwares ( 
that site is a mess :( ). In the meantime, could you tell me how you 
figured out that the NIC's were to blame? What symptoms did you see? 
Something else i can check to see if it's indeed the same issue?

I reformatted the gfs2 filesystems with default 128M journals, just in 
case that helped. I can now see all gfs2 mounts on all 3 servers. I went 
on to start clvmd on every node, 2 nodes went fine, 3rd node gave a 
timeout. All lvm based commands on that server now hang, on the others 
those commands work fine.. *sigh*.

Kind regards,

Bart

Heiko Nardmann schreef op 15.08.2012 20:04:
> Which hardware are you using for your setup? I am just asking because
> I have experienced similar problems which finally have been solved by
> updating the NIC firmware of the systems involved (Dell R610 with
> Broadcom NICs).
>
> Regards,
>
>     Heiko
>
> Am 15.08.2012 17:32, schrieb lists at verwilst.be:
>> Hello,
>>
>> I've set up a 3-node cluster, where i seem to be having problems 
>> with some of my GFS2 mounts. All servers have 2 gfs2 mounts on iscsi 
>> luns, /var/lib/libvirt/sanlock and /etc/libvirt/qemu.
>>
>> /dev/mapper/iscsi_cluster_qemu on /etc/libvirt/qemu type gfs2 
>> (rw,relatime,hostdata=jid=0)
>> /dev/mapper/iscsi_cluster_sanlock on /var/lib/libvirt/sanlock type 
>> gfs2 (rw,relatime,hostdata=jid=0)
>>
>> Currently, on vm01-test, i cannot go to /var/lib/libvirt/sanlock:
>>
>> root at vm01-test:~# ls /var/lib/libvirt/sanlock
>> ^C^C^C
>> ^C
>> ^C
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From cfeist at redhat.com  Wed Aug 15 20:50:29 2012
From: cfeist at redhat.com (Chris Feist)
Date: Wed, 15 Aug 2012 15:50:29 -0500
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2DC48@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2DC48@alexandria.innova.local>
Message-ID: <502C0B95.4020209@redhat.com>

On 08/15/12 15:08, Chip Burke wrote:
> modcluster-0.16.2-18.el6.x86_64
>
> And
>
> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
> ***Sending to ricci server:
> <ricci function="process_batch" async="false" version="1.0"><batch><module
> name="cluster"><request API_version="1.0"><function_call
> name="get_cluster.conf"></function_call></request></module></batch></ricci>
> ***Sending End
> ***Received from ricci server
> <?xml version="1.0"?>
> <ricci authenticated="false" success="5" version="1.0"/>
>
> ***Receive End
> xanadunode2 password:
> ***Sending to ricci server:
> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
> ***Sending End
> ***Received from ricci server
> <?xml version="1.0"?>
> <Timeout_reached_without_valid_XML_request/>
>
> ***Receive End
> Error: no ricci tag in ricci response

Thanks, can you also provide what was in /var/log/messages and /var/log/secure 
when those errors occurred?

>
> Thanks!
>
> ________________________________________
> Chip Burke
>
>
>
>
>
>
>
> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>
>> On 08/14/12 20:34, Chip Burke wrote:
>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>
>>> This gives me similar results. It sits and spins for a few minutes and
>>> then fails with:
>>>
>>> Error: no ricci tag in ricci response
>>>
>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>
>> Can you send me the output of 'rpm -q modcluster' and
>> 'ccs -d -h xanadunode2 --getconf'
>>
>> Thanks,
>> Chris
>>
>>>
>>> This locks up everything going to GFS2 mounts. Two nodes recovered, the
>>> other didn't, required a fence_node. GFS2 showed this before the fence.
>>>
>>> cd: /datastore/lvol0: Input/output error
>>>
>>> Along with the error Error: no ricci tag in ricci response
>>>
>>> ________________________________________
>>> Chip Burke
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>
>>>>
>>>>
>>>> Can you try using ccs to get the current configuration of that node:
>>>> ccs -h <host name> --getconf
>>>>
>>>>
>>>> As well as use ccs to try and set the conf on that node?
>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>
>>>> This should let us narrow down whether it's an issue with ricci or
>>>> luci.
>>>>
>>>> Thanks!
>>>> Chris
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From CBurke at innova-partners.com  Wed Aug 15 21:49:25 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Wed, 15 Aug 2012 21:49:25 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502C0B95.4020209@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2DC96@alexandria.innova.local>

There is nothing in messages or secure on either node1 or 2 at that time.
________________________________________
Chip Burke







On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:

>On 08/15/12 15:08, Chip Burke wrote:
>> modcluster-0.16.2-18.el6.x86_64
>>
>> And
>>
>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>> ***Sending to ricci server:
>> <ricci function="process_batch" async="false"
>>version="1.0"><batch><module
>> name="cluster"><request API_version="1.0"><function_call
>> 
>>name="get_cluster.conf"></function_call></request></module></batch></ricc
>>i>
>> ***Sending End
>> ***Received from ricci server
>> <?xml version="1.0"?>
>> <ricci authenticated="false" success="5" version="1.0"/>
>>
>> ***Receive End
>> xanadunode2 password:
>> ***Sending to ricci server:
>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>> ***Sending End
>> ***Received from ricci server
>> <?xml version="1.0"?>
>> <Timeout_reached_without_valid_XML_request/>
>>
>> ***Receive End
>> Error: no ricci tag in ricci response
>
>Thanks, can you also provide what was in /var/log/messages and
>/var/log/secure 
>when those errors occurred?
>
>>
>> Thanks!
>>
>> ________________________________________
>> Chip Burke
>>
>>
>>
>>
>>
>>
>>
>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>
>>> On 08/14/12 20:34, Chip Burke wrote:
>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>
>>>> This gives me similar results. It sits and spins for a few minutes and
>>>> then fails with:
>>>>
>>>> Error: no ricci tag in ricci response
>>>>
>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>
>>> Can you send me the output of 'rpm -q modcluster' and
>>> 'ccs -d -h xanadunode2 --getconf'
>>>
>>> Thanks,
>>> Chris
>>>
>>>>
>>>> This locks up everything going to GFS2 mounts. Two nodes recovered,
>>>>the
>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>fence.
>>>>
>>>> cd: /datastore/lvol0: Input/output error
>>>>
>>>> Along with the error Error: no ricci tag in ricci response
>>>>
>>>> ________________________________________
>>>> Chip Burke
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Can you try using ccs to get the current configuration of that node:
>>>>> ccs -h <host name> --getconf
>>>>>
>>>>>
>>>>> As well as use ccs to try and set the conf on that node?
>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>
>>>>> This should let us narrow down whether it's an issue with ricci or
>>>>> luci.
>>>>>
>>>>> Thanks!
>>>>> Chris
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From heiko.nardmann at itechnical.de  Thu Aug 16 06:30:09 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Thu, 16 Aug 2012 08:30:09 +0200
Subject: [Linux-cluster] GFS2 volumes hanging on 1 of 3 cluster nodes
In-Reply-To: <9938c76b36c834ea155b177011081aeb@verwilst.be>
References: <a24bc62ab90de0a10c568848ed7a0bc4@verwilst.be>
	<502BE4AF.2000300@itechnical.de>
	<9938c76b36c834ea155b177011081aeb@verwilst.be>
Message-ID: <502C9371.3040000@itechnical.de>

Symptoms have been

- losing the cluster token thus leading to one machine being fenced
- during reboot multiple failures trying to mount the storage (GFS2, 
too) thus taking a long time until the machine has been finally up
- after reboot access to iSCSI SAN PowerVault MD3200i slow or not 
possible at all (as you described)
- slow performance when transferring files from one of the cluster nodes 
to a third party machine - not always but sometimes
- strange things can also be seen sometimes inside a tcpdump capture 
(many red things shown by wireshark)

The first issue has been the one which I thought of should be a problem 
of the RHCF. Since we purchased RHEL 6.1 together with the machines I 
have contacted Dell which suggested to first run a DSET report for both 
machines (two node cluster). Then the Dell support came up with the 
recommendation to upgrade the NIC firmware.

Regards,

     Heiko

Am 15.08.2012 22:16, schrieb lists at verwilst.be:
> Hi Heiko,
>
> I'm using Dell R310's for this setup, which indeed have broadcom 
> NIC's. I've emailed Dell Prosupport for the location to the latest 
> firmwares ( that site is a mess :( ). In the meantime, could you tell 
> me how you figured out that the NIC's were to blame? What symptoms did 
> you see? Something else i can check to see if it's indeed the same issue?
>
> I reformatted the gfs2 filesystems with default 128M journals, just in 
> case that helped. I can now see all gfs2 mounts on all 3 servers. I 
> went on to start clvmd on every node, 2 nodes went fine, 3rd node gave 
> a timeout. All lvm based commands on that server now hang, on the 
> others those commands work fine.. *sigh*.
>
> Kind regards,
>
> Bart
>
> Heiko Nardmann schreef op 15.08.2012 20:04:
>> Which hardware are you using for your setup? I am just asking because
>> I have experienced similar problems which finally have been solved by
>> updating the NIC firmware of the systems involved (Dell R610 with
>> Broadcom NICs).
>>
>> Regards,
>>
>>     Heiko
>>
>> Am 15.08.2012 17:32, schrieb lists at verwilst.be:
>>> Hello,
>>>
>>> I've set up a 3-node cluster, where i seem to be having problems 
>>> with some of my GFS2 mounts. All servers have 2 gfs2 mounts on iscsi 
>>> luns, /var/lib/libvirt/sanlock and /etc/libvirt/qemu.
>>>
>>> /dev/mapper/iscsi_cluster_qemu on /etc/libvirt/qemu type gfs2 
>>> (rw,relatime,hostdata=jid=0)
>>> /dev/mapper/iscsi_cluster_sanlock on /var/lib/libvirt/sanlock type 
>>> gfs2 (rw,relatime,hostdata=jid=0)
>>>
>>> Currently, on vm01-test, i cannot go to /var/lib/libvirt/sanlock:
>>>
>>> root at vm01-test:~# ls /var/lib/libvirt/sanlock
>>> ^C^C^C
>>> ^C
>>> ^C
>>



From fdinitto at redhat.com  Thu Aug 16 09:10:38 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 16 Aug 2012 11:10:38 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2DC96@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2DC96@alexandria.innova.local>
Message-ID: <502CB90E.9030404@redhat.com>

Maybe a stupid question..

from node1:

telnet node2 11111

do you get anything? are the iptables set correctly? (and check also
from node2 to node1 and from the luci machine to both nodes)

Fabio

On 8/15/2012 11:49 PM, Chip Burke wrote:
> There is nothing in messages or secure on either node1 or 2 at that time.
> ________________________________________
> Chip Burke
> 
> 
> 
> 
> 
> 
> 
> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
> 
>> On 08/15/12 15:08, Chip Burke wrote:
>>> modcluster-0.16.2-18.el6.x86_64
>>>
>>> And
>>>
>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>> ***Sending to ricci server:
>>> <ricci function="process_batch" async="false"
>>> version="1.0"><batch><module
>>> name="cluster"><request API_version="1.0"><function_call
>>>
>>> name="get_cluster.conf"></function_call></request></module></batch></ricc
>>> i>
>>> ***Sending End
>>> ***Received from ricci server
>>> <?xml version="1.0"?>
>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>
>>> ***Receive End
>>> xanadunode2 password:
>>> ***Sending to ricci server:
>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>> ***Sending End
>>> ***Received from ricci server
>>> <?xml version="1.0"?>
>>> <Timeout_reached_without_valid_XML_request/>
>>>
>>> ***Receive End
>>> Error: no ricci tag in ricci response
>>
>> Thanks, can you also provide what was in /var/log/messages and
>> /var/log/secure 
>> when those errors occurred?
>>
>>>
>>> Thanks!
>>>
>>> ________________________________________
>>> Chip Burke
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>
>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>
>>>>> This gives me similar results. It sits and spins for a few minutes and
>>>>> then fails with:
>>>>>
>>>>> Error: no ricci tag in ricci response
>>>>>
>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>
>>>> Can you send me the output of 'rpm -q modcluster' and
>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>>>
>>>>> This locks up everything going to GFS2 mounts. Two nodes recovered,
>>>>> the
>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>> fence.
>>>>>
>>>>> cd: /datastore/lvol0: Input/output error
>>>>>
>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>
>>>>> ________________________________________
>>>>> Chip Burke
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Can you try using ccs to get the current configuration of that node:
>>>>>> ccs -h <host name> --getconf
>>>>>>
>>>>>>
>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>
>>>>>> This should let us narrow down whether it's an issue with ricci or
>>>>>> luci.
>>>>>>
>>>>>> Thanks!
>>>>>> Chris
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From cfeist at redhat.com  Thu Aug 16 16:30:32 2012
From: cfeist at redhat.com (Chris Feist)
Date: Thu, 16 Aug 2012 11:30:32 -0500
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2DC96@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2DC96@alexandria.innova.local>
Message-ID: <502D2028.8000704@redhat.com>

On 08/15/12 16:49, Chip Burke wrote:
> There is nothing in messages or secure on either node1 or 2 at that time.

Ok, there's something going on with the ricci authentication on that node.  Can 
you give me the output of 'rpm -q ricci' as well as do a '/etc/init.d/ricci 
restart'.

Then on the node that is running ricci, try this command:
ccs -d -h localhost --getconf

(it should ask your for a password, and enter the ricci password)

Thanks,
Chris

> ________________________________________
> Chip Burke
>
>
>
>
>
>
>
> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>
>> On 08/15/12 15:08, Chip Burke wrote:
>>> modcluster-0.16.2-18.el6.x86_64
>>>
>>> And
>>>
>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>> ***Sending to ricci server:
>>> <ricci function="process_batch" async="false"
>>> version="1.0"><batch><module
>>> name="cluster"><request API_version="1.0"><function_call
>>>
>>> name="get_cluster.conf"></function_call></request></module></batch></ricc
>>> i>
>>> ***Sending End
>>> ***Received from ricci server
>>> <?xml version="1.0"?>
>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>
>>> ***Receive End
>>> xanadunode2 password:
>>> ***Sending to ricci server:
>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>> ***Sending End
>>> ***Received from ricci server
>>> <?xml version="1.0"?>
>>> <Timeout_reached_without_valid_XML_request/>
>>>
>>> ***Receive End
>>> Error: no ricci tag in ricci response
>>
>> Thanks, can you also provide what was in /var/log/messages and
>> /var/log/secure
>> when those errors occurred?
>>
>>>
>>> Thanks!
>>>
>>> ________________________________________
>>> Chip Burke
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>
>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>
>>>>> This gives me similar results. It sits and spins for a few minutes and
>>>>> then fails with:
>>>>>
>>>>> Error: no ricci tag in ricci response
>>>>>
>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>
>>>> Can you send me the output of 'rpm -q modcluster' and
>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>>>
>>>>> This locks up everything going to GFS2 mounts. Two nodes recovered,
>>>>> the
>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>> fence.
>>>>>
>>>>> cd: /datastore/lvol0: Input/output error
>>>>>
>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>
>>>>> ________________________________________
>>>>> Chip Burke
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Can you try using ccs to get the current configuration of that node:
>>>>>> ccs -h <host name> --getconf
>>>>>>
>>>>>>
>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>
>>>>>> This should let us narrow down whether it's an issue with ricci or
>>>>>> luci.
>>>>>>
>>>>>> Thanks!
>>>>>> Chris
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From CBurke at innova-partners.com  Thu Aug 16 18:52:21 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Thu, 16 Aug 2012 18:52:21 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502D2028.8000704@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2E088@alexandria.innova.local>

Here we go?

node1:
ricci-0.16.2-55.el6.x86_64


node2:
ricci-0.16.2-55.el6.x86_64

[root at xanadunode2 ~]# service ricci stop
Shutting down ricci:                                       [  OK  ]
[root at xanadunode2 ~]# service ricci start
Starting ricci:                                            [  OK  ]
[root at xanadunode2 ~]# ccs -d -h localhost --getconf
***Sending to ricci server:
<ricci function="process_batch" async="false" version="1.0"><batch><module
name="cluster"><request API_version="1.0"><function_call
name="get_cluster.conf"></function_call></request></module></batch></ricci>
***Sending End
***Received from ricci server
<?xml version="1.0"?>
<ricci authenticated="false" success="5" version="1.0"/>

***Receive End
localhost password:
***Sending to ricci server:
<ricci function="authenticate" password="XXXXXX" version="1.0"/>
***Sending End
***Received from ricci server
<?xml version="1.0"?>
<Timeout_reached_without_valid_XML_request/>

***Receive End
Error: no ricci tag in ricci response

Of course I fudged the password with XXXs for the list.


Thanks!
________________________________________
Chip Burke







On 8/16/12 12:30 PM, "Chris Feist" <cfeist at redhat.com> wrote:

>On 08/15/12 16:49, Chip Burke wrote:
>> There is nothing in messages or secure on either node1 or 2 at that
>>time.
>
>Ok, there's something going on with the ricci authentication on that
>node.  Can 
>you give me the output of 'rpm -q ricci' as well as do a
>'/etc/init.d/ricci
>restart'.
>
>Then on the node that is running ricci, try this command:
>ccs -d -h localhost --getconf
>
>(it should ask your for a password, and enter the ricci password)
>
>Thanks,
>Chris
>
>> ________________________________________
>> Chip Burke
>>
>>
>>
>>
>>
>>
>>
>> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>
>>> On 08/15/12 15:08, Chip Burke wrote:
>>>> modcluster-0.16.2-18.el6.x86_64
>>>>
>>>> And
>>>>
>>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>>> ***Sending to ricci server:
>>>> <ricci function="process_batch" async="false"
>>>> version="1.0"><batch><module
>>>> name="cluster"><request API_version="1.0"><function_call
>>>>
>>>> 
>>>>name="get_cluster.conf"></function_call></request></module></batch></ri
>>>>cc
>>>> i>
>>>> ***Sending End
>>>> ***Received from ricci server
>>>> <?xml version="1.0"?>
>>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>>
>>>> ***Receive End
>>>> xanadunode2 password:
>>>> ***Sending to ricci server:
>>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>>> ***Sending End
>>>> ***Received from ricci server
>>>> <?xml version="1.0"?>
>>>> <Timeout_reached_without_valid_XML_request/>
>>>>
>>>> ***Receive End
>>>> Error: no ricci tag in ricci response
>>>
>>> Thanks, can you also provide what was in /var/log/messages and
>>> /var/log/secure
>>> when those errors occurred?
>>>
>>>>
>>>> Thanks!
>>>>
>>>> ________________________________________
>>>> Chip Burke
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>
>>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>>
>>>>>> This gives me similar results. It sits and spins for a few minutes
>>>>>>and
>>>>>> then fails with:
>>>>>>
>>>>>> Error: no ricci tag in ricci response
>>>>>>
>>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>>
>>>>> Can you send me the output of 'rpm -q modcluster' and
>>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>>>>
>>>>>> This locks up everything going to GFS2 mounts. Two nodes recovered,
>>>>>> the
>>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>>> fence.
>>>>>>
>>>>>> cd: /datastore/lvol0: Input/output error
>>>>>>
>>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>>
>>>>>> ________________________________________
>>>>>> Chip Burke
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can you try using ccs to get the current configuration of that
>>>>>>>node:
>>>>>>> ccs -h <host name> --getconf
>>>>>>>
>>>>>>>
>>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>>
>>>>>>> This should let us narrow down whether it's an issue with ricci or
>>>>>>> luci.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Chris
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From CBurke at innova-partners.com  Thu Aug 16 19:05:07 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Thu, 16 Aug 2012 19:05:07 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502CB90E.9030404@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2E0B7@alexandria.innova.local>

Indeed. I get in via telnet between machines and get dumped out for not
using SSL, but that is to be expected.

Ps -A/top also show ricci happily sitting there listening.

Additionally

#netstat -lnptu | grep 11111
tcp        0      0 :::11111                    :::*
 LISTEN      3210/ricci

So ricci is there.


________________________________________
Chip Burke







On 8/16/12 5:10 AM, "Fabio M. Di Nitto" <fdinitto at redhat.com> wrote:

>Maybe a stupid question..
>
>from node1:
>
>telnet node2 11111
>
>do you get anything? are the iptables set correctly? (and check also
>from node2 to node1 and from the luci machine to both nodes)
>
>Fabio
>
>On 8/15/2012 11:49 PM, Chip Burke wrote:
>> There is nothing in messages or secure on either node1 or 2 at that
>>time.
>> ________________________________________
>> Chip Burke
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>> 
>>> On 08/15/12 15:08, Chip Burke wrote:
>>>> modcluster-0.16.2-18.el6.x86_64
>>>>
>>>> And
>>>>
>>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>>> ***Sending to ricci server:
>>>> <ricci function="process_batch" async="false"
>>>> version="1.0"><batch><module
>>>> name="cluster"><request API_version="1.0"><function_call
>>>>
>>>> 
>>>>name="get_cluster.conf"></function_call></request></module></batch></ri
>>>>cc
>>>> i>
>>>> ***Sending End
>>>> ***Received from ricci server
>>>> <?xml version="1.0"?>
>>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>>
>>>> ***Receive End
>>>> xanadunode2 password:
>>>> ***Sending to ricci server:
>>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>>> ***Sending End
>>>> ***Received from ricci server
>>>> <?xml version="1.0"?>
>>>> <Timeout_reached_without_valid_XML_request/>
>>>>
>>>> ***Receive End
>>>> Error: no ricci tag in ricci response
>>>
>>> Thanks, can you also provide what was in /var/log/messages and
>>> /var/log/secure
>>> when those errors occurred?
>>>
>>>>
>>>> Thanks!
>>>>
>>>> ________________________________________
>>>> Chip Burke
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>
>>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>>
>>>>>> This gives me similar results. It sits and spins for a few minutes
>>>>>>and
>>>>>> then fails with:
>>>>>>
>>>>>> Error: no ricci tag in ricci response
>>>>>>
>>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>>
>>>>> Can you send me the output of 'rpm -q modcluster' and
>>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>>>>
>>>>>> This locks up everything going to GFS2 mounts. Two nodes recovered,
>>>>>> the
>>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>>> fence.
>>>>>>
>>>>>> cd: /datastore/lvol0: Input/output error
>>>>>>
>>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>>
>>>>>> ________________________________________
>>>>>> Chip Burke
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can you try using ccs to get the current configuration of that
>>>>>>>node:
>>>>>>> ccs -h <host name> --getconf
>>>>>>>
>>>>>>>
>>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>>
>>>>>>> This should let us narrow down whether it's an issue with ricci or
>>>>>>> luci.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Chris
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From cfeist at redhat.com  Thu Aug 16 19:52:32 2012
From: cfeist at redhat.com (Chris Feist)
Date: Thu, 16 Aug 2012 14:52:32 -0500
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2E088@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2E088@alexandria.innova.local>
Message-ID: <502D4F80.3060302@redhat.com>

On 08/16/12 13:52, Chip Burke wrote:
> Here we go?
>
> node1:
> ricci-0.16.2-55.el6.x86_64
>
>
> node2:
> ricci-0.16.2-55.el6.x86_64
>
> [root at xanadunode2 ~]# service ricci stop
> Shutting down ricci:                                       [  OK  ]
> [root at xanadunode2 ~]# service ricci start
> Starting ricci:                                            [  OK  ]
> [root at xanadunode2 ~]# ccs -d -h localhost --getconf
> ***Sending to ricci server:
> <ricci function="process_batch" async="false" version="1.0"><batch><module
> name="cluster"><request API_version="1.0"><function_call
> name="get_cluster.conf"></function_call></request></module></batch></ricci>
> ***Sending End
> ***Received from ricci server
> <?xml version="1.0"?>
> <ricci authenticated="false" success="5" version="1.0"/>
>
> ***Receive End
> localhost password:
> ***Sending to ricci server:
> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
> ***Sending End
> ***Received from ricci server
> <?xml version="1.0"?>
> <Timeout_reached_without_valid_XML_request/>
>
> ***Receive End
> Error: no ricci tag in ricci response
>
> Of course I fudged the password with XXXs for the list.

How long does it take for you to get the "Error: no ricci tag in ricci 
response"?  Is it pretty quick or does it take around 30 seconds?

>
>
> Thanks!
> ________________________________________
> Chip Burke
>
>
>
>
>
>
>
> On 8/16/12 12:30 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>
>> On 08/15/12 16:49, Chip Burke wrote:
>>> There is nothing in messages or secure on either node1 or 2 at that
>>> time.
>>
>> Ok, there's something going on with the ricci authentication on that
>> node.  Can
>> you give me the output of 'rpm -q ricci' as well as do a
>> '/etc/init.d/ricci
>> restart'.
>>
>> Then on the node that is running ricci, try this command:
>> ccs -d -h localhost --getconf
>>
>> (it should ask your for a password, and enter the ricci password)
>>
>> Thanks,
>> Chris
>>
>>> ________________________________________
>>> Chip Burke
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>
>>>> On 08/15/12 15:08, Chip Burke wrote:
>>>>> modcluster-0.16.2-18.el6.x86_64
>>>>>
>>>>> And
>>>>>
>>>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>>>> ***Sending to ricci server:
>>>>> <ricci function="process_batch" async="false"
>>>>> version="1.0"><batch><module
>>>>> name="cluster"><request API_version="1.0"><function_call
>>>>>
>>>>>
>>>>> name="get_cluster.conf"></function_call></request></module></batch></ri
>>>>> cc
>>>>> i>
>>>>> ***Sending End
>>>>> ***Received from ricci server
>>>>> <?xml version="1.0"?>
>>>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>>>
>>>>> ***Receive End
>>>>> xanadunode2 password:
>>>>> ***Sending to ricci server:
>>>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>>>> ***Sending End
>>>>> ***Received from ricci server
>>>>> <?xml version="1.0"?>
>>>>> <Timeout_reached_without_valid_XML_request/>
>>>>>
>>>>> ***Receive End
>>>>> Error: no ricci tag in ricci response
>>>>
>>>> Thanks, can you also provide what was in /var/log/messages and
>>>> /var/log/secure
>>>> when those errors occurred?
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> ________________________________________
>>>>> Chip Burke
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>
>>>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>>>
>>>>>>> This gives me similar results. It sits and spins for a few minutes
>>>>>>> and
>>>>>>> then fails with:
>>>>>>>
>>>>>>> Error: no ricci tag in ricci response
>>>>>>>
>>>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>>>
>>>>>> Can you send me the output of 'rpm -q modcluster' and
>>>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>>>
>>>>>>> This locks up everything going to GFS2 mounts. Two nodes recovered,
>>>>>>> the
>>>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>>>> fence.
>>>>>>>
>>>>>>> cd: /datastore/lvol0: Input/output error
>>>>>>>
>>>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> Chip Burke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you try using ccs to get the current configuration of that
>>>>>>>> node:
>>>>>>>> ccs -h <host name> --getconf
>>>>>>>>
>>>>>>>>
>>>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>>>
>>>>>>>> This should let us narrow down whether it's an issue with ricci or
>>>>>>>> luci.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Chris
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From CBurke at innova-partners.com  Thu Aug 16 20:09:41 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Thu, 16 Aug 2012 20:09:41 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502D4F80.3060302@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2E15B@alexandria.innova.local>

time ccs -d -h xanadunode2 --getconf

Gives me

real	2m7.027s
user	0m0.053s
sys	0m0.013s



So it sits for quite a while.
________________________________________
Chip Burke







On 8/16/12 3:52 PM, "Chris Feist" <cfeist at redhat.com> wrote:

>On 08/16/12 13:52, Chip Burke wrote:
>> Here we go?
>>
>> node1:
>> ricci-0.16.2-55.el6.x86_64
>>
>>
>> node2:
>> ricci-0.16.2-55.el6.x86_64
>>
>> [root at xanadunode2 ~]# service ricci stop
>> Shutting down ricci:                                       [  OK  ]
>> [root at xanadunode2 ~]# service ricci start
>> Starting ricci:                                            [  OK  ]
>> [root at xanadunode2 ~]# ccs -d -h localhost --getconf
>> ***Sending to ricci server:
>> <ricci function="process_batch" async="false"
>>version="1.0"><batch><module
>> name="cluster"><request API_version="1.0"><function_call
>> 
>>name="get_cluster.conf"></function_call></request></module></batch></ricc
>>i>
>> ***Sending End
>> ***Received from ricci server
>> <?xml version="1.0"?>
>> <ricci authenticated="false" success="5" version="1.0"/>
>>
>> ***Receive End
>> localhost password:
>> ***Sending to ricci server:
>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>> ***Sending End
>> ***Received from ricci server
>> <?xml version="1.0"?>
>> <Timeout_reached_without_valid_XML_request/>
>>
>> ***Receive End
>> Error: no ricci tag in ricci response
>>
>> Of course I fudged the password with XXXs for the list.
>
>How long does it take for you to get the "Error: no ricci tag in ricci
>response"?  Is it pretty quick or does it take around 30 seconds?
>
>>
>>
>> Thanks!
>> ________________________________________
>> Chip Burke
>>
>>
>>
>>
>>
>>
>>
>> On 8/16/12 12:30 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>
>>> On 08/15/12 16:49, Chip Burke wrote:
>>>> There is nothing in messages or secure on either node1 or 2 at that
>>>> time.
>>>
>>> Ok, there's something going on with the ricci authentication on that
>>> node.  Can
>>> you give me the output of 'rpm -q ricci' as well as do a
>>> '/etc/init.d/ricci
>>> restart'.
>>>
>>> Then on the node that is running ricci, try this command:
>>> ccs -d -h localhost --getconf
>>>
>>> (it should ask your for a password, and enter the ricci password)
>>>
>>> Thanks,
>>> Chris
>>>
>>>> ________________________________________
>>>> Chip Burke
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>
>>>>> On 08/15/12 15:08, Chip Burke wrote:
>>>>>> modcluster-0.16.2-18.el6.x86_64
>>>>>>
>>>>>> And
>>>>>>
>>>>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>>>>> ***Sending to ricci server:
>>>>>> <ricci function="process_batch" async="false"
>>>>>> version="1.0"><batch><module
>>>>>> name="cluster"><request API_version="1.0"><function_call
>>>>>>
>>>>>>
>>>>>> 
>>>>>>name="get_cluster.conf"></function_call></request></module></batch></
>>>>>>ri
>>>>>> cc
>>>>>> i>
>>>>>> ***Sending End
>>>>>> ***Received from ricci server
>>>>>> <?xml version="1.0"?>
>>>>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>>>>
>>>>>> ***Receive End
>>>>>> xanadunode2 password:
>>>>>> ***Sending to ricci server:
>>>>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>>>>> ***Sending End
>>>>>> ***Received from ricci server
>>>>>> <?xml version="1.0"?>
>>>>>> <Timeout_reached_without_valid_XML_request/>
>>>>>>
>>>>>> ***Receive End
>>>>>> Error: no ricci tag in ricci response
>>>>>
>>>>> Thanks, can you also provide what was in /var/log/messages and
>>>>> /var/log/secure
>>>>> when those errors occurred?
>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> ________________________________________
>>>>>> Chip Burke
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>
>>>>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>>>>
>>>>>>>> This gives me similar results. It sits and spins for a few minutes
>>>>>>>> and
>>>>>>>> then fails with:
>>>>>>>>
>>>>>>>> Error: no ricci tag in ricci response
>>>>>>>>
>>>>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>>>>
>>>>>>> Can you send me the output of 'rpm -q modcluster' and
>>>>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Chris
>>>>>>>
>>>>>>>>
>>>>>>>> This locks up everything going to GFS2 mounts. Two nodes
>>>>>>>>recovered,
>>>>>>>> the
>>>>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>>>>> fence.
>>>>>>>>
>>>>>>>> cd: /datastore/lvol0: Input/output error
>>>>>>>>
>>>>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> Chip Burke
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you try using ccs to get the current configuration of that
>>>>>>>>> node:
>>>>>>>>> ccs -h <host name> --getconf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>>>>
>>>>>>>>> This should let us narrow down whether it's an issue with ricci
>>>>>>>>>or
>>>>>>>>> luci.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Linux-cluster mailing list
>>>>>>>> Linux-cluster at redhat.com
>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From cfeist at redhat.com  Thu Aug 16 22:22:12 2012
From: cfeist at redhat.com (Chris Feist)
Date: Thu, 16 Aug 2012 17:22:12 -0500
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2E15B@alexandria.innova.local>
References: <F6CC886A516DF049A33F32C149ED21C823C2E15B@alexandria.innova.local>
Message-ID: <502D7294.7040505@redhat.com>

On 08/16/12 15:09, Chip Burke wrote:
> time ccs -d -h xanadunode2 --getconf
>
> Gives me
>
> real	2m7.027s
> user	0m0.053s
> sys	0m0.013s

Can you try removing libvirt (rpm -e libvirt) and restarting ricci 
(/etc/init.d/ricci restart).

And run that command again?



>
>
>
> So it sits for quite a while.
> ________________________________________
> Chip Burke
>
>
>
>
>
>
>
> On 8/16/12 3:52 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>
>> On 08/16/12 13:52, Chip Burke wrote:
>>> Here we go?
>>>
>>> node1:
>>> ricci-0.16.2-55.el6.x86_64
>>>
>>>
>>> node2:
>>> ricci-0.16.2-55.el6.x86_64
>>>
>>> [root at xanadunode2 ~]# service ricci stop
>>> Shutting down ricci:                                       [  OK  ]
>>> [root at xanadunode2 ~]# service ricci start
>>> Starting ricci:                                            [  OK  ]
>>> [root at xanadunode2 ~]# ccs -d -h localhost --getconf
>>> ***Sending to ricci server:
>>> <ricci function="process_batch" async="false"
>>> version="1.0"><batch><module
>>> name="cluster"><request API_version="1.0"><function_call
>>>
>>> name="get_cluster.conf"></function_call></request></module></batch></ricc
>>> i>
>>> ***Sending End
>>> ***Received from ricci server
>>> <?xml version="1.0"?>
>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>
>>> ***Receive End
>>> localhost password:
>>> ***Sending to ricci server:
>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>> ***Sending End
>>> ***Received from ricci server
>>> <?xml version="1.0"?>
>>> <Timeout_reached_without_valid_XML_request/>
>>>
>>> ***Receive End
>>> Error: no ricci tag in ricci response
>>>
>>> Of course I fudged the password with XXXs for the list.
>>
>> How long does it take for you to get the "Error: no ricci tag in ricci
>> response"?  Is it pretty quick or does it take around 30 seconds?
>>
>>>
>>>
>>> Thanks!
>>> ________________________________________
>>> Chip Burke
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 8/16/12 12:30 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>
>>>> On 08/15/12 16:49, Chip Burke wrote:
>>>>> There is nothing in messages or secure on either node1 or 2 at that
>>>>> time.
>>>>
>>>> Ok, there's something going on with the ricci authentication on that
>>>> node.  Can
>>>> you give me the output of 'rpm -q ricci' as well as do a
>>>> '/etc/init.d/ricci
>>>> restart'.
>>>>
>>>> Then on the node that is running ricci, try this command:
>>>> ccs -d -h localhost --getconf
>>>>
>>>> (it should ask your for a password, and enter the ricci password)
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>>> ________________________________________
>>>>> Chip Burke
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/15/12 4:50 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>
>>>>>> On 08/15/12 15:08, Chip Burke wrote:
>>>>>>> modcluster-0.16.2-18.el6.x86_64
>>>>>>>
>>>>>>> And
>>>>>>>
>>>>>>> [root at xanadunode1 ~]# ccs -d -h xanadunode2 --getconf
>>>>>>> ***Sending to ricci server:
>>>>>>> <ricci function="process_batch" async="false"
>>>>>>> version="1.0"><batch><module
>>>>>>> name="cluster"><request API_version="1.0"><function_call
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> name="get_cluster.conf"></function_call></request></module></batch></
>>>>>>> ri
>>>>>>> cc
>>>>>>> i>
>>>>>>> ***Sending End
>>>>>>> ***Received from ricci server
>>>>>>> <?xml version="1.0"?>
>>>>>>> <ricci authenticated="false" success="5" version="1.0"/>
>>>>>>>
>>>>>>> ***Receive End
>>>>>>> xanadunode2 password:
>>>>>>> ***Sending to ricci server:
>>>>>>> <ricci function="authenticate" password="XXXXXX" version="1.0"/>
>>>>>>> ***Sending End
>>>>>>> ***Received from ricci server
>>>>>>> <?xml version="1.0"?>
>>>>>>> <Timeout_reached_without_valid_XML_request/>
>>>>>>>
>>>>>>> ***Receive End
>>>>>>> Error: no ricci tag in ricci response
>>>>>>
>>>>>> Thanks, can you also provide what was in /var/log/messages and
>>>>>> /var/log/secure
>>>>>> when those errors occurred?
>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> Chip Burke
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/15/12 3:56 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>>
>>>>>>>> On 08/14/12 20:34, Chip Burke wrote:
>>>>>>>>> [root at xanadunode1 ~]# ccs -h xanadunode2 --getconf
>>>>>>>>>
>>>>>>>>> This gives me similar results. It sits and spins for a few minutes
>>>>>>>>> and
>>>>>>>>> then fails with:
>>>>>>>>>
>>>>>>>>> Error: no ricci tag in ricci response
>>>>>>>>>
>>>>>>>>> ccs -f /etc/cluster/cluster.conf -h xanadunode2 --setconf
>>>>>>>>
>>>>>>>> Can you send me the output of 'rpm -q modcluster' and
>>>>>>>> 'ccs -d -h xanadunode2 --getconf'
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This locks up everything going to GFS2 mounts. Two nodes
>>>>>>>>> recovered,
>>>>>>>>> the
>>>>>>>>> other didn't, required a fence_node. GFS2 showed this before the
>>>>>>>>> fence.
>>>>>>>>>
>>>>>>>>> cd: /datastore/lvol0: Input/output error
>>>>>>>>>
>>>>>>>>> Along with the error Error: no ricci tag in ricci response
>>>>>>>>>
>>>>>>>>> ________________________________________
>>>>>>>>> Chip Burke
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/14/12 3:12 PM, "Chris Feist" <cfeist at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can you try using ccs to get the current configuration of that
>>>>>>>>>> node:
>>>>>>>>>> ccs -h <host name> --getconf
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As well as use ccs to try and set the conf on that node?
>>>>>>>>>> ccs -f <cluster.conf file> -h <host name> --setconf
>>>>>>>>>>
>>>>>>>>>> This should let us narrow down whether it's an issue with ricci
>>>>>>>>>> or
>>>>>>>>>> luci.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> Chris
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Linux-cluster mailing list
>>>>>>>>> Linux-cluster at redhat.com
>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Linux-cluster mailing list
>>>>>>>> Linux-cluster at redhat.com
>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




From akinoztopuz at yahoo.com  Fri Aug 17 08:55:41 2012
From: akinoztopuz at yahoo.com (=?iso-8859-1?Q?AKIN_=FFffffffffffd6ZTOPUZ?=)
Date: Fri, 17 Aug 2012 01:55:41 -0700 (PDT)
Subject: [Linux-cluster] luci / sap database
Message-ID: <1345193741.18116.YahooMailNeo@web160702.mail.bf1.yahoo.com>

????Hi 
Are there anyone who experienced with sap and oracle in cluster environment?
how can I build the sap instance and sap database resources (ASCS01 and ORACLE)? in switchover system?
other cluster issues already done.
?
best regards 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120817/840c399c/attachment.htm>

From CBurke at innova-partners.com  Fri Aug 17 15:14:27 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Fri, 17 Aug 2012 15:14:27 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <502D7294.7040505@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2F436@alexandria.innova.local>

Libvirt is not installed on any of the hosts.
________________________________________
Chip Burke







On 8/16/12 6:22 PM, "Chris Feist" <cfeist at redhat.com> wrote:

>
>Can you try removing libvirt (rpm -e libvirt) and restarting ricci
>(/etc/init.d/ricci restart).
>
>And run that command again?
>




From mkparam at gmail.com  Sun Aug 19 02:54:19 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Sun, 19 Aug 2012 08:24:19 +0530
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5 32bit
Message-ID: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>

Guys, This is definitely very trivial for those who are in red hat for
several years.

I have registered for an evaluation copy of rhel5 32bit and download all
the 5 discs, and tried installing them but it does not show "Cluster"
packages during the installation. I attempted this couple times even with
the installation number (it shows additionally Virtualization packages if
Installation Number is given). With the EasyInstall, it completes the
installation quickly but most of the packages are not installed.

So, I attempted installing these RPMs manually from disc5 , it throws me
errors like some dependency rpm's not available as shown below.


[root at server1 Cluster]# pwd
/media/RHEL-5.6 i386 Disc 5/Cluster

[root at server1 Cluster]# ls
Cluster_Administration-as-IN-5.2-1.noarch.rpm
 Cluster_Administration-ru-RU-5.2-1.noarch.rpm
Cluster_Administration-bn-IN-5.2-1.noarch.rpm
 Cluster_Administration-si-LK-5.2-1.noarch.rpm
Cluster_Administration-de-DE-5.2-1.noarch.rpm
 Cluster_Administration-ta-IN-5.2-1.noarch.rpm
Cluster_Administration-en-US-5.2-1.noarch.rpm
 Cluster_Administration-te-IN-5.2-1.noarch.rpm
Cluster_Administration-es-ES-5.2-1.noarch.rpm
 Cluster_Administration-zh-CN-5.2-1.noarch.rpm
Cluster_Administration-fr-FR-5.2-1.noarch.rpm
 Cluster_Administration-zh-TW-5.2-1.noarch.rpm
Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
Cluster_Administration-hi-IN-5.2-1.noarch.rpm
 cluster-cim-0.12.1-2.el5.i386.rpm
Cluster_Administration-it-IT-5.2-1.noarch.rpm
 cluster-snmp-0.12.1-2.el5.i386.rpm
Cluster_Administration-ja-JP-5.2-1.noarch.rpm  ipvsadm-1.24-12.el5.i386.rpm
Cluster_Administration-kn-IN-5.2-1.noarch.rpm  luci-0.12.2-24.el5.i386.rpm
Cluster_Administration-ko-KR-5.2-1.noarch.rpm
 modcluster-0.12.1-2.el5.i386.rpm
Cluster_Administration-ml-IN-5.2-1.noarch.rpm  piranha-0.8.4-19.el5.i386.rpm
Cluster_Administration-mr-IN-5.2-1.noarch.rpm
 rgmanager-2.0.52-9.el5.i386.rpm
Cluster_Administration-or-IN-5.2-1.noarch.rpm  ricci-0.12.2-24.el5.i386.rpm
Cluster_Administration-pa-IN-5.2-1.noarch.rpm
 system-config-cluster-1.0.57-7.noarch.rpm
Cluster_Administration-pt-BR-5.2-1.noarch.rpm

[root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY, key
ID 37017186
error: Failed dependencies:
* oddjob is needed by ricci-0.12.2-24.el5.i386*
* modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*

[root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm
warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature: NOKEY,
key ID 37017186
error: Failed dependencies:
* libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
* oddjob is needed by modcluster-0.12.1-2.el5.i386*


I am not an expert in red hat but not a novice too.
I know i can download these dependency packages one by one but it would be
endless to me.
Is there a better way of installing all these Cluster packages at one shot
to make my life easier ?
Also, I read in some links talking about up2date but later realize that
these are not meant for rhel5+ versions. Is that true ?
My system with the freshInstall do not have up2date installed, i am
surprised if at all this is supported/available for rhel5 onwards.

I am really looking for a simple solution to install all the dependent
rpm's for Cluster rather than i search for them one by one.
If i can do this through yum, i don't have a yum repo setup locally. Is
there one publicly available ?

My requirement is : I want to setup two RHEL vm's in my MacBook as
active-passive in cluster for Apache, Squid, Mysql services as a proof of
concept for my internal business requirement. thanks.

-Param

On Thu, Aug 9, 2012 at 10:01 AM, Digimer <lists at alteeve.ca> wrote:

> On 08/09/2012 12:17 AM, PARAM KRISH wrote:
>
>> Wow, Thanks for sharing the stuff to me. I really appreciate and kind of
>> excited to get a reply just in a day. Yeah, i wish i can join the IRC to
>> get quick answers, i must first check if my office network has it open.
>> Thanks Again. I will go through the stuff u sent me and will come back
>> with more appropriate questions to what i like to setup at my place.
>>
>> Param
>>
>
> Look forward to seeing you around. :)
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120819/2da624c5/attachment.htm>

From lists at alteeve.ca  Sun Aug 19 03:19:07 2012
From: lists at alteeve.ca (Digimer)
Date: Sat, 18 Aug 2012 23:19:07 -0400
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit
In-Reply-To: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
Message-ID: <50305B2B.7010401@alteeve.ca>

On 08/18/2012 10:54 PM, PARAM KRISH wrote:
> Guys, This is definitely very trivial for those who are in red hat for
> several years.
> 
> I have registered for an evaluation copy of rhel5 32bit and download all
> the 5 discs, and tried installing them but it does not show "Cluster"
> packages during the installation. I attempted this couple times even
> with the installation number (it shows additionally Virtualization
> packages if Installation Number is given). With the EasyInstall, it
> completes the installation quickly but most of the packages are not
> installed. 
> 
> So, I attempted installing these RPMs manually from disc5 , it throws me
> errors like some dependency rpm's not available as shown below.
> 
> 
> [root at server1 Cluster]# pwd
> /media/RHEL-5.6 i386 Disc 5/Cluster
> 
> [root at server1 Cluster]# ls
> Cluster_Administration-as-IN-5.2-1.noarch.rpm
>  Cluster_Administration-ru-RU-5.2-1.noarch.rpm
> Cluster_Administration-bn-IN-5.2-1.noarch.rpm
>  Cluster_Administration-si-LK-5.2-1.noarch.rpm
> Cluster_Administration-de-DE-5.2-1.noarch.rpm
>  Cluster_Administration-ta-IN-5.2-1.noarch.rpm
> Cluster_Administration-en-US-5.2-1.noarch.rpm
>  Cluster_Administration-te-IN-5.2-1.noarch.rpm
> Cluster_Administration-es-ES-5.2-1.noarch.rpm
>  Cluster_Administration-zh-CN-5.2-1.noarch.rpm
> Cluster_Administration-fr-FR-5.2-1.noarch.rpm
>  Cluster_Administration-zh-TW-5.2-1.noarch.rpm
> Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
> Cluster_Administration-hi-IN-5.2-1.noarch.rpm
>  cluster-cim-0.12.1-2.el5.i386.rpm
> Cluster_Administration-it-IT-5.2-1.noarch.rpm
>  cluster-snmp-0.12.1-2.el5.i386.rpm
> Cluster_Administration-ja-JP-5.2-1.noarch.rpm  ipvsadm-1.24-12.el5.i386.rpm
> Cluster_Administration-kn-IN-5.2-1.noarch.rpm  luci-0.12.2-24.el5.i386.rpm
> Cluster_Administration-ko-KR-5.2-1.noarch.rpm
>  modcluster-0.12.1-2.el5.i386.rpm
> Cluster_Administration-ml-IN-5.2-1.noarch.rpm  piranha-0.8.4-19.el5.i386.rpm
> Cluster_Administration-mr-IN-5.2-1.noarch.rpm
>  rgmanager-2.0.52-9.el5.i386.rpm
> Cluster_Administration-or-IN-5.2-1.noarch.rpm  ricci-0.12.2-24.el5.i386.rpm
> Cluster_Administration-pa-IN-5.2-1.noarch.rpm
>  system-config-cluster-1.0.57-7.noarch.rpm
> Cluster_Administration-pt-BR-5.2-1.noarch.rpm
> 
> [root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
> warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY,
> key ID 37017186
> error: Failed dependencies:
> *oddjob is needed by ricci-0.12.2-24.el5.i386*
> *modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*
> 
> [root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm 
> warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature:
> NOKEY, key ID 37017186
> error: Failed dependencies:
> *libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
> *oddjob is needed by modcluster-0.12.1-2.el5.i386*
> 
> 
> I am not an expert in red hat but not a novice too.
> I know i can download these dependency packages one by one but it would
> be endless to me.
> Is there a better way of installing all these Cluster packages at one
> shot to make my life easier ? 
> Also, I read in some links talking about up2date but later realize that
> these are not meant for rhel5+ versions. Is that true ? 
> My system with the freshInstall do not have up2date installed, i am
> surprised if at all this is supported/available for rhel5 onwards.
> 
> I am really looking for a simple solution to install all the dependent
> rpm's for Cluster rather than i search for them one by one. 
> If i can do this through yum, i don't have a yum repo setup locally. Is
> there one publicly available ? 
> 
> My requirement is : I want to setup two RHEL vm's in my MacBook as
> active-passive in cluster for Apache, Squid, Mysql services as a proof
> of concept for my internal business requirement. thanks.
> 
> -Param

I've not used RHEL 5 in some time. Is it possible that you use RHEL 6?
Also, if at all possible, 64-bit version. For a test cluster, you can
create a minimal install on 1GB of RAM, so if you have at least 4GB or
RAM on your laptop, you should be fine.

Under RHEL6, you need to enable the High-Availability Add-on groups in
RHN. Not sure if they are called the same on RHEL 5, it's been too long
since I last used it to remember.

-- 
Digimer
Papers and Projects: https://alteeve.com



From chemasters at gmail.com  Sun Aug 19 03:36:36 2012
From: chemasters at gmail.com (chemasters at gmail.com)
Date: Sun, 19 Aug 2012 03:36:36 +0000
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on
	rhel5	32bit
In-Reply-To: <50305B2B.7010401@alteeve.ca>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
	<50305B2B.7010401@alteeve.ca>
Message-ID: <1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>

Mostly in redhat the dependencies for each package are included in the redhat dvd or cds. I think you try to specify your dvd or cds are repositories on the system and then try to install the cluster packages again using the yum command.

Regards,

Cherif
Regards,

Cherif

-----Original Message-----
From: Digimer <lists at alteeve.ca>
Sender: linux-cluster-bounces at redhat.com
Date: Sat, 18 Aug 2012 23:19:07 
To: PARAM KRISH<mkparam at gmail.com>
Reply-To: linux clustering <linux-cluster at redhat.com>
Cc: linux clustering<linux-cluster at redhat.com>
Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit

On 08/18/2012 10:54 PM, PARAM KRISH wrote:
> Guys, This is definitely very trivial for those who are in red hat for
> several years.
> 
> I have registered for an evaluation copy of rhel5 32bit and download all
> the 5 discs, and tried installing them but it does not show "Cluster"
> packages during the installation. I attempted this couple times even
> with the installation number (it shows additionally Virtualization
> packages if Installation Number is given). With the EasyInstall, it
> completes the installation quickly but most of the packages are not
> installed. 
> 
> So, I attempted installing these RPMs manually from disc5 , it throws me
> errors like some dependency rpm's not available as shown below.
> 
> 
> [root at server1 Cluster]# pwd
> /media/RHEL-5.6 i386 Disc 5/Cluster
> 
> [root at server1 Cluster]# ls
> Cluster_Administration-as-IN-5.2-1.noarch.rpm
>  Cluster_Administration-ru-RU-5.2-1.noarch.rpm
> Cluster_Administration-bn-IN-5.2-1.noarch.rpm
>  Cluster_Administration-si-LK-5.2-1.noarch.rpm
> Cluster_Administration-de-DE-5.2-1.noarch.rpm
>  Cluster_Administration-ta-IN-5.2-1.noarch.rpm
> Cluster_Administration-en-US-5.2-1.noarch.rpm
>  Cluster_Administration-te-IN-5.2-1.noarch.rpm
> Cluster_Administration-es-ES-5.2-1.noarch.rpm
>  Cluster_Administration-zh-CN-5.2-1.noarch.rpm
> Cluster_Administration-fr-FR-5.2-1.noarch.rpm
>  Cluster_Administration-zh-TW-5.2-1.noarch.rpm
> Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
> Cluster_Administration-hi-IN-5.2-1.noarch.rpm
>  cluster-cim-0.12.1-2.el5.i386.rpm
> Cluster_Administration-it-IT-5.2-1.noarch.rpm
>  cluster-snmp-0.12.1-2.el5.i386.rpm
> Cluster_Administration-ja-JP-5.2-1.noarch.rpm  ipvsadm-1.24-12.el5.i386.rpm
> Cluster_Administration-kn-IN-5.2-1.noarch.rpm  luci-0.12.2-24.el5.i386.rpm
> Cluster_Administration-ko-KR-5.2-1.noarch.rpm
>  modcluster-0.12.1-2.el5.i386.rpm
> Cluster_Administration-ml-IN-5.2-1.noarch.rpm  piranha-0.8.4-19.el5.i386.rpm
> Cluster_Administration-mr-IN-5.2-1.noarch.rpm
>  rgmanager-2.0.52-9.el5.i386.rpm
> Cluster_Administration-or-IN-5.2-1.noarch.rpm  ricci-0.12.2-24.el5.i386.rpm
> Cluster_Administration-pa-IN-5.2-1.noarch.rpm
>  system-config-cluster-1.0.57-7.noarch.rpm
> Cluster_Administration-pt-BR-5.2-1.noarch.rpm
> 
> [root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
> warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY,
> key ID 37017186
> error: Failed dependencies:
> *oddjob is needed by ricci-0.12.2-24.el5.i386*
> *modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*
> 
> [root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm 
> warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature:
> NOKEY, key ID 37017186
> error: Failed dependencies:
> *libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
> *oddjob is needed by modcluster-0.12.1-2.el5.i386*
> 
> 
> I am not an expert in red hat but not a novice too.
> I know i can download these dependency packages one by one but it would
> be endless to me.
> Is there a better way of installing all these Cluster packages at one
> shot to make my life easier ? 
> Also, I read in some links talking about up2date but later realize that
> these are not meant for rhel5+ versions. Is that true ? 
> My system with the freshInstall do not have up2date installed, i am
> surprised if at all this is supported/available for rhel5 onwards.
> 
> I am really looking for a simple solution to install all the dependent
> rpm's for Cluster rather than i search for them one by one. 
> If i can do this through yum, i don't have a yum repo setup locally. Is
> there one publicly available ? 
> 
> My requirement is : I want to setup two RHEL vm's in my MacBook as
> active-passive in cluster for Apache, Squid, Mysql services as a proof
> of concept for my internal business requirement. thanks.
> 
> -Param

I've not used RHEL 5 in some time. Is it possible that you use RHEL 6?
Also, if at all possible, 64-bit version. For a test cluster, you can
create a minimal install on 1GB of RAM, so if you have at least 4GB or
RAM on your laptop, you should be fine.

Under RHEL6, you need to enable the High-Availability Add-on groups in
RHN. Not sure if they are called the same on RHEL 5, it's been too long
since I last used it to remember.

-- 
Digimer
Papers and Projects: https://alteeve.com

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From mkparam at gmail.com  Sun Aug 19 06:16:25 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Sun, 19 Aug 2012 11:46:25 +0530
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit
In-Reply-To: <1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
	<50305B2B.7010401@alteeve.ca>
	<1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>
Message-ID: <CAA1zgjZTQ+mjFJ2goYOKq-VqvekL8YH3R4Q0BUQdjY7dT8coSA@mail.gmail.com>

Unfortunately I have 5 disc's that are part of installation. Do i need to
create DVD's out of it for this purpose or download the DVD iso (3GB) again
to make this work ?

Yeah, i still believe all these rpm's , lib's must be in the CD's somewhere
but i have to mount then one by one to search and install and it might say
somebody is missing still ?

I would be using these methods to setup a similar one in Production very
soon hence i have to figure out the best methods now to install all these
required packages so that i document them to pass on to a different team
who do this in Production remotely.

Easier way to install all Clustering stuff as just one bunch is still
something that i really look for.

-Param

On Sun, Aug 19, 2012 at 9:06 AM, <chemasters at gmail.com> wrote:

> Mostly in redhat the dependencies for each package are included in the
> redhat dvd or cds. I think you try to specify your dvd or cds are
> repositories on the system and then try to install the cluster packages
> again using the yum command.
>
> Regards,
>
> Cherif
> Regards,
>
> Cherif
>
> -----Original Message-----
> From: Digimer <lists at alteeve.ca>
> Sender: linux-cluster-bounces at redhat.com
> Date: Sat, 18 Aug 2012 23:19:07
> To: PARAM KRISH<mkparam at gmail.com>
> Reply-To: linux clustering <linux-cluster at redhat.com>
> Cc: linux clustering<linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
>         32bit
>
> On 08/18/2012 10:54 PM, PARAM KRISH wrote:
> > Guys, This is definitely very trivial for those who are in red hat for
> > several years.
> >
> > I have registered for an evaluation copy of rhel5 32bit and download all
> > the 5 discs, and tried installing them but it does not show "Cluster"
> > packages during the installation. I attempted this couple times even
> > with the installation number (it shows additionally Virtualization
> > packages if Installation Number is given). With the EasyInstall, it
> > completes the installation quickly but most of the packages are not
> > installed.
> >
> > So, I attempted installing these RPMs manually from disc5 , it throws me
> > errors like some dependency rpm's not available as shown below.
> >
> >
> > [root at server1 Cluster]# pwd
> > /media/RHEL-5.6 i386 Disc 5/Cluster
> >
> > [root at server1 Cluster]# ls
> > Cluster_Administration-as-IN-5.2-1.noarch.rpm
> >  Cluster_Administration-ru-RU-5.2-1.noarch.rpm
> > Cluster_Administration-bn-IN-5.2-1.noarch.rpm
> >  Cluster_Administration-si-LK-5.2-1.noarch.rpm
> > Cluster_Administration-de-DE-5.2-1.noarch.rpm
> >  Cluster_Administration-ta-IN-5.2-1.noarch.rpm
> > Cluster_Administration-en-US-5.2-1.noarch.rpm
> >  Cluster_Administration-te-IN-5.2-1.noarch.rpm
> > Cluster_Administration-es-ES-5.2-1.noarch.rpm
> >  Cluster_Administration-zh-CN-5.2-1.noarch.rpm
> > Cluster_Administration-fr-FR-5.2-1.noarch.rpm
> >  Cluster_Administration-zh-TW-5.2-1.noarch.rpm
> > Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
> > Cluster_Administration-hi-IN-5.2-1.noarch.rpm
> >  cluster-cim-0.12.1-2.el5.i386.rpm
> > Cluster_Administration-it-IT-5.2-1.noarch.rpm
> >  cluster-snmp-0.12.1-2.el5.i386.rpm
> > Cluster_Administration-ja-JP-5.2-1.noarch.rpm
>  ipvsadm-1.24-12.el5.i386.rpm
> > Cluster_Administration-kn-IN-5.2-1.noarch.rpm
>  luci-0.12.2-24.el5.i386.rpm
> > Cluster_Administration-ko-KR-5.2-1.noarch.rpm
> >  modcluster-0.12.1-2.el5.i386.rpm
> > Cluster_Administration-ml-IN-5.2-1.noarch.rpm
>  piranha-0.8.4-19.el5.i386.rpm
> > Cluster_Administration-mr-IN-5.2-1.noarch.rpm
> >  rgmanager-2.0.52-9.el5.i386.rpm
> > Cluster_Administration-or-IN-5.2-1.noarch.rpm
>  ricci-0.12.2-24.el5.i386.rpm
> > Cluster_Administration-pa-IN-5.2-1.noarch.rpm
> >  system-config-cluster-1.0.57-7.noarch.rpm
> > Cluster_Administration-pt-BR-5.2-1.noarch.rpm
> >
> > [root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
> > warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY,
> > key ID 37017186
> > error: Failed dependencies:
> > *oddjob is needed by ricci-0.12.2-24.el5.i386*
> > *modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*
> >
> > [root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm
> > warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature:
> > NOKEY, key ID 37017186
> > error: Failed dependencies:
> > *libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
> > *oddjob is needed by modcluster-0.12.1-2.el5.i386*
> >
> >
> > I am not an expert in red hat but not a novice too.
> > I know i can download these dependency packages one by one but it would
> > be endless to me.
> > Is there a better way of installing all these Cluster packages at one
> > shot to make my life easier ?
> > Also, I read in some links talking about up2date but later realize that
> > these are not meant for rhel5+ versions. Is that true ?
> > My system with the freshInstall do not have up2date installed, i am
> > surprised if at all this is supported/available for rhel5 onwards.
> >
> > I am really looking for a simple solution to install all the dependent
> > rpm's for Cluster rather than i search for them one by one.
> > If i can do this through yum, i don't have a yum repo setup locally. Is
> > there one publicly available ?
> >
> > My requirement is : I want to setup two RHEL vm's in my MacBook as
> > active-passive in cluster for Apache, Squid, Mysql services as a proof
> > of concept for my internal business requirement. thanks.
> >
> > -Param
>
> I've not used RHEL 5 in some time. Is it possible that you use RHEL 6?
> Also, if at all possible, 64-bit version. For a test cluster, you can
> create a minimal install on 1GB of RAM, so if you have at least 4GB or
> RAM on your laptop, you should be fine.
>
> Under RHEL6, you need to enable the High-Availability Add-on groups in
> RHN. Not sure if they are called the same on RHEL 5, it's been too long
> since I last used it to remember.
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120819/aad2a2d5/attachment.htm>

From susvirkar.3616 at gmail.com  Sun Aug 19 07:13:15 2012
From: susvirkar.3616 at gmail.com (umesh susvirkar)
Date: Sun, 19 Aug 2012 12:43:15 +0530
Subject: [Linux-cluster] Quorum device brain the cluster when master
	lose network
In-Reply-To: <1719824091.7877.1344873277971.JavaMail.root@geekarea.fr>
References: <CAE7pJ3BrNW5WK_tJo7N9uuxXb7=cyCtnjVM2G2L99BJAfswS+w@mail.gmail.com>
	<1719824091.7877.1344873277971.JavaMail.root@geekarea.fr>
Message-ID: <CAKAG_DqGB-kQ=qwMe0Vg3oTVoij=S+0OmnwvcPYV1sFYaj19Bw@mail.gmail.com>

Hi

can you do following changes in your cluster.conf file


        <cman expected_votes="5" quorum_dev_poll="25000" >
                <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
        </cman>
<totem token="30000" consensus="15000" />
        <quorumd interval="3" label="quorum" tko="7" votes="1"
min_score="1" >
<heuristic program="/bin/ping -c 1 -W 4 10.148.8.1" score="1" interval="2"
tko="2"/>
</quorumd>

and restart the cluster & check
Also capture cman_tool status & paste the /var/log/message


On Mon, Aug 13, 2012 at 9:24 PM, GouNiNi <gounini.geekarea at gmail.com> wrote:

> Sorry, serveur was not usable until now.
>
> [root at mynode1 ~]# cman_tool status
> Version: 6.2.0
> Config Version: 162
> Cluster Name: cluname
> Cluster Id: 57462
> Cluster Member: Yes
> Cluster Generation: 836
> Membership state: Cluster-Member
> Nodes: 4
> Expected votes: 4
> Quorum device votes: 1
> Total votes: 5
> Quorum: 3
> Active subsystems: 9
> Flags: Dirty
> Ports Bound: 0 177
> Node name: mynode1
> Node ID: 1
> Multicast addresses: XX.XX.XX.XX
> Node addresses: YY.YY.YY.YY
>
> I reproduce my problem today.
> I tried to use a speciel heuristic to leave quorum device but it's not
> working:
> <heuristic program="/bin/ping -c1 -w1 10.148.8.1 || /etc/init.d/qdiskd
> stop" score="1" interval="2" tko="2"/>
>
> An idea?
>
> --
>   .`'`.   GouNiNi
>  :  ': :
>  `. ` .`  GNU/Linux
>    `'`    http://www.geekarea.fr
>
>
> ----- Mail original -----
> > De: "emmanuel segura" <emi2fast at gmail.com>
> > ?: "linux clustering" <linux-cluster at redhat.com>
> > Envoy?: Mardi 7 Ao?t 2012 14:31:13
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when master
>      lose network
> >
> >
> > send me a cman_tool status ;-)
> >
> >
> > 2012/8/7 GouNiNi < gounini.geekarea at gmail.com >
> >
> >
> > Yes I do ;)
> >
> > --
> > .`'`. GouNiNi
> > : ': :
> > `. ` .` GNU/Linux
> > `'` http://www.geekarea.fr
> >
> >
> > ----- Mail original -----
> > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > Envoy?: Mardi 7 Ao?t 2012 11:29:59
> > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > master lose network
> > >
> > >
> > > do you reboot all nodes in your cluster after removed the
> > > expected_votes?
> > >
> > >
> > > 2012/8/7 GouNiNi < gounini.geekarea at gmail.com >
> > >
> > >
> > > Hello,
> > >
> > > My problem is still here.
> > > I made a try without expected_votes="5" but nothing change on my
> > > test
> > > loosing network on two nodes.
> > >
> > > Any other idea?
> > >
> > > Regards,
> > >
> > >
> > > --
> > > .`'`. GouNiNi
> > > : ': :
> > > `. ` .` GNU/Linux
> > > `'` http://www.geekarea.fr
> > >
> > >
> > > ----- Mail original -----
> > > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > Envoy?: Mercredi 1 Ao?t 2012 10:58:59
> > >
> > >
> > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > master lose network
> > > >
> > > >
> > > > Hello Gounini
> > > >
> > > > Sorry but it told you, remove <cman expected_votes="5"> and
> > > > reboot
> > > > the cluster
> > > >
> > > > Let the cluster calculate the expected votes
> > > >
> > > >
> > > > 2012/8/1 GouNiNi < gounini.geekarea at gmail.com >
> > > >
> > > >
> > > > I do this test one more time and I got same result with more
> > > > precisions:
> > > >
> > > > When I shutdown network on 2 nodes including the master, master
> > > > stay
> > > > alive while the 2 online nodes are fencing the offline non-master
> > > > node. The cluster goes Inquorate after.
> > > > When fenced node came back, he joins cluster and cluster becomes
> > > > quorate. New master is chose and the old master is fenced.
> > > >
> > > > # cman_tool status
> > > > Version: 6.2.0
> > > > Config Version: 144
> > > > Cluster Name: cluname
> > > > Cluster Id: 57462
> > > > Cluster Member: Yes
> > > > Cluster Generation: 488
> > > > Membership state: Cluster-Member
> > > > Nodes: 4
> > > > Expected votes: 5
> > > > Quorum device votes: 1
> > > > Total votes: 5
> > > > Quorum: 3
> > > > Active subsystems: 9
> > > > Flags: Dirty
> > > > Ports Bound: 0 177
> > > > Node name: nodename
> > > > Node ID: 2
> > > > Multicast addresses: ZZ.ZZ.ZZ.ZZ
> > > > Node addresses: YY.YY.YY.YY
> > > >
> > > > --
> > > > .`'`. GouNiNi
> > > > : ': :
> > > > `. ` .` GNU/Linux
> > > > `'` http://www.geekarea.fr
> > > >
> > > >
> > > > ----- Mail original -----
> > > > > De: "emmanuel segura" < emi2fast at gmail.com >
> > > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > > Envoy?: Lundi 30 Juillet 2012 17:35:39
> > > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster when
> > > > > master lose network
> > > > >
> > > > >
> > > > > can you send me the ouput from cman_tool status? when the
> > > > > cluster
> > > > > it's running
> > > > >
> > > > >
> > > > > 2012/7/30 GouNiNi < gounini.geekarea at gmail.com >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ----- Mail original -----
> > > > > > De: "Digimer" < lists at alteeve.ca >
> > > > > > ?: "linux clustering" < linux-cluster at redhat.com >
> > > > > > Cc: "GouNiNi" < gounini.geekarea at gmail.com >
> > > > > > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > > > > > Objet: Re: [Linux-cluster] Quorum device brain the cluster
> > > > > > when
> > > > > > master lose network
> > > > > >
> > > > > > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > I did some tests on 4 nodes cluster with quorum device and
> > > > > > > I
> > > > > > > find
> > > > > > > a
> > > > > > > bad situation with one test, so I need your knowledges to
> > > > > > > correct
> > > > > > > my configuration.
> > > > > > >
> > > > > > > Configuation:
> > > > > > > 4 nodes, all vote for 1
> > > > > > > quorum device vote for 1 (to hold services with minimum 2
> > > > > > > nodes
> > > > > > > up)
> > > > > > > cman expected votes 5
> > > > > > >
> > > > > > > Situation:
> > > > > > > I shut down network on 2 nodes, one of them is master.
> > > > > > >
> > > > > > > Observation:
> > > > > > > Fencing of one node (the master)... Quorum device Offline,
> > > > > > > Quorum
> > > > > > > disolved ! Services stopped.
> > > > > > > Fenced node reboot, cluster is quorate, 2nd offline node is
> > > > > > > fenced.
> > > > > > > Services restart.
> > > > > > > 2nd node offline reboot.
> > > > > > >
> > > > > > > My cluster is not quorate for 8 min (very long hardware
> > > > > > > boot
> > > > > > > :-)
> > > > > > > and my services were offline.
> > > > > > >
> > > > > > > Do you know how to prevent this situation?
> > > > > > >
> > > > > > > Regards,
> > > > > >
> > > > > > Please tell us the name and version of the cluster software
> > > > > > you
> > > > > > are
> > > > > > using, Please also share your configuration file(s).
> > > > > >
> > > > > > --
> > > > > > Digimer
> > > > > > Papers and Projects: https://alteeve.com
> > > > > >
> > > > >
> > > > > Sorry, RHEL5.6 64bits
> > > > >
> > > > > # rpm -q cman rgmanager
> > > > > cman-2.0.115-68.el5
> > > > > rgmanager-2.0.52-9.el5
> > > > >
> > > > >
> > > > > <?xml version="1.0"?>
> > > > > <cluster alias="cluname" config_version="144" name="cluname">
> > > > > <clusternodes>
> > > > > <clusternode name="node1" nodeid="1" votes="1">
> > > > > <fence>
> > > > > <method name="single">
> > > > > <device name="fenceIBM_307" port="12"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > <clusternode name="node2" nodeid="2" votes="1">
> > > > > <fence>
> > > > > <method name="single">
> > > > > <device name="fenceIBM_307" port="11"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > <clusternode name="node3" nodeid="3" votes="1">
> > > > > <fence>
> > > > > <method name="single">
> > > > > <device name="fenceIBM_308" port="6"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > <clusternode name="node4" nodeid="4" votes="1">
> > > > > <fence>
> > > > > <method name="single">
> > > > > <device name="fenceIBM_308" port="7"/>
> > > > > </method>
> > > > > </fence>
> > > > > </clusternode>
> > > > > </clusternodes>
> > > > > <fencedevices>
> > > > > <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX"
> > > > > login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
> > > > > <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY"
> > > > > login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
> > > > > </fencedevices>
> > > > > <rm log_level="7">
> > > > > <failoverdomains/>
> > > > > <resources/>
> > > > > <service ...>
> > > > > <...>
> > > > > </service>
> > > > > </rm>
> > > > > <fence_daemon clean_start="0" post_fail_delay="15"
> > > > > post_join_delay="300"/>
> > > > > <cman expected_votes="5">
> > > > > <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
> > > > > </cman>
> > > > > <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> > > > > </cluster>
> > > > >
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > >
> > > > >
> > > > > --
> > > > > esta es mi vida e me la vivo hasta que dios quiera
> > > > >
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > >
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > >
> > > >
> > > > --
> > > > esta es mi vida e me la vivo hasta que dios quiera
> > > >
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > >
> > > --
> > > esta es mi vida e me la vivo hasta que dios quiera
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120819/1e91e057/attachment.htm>

From chemasters at gmail.com  Sun Aug 19 08:37:57 2012
From: chemasters at gmail.com (chemasters at gmail.com)
Date: Sun, 19 Aug 2012 08:37:57 +0000
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit
In-Reply-To: <CAA1zgjZTQ+mjFJ2goYOKq-VqvekL8YH3R4Q0BUQdjY7dT8coSA@mail.gmail.com>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
	<50305B2B.7010401@alteeve.ca>
	<1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>
	<CAA1zgjZTQ+mjFJ2goYOKq-VqvekL8YH3R4Q0BUQdjY7dT8coSA@mail.gmail.com>
Message-ID: <10575653-1345365526-cardhu_decombobulator_blackberry.rim.net-42447776-@b17.c8.bise7.blackberry>

I suggest you download the dvd ISO image and take it from there. 
Regards,

Cherif

-----Original Message-----
From: PARAM KRISH <mkparam at gmail.com>
Date: Sun, 19 Aug 2012 11:46:25 
To: <chemasters at gmail.com>
Cc: linux clustering<linux-cluster at redhat.com>
Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5 32bit

Unfortunately I have 5 disc's that are part of installation. Do i need to
create DVD's out of it for this purpose or download the DVD iso (3GB) again
to make this work ?

Yeah, i still believe all these rpm's , lib's must be in the CD's somewhere
but i have to mount then one by one to search and install and it might say
somebody is missing still ?

I would be using these methods to setup a similar one in Production very
soon hence i have to figure out the best methods now to install all these
required packages so that i document them to pass on to a different team
who do this in Production remotely.

Easier way to install all Clustering stuff as just one bunch is still
something that i really look for.

-Param

On Sun, Aug 19, 2012 at 9:06 AM, <chemasters at gmail.com> wrote:

> Mostly in redhat the dependencies for each package are included in the
> redhat dvd or cds. I think you try to specify your dvd or cds are
> repositories on the system and then try to install the cluster packages
> again using the yum command.
>
> Regards,
>
> Cherif
> Regards,
>
> Cherif
>
> -----Original Message-----
> From: Digimer <lists at alteeve.ca>
> Sender: linux-cluster-bounces at redhat.com
> Date: Sat, 18 Aug 2012 23:19:07
> To: PARAM KRISH<mkparam at gmail.com>
> Reply-To: linux clustering <linux-cluster at redhat.com>
> Cc: linux clustering<linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
>         32bit
>
> On 08/18/2012 10:54 PM, PARAM KRISH wrote:
> > Guys, This is definitely very trivial for those who are in red hat for
> > several years.
> >
> > I have registered for an evaluation copy of rhel5 32bit and download all
> > the 5 discs, and tried installing them but it does not show "Cluster"
> > packages during the installation. I attempted this couple times even
> > with the installation number (it shows additionally Virtualization
> > packages if Installation Number is given). With the EasyInstall, it
> > completes the installation quickly but most of the packages are not
> > installed.
> >
> > So, I attempted installing these RPMs manually from disc5 , it throws me
> > errors like some dependency rpm's not available as shown below.
> >
> >
> > [root at server1 Cluster]# pwd
> > /media/RHEL-5.6 i386 Disc 5/Cluster
> >
> > [root at server1 Cluster]# ls
> > Cluster_Administration-as-IN-5.2-1.noarch.rpm
> >  Cluster_Administration-ru-RU-5.2-1.noarch.rpm
> > Cluster_Administration-bn-IN-5.2-1.noarch.rpm
> >  Cluster_Administration-si-LK-5.2-1.noarch.rpm
> > Cluster_Administration-de-DE-5.2-1.noarch.rpm
> >  Cluster_Administration-ta-IN-5.2-1.noarch.rpm
> > Cluster_Administration-en-US-5.2-1.noarch.rpm
> >  Cluster_Administration-te-IN-5.2-1.noarch.rpm
> > Cluster_Administration-es-ES-5.2-1.noarch.rpm
> >  Cluster_Administration-zh-CN-5.2-1.noarch.rpm
> > Cluster_Administration-fr-FR-5.2-1.noarch.rpm
> >  Cluster_Administration-zh-TW-5.2-1.noarch.rpm
> > Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
> > Cluster_Administration-hi-IN-5.2-1.noarch.rpm
> >  cluster-cim-0.12.1-2.el5.i386.rpm
> > Cluster_Administration-it-IT-5.2-1.noarch.rpm
> >  cluster-snmp-0.12.1-2.el5.i386.rpm
> > Cluster_Administration-ja-JP-5.2-1.noarch.rpm
>  ipvsadm-1.24-12.el5.i386.rpm
> > Cluster_Administration-kn-IN-5.2-1.noarch.rpm
>  luci-0.12.2-24.el5.i386.rpm
> > Cluster_Administration-ko-KR-5.2-1.noarch.rpm
> >  modcluster-0.12.1-2.el5.i386.rpm
> > Cluster_Administration-ml-IN-5.2-1.noarch.rpm
>  piranha-0.8.4-19.el5.i386.rpm
> > Cluster_Administration-mr-IN-5.2-1.noarch.rpm
> >  rgmanager-2.0.52-9.el5.i386.rpm
> > Cluster_Administration-or-IN-5.2-1.noarch.rpm
>  ricci-0.12.2-24.el5.i386.rpm
> > Cluster_Administration-pa-IN-5.2-1.noarch.rpm
> >  system-config-cluster-1.0.57-7.noarch.rpm
> > Cluster_Administration-pt-BR-5.2-1.noarch.rpm
> >
> > [root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
> > warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY,
> > key ID 37017186
> > error: Failed dependencies:
> > *oddjob is needed by ricci-0.12.2-24.el5.i386*
> > *modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*
> >
> > [root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm
> > warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature:
> > NOKEY, key ID 37017186
> > error: Failed dependencies:
> > *libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
> > *oddjob is needed by modcluster-0.12.1-2.el5.i386*
> >
> >
> > I am not an expert in red hat but not a novice too.
> > I know i can download these dependency packages one by one but it would
> > be endless to me.
> > Is there a better way of installing all these Cluster packages at one
> > shot to make my life easier ?
> > Also, I read in some links talking about up2date but later realize that
> > these are not meant for rhel5+ versions. Is that true ?
> > My system with the freshInstall do not have up2date installed, i am
> > surprised if at all this is supported/available for rhel5 onwards.
> >
> > I am really looking for a simple solution to install all the dependent
> > rpm's for Cluster rather than i search for them one by one.
> > If i can do this through yum, i don't have a yum repo setup locally. Is
> > there one publicly available ?
> >
> > My requirement is : I want to setup two RHEL vm's in my MacBook as
> > active-passive in cluster for Apache, Squid, Mysql services as a proof
> > of concept for my internal business requirement. thanks.
> >
> > -Param
>
> I've not used RHEL 5 in some time. Is it possible that you use RHEL 6?
> Also, if at all possible, 64-bit version. For a test cluster, you can
> create a minimal install on 1GB of RAM, so if you have at least 4GB or
> RAM on your laptop, you should be fine.
>
> Under RHEL6, you need to enable the High-Availability Add-on groups in
> RHN. Not sure if they are called the same on RHEL 5, it's been too long
> since I last used it to remember.
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120819/59e88301/attachment.htm>

From lists at verwilst.be  Sun Aug 19 09:52:21 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Sun, 19 Aug 2012 11:52:21 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
Message-ID: <082553e1c99e8240d70621355431fa04@verwilst.be>

Hi,

I have a 3-node cluster in testing which seem to work quite well ( 
cman, rgmanager, gfs2, etc ).
On (only) one of my nodes, yesterday i noticed the message below in 
dmesg.

I saw this 30 minutes after the facts. I could browse both my gfs2 
mounts, there was no fencing or anything on any node.

Any idea what might have caused this, and then go away?

Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task 
kworker/1:0:3117 blocked for more than 120 seconds.
Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 00:10:01 vm02-test kernel: [282120.240296] kworker/1:0     D 
ffff88032fc93900     0  3117      2 0x00000000
Aug 19 00:10:01 vm02-test kernel: [282120.240302]  ffff8802bb4dfb30 
0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
Aug 19 00:10:01 vm02-test kernel: [282120.240307]  ffff8802bb4dffd8 
ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
Aug 19 00:10:01 vm02-test kernel: [282120.240311]  0000000000000286 
ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
Aug 19 00:10:01 vm02-test kernel: [282120.240316] Call Trace:
Aug 19 00:10:01 vm02-test kernel: [282120.240334]  [<ffffffffa0570290>] 
? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240340]  [<ffffffff81666c89>] 
schedule+0x29/0x70
Aug 19 00:10:01 vm02-test kernel: [282120.240349]  [<ffffffffa057029e>] 
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240352]  [<ffffffff81665400>] 
__wait_on_bit+0x60/0x90
Aug 19 00:10:01 vm02-test kernel: [282120.240361]  [<ffffffffa0570290>] 
? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240364]  [<ffffffff816654ac>] 
out_of_line_wait_on_bit+0x7c/0x90
Aug 19 00:10:01 vm02-test kernel: [282120.240369]  [<ffffffff81073400>] 
? autoremove_wake_function+0x40/0x40
Aug 19 00:10:01 vm02-test kernel: [282120.240378]  [<ffffffffa05713a7>] 
wait_on_holder+0x47/0x80 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240388]  [<ffffffffa05741d8>] 
gfs2_glock_nq+0x328/0x450 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240399]  [<ffffffffa058a8ca>] 
gfs2_check_blk_type+0x4a/0x150 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240410]  [<ffffffffa058a8c1>] 
? gfs2_check_blk_type+0x41/0x150 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240421]  [<ffffffffa058ba0c>] 
gfs2_evict_inode+0x2cc/0x360 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240432]  [<ffffffffa058b842>] 
? gfs2_evict_inode+0x102/0x360 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240437]  [<ffffffff811940c2>] 
evict+0xb2/0x1b0
Aug 19 00:10:01 vm02-test kernel: [282120.240440]  [<ffffffff811942c9>] 
iput+0x109/0x210
Aug 19 00:10:01 vm02-test kernel: [282120.240448]  [<ffffffffa0572fdc>] 
delete_work_func+0x5c/0x90 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240453]  [<ffffffff8106d5fa>] 
process_one_work+0x12a/0x420
Aug 19 00:10:01 vm02-test kernel: [282120.240462]  [<ffffffffa0572f80>] 
? gfs2_holder_uninit+0x40/0x40 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240465]  [<ffffffff8106e19e>] 
worker_thread+0x12e/0x2f0
Aug 19 00:10:01 vm02-test kernel: [282120.240469]  [<ffffffff8106e070>] 
? manage_workers.isra.25+0x200/0x200
Aug 19 00:10:01 vm02-test kernel: [282120.240472]  [<ffffffff81072e73>] 
kthread+0x93/0xa0
Aug 19 00:10:01 vm02-test kernel: [282120.240477]  [<ffffffff816710a4>] 
kernel_thread_helper+0x4/0x10
Aug 19 00:10:01 vm02-test kernel: [282120.240480]  [<ffffffff81072de0>] 
? flush_kthread_worker+0x80/0x80
Aug 19 00:10:01 vm02-test kernel: [282120.240484]  [<ffffffff816710a0>] 
? gs_change+0x13/0x13
Aug 19 00:12:01 vm02-test kernel: [282240.240061] INFO: task 
kworker/1:0:3117 blocked for more than 120 seconds.
Aug 19 00:12:01 vm02-test kernel: [282240.240175] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 00:12:01 vm02-test kernel: [282240.240289] kworker/1:0     D 
ffff88032fc93900     0  3117      2 0x00000000
Aug 19 00:12:01 vm02-test kernel: [282240.240294]  ffff8802bb4dfb30 
0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
Aug 19 00:12:01 vm02-test kernel: [282240.240299]  ffff8802bb4dffd8 
ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
Aug 19 00:12:01 vm02-test kernel: [282240.240304]  0000000000000286 
ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
Aug 19 00:12:01 vm02-test kernel: [282240.240309] Call Trace:
Aug 19 00:12:01 vm02-test kernel: [282240.240326]  [<ffffffffa0570290>] 
? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240332]  [<ffffffff81666c89>] 
schedule+0x29/0x70
Aug 19 00:12:01 vm02-test kernel: [282240.240341]  [<ffffffffa057029e>] 
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240345]  [<ffffffff81665400>] 
__wait_on_bit+0x60/0x90
Aug 19 00:12:01 vm02-test kernel: [282240.240353]  [<ffffffffa0570290>] 
? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240357]  [<ffffffff816654ac>] 
out_of_line_wait_on_bit+0x7c/0x90
Aug 19 00:12:01 vm02-test kernel: [282240.240362]  [<ffffffff81073400>] 
? autoremove_wake_function+0x40/0x40
Aug 19 00:12:01 vm02-test kernel: [282240.240371]  [<ffffffffa05713a7>] 
wait_on_holder+0x47/0x80 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240380]  [<ffffffffa05741d8>] 
gfs2_glock_nq+0x328/0x450 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240391]  [<ffffffffa058a8ca>] 
gfs2_check_blk_type+0x4a/0x150 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240402]  [<ffffffffa058a8c1>] 
? gfs2_check_blk_type+0x41/0x150 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240413]  [<ffffffffa058ba0c>] 
gfs2_evict_inode+0x2cc/0x360 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240424]  [<ffffffffa058b842>] 
? gfs2_evict_inode+0x102/0x360 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240429]  [<ffffffff811940c2>] 
evict+0xb2/0x1b0
Aug 19 00:12:01 vm02-test kernel: [282240.240432]  [<ffffffff811942c9>] 
iput+0x109/0x210
Aug 19 00:12:01 vm02-test kernel: [282240.240440]  [<ffffffffa0572fdc>] 
delete_work_func+0x5c/0x90 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240445]  [<ffffffff8106d5fa>] 
process_one_work+0x12a/0x420
Aug 19 00:12:01 vm02-test kernel: [282240.240454]  [<ffffffffa0572f80>] 
? gfs2_holder_uninit+0x40/0x40 [gfs2]
Aug 19 00:12:01 vm02-test kernel: [282240.240458]  [<ffffffff8106e19e>] 
worker_thread+0x12e/0x2f0
Aug 19 00:12:01 vm02-test kernel: [282240.240462]  [<ffffffff8106e070>] 
? manage_workers.isra.25+0x200/0x200
Aug 19 00:12:01 vm02-test kernel: [282240.240465]  [<ffffffff81072e73>] 
kthread+0x93/0xa0
Aug 19 00:12:01 vm02-test kernel: [282240.240469]  [<ffffffff816710a4>] 
kernel_thread_helper+0x4/0x10
Aug 19 00:12:01 vm02-test kernel: [282240.240473]  [<ffffffff81072de0>] 
? flush_kthread_worker+0x80/0x80
Aug 19 00:12:01 vm02-test kernel: [282240.240476]  [<ffffffff816710a0>] 
? gs_change+0x13/0x13
<snip, goes on for a while>

Kind regards,

Bart



From mkparam at gmail.com  Sun Aug 19 11:02:24 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Sun, 19 Aug 2012 16:32:24 +0530
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit
In-Reply-To: <10575653-1345365526-cardhu_decombobulator_blackberry.rim.net-42447776-@b17.c8.bise7.blackberry>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
	<50305B2B.7010401@alteeve.ca>
	<1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>
	<CAA1zgjZTQ+mjFJ2goYOKq-VqvekL8YH3R4Q0BUQdjY7dT8coSA@mail.gmail.com>
	<10575653-1345365526-cardhu_decombobulator_blackberry.rim.net-42447776-@b17.c8.bise7.blackberry>
Message-ID: <CAA1zgjZ4Ay2iVR-ZA_sYERxwz91gGqJTx5QLS2Dss4PJjfC-4Q@mail.gmail.com>

Thanks Cheriff.  I saved few hours by not downloading the DVD. I made a
good progress now. I copied all the 5 discs to a local folder, created a
repo and installed all the rpm's now as group-install.

referred the links like
http://portal2portal.blogspot.in/2012/06/adding-local-yum-repository-to-red-hat.htmland
http://www.riccardoriva.com/blog/?p=255

However i don't find ccsd, magma,dlm rpm's in the discs (this link
http://www.centos.org/docs/4/html/rh-cs-en-4/ap-rhcs-sw-inst-cust.htmlrecommends
i must have these two as well ) ; I am not sure why these are
missing in the discs/repo. I will have to look for them separately, install
then i will clone the vm to create my second node to be part of the cluster.

For now, I am good. But if you know the details of these rpm's , let me
know. thanks.

Param

On Sun, Aug 19, 2012 at 2:07 PM, <chemasters at gmail.com> wrote:

> **
> I suggest you download the dvd ISO image and take it from there.
> Regards,
>
> Cherif
> ------------------------------
> *From: * PARAM KRISH <mkparam at gmail.com>
> *Date: *Sun, 19 Aug 2012 11:46:25 +0530
> *To: *<chemasters at gmail.com>
> *Cc: *linux clustering<linux-cluster at redhat.com>
> *Subject: *Re: [Linux-cluster] RHEL Cluster rpm fresh installation on
> rhel5 32bit
>
> Unfortunately I have 5 disc's that are part of installation. Do i need to
> create DVD's out of it for this purpose or download the DVD iso (3GB) again
> to make this work ?
>
> Yeah, i still believe all these rpm's , lib's must be in the CD's
> somewhere but i have to mount then one by one to search and install and it
> might say somebody is missing still ?
>
> I would be using these methods to setup a similar one in Production very
> soon hence i have to figure out the best methods now to install all these
> required packages so that i document them to pass on to a different team
> who do this in Production remotely.
>
> Easier way to install all Clustering stuff as just one bunch is still
> something that i really look for.
>
> -Param
>
> On Sun, Aug 19, 2012 at 9:06 AM, <chemasters at gmail.com> wrote:
>
>> Mostly in redhat the dependencies for each package are included in the
>> redhat dvd or cds. I think you try to specify your dvd or cds are
>> repositories on the system and then try to install the cluster packages
>> again using the yum command.
>>
>> Regards,
>>
>> Cherif
>> Regards,
>>
>> Cherif
>>
>> -----Original Message-----
>> From: Digimer <lists at alteeve.ca>
>> Sender: linux-cluster-bounces at redhat.com
>> Date: Sat, 18 Aug 2012 23:19:07
>> To: PARAM KRISH<mkparam at gmail.com>
>> Reply-To: linux clustering <linux-cluster at redhat.com>
>> Cc: linux clustering<linux-cluster at redhat.com>
>> Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
>>         32bit
>>
>> On 08/18/2012 10:54 PM, PARAM KRISH wrote:
>> > Guys, This is definitely very trivial for those who are in red hat for
>> > several years.
>> >
>> > I have registered for an evaluation copy of rhel5 32bit and download all
>> > the 5 discs, and tried installing them but it does not show "Cluster"
>> > packages during the installation. I attempted this couple times even
>> > with the installation number (it shows additionally Virtualization
>> > packages if Installation Number is given). With the EasyInstall, it
>> > completes the installation quickly but most of the packages are not
>> > installed.
>> >
>> > So, I attempted installing these RPMs manually from disc5 , it throws me
>> > errors like some dependency rpm's not available as shown below.
>> >
>> >
>> > [root at server1 Cluster]# pwd
>> > /media/RHEL-5.6 i386 Disc 5/Cluster
>> >
>> > [root at server1 Cluster]# ls
>> > Cluster_Administration-as-IN-5.2-1.noarch.rpm
>> >  Cluster_Administration-ru-RU-5.2-1.noarch.rpm
>> > Cluster_Administration-bn-IN-5.2-1.noarch.rpm
>> >  Cluster_Administration-si-LK-5.2-1.noarch.rpm
>> > Cluster_Administration-de-DE-5.2-1.noarch.rpm
>> >  Cluster_Administration-ta-IN-5.2-1.noarch.rpm
>> > Cluster_Administration-en-US-5.2-1.noarch.rpm
>> >  Cluster_Administration-te-IN-5.2-1.noarch.rpm
>> > Cluster_Administration-es-ES-5.2-1.noarch.rpm
>> >  Cluster_Administration-zh-CN-5.2-1.noarch.rpm
>> > Cluster_Administration-fr-FR-5.2-1.noarch.rpm
>> >  Cluster_Administration-zh-TW-5.2-1.noarch.rpm
>> > Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
>> > Cluster_Administration-hi-IN-5.2-1.noarch.rpm
>> >  cluster-cim-0.12.1-2.el5.i386.rpm
>> > Cluster_Administration-it-IT-5.2-1.noarch.rpm
>> >  cluster-snmp-0.12.1-2.el5.i386.rpm
>> > Cluster_Administration-ja-JP-5.2-1.noarch.rpm
>>  ipvsadm-1.24-12.el5.i386.rpm
>> > Cluster_Administration-kn-IN-5.2-1.noarch.rpm
>>  luci-0.12.2-24.el5.i386.rpm
>> > Cluster_Administration-ko-KR-5.2-1.noarch.rpm
>> >  modcluster-0.12.1-2.el5.i386.rpm
>> > Cluster_Administration-ml-IN-5.2-1.noarch.rpm
>>  piranha-0.8.4-19.el5.i386.rpm
>> > Cluster_Administration-mr-IN-5.2-1.noarch.rpm
>> >  rgmanager-2.0.52-9.el5.i386.rpm
>> > Cluster_Administration-or-IN-5.2-1.noarch.rpm
>>  ricci-0.12.2-24.el5.i386.rpm
>> > Cluster_Administration-pa-IN-5.2-1.noarch.rpm
>> >  system-config-cluster-1.0.57-7.noarch.rpm
>> > Cluster_Administration-pt-BR-5.2-1.noarch.rpm
>> >
>> > [root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
>> > warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY,
>> > key ID 37017186
>> > error: Failed dependencies:
>> > *oddjob is needed by ricci-0.12.2-24.el5.i386*
>> > *modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*
>> >
>> > [root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm
>> > warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature:
>> > NOKEY, key ID 37017186
>> > error: Failed dependencies:
>> > *libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
>> > *oddjob is needed by modcluster-0.12.1-2.el5.i386*
>> >
>> >
>> > I am not an expert in red hat but not a novice too.
>> > I know i can download these dependency packages one by one but it would
>> > be endless to me.
>> > Is there a better way of installing all these Cluster packages at one
>> > shot to make my life easier ?
>> > Also, I read in some links talking about up2date but later realize that
>> > these are not meant for rhel5+ versions. Is that true ?
>> > My system with the freshInstall do not have up2date installed, i am
>> > surprised if at all this is supported/available for rhel5 onwards.
>> >
>> > I am really looking for a simple solution to install all the dependent
>> > rpm's for Cluster rather than i search for them one by one.
>> > If i can do this through yum, i don't have a yum repo setup locally. Is
>> > there one publicly available ?
>> >
>> > My requirement is : I want to setup two RHEL vm's in my MacBook as
>> > active-passive in cluster for Apache, Squid, Mysql services as a proof
>> > of concept for my internal business requirement. thanks.
>> >
>> > -Param
>>
>> I've not used RHEL 5 in some time. Is it possible that you use RHEL 6?
>> Also, if at all possible, 64-bit version. For a test cluster, you can
>> create a minimal install on 1GB of RAM, so if you have at least 4GB or
>> RAM on your laptop, you should be fine.
>>
>> Under RHEL6, you need to enable the High-Availability Add-on groups in
>> RHN. Not sure if they are called the same on RHEL 5, it's been too long
>> since I last used it to remember.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120819/8dfae4fc/attachment.htm>

From chemasters at gmail.com  Sun Aug 19 13:03:37 2012
From: chemasters at gmail.com (chemasters at gmail.com)
Date: Sun, 19 Aug 2012 13:03:37 +0000
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit
In-Reply-To: <CAA1zgjZ4Ay2iVR-ZA_sYERxwz91gGqJTx5QLS2Dss4PJjfC-4Q@mail.gmail.com>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
	<50305B2B.7010401@alteeve.ca>
	<1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>
	<CAA1zgjZTQ+mjFJ2goYOKq-VqvekL8YH3R4Q0BUQdjY7dT8coSA@mail.gmail.com>
	<10575653-1345365526-cardhu_decombobulator_blackberry.rim.net-42447776-@b17.c8.bise7.blackberry>
	<CAA1zgjZ4Ay2iVR-ZA_sYERxwz91gGqJTx5QLS2Dss4PJjfC-4Q@mail.gmail.com>
Message-ID: <854776336-1345381404-cardhu_decombobulator_blackberry.rim.net-1284804004-@b17.c8.bise7.blackberry>

I think you can find then at pkgs.org check in the site they must have it.

Nb: do not forget to select the right redhat version in the site
Regards,

Cherif

-----Original Message-----
From: PARAM KRISH <mkparam at gmail.com>
Date: Sun, 19 Aug 2012 16:32:24 
To: <chemasters at gmail.com>
Cc: linux clustering<linux-cluster at redhat.com>
Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5 32bit

Thanks Cheriff.  I saved few hours by not downloading the DVD. I made a
good progress now. I copied all the 5 discs to a local folder, created a
repo and installed all the rpm's now as group-install.

referred the links like
http://portal2portal.blogspot.in/2012/06/adding-local-yum-repository-to-red-hat.htmland
http://www.riccardoriva.com/blog/?p=255

However i don't find ccsd, magma,dlm rpm's in the discs (this link
http://www.centos.org/docs/4/html/rh-cs-en-4/ap-rhcs-sw-inst-cust.htmlrecommends
i must have these two as well ) ; I am not sure why these are
missing in the discs/repo. I will have to look for them separately, install
then i will clone the vm to create my second node to be part of the cluster.

For now, I am good. But if you know the details of these rpm's , let me
know. thanks.

Param

On Sun, Aug 19, 2012 at 2:07 PM, <chemasters at gmail.com> wrote:

> **
> I suggest you download the dvd ISO image and take it from there.
> Regards,
>
> Cherif
> ------------------------------
> *From: * PARAM KRISH <mkparam at gmail.com>
> *Date: *Sun, 19 Aug 2012 11:46:25 +0530
> *To: *<chemasters at gmail.com>
> *Cc: *linux clustering<linux-cluster at redhat.com>
> *Subject: *Re: [Linux-cluster] RHEL Cluster rpm fresh installation on
> rhel5 32bit
>
> Unfortunately I have 5 disc's that are part of installation. Do i need to
> create DVD's out of it for this purpose or download the DVD iso (3GB) again
> to make this work ?
>
> Yeah, i still believe all these rpm's , lib's must be in the CD's
> somewhere but i have to mount then one by one to search and install and it
> might say somebody is missing still ?
>
> I would be using these methods to setup a similar one in Production very
> soon hence i have to figure out the best methods now to install all these
> required packages so that i document them to pass on to a different team
> who do this in Production remotely.
>
> Easier way to install all Clustering stuff as just one bunch is still
> something that i really look for.
>
> -Param
>
> On Sun, Aug 19, 2012 at 9:06 AM, <chemasters at gmail.com> wrote:
>
>> Mostly in redhat the dependencies for each package are included in the
>> redhat dvd or cds. I think you try to specify your dvd or cds are
>> repositories on the system and then try to install the cluster packages
>> again using the yum command.
>>
>> Regards,
>>
>> Cherif
>> Regards,
>>
>> Cherif
>>
>> -----Original Message-----
>> From: Digimer <lists at alteeve.ca>
>> Sender: linux-cluster-bounces at redhat.com
>> Date: Sat, 18 Aug 2012 23:19:07
>> To: PARAM KRISH<mkparam at gmail.com>
>> Reply-To: linux clustering <linux-cluster at redhat.com>
>> Cc: linux clustering<linux-cluster at redhat.com>
>> Subject: Re: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
>>         32bit
>>
>> On 08/18/2012 10:54 PM, PARAM KRISH wrote:
>> > Guys, This is definitely very trivial for those who are in red hat for
>> > several years.
>> >
>> > I have registered for an evaluation copy of rhel5 32bit and download all
>> > the 5 discs, and tried installing them but it does not show "Cluster"
>> > packages during the installation. I attempted this couple times even
>> > with the installation number (it shows additionally Virtualization
>> > packages if Installation Number is given). With the EasyInstall, it
>> > completes the installation quickly but most of the packages are not
>> > installed.
>> >
>> > So, I attempted installing these RPMs manually from disc5 , it throws me
>> > errors like some dependency rpm's not available as shown below.
>> >
>> >
>> > [root at server1 Cluster]# pwd
>> > /media/RHEL-5.6 i386 Disc 5/Cluster
>> >
>> > [root at server1 Cluster]# ls
>> > Cluster_Administration-as-IN-5.2-1.noarch.rpm
>> >  Cluster_Administration-ru-RU-5.2-1.noarch.rpm
>> > Cluster_Administration-bn-IN-5.2-1.noarch.rpm
>> >  Cluster_Administration-si-LK-5.2-1.noarch.rpm
>> > Cluster_Administration-de-DE-5.2-1.noarch.rpm
>> >  Cluster_Administration-ta-IN-5.2-1.noarch.rpm
>> > Cluster_Administration-en-US-5.2-1.noarch.rpm
>> >  Cluster_Administration-te-IN-5.2-1.noarch.rpm
>> > Cluster_Administration-es-ES-5.2-1.noarch.rpm
>> >  Cluster_Administration-zh-CN-5.2-1.noarch.rpm
>> > Cluster_Administration-fr-FR-5.2-1.noarch.rpm
>> >  Cluster_Administration-zh-TW-5.2-1.noarch.rpm
>> > Cluster_Administration-gu-IN-5.2-1.noarch.rpm  TRANS.TBL
>> > Cluster_Administration-hi-IN-5.2-1.noarch.rpm
>> >  cluster-cim-0.12.1-2.el5.i386.rpm
>> > Cluster_Administration-it-IT-5.2-1.noarch.rpm
>> >  cluster-snmp-0.12.1-2.el5.i386.rpm
>> > Cluster_Administration-ja-JP-5.2-1.noarch.rpm
>>  ipvsadm-1.24-12.el5.i386.rpm
>> > Cluster_Administration-kn-IN-5.2-1.noarch.rpm
>>  luci-0.12.2-24.el5.i386.rpm
>> > Cluster_Administration-ko-KR-5.2-1.noarch.rpm
>> >  modcluster-0.12.1-2.el5.i386.rpm
>> > Cluster_Administration-ml-IN-5.2-1.noarch.rpm
>>  piranha-0.8.4-19.el5.i386.rpm
>> > Cluster_Administration-mr-IN-5.2-1.noarch.rpm
>> >  rgmanager-2.0.52-9.el5.i386.rpm
>> > Cluster_Administration-or-IN-5.2-1.noarch.rpm
>>  ricci-0.12.2-24.el5.i386.rpm
>> > Cluster_Administration-pa-IN-5.2-1.noarch.rpm
>> >  system-config-cluster-1.0.57-7.noarch.rpm
>> > Cluster_Administration-pt-BR-5.2-1.noarch.rpm
>> >
>> > [root at server1 Cluster]# rpm -ia ricci-0.12.2-24.el5.i386.rpm
>> > warning: ricci-0.12.2-24.el5.i386.rpm: Header V3 DSA signature: NOKEY,
>> > key ID 37017186
>> > error: Failed dependencies:
>> > *oddjob is needed by ricci-0.12.2-24.el5.i386*
>> > *modcluster >= 0.12.0 is needed by ricci-0.12.2-24.el5.i386*
>> >
>> > [root at server1 Cluster]# rpm -ia modcluster-0.12.1-2.el5.i386.rpm
>> > warning: modcluster-0.12.1-2.el5.i386.rpm: Header V3 DSA signature:
>> > NOKEY, key ID 37017186
>> > error: Failed dependencies:
>> > *libcman.so.2 is needed by modcluster-0.12.1-2.el5.i386*
>> > *oddjob is needed by modcluster-0.12.1-2.el5.i386*
>> >
>> >
>> > I am not an expert in red hat but not a novice too.
>> > I know i can download these dependency packages one by one but it would
>> > be endless to me.
>> > Is there a better way of installing all these Cluster packages at one
>> > shot to make my life easier ?
>> > Also, I read in some links talking about up2date but later realize that
>> > these are not meant for rhel5+ versions. Is that true ?
>> > My system with the freshInstall do not have up2date installed, i am
>> > surprised if at all this is supported/available for rhel5 onwards.
>> >
>> > I am really looking for a simple solution to install all the dependent
>> > rpm's for Cluster rather than i search for them one by one.
>> > If i can do this through yum, i don't have a yum repo setup locally. Is
>> > there one publicly available ?
>> >
>> > My requirement is : I want to setup two RHEL vm's in my MacBook as
>> > active-passive in cluster for Apache, Squid, Mysql services as a proof
>> > of concept for my internal business requirement. thanks.
>> >
>> > -Param
>>
>> I've not used RHEL 5 in some time. Is it possible that you use RHEL 6?
>> Also, if at all possible, 64-bit version. For a test cluster, you can
>> create a minimal install on 1GB of RAM, so if you have at least 4GB or
>> RAM on your laptop, you should be fine.
>>
>> Under RHEL6, you need to enable the High-Availability Add-on groups in
>> RHN. Not sure if they are called the same on RHEL 5, it's been too long
>> since I last used it to remember.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120819/17398cdf/attachment.htm>

From lists at alteeve.ca  Sun Aug 19 15:45:07 2012
From: lists at alteeve.ca (Digimer)
Date: Sun, 19 Aug 2012 11:45:07 -0400
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <082553e1c99e8240d70621355431fa04@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
Message-ID: <50310A03.7050704@alteeve.ca>

On 08/19/2012 05:52 AM, Bart Verwilst wrote:
> Hi,
> 
> I have a 3-node cluster in testing which seem to work quite well ( cman,
> rgmanager, gfs2, etc ).
> On (only) one of my nodes, yesterday i noticed the message below in dmesg.
> 
> I saw this 30 minutes after the facts. I could browse both my gfs2
> mounts, there was no fencing or anything on any node.
> 
> Any idea what might have caused this, and then go away?
> 
> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
> kworker/1:0:3117 blocked for more than 120 seconds.
> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 19 00:10:01 vm02-test kernel: [282120.240296] kworker/1:0     D
> ffff88032fc93900     0  3117      2 0x00000000
> Aug 19 00:10:01 vm02-test kernel: [282120.240302]  ffff8802bb4dfb30
> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
> Aug 19 00:10:01 vm02-test kernel: [282120.240307]  ffff8802bb4dffd8
> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
> Aug 19 00:10:01 vm02-test kernel: [282120.240311]  0000000000000286
> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
> Aug 19 00:10:01 vm02-test kernel: [282120.240316] Call Trace:
> Aug 19 00:10:01 vm02-test kernel: [282120.240334]  [<ffffffffa0570290>]
> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240340]  [<ffffffff81666c89>]
> schedule+0x29/0x70
> Aug 19 00:10:01 vm02-test kernel: [282120.240349]  [<ffffffffa057029e>]
> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240352]  [<ffffffff81665400>]
> __wait_on_bit+0x60/0x90
> Aug 19 00:10:01 vm02-test kernel: [282120.240361]  [<ffffffffa0570290>]
> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240364]  [<ffffffff816654ac>]
> out_of_line_wait_on_bit+0x7c/0x90
> Aug 19 00:10:01 vm02-test kernel: [282120.240369]  [<ffffffff81073400>]
> ? autoremove_wake_function+0x40/0x40
> Aug 19 00:10:01 vm02-test kernel: [282120.240378]  [<ffffffffa05713a7>]
> wait_on_holder+0x47/0x80 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240388]  [<ffffffffa05741d8>]
> gfs2_glock_nq+0x328/0x450 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240399]  [<ffffffffa058a8ca>]
> gfs2_check_blk_type+0x4a/0x150 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240410]  [<ffffffffa058a8c1>]
> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240421]  [<ffffffffa058ba0c>]
> gfs2_evict_inode+0x2cc/0x360 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240432]  [<ffffffffa058b842>]
> ? gfs2_evict_inode+0x102/0x360 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240437]  [<ffffffff811940c2>]
> evict+0xb2/0x1b0
> Aug 19 00:10:01 vm02-test kernel: [282120.240440]  [<ffffffff811942c9>]
> iput+0x109/0x210
> Aug 19 00:10:01 vm02-test kernel: [282120.240448]  [<ffffffffa0572fdc>]
> delete_work_func+0x5c/0x90 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240453]  [<ffffffff8106d5fa>]
> process_one_work+0x12a/0x420
> Aug 19 00:10:01 vm02-test kernel: [282120.240462]  [<ffffffffa0572f80>]
> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
> Aug 19 00:10:01 vm02-test kernel: [282120.240465]  [<ffffffff8106e19e>]
> worker_thread+0x12e/0x2f0
> Aug 19 00:10:01 vm02-test kernel: [282120.240469]  [<ffffffff8106e070>]
> ? manage_workers.isra.25+0x200/0x200
> Aug 19 00:10:01 vm02-test kernel: [282120.240472]  [<ffffffff81072e73>]
> kthread+0x93/0xa0
> Aug 19 00:10:01 vm02-test kernel: [282120.240477]  [<ffffffff816710a4>]
> kernel_thread_helper+0x4/0x10
> Aug 19 00:10:01 vm02-test kernel: [282120.240480]  [<ffffffff81072de0>]
> ? flush_kthread_worker+0x80/0x80
> Aug 19 00:10:01 vm02-test kernel: [282120.240484]  [<ffffffff816710a0>]
> ? gs_change+0x13/0x13
> Aug 19 00:12:01 vm02-test kernel: [282240.240061] INFO: task
> kworker/1:0:3117 blocked for more than 120 seconds.
> Aug 19 00:12:01 vm02-test kernel: [282240.240175] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 19 00:12:01 vm02-test kernel: [282240.240289] kworker/1:0     D
> ffff88032fc93900     0  3117      2 0x00000000
> Aug 19 00:12:01 vm02-test kernel: [282240.240294]  ffff8802bb4dfb30
> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
> Aug 19 00:12:01 vm02-test kernel: [282240.240299]  ffff8802bb4dffd8
> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
> Aug 19 00:12:01 vm02-test kernel: [282240.240304]  0000000000000286
> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
> Aug 19 00:12:01 vm02-test kernel: [282240.240309] Call Trace:
> Aug 19 00:12:01 vm02-test kernel: [282240.240326]  [<ffffffffa0570290>]
> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240332]  [<ffffffff81666c89>]
> schedule+0x29/0x70
> Aug 19 00:12:01 vm02-test kernel: [282240.240341]  [<ffffffffa057029e>]
> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240345]  [<ffffffff81665400>]
> __wait_on_bit+0x60/0x90
> Aug 19 00:12:01 vm02-test kernel: [282240.240353]  [<ffffffffa0570290>]
> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240357]  [<ffffffff816654ac>]
> out_of_line_wait_on_bit+0x7c/0x90
> Aug 19 00:12:01 vm02-test kernel: [282240.240362]  [<ffffffff81073400>]
> ? autoremove_wake_function+0x40/0x40
> Aug 19 00:12:01 vm02-test kernel: [282240.240371]  [<ffffffffa05713a7>]
> wait_on_holder+0x47/0x80 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240380]  [<ffffffffa05741d8>]
> gfs2_glock_nq+0x328/0x450 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240391]  [<ffffffffa058a8ca>]
> gfs2_check_blk_type+0x4a/0x150 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240402]  [<ffffffffa058a8c1>]
> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240413]  [<ffffffffa058ba0c>]
> gfs2_evict_inode+0x2cc/0x360 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240424]  [<ffffffffa058b842>]
> ? gfs2_evict_inode+0x102/0x360 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240429]  [<ffffffff811940c2>]
> evict+0xb2/0x1b0
> Aug 19 00:12:01 vm02-test kernel: [282240.240432]  [<ffffffff811942c9>]
> iput+0x109/0x210
> Aug 19 00:12:01 vm02-test kernel: [282240.240440]  [<ffffffffa0572fdc>]
> delete_work_func+0x5c/0x90 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240445]  [<ffffffff8106d5fa>]
> process_one_work+0x12a/0x420
> Aug 19 00:12:01 vm02-test kernel: [282240.240454]  [<ffffffffa0572f80>]
> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
> Aug 19 00:12:01 vm02-test kernel: [282240.240458]  [<ffffffff8106e19e>]
> worker_thread+0x12e/0x2f0
> Aug 19 00:12:01 vm02-test kernel: [282240.240462]  [<ffffffff8106e070>]
> ? manage_workers.isra.25+0x200/0x200
> Aug 19 00:12:01 vm02-test kernel: [282240.240465]  [<ffffffff81072e73>]
> kthread+0x93/0xa0
> Aug 19 00:12:01 vm02-test kernel: [282240.240469]  [<ffffffff816710a4>]
> kernel_thread_helper+0x4/0x10
> Aug 19 00:12:01 vm02-test kernel: [282240.240473]  [<ffffffff81072de0>]
> ? flush_kthread_worker+0x80/0x80
> Aug 19 00:12:01 vm02-test kernel: [282240.240476]  [<ffffffff816710a0>]
> ? gs_change+0x13/0x13
> <snip, goes on for a while>
> 
> Kind regards,
> 
> Bart

I usually see this when DLM is blocked. DLM usually blocks on a failed
fence action.

To clarify; this comes up on one of the three nodes only? On the node
with these messages, you shouldn't be able to look at the hung FS on the
effected node.

Can you share you versions and cluster.conf please? Also, what is in the
logs in the three or four minutes before these messages start? Anything
interesting in the log files of the other nodes around the same time period?

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com



From lists at verwilst.be  Sun Aug 19 19:05:19 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Sun, 19 Aug 2012 21:05:19 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <50310A03.7050704@alteeve.ca>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
Message-ID: <323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>

<cluster name="kvm" config_version="13">
	<logging debug="on"/>
         <clusternodes>
         <clusternode name="vm01-test" nodeid="1">
		<fence>
			<method name="apc">
				<device name="apc01" port="1" action="off"/>
				<device name="apc02" port="1" action="off"/>
				<device name="apc01" port="1" action="on"/>
				<device name="apc02" port="1" action="on"/>
			</method>
		</fence>
         </clusternode>
         <clusternode name="vm02-test" nodeid="2">
		<fence>
			<method name="apc">
				<device name="apc01" port="8" action="off"/>
				<device name="apc02" port="8" action="off"/>
				<device name="apc01" port="8" action="on"/>
				<device name="apc02" port="8" action="on"/>
			</method>
                 </fence>
         </clusternode>
         <clusternode name="vm03-test" nodeid="3">
		<fence>
			<method name="apc">
				<device name="apc01" port="2" action="off"/>
				<device name="apc02" port="2" action="off"/>
				<device name="apc01" port="2" action="on"/>
				<device name="apc02" port="2" action="on"/>
			</method>
                 </fence>
         </clusternode>
         </clusternodes>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="apc01" secure="on" 
login="device" name="apc01" passwd="xxx"/>
		<fencedevice agent="fence_apc" ipaddr="apc02" secure="on" 
login="device" name="apc02" passwd="xxx"/>
	</fencedevices>
	<rm log_level="5">
		<failoverdomains>
			<failoverdomain name="any_node" nofailback="1" ordered="0" 
restricted="0"/>
		</failoverdomains>
		<vm domain="any_node" max_restarts="2" migrate="live" name="firewall" 
path="/etc/libvirt/qemu/" recovery="restart" restart_expire_time="600"/>
		<vm domain="any_node" max_restarts="2" migrate="live" name="zabbix" 
path="/etc/libvirt/qemu/" recovery="restart" restart_expire_time="600"/>
	</rm>
	<totem rrp_mode="none" secauth="off"/>
	<quorumd device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
</cluster>

dlm_control.log*, fenced.log, corosync.log are either empty or just 
contain entries until Aug 15 or ( in case of fenced.log ) "Aug 18 
23:41:46 fenced logging mode 3 syslog f 160 p 6 logfile p 7 
/var/log/cluster/fenced.log". I was expecting at least some log output, 
since everything seems to work fine. Maybe enabling more debugging might 
bring something up ( but debugging is already on @ cluster.conf )

Kind regards,

Bart

Digimer schreef op 19.08.2012 17:45:
> On 08/19/2012 05:52 AM, Bart Verwilst wrote:
>> Hi,
>>
>> I have a 3-node cluster in testing which seem to work quite well ( 
>> cman,
>> rgmanager, gfs2, etc ).
>> On (only) one of my nodes, yesterday i noticed the message below in 
>> dmesg.
>>
>> I saw this 30 minutes after the facts. I could browse both my gfs2
>> mounts, there was no fencing or anything on any node.
>>
>> Any idea what might have caused this, and then go away?
>>
>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>> kworker/1:0:3117 blocked for more than 120 seconds.
>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Aug 19 00:10:01 vm02-test kernel: [282120.240296] kworker/1:0     D
>> ffff88032fc93900     0  3117      2 0x00000000
>> Aug 19 00:10:01 vm02-test kernel: [282120.240302]  ffff8802bb4dfb30
>> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
>> Aug 19 00:10:01 vm02-test kernel: [282120.240307]  ffff8802bb4dffd8
>> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
>> Aug 19 00:10:01 vm02-test kernel: [282120.240311]  0000000000000286
>> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
>> Aug 19 00:10:01 vm02-test kernel: [282120.240316] Call Trace:
>> Aug 19 00:10:01 vm02-test kernel: [282120.240334]  
>> [<ffffffffa0570290>]
>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240340]  
>> [<ffffffff81666c89>]
>> schedule+0x29/0x70
>> Aug 19 00:10:01 vm02-test kernel: [282120.240349]  
>> [<ffffffffa057029e>]
>> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240352]  
>> [<ffffffff81665400>]
>> __wait_on_bit+0x60/0x90
>> Aug 19 00:10:01 vm02-test kernel: [282120.240361]  
>> [<ffffffffa0570290>]
>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240364]  
>> [<ffffffff816654ac>]
>> out_of_line_wait_on_bit+0x7c/0x90
>> Aug 19 00:10:01 vm02-test kernel: [282120.240369]  
>> [<ffffffff81073400>]
>> ? autoremove_wake_function+0x40/0x40
>> Aug 19 00:10:01 vm02-test kernel: [282120.240378]  
>> [<ffffffffa05713a7>]
>> wait_on_holder+0x47/0x80 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240388]  
>> [<ffffffffa05741d8>]
>> gfs2_glock_nq+0x328/0x450 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240399]  
>> [<ffffffffa058a8ca>]
>> gfs2_check_blk_type+0x4a/0x150 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240410]  
>> [<ffffffffa058a8c1>]
>> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240421]  
>> [<ffffffffa058ba0c>]
>> gfs2_evict_inode+0x2cc/0x360 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240432]  
>> [<ffffffffa058b842>]
>> ? gfs2_evict_inode+0x102/0x360 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240437]  
>> [<ffffffff811940c2>]
>> evict+0xb2/0x1b0
>> Aug 19 00:10:01 vm02-test kernel: [282120.240440]  
>> [<ffffffff811942c9>]
>> iput+0x109/0x210
>> Aug 19 00:10:01 vm02-test kernel: [282120.240448]  
>> [<ffffffffa0572fdc>]
>> delete_work_func+0x5c/0x90 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240453]  
>> [<ffffffff8106d5fa>]
>> process_one_work+0x12a/0x420
>> Aug 19 00:10:01 vm02-test kernel: [282120.240462]  
>> [<ffffffffa0572f80>]
>> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
>> Aug 19 00:10:01 vm02-test kernel: [282120.240465]  
>> [<ffffffff8106e19e>]
>> worker_thread+0x12e/0x2f0
>> Aug 19 00:10:01 vm02-test kernel: [282120.240469]  
>> [<ffffffff8106e070>]
>> ? manage_workers.isra.25+0x200/0x200
>> Aug 19 00:10:01 vm02-test kernel: [282120.240472]  
>> [<ffffffff81072e73>]
>> kthread+0x93/0xa0
>> Aug 19 00:10:01 vm02-test kernel: [282120.240477]  
>> [<ffffffff816710a4>]
>> kernel_thread_helper+0x4/0x10
>> Aug 19 00:10:01 vm02-test kernel: [282120.240480]  
>> [<ffffffff81072de0>]
>> ? flush_kthread_worker+0x80/0x80
>> Aug 19 00:10:01 vm02-test kernel: [282120.240484]  
>> [<ffffffff816710a0>]
>> ? gs_change+0x13/0x13
>> Aug 19 00:12:01 vm02-test kernel: [282240.240061] INFO: task
>> kworker/1:0:3117 blocked for more than 120 seconds.
>> Aug 19 00:12:01 vm02-test kernel: [282240.240175] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Aug 19 00:12:01 vm02-test kernel: [282240.240289] kworker/1:0     D
>> ffff88032fc93900     0  3117      2 0x00000000
>> Aug 19 00:12:01 vm02-test kernel: [282240.240294]  ffff8802bb4dfb30
>> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
>> Aug 19 00:12:01 vm02-test kernel: [282240.240299]  ffff8802bb4dffd8
>> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
>> Aug 19 00:12:01 vm02-test kernel: [282240.240304]  0000000000000286
>> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
>> Aug 19 00:12:01 vm02-test kernel: [282240.240309] Call Trace:
>> Aug 19 00:12:01 vm02-test kernel: [282240.240326]  
>> [<ffffffffa0570290>]
>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240332]  
>> [<ffffffff81666c89>]
>> schedule+0x29/0x70
>> Aug 19 00:12:01 vm02-test kernel: [282240.240341]  
>> [<ffffffffa057029e>]
>> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240345]  
>> [<ffffffff81665400>]
>> __wait_on_bit+0x60/0x90
>> Aug 19 00:12:01 vm02-test kernel: [282240.240353]  
>> [<ffffffffa0570290>]
>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240357]  
>> [<ffffffff816654ac>]
>> out_of_line_wait_on_bit+0x7c/0x90
>> Aug 19 00:12:01 vm02-test kernel: [282240.240362]  
>> [<ffffffff81073400>]
>> ? autoremove_wake_function+0x40/0x40
>> Aug 19 00:12:01 vm02-test kernel: [282240.240371]  
>> [<ffffffffa05713a7>]
>> wait_on_holder+0x47/0x80 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240380]  
>> [<ffffffffa05741d8>]
>> gfs2_glock_nq+0x328/0x450 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240391]  
>> [<ffffffffa058a8ca>]
>> gfs2_check_blk_type+0x4a/0x150 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240402]  
>> [<ffffffffa058a8c1>]
>> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240413]  
>> [<ffffffffa058ba0c>]
>> gfs2_evict_inode+0x2cc/0x360 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240424]  
>> [<ffffffffa058b842>]
>> ? gfs2_evict_inode+0x102/0x360 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240429]  
>> [<ffffffff811940c2>]
>> evict+0xb2/0x1b0
>> Aug 19 00:12:01 vm02-test kernel: [282240.240432]  
>> [<ffffffff811942c9>]
>> iput+0x109/0x210
>> Aug 19 00:12:01 vm02-test kernel: [282240.240440]  
>> [<ffffffffa0572fdc>]
>> delete_work_func+0x5c/0x90 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240445]  
>> [<ffffffff8106d5fa>]
>> process_one_work+0x12a/0x420
>> Aug 19 00:12:01 vm02-test kernel: [282240.240454]  
>> [<ffffffffa0572f80>]
>> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
>> Aug 19 00:12:01 vm02-test kernel: [282240.240458]  
>> [<ffffffff8106e19e>]
>> worker_thread+0x12e/0x2f0
>> Aug 19 00:12:01 vm02-test kernel: [282240.240462]  
>> [<ffffffff8106e070>]
>> ? manage_workers.isra.25+0x200/0x200
>> Aug 19 00:12:01 vm02-test kernel: [282240.240465]  
>> [<ffffffff81072e73>]
>> kthread+0x93/0xa0
>> Aug 19 00:12:01 vm02-test kernel: [282240.240469]  
>> [<ffffffff816710a4>]
>> kernel_thread_helper+0x4/0x10
>> Aug 19 00:12:01 vm02-test kernel: [282240.240473]  
>> [<ffffffff81072de0>]
>> ? flush_kthread_worker+0x80/0x80
>> Aug 19 00:12:01 vm02-test kernel: [282240.240476]  
>> [<ffffffff816710a0>]
>> ? gs_change+0x13/0x13
>> <snip, goes on for a while>
>>
>> Kind regards,
>>
>> Bart
>
> I usually see this when DLM is blocked. DLM usually blocks on a 
> failed
> fence action.
>
> To clarify; this comes up on one of the three nodes only? On the node
> with these messages, you shouldn't be able to look at the hung FS on 
> the
> effected node.
>
> Can you share you versions and cluster.conf please? Also, what is in 
> the
> logs in the three or four minutes before these messages start? 
> Anything
> interesting in the log files of the other nodes around the same time 
> period?
>
> digimer



From lists at verwilst.be  Sun Aug 19 19:24:40 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Sun, 19 Aug 2012 21:24:40 +0200
Subject: [Linux-cluster] IOPS usage for quorum disk
Message-ID: <a6d861ab2dd5aee9a0b3451cf25a10c6@verwilst.be>

Hi,

I can see my iscsi quorum disk ( with 3 nodes ) hoovering around 50-60 
iops, which seems like pretty heavy for something that should be writing 
an "im still here!" message every 2 seconds ( quorumd interval = 2 secs, 
tko = 4 ) for each node. Can somebody explain where those iops are 
coming from? I just want to make sure i fully understand every part of 
this before it goes live. :)

Kind regards,

Bart Verwilst



From lists at alteeve.ca  Sun Aug 19 20:24:05 2012
From: lists at alteeve.ca (Digimer)
Date: Sun, 19 Aug 2012 16:24:05 -0400
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
Message-ID: <50314B65.8010204@alteeve.ca>

The config looks fine. Sorry, I should have specified syslog;
/var/log/messages.

On 08/19/2012 03:05 PM, Bart Verwilst wrote:
> <cluster name="kvm" config_version="13">
>     <logging debug="on"/>
>         <clusternodes>
>         <clusternode name="vm01-test" nodeid="1">
>         <fence>
>             <method name="apc">
>                 <device name="apc01" port="1" action="off"/>
>                 <device name="apc02" port="1" action="off"/>
>                 <device name="apc01" port="1" action="on"/>
>                 <device name="apc02" port="1" action="on"/>
>             </method>
>         </fence>
>         </clusternode>
>         <clusternode name="vm02-test" nodeid="2">
>         <fence>
>             <method name="apc">
>                 <device name="apc01" port="8" action="off"/>
>                 <device name="apc02" port="8" action="off"/>
>                 <device name="apc01" port="8" action="on"/>
>                 <device name="apc02" port="8" action="on"/>
>             </method>
>                 </fence>
>         </clusternode>
>         <clusternode name="vm03-test" nodeid="3">
>         <fence>
>             <method name="apc">
>                 <device name="apc01" port="2" action="off"/>
>                 <device name="apc02" port="2" action="off"/>
>                 <device name="apc01" port="2" action="on"/>
>                 <device name="apc02" port="2" action="on"/>
>             </method>
>                 </fence>
>         </clusternode>
>         </clusternodes>
>     <fencedevices>
>         <fencedevice agent="fence_apc" ipaddr="apc01" secure="on"
> login="device" name="apc01" passwd="xxx"/>
>         <fencedevice agent="fence_apc" ipaddr="apc02" secure="on"
> login="device" name="apc02" passwd="xxx"/>
>     </fencedevices>
>     <rm log_level="5">
>         <failoverdomains>
>             <failoverdomain name="any_node" nofailback="1" ordered="0"
> restricted="0"/>
>         </failoverdomains>
>         <vm domain="any_node" max_restarts="2" migrate="live"
> name="firewall" path="/etc/libvirt/qemu/" recovery="restart"
> restart_expire_time="600"/>
>         <vm domain="any_node" max_restarts="2" migrate="live"
> name="zabbix" path="/etc/libvirt/qemu/" recovery="restart"
> restart_expire_time="600"/>
>     </rm>
>     <totem rrp_mode="none" secauth="off"/>
>     <quorumd device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> </cluster>
> 
> dlm_control.log*, fenced.log, corosync.log are either empty or just
> contain entries until Aug 15 or ( in case of fenced.log ) "Aug 18
> 23:41:46 fenced logging mode 3 syslog f 160 p 6 logfile p 7
> /var/log/cluster/fenced.log". I was expecting at least some log output,
> since everything seems to work fine. Maybe enabling more debugging might
> bring something up ( but debugging is already on @ cluster.conf )
> 
> Kind regards,
> 
> Bart
> 
> Digimer schreef op 19.08.2012 17:45:
>> On 08/19/2012 05:52 AM, Bart Verwilst wrote:
>>> Hi,
>>>
>>> I have a 3-node cluster in testing which seem to work quite well ( cman,
>>> rgmanager, gfs2, etc ).
>>> On (only) one of my nodes, yesterday i noticed the message below in
>>> dmesg.
>>>
>>> I saw this 30 minutes after the facts. I could browse both my gfs2
>>> mounts, there was no fencing or anything on any node.
>>>
>>> Any idea what might have caused this, and then go away?
>>>
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240296] kworker/1:0     D
>>> ffff88032fc93900     0  3117      2 0x00000000
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240302]  ffff8802bb4dfb30
>>> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240307]  ffff8802bb4dffd8
>>> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240311]  0000000000000286
>>> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240316] Call Trace:
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240334]  [<ffffffffa0570290>]
>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240340]  [<ffffffff81666c89>]
>>> schedule+0x29/0x70
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240349]  [<ffffffffa057029e>]
>>> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240352]  [<ffffffff81665400>]
>>> __wait_on_bit+0x60/0x90
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240361]  [<ffffffffa0570290>]
>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240364]  [<ffffffff816654ac>]
>>> out_of_line_wait_on_bit+0x7c/0x90
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240369]  [<ffffffff81073400>]
>>> ? autoremove_wake_function+0x40/0x40
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240378]  [<ffffffffa05713a7>]
>>> wait_on_holder+0x47/0x80 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240388]  [<ffffffffa05741d8>]
>>> gfs2_glock_nq+0x328/0x450 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240399]  [<ffffffffa058a8ca>]
>>> gfs2_check_blk_type+0x4a/0x150 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240410]  [<ffffffffa058a8c1>]
>>> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240421]  [<ffffffffa058ba0c>]
>>> gfs2_evict_inode+0x2cc/0x360 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240432]  [<ffffffffa058b842>]
>>> ? gfs2_evict_inode+0x102/0x360 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240437]  [<ffffffff811940c2>]
>>> evict+0xb2/0x1b0
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240440]  [<ffffffff811942c9>]
>>> iput+0x109/0x210
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240448]  [<ffffffffa0572fdc>]
>>> delete_work_func+0x5c/0x90 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240453]  [<ffffffff8106d5fa>]
>>> process_one_work+0x12a/0x420
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240462]  [<ffffffffa0572f80>]
>>> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240465]  [<ffffffff8106e19e>]
>>> worker_thread+0x12e/0x2f0
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240469]  [<ffffffff8106e070>]
>>> ? manage_workers.isra.25+0x200/0x200
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240472]  [<ffffffff81072e73>]
>>> kthread+0x93/0xa0
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240477]  [<ffffffff816710a4>]
>>> kernel_thread_helper+0x4/0x10
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240480]  [<ffffffff81072de0>]
>>> ? flush_kthread_worker+0x80/0x80
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240484]  [<ffffffff816710a0>]
>>> ? gs_change+0x13/0x13
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240061] INFO: task
>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240175] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240289] kworker/1:0     D
>>> ffff88032fc93900     0  3117      2 0x00000000
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240294]  ffff8802bb4dfb30
>>> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240299]  ffff8802bb4dffd8
>>> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240304]  0000000000000286
>>> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240309] Call Trace:
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240326]  [<ffffffffa0570290>]
>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240332]  [<ffffffff81666c89>]
>>> schedule+0x29/0x70
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240341]  [<ffffffffa057029e>]
>>> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240345]  [<ffffffff81665400>]
>>> __wait_on_bit+0x60/0x90
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240353]  [<ffffffffa0570290>]
>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240357]  [<ffffffff816654ac>]
>>> out_of_line_wait_on_bit+0x7c/0x90
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240362]  [<ffffffff81073400>]
>>> ? autoremove_wake_function+0x40/0x40
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240371]  [<ffffffffa05713a7>]
>>> wait_on_holder+0x47/0x80 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240380]  [<ffffffffa05741d8>]
>>> gfs2_glock_nq+0x328/0x450 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240391]  [<ffffffffa058a8ca>]
>>> gfs2_check_blk_type+0x4a/0x150 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240402]  [<ffffffffa058a8c1>]
>>> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240413]  [<ffffffffa058ba0c>]
>>> gfs2_evict_inode+0x2cc/0x360 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240424]  [<ffffffffa058b842>]
>>> ? gfs2_evict_inode+0x102/0x360 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240429]  [<ffffffff811940c2>]
>>> evict+0xb2/0x1b0
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240432]  [<ffffffff811942c9>]
>>> iput+0x109/0x210
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240440]  [<ffffffffa0572fdc>]
>>> delete_work_func+0x5c/0x90 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240445]  [<ffffffff8106d5fa>]
>>> process_one_work+0x12a/0x420
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240454]  [<ffffffffa0572f80>]
>>> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240458]  [<ffffffff8106e19e>]
>>> worker_thread+0x12e/0x2f0
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240462]  [<ffffffff8106e070>]
>>> ? manage_workers.isra.25+0x200/0x200
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240465]  [<ffffffff81072e73>]
>>> kthread+0x93/0xa0
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240469]  [<ffffffff816710a4>]
>>> kernel_thread_helper+0x4/0x10
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240473]  [<ffffffff81072de0>]
>>> ? flush_kthread_worker+0x80/0x80
>>> Aug 19 00:12:01 vm02-test kernel: [282240.240476]  [<ffffffff816710a0>]
>>> ? gs_change+0x13/0x13
>>> <snip, goes on for a while>
>>>
>>> Kind regards,
>>>
>>> Bart
>>
>> I usually see this when DLM is blocked. DLM usually blocks on a failed
>> fence action.
>>
>> To clarify; this comes up on one of the three nodes only? On the node
>> with these messages, you shouldn't be able to look at the hung FS on the
>> effected node.
>>
>> Can you share you versions and cluster.conf please? Also, what is in the
>> logs in the three or four minutes before these messages start? Anything
>> interesting in the log files of the other nodes around the same time
>> period?
>>
>> digimer
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca



From lists at verwilst.be  Sun Aug 19 21:34:45 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Sun, 19 Aug 2012 23:34:45 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <50314B65.8010204@alteeve.ca>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
Message-ID: <379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>

Well nothing strange there either i think, nothing on for example 
vm01-test during that period, vm02-test gives me:


Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:09:57 vm02-test corosync[7394]:   [CMAN  ] memb: 
quorum_device_timer_fn
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:09:58 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:09:59 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:10:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:10:01 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task 
kworker/1:0:3117 blocked for more than 120 seconds.
Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 00:10:01 vm02-test kernel: [282120.240296] kworker/1:0     D 
ffff88032fc93900     0  3117      2 0x00000000
Aug 19 00:10:01 vm02-test kernel: [282120.240302]  ffff8802bb4dfb30 
0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
Aug 19 00:10:01 vm02-test kernel: [282120.240307]  ffff8802bb4dffd8 
ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
Aug 19 00:10:01 vm02-test kernel: [282120.240311]  0000000000000286 
ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
Aug 19 00:10:01 vm02-test kernel: [282120.240316] Call Trace:
Aug 19 00:10:01 vm02-test kernel: [282120.240334]  [<ffffffffa0570290>] 
? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240340]  [<ffffffff81666c89>] 
schedule+0x29/0x70
Aug 19 00:10:01 vm02-test kernel: [282120.240349]  [<ffffffffa057029e>] 
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240352]  [<ffffffff81665400>] 
__wait_on_bit+0x60/0x90
Aug 19 00:10:01 vm02-test kernel: [282120.240361]  [<ffffffffa0570290>] 
? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240364]  [<ffffffff816654ac>] 
out_of_line_wait_on_bit+0x7c/0x90
Aug 19 00:10:01 vm02-test kernel: [282120.240369]  [<ffffffff81073400>] 
? autoremove_wake_function+0x40/0x40
Aug 19 00:10:01 vm02-test kernel: [282120.240378]  [<ffffffffa05713a7>] 
wait_on_holder+0x47/0x80 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240388]  [<ffffffffa05741d8>] 
gfs2_glock_nq+0x328/0x450 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240399]  [<ffffffffa058a8ca>] 
gfs2_check_blk_type+0x4a/0x150 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240410]  [<ffffffffa058a8c1>] 
? gfs2_check_blk_type+0x41/0x150 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240421]  [<ffffffffa058ba0c>] 
gfs2_evict_inode+0x2cc/0x360 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240432]  [<ffffffffa058b842>] 
? gfs2_evict_inode+0x102/0x360 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240437]  [<ffffffff811940c2>] 
evict+0xb2/0x1b0
Aug 19 00:10:01 vm02-test kernel: [282120.240440]  [<ffffffff811942c9>] 
iput+0x109/0x210
Aug 19 00:10:01 vm02-test kernel: [282120.240448]  [<ffffffffa0572fdc>] 
delete_work_func+0x5c/0x90 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240453]  [<ffffffff8106d5fa>] 
process_one_work+0x12a/0x420
Aug 19 00:10:01 vm02-test kernel: [282120.240462]  [<ffffffffa0572f80>] 
? gfs2_holder_uninit+0x40/0x40 [gfs2]
Aug 19 00:10:01 vm02-test kernel: [282120.240465]  [<ffffffff8106e19e>] 
worker_thread+0x12e/0x2f0
Aug 19 00:10:01 vm02-test kernel: [282120.240469]  [<ffffffff8106e070>] 
? manage_workers.isra.25+0x200/0x200
Aug 19 00:10:01 vm02-test kernel: [282120.240472]  [<ffffffff81072e73>] 
kthread+0x93/0xa0
Aug 19 00:10:01 vm02-test kernel: [282120.240477]  [<ffffffff816710a4>] 
kernel_thread_helper+0x4/0x10
Aug 19 00:10:01 vm02-test kernel: [282120.240480]  [<ffffffff81072de0>] 
? flush_kthread_worker+0x80/0x80
Aug 19 00:10:01 vm02-test kernel: [282120.240484]  [<ffffffff816710a0>] 
? gs_change+0x13/0x13
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:10:02 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:10:03 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17

Does this help in any way?

Kind regards,

Bart

Digimer schreef op 19.08.2012 22:24:
> The config looks fine. Sorry, I should have specified syslog;
> /var/log/messages.
>
> On 08/19/2012 03:05 PM, Bart Verwilst wrote:
>> <cluster name="kvm" config_version="13">
>>     <logging debug="on"/>
>>         <clusternodes>
>>         <clusternode name="vm01-test" nodeid="1">
>>         <fence>
>>             <method name="apc">
>>                 <device name="apc01" port="1" action="off"/>
>>                 <device name="apc02" port="1" action="off"/>
>>                 <device name="apc01" port="1" action="on"/>
>>                 <device name="apc02" port="1" action="on"/>
>>             </method>
>>         </fence>
>>         </clusternode>
>>         <clusternode name="vm02-test" nodeid="2">
>>         <fence>
>>             <method name="apc">
>>                 <device name="apc01" port="8" action="off"/>
>>                 <device name="apc02" port="8" action="off"/>
>>                 <device name="apc01" port="8" action="on"/>
>>                 <device name="apc02" port="8" action="on"/>
>>             </method>
>>                 </fence>
>>         </clusternode>
>>         <clusternode name="vm03-test" nodeid="3">
>>         <fence>
>>             <method name="apc">
>>                 <device name="apc01" port="2" action="off"/>
>>                 <device name="apc02" port="2" action="off"/>
>>                 <device name="apc01" port="2" action="on"/>
>>                 <device name="apc02" port="2" action="on"/>
>>             </method>
>>                 </fence>
>>         </clusternode>
>>         </clusternodes>
>>     <fencedevices>
>>         <fencedevice agent="fence_apc" ipaddr="apc01" secure="on"
>> login="device" name="apc01" passwd="xxx"/>
>>         <fencedevice agent="fence_apc" ipaddr="apc02" secure="on"
>> login="device" name="apc02" passwd="xxx"/>
>>     </fencedevices>
>>     <rm log_level="5">
>>         <failoverdomains>
>>             <failoverdomain name="any_node" nofailback="1" 
>> ordered="0"
>> restricted="0"/>
>>         </failoverdomains>
>>         <vm domain="any_node" max_restarts="2" migrate="live"
>> name="firewall" path="/etc/libvirt/qemu/" recovery="restart"
>> restart_expire_time="600"/>
>>         <vm domain="any_node" max_restarts="2" migrate="live"
>> name="zabbix" path="/etc/libvirt/qemu/" recovery="restart"
>> restart_expire_time="600"/>
>>     </rm>
>>     <totem rrp_mode="none" secauth="off"/>
>>     <quorumd device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>> </cluster>
>>
>> dlm_control.log*, fenced.log, corosync.log are either empty or just
>> contain entries until Aug 15 or ( in case of fenced.log ) "Aug 18
>> 23:41:46 fenced logging mode 3 syslog f 160 p 6 logfile p 7
>> /var/log/cluster/fenced.log". I was expecting at least some log 
>> output,
>> since everything seems to work fine. Maybe enabling more debugging 
>> might
>> bring something up ( but debugging is already on @ cluster.conf )
>>
>> Kind regards,
>>
>> Bart
>>
>> Digimer schreef op 19.08.2012 17:45:
>>> On 08/19/2012 05:52 AM, Bart Verwilst wrote:
>>>> Hi,
>>>>
>>>> I have a 3-node cluster in testing which seem to work quite well ( 
>>>> cman,
>>>> rgmanager, gfs2, etc ).
>>>> On (only) one of my nodes, yesterday i noticed the message below 
>>>> in
>>>> dmesg.
>>>>
>>>> I saw this 30 minutes after the facts. I could browse both my gfs2
>>>> mounts, there was no fencing or anything on any node.
>>>>
>>>> Any idea what might have caused this, and then go away?
>>>>
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240296] kworker/1:0     
>>>> D
>>>> ffff88032fc93900     0  3117      2 0x00000000
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240302]  
>>>> ffff8802bb4dfb30
>>>> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240307]  
>>>> ffff8802bb4dffd8
>>>> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240311]  
>>>> 0000000000000286
>>>> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240316] Call Trace:
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240334]  
>>>> [<ffffffffa0570290>]
>>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240340]  
>>>> [<ffffffff81666c89>]
>>>> schedule+0x29/0x70
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240349]  
>>>> [<ffffffffa057029e>]
>>>> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240352]  
>>>> [<ffffffff81665400>]
>>>> __wait_on_bit+0x60/0x90
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240361]  
>>>> [<ffffffffa0570290>]
>>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240364]  
>>>> [<ffffffff816654ac>]
>>>> out_of_line_wait_on_bit+0x7c/0x90
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240369]  
>>>> [<ffffffff81073400>]
>>>> ? autoremove_wake_function+0x40/0x40
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240378]  
>>>> [<ffffffffa05713a7>]
>>>> wait_on_holder+0x47/0x80 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240388]  
>>>> [<ffffffffa05741d8>]
>>>> gfs2_glock_nq+0x328/0x450 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240399]  
>>>> [<ffffffffa058a8ca>]
>>>> gfs2_check_blk_type+0x4a/0x150 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240410]  
>>>> [<ffffffffa058a8c1>]
>>>> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240421]  
>>>> [<ffffffffa058ba0c>]
>>>> gfs2_evict_inode+0x2cc/0x360 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240432]  
>>>> [<ffffffffa058b842>]
>>>> ? gfs2_evict_inode+0x102/0x360 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240437]  
>>>> [<ffffffff811940c2>]
>>>> evict+0xb2/0x1b0
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240440]  
>>>> [<ffffffff811942c9>]
>>>> iput+0x109/0x210
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240448]  
>>>> [<ffffffffa0572fdc>]
>>>> delete_work_func+0x5c/0x90 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240453]  
>>>> [<ffffffff8106d5fa>]
>>>> process_one_work+0x12a/0x420
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240462]  
>>>> [<ffffffffa0572f80>]
>>>> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240465]  
>>>> [<ffffffff8106e19e>]
>>>> worker_thread+0x12e/0x2f0
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240469]  
>>>> [<ffffffff8106e070>]
>>>> ? manage_workers.isra.25+0x200/0x200
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240472]  
>>>> [<ffffffff81072e73>]
>>>> kthread+0x93/0xa0
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240477]  
>>>> [<ffffffff816710a4>]
>>>> kernel_thread_helper+0x4/0x10
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240480]  
>>>> [<ffffffff81072de0>]
>>>> ? flush_kthread_worker+0x80/0x80
>>>> Aug 19 00:10:01 vm02-test kernel: [282120.240484]  
>>>> [<ffffffff816710a0>]
>>>> ? gs_change+0x13/0x13
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240061] INFO: task
>>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240175] "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240289] kworker/1:0     
>>>> D
>>>> ffff88032fc93900     0  3117      2 0x00000000
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240294]  
>>>> ffff8802bb4dfb30
>>>> 0000000000000046 ffff88031e4744d0 ffff8802bb4dffd8
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240299]  
>>>> ffff8802bb4dffd8
>>>> ffff8802bb4dffd8 ffff88031e5796f0 ffff88031e4744d0
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240304]  
>>>> 0000000000000286
>>>> ffff88032ffbd0f8 ffff8802bb4dfbc8 0000000000000002
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240309] Call Trace:
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240326]  
>>>> [<ffffffffa0570290>]
>>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240332]  
>>>> [<ffffffff81666c89>]
>>>> schedule+0x29/0x70
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240341]  
>>>> [<ffffffffa057029e>]
>>>> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240345]  
>>>> [<ffffffff81665400>]
>>>> __wait_on_bit+0x60/0x90
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240353]  
>>>> [<ffffffffa0570290>]
>>>> ? gfs2_glock_demote_wait+0x20/0x20 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240357]  
>>>> [<ffffffff816654ac>]
>>>> out_of_line_wait_on_bit+0x7c/0x90
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240362]  
>>>> [<ffffffff81073400>]
>>>> ? autoremove_wake_function+0x40/0x40
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240371]  
>>>> [<ffffffffa05713a7>]
>>>> wait_on_holder+0x47/0x80 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240380]  
>>>> [<ffffffffa05741d8>]
>>>> gfs2_glock_nq+0x328/0x450 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240391]  
>>>> [<ffffffffa058a8ca>]
>>>> gfs2_check_blk_type+0x4a/0x150 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240402]  
>>>> [<ffffffffa058a8c1>]
>>>> ? gfs2_check_blk_type+0x41/0x150 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240413]  
>>>> [<ffffffffa058ba0c>]
>>>> gfs2_evict_inode+0x2cc/0x360 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240424]  
>>>> [<ffffffffa058b842>]
>>>> ? gfs2_evict_inode+0x102/0x360 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240429]  
>>>> [<ffffffff811940c2>]
>>>> evict+0xb2/0x1b0
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240432]  
>>>> [<ffffffff811942c9>]
>>>> iput+0x109/0x210
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240440]  
>>>> [<ffffffffa0572fdc>]
>>>> delete_work_func+0x5c/0x90 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240445]  
>>>> [<ffffffff8106d5fa>]
>>>> process_one_work+0x12a/0x420
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240454]  
>>>> [<ffffffffa0572f80>]
>>>> ? gfs2_holder_uninit+0x40/0x40 [gfs2]
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240458]  
>>>> [<ffffffff8106e19e>]
>>>> worker_thread+0x12e/0x2f0
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240462]  
>>>> [<ffffffff8106e070>]
>>>> ? manage_workers.isra.25+0x200/0x200
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240465]  
>>>> [<ffffffff81072e73>]
>>>> kthread+0x93/0xa0
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240469]  
>>>> [<ffffffff816710a4>]
>>>> kernel_thread_helper+0x4/0x10
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240473]  
>>>> [<ffffffff81072de0>]
>>>> ? flush_kthread_worker+0x80/0x80
>>>> Aug 19 00:12:01 vm02-test kernel: [282240.240476]  
>>>> [<ffffffff816710a0>]
>>>> ? gs_change+0x13/0x13
>>>> <snip, goes on for a while>
>>>>
>>>> Kind regards,
>>>>
>>>> Bart
>>>
>>> I usually see this when DLM is blocked. DLM usually blocks on a 
>>> failed
>>> fence action.
>>>
>>> To clarify; this comes up on one of the three nodes only? On the 
>>> node
>>> with these messages, you shouldn't be able to look at the hung FS 
>>> on the
>>> effected node.
>>>
>>> Can you share you versions and cluster.conf please? Also, what is 
>>> in the
>>> logs in the three or four minutes before these messages start? 
>>> Anything
>>> interesting in the log files of the other nodes around the same 
>>> time
>>> period?
>>>
>>> digimer
>>



From lists at alteeve.ca  Sun Aug 19 22:01:46 2012
From: lists at alteeve.ca (Digimer)
Date: Sun, 19 Aug 2012 18:01:46 -0400
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
Message-ID: <5031624A.4040301@alteeve.ca>

On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
> kworker/1:0:3117 blocked for more than 120 seconds.
> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Nothing around Aug 19 00:08:00 ?

-- 
Digimer
Papers and Projects: https://alteeve.ca



From lists at verwilst.be  Mon Aug 20 07:50:17 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Mon, 20 Aug 2012 09:50:17 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <5031624A.4040301@alteeve.ca>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
Message-ID: <c7e72778ac4493cccf52f8068245654e@verwilst.be>

Nothing out of the ordinary, should have mentioned that!

<snip>
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 7
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: allocated new buffer (retsize=1024)
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
get_all_members: retlen = 1760
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 4
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 1760
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply 40000007 to fd 17
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 800000b7
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to 
process is 800000b7
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
return code is 0
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: Returning 
command data. length = 0
Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending 
reply c00000b7 to fd 17
Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20 
bytes from fd 17
Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
command is 7
Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About to 
process command
</snip>

Digimer schreef op 20.08.2012 00:01:
> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>> kworker/1:0:3117 blocked for more than 120 seconds.
>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>
> Nothing around Aug 19 00:08:00 ?



From lists at verwilst.be  Mon Aug 20 12:11:39 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Mon, 20 Aug 2012 14:11:39 +0200
Subject: [Linux-cluster] Cluster logging issues + rgmanager doesn't notice
	failed vms
Message-ID: <ff771af4f7c9dbb11fa639a179e269d3@verwilst.be>

Hello again ;)

My cluster seems to be logging only to /var/log/syslog, and even then 
only from the corosync daemon, the /var/log/cluster logs are empty:

root at vm01-test:~# ls -al /var/log/cluster/*.log
-rw------- 1 root root 0 Aug 16 06:50 /var/log/cluster/corosync.log
-rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/dlm_controld.log
-rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/fenced.log
-rw------- 1 root root 0 Aug  7 06:27 /var/log/cluster/fence_na.log
-rw------- 1 root root 0 Aug 16 06:50 /var/log/cluster/gfs_controld.log
-rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/qdiskd.log
-rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/rgmanager.log

Also, I've shut down my 2 vms with virt-manager or with halt from the 
cli on the guest itself.
virsh list on all 3 nodes show no running guests. However:

root at vm01-test:~# clustat
Cluster Status for kvm @ Mon Aug 20 14:10:20 2012
Member Status: Quorate

  Member Name                                                     ID   
Status
  ------ ----                                                     ---- 
------
  vm01-test                                                           1 
Online, Local, rgmanager
  vm02-test                                                           2 
Online, rgmanager
  vm03-test                                                           3 
Online, rgmanager
  /dev/mapper/iscsi_cluster_quorum                                    0 
Online, Quorum Disk

  Service Name                                                     Owner 
(Last)                                                     State
  ------- ----                                                     ----- 
------                                                     -----
  vm:intux_firewall                                                
vm02-test                                                        started
  vm:intux_zabbix                                                  
vm02-test                                                        started


My config:

<cluster name="kvm" config_version="14">
	<logging debug="on"/>
         <clusternodes>
         <clusternode name="vm01-test" nodeid="1">
		<fence>
			<method name="apc">
				<device name="apc01" port="1" action="off"/>
				<device name="apc02" port="1" action="off"/>
				<device name="apc01" port="1" action="on"/>
				<device name="apc02" port="1" action="on"/>
			</method>
		</fence>
         </clusternode>
         <clusternode name="vm02-test" nodeid="2">
		<fence>
			<method name="apc">
				<device name="apc01" port="8" action="off"/>
				<device name="apc02" port="8" action="off"/>
				<device name="apc01" port="8" action="on"/>
				<device name="apc02" port="8" action="on"/>
			</method>
                 </fence>
         </clusternode>
         <clusternode name="vm03-test" nodeid="3">
		<fence>
			<method name="apc">
				<device name="apc01" port="2" action="off"/>
				<device name="apc02" port="2" action="off"/>
				<device name="apc01" port="2" action="on"/>
				<device name="apc02" port="2" action="on"/>
			</method>
                 </fence>
         </clusternode>
         </clusternodes>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="apc01" secure="on" 
login="device" name="apc01" passwd="xxx"/>
		<fencedevice agent="fence_apc" ipaddr="apc02" secure="on" 
login="device" name="apc02" passwd="xxx"/>
	</fencedevices>
	<rm log_level="5">
		<failoverdomains>
			<failoverdomain name="any_node" nofailback="1" ordered="0" 
restricted="0"/>
		</failoverdomains>
		<vm domain="any_node" max_restarts="2" migrate="live" name="firewall" 
path="/etc/libvirt/qemu/" recovery="restart" restart_expire_time="600"/>
		<vm domain="any_node" max_restarts="2" migrate="live" name="zabbix" 
path="/etc/libvirt/qemu/" recovery="restart" restart_expire_time="600"/>
	</rm>
	<totem rrp_mode="none" secauth="off"/>
	<quorumd interval="2" tko="4" 
device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
</cluster>

I hope you guys can shed some light on this.. CMAN, rgmanager, ... = 
3.1.7-0ubuntu2.1, corosync = 1.4.2-2

Kind regards,

Bart



From lists at verwilst.be  Mon Aug 20 12:24:38 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Mon, 20 Aug 2012 14:24:38 +0200
Subject: [Linux-cluster] Cluster logging issues + rgmanager doesn't
	notice failed vms
In-Reply-To: <ff771af4f7c9dbb11fa639a179e269d3@verwilst.be>
References: <ff771af4f7c9dbb11fa639a179e269d3@verwilst.be>
Message-ID: <767817d85d8dea0622e45953b4495e8b@verwilst.be>

At the same time, i notice a hanging /etc/libvirt/qemu gfs2 mount ( 
while /var/lib/libvirt/sanlock still works fine ) on vm02. vm01 and vm03 
have perfectly accessible mounts. Nothing special to see in syslog or 
dmesg..

/dev/mapper/iscsi_cluster_qemu on /etc/libvirt/qemu type gfs2 
(rw,relatime,hostdata=jid=2)
/dev/mapper/iscsi_cluster_sanlock on /var/lib/libvirt/sanlock type gfs2 
(rw,relatime,hostdata=jid=2)

Any ideas?

Bart

Bart Verwilst schreef op 20.08.2012 14:11:
> Hello again ;)
>
> My cluster seems to be logging only to /var/log/syslog, and even then
> only from the corosync daemon, the /var/log/cluster logs are empty:
>
> root at vm01-test:~# ls -al /var/log/cluster/*.log
> -rw------- 1 root root 0 Aug 16 06:50 /var/log/cluster/corosync.log
> -rw------- 1 root root 0 Aug 20 06:39 
> /var/log/cluster/dlm_controld.log
> -rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/fenced.log
> -rw------- 1 root root 0 Aug  7 06:27 /var/log/cluster/fence_na.log
> -rw------- 1 root root 0 Aug 16 06:50 
> /var/log/cluster/gfs_controld.log
> -rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/qdiskd.log
> -rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/rgmanager.log
>
> Also, I've shut down my 2 vms with virt-manager or with halt from the
> cli on the guest itself.
> virsh list on all 3 nodes show no running guests. However:
>
> root at vm01-test:~# clustat
> Cluster Status for kvm @ Mon Aug 20 14:10:20 2012
> Member Status: Quorate
>
>  Member Name                                                     ID   
> Status
>  ------ ----                                                     ---- 
> ------
>  vm01-test
> 1 Online, Local, rgmanager
>  vm02-test
> 2 Online, rgmanager
>  vm03-test
> 3 Online, rgmanager
>  /dev/mapper/iscsi_cluster_quorum
> 0 Online, Quorum Disk
>
>  Service Name
> Owner (Last)                                                     
> State
>  ------- ----
> ----- ------                                                     
> -----
>  vm:intux_firewall
> vm02-test
> started
>  vm:intux_zabbix
> vm02-test
> started
>
>
> My config:
>
> <cluster name="kvm" config_version="14">
> <logging debug="on"/>
>         <clusternodes>
>         <clusternode name="vm01-test" nodeid="1">
> 	<fence>
> 		<method name="apc">
> 			<device name="apc01" port="1" action="off"/>
> 			<device name="apc02" port="1" action="off"/>
> 			<device name="apc01" port="1" action="on"/>
> 			<device name="apc02" port="1" action="on"/>
> 		</method>
> 	</fence>
>         </clusternode>
>         <clusternode name="vm02-test" nodeid="2">
> 	<fence>
> 		<method name="apc">
> 			<device name="apc01" port="8" action="off"/>
> 			<device name="apc02" port="8" action="off"/>
> 			<device name="apc01" port="8" action="on"/>
> 			<device name="apc02" port="8" action="on"/>
> 		</method>
>                 </fence>
>         </clusternode>
>         <clusternode name="vm03-test" nodeid="3">
> 	<fence>
> 		<method name="apc">
> 			<device name="apc01" port="2" action="off"/>
> 			<device name="apc02" port="2" action="off"/>
> 			<device name="apc01" port="2" action="on"/>
> 			<device name="apc02" port="2" action="on"/>
> 		</method>
>                 </fence>
>         </clusternode>
>         </clusternodes>
> <fencedevices>
> 	<fencedevice agent="fence_apc" ipaddr="apc01" secure="on"
> login="device" name="apc01" passwd="xxx"/>
> 	<fencedevice agent="fence_apc" ipaddr="apc02" secure="on"
> login="device" name="apc02" passwd="xxx"/>
> </fencedevices>
> <rm log_level="5">
> 	<failoverdomains>
> 		<failoverdomain name="any_node" nofailback="1" ordered="0" 
> restricted="0"/>
> 	</failoverdomains>
> 	<vm domain="any_node" max_restarts="2" migrate="live"
> name="firewall" path="/etc/libvirt/qemu/" recovery="restart"
> restart_expire_time="600"/>
> 	<vm domain="any_node" max_restarts="2" migrate="live" name="zabbix"
> path="/etc/libvirt/qemu/" recovery="restart"
> restart_expire_time="600"/>
> </rm>
> <totem rrp_mode="none" secauth="off"/>
> <quorumd interval="2" tko="4"
> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> </cluster>
>
> I hope you guys can shed some light on this.. CMAN, rgmanager, ... =
> 3.1.7-0ubuntu2.1, corosync = 1.4.2-2
>
> Kind regards,
>
> Bart



From lists at verwilst.be  Mon Aug 20 12:28:42 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Mon, 20 Aug 2012 14:28:42 +0200
Subject: [Linux-cluster] Cluster logging issues + rgmanager doesn't
	notice failed vms
In-Reply-To: <767817d85d8dea0622e45953b4495e8b@verwilst.be>
References: <ff771af4f7c9dbb11fa639a179e269d3@verwilst.be>
	<767817d85d8dea0622e45953b4495e8b@verwilst.be>
Message-ID: <9f11f8db85b1956a591b2c6c1157aeb1@verwilst.be>

Not sure if it will help, but here is some more debugging output for 
the locking:
vm02 is the bad node, vm03 can reach the mounts fine.

root at vm02-test:~# cat /sys/kernel/debug/gfs2/kvm\:qemu/glocks
G:  s:UN n:2/19 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/183f3 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:19/99315 t:8 f:0x00 d:0x00000201 s:768
G:  s:EX n:9/1 f:Iqb t:EX d:EX/0 a:0 v:0 r:2 m:50
  H: s:EX f:eH e:0 p:8153 [(ended)] gfs2_glock_nq_num+0x5a/0xa0 [gfs2]
G:  s:SH n:5/183f2 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:UN n:2/183fa f:lIqob t:SH d:EX/0 a:0 v:0 r:5 m:50
  H: s:SH f:AW e:0 p:31574 [vm.sh] gfs2_getattr+0xb3/0xf0 [gfs2]
  H: s:SH f:AW e:0 p:3054 [ls] gfs2_getattr+0xb3/0xf0 [gfs2]
  H: s:SH f:AW e:0 p:3323 [ls] gfs2_getattr+0xb3/0xf0 [gfs2]
G:  s:SH n:5/181ec f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:UN n:2/183f4 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/805b f:Iqob t:SH d:EX/0 a:0 v:0 r:3 m:50
  H: s:SH f:eEcH e:0 p:8153 [(ended)] init_inodes+0x3ac/0x5f0 [gfs2]
  I: n:5/32859 t:8 f:0x01 d:0x00000200 s:134217728
G:  s:SH n:5/183f3 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:SH n:5/19 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:UN n:3/4000c f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/180e7 f:IqLb t:SH d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/20017 f:DIqob t:SH d:UN/138019392000 a:0 v:0 r:3 m:50
  H: s:SH f:EH e:0 p:12159 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 
[gfs2]
G:  s:SH n:1/2 f:Iqb t:SH d:EX/0 a:0 v:0 r:2 m:50
G:  s:UN n:3/50009 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/18 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:SH n:5/180e7 f:IqLb t:SH d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/181ed f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:UN n:3/11 f:qo t:UN d:EX/0 a:0 v:0 r:1 m:44
G:  s:UN n:3/20012 f:lIqob t:SH d:EX/0 a:0 v:0 r:3 m:50
  H: s:SH f:W e:0 p:3117 [kworker/1:0] gfs2_check_blk_type+0x41/0x150 
[gfs2]
G:  s:SH n:4/0 f:IqLb t:SH d:EX/0 a:0 v:0 r:1 m:50
G:  s:UN n:3/70003 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/20017 f:Iqob t:EX d:EX/0 a:0 v:0 r:3 m:50
  H: s:EX f:H e:0 p:3117 [kworker/1:0] gfs2_evict_inode+0x102/0x360 
[gfs2]
  I: n:8/131095 t:8 f:0x01 d:0x00000000 s:2094
G:  s:SH n:5/16 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:UN n:3/10015 f:IqLo t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/805b f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:SH n:1/1 f:Iqb t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:eEH e:0 p:8153 [(ended)] gfs2_glock_nq_num+0x5a/0xa0 [gfs2]
G:  s:EX n:2/181ec f:yIqob t:EX d:EX/0 a:1 v:0 r:3 m:50
  H: s:EX f:H e:0 p:8153 [(ended)] fill_super+0x94f/0xc80 [gfs2]
  I: n:12/98796 t:8 f:0x00 d:0x00000201 s:24
G:  s:UN n:1/3 f: t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/183f4 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:SH n:2/18 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:3/24 t:4 f:0x00 d:0x00000201 s:3864
G:  s:UN n:3/60006 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:UN n:2/100a5 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/16 f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:1/22 t:4 f:0x00 d:0x00000001 s:3864
G:  s:UN n:2/3001b f:lIqob t:SH d:EX/0 a:0 v:0 r:3 m:50
  H: s:SH f:AW e:0 p:31854 [vm.sh] gfs2_getattr+0xb3/0xf0 [gfs2]
G:  s:SH n:5/17 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:EX n:2/183f8 f:IqLb t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:UN n:2/183f2 f:IqLo t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/17 f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:2/23 t:4 f:0x00 d:0x00000201 s:3864
G:  s:SH n:5/100a5 f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:8153 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:UN n:3/3000f f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/3001b f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:2278 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 [gfs2]
G:  s:SH n:5/183fa f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:13012 [(ended)] gfs2_inode_lookup+0x11e/0x2e0 
[gfs2]
G:  s:EX n:2/181ed f:Iqob t:EX d:EX/0 a:0 v:0 r:3 m:50
  H: s:EX f:H e:0 p:8153 [(ended)] fill_super+0x991/0xc80 [gfs2]
  I: n:13/98797 t:8 f:0x00 d:0x00000200 s:1048576

root at vm03-test:~# cat /sys/kernel/debug/gfs2/kvm\:qemu/glocks
G:  s:EX n:2/30018 f:IqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/183f3 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:2/30014 f:IqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/100a5 f:Iqo t:SH d:EX/0 a:0 v:0 r:3 m:50
  H: s:SH f:eEcH e:0 p:5787 [(ended)] init_journal+0x184/0x540 [gfs2]
  I: n:6/65701 t:8 f:0x01 d:0x00000200 s:134217728
G:  s:SH n:5/183fa f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:14727 [libvirtd] gfs2_inode_lookup+0x11e/0x2f0 
[gfs2]
G:  s:EX n:3/10015 f:yIqo t:EX d:EX/0 a:2 v:0 r:2 m:10
  R: n:65557 f:30000000 b:31772/31772 i:16
G:  s:UN n:3/50009 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/183f2 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:UN n:2/183f2 f:IqLo t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/183f5 f:IqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/182f0 f:Iqo t:EX d:EX/0 a:0 v:0 r:3 m:50
  H: s:EX f:H e:0 p:5787 [(ended)] init_per_node+0x1a8/0x270 [gfs2]
  I: n:16/99056 t:8 f:0x00 d:0x00000200 s:1048576
G:  s:EX n:2/20018 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:9/2 f:Iq t:EX d:EX/0 a:0 v:0 r:2 m:50
  H: s:EX f:eH e:0 p:5787 [(ended)] gfs2_glock_nq_num+0x59/0xa0 [gfs2]
G:  s:EX n:2/2001a f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:1/2 f:Iq t:SH d:EX/0 a:0 v:0 r:2 m:50
G:  s:SH n:5/19 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:2/30019 f:IqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/182ef f:yIqo t:EX d:EX/0 a:1 v:0 r:3 m:50
  H: s:EX f:H e:0 p:5787 [(ended)] init_per_node+0x175/0x270 [gfs2]
  I: n:15/99055 t:8 f:0x00 d:0x00000201 s:24
G:  s:SH n:5/182f0 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:2/20019 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/3001b f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:4569 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:SH n:5/16 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:SH n:5/100a5 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:SH n:5/180e7 f:IqL t:SH d:EX/0 a:0 v:0 r:1 m:50
G:  s:UN n:2/183f4 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:UN n:3/3000f f:Io t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/180e7 f:IqL t:SH d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:3/20012 f:Io t:SH d:EX/0 a:0 v:0 r:2 m:10
  R: n:131090 f:30000000 b:65527/65527 i:1
G:  s:UN n:2/19 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/183fa f:IqLo t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:8/99322 t:8 f:0x00 d:0x00000000 s:2390
G:  s:SH n:5/17 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:UN n:3/70003 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/183f9 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/18 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:2/30017 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/2001b f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/30015 f:IqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/18 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:3/24 t:4 f:0x00 d:0x00000201 s:3864
G:  s:UN n:1/3 f: t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:UN n:2/805b f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:EX n:2/183f7 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/16 f:IqLo t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:1/22 t:4 f:0x00 d:0x00000001 s:3864
G:  s:UN n:3/4000c f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/805b f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:2/183f6 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:4/0 f:IqL t:SH d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:5/182ef f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:2/30016 f:yIqL t:EX d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/17 f:IqLo t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:2/23 t:4 f:0x00 d:0x00000201 s:3864
G:  s:SH n:1/1 f:Iq t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:eEH e:0 p:5787 [(ended)] gfs2_glock_nq_num+0x59/0xa0 [gfs2]
G:  s:SH n:2/3001b f:IqLo t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:8/196635 t:8 f:0x00 d:0x00000000 s:2566
G:  s:SH n:5/183f4 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  H: s:SH f:EH e:0 p:5787 [(ended)] gfs2_inode_lookup+0x11e/0x2f0 [gfs2]
G:  s:EX n:3/11 f:IqLo t:EX d:EX/0 a:0 v:0 r:2 m:50
  R: n:17 f:20000000 b:0/0 i:5
G:  s:UN n:3/60006 f:o t:UN d:EX/0 a:0 v:0 r:1 m:50
G:  s:SH n:2/183f3 f:Iqo t:SH d:EX/0 a:0 v:0 r:2 m:50
  I: n:19/99315 t:8 f:0x00 d:0x00000201 s:768

Kind regards,

Bart

Bart Verwilst schreef op 20.08.2012 14:24:
> At the same time, i notice a hanging /etc/libvirt/qemu gfs2 mount (
> while /var/lib/libvirt/sanlock still works fine ) on vm02. vm01 and
> vm03 have perfectly accessible mounts. Nothing special to see in
> syslog or dmesg..
>
> /dev/mapper/iscsi_cluster_qemu on /etc/libvirt/qemu type gfs2
> (rw,relatime,hostdata=jid=2)
> /dev/mapper/iscsi_cluster_sanlock on /var/lib/libvirt/sanlock type
> gfs2 (rw,relatime,hostdata=jid=2)
>
> Any ideas?
>
> Bart
>
> Bart Verwilst schreef op 20.08.2012 14:11:
>> Hello again ;)
>>
>> My cluster seems to be logging only to /var/log/syslog, and even 
>> then
>> only from the corosync daemon, the /var/log/cluster logs are empty:
>>
>> root at vm01-test:~# ls -al /var/log/cluster/*.log
>> -rw------- 1 root root 0 Aug 16 06:50 /var/log/cluster/corosync.log
>> -rw------- 1 root root 0 Aug 20 06:39 
>> /var/log/cluster/dlm_controld.log
>> -rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/fenced.log
>> -rw------- 1 root root 0 Aug  7 06:27 /var/log/cluster/fence_na.log
>> -rw------- 1 root root 0 Aug 16 06:50 
>> /var/log/cluster/gfs_controld.log
>> -rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/qdiskd.log
>> -rw------- 1 root root 0 Aug 20 06:39 /var/log/cluster/rgmanager.log
>>
>> Also, I've shut down my 2 vms with virt-manager or with halt from 
>> the
>> cli on the guest itself.
>> virsh list on all 3 nodes show no running guests. However:
>>
>> root at vm01-test:~# clustat
>> Cluster Status for kvm @ Mon Aug 20 14:10:20 2012
>> Member Status: Quorate
>>
>>  Member Name                                                     ID  
>> Status
>>  ------ ----                                                     
>> ---- ------
>>  vm01-test
>> 1 Online, Local, rgmanager
>>  vm02-test
>> 2 Online, rgmanager
>>  vm03-test
>> 3 Online, rgmanager
>>  /dev/mapper/iscsi_cluster_quorum
>> 0 Online, Quorum Disk
>>
>>  Service Name
>> Owner (Last)                                                     
>> State
>>  ------- ----
>> ----- ------                                                     
>> -----
>>  vm:intux_firewall
>> vm02-test
>> started
>>  vm:intux_zabbix
>> vm02-test
>> started
>>
>>
>> My config:
>>
>> <cluster name="kvm" config_version="14">
>> <logging debug="on"/>
>>         <clusternodes>
>>         <clusternode name="vm01-test" nodeid="1">
>> 	<fence>
>> 		<method name="apc">
>> 			<device name="apc01" port="1" action="off"/>
>> 			<device name="apc02" port="1" action="off"/>
>> 			<device name="apc01" port="1" action="on"/>
>> 			<device name="apc02" port="1" action="on"/>
>> 		</method>
>> 	</fence>
>>         </clusternode>
>>         <clusternode name="vm02-test" nodeid="2">
>> 	<fence>
>> 		<method name="apc">
>> 			<device name="apc01" port="8" action="off"/>
>> 			<device name="apc02" port="8" action="off"/>
>> 			<device name="apc01" port="8" action="on"/>
>> 			<device name="apc02" port="8" action="on"/>
>> 		</method>
>>                 </fence>
>>         </clusternode>
>>         <clusternode name="vm03-test" nodeid="3">
>> 	<fence>
>> 		<method name="apc">
>> 			<device name="apc01" port="2" action="off"/>
>> 			<device name="apc02" port="2" action="off"/>
>> 			<device name="apc01" port="2" action="on"/>
>> 			<device name="apc02" port="2" action="on"/>
>> 		</method>
>>                 </fence>
>>         </clusternode>
>>         </clusternodes>
>> <fencedevices>
>> 	<fencedevice agent="fence_apc" ipaddr="apc01" secure="on"
>> login="device" name="apc01" passwd="xxx"/>
>> 	<fencedevice agent="fence_apc" ipaddr="apc02" secure="on"
>> login="device" name="apc02" passwd="xxx"/>
>> </fencedevices>
>> <rm log_level="5">
>> 	<failoverdomains>
>> 		<failoverdomain name="any_node" nofailback="1" ordered="0"
>> restricted="0"/>
>> 	</failoverdomains>
>> 	<vm domain="any_node" max_restarts="2" migrate="live"
>> name="firewall" path="/etc/libvirt/qemu/" recovery="restart"
>> restart_expire_time="600"/>
>> 	<vm domain="any_node" max_restarts="2" migrate="live" name="zabbix"
>> path="/etc/libvirt/qemu/" recovery="restart"
>> restart_expire_time="600"/>
>> </rm>
>> <totem rrp_mode="none" secauth="off"/>
>> <quorumd interval="2" tko="4"
>> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>> </cluster>
>>
>> I hope you guys can shed some light on this.. CMAN, rgmanager, ... =
>> 3.1.7-0ubuntu2.1, corosync = 1.4.2-2
>>
>> Kind regards,
>>
>> Bart



From rhayden.public at gmail.com  Mon Aug 20 13:10:39 2012
From: rhayden.public at gmail.com (Robert Hayden)
Date: Mon, 20 Aug 2012 08:10:39 -0500
Subject: [Linux-cluster] RHEL Cluster rpm fresh installation on rhel5
	32bit
In-Reply-To: <CAA1zgjZ4Ay2iVR-ZA_sYERxwz91gGqJTx5QLS2Dss4PJjfC-4Q@mail.gmail.com>
References: <CAA1zgjYiV8AGyS18-_YSP9a2qAq09JSJg0jXm6M9E9-8DY0reQ@mail.gmail.com>
	<50305B2B.7010401@alteeve.ca>
	<1779415192-1345347395-cardhu_decombobulator_blackberry.rim.net-1878700415-@b17.c8.bise7.blackberry>
	<CAA1zgjZTQ+mjFJ2goYOKq-VqvekL8YH3R4Q0BUQdjY7dT8coSA@mail.gmail.com>
	<10575653-1345365526-cardhu_decombobulator_blackberry.rim.net-42447776-@b17.c8.bise7.blackberry>
	<CAA1zgjZ4Ay2iVR-ZA_sYERxwz91gGqJTx5QLS2Dss4PJjfC-4Q@mail.gmail.com>
Message-ID: <CANqTVAFbnVu_eJRPJxkSBeb2yRjzsgmPWuB57tUyet+5G_nhdw@mail.gmail.com>

I believe the document you are following is for RHEL 4.

The packages I typically pull are as follows.  They will pull in others as
needed.

################################# RHCS Specific Packages
#cman.x86_64
#openais.x86_64
#lvm2-cluster.x86_64
#gfs2-utils.x86_64
#rgmanager.x86_64
#system-config-cluster.noarch
#luci.x86_64
#ricci.x86_64
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120820/81e70a4e/attachment.htm>

From jpokorny at redhat.com  Mon Aug 20 14:28:41 2012
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Mon, 20 Aug 2012 16:28:41 +0200
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <F6CC886A516DF049A33F32C149ED21C823C2F436@alexandria.innova.local>
References: <502D7294.7040505@redhat.com>
	<F6CC886A516DF049A33F32C149ED21C823C2F436@alexandria.innova.local>
Message-ID: <20120820142841.GD537@redhat.com>

Hello Chip,

On 17/08/12 15:14 +0000, Chip Burke wrote:
> Libvirt is not installed on any of the hosts.

could you please provide a strace log (best as gzipped attachment sent
off-list, or, you can use e.g. fpaste.org and provide a link if the
log is not so huge).

Something like (untested):

# strace -fp $(pidof ricci) -ff -o ricci
# tar czf ricci-strace.tar.gz ricci.*

While the strace is running, please let luci or ccs access ricci
so these attempts are covered well in thse strace log.

Thanks,
Jan



From CBurke at innova-partners.com  Mon Aug 20 14:51:50 2012
From: CBurke at innova-partners.com (Chip Burke)
Date: Mon, 20 Aug 2012 14:51:50 +0000
Subject: [Linux-cluster] Ricci doesn't work
In-Reply-To: <20120820142841.GD537@redhat.com>
Message-ID: <F6CC886A516DF049A33F32C149ED21C823C2FE2D@alexandria.innova.local>

Thanks for sticking with me on this. Here's the log:

https://dl.dropbox.com/u/8137282/strace.ricci.tgz
________________________________________
Chip Burke






On 8/20/12 10:28 AM, "Jan Pokorn?" <jpokorny at redhat.com> wrote:

>Hello Chip,
>
>On 17/08/12 15:14 +0000, Chip Burke wrote:
>> Libvirt is not installed on any of the hosts.
>
>could you please provide a strace log (best as gzipped attachment sent
>off-list, or, you can use e.g. fpaste.org and provide a link if the
>log is not so huge).
>
>Something like (untested):
>
># strace -fp $(pidof ricci) -ff -o ricci
># tar czf ricci-strace.tar.gz ricci.*
>
>While the strace is running, please let luci or ccs access ricci
>so these attempts are covered well in thse strace log.
>
>Thanks,
>Jan
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From pablo.perez1981 at gmail.com  Tue Aug 21 08:57:15 2012
From: pablo.perez1981 at gmail.com (=?ISO-8859-1?Q?Pablo_P=E9rez_Fern=E1ndez?=)
Date: Tue, 21 Aug 2012 10:57:15 +0200
Subject: [Linux-cluster] Cluster suite CLI for rhel4 and rhel5?
Message-ID: <CAGmgBeghh35-VspwhiTWNCSiCUr-3j6oW25yhG=nOKQN5j3bcg@mail.gmail.com>

hi,

i?ve been looking for administration documentation related with rhel5
and rhel4, but i only found it on rhel 6 docs.

ccs command seems not to be supported on these versions, is it right???

is there any kind of guide for administering cluster suite on rhel4
and 5 only with CLI????

thanks a lot.

regards

Pablo



From arpittolani at gmail.com  Tue Aug 21 09:12:12 2012
From: arpittolani at gmail.com (Arpit Tolani)
Date: Tue, 21 Aug 2012 14:42:12 +0530
Subject: [Linux-cluster] Cluster suite CLI for rhel4 and rhel5?
In-Reply-To: <CAGmgBeghh35-VspwhiTWNCSiCUr-3j6oW25yhG=nOKQN5j3bcg@mail.gmail.com>
References: <CAGmgBeghh35-VspwhiTWNCSiCUr-3j6oW25yhG=nOKQN5j3bcg@mail.gmail.com>
Message-ID: <CAD3MydAMJHYif05wmbvavdETFy_h8AaWiaB-=4KjdOM-0bi5_A@mail.gmail.com>

Hello

On Tue, Aug 21, 2012 at 2:27 PM, Pablo P?rez Fern?ndez <
pablo.perez1981 at gmail.com> wrote:

> hi,
>
> i?ve been looking for administration documentation related with rhel5
> and rhel4, but i only found it on rhel 6 docs.
>
> ccs command seems not to be supported on these versions, is it right???
>
> is there any kind of guide for administering cluster suite on rhel4
> and 5 only with CLI????
>
>
For RHEL-5,Start from here
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html

For RHEL-4
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/4/html/Cluster_Administration/index.html

More docs can be found on
https://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/


> thanks a lot.
>
> regards
>
> Pablo
>

Regards
Arpit Tolani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120821/f469ecfe/attachment.htm>

From mij at irwan.name  Tue Aug 21 09:20:15 2012
From: mij at irwan.name (Mohd Irwan Jamaluddin)
Date: Tue, 21 Aug 2012 17:20:15 +0800
Subject: [Linux-cluster] Cluster suite CLI for rhel4 and rhel5?
In-Reply-To: <CAGmgBeghh35-VspwhiTWNCSiCUr-3j6oW25yhG=nOKQN5j3bcg@mail.gmail.com>
References: <CAGmgBeghh35-VspwhiTWNCSiCUr-3j6oW25yhG=nOKQN5j3bcg@mail.gmail.com>
Message-ID: <CALG-EqptCjMsW1xz54tcS=XD_iJ2LmQkumh4acFaEp_ESwiX9Q@mail.gmail.com>

On Tue, Aug 21, 2012 at 4:57 PM, Pablo P?rez Fern?ndez <
pablo.perez1981 at gmail.com> wrote:

> hi,
>
> i?ve been looking for administration documentation related with rhel5
> and rhel4, but i only found it on rhel 6 docs.
>
> ccs command seems not to be supported on these versions, is it right???
>
> is there any kind of guide for administering cluster suite on rhel4
> and 5 only with CLI????
>
> thanks a lot.
>
>
Here you are:
https://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120821/28a09c35/attachment.htm>

From pablo.perez1981 at gmail.com  Tue Aug 21 09:43:42 2012
From: pablo.perez1981 at gmail.com (=?ISO-8859-1?Q?Pablo_P=E9rez_Fern=E1ndez?=)
Date: Tue, 21 Aug 2012 11:43:42 +0200
Subject: [Linux-cluster] Cluster suite CLI for rhel4 and rhel5?
Message-ID: <CAGmgBeiX_fcRexKc-wC-_rOWGLJNGwCWkJBX2_sNbAPNkE_YSg@mail.gmail.com>

hi,

thanks but i was meaning to get information and procedures with only
the command line. trying to avoid any kind of GUI.

best regards

Pablo



From lists at verwilst.be  Tue Aug 21 10:08:02 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Tue, 21 Aug 2012 12:08:02 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <c7e72778ac4493cccf52f8068245654e@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
Message-ID: <794d9991566585741d98464f228c789a@verwilst.be>

As yet another reply to my own post, i found this on the node where it 
hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's hanging 
):


[ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm", 
"kvm:sanlock"
[ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now mounting 
FS...
[ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked for use
[ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at journal...
[ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the 
transaction lock...
[ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying journal...
[ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 blocks
[ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke tags
[ 1219.782611] init: libvirt-bin main process ended, respawning
[ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed in 1s
[ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
[ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire 
journal lock...
[ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at journal...
[ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
[ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire 
journal lock...
[ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at journal...
[ 1219.943994] init: ttyS1 main process (20318) terminated with status 
1
[ 1219.944037] init: ttyS1 main process ended, respawning
[ 1219.967054] init: ttyS0 main process (20320) terminated with status 
1
[ 1219.967100] init: ttyS0 main process ended, respawning
[ 1219.972037] ttyS0: LSR safety check engaged!
[ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the 
transaction lock...
[ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying journal...
[ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106 
blocks
[ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke tags
[ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed in 1s
[ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
[ 1221.457120] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000018
[ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
[ 1221.458197] Oops: 0000 [#1] SMP
[ 1221.458374] CPU 0
[ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase 
ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables 
iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel kvm 
ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue 
configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa ib_mad 
ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q garp stp 
joydev dm_multipath dcdbas mac_hid i7core_edac edac_core 
acpi_power_meter lp parport usbhid hid bnx2 mpt2sas scsi_transport_sas 
e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: ipmi_si]
[ 1221.463058]
[ 1221.463146] Pid: 20611, comm: libvirtd Not tainted 3.2.0-26-generic 
#41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
[ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  [<ffffffffa04f800a>] 
gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
[ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX: 
ffff88020cfe1d40
[ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI: 
ffff88022efa2440
[ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09: 
0000000000000000
[ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12: 
ffff88021ef50000
[ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15: 
ffff88022efa2000
[ 1221.466115] FS:  00007f4d2c0e7700(0000) GS:ffff880237200000(0000) 
knlGS:0000000000000000
[ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4: 
00000000000006f0
[ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 1221.467702] Process libvirtd (pid: 20611, threadinfo 
ffff88020cfe0000, task ffff8802241cdbc0)
[ 1221.468091] Stack:
[ 1221.468199]  0000000000000003 ffff88022f108048 ffff88020cfe1d58 
ffff88020cfe1d40
[ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 ffff88022354aa00 
0000000000000001
[ 1221.468963]  0000000000000000 0000000000000000 ffffffffa04f7fda 
ffff88020cfe1d80
[ 1221.469346] Call Trace:
[ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 [gfs2]
[ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 [gfs2]
[ 1221.470236]  [<ffffffff8129cb2c>] ? 
security_inode_permission+0x1c/0x30
[ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
[ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
[ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
[ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
[ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
[ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
[ 1221.472019]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
[ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 00 e8 
fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 08 ff 
ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff ff 48
[ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.474240]  RSP <ffff88020cfe1d28>
[ 1221.474408] CR2: 0000000000000018
[ 1221.474959] ---[ end trace f7df780fd45600a8 ]---


Bart Verwilst schreef op 20.08.2012 09:50:
> Nothing out of the ordinary, should have mentioned that!
>
> <snip>
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> bytes from fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
> command is 7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> process command
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to
> process is 7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> get_all_members: allocated new buffer (retsize=1024)
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> get_all_members: retlen = 1760
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> return code is 4
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> Returning command data. length = 1760
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending
> reply 40000007 to fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> bytes from fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client
> command is 800000b7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> process command
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to
> process is 800000b7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> return code is 0
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> Returning command data. length = 0
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending
> reply c00000b7 to fd 17
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> bytes from fd 17
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
> command is 7
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> process command
> </snip>
>
> Digimer schreef op 20.08.2012 00:01:
>> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
>> Nothing around Aug 19 00:08:00 ?



From swhiteho at redhat.com  Tue Aug 21 10:17:13 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 21 Aug 2012 11:17:13 +0100
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <794d9991566585741d98464f228c789a@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
Message-ID: <1345544233.2732.39.camel@menhir>

Hi,

On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
> As yet another reply to my own post, i found this on the node where it 
> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's hanging 
> ):
> 
> 
> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm", 
> "kvm:sanlock"
> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now mounting 
> FS...
> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked for use
> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at journal...
> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the 
> transaction lock...
> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying journal...
> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 blocks
> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke tags
> [ 1219.782611] init: libvirt-bin main process ended, respawning
> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed in 1s
> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire 
> journal lock...
> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at journal...
> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire 
> journal lock...
> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at journal...
> [ 1219.943994] init: ttyS1 main process (20318) terminated with status 
> 1
> [ 1219.944037] init: ttyS1 main process ended, respawning
> [ 1219.967054] init: ttyS0 main process (20320) terminated with status 
> 1
> [ 1219.967100] init: ttyS0 main process ended, respawning
> [ 1219.972037] ttyS0: LSR safety check engaged!
> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the 
> transaction lock...
> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying journal...
> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106 
> blocks
> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke tags
> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed in 1s
> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done

So that looks like it successfully recovered the journals for nodes one
and two. How many nodes are in the cluster? What is the fencing quorum
set up being used?

> [ 1221.457120] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000018

So this is a dereference of something which is 24 bytes into some
structure or other. Certainly something which should not happen so we
need to take a look at that.

Was this a one off, or something that you can reproduce?

Steve.


> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 [gfs2]
> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
> [ 1221.458197] Oops: 0000 [#1] SMP
> [ 1221.458374] CPU 0
> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase 
> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables 
> iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel kvm 
> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue 
> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa ib_mad 
> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q garp stp 
> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core 
> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas scsi_transport_sas 
> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: ipmi_si]
> [ 1221.463058]
> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted 3.2.0-26-generic 
> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  [<ffffffffa04f800a>] 
> gfs2_unlink+0x8a/0x220 [gfs2]
> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX: 
> ffff88020cfe1d40
> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI: 
> ffff88022efa2440
> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09: 
> 0000000000000000
> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12: 
> ffff88021ef50000
> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15: 
> ffff88022efa2000
> [ 1221.466115] FS:  00007f4d2c0e7700(0000) GS:ffff880237200000(0000) 
> knlGS:0000000000000000
> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4: 
> 00000000000006f0
> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
> 0000000000000400
> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo 
> ffff88020cfe0000, task ffff8802241cdbc0)
> [ 1221.468091] Stack:
> [ 1221.468199]  0000000000000003 ffff88022f108048 ffff88020cfe1d58 
> ffff88020cfe1d40
> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 ffff88022354aa00 
> 0000000000000001
> [ 1221.468963]  0000000000000000 0000000000000000 ffffffffa04f7fda 
> ffff88020cfe1d80
> [ 1221.469346] Call Trace:
> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 [gfs2]
> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 [gfs2]
> [ 1221.470236]  [<ffffffff8129cb2c>] ? 
> security_inode_permission+0x1c/0x30
> [ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
> [ 1221.472019]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 00 e8 
> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 08 ff 
> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff ff 48
> [ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 [gfs2]
> [ 1221.474240]  RSP <ffff88020cfe1d28>
> [ 1221.474408] CR2: 0000000000000018
> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
> 
> 
> Bart Verwilst schreef op 20.08.2012 09:50:
> > Nothing out of the ordinary, should have mentioned that!
> >
> > <snip>
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> > bytes from fd 17
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
> > command is 7
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> > process command
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to
> > process is 7
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> > get_all_members: allocated new buffer (retsize=1024)
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> > get_all_members: retlen = 1760
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> > return code is 4
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> > Returning command data. length = 1760
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending
> > reply 40000007 to fd 17
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> > bytes from fd 17
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client
> > command is 800000b7
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> > process command
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to
> > process is 800000b7
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> > return code is 0
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> > Returning command data. length = 0
> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending
> > reply c00000b7 to fd 17
> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> > bytes from fd 17
> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
> > command is 7
> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> > process command
> > </snip>
> >
> > Digimer schreef op 20.08.2012 00:01:
> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
> >>> kworker/1:0:3117 blocked for more than 120 seconds.
> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>
> >> Nothing around Aug 19 00:08:00 ?
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From lists at verwilst.be  Tue Aug 21 10:39:11 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Tue, 21 Aug 2012 12:39:11 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <1345544233.2732.39.camel@menhir>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
Message-ID: <af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>

Hi Steven

Shared storage is iSCSI,

<totem rrp_mode="none" secauth="off" token="20000"/>
<quorumd tko="4" interval="2" 
device="/dev/mapper/iscsi_cluster_quorum"></quorumd>

Actually i know why this is happening now, and can reproduce 100% of 
the time, i've added my findings as a comment to this bug from somebody 
having the same problem:

https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207

Create file on one node's gfs2 mount, rm on the other -> hanging 
mountpoint + kernel OOPS.

Happy that i'm finally getting somewhere with this :P

Anything i can do to help Steven?

Kind regards,

Bart Verwilst

Steven Whitehouse schreef op 21.08.2012 12:17:
> Hi,
>
> On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>> As yet another reply to my own post, i found this on the node where 
>> it
>> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's 
>> hanging
>> ):
>>
>>
>> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
>> "kvm:sanlock"
>> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now 
>> mounting
>> FS...
>> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked for 
>> use
>> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at 
>> journal...
>> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
>> transaction lock...
>> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying journal...
>> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 
>> blocks
>> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke tags
>> [ 1219.782611] init: libvirt-bin main process ended, respawning
>> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed in 
>> 1s
>> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire
>> journal lock...
>> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at 
>> journal...
>> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire
>> journal lock...
>> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at 
>> journal...
>> [ 1219.943994] init: ttyS1 main process (20318) terminated with 
>> status
>> 1
>> [ 1219.944037] init: ttyS1 main process ended, respawning
>> [ 1219.967054] init: ttyS0 main process (20320) terminated with 
>> status
>> 1
>> [ 1219.967100] init: ttyS0 main process ended, respawning
>> [ 1219.972037] ttyS0: LSR safety check engaged!
>> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
>> transaction lock...
>> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying journal...
>> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106
>> blocks
>> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke 
>> tags
>> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed in 
>> 1s
>> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>
> So that looks like it successfully recovered the journals for nodes 
> one
> and two. How many nodes are in the cluster? What is the fencing 
> quorum
> set up being used?
>
>> [ 1221.457120] BUG: unable to handle kernel NULL pointer dereference 
>> at
>> 0000000000000018
>
> So this is a dereference of something which is 24 bytes into some
> structure or other. Certainly something which should not happen so we
> need to take a look at that.
>
> Was this a one off, or something that you can reproduce?
>
> Steve.
>
>
>> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 
>> [gfs2]
>> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>> [ 1221.458197] Oops: 0000 [#1] SMP
>> [ 1221.458374] CPU 0
>> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase
>> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables
>> iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel kvm
>> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
>> ocfs2_stackglue
>> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa 
>> ib_mad
>> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
>> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q garp 
>> stp
>> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
>> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas 
>> scsi_transport_sas
>> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: ipmi_si]
>> [ 1221.463058]
>> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted 
>> 3.2.0-26-generic
>> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  [<ffffffffa04f800a>]
>> gfs2_unlink+0x8a/0x220 [gfs2]
>> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
>> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX:
>> ffff88020cfe1d40
>> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI:
>> ffff88022efa2440
>> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09:
>> 0000000000000000
>> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12:
>> ffff88021ef50000
>> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15:
>> ffff88022efa2000
>> [ 1221.466115] FS:  00007f4d2c0e7700(0000) GS:ffff880237200000(0000)
>> knlGS:0000000000000000
>> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4:
>> 00000000000006f0
>> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>> ffff88020cfe0000, task ffff8802241cdbc0)
>> [ 1221.468091] Stack:
>> [ 1221.468199]  0000000000000003 ffff88022f108048 ffff88020cfe1d58
>> ffff88020cfe1d40
>> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 ffff88022354aa00
>> 0000000000000001
>> [ 1221.468963]  0000000000000000 0000000000000000 ffffffffa04f7fda
>> ffff88020cfe1d80
>> [ 1221.469346] Call Trace:
>> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 [gfs2]
>> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 [gfs2]
>> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>> security_inode_permission+0x1c/0x30
>> [ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
>> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
>> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
>> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
>> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
>> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
>> [ 1221.472019]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
>> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 00 
>> e8
>> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 08 
>> ff
>> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff ff 
>> 48
>> [ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 
>> [gfs2]
>> [ 1221.474240]  RSP <ffff88020cfe1d28>
>> [ 1221.474408] CR2: 0000000000000018
>> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>>
>>
>> Bart Verwilst schreef op 20.08.2012 09:50:
>> > Nothing out of the ordinary, should have mentioned that!
>> >
>> > <snip>
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
>> 20
>> > bytes from fd 17
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> client
>> > command is 7
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
>> to
>> > process command
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
>> to
>> > process is 7
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> > get_all_members: allocated new buffer (retsize=1024)
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> > get_all_members: retlen = 1760
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
>> > return code is 4
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> > Returning command data. length = 1760
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> sending
>> > reply 40000007 to fd 17
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
>> 20
>> > bytes from fd 17
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> client
>> > command is 800000b7
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
>> to
>> > process command
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
>> to
>> > process is 800000b7
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
>> > return code is 0
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> > Returning command data. length = 0
>> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> sending
>> > reply c00000b7 to fd 17
>> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
>> 20
>> > bytes from fd 17
>> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> client
>> > command is 7
>> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
>> to
>> > process command
>> > </snip>
>> >
>> > Digimer schreef op 20.08.2012 00:01:
>> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >>
>> >> Nothing around Aug 19 00:08:00 ?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Tue Aug 21 10:59:33 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 21 Aug 2012 11:59:33 +0100
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
Message-ID: <1345546773.2732.41.camel@menhir>

Hi,

On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
> Hi Steven
> 
> Shared storage is iSCSI,
> 
> <totem rrp_mode="none" secauth="off" token="20000"/>
> <quorumd tko="4" interval="2" 
> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> 
> Actually i know why this is happening now, and can reproduce 100% of 
> the time, i've added my findings as a comment to this bug from somebody 
> having the same problem:
> 
> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
> 
> Create file on one node's gfs2 mount, rm on the other -> hanging 
> mountpoint + kernel OOPS.
> 
> Happy that i'm finally getting somewhere with this :P
> 
> Anything i can do to help Steven?
> 
> Kind regards,
> 
> Bart Verwilst
> 
Can you reproduce this without drbd in the mix? That should remove one
complication and make this easier to track down.

I'll take a look at see what that dereference is likely to be in the
mean time,

Steve.

> Steven Whitehouse schreef op 21.08.2012 12:17:
> > Hi,
> >
> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
> >> As yet another reply to my own post, i found this on the node where 
> >> it
> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's 
> >> hanging
> >> ):
> >>
> >>
> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
> >> "kvm:sanlock"
> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now 
> >> mounting
> >> FS...
> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked for 
> >> use
> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at 
> >> journal...
> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
> >> transaction lock...
> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying journal...
> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 
> >> blocks
> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke tags
> >> [ 1219.782611] init: libvirt-bin main process ended, respawning
> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed in 
> >> 1s
> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire
> >> journal lock...
> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at 
> >> journal...
> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire
> >> journal lock...
> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at 
> >> journal...
> >> [ 1219.943994] init: ttyS1 main process (20318) terminated with 
> >> status
> >> 1
> >> [ 1219.944037] init: ttyS1 main process ended, respawning
> >> [ 1219.967054] init: ttyS0 main process (20320) terminated with 
> >> status
> >> 1
> >> [ 1219.967100] init: ttyS0 main process ended, respawning
> >> [ 1219.972037] ttyS0: LSR safety check engaged!
> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
> >> transaction lock...
> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying journal...
> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106
> >> blocks
> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke 
> >> tags
> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed in 
> >> 1s
> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
> >
> > So that looks like it successfully recovered the journals for nodes 
> > one
> > and two. How many nodes are in the cluster? What is the fencing 
> > quorum
> > set up being used?
> >
> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer dereference 
> >> at
> >> 0000000000000018
> >
> > So this is a dereference of something which is 24 bytes into some
> > structure or other. Certainly something which should not happen so we
> > need to take a look at that.
> >
> > Was this a one off, or something that you can reproduce?
> >
> > Steve.
> >
> >
> >> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 
> >> [gfs2]
> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
> >> [ 1221.458197] Oops: 0000 [#1] SMP
> >> [ 1221.458374] CPU 0
> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase
> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables
> >> iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel kvm
> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
> >> ocfs2_stackglue
> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa 
> >> ib_mad
> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q garp 
> >> stp
> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas 
> >> scsi_transport_sas
> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: ipmi_si]
> >> [ 1221.463058]
> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted 
> >> 3.2.0-26-generic
> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  [<ffffffffa04f800a>]
> >> gfs2_unlink+0x8a/0x220 [gfs2]
> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX:
> >> ffff88020cfe1d40
> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI:
> >> ffff88022efa2440
> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09:
> >> 0000000000000000
> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12:
> >> ffff88021ef50000
> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15:
> >> ffff88022efa2000
> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000) GS:ffff880237200000(0000)
> >> knlGS:0000000000000000
> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4:
> >> 00000000000006f0
> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> 0000000000000000
> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >> 0000000000000400
> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
> >> ffff88020cfe0000, task ffff8802241cdbc0)
> >> [ 1221.468091] Stack:
> >> [ 1221.468199]  0000000000000003 ffff88022f108048 ffff88020cfe1d58
> >> ffff88020cfe1d40
> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 ffff88022354aa00
> >> 0000000000000001
> >> [ 1221.468963]  0000000000000000 0000000000000000 ffffffffa04f7fda
> >> ffff88020cfe1d80
> >> [ 1221.469346] Call Trace:
> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 [gfs2]
> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 [gfs2]
> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
> >> security_inode_permission+0x1c/0x30
> >> [ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
> >> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
> >> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
> >> [ 1221.472019]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 00 
> >> e8
> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 08 
> >> ff
> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff ff 
> >> 48
> >> [ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 
> >> [gfs2]
> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
> >> [ 1221.474408] CR2: 0000000000000018
> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
> >>
> >>
> >> Bart Verwilst schreef op 20.08.2012 09:50:
> >> > Nothing out of the ordinary, should have mentioned that!
> >> >
> >> > <snip>
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
> >> 20
> >> > bytes from fd 17
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> client
> >> > command is 7
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
> >> to
> >> > process command
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
> >> to
> >> > process is 7
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> > get_all_members: allocated new buffer (retsize=1024)
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> > get_all_members: retlen = 1760
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> >> > return code is 4
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> > Returning command data. length = 1760
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> sending
> >> > reply 40000007 to fd 17
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
> >> 20
> >> > bytes from fd 17
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> client
> >> > command is 800000b7
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
> >> to
> >> > process command
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
> >> to
> >> > process is 800000b7
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> >> > return code is 0
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> > Returning command data. length = 0
> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> sending
> >> > reply c00000b7 to fd 17
> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
> >> 20
> >> > bytes from fd 17
> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> client
> >> > command is 7
> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
> >> to
> >> > process command
> >> > </snip>
> >> >
> >> > Digimer schreef op 20.08.2012 00:01:
> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> >>
> >> >> Nothing around Aug 19 00:08:00 ?
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> 




From lists at verwilst.be  Tue Aug 21 11:03:33 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Tue, 21 Aug 2012 13:03:33 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <1345546773.2732.41.camel@menhir>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
Message-ID: <7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>

Hi Steven,

There is no drbd in the mix ( which is why i changed the title of the 
bugreport now ). I'm only using plain iSCSI. The original posted had it 
with drbd :)

Kind regards,

Bart

Steven Whitehouse schreef op 21.08.2012 12:59:
> Hi,
>
> On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
>> Hi Steven
>>
>> Shared storage is iSCSI,
>>
>> <totem rrp_mode="none" secauth="off" token="20000"/>
>> <quorumd tko="4" interval="2"
>> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>>
>> Actually i know why this is happening now, and can reproduce 100% of
>> the time, i've added my findings as a comment to this bug from 
>> somebody
>> having the same problem:
>>
>> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
>>
>> Create file on one node's gfs2 mount, rm on the other -> hanging
>> mountpoint + kernel OOPS.
>>
>> Happy that i'm finally getting somewhere with this :P
>>
>> Anything i can do to help Steven?
>>
>> Kind regards,
>>
>> Bart Verwilst
>>
> Can you reproduce this without drbd in the mix? That should remove 
> one
> complication and make this easier to track down.
>
> I'll take a look at see what that dereference is likely to be in the
> mean time,
>
> Steve.
>
>> Steven Whitehouse schreef op 21.08.2012 12:17:
>> > Hi,
>> >
>> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>> >> As yet another reply to my own post, i found this on the node 
>> where
>> >> it
>> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's
>> >> hanging
>> >> ):
>> >>
>> >>
>> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
>> >> "kvm:sanlock"
>> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now
>> >> mounting
>> >> FS...
>> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked 
>> for
>> >> use
>> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
>> >> journal...
>> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
>> >> transaction lock...
>> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying 
>> journal...
>> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0
>> >> blocks
>> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke 
>> tags
>> >> [ 1219.782611] init: libvirt-bin main process ended, respawning
>> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed 
>> in
>> >> 1s
>> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire
>> >> journal lock...
>> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
>> >> journal...
>> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire
>> >> journal lock...
>> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
>> >> journal...
>> >> [ 1219.943994] init: ttyS1 main process (20318) terminated with
>> >> status
>> >> 1
>> >> [ 1219.944037] init: ttyS1 main process ended, respawning
>> >> [ 1219.967054] init: ttyS0 main process (20320) terminated with
>> >> status
>> >> 1
>> >> [ 1219.967100] init: ttyS0 main process ended, respawning
>> >> [ 1219.972037] ttyS0: LSR safety check engaged!
>> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
>> >> transaction lock...
>> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying 
>> journal...
>> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106
>> >> blocks
>> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke
>> >> tags
>> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed 
>> in
>> >> 1s
>> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>> >
>> > So that looks like it successfully recovered the journals for 
>> nodes
>> > one
>> > and two. How many nodes are in the cluster? What is the fencing
>> > quorum
>> > set up being used?
>> >
>> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer 
>> dereference
>> >> at
>> >> 0000000000000018
>> >
>> > So this is a dereference of something which is 24 bytes into some
>> > structure or other. Certainly something which should not happen so 
>> we
>> > need to take a look at that.
>> >
>> > Was this a one off, or something that you can reproduce?
>> >
>> > Steve.
>> >
>> >
>> >> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220
>> >> [gfs2]
>> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>> >> [ 1221.458197] Oops: 0000 [#1] SMP
>> >> [ 1221.458374] CPU 0
>> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase
>> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables
>> >> iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel 
>> kvm
>> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>> >> ocfs2_stackglue
>> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa
>> >> ib_mad
>> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi 
>> scsi_transport_iscsi
>> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q 
>> garp
>> >> stp
>> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
>> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
>> >> scsi_transport_sas
>> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: 
>> ipmi_si]
>> >> [ 1221.463058]
>> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
>> >> 3.2.0-26-generic
>> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  
>> [<ffffffffa04f800a>]
>> >> gfs2_unlink+0x8a/0x220 [gfs2]
>> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
>> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX:
>> >> ffff88020cfe1d40
>> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI:
>> >> ffff88022efa2440
>> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09:
>> >> 0000000000000000
>> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12:
>> >> ffff88021ef50000
>> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15:
>> >> ffff88022efa2000
>> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000) 
>> GS:ffff880237200000(0000)
>> >> knlGS:0000000000000000
>> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4:
>> >> 00000000000006f0
>> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> >> 0000000000000000
>> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> >> 0000000000000400
>> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>> >> ffff88020cfe0000, task ffff8802241cdbc0)
>> >> [ 1221.468091] Stack:
>> >> [ 1221.468199]  0000000000000003 ffff88022f108048 
>> ffff88020cfe1d58
>> >> ffff88020cfe1d40
>> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 
>> ffff88022354aa00
>> >> 0000000000000001
>> >> [ 1221.468963]  0000000000000000 0000000000000000 
>> ffffffffa04f7fda
>> >> ffff88020cfe1d80
>> >> [ 1221.469346] Call Trace:
>> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 
>> [gfs2]
>> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 
>> [gfs2]
>> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>> >> security_inode_permission+0x1c/0x30
>> >> [ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
>> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
>> >> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
>> >> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
>> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
>> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
>> >> [ 1221.472019]  [<ffffffff81661fc2>] 
>> system_call_fastpath+0x16/0x1b
>> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 
>> 00
>> >> e8
>> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 
>> 08
>> >> ff
>> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff 
>> ff
>> >> 48
>> >> [ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220
>> >> [gfs2]
>> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
>> >> [ 1221.474408] CR2: 0000000000000018
>> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>> >>
>> >>
>> >> Bart Verwilst schreef op 20.08.2012 09:50:
>> >> > Nothing out of the ordinary, should have mentioned that!
>> >> >
>> >> > <snip>
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> read
>> >> 20
>> >> > bytes from fd 17
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> client
>> >> > command is 7
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> About
>> >> to
>> >> > process command
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
>> command
>> >> to
>> >> > process is 7
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> > get_all_members: allocated new buffer (retsize=1024)
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> > get_all_members: retlen = 1760
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
>> command
>> >> > return code is 4
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> > Returning command data. length = 1760
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> sending
>> >> > reply 40000007 to fd 17
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> read
>> >> 20
>> >> > bytes from fd 17
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> client
>> >> > command is 800000b7
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> About
>> >> to
>> >> > process command
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
>> command
>> >> to
>> >> > process is 800000b7
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
>> command
>> >> > return code is 0
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> > Returning command data. length = 0
>> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> sending
>> >> > reply c00000b7 to fd 17
>> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> read
>> >> 20
>> >> > bytes from fd 17
>> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> client
>> >> > command is 7
>> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
>> About
>> >> to
>> >> > process command
>> >> > </snip>
>> >> >
>> >> > Digimer schreef op 20.08.2012 00:01:
>> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this 
>> message.
>> >> >>
>> >> >> Nothing around Aug 19 00:08:00 ?
>> >>
>> >> --
>> >> Linux-cluster mailing list
>> >> Linux-cluster at redhat.com
>> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>>



From swhiteho at redhat.com  Tue Aug 21 11:27:59 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 21 Aug 2012 12:27:59 +0100
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
Message-ID: <1345548479.2732.43.camel@menhir>

Hi,

On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
> Hi Steven,
> 
> There is no drbd in the mix ( which is why i changed the title of the 
> bugreport now ). I'm only using plain iSCSI. The original posted had it 
> with drbd :)
> 
> Kind regards,
> 
> Bart
> 
Ah, I see sorry.. I misunderstood the report. I wonder whether your
distro kernel has this patch:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa

Thats the most likely thing that I can see that has been fixed recently,

Steve.


> Steven Whitehouse schreef op 21.08.2012 12:59:
> > Hi,
> >
> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
> >> Hi Steven
> >>
> >> Shared storage is iSCSI,
> >>
> >> <totem rrp_mode="none" secauth="off" token="20000"/>
> >> <quorumd tko="4" interval="2"
> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> >>
> >> Actually i know why this is happening now, and can reproduce 100% of
> >> the time, i've added my findings as a comment to this bug from 
> >> somebody
> >> having the same problem:
> >>
> >> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
> >>
> >> Create file on one node's gfs2 mount, rm on the other -> hanging
> >> mountpoint + kernel OOPS.
> >>
> >> Happy that i'm finally getting somewhere with this :P
> >>
> >> Anything i can do to help Steven?
> >>
> >> Kind regards,
> >>
> >> Bart Verwilst
> >>
> > Can you reproduce this without drbd in the mix? That should remove 
> > one
> > complication and make this easier to track down.
> >
> > I'll take a look at see what that dereference is likely to be in the
> > mean time,
> >
> > Steve.
> >
> >> Steven Whitehouse schreef op 21.08.2012 12:17:
> >> > Hi,
> >> >
> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
> >> >> As yet another reply to my own post, i found this on the node 
> >> where
> >> >> it
> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's
> >> >> hanging
> >> >> ):
> >> >>
> >> >>
> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
> >> >> "kvm:sanlock"
> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now
> >> >> mounting
> >> >> FS...
> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked 
> >> for
> >> >> use
> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
> >> >> journal...
> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
> >> >> transaction lock...
> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying 
> >> journal...
> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0
> >> >> blocks
> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke 
> >> tags
> >> >> [ 1219.782611] init: libvirt-bin main process ended, respawning
> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed 
> >> in
> >> >> 1s
> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire
> >> >> journal lock...
> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
> >> >> journal...
> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire
> >> >> journal lock...
> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
> >> >> journal...
> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated with
> >> >> status
> >> >> 1
> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated with
> >> >> status
> >> >> 1
> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
> >> >> transaction lock...
> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying 
> >> journal...
> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106
> >> >> blocks
> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke
> >> >> tags
> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed 
> >> in
> >> >> 1s
> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
> >> >
> >> > So that looks like it successfully recovered the journals for 
> >> nodes
> >> > one
> >> > and two. How many nodes are in the cluster? What is the fencing
> >> > quorum
> >> > set up being used?
> >> >
> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer 
> >> dereference
> >> >> at
> >> >> 0000000000000018
> >> >
> >> > So this is a dereference of something which is 24 bytes into some
> >> > structure or other. Certainly something which should not happen so 
> >> we
> >> > need to take a look at that.
> >> >
> >> > Was this a one off, or something that you can reproduce?
> >> >
> >> > Steve.
> >> >
> >> >
> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220
> >> >> [gfs2]
> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
> >> >> [ 1221.458374] CPU 0
> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase
> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables
> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel 
> >> kvm
> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> >> >> ocfs2_stackglue
> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa
> >> >> ib_mad
> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi 
> >> scsi_transport_iscsi
> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q 
> >> garp
> >> >> stp
> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
> >> >> scsi_transport_sas
> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: 
> >> ipmi_si]
> >> >> [ 1221.463058]
> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
> >> >> 3.2.0-26-generic
> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  
> >> [<ffffffffa04f800a>]
> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX:
> >> >> ffff88020cfe1d40
> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI:
> >> >> ffff88022efa2440
> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09:
> >> >> 0000000000000000
> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12:
> >> >> ffff88021ef50000
> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15:
> >> >> ffff88022efa2000
> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000) 
> >> GS:ffff880237200000(0000)
> >> >> knlGS:0000000000000000
> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4:
> >> >> 00000000000006f0
> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> >> 0000000000000000
> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >> >> 0000000000000400
> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
> >> >> [ 1221.468091] Stack:
> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048 
> >> ffff88020cfe1d58
> >> >> ffff88020cfe1d40
> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 
> >> ffff88022354aa00
> >> >> 0000000000000001
> >> >> [ 1221.468963]  0000000000000000 0000000000000000 
> >> ffffffffa04f7fda
> >> >> ffff88020cfe1d80
> >> >> [ 1221.469346] Call Trace:
> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 
> >> [gfs2]
> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 
> >> [gfs2]
> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
> >> >> security_inode_permission+0x1c/0x30
> >> >> [ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
> >> >> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
> >> >> [ 1221.472019]  [<ffffffff81661fc2>] 
> >> system_call_fastpath+0x16/0x1b
> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 
> >> 00
> >> >> e8
> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 
> >> 08
> >> >> ff
> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff 
> >> ff
> >> >> 48
> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220
> >> >> [gfs2]
> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
> >> >> [ 1221.474408] CR2: 0000000000000018
> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
> >> >>
> >> >>
> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
> >> >> > Nothing out of the ordinary, should have mentioned that!
> >> >> >
> >> >> > <snip>
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> read
> >> >> 20
> >> >> > bytes from fd 17
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> client
> >> >> > command is 7
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> About
> >> >> to
> >> >> > process command
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
> >> command
> >> >> to
> >> >> > process is 7
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> > get_all_members: allocated new buffer (retsize=1024)
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> > get_all_members: retlen = 1760
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
> >> command
> >> >> > return code is 4
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> > Returning command data. length = 1760
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> sending
> >> >> > reply 40000007 to fd 17
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> read
> >> >> 20
> >> >> > bytes from fd 17
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> client
> >> >> > command is 800000b7
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> About
> >> >> to
> >> >> > process command
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
> >> command
> >> >> to
> >> >> > process is 800000b7
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: 
> >> command
> >> >> > return code is 0
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> > Returning command data. length = 0
> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> sending
> >> >> > reply c00000b7 to fd 17
> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> read
> >> >> 20
> >> >> > bytes from fd 17
> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> client
> >> >> > command is 7
> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
> >> About
> >> >> to
> >> >> > process command
> >> >> > </snip>
> >> >> >
> >> >> > Digimer schreef op 20.08.2012 00:01:
> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this 
> >> message.
> >> >> >>
> >> >> >> Nothing around Aug 19 00:08:00 ?
> >> >>
> >> >> --
> >> >> Linux-cluster mailing list
> >> >> Linux-cluster at redhat.com
> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> 




From lists at verwilst.be  Tue Aug 21 12:23:59 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Tue, 21 Aug 2012 14:23:59 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <1345548479.2732.43.camel@menhir>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
Message-ID: <740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>

Hi Steven,

The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the patch you 
mentioned seems to be for a newer(?) version.
What should I do, offer an altered version of this patch for inclusion 
into Ubuntu's 3.2.0 kernel, or is it a little less straightforward than 
this? :)

Kind regards,

Bart

Steven Whitehouse schreef op 21.08.2012 13:27:
> Hi,
>
> On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
>> Hi Steven,
>>
>> There is no drbd in the mix ( which is why i changed the title of 
>> the
>> bugreport now ). I'm only using plain iSCSI. The original posted had 
>> it
>> with drbd :)
>>
>> Kind regards,
>>
>> Bart
>>
> Ah, I see sorry.. I misunderstood the report. I wonder whether your
> distro kernel has this patch:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
>
> Thats the most likely thing that I can see that has been fixed 
> recently,
>
> Steve.
>
>
>> Steven Whitehouse schreef op 21.08.2012 12:59:
>> > Hi,
>> >
>> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
>> >> Hi Steven
>> >>
>> >> Shared storage is iSCSI,
>> >>
>> >> <totem rrp_mode="none" secauth="off" token="20000"/>
>> >> <quorumd tko="4" interval="2"
>> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>> >>
>> >> Actually i know why this is happening now, and can reproduce 100% 
>> of
>> >> the time, i've added my findings as a comment to this bug from
>> >> somebody
>> >> having the same problem:
>> >>
>> >> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
>> >>
>> >> Create file on one node's gfs2 mount, rm on the other -> hanging
>> >> mountpoint + kernel OOPS.
>> >>
>> >> Happy that i'm finally getting somewhere with this :P
>> >>
>> >> Anything i can do to help Steven?
>> >>
>> >> Kind regards,
>> >>
>> >> Bart Verwilst
>> >>
>> > Can you reproduce this without drbd in the mix? That should remove
>> > one
>> > complication and make this easier to track down.
>> >
>> > I'll take a look at see what that dereference is likely to be in 
>> the
>> > mean time,
>> >
>> > Steve.
>> >
>> >> Steven Whitehouse schreef op 21.08.2012 12:17:
>> >> > Hi,
>> >> >
>> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>> >> >> As yet another reply to my own post, i found this on the node
>> >> where
>> >> >> it
>> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock 
>> that's
>> >> >> hanging
>> >> >> ):
>> >> >>
>> >> >>
>> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
>> >> >> "kvm:sanlock"
>> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now
>> >> >> mounting
>> >> >> FS...
>> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked
>> >> for
>> >> >> use
>> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
>> >> >> journal...
>> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
>> >> >> transaction lock...
>> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying
>> >> journal...
>> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 
>> 0
>> >> >> blocks
>> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke
>> >> tags
>> >> >> [ 1219.782611] init: libvirt-bin main process ended, 
>> respawning
>> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal 
>> replayed
>> >> in
>> >> >> 1s
>> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to 
>> acquire
>> >> >> journal lock...
>> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
>> >> >> journal...
>> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to 
>> acquire
>> >> >> journal lock...
>> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
>> >> >> journal...
>> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated 
>> with
>> >> >> status
>> >> >> 1
>> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
>> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated 
>> with
>> >> >> status
>> >> >> 1
>> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
>> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
>> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
>> >> >> transaction lock...
>> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying
>> >> journal...
>> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 
>> 106
>> >> >> blocks
>> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 
>> revoke
>> >> >> tags
>> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal 
>> replayed
>> >> in
>> >> >> 1s
>> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>> >> >
>> >> > So that looks like it successfully recovered the journals for
>> >> nodes
>> >> > one
>> >> > and two. How many nodes are in the cluster? What is the fencing
>> >> > quorum
>> >> > set up being used?
>> >> >
>> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer
>> >> dereference
>> >> >> at
>> >> >> 0000000000000018
>> >> >
>> >> > So this is a dereference of something which is 24 bytes into 
>> some
>> >> > structure or other. Certainly something which should not happen 
>> so
>> >> we
>> >> > need to take a look at that.
>> >> >
>> >> > Was this a one off, or something that you can reproduce?
>> >> >
>> >> > Steve.
>> >> >
>> >> >
>> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220
>> >> >> [gfs2]
>> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
>> >> >> [ 1221.458374] CPU 0
>> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl 
>> mptbase
>> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter 
>> ip6_tables
>> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables 
>> kvm_intel
>> >> kvm
>> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>> >> >> ocfs2_stackglue
>> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm 
>> ib_sa
>> >> >> ib_mad
>> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
>> >> scsi_transport_iscsi
>> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 
>> 8021q
>> >> garp
>> >> >> stp
>> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
>> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
>> >> >> scsi_transport_sas
>> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded:
>> >> ipmi_si]
>> >> >> [ 1221.463058]
>> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
>> >> >> 3.2.0-26-generic
>> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
>> >> [<ffffffffa04f800a>]
>> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
>> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
>> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 
>> RCX:
>> >> >> ffff88020cfe1d40
>> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 
>> RDI:
>> >> >> ffff88022efa2440
>> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 
>> R09:
>> >> >> 0000000000000000
>> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 
>> R12:
>> >> >> ffff88021ef50000
>> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 
>> R15:
>> >> >> ffff88022efa2000
>> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
>> >> GS:ffff880237200000(0000)
>> >> >> knlGS:0000000000000000
>> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 
>> 0000000080050033
>> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 
>> CR4:
>> >> >> 00000000000006f0
>> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 
>> DR2:
>> >> >> 0000000000000000
>> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 
>> DR7:
>> >> >> 0000000000000400
>> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
>> >> >> [ 1221.468091] Stack:
>> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
>> >> ffff88020cfe1d58
>> >> >> ffff88020cfe1d40
>> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
>> >> ffff88022354aa00
>> >> >> 0000000000000001
>> >> >> [ 1221.468963]  0000000000000000 0000000000000000
>> >> ffffffffa04f7fda
>> >> >> ffff88020cfe1d80
>> >> >> [ 1221.469346] Call Trace:
>> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220
>> >> [gfs2]
>> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220
>> >> [gfs2]
>> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>> >> >> security_inode_permission+0x1c/0x30
>> >> >> [ 1221.470536]  [<ffffffff81184e70>] 
>> vfs_unlink.part.26+0x80/0xf0
>> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
>> >> >> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
>> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
>> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
>> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
>> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
>> >> system_call_fastpath+0x16/0x1b
>> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 
>> 00
>> >> 00
>> >> >> e8
>> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 
>> 8d
>> >> 08
>> >> >> ff
>> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e 
>> ff
>> >> ff
>> >> >> 48
>> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>] 
>> gfs2_unlink+0x8a/0x220
>> >> >> [gfs2]
>> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
>> >> >> [ 1221.474408] CR2: 0000000000000018
>> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>> >> >>
>> >> >>
>> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
>> >> >> > Nothing out of the ordinary, should have mentioned that!
>> >> >> >
>> >> >> > <snip>
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> read
>> >> >> 20
>> >> >> > bytes from fd 17
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> client
>> >> >> > command is 7
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> About
>> >> >> to
>> >> >> > process command
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> command
>> >> >> to
>> >> >> > process is 7
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> >> > get_all_members: allocated new buffer (retsize=1024)
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> >> > get_all_members: retlen = 1760
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> command
>> >> >> > return code is 4
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> > Returning command data. length = 1760
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> sending
>> >> >> > reply 40000007 to fd 17
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> read
>> >> >> 20
>> >> >> > bytes from fd 17
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> client
>> >> >> > command is 800000b7
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> About
>> >> >> to
>> >> >> > process command
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> command
>> >> >> to
>> >> >> > process is 800000b7
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
>> >> command
>> >> >> > return code is 0
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> > Returning command data. length = 0
>> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> sending
>> >> >> > reply c00000b7 to fd 17
>> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> read
>> >> >> 20
>> >> >> > bytes from fd 17
>> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> >> client
>> >> >> > command is 7
>> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
>> >> About
>> >> >> to
>> >> >> > process command
>> >> >> > </snip>
>> >> >> >
>> >> >> > Digimer schreef op 20.08.2012 00:01:
>> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: 
>> task
>> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 
>> >
>> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this
>> >> message.
>> >> >> >>
>> >> >> >> Nothing around Aug 19 00:08:00 ?
>> >> >>
>> >> >> --
>> >> >> Linux-cluster mailing list
>> >> >> Linux-cluster at redhat.com
>> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>> >>
>>



From swhiteho at redhat.com  Tue Aug 21 12:37:11 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 21 Aug 2012 13:37:11 +0100
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
Message-ID: <1345552631.2732.50.camel@menhir>

Hi,

On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
> Hi Steven,
> 
> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the patch you 
> mentioned seems to be for a newer(?) version.
> What should I do, offer an altered version of this patch for inclusion 
> into Ubuntu's 3.2.0 kernel, or is it a little less straightforward than 
> this? :)
> 
> Kind regards,
> 
> Bart
> 
Well, since you've not got that patch in your existing kernel, then
there are really two issues here. Firstly to try and verify that this
patch really is a fix for the problem, and then to figure out what needs
to be done wrt Ubuntu distro kernels. One solution may be to post it for
the upstream -stable kernel as I think most distros will then pick this
up.

Are you able to build a new Ubuntu kernel with that patch in it? That
would be one way to test it. Another way which doesn't require building
kernels is this:

The problem occurs when the resource groups are not uptodate, and
various actions taken on the filesystem will ensure that they are
uptodate. Mounting a filesystem and immediately running an unlink of a
file which is known to exist on the filesystem (before performing any
other action) should trigger that bug if it is present.

It may not be that particular bug, but that looks the most likely of any
recent patch to that bit of code. Are you able to try a more recent
Ubuntu kernel? If you could try one based on a more recent upstream
which has that patch in it, then that might also help narrow down the
problem,

Steve.

> Steven Whitehouse schreef op 21.08.2012 13:27:
> > Hi,
> >
> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
> >> Hi Steven,
> >>
> >> There is no drbd in the mix ( which is why i changed the title of 
> >> the
> >> bugreport now ). I'm only using plain iSCSI. The original posted had 
> >> it
> >> with drbd :)
> >>
> >> Kind regards,
> >>
> >> Bart
> >>
> > Ah, I see sorry.. I misunderstood the report. I wonder whether your
> > distro kernel has this patch:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
> >
> > Thats the most likely thing that I can see that has been fixed 
> > recently,
> >
> > Steve.
> >
> >
> >> Steven Whitehouse schreef op 21.08.2012 12:59:
> >> > Hi,
> >> >
> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
> >> >> Hi Steven
> >> >>
> >> >> Shared storage is iSCSI,
> >> >>
> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
> >> >> <quorumd tko="4" interval="2"
> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> >> >>
> >> >> Actually i know why this is happening now, and can reproduce 100% 
> >> of
> >> >> the time, i've added my findings as a comment to this bug from
> >> >> somebody
> >> >> having the same problem:
> >> >>
> >> >> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
> >> >>
> >> >> Create file on one node's gfs2 mount, rm on the other -> hanging
> >> >> mountpoint + kernel OOPS.
> >> >>
> >> >> Happy that i'm finally getting somewhere with this :P
> >> >>
> >> >> Anything i can do to help Steven?
> >> >>
> >> >> Kind regards,
> >> >>
> >> >> Bart Verwilst
> >> >>
> >> > Can you reproduce this without drbd in the mix? That should remove
> >> > one
> >> > complication and make this easier to track down.
> >> >
> >> > I'll take a look at see what that dereference is likely to be in 
> >> the
> >> > mean time,
> >> >
> >> > Steve.
> >> >
> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
> >> >> > Hi,
> >> >> >
> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
> >> >> >> As yet another reply to my own post, i found this on the node
> >> >> where
> >> >> >> it
> >> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock 
> >> that's
> >> >> >> hanging
> >> >> >> ):
> >> >> >>
> >> >> >>
> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
> >> >> >> "kvm:sanlock"
> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now
> >> >> >> mounting
> >> >> >> FS...
> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked
> >> >> for
> >> >> >> use
> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
> >> >> >> journal...
> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
> >> >> >> transaction lock...
> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying
> >> >> journal...
> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 
> >> 0
> >> >> >> blocks
> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke
> >> >> tags
> >> >> >> [ 1219.782611] init: libvirt-bin main process ended, 
> >> respawning
> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal 
> >> replayed
> >> >> in
> >> >> >> 1s
> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to 
> >> acquire
> >> >> >> journal lock...
> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
> >> >> >> journal...
> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to 
> >> acquire
> >> >> >> journal lock...
> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
> >> >> >> journal...
> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated 
> >> with
> >> >> >> status
> >> >> >> 1
> >> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated 
> >> with
> >> >> >> status
> >> >> >> 1
> >> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
> >> >> >> transaction lock...
> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying
> >> >> journal...
> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 
> >> 106
> >> >> >> blocks
> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 
> >> revoke
> >> >> >> tags
> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal 
> >> replayed
> >> >> in
> >> >> >> 1s
> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
> >> >> >
> >> >> > So that looks like it successfully recovered the journals for
> >> >> nodes
> >> >> > one
> >> >> > and two. How many nodes are in the cluster? What is the fencing
> >> >> > quorum
> >> >> > set up being used?
> >> >> >
> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer
> >> >> dereference
> >> >> >> at
> >> >> >> 0000000000000018
> >> >> >
> >> >> > So this is a dereference of something which is 24 bytes into 
> >> some
> >> >> > structure or other. Certainly something which should not happen 
> >> so
> >> >> we
> >> >> > need to take a look at that.
> >> >> >
> >> >> > Was this a one off, or something that you can reproduce?
> >> >> >
> >> >> > Steve.
> >> >> >
> >> >> >
> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220
> >> >> >> [gfs2]
> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
> >> >> >> [ 1221.458374] CPU 0
> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl 
> >> mptbase
> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter 
> >> ip6_tables
> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables 
> >> kvm_intel
> >> >> kvm
> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> >> >> >> ocfs2_stackglue
> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm 
> >> ib_sa
> >> >> >> ib_mad
> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> >> >> scsi_transport_iscsi
> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 
> >> 8021q
> >> >> garp
> >> >> >> stp
> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
> >> >> >> scsi_transport_sas
> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded:
> >> >> ipmi_si]
> >> >> >> [ 1221.463058]
> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
> >> >> >> 3.2.0-26-generic
> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
> >> >> [<ffffffffa04f800a>]
> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 
> >> RCX:
> >> >> >> ffff88020cfe1d40
> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 
> >> RDI:
> >> >> >> ffff88022efa2440
> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 
> >> R09:
> >> >> >> 0000000000000000
> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 
> >> R12:
> >> >> >> ffff88021ef50000
> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 
> >> R15:
> >> >> >> ffff88022efa2000
> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
> >> >> GS:ffff880237200000(0000)
> >> >> >> knlGS:0000000000000000
> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 
> >> 0000000080050033
> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 
> >> CR4:
> >> >> >> 00000000000006f0
> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 
> >> DR2:
> >> >> >> 0000000000000000
> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 
> >> DR7:
> >> >> >> 0000000000000400
> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
> >> >> >> [ 1221.468091] Stack:
> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
> >> >> ffff88020cfe1d58
> >> >> >> ffff88020cfe1d40
> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
> >> >> ffff88022354aa00
> >> >> >> 0000000000000001
> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
> >> >> ffffffffa04f7fda
> >> >> >> ffff88020cfe1d80
> >> >> >> [ 1221.469346] Call Trace:
> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220
> >> >> [gfs2]
> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220
> >> >> [gfs2]
> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
> >> >> >> security_inode_permission+0x1c/0x30
> >> >> >> [ 1221.470536]  [<ffffffff81184e70>] 
> >> vfs_unlink.part.26+0x80/0xf0
> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
> >> >> >> [ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
> >> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
> >> >> system_call_fastpath+0x16/0x1b
> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 
> >> 00
> >> >> 00
> >> >> >> e8
> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 
> >> 8d
> >> >> 08
> >> >> >> ff
> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e 
> >> ff
> >> >> ff
> >> >> >> 48
> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>] 
> >> gfs2_unlink+0x8a/0x220
> >> >> >> [gfs2]
> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
> >> >> >> [ 1221.474408] CR2: 0000000000000018
> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
> >> >> >>
> >> >> >>
> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
> >> >> >> > Nothing out of the ordinary, should have mentioned that!
> >> >> >> >
> >> >> >> > <snip>
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> read
> >> >> >> 20
> >> >> >> > bytes from fd 17
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> client
> >> >> >> > command is 7
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> About
> >> >> >> to
> >> >> >> > process command
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> command
> >> >> >> to
> >> >> >> > process is 7
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> >> > get_all_members: retlen = 1760
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> command
> >> >> >> > return code is 4
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> > Returning command data. length = 1760
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> sending
> >> >> >> > reply 40000007 to fd 17
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> read
> >> >> >> 20
> >> >> >> > bytes from fd 17
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> client
> >> >> >> > command is 800000b7
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> About
> >> >> >> to
> >> >> >> > process command
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> command
> >> >> >> to
> >> >> >> > process is 800000b7
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> >> >> command
> >> >> >> > return code is 0
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> > Returning command data. length = 0
> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> sending
> >> >> >> > reply c00000b7 to fd 17
> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> read
> >> >> >> 20
> >> >> >> > bytes from fd 17
> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> >> client
> >> >> >> > command is 7
> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon:
> >> >> About
> >> >> >> to
> >> >> >> > process command
> >> >> >> > </snip>
> >> >> >> >
> >> >> >> > Digimer schreef op 20.08.2012 00:01:
> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: 
> >> task
> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 
> >> >
> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this
> >> >> message.
> >> >> >> >>
> >> >> >> >> Nothing around Aug 19 00:08:00 ?
> >> >> >>
> >> >> >> --
> >> >> >> Linux-cluster mailing list
> >> >> >> Linux-cluster at redhat.com
> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >> >>
> >>
> 




From lists at verwilst.be  Tue Aug 21 14:29:40 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Tue, 21 Aug 2012 16:29:40 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <1345552631.2732.50.camel@menhir>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
	<1345552631.2732.50.camel@menhir>
Message-ID: <9435fc4bb742e6d6c941b92cdee49b64@verwilst.be>

Hi Steven,

I installed kernel 3.3 from 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/ on 1 of the 
hosts.

Next, i mounted my gfs2 volume on 1 node, created a new empty file in 
it, unmounted the gfs volume, and then mounted it on all 3 nodes.

Then i simply rm'ed the file on each node ( starting with the normal 
3.2-kernel based machines ).

On the 2 3.2 machines, rm triggered an oops, and the file wasn't gone. 
I then rm'ed it on the 3.3 kernel machine, which went perfectly. gfs 
mount still reachable afterwards, no oops. Both other nodes hung when 
trying to ls into the mount.

I tried booting the kernel from 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.28-precise/ to try 
and narrow down the scope, but that one doesnt seem to be coming up 
anymore, so i'll have remote hands look at it asap.

But at least this kind of narrows down the scope?

Kind regards,

Bart

Steven Whitehouse schreef op 21.08.2012 14:37:
> Hi,
>
> On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
>> Hi Steven,
>>
>> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the patch 
>> you
>> mentioned seems to be for a newer(?) version.
>> What should I do, offer an altered version of this patch for 
>> inclusion
>> into Ubuntu's 3.2.0 kernel, or is it a little less straightforward 
>> than
>> this? :)
>>
>> Kind regards,
>>
>> Bart
>>
> Well, since you've not got that patch in your existing kernel, then
> there are really two issues here. Firstly to try and verify that this
> patch really is a fix for the problem, and then to figure out what 
> needs
> to be done wrt Ubuntu distro kernels. One solution may be to post it 
> for
> the upstream -stable kernel as I think most distros will then pick 
> this
> up.
>
> Are you able to build a new Ubuntu kernel with that patch in it? That
> would be one way to test it. Another way which doesn't require 
> building
> kernels is this:
>
> The problem occurs when the resource groups are not uptodate, and
> various actions taken on the filesystem will ensure that they are
> uptodate. Mounting a filesystem and immediately running an unlink of 
> a
> file which is known to exist on the filesystem (before performing any
> other action) should trigger that bug if it is present.
>
> It may not be that particular bug, but that looks the most likely of 
> any
> recent patch to that bit of code. Are you able to try a more recent
> Ubuntu kernel? If you could try one based on a more recent upstream
> which has that patch in it, then that might also help narrow down the
> problem,
>
> Steve.
>
>> Steven Whitehouse schreef op 21.08.2012 13:27:
>> > Hi,
>> >
>> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
>> >> Hi Steven,
>> >>
>> >> There is no drbd in the mix ( which is why i changed the title of
>> >> the
>> >> bugreport now ). I'm only using plain iSCSI. The original posted 
>> had
>> >> it
>> >> with drbd :)
>> >>
>> >> Kind regards,
>> >>
>> >> Bart
>> >>
>> > Ah, I see sorry.. I misunderstood the report. I wonder whether 
>> your
>> > distro kernel has this patch:
>> >
>> > 
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
>> >
>> > Thats the most likely thing that I can see that has been fixed
>> > recently,
>> >
>> > Steve.
>> >
>> >
>> >> Steven Whitehouse schreef op 21.08.2012 12:59:
>> >> > Hi,
>> >> >
>> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
>> >> >> Hi Steven
>> >> >>
>> >> >> Shared storage is iSCSI,
>> >> >>
>> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
>> >> >> <quorumd tko="4" interval="2"
>> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>> >> >>
>> >> >> Actually i know why this is happening now, and can reproduce 
>> 100%
>> >> of
>> >> >> the time, i've added my findings as a comment to this bug from
>> >> >> somebody
>> >> >> having the same problem:
>> >> >>
>> >> >> 
>> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
>> >> >>
>> >> >> Create file on one node's gfs2 mount, rm on the other -> 
>> hanging
>> >> >> mountpoint + kernel OOPS.
>> >> >>
>> >> >> Happy that i'm finally getting somewhere with this :P
>> >> >>
>> >> >> Anything i can do to help Steven?
>> >> >>
>> >> >> Kind regards,
>> >> >>
>> >> >> Bart Verwilst
>> >> >>
>> >> > Can you reproduce this without drbd in the mix? That should 
>> remove
>> >> > one
>> >> > complication and make this easier to track down.
>> >> >
>> >> > I'll take a look at see what that dereference is likely to be 
>> in
>> >> the
>> >> > mean time,
>> >> >
>> >> > Steve.
>> >> >
>> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
>> >> >> > Hi,
>> >> >> >
>> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>> >> >> >> As yet another reply to my own post, i found this on the 
>> node
>> >> >> where
>> >> >> >> it
>> >> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock
>> >> that's
>> >> >> >> hanging
>> >> >> >> ):
>> >> >> >>
>> >> >> >>
>> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster 
>> "lock_dlm",
>> >> >> >> "kvm:sanlock"
>> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. 
>> Now
>> >> >> >> mounting
>> >> >> >> FS...
>> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already 
>> locked
>> >> >> for
>> >> >> >> use
>> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
>> >> >> >> journal...
>> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring 
>> the
>> >> >> >> transaction lock...
>> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying
>> >> >> journal...
>> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 
>> of
>> >> 0
>> >> >> >> blocks
>> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 
>> revoke
>> >> >> tags
>> >> >> >> [ 1219.782611] init: libvirt-bin main process ended,
>> >> respawning
>> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal
>> >> replayed
>> >> >> in
>> >> >> >> 1s
>> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to
>> >> acquire
>> >> >> >> journal lock...
>> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
>> >> >> >> journal...
>> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to
>> >> acquire
>> >> >> >> journal lock...
>> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
>> >> >> >> journal...
>> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated
>> >> with
>> >> >> >> status
>> >> >> >> 1
>> >> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
>> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated
>> >> with
>> >> >> >> status
>> >> >> >> 1
>> >> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
>> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
>> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring 
>> the
>> >> >> >> transaction lock...
>> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying
>> >> >> journal...
>> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 
>> of
>> >> 106
>> >> >> >> blocks
>> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487
>> >> revoke
>> >> >> >> tags
>> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal
>> >> replayed
>> >> >> in
>> >> >> >> 1s
>> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>> >> >> >
>> >> >> > So that looks like it successfully recovered the journals 
>> for
>> >> >> nodes
>> >> >> > one
>> >> >> > and two. How many nodes are in the cluster? What is the 
>> fencing
>> >> >> > quorum
>> >> >> > set up being used?
>> >> >> >
>> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer
>> >> >> dereference
>> >> >> >> at
>> >> >> >> 0000000000000018
>> >> >> >
>> >> >> > So this is a dereference of something which is 24 bytes into
>> >> some
>> >> >> > structure or other. Certainly something which should not 
>> happen
>> >> so
>> >> >> we
>> >> >> > need to take a look at that.
>> >> >> >
>> >> >> > Was this a one off, or something that you can reproduce?
>> >> >> >
>> >> >> > Steve.
>> >> >> >
>> >> >> >
>> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] 
>> gfs2_unlink+0x8a/0x220
>> >> >> >> [gfs2]
>> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
>> >> >> >> [ 1221.458374] CPU 0
>> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl
>> >> mptbase
>> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter
>> >> ip6_tables
>> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables
>> >> kvm_intel
>> >> >> kvm
>> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>> >> >> >> ocfs2_stackglue
>> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm
>> >> ib_sa
>> >> >> >> ib_mad
>> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
>> >> >> scsi_transport_iscsi
>> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc
>> >> 8021q
>> >> >> garp
>> >> >> >> stp
>> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
>> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
>> >> >> >> scsi_transport_sas
>> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded:
>> >> >> ipmi_si]
>> >> >> >> [ 1221.463058]
>> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
>> >> >> >> 3.2.0-26-generic
>> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
>> >> >> [<ffffffffa04f800a>]
>> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
>> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
>> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000
>> >> RCX:
>> >> >> >> ffff88020cfe1d40
>> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3
>> >> RDI:
>> >> >> >> ffff88022efa2440
>> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000
>> >> R09:
>> >> >> >> 0000000000000000
>> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003
>> >> R12:
>> >> >> >> ffff88021ef50000
>> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0
>> >> R15:
>> >> >> >> ffff88022efa2000
>> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
>> >> >> GS:ffff880237200000(0000)
>> >> >> >> knlGS:0000000000000000
>> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0:
>> >> 0000000080050033
>> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000
>> >> CR4:
>> >> >> >> 00000000000006f0
>> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000
>> >> DR2:
>> >> >> >> 0000000000000000
>> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0
>> >> DR7:
>> >> >> >> 0000000000000400
>> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
>> >> >> >> [ 1221.468091] Stack:
>> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
>> >> >> ffff88020cfe1d58
>> >> >> >> ffff88020cfe1d40
>> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
>> >> >> ffff88022354aa00
>> >> >> >> 0000000000000001
>> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
>> >> >> ffffffffa04f7fda
>> >> >> >> ffff88020cfe1d80
>> >> >> >> [ 1221.469346] Call Trace:
>> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? 
>> gfs2_unlink+0x5a/0x220
>> >> >> [gfs2]
>> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? 
>> gfs2_unlink+0x74/0x220
>> >> >> [gfs2]
>> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>> >> >> >> security_inode_permission+0x1c/0x30
>> >> >> >> [ 1221.470536]  [<ffffffff81184e70>]
>> >> vfs_unlink.part.26+0x80/0xf0
>> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
>> >> >> >> [ 1221.471040]  [<ffffffff8118758a>] 
>> do_unlinkat+0x1aa/0x1d0
>> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? 
>> vfs_write+0x110/0x180
>> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
>> >> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
>> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
>> >> >> system_call_fastpath+0x16/0x1b
>> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 
>> 00
>> >> 00
>> >> >> 00
>> >> >> >> e8
>> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 
>> 8d
>> >> 8d
>> >> >> 08
>> >> >> >> ff
>> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 
>> 1e
>> >> ff
>> >> >> ff
>> >> >> >> 48
>> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>]
>> >> gfs2_unlink+0x8a/0x220
>> >> >> >> [gfs2]
>> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
>> >> >> >> [ 1221.474408] CR2: 0000000000000018
>> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>> >> >> >>
>> >> >> >>
>> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
>> >> >> >> > Nothing out of the ordinary, should have mentioned that!
>> >> >> >> >
>> >> >> >> > <snip>
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> read
>> >> >> >> 20
>> >> >> >> > bytes from fd 17
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> client
>> >> >> >> > command is 7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> About
>> >> >> >> to
>> >> >> >> > process command
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> to
>> >> >> >> > process is 7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> >> > get_all_members: retlen = 1760
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> > return code is 4
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> > Returning command data. length = 1760
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> sending
>> >> >> >> > reply 40000007 to fd 17
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> read
>> >> >> >> 20
>> >> >> >> > bytes from fd 17
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> client
>> >> >> >> > command is 800000b7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> About
>> >> >> >> to
>> >> >> >> > process command
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> to
>> >> >> >> > process is 800000b7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> > return code is 0
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> > Returning command data. length = 0
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> sending
>> >> >> >> > reply c00000b7 to fd 17
>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> read
>> >> >> >> 20
>> >> >> >> > bytes from fd 17
>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> client
>> >> >> >> > command is 7
>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> About
>> >> >> >> to
>> >> >> >> > process command
>> >> >> >> > </snip>
>> >> >> >> >
>> >> >> >> > Digimer schreef op 20.08.2012 00:01:
>> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO:
>> >> task
>> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 
>> 0
>> >> >
>> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this
>> >> >> message.
>> >> >> >> >>
>> >> >> >> >> Nothing around Aug 19 00:08:00 ?
>> >> >> >>
>> >> >> >> --
>> >> >> >> Linux-cluster mailing list
>> >> >> >> Linux-cluster at redhat.com
>> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>> >> >>
>> >>
>>



From lists at verwilst.be  Tue Aug 21 20:35:40 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Tue, 21 Aug 2012 22:35:40 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <1345552631.2732.50.camel@menhir>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
	<1345552631.2732.50.camel@menhir>
Message-ID: <20cd1356f06ffc0a5219754ce7086492@verwilst.be>

Hi Steven,

I've tested with kernel 3.3 ( 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/ ), bug isnt 
present there. Tried with 3.2.28 ( also from the kernel-ppa ), and bug 
happens there. In the end i was idd able to trace it back to 3.3-rc6, 
where you pushed a couple of GFS2 patches upstream.

inode.c for example is quite different between 3.2.28 and 3.3-rc1, and 
i do not dare to hack myself a diff file that incorporates your change, 
fearing it will probably be less stable than it is already. :)

Would it be too much to ask to backport your change to 3.2.x? I will 
then test this, and try to push it upstream to -stable and/or ubuntu 
LTS..

Thanks a lot in advance!

Kind regards,

Bart

Steven Whitehouse schreef op 21.08.2012 14:37:
> Hi,
>
> On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
>> Hi Steven,
>>
>> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the patch 
>> you
>> mentioned seems to be for a newer(?) version.
>> What should I do, offer an altered version of this patch for 
>> inclusion
>> into Ubuntu's 3.2.0 kernel, or is it a little less straightforward 
>> than
>> this? :)
>>
>> Kind regards,
>>
>> Bart
>>
> Well, since you've not got that patch in your existing kernel, then
> there are really two issues here. Firstly to try and verify that this
> patch really is a fix for the problem, and then to figure out what 
> needs
> to be done wrt Ubuntu distro kernels. One solution may be to post it 
> for
> the upstream -stable kernel as I think most distros will then pick 
> this
> up.
>
> Are you able to build a new Ubuntu kernel with that patch in it? That
> would be one way to test it. Another way which doesn't require 
> building
> kernels is this:
>
> The problem occurs when the resource groups are not uptodate, and
> various actions taken on the filesystem will ensure that they are
> uptodate. Mounting a filesystem and immediately running an unlink of 
> a
> file which is known to exist on the filesystem (before performing any
> other action) should trigger that bug if it is present.
>
> It may not be that particular bug, but that looks the most likely of 
> any
> recent patch to that bit of code. Are you able to try a more recent
> Ubuntu kernel? If you could try one based on a more recent upstream
> which has that patch in it, then that might also help narrow down the
> problem,
>
> Steve.
>
>> Steven Whitehouse schreef op 21.08.2012 13:27:
>> > Hi,
>> >
>> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
>> >> Hi Steven,
>> >>
>> >> There is no drbd in the mix ( which is why i changed the title of
>> >> the
>> >> bugreport now ). I'm only using plain iSCSI. The original posted 
>> had
>> >> it
>> >> with drbd :)
>> >>
>> >> Kind regards,
>> >>
>> >> Bart
>> >>
>> > Ah, I see sorry.. I misunderstood the report. I wonder whether 
>> your
>> > distro kernel has this patch:
>> >
>> > 
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
>> >
>> > Thats the most likely thing that I can see that has been fixed
>> > recently,
>> >
>> > Steve.
>> >
>> >
>> >> Steven Whitehouse schreef op 21.08.2012 12:59:
>> >> > Hi,
>> >> >
>> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
>> >> >> Hi Steven
>> >> >>
>> >> >> Shared storage is iSCSI,
>> >> >>
>> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
>> >> >> <quorumd tko="4" interval="2"
>> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>> >> >>
>> >> >> Actually i know why this is happening now, and can reproduce 
>> 100%
>> >> of
>> >> >> the time, i've added my findings as a comment to this bug from
>> >> >> somebody
>> >> >> having the same problem:
>> >> >>
>> >> >> 
>> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
>> >> >>
>> >> >> Create file on one node's gfs2 mount, rm on the other -> 
>> hanging
>> >> >> mountpoint + kernel OOPS.
>> >> >>
>> >> >> Happy that i'm finally getting somewhere with this :P
>> >> >>
>> >> >> Anything i can do to help Steven?
>> >> >>
>> >> >> Kind regards,
>> >> >>
>> >> >> Bart Verwilst
>> >> >>
>> >> > Can you reproduce this without drbd in the mix? That should 
>> remove
>> >> > one
>> >> > complication and make this easier to track down.
>> >> >
>> >> > I'll take a look at see what that dereference is likely to be 
>> in
>> >> the
>> >> > mean time,
>> >> >
>> >> > Steve.
>> >> >
>> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
>> >> >> > Hi,
>> >> >> >
>> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>> >> >> >> As yet another reply to my own post, i found this on the 
>> node
>> >> >> where
>> >> >> >> it
>> >> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock
>> >> that's
>> >> >> >> hanging
>> >> >> >> ):
>> >> >> >>
>> >> >> >>
>> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster 
>> "lock_dlm",
>> >> >> >> "kvm:sanlock"
>> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. 
>> Now
>> >> >> >> mounting
>> >> >> >> FS...
>> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already 
>> locked
>> >> >> for
>> >> >> >> use
>> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
>> >> >> >> journal...
>> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring 
>> the
>> >> >> >> transaction lock...
>> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying
>> >> >> journal...
>> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 
>> of
>> >> 0
>> >> >> >> blocks
>> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 
>> revoke
>> >> >> tags
>> >> >> >> [ 1219.782611] init: libvirt-bin main process ended,
>> >> respawning
>> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal
>> >> replayed
>> >> >> in
>> >> >> >> 1s
>> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to
>> >> acquire
>> >> >> >> journal lock...
>> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
>> >> >> >> journal...
>> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to
>> >> acquire
>> >> >> >> journal lock...
>> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
>> >> >> >> journal...
>> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated
>> >> with
>> >> >> >> status
>> >> >> >> 1
>> >> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
>> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated
>> >> with
>> >> >> >> status
>> >> >> >> 1
>> >> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
>> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
>> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring 
>> the
>> >> >> >> transaction lock...
>> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying
>> >> >> journal...
>> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 
>> of
>> >> 106
>> >> >> >> blocks
>> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487
>> >> revoke
>> >> >> >> tags
>> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal
>> >> replayed
>> >> >> in
>> >> >> >> 1s
>> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>> >> >> >
>> >> >> > So that looks like it successfully recovered the journals 
>> for
>> >> >> nodes
>> >> >> > one
>> >> >> > and two. How many nodes are in the cluster? What is the 
>> fencing
>> >> >> > quorum
>> >> >> > set up being used?
>> >> >> >
>> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer
>> >> >> dereference
>> >> >> >> at
>> >> >> >> 0000000000000018
>> >> >> >
>> >> >> > So this is a dereference of something which is 24 bytes into
>> >> some
>> >> >> > structure or other. Certainly something which should not 
>> happen
>> >> so
>> >> >> we
>> >> >> > need to take a look at that.
>> >> >> >
>> >> >> > Was this a one off, or something that you can reproduce?
>> >> >> >
>> >> >> > Steve.
>> >> >> >
>> >> >> >
>> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] 
>> gfs2_unlink+0x8a/0x220
>> >> >> >> [gfs2]
>> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
>> >> >> >> [ 1221.458374] CPU 0
>> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl
>> >> mptbase
>> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter
>> >> ip6_tables
>> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables
>> >> kvm_intel
>> >> >> kvm
>> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>> >> >> >> ocfs2_stackglue
>> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm
>> >> ib_sa
>> >> >> >> ib_mad
>> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
>> >> >> scsi_transport_iscsi
>> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc
>> >> 8021q
>> >> >> garp
>> >> >> >> stp
>> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
>> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
>> >> >> >> scsi_transport_sas
>> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded:
>> >> >> ipmi_si]
>> >> >> >> [ 1221.463058]
>> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
>> >> >> >> 3.2.0-26-generic
>> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
>> >> >> [<ffffffffa04f800a>]
>> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
>> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
>> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000
>> >> RCX:
>> >> >> >> ffff88020cfe1d40
>> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3
>> >> RDI:
>> >> >> >> ffff88022efa2440
>> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000
>> >> R09:
>> >> >> >> 0000000000000000
>> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003
>> >> R12:
>> >> >> >> ffff88021ef50000
>> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0
>> >> R15:
>> >> >> >> ffff88022efa2000
>> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
>> >> >> GS:ffff880237200000(0000)
>> >> >> >> knlGS:0000000000000000
>> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0:
>> >> 0000000080050033
>> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000
>> >> CR4:
>> >> >> >> 00000000000006f0
>> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000
>> >> DR2:
>> >> >> >> 0000000000000000
>> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0
>> >> DR7:
>> >> >> >> 0000000000000400
>> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
>> >> >> >> [ 1221.468091] Stack:
>> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
>> >> >> ffff88020cfe1d58
>> >> >> >> ffff88020cfe1d40
>> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
>> >> >> ffff88022354aa00
>> >> >> >> 0000000000000001
>> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
>> >> >> ffffffffa04f7fda
>> >> >> >> ffff88020cfe1d80
>> >> >> >> [ 1221.469346] Call Trace:
>> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? 
>> gfs2_unlink+0x5a/0x220
>> >> >> [gfs2]
>> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? 
>> gfs2_unlink+0x74/0x220
>> >> >> [gfs2]
>> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>> >> >> >> security_inode_permission+0x1c/0x30
>> >> >> >> [ 1221.470536]  [<ffffffff81184e70>]
>> >> vfs_unlink.part.26+0x80/0xf0
>> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
>> >> >> >> [ 1221.471040]  [<ffffffff8118758a>] 
>> do_unlinkat+0x1aa/0x1d0
>> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? 
>> vfs_write+0x110/0x180
>> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
>> >> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
>> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
>> >> >> system_call_fastpath+0x16/0x1b
>> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 
>> 00
>> >> 00
>> >> >> 00
>> >> >> >> e8
>> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 
>> 8d
>> >> 8d
>> >> >> 08
>> >> >> >> ff
>> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 
>> 1e
>> >> ff
>> >> >> ff
>> >> >> >> 48
>> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>]
>> >> gfs2_unlink+0x8a/0x220
>> >> >> >> [gfs2]
>> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
>> >> >> >> [ 1221.474408] CR2: 0000000000000018
>> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>> >> >> >>
>> >> >> >>
>> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
>> >> >> >> > Nothing out of the ordinary, should have mentioned that!
>> >> >> >> >
>> >> >> >> > <snip>
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> read
>> >> >> >> 20
>> >> >> >> > bytes from fd 17
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> client
>> >> >> >> > command is 7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> About
>> >> >> >> to
>> >> >> >> > process command
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> to
>> >> >> >> > process is 7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> >> > get_all_members: retlen = 1760
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> > return code is 4
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> > Returning command data. length = 1760
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> sending
>> >> >> >> > reply 40000007 to fd 17
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> read
>> >> >> >> 20
>> >> >> >> > bytes from fd 17
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> client
>> >> >> >> > command is 800000b7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> About
>> >> >> >> to
>> >> >> >> > process command
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> to
>> >> >> >> > process is 800000b7
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> memb:
>> >> >> command
>> >> >> >> > return code is 0
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> > Returning command data. length = 0
>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> sending
>> >> >> >> > reply c00000b7 to fd 17
>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> read
>> >> >> >> 20
>> >> >> >> > bytes from fd 17
>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> >> client
>> >> >> >> > command is 7
>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>> daemon:
>> >> >> About
>> >> >> >> to
>> >> >> >> > process command
>> >> >> >> > </snip>
>> >> >> >> >
>> >> >> >> > Digimer schreef op 20.08.2012 00:01:
>> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO:
>> >> task
>> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 
>> 0
>> >> >
>> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this
>> >> >> message.
>> >> >> >> >>
>> >> >> >> >> Nothing around Aug 19 00:08:00 ?
>> >> >> >>
>> >> >> >> --
>> >> >> >> Linux-cluster mailing list
>> >> >> >> Linux-cluster at redhat.com
>> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>> >> >>
>> >>
>>



From lists at verwilst.be  Wed Aug 22 07:35:44 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Wed, 22 Aug 2012 09:35:44 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <20cd1356f06ffc0a5219754ce7086492@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
	<1345552631.2732.50.camel@menhir>
	<20cd1356f06ffc0a5219754ce7086492@verwilst.be>
Message-ID: <86053dbeda017ab05682fa832504c33a@verwilst.be>

Hi Steven,

I'm not sure if this is enough to fix it in 3.2:

--- inode.c.orig	2012-08-22 07:28:15.675859475 +0000
+++ inode.c	2012-08-22 07:33:05.895865014 +0000
@@ -1039,6 +1039,10 @@
  	struct gfs2_rgrpd *rgd;
  	int error;

+	error = gfs2_rindex_update(sdp);
+	if (error)
+		return error;
+
  	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
  	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);

I've left off the "if (!rgd) { ... }" part since out_inodes doesn't 
exist yet there, so it seemed unneeded. I'm far from a kernel developer 
though, so please give your OK for this, so i can try to push it into 
Ubuntu and/or upstream.

Bart

Bart Verwilst schreef op 21.08.2012 22:35:
> Hi Steven,
>
> I've tested with kernel 3.3 (
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/ ), bug
> isnt present there. Tried with 3.2.28 ( also from the kernel-ppa ),
> and bug happens there. In the end i was idd able to trace it back to
> 3.3-rc6, where you pushed a couple of GFS2 patches upstream.
>
> inode.c for example is quite different between 3.2.28 and 3.3-rc1,
> and i do not dare to hack myself a diff file that incorporates your
> change, fearing it will probably be less stable than it is already. 
> :)
>
> Would it be too much to ask to backport your change to 3.2.x? I will
> then test this, and try to push it upstream to -stable and/or ubuntu
> LTS..
>
> Thanks a lot in advance!
>
> Kind regards,
>
> Bart
>
> Steven Whitehouse schreef op 21.08.2012 14:37:
>> Hi,
>>
>> On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
>>> Hi Steven,
>>>
>>> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the patch 
>>> you
>>> mentioned seems to be for a newer(?) version.
>>> What should I do, offer an altered version of this patch for 
>>> inclusion
>>> into Ubuntu's 3.2.0 kernel, or is it a little less straightforward 
>>> than
>>> this? :)
>>>
>>> Kind regards,
>>>
>>> Bart
>>>
>> Well, since you've not got that patch in your existing kernel, then
>> there are really two issues here. Firstly to try and verify that 
>> this
>> patch really is a fix for the problem, and then to figure out what 
>> needs
>> to be done wrt Ubuntu distro kernels. One solution may be to post it 
>> for
>> the upstream -stable kernel as I think most distros will then pick 
>> this
>> up.
>>
>> Are you able to build a new Ubuntu kernel with that patch in it? 
>> That
>> would be one way to test it. Another way which doesn't require 
>> building
>> kernels is this:
>>
>> The problem occurs when the resource groups are not uptodate, and
>> various actions taken on the filesystem will ensure that they are
>> uptodate. Mounting a filesystem and immediately running an unlink of 
>> a
>> file which is known to exist on the filesystem (before performing 
>> any
>> other action) should trigger that bug if it is present.
>>
>> It may not be that particular bug, but that looks the most likely of 
>> any
>> recent patch to that bit of code. Are you able to try a more recent
>> Ubuntu kernel? If you could try one based on a more recent upstream
>> which has that patch in it, then that might also help narrow down 
>> the
>> problem,
>>
>> Steve.
>>
>>> Steven Whitehouse schreef op 21.08.2012 13:27:
>>> > Hi,
>>> >
>>> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
>>> >> Hi Steven,
>>> >>
>>> >> There is no drbd in the mix ( which is why i changed the title 
>>> of
>>> >> the
>>> >> bugreport now ). I'm only using plain iSCSI. The original posted 
>>> had
>>> >> it
>>> >> with drbd :)
>>> >>
>>> >> Kind regards,
>>> >>
>>> >> Bart
>>> >>
>>> > Ah, I see sorry.. I misunderstood the report. I wonder whether 
>>> your
>>> > distro kernel has this patch:
>>> >
>>> > 
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
>>> >
>>> > Thats the most likely thing that I can see that has been fixed
>>> > recently,
>>> >
>>> > Steve.
>>> >
>>> >
>>> >> Steven Whitehouse schreef op 21.08.2012 12:59:
>>> >> > Hi,
>>> >> >
>>> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
>>> >> >> Hi Steven
>>> >> >>
>>> >> >> Shared storage is iSCSI,
>>> >> >>
>>> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
>>> >> >> <quorumd tko="4" interval="2"
>>> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>>> >> >>
>>> >> >> Actually i know why this is happening now, and can reproduce 
>>> 100%
>>> >> of
>>> >> >> the time, i've added my findings as a comment to this bug 
>>> from
>>> >> >> somebody
>>> >> >> having the same problem:
>>> >> >>
>>> >> >> 
>>> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
>>> >> >>
>>> >> >> Create file on one node's gfs2 mount, rm on the other -> 
>>> hanging
>>> >> >> mountpoint + kernel OOPS.
>>> >> >>
>>> >> >> Happy that i'm finally getting somewhere with this :P
>>> >> >>
>>> >> >> Anything i can do to help Steven?
>>> >> >>
>>> >> >> Kind regards,
>>> >> >>
>>> >> >> Bart Verwilst
>>> >> >>
>>> >> > Can you reproduce this without drbd in the mix? That should 
>>> remove
>>> >> > one
>>> >> > complication and make this easier to track down.
>>> >> >
>>> >> > I'll take a look at see what that dereference is likely to be 
>>> in
>>> >> the
>>> >> > mean time,
>>> >> >
>>> >> > Steve.
>>> >> >
>>> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>>> >> >> >> As yet another reply to my own post, i found this on the 
>>> node
>>> >> >> where
>>> >> >> >> it
>>> >> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock
>>> >> that's
>>> >> >> >> hanging
>>> >> >> >> ):
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster 
>>> "lock_dlm",
>>> >> >> >> "kvm:sanlock"
>>> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. 
>>> Now
>>> >> >> >> mounting
>>> >> >> >> FS...
>>> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already 
>>> locked
>>> >> >> for
>>> >> >> >> use
>>> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
>>> >> >> >> journal...
>>> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring 
>>> the
>>> >> >> >> transaction lock...
>>> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying
>>> >> >> journal...
>>> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 
>>> of
>>> >> 0
>>> >> >> >> blocks
>>> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 
>>> revoke
>>> >> >> tags
>>> >> >> >> [ 1219.782611] init: libvirt-bin main process ended,
>>> >> respawning
>>> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal
>>> >> replayed
>>> >> >> in
>>> >> >> >> 1s
>>> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>>> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to
>>> >> acquire
>>> >> >> >> journal lock...
>>> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
>>> >> >> >> journal...
>>> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>>> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to
>>> >> acquire
>>> >> >> >> journal lock...
>>> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
>>> >> >> >> journal...
>>> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated
>>> >> with
>>> >> >> >> status
>>> >> >> >> 1
>>> >> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
>>> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated
>>> >> with
>>> >> >> >> status
>>> >> >> >> 1
>>> >> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
>>> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
>>> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring 
>>> the
>>> >> >> >> transaction lock...
>>> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying
>>> >> >> journal...
>>> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 
>>> of
>>> >> 106
>>> >> >> >> blocks
>>> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487
>>> >> revoke
>>> >> >> >> tags
>>> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal
>>> >> replayed
>>> >> >> in
>>> >> >> >> 1s
>>> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>>> >> >> >
>>> >> >> > So that looks like it successfully recovered the journals 
>>> for
>>> >> >> nodes
>>> >> >> > one
>>> >> >> > and two. How many nodes are in the cluster? What is the 
>>> fencing
>>> >> >> > quorum
>>> >> >> > set up being used?
>>> >> >> >
>>> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer
>>> >> >> dereference
>>> >> >> >> at
>>> >> >> >> 0000000000000018
>>> >> >> >
>>> >> >> > So this is a dereference of something which is 24 bytes 
>>> into
>>> >> some
>>> >> >> > structure or other. Certainly something which should not 
>>> happen
>>> >> so
>>> >> >> we
>>> >> >> > need to take a look at that.
>>> >> >> >
>>> >> >> > Was this a one off, or something that you can reproduce?
>>> >> >> >
>>> >> >> > Steve.
>>> >> >> >
>>> >> >> >
>>> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] 
>>> gfs2_unlink+0x8a/0x220
>>> >> >> >> [gfs2]
>>> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>>> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
>>> >> >> >> [ 1221.458374] CPU 0
>>> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl
>>> >> mptbase
>>> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter
>>> >> ip6_tables
>>> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables
>>> >> kvm_intel
>>> >> >> kvm
>>> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
>>> >> >> >> ocfs2_stackglue
>>> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm
>>> >> ib_sa
>>> >> >> >> ib_mad
>>> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
>>> >> >> scsi_transport_iscsi
>>> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc
>>> >> 8021q
>>> >> >> garp
>>> >> >> >> stp
>>> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
>>> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
>>> >> >> >> scsi_transport_sas
>>> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded:
>>> >> >> ipmi_si]
>>> >> >> >> [ 1221.463058]
>>> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
>>> >> >> >> 3.2.0-26-generic
>>> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>>> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
>>> >> >> [<ffffffffa04f800a>]
>>> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
>>> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 
>>> 00010296
>>> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000
>>> >> RCX:
>>> >> >> >> ffff88020cfe1d40
>>> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3
>>> >> RDI:
>>> >> >> >> ffff88022efa2440
>>> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000
>>> >> R09:
>>> >> >> >> 0000000000000000
>>> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003
>>> >> R12:
>>> >> >> >> ffff88021ef50000
>>> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0
>>> >> R15:
>>> >> >> >> ffff88022efa2000
>>> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
>>> >> >> GS:ffff880237200000(0000)
>>> >> >> >> knlGS:0000000000000000
>>> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0:
>>> >> 0000000080050033
>>> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000
>>> >> CR4:
>>> >> >> >> 00000000000006f0
>>> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000
>>> >> DR2:
>>> >> >> >> 0000000000000000
>>> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0
>>> >> DR7:
>>> >> >> >> 0000000000000400
>>> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>>> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
>>> >> >> >> [ 1221.468091] Stack:
>>> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
>>> >> >> ffff88020cfe1d58
>>> >> >> >> ffff88020cfe1d40
>>> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
>>> >> >> ffff88022354aa00
>>> >> >> >> 0000000000000001
>>> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
>>> >> >> ffffffffa04f7fda
>>> >> >> >> ffff88020cfe1d80
>>> >> >> >> [ 1221.469346] Call Trace:
>>> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? 
>>> gfs2_unlink+0x5a/0x220
>>> >> >> [gfs2]
>>> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? 
>>> gfs2_unlink+0x74/0x220
>>> >> >> [gfs2]
>>> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>>> >> >> >> security_inode_permission+0x1c/0x30
>>> >> >> >> [ 1221.470536]  [<ffffffff81184e70>]
>>> >> vfs_unlink.part.26+0x80/0xf0
>>> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
>>> >> >> >> [ 1221.471040]  [<ffffffff8118758a>] 
>>> do_unlinkat+0x1aa/0x1d0
>>> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? 
>>> vfs_write+0x110/0x180
>>> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
>>> >> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
>>> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
>>> >> >> system_call_fastpath+0x16/0x1b
>>> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 
>>> 01 00
>>> >> 00
>>> >> >> 00
>>> >> >> >> e8
>>> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 
>>> 48 8d
>>> >> 8d
>>> >> >> 08
>>> >> >> >> ff
>>> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 
>>> 1e
>>> >> ff
>>> >> >> ff
>>> >> >> >> 48
>>> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>]
>>> >> gfs2_unlink+0x8a/0x220
>>> >> >> >> [gfs2]
>>> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
>>> >> >> >> [ 1221.474408] CR2: 0000000000000018
>>> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
>>> >> >> >> > Nothing out of the ordinary, should have mentioned that!
>>> >> >> >> >
>>> >> >> >> > <snip>
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> read
>>> >> >> >> 20
>>> >> >> >> > bytes from fd 17
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> client
>>> >> >> >> > command is 7
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> About
>>> >> >> >> to
>>> >> >> >> > process command
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> memb:
>>> >> >> command
>>> >> >> >> to
>>> >> >> >> > process is 7
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> memb:
>>> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> memb:
>>> >> >> >> > get_all_members: retlen = 1760
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> memb:
>>> >> >> command
>>> >> >> >> > return code is 4
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> > Returning command data. length = 1760
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> sending
>>> >> >> >> > reply 40000007 to fd 17
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> read
>>> >> >> >> 20
>>> >> >> >> > bytes from fd 17
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> client
>>> >> >> >> > command is 800000b7
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> About
>>> >> >> >> to
>>> >> >> >> > process command
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> memb:
>>> >> >> command
>>> >> >> >> to
>>> >> >> >> > process is 800000b7
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> memb:
>>> >> >> command
>>> >> >> >> > return code is 0
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> > Returning command data. length = 0
>>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> sending
>>> >> >> >> > reply c00000b7 to fd 17
>>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> read
>>> >> >> >> 20
>>> >> >> >> > bytes from fd 17
>>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> >> client
>>> >> >> >> > command is 7
>>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
>>> daemon:
>>> >> >> About
>>> >> >> >> to
>>> >> >> >> > process command
>>> >> >> >> > </snip>
>>> >> >> >> >
>>> >> >> >> > Digimer schreef op 20.08.2012 00:01:
>>> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] 
>>> INFO:
>>> >> task
>>> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] 
>>> "echo
>>> 0
>>> >> >
>>> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> >> >> message.
>>> >> >> >> >>
>>> >> >> >> >> Nothing around Aug 19 00:08:00 ?
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Linux-cluster mailing list
>>> >> >> >> Linux-cluster at redhat.com
>>> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >> >>
>>> >>
>>>



From swhiteho at redhat.com  Wed Aug 22 08:44:57 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 22 Aug 2012 09:44:57 +0100
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <86053dbeda017ab05682fa832504c33a@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
	<1345552631.2732.50.camel@menhir>
	<20cd1356f06ffc0a5219754ce7086492@verwilst.be>
	<86053dbeda017ab05682fa832504c33a@verwilst.be>
Message-ID: <1345625097.2722.2.camel@menhir>

Hi,

On Wed, 2012-08-22 at 09:35 +0200, Bart Verwilst wrote:
> Hi Steven,
> 
> I'm not sure if this is enough to fix it in 3.2:
> 
> --- inode.c.orig	2012-08-22 07:28:15.675859475 +0000
> +++ inode.c	2012-08-22 07:33:05.895865014 +0000
> @@ -1039,6 +1039,10 @@
>   	struct gfs2_rgrpd *rgd;
>   	int error;
> 
> +	error = gfs2_rindex_update(sdp);
> +	if (error)
> +		return error;
> +
>   	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
>   	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
> 
> I've left off the "if (!rgd) { ... }" part since out_inodes doesn't 
> exist yet there, so it seemed unneeded. I'm far from a kernel developer 
> though, so please give your OK for this, so i can try to push it into 
> Ubuntu and/or upstream.
> 
> Bart
> 
It may well be enough. If we can verify that, then I'm happy to ACK this
for -stable,

Steve.

> Bart Verwilst schreef op 21.08.2012 22:35:
> > Hi Steven,
> >
> > I've tested with kernel 3.3 (
> > http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/ ), bug
> > isnt present there. Tried with 3.2.28 ( also from the kernel-ppa ),
> > and bug happens there. In the end i was idd able to trace it back to
> > 3.3-rc6, where you pushed a couple of GFS2 patches upstream.
> >
> > inode.c for example is quite different between 3.2.28 and 3.3-rc1,
> > and i do not dare to hack myself a diff file that incorporates your
> > change, fearing it will probably be less stable than it is already. 
> > :)
> >
> > Would it be too much to ask to backport your change to 3.2.x? I will
> > then test this, and try to push it upstream to -stable and/or ubuntu
> > LTS..
> >
> > Thanks a lot in advance!
> >
> > Kind regards,
> >
> > Bart
> >
> > Steven Whitehouse schreef op 21.08.2012 14:37:
> >> Hi,
> >>
> >> On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
> >>> Hi Steven,
> >>>
> >>> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the patch 
> >>> you
> >>> mentioned seems to be for a newer(?) version.
> >>> What should I do, offer an altered version of this patch for 
> >>> inclusion
> >>> into Ubuntu's 3.2.0 kernel, or is it a little less straightforward 
> >>> than
> >>> this? :)
> >>>
> >>> Kind regards,
> >>>
> >>> Bart
> >>>
> >> Well, since you've not got that patch in your existing kernel, then
> >> there are really two issues here. Firstly to try and verify that 
> >> this
> >> patch really is a fix for the problem, and then to figure out what 
> >> needs
> >> to be done wrt Ubuntu distro kernels. One solution may be to post it 
> >> for
> >> the upstream -stable kernel as I think most distros will then pick 
> >> this
> >> up.
> >>
> >> Are you able to build a new Ubuntu kernel with that patch in it? 
> >> That
> >> would be one way to test it. Another way which doesn't require 
> >> building
> >> kernels is this:
> >>
> >> The problem occurs when the resource groups are not uptodate, and
> >> various actions taken on the filesystem will ensure that they are
> >> uptodate. Mounting a filesystem and immediately running an unlink of 
> >> a
> >> file which is known to exist on the filesystem (before performing 
> >> any
> >> other action) should trigger that bug if it is present.
> >>
> >> It may not be that particular bug, but that looks the most likely of 
> >> any
> >> recent patch to that bit of code. Are you able to try a more recent
> >> Ubuntu kernel? If you could try one based on a more recent upstream
> >> which has that patch in it, then that might also help narrow down 
> >> the
> >> problem,
> >>
> >> Steve.
> >>
> >>> Steven Whitehouse schreef op 21.08.2012 13:27:
> >>> > Hi,
> >>> >
> >>> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
> >>> >> Hi Steven,
> >>> >>
> >>> >> There is no drbd in the mix ( which is why i changed the title 
> >>> of
> >>> >> the
> >>> >> bugreport now ). I'm only using plain iSCSI. The original posted 
> >>> had
> >>> >> it
> >>> >> with drbd :)
> >>> >>
> >>> >> Kind regards,
> >>> >>
> >>> >> Bart
> >>> >>
> >>> > Ah, I see sorry.. I misunderstood the report. I wonder whether 
> >>> your
> >>> > distro kernel has this patch:
> >>> >
> >>> > 
> >>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
> >>> >
> >>> > Thats the most likely thing that I can see that has been fixed
> >>> > recently,
> >>> >
> >>> > Steve.
> >>> >
> >>> >
> >>> >> Steven Whitehouse schreef op 21.08.2012 12:59:
> >>> >> > Hi,
> >>> >> >
> >>> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
> >>> >> >> Hi Steven
> >>> >> >>
> >>> >> >> Shared storage is iSCSI,
> >>> >> >>
> >>> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
> >>> >> >> <quorumd tko="4" interval="2"
> >>> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> >>> >> >>
> >>> >> >> Actually i know why this is happening now, and can reproduce 
> >>> 100%
> >>> >> of
> >>> >> >> the time, i've added my findings as a comment to this bug 
> >>> from
> >>> >> >> somebody
> >>> >> >> having the same problem:
> >>> >> >>
> >>> >> >> 
> >>> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
> >>> >> >>
> >>> >> >> Create file on one node's gfs2 mount, rm on the other -> 
> >>> hanging
> >>> >> >> mountpoint + kernel OOPS.
> >>> >> >>
> >>> >> >> Happy that i'm finally getting somewhere with this :P
> >>> >> >>
> >>> >> >> Anything i can do to help Steven?
> >>> >> >>
> >>> >> >> Kind regards,
> >>> >> >>
> >>> >> >> Bart Verwilst
> >>> >> >>
> >>> >> > Can you reproduce this without drbd in the mix? That should 
> >>> remove
> >>> >> > one
> >>> >> > complication and make this easier to track down.
> >>> >> >
> >>> >> > I'll take a look at see what that dereference is likely to be 
> >>> in
> >>> >> the
> >>> >> > mean time,
> >>> >> >
> >>> >> > Steve.
> >>> >> >
> >>> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
> >>> >> >> > Hi,
> >>> >> >> >
> >>> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
> >>> >> >> >> As yet another reply to my own post, i found this on the 
> >>> node
> >>> >> >> where
> >>> >> >> >> it
> >>> >> >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock
> >>> >> that's
> >>> >> >> >> hanging
> >>> >> >> >> ):
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster 
> >>> "lock_dlm",
> >>> >> >> >> "kvm:sanlock"
> >>> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. 
> >>> Now
> >>> >> >> >> mounting
> >>> >> >> >> FS...
> >>> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already 
> >>> locked
> >>> >> >> for
> >>> >> >> >> use
> >>> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at
> >>> >> >> >> journal...
> >>> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring 
> >>> the
> >>> >> >> >> transaction lock...
> >>> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying
> >>> >> >> journal...
> >>> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 
> >>> of
> >>> >> 0
> >>> >> >> >> blocks
> >>> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 
> >>> revoke
> >>> >> >> tags
> >>> >> >> >> [ 1219.782611] init: libvirt-bin main process ended,
> >>> >> respawning
> >>> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal
> >>> >> replayed
> >>> >> >> in
> >>> >> >> >> 1s
> >>> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
> >>> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to
> >>> >> acquire
> >>> >> >> >> journal lock...
> >>> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at
> >>> >> >> >> journal...
> >>> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
> >>> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to
> >>> >> acquire
> >>> >> >> >> journal lock...
> >>> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at
> >>> >> >> >> journal...
> >>> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated
> >>> >> with
> >>> >> >> >> status
> >>> >> >> >> 1
> >>> >> >> >> [ 1219.944037] init: ttyS1 main process ended, respawning
> >>> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated
> >>> >> with
> >>> >> >> >> status
> >>> >> >> >> 1
> >>> >> >> >> [ 1219.967100] init: ttyS0 main process ended, respawning
> >>> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
> >>> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring 
> >>> the
> >>> >> >> >> transaction lock...
> >>> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying
> >>> >> >> journal...
> >>> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 
> >>> of
> >>> >> 106
> >>> >> >> >> blocks
> >>> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487
> >>> >> revoke
> >>> >> >> >> tags
> >>> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal
> >>> >> replayed
> >>> >> >> in
> >>> >> >> >> 1s
> >>> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
> >>> >> >> >
> >>> >> >> > So that looks like it successfully recovered the journals 
> >>> for
> >>> >> >> nodes
> >>> >> >> > one
> >>> >> >> > and two. How many nodes are in the cluster? What is the 
> >>> fencing
> >>> >> >> > quorum
> >>> >> >> > set up being used?
> >>> >> >> >
> >>> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer
> >>> >> >> dereference
> >>> >> >> >> at
> >>> >> >> >> 0000000000000018
> >>> >> >> >
> >>> >> >> > So this is a dereference of something which is 24 bytes 
> >>> into
> >>> >> some
> >>> >> >> > structure or other. Certainly something which should not 
> >>> happen
> >>> >> so
> >>> >> >> we
> >>> >> >> > need to take a look at that.
> >>> >> >> >
> >>> >> >> > Was this a one off, or something that you can reproduce?
> >>> >> >> >
> >>> >> >> > Steve.
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] 
> >>> gfs2_unlink+0x8a/0x220
> >>> >> >> >> [gfs2]
> >>> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
> >>> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
> >>> >> >> >> [ 1221.458374] CPU 0
> >>> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl
> >>> >> mptbase
> >>> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter
> >>> >> ip6_tables
> >>> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables
> >>> >> kvm_intel
> >>> >> >> kvm
> >>> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> >>> >> >> >> ocfs2_stackglue
> >>> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm
> >>> >> ib_sa
> >>> >> >> >> ib_mad
> >>> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> >>> >> >> scsi_transport_iscsi
> >>> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc
> >>> >> 8021q
> >>> >> >> garp
> >>> >> >> >> stp
> >>> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
> >>> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
> >>> >> >> >> scsi_transport_sas
> >>> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded:
> >>> >> >> ipmi_si]
> >>> >> >> >> [ 1221.463058]
> >>> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
> >>> >> >> >> 3.2.0-26-generic
> >>> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
> >>> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
> >>> >> >> [<ffffffffa04f800a>]
> >>> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
> >>> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 
> >>> 00010296
> >>> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000
> >>> >> RCX:
> >>> >> >> >> ffff88020cfe1d40
> >>> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3
> >>> >> RDI:
> >>> >> >> >> ffff88022efa2440
> >>> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000
> >>> >> R09:
> >>> >> >> >> 0000000000000000
> >>> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003
> >>> >> R12:
> >>> >> >> >> ffff88021ef50000
> >>> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0
> >>> >> R15:
> >>> >> >> >> ffff88022efa2000
> >>> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
> >>> >> >> GS:ffff880237200000(0000)
> >>> >> >> >> knlGS:0000000000000000
> >>> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0:
> >>> >> 0000000080050033
> >>> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000
> >>> >> CR4:
> >>> >> >> >> 00000000000006f0
> >>> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000
> >>> >> DR2:
> >>> >> >> >> 0000000000000000
> >>> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0
> >>> >> DR7:
> >>> >> >> >> 0000000000000400
> >>> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
> >>> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
> >>> >> >> >> [ 1221.468091] Stack:
> >>> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
> >>> >> >> ffff88020cfe1d58
> >>> >> >> >> ffff88020cfe1d40
> >>> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
> >>> >> >> ffff88022354aa00
> >>> >> >> >> 0000000000000001
> >>> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
> >>> >> >> ffffffffa04f7fda
> >>> >> >> >> ffff88020cfe1d80
> >>> >> >> >> [ 1221.469346] Call Trace:
> >>> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ? 
> >>> gfs2_unlink+0x5a/0x220
> >>> >> >> [gfs2]
> >>> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ? 
> >>> gfs2_unlink+0x74/0x220
> >>> >> >> [gfs2]
> >>> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
> >>> >> >> >> security_inode_permission+0x1c/0x30
> >>> >> >> >> [ 1221.470536]  [<ffffffff81184e70>]
> >>> >> vfs_unlink.part.26+0x80/0xf0
> >>> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
> >>> >> >> >> [ 1221.471040]  [<ffffffff8118758a>] 
> >>> do_unlinkat+0x1aa/0x1d0
> >>> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ? 
> >>> vfs_write+0x110/0x180
> >>> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
> >>> >> >> >> [ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
> >>> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
> >>> >> >> system_call_fastpath+0x16/0x1b
> >>> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 
> >>> 01 00
> >>> >> 00
> >>> >> >> 00
> >>> >> >> >> e8
> >>> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 
> >>> 48 8d
> >>> >> 8d
> >>> >> >> 08
> >>> >> >> >> ff
> >>> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 
> >>> 1e
> >>> >> ff
> >>> >> >> ff
> >>> >> >> >> 48
> >>> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>]
> >>> >> gfs2_unlink+0x8a/0x220
> >>> >> >> >> [gfs2]
> >>> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
> >>> >> >> >> [ 1221.474408] CR2: 0000000000000018
> >>> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
> >>> >> >> >> > Nothing out of the ordinary, should have mentioned that!
> >>> >> >> >> >
> >>> >> >> >> > <snip>
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> read
> >>> >> >> >> 20
> >>> >> >> >> > bytes from fd 17
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> client
> >>> >> >> >> > command is 7
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> About
> >>> >> >> >> to
> >>> >> >> >> > process command
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> memb:
> >>> >> >> command
> >>> >> >> >> to
> >>> >> >> >> > process is 7
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> memb:
> >>> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> memb:
> >>> >> >> >> > get_all_members: retlen = 1760
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> memb:
> >>> >> >> command
> >>> >> >> >> > return code is 4
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> > Returning command data. length = 1760
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> sending
> >>> >> >> >> > reply 40000007 to fd 17
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> read
> >>> >> >> >> 20
> >>> >> >> >> > bytes from fd 17
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> client
> >>> >> >> >> > command is 800000b7
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> About
> >>> >> >> >> to
> >>> >> >> >> > process command
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> memb:
> >>> >> >> command
> >>> >> >> >> to
> >>> >> >> >> > process is 800000b7
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> memb:
> >>> >> >> command
> >>> >> >> >> > return code is 0
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> > Returning command data. length = 0
> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> sending
> >>> >> >> >> > reply c00000b7 to fd 17
> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> read
> >>> >> >> >> 20
> >>> >> >> >> > bytes from fd 17
> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> >> client
> >>> >> >> >> > command is 7
> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] 
> >>> daemon:
> >>> >> >> About
> >>> >> >> >> to
> >>> >> >> >> > process command
> >>> >> >> >> > </snip>
> >>> >> >> >> >
> >>> >> >> >> > Digimer schreef op 20.08.2012 00:01:
> >>> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> >>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] 
> >>> INFO:
> >>> >> task
> >>> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
> >>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] 
> >>> "echo
> >>> 0
> >>> >> >
> >>> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this
> >>> >> >> message.
> >>> >> >> >> >>
> >>> >> >> >> >> Nothing around Aug 19 00:08:00 ?
> >>> >> >> >>
> >>> >> >> >> --
> >>> >> >> >> Linux-cluster mailing list
> >>> >> >> >> Linux-cluster at redhat.com
> >>> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>> >> >>
> >>> >>
> >>>
> 




From lists at verwilst.be  Wed Aug 22 12:34:51 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Wed, 22 Aug 2012 14:34:51 +0200
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <1345625097.2722.2.camel@menhir>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
	<1345552631.2732.50.camel@menhir>
	<20cd1356f06ffc0a5219754ce7086492@verwilst.be>
	<86053dbeda017ab05682fa832504c33a@verwilst.be>
	<1345625097.2722.2.camel@menhir>
Message-ID: <704d58901f1a3b65d38150422586914f@verwilst.be>

Hi Steven,

I have just tested this kernel:

http://people.canonical.com/~apw/lp1020207-precise/

Which contains 2 patches, 
0001-GFS2-More-automated-code-analysis-fixes.patch and 
0002-GFS2-Read-in-rindex-if-necessary-during-unlink.patch. This will 
probably be included into the latest stable kernel of Ubuntu 12.04.

Works perfectly now, can't reproduce this bug anymore.

What's next to push this to -stable?

Bart

Steven Whitehouse schreef op 22.08.2012 10:44:
> Hi,
>
> On Wed, 2012-08-22 at 09:35 +0200, Bart Verwilst wrote:
>> Hi Steven,
>>
>> I'm not sure if this is enough to fix it in 3.2:
>>
>> --- inode.c.orig	2012-08-22 07:28:15.675859475 +0000
>> +++ inode.c	2012-08-22 07:33:05.895865014 +0000
>> @@ -1039,6 +1039,10 @@
>>   	struct gfs2_rgrpd *rgd;
>>   	int error;
>>
>> +	error = gfs2_rindex_update(sdp);
>> +	if (error)
>> +		return error;
>> +
>>   	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
>>   	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
>>
>> I've left off the "if (!rgd) { ... }" part since out_inodes doesn't
>> exist yet there, so it seemed unneeded. I'm far from a kernel 
>> developer
>> though, so please give your OK for this, so i can try to push it 
>> into
>> Ubuntu and/or upstream.
>>
>> Bart
>>
> It may well be enough. If we can verify that, then I'm happy to ACK 
> this
> for -stable,
>
> Steve.
>
>> Bart Verwilst schreef op 21.08.2012 22:35:
>> > Hi Steven,
>> >
>> > I've tested with kernel 3.3 (
>> > http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/ ), bug
>> > isnt present there. Tried with 3.2.28 ( also from the kernel-ppa 
>> ),
>> > and bug happens there. In the end i was idd able to trace it back 
>> to
>> > 3.3-rc6, where you pushed a couple of GFS2 patches upstream.
>> >
>> > inode.c for example is quite different between 3.2.28 and 3.3-rc1,
>> > and i do not dare to hack myself a diff file that incorporates 
>> your
>> > change, fearing it will probably be less stable than it is 
>> already.
>> > :)
>> >
>> > Would it be too much to ask to backport your change to 3.2.x? I 
>> will
>> > then test this, and try to push it upstream to -stable and/or 
>> ubuntu
>> > LTS..
>> >
>> > Thanks a lot in advance!
>> >
>> > Kind regards,
>> >
>> > Bart
>> >
>> > Steven Whitehouse schreef op 21.08.2012 14:37:
>> >> Hi,
>> >>
>> >> On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
>> >>> Hi Steven,
>> >>>
>> >>> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the 
>> patch
>> >>> you
>> >>> mentioned seems to be for a newer(?) version.
>> >>> What should I do, offer an altered version of this patch for
>> >>> inclusion
>> >>> into Ubuntu's 3.2.0 kernel, or is it a little less 
>> straightforward
>> >>> than
>> >>> this? :)
>> >>>
>> >>> Kind regards,
>> >>>
>> >>> Bart
>> >>>
>> >> Well, since you've not got that patch in your existing kernel, 
>> then
>> >> there are really two issues here. Firstly to try and verify that
>> >> this
>> >> patch really is a fix for the problem, and then to figure out 
>> what
>> >> needs
>> >> to be done wrt Ubuntu distro kernels. One solution may be to post 
>> it
>> >> for
>> >> the upstream -stable kernel as I think most distros will then 
>> pick
>> >> this
>> >> up.
>> >>
>> >> Are you able to build a new Ubuntu kernel with that patch in it?
>> >> That
>> >> would be one way to test it. Another way which doesn't require
>> >> building
>> >> kernels is this:
>> >>
>> >> The problem occurs when the resource groups are not uptodate, and
>> >> various actions taken on the filesystem will ensure that they are
>> >> uptodate. Mounting a filesystem and immediately running an unlink 
>> of
>> >> a
>> >> file which is known to exist on the filesystem (before performing
>> >> any
>> >> other action) should trigger that bug if it is present.
>> >>
>> >> It may not be that particular bug, but that looks the most likely 
>> of
>> >> any
>> >> recent patch to that bit of code. Are you able to try a more 
>> recent
>> >> Ubuntu kernel? If you could try one based on a more recent 
>> upstream
>> >> which has that patch in it, then that might also help narrow down
>> >> the
>> >> problem,
>> >>
>> >> Steve.
>> >>
>> >>> Steven Whitehouse schreef op 21.08.2012 13:27:
>> >>> > Hi,
>> >>> >
>> >>> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
>> >>> >> Hi Steven,
>> >>> >>
>> >>> >> There is no drbd in the mix ( which is why i changed the 
>> title
>> >>> of
>> >>> >> the
>> >>> >> bugreport now ). I'm only using plain iSCSI. The original 
>> posted
>> >>> had
>> >>> >> it
>> >>> >> with drbd :)
>> >>> >>
>> >>> >> Kind regards,
>> >>> >>
>> >>> >> Bart
>> >>> >>
>> >>> > Ah, I see sorry.. I misunderstood the report. I wonder whether
>> >>> your
>> >>> > distro kernel has this patch:
>> >>> >
>> >>> >
>> >>> 
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
>> >>> >
>> >>> > Thats the most likely thing that I can see that has been fixed
>> >>> > recently,
>> >>> >
>> >>> > Steve.
>> >>> >
>> >>> >
>> >>> >> Steven Whitehouse schreef op 21.08.2012 12:59:
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
>> >>> >> >> Hi Steven
>> >>> >> >>
>> >>> >> >> Shared storage is iSCSI,
>> >>> >> >>
>> >>> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
>> >>> >> >> <quorumd tko="4" interval="2"
>> >>> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
>> >>> >> >>
>> >>> >> >> Actually i know why this is happening now, and can 
>> reproduce
>> >>> 100%
>> >>> >> of
>> >>> >> >> the time, i've added my findings as a comment to this bug
>> >>> from
>> >>> >> >> somebody
>> >>> >> >> having the same problem:
>> >>> >> >>
>> >>> >> >>
>> >>> 
>> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
>> >>> >> >>
>> >>> >> >> Create file on one node's gfs2 mount, rm on the other ->
>> >>> hanging
>> >>> >> >> mountpoint + kernel OOPS.
>> >>> >> >>
>> >>> >> >> Happy that i'm finally getting somewhere with this :P
>> >>> >> >>
>> >>> >> >> Anything i can do to help Steven?
>> >>> >> >>
>> >>> >> >> Kind regards,
>> >>> >> >>
>> >>> >> >> Bart Verwilst
>> >>> >> >>
>> >>> >> > Can you reproduce this without drbd in the mix? That should
>> >>> remove
>> >>> >> > one
>> >>> >> > complication and make this easier to track down.
>> >>> >> >
>> >>> >> > I'll take a look at see what that dereference is likely to 
>> be
>> >>> in
>> >>> >> the
>> >>> >> > mean time,
>> >>> >> >
>> >>> >> > Steve.
>> >>> >> >
>> >>> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
>> >>> >> >> > Hi,
>> >>> >> >> >
>> >>> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
>> >>> >> >> >> As yet another reply to my own post, i found this on 
>> the
>> >>> node
>> >>> >> >> where
>> >>> >> >> >> it
>> >>> >> >> >> hangs ( this time it's vm01, and 
>> /var/lib/libvirt/sanlock
>> >>> >> that's
>> >>> >> >> >> hanging
>> >>> >> >> >> ):
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster
>> >>> "lock_dlm",
>> >>> >> >> >> "kvm:sanlock"
>> >>> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined 
>> cluster.
>> >>> Now
>> >>> >> >> >> mounting
>> >>> >> >> >> FS...
>> >>> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already
>> >>> locked
>> >>> >> >> for
>> >>> >> >> >> use
>> >>> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking 
>> at
>> >>> >> >> >> journal...
>> >>> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: 
>> Acquiring
>> >>> the
>> >>> >> >> >> transaction lock...
>> >>> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: 
>> Replaying
>> >>> >> >> journal...
>> >>> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: 
>> Replayed 0
>> >>> of
>> >>> >> 0
>> >>> >> >> >> blocks
>> >>> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1
>> >>> revoke
>> >>> >> >> tags
>> >>> >> >> >> [ 1219.782611] init: libvirt-bin main process ended,
>> >>> >> respawning
>> >>> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal
>> >>> >> replayed
>> >>> >> >> in
>> >>> >> >> >> 1s
>> >>> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
>> >>> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying 
>> to
>> >>> >> acquire
>> >>> >> >> >> journal lock...
>> >>> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking 
>> at
>> >>> >> >> >> journal...
>> >>> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
>> >>> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying 
>> to
>> >>> >> acquire
>> >>> >> >> >> journal lock...
>> >>> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking 
>> at
>> >>> >> >> >> journal...
>> >>> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) 
>> terminated
>> >>> >> with
>> >>> >> >> >> status
>> >>> >> >> >> 1
>> >>> >> >> >> [ 1219.944037] init: ttyS1 main process ended, 
>> respawning
>> >>> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) 
>> terminated
>> >>> >> with
>> >>> >> >> >> status
>> >>> >> >> >> 1
>> >>> >> >> >> [ 1219.967100] init: ttyS0 main process ended, 
>> respawning
>> >>> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
>> >>> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: 
>> Acquiring
>> >>> the
>> >>> >> >> >> transaction lock...
>> >>> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: 
>> Replaying
>> >>> >> >> journal...
>> >>> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: 
>> Replayed 3
>> >>> of
>> >>> >> 106
>> >>> >> >> >> blocks
>> >>> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 
>> 4487
>> >>> >> revoke
>> >>> >> >> >> tags
>> >>> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal
>> >>> >> replayed
>> >>> >> >> in
>> >>> >> >> >> 1s
>> >>> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
>> >>> >> >> >
>> >>> >> >> > So that looks like it successfully recovered the 
>> journals
>> >>> for
>> >>> >> >> nodes
>> >>> >> >> > one
>> >>> >> >> > and two. How many nodes are in the cluster? What is the
>> >>> fencing
>> >>> >> >> > quorum
>> >>> >> >> > set up being used?
>> >>> >> >> >
>> >>> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL 
>> pointer
>> >>> >> >> dereference
>> >>> >> >> >> at
>> >>> >> >> >> 0000000000000018
>> >>> >> >> >
>> >>> >> >> > So this is a dereference of something which is 24 bytes
>> >>> into
>> >>> >> some
>> >>> >> >> > structure or other. Certainly something which should not
>> >>> happen
>> >>> >> so
>> >>> >> >> we
>> >>> >> >> > need to take a look at that.
>> >>> >> >> >
>> >>> >> >> > Was this a one off, or something that you can reproduce?
>> >>> >> >> >
>> >>> >> >> > Steve.
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>]
>> >>> gfs2_unlink+0x8a/0x220
>> >>> >> >> >> [gfs2]
>> >>> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
>> >>> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
>> >>> >> >> >> [ 1221.458374] CPU 0
>> >>> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si 
>> mptctl
>> >>> >> mptbase
>> >>> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter
>> >>> >> ip6_tables
>> >>> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables
>> >>> >> kvm_intel
>> >>> >> >> kvm
>> >>> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>> ocfs2_nodemanager
>> >>> >> >> >> ocfs2_stackglue
>> >>> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm 
>> iw_cm
>> >>> >> ib_sa
>> >>> >> >> >> ib_mad
>> >>> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
>> >>> >> >> scsi_transport_iscsi
>> >>> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl 
>> sunrpc
>> >>> >> 8021q
>> >>> >> >> garp
>> >>> >> >> >> stp
>> >>> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac 
>> edac_core
>> >>> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
>> >>> >> >> >> scsi_transport_sas
>> >>> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last 
>> unloaded:
>> >>> >> >> ipmi_si]
>> >>> >> >> >> [ 1221.463058]
>> >>> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
>> >>> >> >> >> 3.2.0-26-generic
>> >>> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
>> >>> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
>> >>> >> >> [<ffffffffa04f800a>]
>> >>> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
>> >>> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS:
>> >>> 00010296
>> >>> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: 
>> ffff88021ef58000
>> >>> >> RCX:
>> >>> >> >> >> ffff88020cfe1d40
>> >>> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 
>> 00000000000183f3
>> >>> >> RDI:
>> >>> >> >> >> ffff88022efa2440
>> >>> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 
>> 4000000000000000
>> >>> >> R09:
>> >>> >> >> >> 0000000000000000
>> >>> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 
>> 0000000000000003
>> >>> >> R12:
>> >>> >> >> >> ffff88021ef50000
>> >>> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: 
>> ffff8802178980c0
>> >>> >> R15:
>> >>> >> >> >> ffff88022efa2000
>> >>> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
>> >>> >> >> GS:ffff880237200000(0000)
>> >>> >> >> >> knlGS:0000000000000000
>> >>> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0:
>> >>> >> 0000000080050033
>> >>> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 
>> 00000002175e4000
>> >>> >> CR4:
>> >>> >> >> >> 00000000000006f0
>> >>> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 
>> 0000000000000000
>> >>> >> DR2:
>> >>> >> >> >> 0000000000000000
>> >>> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 
>> 00000000ffff0ff0
>> >>> >> DR7:
>> >>> >> >> >> 0000000000000400
>> >>> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
>> >>> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
>> >>> >> >> >> [ 1221.468091] Stack:
>> >>> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
>> >>> >> >> ffff88020cfe1d58
>> >>> >> >> >> ffff88020cfe1d40
>> >>> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
>> >>> >> >> ffff88022354aa00
>> >>> >> >> >> 0000000000000001
>> >>> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
>> >>> >> >> ffffffffa04f7fda
>> >>> >> >> >> ffff88020cfe1d80
>> >>> >> >> >> [ 1221.469346] Call Trace:
>> >>> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ?
>> >>> gfs2_unlink+0x5a/0x220
>> >>> >> >> [gfs2]
>> >>> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ?
>> >>> gfs2_unlink+0x74/0x220
>> >>> >> >> [gfs2]
>> >>> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
>> >>> >> >> >> security_inode_permission+0x1c/0x30
>> >>> >> >> >> [ 1221.470536]  [<ffffffff81184e70>]
>> >>> >> vfs_unlink.part.26+0x80/0xf0
>> >>> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] 
>> vfs_unlink+0x3c/0x60
>> >>> >> >> >> [ 1221.471040]  [<ffffffff8118758a>]
>> >>> do_unlinkat+0x1aa/0x1d0
>> >>> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ?
>> >>> vfs_write+0x110/0x180
>> >>> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? 
>> sys_write+0x67/0x90
>> >>> >> >> >> [ 1221.471780]  [<ffffffff81188106>] 
>> sys_unlink+0x16/0x20
>> >>> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
>> >>> >> >> system_call_fastpath+0x16/0x1b
>> >>> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 
>> be
>> >>> 01 00
>> >>> >> 00
>> >>> >> >> 00
>> >>> >> >> >> e8
>> >>> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 
>> 00
>> >>> 48 8d
>> >>> >> 8d
>> >>> >> >> 08
>> >>> >> >> >> ff
>> >>> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 
>> d2
>> >>> 1e
>> >>> >> ff
>> >>> >> >> ff
>> >>> >> >> >> 48
>> >>> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>]
>> >>> >> gfs2_unlink+0x8a/0x220
>> >>> >> >> >> [gfs2]
>> >>> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
>> >>> >> >> >> [ 1221.474408] CR2: 0000000000000018
>> >>> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
>> >>> >> >> >> > Nothing out of the ordinary, should have mentioned 
>> that!
>> >>> >> >> >> >
>> >>> >> >> >> > <snip>
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> read
>> >>> >> >> >> 20
>> >>> >> >> >> > bytes from fd 17
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> client
>> >>> >> >> >> > command is 7
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> About
>> >>> >> >> >> to
>> >>> >> >> >> > process command
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> memb:
>> >>> >> >> command
>> >>> >> >> >> to
>> >>> >> >> >> > process is 7
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> memb:
>> >>> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> memb:
>> >>> >> >> >> > get_all_members: retlen = 1760
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> memb:
>> >>> >> >> command
>> >>> >> >> >> > return code is 4
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> > Returning command data. length = 1760
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> sending
>> >>> >> >> >> > reply 40000007 to fd 17
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> read
>> >>> >> >> >> 20
>> >>> >> >> >> > bytes from fd 17
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> client
>> >>> >> >> >> > command is 800000b7
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> About
>> >>> >> >> >> to
>> >>> >> >> >> > process command
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> memb:
>> >>> >> >> command
>> >>> >> >> >> to
>> >>> >> >> >> > process is 800000b7
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> memb:
>> >>> >> >> command
>> >>> >> >> >> > return code is 0
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> > Returning command data. length = 0
>> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> sending
>> >>> >> >> >> > reply c00000b7 to fd 17
>> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> read
>> >>> >> >> >> 20
>> >>> >> >> >> > bytes from fd 17
>> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> >> client
>> >>> >> >> >> > command is 7
>> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ]
>> >>> daemon:
>> >>> >> >> About
>> >>> >> >> >> to
>> >>> >> >> >> > process command
>> >>> >> >> >> > </snip>
>> >>> >> >> >> >
>> >>> >> >> >> > Digimer schreef op 20.08.2012 00:01:
>> >>> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>> >>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067]
>> >>> INFO:
>> >>> >> task
>> >>> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
>> >>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182]
>> >>> "echo
>> >>> 0
>> >>> >> >
>> >>> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this
>> >>> >> >> message.
>> >>> >> >> >> >>
>> >>> >> >> >> >> Nothing around Aug 19 00:08:00 ?
>> >>> >> >> >>
>> >>> >> >> >> --
>> >>> >> >> >> Linux-cluster mailing list
>> >>> >> >> >> Linux-cluster at redhat.com
>> >>> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>> >>> >> >>
>> >>> >>
>> >>>
>>



From swhiteho at redhat.com  Wed Aug 22 14:04:56 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 22 Aug 2012 15:04:56 +0100
Subject: [Linux-cluster] gfs2 blocking tasks
In-Reply-To: <704d58901f1a3b65d38150422586914f@verwilst.be>
References: <082553e1c99e8240d70621355431fa04@verwilst.be>
	<50310A03.7050704@alteeve.ca>
	<323d4be2e01e1cd9be75e7c2a61e8b40@verwilst.be>
	<50314B65.8010204@alteeve.ca>
	<379aa3f9df6c97e06cc484b537d0f50e@verwilst.be>
	<5031624A.4040301@alteeve.ca>
	<c7e72778ac4493cccf52f8068245654e@verwilst.be>
	<794d9991566585741d98464f228c789a@verwilst.be>
	<1345544233.2732.39.camel@menhir>
	<af72d07e712f42d9b1e62d9f119a1f0f@verwilst.be>
	<1345546773.2732.41.camel@menhir>
	<7ae95a4a92078d1e22e4bcfbc2a12dc6@verwilst.be>
	<1345548479.2732.43.camel@menhir>
	<740f8f98c6d125f9d6cdfcd228e0fcd5@verwilst.be>
	<1345552631.2732.50.camel@menhir>
	<20cd1356f06ffc0a5219754ce7086492@verwilst.be>
	<86053dbeda017ab05682fa832504c33a@verwilst.be>
	<1345625097.2722.2.camel@menhir>
	<704d58901f1a3b65d38150422586914f@verwilst.be>
Message-ID: <1345644296.2720.5.camel@menhir>

Hi,

On Wed, 2012-08-22 at 14:34 +0200, Bart Verwilst wrote:
> Hi Steven,
> 
> I have just tested this kernel:
> 
> http://people.canonical.com/~apw/lp1020207-precise/
> 
> Which contains 2 patches, 
> 0001-GFS2-More-automated-code-analysis-fixes.patch and 
> 0002-GFS2-Read-in-rindex-if-necessary-during-unlink.patch. This will 
> probably be included into the latest stable kernel of Ubuntu 12.04.
> 
> Works perfectly now, can't reproduce this bug anymore.
> 
> What's next to push this to -stable?
> 
> Bart
> 

Looking at kernel.org, 3.2.28 is the one to aim at, since that is the
current stable tree for 3.2

http://www.kernel.org/

So it should just be a case of posting a patch against that tree to
stable at kernel.org along with a detailed explanation (i.e. mention that
this is a targeted back port of an upstream patch and also the bug which
it fixes). If you copy me in, then I can ACK it,

Steve.

> Steven Whitehouse schreef op 22.08.2012 10:44:
> > Hi,
> >
> > On Wed, 2012-08-22 at 09:35 +0200, Bart Verwilst wrote:
> >> Hi Steven,
> >>
> >> I'm not sure if this is enough to fix it in 3.2:
> >>
> >> --- inode.c.orig	2012-08-22 07:28:15.675859475 +0000
> >> +++ inode.c	2012-08-22 07:33:05.895865014 +0000
> >> @@ -1039,6 +1039,10 @@
> >>   	struct gfs2_rgrpd *rgd;
> >>   	int error;
> >>
> >> +	error = gfs2_rindex_update(sdp);
> >> +	if (error)
> >> +		return error;
> >> +
> >>   	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
> >>   	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
> >>
> >> I've left off the "if (!rgd) { ... }" part since out_inodes doesn't
> >> exist yet there, so it seemed unneeded. I'm far from a kernel 
> >> developer
> >> though, so please give your OK for this, so i can try to push it 
> >> into
> >> Ubuntu and/or upstream.
> >>
> >> Bart
> >>
> > It may well be enough. If we can verify that, then I'm happy to ACK 
> > this
> > for -stable,
> >
> > Steve.
> >
> >> Bart Verwilst schreef op 21.08.2012 22:35:
> >> > Hi Steven,
> >> >
> >> > I've tested with kernel 3.3 (
> >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/ ), bug
> >> > isnt present there. Tried with 3.2.28 ( also from the kernel-ppa 
> >> ),
> >> > and bug happens there. In the end i was idd able to trace it back 
> >> to
> >> > 3.3-rc6, where you pushed a couple of GFS2 patches upstream.
> >> >
> >> > inode.c for example is quite different between 3.2.28 and 3.3-rc1,
> >> > and i do not dare to hack myself a diff file that incorporates 
> >> your
> >> > change, fearing it will probably be less stable than it is 
> >> already.
> >> > :)
> >> >
> >> > Would it be too much to ask to backport your change to 3.2.x? I 
> >> will
> >> > then test this, and try to push it upstream to -stable and/or 
> >> ubuntu
> >> > LTS..
> >> >
> >> > Thanks a lot in advance!
> >> >
> >> > Kind regards,
> >> >
> >> > Bart
> >> >
> >> > Steven Whitehouse schreef op 21.08.2012 14:37:
> >> >> Hi,
> >> >>
> >> >> On Tue, 2012-08-21 at 14:23 +0200, Bart Verwilst wrote:
> >> >>> Hi Steven,
> >> >>>
> >> >>> The kernel of Ubuntu 12.04 LTS is based on 3.2.0, while the 
> >> patch
> >> >>> you
> >> >>> mentioned seems to be for a newer(?) version.
> >> >>> What should I do, offer an altered version of this patch for
> >> >>> inclusion
> >> >>> into Ubuntu's 3.2.0 kernel, or is it a little less 
> >> straightforward
> >> >>> than
> >> >>> this? :)
> >> >>>
> >> >>> Kind regards,
> >> >>>
> >> >>> Bart
> >> >>>
> >> >> Well, since you've not got that patch in your existing kernel, 
> >> then
> >> >> there are really two issues here. Firstly to try and verify that
> >> >> this
> >> >> patch really is a fix for the problem, and then to figure out 
> >> what
> >> >> needs
> >> >> to be done wrt Ubuntu distro kernels. One solution may be to post 
> >> it
> >> >> for
> >> >> the upstream -stable kernel as I think most distros will then 
> >> pick
> >> >> this
> >> >> up.
> >> >>
> >> >> Are you able to build a new Ubuntu kernel with that patch in it?
> >> >> That
> >> >> would be one way to test it. Another way which doesn't require
> >> >> building
> >> >> kernels is this:
> >> >>
> >> >> The problem occurs when the resource groups are not uptodate, and
> >> >> various actions taken on the filesystem will ensure that they are
> >> >> uptodate. Mounting a filesystem and immediately running an unlink 
> >> of
> >> >> a
> >> >> file which is known to exist on the filesystem (before performing
> >> >> any
> >> >> other action) should trigger that bug if it is present.
> >> >>
> >> >> It may not be that particular bug, but that looks the most likely 
> >> of
> >> >> any
> >> >> recent patch to that bit of code. Are you able to try a more 
> >> recent
> >> >> Ubuntu kernel? If you could try one based on a more recent 
> >> upstream
> >> >> which has that patch in it, then that might also help narrow down
> >> >> the
> >> >> problem,
> >> >>
> >> >> Steve.
> >> >>
> >> >>> Steven Whitehouse schreef op 21.08.2012 13:27:
> >> >>> > Hi,
> >> >>> >
> >> >>> > On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote:
> >> >>> >> Hi Steven,
> >> >>> >>
> >> >>> >> There is no drbd in the mix ( which is why i changed the 
> >> title
> >> >>> of
> >> >>> >> the
> >> >>> >> bugreport now ). I'm only using plain iSCSI. The original 
> >> posted
> >> >>> had
> >> >>> >> it
> >> >>> >> with drbd :)
> >> >>> >>
> >> >>> >> Kind regards,
> >> >>> >>
> >> >>> >> Bart
> >> >>> >>
> >> >>> > Ah, I see sorry.. I misunderstood the report. I wonder whether
> >> >>> your
> >> >>> > distro kernel has this patch:
> >> >>> >
> >> >>> >
> >> >>> 
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa
> >> >>> >
> >> >>> > Thats the most likely thing that I can see that has been fixed
> >> >>> > recently,
> >> >>> >
> >> >>> > Steve.
> >> >>> >
> >> >>> >
> >> >>> >> Steven Whitehouse schreef op 21.08.2012 12:59:
> >> >>> >> > Hi,
> >> >>> >> >
> >> >>> >> > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote:
> >> >>> >> >> Hi Steven
> >> >>> >> >>
> >> >>> >> >> Shared storage is iSCSI,
> >> >>> >> >>
> >> >>> >> >> <totem rrp_mode="none" secauth="off" token="20000"/>
> >> >>> >> >> <quorumd tko="4" interval="2"
> >> >>> >> >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd>
> >> >>> >> >>
> >> >>> >> >> Actually i know why this is happening now, and can 
> >> reproduce
> >> >>> 100%
> >> >>> >> of
> >> >>> >> >> the time, i've added my findings as a comment to this bug
> >> >>> from
> >> >>> >> >> somebody
> >> >>> >> >> having the same problem:
> >> >>> >> >>
> >> >>> >> >>
> >> >>> 
> >> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207
> >> >>> >> >>
> >> >>> >> >> Create file on one node's gfs2 mount, rm on the other ->
> >> >>> hanging
> >> >>> >> >> mountpoint + kernel OOPS.
> >> >>> >> >>
> >> >>> >> >> Happy that i'm finally getting somewhere with this :P
> >> >>> >> >>
> >> >>> >> >> Anything i can do to help Steven?
> >> >>> >> >>
> >> >>> >> >> Kind regards,
> >> >>> >> >>
> >> >>> >> >> Bart Verwilst
> >> >>> >> >>
> >> >>> >> > Can you reproduce this without drbd in the mix? That should
> >> >>> remove
> >> >>> >> > one
> >> >>> >> > complication and make this easier to track down.
> >> >>> >> >
> >> >>> >> > I'll take a look at see what that dereference is likely to 
> >> be
> >> >>> in
> >> >>> >> the
> >> >>> >> > mean time,
> >> >>> >> >
> >> >>> >> > Steve.
> >> >>> >> >
> >> >>> >> >> Steven Whitehouse schreef op 21.08.2012 12:17:
> >> >>> >> >> > Hi,
> >> >>> >> >> >
> >> >>> >> >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
> >> >>> >> >> >> As yet another reply to my own post, i found this on 
> >> the
> >> >>> node
> >> >>> >> >> where
> >> >>> >> >> >> it
> >> >>> >> >> >> hangs ( this time it's vm01, and 
> >> /var/lib/libvirt/sanlock
> >> >>> >> that's
> >> >>> >> >> >> hanging
> >> >>> >> >> >> ):
> >> >>> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster
> >> >>> "lock_dlm",
> >> >>> >> >> >> "kvm:sanlock"
> >> >>> >> >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined 
> >> cluster.
> >> >>> Now
> >> >>> >> >> >> mounting
> >> >>> >> >> >> FS...
> >> >>> >> >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already
> >> >>> locked
> >> >>> >> >> for
> >> >>> >> >> >> use
> >> >>> >> >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking 
> >> at
> >> >>> >> >> >> journal...
> >> >>> >> >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: 
> >> Acquiring
> >> >>> the
> >> >>> >> >> >> transaction lock...
> >> >>> >> >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: 
> >> Replaying
> >> >>> >> >> journal...
> >> >>> >> >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: 
> >> Replayed 0
> >> >>> of
> >> >>> >> 0
> >> >>> >> >> >> blocks
> >> >>> >> >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1
> >> >>> revoke
> >> >>> >> >> tags
> >> >>> >> >> >> [ 1219.782611] init: libvirt-bin main process ended,
> >> >>> >> respawning
> >> >>> >> >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal
> >> >>> >> replayed
> >> >>> >> >> in
> >> >>> >> >> >> 1s
> >> >>> >> >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
> >> >>> >> >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying 
> >> to
> >> >>> >> acquire
> >> >>> >> >> >> journal lock...
> >> >>> >> >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking 
> >> at
> >> >>> >> >> >> journal...
> >> >>> >> >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
> >> >>> >> >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying 
> >> to
> >> >>> >> acquire
> >> >>> >> >> >> journal lock...
> >> >>> >> >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking 
> >> at
> >> >>> >> >> >> journal...
> >> >>> >> >> >> [ 1219.943994] init: ttyS1 main process (20318) 
> >> terminated
> >> >>> >> with
> >> >>> >> >> >> status
> >> >>> >> >> >> 1
> >> >>> >> >> >> [ 1219.944037] init: ttyS1 main process ended, 
> >> respawning
> >> >>> >> >> >> [ 1219.967054] init: ttyS0 main process (20320) 
> >> terminated
> >> >>> >> with
> >> >>> >> >> >> status
> >> >>> >> >> >> 1
> >> >>> >> >> >> [ 1219.967100] init: ttyS0 main process ended, 
> >> respawning
> >> >>> >> >> >> [ 1219.972037] ttyS0: LSR safety check engaged!
> >> >>> >> >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: 
> >> Acquiring
> >> >>> the
> >> >>> >> >> >> transaction lock...
> >> >>> >> >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: 
> >> Replaying
> >> >>> >> >> journal...
> >> >>> >> >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: 
> >> Replayed 3
> >> >>> of
> >> >>> >> 106
> >> >>> >> >> >> blocks
> >> >>> >> >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 
> >> 4487
> >> >>> >> revoke
> >> >>> >> >> >> tags
> >> >>> >> >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal
> >> >>> >> replayed
> >> >>> >> >> in
> >> >>> >> >> >> 1s
> >> >>> >> >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
> >> >>> >> >> >
> >> >>> >> >> > So that looks like it successfully recovered the 
> >> journals
> >> >>> for
> >> >>> >> >> nodes
> >> >>> >> >> > one
> >> >>> >> >> > and two. How many nodes are in the cluster? What is the
> >> >>> fencing
> >> >>> >> >> > quorum
> >> >>> >> >> > set up being used?
> >> >>> >> >> >
> >> >>> >> >> >> [ 1221.457120] BUG: unable to handle kernel NULL 
> >> pointer
> >> >>> >> >> dereference
> >> >>> >> >> >> at
> >> >>> >> >> >> 0000000000000018
> >> >>> >> >> >
> >> >>> >> >> > So this is a dereference of something which is 24 bytes
> >> >>> into
> >> >>> >> some
> >> >>> >> >> > structure or other. Certainly something which should not
> >> >>> happen
> >> >>> >> so
> >> >>> >> >> we
> >> >>> >> >> > need to take a look at that.
> >> >>> >> >> >
> >> >>> >> >> > Was this a one off, or something that you can reproduce?
> >> >>> >> >> >
> >> >>> >> >> > Steve.
> >> >>> >> >> >
> >> >>> >> >> >
> >> >>> >> >> >> [ 1221.457508] IP: [<ffffffffa04f800a>]
> >> >>> gfs2_unlink+0x8a/0x220
> >> >>> >> >> >> [gfs2]
> >> >>> >> >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
> >> >>> >> >> >> [ 1221.458197] Oops: 0000 [#1] SMP
> >> >>> >> >> >> [ 1221.458374] CPU 0
> >> >>> >> >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si 
> >> mptctl
> >> >>> >> mptbase
> >> >>> >> >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter
> >> >>> >> ip6_tables
> >> >>> >> >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables
> >> >>> >> kvm_intel
> >> >>> >> >> kvm
> >> >>> >> >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
> >> ocfs2_nodemanager
> >> >>> >> >> >> ocfs2_stackglue
> >> >>> >> >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm 
> >> iw_cm
> >> >>> >> ib_sa
> >> >>> >> >> >> ib_mad
> >> >>> >> >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> >> >>> >> >> scsi_transport_iscsi
> >> >>> >> >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl 
> >> sunrpc
> >> >>> >> 8021q
> >> >>> >> >> garp
> >> >>> >> >> >> stp
> >> >>> >> >> >> joydev dm_multipath dcdbas mac_hid i7core_edac 
> >> edac_core
> >> >>> >> >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas
> >> >>> >> >> >> scsi_transport_sas
> >> >>> >> >> >> e1000e raid_class scsi_dh_rdac usb_storage [last 
> >> unloaded:
> >> >>> >> >> ipmi_si]
> >> >>> >> >> >> [ 1221.463058]
> >> >>> >> >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted
> >> >>> >> >> >> 3.2.0-26-generic
> >> >>> >> >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
> >> >>> >> >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]
> >> >>> >> >> [<ffffffffa04f800a>]
> >> >>> >> >> >> gfs2_unlink+0x8a/0x220 [gfs2]
> >> >>> >> >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS:
> >> >>> 00010296
> >> >>> >> >> >> [ 1221.464344] RAX: 0000000000000000 RBX: 
> >> ffff88021ef58000
> >> >>> >> RCX:
> >> >>> >> >> >> ffff88020cfe1d40
> >> >>> >> >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 
> >> 00000000000183f3
> >> >>> >> RDI:
> >> >>> >> >> >> ffff88022efa2440
> >> >>> >> >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 
> >> 4000000000000000
> >> >>> >> R09:
> >> >>> >> >> >> 0000000000000000
> >> >>> >> >> >> [ 1221.465297] R10: fdd50265775f720a R11: 
> >> 0000000000000003
> >> >>> >> R12:
> >> >>> >> >> >> ffff88021ef50000
> >> >>> >> >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: 
> >> ffff8802178980c0
> >> >>> >> R15:
> >> >>> >> >> >> ffff88022efa2000
> >> >>> >> >> >> [ 1221.466115] FS:  00007f4d2c0e7700(0000)
> >> >>> >> >> GS:ffff880237200000(0000)
> >> >>> >> >> >> knlGS:0000000000000000
> >> >>> >> >> >> [ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0:
> >> >>> >> 0000000080050033
> >> >>> >> >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 
> >> 00000002175e4000
> >> >>> >> CR4:
> >> >>> >> >> >> 00000000000006f0
> >> >>> >> >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 
> >> 0000000000000000
> >> >>> >> DR2:
> >> >>> >> >> >> 0000000000000000
> >> >>> >> >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 
> >> 00000000ffff0ff0
> >> >>> >> DR7:
> >> >>> >> >> >> 0000000000000400
> >> >>> >> >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo
> >> >>> >> >> >> ffff88020cfe0000, task ffff8802241cdbc0)
> >> >>> >> >> >> [ 1221.468091] Stack:
> >> >>> >> >> >> [ 1221.468199]  0000000000000003 ffff88022f108048
> >> >>> >> >> ffff88020cfe1d58
> >> >>> >> >> >> ffff88020cfe1d40
> >> >>> >> >> >> [ 1221.468581]  ffff88020cfe1d40 ffff88022f108000
> >> >>> >> >> ffff88022354aa00
> >> >>> >> >> >> 0000000000000001
> >> >>> >> >> >> [ 1221.468963]  0000000000000000 0000000000000000
> >> >>> >> >> ffffffffa04f7fda
> >> >>> >> >> >> ffff88020cfe1d80
> >> >>> >> >> >> [ 1221.469346] Call Trace:
> >> >>> >> >> >> [ 1221.469486]  [<ffffffffa04f7fda>] ?
> >> >>> gfs2_unlink+0x5a/0x220
> >> >>> >> >> [gfs2]
> >> >>> >> >> >> [ 1221.469955]  [<ffffffffa04f7ff4>] ?
> >> >>> gfs2_unlink+0x74/0x220
> >> >>> >> >> [gfs2]
> >> >>> >> >> >> [ 1221.470236]  [<ffffffff8129cb2c>] ?
> >> >>> >> >> >> security_inode_permission+0x1c/0x30
> >> >>> >> >> >> [ 1221.470536]  [<ffffffff81184e70>]
> >> >>> >> vfs_unlink.part.26+0x80/0xf0
> >> >>> >> >> >> [ 1221.470802]  [<ffffffff81184f1c>] 
> >> vfs_unlink+0x3c/0x60
> >> >>> >> >> >> [ 1221.471040]  [<ffffffff8118758a>]
> >> >>> do_unlinkat+0x1aa/0x1d0
> >> >>> >> >> >> [ 1221.471290]  [<ffffffff81177fc0>] ?
> >> >>> vfs_write+0x110/0x180
> >> >>> >> >> >> [ 1221.471538]  [<ffffffff811782a7>] ? 
> >> sys_write+0x67/0x90
> >> >>> >> >> >> [ 1221.471780]  [<ffffffff81188106>] 
> >> sys_unlink+0x16/0x20
> >> >>> >> >> >> [ 1221.472019]  [<ffffffff81661fc2>]
> >> >>> >> >> system_call_fastpath+0x16/0x1b
> >> >>> >> >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 
> >> be
> >> >>> 01 00
> >> >>> >> 00
> >> >>> >> >> 00
> >> >>> >> >> >> e8
> >> >>> >> >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 
> >> 00
> >> >>> 48 8d
> >> >>> >> 8d
> >> >>> >> >> 08
> >> >>> >> >> >> ff
> >> >>> >> >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 
> >> d2
> >> >>> 1e
> >> >>> >> ff
> >> >>> >> >> ff
> >> >>> >> >> >> 48
> >> >>> >> >> >> [ 1221.473936] RIP  [<ffffffffa04f800a>]
> >> >>> >> gfs2_unlink+0x8a/0x220
> >> >>> >> >> >> [gfs2]
> >> >>> >> >> >> [ 1221.474240]  RSP <ffff88020cfe1d28>
> >> >>> >> >> >> [ 1221.474408] CR2: 0000000000000018
> >> >>> >> >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]---
> >> >>> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >> >> Bart Verwilst schreef op 20.08.2012 09:50:
> >> >>> >> >> >> > Nothing out of the ordinary, should have mentioned 
> >> that!
> >> >>> >> >> >> >
> >> >>> >> >> >> > <snip>
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> read
> >> >>> >> >> >> 20
> >> >>> >> >> >> > bytes from fd 17
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> client
> >> >>> >> >> >> > command is 7
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> About
> >> >>> >> >> >> to
> >> >>> >> >> >> > process command
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> memb:
> >> >>> >> >> command
> >> >>> >> >> >> to
> >> >>> >> >> >> > process is 7
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> memb:
> >> >>> >> >> >> > get_all_members: allocated new buffer (retsize=1024)
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> memb:
> >> >>> >> >> >> > get_all_members: retlen = 1760
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> memb:
> >> >>> >> >> command
> >> >>> >> >> >> > return code is 4
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> > Returning command data. length = 1760
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> sending
> >> >>> >> >> >> > reply 40000007 to fd 17
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> read
> >> >>> >> >> >> 20
> >> >>> >> >> >> > bytes from fd 17
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> client
> >> >>> >> >> >> > command is 800000b7
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> About
> >> >>> >> >> >> to
> >> >>> >> >> >> > process command
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> memb:
> >> >>> >> >> command
> >> >>> >> >> >> to
> >> >>> >> >> >> > process is 800000b7
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> memb:
> >> >>> >> >> command
> >> >>> >> >> >> > return code is 0
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> > Returning command data. length = 0
> >> >>> >> >> >> > Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> sending
> >> >>> >> >> >> > reply c00000b7 to fd 17
> >> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> read
> >> >>> >> >> >> 20
> >> >>> >> >> >> > bytes from fd 17
> >> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> >> client
> >> >>> >> >> >> > command is 7
> >> >>> >> >> >> > Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ]
> >> >>> daemon:
> >> >>> >> >> About
> >> >>> >> >> >> to
> >> >>> >> >> >> > process command
> >> >>> >> >> >> > </snip>
> >> >>> >> >> >> >
> >> >>> >> >> >> > Digimer schreef op 20.08.2012 00:01:
> >> >>> >> >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
> >> >>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067]
> >> >>> INFO:
> >> >>> >> task
> >> >>> >> >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds.
> >> >>> >> >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182]
> >> >>> "echo
> >> >>> 0
> >> >>> >> >
> >> >>> >> >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables 
> >> this
> >> >>> >> >> message.
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> Nothing around Aug 19 00:08:00 ?
> >> >>> >> >> >>
> >> >>> >> >> >> --
> >> >>> >> >> >> Linux-cluster mailing list
> >> >>> >> >> >> Linux-cluster at redhat.com
> >> >>> >> >> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >>
> 




From lists at verwilst.be  Thu Aug 23 20:16:40 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Thu, 23 Aug 2012 22:16:40 +0200
Subject: [Linux-cluster] gfs2 mount: No space left on device
Message-ID: <21fee5b6561447a4a951e031890d3648@verwilst.be>

Hello,

One problem fixed, up to the next one :) While everything seemed to 
work fine for a while, now I'm seeing this:

root at vm02-test:~# df -h | grep libvirt
/dev/mapper/iscsi_cluster_qemu     2.0G  388M  1.7G  19% 
/etc/libvirt/qemu
/dev/mapper/iscsi_cluster_sanlock  5.0G  393M  4.7G   8% 
/var/lib/libvirt/sanlock

root at vm02-test:~# ls -al /etc/libvirt/qemu
total 16
drwxr-xr-x 2 root root 3864 Aug 23 13:54 .
drwxr-xr-x 6 root root 4096 Aug 14 15:09 ..
-rw------- 1 root root 2566 Aug 23 13:51 firewall.xml
-rw------- 1 root root 2390 Aug 23 13:54 zabbix.xml

root at vm02-test:~# gfs2_tool journals /etc/libvirt/qemu
journal2 - 128MB
journal1 - 128MB
journal0 - 128MB
3 journal(s) found.


root at vm02-test:~# touch /etc/libvirt/qemu/test
touch: cannot touch `/etc/libvirt/qemu/test': No space left on device



Anything I can do to debug this further?

Kind regards,

Bart Verwilst



From lists at verwilst.be  Thu Aug 23 20:35:11 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Thu, 23 Aug 2012 22:35:11 +0200
Subject: [Linux-cluster] gfs2 mount: No space left on device
In-Reply-To: <21fee5b6561447a4a951e031890d3648@verwilst.be>
References: <21fee5b6561447a4a951e031890d3648@verwilst.be>
Message-ID: <e7ea12ec2f2bd4763221e6d72d08ee9d@verwilst.be>

Umounting and remounting made the filesystem writeable again.

I've then ran a gfs2_fsck on the device, which gave me

root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
Initializing fsck
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Okay to reclaim unlinked inodes in resource group 131090 (0x20012)? 
(y/n)y
Error: resource group 131090 (0x20012): free space (65527) does not 
match bitmap (65528)
(1 blocks were reclaimed)
Fix the rgrp free blocks count? (y/n)y
The rgrp was fixed.
RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
Starting pass1
Pass1 complete
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete
Starting pass3
Pass3 complete
Starting pass4
Pass4 complete
Starting pass5
RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
Update resource group counts? (y/n) y
Resource group counts updated
Pass5 complete
The statfs file is wrong:

Current statfs values:
blocks:  524228 (0x7ffc4)
free:    424937 (0x67be9)
dinodes: 24 (0x18)

Calculated statfs values:
blocks:  524228 (0x7ffc4)
free:    424938 (0x67bea)
dinodes: 23 (0x17)
Okay to fix the master statfs file? (y/n)y
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete


root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
Initializing fsck
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Okay to reclaim unlinked inodes in resource group 131090 (0x20012)? 
(y/n)y
Error: resource group 131090 (0x20012): free space (65527) does not 
match bitmap (65528)
(1 blocks were reclaimed)
Fix the rgrp free blocks count? (y/n)y
The rgrp was fixed.
RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
Starting pass1
Pass1 complete
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete
Starting pass3
Pass3 complete
Starting pass4
Pass4 complete
Starting pass5
RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
Update resource group counts? (y/n) y
Resource group counts updated
Pass5 complete
The statfs file is wrong:

Current statfs values:
blocks:  524228 (0x7ffc4)
free:    424937 (0x67be9)
dinodes: 24 (0x18)

Calculated statfs values:
blocks:  524228 (0x7ffc4)
free:    424938 (0x67bea)
dinodes: 23 (0x17)
Okay to fix the master statfs file? (y/n)y
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete

Could it be that it looks like bug 
https://bugzilla.redhat.com/show_bug.cgi?id=666080 ?

Bart

Bart Verwilst schreef op 23.08.2012 22:16:
> Hello,
>
> One problem fixed, up to the next one :) While everything seemed to
> work fine for a while, now I'm seeing this:
>
> root at vm02-test:~# df -h | grep libvirt
> /dev/mapper/iscsi_cluster_qemu     2.0G  388M  1.7G  19% 
> /etc/libvirt/qemu
> /dev/mapper/iscsi_cluster_sanlock  5.0G  393M  4.7G   8%
> /var/lib/libvirt/sanlock
>
> root at vm02-test:~# ls -al /etc/libvirt/qemu
> total 16
> drwxr-xr-x 2 root root 3864 Aug 23 13:54 .
> drwxr-xr-x 6 root root 4096 Aug 14 15:09 ..
> -rw------- 1 root root 2566 Aug 23 13:51 firewall.xml
> -rw------- 1 root root 2390 Aug 23 13:54 zabbix.xml
>
> root at vm02-test:~# gfs2_tool journals /etc/libvirt/qemu
> journal2 - 128MB
> journal1 - 128MB
> journal0 - 128MB
> 3 journal(s) found.
>
>
> root at vm02-test:~# touch /etc/libvirt/qemu/test
> touch: cannot touch `/etc/libvirt/qemu/test': No space left on device
>
>
>
> Anything I can do to debug this further?
>
> Kind regards,
>
> Bart Verwilst



From mkparam at gmail.com  Fri Aug 24 09:04:54 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Fri, 24 Aug 2012 14:34:54 +0530
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
Message-ID: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>

All,

I am trying to setup a simple two node cluster in my laptop using two RHEL
VM's.

Everything looks just fine to me but i am unable to enable a apache service
though it works beautifully when tried with "rg_test test" on both the
nodes.

What could be the problem ? Please help. I am a novice in red hat cluster
but learnt a bit of it in the last few days while trying to fix all the
problems encountered.

Here are the details.

[root at server1 ~]# clustat
Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
Member Status: Quorate

 Member Name                                                 ID   Status
 ------ ----                                                 ---- ------
 server1                                                         1 Online,
Local
 server2                                                         2 Online

[root at server1 ~]# clustat -x
<?xml version="1.0"?>
<clustat version="4.1.1">
  <cluster name="newCluster" id="43188" generation="250536"/>
  <quorum quorate="1" groupmember="0"/>
  <nodes>
    <node name="server1" state="1" local="1" estranged="0" rgmanager="0"
rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
    <node name="server2" state="1" local="0" estranged="0" rgmanager="0"
rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>  </nodes>
</clustat>

[root at server2 ~]# clustat
Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
Member Status: Quorate

 Member Name                                                 ID   Status
 ------ ----                                                 ---- ------
 server1                                                         1 Online
 server2                                                         2 Online,
Local

[root at server2 ~]# clustat -x
<?xml version="1.0"?>
<clustat version="4.1.1">
  <cluster name="newCluster" id="43188" generation="250536"/>
  <quorum quorate="1" groupmember="0"/>
  <nodes>
    <node name="server1" state="1" local="0" estranged="0" rgmanager="0"
rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
    <node name="server2" state="1" local="1" estranged="0" rgmanager="0"
rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
  </nodes>
</clustat>


[root at server2 ~]# clusvcadm -e Apache
Local machine trying to enable service:Apache...Could not connect to
resource group manager

strace cluvcsadm -e Apache
...
stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7fb5000
write(1, "Local machine trying to enable s"..., 48Local machine trying to
enable service:Apache...) = 48
socket(PF_FILE, SOCK_STREAM, 0)         = 5
connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...},
110) = -1 ENOENT (No such file or directory)
close(5)                                = 0
write(1, "Could not connect to resource gr"..., 44Could not connect to
resource group manager
) = 44
exit_group(1)                           = ?


[root at server1 ~]# hostname
server1.localdomain

[root at server1 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
#127.0.0.1              server1.localdomain server1 localhost.localdomain
localhost
192.168.61.132 server1.localdomain server1
192.168.61.133 server2.localdomain server2
::1             localhost6.localdomain6 localhost6


Package versions :
luci-0.12.2-24.el5
ricci-0.12.2-24.el5
rgmanager-2.0.52-9.el5
modcluster-0.12.1-2.el5
cluster-cim-0.12.1-2.el5
system-config-cluster-1.0.57-7
lvm2-cluster-2.02.74-3.el5
cluster-snmp-0.12.1-2.el5

[root at server1 log]# cman_tool status
Version: 6.2.0
Config Version: 15
Cluster Name: newCluster
Cluster Id: 43188
Cluster Member: Yes
Cluster Generation: 250536
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 2
Flags: 2node
Ports Bound: 0
Node name: server1
Node ID: 1
Multicast addresses: 239.192.168.93
Node addresses: 192.168.61.132

Redhat :Red Hat Enterprise Linux Server release 5.6 (Tikanga)
2.6.18-238.el5xen

[root at server1 log]# service rgmanager status
clurgmgrd (pid  9775) is running...

[root at server1 log]# netstat -na | grep 11111
tcp        0      0 0.0.0.0:11111               0.0.0.0:*
LISTEN

Please let me know if you can help. One thing i noticed was that in the
"clustat" it does not show "rgmanager" against both the nodes but i see the
service is just running fine.

*Note : No iptables, no SELinux enabled.*
*
*
Hope i have given all the details required to help me quickly. Thanks.

-Param
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/91136404/attachment.htm>

From sabaris575 at gmail.com  Fri Aug 24 09:22:59 2012
From: sabaris575 at gmail.com (sabari G)
Date: Fri, 24 Aug 2012 14:52:59 +0530
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
Message-ID: <CANp7ydPYkY-gP86TZMa0mN_BxodwXgN=KUQped9QAWw5dYQBPQ@mail.gmail.com>

HAI


   THIS IS SABARI,CAN I KNOW ONE THING WHICH NOT RELATED TO UR QUERY?

CAN U TELL ME HOW YOU ALIGNED QUORAM DISK ,IN VMWARE.IF POSSIBLE PLEASE
SHARE WITH YOUR SETUP ARRANGEMENT AND CONFIGURATION .
REGARDS,
SABARI
On Fri, Aug 24, 2012 at 2:34 PM, PARAM KRISH <mkparam at gmail.com> wrote:

> All,
>
> I am trying to setup a simple two node cluster in my laptop using two RHEL
> VM's.
>
> Everything looks just fine to me but i am unable to enable a apache
> service though it works beautifully when tried with "rg_test test" on both
> the nodes.
>
> What could be the problem ? Please help. I am a novice in red hat cluster
> but learnt a bit of it in the last few days while trying to fix all the
> problems encountered.
>
> Here are the details.
>
> [root at server1 ~]# clustat
> Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
> Member Status: Quorate
>
>  Member Name                                                 ID   Status
>  ------ ----                                                 ---- ------
>  server1                                                         1 Online,
> Local
>  server2                                                         2 Online
>
> [root at server1 ~]# clustat -x
> <?xml version="1.0"?>
> <clustat version="4.1.1">
>   <cluster name="newCluster" id="43188" generation="250536"/>
>   <quorum quorate="1" groupmember="0"/>
>   <nodes>
>     <node name="server1" state="1" local="1" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>     <node name="server2" state="1" local="0" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>  </nodes>
> </clustat>
>
> [root at server2 ~]# clustat
> Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
> Member Status: Quorate
>
>  Member Name                                                 ID   Status
>  ------ ----                                                 ---- ------
>  server1                                                         1 Online
>  server2                                                         2 Online,
> Local
>
> [root at server2 ~]# clustat -x
> <?xml version="1.0"?>
> <clustat version="4.1.1">
>   <cluster name="newCluster" id="43188" generation="250536"/>
>   <quorum quorate="1" groupmember="0"/>
>   <nodes>
>     <node name="server1" state="1" local="0" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>     <node name="server2" state="1" local="1" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
>   </nodes>
> </clustat>
>
>
> [root at server2 ~]# clusvcadm -e Apache
> Local machine trying to enable service:Apache...Could not connect to
> resource group manager
>
> strace cluvcsadm -e Apache
> ...
> stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0xb7fb5000
> write(1, "Local machine trying to enable s"..., 48Local machine trying to
> enable service:Apache...) = 48
> socket(PF_FILE, SOCK_STREAM, 0)         = 5
> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...},
> 110) = -1 ENOENT (No such file or directory)
> close(5)                                = 0
> write(1, "Could not connect to resource gr"..., 44Could not connect to
> resource group manager
> ) = 44
> exit_group(1)                           = ?
>
>
> [root at server1 ~]# hostname
> server1.localdomain
>
> [root at server1 ~]# cat /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1              server1.localdomain server1 localhost.localdomain
> localhost
> 192.168.61.132 server1.localdomain server1
> 192.168.61.133 server2.localdomain server2
> ::1             localhost6.localdomain6 localhost6
>
>
> Package versions :
> luci-0.12.2-24.el5
> ricci-0.12.2-24.el5
> rgmanager-2.0.52-9.el5
> modcluster-0.12.1-2.el5
> cluster-cim-0.12.1-2.el5
> system-config-cluster-1.0.57-7
> lvm2-cluster-2.02.74-3.el5
> cluster-snmp-0.12.1-2.el5
>
> [root at server1 log]# cman_tool status
> Version: 6.2.0
> Config Version: 15
> Cluster Name: newCluster
> Cluster Id: 43188
> Cluster Member: Yes
> Cluster Generation: 250536
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 2
> Flags: 2node
> Ports Bound: 0
> Node name: server1
> Node ID: 1
> Multicast addresses: 239.192.168.93
> Node addresses: 192.168.61.132
>
> Redhat :Red Hat Enterprise Linux Server release 5.6 (Tikanga)
> 2.6.18-238.el5xen
>
> [root at server1 log]# service rgmanager status
> clurgmgrd (pid  9775) is running...
>
> [root at server1 log]# netstat -na | grep 11111
> tcp        0      0 0.0.0.0:11111               0.0.0.0:*
>   LISTEN
>
> Please let me know if you can help. One thing i noticed was that in the
> "clustat" it does not show "rgmanager" against both the nodes but i see the
> service is just running fine.
>
> *Note : No iptables, no SELinux enabled.*
> *
> *
> Hope i have given all the details required to help me quickly. Thanks.
>
> -Param
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Regards,
Sabarinath G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/9e37ac97/attachment.htm>

From emi2fast at gmail.com  Fri Aug 24 09:26:46 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 24 Aug 2012 11:26:46 +0200
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
Message-ID: <CAE7pJ3BiroQ0cszeq2GvyTSibf4Xi6H4Zb88CqrQtmbE7_1Kmg@mail.gmail.com>

did you started rgmanager?

2012/8/24 PARAM KRISH <mkparam at gmail.com>

> All,
>
> I am trying to setup a simple two node cluster in my laptop using two RHEL
> VM's.
>
> Everything looks just fine to me but i am unable to enable a apache
> service though it works beautifully when tried with "rg_test test" on both
> the nodes.
>
> What could be the problem ? Please help. I am a novice in red hat cluster
> but learnt a bit of it in the last few days while trying to fix all the
> problems encountered.
>
> Here are the details.
>
> [root at server1 ~]# clustat
> Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
> Member Status: Quorate
>
>  Member Name                                                 ID   Status
>  ------ ----                                                 ---- ------
>  server1                                                         1 Online,
> Local
>  server2                                                         2 Online
>
> [root at server1 ~]# clustat -x
> <?xml version="1.0"?>
> <clustat version="4.1.1">
>   <cluster name="newCluster" id="43188" generation="250536"/>
>   <quorum quorate="1" groupmember="0"/>
>   <nodes>
>     <node name="server1" state="1" local="1" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>     <node name="server2" state="1" local="0" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>  </nodes>
> </clustat>
>
> [root at server2 ~]# clustat
> Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
> Member Status: Quorate
>
>  Member Name                                                 ID   Status
>  ------ ----                                                 ---- ------
>  server1                                                         1 Online
>  server2                                                         2 Online,
> Local
>
> [root at server2 ~]# clustat -x
> <?xml version="1.0"?>
> <clustat version="4.1.1">
>   <cluster name="newCluster" id="43188" generation="250536"/>
>   <quorum quorate="1" groupmember="0"/>
>   <nodes>
>     <node name="server1" state="1" local="0" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>     <node name="server2" state="1" local="1" estranged="0" rgmanager="0"
> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
>   </nodes>
> </clustat>
>
>
> [root at server2 ~]# clusvcadm -e Apache
> Local machine trying to enable service:Apache...Could not connect to
> resource group manager
>
> strace cluvcsadm -e Apache
> ...
> stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0xb7fb5000
> write(1, "Local machine trying to enable s"..., 48Local machine trying to
> enable service:Apache...) = 48
> socket(PF_FILE, SOCK_STREAM, 0)         = 5
> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...},
> 110) = -1 ENOENT (No such file or directory)
> close(5)                                = 0
> write(1, "Could not connect to resource gr"..., 44Could not connect to
> resource group manager
> ) = 44
> exit_group(1)                           = ?
>
>
> [root at server1 ~]# hostname
> server1.localdomain
>
> [root at server1 ~]# cat /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1              server1.localdomain server1 localhost.localdomain
> localhost
> 192.168.61.132 server1.localdomain server1
> 192.168.61.133 server2.localdomain server2
> ::1             localhost6.localdomain6 localhost6
>
>
> Package versions :
> luci-0.12.2-24.el5
> ricci-0.12.2-24.el5
> rgmanager-2.0.52-9.el5
> modcluster-0.12.1-2.el5
> cluster-cim-0.12.1-2.el5
> system-config-cluster-1.0.57-7
> lvm2-cluster-2.02.74-3.el5
> cluster-snmp-0.12.1-2.el5
>
> [root at server1 log]# cman_tool status
> Version: 6.2.0
> Config Version: 15
> Cluster Name: newCluster
> Cluster Id: 43188
> Cluster Member: Yes
> Cluster Generation: 250536
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 2
> Flags: 2node
> Ports Bound: 0
> Node name: server1
> Node ID: 1
> Multicast addresses: 239.192.168.93
> Node addresses: 192.168.61.132
>
> Redhat :Red Hat Enterprise Linux Server release 5.6 (Tikanga)
> 2.6.18-238.el5xen
>
> [root at server1 log]# service rgmanager status
> clurgmgrd (pid  9775) is running...
>
> [root at server1 log]# netstat -na | grep 11111
> tcp        0      0 0.0.0.0:11111               0.0.0.0:*
>   LISTEN
>
> Please let me know if you can help. One thing i noticed was that in the
> "clustat" it does not show "rgmanager" against both the nodes but i see the
> service is just running fine.
>
> *Note : No iptables, no SELinux enabled.*
> *
> *
> Hope i have given all the details required to help me quickly. Thanks.
>
> -Param
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/4d4c1bef/attachment.htm>

From emi2fast at gmail.com  Fri Aug 24 09:35:37 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 24 Aug 2012 11:35:37 +0200
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <CANp7ydPYkY-gP86TZMa0mN_BxodwXgN=KUQped9QAWw5dYQBPQ@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<CANp7ydPYkY-gP86TZMa0mN_BxodwXgN=KUQped9QAWw5dYQBPQ@mail.gmail.com>
Message-ID: <CAE7pJ3COUcrfbX0DscwBWs1WmCgcxeQasvTn4Kx-671Qa_PwrA@mail.gmail.com>

Sorry :-(

I don't use VMWARE, i use HP BLADE + SAN STORAGE

2012/8/24 sabari G <sabaris575 at gmail.com>

>
>
> HAI
>
>
>    THIS IS SABARI,CAN I KNOW ONE THING WHICH NOT RELATED TO UR QUERY?
>
> CAN U TELL ME HOW YOU ALIGNED QUORAM DISK ,IN VMWARE.IF POSSIBLE PLEASE
> SHARE WITH YOUR SETUP ARRANGEMENT AND CONFIGURATION .
> REGARDS,
> SABARI
> On Fri, Aug 24, 2012 at 2:34 PM, PARAM KRISH <mkparam at gmail.com> wrote:
>
>> All,
>>
>> I am trying to setup a simple two node cluster in my laptop using two
>> RHEL VM's.
>>
>> Everything looks just fine to me but i am unable to enable a apache
>> service though it works beautifully when tried with "rg_test test" on both
>> the nodes.
>>
>> What could be the problem ? Please help. I am a novice in red hat cluster
>> but learnt a bit of it in the last few days while trying to fix all the
>> problems encountered.
>>
>> Here are the details.
>>
>> [root at server1 ~]# clustat
>> Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
>> Member Status: Quorate
>>
>>  Member Name                                                 ID   Status
>>  ------ ----                                                 ---- ------
>>  server1                                                         1
>> Online, Local
>>  server2                                                         2 Online
>>
>> [root at server1 ~]# clustat -x
>> <?xml version="1.0"?>
>> <clustat version="4.1.1">
>>   <cluster name="newCluster" id="43188" generation="250536"/>
>>   <quorum quorate="1" groupmember="0"/>
>>   <nodes>
>>     <node name="server1" state="1" local="1" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>>     <node name="server2" state="1" local="0" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>  </nodes>
>> </clustat>
>>
>> [root at server2 ~]# clustat
>> Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
>> Member Status: Quorate
>>
>>  Member Name                                                 ID   Status
>>  ------ ----                                                 ---- ------
>>  server1                                                         1 Online
>>  server2                                                         2
>> Online, Local
>>
>> [root at server2 ~]# clustat -x
>> <?xml version="1.0"?>
>> <clustat version="4.1.1">
>>   <cluster name="newCluster" id="43188" generation="250536"/>
>>   <quorum quorate="1" groupmember="0"/>
>>   <nodes>
>>     <node name="server1" state="1" local="0" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>>     <node name="server2" state="1" local="1" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
>>   </nodes>
>> </clustat>
>>
>>
>> [root at server2 ~]# clusvcadm -e Apache
>> Local machine trying to enable service:Apache...Could not connect to
>> resource group manager
>>
>> strace cluvcsadm -e Apache
>> ...
>> stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
>> = 0xb7fb5000
>> write(1, "Local machine trying to enable s"..., 48Local machine trying to
>> enable service:Apache...) = 48
>> socket(PF_FILE, SOCK_STREAM, 0)         = 5
>> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...},
>> 110) = -1 ENOENT (No such file or directory)
>> close(5)                                = 0
>> write(1, "Could not connect to resource gr"..., 44Could not connect to
>> resource group manager
>> ) = 44
>> exit_group(1)                           = ?
>>
>>
>> [root at server1 ~]# hostname
>> server1.localdomain
>>
>> [root at server1 ~]# cat /etc/hosts
>> # Do not remove the following line, or various programs
>> # that require network functionality will fail.
>> #127.0.0.1              server1.localdomain server1 localhost.localdomain
>> localhost
>> 192.168.61.132 server1.localdomain server1
>> 192.168.61.133 server2.localdomain server2
>> ::1             localhost6.localdomain6 localhost6
>>
>>
>> Package versions :
>> luci-0.12.2-24.el5
>> ricci-0.12.2-24.el5
>> rgmanager-2.0.52-9.el5
>> modcluster-0.12.1-2.el5
>> cluster-cim-0.12.1-2.el5
>> system-config-cluster-1.0.57-7
>> lvm2-cluster-2.02.74-3.el5
>> cluster-snmp-0.12.1-2.el5
>>
>> [root at server1 log]# cman_tool status
>> Version: 6.2.0
>> Config Version: 15
>> Cluster Name: newCluster
>> Cluster Id: 43188
>> Cluster Member: Yes
>> Cluster Generation: 250536
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 2
>> Quorum: 1
>> Active subsystems: 2
>> Flags: 2node
>> Ports Bound: 0
>> Node name: server1
>> Node ID: 1
>> Multicast addresses: 239.192.168.93
>> Node addresses: 192.168.61.132
>>
>> Redhat :Red Hat Enterprise Linux Server release 5.6 (Tikanga)
>> 2.6.18-238.el5xen
>>
>> [root at server1 log]# service rgmanager status
>> clurgmgrd (pid  9775) is running...
>>
>> [root at server1 log]# netstat -na | grep 11111
>> tcp        0      0 0.0.0.0:11111               0.0.0.0:*
>>     LISTEN
>>
>> Please let me know if you can help. One thing i noticed was that in the
>> "clustat" it does not show "rgmanager" against both the nodes but i see the
>> service is just running fine.
>>
>> *Note : No iptables, no SELinux enabled.*
>> *
>> *
>> Hope i have given all the details required to help me quickly. Thanks.
>>
>> -Param
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Regards,
> Sabarinath G
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/5ebb38d9/attachment.htm>

From heiko.nardmann at itechnical.de  Fri Aug 24 09:38:47 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Fri, 24 Aug 2012 11:38:47 +0200
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
 manager
In-Reply-To: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
Message-ID: <50374BA7.9080602@itechnical.de>

It is strange that strace shows that /var/run/cluster/rgmanager.sk is 
missing.

Normally it is helpful to see the complete cluster.conf. Could you 
provide that one?

Also of interest is /var/log/cluster/rgmanager.log - do you have debug 
enabled inside cluster.conf?

Maybe it is possible to start rgmanager in the foreground (-f) with 
strace? That might also be a way to show why the rgmanager.sk is missing ...

Just some ideas ...


Kind regards,

     Heiko

Am 24.08.2012 11:04, schrieb PARAM KRISH:
> All,
>
> I am trying to setup a simple two node cluster in my laptop using two 
> RHEL VM's.
>
> Everything looks just fine to me but i am unable to enable a apache 
> service though it works beautifully when tried with "rg_test test" on 
> both the nodes.
>
> What could be the problem ? Please help. I am a novice in red hat 
> cluster but learnt a bit of it in the last few days while trying to 
> fix all the problems encountered.
>
> Here are the details.
>
> [root at server1 ~]# clustat
> Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
> Member Status: Quorate
>
>  Member Name                 ID   Status
>  ------ ----                 ---- ------
>  server1                     1 Online, Local
>  server2                     2 Online
>
> [root at server1 ~]# clustat -x
> <?xml version="1.0"?>
> <clustat version="4.1.1">
>   <cluster name="newCluster" id="43188" generation="250536"/>
>   <quorum quorate="1" groupmember="0"/>
>   <nodes>
>     <node name="server1" state="1" local="1" estranged="0" 
> rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>     <node name="server2" state="1" local="0" estranged="0" 
> rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000002"/> 
>  </nodes>
> </clustat>
>
> [root at server2 ~]# clustat
> Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
> Member Status: Quorate
>
>  Member Name                 ID   Status
>  ------ ----                 ---- ------
>  server1                     1 Online
>  server2                     2 Online, Local
>
> [root at server2 ~]# clustat -x
> <?xml version="1.0"?>
> <clustat version="4.1.1">
>   <cluster name="newCluster" id="43188" generation="250536"/>
>   <quorum quorate="1" groupmember="0"/>
>   <nodes>
>     <node name="server1" state="1" local="0" estranged="0" 
> rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>     <node name="server2" state="1" local="1" estranged="0" 
> rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
>   </nodes>
> </clustat>
>
>
> [root at server2 ~]# clusvcadm -e Apache
> Local machine trying to enable service:Apache...Could not connect to 
> resource group manager
>
> strace cluvcsadm -e Apache
> ...
> stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
> 0) = 0xb7fb5000
> write(1, "Local machine trying to enable s"..., 48Local machine trying 
> to enable service:Apache...) = 48
> socket(PF_FILE, SOCK_STREAM, 0)         = 5
> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk 
> <http://rgmanager.sk>"...}, 110) = -1 ENOENT (No such file or directory)
> close(5)                                = 0
> write(1, "Could not connect to resource gr"..., 44Could not connect to 
> resource group manager
> ) = 44
> exit_group(1)                           = ?
>
>
> [root at server1 ~]# hostname
> server1.localdomain
>
> [root at server1 ~]# cat /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1              server1.localdomain server1 
> localhost.localdomain localhost
> 192.168.61.132 server1.localdomain server1
> 192.168.61.133 server2.localdomain server2
> ::1             localhost6.localdomain6 localhost6
>
>
> Package versions :
> luci-0.12.2-24.el5
> ricci-0.12.2-24.el5
> rgmanager-2.0.52-9.el5
> modcluster-0.12.1-2.el5
> cluster-cim-0.12.1-2.el5
> system-config-cluster-1.0.57-7
> lvm2-cluster-2.02.74-3.el5
> cluster-snmp-0.12.1-2.el5
>
> [root at server1 log]# cman_tool status
> Version: 6.2.0
> Config Version: 15
> Cluster Name: newCluster
> Cluster Id: 43188
> Cluster Member: Yes
> Cluster Generation: 250536
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 2
> Flags: 2node
> Ports Bound: 0
> Node name: server1
> Node ID: 1
> Multicast addresses: 239.192.168.93
> Node addresses: 192.168.61.132
>
> Redhat :Red Hat Enterprise Linux Server release 5.6 
> (Tikanga)2.6.18-238.el5xen
>
> [root at server1 log]# service rgmanager status
> clurgmgrd (pid  9775) is running...
>
> [root at server1 log]# netstat -na | grep 11111
> tcp        0      0 0.0.0.0:11111 <http://0.0.0.0:11111>         
> 0.0.0.0:*                   LISTEN
>
> Please let me know if you can help. One thing i noticed was that in 
> the "clustat" it does not show "rgmanager" against both the nodes but 
> i see the service is just running fine.
>
> *Note : No iptables, no SELinux enabled.*
> *
> *
> Hope i have given all the details required to help me quickly. Thanks.
>
> -Param
>
>



From emi2fast at gmail.com  Fri Aug 24 09:48:35 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 24 Aug 2012 11:48:35 +0200
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <50374BA7.9080602@itechnical.de>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
Message-ID: <CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>

/etc/init.d/rgmanager start or service rgmanager start

2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de>

> It is strange that strace shows that /var/run/cluster/rgmanager.sk is
> missing.
>
> Normally it is helpful to see the complete cluster.conf. Could you provide
> that one?
>
> Also of interest is /var/log/cluster/rgmanager.log - do you have debug
> enabled inside cluster.conf?
>
> Maybe it is possible to start rgmanager in the foreground (-f) with
> strace? That might also be a way to show why the rgmanager.sk is missing
> ...
>
> Just some ideas ...
>
>
> Kind regards,
>
>     Heiko
>
> Am 24.08.2012 11:04, schrieb PARAM KRISH:
>
>> All,
>>
>> I am trying to setup a simple two node cluster in my laptop using two
>> RHEL VM's.
>>
>> Everything looks just fine to me but i am unable to enable a apache
>> service though it works beautifully when tried with "rg_test test" on both
>> the nodes.
>>
>> What could be the problem ? Please help. I am a novice in red hat cluster
>> but learnt a bit of it in the last few days while trying to fix all the
>> problems encountered.
>>
>> Here are the details.
>>
>> [root at server1 ~]# clustat
>> Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
>> Member Status: Quorate
>>
>>  Member Name                 ID   Status
>>  ------ ----                 ---- ------
>>  server1                     1 Online, Local
>>  server2                     2 Online
>>
>> [root at server1 ~]# clustat -x
>> <?xml version="1.0"?>
>> <clustat version="4.1.1">
>>   <cluster name="newCluster" id="43188" generation="250536"/>
>>   <quorum quorate="1" groupmember="0"/>
>>   <nodes>
>>     <node name="server1" state="1" local="1" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>>     <node name="server2" state="1" local="0" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>  </nodes>
>> </clustat>
>>
>> [root at server2 ~]# clustat
>> Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
>> Member Status: Quorate
>>
>>  Member Name                 ID   Status
>>  ------ ----                 ---- ------
>>  server1                     1 Online
>>  server2                     2 Online, Local
>>
>> [root at server2 ~]# clustat -x
>> <?xml version="1.0"?>
>> <clustat version="4.1.1">
>>   <cluster name="newCluster" id="43188" generation="250536"/>
>>   <quorum quorate="1" groupmember="0"/>
>>   <nodes>
>>     <node name="server1" state="1" local="0" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>>     <node name="server2" state="1" local="1" estranged="0" rgmanager="0"
>> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
>>   </nodes>
>> </clustat>
>>
>>
>> [root at server2 ~]# clusvcadm -e Apache
>> Local machine trying to enable service:Apache...Could not connect to
>> resource group manager
>>
>> strace cluvcsadm -e Apache
>> ...
>> stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
>> = 0xb7fb5000
>> write(1, "Local machine trying to enable s"..., 48Local machine trying to
>> enable service:Apache...) = 48
>> socket(PF_FILE, SOCK_STREAM, 0)         = 5
>> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanag**er.sk<http://rgmanager.sk><
>> http://rgmanager.sk>"...}, 110) = -1 ENOENT (No such file or directory)
>>
>> close(5)                                = 0
>> write(1, "Could not connect to resource gr"..., 44Could not connect to
>> resource group manager
>> ) = 44
>> exit_group(1)                           = ?
>>
>>
>> [root at server1 ~]# hostname
>> server1.localdomain
>>
>> [root at server1 ~]# cat /etc/hosts
>> # Do not remove the following line, or various programs
>> # that require network functionality will fail.
>> #127.0.0.1              server1.localdomain server1 localhost.localdomain
>> localhost
>> 192.168.61.132 server1.localdomain server1
>> 192.168.61.133 server2.localdomain server2
>> ::1             localhost6.localdomain6 localhost6
>>
>>
>> Package versions :
>> luci-0.12.2-24.el5
>> ricci-0.12.2-24.el5
>> rgmanager-2.0.52-9.el5
>> modcluster-0.12.1-2.el5
>> cluster-cim-0.12.1-2.el5
>> system-config-cluster-1.0.57-7
>> lvm2-cluster-2.02.74-3.el5
>> cluster-snmp-0.12.1-2.el5
>>
>> [root at server1 log]# cman_tool status
>> Version: 6.2.0
>> Config Version: 15
>> Cluster Name: newCluster
>> Cluster Id: 43188
>> Cluster Member: Yes
>> Cluster Generation: 250536
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 2
>> Quorum: 1
>> Active subsystems: 2
>> Flags: 2node
>> Ports Bound: 0
>> Node name: server1
>> Node ID: 1
>> Multicast addresses: 239.192.168.93
>> Node addresses: 192.168.61.132
>>
>> Redhat :Red Hat Enterprise Linux Server release 5.6
>> (Tikanga)2.6.18-238.el5xen
>>
>> [root at server1 log]# service rgmanager status
>> clurgmgrd (pid  9775) is running...
>>
>> [root at server1 log]# netstat -na | grep 11111
>> tcp        0      0 0.0.0.0:11111 <http://0.0.0.0:11111>         0.0.0.0:*
>>                   LISTEN
>>
>>
>> Please let me know if you can help. One thing i noticed was that in the
>> "clustat" it does not show "rgmanager" against both the nodes but i see the
>> service is just running fine.
>>
>> *Note : No iptables, no SELinux enabled.*
>> *
>>
>> *
>> Hope i have given all the details required to help me quickly. Thanks.
>>
>> -Param
>>
>>
>>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/2585ced6/attachment.htm>

From mkparam at gmail.com  Fri Aug 24 10:43:51 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Fri, 24 Aug 2012 16:13:51 +0530
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
	<CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
Message-ID: <CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>

Hi, Thanks for the help. I hope we are nearing to the problem.

I enabled logging , this is how my cluster.conf looks like

<?xml version="1.0"?>
<cluster alias="newCluster" config_version="16" name="newCluster">
<logging debug="on"/>
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="server1" nodeid="1" votes="1">
<fence><method name="single"><device
name="human"/></method></fence></clusternode><clusternode name="server2"
nodeid="2" votes="1"><fence><method name="single"><device
name="human"/></method></fence></clusternode></clusternodes><fencedevices>

        </fencedevices><rm><failoverdomains><failoverdomain name="failOver"
nofailback="0" ordered="1" restricted="0"><failoverdomainnode
name="server1" priority="1"/><failoverdomainnode name="server2"
priority="2"/></failoverdomain></failoverdomains><resources><ip
address="192.168.61.130" monitor_link="1"/><apache
config_file="conf/httpd.conf" name="httpd" server_root="/etc/httpd"
shutdown_wait="0"/></resources><service autostart="1" domain="failOver"
exclusive="1" name="Apache" recovery="relocate"><ip
address="192.168.61.130" monitor_link="1"><apache
config_file="conf/httpd.conf" name="Apache" server_root="/etc/httpd"
shutdown_wait="0"/></ip></service><service autostart="1" domain="failOver"
exclusive="1" name="website" recovery="relocate"><ip
ref="192.168.61.130"><apache ref="httpd"/></ip></service></rm><fence_daemon
clean_start="1" post_fail_delay="0" post_join_delay="3"/><logging
debug="on"/></cluster>

There is no logging happening in /var/run/cluster/

[root at server1 ~]# ls /var/run/cluster/
apache  ccsd.pid  ccsd.sock  rgmanager.sk

I started resource manager in foreground and it says like ..

failed acquiring lockspace: No such device
Locks not working!

What next i could do ?

-Param

On Fri, Aug 24, 2012 at 3:18 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> /etc/init.d/rgmanager start or service rgmanager start
>
>
> 2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de>
>
>> It is strange that strace shows that /var/run/cluster/rgmanager.sk is
>> missing.
>>
>> Normally it is helpful to see the complete cluster.conf. Could you
>> provide that one?
>>
>> Also of interest is /var/log/cluster/rgmanager.log - do you have debug
>> enabled inside cluster.conf?
>>
>> Maybe it is possible to start rgmanager in the foreground (-f) with
>> strace? That might also be a way to show why the rgmanager.sk is missing
>> ...
>>
>> Just some ideas ...
>>
>>
>> Kind regards,
>>
>>     Heiko
>>
>> Am 24.08.2012 11:04, schrieb PARAM KRISH:
>>
>>> All,
>>>
>>> I am trying to setup a simple two node cluster in my laptop using two
>>> RHEL VM's.
>>>
>>> Everything looks just fine to me but i am unable to enable a apache
>>> service though it works beautifully when tried with "rg_test test" on both
>>> the nodes.
>>>
>>> What could be the problem ? Please help. I am a novice in red hat
>>> cluster but learnt a bit of it in the last few days while trying to fix all
>>> the problems encountered.
>>>
>>> Here are the details.
>>>
>>> [root at server1 ~]# clustat
>>> Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
>>> Member Status: Quorate
>>>
>>>  Member Name                 ID   Status
>>>  ------ ----                 ---- ------
>>>  server1                     1 Online, Local
>>>  server2                     2 Online
>>>
>>> [root at server1 ~]# clustat -x
>>> <?xml version="1.0"?>
>>> <clustat version="4.1.1">
>>>   <cluster name="newCluster" id="43188" generation="250536"/>
>>>   <quorum quorate="1" groupmember="0"/>
>>>   <nodes>
>>>     <node name="server1" state="1" local="1" estranged="0" rgmanager="0"
>>> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>>>     <node name="server2" state="1" local="0" estranged="0" rgmanager="0"
>>> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>  </nodes>
>>> </clustat>
>>>
>>> [root at server2 ~]# clustat
>>> Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
>>> Member Status: Quorate
>>>
>>>  Member Name                 ID   Status
>>>  ------ ----                 ---- ------
>>>  server1                     1 Online
>>>  server2                     2 Online, Local
>>>
>>> [root at server2 ~]# clustat -x
>>> <?xml version="1.0"?>
>>> <clustat version="4.1.1">
>>>   <cluster name="newCluster" id="43188" generation="250536"/>
>>>   <quorum quorate="1" groupmember="0"/>
>>>   <nodes>
>>>     <node name="server1" state="1" local="0" estranged="0" rgmanager="0"
>>> rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
>>>     <node name="server2" state="1" local="1" estranged="0" rgmanager="0"
>>> rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
>>>   </nodes>
>>> </clustat>
>>>
>>>
>>> [root at server2 ~]# clusvcadm -e Apache
>>> Local machine trying to enable service:Apache...Could not connect to
>>> resource group manager
>>>
>>> strace cluvcsadm -e Apache
>>> ...
>>> stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
>>> 0) = 0xb7fb5000
>>> write(1, "Local machine trying to enable s"..., 48Local machine trying
>>> to enable service:Apache...) = 48
>>> socket(PF_FILE, SOCK_STREAM, 0)         = 5
>>> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanag**er.sk<http://rgmanager.sk><
>>> http://rgmanager.sk>"...}, 110) = -1 ENOENT (No such file or directory)
>>>
>>> close(5)                                = 0
>>> write(1, "Could not connect to resource gr"..., 44Could not connect to
>>> resource group manager
>>> ) = 44
>>> exit_group(1)                           = ?
>>>
>>>
>>> [root at server1 ~]# hostname
>>> server1.localdomain
>>>
>>> [root at server1 ~]# cat /etc/hosts
>>> # Do not remove the following line, or various programs
>>> # that require network functionality will fail.
>>> #127.0.0.1              server1.localdomain server1
>>> localhost.localdomain localhost
>>> 192.168.61.132 server1.localdomain server1
>>> 192.168.61.133 server2.localdomain server2
>>> ::1             localhost6.localdomain6 localhost6
>>>
>>>
>>> Package versions :
>>> luci-0.12.2-24.el5
>>> ricci-0.12.2-24.el5
>>> rgmanager-2.0.52-9.el5
>>> modcluster-0.12.1-2.el5
>>> cluster-cim-0.12.1-2.el5
>>> system-config-cluster-1.0.57-7
>>> lvm2-cluster-2.02.74-3.el5
>>> cluster-snmp-0.12.1-2.el5
>>>
>>> [root at server1 log]# cman_tool status
>>> Version: 6.2.0
>>> Config Version: 15
>>> Cluster Name: newCluster
>>> Cluster Id: 43188
>>> Cluster Member: Yes
>>> Cluster Generation: 250536
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 1
>>> Total votes: 2
>>> Quorum: 1
>>> Active subsystems: 2
>>> Flags: 2node
>>> Ports Bound: 0
>>> Node name: server1
>>> Node ID: 1
>>> Multicast addresses: 239.192.168.93
>>> Node addresses: 192.168.61.132
>>>
>>> Redhat :Red Hat Enterprise Linux Server release 5.6
>>> (Tikanga)2.6.18-238.el5xen
>>>
>>> [root at server1 log]# service rgmanager status
>>> clurgmgrd (pid  9775) is running...
>>>
>>> [root at server1 log]# netstat -na | grep 11111
>>> tcp        0      0 0.0.0.0:11111 <http://0.0.0.0:11111>
>>> 0.0.0.0:*                   LISTEN
>>>
>>>
>>> Please let me know if you can help. One thing i noticed was that in the
>>> "clustat" it does not show "rgmanager" against both the nodes but i see the
>>> service is just running fine.
>>>
>>> *Note : No iptables, no SELinux enabled.*
>>> *
>>>
>>> *
>>> Hope i have given all the details required to help me quickly. Thanks.
>>>
>>> -Param
>>>
>>>
>>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/21d61573/attachment.htm>

From lists at alteeve.ca  Fri Aug 24 13:58:33 2012
From: lists at alteeve.ca (Digimer)
Date: Fri, 24 Aug 2012 09:58:33 -0400
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
 manager
In-Reply-To: <CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
	<CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
	<CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>
Message-ID: <50378889.7080506@alteeve.ca>

A few things;

1. Please repost your cluster.conf file with line wraps in plain text.

2. Manual fencing is not supported in any way, please use real fencing,
like IPMI, iLO, etc.

3. Please stop the cluster entirely, start 'tail -f -n 0
/var/log/messages' on both nodes, then start cman, then start rgmanager.
Please share the output from the logs.

Digimer

On 08/24/2012 06:43 AM, PARAM KRISH wrote:
> Hi, Thanks for the help. I hope we are nearing to the problem.
> 
> I enabled logging , this is how my cluster.conf looks like
> 
> <?xml version="1.0"?>
> <cluster alias="newCluster" config_version="16" name="newCluster">
> <logging debug="on"/>
> <cman expected_votes="1" two_node="1"/>
> <clusternodes>
> <clusternode name="server1" nodeid="1" votes="1">
> <fence><method name="single"><device
> name="human"/></method></fence></clusternode><clusternode name="server2"
> nodeid="2" votes="1"><fence><method name="single"><device
> name="human"/></method></fence></clusternode></clusternodes><fencedevices>
> 
>         </fencedevices><rm><failoverdomains><failoverdomain
> name="failOver" nofailback="0" ordered="1"
> restricted="0"><failoverdomainnode name="server1"
> priority="1"/><failoverdomainnode name="server2"
> priority="2"/></failoverdomain></failoverdomains><resources><ip
> address="192.168.61.130" monitor_link="1"/><apache
> config_file="conf/httpd.conf" name="httpd" server_root="/etc/httpd"
> shutdown_wait="0"/></resources><service autostart="1" domain="failOver"
> exclusive="1" name="Apache" recovery="relocate"><ip
> address="192.168.61.130" monitor_link="1"><apache
> config_file="conf/httpd.conf" name="Apache" server_root="/etc/httpd"
> shutdown_wait="0"/></ip></service><service autostart="1"
> domain="failOver" exclusive="1" name="website" recovery="relocate"><ip
> ref="192.168.61.130"><apache
> ref="httpd"/></ip></service></rm><fence_daemon clean_start="1"
> post_fail_delay="0" post_join_delay="3"/><logging debug="on"/></cluster>
> 
> There is no logging happening in /var/run/cluster/
> 
> [root at server1 ~]# ls /var/run/cluster/
> apache  ccsd.pid  ccsd.sock  rgmanager.sk <http://rgmanager.sk>
> 
> I started resource manager in foreground and it says like ..
> 
> failed acquiring lockspace: No such device
> Locks not working!
> 
> What next i could do ? 
> 
> -Param
> 
> On Fri, Aug 24, 2012 at 3:18 PM, emmanuel segura <emi2fast at gmail.com
> <mailto:emi2fast at gmail.com>> wrote:
> 
>     /etc/init.d/rgmanager start or service rgmanager start
> 
> 
>     2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de
>     <mailto:heiko.nardmann at itechnical.de>>
> 
>         It is strange that strace shows that
>         /var/run/cluster/rgmanager.sk <http://rgmanager.sk> is missing.
> 
>         Normally it is helpful to see the complete cluster.conf. Could
>         you provide that one?
> 
>         Also of interest is /var/log/cluster/rgmanager.log - do you have
>         debug enabled inside cluster.conf?
> 
>         Maybe it is possible to start rgmanager in the foreground (-f)
>         with strace? That might also be a way to show why the
>         rgmanager.sk <http://rgmanager.sk> is missing ...
> 
>         Just some ideas ...
> 
> 
>         Kind regards,
> 
>             Heiko
> 
>         Am 24.08.2012 11 <tel:24.08.2012%2011>:04, schrieb PARAM KRISH:
> 
>             All,
> 
>             I am trying to setup a simple two node cluster in my laptop
>             using two RHEL VM's.
> 
>             Everything looks just fine to me but i am unable to enable a
>             apache service though it works beautifully when tried with
>             "rg_test test" on both the nodes.
> 
>             What could be the problem ? Please help. I am a novice in
>             red hat cluster but learnt a bit of it in the last few days
>             while trying to fix all the problems encountered.
> 
>             Here are the details.
> 
>             [root at server1 ~]# clustat
>             Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
>             Member Status: Quorate
> 
>              Member Name                 ID   Status
>              ------ ----                 ---- ------
>              server1                     1 Online, Local
>              server2                     2 Online
> 
>             [root at server1 ~]# clustat -x
>             <?xml version="1.0"?>
>             <clustat version="4.1.1">
>               <cluster name="newCluster" id="43188" generation="250536"/>
>               <quorum quorate="1" groupmember="0"/>
>               <nodes>
>                 <node name="server1" state="1" local="1" estranged="0"
>             rgmanager="0" rgmanager_master="0" qdisk="0"
>             nodeid="0x00000001"/>
>                 <node name="server2" state="1" local="0" estranged="0"
>             rgmanager="0" rgmanager_master="0" qdisk="0"
>             nodeid="0x00000002"/>  </nodes>
>             </clustat>
> 
>             [root at server2 ~]# clustat
>             Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
>             Member Status: Quorate
> 
>              Member Name                 ID   Status
>              ------ ----                 ---- ------
>              server1                     1 Online
>              server2                     2 Online, Local
> 
>             [root at server2 ~]# clustat -x
>             <?xml version="1.0"?>
>             <clustat version="4.1.1">
>               <cluster name="newCluster" id="43188" generation="250536"/>
>               <quorum quorate="1" groupmember="0"/>
>               <nodes>
>                 <node name="server1" state="1" local="0" estranged="0"
>             rgmanager="0" rgmanager_master="0" qdisk="0"
>             nodeid="0x00000001"/>
>                 <node name="server2" state="1" local="1" estranged="0"
>             rgmanager="0" rgmanager_master="0" qdisk="0"
>             nodeid="0x00000002"/>
>               </nodes>
>             </clustat>
> 
> 
>             [root at server2 ~]# clusvcadm -e Apache
>             Local machine trying to enable service:Apache...Could not
>             connect to resource group manager
> 
>             strace cluvcsadm -e Apache
>             ...
>             stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4),
>             ...}) = 0
>             mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
>             MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fb5000
>             write(1, "Local machine trying to enable s"..., 48Local
>             machine trying to enable service:Apache...) = 48
>             socket(PF_FILE, SOCK_STREAM, 0)         = 5
>             connect(5, {sa_family=AF_FILE,
>             path="/var/run/cluster/rgmanag__er.sk <http://rgmanager.sk>
>             <http://rgmanager.sk>"...}, 110) = -1 ENOENT (No such file
>             or directory)
> 
>             close(5)                                = 0
>             write(1, "Could not connect to resource gr"..., 44Could not
>             connect to resource group manager
>             ) = 44
>             exit_group(1)                           = ?
> 
> 
>             [root at server1 ~]# hostname
>             server1.localdomain
> 
>             [root at server1 ~]# cat /etc/hosts
>             # Do not remove the following line, or various programs
>             # that require network functionality will fail.
>             #127.0.0.1              server1.localdomain server1
>             localhost.localdomain localhost
>             192.168.61.132 server1.localdomain server1
>             192.168.61.133 server2.localdomain server2
>             ::1             localhost6.localdomain6 localhost6
> 
> 
>             Package versions :
>             luci-0.12.2-24.el5
>             ricci-0.12.2-24.el5
>             rgmanager-2.0.52-9.el5
>             modcluster-0.12.1-2.el5
>             cluster-cim-0.12.1-2.el5
>             system-config-cluster-1.0.57-7
>             lvm2-cluster-2.02.74-3.el5
>             cluster-snmp-0.12.1-2.el5
> 
>             [root at server1 log]# cman_tool status
>             Version: 6.2.0
>             Config Version: 15
>             Cluster Name: newCluster
>             Cluster Id: 43188
>             Cluster Member: Yes
>             Cluster Generation: 250536
>             Membership state: Cluster-Member
>             Nodes: 2
>             Expected votes: 1
>             Total votes: 2
>             Quorum: 1
>             Active subsystems: 2
>             Flags: 2node
>             Ports Bound: 0
>             Node name: server1
>             Node ID: 1
>             Multicast addresses: 239.192.168.93
>             Node addresses: 192.168.61.132
> 
>             Redhat :Red Hat Enterprise Linux Server release 5.6
>             (Tikanga)2.6.18-238.el5xen
> 
>             [root at server1 log]# service rgmanager status
>             clurgmgrd (pid  9775) is running...
> 
>             [root at server1 log]# netstat -na | grep 11111
>             tcp        0      0 0.0.0.0:11111 <http://0.0.0.0:11111>
>             <http://0.0.0.0:11111>         0.0.0.0:*                  
>             LISTEN
> 
> 
>             Please let me know if you can help. One thing i noticed was
>             that in the "clustat" it does not show "rgmanager" against
>             both the nodes but i see the service is just running fine.
> 
>             *Note : No iptables, no SELinux enabled.*
>             *
> 
>             *
>             Hope i have given all the details required to help me
>             quickly. Thanks.
> 
>             -Param
> 
> 
> 
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>
> 
> 
> 
> 
>     -- 
>     esta es mi vida e me la vivo hasta que dios quiera
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca



From mkparam at gmail.com  Fri Aug 24 15:36:14 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Fri, 24 Aug 2012 21:06:14 +0530
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <50378889.7080506@alteeve.ca>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
	<CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
	<CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>
	<50378889.7080506@alteeve.ca>
Message-ID: <CAA1zgjZSDybBP1JhaCniRigS0pAXAZdGR5HX5UT6bMPLeW0R7A@mail.gmail.com>

Please find below the details.

1. cluster.conf : Please find attached herewith.

2. Is fencing in any form mandatory for a setup as simple as this ? I am
just using two redhat5 VM's within VMware fusion in my Mac. What kind of
fencing is applicable to this kind of setup, that can also benefit me ? All
that i want to see from this PoC is to add some services like Apache, MySQL
to see how quick and reliable the cluster to pick the service failures
etc.,

3. Please find attached the messages from server1 and server2 when i did
"service cman stop and start" on both nodes one after the other. In server1
both stop and start went fine but clustat showed "Could not connect to
CMAN: connection refused". If i do "cman_tool join" on both nodes one after
the other, things look green

Also, Am i missing any rpm's that are most important ? I used yum group
install "Clustering" and "ClusterStorage" to install all the packages.

-Param

On Fri, Aug 24, 2012 at 7:28 PM, Digimer <lists at alteeve.ca> wrote:

> A few things;
>
> 1. Please repost your cluster.conf file with line wraps in plain text.
>
> 2. Manual fencing is not supported in any way, please use real fencing,
> like IPMI, iLO, etc.
>
> 3. Please stop the cluster entirely, start 'tail -f -n 0
> /var/log/messages' on both nodes, then start cman, then start rgmanager.
> Please share the output from the logs.
>
> Digimer
>
> On 08/24/2012 06:43 AM, PARAM KRISH wrote:
> > Hi, Thanks for the help. I hope we are nearing to the problem.
> >
> > I enabled logging , this is how my cluster.conf looks like
> >
> > <?xml version="1.0"?>
> > <cluster alias="newCluster" config_version="16" name="newCluster">
> > <logging debug="on"/>
> > <cman expected_votes="1" two_node="1"/>
> > <clusternodes>
> > <clusternode name="server1" nodeid="1" votes="1">
> > <fence><method name="single"><device
> > name="human"/></method></fence></clusternode><clusternode name="server2"
> > nodeid="2" votes="1"><fence><method name="single"><device
> >
> name="human"/></method></fence></clusternode></clusternodes><fencedevices>
> >
> >         </fencedevices><rm><failoverdomains><failoverdomain
> > name="failOver" nofailback="0" ordered="1"
> > restricted="0"><failoverdomainnode name="server1"
> > priority="1"/><failoverdomainnode name="server2"
> > priority="2"/></failoverdomain></failoverdomains><resources><ip
> > address="192.168.61.130" monitor_link="1"/><apache
> > config_file="conf/httpd.conf" name="httpd" server_root="/etc/httpd"
> > shutdown_wait="0"/></resources><service autostart="1" domain="failOver"
> > exclusive="1" name="Apache" recovery="relocate"><ip
> > address="192.168.61.130" monitor_link="1"><apache
> > config_file="conf/httpd.conf" name="Apache" server_root="/etc/httpd"
> > shutdown_wait="0"/></ip></service><service autostart="1"
> > domain="failOver" exclusive="1" name="website" recovery="relocate"><ip
> > ref="192.168.61.130"><apache
> > ref="httpd"/></ip></service></rm><fence_daemon clean_start="1"
> > post_fail_delay="0" post_join_delay="3"/><logging debug="on"/></cluster>
> >
> > There is no logging happening in /var/run/cluster/
> >
> > [root at server1 ~]# ls /var/run/cluster/
> > apache  ccsd.pid  ccsd.sock  rgmanager.sk <http://rgmanager.sk>
> >
> > I started resource manager in foreground and it says like ..
> >
> > failed acquiring lockspace: No such device
> > Locks not working!
> >
> > What next i could do ?
> >
> > -Param
> >
> > On Fri, Aug 24, 2012 at 3:18 PM, emmanuel segura <emi2fast at gmail.com
> > <mailto:emi2fast at gmail.com>> wrote:
> >
> >     /etc/init.d/rgmanager start or service rgmanager start
> >
> >
> >     2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de
> >     <mailto:heiko.nardmann at itechnical.de>>
> >
> >         It is strange that strace shows that
> >         /var/run/cluster/rgmanager.sk <http://rgmanager.sk> is missing.
> >
> >         Normally it is helpful to see the complete cluster.conf. Could
> >         you provide that one?
> >
> >         Also of interest is /var/log/cluster/rgmanager.log - do you have
> >         debug enabled inside cluster.conf?
> >
> >         Maybe it is possible to start rgmanager in the foreground (-f)
> >         with strace? That might also be a way to show why the
> >         rgmanager.sk <http://rgmanager.sk> is missing ...
> >
> >         Just some ideas ...
> >
> >
> >         Kind regards,
> >
> >             Heiko
> >
> >         Am 24.08.2012 11 <tel:24.08.2012%2011>:04, schrieb PARAM KRISH:
> >
> >             All,
> >
> >             I am trying to setup a simple two node cluster in my laptop
> >             using two RHEL VM's.
> >
> >             Everything looks just fine to me but i am unable to enable a
> >             apache service though it works beautifully when tried with
> >             "rg_test test" on both the nodes.
> >
> >             What could be the problem ? Please help. I am a novice in
> >             red hat cluster but learnt a bit of it in the last few days
> >             while trying to fix all the problems encountered.
> >
> >             Here are the details.
> >
> >             [root at server1 ~]# clustat
> >             Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
> >             Member Status: Quorate
> >
> >              Member Name                 ID   Status
> >              ------ ----                 ---- ------
> >              server1                     1 Online, Local
> >              server2                     2 Online
> >
> >             [root at server1 ~]# clustat -x
> >             <?xml version="1.0"?>
> >             <clustat version="4.1.1">
> >               <cluster name="newCluster" id="43188" generation="250536"/>
> >               <quorum quorate="1" groupmember="0"/>
> >               <nodes>
> >                 <node name="server1" state="1" local="1" estranged="0"
> >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >             nodeid="0x00000001"/>
> >                 <node name="server2" state="1" local="0" estranged="0"
> >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >             nodeid="0x00000002"/>  </nodes>
> >             </clustat>
> >
> >             [root at server2 ~]# clustat
> >             Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
> >             Member Status: Quorate
> >
> >              Member Name                 ID   Status
> >              ------ ----                 ---- ------
> >              server1                     1 Online
> >              server2                     2 Online, Local
> >
> >             [root at server2 ~]# clustat -x
> >             <?xml version="1.0"?>
> >             <clustat version="4.1.1">
> >               <cluster name="newCluster" id="43188" generation="250536"/>
> >               <quorum quorate="1" groupmember="0"/>
> >               <nodes>
> >                 <node name="server1" state="1" local="0" estranged="0"
> >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >             nodeid="0x00000001"/>
> >                 <node name="server2" state="1" local="1" estranged="0"
> >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >             nodeid="0x00000002"/>
> >               </nodes>
> >             </clustat>
> >
> >
> >             [root at server2 ~]# clusvcadm -e Apache
> >             Local machine trying to enable service:Apache...Could not
> >             connect to resource group manager
> >
> >             strace cluvcsadm -e Apache
> >             ...
> >             stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4),
> >             ...}) = 0
> >             mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
> >             MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fb5000
> >             write(1, "Local machine trying to enable s"..., 48Local
> >             machine trying to enable service:Apache...) = 48
> >             socket(PF_FILE, SOCK_STREAM, 0)         = 5
> >             connect(5, {sa_family=AF_FILE,
> >             path="/var/run/cluster/rgmanag__er.sk <http://rgmanager.sk>
> >             <http://rgmanager.sk>"...}, 110) = -1 ENOENT (No such file
> >             or directory)
> >
> >             close(5)                                = 0
> >             write(1, "Could not connect to resource gr"..., 44Could not
> >             connect to resource group manager
> >             ) = 44
> >             exit_group(1)                           = ?
> >
> >
> >             [root at server1 ~]# hostname
> >             server1.localdomain
> >
> >             [root at server1 ~]# cat /etc/hosts
> >             # Do not remove the following line, or various programs
> >             # that require network functionality will fail.
> >             #127.0.0.1              server1.localdomain server1
> >             localhost.localdomain localhost
> >             192.168.61.132 server1.localdomain server1
> >             192.168.61.133 server2.localdomain server2
> >             ::1             localhost6.localdomain6 localhost6
> >
> >
> >             Package versions :
> >             luci-0.12.2-24.el5
> >             ricci-0.12.2-24.el5
> >             rgmanager-2.0.52-9.el5
> >             modcluster-0.12.1-2.el5
> >             cluster-cim-0.12.1-2.el5
> >             system-config-cluster-1.0.57-7
> >             lvm2-cluster-2.02.74-3.el5
> >             cluster-snmp-0.12.1-2.el5
> >
> >             [root at server1 log]# cman_tool status
> >             Version: 6.2.0
> >             Config Version: 15
> >             Cluster Name: newCluster
> >             Cluster Id: 43188
> >             Cluster Member: Yes
> >             Cluster Generation: 250536
> >             Membership state: Cluster-Member
> >             Nodes: 2
> >             Expected votes: 1
> >             Total votes: 2
> >             Quorum: 1
> >             Active subsystems: 2
> >             Flags: 2node
> >             Ports Bound: 0
> >             Node name: server1
> >             Node ID: 1
> >             Multicast addresses: 239.192.168.93
> >             Node addresses: 192.168.61.132
> >
> >             Redhat :Red Hat Enterprise Linux Server release 5.6
> >             (Tikanga)2.6.18-238.el5xen
> >
> >             [root at server1 log]# service rgmanager status
> >             clurgmgrd (pid  9775) is running...
> >
> >             [root at server1 log]# netstat -na | grep 11111
> >             tcp        0      0 0.0.0.0:11111 <http://0.0.0.0:11111>
> >             <http://0.0.0.0:11111>         0.0.0.0:*
> >             LISTEN
> >
> >
> >             Please let me know if you can help. One thing i noticed was
> >             that in the "clustat" it does not show "rgmanager" against
> >             both the nodes but i see the service is just running fine.
> >
> >             *Note : No iptables, no SELinux enabled.*
> >             *
> >
> >             *
> >             Hope i have given all the details required to help me
> >             quickly. Thanks.
> >
> >             -Param
> >
> >
> >
> >         --
> >         Linux-cluster mailing list
> >         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >         https://www.redhat.com/__mailman/listinfo/linux-cluster
> >         <https://www.redhat.com/mailman/listinfo/linux-cluster>
> >
> >
> >
> >
> >     --
> >     esta es mi vida e me la vivo hasta que dios quiera
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/8d561b29/attachment.htm>
-------------- next part --------------
Aug 23 07:22:21 server2 kernel: eth1: link down
Aug 23 07:22:21 server2 kernel: eth0: link down
Aug 23 07:22:22 server2 avahi-daemon[3607]: Invalid query packet.
Aug 23 07:22:47 server2 last message repeated 15 times
Aug 23 07:22:53 server2 kernel: eth1: link up
Aug 23 07:22:53 server2 kernel: eth0: link up
Aug 23 07:22:56 server2 avahi-daemon[3607]: Invalid query packet.
Aug 23 07:23:39 server2 last message repeated 2 times
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] entering GATHER state from 11. 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] Saving state aru d high seq received d 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] Storing new sequence id for ring 3d304 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] entering COMMIT state. 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] entering RECOVERY state. 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] previous ring seq 250624 rep 192.168.61.132 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] aru c high delivered c received flag 1 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] position [1] member 192.168.61.133: 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] previous ring seq 250624 rep 192.168.61.133 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] aru d high delivered d received flag 1 
Aug 23 07:26:10 server2 openais[16727]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 07:26:10 server2 openais[16727]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 07:26:10 server2 openais[16727]: [CLM  ] New Configuration: 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] Members Left: 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] Members Joined: 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] New Configuration: 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] Members Left: 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] Members Joined: 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 07:26:11 server2 openais[16727]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 07:26:11 server2 openais[16727]: [TOTEM] entering OPERATIONAL state. 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 07:26:11 server2 openais[16727]: [CLM  ] got nodejoin message 192.168.61.133 
Aug 23 07:34:12 server2 ccsd[16559]: Stopping ccsd, SIGTERM received. 
Aug 23 07:34:12 server2 NAMC
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading all openais components 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_confdb v0 (19/10) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_cpg v0 (18/8) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_cfg v0 (17/7) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_msg v0 (16/6) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_lck v0 (15/5) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_evt v0 (14/4) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_amf v0 (12/2) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_clm v0 (11/1) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_evs v0 (10/0) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] Unloading openais component: openais_cman v0 (9/9) 
Aug 23 07:34:12 server2 openais[16727]: [SERV ] AIS Executive exiting (reason: CMAN requested shutdown). 
Aug 23 07:34:29 server2 ccsd[19167]: Starting ccsd 2.0.115: 
Aug 23 07:34:29 server2 ccsd[19167]:  Built: Dec 10 2010 11:26:52 
Aug 23 07:34:29 server2 ccsd[19167]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Aug 23 07:34:29 server2 ccsd[19167]: cluster.conf (cluster name = newCluster, version = 16) found. 
Aug 23 07:34:29 server2 ccsd[19167]: Unable to sendto broadcast ipv4 socket, but inet_ntop returned NULL pointer: Network is unreachable 
Aug 23 07:34:57 server2 last message repeated 16 times
Aug 23 07:34:58 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 30 seconds. 
Aug 23 07:34:59 server2 ccsd[19167]: Unable to sendto broadcast ipv4 socket, but inet_ntop returned NULL pointer: Network is unreachable 
Aug 23 07:35:09 server2 last message repeated 6 times
Aug 23 07:35:09 server2 openais[19177]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' 
Aug 23 07:35:09 server2 openais[19177]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. 
Aug 23 07:35:09 server2 openais[19177]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. 
Aug 23 07:35:09 server2 openais[19177]: [MAIN ] AIS Executive Service: started and ready to provide service. 
Aug 23 07:35:09 server2 openais[19177]: [MAIN ] Using default multicast address of 239.192.168.93 
Aug 23 07:35:09 server2 ccsd[19167]: Initial status:: Quorate 
Aug 23 07:35:09 server2 groupd[19197]: found uncontrolled kernel object rgmanager in /sys/kernel/dlm
Aug 23 07:35:09 server2 groupd[19197]: local node must be reset to clear 1 uncontrolled instances of gfs and/or dlm
Aug 23 07:35:09 server2 fence_node[19198]: Fence of "server2" was unsuccessful 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] join (60 ms) send_join (0 ms) consensus (2000 ms) merge (200 ms) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] downcheck (1000 ms) fail to recv const (2500 msgs) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1402 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] missed count const (5 messages) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] send threads (0 threads) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] RRP token expired timeout (495 ms) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] RRP token problem counter (2000 ms) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] RRP threshold (10 problem count) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] RRP mode set to none. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] heartbeat_failures_allowed (0) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] max_network_delay (50 ms) 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] The network interface [192.168.61.133] is now up. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Created or loaded sequence id 250628.192.168.61.133 for this ring. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering GATHER state from 15. 
Aug 23 07:35:09 server2 openais[19177]: [CMAN ] CMAN 2.0.115 (built Dec 10 2010 11:26:55) started 
Aug 23 07:35:09 server2 openais[19177]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais extended virtual synchrony service' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais cluster membership service B.01.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais availability management framework B.01.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais checkpoint service B.01.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais event service B.01.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais distributed locking service B.01.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais message service B.01.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais configuration service' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' 
Aug 23 07:35:09 server2 openais[19177]: [SERV ] Service initialized 'openais cluster config database access v1.01' 
Aug 23 07:35:09 server2 openais[19177]: [SYNC ] Not using a virtual synchrony filter. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Saving state aru 0 high seq received 0 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Storing new sequence id for ring 3d308 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering COMMIT state. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering RECOVERY state. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] position [0] member 192.168.61.133: 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] previous ring seq 250628 rep 192.168.61.133 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] aru 0 high delivered 0 received flag 1 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Sending initial ORF token 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] New Configuration: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Left: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Joined: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] New Configuration: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Left: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Joined: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 07:35:09 server2 openais[19177]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering OPERATIONAL state. 
Aug 23 07:35:09 server2 openais[19177]: [CMAN ] quorum regained, resuming activity 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] got nodejoin message 192.168.61.133 
Aug 23 07:35:09 server2 openais[19177]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering GATHER state from 11. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Saving state aru d high seq received d 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Storing new sequence id for ring 3d30c 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering COMMIT state. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering RECOVERY state. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] previous ring seq 250632 rep 192.168.61.132 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] aru d high delivered d received flag 1 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] position [1] member 192.168.61.133: 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] previous ring seq 250632 rep 192.168.61.133 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] aru d high delivered d received flag 1 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] New Configuration: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Left: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Joined: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] New Configuration: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Left: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] Members Joined: 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 07:35:09 server2 openais[19177]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 07:35:09 server2 openais[19177]: [TOTEM] entering OPERATIONAL state. 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 07:35:09 server2 openais[19177]: [CLM  ] got nodejoin message 192.168.61.133 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading all openais components 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_confdb v0 (19/10) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_cpg v0 (18/8) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_cfg v0 (17/7) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_msg v0 (16/6) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_lck v0 (15/5) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_evt v0 (14/4) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_amf v0 (12/2) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_clm v0 (11/1) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_evs v0 (10/0) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] Unloading openais component: openais_cman v0 (9/9) 
Aug 23 07:35:10 server2 openais[19177]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). 
Aug 23 07:35:19 server2 dlm_controld[19210]: group_init error (nil) 111
Aug 23 07:35:19 server2 gfs_controld[19216]: group_init error (nil) 111
Aug 23 07:35:19 server2 fenced[19204]: group_init error (nil) 111
Aug 23 07:35:29 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 60 seconds. 
Aug 23 07:35:59 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 90 seconds. 
Aug 23 07:36:30 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 120 seconds. 
Aug 23 07:37:00 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 150 seconds. 
Aug 23 07:37:30 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 180 seconds. 
Aug 23 07:38:00 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 210 seconds. 
Aug 23 07:38:30 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 240 seconds. 
Aug 23 07:39:00 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 270 seconds. 
Aug 23 07:39:30 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 300 seconds. 
Aug 23 07:40:00 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 330 seconds. 
Aug 23 07:40:30 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 360 seconds. 
Aug 23 07:41:01 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 390 seconds. 
Aug 23 07:41:31 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 420 seconds. 
Aug 23 07:42:01 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 450 seconds. 
Aug 23 07:42:31 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 480 seconds. 
Aug 23 07:43:01 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 510 seconds. 
Aug 23 07:43:31 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 540 seconds. 
Aug 23 07:44:01 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 570 seconds. 
Aug 23 07:44:31 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 600 seconds. 
Aug 23 07:45:02 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 630 seconds. 
Aug 23 07:45:32 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 660 seconds. 
Aug 23 07:46:02 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 690 seconds. 
Aug 23 07:46:32 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 720 seconds. 
Aug 23 07:47:02 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 750 seconds. 
Aug 23 07:47:32 server2 ccsd[19167]: Unable to connect to cluster infrastructure after 780 seconds. 
-------------- next part --------------
Aug 23 02:12:19 server1 avahi-daemon[3412]: Invalid query packet.
Aug 23 02:14:49 server1 kernel: Kernel logging (proc) stopped.
Aug 23 02:14:49 server1 kernel: Kernel log daemon terminating.
Aug 23 02:14:51 server1 exiting on signal 15
Aug 23 02:14:51 server1 syslogd 1.4.1: restart.
Aug 23 02:14:51 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Aug 23 02:15:28 server1 rgmanager: [24865]: <notice> Shutting down Cluster Service Manager... 
Aug 23 02:21:14 server1 groupd[25130]: found uncontrolled kernel object rgmanager in /sys/kernel/dlm
Aug 23 02:21:14 server1 groupd[25130]: local node must be reset to clear 1 uncontrolled instances of gfs and/or dlm
Aug 23 02:21:14 server1 openais[24514]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application 
Aug 23 02:21:14 server1 fence_node[25131]: Fence of "server1" was unsuccessful 
Aug 23 02:21:24 server1 dlm_controld[25143]: group_init error (nil) 111
Aug 23 02:21:24 server1 gfs_controld[25149]: group_init error (nil) 111
Aug 23 02:21:24 server1 fenced[25137]: group_init error (nil) 111
Aug 23 02:21:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 120 seconds. 
Aug 23 02:22:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 150 seconds. 
Aug 23 02:22:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 180 seconds. 
Aug 23 02:23:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 210 seconds. 
Aug 23 02:23:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 240 seconds. 
Aug 23 02:24:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 270 seconds. 
Aug 23 02:24:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 300 seconds. 
Aug 23 02:25:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 330 seconds. 
Aug 23 02:25:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 360 seconds. 
Aug 23 02:26:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 390 seconds. 
Aug 23 02:26:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 420 seconds. 
Aug 23 02:27:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 450 seconds. 
Aug 23 02:27:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 480 seconds. 
Aug 23 02:28:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 510 seconds. 
Aug 23 02:28:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 540 seconds. 
Aug 23 02:29:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 570 seconds. 
Aug 23 02:29:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 600 seconds. 
Aug 23 02:30:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 630 seconds. 
Aug 23 02:30:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 660 seconds. 
Aug 23 02:31:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 690 seconds. 
Aug 23 02:31:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 720 seconds. 
Aug 23 02:32:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 750 seconds. 
Aug 23 02:32:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 780 seconds. 
Aug 23 02:33:09 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 810 seconds. 
Aug 23 02:33:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 840 seconds. 
Aug 23 02:34:09 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 870 seconds. 
Aug 23 02:34:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 900 seconds. 
Aug 23 02:35:09 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 930 seconds. 
Aug 23 02:35:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 960 seconds. 
Aug 23 02:36:09 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 990 seconds. 
Aug 23 02:36:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1020 seconds. 
Aug 23 02:37:10 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1050 seconds. 
Aug 23 02:37:40 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1080 seconds. 
Aug 23 02:38:10 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1110 seconds. 
Aug 23 02:38:40 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1140 seconds. 
Aug 23 02:39:10 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1170 seconds. 
Aug 23 02:39:40 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1200 seconds. 
Aug 23 02:40:10 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1230 seconds. 
Aug 23 02:40:40 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1260 seconds. 
Aug 23 02:41:10 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1290 seconds. 
Aug 23 02:41:41 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1320 seconds. 
Aug 23 02:42:11 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1350 seconds. 
Aug 23 02:42:41 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1380 seconds. 
Aug 23 02:43:11 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1410 seconds. 
Aug 23 02:43:41 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1440 seconds. 
Aug 23 02:44:11 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1470 seconds. 
Aug 23 02:44:41 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1500 seconds. 
Aug 23 02:45:11 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1530 seconds. 
Aug 23 02:45:42 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1560 seconds. 
Aug 23 02:46:12 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1590 seconds. 
Aug 23 02:46:42 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1620 seconds. 
Aug 23 02:47:12 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1650 seconds. 
Aug 23 02:47:42 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1680 seconds. 
Aug 23 02:48:12 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1710 seconds. 
Aug 23 02:48:42 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1740 seconds. 
Aug 23 02:49:12 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1770 seconds. 
Aug 23 02:49:42 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1800 seconds. 
Aug 23 02:50:13 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1830 seconds. 
Aug 23 02:50:43 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1860 seconds. 
Aug 23 02:51:13 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1890 seconds. 
Aug 23 02:51:43 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1920 seconds. 
Aug 23 02:52:13 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1950 seconds. 
Aug 23 02:52:43 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 1980 seconds. 
Aug 23 02:53:13 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2010 seconds. 
Aug 23 02:53:43 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2040 seconds. 
Aug 23 02:54:14 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2070 seconds. 
Aug 23 02:54:44 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2100 seconds. 
Aug 23 02:55:14 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2130 seconds. 
Aug 23 02:55:44 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2160 seconds. 
Aug 23 02:56:14 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2190 seconds. 
Aug 23 02:56:44 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2220 seconds. 
Aug 23 02:57:14 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2250 seconds. 
Aug 23 02:57:44 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2280 seconds. 
Aug 23 02:58:14 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2310 seconds. 
Aug 23 02:58:45 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2340 seconds. 
Aug 23 02:59:15 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2370 seconds. 
Aug 23 02:59:45 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2400 seconds. 
Aug 23 03:00:15 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2430 seconds. 
Aug 23 03:00:45 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2460 seconds. 
Aug 23 03:01:15 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2490 seconds. 
Aug 23 03:01:45 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2520 seconds. 
Aug 23 03:02:15 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2550 seconds. 
Aug 23 03:02:46 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2580 seconds. 
Aug 23 03:03:16 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2610 seconds. 
Aug 23 03:03:46 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2640 seconds. 
Aug 23 03:04:16 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2670 seconds. 
Aug 23 03:04:46 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2700 seconds. 
Aug 23 03:05:16 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2730 seconds. 
Aug 23 03:05:46 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2760 seconds. 
Aug 23 03:06:16 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2790 seconds. 
Aug 23 03:06:46 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2820 seconds. 
Aug 23 03:07:17 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2850 seconds. 
Aug 23 03:07:47 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2880 seconds. 
Aug 23 03:08:17 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2910 seconds. 
Aug 23 03:08:47 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2940 seconds. 
Aug 23 03:09:17 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 2970 seconds. 
Aug 23 03:09:47 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3000 seconds. 
Aug 23 03:10:17 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3030 seconds. 
Aug 23 03:10:47 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3060 seconds. 
Aug 23 03:11:18 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3090 seconds. 
Aug 23 03:11:48 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3120 seconds. 
Aug 23 03:12:18 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3150 seconds. 
Aug 23 03:12:48 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3180 seconds. 
Aug 23 03:13:18 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3210 seconds. 
Aug 23 03:13:48 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3240 seconds. 
Aug 23 03:14:18 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3270 seconds. 
Aug 23 03:14:48 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3300 seconds. 
Aug 23 03:15:18 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3330 seconds. 
Aug 23 03:15:49 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3360 seconds. 
Aug 23 03:16:19 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3390 seconds. 
Aug 23 03:16:49 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3420 seconds. 
Aug 23 03:17:19 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3450 seconds. 
Aug 23 03:17:49 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3480 seconds. 
Aug 23 03:18:19 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3510 seconds. 
Aug 23 03:18:49 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3540 seconds. 
Aug 23 03:19:19 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3570 seconds. 
Aug 23 03:19:50 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3600 seconds. 
Aug 23 03:20:20 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3630 seconds. 
Aug 23 03:20:50 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3660 seconds. 
Aug 23 03:21:20 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3690 seconds. 
Aug 23 03:21:50 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3720 seconds. 
Aug 23 03:22:20 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3750 seconds. 
Aug 23 03:22:50 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3780 seconds. 
Aug 23 03:23:20 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3810 seconds. 
Aug 23 03:23:51 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3840 seconds. 
Aug 23 03:24:21 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3870 seconds. 
Aug 23 03:24:51 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3900 seconds. 
Aug 23 03:25:21 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3930 seconds. 
Aug 23 03:25:51 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3960 seconds. 
Aug 23 03:26:21 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 3990 seconds. 
Aug 23 03:26:51 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4020 seconds. 
Aug 23 03:27:21 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4050 seconds. 
Aug 23 03:27:51 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4080 seconds. 
Aug 23 03:28:22 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4110 seconds. 
Aug 23 03:28:52 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4140 seconds. 
Aug 23 03:29:22 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4170 seconds. 
Aug 23 03:29:52 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4200 seconds. 
Aug 23 03:30:22 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4230 seconds. 
Aug 23 03:30:52 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4260 seconds. 
Aug 23 03:31:22 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4290 seconds. 
Aug 23 03:31:52 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4320 seconds. 
Aug 23 03:32:23 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4350 seconds. 
Aug 23 03:32:53 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4380 seconds. 
Aug 23 03:33:23 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4410 seconds. 
Aug 23 03:33:53 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4440 seconds. 
Aug 23 03:34:23 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4470 seconds. 
Aug 23 03:34:53 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4500 seconds. 
Aug 23 03:35:23 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4530 seconds. 
Aug 23 03:35:53 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4560 seconds. 
Aug 23 03:36:23 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4590 seconds. 
Aug 23 03:36:54 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4620 seconds. 
Aug 23 03:37:24 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4650 seconds. 
Aug 23 03:37:54 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4680 seconds. 
Aug 23 03:38:24 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4710 seconds. 
Aug 23 03:38:54 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4740 seconds. 
Aug 23 03:39:24 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4770 seconds. 
Aug 23 03:39:54 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4800 seconds. 
Aug 23 03:40:24 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4830 seconds. 
Aug 23 03:40:55 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4860 seconds. 
Aug 23 03:41:25 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4890 seconds. 
Aug 23 03:41:55 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4920 seconds. 
Aug 23 03:42:25 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4950 seconds. 
Aug 23 03:42:55 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 4980 seconds. 
Aug 23 03:43:25 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5010 seconds. 
Aug 23 03:43:55 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5040 seconds. 
Aug 23 03:44:25 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5070 seconds. 
Aug 23 03:44:55 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5100 seconds. 
Aug 23 03:45:26 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5130 seconds. 
Aug 23 03:45:56 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5160 seconds. 
Aug 23 03:46:26 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5190 seconds. 
Aug 23 03:46:56 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5220 seconds. 
Aug 23 03:47:26 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5250 seconds. 
Aug 23 03:47:56 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5280 seconds. 
Aug 23 03:48:26 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5310 seconds. 
Aug 23 03:48:56 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5340 seconds. 
Aug 23 03:49:26 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5370 seconds. 
Aug 23 03:49:57 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5400 seconds. 
Aug 23 03:50:27 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5430 seconds. 
Aug 23 03:50:57 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5460 seconds. 
Aug 23 03:51:27 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5490 seconds. 
Aug 23 03:51:57 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5520 seconds. 
Aug 23 03:52:27 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5550 seconds. 
Aug 23 03:52:57 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5580 seconds. 
Aug 23 03:53:27 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5610 seconds. 
Aug 23 03:53:58 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5640 seconds. 
Aug 23 03:54:28 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5670 seconds. 
Aug 23 03:54:58 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5700 seconds. 
Aug 23 03:55:28 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5730 seconds. 
Aug 23 03:55:58 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5760 seconds. 
Aug 23 03:56:28 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5790 seconds. 
Aug 23 03:56:58 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5820 seconds. 
Aug 23 03:57:28 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5850 seconds. 
Aug 23 03:57:58 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5880 seconds. 
Aug 23 03:58:29 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5910 seconds. 
Aug 23 03:58:59 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5940 seconds. 
Aug 23 03:59:29 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 5970 seconds. 
Aug 23 03:59:59 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6000 seconds. 
Aug 23 04:00:29 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6030 seconds. 
Aug 23 04:00:59 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6060 seconds. 
Aug 23 04:01:29 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6090 seconds. 
Aug 23 04:01:59 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6120 seconds. 
Aug 23 04:02:30 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6150 seconds. 
Aug 23 04:03:00 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6180 seconds. 
Aug 23 04:03:30 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6210 seconds. 
Aug 23 04:04:00 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6240 seconds. 
Aug 23 04:04:30 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6270 seconds. 
Aug 23 04:05:00 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6300 seconds. 
Aug 23 04:05:30 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6330 seconds. 
Aug 23 04:06:00 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6360 seconds. 
Aug 23 04:06:30 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6390 seconds. 
Aug 23 04:07:01 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6420 seconds. 
Aug 23 04:07:31 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6450 seconds. 
Aug 23 04:08:01 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6480 seconds. 
Aug 23 04:08:31 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6510 seconds. 
Aug 23 04:09:01 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6540 seconds. 
Aug 23 04:09:31 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6570 seconds. 
Aug 23 04:10:01 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6600 seconds. 
Aug 23 04:10:31 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6630 seconds. 
Aug 23 04:11:02 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6660 seconds. 
Aug 23 04:11:32 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6690 seconds. 
Aug 23 04:12:02 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6720 seconds. 
Aug 23 04:12:32 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6750 seconds. 
Aug 23 04:13:02 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6780 seconds. 
Aug 23 04:13:32 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6810 seconds. 
Aug 23 04:14:02 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6840 seconds. 
Aug 23 04:14:32 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6870 seconds. 
Aug 23 04:15:02 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6900 seconds. 
Aug 23 04:15:33 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6930 seconds. 
Aug 23 04:16:03 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6960 seconds. 
Aug 23 04:16:33 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 6990 seconds. 
Aug 23 04:17:03 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7020 seconds. 
Aug 23 04:17:33 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7050 seconds. 
Aug 23 04:18:03 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7080 seconds. 
Aug 23 04:18:33 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7110 seconds. 
Aug 23 04:19:03 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7140 seconds. 
Aug 23 04:19:34 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7170 seconds. 
Aug 23 04:20:04 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7200 seconds. 
Aug 23 04:20:34 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7230 seconds. 
Aug 23 04:21:04 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7260 seconds. 
Aug 23 04:21:34 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7290 seconds. 
Aug 23 04:22:04 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7320 seconds. 
Aug 23 04:22:34 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7350 seconds. 
Aug 23 04:23:04 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7380 seconds. 
Aug 23 04:23:34 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7410 seconds. 
Aug 23 04:24:05 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7440 seconds. 
Aug 23 04:24:35 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7470 seconds. 
Aug 23 04:25:05 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7500 seconds. 
Aug 23 04:25:35 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7530 seconds. 
Aug 23 04:26:05 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7560 seconds. 
Aug 23 04:26:35 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7590 seconds. 
Aug 23 04:27:05 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7620 seconds. 
Aug 23 04:27:35 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7650 seconds. 
Aug 23 04:28:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7680 seconds. 
Aug 23 04:28:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7710 seconds. 
Aug 23 04:29:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7740 seconds. 
Aug 23 04:29:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7770 seconds. 
Aug 23 04:30:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7800 seconds. 
Aug 23 04:30:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7830 seconds. 
Aug 23 04:31:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7860 seconds. 
Aug 23 04:31:36 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7890 seconds. 
Aug 23 04:32:06 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7920 seconds. 
Aug 23 04:32:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7950 seconds. 
Aug 23 04:33:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 7980 seconds. 
Aug 23 04:33:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8010 seconds. 
Aug 23 04:34:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8040 seconds. 
Aug 23 04:34:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8070 seconds. 
Aug 23 04:35:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8100 seconds. 
Aug 23 04:35:37 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8130 seconds. 
Aug 23 04:36:07 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8160 seconds. 
Aug 23 04:36:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8190 seconds. 
Aug 23 04:37:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8220 seconds. 
Aug 23 04:37:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8250 seconds. 
Aug 23 04:38:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8280 seconds. 
Aug 23 04:38:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8310 seconds. 
Aug 23 04:38:54 server1 kernel: eth1: link down
Aug 23 04:38:54 server1 kernel: eth0: link down
Aug 23 04:39:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8340 seconds. 
Aug 23 04:39:26 server1 kernel: eth1: link up
Aug 23 04:39:26 server1 kernel: eth0: link up
Aug 23 04:39:38 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8370 seconds. 
Aug 23 04:40:08 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8400 seconds. 
Aug 23 04:40:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8430 seconds. 
Aug 23 04:41:09 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8460 seconds. 
Aug 23 04:41:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8490 seconds. 
Aug 23 04:42:09 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8520 seconds. 
Aug 23 04:42:39 server1 ccsd[24441]: Unable to connect to cluster infrastructure after 8550 seconds. 
Aug 23 04:42:39 server1 openais[27441]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' 
Aug 23 04:42:39 server1 openais[27441]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. 
Aug 23 04:42:39 server1 openais[27441]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. 
Aug 23 04:42:39 server1 openais[27441]: [MAIN ] AIS Executive Service: started and ready to provide service. 
Aug 23 04:42:39 server1 openais[27441]: [MAIN ] Using default multicast address of 239.192.168.93 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] join (60 ms) send_join (0 ms) consensus (2000 ms) merge (200 ms) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] downcheck (1000 ms) fail to recv const (2500 msgs) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1402 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] missed count const (5 messages) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] send threads (0 threads) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] RRP token expired timeout (495 ms) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] RRP token problem counter (2000 ms) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] RRP threshold (10 problem count) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] RRP mode set to none. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] heartbeat_failures_allowed (0) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] max_network_delay (50 ms) 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] The network interface [192.168.61.132] is now up. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Created or loaded sequence id 250620.192.168.61.132 for this ring. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering GATHER state from 15. 
Aug 23 04:42:40 server1 openais[27441]: [CMAN ] CMAN 2.0.115 (built Dec 10 2010 11:26:55) started 
Aug 23 04:42:40 server1 openais[27441]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais extended virtual synchrony service' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais cluster membership service B.01.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais availability management framework B.01.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais checkpoint service B.01.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais event service B.01.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais distributed locking service B.01.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais message service B.01.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais configuration service' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' 
Aug 23 04:42:40 server1 openais[27441]: [SERV ] Service initialized 'openais cluster config database access v1.01' 
Aug 23 04:42:40 server1 openais[27441]: [SYNC ] Not using a virtual synchrony filter. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Saving state aru 0 high seq received 0 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Storing new sequence id for ring 3d300 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering COMMIT state. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering RECOVERY state. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] previous ring seq 250620 rep 192.168.61.132 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] aru 0 high delivered 0 received flag 1 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Sending initial ORF token 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:42:40 server1 openais[27441]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering OPERATIONAL state. 
Aug 23 04:42:40 server1 openais[27441]: [CMAN ] quorum regained, resuming activity 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering GATHER state from 11. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Saving state aru c high seq received c 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Storing new sequence id for ring 3d304 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering COMMIT state. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering RECOVERY state. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] previous ring seq 250624 rep 192.168.61.132 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] aru c high delivered c received flag 1 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] position [1] member 192.168.61.133: 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] previous ring seq 250624 rep 192.168.61.133 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] aru d high delivered d received flag 1 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] Sending initial ORF token 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 04:42:40 server1 ccsd[24441]: Initial status:: Quorate 
Aug 23 04:42:40 server1 openais[27441]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 04:42:40 server1 openais[27441]: [TOTEM] entering OPERATIONAL state. 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 04:42:40 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.133 
Aug 23 04:50:53 server1 openais[27441]: [TOTEM] The token was lost in the OPERATIONAL state. 
Aug 23 04:50:53 server1 openais[27441]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Aug 23 04:50:53 server1 openais[27441]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Aug 23 04:50:53 server1 openais[27441]: [TOTEM] entering GATHER state from 2. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] entering GATHER state from 0. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] Saving state aru 1c high seq received 1c 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] Storing new sequence id for ring 3d308 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] entering COMMIT state. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] entering RECOVERY state. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] previous ring seq 250628 rep 192.168.61.132 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] aru 1c high delivered 1c received flag 1 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] Sending initial ORF token 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:50:55 server1 openais[27441]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 04:50:55 server1 openais[27441]: [TOTEM] entering OPERATIONAL state. 
Aug 23 04:50:55 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] entering GATHER state from 11. 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] Saving state aru d high seq received d 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] Storing new sequence id for ring 3d30c 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] entering COMMIT state. 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] entering RECOVERY state. 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] previous ring seq 250632 rep 192.168.61.132 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] aru d high delivered d received flag 1 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] position [1] member 192.168.61.133: 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] previous ring seq 250632 rep 192.168.61.133 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] aru d high delivered d received flag 1 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] Sending initial ORF token 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 04:51:39 server1 openais[27441]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 04:51:39 server1 openais[27441]: [TOTEM] entering OPERATIONAL state. 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 04:51:39 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.133 
Aug 23 04:51:50 server1 openais[27441]: [TOTEM] The token was lost in the OPERATIONAL state. 
Aug 23 04:51:50 server1 openais[27441]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Aug 23 04:51:50 server1 openais[27441]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Aug 23 04:51:50 server1 openais[27441]: [TOTEM] entering GATHER state from 2. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] entering GATHER state from 0. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] Saving state aru 1b high seq received 1b 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] Storing new sequence id for ring 3d310 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] entering COMMIT state. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] entering RECOVERY state. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] previous ring seq 250636 rep 192.168.61.132 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] aru 1b high delivered 1b received flag 1 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] Sending initial ORF token 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.133)  
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] New Configuration: 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] Members Left: 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] Members Joined: 
Aug 23 04:51:52 server1 openais[27441]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 04:51:52 server1 openais[27441]: [TOTEM] entering OPERATIONAL state. 
Aug 23 04:51:52 server1 openais[27441]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 05:02:02 server1 ccsd[24441]: Stopping ccsd, SIGTERM received. 
Aug 23 05:02:02 server1 NAMC
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading all openais components 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] AIS Executive exiting (reason: CMAN requested shutdown). 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading all openais components 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_confdb v0 (19/10) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_cpg v0 (18/8) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_cfg v0 (17/7) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_msg v0 (16/6) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_lck v0 (15/5) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_evt v0 (14/4) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_amf v0 (12/2) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_clm v0 (11/1) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_evs v0 (10/0) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] Unloading openais component: openais_cman v0 (9/9) 
Aug 23 05:02:03 server1 openais[27441]: [SERV ] AIS Executive exiting (reason: CMAN requested shutdown). 
Aug 23 05:02:08 server1 ccsd[27850]: Starting ccsd 2.0.115: 
Aug 23 05:02:08 server1 ccsd[27850]:  Built: Dec 10 2010 11:26:52 
Aug 23 05:02:08 server1 ccsd[27850]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Aug 23 05:02:09 server1 ccsd[27850]: cluster.conf (cluster name = newCluster, version = 16) found. 
Aug 23 05:02:09 server1 ccsd[27850]: Unable to sendto broadcast ipv4 socket, but inet_ntop returned NULL pointer: Network is unreachable 
Aug 23 05:02:37 server1 last message repeated 16 times
Aug 23 05:02:38 server1 ccsd[27850]: Unable to connect to cluster infrastructure after 30 seconds. 
Aug 23 05:02:39 server1 ccsd[27850]: Unable to sendto broadcast ipv4 socket, but inet_ntop returned NULL pointer: Network is unreachable 
Aug 23 05:02:49 server1 last message repeated 6 times
Aug 23 05:02:49 server1 openais[27860]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' 
Aug 23 05:02:49 server1 openais[27860]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. 
Aug 23 05:02:49 server1 openais[27860]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. 
Aug 23 05:02:49 server1 openais[27860]: [MAIN ] AIS Executive Service: started and ready to provide service. 
Aug 23 05:02:49 server1 openais[27860]: [MAIN ] Using default multicast address of 239.192.168.93 
Aug 23 05:02:49 server1 ccsd[27850]: Initial status:: Quorate 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] join (60 ms) send_join (0 ms) consensus (2000 ms) merge (200 ms) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] downcheck (1000 ms) fail to recv const (2500 msgs) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1402 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] missed count const (5 messages) 
Aug 23 05:02:49 server1 groupd[27879]: found uncontrolled kernel object rgmanager in /sys/kernel/dlm
Aug 23 05:02:49 server1 groupd[27879]: local node must be reset to clear 1 uncontrolled instances of gfs and/or dlm
Aug 23 05:02:49 server1 fence_node[27880]: Fence of "server1" was unsuccessful 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] send threads (0 threads) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] RRP token expired timeout (495 ms) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] RRP token problem counter (2000 ms) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] RRP threshold (10 problem count) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] RRP mode set to none. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] heartbeat_failures_allowed (0) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] max_network_delay (50 ms) 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] The network interface [192.168.61.132] is now up. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Created or loaded sequence id 250640.192.168.61.132 for this ring. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] entering GATHER state from 15. 
Aug 23 05:02:49 server1 openais[27860]: [CMAN ] CMAN 2.0.115 (built Dec 10 2010 11:26:55) started 
Aug 23 05:02:49 server1 openais[27860]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais extended virtual synchrony service' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais cluster membership service B.01.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais availability management framework B.01.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais checkpoint service B.01.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais event service B.01.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais distributed locking service B.01.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais message service B.01.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais configuration service' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' 
Aug 23 05:02:49 server1 openais[27860]: [SERV ] Service initialized 'openais cluster config database access v1.01' 
Aug 23 05:02:49 server1 openais[27860]: [SYNC ] Not using a virtual synchrony filter. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Creating commit token because I am the rep. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Saving state aru 0 high seq received 0 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Storing new sequence id for ring 3d314 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] entering COMMIT state. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] entering RECOVERY state. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] position [0] member 192.168.61.132: 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] previous ring seq 250640 rep 192.168.61.132 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] aru 0 high delivered 0 received flag 1 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] Sending initial ORF token 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] New Configuration: 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] Members Left: 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] Members Joined: 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] New Configuration: 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] Members Left: 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] Members Joined: 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] 	r(0) ip(192.168.61.132)  
Aug 23 05:02:49 server1 openais[27860]: [SYNC ] This node is within the primary component and will provide service. 
Aug 23 05:02:49 server1 openais[27860]: [TOTEM] entering OPERATIONAL state. 
Aug 23 05:02:49 server1 openais[27860]: [CMAN ] quorum regained, resuming activity 
Aug 23 05:02:49 server1 openais[27860]: [CLM  ] got nodejoin message 192.168.61.132 
Aug 23 05:02:49 server1 openais[27860]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading all openais components 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_confdb v0 (19/10) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_cpg v0 (18/8) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_cfg v0 (17/7) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_msg v0 (16/6) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_lck v0 (15/5) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_evt v0 (14/4) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_amf v0 (12/2) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_clm v0 (11/1) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_evs v0 (10/0) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] Unloading openais component: openais_cman v0 (9/9) 
Aug 23 05:02:50 server1 openais[27860]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). 
Aug 23 05:02:59 server1 dlm_controld[27892]: group_init error (nil) 111
Aug 23 05:02:59 server1 fenced[27886]: group_init error (nil) 111
Aug 23 05:02:59 server1 gfs_controld[27898]: group_init error (nil) 111
Aug 23 05:03:09 server1 ccsd[27850]: Unable to connect to cluster infrastructure after 60 seconds. 
Aug 23 05:03:40 server1 ccsd[27850]: Unable to connect to cluster infrastructure after 90 seconds. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1682 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120824/8d561b29/attachment.obj>

From mkparam at gmail.com  Sat Aug 25 04:14:25 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Sat, 25 Aug 2012 09:44:25 +0530
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <CAA1zgjZSDybBP1JhaCniRigS0pAXAZdGR5HX5UT6bMPLeW0R7A@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
	<CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
	<CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>
	<50378889.7080506@alteeve.ca>
	<CAA1zgjZSDybBP1JhaCniRigS0pAXAZdGR5HX5UT6bMPLeW0R7A@mail.gmail.com>
Message-ID: <CAA1zgjbvJkCxdW88QNK1Feq-wTf9AA_GV=GvGYmy1m8y9dfTnQ@mail.gmail.com>

Digimer,

I just want to confirm if i am missing any rpm's in my setup.

I referred this site
http://www.centos.org/docs/4/html/rh-cs-en-4/ap-rhcs-sw-inst-cust.html to
install the rpm's in red hat 5.6 but these packages *magma**, *dlm** are
missing in the CD, does it mean these are really important ? I installed
using 'yum group install' for "Cluster" and "ClusterStorage"

This is what the yum.log says from server1. Please confirm if there is
something missing.

-Param

On Fri, Aug 24, 2012 at 9:06 PM, PARAM KRISH <mkparam at gmail.com> wrote:

> Please find below the details.
>
> 1. cluster.conf : Please find attached herewith.
>
> 2. Is fencing in any form mandatory for a setup as simple as this ? I am
> just using two redhat5 VM's within VMware fusion in my Mac. What kind of
> fencing is applicable to this kind of setup, that can also benefit me ? All
> that i want to see from this PoC is to add some services like Apache, MySQL
> to see how quick and reliable the cluster to pick the service failures
> etc.,
>
> 3. Please find attached the messages from server1 and server2 when i did
> "service cman stop and start" on both nodes one after the other. In server1
> both stop and start went fine but clustat showed "Could not connect to
> CMAN: connection refused". If i do "cman_tool join" on both nodes one after
> the other, things look green
>
> Also, Am i missing any rpm's that are most important ? I used yum group
> install "Clustering" and "ClusterStorage" to install all the packages.
>
> -Param
>
> On Fri, Aug 24, 2012 at 7:28 PM, Digimer <lists at alteeve.ca> wrote:
>
>> A few things;
>>
>> 1. Please repost your cluster.conf file with line wraps in plain text.
>>
>> 2. Manual fencing is not supported in any way, please use real fencing,
>> like IPMI, iLO, etc.
>>
>> 3. Please stop the cluster entirely, start 'tail -f -n 0
>> /var/log/messages' on both nodes, then start cman, then start rgmanager.
>> Please share the output from the logs.
>>
>> Digimer
>>
>> On 08/24/2012 06:43 AM, PARAM KRISH wrote:
>> > Hi, Thanks for the help. I hope we are nearing to the problem.
>> >
>> > I enabled logging , this is how my cluster.conf looks like
>> >
>> > <?xml version="1.0"?>
>> > <cluster alias="newCluster" config_version="16" name="newCluster">
>> > <logging debug="on"/>
>> > <cman expected_votes="1" two_node="1"/>
>> > <clusternodes>
>> > <clusternode name="server1" nodeid="1" votes="1">
>> > <fence><method name="single"><device
>> > name="human"/></method></fence></clusternode><clusternode name="server2"
>> > nodeid="2" votes="1"><fence><method name="single"><device
>> >
>> name="human"/></method></fence></clusternode></clusternodes><fencedevices>
>> >
>> >         </fencedevices><rm><failoverdomains><failoverdomain
>> > name="failOver" nofailback="0" ordered="1"
>> > restricted="0"><failoverdomainnode name="server1"
>> > priority="1"/><failoverdomainnode name="server2"
>> > priority="2"/></failoverdomain></failoverdomains><resources><ip
>> > address="192.168.61.130" monitor_link="1"/><apache
>> > config_file="conf/httpd.conf" name="httpd" server_root="/etc/httpd"
>> > shutdown_wait="0"/></resources><service autostart="1" domain="failOver"
>> > exclusive="1" name="Apache" recovery="relocate"><ip
>> > address="192.168.61.130" monitor_link="1"><apache
>> > config_file="conf/httpd.conf" name="Apache" server_root="/etc/httpd"
>> > shutdown_wait="0"/></ip></service><service autostart="1"
>> > domain="failOver" exclusive="1" name="website" recovery="relocate"><ip
>> > ref="192.168.61.130"><apache
>> > ref="httpd"/></ip></service></rm><fence_daemon clean_start="1"
>> > post_fail_delay="0" post_join_delay="3"/><logging debug="on"/></cluster>
>> >
>> > There is no logging happening in /var/run/cluster/
>> >
>> > [root at server1 ~]# ls /var/run/cluster/
>> > apache  ccsd.pid  ccsd.sock  rgmanager.sk <http://rgmanager.sk>
>> >
>> > I started resource manager in foreground and it says like ..
>> >
>> > failed acquiring lockspace: No such device
>> > Locks not working!
>> >
>> > What next i could do ?
>> >
>> > -Param
>> >
>> > On Fri, Aug 24, 2012 at 3:18 PM, emmanuel segura <emi2fast at gmail.com
>> > <mailto:emi2fast at gmail.com>> wrote:
>> >
>> >     /etc/init.d/rgmanager start or service rgmanager start
>> >
>> >
>> >     2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de
>> >     <mailto:heiko.nardmann at itechnical.de>>
>> >
>> >         It is strange that strace shows that
>> >         /var/run/cluster/rgmanager.sk <http://rgmanager.sk> is missing.
>> >
>> >         Normally it is helpful to see the complete cluster.conf. Could
>> >         you provide that one?
>> >
>> >         Also of interest is /var/log/cluster/rgmanager.log - do you have
>> >         debug enabled inside cluster.conf?
>> >
>> >         Maybe it is possible to start rgmanager in the foreground (-f)
>> >         with strace? That might also be a way to show why the
>> >         rgmanager.sk <http://rgmanager.sk> is missing ...
>> >
>> >         Just some ideas ...
>> >
>> >
>> >         Kind regards,
>> >
>> >             Heiko
>> >
>> >         Am 24.08.2012 11 <tel:24.08.2012%2011>:04, schrieb PARAM KRISH:
>> >
>> >             All,
>> >
>> >             I am trying to setup a simple two node cluster in my laptop
>> >             using two RHEL VM's.
>> >
>> >             Everything looks just fine to me but i am unable to enable a
>> >             apache service though it works beautifully when tried with
>> >             "rg_test test" on both the nodes.
>> >
>> >             What could be the problem ? Please help. I am a novice in
>> >             red hat cluster but learnt a bit of it in the last few days
>> >             while trying to fix all the problems encountered.
>> >
>> >             Here are the details.
>> >
>> >             [root at server1 ~]# clustat
>> >             Cluster Status for newCluster @ Thu Aug 23 00:29:32 2012
>> >             Member Status: Quorate
>> >
>> >              Member Name                 ID   Status
>> >              ------ ----                 ---- ------
>> >              server1                     1 Online, Local
>> >              server2                     2 Online
>> >
>> >             [root at server1 ~]# clustat -x
>> >             <?xml version="1.0"?>
>> >             <clustat version="4.1.1">
>> >               <cluster name="newCluster" id="43188"
>> generation="250536"/>
>> >               <quorum quorate="1" groupmember="0"/>
>> >               <nodes>
>> >                 <node name="server1" state="1" local="1" estranged="0"
>> >             rgmanager="0" rgmanager_master="0" qdisk="0"
>> >             nodeid="0x00000001"/>
>> >                 <node name="server2" state="1" local="0" estranged="0"
>> >             rgmanager="0" rgmanager_master="0" qdisk="0"
>> >             nodeid="0x00000002"/>  </nodes>
>> >             </clustat>
>> >
>> >             [root at server2 ~]# clustat
>> >             Cluster Status for newCluster @ Thu Aug 23 03:13:34 2012
>> >             Member Status: Quorate
>> >
>> >              Member Name                 ID   Status
>> >              ------ ----                 ---- ------
>> >              server1                     1 Online
>> >              server2                     2 Online, Local
>> >
>> >             [root at server2 ~]# clustat -x
>> >             <?xml version="1.0"?>
>> >             <clustat version="4.1.1">
>> >               <cluster name="newCluster" id="43188"
>> generation="250536"/>
>> >               <quorum quorate="1" groupmember="0"/>
>> >               <nodes>
>> >                 <node name="server1" state="1" local="0" estranged="0"
>> >             rgmanager="0" rgmanager_master="0" qdisk="0"
>> >             nodeid="0x00000001"/>
>> >                 <node name="server2" state="1" local="1" estranged="0"
>> >             rgmanager="0" rgmanager_master="0" qdisk="0"
>> >             nodeid="0x00000002"/>
>> >               </nodes>
>> >             </clustat>
>> >
>> >
>> >             [root at server2 ~]# clusvcadm -e Apache
>> >             Local machine trying to enable service:Apache...Could not
>> >             connect to resource group manager
>> >
>> >             strace cluvcsadm -e Apache
>> >             ...
>> >             stat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4),
>> >             ...}) = 0
>> >             mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
>> >             MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fb5000
>> >             write(1, "Local machine trying to enable s"..., 48Local
>> >             machine trying to enable service:Apache...) = 48
>> >             socket(PF_FILE, SOCK_STREAM, 0)         = 5
>> >             connect(5, {sa_family=AF_FILE,
>> >             path="/var/run/cluster/rgmanag__er.sk <http://rgmanager.sk>
>> >             <http://rgmanager.sk>"...}, 110) = -1 ENOENT (No such file
>> >             or directory)
>> >
>> >             close(5)                                = 0
>> >             write(1, "Could not connect to resource gr"..., 44Could not
>> >             connect to resource group manager
>> >             ) = 44
>> >             exit_group(1)                           = ?
>> >
>> >
>> >             [root at server1 ~]# hostname
>> >             server1.localdomain
>> >
>> >             [root at server1 ~]# cat /etc/hosts
>> >             # Do not remove the following line, or various programs
>> >             # that require network functionality will fail.
>> >             #127.0.0.1              server1.localdomain server1
>> >             localhost.localdomain localhost
>> >             192.168.61.132 server1.localdomain server1
>> >             192.168.61.133 server2.localdomain server2
>> >             ::1             localhost6.localdomain6 localhost6
>> >
>> >
>> >             Package versions :
>> >             luci-0.12.2-24.el5
>> >             ricci-0.12.2-24.el5
>> >             rgmanager-2.0.52-9.el5
>> >             modcluster-0.12.1-2.el5
>> >             cluster-cim-0.12.1-2.el5
>> >             system-config-cluster-1.0.57-7
>> >             lvm2-cluster-2.02.74-3.el5
>> >             cluster-snmp-0.12.1-2.el5
>> >
>> >             [root at server1 log]# cman_tool status
>> >             Version: 6.2.0
>> >             Config Version: 15
>> >             Cluster Name: newCluster
>> >             Cluster Id: 43188
>> >             Cluster Member: Yes
>> >             Cluster Generation: 250536
>> >             Membership state: Cluster-Member
>> >             Nodes: 2
>> >             Expected votes: 1
>> >             Total votes: 2
>> >             Quorum: 1
>> >             Active subsystems: 2
>> >             Flags: 2node
>> >             Ports Bound: 0
>> >             Node name: server1
>> >             Node ID: 1
>> >             Multicast addresses: 239.192.168.93
>> >             Node addresses: 192.168.61.132
>> >
>> >             Redhat :Red Hat Enterprise Linux Server release 5.6
>> >             (Tikanga)2.6.18-238.el5xen
>> >
>> >             [root at server1 log]# service rgmanager status
>> >             clurgmgrd (pid  9775) is running...
>> >
>> >             [root at server1 log]# netstat -na | grep 11111
>> >             tcp        0      0 0.0.0.0:11111 <http://0.0.0.0:11111>
>> >             <http://0.0.0.0:11111>         0.0.0.0:*
>> >             LISTEN
>> >
>> >
>> >             Please let me know if you can help. One thing i noticed was
>> >             that in the "clustat" it does not show "rgmanager" against
>> >             both the nodes but i see the service is just running fine.
>> >
>> >             *Note : No iptables, no SELinux enabled.*
>> >             *
>> >
>> >             *
>> >             Hope i have given all the details required to help me
>> >             quickly. Thanks.
>> >
>> >             -Param
>> >
>> >
>> >
>> >         --
>> >         Linux-cluster mailing list
>> >         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>> >         https://www.redhat.com/__mailman/listinfo/linux-cluster
>> >         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>> >
>> >
>> >
>> >
>> >     --
>> >     esta es mi vida e me la vivo hasta que dios quiera
>> >
>> >     --
>> >     Linux-cluster mailing list
>> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>> >     https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> >
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120825/1d7a7416/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: yum.log
Type: application/octet-stream
Size: 2347 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120825/1d7a7416/attachment.obj>

From lists at alteeve.ca  Sat Aug 25 04:37:04 2012
From: lists at alteeve.ca (Digimer)
Date: Sat, 25 Aug 2012 00:37:04 -0400
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
 manager
In-Reply-To: <CAA1zgjbvJkCxdW88QNK1Feq-wTf9AA_GV=GvGYmy1m8y9dfTnQ@mail.gmail.com>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
	<CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
	<CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>
	<50378889.7080506@alteeve.ca>
	<CAA1zgjZSDybBP1JhaCniRigS0pAXAZdGR5HX5UT6bMPLeW0R7A@mail.gmail.com>
	<CAA1zgjbvJkCxdW88QNK1Feq-wTf9AA_GV=GvGYmy1m8y9dfTnQ@mail.gmail.com>
Message-ID: <50385670.2060204@alteeve.ca>

I haven't used the rhcs stable 2 (version in RHEL 5) in some time, and I
don't remember for sure. I don't think I've seen magma though.

Why are you following the docs for RHEL 4, installing on RHEL 5 when
RHEL 6 is the current version?

On 08/25/2012 12:14 AM, PARAM KRISH wrote:
> Digimer, 
> 
> I just want to confirm if i am missing any rpm's in my setup.
> 
> I referred this site
> http://www.centos.org/docs/4/html/rh-cs-en-4/ap-rhcs-sw-inst-cust.html
> to install the rpm's in red hat 5.6 but these packages *magma**, *dlm**
> are missing in the CD, does it mean these are really important ? I
> installed using 'yum group install' for "Cluster" and "ClusterStorage"
> 
> This is what the yum.log says from server1. Please confirm if there is
> something missing.
> 
> -Param
> 
> On Fri, Aug 24, 2012 at 9:06 PM, PARAM KRISH <mkparam at gmail.com
> <mailto:mkparam at gmail.com>> wrote:
> 
>     Please find below the details.
> 
>     1. cluster.conf : Please find attached herewith.
> 
>     2. Is fencing in any form mandatory for a setup as simple as this ?
>     I am just using two redhat5 VM's within VMware fusion in my Mac.
>     What kind of fencing is applicable to this kind of setup, that can
>     also benefit me ? All that i want to see from this PoC is to add
>     some services like Apache, MySQL to see how quick and reliable the
>     cluster to pick the service failures etc., 
> 
>     3. Please find attached the messages from server1 and server2 when i
>     did "service cman stop and start" on both nodes one after the other.
>     In server1 both stop and start went fine but clustat showed "Could
>     not connect to CMAN: connection refused". If i do "cman_tool join"
>     on both nodes one after the other, things look green
> 
>     Also, Am i missing any rpm's that are most important ? I used yum
>     group install "Clustering" and "ClusterStorage" to install all the
>     packages.
> 
>     -Param
> 
>     On Fri, Aug 24, 2012 at 7:28 PM, Digimer <lists at alteeve.ca
>     <mailto:lists at alteeve.ca>> wrote:
> 
>         A few things;
> 
>         1. Please repost your cluster.conf file with line wraps in plain
>         text.
> 
>         2. Manual fencing is not supported in any way, please use real
>         fencing,
>         like IPMI, iLO, etc.
> 
>         3. Please stop the cluster entirely, start 'tail -f -n 0
>         /var/log/messages' on both nodes, then start cman, then start
>         rgmanager.
>         Please share the output from the logs.
> 
>         Digimer
> 
>         On 08/24/2012 06:43 AM, PARAM KRISH wrote:
>         > Hi, Thanks for the help. I hope we are nearing to the problem.
>         >
>         > I enabled logging , this is how my cluster.conf looks like
>         >
>         > <?xml version="1.0"?>
>         > <cluster alias="newCluster" config_version="16" name="newCluster">
>         > <logging debug="on"/>
>         > <cman expected_votes="1" two_node="1"/>
>         > <clusternodes>
>         > <clusternode name="server1" nodeid="1" votes="1">
>         > <fence><method name="single"><device
>         > name="human"/></method></fence></clusternode><clusternode
>         name="server2"
>         > nodeid="2" votes="1"><fence><method name="single"><device
>         >
>         name="human"/></method></fence></clusternode></clusternodes><fencedevices>
>         >
>         >         </fencedevices><rm><failoverdomains><failoverdomain
>         > name="failOver" nofailback="0" ordered="1"
>         > restricted="0"><failoverdomainnode name="server1"
>         > priority="1"/><failoverdomainnode name="server2"
>         > priority="2"/></failoverdomain></failoverdomains><resources><ip
>         > address="192.168.61.130" monitor_link="1"/><apache
>         > config_file="conf/httpd.conf" name="httpd"
>         server_root="/etc/httpd"
>         > shutdown_wait="0"/></resources><service autostart="1"
>         domain="failOver"
>         > exclusive="1" name="Apache" recovery="relocate"><ip
>         > address="192.168.61.130" monitor_link="1"><apache
>         > config_file="conf/httpd.conf" name="Apache"
>         server_root="/etc/httpd"
>         > shutdown_wait="0"/></ip></service><service autostart="1"
>         > domain="failOver" exclusive="1" name="website"
>         recovery="relocate"><ip
>         > ref="192.168.61.130"><apache
>         > ref="httpd"/></ip></service></rm><fence_daemon clean_start="1"
>         > post_fail_delay="0" post_join_delay="3"/><logging
>         debug="on"/></cluster>
>         >
>         > There is no logging happening in /var/run/cluster/
>         >
>         > [root at server1 ~]# ls /var/run/cluster/
>         > apache  ccsd.pid  ccsd.sock  rgmanager.sk
>         <http://rgmanager.sk> <http://rgmanager.sk>
>         >
>         > I started resource manager in foreground and it says like ..
>         >
>         > failed acquiring lockspace: No such device
>         > Locks not working!
>         >
>         > What next i could do ?
>         >
>         > -Param
>         >
>         > On Fri, Aug 24, 2012 at 3:18 PM, emmanuel segura
>         <emi2fast at gmail.com <mailto:emi2fast at gmail.com>
>         > <mailto:emi2fast at gmail.com <mailto:emi2fast at gmail.com>>> wrote:
>         >
>         >     /etc/init.d/rgmanager start or service rgmanager start
>         >
>         >
>         >     2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de
>         <mailto:heiko.nardmann at itechnical.de>
>         >     <mailto:heiko.nardmann at itechnical.de
>         <mailto:heiko.nardmann at itechnical.de>>>
>         >
>         >         It is strange that strace shows that
>         >         /var/run/cluster/rgmanager.sk <http://rgmanager.sk>
>         <http://rgmanager.sk> is missing.
>         >
>         >         Normally it is helpful to see the complete
>         cluster.conf. Could
>         >         you provide that one?
>         >
>         >         Also of interest is /var/log/cluster/rgmanager.log -
>         do you have
>         >         debug enabled inside cluster.conf?
>         >
>         >         Maybe it is possible to start rgmanager in the
>         foreground (-f)
>         >         with strace? That might also be a way to show why the
>         >         rgmanager.sk <http://rgmanager.sk>
>         <http://rgmanager.sk> is missing ...
>         >
>         >         Just some ideas ...
>         >
>         >
>         >         Kind regards,
>         >
>         >             Heiko
>         >
>         >         Am 24.08.2012 11 <tel:24.08.2012%2011>
>         <tel:24.08.2012%2011>:04, schrieb PARAM KRISH:
>         >
>         >             All,
>         >
>         >             I am trying to setup a simple two node cluster in
>         my laptop
>         >             using two RHEL VM's.
>         >
>         >             Everything looks just fine to me but i am unable
>         to enable a
>         >             apache service though it works beautifully when
>         tried with
>         >             "rg_test test" on both the nodes.
>         >
>         >             What could be the problem ? Please help. I am a
>         novice in
>         >             red hat cluster but learnt a bit of it in the last
>         few days
>         >             while trying to fix all the problems encountered.
>         >
>         >             Here are the details.
>         >
>         >             [root at server1 ~]# clustat
>         >             Cluster Status for newCluster @ Thu Aug 23
>         00:29:32 2012
>         >             Member Status: Quorate
>         >
>         >              Member Name                 ID   Status
>         >              ------ ----                 ---- ------
>         >              server1                     1 Online, Local
>         >              server2                     2 Online
>         >
>         >             [root at server1 ~]# clustat -x
>         >             <?xml version="1.0"?>
>         >             <clustat version="4.1.1">
>         >               <cluster name="newCluster" id="43188"
>         generation="250536"/>
>         >               <quorum quorate="1" groupmember="0"/>
>         >               <nodes>
>         >                 <node name="server1" state="1" local="1"
>         estranged="0"
>         >             rgmanager="0" rgmanager_master="0" qdisk="0"
>         >             nodeid="0x00000001"/>
>         >                 <node name="server2" state="1" local="0"
>         estranged="0"
>         >             rgmanager="0" rgmanager_master="0" qdisk="0"
>         >             nodeid="0x00000002"/>  </nodes>
>         >             </clustat>
>         >
>         >             [root at server2 ~]# clustat
>         >             Cluster Status for newCluster @ Thu Aug 23
>         03:13:34 2012
>         >             Member Status: Quorate
>         >
>         >              Member Name                 ID   Status
>         >              ------ ----                 ---- ------
>         >              server1                     1 Online
>         >              server2                     2 Online, Local
>         >
>         >             [root at server2 ~]# clustat -x
>         >             <?xml version="1.0"?>
>         >             <clustat version="4.1.1">
>         >               <cluster name="newCluster" id="43188"
>         generation="250536"/>
>         >               <quorum quorate="1" groupmember="0"/>
>         >               <nodes>
>         >                 <node name="server1" state="1" local="0"
>         estranged="0"
>         >             rgmanager="0" rgmanager_master="0" qdisk="0"
>         >             nodeid="0x00000001"/>
>         >                 <node name="server2" state="1" local="1"
>         estranged="0"
>         >             rgmanager="0" rgmanager_master="0" qdisk="0"
>         >             nodeid="0x00000002"/>
>         >               </nodes>
>         >             </clustat>
>         >
>         >
>         >             [root at server2 ~]# clusvcadm -e Apache
>         >             Local machine trying to enable
>         service:Apache...Could not
>         >             connect to resource group manager
>         >
>         >             strace cluvcsadm -e Apache
>         >             ...
>         >             stat64(1, {st_mode=S_IFCHR|0620,
>         st_rdev=makedev(136, 4),
>         >             ...}) = 0
>         >             mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
>         >             MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fb5000
>         >             write(1, "Local machine trying to enable s"...,
>         48Local
>         >             machine trying to enable service:Apache...) = 48
>         >             socket(PF_FILE, SOCK_STREAM, 0)         = 5
>         >             connect(5, {sa_family=AF_FILE,
>         >             path="/var/run/cluster/rgmanag__er.sk
>         <http://rgmanag__er.sk> <http://rgmanager.sk>
>         >             <http://rgmanager.sk>"...}, 110) = -1 ENOENT (No
>         such file
>         >             or directory)
>         >
>         >             close(5)                                = 0
>         >             write(1, "Could not connect to resource gr"...,
>         44Could not
>         >             connect to resource group manager
>         >             ) = 44
>         >             exit_group(1)                           = ?
>         >
>         >
>         >             [root at server1 ~]# hostname
>         >             server1.localdomain
>         >
>         >             [root at server1 ~]# cat /etc/hosts
>         >             # Do not remove the following line, or various
>         programs
>         >             # that require network functionality will fail.
>         >             #127.0.0.1              server1.localdomain server1
>         >             localhost.localdomain localhost
>         >             192.168.61.132 server1.localdomain server1
>         >             192.168.61.133 server2.localdomain server2
>         >             ::1             localhost6.localdomain6 localhost6
>         >
>         >
>         >             Package versions :
>         >             luci-0.12.2-24.el5
>         >             ricci-0.12.2-24.el5
>         >             rgmanager-2.0.52-9.el5
>         >             modcluster-0.12.1-2.el5
>         >             cluster-cim-0.12.1-2.el5
>         >             system-config-cluster-1.0.57-7
>         >             lvm2-cluster-2.02.74-3.el5
>         >             cluster-snmp-0.12.1-2.el5
>         >
>         >             [root at server1 log]# cman_tool status
>         >             Version: 6.2.0
>         >             Config Version: 15
>         >             Cluster Name: newCluster
>         >             Cluster Id: 43188
>         >             Cluster Member: Yes
>         >             Cluster Generation: 250536
>         >             Membership state: Cluster-Member
>         >             Nodes: 2
>         >             Expected votes: 1
>         >             Total votes: 2
>         >             Quorum: 1
>         >             Active subsystems: 2
>         >             Flags: 2node
>         >             Ports Bound: 0
>         >             Node name: server1
>         >             Node ID: 1
>         >             Multicast addresses: 239.192.168.93
>         >             Node addresses: 192.168.61.132
>         >
>         >             Redhat :Red Hat Enterprise Linux Server release 5.6
>         >             (Tikanga)2.6.18-238.el5xen
>         >
>         >             [root at server1 log]# service rgmanager status
>         >             clurgmgrd (pid  9775) is running...
>         >
>         >             [root at server1 log]# netstat -na | grep 11111
>         >             tcp        0      0 0.0.0.0:11111
>         <http://0.0.0.0:11111> <http://0.0.0.0:11111>
>         >             <http://0.0.0.0:11111>         0.0.0.0:*
>         >             LISTEN
>         >
>         >
>         >             Please let me know if you can help. One thing i
>         noticed was
>         >             that in the "clustat" it does not show "rgmanager"
>         against
>         >             both the nodes but i see the service is just
>         running fine.
>         >
>         >             *Note : No iptables, no SELinux enabled.*
>         >             *
>         >
>         >             *
>         >             Hope i have given all the details required to help me
>         >             quickly. Thanks.
>         >
>         >             -Param
>         >
>         >
>         >
>         >         --
>         >         Linux-cluster mailing list
>         >         Linux-cluster at redhat.com
>         <mailto:Linux-cluster at redhat.com>
>         <mailto:Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
>         >         https://www.redhat.com/__mailman/listinfo/linux-cluster
>         >         <https://www.redhat.com/mailman/listinfo/linux-cluster>
>         >
>         >
>         >
>         >
>         >     --
>         >     esta es mi vida e me la vivo hasta que dios quiera
>         >
>         >     --
>         >     Linux-cluster mailing list
>         >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         <mailto:Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
>         >     https://www.redhat.com/mailman/listinfo/linux-cluster
>         >
>         >
>         >
>         >
>         > --
>         > Linux-cluster mailing list
>         > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         > https://www.redhat.com/mailman/listinfo/linux-cluster
>         >
> 
> 
>         --
>         Digimer
>         Papers and Projects: https://alteeve.ca
> 
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca



From mkparam at gmail.com  Sat Aug 25 04:51:25 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Sat, 25 Aug 2012 10:21:25 +0530
Subject: [Linux-cluster] clusvcadm : Could not connect to resource group
	manager
In-Reply-To: <50385670.2060204@alteeve.ca>
References: <CAA1zgjYzXg0kvaVaio1trNufYOR2EYy0P1yf4roay-pzzeH15g@mail.gmail.com>
	<50374BA7.9080602@itechnical.de>
	<CAE7pJ3B1vN2cpYVxuO4AEVYfeoewJ0YnRPr9cKoM5J+xhoUvSw@mail.gmail.com>
	<CAA1zgjaWtmuNznvR0ZR82K5acaay1-a5R8aVesEQyhsSvK5nkA@mail.gmail.com>
	<50378889.7080506@alteeve.ca>
	<CAA1zgjZSDybBP1JhaCniRigS0pAXAZdGR5HX5UT6bMPLeW0R7A@mail.gmail.com>
	<CAA1zgjbvJkCxdW88QNK1Feq-wTf9AA_GV=GvGYmy1m8y9dfTnQ@mail.gmail.com>
	<50385670.2060204@alteeve.ca>
Message-ID: <CAA1zgjYNW8gHzasr6AYcuBZLx4RWUXtvvwYw36bF42omCK69aw@mail.gmail.com>

I was able to download RHEL5 32bit ISO evaluation copy so i went with it
rather than downloading RHEL6 just for my PoC on two node cluster.

I was not blinding following RHEL4 doc which is why i did not look for
installing those packages once after i did "Cluster" and "ClusterStorage"
group install assuming everything must be there.
I started having a doubt about any missing packages because of those dlm
errors reported in the messages i sent earlier.

Aug 23 05:02:59 server1 dlm_controld[27892]: group_init error (nil) 111
Aug 23 05:02:59 server1 fenced[27886]: group_init error (nil) 111

Searching on some forums always gives me old replies about RHEL4 and error
which talks about some dlm rpm's, thats why.

-Param


On Sat, Aug 25, 2012 at 10:07 AM, Digimer <lists at alteeve.ca> wrote:

> I haven't used the rhcs stable 2 (version in RHEL 5) in some time, and I
> don't remember for sure. I don't think I've seen magma though.
>
> Why are you following the docs for RHEL 4, installing on RHEL 5 when
> RHEL 6 is the current version?
>
> On 08/25/2012 12:14 AM, PARAM KRISH wrote:
> > Digimer,
> >
> > I just want to confirm if i am missing any rpm's in my setup.
> >
> > I referred this site
> > http://www.centos.org/docs/4/html/rh-cs-en-4/ap-rhcs-sw-inst-cust.html
> > to install the rpm's in red hat 5.6 but these packages *magma**, *dlm**
> > are missing in the CD, does it mean these are really important ? I
> > installed using 'yum group install' for "Cluster" and "ClusterStorage"
> >
> > This is what the yum.log says from server1. Please confirm if there is
> > something missing.
> >
> > -Param
> >
> > On Fri, Aug 24, 2012 at 9:06 PM, PARAM KRISH <mkparam at gmail.com
> > <mailto:mkparam at gmail.com>> wrote:
> >
> >     Please find below the details.
> >
> >     1. cluster.conf : Please find attached herewith.
> >
> >     2. Is fencing in any form mandatory for a setup as simple as this ?
> >     I am just using two redhat5 VM's within VMware fusion in my Mac.
> >     What kind of fencing is applicable to this kind of setup, that can
> >     also benefit me ? All that i want to see from this PoC is to add
> >     some services like Apache, MySQL to see how quick and reliable the
> >     cluster to pick the service failures etc.,
> >
> >     3. Please find attached the messages from server1 and server2 when i
> >     did "service cman stop and start" on both nodes one after the other.
> >     In server1 both stop and start went fine but clustat showed "Could
> >     not connect to CMAN: connection refused". If i do "cman_tool join"
> >     on both nodes one after the other, things look green
> >
> >     Also, Am i missing any rpm's that are most important ? I used yum
> >     group install "Clustering" and "ClusterStorage" to install all the
> >     packages.
> >
> >     -Param
> >
> >     On Fri, Aug 24, 2012 at 7:28 PM, Digimer <lists at alteeve.ca
> >     <mailto:lists at alteeve.ca>> wrote:
> >
> >         A few things;
> >
> >         1. Please repost your cluster.conf file with line wraps in plain
> >         text.
> >
> >         2. Manual fencing is not supported in any way, please use real
> >         fencing,
> >         like IPMI, iLO, etc.
> >
> >         3. Please stop the cluster entirely, start 'tail -f -n 0
> >         /var/log/messages' on both nodes, then start cman, then start
> >         rgmanager.
> >         Please share the output from the logs.
> >
> >         Digimer
> >
> >         On 08/24/2012 06:43 AM, PARAM KRISH wrote:
> >         > Hi, Thanks for the help. I hope we are nearing to the problem.
> >         >
> >         > I enabled logging , this is how my cluster.conf looks like
> >         >
> >         > <?xml version="1.0"?>
> >         > <cluster alias="newCluster" config_version="16"
> name="newCluster">
> >         > <logging debug="on"/>
> >         > <cman expected_votes="1" two_node="1"/>
> >         > <clusternodes>
> >         > <clusternode name="server1" nodeid="1" votes="1">
> >         > <fence><method name="single"><device
> >         > name="human"/></method></fence></clusternode><clusternode
> >         name="server2"
> >         > nodeid="2" votes="1"><fence><method name="single"><device
> >         >
> >
> name="human"/></method></fence></clusternode></clusternodes><fencedevices>
> >         >
> >         >         </fencedevices><rm><failoverdomains><failoverdomain
> >         > name="failOver" nofailback="0" ordered="1"
> >         > restricted="0"><failoverdomainnode name="server1"
> >         > priority="1"/><failoverdomainnode name="server2"
> >         > priority="2"/></failoverdomain></failoverdomains><resources><ip
> >         > address="192.168.61.130" monitor_link="1"/><apache
> >         > config_file="conf/httpd.conf" name="httpd"
> >         server_root="/etc/httpd"
> >         > shutdown_wait="0"/></resources><service autostart="1"
> >         domain="failOver"
> >         > exclusive="1" name="Apache" recovery="relocate"><ip
> >         > address="192.168.61.130" monitor_link="1"><apache
> >         > config_file="conf/httpd.conf" name="Apache"
> >         server_root="/etc/httpd"
> >         > shutdown_wait="0"/></ip></service><service autostart="1"
> >         > domain="failOver" exclusive="1" name="website"
> >         recovery="relocate"><ip
> >         > ref="192.168.61.130"><apache
> >         > ref="httpd"/></ip></service></rm><fence_daemon clean_start="1"
> >         > post_fail_delay="0" post_join_delay="3"/><logging
> >         debug="on"/></cluster>
> >         >
> >         > There is no logging happening in /var/run/cluster/
> >         >
> >         > [root at server1 ~]# ls /var/run/cluster/
> >         > apache  ccsd.pid  ccsd.sock  rgmanager.sk
> >         <http://rgmanager.sk> <http://rgmanager.sk>
> >         >
> >         > I started resource manager in foreground and it says like ..
> >         >
> >         > failed acquiring lockspace: No such device
> >         > Locks not working!
> >         >
> >         > What next i could do ?
> >         >
> >         > -Param
> >         >
> >         > On Fri, Aug 24, 2012 at 3:18 PM, emmanuel segura
> >         <emi2fast at gmail.com <mailto:emi2fast at gmail.com>
> >         > <mailto:emi2fast at gmail.com <mailto:emi2fast at gmail.com>>>
> wrote:
> >         >
> >         >     /etc/init.d/rgmanager start or service rgmanager start
> >         >
> >         >
> >         >     2012/8/24 Heiko Nardmann <heiko.nardmann at itechnical.de
> >         <mailto:heiko.nardmann at itechnical.de>
> >         >     <mailto:heiko.nardmann at itechnical.de
> >         <mailto:heiko.nardmann at itechnical.de>>>
> >         >
> >         >         It is strange that strace shows that
> >         >         /var/run/cluster/rgmanager.sk <http://rgmanager.sk>
> >         <http://rgmanager.sk> is missing.
> >         >
> >         >         Normally it is helpful to see the complete
> >         cluster.conf. Could
> >         >         you provide that one?
> >         >
> >         >         Also of interest is /var/log/cluster/rgmanager.log -
> >         do you have
> >         >         debug enabled inside cluster.conf?
> >         >
> >         >         Maybe it is possible to start rgmanager in the
> >         foreground (-f)
> >         >         with strace? That might also be a way to show why the
> >         >         rgmanager.sk <http://rgmanager.sk>
> >         <http://rgmanager.sk> is missing ...
> >         >
> >         >         Just some ideas ...
> >         >
> >         >
> >         >         Kind regards,
> >         >
> >         >             Heiko
> >         >
> >         >         Am 24.08.2012 11 <tel:24.08.2012%2011>
> >         <tel:24.08.2012%2011>:04, schrieb PARAM KRISH:
> >         >
> >         >             All,
> >         >
> >         >             I am trying to setup a simple two node cluster in
> >         my laptop
> >         >             using two RHEL VM's.
> >         >
> >         >             Everything looks just fine to me but i am unable
> >         to enable a
> >         >             apache service though it works beautifully when
> >         tried with
> >         >             "rg_test test" on both the nodes.
> >         >
> >         >             What could be the problem ? Please help. I am a
> >         novice in
> >         >             red hat cluster but learnt a bit of it in the last
> >         few days
> >         >             while trying to fix all the problems encountered.
> >         >
> >         >             Here are the details.
> >         >
> >         >             [root at server1 ~]# clustat
> >         >             Cluster Status for newCluster @ Thu Aug 23
> >         00:29:32 2012
> >         >             Member Status: Quorate
> >         >
> >         >              Member Name                 ID   Status
> >         >              ------ ----                 ---- ------
> >         >              server1                     1 Online, Local
> >         >              server2                     2 Online
> >         >
> >         >             [root at server1 ~]# clustat -x
> >         >             <?xml version="1.0"?>
> >         >             <clustat version="4.1.1">
> >         >               <cluster name="newCluster" id="43188"
> >         generation="250536"/>
> >         >               <quorum quorate="1" groupmember="0"/>
> >         >               <nodes>
> >         >                 <node name="server1" state="1" local="1"
> >         estranged="0"
> >         >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >         >             nodeid="0x00000001"/>
> >         >                 <node name="server2" state="1" local="0"
> >         estranged="0"
> >         >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >         >             nodeid="0x00000002"/>  </nodes>
> >         >             </clustat>
> >         >
> >         >             [root at server2 ~]# clustat
> >         >             Cluster Status for newCluster @ Thu Aug 23
> >         03:13:34 2012
> >         >             Member Status: Quorate
> >         >
> >         >              Member Name                 ID   Status
> >         >              ------ ----                 ---- ------
> >         >              server1                     1 Online
> >         >              server2                     2 Online, Local
> >         >
> >         >             [root at server2 ~]# clustat -x
> >         >             <?xml version="1.0"?>
> >         >             <clustat version="4.1.1">
> >         >               <cluster name="newCluster" id="43188"
> >         generation="250536"/>
> >         >               <quorum quorate="1" groupmember="0"/>
> >         >               <nodes>
> >         >                 <node name="server1" state="1" local="0"
> >         estranged="0"
> >         >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >         >             nodeid="0x00000001"/>
> >         >                 <node name="server2" state="1" local="1"
> >         estranged="0"
> >         >             rgmanager="0" rgmanager_master="0" qdisk="0"
> >         >             nodeid="0x00000002"/>
> >         >               </nodes>
> >         >             </clustat>
> >         >
> >         >
> >         >             [root at server2 ~]# clusvcadm -e Apache
> >         >             Local machine trying to enable
> >         service:Apache...Could not
> >         >             connect to resource group manager
> >         >
> >         >             strace cluvcsadm -e Apache
> >         >             ...
> >         >             stat64(1, {st_mode=S_IFCHR|0620,
> >         st_rdev=makedev(136, 4),
> >         >             ...}) = 0
> >         >             mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
> >         >             MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fb5000
> >         >             write(1, "Local machine trying to enable s"...,
> >         48Local
> >         >             machine trying to enable service:Apache...) = 48
> >         >             socket(PF_FILE, SOCK_STREAM, 0)         = 5
> >         >             connect(5, {sa_family=AF_FILE,
> >         >             path="/var/run/cluster/rgmanag__er.sk
> >         <http://rgmanag__er.sk> <http://rgmanager.sk>
> >         >             <http://rgmanager.sk>"...}, 110) = -1 ENOENT (No
> >         such file
> >         >             or directory)
> >         >
> >         >             close(5)                                = 0
> >         >             write(1, "Could not connect to resource gr"...,
> >         44Could not
> >         >             connect to resource group manager
> >         >             ) = 44
> >         >             exit_group(1)                           = ?
> >         >
> >         >
> >         >             [root at server1 ~]# hostname
> >         >             server1.localdomain
> >         >
> >         >             [root at server1 ~]# cat /etc/hosts
> >         >             # Do not remove the following line, or various
> >         programs
> >         >             # that require network functionality will fail.
> >         >             #127.0.0.1              server1.localdomain server1
> >         >             localhost.localdomain localhost
> >         >             192.168.61.132 server1.localdomain server1
> >         >             192.168.61.133 server2.localdomain server2
> >         >             ::1             localhost6.localdomain6 localhost6
> >         >
> >         >
> >         >             Package versions :
> >         >             luci-0.12.2-24.el5
> >         >             ricci-0.12.2-24.el5
> >         >             rgmanager-2.0.52-9.el5
> >         >             modcluster-0.12.1-2.el5
> >         >             cluster-cim-0.12.1-2.el5
> >         >             system-config-cluster-1.0.57-7
> >         >             lvm2-cluster-2.02.74-3.el5
> >         >             cluster-snmp-0.12.1-2.el5
> >         >
> >         >             [root at server1 log]# cman_tool status
> >         >             Version: 6.2.0
> >         >             Config Version: 15
> >         >             Cluster Name: newCluster
> >         >             Cluster Id: 43188
> >         >             Cluster Member: Yes
> >         >             Cluster Generation: 250536
> >         >             Membership state: Cluster-Member
> >         >             Nodes: 2
> >         >             Expected votes: 1
> >         >             Total votes: 2
> >         >             Quorum: 1
> >         >             Active subsystems: 2
> >         >             Flags: 2node
> >         >             Ports Bound: 0
> >         >             Node name: server1
> >         >             Node ID: 1
> >         >             Multicast addresses: 239.192.168.93
> >         >             Node addresses: 192.168.61.132
> >         >
> >         >             Redhat :Red Hat Enterprise Linux Server release 5.6
> >         >             (Tikanga)2.6.18-238.el5xen
> >         >
> >         >             [root at server1 log]# service rgmanager status
> >         >             clurgmgrd (pid  9775) is running...
> >         >
> >         >             [root at server1 log]# netstat -na | grep 11111
> >         >             tcp        0      0 0.0.0.0:11111
> >         <http://0.0.0.0:11111> <http://0.0.0.0:11111>
> >         >             <http://0.0.0.0:11111>         0.0.0.0:*
> >         >             LISTEN
> >         >
> >         >
> >         >             Please let me know if you can help. One thing i
> >         noticed was
> >         >             that in the "clustat" it does not show "rgmanager"
> >         against
> >         >             both the nodes but i see the service is just
> >         running fine.
> >         >
> >         >             *Note : No iptables, no SELinux enabled.*
> >         >             *
> >         >
> >         >             *
> >         >             Hope i have given all the details required to help
> me
> >         >             quickly. Thanks.
> >         >
> >         >             -Param
> >         >
> >         >
> >         >
> >         >         --
> >         >         Linux-cluster mailing list
> >         >         Linux-cluster at redhat.com
> >         <mailto:Linux-cluster at redhat.com>
> >         <mailto:Linux-cluster at redhat.com <mailto:
> Linux-cluster at redhat.com>>
> >         >
> https://www.redhat.com/__mailman/listinfo/linux-cluster
> >         >         <https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >         >
> >         >
> >         >
> >         >
> >         >     --
> >         >     esta es mi vida e me la vivo hasta que dios quiera
> >         >
> >         >     --
> >         >     Linux-cluster mailing list
> >         >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >         <mailto:Linux-cluster at redhat.com <mailto:
> Linux-cluster at redhat.com>>
> >         >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >         >
> >         >
> >         >
> >         >
> >         > --
> >         > Linux-cluster mailing list
> >         > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >         > https://www.redhat.com/mailman/listinfo/linux-cluster
> >         >
> >
> >
> >         --
> >         Digimer
> >         Papers and Projects: https://alteeve.ca
> >
> >
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120825/cf47206a/attachment.htm>

From td3201 at gmail.com  Sun Aug 26 16:14:07 2012
From: td3201 at gmail.com (Terry)
Date: Sun, 26 Aug 2012 11:14:07 -0500
Subject: [Linux-cluster] cluster issues - configuration OK?
Message-ID: <CAHSRzpC=FMHpredFDqkf42yZ4syjYsNZCU7D+X22N9D_3wz26A@mail.gmail.com>

I have a two node cluster on RHEL 6.3.  It is serving up three NFS mounts
and a Postgres 9.0 database.  The database uses a GFS2 disk and the NFS
mount points are ext4.   I can't seem to fail the services between nodes
with out a disable/enable.  On top of that issue, please just look at my
config and let me know where it can be improved in general.  Here's a log
showing me trying to relocate postgres from one node to the other:

*Aug 26 10:50:35 omadvnfs01c rgmanager[9149]: Stopping service
service:postgresql90*
*Aug 26 10:50:35 omadvnfs01c rgmanager[19756]: [ip] Removing IPv4 address
10.198.1.112/24 from bond0*
*Aug 26 10:50:35 omadvnfs01c avahi-daemon[6596]: Withdrawing address record
for 10.198.1.112 on bond0.*
*Aug 26 10:50:35 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:50:45 omadvnfs01c rsyslogd-2177: imuxsock lost 270 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:50:45 omadvnfs01c rgmanager[20118]: [script] Executing
/etc/init.d/postgresql-9.0 stop*
*Aug 26 10:50:45 omadvnfs01c postgres[18312]: [2-1] LOG:  received fast
shutdown request*
*Aug 26 10:50:45 omadvnfs01c postgres[18312]: [3-1] LOG:  aborting any
active transactions*
*Aug 26 10:50:45 omadvnfs01c postgres[19284]: [10-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19207]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19102]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19100]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19099]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19141]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19142]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19072]: [2-1] LOG:  autovacuum
launcher shutting down*
*Aug 26 10:50:45 omadvnfs01c postgres[19138]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19137]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19139]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19134]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19110]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19136]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19098]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19101]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19140]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19135]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:45 omadvnfs01c postgres[19133]: [2-1] FATAL:  terminating
connection due to administrator command*
*Aug 26 10:50:46 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:50:55 omadvnfs01c nrpe[20652]: Error: Could not complete SSL
handshake. 5*
*Aug 26 10:50:55 omadvnfs01c rsyslogd-2177: imuxsock lost 352 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:50:57 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:51:05 omadvnfs01c rsyslogd-2177: imuxsock lost 32 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:51:15 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:51:24 omadvnfs01c rsyslogd-2177: imuxsock lost 212 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:51:27 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:51:45 omadvnfs01c rsyslogd-2177: imuxsock lost 38 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:51:46 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:51:46 omadvnfs01c rgmanager[22393]: [script]
script:postgresql90-init: stop of /etc/init.d/postgresql-9.0 failed
(returned 1)*
*Aug 26 10:51:46 omadvnfs01c rgmanager[9149]: stop on script
"postgresql90-init" returned 1 (generic error)*
*Aug 26 10:51:46 omadvnfs01c rgmanager[22492]: [fs] unmounting /data03*
*Aug 26 10:51:46 omadvnfs01c rgmanager[22533]: [fs] Sending SIGTERM to
processes on /data03*
*Aug 26 10:51:52 omadvnfs01c rsyslogd-2177: imuxsock lost 248 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:51:52 omadvnfs01c rgmanager[22636]: [fs] unmounting /data03*
*Aug 26 10:51:52 omadvnfs01c rgmanager[22677]: [fs] Sending SIGKILL to
processes on /data03*
*Aug 26 10:51:55 omadvnfs01c rsyslogd-2177: imuxsock begins to drop
messages from pid 5431 due to rate-limiting*
*Aug 26 10:51:57 omadvnfs01c rgmanager[23435]: [fs] unmounting /data03*
*Aug 26 10:51:58 omadvnfs01c rsyslogd-2177: imuxsock lost 344 messages from
pid 5431 due to rate-limiting*
*Aug 26 10:51:58 omadvnfs01c rgmanager[9149]: #12: RG service:postgresql90
failed to stop; intervention required*
*Aug 26 10:51:58 omadvnfs01c rgmanager[9149]: Service service:postgresql90
is failed*


Here is my cluster.conf:


*<?xml version="1.0"?>*
*<cluster config_version="166" name="omadvnfs01">*
*        <cman expected_votes="1" two_node="1"/>*
*        <clusternodes>*
*                <clusternode name="omadvnfs01c.sec.jel.lc" nodeid="1">*
*                        <fence>*
*                                <method name="drac">*
*                                        <device name="omadvnfs01c-drac"/>*
*                                </method>*
*                        </fence>*
*                </clusternode>*
*                <clusternode name="omadvnfs01b.sec.jel.lc" nodeid="2">*
*                        <fence>*
*                                <method name="drac">*
*                                        <device name="omadvnfs01b-drac"/>*
*                                </method>*
*                        </fence>*
*                </clusternode>*
*        </clusternodes>*
*        <fencedevices>*
*                <fencedevice agent="fence_drac5" ipaddr="10.98.1.213"
login="root" module_name="omadvnfs01c" name="omadvnfs01c-drac"
passwd="narf" secure="on"/>*
*                <fencedevice agent="fence_drac5" ipaddr="10.98.1.212"
login="root" module_name="omadvnfs01b" name="omadvnfs01b-drac"
passwd="narf" secure="on"/>*
*        </fencedevices>*
*        <rm>*
*                <resources>*
*                        <nfsexport name="data01a"/>*
*                        <nfsexport name="data01b"/>*
*                        <nfsexport name="data01c"/>*
*                        <nfsclient allow_recover="on" name="omadvdss01a"
options="rw,no_root_squash,async" target="omadvdss01a"/>*
*                        <nfsclient allow_recover="on" name="omadvdss01b"
options="rw,no_root_squash,async" target="omadvdss01b"/>*
*                        <nfsclient allow_recover="on" name="omadvdss01c"
options="rw,no_root_squash,async" target="omadvdss01c"/>*
*                        <script file="/etc/init.d/postgresql-9.0"
name="postgresql90-init"/>*
*                        <script file="/etc/init.d/postgresql-9.1"
name="postgresql91-init"/>*
*                        <ip address="10.198.1.112" monitor_link="on"
sleeptime="10"/>*
*                        <ip address="10.198.1.113" monitor_link="on"
sleeptime="10"/>*
*                        <ip address="10.198.1.114" monitor_link="on"
sleeptime="10"/>*
*                        <ip address="10.198.1.115" monitor_link="on"
sleeptime="10"/>*
*                        <script file="/etc/init.d/postgresql-8.4"
name="postgresql84-init"/>*
*                        <fs device="/dev/vg_data01a/lv_data01a"
force_unmount="1" fsid="18521" self_fence="1" fstype="ext4"
mountpoint="/data01a" name="omadvnfs01-data01a" nfslock="1"
options="noatime,nodiratime,data=writeback,commit=30"/>*
*                        <fs device="/dev/vg_data01b/lv_data01b"
force_unmount="1" fsid="6623" self_fence="1" fstype="ext4"
mountpoint="/data01b" name="omadvnfs01-data01b" nfslock="1"
options="noatime,nodiratime,data=writeback,commit=30"/>*
*                        <fs device="/dev/vg_data01c/lv_data01c"
force_unmount="1" fsid="91523" self_fence="1" fstype="ext4"
mountpoint="/data01c" name="omadvnfs01-data01c" nfslock="1"
options="noatime,nodiratime,data=writeback,commit=30"/>*
*                        <fs device="/dev/vg_data03/lv_data03"
force_unmount="1" force_fsck="1" self_fence="1" fsid="15631" fstype="gfs2"
mountpoint="/data03" name="omadvnfs01-data03" options=""/>*
*                </resources>*
*                <failoverdomains>*
*                        <failoverdomain name="fd_omadvnfs01c"
nofailback="1" ordered="1" restricted="0">*
*                                <failoverdomainnode name="
omadvnfs01c.sec.jel.lc" priority="1"/>*
*                                <failoverdomainnode name="
omadvnfs01b.sec.jel.lc" priority="2"/>*
*                        </failoverdomain>*
*                        <failoverdomain name="fd_omadvnfs01b"
nofailback="1" ordered="1" restricted="0">*
*                                <failoverdomainnode name="
omadvnfs01b.sec.jel.lc" priority="1"/>*
*                                <failoverdomainnode name="
omadvnfs01c.sec.jel.lc" priority="2"/>*
*                        </failoverdomain>*
*                </failoverdomains>*
*                <service domain="fd_omadvnfs01b"
name="omadvnfs01-nfs-data01b" nfslock="1" recovery="relocate">*
*                        <fs ref="omadvnfs01-data01b">*
*                                <nfsexport ref="data01b">*
*                                        <ip ref="10.198.1.114"/>*
*                                        <nfsclient ref="omadvdss01a"/>*
*                                        <nfsclient ref="omadvdss01b"/>*
*
                                        <nfsclient ref="omadvdss01c"/>
                                </nfsexport>
                        </fs>
                </service>
                <service domain="fd_omadvnfs01c"
name="omadvnfs01-nfs-data01a" nfslock="1" recovery="relocate">
                        <fs ref="omadvnfs01-data01a">
                                <nfsexport ref="data01a">
                                        <ip ref="10.198.1.113"/>
                                        <nfsclient ref="omadvdss01a"/>
                                        <nfsclient ref="omadvdss01b"/>
                                        <nfsclient ref="omadvdss01c"/>
                                </nfsexport>
                        </fs>
                </service>
                <service domain="fd_omadvnfs01c"
name="omadvnfs01-nfs-data01c" nfslock="1" recovery="relocate">
                        <fs ref="omadvnfs01-data01c">
                                <nfsexport ref="data01c">
                                        <ip ref="10.198.1.115"/>
                                        <nfsclient ref="omadvdss01a"/>
                                        <nfsclient ref="omadvdss01b"/>
                                        <nfsclient ref="omadvdss01c"/>
                                </nfsexport>
                        </fs>
                </service>
                <service domain="fd_omadvnfs01b" name="postgresql90"
recovery="relocate">
                        <ip ref="10.198.1.112"/>
                        <fs ref="omadvnfs01-data03">
                                <script ref="postgresql90-init"/>
                        </fs>
                </service>
        </rm>
        <logging debug="on" logfile="/var/log/cluster.log"
logfile_priority="debug"/>
</cluster>

*

There's nothing of interest in my cluster.log file during the time when I
attempted to relocate.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120826/cd277bae/attachment.htm>

From it14.jed at sghgroup.net  Mon Aug 27 08:55:49 2012
From: it14.jed at sghgroup.net (Saad Eldien Mamdouh)
Date: Mon, 27 Aug 2012 11:55:49 +0300
Subject: [Linux-cluster] oracle cluster problem
Message-ID: <009a01cd8431$c1476d00$43d64700$@sghgroup.net>

Hi

 

This is my /var/log/messages 

 

Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Host name conflict, retrying
with <erpprddb2-20>
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address record
for fe80::7ae7:d1ff:fee8:9ef8 on bond1.
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address record
for 169.254.94.182 on bond1.
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address record
for 130.1.15.136 on bond1.
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address record
for fe80::7ae7:d1ff:fee8:9efc on bond0.
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address record
for 130.1.2.163 on bond0.
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address record
for 130.1.2.162 on bond0.
Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering HINFO record with
values 'X86_64'/'LINUX'.
Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record for
fe80::7ae7:d1ff:fee8:9ef8 on bond1.
Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record for
169.254.94.182 on bond1.
Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record for
130.1.15.136 on bond1.
Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record for
fe80::7ae7:d1ff:fee8:9efc on bond0.
Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record for
130.1.2.162 on bond0.
Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Host name conflict, retrying
with <erpprddb2-21>



The server kernel  2.6.18-194.el5

 

If anyone has any idea about that ?

 

Thanks 

 Saad

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/362d9f68/attachment.htm>

From chemasters at gmail.com  Mon Aug 27 09:29:55 2012
From: chemasters at gmail.com (Ibrahima Cherif)
Date: Mon, 27 Aug 2012 09:29:55 +0000
Subject: [Linux-cluster] oracle cluster problem
In-Reply-To: <009a01cd8431$c1476d00$43d64700$@sghgroup.net>
References: <009a01cd8431$c1476d00$43d64700$@sghgroup.net>
Message-ID: <CAMru16qgeXD4nxtv2Wt9mCE5MpN+2GOtOw8UvkgLXY=Mp3Oc7w@mail.gmail.com>

Hi,
are you using oracle rac ? if yes, what is the current cluster setup? it is
normal in oracle rac for the vip to failover to the second node when the
first node goes down. What type of name resolution are you using? is DNS or
localhost /etc/hosts?
On Aug 27, 2012 9:10 AM, "Saad Eldien Mamdouh" <it14.jed at sghgroup.net>
wrote:

> Hi****
>
> ** **
>
> This is my /var/log/messages ****
>
> ** **
>
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Host name conflict, retrying
> with <erpprddb2-20>
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
> record for fe80::7ae7:d1ff:fee8:9ef8 on bond1.
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
> record for 169.254.94.182 on bond1.
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
> record for 130.1.15.136 on bond1.
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
> record for fe80::7ae7:d1ff:fee8:9efc on bond0.
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
> record for 130.1.2.163 on bond0.
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
> record for 130.1.2.162 on bond0.
> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering HINFO record
> with values 'X86_64'/'LINUX'.
> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
> for fe80::7ae7:d1ff:fee8:9ef8 on bond1.
> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
> for 169.254.94.182 on bond1.
> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
> for 130.1.15.136 on bond1.
> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
> for fe80::7ae7:d1ff:fee8:9efc on bond0.
> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
> for 130.1.2.162 on bond0.
> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Host name conflict, retrying
> with <erpprddb2-21>
>
> ****
>
> The server kernel  2.6.18-194.el5****
>
> ** **
>
> If anyone has any idea about that ?****
>
> ** **
>
> Thanks ****
>
> * Saad*****
>
> ** **
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/f9398771/attachment.htm>

From mkparam at gmail.com  Mon Aug 27 09:31:38 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Mon, 27 Aug 2012 15:01:38 +0530
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
Message-ID: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>

Hi

 I think i am almost there. I have started using RHEL6 hoping it would not
give me any night-mare this time to setup a 2 Node Cluster for a Apache
cluster service. and i think i have done pretty much everything.

In short,

1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
192.168.18.11
2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
taken care
3.  I created the cluster, added two nodes, added the service WEB ( added
the child :IP and :apache to it)
4. Cluster is in quorum and detects other node going offline fantastically
5.  Tested the start/stop of this resource WEB using "rg_test" , it worked
just fine on both the nodes.
6.  But, for some reasons, its not starting or failing over to other node
when i manually test(using clusvcadm -e WEB) or do a reboot or whatever.

7. Please let me know how do i verify the cluster startup and failover
manually to make sure everything works
8. What is it i am missing that makes this not work now ? Please assist.

Please go through the output of all the commands attached herewith.

Let me know if there is still required.

Param
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/aece4d57/attachment.htm>
-------------- next part --------------
<?xml version="1.0"?>
<cluster config_version="14" name="httpdCluster">
<logging debug="on"/>
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="node1.localdomain" nodeid="1" votes="1">
                        <fence>
                                <method name="single"/>
                        </fence>
                </clusternode>
                <clusternode name="node2.localdomain" nodeid="2" votes="1">
                        <fence>
                                <method name="single"/>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="myFailOver" nofailback="0" ordered="1" restricted="0">
                                <failoverdomainnode name="node1.localdomain" priority="1"/>
                                <failoverdomainnode name="node2.localdomain" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <apache config_file="conf/httpd.conf" name="apache" server_root="/etc/httpd" shutdown_wait="0"/>
                </resources>
                <service autostart="1" domain="myFailOver" exclusive="1" name="WEB" recovery="relocate">
                        <ip address="192.168.18.50" monitor_link="1" sleeptime="10">
                                <apache config_file="conf/httpd.conf" name="WEB" server_root="/etc/httpd" shutdown_wait="0"/>
                        </ip>
                </service>
        </rm>
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
</cluster>

================================================

[root at node2 apache]# clustat
Cluster Status for httpdCluster @ Mon Aug 27 20:13:24 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1.localdomain                                                   1 Online, rgmanager
 node2.localdomain                                                   2 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:WEB                                                      (node2.localdomain)                                              failed        

[root at node2 apache]# ps -eaf | grep httpd
root     17219  3171  0 20:15 pts/0    00:00:00 grep httpd

[root at node2 apache]# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:0c:29:1a:5b:cf brd ff:ff:ff:ff:ff:ff
    inet 192.168.18.11/24 brd 192.168.18.255 scope global eth0
    inet6 fe80::20c:29ff:fe1a:5bcf/64 scope link 
       valid_lft forever preferred_lft forever

[root at node2 apache]# /usr/share/cluster/apache.sh start service:WEB
<debug>  Verifying Configuration Of default
Verifying Configuration Of default
<error>  Verifying Configuration Of default > Failed - Invalid Name Of Service
Verifying Configuration Of default > Failed - Invalid Name Of Service

[root at node2 apache]# rg_test test /etc/cluster/cluster.conf start service WEB
Running in test mode.
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/svclib_nfslock
Starting WEB...
<debug>  Link for eth0: Detected
Link for eth0: Detected
<info>   Adding IPv4 address 192.168.18.50/24 to eth0
Adding IPv4 address 192.168.18.50/24 to eth0
<debug>  Pinging addr 192.168.18.50 from dev eth0
Pinging addr 192.168.18.50 from dev eth0
<debug>  Sending gratuitous ARP: 192.168.18.50 00:0c:29:1a:5b:cf brd ff:ff:ff:ff:ff:ff
Sending gratuitous ARP: 192.168.18.50 00:0c:29:1a:5b:cf brd ff:ff:ff:ff:ff:ff
rdisc: no process killed
<debug>  Verifying Configuration Of apache:WEB
Verifying Configuration Of apache:WEB
<debug>  Checking Syntax Of The File /etc/httpd/conf/httpd.conf
Checking Syntax Of The File /etc/httpd/conf/httpd.conf
<debug>  Checking Syntax Of The File /etc/httpd/conf/httpd.conf > Succeed
Checking Syntax Of The File /etc/httpd/conf/httpd.conf > Succeed
<info>   Starting Service apache:WEB
Starting Service apache:WEB
<debug>  Looking For IP Addresses
Looking For IP Addresses
Query failed: Invalid argument (/cluster/rm/service[@name="WEB"]/ip[2]/@address)
<debug>  Looking For IP Addresses > Succeed -  IP Addresses Found
Looking For IP Addresses > Succeed -  IP Addresses Found
<debug>  Checking: SHA1 checksum of config file /etc/cluster/apache/apache:WEB/httpd.conf
Checking: SHA1 checksum of config file /etc/cluster/apache/apache:WEB/httpd.conf
<debug>  Checking: SHA1 checksum > succeed
Checking: SHA1 checksum > succeed
<debug>  Generating New Config File /etc/cluster/apache/apache:WEB/httpd.conf From /etc/httpd/conf/httpd.conf
Generating New Config File /etc/cluster/apache/apache:WEB/httpd.conf From /etc/httpd/conf/httpd.conf
<debug>  Generating New Config File /etc/cluster/apache/apache:WEB/httpd.conf From /etc/httpd/conf/httpd.conf > Succeed
Generating New Config File /etc/cluster/apache/apache:WEB/httpd.conf From /etc/httpd/conf/httpd.conf > Succeed
<debug>  Starting Service apache:WEB > Succeed
Starting Service apache:WEB > Succeed
Start of WEB complete

[root at node2 apache]# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:0c:29:1a:5b:cf brd ff:ff:ff:ff:ff:ff
    inet 192.168.18.11/24 brd 192.168.18.255 scope global eth0
    inet 192.168.18.50/24 scope global secondary eth0
    inet6 fe80::20c:29ff:fe1a:5bcf/64 scope link 
       valid_lft forever preferred_lft forever

[root at node2 apache]# ps -eaf | grep httpd | wc -l
10

[root at node2 apache]# wget http://192.168.18.50
--2012-08-27 20:17:15--  http://192.168.18.50/
Connecting to 192.168.18.50:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22 [text/html]
Saving to: `index.html'

100%[===============================================================================================================>] 22          --.-K/s   in 0s      

2012-08-27 20:17:15 (3.98 MB/s) - `index.html' saved [22/22]


/var/log/messages
Aug 27 19:24:44 node2 rgmanager[9388]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:25:03 node2 rgmanager[9523]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:25:39 node2 rgmanager[10429]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:26:23 node2 rgmanager[10585]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:26:50 node2 rgmanager[10730]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:26:58 node2 rgmanager[10807]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:27:10 node2 rgmanager[10865]: (null)
Aug 27 19:27:31 node2 rgmanager[10973]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:28:28 node2 rgmanager[11148]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:28:33 node2 rgmanager[11226]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:30:58 node2 rgmanager[11587]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:31:03 node2 rgmanager[11665]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:31:06 node2 rgmanager[11733]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 19:36:58 node2 rgmanager[12495]:  is not configured
Aug 27 19:38:43 node2 rgmanager[12884]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 20:13:35 node2 rgmanager[16956]: Verifying Configuration Of default > Failed - Invalid Name Of Service
Aug 27 20:16:11 node2 rgmanager[17717]: Adding IPv4 address 192.168.18.50/24 to eth0
Aug 27 20:16:14 node2 in.rdiscd[17784]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use
Aug 27 20:16:14 node2 in.rdiscd[17784]: Failed joining addresses
Aug 27 20:16:15 node2 rgmanager[17876]: Starting Service apache:WEB
Aug 27 20:16:16 node2 rgmanager[17940]: Query failed: Invalid argument (/cluster/rm/service[@name="WEB"]/ip[2]/@address)
Aug 27 20:17:31 node2 rgmanager[18737]: Stopping Service apache:WEB
Aug 27 20:17:33 node2 rgmanager[18771]: Stopping Service apache:WEB > Failed - Application Is Still Running
Aug 27 20:17:33 node2 rgmanager[18791]: Stopping Service apache:WEB > Failed
Aug 27 20:17:33 node2 rgmanager[18840]: Removing IPv4 address 192.168.18.50/24 from eth0

[root at node2 cluster]# clusvcadm -e WEB -m node2.localdomain
Member node2.localdomain trying to enable service:WEB...Aborted; service failed

[root at node2 cluster]# tail /var/log/messages
..
Aug 27 20:21:06 node2 rgmanager[1771]: #43: Service service:WEB has failed; can not start.
Aug 27 20:21:06 node2 rgmanager[1771]: #13: Service service:WEB failed to stop cleanly

From mkparam at gmail.com  Mon Aug 27 10:03:36 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Mon, 27 Aug 2012 15:33:36 +0530
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
Message-ID: <CAA1zgjY8qRHo275uGCpRz0y4XdrKxP1Zd7h9PxhCyfCyn-2OpQ@mail.gmail.com>

Just noticed that disable and enable of that service works fine(clusvcadm
-d WEB and clusvcadm -e WEB) on both the nodes but the relocate does not.

Aug 27 20:57:18 node2 rgmanager[1771]: Stopping service service:WEB
Aug 27 20:57:19 node2 rgmanager[23541]: Stopping Service apache:WEB
Aug 27 20:57:19 node2 rgmanager[23561]: Checking Existence Of File
/var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
Exist
Aug 27 20:57:19 node2 rgmanager[23581]: Stopping Service apache:WEB >
Succeed
Aug 27 20:57:19 node2 rgmanager[1771]: Service service:WEB is disabled
Aug 27 20:57:32 node2 rgmanager[1771]: Starting disabled service service:WEB
Aug 27 20:57:33 node2 rgmanager[23716]: Adding IPv4 address
192.168.18.50/24to eth0
Aug 27 20:57:37 node2 rgmanager[23882]: Starting Service apache:WEB
Aug 27 20:57:37 node2 rgmanager[23937]: Query failed: Invalid argument
(/cluster/rm/service[@name="WEB"]/ip[2]/@address)
Aug 27 20:57:39 node2 rgmanager[1771]: Service service:WEB started
Aug 27 20:58:35 node2 rgmanager[1771]: Stopping service service:WEB
Aug 27 20:58:36 node2 rgmanager[24554]: Stopping Service apache:WEB
Aug 27 20:58:37 node2 rgmanager[24579]: Stopping Service apache:WEB >
Failed - Application Is Still Running
Aug 27 20:58:38 node2 rgmanager[24599]: Stopping Service apache:WEB > Failed
Aug 27 20:58:38 node2 rgmanager[1771]: stop on apache "WEB" returned 1
(generic error)
Aug 27 20:58:38 node2 rgmanager[24648]: Removing IPv4 address
192.168.18.50/24 from eth0
Aug 27 20:58:48 node2 rgmanager[1771]: #12: RG service:WEB failed to stop;
intervention required
Aug 27 20:58:48 node2 rgmanager[1771]: Service service:WEB is failed
Aug 27 20:58:48 node2 rgmanager[1771]: #70: Failed to relocate service:WEB;
restarting locally
Aug 27 20:58:48 node2 rgmanager[1771]: #43: Service service:WEB has failed;
can not start.
Aug 27 20:58:48 node2 rgmanager[1771]: #2: Service service:WEB returned
failure code.  Last Owner: node2.localdomain
Aug 27 20:58:48 node2 rgmanager[1771]: #4: Administrator intervention
required.
Aug 27 20:59:13 node2 rgmanager[1771]: Stopping service service:WEB
Aug 27 20:59:13 node2 rgmanager[24841]: Stopping Service apache:WEBAug 27
20:59:14 node2 rgmanager[24861]: Checking Existence Of File
/var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
Exist
Aug 27 20:59:14 node2 rgmanager[24881]: Stopping Service apache:WEB >
Succeed
Aug 27 20:59:14 node2 rgmanager[1771]: Service service:WEB is disabled
Aug 27 21:01:06 node2 rgmanager[1771]: #43: Service service:WEB has failed;
can not start.


What could be missing ?

On Mon, Aug 27, 2012 at 3:01 PM, PARAM KRISH <mkparam at gmail.com> wrote:

> Hi
>
>  I think i am almost there. I have started using RHEL6 hoping it would not
> give me any night-mare this time to setup a 2 Node Cluster for a Apache
> cluster service. and i think i have done pretty much everything.
>
> In short,
>
> 1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
> 192.168.18.11
> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
> taken care
> 3.  I created the cluster, added two nodes, added the service WEB ( added
> the child :IP and :apache to it)
> 4. Cluster is in quorum and detects other node going offline fantastically
> 5.  Tested the start/stop of this resource WEB using "rg_test" , it worked
> just fine on both the nodes.
> 6.  But, for some reasons, its not starting or failing over to other node
> when i manually test(using clusvcadm -e WEB) or do a reboot or whatever.
>
> 7. Please let me know how do i verify the cluster startup and failover
> manually to make sure everything works
> 8. What is it i am missing that makes this not work now ? Please assist.
>
> Please go through the output of all the commands attached herewith.
>
> Let me know if there is still required.
>
> Param
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/458c6979/attachment.htm>

From emi2fast at gmail.com  Mon Aug 27 10:07:47 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 27 Aug 2012 12:07:47 +0200
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
Message-ID: <CAE7pJ3AZXT00UwM9ox8tUuxPmzApgxCYh_wAUWsi3jFd_a6npg@mail.gmail.com>

Fix your service defination

  <service autostart="1" domain="myFailOver" exclusive="1" name="WEB"
recovery="relocate">
                        <ip address="192.168.18.50" monitor_link="1"
sleeptime="10"/>
                         <apache href="apache"/>

   </service>



2012/8/27 PARAM KRISH <mkparam at gmail.com>

> Hi
>
>  I think i am almost there. I have started using RHEL6 hoping it would not
> give me any night-mare this time to setup a 2 Node Cluster for a Apache
> cluster service. and i think i have done pretty much everything.
>
> In short,
>
> 1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
> 192.168.18.11
> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
> taken care
> 3.  I created the cluster, added two nodes, added the service WEB ( added
> the child :IP and :apache to it)
> 4. Cluster is in quorum and detects other node going offline fantastically
> 5.  Tested the start/stop of this resource WEB using "rg_test" , it worked
> just fine on both the nodes.
> 6.  But, for some reasons, its not starting or failing over to other node
> when i manually test(using clusvcadm -e WEB) or do a reboot or whatever.
>
> 7. Please let me know how do i verify the cluster startup and failover
> manually to make sure everything works
> 8. What is it i am missing that makes this not work now ? Please assist.
>
> Please go through the output of all the commands attached herewith.
>
> Let me know if there is still required.
>
> Param
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/7fbdc20b/attachment.htm>

From emi2fast at gmail.com  Mon Aug 27 10:11:34 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 27 Aug 2012 12:11:34 +0200
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAE7pJ3AZXT00UwM9ox8tUuxPmzApgxCYh_wAUWsi3jFd_a6npg@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
	<CAE7pJ3AZXT00UwM9ox8tUuxPmzApgxCYh_wAUWsi3jFd_a6npg@mail.gmail.com>
Message-ID: <CAE7pJ3BYba9piDN53RonoGPV6HMkueY2=LmNvE5GBbyDU6tbEA@mail.gmail.com>

Sorry the correct version it's this
==================================================================
 <service autostart="1" domain="myFailOver" exclusive="1" name="WEB"
recovery="relocate">
                        <ip address="192.168.18.50" monitor_link="1"
sleeptime="10"/>
                         <apache ref="apache"/>

   </service>


2012/8/27 emmanuel segura <emi2fast at gmail.com>

> Fix your service defination
>
>   <service autostart="1" domain="myFailOver" exclusive="1" name="WEB" recovery="relocate">
>                         <ip address="192.168.18.50" monitor_link="1" sleeptime="10"/>
>                          <apache href="apache"/>
>
>    </service>
>
>
>
> 2012/8/27 PARAM KRISH <mkparam at gmail.com>
>
>> Hi
>>
>>  I think i am almost there. I have started using RHEL6 hoping it would
>> not give me any night-mare this time to setup a 2 Node Cluster for a Apache
>> cluster service. and i think i have done pretty much everything.
>>
>> In short,
>>
>> 1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
>> 192.168.18.11
>> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
>> taken care
>> 3.  I created the cluster, added two nodes, added the service WEB ( added
>> the child :IP and :apache to it)
>> 4. Cluster is in quorum and detects other node going offline fantastically
>> 5.  Tested the start/stop of this resource WEB using "rg_test" , it
>> worked just fine on both the nodes.
>> 6.  But, for some reasons, its not starting or failing over to other node
>> when i manually test(using clusvcadm -e WEB) or do a reboot or whatever.
>>
>> 7. Please let me know how do i verify the cluster startup and failover
>> manually to make sure everything works
>> 8. What is it i am missing that makes this not work now ? Please assist.
>>
>> Please go through the output of all the commands attached herewith.
>>
>> Let me know if there is still required.
>>
>> Param
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/d95f7b50/attachment.htm>

From emi2fast at gmail.com  Mon Aug 27 10:13:58 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 27 Aug 2012 12:13:58 +0200
Subject: [Linux-cluster] oracle cluster problem
In-Reply-To: <CAMru16qgeXD4nxtv2Wt9mCE5MpN+2GOtOw8UvkgLXY=Mp3Oc7w@mail.gmail.com>
References: <009a01cd8431$c1476d00$43d64700$@sghgroup.net>
	<CAMru16qgeXD4nxtv2Wt9mCE5MpN+2GOtOw8UvkgLXY=Mp3Oc7w@mail.gmail.com>
Message-ID: <CAE7pJ3DoVubyp5hyGn7bKvxSQvJG92TV8mBOJBfWV-hXgD1wJw@mail.gmail.com>

Sorry, but i don't see any error

2012/8/27 Ibrahima Cherif <chemasters at gmail.com>

> Hi,
> are you using oracle rac ? if yes, what is the current cluster setup? it
> is normal in oracle rac for the vip to failover to the second node when the
> first node goes down. What type of name resolution are you using? is DNS or
> localhost /etc/hosts?
> On Aug 27, 2012 9:10 AM, "Saad Eldien Mamdouh" <it14.jed at sghgroup.net>
> wrote:
>
>> Hi****
>>
>> ** **
>>
>> This is my /var/log/messages ****
>>
>> ** **
>>
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Host name conflict,
>> retrying with <erpprddb2-20>
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
>> record for fe80::7ae7:d1ff:fee8:9ef8 on bond1.
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
>> record for 169.254.94.182 on bond1.
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
>> record for 130.1.15.136 on bond1.
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
>> record for fe80::7ae7:d1ff:fee8:9efc on bond0.
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
>> record for 130.1.2.163 on bond0.
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering new address
>> record for 130.1.2.162 on bond0.
>> Aug 25 16:33:44 erpprddb2 avahi-daemon[6268]: Registering HINFO record
>> with values 'X86_64'/'LINUX'.
>> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
>> for fe80::7ae7:d1ff:fee8:9ef8 on bond1.
>> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
>> for 169.254.94.182 on bond1.
>> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
>> for 130.1.15.136 on bond1.
>> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
>> for fe80::7ae7:d1ff:fee8:9efc on bond0.
>> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Withdrawing address record
>> for 130.1.2.162 on bond0.
>> Aug 25 16:34:04 erpprddb2 avahi-daemon[6268]: Host name conflict,
>> retrying with <erpprddb2-21>
>>
>> ****
>>
>> The server kernel  2.6.18-194.el5****
>>
>> ** **
>>
>> If anyone has any idea about that ?****
>>
>> ** **
>>
>> Thanks ****
>>
>> * Saad*****
>>
>> ** **
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/f86de0a6/attachment.htm>

From emi2fast at gmail.com  Mon Aug 27 10:15:43 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 27 Aug 2012 12:15:43 +0200
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAA1zgjY8qRHo275uGCpRz0y4XdrKxP1Zd7h9PxhCyfCyn-2OpQ@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
	<CAA1zgjY8qRHo275uGCpRz0y4XdrKxP1Zd7h9PxhCyfCyn-2OpQ@mail.gmail.com>
Message-ID: <CAE7pJ3Az0dn+VYj-wP=f2zoj8QeYH-qakv3_RWPtY9TVW-CPJg@mail.gmail.com>

Fix your config before

2012/8/27 PARAM KRISH <mkparam at gmail.com>

> Just noticed that disable and enable of that service works fine(clusvcadm
> -d WEB and clusvcadm -e WEB) on both the nodes but the relocate does not.
>
> Aug 27 20:57:18 node2 rgmanager[1771]: Stopping service service:WEB
> Aug 27 20:57:19 node2 rgmanager[23541]: Stopping Service apache:WEB
> Aug 27 20:57:19 node2 rgmanager[23561]: Checking Existence Of File
> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
> Exist
> Aug 27 20:57:19 node2 rgmanager[23581]: Stopping Service apache:WEB >
> Succeed
> Aug 27 20:57:19 node2 rgmanager[1771]: Service service:WEB is disabled
> Aug 27 20:57:32 node2 rgmanager[1771]: Starting disabled service
> service:WEB
> Aug 27 20:57:33 node2 rgmanager[23716]: Adding IPv4 address
> 192.168.18.50/24 to eth0
> Aug 27 20:57:37 node2 rgmanager[23882]: Starting Service apache:WEB
> Aug 27 20:57:37 node2 rgmanager[23937]: Query failed: Invalid argument
> (/cluster/rm/service[@name="WEB"]/ip[2]/@address)
> Aug 27 20:57:39 node2 rgmanager[1771]: Service service:WEB started
> Aug 27 20:58:35 node2 rgmanager[1771]: Stopping service service:WEB
> Aug 27 20:58:36 node2 rgmanager[24554]: Stopping Service apache:WEB
> Aug 27 20:58:37 node2 rgmanager[24579]: Stopping Service apache:WEB >
> Failed - Application Is Still Running
> Aug 27 20:58:38 node2 rgmanager[24599]: Stopping Service apache:WEB >
> Failed
> Aug 27 20:58:38 node2 rgmanager[1771]: stop on apache "WEB" returned 1
> (generic error)
> Aug 27 20:58:38 node2 rgmanager[24648]: Removing IPv4 address
> 192.168.18.50/24 from eth0
> Aug 27 20:58:48 node2 rgmanager[1771]: #12: RG service:WEB failed to stop;
> intervention required
> Aug 27 20:58:48 node2 rgmanager[1771]: Service service:WEB is failed
> Aug 27 20:58:48 node2 rgmanager[1771]: #70: Failed to relocate
> service:WEB; restarting locally
> Aug 27 20:58:48 node2 rgmanager[1771]: #43: Service service:WEB has
> failed; can not start.
> Aug 27 20:58:48 node2 rgmanager[1771]: #2: Service service:WEB returned
> failure code.  Last Owner: node2.localdomain
> Aug 27 20:58:48 node2 rgmanager[1771]: #4: Administrator intervention
> required.
> Aug 27 20:59:13 node2 rgmanager[1771]: Stopping service service:WEB
> Aug 27 20:59:13 node2 rgmanager[24841]: Stopping Service apache:WEBAug 27
> 20:59:14 node2 rgmanager[24861]: Checking Existence Of File
> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
> Exist
> Aug 27 20:59:14 node2 rgmanager[24881]: Stopping Service apache:WEB >
> Succeed
> Aug 27 20:59:14 node2 rgmanager[1771]: Service service:WEB is disabled
> Aug 27 21:01:06 node2 rgmanager[1771]: #43: Service service:WEB has
> failed; can not start.
>
>
> What could be missing ?
>
> On Mon, Aug 27, 2012 at 3:01 PM, PARAM KRISH <mkparam at gmail.com> wrote:
>
>> Hi
>>
>>  I think i am almost there. I have started using RHEL6 hoping it would
>> not give me any night-mare this time to setup a 2 Node Cluster for a Apache
>> cluster service. and i think i have done pretty much everything.
>>
>> In short,
>>
>> 1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
>> 192.168.18.11
>> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
>> taken care
>> 3.  I created the cluster, added two nodes, added the service WEB ( added
>> the child :IP and :apache to it)
>> 4. Cluster is in quorum and detects other node going offline fantastically
>> 5.  Tested the start/stop of this resource WEB using "rg_test" , it
>> worked just fine on both the nodes.
>> 6.  But, for some reasons, its not starting or failing over to other node
>> when i manually test(using clusvcadm -e WEB) or do a reboot or whatever.
>>
>> 7. Please let me know how do i verify the cluster startup and failover
>> manually to make sure everything works
>> 8. What is it i am missing that makes this not work now ? Please assist.
>>
>> Please go through the output of all the commands attached herewith.
>>
>> Let me know if there is still required.
>>
>> Param
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/a424a750/attachment.htm>

From mkparam at gmail.com  Mon Aug 27 10:41:33 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Mon, 27 Aug 2012 16:11:33 +0530
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAE7pJ3Az0dn+VYj-wP=f2zoj8QeYH-qakv3_RWPtY9TVW-CPJg@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
	<CAA1zgjY8qRHo275uGCpRz0y4XdrKxP1Zd7h9PxhCyfCyn-2OpQ@mail.gmail.com>
	<CAE7pJ3Az0dn+VYj-wP=f2zoj8QeYH-qakv3_RWPtY9TVW-CPJg@mail.gmail.com>
Message-ID: <CAA1zgjbxVS26ztn=Qs=FDQnbYDOchLioPWQHWgkmOFr9P0B6Pg@mail.gmail.com>

Nope, that did not help. it still the same. enable/disable works fine on
both nodes but relocate, nope.

On Mon, Aug 27, 2012 at 3:45 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> Fix your config before
>
> 2012/8/27 PARAM KRISH <mkparam at gmail.com>
>
>> Just noticed that disable and enable of that service works fine(clusvcadm
>> -d WEB and clusvcadm -e WEB) on both the nodes but the relocate does not.
>>
>> Aug 27 20:57:18 node2 rgmanager[1771]: Stopping service service:WEB
>> Aug 27 20:57:19 node2 rgmanager[23541]: Stopping Service apache:WEB
>> Aug 27 20:57:19 node2 rgmanager[23561]: Checking Existence Of File
>> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
>> Exist
>> Aug 27 20:57:19 node2 rgmanager[23581]: Stopping Service apache:WEB >
>> Succeed
>> Aug 27 20:57:19 node2 rgmanager[1771]: Service service:WEB is disabled
>> Aug 27 20:57:32 node2 rgmanager[1771]: Starting disabled service
>> service:WEB
>> Aug 27 20:57:33 node2 rgmanager[23716]: Adding IPv4 address
>> 192.168.18.50/24 to eth0
>> Aug 27 20:57:37 node2 rgmanager[23882]: Starting Service apache:WEB
>> Aug 27 20:57:37 node2 rgmanager[23937]: Query failed: Invalid argument
>> (/cluster/rm/service[@name="WEB"]/ip[2]/@address)
>> Aug 27 20:57:39 node2 rgmanager[1771]: Service service:WEB started
>> Aug 27 20:58:35 node2 rgmanager[1771]: Stopping service service:WEB
>> Aug 27 20:58:36 node2 rgmanager[24554]: Stopping Service apache:WEB
>> Aug 27 20:58:37 node2 rgmanager[24579]: Stopping Service apache:WEB >
>> Failed - Application Is Still Running
>> Aug 27 20:58:38 node2 rgmanager[24599]: Stopping Service apache:WEB >
>> Failed
>> Aug 27 20:58:38 node2 rgmanager[1771]: stop on apache "WEB" returned 1
>> (generic error)
>> Aug 27 20:58:38 node2 rgmanager[24648]: Removing IPv4 address
>> 192.168.18.50/24 from eth0
>> Aug 27 20:58:48 node2 rgmanager[1771]: #12: RG service:WEB failed to
>> stop; intervention required
>> Aug 27 20:58:48 node2 rgmanager[1771]: Service service:WEB is failed
>> Aug 27 20:58:48 node2 rgmanager[1771]: #70: Failed to relocate
>> service:WEB; restarting locally
>> Aug 27 20:58:48 node2 rgmanager[1771]: #43: Service service:WEB has
>> failed; can not start.
>> Aug 27 20:58:48 node2 rgmanager[1771]: #2: Service service:WEB returned
>> failure code.  Last Owner: node2.localdomain
>> Aug 27 20:58:48 node2 rgmanager[1771]: #4: Administrator intervention
>> required.
>> Aug 27 20:59:13 node2 rgmanager[1771]: Stopping service service:WEB
>> Aug 27 20:59:13 node2 rgmanager[24841]: Stopping Service apache:WEBAug 27
>> 20:59:14 node2 rgmanager[24861]: Checking Existence Of File
>> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
>> Exist
>> Aug 27 20:59:14 node2 rgmanager[24881]: Stopping Service apache:WEB >
>> Succeed
>> Aug 27 20:59:14 node2 rgmanager[1771]: Service service:WEB is disabled
>> Aug 27 21:01:06 node2 rgmanager[1771]: #43: Service service:WEB has
>> failed; can not start.
>>
>>
>> What could be missing ?
>>
>> On Mon, Aug 27, 2012 at 3:01 PM, PARAM KRISH <mkparam at gmail.com> wrote:
>>
>>> Hi
>>>
>>>  I think i am almost there. I have started using RHEL6 hoping it would
>>> not give me any night-mare this time to setup a 2 Node Cluster for a Apache
>>> cluster service. and i think i have done pretty much everything.
>>>
>>> In short,
>>>
>>> 1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
>>> 192.168.18.11
>>> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
>>> taken care
>>> 3.  I created the cluster, added two nodes, added the service WEB (
>>> added the child :IP and :apache to it)
>>> 4. Cluster is in quorum and detects other node going offline
>>> fantastically
>>> 5.  Tested the start/stop of this resource WEB using "rg_test" , it
>>> worked just fine on both the nodes.
>>> 6.  But, for some reasons, its not starting or failing over to other
>>> node when i manually test(using clusvcadm -e WEB) or do a reboot or
>>> whatever.
>>>
>>> 7. Please let me know how do i verify the cluster startup and failover
>>> manually to make sure everything works
>>> 8. What is it i am missing that makes this not work now ? Please assist.
>>>
>>> Please go through the output of all the commands attached herewith.
>>>
>>> Let me know if there is still required.
>>>
>>> Param
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/ad4d3a3c/attachment.htm>

From emi2fast at gmail.com  Mon Aug 27 12:14:09 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 27 Aug 2012 14:14:09 +0200
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAA1zgjbxVS26ztn=Qs=FDQnbYDOchLioPWQHWgkmOFr9P0B6Pg@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
	<CAA1zgjY8qRHo275uGCpRz0y4XdrKxP1Zd7h9PxhCyfCyn-2OpQ@mail.gmail.com>
	<CAE7pJ3Az0dn+VYj-wP=f2zoj8QeYH-qakv3_RWPtY9TVW-CPJg@mail.gmail.com>
	<CAA1zgjbxVS26ztn=Qs=FDQnbYDOchLioPWQHWgkmOFr9P0B6Pg@mail.gmail.com>
Message-ID: <CAE7pJ3AkexGkA1mh9fktfG=QSBaOKP=+pz2-A0LjyLs_DzFbGw@mail.gmail.com>

if relocate doesn't work maybe you have a problem in one node

2012/8/27 PARAM KRISH <mkparam at gmail.com>

> Nope, that did not help. it still the same. enable/disable works fine on
> both nodes but relocate, nope.
>
> On Mon, Aug 27, 2012 at 3:45 PM, emmanuel segura <emi2fast at gmail.com>wrote:
>
>> Fix your config before
>>
>> 2012/8/27 PARAM KRISH <mkparam at gmail.com>
>>
>>> Just noticed that disable and enable of that service works
>>> fine(clusvcadm -d WEB and clusvcadm -e WEB) on both the nodes but the
>>> relocate does not.
>>>
>>> Aug 27 20:57:18 node2 rgmanager[1771]: Stopping service service:WEB
>>> Aug 27 20:57:19 node2 rgmanager[23541]: Stopping Service apache:WEB
>>> Aug 27 20:57:19 node2 rgmanager[23561]: Checking Existence Of File
>>> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
>>> Exist
>>> Aug 27 20:57:19 node2 rgmanager[23581]: Stopping Service apache:WEB >
>>> Succeed
>>> Aug 27 20:57:19 node2 rgmanager[1771]: Service service:WEB is disabled
>>> Aug 27 20:57:32 node2 rgmanager[1771]: Starting disabled service
>>> service:WEB
>>> Aug 27 20:57:33 node2 rgmanager[23716]: Adding IPv4 address
>>> 192.168.18.50/24 to eth0
>>> Aug 27 20:57:37 node2 rgmanager[23882]: Starting Service apache:WEB
>>> Aug 27 20:57:37 node2 rgmanager[23937]: Query failed: Invalid argument
>>> (/cluster/rm/service[@name="WEB"]/ip[2]/@address)
>>> Aug 27 20:57:39 node2 rgmanager[1771]: Service service:WEB started
>>> Aug 27 20:58:35 node2 rgmanager[1771]: Stopping service service:WEB
>>> Aug 27 20:58:36 node2 rgmanager[24554]: Stopping Service apache:WEB
>>> Aug 27 20:58:37 node2 rgmanager[24579]: Stopping Service apache:WEB >
>>> Failed - Application Is Still Running
>>> Aug 27 20:58:38 node2 rgmanager[24599]: Stopping Service apache:WEB >
>>> Failed
>>> Aug 27 20:58:38 node2 rgmanager[1771]: stop on apache "WEB" returned 1
>>> (generic error)
>>> Aug 27 20:58:38 node2 rgmanager[24648]: Removing IPv4 address
>>> 192.168.18.50/24 from eth0
>>> Aug 27 20:58:48 node2 rgmanager[1771]: #12: RG service:WEB failed to
>>> stop; intervention required
>>> Aug 27 20:58:48 node2 rgmanager[1771]: Service service:WEB is failed
>>> Aug 27 20:58:48 node2 rgmanager[1771]: #70: Failed to relocate
>>> service:WEB; restarting locally
>>> Aug 27 20:58:48 node2 rgmanager[1771]: #43: Service service:WEB has
>>> failed; can not start.
>>> Aug 27 20:58:48 node2 rgmanager[1771]: #2: Service service:WEB returned
>>> failure code.  Last Owner: node2.localdomain
>>> Aug 27 20:58:48 node2 rgmanager[1771]: #4: Administrator intervention
>>> required.
>>> Aug 27 20:59:13 node2 rgmanager[1771]: Stopping service service:WEB
>>> Aug 27 20:59:13 node2 rgmanager[24841]: Stopping Service apache:WEBAug
>>> 27 20:59:14 node2 rgmanager[24861]: Checking Existence Of File
>>> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
>>> Exist
>>> Aug 27 20:59:14 node2 rgmanager[24881]: Stopping Service apache:WEB >
>>> Succeed
>>> Aug 27 20:59:14 node2 rgmanager[1771]: Service service:WEB is disabled
>>> Aug 27 21:01:06 node2 rgmanager[1771]: #43: Service service:WEB has
>>> failed; can not start.
>>>
>>>
>>> What could be missing ?
>>>
>>> On Mon, Aug 27, 2012 at 3:01 PM, PARAM KRISH <mkparam at gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>>  I think i am almost there. I have started using RHEL6 hoping it would
>>>> not give me any night-mare this time to setup a 2 Node Cluster for a Apache
>>>> cluster service. and i think i have done pretty much everything.
>>>>
>>>> In short,
>>>>
>>>> 1. Two nodes having private IP's eth0 configured with 192.168.18.10 and
>>>> 192.168.18.11
>>>> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
>>>> taken care
>>>> 3.  I created the cluster, added two nodes, added the service WEB (
>>>> added the child :IP and :apache to it)
>>>> 4. Cluster is in quorum and detects other node going offline
>>>> fantastically
>>>> 5.  Tested the start/stop of this resource WEB using "rg_test" , it
>>>> worked just fine on both the nodes.
>>>> 6.  But, for some reasons, its not starting or failing over to other
>>>> node when i manually test(using clusvcadm -e WEB) or do a reboot or
>>>> whatever.
>>>>
>>>> 7. Please let me know how do i verify the cluster startup and failover
>>>> manually to make sure everything works
>>>> 8. What is it i am missing that makes this not work now ? Please
>>>> assist.
>>>>
>>>> Please go through the output of all the commands attached herewith.
>>>>
>>>> Let me know if there is still required.
>>>>
>>>> Param
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>
>> --
>> esta es mi vida e me la vivo hasta que dios quiera
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/3ade84a2/attachment.htm>

From mkparam at gmail.com  Mon Aug 27 12:48:19 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Mon, 27 Aug 2012 18:18:19 +0530
Subject: [Linux-cluster] Unable to start Apache cluster service (Error :
 Failed - Invalid Name Of Service )
In-Reply-To: <CAE7pJ3AkexGkA1mh9fktfG=QSBaOKP=+pz2-A0LjyLs_DzFbGw@mail.gmail.com>
References: <CAA1zgjZbE8dJoUaEJDd2T14qCrN8KK8EB0yui8_V80NY8=4O4g@mail.gmail.com>
	<CAA1zgjY8qRHo275uGCpRz0y4XdrKxP1Zd7h9PxhCyfCyn-2OpQ@mail.gmail.com>
	<CAE7pJ3Az0dn+VYj-wP=f2zoj8QeYH-qakv3_RWPtY9TVW-CPJg@mail.gmail.com>
	<CAA1zgjbxVS26ztn=Qs=FDQnbYDOchLioPWQHWgkmOFr9P0B6Pg@mail.gmail.com>
	<CAE7pJ3AkexGkA1mh9fktfG=QSBaOKP=+pz2-A0LjyLs_DzFbGw@mail.gmail.com>
Message-ID: <CAA1zgjZR8iE2tahd5wY-=Z58esjbF+r=Fuw2Tyyo6egMys2_wA@mail.gmail.com>

relocate does not work from either of the nodes so the problem with both
the nodes ? they are in good health, and stay in quorum perfectly fine.
i can also bring up the services separately on both the nodes. just the
relocate which seem to fail when manually tried.
when the service is online in node2, if i halt it, it just simple stops but
does not auto-relocate to node1, does not even attempt to do.
how do i debug with more information to know where the problem lies ?

Param

On Mon, Aug 27, 2012 at 5:44 PM, emmanuel segura <emi2fast at gmail.com> wrote:

> if relocate doesn't work maybe you have a problem in one node
>
>
> 2012/8/27 PARAM KRISH <mkparam at gmail.com>
>
>> Nope, that did not help. it still the same. enable/disable works fine on
>> both nodes but relocate, nope.
>>
>> On Mon, Aug 27, 2012 at 3:45 PM, emmanuel segura <emi2fast at gmail.com>wrote:
>>
>>> Fix your config before
>>>
>>> 2012/8/27 PARAM KRISH <mkparam at gmail.com>
>>>
>>>> Just noticed that disable and enable of that service works
>>>> fine(clusvcadm -d WEB and clusvcadm -e WEB) on both the nodes but the
>>>> relocate does not.
>>>>
>>>> Aug 27 20:57:18 node2 rgmanager[1771]: Stopping service service:WEB
>>>> Aug 27 20:57:19 node2 rgmanager[23541]: Stopping Service apache:WEB
>>>> Aug 27 20:57:19 node2 rgmanager[23561]: Checking Existence Of File
>>>> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
>>>> Exist
>>>> Aug 27 20:57:19 node2 rgmanager[23581]: Stopping Service apache:WEB >
>>>> Succeed
>>>> Aug 27 20:57:19 node2 rgmanager[1771]: Service service:WEB is disabled
>>>> Aug 27 20:57:32 node2 rgmanager[1771]: Starting disabled service
>>>> service:WEB
>>>> Aug 27 20:57:33 node2 rgmanager[23716]: Adding IPv4 address
>>>> 192.168.18.50/24 to eth0
>>>> Aug 27 20:57:37 node2 rgmanager[23882]: Starting Service apache:WEB
>>>> Aug 27 20:57:37 node2 rgmanager[23937]: Query failed: Invalid argument
>>>> (/cluster/rm/service[@name="WEB"]/ip[2]/@address)
>>>> Aug 27 20:57:39 node2 rgmanager[1771]: Service service:WEB started
>>>> Aug 27 20:58:35 node2 rgmanager[1771]: Stopping service service:WEB
>>>> Aug 27 20:58:36 node2 rgmanager[24554]: Stopping Service apache:WEB
>>>> Aug 27 20:58:37 node2 rgmanager[24579]: Stopping Service apache:WEB >
>>>> Failed - Application Is Still Running
>>>> Aug 27 20:58:38 node2 rgmanager[24599]: Stopping Service apache:WEB >
>>>> Failed
>>>> Aug 27 20:58:38 node2 rgmanager[1771]: stop on apache "WEB" returned 1
>>>> (generic error)
>>>> Aug 27 20:58:38 node2 rgmanager[24648]: Removing IPv4 address
>>>> 192.168.18.50/24 from eth0
>>>> Aug 27 20:58:48 node2 rgmanager[1771]: #12: RG service:WEB failed to
>>>> stop; intervention required
>>>> Aug 27 20:58:48 node2 rgmanager[1771]: Service service:WEB is failed
>>>> Aug 27 20:58:48 node2 rgmanager[1771]: #70: Failed to relocate
>>>> service:WEB; restarting locally
>>>> Aug 27 20:58:48 node2 rgmanager[1771]: #43: Service service:WEB has
>>>> failed; can not start.
>>>> Aug 27 20:58:48 node2 rgmanager[1771]: #2: Service service:WEB returned
>>>> failure code.  Last Owner: node2.localdomain
>>>> Aug 27 20:58:48 node2 rgmanager[1771]: #4: Administrator intervention
>>>> required.
>>>> Aug 27 20:59:13 node2 rgmanager[1771]: Stopping service service:WEB
>>>> Aug 27 20:59:13 node2 rgmanager[24841]: Stopping Service apache:WEBAug
>>>> 27 20:59:14 node2 rgmanager[24861]: Checking Existence Of File
>>>> /var/run/cluster/apache/apache:WEB.pid [apache:WEB] > Failed - File Doesn't
>>>> Exist
>>>> Aug 27 20:59:14 node2 rgmanager[24881]: Stopping Service apache:WEB >
>>>> Succeed
>>>> Aug 27 20:59:14 node2 rgmanager[1771]: Service service:WEB is disabled
>>>> Aug 27 21:01:06 node2 rgmanager[1771]: #43: Service service:WEB has
>>>> failed; can not start.
>>>>
>>>>
>>>> What could be missing ?
>>>>
>>>> On Mon, Aug 27, 2012 at 3:01 PM, PARAM KRISH <mkparam at gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>>  I think i am almost there. I have started using RHEL6 hoping it would
>>>>> not give me any night-mare this time to setup a 2 Node Cluster for a Apache
>>>>> cluster service. and i think i have done pretty much everything.
>>>>>
>>>>> In short,
>>>>>
>>>>> 1. Two nodes having private IP's eth0 configured with 192.168.18.10
>>>>> and 192.168.18.11
>>>>> 2. Nodes are named as node1.localdomain, node2.localdomain, /etc/hosts
>>>>> taken care
>>>>> 3.  I created the cluster, added two nodes, added the service WEB (
>>>>> added the child :IP and :apache to it)
>>>>> 4. Cluster is in quorum and detects other node going offline
>>>>> fantastically
>>>>> 5.  Tested the start/stop of this resource WEB using "rg_test" , it
>>>>> worked just fine on both the nodes.
>>>>> 6.  But, for some reasons, its not starting or failing over to other
>>>>> node when i manually test(using clusvcadm -e WEB) or do a reboot or
>>>>> whatever.
>>>>>
>>>>> 7. Please let me know how do i verify the cluster startup and failover
>>>>> manually to make sure everything works
>>>>> 8. What is it i am missing that makes this not work now ? Please
>>>>> assist.
>>>>>
>>>>> Please go through the output of all the commands attached herewith.
>>>>>
>>>>> Let me know if there is still required.
>>>>>
>>>>> Param
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>>
>>> --
>>> esta es mi vida e me la vivo hasta que dios quiera
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/c97cb67e/attachment.htm>

From swhiteho at redhat.com  Mon Aug 27 12:53:24 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 27 Aug 2012 13:53:24 +0100
Subject: [Linux-cluster] gfs2 mount: No space left on device
In-Reply-To: <e7ea12ec2f2bd4763221e6d72d08ee9d@verwilst.be>
References: <21fee5b6561447a4a951e031890d3648@verwilst.be>
	<e7ea12ec2f2bd4763221e6d72d08ee9d@verwilst.be>
Message-ID: <1346072004.2703.36.camel@menhir>

Hi,

On Thu, 2012-08-23 at 22:35 +0200, Bart Verwilst wrote:
> Umounting and remounting made the filesystem writeable again.
> 
> I've then ran a gfs2_fsck on the device, which gave me
> 

The output from fsck doesn't really give any clues as to the cause.

The reclaiming of unlinked inodes is a fairly normal thing to see,
particularly if there has been some kind of crash just before running
fsck and it is nothing to worry about.

The real issue is why you got this out of space error in the first
place, when there appears to be plenty of free blocks left. It would be
worth checking with gfs2_edit just to be sure that the allocation
bitmaps are not full, even if the summary information says otherwise.

Can you easily reproduce this issue, or is this something that has just
occurred once?

Steve.


> root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
> Initializing fsck
> Validating Resource Group index.
> Level 1 rgrp check: Checking if all rgrp and rindex values are good.
> (level 1 passed)
> Okay to reclaim unlinked inodes in resource group 131090 (0x20012)? 
> (y/n)y
> Error: resource group 131090 (0x20012): free space (65527) does not 
> match bitmap (65528)
> (1 blocks were reclaimed)
> Fix the rgrp free blocks count? (y/n)y
> The rgrp was fixed.
> RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
> Starting pass1
> Pass1 complete
> Starting pass1b
> Pass1b complete
> Starting pass1c
> Pass1c complete
> Starting pass2
> Pass2 complete
> Starting pass3
> Pass3 complete
> Starting pass4
> Pass4 complete
> Starting pass5
> RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
> Update resource group counts? (y/n) y
> Resource group counts updated
> Pass5 complete
> The statfs file is wrong:
> 
> Current statfs values:
> blocks:  524228 (0x7ffc4)
> free:    424937 (0x67be9)
> dinodes: 24 (0x18)
> 
> Calculated statfs values:
> blocks:  524228 (0x7ffc4)
> free:    424938 (0x67bea)
> dinodes: 23 (0x17)
> Okay to fix the master statfs file? (y/n)y
> The statfs file was fixed.
> Writing changes to disk
> gfs2_fsck complete
> 
> 
> root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
> Initializing fsck
> Validating Resource Group index.
> Level 1 rgrp check: Checking if all rgrp and rindex values are good.
> (level 1 passed)
> Okay to reclaim unlinked inodes in resource group 131090 (0x20012)? 
> (y/n)y
> Error: resource group 131090 (0x20012): free space (65527) does not 
> match bitmap (65528)
> (1 blocks were reclaimed)
> Fix the rgrp free blocks count? (y/n)y
> The rgrp was fixed.
> RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
> Starting pass1
> Pass1 complete
> Starting pass1b
> Pass1b complete
> Starting pass1c
> Pass1c complete
> Starting pass2
> Pass2 complete
> Starting pass3
> Pass3 complete
> Starting pass4
> Pass4 complete
> Starting pass5
> RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
> Update resource group counts? (y/n) y
> Resource group counts updated
> Pass5 complete
> The statfs file is wrong:
> 
> Current statfs values:
> blocks:  524228 (0x7ffc4)
> free:    424937 (0x67be9)
> dinodes: 24 (0x18)
> 
> Calculated statfs values:
> blocks:  524228 (0x7ffc4)
> free:    424938 (0x67bea)
> dinodes: 23 (0x17)
> Okay to fix the master statfs file? (y/n)y
> The statfs file was fixed.
> Writing changes to disk
> gfs2_fsck complete
> 
> Could it be that it looks like bug 
> https://bugzilla.redhat.com/show_bug.cgi?id=666080 ?
> 
> Bart
> 
> Bart Verwilst schreef op 23.08.2012 22:16:
> > Hello,
> >
> > One problem fixed, up to the next one :) While everything seemed to
> > work fine for a while, now I'm seeing this:
> >
> > root at vm02-test:~# df -h | grep libvirt
> > /dev/mapper/iscsi_cluster_qemu     2.0G  388M  1.7G  19% 
> > /etc/libvirt/qemu
> > /dev/mapper/iscsi_cluster_sanlock  5.0G  393M  4.7G   8%
> > /var/lib/libvirt/sanlock
> >
> > root at vm02-test:~# ls -al /etc/libvirt/qemu
> > total 16
> > drwxr-xr-x 2 root root 3864 Aug 23 13:54 .
> > drwxr-xr-x 6 root root 4096 Aug 14 15:09 ..
> > -rw------- 1 root root 2566 Aug 23 13:51 firewall.xml
> > -rw------- 1 root root 2390 Aug 23 13:54 zabbix.xml
> >
> > root at vm02-test:~# gfs2_tool journals /etc/libvirt/qemu
> > journal2 - 128MB
> > journal1 - 128MB
> > journal0 - 128MB
> > 3 journal(s) found.
> >
> >
> > root at vm02-test:~# touch /etc/libvirt/qemu/test
> > touch: cannot touch `/etc/libvirt/qemu/test': No space left on device
> >
> >
> >
> > Anything I can do to debug this further?
> >
> > Kind regards,
> >
> > Bart Verwilst
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From d_joshi84 at yahoo.com  Mon Aug 27 15:47:37 2012
From: d_joshi84 at yahoo.com (joshi dhaval)
Date: Mon, 27 Aug 2012 23:47:37 +0800 (SGT)
Subject: [Linux-cluster] RHCS evaluation
Message-ID: <1346082457.68027.YahooMailClassic@web190404.mail.sg3.yahoo.com>

Hello Experts,

i am doing evaluation for redhat cluster in our environment, i am struggling with Installation of packages on RHEL 5.6, Can anyone suggest me what packages are required for RHEL 5.6 and any good document about cluster to start with ?

i am bit confuse with fencing,, is that fencing is mandatory in redhat cluster ?

Is this cluster is as reliable as VCS and any i/o or other comparison matrix available to check the product performance ? 

Regards, Dha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120827/a5c28e44/attachment.htm>

From lists at alteeve.ca  Mon Aug 27 16:01:37 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 27 Aug 2012 12:01:37 -0400
Subject: [Linux-cluster] RHCS evaluation
In-Reply-To: <1346082457.68027.YahooMailClassic@web190404.mail.sg3.yahoo.com>
References: <1346082457.68027.YahooMailClassic@web190404.mail.sg3.yahoo.com>
Message-ID: <503B99E1.1090700@alteeve.ca>

On 08/27/2012 11:47 AM, joshi dhaval wrote:
> Hello Experts,
> 
> i am doing evaluation for redhat cluster in our environment, i am
> struggling with Installation of packages on RHEL 5.6, Can anyone suggest
> me what packages are required for RHEL 5.6 and any good document about
> cluster to start with ?
> 
> i am bit confuse with fencing,, is that fencing is mandatory in redhat
> cluster ?
> 
> Is this cluster is as reliable as VCS and any i/o or other comparison
> matrix available to check the product performance ?
> 
> Regards, Dha

Welcome,

  First up, please upgrade to RHEL 6.3 if you can. It's much more up to
date and has a longer life going forward.

  Yes, fencing is critical. Without it, you cluster will hang (by
design) the first time a node fails.

  It's very stable, I've been running many RHCS clusters in production
for three years now.

  It's hard to make recommendations beyond this without knowing what
your goal is. I've got a tutorial on RHEL 6 for Virtual Machines
clusters that might help. There is a "concepts" section at the beginning
that talks generally about the various parts in the cluster that might
be particularly useful.

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

hth,

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca



From d_joshi84 at yahoo.com  Mon Aug 27 16:14:14 2012
From: d_joshi84 at yahoo.com (joshi dhaval)
Date: Tue, 28 Aug 2012 00:14:14 +0800 (SGT)
Subject: [Linux-cluster] RHCS evaluation
In-Reply-To: <503B99E1.1090700@alteeve.ca>
Message-ID: <1346084054.38090.YahooMailClassic@web190405.mail.sg3.yahoo.com>

Thanks, I think by reading this i will be able to clear my concepts for requirement? ..

my goal is to replace VCS with redhat cluster, we have several servers running with Oracle, sybase, UDB? and some of the application resources .. ..

first to start with i am planning to configure 2 node cluster with shared SAN storage and test failover for mount points ....? below document looks to me something related to redhat virtual machines .. will i able to acheive what i want by reading this Document ( though what i understood is i will able to clear my concepts ).

--- On Mon, 8/27/12, Digimer <lists at alteeve.ca> wrote:

From: Digimer <lists at alteeve.ca>
Subject: Re: [Linux-cluster] RHCS evaluation
To: "linux clustering" <linux-cluster at redhat.com>
Cc: "joshi dhaval" <d_joshi84 at yahoo.com>
Date: Monday, August 27, 2012, 9:31 PM

On 08/27/2012 11:47 AM, joshi dhaval wrote:
> Hello Experts,
> 
> i am doing evaluation for redhat cluster in our environment, i am
> struggling with Installation of packages on RHEL 5.6, Can anyone suggest
> me what packages are required for RHEL 5.6 and any good document about
> cluster to start with ?
> 
> i am bit confuse with fencing,, is that fencing is mandatory in redhat
> cluster ?
> 
> Is this cluster is as reliable as VCS and any i/o or other comparison
> matrix available to check the product performance ?
> 
> Regards, Dha

Welcome,

? First up, please upgrade to RHEL 6.3 if you can. It's much more up to
date and has a longer life going forward.

? Yes, fencing is critical. Without it, you cluster will hang (by
design) the first time a node fails.

? It's very stable, I've been running many RHCS clusters in production
for three years now.

? It's hard to make recommendations beyond this without knowing what
your goal is. I've got a tutorial on RHEL 6 for Virtual Machines
clusters that might help. There is a "concepts" section at the beginning
that talks generally about the various parts in the cluster that might
be particularly useful.

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

hth,

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120828/59dea8af/attachment.htm>

From lists at alteeve.ca  Mon Aug 27 16:57:56 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 27 Aug 2012 12:57:56 -0400
Subject: [Linux-cluster] RHCS evaluation
In-Reply-To: <1346084054.38090.YahooMailClassic@web190405.mail.sg3.yahoo.com>
References: <1346084054.38090.YahooMailClassic@web190405.mail.sg3.yahoo.com>
Message-ID: <503BA714.4040209@alteeve.ca>

Virtual Machines run really well on KVM; Both Windows (using the virtio
"scsi" bus) and Linux guests. I moved from RHEL 5 to RHEL 6 specifically
for VM performance reasons.

The tutorial I linked does indeed cover VMs. The main thing you would
change is that, instead of using DRBD as the clustered LVM physical
volume, you would instead use the SAN's LUN. Otherwise, it should give
you everything you need.

For what it's worth, I've got a dozen of these kind of clusters in
production in the US and Canada. Some have been running for quite a
while and without issues. So I am confident in saying that it should be
a good replacement for VCS.

If you do try to implement this, and if you run into any trouble, I (and
I am sure others) will be happy to help. If you can get onto IRC, many
of us can be found on freenode.net in the #linux-cluster channel.

Cheers!

digimer

On 08/27/2012 12:14 PM, joshi dhaval wrote:
> Thanks, I think by reading this i will be able to clear my concepts for
> requirement  ..
> 
> my goal is to replace VCS with redhat cluster, we have several servers
> running with Oracle, sybase, UDB  and some of the application resources
> .. ..
> 
> first to start with i am planning to configure 2 node cluster with
> shared SAN storage and test failover for mount points ....  below
> document looks to me something related to redhat virtual machines ..
> will i able to acheive what i want by reading this Document ( though
> what i understood is i will able to clear my concepts ).
> 
> --- On *Mon, 8/27/12, Digimer /<lists at alteeve.ca>/* wrote:
> 
> 
>     From: Digimer <lists at alteeve.ca>
>     Subject: Re: [Linux-cluster] RHCS evaluation
>     To: "linux clustering" <linux-cluster at redhat.com>
>     Cc: "joshi dhaval" <d_joshi84 at yahoo.com>
>     Date: Monday, August 27, 2012, 9:31 PM
> 
>     On 08/27/2012 11:47 AM, joshi dhaval wrote:
>     > Hello Experts,
>     >
>     > i am doing evaluation for redhat cluster in our environment, i am
>     > struggling with Installation of packages on RHEL 5.6, Can anyone
>     suggest
>     > me what packages are required for RHEL 5.6 and any good document about
>     > cluster to start with ?
>     >
>     > i am bit confuse with fencing,, is that fencing is mandatory in redhat
>     > cluster ?
>     >
>     > Is this cluster is as reliable as VCS and any i/o or other comparison
>     > matrix available to check the product performance ?
>     >
>     > Regards, Dha
> 
>     Welcome,
> 
>       First up, please upgrade to RHEL 6.3 if you can. It's much more up to
>     date and has a longer life going forward.
> 
>       Yes, fencing is critical. Without it, you cluster will hang (by
>     design) the first time a node fails.
> 
>       It's very stable, I've been running many RHCS clusters in production
>     for three years now.
> 
>       It's hard to make recommendations beyond this without knowing what
>     your goal is. I've got a tutorial on RHEL 6 for Virtual Machines
>     clusters that might help. There is a "concepts" section at the beginning
>     that talks generally about the various parts in the cluster that might
>     be particularly useful.
> 
>     https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
> 
>     hth,
> 
>     digimer
> 
>     -- 
>     Digimer
>     Papers and Projects: https://alteeve.ca
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca



From lists at verwilst.be  Mon Aug 27 19:00:23 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Mon, 27 Aug 2012 21:00:23 +0200
Subject: [Linux-cluster] gfs2 mount: No space left on device
In-Reply-To: <1346072004.2703.36.camel@menhir>
References: <21fee5b6561447a4a951e031890d3648@verwilst.be>
	<e7ea12ec2f2bd4763221e6d72d08ee9d@verwilst.be>
	<1346072004.2703.36.camel@menhir>
Message-ID: <b941599d39757d482918a4fd8759a273@verwilst.be>

Hi Steven,

Even though i stumbled upon this issue multiple times already, I can't 
really fully reproduce it yet. It always happens to me when i'm 
live-migrating ( and libvirt create a bleh.xml.new file inthere ).

What gfs2_edit command output should i provide to you when it happens 
again ( and i'm sure it will )? I'll make sure to post it here on the 
list as soon as it occurs. Some other
commands,logs,outputs you think can be handy to figure this one out? I 
feel like an Ubuntu cluster guinnea pig :)

Thanks for your help so far!

Kind regards,

Bart

Steven Whitehouse schreef op 27.08.2012 14:53:
> Hi,
>
> On Thu, 2012-08-23 at 22:35 +0200, Bart Verwilst wrote:
>> Umounting and remounting made the filesystem writeable again.
>>
>> I've then ran a gfs2_fsck on the device, which gave me
>>
>
> The output from fsck doesn't really give any clues as to the cause.
>
> The reclaiming of unlinked inodes is a fairly normal thing to see,
> particularly if there has been some kind of crash just before running
> fsck and it is nothing to worry about.
>
> The real issue is why you got this out of space error in the first
> place, when there appears to be plenty of free blocks left. It would 
> be
> worth checking with gfs2_edit just to be sure that the allocation
> bitmaps are not full, even if the summary information says otherwise.
>
> Can you easily reproduce this issue, or is this something that has 
> just
> occurred once?
>
> Steve.
>
>
>> root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
>> Initializing fsck
>> Validating Resource Group index.
>> Level 1 rgrp check: Checking if all rgrp and rindex values are good.
>> (level 1 passed)
>> Okay to reclaim unlinked inodes in resource group 131090 (0x20012)?
>> (y/n)y
>> Error: resource group 131090 (0x20012): free space (65527) does not
>> match bitmap (65528)
>> (1 blocks were reclaimed)
>> Fix the rgrp free blocks count? (y/n)y
>> The rgrp was fixed.
>> RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
>> Starting pass1
>> Pass1 complete
>> Starting pass1b
>> Pass1b complete
>> Starting pass1c
>> Pass1c complete
>> Starting pass2
>> Pass2 complete
>> Starting pass3
>> Pass3 complete
>> Starting pass4
>> Pass4 complete
>> Starting pass5
>> RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
>> Update resource group counts? (y/n) y
>> Resource group counts updated
>> Pass5 complete
>> The statfs file is wrong:
>>
>> Current statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424937 (0x67be9)
>> dinodes: 24 (0x18)
>>
>> Calculated statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424938 (0x67bea)
>> dinodes: 23 (0x17)
>> Okay to fix the master statfs file? (y/n)y
>> The statfs file was fixed.
>> Writing changes to disk
>> gfs2_fsck complete
>>
>>
>> root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
>> Initializing fsck
>> Validating Resource Group index.
>> Level 1 rgrp check: Checking if all rgrp and rindex values are good.
>> (level 1 passed)
>> Okay to reclaim unlinked inodes in resource group 131090 (0x20012)?
>> (y/n)y
>> Error: resource group 131090 (0x20012): free space (65527) does not
>> match bitmap (65528)
>> (1 blocks were reclaimed)
>> Fix the rgrp free blocks count? (y/n)y
>> The rgrp was fixed.
>> RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
>> Starting pass1
>> Pass1 complete
>> Starting pass1b
>> Pass1b complete
>> Starting pass1c
>> Pass1c complete
>> Starting pass2
>> Pass2 complete
>> Starting pass3
>> Pass3 complete
>> Starting pass4
>> Pass4 complete
>> Starting pass5
>> RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
>> Update resource group counts? (y/n) y
>> Resource group counts updated
>> Pass5 complete
>> The statfs file is wrong:
>>
>> Current statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424937 (0x67be9)
>> dinodes: 24 (0x18)
>>
>> Calculated statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424938 (0x67bea)
>> dinodes: 23 (0x17)
>> Okay to fix the master statfs file? (y/n)y
>> The statfs file was fixed.
>> Writing changes to disk
>> gfs2_fsck complete
>>
>> Could it be that it looks like bug
>> https://bugzilla.redhat.com/show_bug.cgi?id=666080 ?
>>
>> Bart
>>
>> Bart Verwilst schreef op 23.08.2012 22:16:
>> > Hello,
>> >
>> > One problem fixed, up to the next one :) While everything seemed 
>> to
>> > work fine for a while, now I'm seeing this:
>> >
>> > root at vm02-test:~# df -h | grep libvirt
>> > /dev/mapper/iscsi_cluster_qemu     2.0G  388M  1.7G  19%
>> > /etc/libvirt/qemu
>> > /dev/mapper/iscsi_cluster_sanlock  5.0G  393M  4.7G   8%
>> > /var/lib/libvirt/sanlock
>> >
>> > root at vm02-test:~# ls -al /etc/libvirt/qemu
>> > total 16
>> > drwxr-xr-x 2 root root 3864 Aug 23 13:54 .
>> > drwxr-xr-x 6 root root 4096 Aug 14 15:09 ..
>> > -rw------- 1 root root 2566 Aug 23 13:51 firewall.xml
>> > -rw------- 1 root root 2390 Aug 23 13:54 zabbix.xml
>> >
>> > root at vm02-test:~# gfs2_tool journals /etc/libvirt/qemu
>> > journal2 - 128MB
>> > journal1 - 128MB
>> > journal0 - 128MB
>> > 3 journal(s) found.
>> >
>> >
>> > root at vm02-test:~# touch /etc/libvirt/qemu/test
>> > touch: cannot touch `/etc/libvirt/qemu/test': No space left on 
>> device
>> >
>> >
>> >
>> > Anything I can do to debug this further?
>> >
>> > Kind regards,
>> >
>> > Bart Verwilst
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster



From siva.viramuthu at boeing.com  Wed Aug 29 11:37:51 2012
From: siva.viramuthu at boeing.com (I-Viramuthu, Siva)
Date: Wed, 29 Aug 2012 06:37:51 -0500
Subject: [Linux-cluster] Linux cluster help
Message-ID: <2B4E8A5F046CDA41A09D1EE0F64A368133E55F85@XCH-MW-15V.mw.nos.boeing.com>


Hello,

        After the reboot, I am unable to joint with the fence group, it is always says waithing...

        Any idea...

Siva Viramuthu
Unix Admin
Boeing Defence UK Ltd
Bicester
OX25 1LP
Mobile: +44 (0)788 000 9169
Tel:      + 44 (0) 186 925 9980






From swhiteho at redhat.com  Wed Aug 29 11:54:39 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 29 Aug 2012 12:54:39 +0100
Subject: [Linux-cluster] Linux cluster help
In-Reply-To: <2B4E8A5F046CDA41A09D1EE0F64A368133E55F85@XCH-MW-15V.mw.nos.boeing.com>
References: <2B4E8A5F046CDA41A09D1EE0F64A368133E55F85@XCH-MW-15V.mw.nos.boeing.com>
Message-ID: <1346241279.2699.7.camel@menhir>

Hi,

On Wed, 2012-08-29 at 06:37 -0500, I-Viramuthu, Siva wrote:
> Hello,
> 
>         After the reboot, I am unable to joint with the fence group, it is always says waithing...
> 
>         Any idea...
> 
We'll need a bit more info... do you have a cluster.conf you can share
and are there any messages in the logs which might point out where the
problem is?

Steve.

> Siva Viramuthu
> Unix Admin
> Boeing Defence UK Ltd
> Bicester
> OX25 1LP
> Mobile: +44 (0)788 000 9169
> Tel:      + 44 (0) 186 925 9980
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From lists at verwilst.be  Wed Aug 29 20:58:16 2012
From: lists at verwilst.be (Bart Verwilst)
Date: Wed, 29 Aug 2012 22:58:16 +0200
Subject: [Linux-cluster] gfs2 mount: No space left on device
In-Reply-To: <1346072004.2703.36.camel@menhir>
References: <21fee5b6561447a4a951e031890d3648@verwilst.be>
	<e7ea12ec2f2bd4763221e6d72d08ee9d@verwilst.be>
	<1346072004.2703.36.camel@menhir>
Message-ID: <ccfb239fca437d9a1ebdfe69759f2fbc@verwilst.be>

Hi steven, i wrote a very simple script to reproduce it, writing small 
random textfiles on one node, deleting them on another ( it's the first 
test i did, so not sure yet if deleting is needed.

root at vm02-test:~/gfs2-test# ./write-files.sh
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ........
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ........
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ........
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ........
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ........
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ........
............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................/write-files.sh: 
line 5: /etc/libvirt/qemu/gfs2-test.9540: No space left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.26494: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.6601: No space 
left on device
................................................................................./write-files.sh: 
line 5: /etc/libvirt/qemu/gfs2-test.2420: No space left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.25833: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.2901: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.25261: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.25212: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.13118: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.832: No space 
left on device
../write-files.sh: line 5: /etc/libvirt/qemu/gfs2-test.21241: No space 
left on device
<snip>

I had over 75% diskspace free when this happened. I unmounted the 
mount, but remounting made it hang. It also hang on both other nodes. 
The other gfs2 mount ( /var/lib/libvirt/sanlock ) worked just fine. 
Rebooting this node ( echo b > /proc/sysrq-trigger ) brought the others 
to life for a while, but i keep having stability issues with the mounts 
now, so i think a full reboot is in order..

Any ideas?

Kind regards,

Bart


Steven Whitehouse schreef op 27.08.2012 14:53:
> Hi,
>
> On Thu, 2012-08-23 at 22:35 +0200, Bart Verwilst wrote:
>> Umounting and remounting made the filesystem writeable again.
>>
>> I've then ran a gfs2_fsck on the device, which gave me
>>
>
> The output from fsck doesn't really give any clues as to the cause.
>
> The reclaiming of unlinked inodes is a fairly normal thing to see,
> particularly if there has been some kind of crash just before running
> fsck and it is nothing to worry about.
>
> The real issue is why you got this out of space error in the first
> place, when there appears to be plenty of free blocks left. It would 
> be
> worth checking with gfs2_edit just to be sure that the allocation
> bitmaps are not full, even if the summary information says otherwise.
>
> Can you easily reproduce this issue, or is this something that has 
> just
> occurred once?
>
> Steve.
>
>
>> root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
>> Initializing fsck
>> Validating Resource Group index.
>> Level 1 rgrp check: Checking if all rgrp and rindex values are good.
>> (level 1 passed)
>> Okay to reclaim unlinked inodes in resource group 131090 (0x20012)?
>> (y/n)y
>> Error: resource group 131090 (0x20012): free space (65527) does not
>> match bitmap (65528)
>> (1 blocks were reclaimed)
>> Fix the rgrp free blocks count? (y/n)y
>> The rgrp was fixed.
>> RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
>> Starting pass1
>> Pass1 complete
>> Starting pass1b
>> Pass1b complete
>> Starting pass1c
>> Pass1c complete
>> Starting pass2
>> Pass2 complete
>> Starting pass3
>> Pass3 complete
>> Starting pass4
>> Pass4 complete
>> Starting pass5
>> RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
>> Update resource group counts? (y/n) y
>> Resource group counts updated
>> Pass5 complete
>> The statfs file is wrong:
>>
>> Current statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424937 (0x67be9)
>> dinodes: 24 (0x18)
>>
>> Calculated statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424938 (0x67bea)
>> dinodes: 23 (0x17)
>> Okay to fix the master statfs file? (y/n)y
>> The statfs file was fixed.
>> Writing changes to disk
>> gfs2_fsck complete
>>
>>
>> root at vm01-test:~# gfs2_fsck /dev/mapper/iscsi_cluster_qemu
>> Initializing fsck
>> Validating Resource Group index.
>> Level 1 rgrp check: Checking if all rgrp and rindex values are good.
>> (level 1 passed)
>> Okay to reclaim unlinked inodes in resource group 131090 (0x20012)?
>> (y/n)y
>> Error: resource group 131090 (0x20012): free space (65527) does not
>> match bitmap (65528)
>> (1 blocks were reclaimed)
>> Fix the rgrp free blocks count? (y/n)y
>> The rgrp was fixed.
>> RGs: Consistent: 7   Inconsistent: 1   Fixed: 1   Total: 8
>> Starting pass1
>> Pass1 complete
>> Starting pass1b
>> Pass1b complete
>> Starting pass1c
>> Pass1c complete
>> Starting pass2
>> Pass2 complete
>> Starting pass3
>> Pass3 complete
>> Starting pass4
>> Pass4 complete
>> Starting pass5
>> RG #131090 (0x20012) Inode count inconsistent: is 1 should be 0
>> Update resource group counts? (y/n) y
>> Resource group counts updated
>> Pass5 complete
>> The statfs file is wrong:
>>
>> Current statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424937 (0x67be9)
>> dinodes: 24 (0x18)
>>
>> Calculated statfs values:
>> blocks:  524228 (0x7ffc4)
>> free:    424938 (0x67bea)
>> dinodes: 23 (0x17)
>> Okay to fix the master statfs file? (y/n)y
>> The statfs file was fixed.
>> Writing changes to disk
>> gfs2_fsck complete
>>
>> Could it be that it looks like bug
>> https://bugzilla.redhat.com/show_bug.cgi?id=666080 ?
>>
>> Bart
>>
>> Bart Verwilst schreef op 23.08.2012 22:16:
>> > Hello,
>> >
>> > One problem fixed, up to the next one :) While everything seemed 
>> to
>> > work fine for a while, now I'm seeing this:
>> >
>> > root at vm02-test:~# df -h | grep libvirt
>> > /dev/mapper/iscsi_cluster_qemu     2.0G  388M  1.7G  19%
>> > /etc/libvirt/qemu
>> > /dev/mapper/iscsi_cluster_sanlock  5.0G  393M  4.7G   8%
>> > /var/lib/libvirt/sanlock
>> >
>> > root at vm02-test:~# ls -al /etc/libvirt/qemu
>> > total 16
>> > drwxr-xr-x 2 root root 3864 Aug 23 13:54 .
>> > drwxr-xr-x 6 root root 4096 Aug 14 15:09 ..
>> > -rw------- 1 root root 2566 Aug 23 13:51 firewall.xml
>> > -rw------- 1 root root 2390 Aug 23 13:54 zabbix.xml
>> >
>> > root at vm02-test:~# gfs2_tool journals /etc/libvirt/qemu
>> > journal2 - 128MB
>> > journal1 - 128MB
>> > journal0 - 128MB
>> > 3 journal(s) found.
>> >
>> >
>> > root at vm02-test:~# touch /etc/libvirt/qemu/test
>> > touch: cannot touch `/etc/libvirt/qemu/test': No space left on 
>> device
>> >
>> >
>> >
>> > Anything I can do to debug this further?
>> >
>> > Kind regards,
>> >
>> > Bart Verwilst
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster



From rajenddra at gmail.com  Thu Aug 30 00:18:34 2012
From: rajenddra at gmail.com (Rajendra Roka)
Date: Thu, 30 Aug 2012 10:18:34 +1000
Subject: [Linux-cluster] Help needed for mysql cluster
Message-ID: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>

Currently we have mysql cluster running in RHEL 6.3 with 2 vmware guest and
using a single share storage through RDM(Raw Device Mapping)

We don't need high performance multi master environment, but we want to
enable high availablity just incase of one server goes down for mentenence
or any other purpose second node takes over.

Current configuration is working fine since we have a single data store but
that's preventing the functionality of V-Motion in Vmware environment.

To avoid this problem, I would like to develop a cluster with local
storage(i.e. vm's local storage rather than using RDM). Our databases are
not really bigger and we never have performance issue here.

Is there any way we can have 2 identical copies of mysql database so if
master/primary node goes down another one takes over? I don't care if there
is right solution in master/slave mysql environment.

as long as that works in high avaliablity. I've heard about DRBD but not
sure if that is the right solution in database environment.


Please help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/f674480e/attachment.htm>

From mij at irwan.name  Thu Aug 30 01:11:33 2012
From: mij at irwan.name (Mohd Irwan Jamaluddin)
Date: Thu, 30 Aug 2012 09:11:33 +0800
Subject: [Linux-cluster] Help needed for mysql cluster
In-Reply-To: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
References: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
Message-ID: <CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>

On Thu, Aug 30, 2012 at 8:18 AM, Rajendra Roka <rajenddra at gmail.com> wrote:

> Currently we have mysql cluster running in RHEL 6.3 with 2 vmware guest
> and using a single share storage through RDM(Raw Device Mapping)
>
> We don't need high performance multi master environment, but we want to
> enable high availablity just incase of one server goes down for mentenence
> or any other purpose second node takes over.
>
> Current configuration is working fine since we have a single data store
> but that's preventing the functionality of V-Motion in Vmware environment.
>
> To avoid this problem, I would like to develop a cluster with local
> storage(i.e. vm's local storage rather than using RDM). Our databases are
> not really bigger and we never have performance issue here.
>
> Is there any way we can have 2 identical copies of mysql database so if
> master/primary node goes down another one takes over? I don't care if there
> is right solution in master/slave mysql environment.
>
> as long as that works in high avaliablity. I've heard about DRBD but not
> sure if that is the right solution in database environment.
>
>
> Please help.
>
>
DRBD for MySQL is not a bad idea. I suggest you to read
http://dev.mysql.com/doc/refman/5.0/en/ha-drbd.html

This is my experience a long time ago: http://blog.irwan.name/?p=118
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/b374937e/attachment.htm>

From rajenddra at gmail.com  Thu Aug 30 06:23:48 2012
From: rajenddra at gmail.com (Rajendra Roka)
Date: Thu, 30 Aug 2012 16:23:48 +1000
Subject: [Linux-cluster] Help needed for mysql cluster
In-Reply-To: <CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>
References: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
	<CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>
Message-ID: <CA+WJLRGe+dPaAzzxdyG-D=7dNj0dYwiwUw9HiGAW8cZEH3tG8A@mail.gmail.com>

Thanks for your reply.

I am looking for help in Redhat cluster not the third party heartbeat.

Anybody has successful implementation of similar case?

Thanks,
Raj

On Thu, Aug 30, 2012 at 11:11 AM, Mohd Irwan Jamaluddin <mij at irwan.name>wrote:

> On Thu, Aug 30, 2012 at 8:18 AM, Rajendra Roka <rajenddra at gmail.com>wrote:
>
>> Currently we have mysql cluster running in RHEL 6.3 with 2 vmware guest
>> and using a single share storage through RDM(Raw Device Mapping)
>>
>> We don't need high performance multi master environment, but we want to
>> enable high availablity just incase of one server goes down for mentenence
>> or any other purpose second node takes over.
>>
>> Current configuration is working fine since we have a single data store
>> but that's preventing the functionality of V-Motion in Vmware environment.
>>
>> To avoid this problem, I would like to develop a cluster with local
>> storage(i.e. vm's local storage rather than using RDM). Our databases are
>> not really bigger and we never have performance issue here.
>>
>> Is there any way we can have 2 identical copies of mysql database so if
>> master/primary node goes down another one takes over? I don't care if there
>> is right solution in master/slave mysql environment.
>>
>> as long as that works in high avaliablity. I've heard about DRBD but not
>> sure if that is the right solution in database environment.
>>
>>
>> Please help.
>>
>>
> DRBD for MySQL is not a bad idea. I suggest you to read
> http://dev.mysql.com/doc/refman/5.0/en/ha-drbd.html
>
> This is my experience a long time ago: http://blog.irwan.name/?p=118
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/3b673ccc/attachment.htm>

From list at fajar.net  Thu Aug 30 06:46:39 2012
From: list at fajar.net (Fajar A. Nugraha)
Date: Thu, 30 Aug 2012 13:46:39 +0700
Subject: [Linux-cluster] Help needed for mysql cluster
In-Reply-To: <CA+WJLRGe+dPaAzzxdyG-D=7dNj0dYwiwUw9HiGAW8cZEH3tG8A@mail.gmail.com>
References: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
	<CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>
	<CA+WJLRGe+dPaAzzxdyG-D=7dNj0dYwiwUw9HiGAW8cZEH3tG8A@mail.gmail.com>
Message-ID: <CAG1y0sfayJsi0RoHU5KWy3f8PhdV0ccY7z_sB+YkyexndGMSjw@mail.gmail.com>

On Thu, Aug 30, 2012 at 1:23 PM, Rajendra Roka <rajenddra at gmail.com> wrote:
> Thanks for your reply.
>
> I am looking for help in Redhat cluster not the third party heartbeat.


>> On Thu, Aug 30, 2012 at 8:18 AM, Rajendra Roka <rajenddra at gmail.com>
>> wrote:
>>>
>>> Currently we have mysql cluster running in RHEL 6.3 with 2 vmware guest
>>> and using a single share storage through RDM(Raw Device Mapping)
>>>
>>> We don't need high performance multi master environment, but we want to
>>> enable high availablity just incase of one server goes down for mentenence
>>> or any other purpose second node takes over.


You're making things MUCH more complicated than it should be, you know.

For that simple goal, ONE of these should work:
- vmware's VMHA, with shared storage
- redhat cluster with shared storage, on phyisical nodes
- redhat cluster + drbd, with local storage (synchronized using drbd)

Don't try to mix those. At least not until you have the required
knowledge and experience (which, from your post, I assume you don't).

>>>
>>> Current configuration is working fine since we have a single data store
>>> but that's preventing the functionality of V-Motion in Vmware environment.

That's because you mix solutions that are not designed to sit well together.

If you simply use a mysql VM with shared storage and let vmware handel
the HA, it will work and is a supported configuration.

If you simply use Redhat cluster on physical machine and create either
a mysql resource or a VM resource (kvm/xen running mysql) with shared
storage, it will work and is a supported configuration.

However you decide to use redhat cluster on top of vmware. It can
work, but is not supported.

>>> Is there any way we can have 2 identical copies of mysql database so if
>>> master/primary node goes down another one takes over?

Choose either VMHA or redhat cluster on physical machine. Your choice.

-- 
Fajar



From mcpbalaji at gmail.com  Thu Aug 30 07:02:43 2012
From: mcpbalaji at gmail.com (H.Bala Ji)
Date: Thu, 30 Aug 2012 12:32:43 +0530
Subject: [Linux-cluster] Help needed for mysql cluster
In-Reply-To: <CAG1y0sfayJsi0RoHU5KWy3f8PhdV0ccY7z_sB+YkyexndGMSjw@mail.gmail.com>
References: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
	<CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>
	<CA+WJLRGe+dPaAzzxdyG-D=7dNj0dYwiwUw9HiGAW8cZEH3tG8A@mail.gmail.com>
	<CAG1y0sfayJsi0RoHU5KWy3f8PhdV0ccY7z_sB+YkyexndGMSjw@mail.gmail.com>
Message-ID: <CAKaKUBgEY+aRZovRT6t3WtjStn4BbEELtfDmjdXdCNnqT0nO-A@mail.gmail.com>

Hi,

I also expecting to implement the mysql cluster. Please help out and give
us the implementation steps.

Thanks,
Balaji.V

On Thu, Aug 30, 2012 at 12:16 PM, Fajar A. Nugraha <list at fajar.net> wrote:

> On Thu, Aug 30, 2012 at 1:23 PM, Rajendra Roka <rajenddra at gmail.com>
> wrote:
> > Thanks for your reply.
> >
> > I am looking for help in Redhat cluster not the third party heartbeat.
>
>
> >> On Thu, Aug 30, 2012 at 8:18 AM, Rajendra Roka <rajenddra at gmail.com>
> >> wrote:
> >>>
> >>> Currently we have mysql cluster running in RHEL 6.3 with 2 vmware guest
> >>> and using a single share storage through RDM(Raw Device Mapping)
> >>>
> >>> We don't need high performance multi master environment, but we want to
> >>> enable high availablity just incase of one server goes down for
> mentenence
> >>> or any other purpose second node takes over.
>
>
> You're making things MUCH more complicated than it should be, you know.
>
> For that simple goal, ONE of these should work:
> - vmware's VMHA, with shared storage
> - redhat cluster with shared storage, on phyisical nodes
> - redhat cluster + drbd, with local storage (synchronized using drbd)
>
> Don't try to mix those. At least not until you have the required
> knowledge and experience (which, from your post, I assume you don't).
>
> >>>
> >>> Current configuration is working fine since we have a single data store
> >>> but that's preventing the functionality of V-Motion in Vmware
> environment.
>
> That's because you mix solutions that are not designed to sit well
> together.
>
> If you simply use a mysql VM with shared storage and let vmware handel
> the HA, it will work and is a supported configuration.
>
> If you simply use Redhat cluster on physical machine and create either
> a mysql resource or a VM resource (kvm/xen running mysql) with shared
> storage, it will work and is a supported configuration.
>
> However you decide to use redhat cluster on top of vmware. It can
> work, but is not supported.
>
> >>> Is there any way we can have 2 identical copies of mysql database so if
> >>> master/primary node goes down another one takes over?
>
> Choose either VMHA or redhat cluster on physical machine. Your choice.
>
> --
> Fajar
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Regards,
*Balaji.V*
@+91 9952022099
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/d38355ca/attachment.htm>

From rajenddra at gmail.com  Thu Aug 30 07:36:33 2012
From: rajenddra at gmail.com (Rajendra Roka)
Date: Thu, 30 Aug 2012 17:36:33 +1000
Subject: [Linux-cluster] Help needed for mysql cluster
In-Reply-To: <CAG1y0sfayJsi0RoHU5KWy3f8PhdV0ccY7z_sB+YkyexndGMSjw@mail.gmail.com>
References: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
	<CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>
	<CA+WJLRGe+dPaAzzxdyG-D=7dNj0dYwiwUw9HiGAW8cZEH3tG8A@mail.gmail.com>
	<CAG1y0sfayJsi0RoHU5KWy3f8PhdV0ccY7z_sB+YkyexndGMSjw@mail.gmail.com>
Message-ID: <CA+WJLREKH=9jn7YdbGRMx6BZ+XCVCFG-yJHfY4RGOb7vutEokQ@mail.gmail.com>

I don't mean to use mixed system. I am currently using 2 node redhat
cluster in vmware with RDM. The reason behind this is high avaliabliy in
MySQL. Not sure how Vmware able to handle the high availabily in this case?
How can I make 2 nodes MySQL servers in vmware  that runs in failover mode
without using Redhat or DRBD cluster ??

We want 2 nodes just in-case if we need to perform maintenance in one node
another node takes over. We want identical copy in each box since we will
not have direct access to iSCSI or storage. The idea is to get rid the RDM
device. We don't want in load balance mode but master/slave works for our
environment.

I found problem with DRBD implementation here since we have only one nic
card(vlan) patched in this box.

Please suggest.

On Thu, Aug 30, 2012 at 4:46 PM, Fajar A. Nugraha <list at fajar.net> wrote:

> On Thu, Aug 30, 2012 at 1:23 PM, Rajendra Roka <rajenddra at gmail.com>
> wrote:
> > Thanks for your reply.
> >
> > I am looking for help in Redhat cluster not the third party heartbeat.
>
>
> >> On Thu, Aug 30, 2012 at 8:18 AM, Rajendra Roka <rajenddra at gmail.com>
> >> wrote:
> >>>
> >>> Currently we have mysql cluster running in RHEL 6.3 with 2 vmware guest
> >>> and using a single share storage through RDM(Raw Device Mapping)
> >>>
> >>> We don't need high performance multi master environment, but we want to
> >>> enable high availablity just incase of one server goes down for
> mentenence
> >>> or any other purpose second node takes over.
>
>
> You're making things MUCH more complicated than it should be, you know.
>
> For that simple goal, ONE of these should work:
> - vmware's VMHA, with shared storage
> - redhat cluster with shared storage, on phyisical nodes
> - redhat cluster + drbd, with local storage (synchronized using drbd)
>
> Don't try to mix those. At least not until you have the required
> knowledge and experience (which, from your post, I assume you don't).
>
> >>>
> >>> Current configuration is working fine since we have a single data store
> >>> but that's preventing the functionality of V-Motion in Vmware
> environment.
>
> That's because you mix solutions that are not designed to sit well
> together.
>
> If you simply use a mysql VM with shared storage and let vmware handel
> the HA, it will work and is a supported configuration.
>
> If you simply use Redhat cluster on physical machine and create either
> a mysql resource or a VM resource (kvm/xen running mysql) with shared
> storage, it will work and is a supported configuration.
>
> However you decide to use redhat cluster on top of vmware. It can
> work, but is not supported.
>
> >>> Is there any way we can have 2 identical copies of mysql database so if
> >>> master/primary node goes down another one takes over?
>
> Choose either VMHA or redhat cluster on physical machine. Your choice.
>
> --
> Fajar
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/39f3e4b6/attachment.htm>

From mkparam at gmail.com  Thu Aug 30 08:07:04 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Thu, 30 Aug 2012 13:37:04 +0530
Subject: [Linux-cluster] Problems with relocation of apache and fence_vmware
Message-ID: <CAA1zgjYhhHJyJkteW7gkCDRPjDuKq8vTtG+=j2nOQ-4U6coHyw@mail.gmail.com>

*Background : *
I am using two VM's hosted in my internal lab that has two interfaces one
configured with a valid IP and other being down. I have kept the VIP also
in the same network. My intention is to have a Apache configured as cluster
service in these two nodes and do a fail-over when the node or the
interface goes down. I try to use fence_vmware as fencing device. These two
VM's are now part of a ESX 4.1 host and the GuestOS in my VM's are RHEL6.0
32-bit.


I am seeing the following problems in my setup now ...

1. When starting a apache service from LUCI, it starts fine in a node. But,
if i kill httpd process from that node manually, it does not detect the
service is down to restart or to relocate
2. -same- case if i do "ip adds del <VIP>" ; it just detects the node is
down but does not do a restart or relocate of the service
3. Whenever i reboot the nodes, it comes online and the service properly
starts fine in either of the node and both nodes perfectly in Quorum but
the fail-over never happens if i stop that active node.
4. I am not sure what format of fence that i must put in the cluster.conf,
since there is no way i can test that out if at all it works fine.

Manual tests :
1. I manually run something like this
"fence_vmware --action=status --ip=10.72.145.145 --username=<login>
--password=<password> --plug=<vm-name>" which works fine on both the nodes.
2. Apache starts/stops just particularly fine from both nodes when i do
"rg_test test /etc/cluster/cluster.conf start service WEB"

Cluster.conf is attached herewith.
rgmanager.log is attached herewith.

Please let me know any specific debug commands that i can run manually to
find out the issues going on here, more particularly the "relocation" of
service and the "fencing"; both consistently fails.

Please help. I have been spending more than 10 days now to set this up in
my internal lab to show it as Proof of Concept to my business heads to buy
RHEL cluster indeed works for our production requirement.

-Param
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/5485e3dd/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 2089 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/5485e3dd/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rgmanager.log
Type: application/octet-stream
Size: 23147 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/5485e3dd/attachment-0001.obj>

From Colin.Simpson at iongeo.com  Thu Aug 30 10:39:03 2012
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Thu, 30 Aug 2012 10:39:03 +0000
Subject: [Linux-cluster] RHEL/CentOS-6 HA NFS Configuration Question
In-Reply-To: <4FB4CB82.2010109@redhat.com>
References: <mailman.49.1337097607.8433.linux-cluster@redhat.com>
	<4FB29374.5000600@arlut.utexas.edu> <4FB29EA8.5020208@redhat.com>
	<1337192345.12150.57.camel@bhac.iouk.ioroot.tld>
	<4FB4B646.4030306@redhat.com>
	<1337248049.13755.11.camel@bhac.iouk.ioroot.tld>
	<4FB4CB82.2010109@redhat.com>
Message-ID: <1346323144.10055.12.camel@bhac.iouk.ioroot.tld>

Did this fix make it as yet?

Thanks

Colin

On Thu, 2012-05-17 at 11:57 +0200, Fabio M. Di Nitto wrote:
> Hi Colin,
>
> On 5/17/2012 11:47 AM, Colin Simpson wrote:
> > Thanks for all the useful information on this.
> >
> > I realise the bz is not for this issue, I just included it as it has the
> > suggestion that nfsd should actually live in user space (which seems
> > sensible).
>
> Understood. I can?t really say if userland or kernel would make any
> difference in this specific unmount issue, but for "safety reasons" I
> need to assume their design is the same and behave the same way. when/if
> there will be a switch, we will need to look more deeply into it. With
> current kernel implementation we (cluster guys) need to use this approach.
>
> >
> > Out of interest is there a bz # for this issue?
>
> Yes one for rhel5 and one for rhel6, but they are both private at the
> moment because they have customer data in it.
>
> I expect that the workaround/fix (whatever you want to label it) will be
> available via RHN in 2/3 weeks.
>
> Fabio
>
> >
> > Colin
> >
> >
> > On Thu, 2012-05-17 at 10:26 +0200, Fabio M. Di Nitto wrote:
> >> On 05/16/2012 08:19 PM, Colin Simpson wrote:
> >>> This is interesting.
> >>>
> >>> We very often see the filesystems fail to umount on busy clustered NFS
> >>> servers.
> >>
> >> Yes, I am aware the issue since I have been investigating it in details
> >> for the past couple of weeks.
> >>
> >>>
> >>> What is the nature of the "real fix"?
> >>
> >> First, the bz you mention below is unrelated to the unmount problem we
> >> are discussing. clustered nfsd locks are a slightly different story.
> >>
> >> There are two issues here:
> >>
> >> 1) cluster users expectations
> >> 2) nfsd internal design
> >>
> >> (and note I am not blaming either cluster or nfsd here)
> >>
> >> Generally cluster users expect to be able to do things like (fake meta
> >> config):
> >>
> >> <service1..
> >>  <fs1..
> >>   <nfsexport1..
> >>    <nfsclient1..
> >>     <ip1..
> >> ....
> >> <service2
> >>  <fs2..
> >>   <nfsexport2..
> >>    <nfsclient2..
> >>     <ip2..
> >>
> >> and be able to move services around cluster nodes without problem. Note
> >> that it is irrelevant of the fs used. It can be clustered or not.
> >>
> >> This setup does unfortunately clash with nfsd design.
> >>
> >> When shutdown of a service happens (due to stop or relocation is
> >> indifferent):
> >>
> >> ip is removed
> >> exportfs -u .....
> >> (and that's where we hit the nfsd design limitation)
> >> umount fs..
> >>
> >> By design (tho I can't say exactly why it is done this way without
> >> speculating), nfsd will continue to serve open sessions via rpc.
> >> exportfs -u will only stop new incoming requests.
> >>
> >> If nfsd is serving a client, it will continue to hold a lock on the
> >> filesystem (in kernel) that would prevent the fs to be unmounted.
> >>
> >> The only way to effectively close the sessions are:
> >>
> >> - drop the VIP and wait for connections timeout (nfsd would effectively
> >>   also drop the lock on the fs) but it is slow and not always consistent
> >>   on how long it would take
> >>
> >> - restart nfsd.
> >>
> >>
> >> The "real fix" here would be to wait for nfsd containers that do support
> >> exactly this scenario. Allowing unexport of single fs and lock drops
> >> etc. etc. This work is still in very early stages upstream, that doesn't
> >> make it suitable yet for production.
> >>
> >> The patch I am working on, is basically a way to handle the clash in the
> >> best way as possible.
> >>
> >> A new nfsrestart="" option will be added to both fs and clusterfs, that,
> >> if the filesystem cannot be unmounted, if force_unmount is set, it will
> >> perform an extremely fast restart of nfslock and nfsd.
> >>
> >> We can argue that it is not the final solution, i think we can agree
> >> that it is more of a workaround, but:
> >>
> >> 1) it will allow service migration instead of service failure
> >> 2) it will match cluster users expectations (allowing different exports
> >> and live peacefully together).
> >>
> >> The only negative impact that we have been able to evaluate so far (the
> >> patch is still under heavy testing phase), beside having to add a config
> >> option to enable it, is that there will be a small window in which all
> >> clients connect to a certain node for all nfs services, will not be
> >> served because nfsd is restarting.
> >>
> >> So if you are migrating export1 and there are clients using export2,
> >> export2 will also be affected for those few ms required to restart nfsd.
> >> (assuming export1 and 2 are running on the same node of course).
> >>
> >> Placing things in perspective for a cluster, I think that it is a lot
> >> better to be able to unmount a fs and relocate services as necessary vs
> >> a service failing completely and maybe node being fenced.
> >>
> >>
> >>
> >>
> >>>
> >>> I like the idea of NFSD fully being in user space, so killing it would
> >>> definitely free the fs.
> >>>
> >>> Alan Brown (who's on this list) recently posted to a RH BZ that he was
> >>> one of the people who moved it into kernel space for performance reasons
> >>> in the past (that are no longer relevant):
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=580863#c9
> >>>
> >>> , but I doubt this is the fix you have in mind.
> >>
> >> No that's a totally different issue.
> >>
> >>>
> >>> Colin
> >>>
> >>> On Tue, 2012-05-15 at 20:21 +0200, Fabio M. Di Nitto wrote:
> >>>> This solves different issues at startup, relocation and recovery
> >>>>
> >>>> Also note that there is known limitation in nfsd (both rhel5/6) that
> >>>> could cause some problems in some conditions in your current
> >>>> configuration. A permanent fix is being worked on atm.
> >>>>
> >>>> Without extreme details, you might have 2 of those services running on
> >>>> the same node and attempting to relocate one of them can fail because
> >>>> the fs cannot be unmounted. This is due to nfsd holding a lock (at
> >>>> kernel level) to the FS. Changing config to the suggested one, mask the
> >>>> problem pretty well, but more testing for a real fix is in progress.
> >>>>
> >>>> Fabio
> >>>>
> >>>> --
> >>>> Linux-cluster mailing list
> >>>> Linux-cluster at redhat.com
> >>>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>>
> >>> ________________________________
> >>>
> >>>
> >>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
> >>>
> >>>
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >
> > ________________________________
> >
> >
> > This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
> >
>


________________________________


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.




From fdinitto at redhat.com  Thu Aug 30 10:54:53 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 30 Aug 2012 12:54:53 +0200
Subject: [Linux-cluster] RHEL/CentOS-6 HA NFS Configuration Question
In-Reply-To: <1346323144.10055.12.camel@bhac.iouk.ioroot.tld>
References: <mailman.49.1337097607.8433.linux-cluster@redhat.com>
	<4FB29374.5000600@arlut.utexas.edu> <4FB29EA8.5020208@redhat.com>
	<1337192345.12150.57.camel@bhac.iouk.ioroot.tld>
	<4FB4B646.4030306@redhat.com>
	<1337248049.13755.11.camel@bhac.iouk.ioroot.tld>
	<4FB4CB82.2010109@redhat.com>
	<1346323144.10055.12.camel@bhac.iouk.ioroot.tld>
Message-ID: <503F467D.7020108@redhat.com>

Hi Colin,

the fix is out for rhel5.8.z in rgmanager-2.0.52-28.el5_8.2 and/or higher.

rhel6.4 fix has been built but not verified by our QA team yet.

Fabio

On 8/30/2012 12:39 PM, Colin Simpson wrote:
> Did this fix make it as yet?
> 
> Thanks
> 
> Colin
> 
> On Thu, 2012-05-17 at 11:57 +0200, Fabio M. Di Nitto wrote:
>> Hi Colin,
>>
>> On 5/17/2012 11:47 AM, Colin Simpson wrote:
>>> Thanks for all the useful information on this.
>>>
>>> I realise the bz is not for this issue, I just included it as it has the
>>> suggestion that nfsd should actually live in user space (which seems
>>> sensible).
>>
>> Understood. I can?t really say if userland or kernel would make any
>> difference in this specific unmount issue, but for "safety reasons" I
>> need to assume their design is the same and behave the same way. when/if
>> there will be a switch, we will need to look more deeply into it. With
>> current kernel implementation we (cluster guys) need to use this approach.
>>
>>>
>>> Out of interest is there a bz # for this issue?
>>
>> Yes one for rhel5 and one for rhel6, but they are both private at the
>> moment because they have customer data in it.
>>
>> I expect that the workaround/fix (whatever you want to label it) will be
>> available via RHN in 2/3 weeks.
>>
>> Fabio
>>
>>>
>>> Colin
>>>
>>>
>>> On Thu, 2012-05-17 at 10:26 +0200, Fabio M. Di Nitto wrote:
>>>> On 05/16/2012 08:19 PM, Colin Simpson wrote:
>>>>> This is interesting.
>>>>>
>>>>> We very often see the filesystems fail to umount on busy clustered NFS
>>>>> servers.
>>>>
>>>> Yes, I am aware the issue since I have been investigating it in details
>>>> for the past couple of weeks.
>>>>
>>>>>
>>>>> What is the nature of the "real fix"?
>>>>
>>>> First, the bz you mention below is unrelated to the unmount problem we
>>>> are discussing. clustered nfsd locks are a slightly different story.
>>>>
>>>> There are two issues here:
>>>>
>>>> 1) cluster users expectations
>>>> 2) nfsd internal design
>>>>
>>>> (and note I am not blaming either cluster or nfsd here)
>>>>
>>>> Generally cluster users expect to be able to do things like (fake meta
>>>> config):
>>>>
>>>> <service1..
>>>>  <fs1..
>>>>   <nfsexport1..
>>>>    <nfsclient1..
>>>>     <ip1..
>>>> ....
>>>> <service2
>>>>  <fs2..
>>>>   <nfsexport2..
>>>>    <nfsclient2..
>>>>     <ip2..
>>>>
>>>> and be able to move services around cluster nodes without problem. Note
>>>> that it is irrelevant of the fs used. It can be clustered or not.
>>>>
>>>> This setup does unfortunately clash with nfsd design.
>>>>
>>>> When shutdown of a service happens (due to stop or relocation is
>>>> indifferent):
>>>>
>>>> ip is removed
>>>> exportfs -u .....
>>>> (and that's where we hit the nfsd design limitation)
>>>> umount fs..
>>>>
>>>> By design (tho I can't say exactly why it is done this way without
>>>> speculating), nfsd will continue to serve open sessions via rpc.
>>>> exportfs -u will only stop new incoming requests.
>>>>
>>>> If nfsd is serving a client, it will continue to hold a lock on the
>>>> filesystem (in kernel) that would prevent the fs to be unmounted.
>>>>
>>>> The only way to effectively close the sessions are:
>>>>
>>>> - drop the VIP and wait for connections timeout (nfsd would effectively
>>>>   also drop the lock on the fs) but it is slow and not always consistent
>>>>   on how long it would take
>>>>
>>>> - restart nfsd.
>>>>
>>>>
>>>> The "real fix" here would be to wait for nfsd containers that do support
>>>> exactly this scenario. Allowing unexport of single fs and lock drops
>>>> etc. etc. This work is still in very early stages upstream, that doesn't
>>>> make it suitable yet for production.
>>>>
>>>> The patch I am working on, is basically a way to handle the clash in the
>>>> best way as possible.
>>>>
>>>> A new nfsrestart="" option will be added to both fs and clusterfs, that,
>>>> if the filesystem cannot be unmounted, if force_unmount is set, it will
>>>> perform an extremely fast restart of nfslock and nfsd.
>>>>
>>>> We can argue that it is not the final solution, i think we can agree
>>>> that it is more of a workaround, but:
>>>>
>>>> 1) it will allow service migration instead of service failure
>>>> 2) it will match cluster users expectations (allowing different exports
>>>> and live peacefully together).
>>>>
>>>> The only negative impact that we have been able to evaluate so far (the
>>>> patch is still under heavy testing phase), beside having to add a config
>>>> option to enable it, is that there will be a small window in which all
>>>> clients connect to a certain node for all nfs services, will not be
>>>> served because nfsd is restarting.
>>>>
>>>> So if you are migrating export1 and there are clients using export2,
>>>> export2 will also be affected for those few ms required to restart nfsd.
>>>> (assuming export1 and 2 are running on the same node of course).
>>>>
>>>> Placing things in perspective for a cluster, I think that it is a lot
>>>> better to be able to unmount a fs and relocate services as necessary vs
>>>> a service failing completely and maybe node being fenced.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> I like the idea of NFSD fully being in user space, so killing it would
>>>>> definitely free the fs.
>>>>>
>>>>> Alan Brown (who's on this list) recently posted to a RH BZ that he was
>>>>> one of the people who moved it into kernel space for performance reasons
>>>>> in the past (that are no longer relevant):
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=580863#c9
>>>>>
>>>>> , but I doubt this is the fix you have in mind.
>>>>
>>>> No that's a totally different issue.
>>>>
>>>>>
>>>>> Colin
>>>>>
>>>>> On Tue, 2012-05-15 at 20:21 +0200, Fabio M. Di Nitto wrote:
>>>>>> This solves different issues at startup, relocation and recovery
>>>>>>
>>>>>> Also note that there is known limitation in nfsd (both rhel5/6) that
>>>>>> could cause some problems in some conditions in your current
>>>>>> configuration. A permanent fix is being worked on atm.
>>>>>>
>>>>>> Without extreme details, you might have 2 of those services running on
>>>>>> the same node and attempting to relocate one of them can fail because
>>>>>> the fs cannot be unmounted. This is due to nfsd holding a lock (at
>>>>>> kernel level) to the FS. Changing config to the suggested one, mask the
>>>>>> problem pretty well, but more testing for a real fix is in progress.
>>>>>>
>>>>>> Fabio
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>> ________________________________
>>>>>
>>>>>
>>>>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>>>>
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> ________________________________
>>>
>>>
>>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
>>>
>>
> 
> 
> ________________________________
> 
> 
> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.
> 



From mkparam at gmail.com  Thu Aug 30 15:28:17 2012
From: mkparam at gmail.com (PARAM KRISH)
Date: Thu, 30 Aug 2012 20:58:17 +0530
Subject: [Linux-cluster] Problems with relocation of apache and
	fence_vmware
In-Reply-To: <CAA1zgjYhhHJyJkteW7gkCDRPjDuKq8vTtG+=j2nOQ-4U6coHyw@mail.gmail.com>
References: <CAA1zgjYhhHJyJkteW7gkCDRPjDuKq8vTtG+=j2nOQ-4U6coHyw@mail.gmail.com>
Message-ID: <CAA1zgja0QfxzTuTJS_6-5tvjkmxAvb=qCp_Poj+cGGG4a+_UuQ@mail.gmail.com>

Never mind. I am good now. I have figured out the syntax for fence_vmware
and it works beautifully now.

Here it is, just in case someone breaks his head to get this done in future
..

...
        <clusternodes>
                <clusternode name="node1.localdomain" nodeid="1" votes="1">
                        <fence>
                                <method name="fence_vmware">
                                        <device name="vmware"
port="node1.localdomain"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2.localdomain" nodeid="2" votes="1">
                        <fence>
                                <method name="fence_vmware">
                                        <device name="vmware"
port="node2.localdomain"/>
                                </method>
                        </fence>
                </clusternode>
..
        <fencedevices>
                <fencedevice agent="fence_vmware" ipaddr="a.b.c.d"
login="xxx" name="vmware" passwd="xxx"/>
        </fencedevices>
...

I will be doing some series of fail-over scenarios ( node and service
failures have worked very well so far) and will get back with the results
if there are any concerns. Thanks for helping me thus far. I really
appreciate.

Param


On Thu, Aug 30, 2012 at 1:37 PM, PARAM KRISH <mkparam at gmail.com> wrote:

> *Background : *
> I am using two VM's hosted in my internal lab that has two interfaces one
> configured with a valid IP and other being down. I have kept the VIP also
> in the same network. My intention is to have a Apache configured as cluster
> service in these two nodes and do a fail-over when the node or the
> interface goes down. I try to use fence_vmware as fencing device. These two
> VM's are now part of a ESX 4.1 host and the GuestOS in my VM's are RHEL6.0
> 32-bit.
>
>
> I am seeing the following problems in my setup now ...
>
> 1. When starting a apache service from LUCI, it starts fine in a node.
> But, if i kill httpd process from that node manually, it does not detect
> the service is down to restart or to relocate
> 2. -same- case if i do "ip adds del <VIP>" ; it just detects the node is
> down but does not do a restart or relocate of the service
> 3. Whenever i reboot the nodes, it comes online and the service properly
> starts fine in either of the node and both nodes perfectly in Quorum but
> the fail-over never happens if i stop that active node.
> 4. I am not sure what format of fence that i must put in the cluster.conf,
> since there is no way i can test that out if at all it works fine.
>
> Manual tests :
> 1. I manually run something like this
> "fence_vmware --action=status --ip=10.72.145.145 --username=<login>
> --password=<password> --plug=<vm-name>" which works fine on both the nodes.
> 2. Apache starts/stops just particularly fine from both nodes when i do
> "rg_test test /etc/cluster/cluster.conf start service WEB"
>
> Cluster.conf is attached herewith.
> rgmanager.log is attached herewith.
>
> Please let me know any specific debug commands that i can run manually to
> find out the issues going on here, more particularly the "relocation" of
> service and the "fencing"; both consistently fails.
>
> Please help. I have been spending more than 10 days now to set this up in
> my internal lab to show it as Proof of Concept to my business heads to buy
> RHEL cluster indeed works for our production requirement.
>
> -Param
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120830/2a00c23c/attachment.htm>

From list at fajar.net  Thu Aug 30 22:01:27 2012
From: list at fajar.net (Fajar A. Nugraha)
Date: Fri, 31 Aug 2012 05:01:27 +0700
Subject: [Linux-cluster] Help needed for mysql cluster
In-Reply-To: <CA+WJLREKH=9jn7YdbGRMx6BZ+XCVCFG-yJHfY4RGOb7vutEokQ@mail.gmail.com>
References: <CA+WJLRE4PHU0xUu7-V8+_C3qBsyyxfWCQPUZ7z-WZtE3qgq3Vg@mail.gmail.com>
	<CANpTbaVD-R+hWqcROqKOwygZg19cg8o3wnNGnbBZCtq5KOBHOA@mail.gmail.com>
	<CA+WJLRGe+dPaAzzxdyG-D=7dNj0dYwiwUw9HiGAW8cZEH3tG8A@mail.gmail.com>
	<CAG1y0sfayJsi0RoHU5KWy3f8PhdV0ccY7z_sB+YkyexndGMSjw@mail.gmail.com>
	<CA+WJLREKH=9jn7YdbGRMx6BZ+XCVCFG-yJHfY4RGOb7vutEokQ@mail.gmail.com>
Message-ID: <CAG1y0sdCLpLbUU-cfgh1O0U170whbRSnXqhHcaYCDv1j765xQg@mail.gmail.com>

On Thu, Aug 30, 2012 at 2:36 PM, Rajendra Roka <rajenddra at gmail.com> wrote:
> I don't mean to use mixed system. I am currently using 2 node redhat cluster
> in vmware with RDM. The reason behind this is high avaliabliy in MySQL. Not
> sure how Vmware able to handle the high availabily in this case?

Have you tried VMHA?

> How can I
> make 2 nodes MySQL servers in vmware  that runs in failover mode without
> using Redhat or DRBD cluster ??

It's one VM. vmware can take of moving the VM across any available
physical nodes in case one physical node goes down.

Similar in concept to RHCS, but instead of restarting/moving
application, you're restarting/moving VM.

>
> We want 2 nodes just in-case if we need to perform maintenance in one node
> another node takes over. We want identical copy in each box since we will
> not have direct access to iSCSI or storage.

Are you REALLY sure you don't have shared storage? Cause if so, the
only way to reliably have "identical" copy is:
- have DRBD emulate a shared storage, OR
- use mysql cluster, which is another beast altogether. MUCH more complex.

> The idea is to get rid the RDM
> device.

Why?

By removing RDM and shared storage, you're actually increasing the
level of complexity MUCH more.

> We don't want in load balance mode but master/slave works for our
> environment.

mysql has plain master/slave setup, without the need of any additional
clustering software, but there's no setup that I know of that can
RELIABLY promote a slave to master AUTOMATICALLY in case of failure.

Manual slave <-> master switch should work fine though, but it's
somewhat complicated.

>
> I found problem with DRBD implementation here since we have only one nic
> card(vlan) patched in this box.
>
> Please suggest.

Prioritize your options. Every change has it cost. You need to
determine whether the cost is worthed.

It seems to me you're very determined to change your current setup,
when in fact it works just fine and changing it will (most likely)
make it worse.

IMHO there's nothing wrong with NOT being able to vmotion a RHCS node.
You don't NEED to vmotion it. Just shut it down, perform whatever
maintenance you need on the physical server, and start it again. RHCS
should take care to move any managed resource to the surviving node.

-- 
Fajar



From misch at schwartzkopff.org  Fri Aug 31 14:07:37 2012
From: misch at schwartzkopff.org (Michael Schwartzkopff)
Date: Fri, 31 Aug 2012 16:07:37 +0200
Subject: [Linux-cluster] DLM locks forever
Message-ID: <201208311607.38001.misch@schwartzkopff.org>

Hi,

I am trying to create a cluster the Red Hat style. I am using cman t ostart 
services like fenced, dlm_controld, gfs_controld and corosync.

At the moment I do not have fencing implemented. I start the fenced with the 
option "-c" because we start the cluster only manually. Basically we want to 
use a GFS2 file system.

Everything works fine up to this point.

But in a split-brain situation both nodes wait for the fencing forever. This 
would be OK. The admin could poweroff one machine. But how can I release the 
lock on the other machine because it is still waiting for the success of the 
fencing. 

Is there any command that tells the fenced on the surviving machine that 
fencing was done manually and it can continue with its operation?

Thanks for hints.

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 M?nchen

Tel: (0163) 172 50 98
Fax: (089) 620 304 13
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120831/bf7a4a0b/attachment.sig>

From rpeterso at redhat.com  Fri Aug 31 14:30:43 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 31 Aug 2012 10:30:43 -0400 (EDT)
Subject: [Linux-cluster] DLM locks forever
In-Reply-To: <201208311607.38001.misch@schwartzkopff.org>
Message-ID: <596201876.6754696.1346423443776.JavaMail.root@redhat.com>

----- Original Message -----
| Hi,
| 
| I am trying to create a cluster the Red Hat style. I am using cman t
| ostart
| services like fenced, dlm_controld, gfs_controld and corosync.
| 
| At the moment I do not have fencing implemented. I start the fenced
| with the
| option "-c" because we start the cluster only manually. Basically we
| want to
| use a GFS2 file system.
| 
| Everything works fine up to this point.
| 
| But in a split-brain situation both nodes wait for the fencing
| forever. This
| would be OK. The admin could poweroff one machine. But how can I
| release the
| lock on the other machine because it is still waiting for the success
| of the
| fencing.
| 
| Is there any command that tells the fenced on the surviving machine
| that
| fencing was done manually and it can continue with its operation?
| 
| Thanks for hints.
| 
| --
| Dr. Michael Schwartzkopff
| Guardinistr. 63
| 81375 M?nchen
| 
| Tel: (0163) 172 50 98
| Fax: (089) 620 304 13

Hi,

What version are you using? It varies depending on the software.
In the older software, you used to have to do this:
fence_ack_manual -n <fenced_node>

In the newer cluster software, there's a concept of unfencing
and stuff that I haven't been following.

Regards,

Bob Peterson
Red Hat File Systems



From Colin.Simpson at iongeo.com  Fri Aug 31 22:33:49 2012
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Fri, 31 Aug 2012 22:33:49 +0000
Subject: [Linux-cluster] Services getting stuck on node
Message-ID: <F7D01032076B4C4BAFA5A43ECD5DC50C24C1FE@EDI2EXMBX04.ioinc.ioroot.tld>

Hi

I had a strange issue this afternoon. One of my cluster nodes died (possible hw fault or driver issue). But the other node failed to take a number of it's services (2 node cluster), when it was successfully fenced.

The clustat indicated that the services were on still on the original node (started) but the top lines correctly stated that the node was "offline".  The rgmanager log says for this event:

Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager Taking over service service:httpd from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
A couple of other services did successfully switch after this.

I have seem this a few times (randomly) on various clusters since around the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on a node). It's hard to reproduce and when down we usually just want a restart as fast as possible (thereby limiting time for debugging).

How can I see what is causing the "#47: Failed changing service status" or any more debugging we can turn on in rgmanager to help with this?

Or better still has anyone else seen anything like this?

Thanks

Colin

________________________________


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.