From rohara at redhat.com  Mon Jul  2 14:17:41 2012
From: rohara at redhat.com (Ryan O'Hara)
Date: Mon, 02 Jul 2012 09:17:41 -0500
Subject: [Linux-cluster] Options for fencing at the node level .
In-Reply-To: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
References: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
Message-ID: <4FF1AD85.5060500@redhat.com>

On 06/28/2012 09:32 PM, Zama Ques wrote:
> Hi All ,
>
> I need to setup HA clustering using redhat cluster suite on two nodes , primary concern being high availability . Before trying it on production , I am trying to configure the setup on two desktop machines . For storage , I am creating a partition and sharing the partition as a iscsi target on a third machine . Would like to know what are the options for fencing available at the node level  .  I tried going through the conga interface for creating a
> shared fence device , I could see one option is using GNBD . virtual machine fencing is there in the list but that is for xen based HA
> cluster . scsi fencing is there , but as far as what I understand it does not support iscsi target as of now. Manual fencing is also there , and I am planning to use that , but would like to  know is there any other options are available for fencing at node level ?

SCSI fencing will work with iscsi if the iscsi target is SPC-3 
compliant. The target must also support the preempt-and-abort SCSI 
subcommand. It really depends on what iscsi target you are using. I've 
used fence_scsi with iscsi a few times and it has worked, but I know 
that some iscsi targets have problems.

Ryan




From urgrue at bulbous.org  Mon Jul  2 17:08:52 2012
From: urgrue at bulbous.org (urgrue)
Date: Mon, 02 Jul 2012 19:08:52 +0200
Subject: [Linux-cluster] CLVM in a 3-node cluster
Message-ID: <4FF1D5A4.3060105@bulbous.org>

I'm trying to set up a 3-node cluster with clvm. Problem is, one node 
can't access the storage, and I'm getting:
Error locking on node node3: Volume group for uuid not found: <snip>
whenever I try to activate the LVs on one of the working nodes.

This can't be "by design", can it?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120702/8cc92a52/attachment.htm>

From lists at alteeve.ca  Mon Jul  2 17:14:10 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 02 Jul 2012 13:14:10 -0400
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <4FF1D5A4.3060105@bulbous.org>
References: <4FF1D5A4.3060105@bulbous.org>
Message-ID: <4FF1D6E2.7010209@alteeve.ca>

On 07/02/2012 01:08 PM, urgrue wrote:
> I'm trying to set up a 3-node cluster with clvm. Problem is, one node
> can't access the storage, and I'm getting:
> Error locking on node node3: Volume group for uuid not found: <snip>
> whenever I try to activate the LVs on one of the working nodes.
>
> This can't be "by design", can it?

Does pvscan show the right device? Are all nodes in the cluster? What 
does 'cman_tool status' and 'dlm_tool ls' show?

-- 
Digimer
Papers and Projects: https://alteeve.com




From urgrue at bulbous.org  Mon Jul  2 21:39:07 2012
From: urgrue at bulbous.org (urgrue)
Date: Mon, 02 Jul 2012 23:39:07 +0200
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <4FF1D6E2.7010209@alteeve.ca>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
Message-ID: <4FF214FB.7000906@bulbous.org>

On 2/7/12 19:14, Digimer wrote:
> On 07/02/2012 01:08 PM, urgrue wrote:
>> I'm trying to set up a 3-node cluster with clvm. Problem is, one node
>> can't access the storage, and I'm getting:
>> Error locking on node node3: Volume group for uuid not found: <snip>
>> whenever I try to activate the LVs on one of the working nodes.
>>
>> This can't be "by design", can it?
>
> Does pvscan show the right device? Are all nodes in the cluster? What 
> does 'cman_tool status' and 'dlm_tool ls' show?
>

Sorry, I realize now I was misleading, let me clarify:
The third node cannot access the storage, this is by design. I have 
three datacenters but only two have access to the active storage. The 
third datacenter only has an async copy, and will only activate 
(manually) in case of a massive disaster (failure of both the other 
datacenters).
So I deliberately have a failover domain with only node1 and node2. 
node3's function is to provide quorum, but also be able to be activated 
(manually is fine) in case of a massive disaster.
In other words node3 is part of the cluster, but it can't see the 
storage during normal operation.
Looking at it another way, it's kind of as if we had a 3-node cluster 
where one node had an HBA failure but is otherwise working. Surely node1 
and node2 should be able to continue running the services?
So my question is, do I have an error somehwere, or is clvm really 
actually not able to function without all nodes being active and able to 
access storage?





From emi2fast at gmail.com  Mon Jul  2 22:40:10 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 3 Jul 2012 00:40:10 +0200
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <4FF214FB.7000906@bulbous.org>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
	<4FF214FB.7000906@bulbous.org>
Message-ID: <CAE7pJ3CPeVwUDkihpW6TDYFmLHqY+tp7dO2mJJKeBXgiMiNdmA@mail.gmail.com>

So my question is, do I have an error somehwere, or is clvm really actually
not able to function without all nodes being active and able to access
storage?

Clvm need to be in a quorate cluster for work & if you use clvm in one node
of the cluster i think the should has access to the storage


your using the 3node to provide the quorum?

esample: if one node of your two primary nodes goes down the it's still
quorute, but if two node goes down and you are no using a quorum disk, you
lose the quorum state

I  don't know why you use a node to privide the quorum, if you are use SAN
why not use a lun for use as quorum disk

All nodes in the cluster should has access to the storag

2012/7/2 urgrue <urgrue at bulbous.org>

> On 2/7/12 19:14, Digimer wrote:
>
>> On 07/02/2012 01:08 PM, urgrue wrote:
>>
>>> I'm trying to set up a 3-node cluster with clvm. Problem is, one node
>>> can't access the storage, and I'm getting:
>>> Error locking on node node3: Volume group for uuid not found: <snip>
>>> whenever I try to activate the LVs on one of the working nodes.
>>>
>>> This can't be "by design", can it?
>>>
>>
>> Does pvscan show the right device? Are all nodes in the cluster? What
>> does 'cman_tool status' and 'dlm_tool ls' show?
>>
>>
> Sorry, I realize now I was misleading, let me clarify:
> The third node cannot access the storage, this is by design. I have three
> datacenters but only two have access to the active storage. The third
> datacenter only has an async copy, and will only activate (manually) in
> case of a massive disaster (failure of both the other datacenters).
> So I deliberately have a failover domain with only node1 and node2.
> node3's function is to provide quorum, but also be able to be activated
> (manually is fine) in case of a massive disaster.
> In other words node3 is part of the cluster, but it can't see the storage
> during normal operation.
> Looking at it another way, it's kind of as if we had a 3-node cluster
> where one node had an HBA failure but is otherwise working. Surely node1
> and node2 should be able to continue running the services?
> So my question is, do I have an error somehwere, or is clvm really
> actually not able to function without all nodes being active and able to
> access storage?
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/**mailman/listinfo/linux-cluster<https://www.redhat.com/mailman/listinfo/linux-cluster>
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120703/7f2d1b02/attachment.htm>

From sam at dotsec.com  Mon Jul  2 23:17:42 2012
From: sam at dotsec.com (Sam Wilson)
Date: Tue, 03 Jul 2012 09:17:42 +1000
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <CAE7pJ3CPeVwUDkihpW6TDYFmLHqY+tp7dO2mJJKeBXgiMiNdmA@mail.gmail.com>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
	<4FF214FB.7000906@bulbous.org>
	<CAE7pJ3CPeVwUDkihpW6TDYFmLHqY+tp7dO2mJJKeBXgiMiNdmA@mail.gmail.com>
Message-ID: <4FF22C16.5010300@dotsec.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

As I understand it, you could have the node as a quorum only node by
running only corosync on it.

However for DR it seems to me like you would actually want the storage
replicated to Node3. So it seems logical to me that clvmd would have
to be running on it.

Cheers,

Sam
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iF4EAREIAAYFAk/yLBUACgkQFdt86iEfl/e3wgD9FMJl355ta20pJfdSvfSDuJDU
DK7jt6idjCAg1LNpFYIA/RswrmTCxdzWXETw1ny4WBOxKo5tDwYmKUBKq5UOdcuU
=HNtS
-----END PGP SIGNATURE-----



From fdinitto at redhat.com  Tue Jul  3 04:04:08 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 03 Jul 2012 06:04:08 +0200
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <4FF214FB.7000906@bulbous.org>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
	<4FF214FB.7000906@bulbous.org>
Message-ID: <4FF26F38.3040705@redhat.com>

On 07/02/2012 11:39 PM, urgrue wrote:
> On 2/7/12 19:14, Digimer wrote:
>> On 07/02/2012 01:08 PM, urgrue wrote:
>>> I'm trying to set up a 3-node cluster with clvm. Problem is, one node
>>> can't access the storage, and I'm getting:
>>> Error locking on node node3: Volume group for uuid not found: <snip>
>>> whenever I try to activate the LVs on one of the working nodes.
>>>
>>> This can't be "by design", can it?
>>
>> Does pvscan show the right device? Are all nodes in the cluster? What
>> does 'cman_tool status' and 'dlm_tool ls' show?
>>
> 
> Sorry, I realize now I was misleading, let me clarify:
> The third node cannot access the storage, this is by design. I have
> three datacenters but only two have access to the active storage. The
> third datacenter only has an async copy, and will only activate
> (manually) in case of a massive disaster (failure of both the other
> datacenters).
> So I deliberately have a failover domain with only node1 and node2.
> node3's function is to provide quorum, but also be able to be activated
> (manually is fine) in case of a massive disaster.
> In other words node3 is part of the cluster, but it can't see the
> storage during normal operation.
> Looking at it another way, it's kind of as if we had a 3-node cluster
> where one node had an HBA failure but is otherwise working. Surely node1
> and node2 should be able to continue running the services?
> So my question is, do I have an error somehwere, or is clvm really
> actually not able to function without all nodes being active and able to
> access storage?

CLVM requires a consistent view of the storage from all nodes in the
cluster. This is by design.

A storage failure during operations (aka you start with all nodes able
to access the storage and then downgrade) is handle correctly.

Fabio



From urgrue at bulbous.org  Tue Jul  3 11:06:57 2012
From: urgrue at bulbous.org (urgrue)
Date: Tue, 03 Jul 2012 14:06:57 +0300
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <4FF26F38.3040705@redhat.com>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
	<4FF214FB.7000906@bulbous.org> <4FF26F38.3040705@redhat.com>
Message-ID: <1341313617.20197.140661097201337.09C0B36C@webmail.messagingengine.com>



On Tue, Jul 3, 2012, at 06:04, Fabio M. Di Nitto wrote:
> CLVM requires a consistent view of the storage from all nodes in the
> cluster. This is by design.
> 
> A storage failure during operations (aka you start with all nodes able
> to access the storage and then downgrade) is handle correctly.

Ok, I understand. I find it a little curious though, since I don't see
what the risk is in allowing startup as long as the cluster is quorate.
Imagine you have a multi-node cluster that suffers a total outage - a
wider infrastructure problem or some kind for example - and upon
recovery one node is still out of the cluster for whatever reason. 
It's pretty common in my experience that larger outages result in many
smaller resulting issues that take a while to clean-up.



From queszama at yahoo.in  Thu Jul  5 14:12:11 2012
From: queszama at yahoo.in (Zama Ques)
Date: Thu, 5 Jul 2012 22:12:11 +0800 (SGT)
Subject: [Linux-cluster] Options for fencing at the node level .
In-Reply-To: <4FF1AD85.5060500@redhat.com>
References: <1340937126.95735.YahooMailNeo@web193003.mail.sg3.yahoo.com>
	<4FF1AD85.5060500@redhat.com>
Message-ID: <1341497531.6847.YahooMailNeo@web193001.mail.sg3.yahoo.com>





________________________________
 From: Ryan O'Hara <rohara at redhat.com>
To: linux-cluster at redhat.com 
Sent: Monday, 2 July 2012 7:47 PM
Subject: Re: [Linux-cluster] Options for fencing at the node level .
 
On 06/28/2012 09:32 PM, Zama Ques wrote:
> Hi All ,
> 
> I need to setup HA clustering using redhat cluster suite on two nodes , primary concern being high availability . Before trying it on production , I am trying to configure the setup on two desktop machines . For storage , I am creating a partition and sharing the partition as a iscsi target on a third machine . Would like to know what are the options for fencing available at the node level? .? I tried going through the conga interface for creating a
> shared fence device , I could see one option is using GNBD . virtual machine fencing is there in the list but that is for xen based HA
> cluster . scsi fencing is there , but as far as what I understand it does not support iscsi target as of now. Manual fencing is also there , and I am planning to use that , but would like to? know is there any other options are available for fencing at node level ?

?> SCSI fencing will work with iscsi if the iscsi target is SPC-3 compliant. The target must also support the preempt-and-abort SCSI subcommand. > It really depends on what iscsi target you are using. I've used fence_scsi with iscsi a few times and it has worked, but I know that some iscsi >targets have problems.

Was actually trying to do the setup on two desktop nodes before doing it on production . So for that , has thought of using a third node and configure one of the partition on that node as iscsi target? and share it among the cluster nodes. Can we use fence_scsi to fence such linux based iscsi targets ? 

Thanks
Zaman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120705/c952ef25/attachment.htm>

From queszama at yahoo.in  Thu Jul  5 14:37:43 2012
From: queszama at yahoo.in (Zama Ques)
Date: Thu, 5 Jul 2012 22:37:43 +0800 (SGT)
Subject: [Linux-cluster] cman service stucks during booting of cluster node
Message-ID: <1341499063.54762.YahooMailNeo@web193002.mail.sg3.yahoo.com>

Hi All,

I am facing some issues with startup of? cluster nodes after configuring a node two cluster using xen virtualization and redhat cluster suite. The issue is that when i fence any of the cluster nodes using fence_xvm or by using conga interface ,? the cluster host while booting up gets stucked at starting the fencing component of the cman service . The boot process got halts there . Same happens when I reboot the host. But if do chkconfig cman off and start the cman service after the host completely boots , then? cman service start successfully without any delay including the fencing component . So , my understanding is that? there is some dependency for fencing component of cman service which is available after the host boots up . I am using xen fencing and iptables is disabled on both the nodes.?

Please provide suggestions/steps how to troubleshoot this. 


Thanks
Zaman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120705/8a7ca02e/attachment.htm>

From ali.bendriss at gmail.com  Tue Jul 10 08:45:17 2012
From: ali.bendriss at gmail.com (Ali Bendriss)
Date: Tue, 10 Jul 2012 10:45:17 +0200
Subject: [Linux-cluster] gfs2 quota tools
Message-ID: <201207101045.17891.ali.bendriss@gmail.com>

Hello,

It's look like recent version of GFS2 use the standard linux quota tools,
but I've tried the mainstream quota-tools (ver 4.00) without success.
Which version sould be used ?

thanks

--
Ali 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120710/4b2bd18f/attachment.htm>

From swhiteho at redhat.com  Tue Jul 10 09:20:34 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 10 Jul 2012 10:20:34 +0100
Subject: [Linux-cluster] gfs2 quota tools
In-Reply-To: <201207101045.17891.ali.bendriss@gmail.com>
References: <201207101045.17891.ali.bendriss@gmail.com>
Message-ID: <1341912034.2717.0.camel@menhir>

Hi,

On Tue, 2012-07-10 at 10:45 +0200, Ali Bendriss wrote:
> Hello,
> 
> It's look like recent version of GFS2 use the standard linux quota
> tools,
> 
> but I've tried the mainstream quota-tools (ver 4.00) without success.
> 
> Which version sould be used ?
> 
> thanks
> 

The quota tools should work with GFS2. Can you explain which kernel
version you were using and what exactly didn't work? What mount options
did you use?

Steve.




From ali.bendriss at gmail.com  Tue Jul 10 10:11:38 2012
From: ali.bendriss at gmail.com (Ali Bendriss)
Date: Tue, 10 Jul 2012 12:11:38 +0200
Subject: [Linux-cluster] gfs2 quota tools
In-Reply-To: <1341912034.2717.0.camel@menhir>
References: <201207101045.17891.ali.bendriss@gmail.com>
	<1341912034.2717.0.camel@menhir>
Message-ID: <201207101211.38846.ali.bendriss@gmail.com>

> Hi,
> 
> On Tue, 2012-07-10 at 10:45 +0200, Ali Bendriss wrote:
> > Hello,
> > 
> > It's look like recent version of GFS2 use the standard linux quota
> > tools,
> > 
> > but I've tried the mainstream quota-tools (ver 4.00) without success.
> > 
> > Which version sould be used ?
> > 
> > thanks
> 
> The quota tools should work with GFS2. Can you explain which kernel
> version you were using and what exactly didn't work? What mount options
> did you use?
> 
> Steve.

Sorry for the missing information:
I'm running slackware with
kernel  : 3.4.3
cluster : 3.1.92
gfsutils: 3.1.4

The file system I want to use the quota with is
/dev/mapper/shared-desktop on /home/csamba/desktop type gfs2 
(rw,noatime,nodiratime,hostdata=jid=0,quota=on)

first I was using gfs2_quota, I was able to init and set the quota for users
but get command was wrong after (when the limit is reached).
in ex:
 du -h /home/csamba/desktop/abendriss
19M     /home/csamba/desktop/abendriss

# gfs2_quota get -f /home/csamba/desktop/ -u abendriss -m
user  PARIS8\abendriss:  limit: 20.0       warn: 0.0        value: 40810.8   

 gfs2_quota init -f /home/csamba/desktop/ -u abendriss -m
mismatch: user 3000272: scan = 8, quotafile = 16
mismatch: user 3000208: scan = 8, quotafile = 16
mismatch: user 3000335: scan = 8, quotafile = 16
root at minnie:/# gfs2_quota get -f /home/csamba/desktop/ -u abendriss -m
user  PARIS8\abendriss:  limit: 20.0       warn: 0.0        value: 18.0  

Each time I need to call init to get the real value back. I was thinking that 
the value were updated each 60s but on my system it's not the case.

The I tried then the quota-tools 4.00 (from source) and get:

root at minnie:/var/tmp/quota-4/quota-tools# ./quotacheck -v -c -u 
/home/csamba/desktop/
quotacheck: Scanning /dev/dm-10 [/home/csamba/desktop] done
quotacheck: Cannot stat old user quota file on: No such file or directory. 
Usage will not be substracted.
quotacheck: Old group file name could not been determined. Usage will not be 
substracted.
quotacheck: Checked 1102 directories and 1 files
quotacheck: Cannot turn user quotas off on /dev/dm-10: Function not 
implemented
Kernel won't know about changes quotacheck did.

thanks,

--
Ali
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120710/70048de9/attachment.htm>

From akinoztopuz at yahoo.com  Wed Jul 11 09:57:16 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Wed, 11 Jul 2012 02:57:16 -0700 (PDT)
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <4FF26F38.3040705@redhat.com>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
	<4FF214FB.7000906@bulbous.org> <4FF26F38.3040705@redhat.com>
Message-ID: <1342000636.45149.YahooMailNeo@web125802.mail.ne1.yahoo.com>

Hi 
?
I have 2-nodes cluster without quorum disks.? noticed a problem at below:
?
?
when I want to move resources to other node it is failed?? to relocate services to other node and again services?? run the orginal node.
?
but when I want to restart node it is ok
?
could you have any ideas?
 

________________________________
 From: Fabio M. Di Nitto <fdinitto at redhat.com>
To: linux-cluster at redhat.com 
Sent: Tuesday, July 3, 2012 7:04 AM
Subject: Re: [Linux-cluster] CLVM in a 3-node cluster
  
On 07/02/2012 11:39 PM, urgrue wrote:
> On 2/7/12 19:14, Digimer wrote:
>> On 07/02/2012 01:08 PM, urgrue wrote:
>>> I'm trying to set up a 3-node cluster with clvm. Problem is, one node
>>> can't access the storage, and I'm getting:
>>> Error locking on node node3: Volume group for uuid not found: <snip>
>>> whenever I try to activate the LVs on one of the working nodes.
>>>
>>> This can't be "by design", can it?
>>
>> Does pvscan show the right device? Are all nodes in the cluster? What
>> does 'cman_tool status' and 'dlm_tool ls' show?
>>
> 
> Sorry, I realize now I was misleading, let me clarify:
> The third node cannot access the storage, this is by design. I have
> three datacenters but only two have access to the active storage. The
> third datacenter only has an async copy, and will only activate
> (manually) in case of a massive disaster (failure of both the other
> datacenters).
> So I deliberately have a failover domain with only node1 and node2.
> node3's function is to provide quorum, but also be able to be activated
> (manually is fine) in case of a massive disaster.
> In other words node3 is part of the cluster, but it can't see the
> storage during normal operation.
> Looking at it another way, it's kind of as if we had a 3-node cluster
> where one node had an HBA failure but is otherwise working. Surely node1
> and node2 should be able to continue running the services?
> So my question is, do I have an error somehwere, or is clvm really
> actually not able to function without all nodes being active and able to
> access storage?

CLVM requires a consistent view of the storage from all nodes in the
cluster. This is by design.

A storage failure during operations (aka you start with all nodes able
to access the storage and then downgrade) is handle correctly.

Fabio

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120711/a11b8571/attachment.htm>

From swhiteho at redhat.com  Wed Jul 11 11:07:44 2012
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 11 Jul 2012 12:07:44 +0100
Subject: [Linux-cluster] gfs2 quota tools
In-Reply-To: <201207101211.38846.ali.bendriss@gmail.com>
References: <201207101045.17891.ali.bendriss@gmail.com>
	<1341912034.2717.0.camel@menhir>
	<201207101211.38846.ali.bendriss@gmail.com>
Message-ID: <1342004864.2700.28.camel@menhir>

Hi,

On Tue, 2012-07-10 at 12:11 +0200, Ali Bendriss wrote:
> > Hi,
> 
> > 
> 
> > On Tue, 2012-07-10 at 10:45 +0200, Ali Bendriss wrote:
> 
> > > Hello,
> 
> > > 
> 
> > > It's look like recent version of GFS2 use the standard linux quota
> 
> > > tools,
> 
> > > 
> 
> > > but I've tried the mainstream quota-tools (ver 4.00) without
> success.
> 
> > > 
> 
> > > Which version sould be used ?
> 
> > > 
> 
> > > thanks
> 
> > 
> 
> > The quota tools should work with GFS2. Can you explain which kernel
> 
> > version you were using and what exactly didn't work? What mount
> options
> 
> > did you use?
> 
> > 
> 
> > Steve.
> 
> Sorry for the missing information:
> 
> I'm running slackware with
> 
> kernel : 3.4.3
> 
> cluster : 3.1.92
> 
> gfsutils: 3.1.4
> 
> The file system I want to use the quota with is
> 
> /dev/mapper/shared-desktop on /home/csamba/desktop type gfs2
> (rw,noatime,nodiratime,hostdata=jid=0,quota=on)
> 
That looks ok...

> first I was using gfs2_quota, I was able to init and set the quota for
> users
> 
> but get command was wrong after (when the limit is reached).
> 
> in ex:
> 
> du -h /home/csamba/desktop/abendriss
> 
> 19M /home/csamba/desktop/abendriss
> 
> # gfs2_quota get -f /home/csamba/desktop/ -u abendriss -m
> 
> user PARIS8\abendriss: limit: 20.0 warn: 0.0 value: 40810.8 
> 
> gfs2_quota init -f /home/csamba/desktop/ -u abendriss -m
> 
> mismatch: user 3000272: scan = 8, quotafile = 16
> 
> mismatch: user 3000208: scan = 8, quotafile = 16
> 
> mismatch: user 3000335: scan = 8, quotafile = 16
> 
> root at minnie:/# gfs2_quota get -f /home/csamba/desktop/ -u abendriss -m
> 
> user PARIS8\abendriss: limit: 20.0 warn: 0.0 value: 18.0 
> 
> Each time I need to call init to get the real value back. I was
> thinking that the value were updated each 60s but on my system it's
> not the case.
> 
The GFS2 quota system is such that it is possible, depending on
circumstances to sometimes exceed the quota limits. There are settings
which allow you to bound the error in time and space, with the tradeoff
being that the more accurate the quotas, the greater the overhead of the
quota management system.

That said, the number of blocks should be correct given a sync of the
quota data on the node in question, in any case. Did you sync the quota
data before examining the quota file?

> The I tried then the quota-tools 4.00 (from source) and get:
> 
> root at minnie:/var/tmp/quota-4/quota-tools# ./quotacheck -v -c
> -u /home/csamba/desktop/
> 
> quotacheck: Scanning /dev/dm-10 [/home/csamba/desktop] done
> 
> quotacheck: Cannot stat old user quota file on: No such file or
> directory. Usage will not be substracted.
> 
> quotacheck: Old group file name could not been determined. Usage will
> not be substracted.
> 
> quotacheck: Checked 1102 directories and 1 files
> 
> quotacheck: Cannot turn user quotas off on /dev/dm-10: Function not
> implemented
> 
This is true. You can't turn quotas on and off using the quota tools,
but only by using the mount arguments (and mount -o remount). I don't
think that should be required in order to run quotacheck, but Abhi can
probably confirm whether that is the case or not,

Steve.

> Kernel won't know about changes quotacheck did.
> 
> thanks,
> 
> --
> 
> Ali
> 




From urgrue at bulbous.org  Wed Jul 11 11:26:51 2012
From: urgrue at bulbous.org (urgrue)
Date: Wed, 11 Jul 2012 14:26:51 +0300
Subject: [Linux-cluster] Third node unable to join cluster
Message-ID: <1342006011.15912.140661100612353.756F3A4D@webmail.messagingengine.com>

I have a third node unable to join my cluster (RHEL 6.3). It fails at
'joining fence domain'. Though I suspect that's a bit of a red herring.
The log isn't telling me much, even though I've increased verbosity. Can
someone point me in the right direction as to how to debug?

The error: 
   Joining fence domain... fence_tool: waiting for fenced to join the
   fence group.
fence_tool: fenced not running, no lockfile

>From fenced.log:
Jul 11 13:17:54 fenced fenced 3.0.12.1 started
Jul 11 13:17:55 fenced cpg_join fenced:daemon ...

And then the only errors/warning I see in corosync.log:
Jul 11 13:17:54 corosync [CMAN  ] daemon: About to process command
Jul 11 13:17:54 corosync [CMAN  ] memb: command to process is 90
Jul 11 13:17:54 corosync [CMAN  ] memb: command return code is 0
Jul 11 13:17:54 corosync [CMAN  ] daemon: Returning command data. length
= 440
Jul 11 13:17:54 corosync [CMAN  ] daemon: sending reply 40000090 to fd
18
Jul 11 13:17:54 corosync [CMAN  ] daemon: read 0 bytes from fd 18
Jul 11 13:17:54 corosync [CMAN  ] daemon: Freed 0 queued messages
Jul 11 13:17:54 corosync [TOTEM ] Received ringid(10.128.32.22:28272)
seq 61
Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61
Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61
Jul 11 13:17:54 corosync [TOTEM ] FAILED TO RECEIVE
Jul 11 13:17:54 corosync [TOTEM ] entering GATHER state from 6.
Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100
Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100
Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100
Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100
Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100
Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100
Jul 11 13:17:54 corosync [CMAN  ] daemon: read 20 bytes from fd 18

<snip>
Jul 11 13:17:59 corosync [CMAN  ] daemon: About to process command
Jul 11 13:17:59 corosync [CMAN  ] memb: command to process is 90
Jul 11 13:17:59 corosync [CMAN  ] memb: cmd_get_node failed: id=0,
name='<CC>^?'
Jul 11 13:17:59 corosync [CMAN  ] memb: command return code is -2
Jul 11 13:17:59 corosync [CMAN  ] daemon: Returning command data. length
= 0
Jul 11 13:17:59 corosync [CMAN  ] daemon: sending reply 40000090 to fd
23
Jul 11 13:17:59 corosync [CMAN  ] daemon: read 0 bytes from fd 23
Jul 11 13:17:59 corosync [CMAN  ] daemon: Freed 0 queued messages
Jul 11 13:17:59 corosync [CMAN  ] daemon: read 20 bytes from fd 23
Jul 11 13:17:59 corosync [CMAN  ] daemon: client command is 5
Jul 11 13:17:59 corosync [CMAN  ] daemon: About to process command
Jul 11 13:17:59 corosync [CMAN  ] memb: command to process is 5
Jul 11 13:17:59 corosync [CMAN  ] daemon: Returning command data. length
= 0
Jul 11 13:17:59 corosync [CMAN  ] daemon: sending reply 40000005 to fd
23
<snip>
Back in fenced.log:
Jul 11 13:18:05 fenced daemon cpg_join error retrying
Jul 11 13:18:15 fenced daemon cpg_join error retrying
Jul 11 13:18:21 fenced daemon cpg_join error 2
Jul 11 13:18:23 fenced cpg_leave fenced:daemon ...
Jul 11 13:18:23 fenced daemon cpg_leave error 9

And in /var/log/messages:
Jul 11 13:17:50 server3 corosync[31116]:   [SERV  ] Service engine
loaded: corosync cluster quorum service v0.1
Jul 11 13:17:50 server3 corosync[31116]:   [MAIN  ] Compatibility mode
set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul 11 13:17:50 server3 corosync[31116]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[1]: 3
Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[1]: 3
Jul 11 13:17:50 server3 ntpd[1747]: synchronized to 10.135.136.17,
stratum 1
Jul 11 13:17:50 server3 corosync[31116]:   [CPG   ] chosen downlist:
sender r(0) ip(10.130.32.32) ; members(old:0 left:0)
Jul 11 13:17:50 server3 corosync[31116]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jul 11 13:17:50 server3 corosync[31116]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Jul 11 13:17:50 server3 corosync[31116]:   [CMAN  ] quorum regained,
resuming activity
Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] This node is within
the primary component and will provide service.
Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[2]: 2 3
Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[2]: 2 3
Jul 11 13:17:54 server3 corosync[31116]:   [TOTEM ] FAILED TO RECEIVE
Jul 11 13:17:54 server3 fenced[31174]: fenced 3.0.12.1 started
Jul 11 13:17:55 server3 dlm_controld[31192]: dlm_controld 3.0.12.1
started
Jul 11 13:18:05 server3 dlm_controld[31192]: daemon cpg_join error
retrying
Jul 11 13:18:05 server3 fenced[31174]: daemon cpg_join error retrying
Jul 11 13:18:05 server3 gfs_controld[31264]: gfs_controld 3.0.12.1
started
Jul 11 13:18:15 server3 dlm_controld[31192]: daemon cpg_join error
retrying
Jul 11 13:18:15 server3 fenced[31174]: daemon cpg_join error retrying
Jul 11 13:18:15 server3 gfs_controld[31264]: daemon cpg_join error
retrying
Jul 11 13:18:19 server3 abrtd: Directory
'ccpp-2012-07-11-13:18:18-31116' creation detected
Jul 11 13:18:19 server3 abrt[31313]: Saved core dump of pid 31116
(/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-07-11-13:18:18-31116
(47955968
Jul 11 13:18:21 server3 dlm_controld[31192]: daemon cpg_join error 2
Jul 11 13:18:21 server3 gfs_controld[31264]: daemon cpg_join error 2
Jul 11 13:18:21 server3 fenced[31174]: daemon cpg_join error 2
Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 3
Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 2
Jul 11 13:18:23 server3 dlm_controld[31192]: daemon cpg_leave error 9
Jul 11 13:18:23 server3 gfs_controld[31264]: daemon cpg_leave error 9
Jul 11 13:18:23 server3 fenced[31174]: daemon cpg_leave error 9
Jul 11 13:18:30 server3 abrtd: Sending an email...
Jul 11 13:18:30 server3 abrtd: Email was sent to: root at localhost
Jul 11 13:18:30 server3 abrtd: Duplicate: UUID
Jul 11 13:18:30 server3 abrtd: DUP_OF_DIR:
/var/spool/abrt/ccpp-2012-07-06-10:30:40-22107
Jul 11 13:18:30 server3 abrtd: Problem directory is a duplicate of
/var/spool/abrt/ccpp-2012-07-06-10:30:40-22107
Jul 11 13:18:30 server3 abrtd: Deleting problem directory
ccpp-2012-07-11-13:18:18-31116 (dup of ccpp-2012-07-06-10:30:40-22107)


Any tips much appreciated.



From lists at alteeve.ca  Wed Jul 11 14:04:09 2012
From: lists at alteeve.ca (Digimer)
Date: Wed, 11 Jul 2012 10:04:09 -0400
Subject: [Linux-cluster] CLVM in a 3-node cluster
In-Reply-To: <1342000636.45149.YahooMailNeo@web125802.mail.ne1.yahoo.com>
References: <4FF1D5A4.3060105@bulbous.org> <4FF1D6E2.7010209@alteeve.ca>
	<4FF214FB.7000906@bulbous.org> <4FF26F38.3040705@redhat.com>
	<1342000636.45149.YahooMailNeo@web125802.mail.ne1.yahoo.com>
Message-ID: <4FFD87D9.3010109@alteeve.ca>

Please start a new thread, with a new subject, and include your
cluster.conf file please.

Digimer

On 07/11/2012 05:57 AM, AKIN ?ffffffffffd6ZTOPUZ wrote:
> Hi
>  
> I have 2-nodes cluster without quorum disks.? noticed a problem at below:
>  
>  
> when I want to move resources to other node it is failed   to relocate
> services to other node and again services   run the orginal node.
>  
> but when I want to restart node it is ok
>  
> could you have any ideas?
> 
> *From:* Fabio M. Di Nitto <fdinitto at redhat.com>
> *To:* linux-cluster at redhat.com
> *Sent:* Tuesday, July 3, 2012 7:04 AM
> *Subject:* Re: [Linux-cluster] CLVM in a 3-node cluster
> 
> On 07/02/2012 11:39 PM, urgrue wrote:
>> On 2/7/12 19:14, Digimer wrote:
>>> On 07/02/2012 01:08 PM, urgrue wrote:
>>>> I'm trying to set up a 3-node cluster with clvm. Problem is, one node
>>>> can't access the storage, and I'm getting:
>>>> Error locking on node node3: Volume group for uuid not found: <snip>
>>>> whenever I try to activate the LVs on one of the working nodes.
>>>>
>>>> This can't be "by design", can it?
>>>
>>> Does pvscan show the right device? Are all nodes in the cluster? What
>>> does 'cman_tool status' and 'dlm_tool ls' show?
>>>
>>
>> Sorry, I realize now I was misleading, let me clarify:
>> The third node cannot access the storage, this is by design. I have
>> three datacenters but only two have access to the active storage. The
>> third datacenter only has an async copy, and will only activate
>> (manually) in case of a massive disaster (failure of both the other
>> datacenters).
>> So I deliberately have a failover domain with only node1 and node2.
>> node3's function is to provide quorum, but also be able to be activated
>> (manually is fine) in case of a massive disaster.
>> In other words node3 is part of the cluster, but it can't see the
>> storage during normal operation.
>> Looking at it another way, it's kind of as if we had a 3-node cluster
>> where one node had an HBA failure but is otherwise working. Surely node1
>> and node2 should be able to continue running the services?
>> So my question is, do I have an error somehwere, or is clvm really
>> actually not able to function without all nodes being active and able to
>> access storage?
> 
> CLVM requires a consistent view of the storage from all nodes in the
> cluster. This is by design.
> 
> A storage failure during operations (aka you start with all nodes able
> to access the storage and then downgrade) is handle correctly.
> 
> Fabio
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
Digimer
Papers and Projects: https://alteeve.com




From urgrue at bulbous.org  Thu Jul 12 07:57:28 2012
From: urgrue at bulbous.org (urgrue)
Date: Thu, 12 Jul 2012 10:57:28 +0300
Subject: [Linux-cluster] Third node unable to join cluster
In-Reply-To: <1342006011.15912.140661100612353.756F3A4D@webmail.messagingengine.com>
References: <1342006011.15912.140661100612353.756F3A4D@webmail.messagingengine.com>
Message-ID: <1342079848.3795.140661101025673.1412A83D@webmail.messagingengine.com>

Solved. It seems the issue was that it was a two-node cluster and adding
the third means the cluster has to reconfigure itself from a 2-node to a
3-node cluster which requires a restart of the cluster.
I would've expected it could give a clear error message regarding this
but seems it just silently fails instead.

On Wed, Jul 11, 2012, at 14:26, urgrue wrote:
> I have a third node unable to join my cluster (RHEL 6.3). It fails at
> 'joining fence domain'. Though I suspect that's a bit of a red herring.
> The log isn't telling me much, even though I've increased verbosity. Can
> someone point me in the right direction as to how to debug?
> 
> The error: 
>    Joining fence domain... fence_tool: waiting for fenced to join the
>    fence group.
> fence_tool: fenced not running, no lockfile
> 
> >From fenced.log:
> Jul 11 13:17:54 fenced fenced 3.0.12.1 started
> Jul 11 13:17:55 fenced cpg_join fenced:daemon ...
> 
> And then the only errors/warning I see in corosync.log:
> Jul 11 13:17:54 corosync [CMAN  ] daemon: About to process command
> Jul 11 13:17:54 corosync [CMAN  ] memb: command to process is 90
> Jul 11 13:17:54 corosync [CMAN  ] memb: command return code is 0
> Jul 11 13:17:54 corosync [CMAN  ] daemon: Returning command data. length
> = 440
> Jul 11 13:17:54 corosync [CMAN  ] daemon: sending reply 40000090 to fd
> 18
> Jul 11 13:17:54 corosync [CMAN  ] daemon: read 0 bytes from fd 18
> Jul 11 13:17:54 corosync [CMAN  ] daemon: Freed 0 queued messages
> Jul 11 13:17:54 corosync [TOTEM ] Received ringid(10.128.32.22:28272)
> seq 61
> Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61
> Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61
> Jul 11 13:17:54 corosync [TOTEM ] FAILED TO RECEIVE
> Jul 11 13:17:54 corosync [TOTEM ] entering GATHER state from 6.
> Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100
> Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100
> Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100
> Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100
> Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100
> Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100
> Jul 11 13:17:54 corosync [CMAN  ] daemon: read 20 bytes from fd 18
> 
> <snip>
> Jul 11 13:17:59 corosync [CMAN  ] daemon: About to process command
> Jul 11 13:17:59 corosync [CMAN  ] memb: command to process is 90
> Jul 11 13:17:59 corosync [CMAN  ] memb: cmd_get_node failed: id=0,
> name='<CC>^?'
> Jul 11 13:17:59 corosync [CMAN  ] memb: command return code is -2
> Jul 11 13:17:59 corosync [CMAN  ] daemon: Returning command data. length
> = 0
> Jul 11 13:17:59 corosync [CMAN  ] daemon: sending reply 40000090 to fd
> 23
> Jul 11 13:17:59 corosync [CMAN  ] daemon: read 0 bytes from fd 23
> Jul 11 13:17:59 corosync [CMAN  ] daemon: Freed 0 queued messages
> Jul 11 13:17:59 corosync [CMAN  ] daemon: read 20 bytes from fd 23
> Jul 11 13:17:59 corosync [CMAN  ] daemon: client command is 5
> Jul 11 13:17:59 corosync [CMAN  ] daemon: About to process command
> Jul 11 13:17:59 corosync [CMAN  ] memb: command to process is 5
> Jul 11 13:17:59 corosync [CMAN  ] daemon: Returning command data. length
> = 0
> Jul 11 13:17:59 corosync [CMAN  ] daemon: sending reply 40000005 to fd
> 23
> <snip>
> Back in fenced.log:
> Jul 11 13:18:05 fenced daemon cpg_join error retrying
> Jul 11 13:18:15 fenced daemon cpg_join error retrying
> Jul 11 13:18:21 fenced daemon cpg_join error 2
> Jul 11 13:18:23 fenced cpg_leave fenced:daemon ...
> Jul 11 13:18:23 fenced daemon cpg_leave error 9
> 
> And in /var/log/messages:
> Jul 11 13:17:50 server3 corosync[31116]:   [SERV  ] Service engine
> loaded: corosync cluster quorum service v0.1
> Jul 11 13:17:50 server3 corosync[31116]:   [MAIN  ] Compatibility mode
> set to whitetank.  Using V1 and V2 of the synchronization engine.
> Jul 11 13:17:50 server3 corosync[31116]:   [TOTEM ] A processor joined
> or left the membership and a new membership was formed.
> Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[1]: 3
> Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[1]: 3
> Jul 11 13:17:50 server3 ntpd[1747]: synchronized to 10.135.136.17,
> stratum 1
> Jul 11 13:17:50 server3 corosync[31116]:   [CPG   ] chosen downlist:
> sender r(0) ip(10.130.32.32) ; members(old:0 left:0)
> Jul 11 13:17:50 server3 corosync[31116]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jul 11 13:17:50 server3 corosync[31116]:   [TOTEM ] A processor joined
> or left the membership and a new membership was formed.
> Jul 11 13:17:50 server3 corosync[31116]:   [CMAN  ] quorum regained,
> resuming activity
> Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] This node is within
> the primary component and will provide service.
> Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[2]: 2 3
> Jul 11 13:17:50 server3 corosync[31116]:   [QUORUM] Members[2]: 2 3
> Jul 11 13:17:54 server3 corosync[31116]:   [TOTEM ] FAILED TO RECEIVE
> Jul 11 13:17:54 server3 fenced[31174]: fenced 3.0.12.1 started
> Jul 11 13:17:55 server3 dlm_controld[31192]: dlm_controld 3.0.12.1
> started
> Jul 11 13:18:05 server3 dlm_controld[31192]: daemon cpg_join error
> retrying
> Jul 11 13:18:05 server3 fenced[31174]: daemon cpg_join error retrying
> Jul 11 13:18:05 server3 gfs_controld[31264]: gfs_controld 3.0.12.1
> started
> Jul 11 13:18:15 server3 dlm_controld[31192]: daemon cpg_join error
> retrying
> Jul 11 13:18:15 server3 fenced[31174]: daemon cpg_join error retrying
> Jul 11 13:18:15 server3 gfs_controld[31264]: daemon cpg_join error
> retrying
> Jul 11 13:18:19 server3 abrtd: Directory
> 'ccpp-2012-07-11-13:18:18-31116' creation detected
> Jul 11 13:18:19 server3 abrt[31313]: Saved core dump of pid 31116
> (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-07-11-13:18:18-31116
> (47955968
> Jul 11 13:18:21 server3 dlm_controld[31192]: daemon cpg_join error 2
> Jul 11 13:18:21 server3 gfs_controld[31264]: daemon cpg_join error 2
> Jul 11 13:18:21 server3 fenced[31174]: daemon cpg_join error 2
> Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 3
> Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 2
> Jul 11 13:18:23 server3 dlm_controld[31192]: daemon cpg_leave error 9
> Jul 11 13:18:23 server3 gfs_controld[31264]: daemon cpg_leave error 9
> Jul 11 13:18:23 server3 fenced[31174]: daemon cpg_leave error 9
> Jul 11 13:18:30 server3 abrtd: Sending an email...
> Jul 11 13:18:30 server3 abrtd: Email was sent to: root at localhost
> Jul 11 13:18:30 server3 abrtd: Duplicate: UUID
> Jul 11 13:18:30 server3 abrtd: DUP_OF_DIR:
> /var/spool/abrt/ccpp-2012-07-06-10:30:40-22107
> Jul 11 13:18:30 server3 abrtd: Problem directory is a duplicate of
> /var/spool/abrt/ccpp-2012-07-06-10:30:40-22107
> Jul 11 13:18:30 server3 abrtd: Deleting problem directory
> ccpp-2012-07-11-13:18:18-31116 (dup of ccpp-2012-07-06-10:30:40-22107)
> 
> 
> Any tips much appreciated.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From akinoztopuz at yahoo.com  Thu Jul 12 08:20:36 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Thu, 12 Jul 2012 01:20:36 -0700 (PDT)
Subject: [Linux-cluster] service relocate problem  in 2 nodes cluster
Message-ID: <1342081236.45360.YahooMailNeo@web125802.mail.ne1.yahoo.com>

????Hello
?
I have 2 nodes clsuter without quorum disk.
?
I saw a problem when I moved to services to other node.
?
disk? loyout is iscsi .
?
I th?nk problem is about gfs.
when I stop service in node1? and related file systems(included in service) are unmounted from that node and I want to mount it on node2 manually ?, I?am tak?ng a message about resource busy.?
?
[root at clsn2 ~]# mount -t gfs2? /dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 /usr/sap/PRO/ASCS01
/sbin/mount.gfs2: /dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 already mounted or /usr/sap/PRO/ASCS01 busy

?
?
Could you have any ideas?
?
?
?
cluster.conf is at the below:
?
?xml version="1.0"?>
<cluster alias="testsapcluster" config_version="197" name="testsapcluster">
??????? <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
??????? <clusternodes>
??????????????? <clusternode name="clsn1.eda.com" nodeid="1" votes="1">
??????????????????????? <fence>
??????????????????????????????? <method name="1">
??????????????????????????????????????? <device name="fence_node1"/>
??????????????????????????????? </method>
??????????????????????? </fence>
??????????????? </clusternode>
??????????????? <clusternode name="clsn2.eda.com" nodeid="2" votes="1">
??????????????????????? <fence>
??????????????????????????????? <method name="1">
??????????????????????????????????????? <device name="fence_node2"/>
??????????????????????????????? </method>
??????????????????????? </fence>
??????????????? </clusternode>
??????? </clusternodes>
??????? <cman expected_votes="1" two_node="1"/>
??????? <fencedevices>
??????????????? <fencedevice agent="fence_ilo" hostname="iloclsnode1" login="clsfenceadmin" name="ClsNode1Fence" passwd="***********"/>
??????????????? <fencedevice agent="fence_ilo" hostname="iloclsnode2" login="clsfenceadmin" name="ClsNode2Fence" passwd="************"/>
??????????????? <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68" login="clsfenceadmin" name="IPMI-Node1" passwd="**********"/>
??????????????? <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" login="clsfenceadmin" name="IPMI-Node2" passwd="**********"/>
??????????????? <fencedevice agent="fence_ipmilan" ipaddr="10.34.1.68" login="clsfenceadmin" name="IPMI_1" passwd="********"/>
??????????????? <fencedevice agent="fence_ipmilan" ipaddr="10.34.1.67" login="clsfenceadmin" name="IPMI_2" passwd="********"/>
??????????????? <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68" lanplus="1" login="clsfenceadmin" method="cycle" name="fence_node1" passwd="*******" power_wait="4"/>
??????????????? <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" lanplus="1" login="clsfenceadmin" method="cycle" name="fence_node2" passwd="*******" power_wait="4"/>
??????? </fencedevices>
??????? <rm log_level="7">
??????????????? <failoverdomains>
??????????????????????? <failoverdomain name="sapfailover" nofailback="0" ordered="1" restricted="0">
??????????????????????????????? <failoverdomainnode name="clsn1.eda.com" priority="1"/>
??????????????????????????????? <failoverdomainnode name="clsn2.eda.com" priority="1"/>
??????????????????????? </failoverdomain>
??????????????? </failoverdomains>
??????????????? <resources>
??????????????????????? <ip address="10.34.1.111" monitor_link="1"/>
??????????????????????? <ip address="10.34.1.246" monitor_link="0"/>
??????????????????????? <ip address="10.34.1.247" monitor_link="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7" force_unmount="1" fsid="1689" fstype="gfs2" mountpoint="/usr/sap/PRO/ASCS01" name="/usr/sap/PRO/ASCS01" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b2-SAPClusterLV_b2" force_unmount="1" fsid="52296" fstype="gfs2" mountpoint="/oracle" name="/oracle" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b3-SAPClusterLV_b3" force_unmount="1" fsid="25486" fstype="gfs2" mountpoint="/oracle/client" name="/oracle/client" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b5-SAPClusterLV_b5" force_unmount="1" fsid="5895" fstype="gfs2" mountpoint="/oracle/stage" name="/oracle/stage" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b6-SAPClusterLV_b6" force_unmount="1" fsid="19741" fstype="gfs2" mountpoint="/oracle/PRO" name="/oracle/PRO" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b7-SAPClusterLV_b7" force_unmount="1" fsid="6452" fstype="gfs2" mountpoint="/oracle/PRO/112_64" name="/oracle/PRO/112_64" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b8-SAPClusterLV_b8" force_unmount="1" fsid="40841" fstype="gfs2" mountpoint="/oracle/PRO/origlogA" name="/oracle/PRO/origlogA" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b9-SAPClusterLV_b9" force_unmount="1" fsid="52787" fstype="gfs2" mountpoint="/oracle/PRO/origlogB" name="/oracle/PRO/origlogB" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c5-SAPClusterLV_c5" force_unmount="1" fsid="22219" fstype="gfs2" mountpoint="/oracle/PRO/sapdata1" name="/oracle/PRO/sapdata1" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b10-SAPClusterLV_b10" force_unmount="1" fsid="47722" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogA" name="/oracle/PRO/mirrlogA" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c6-SAPClusterLV_c6" force_unmount="1" fsid="1905" fstype="gfs2" mountpoint="/oracle/PRO/sapdata2" name="/oracle/PRO/sapdata2" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c1-SAPClusterLV_c1" force_unmount="1" fsid="60368" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogB" name="/oracle/PRO/mirrlogB" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c7-SAPClusterLV_c7" force_unmount="1" fsid="14311" fstype="gfs2" mountpoint="/oracle/PRO/sapdata3" name="/oracle/PRO/sapdata3" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c2-SAPClusterLV_c2" force_unmount="1" fsid="8037" fstype="gfs2" mountpoint="/oracle/PRO/oraarch" name="/oracle/PRO/oraarch" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c8-SAPClusterLV_c8" force_unmount="1" fsid="41540" fstype="gfs2" mountpoint="/oracle/PRO/sapdata4" name="/oracle/PRO/sapdata4" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c3-SAPClusterLV_c3" force_unmount="1" fsid="23164" fstype="gfs2" mountpoint="/oracle/PRO/sapreorg" name="/oracle/PRO/sapreorg" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_c9-SAPClusterLV_c9" force_unmount="1" fsid="37586" fstype="gfs2" mountpoint="/oracle/PRO/sapdata5" name="/oracle/PRO/sapdata5" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d1-SAPClusterLV_d1" force_unmount="1" fsid="61050" fstype="gfs2" mountpoint="/software" name="/software" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d2-SAPClusterLV_d2" force_unmount="1" fsid="45919" fstype="gfs2" mountpoint="/saptmp" name="/saptmp" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d3-SAPClusterLV_d3" force_unmount="1" fsid="56812" fstype="gfs2" mountpoint="/usr/sap/PRO" name="/usr/sap/PRO" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d5-SAPClusterLV_d5" force_unmount="1" fsid="47829" fstype="gfs2" mountpoint="/usr/sap/DAA" name="/usr/sap/DAA" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d6-SAPClusterLV_d6" force_unmount="1" fsid="1394" fstype="gfs2" mountpoint="/usr/sap/hostctrl" name="/usr/sap/hostctrl" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_d8-SAPClusterLV_d8" force_unmount="1" fsid="33058" fstype="gfs2" mountpoint="/usr/sap/PRO/DVEBMGS00" name="/usr/sap/PRO/DVEBMGS00" self_fence="0"/>
??????????????????????? <clusterfs device="/dev/mapper/SAPClusterVG_b1-SAPClusterLV_b1" force_unmount="0" fsid="1822" fstype="gfs2" mountpoint="/sapmnt/PRO" name="/sapmnt/PRO" self_fence="0"/>
??????????????????????? <SAPInstance DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" DIR_PROFILE="/usr/sap/PRO/SYS/profile" InstanceName="PRO_ASCS01_sapproascs" START_PROFILE="START_ASCS01_sapproascs"/>
??????????????????????? <SAPDatabase DBTYPE="ORA" DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" NETSERVICENAME="LISTENER" SID="PRO"/>
??????????????? </resources>
??????????????? <service autostart="0" domain="sapfailover" exclusive="0" name="DB">
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/112_64"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/origlogA"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/origlogB"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/mirrlogA"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/mirrlogB"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/oraarch"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/sapreorg"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata1"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata2"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata3"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata4"/>
??????????????????????? <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata5"/>
??????????????????????? <ip ref="10.34.1.247"/>
??????????????? </service>
??????????????? <service autostart="0" domain="sapfailover" exclusive="1" name="sap">
??????????????????????? <ip ref="10.34.1.246"/>
??????????????????????? <clusterfs ref="/usr/sap/PRO/ASCS01"/>
??????????????? </service>
??????? </rm>
</cluster>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120712/3f8f65cf/attachment.htm>

From carlopmart at gmail.com  Thu Jul 12 09:27:49 2012
From: carlopmart at gmail.com (C. L. Martinez)
Date: Thu, 12 Jul 2012 11:27:49 +0200
Subject: [Linux-cluster] Problems using fence_virt as a fence agent for two
	kvm guests
Message-ID: <CAEjQA5JxZ1MBFkmLRMC6nSXEmbF4zgZkcUO7Z=Q16+WDY9+=qw@mail.gmail.com>

Hi all,

 I have installed two kvm guests (CentOS 6.3) to do some tests using
RHCS under a CentOS 6.3 kvm host. As a fence device I am trying to use
fence_virt, but it doesn't works for me.

 fence_virt.conf in kvm host is:

fence_virtd {
	listener = "multicast";
	backend = "libvirt";
}

listeners {
	multicast {
		key_file = "/etc/fence_virt.key";
		interface = "siemif";
		address = "225.0.0.12";
		family = "ipv4";
	}
}

backends {
	libvirt {
		uri = "qemu:///system";
	}
}

fence_virt.key is located under /etc directory:

-r-------- 1 root root 18 Jul 12 09:48 /etc/fence_virt.key

cluster.conf on both kvm guest nodes is:

<?xml version="1.0"?>
<cluster config_version="1" name="TestCluster">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="cosnode01.domain.local" nodeid="1">
			<fence>
				<method name="kvm">
					<device action="reboot" port="cosnode01" name="kvm_cosnode01"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="cosnode02.domain.local" nodeid="2">
			<fence>
				<method name="kvm">
					<device action="reboot" port="cosnode02" name="kvm_cosnode02"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_virt" ip_family="ipv4"
multicast_address="225.0.0.12" key_file="/etc/cluster/fence_virt.key"
name="kvm_cosnode01"/>
		<fencedevice agent="fence_virt" ip_family="ipv4"
multicast_address="225.0.0.12" key_file="/etc/cluster/fence_virt.key"
name="kvm_cosnode02"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
	<rm log_level="5">
		<failoverdomains>
			<failoverdomain name="only_node01" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="cosnode01.domain.local"/>
			</failoverdomain>
			<failoverdomain name="only_node02" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="cosnode02.domain.local"/>
			</failoverdomain>
			<failoverdomain name="primary_clu01" nofailback="1" ordered="1"
restricted="1">
				<failoverdomainnode name="cosnode01.domain.local" priority="1"/>
				<failoverdomainnode name="cosnode02.domain.local" priority="2"/>
			</failoverdomain>
			<failoverdomain name="primary_clu02" nofailback="1" ordered="1"
restricted="1">
				<failoverdomainnode name="cosnode01.domain.local" priority="2"/>
				<failoverdomainnode name="cosnode02.domain.local" priority="1"/>
			</failoverdomain>
		</failoverdomains>
	</rm>
</cluster>

of course, fence_virt.key is copied under /etc/cluster dir in both nodes.

In cosnode01 I see this error:

fenced[4074]: fence cosnode02.domain.local dev 0.0 agent fence_virt
result: error from agent
fenced[4074]: fence cosnode02.domain.loca failed

What am I doing wrong?? Do I need to modify libvirtd.conf to listen in
siemif interface??

Thanks.



From a_mdl at mail.ru  Thu Jul 12 09:52:41 2012
From: a_mdl at mail.ru (=?UTF-8?B?RGVuaXMgIE1lZHZlZGV2?=)
Date: Thu, 12 Jul 2012 13:52:41 +0400
Subject: [Linux-cluster] =?utf-8?q?2-node_or_degraded_3-nodes=3F?=
Message-ID: <1342086761.686739545@f323.mail.ru>

If I will plan to add more nodes later, but have only 2 right now,
is it better to make 2-nodes cluster or degraded 3 nodes?
I recently heard that you cannot add more nodes to 2-nodes cluster without a clusterwide reboot.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120712/31c87f2b/attachment.htm>

From fdinitto at redhat.com  Thu Jul 12 10:32:24 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 12 Jul 2012 12:32:24 +0200
Subject: [Linux-cluster] 2-node or degraded 3-nodes?
In-Reply-To: <1342086761.686739545@f323.mail.ru>
References: <1342086761.686739545@f323.mail.ru>
Message-ID: <4FFEA7B8.4090401@redhat.com>

On 7/12/2012 11:52 AM, Denis  Medvedev wrote:
> If I will plan to add more nodes later, but have only 2 right now,
> is it better to make 2-nodes cluster or degraded 3 nodes?
> I recently heard that you cannot add more nodes to 2-nodes cluster
> without a clusterwide reboot.

Both have advantages and disadvantages.

In your position, I would make a 2 node cluster and the schedule
downtime to add more nodes later on.

The downtime will give you time to test the new nodes, test service
relocation, fencing and so on... that no matter how good you are as
sysadmin, it?s good practice to do before placing the cluster in
production anyway.

Fabio




From linuxis4me at gmail.com  Thu Jul 12 16:19:50 2012
From: linuxis4me at gmail.com (linux admin)
Date: Thu, 12 Jul 2012 21:49:50 +0530
Subject: [Linux-cluster] Cluster documents
Message-ID: <CACLY4ooj-Th71V7OTnx5+g0VuNHtMSSNve4M7vZ6nnRx8-WuYQ@mail.gmail.com>

Hi,

Can somebody  provide me the document or SOP to make a HA-Cluster . I am
new in the Clustering filed .I want to learn HA-Cluster

please provide me configuration steps  to make the Cluster.
-- 
Thanks
Ranveer singh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120712/8d9b8125/attachment.htm>

From lists at alteeve.ca  Thu Jul 12 16:35:27 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jul 2012 12:35:27 -0400
Subject: [Linux-cluster] Cluster documents
In-Reply-To: <CACLY4ooj-Th71V7OTnx5+g0VuNHtMSSNve4M7vZ6nnRx8-WuYQ@mail.gmail.com>
References: <CACLY4ooj-Th71V7OTnx5+g0VuNHtMSSNve4M7vZ6nnRx8-WuYQ@mail.gmail.com>
Message-ID: <4FFEFCCF.6060707@alteeve.ca>

On 07/12/2012 12:19 PM, linux admin wrote:
> 
> Hi,
> 
> Can somebody  provide me the document or SOP to make a HA-Cluster . I am
> new in the Clustering filed .I want to learn HA-Cluster 
> 
> please provide me configuration steps  to make the Cluster. 
> -- 
> Thanks
> Ranveer singh

"Cluster" is a very broad term. What exactly are you trying to make
highly available? What OS/distro?

If it's for VMs on RHEL / Centos;

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

Digimer

-- 
Digimer
Papers and Projects: https://alteeve.com




From washer at trlp.com  Thu Jul 12 16:49:32 2012
From: washer at trlp.com (James Washer)
Date: Thu, 12 Jul 2012 09:49:32 -0700
Subject: [Linux-cluster] Cluster documents
In-Reply-To: <CACLY4ooj-Th71V7OTnx5+g0VuNHtMSSNve4M7vZ6nnRx8-WuYQ@mail.gmail.com>
References: <CACLY4ooj-Th71V7OTnx5+g0VuNHtMSSNve4M7vZ6nnRx8-WuYQ@mail.gmail.com>
Message-ID: <CAO=CEwE4DCP8=qV2+5ZjZvxCXkM1rzZB21_ufE5dAme1kBYs-w@mail.gmail.com>

Have you read the Redhat Cluster documentation? It's a good place to start.

On Thu, Jul 12, 2012 at 9:19 AM, linux admin <linuxis4me at gmail.com> wrote:

>
> Hi,
>
> Can somebody  provide me the document or SOP to make a HA-Cluster . I am
> new in the Clustering filed .I want to learn HA-Cluster
>
> please provide me configuration steps  to make the Cluster.
> --
> Thanks
> Ranveer singh
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 


 - jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120712/b198a888/attachment.htm>

From delete at fedoraproject.org  Thu Jul 12 22:38:17 2012
From: delete at fedoraproject.org (Matias Kreder)
Date: Thu, 12 Jul 2012 19:38:17 -0300
Subject: [Linux-cluster] gfs_fsck estimation
Message-ID: <CAH1v0ToWd62P3tmGd1d8eSMqTJtuss4Zk3QEqrhB3sZajjaKYw@mail.gmail.com>

Hi,

I'm trying to find a method to estimate the time that gfs_fsck will
take in a specific server. I have seen a lot of different results.
Do you know of any method/procedure already written? If not, which
variables should I consider to make an estimation?
I'm thinking on considering:
- filesystem size
- number of Journals
- CPU speed/number and memory capacity

Any thoughts?

Regards
Matias Kreder



From rpeterso at redhat.com  Fri Jul 13 12:18:23 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 13 Jul 2012 08:18:23 -0400 (EDT)
Subject: [Linux-cluster] gfs_fsck estimation
In-Reply-To: <CAH1v0ToWd62P3tmGd1d8eSMqTJtuss4Zk3QEqrhB3sZajjaKYw@mail.gmail.com>
Message-ID: <82f3f5b9-e04f-4619-b41b-f9a049d5b403@zmail12.collab.prod.int.phx2.redhat.com>

----- Original Message -----
| Hi,
| 
| I'm trying to find a method to estimate the time that gfs_fsck will
| take in a specific server. I have seen a lot of different results.
| Do you know of any method/procedure already written? If not, which
| variables should I consider to make an estimation?
| I'm thinking on considering:
| - filesystem size
| - number of Journals
| - CPU speed/number and memory capacity
| 
| Any thoughts?
| 
| Regards
| Matias Kreder

Hi Matias,

I don't think it's possible to estimate the run time of gfs_fsck.
If the file system is clean, it should be doable, but the problem is
that different kinds of corruption cause especially long delays,
and that corruption is unpredictable.

Another thing to be aware of: Starting with RHEL6.3, the fsck.gfs2
is now able to analyze and repair GFS1 file systems as well as GFS2,
and it is orders of magnitude faster. It's also much more accurate in
its analysis and more able to repair corruption that gfs_fsck would
just give up and throw away.

Regards,

Bob Peterson
Red Hat File Systems



From carlopmart at gmail.com  Fri Jul 13 12:26:35 2012
From: carlopmart at gmail.com (C. L. Martinez)
Date: Fri, 13 Jul 2012 14:26:35 +0200
Subject: [Linux-cluster] Problems using fence_virt as a fence agent for
	two kvm guests
In-Reply-To: <CAEjQA5JxZ1MBFkmLRMC6nSXEmbF4zgZkcUO7Z=Q16+WDY9+=qw@mail.gmail.com>
References: <CAEjQA5JxZ1MBFkmLRMC6nSXEmbF4zgZkcUO7Z=Q16+WDY9+=qw@mail.gmail.com>
Message-ID: <CAEjQA5+R15KJN7u6DN2b4Wz9dqqOF+BKpMf6m_2Mx0vPzF=_WQ@mail.gmail.com>

On Thu, Jul 12, 2012 at 11:27 AM, C. L. Martinez <carlopmart at gmail.com> wrote:
> Hi all,
>
>  I have installed two kvm guests (CentOS 6.3) to do some tests using
> RHCS under a CentOS 6.3 kvm host. As a fence device I am trying to use
> fence_virt, but it doesn't works for me.
>
>  fence_virt.conf in kvm host is:
>
> fence_virtd {
>         listener = "multicast";
>         backend = "libvirt";
> }
>
> listeners {
>         multicast {
>                 key_file = "/etc/fence_virt.key";
>                 interface = "siemif";
>                 address = "225.0.0.12";
>                 family = "ipv4";
>         }
> }
>
> backends {
>         libvirt {
>                 uri = "qemu:///system";
>         }
> }
>
> fence_virt.key is located under /etc directory:
>
> -r-------- 1 root root 18 Jul 12 09:48 /etc/fence_virt.key
>
> cluster.conf on both kvm guest nodes is:
>
> <?xml version="1.0"?>
> <cluster config_version="1" name="TestCluster">
>         <cman expected_votes="1" two_node="1"/>
>         <clusternodes>
>                 <clusternode name="cosnode01.domain.local" nodeid="1">
>                         <fence>
>                                 <method name="kvm">
>                                         <device action="reboot" port="cosnode01" name="kvm_cosnode01"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="cosnode02.domain.local" nodeid="2">
>                         <fence>
>                                 <method name="kvm">
>                                         <device action="reboot" port="cosnode02" name="kvm_cosnode02"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <fencedevices>
>                 <fencedevice agent="fence_virt" ip_family="ipv4"
> multicast_address="225.0.0.12" key_file="/etc/cluster/fence_virt.key"
> name="kvm_cosnode01"/>
>                 <fencedevice agent="fence_virt" ip_family="ipv4"
> multicast_address="225.0.0.12" key_file="/etc/cluster/fence_virt.key"
> name="kvm_cosnode02"/>
>         </fencedevices>
>         <fence_daemon post_join_delay="30"/>
>         <totem rrp_mode="none" secauth="off"/>
>         <rm log_level="5">
>                 <failoverdomains>
>                         <failoverdomain name="only_node01" nofailback="1" ordered="0" restricted="1">
>                                 <failoverdomainnode name="cosnode01.domain.local"/>
>                         </failoverdomain>
>                         <failoverdomain name="only_node02" nofailback="1" ordered="0" restricted="1">
>                                 <failoverdomainnode name="cosnode02.domain.local"/>
>                         </failoverdomain>
>                         <failoverdomain name="primary_clu01" nofailback="1" ordered="1"
> restricted="1">
>                                 <failoverdomainnode name="cosnode01.domain.local" priority="1"/>
>                                 <failoverdomainnode name="cosnode02.domain.local" priority="2"/>
>                         </failoverdomain>
>                         <failoverdomain name="primary_clu02" nofailback="1" ordered="1"
> restricted="1">
>                                 <failoverdomainnode name="cosnode01.domain.local" priority="2"/>
>                                 <failoverdomainnode name="cosnode02.domain.local" priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>         </rm>
> </cluster>
>
> of course, fence_virt.key is copied under /etc/cluster dir in both nodes.
>
> In cosnode01 I see this error:
>
> fenced[4074]: fence cosnode02.domain.local dev 0.0 agent fence_virt
> result: error from agent
> fenced[4074]: fence cosnode02.domain.loca failed
>
> What am I doing wrong?? Do I need to modify libvirtd.conf to listen in
> siemif interface??
>
> Thanks.


Please, any help??



From mkreder at gmail.com  Fri Jul 13 16:04:36 2012
From: mkreder at gmail.com (Matias Kreder)
Date: Fri, 13 Jul 2012 13:04:36 -0300
Subject: [Linux-cluster] gfs_fsck estimation
In-Reply-To: <82f3f5b9-e04f-4619-b41b-f9a049d5b403@zmail12.collab.prod.int.phx2.redhat.com>
References: <CAH1v0ToWd62P3tmGd1d8eSMqTJtuss4Zk3QEqrhB3sZajjaKYw@mail.gmail.com>
	<82f3f5b9-e04f-4619-b41b-f9a049d5b403@zmail12.collab.prod.int.phx2.redhat.com>
Message-ID: <CAH1v0Tqa=Qmz2H3KS=m_ZX+m6u2OQzd7p=hq4iWzikOfagffqw@mail.gmail.com>

On Fri, Jul 13, 2012 at 9:18 AM, Bob Peterson <rpeterso at redhat.com> wrote:
> ----- Original Message -----
> | Hi,
> |
> | I'm trying to find a method to estimate the time that gfs_fsck will
> | take in a specific server. I have seen a lot of different results.
> | Do you know of any method/procedure already written? If not, which
> | variables should I consider to make an estimation?
> | I'm thinking on considering:
> | - filesystem size
> | - number of Journals
> | - CPU speed/number and memory capacity
> |
> | Any thoughts?
> |
> | Regards
> | Matias Kreder
>
> Hi Matias,
>
> I don't think it's possible to estimate the run time of gfs_fsck.
> If the file system is clean, it should be doable, but the problem is
> that different kinds of corruption cause especially long delays,
> and that corruption is unpredictable.
>
> Another thing to be aware of: Starting with RHEL6.3, the fsck.gfs2
> is now able to analyze and repair GFS1 file systems as well as GFS2,
> and it is orders of magnitude faster. It's also much more accurate in
> its analysis and more able to repair corruption that gfs_fsck would
> just give up and throw away.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
Bob,

Thanks for the explanation. I didn't give you the full scenario. I'm
looking to estimate the time of fsck before GFS to GFS2 conversion so
I can assume that filesystems are clean prior to the fsck as they are
mounted and non-corrupted filesystems.

Regards
Matias Kreder
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Fri Jul 13 16:17:53 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 13 Jul 2012 12:17:53 -0400 (EDT)
Subject: [Linux-cluster] gfs_fsck estimation
In-Reply-To: <CAH1v0Tqa=Qmz2H3KS=m_ZX+m6u2OQzd7p=hq4iWzikOfagffqw@mail.gmail.com>
Message-ID: <3884b756-a978-4ce4-9afb-92b39c1ea97d@zmail12.collab.prod.int.phx2.redhat.com>

| Bob,
| 
| Thanks for the explanation. I didn't give you the full scenario. I'm
| looking to estimate the time of fsck before GFS to GFS2 conversion so
| I can assume that filesystems are clean prior to the fsck as they are
| mounted and non-corrupted filesystems.
| 
| Regards
| Matias Kreder

Hi Matias,

If you're on RHEL6.3 or migrating to RHEL6.3, you can move the storage,
then run the new fsck.gfs2 before doing the gfs2_convert. Save you some
time. :)

Regards,

Bob Peterson
Red Hat File Systems



From mkreder at gmail.com  Fri Jul 13 16:36:32 2012
From: mkreder at gmail.com (Matias Kreder)
Date: Fri, 13 Jul 2012 13:36:32 -0300
Subject: [Linux-cluster] gfs_fsck estimation
In-Reply-To: <3884b756-a978-4ce4-9afb-92b39c1ea97d@zmail12.collab.prod.int.phx2.redhat.com>
References: <CAH1v0Tqa=Qmz2H3KS=m_ZX+m6u2OQzd7p=hq4iWzikOfagffqw@mail.gmail.com>
	<3884b756-a978-4ce4-9afb-92b39c1ea97d@zmail12.collab.prod.int.phx2.redhat.com>
Message-ID: <CAH1v0Tpg1C=d6a=rR9sL+C52Z4QJmaPQnq=ESpDZBBtKYWcSRw@mail.gmail.com>

On Fri, Jul 13, 2012 at 1:17 PM, Bob Peterson <rpeterso at redhat.com> wrote:
> | Bob,
> |
> | Thanks for the explanation. I didn't give you the full scenario. I'm
> | looking to estimate the time of fsck before GFS to GFS2 conversion so
> | I can assume that filesystems are clean prior to the fsck as they are
> | mounted and non-corrupted filesystems.
> |
> | Regards
> | Matias Kreder
>
> Hi Matias,
>
> If you're on RHEL6.3 or migrating to RHEL6.3, you can move the storage,
> then run the new fsck.gfs2 before doing the gfs2_convert. Save you some
> time. :)
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
Bob,

Unfortunately we will not be migrating to RHEL6 yet but we are
migrating from GFS to GFS2 on RHEL5 to gain the GFS benefits.

Any thoughts in how to estimate the fsck time?

Regards
Matias Kreder
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jeff.sturm at eprize.com  Fri Jul 13 16:51:29 2012
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Fri, 13 Jul 2012 16:51:29 +0000
Subject: [Linux-cluster] gfs_fsck estimation
In-Reply-To: <CAH1v0Tpg1C=d6a=rR9sL+C52Z4QJmaPQnq=ESpDZBBtKYWcSRw@mail.gmail.com>
References: <CAH1v0Tqa=Qmz2H3KS=m_ZX+m6u2OQzd7p=hq4iWzikOfagffqw@mail.gmail.com>
	<3884b756-a978-4ce4-9afb-92b39c1ea97d@zmail12.collab.prod.int.phx2.redhat.com>
	<CAH1v0Tpg1C=d6a=rR9sL+C52Z4QJmaPQnq=ESpDZBBtKYWcSRw@mail.gmail.com>
Message-ID: <B1B9801C5CBC954680D0374CC4EEABA5365D9ED1@MailNode2.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Matias Kreder
> Sent: Friday, July 13, 2012 12:37 PM
> 
> Unfortunately we will not be migrating to RHEL6 yet but we are migrating from GFS to
> GFS2 on RHEL5 to gain the GFS benefits.
> 
> Any thoughts in how to estimate the fsck time?

If your SAN supports LUN snapshots, you could try gfs_fsck on a snapshot first, and see how long it runs.

-Jeff





From jvdiago at gmail.com  Mon Jul 16 16:03:34 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Mon, 16 Jul 2012 18:03:34 +0200
Subject: [Linux-cluster] Strange behaviours in two-node cluster
Message-ID: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>

Hi, two weeks ago I asked for some help building a two-node cluster with
HA-LVM. After some e-mails, finally I got my cluster working. The problem
now is that sometimes, and in some clusters (I have three clusters with the
same configuration), I got very strange behaviours.

#1 Openais detects some problem and shutdown itself. The network is Ok, is
a virtual device in vmware, shared with the other cluster hearbet networks,
and only happens in one cluster. The error messages:

Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from 6.
Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from 0

Do you know what can I check in order to solve the problem? I don't know
from where I should start. What makes Openais to not receive messages?


#2 I'm getting a lot of RGmanager errors when rgmanager tries to change the
service status. i.e: clusvdcam -d service. Always happens when I have the
two nodes UP. If I shutdown one node, then the command finishes
succesfully. Prior to execute the command, I always check the status with
clustat, and everything is OK:

clurgmgrd[5667]: <err> #52: Failed changing RG status

Another time, what can I check in order to detect problems with rgmanager
that clustat and cman_tool doesn't show?

#3 Sometimes, not always, a node that has been fenced cannot join the
cluster after the reboot. With clustat I can see that there is quorum:

clustat:
[root at node2 ~]# clustat
Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-hb                                  1 Offline
 node2-hb                               2 Online, Local, rgmanager
 /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:test                   node2-hb                  started

The log show how node2 fenced node1:

node2 messages
Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0 sec
post_fail_delay
Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
fenced
Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing

But the node that tries to join the cluster says that there isn't quorum.
Finally. It finishes inquorate, without seeing node1 and the quorum disk.

node1 messages
Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
Connection refused
Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
connection.

Have something in common the three errors?  What should I check? I've
discarded cluster configuration because cluster is working, and the  errors
doesn't appear in all the nodes. The most annoying error cureently is the
#1. Every 10-15 minutes Openais fails and the nodes gets fenced. I attach
the cluster.conf.

Thanks in advance.

Regards, Javi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120716/ac83a827/attachment.htm>
-------------- next part --------------
<?xml version="1.0"?>
<cluster alias="test_cluster" config_version="3" name="test_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="6"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                         <fence>
                                <method name="fence">
                                        <device name="fence-vmware" uuid="77777777777777777777777"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                         <fence>
                                <method name="fence">
                                        <device name="fence-vmware" uuid="777777777777777777777771"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman two_node="0" expected_votes="3"/>
        <fencedevices>
                <fencedevice agent="fence_vmware_soap" ipaddr="XX.XX.XX.XX" login="XXX" passwd="XXXXX" ssl="1" action="reboot" name="fence-vmware"/>
        </fencedevices>
        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="test_cluster_fo" nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode name="node1" priority="1"/>
                                <failoverdomainnode name="node2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
        <resources/>
        <service autostart="1" domain="test_cluster_fo" exclusive="0" name="web_service" recovery="relocate">
                <ip address="192.168.1.1" monitor_link="1"/>
                <lvm name="vg_www" vg_name="vg_www" lv_name="www"/>
                <lvm name="vg_mysql" vg_name="vg_mysql" lv_name="mysql"/>
                <fs device="/dev/vg_www/www" force_fsck="1" force_unmount="1" fstype="ext3" mountpoint="/var/www" name="www" self_fence="0"/>
                <fs device="/dev/vg_mysql/mysql" force_fsck="1" force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="mysql" self_fence="0"/>
                <script file="/etc/init.d/mysql" name="mysql"/>
				<script file="/etc/init.d/httpd" name="httpd"/>
        </service>
        </rm>
        <totem consensus="4000" join="60" token="20000" token_retransmits_before_loss_const="20"/>
        <quorumd interval="1" label="test_qdisk" tko="10" votes="1">
                <heuristic program="/usr/share/cluster/check_eth_link.sh eth0" score="1" interval="2" tko="3"/>
        </quorumd>
 </cluster>

From lists at alteeve.ca  Mon Jul 16 17:20:16 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 16 Jul 2012 13:20:16 -0400
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
Message-ID: <50044D50.5060008@alteeve.ca>

Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
With this, losing a node will mean you lose quorum and all cluster
activity will stop. Please change this to 'two_node="1" expected_votes="1"'.

Did you confirm that your fencing actually works? Does 'fence_node
node1' and 'fence_node node2' actually kill the target?

Are you running into multicast issues? If your switch (virtual or real)
purges multicast groups periodically, it will break the cluster.

What version of the cluster software and what distro are you using?

Digimer


On 07/16/2012 12:03 PM, Javier Vela wrote:
> Hi, two weeks ago I asked for some help building a two-node cluster with
> HA-LVM. After some e-mails, finally I got my cluster working. The
> problem now is that sometimes, and in some clusters (I have three
> clusters with the same configuration), I got very strange behaviours.
> 
> #1 Openais detects some problem and shutdown itself. The network is Ok,
> is a virtual device in vmware, shared with the other cluster hearbet
> networks, and only happens in one cluster. The error messages:
> 
> Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
> Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from 6.
> Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from 0
> 
> Do you know what can I check in order to solve the problem? I don't know
> from where I should start. What makes Openais to not receive messages?
> 
> 
> #2 I'm getting a lot of RGmanager errors when rgmanager tries to change
> the service status. i.e: clusvdcam -d service. Always happens when I
> have the two nodes UP. If I shutdown one node, then the command finishes
> succesfully. Prior to execute the command, I always check the status
> with clustat, and everything is OK:
> 
> clurgmgrd[5667]: <err> #52: Failed changing RG status
> 
> Another time, what can I check in order to detect problems with
> rgmanager that clustat and cman_tool doesn't show?
> 
> #3 Sometimes, not always, a node that has been fenced cannot join the
> cluster after the reboot. With clustat I can see that there is quorum:
> 
> clustat:
> [root at node2 ~]# clustat
> Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
> Member Status: Quorate
> 
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  node1-hb                                  1 Offline
>  node2-hb                               2 Online, Local, rgmanager
>  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
> 
>  Service Name                   Owner (Last)                   State
>  ------- ----                   ----- ------                   -----
>  service:test                   node2-hb                  started
> 
> The log show how node2 fenced node1:
> 
> node2 messages
> Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
> sec post_fail_delay
> Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
> Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
> fenced
> Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
> Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing
> 
> But the node that tries to join the cluster says that there isn't
> quorum. Finally. It finishes inquorate, without seeing node1 and the
> quorum disk.
> 
> node1 messages
> Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
> Connection refused
> Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
> connection.
> 
> Have something in common the three errors?  What should I check? I've
> discarded cluster configuration because cluster is working, and the 
> errors doesn't appear in all the nodes. The most annoying error
> cureently is the #1. Every 10-15 minutes Openais fails and the nodes
> gets fenced. I attach the cluster.conf.
> 
> Thanks in advance.
> 
> Regards, Javi
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
Digimer
Papers and Projects: https://alteeve.com




From jvdiago at gmail.com  Mon Jul 16 18:46:27 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Mon, 16 Jul 2012 20:46:27 +0200
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <50044D50.5060008@alteeve.ca>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
	<50044D50.5060008@alteeve.ca>
Message-ID: <CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>

Hi,

I set two_node=0 in purpose, because of I use a quorum disk with one
additional vote. If one one fails, I still have two votes, and the cluster
remains quorate, avoiding the split-brain situation. Is this approach
wrong? In my tests, this aspect of the quorum worked well.

Fencing works very well. When something happens, the fencing kills the
faulting server without any problems.

The first time I ran into problems I cheked multicast traffic between the
nodes with iperf and everything appeared to be OK. What I don't know is how
works the purge you said. I didn't know that any purge was running
whatsoever. How can I check if is happening? Moreover, when I did the test
only one cluster was running. Now there are 3 cluster running in the same
virtual switch.


Software:

Red Hat Enterprise Linux Server release 5.7 (Tikanga)
cman-2.0.115-85.el5
rgmanager-2.0.52-21.el5
openais-0.80.6-30.el5


 Regards, Javi

2012/7/16 Digimer <lists at alteeve.ca>

> Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
> With this, losing a node will mean you lose quorum and all cluster
> activity will stop. Please change this to 'two_node="1"
> expected_votes="1"'.
>
> Did you confirm that your fencing actually works? Does 'fence_node
> node1' and 'fence_node node2' actually kill the target?
>
> Are you running into multicast issues? If your switch (virtual or real)
> purges multicast groups periodically, it will break the cluster.
>
> What version of the cluster software and what distro are you using?
>
> Digimer
>
>
> On 07/16/2012 12:03 PM, Javier Vela wrote:
> > Hi, two weeks ago I asked for some help building a two-node cluster with
> > HA-LVM. After some e-mails, finally I got my cluster working. The
> > problem now is that sometimes, and in some clusters (I have three
> > clusters with the same configuration), I got very strange behaviours.
> >
> > #1 Openais detects some problem and shutdown itself. The network is Ok,
> > is a virtual device in vmware, shared with the other cluster hearbet
> > networks, and only happens in one cluster. The error messages:
> >
> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from
> 6.
> > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from 0
> >
> > Do you know what can I check in order to solve the problem? I don't know
> > from where I should start. What makes Openais to not receive messages?
> >
> >
> > #2 I'm getting a lot of RGmanager errors when rgmanager tries to change
> > the service status. i.e: clusvdcam -d service. Always happens when I
> > have the two nodes UP. If I shutdown one node, then the command finishes
> > succesfully. Prior to execute the command, I always check the status
> > with clustat, and everything is OK:
> >
> > clurgmgrd[5667]: <err> #52: Failed changing RG status
> >
> > Another time, what can I check in order to detect problems with
> > rgmanager that clustat and cman_tool doesn't show?
> >
> > #3 Sometimes, not always, a node that has been fenced cannot join the
> > cluster after the reboot. With clustat I can see that there is quorum:
> >
> > clustat:
> > [root at node2 ~]# clustat
> > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
> > Member Status: Quorate
> >
> >  Member Name                             ID   Status
> >  ------ ----                             ---- ------
> >  node1-hb                                  1 Offline
> >  node2-hb                               2 Online, Local, rgmanager
> >  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
> >
> >  Service Name                   Owner (Last)                   State
> >  ------- ----                   ----- ------                   -----
> >  service:test                   node2-hb                  started
> >
> > The log show how node2 fenced node1:
> >
> > node2 messages
> > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
> > sec post_fail_delay
> > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
> > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
> > fenced
> > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
> > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing
> >
> > But the node that tries to join the cluster says that there isn't
> > quorum. Finally. It finishes inquorate, without seeing node1 and the
> > quorum disk.
> >
> > node1 messages
> > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
> > Connection refused
> > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
> > connection.
> >
> > Have something in common the three errors?  What should I check? I've
> > discarded cluster configuration because cluster is working, and the
> > errors doesn't appear in all the nodes. The most annoying error
> > cureently is the #1. Every 10-15 minutes Openais fails and the nodes
> > gets fenced. I attach the cluster.conf.
> >
> > Thanks in advance.
> >
> > Regards, Javi
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120716/e7b9e685/attachment.htm>

From akinoztopuz at yahoo.com  Tue Jul 17 05:39:46 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Mon, 16 Jul 2012 22:39:46 -0700 (PDT)
Subject: [Linux-cluster] re-post
Message-ID: <1342503586.78659.YahooMailNeo@web125804.mail.ne1.yahoo.com>

????????Hello
?
I am sending my post again?:are there anybody came accross with this issue before?
?
?
?I have 2 nodes clsuter without quorum disk. 
I saw a problem when I moved to services to other 
node. 
disk  loyout is iscsi . 
I th?nk problem is about gfs.
when I stop service in node1  and related file 
systems(included in service) are unmounted from that node and I want to mount it 
on node2 manually  , I am tak?ng a message about resource busy.  
[root clsn2 ~]# mount -t gfs2  
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 
/usr/sap/PRO/ASCS01
/sbin/mount.gfs2: 
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 already mounted or 
/usr/sap/PRO/ASCS01 busy
  
Could you have any ideas?   
cluster.conf is at the below: 
?xml version="1.0"?>
<cluster 
alias="testsapcluster" config_version="197" name="testsapcluster">
<fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
<clusternodes>
<clusternode name="clsn1.eda.com" nodeid="1" 
votes="1">
<fence>
<method 
name="1">
<device 
name="fence_node1"/>
</method>
</fence>
</clusternode>
<clusternode name="clsn2.eda.com" 
nodeid="2" votes="1">
<fence>
<method 
name="1">
<device 
name="fence_node2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman 
expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ilo" 
hostname="iloclsnode1" login="clsfenceadmin" name="ClsNode1Fence" 
passwd="***********"/>
<fencedevice agent="fence_ilo" 
hostname="iloclsnode2" login="clsfenceadmin" name="ClsNode2Fence" 
passwd="************"/>
<fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.68" login="clsfenceadmin" 
name="IPMI-Node1" passwd="**********"/>
<fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.67" login="clsfenceadmin" 
name="IPMI-Node2" passwd="**********"/>
<fencedevice 
agent="fence_ipmilan" ipaddr="10.34.1.68" login="clsfenceadmin" name="IPMI_1" 
passwd="********"/>
<fencedevice agent="fence_ipmilan" 
ipaddr="10.34.1.67" login="clsfenceadmin" name="IPMI_2" 
passwd="********"/>
<fencedevice agent="fence_ipmilan" 
ipaddr="192.168.11.68" lanplus="1" login="clsfenceadmin" method="cycle" 
name="fence_node1" passwd="*******" power_wait="4"/>
<fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" lanplus="1" 
login="clsfenceadmin" method="cycle" name="fence_node2" passwd="*******" 
power_wait="4"/>
</fencedevices>
<rm 
log_level="7">
<failoverdomains>
<failoverdomain 
name="sapfailover" nofailback="0" ordered="1" 
restricted="0">
<failoverdomainnode 
name="clsn1.eda.com" priority="1"/>
<failoverdomainnode name="clsn2.eda.com" 
priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.34.1.111" 
monitor_link="1"/>
<ip address="10.34.1.246" 
monitor_link="0"/>
<ip address="10.34.1.247" 
monitor_link="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7" force_unmount="1" 
fsid="1689" fstype="gfs2" mountpoint="/usr/sap/PRO/ASCS01" 
name="/usr/sap/PRO/ASCS01" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_b2-SAPClusterLV_b2" 
force_unmount="1" fsid="52296" fstype="gfs2" mountpoint="/oracle" name="/oracle" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_b3-SAPClusterLV_b3" force_unmount="1" 
fsid="25486" fstype="gfs2" mountpoint="/oracle/client" name="/oracle/client" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_b5-SAPClusterLV_b5" force_unmount="1" 
fsid="5895" fstype="gfs2" mountpoint="/oracle/stage" name="/oracle/stage" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_b6-SAPClusterLV_b6" force_unmount="1" 
fsid="19741" fstype="gfs2" mountpoint="/oracle/PRO" name="/oracle/PRO" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_b7-SAPClusterLV_b7" force_unmount="1" 
fsid="6452" fstype="gfs2" mountpoint="/oracle/PRO/112_64" 
name="/oracle/PRO/112_64" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_b8-SAPClusterLV_b8" 
force_unmount="1" fsid="40841" fstype="gfs2" mountpoint="/oracle/PRO/origlogA" 
name="/oracle/PRO/origlogA" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_b9-SAPClusterLV_b9" 
force_unmount="1" fsid="52787" fstype="gfs2" mountpoint="/oracle/PRO/origlogB" 
name="/oracle/PRO/origlogB" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c5-SAPClusterLV_c5" 
force_unmount="1" fsid="22219" fstype="gfs2" mountpoint="/oracle/PRO/sapdata1" 
name="/oracle/PRO/sapdata1" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_b10-SAPClusterLV_b10" 
force_unmount="1" fsid="47722" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogA" 
name="/oracle/PRO/mirrlogA" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c6-SAPClusterLV_c6" 
force_unmount="1" fsid="1905" fstype="gfs2" mountpoint="/oracle/PRO/sapdata2" 
name="/oracle/PRO/sapdata2" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c1-SAPClusterLV_c1" 
force_unmount="1" fsid="60368" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogB" 
name="/oracle/PRO/mirrlogB" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c7-SAPClusterLV_c7" 
force_unmount="1" fsid="14311" fstype="gfs2" mountpoint="/oracle/PRO/sapdata3" 
name="/oracle/PRO/sapdata3" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c2-SAPClusterLV_c2" 
force_unmount="1" fsid="8037" fstype="gfs2" mountpoint="/oracle/PRO/oraarch" 
name="/oracle/PRO/oraarch" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c8-SAPClusterLV_c8" 
force_unmount="1" fsid="41540" fstype="gfs2" mountpoint="/oracle/PRO/sapdata4" 
name="/oracle/PRO/sapdata4" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c3-SAPClusterLV_c3" 
force_unmount="1" fsid="23164" fstype="gfs2" mountpoint="/oracle/PRO/sapreorg" 
name="/oracle/PRO/sapreorg" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_c9-SAPClusterLV_c9" 
force_unmount="1" fsid="37586" fstype="gfs2" mountpoint="/oracle/PRO/sapdata5" 
name="/oracle/PRO/sapdata5" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_d1-SAPClusterLV_d1" 
force_unmount="1" fsid="61050" fstype="gfs2" mountpoint="/software" 
name="/software" self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_d2-SAPClusterLV_d2" force_unmount="1" 
fsid="45919" fstype="gfs2" mountpoint="/saptmp" name="/saptmp" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_d3-SAPClusterLV_d3" force_unmount="1" 
fsid="56812" fstype="gfs2" mountpoint="/usr/sap/PRO" name="/usr/sap/PRO" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_d5-SAPClusterLV_d5" force_unmount="1" 
fsid="47829" fstype="gfs2" mountpoint="/usr/sap/DAA" name="/usr/sap/DAA" 
self_fence="0"/>
<clusterfs 
device="/dev/mapper/SAPClusterVG_d6-SAPClusterLV_d6" force_unmount="1" 
fsid="1394" fstype="gfs2" mountpoint="/usr/sap/hostctrl" 
name="/usr/sap/hostctrl" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_d8-SAPClusterLV_d8" 
force_unmount="1" fsid="33058" fstype="gfs2" mountpoint="/usr/sap/PRO/DVEBMGS00" 
name="/usr/sap/PRO/DVEBMGS00" self_fence="0"/>
<clusterfs device="/dev/mapper/SAPClusterVG_b1-SAPClusterLV_b1" 
force_unmount="0" fsid="1822" fstype="gfs2" mountpoint="/sapmnt/PRO" 
name="/sapmnt/PRO" self_fence="0"/>
<SAPInstance DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
DIR_PROFILE="/usr/sap/PRO/SYS/profile" InstanceName="PRO_ASCS01_sapproascs" 
START_PROFILE="START_ASCS01_sapproascs"/>
<SAPDatabase DBTYPE="ORA" DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
NETSERVICENAME="LISTENER" SID="PRO"/>
</resources>
<service autostart="0" 
domain="sapfailover" exclusive="0" name="DB">
<clusterfs fstype="gfs" ref="/oracle/PRO"/>
<clusterfs fstype="gfs" 
ref="/oracle/PRO/112_64"/>
<clusterfs 
fstype="gfs" ref="/oracle/PRO/origlogA"/>
<clusterfs fstype="gfs" 
ref="/oracle/PRO/origlogB"/>
<clusterfs 
fstype="gfs" ref="/oracle/PRO/mirrlogA"/>
<clusterfs fstype="gfs" 
ref="/oracle/PRO/mirrlogB"/>
<clusterfs 
fstype="gfs" ref="/oracle/PRO/oraarch"/>
<clusterfs fstype="gfs" 
ref="/oracle/PRO/sapreorg"/>
<clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata1"/>
<clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata2"/>
<clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata3"/>
<clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata4"/>
<clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata5"/>
<ip 
ref="10.34.1.247"/>
</service>
<service autostart="0" domain="sapfailover" exclusive="1" 
name="sap">
<ip 
ref="10.34.1.246"/>
<clusterfs 
ref="/usr/sap/PRO/ASCS01"/>
</service>
</rm>
</cluster>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120716/b49f39dd/attachment.htm>

From jvdiago at gmail.com  Tue Jul 17 07:30:47 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Tue, 17 Jul 2012 09:30:47 +0200
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
	<50044D50.5060008@alteeve.ca>
	<CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
Message-ID: <CAEAM5QX-djOYBq0Tm3sYpDRzAFfwp3fY=7g2FbWBg+n4Gr4BqA@mail.gmail.com>

Hi, I'm also seeing a lot of log entries in the logs like that:

openais[4264]: [TOTEM] Retransmit List: 34 35 36 37 38 39 3a 3b 3c

I've searched through internet and this happens when there are some delay
between the nodes, but openais its supposed to recover gracefully. Can this
be a problem?

2012/7/16 Javier Vela <jvdiago at gmail.com>

> Hi,
>
> I set two_node=0 in purpose, because of I use a quorum disk with one
> additional vote. If one one fails, I still have two votes, and the cluster
> remains quorate, avoiding the split-brain situation. Is this approach
> wrong? In my tests, this aspect of the quorum worked well.
>
> Fencing works very well. When something happens, the fencing kills the
> faulting server without any problems.
>
> The first time I ran into problems I cheked multicast traffic between the
> nodes with iperf and everything appeared to be OK. What I don't know is how
> works the purge you said. I didn't know that any purge was running
> whatsoever. How can I check if is happening? Moreover, when I did the test
> only one cluster was running. Now there are 3 cluster running in the same
> virtual switch.
>
>
> Software:
>
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
>
>
>  Regards, Javi
>
>
> 2012/7/16 Digimer <lists at alteeve.ca>
>
>> Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
>> With this, losing a node will mean you lose quorum and all cluster
>> activity will stop. Please change this to 'two_node="1"
>> expected_votes="1"'.
>>
>> Did you confirm that your fencing actually works? Does 'fence_node
>> node1' and 'fence_node node2' actually kill the target?
>>
>> Are you running into multicast issues? If your switch (virtual or real)
>> purges multicast groups periodically, it will break the cluster.
>>
>> What version of the cluster software and what distro are you using?
>>
>> Digimer
>>
>>
>> On 07/16/2012 12:03 PM, Javier Vela wrote:
>> > Hi, two weeks ago I asked for some help building a two-node cluster with
>> > HA-LVM. After some e-mails, finally I got my cluster working. The
>> > problem now is that sometimes, and in some clusters (I have three
>> > clusters with the same configuration), I got very strange behaviours.
>> >
>> > #1 Openais detects some problem and shutdown itself. The network is Ok,
>> > is a virtual device in vmware, shared with the other cluster hearbet
>> > networks, and only happens in one cluster. The error messages:
>> >
>> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
>> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from
>> 6.
>> > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from
>> 0
>> >
>> > Do you know what can I check in order to solve the problem? I don't know
>> > from where I should start. What makes Openais to not receive messages?
>> >
>> >
>> > #2 I'm getting a lot of RGmanager errors when rgmanager tries to change
>> > the service status. i.e: clusvdcam -d service. Always happens when I
>> > have the two nodes UP. If I shutdown one node, then the command finishes
>> > succesfully. Prior to execute the command, I always check the status
>> > with clustat, and everything is OK:
>> >
>> > clurgmgrd[5667]: <err> #52: Failed changing RG status
>> >
>> > Another time, what can I check in order to detect problems with
>> > rgmanager that clustat and cman_tool doesn't show?
>> >
>> > #3 Sometimes, not always, a node that has been fenced cannot join the
>> > cluster after the reboot. With clustat I can see that there is quorum:
>> >
>> > clustat:
>> > [root at node2 ~]# clustat
>> > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
>> > Member Status: Quorate
>> >
>> >  Member Name                             ID   Status
>> >  ------ ----                             ---- ------
>> >  node1-hb                                  1 Offline
>> >  node2-hb                               2 Online, Local, rgmanager
>> >  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>> >
>> >  Service Name                   Owner (Last)                   State
>> >  ------- ----                   ----- ------                   -----
>> >  service:test                   node2-hb                  started
>> >
>> > The log show how node2 fenced node1:
>> >
>> > node2 messages
>> > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
>> > sec post_fail_delay
>> > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
>> > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
>> > fenced
>> > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
>> > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing
>> >
>> > But the node that tries to join the cluster says that there isn't
>> > quorum. Finally. It finishes inquorate, without seeing node1 and the
>> > quorum disk.
>> >
>> > node1 messages
>> > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
>> > Connection refused
>> > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
>> > connection.
>> >
>> > Have something in common the three errors?  What should I check? I've
>> > discarded cluster configuration because cluster is working, and the
>> > errors doesn't appear in all the nodes. The most annoying error
>> > cureently is the #1. Every 10-15 minutes Openais fails and the nodes
>> > gets fenced. I attach the cluster.conf.
>> >
>> > Thanks in advance.
>> >
>> > Regards, Javi
>> >
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/ba3a982b/attachment.htm>

From emi2fast at gmail.com  Tue Jul 17 07:40:46 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 17 Jul 2012 09:40:46 +0200
Subject: [Linux-cluster] re-post
In-Reply-To: <1342503586.78659.YahooMailNeo@web125804.mail.ne1.yahoo.com>
References: <1342503586.78659.YahooMailNeo@web125804.mail.ne1.yahoo.com>
Message-ID: <CAE7pJ3BeQDDj_ZJO-QvwVGR5xSj9ahjT4DrsqJhh2e8MuSoR=A@mail.gmail.com>

If you have a failover service, why do you use gfs2?

2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>

>         Hello
>
> I am sending my post again :are there anybody came accross with this issue
> before?
>
>
>
> I have 2 nodes clsuter without quorum disk.
>  I saw a problem when I moved to services to other node.
>  disk loyout is iscsi .
>  I th?nk problem is about gfs.
> when I stop service in node1 and related file systems(included in service)
> are unmounted from that node and I want to mount it on node2 manually , I
> am tak?ng a message about resource busy.
>  [root clsn2 ~]# mount -t gfs2
> /dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 /usr/sap/PRO/ASCS01
> /sbin/mount.gfs2: /dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 already
> mounted or /usr/sap/PRO/ASCS01 busy
>   Could you have any ideas?
>    cluster.conf is at the below:
>  ?xml version="1.0"?>
> <cluster alias="testsapcluster" config_version="197" name="testsapcluster">
> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
> <clusternodes>
> <clusternode name="clsn1.eda.com" nodeid="1" votes="1">
> <fence>
> <method name="1">
> <device name="fence_node1"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="clsn2.eda.com" nodeid="2" votes="1">
> <fence>
> <method name="1">
> <device name="fence_node2"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_ilo" hostname="iloclsnode1"
> login="clsfenceadmin" name="ClsNode1Fence" passwd="***********"/>
> <fencedevice agent="fence_ilo" hostname="iloclsnode2"
> login="clsfenceadmin" name="ClsNode2Fence" passwd="************"/>
> <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68"
> login="clsfenceadmin" name="IPMI-Node1" passwd="**********"/>
> <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67"
> login="clsfenceadmin" name="IPMI-Node2" passwd="**********"/>
> <fencedevice agent="fence_ipmilan" ipaddr="10.34.1.68"
> login="clsfenceadmin" name="IPMI_1" passwd="********"/>
> <fencedevice agent="fence_ipmilan" ipaddr="10.34.1.67"
> login="clsfenceadmin" name="IPMI_2" passwd="********"/>
> <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.68" lanplus="1"
> login="clsfenceadmin" method="cycle" name="fence_node1" passwd="*******"
> power_wait="4"/>
> <fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" lanplus="1"
> login="clsfenceadmin" method="cycle" name="fence_node2" passwd="*******"
> power_wait="4"/>
> </fencedevices>
> <rm log_level="7">
> <failoverdomains>
> <failoverdomain name="sapfailover" nofailback="0" ordered="1"
> restricted="0">
> <failoverdomainnode name="clsn1.eda.com" priority="1"/>
> <failoverdomainnode name="clsn2.eda.com" priority="1"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="10.34.1.111" monitor_link="1"/>
> <ip address="10.34.1.246" monitor_link="0"/>
> <ip address="10.34.1.247" monitor_link="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7"
> force_unmount="1" fsid="1689" fstype="gfs2"
> mountpoint="/usr/sap/PRO/ASCS01" name="/usr/sap/PRO/ASCS01" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b2-SAPClusterLV_b2"
> force_unmount="1" fsid="52296" fstype="gfs2" mountpoint="/oracle"
> name="/oracle" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b3-SAPClusterLV_b3"
> force_unmount="1" fsid="25486" fstype="gfs2" mountpoint="/oracle/client"
> name="/oracle/client" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b5-SAPClusterLV_b5"
> force_unmount="1" fsid="5895" fstype="gfs2" mountpoint="/oracle/stage"
> name="/oracle/stage" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b6-SAPClusterLV_b6"
> force_unmount="1" fsid="19741" fstype="gfs2" mountpoint="/oracle/PRO"
> name="/oracle/PRO" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b7-SAPClusterLV_b7"
> force_unmount="1" fsid="6452" fstype="gfs2" mountpoint="/oracle/PRO/112_64"
> name="/oracle/PRO/112_64" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b8-SAPClusterLV_b8"
> force_unmount="1" fsid="40841" fstype="gfs2"
> mountpoint="/oracle/PRO/origlogA" name="/oracle/PRO/origlogA"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b9-SAPClusterLV_b9"
> force_unmount="1" fsid="52787" fstype="gfs2"
> mountpoint="/oracle/PRO/origlogB" name="/oracle/PRO/origlogB"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c5-SAPClusterLV_c5"
> force_unmount="1" fsid="22219" fstype="gfs2"
> mountpoint="/oracle/PRO/sapdata1" name="/oracle/PRO/sapdata1"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b10-SAPClusterLV_b10"
> force_unmount="1" fsid="47722" fstype="gfs2"
> mountpoint="/oracle/PRO/mirrlogA" name="/oracle/PRO/mirrlogA"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c6-SAPClusterLV_c6"
> force_unmount="1" fsid="1905" fstype="gfs2"
> mountpoint="/oracle/PRO/sapdata2" name="/oracle/PRO/sapdata2"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c1-SAPClusterLV_c1"
> force_unmount="1" fsid="60368" fstype="gfs2"
> mountpoint="/oracle/PRO/mirrlogB" name="/oracle/PRO/mirrlogB"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c7-SAPClusterLV_c7"
> force_unmount="1" fsid="14311" fstype="gfs2"
> mountpoint="/oracle/PRO/sapdata3" name="/oracle/PRO/sapdata3"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c2-SAPClusterLV_c2"
> force_unmount="1" fsid="8037" fstype="gfs2"
> mountpoint="/oracle/PRO/oraarch" name="/oracle/PRO/oraarch" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c8-SAPClusterLV_c8"
> force_unmount="1" fsid="41540" fstype="gfs2"
> mountpoint="/oracle/PRO/sapdata4" name="/oracle/PRO/sapdata4"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c3-SAPClusterLV_c3"
> force_unmount="1" fsid="23164" fstype="gfs2"
> mountpoint="/oracle/PRO/sapreorg" name="/oracle/PRO/sapreorg"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_c9-SAPClusterLV_c9"
> force_unmount="1" fsid="37586" fstype="gfs2"
> mountpoint="/oracle/PRO/sapdata5" name="/oracle/PRO/sapdata5"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d1-SAPClusterLV_d1"
> force_unmount="1" fsid="61050" fstype="gfs2" mountpoint="/software"
> name="/software" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d2-SAPClusterLV_d2"
> force_unmount="1" fsid="45919" fstype="gfs2" mountpoint="/saptmp"
> name="/saptmp" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d3-SAPClusterLV_d3"
> force_unmount="1" fsid="56812" fstype="gfs2" mountpoint="/usr/sap/PRO"
> name="/usr/sap/PRO" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d5-SAPClusterLV_d5"
> force_unmount="1" fsid="47829" fstype="gfs2" mountpoint="/usr/sap/DAA"
> name="/usr/sap/DAA" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d6-SAPClusterLV_d6"
> force_unmount="1" fsid="1394" fstype="gfs2" mountpoint="/usr/sap/hostctrl"
> name="/usr/sap/hostctrl" self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_d8-SAPClusterLV_d8"
> force_unmount="1" fsid="33058" fstype="gfs2"
> mountpoint="/usr/sap/PRO/DVEBMGS00" name="/usr/sap/PRO/DVEBMGS00"
> self_fence="0"/>
> <clusterfs device="/dev/mapper/SAPClusterVG_b1-SAPClusterLV_b1"
> force_unmount="0" fsid="1822" fstype="gfs2" mountpoint="/sapmnt/PRO"
> name="/sapmnt/PRO" self_fence="0"/>
> <SAPInstance DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe"
> DIR_PROFILE="/usr/sap/PRO/SYS/profile" InstanceName="PRO_ASCS01_sapproascs"
> START_PROFILE="START_ASCS01_sapproascs"/>
> <SAPDatabase DBTYPE="ORA" DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe"
> NETSERVICENAME="LISTENER" SID="PRO"/>
> </resources>
> <service autostart="0" domain="sapfailover" exclusive="0" name="DB">
> <clusterfs fstype="gfs" ref="/oracle/PRO"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/112_64"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/origlogA"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/origlogB"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/mirrlogA"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/mirrlogB"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/oraarch"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/sapreorg"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata1"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata2"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata3"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata4"/>
> <clusterfs fstype="gfs" ref="/oracle/PRO/sapdata5"/>
> <ip ref="10.34.1.247"/>
> </service>
> <service autostart="0" domain="sapfailover" exclusive="1" name="sap">
> <ip ref="10.34.1.246"/>
> <clusterfs ref="/usr/sap/PRO/ASCS01"/>
> </service>
> </rm>
> </cluster>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/10f84d24/attachment.htm>

From akinoztopuz at yahoo.com  Tue Jul 17 10:12:14 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Tue, 17 Jul 2012 03:12:14 -0700 (PDT)
Subject: [Linux-cluster] re-post
In-Reply-To: <CAE7pJ3BeQDDj_ZJO-QvwVGR5xSj9ahjT4DrsqJhh2e8MuSoR=A@mail.gmail.com>
References: <1342503586.78659.YahooMailNeo@web125804.mail.ne1.yahoo.com>
	<CAE7pJ3BeQDDj_ZJO-QvwVGR5xSj9ahjT4DrsqJhh2e8MuSoR=A@mail.gmail.com>
Message-ID: <1342519934.47015.YahooMailNeo@web125803.mail.ne1.yahoo.com>

could you please? make some clerificaiton ?about your comments for my understanding?
?
gfs for clustered file system
there are two services including oracle and? sap??? mount points.
 

________________________________
 From: emmanuel segura <emi2fast at gmail.com>
To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>; linux clustering <linux-cluster at redhat.com> 
Sent: Tuesday, July 17, 2012 10:40 AM
Subject: Re: [Linux-cluster] re-post
  

If you have a failover service, why do you use gfs2?


2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>

????????Hello 
>?
>I am sending my post again?:are there anybody came accross with this issue before?
>?
>? 
>?I have 2 nodes clsuter without quorum disk. 
>I saw a problem when I moved to services to other 
node. 
>disk  loyout is iscsi . 
>I th?nk problem is about gfs.
>when I stop service in node1  and related file 
systems(included in service) are unmounted from that node and I want to mount it 
on node2 manually  , I am tak?ng a message about resource busy.  
>[root clsn2 ~]# mount -t gfs2  
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 
/usr/sap/PRO/ASCS01
>/sbin/mount.gfs2: 
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 already mounted or 
/usr/sap/PRO/ASCS01 busy
>  
>Could you have any ideas?   
>cluster.conf is at the below: 
>?xml version="1.0"?>
><cluster 
alias="testsapcluster" config_version="197" name="testsapcluster">
><fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
><clusternodes>
><clusternode name="clsn1.eda.com" nodeid="1" 
votes="1">
><fence>
><method 
name="1">
><device 
name="fence_node1"/>
></method>
></fence>
></clusternode>
><clusternode name="clsn2.eda.com" 
nodeid="2" votes="1">
><fence>
><method 
name="1">
><device 
name="fence_node2"/>
></method>
></fence>
></clusternode>
></clusternodes>
><cman 
expected_votes="1" two_node="1"/>
><fencedevices>
><fencedevice agent="fence_ilo" 
hostname="iloclsnode1" login="clsfenceadmin" name="ClsNode1Fence" 
passwd="***********"/>
><fencedevice agent="fence_ilo" 
hostname="iloclsnode2" login="clsfenceadmin" name="ClsNode2Fence" 
passwd="************"/>
><fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.68" login="clsfenceadmin" 
name="IPMI-Node1" passwd="**********"/>
><fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.67" login="clsfenceadmin" 
name="IPMI-Node2" passwd="**********"/>
><fencedevice 
agent="fence_ipmilan" ipaddr="10.34.1.68" login="clsfenceadmin" name="IPMI_1" 
passwd="********"/>
><fencedevice agent="fence_ipmilan" 
ipaddr="10.34.1.67" login="clsfenceadmin" name="IPMI_2" 
passwd="********"/>
><fencedevice agent="fence_ipmilan" 
ipaddr="192.168.11.68" lanplus="1" login="clsfenceadmin" method="cycle" 
name="fence_node1" passwd="*******" power_wait="4"/>
><fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" lanplus="1" 
login="clsfenceadmin" method="cycle" name="fence_node2" passwd="*******" 
power_wait="4"/>
></fencedevices>
><rm 
log_level="7">
><failoverdomains>
><failoverdomain 
name="sapfailover" nofailback="0" ordered="1" 
restricted="0">
><failoverdomainnode 
name="clsn1.eda.com" priority="1"/>
><failoverdomainnode name="clsn2.eda.com" 
priority="1"/>
></failoverdomain>
></failoverdomains>
><resources>
><ip address="10.34.1.111" 
monitor_link="1"/>
><ip address="10.34.1.246" 
monitor_link="0"/>
><ip address="10.34.1.247" 
monitor_link="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7" force_unmount="1" 
fsid="1689" fstype="gfs2" mountpoint="/usr/sap/PRO/ASCS01" 
name="/usr/sap/PRO/ASCS01" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_b2-SAPClusterLV_b2" 
force_unmount="1" fsid="52296" fstype="gfs2" mountpoint="/oracle" name="/oracle" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_b3-SAPClusterLV_b3" force_unmount="1" 
fsid="25486" fstype="gfs2" mountpoint="/oracle/client" name="/oracle/client" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_b5-SAPClusterLV_b5" force_unmount="1" 
fsid="5895" fstype="gfs2" mountpoint="/oracle/stage" name="/oracle/stage" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_b6-SAPClusterLV_b6" force_unmount="1" 
fsid="19741" fstype="gfs2" mountpoint="/oracle/PRO" name="/oracle/PRO" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_b7-SAPClusterLV_b7" force_unmount="1" 
fsid="6452" fstype="gfs2" mountpoint="/oracle/PRO/112_64" 
name="/oracle/PRO/112_64" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_b8-SAPClusterLV_b8" 
force_unmount="1" fsid="40841" fstype="gfs2" mountpoint="/oracle/PRO/origlogA" 
name="/oracle/PRO/origlogA" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_b9-SAPClusterLV_b9" 
force_unmount="1" fsid="52787" fstype="gfs2" mountpoint="/oracle/PRO/origlogB" 
name="/oracle/PRO/origlogB" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c5-SAPClusterLV_c5" 
force_unmount="1" fsid="22219" fstype="gfs2" mountpoint="/oracle/PRO/sapdata1" 
name="/oracle/PRO/sapdata1" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_b10-SAPClusterLV_b10" 
force_unmount="1" fsid="47722" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogA" 
name="/oracle/PRO/mirrlogA" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c6-SAPClusterLV_c6" 
force_unmount="1" fsid="1905" fstype="gfs2" mountpoint="/oracle/PRO/sapdata2" 
name="/oracle/PRO/sapdata2" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c1-SAPClusterLV_c1" 
force_unmount="1" fsid="60368" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogB" 
name="/oracle/PRO/mirrlogB" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c7-SAPClusterLV_c7" 
force_unmount="1" fsid="14311" fstype="gfs2" mountpoint="/oracle/PRO/sapdata3" 
name="/oracle/PRO/sapdata3" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c2-SAPClusterLV_c2" 
force_unmount="1" fsid="8037" fstype="gfs2" mountpoint="/oracle/PRO/oraarch" 
name="/oracle/PRO/oraarch" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c8-SAPClusterLV_c8" 
force_unmount="1" fsid="41540" fstype="gfs2" mountpoint="/oracle/PRO/sapdata4" 
name="/oracle/PRO/sapdata4" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c3-SAPClusterLV_c3" 
force_unmount="1" fsid="23164" fstype="gfs2" mountpoint="/oracle/PRO/sapreorg" 
name="/oracle/PRO/sapreorg" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_c9-SAPClusterLV_c9" 
force_unmount="1" fsid="37586" fstype="gfs2" mountpoint="/oracle/PRO/sapdata5" 
name="/oracle/PRO/sapdata5" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_d1-SAPClusterLV_d1" 
force_unmount="1" fsid="61050" fstype="gfs2" mountpoint="/software" 
name="/software" self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_d2-SAPClusterLV_d2" force_unmount="1" 
fsid="45919" fstype="gfs2" mountpoint="/saptmp" name="/saptmp" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_d3-SAPClusterLV_d3" force_unmount="1" 
fsid="56812" fstype="gfs2" mountpoint="/usr/sap/PRO" name="/usr/sap/PRO" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_d5-SAPClusterLV_d5" force_unmount="1" 
fsid="47829" fstype="gfs2" mountpoint="/usr/sap/DAA" name="/usr/sap/DAA" 
self_fence="0"/>
><clusterfs 
device="/dev/mapper/SAPClusterVG_d6-SAPClusterLV_d6" force_unmount="1" 
fsid="1394" fstype="gfs2" mountpoint="/usr/sap/hostctrl" 
name="/usr/sap/hostctrl" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_d8-SAPClusterLV_d8" 
force_unmount="1" fsid="33058" fstype="gfs2" mountpoint="/usr/sap/PRO/DVEBMGS00" 
name="/usr/sap/PRO/DVEBMGS00" self_fence="0"/>
><clusterfs device="/dev/mapper/SAPClusterVG_b1-SAPClusterLV_b1" 
force_unmount="0" fsid="1822" fstype="gfs2" mountpoint="/sapmnt/PRO" 
name="/sapmnt/PRO" self_fence="0"/>
><SAPInstance DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
DIR_PROFILE="/usr/sap/PRO/SYS/profile" InstanceName="PRO_ASCS01_sapproascs" 
START_PROFILE="START_ASCS01_sapproascs"/>
><SAPDatabase DBTYPE="ORA" DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
NETSERVICENAME="LISTENER" SID="PRO"/>
></resources>
><service autostart="0" 
domain="sapfailover" exclusive="0" name="DB">
><clusterfs fstype="gfs" ref="/oracle/PRO"/>
><clusterfs fstype="gfs" 
ref="/oracle/PRO/112_64"/>
><clusterfs 
fstype="gfs" ref="/oracle/PRO/origlogA"/>
><clusterfs fstype="gfs" 
ref="/oracle/PRO/origlogB"/>
><clusterfs 
fstype="gfs" ref="/oracle/PRO/mirrlogA"/>
><clusterfs fstype="gfs" 
ref="/oracle/PRO/mirrlogB"/>
><clusterfs 
fstype="gfs" ref="/oracle/PRO/oraarch"/>
><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapreorg"/>
><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata1"/>
><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata2"/>
><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata3"/>
><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata4"/>
><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata5"/>
><ip 
ref="10.34.1.247"/>
></service>
><service autostart="0" domain="sapfailover" exclusive="1" 
name="sap">
><ip 
ref="10.34.1.246"/>
><clusterfs 
ref="/usr/sap/PRO/ASCS01"/>
></service>
></rm>
></cluster>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/970a1f6d/attachment.htm>

From akinoztopuz at yahoo.com  Tue Jul 17 10:35:57 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Tue, 17 Jul 2012 03:35:57 -0700 (PDT)
Subject: [Linux-cluster] re-post
In-Reply-To: <CAE7pJ3AmYy5wfsqY+XN-GvnvXNW9OHbFfoV5AX-M2q-JJDvmFg@mail.gmail.com>
References: <1342503586.78659.YahooMailNeo@web125804.mail.ne1.yahoo.com>
	<CAE7pJ3BeQDDj_ZJO-QvwVGR5xSj9ahjT4DrsqJhh2e8MuSoR=A@mail.gmail.com>
	<1342519934.47015.YahooMailNeo@web125803.mail.ne1.yahoo.com>
	<CAE7pJ3AmYy5wfsqY+XN-GvnvXNW9OHbFfoV5AX-M2q-JJDvmFg@mail.gmail.com>
Message-ID: <1342521357.64460.YahooMailNeo@web125805.mail.ne1.yahoo.com>

you mean that , shared file systems will be in?each nodes fstab?? instead of? adding? cluster services??

 

________________________________
 From: emmanuel segura <emi2fast at gmail.com>
To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com> 
Sent: Tuesday, July 17, 2012 1:26 PM
Subject: Re: [Linux-cluster] re-post
  

When i use gfs a i use /etc/fstab for mount


2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>

could you please? make some clerificaiton ?about your comments for my understanding? 
>?
>gfs for clustered file system
>there are two services including oracle and? sap??? mount points.
>
> 
> From: emmanuel segura <emi2fast at gmail.com>
>To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>; linux clustering <linux-cluster at redhat.com> 
>Sent: Tuesday, July 17, 2012 10:40 AM
>Subject: Re: [Linux-cluster] re-post
>  
>
>If you have a failover service, why do you use gfs2?
>
>
>2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>
>
>????????Hello 
>>?
>>I am sending my post again?:are there anybody came accross with this issue before?
>>?
>>? 
>>?I have 2 nodes clsuter without quorum disk. 
>>I saw a problem when I moved to services to other 
node. 
>>disk  loyout is iscsi . 
>>I th?nk problem is about gfs.
>>when I stop service in node1  and related file 
systems(included in service) are unmounted from that node and I want to mount it 
on node2 manually  , I am tak?ng a message about resource busy.  
>>[root clsn2 ~]# mount -t gfs2  
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 
/usr/sap/PRO/ASCS01
>>/sbin/mount.gfs2: 
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 already mounted or 
/usr/sap/PRO/ASCS01 busy
>>  
>>Could you have any ideas?   
>>cluster.conf is at the below: 
>>?xml version="1.0"?>
>><cluster 
alias="testsapcluster" config_version="197" name="testsapcluster">
>><fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
>><clusternodes>
>><clusternode name="clsn1.eda.com" nodeid="1" 
votes="1">
>><fence>
>><method 
name="1">
>><device 
name="fence_node1"/>
>></method>
>></fence>
>></clusternode>
>><clusternode name="clsn2.eda.com" 
nodeid="2" votes="1">
>><fence>
>><method 
name="1">
>><device 
name="fence_node2"/>
>></method>
>></fence>
>></clusternode>
>></clusternodes>
>><cman 
expected_votes="1" two_node="1"/>
>><fencedevices>
>><fencedevice agent="fence_ilo" 
hostname="iloclsnode1" login="clsfenceadmin" name="ClsNode1Fence" 
passwd="***********"/>
>><fencedevice agent="fence_ilo" 
hostname="iloclsnode2" login="clsfenceadmin" name="ClsNode2Fence" 
passwd="************"/>
>><fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.68" login="clsfenceadmin" 
name="IPMI-Node1" passwd="**********"/>
>><fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.67" login="clsfenceadmin" 
name="IPMI-Node2" passwd="**********"/>
>><fencedevice 
agent="fence_ipmilan" ipaddr="10.34.1.68" login="clsfenceadmin" name="IPMI_1" 
passwd="********"/>
>><fencedevice agent="fence_ipmilan" 
ipaddr="10.34.1.67" login="clsfenceadmin" name="IPMI_2" 
passwd="********"/>
>><fencedevice agent="fence_ipmilan" 
ipaddr="192.168.11.68" lanplus="1" login="clsfenceadmin" method="cycle" 
name="fence_node1" passwd="*******" power_wait="4"/>
>><fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" lanplus="1" 
login="clsfenceadmin" method="cycle" name="fence_node2" passwd="*******" 
power_wait="4"/>
>></fencedevices>
>><rm 
log_level="7">
>><failoverdomains>
>><failoverdomain 
name="sapfailover" nofailback="0" ordered="1" 
restricted="0">
>><failoverdomainnode 
name="clsn1.eda.com" priority="1"/>
>><failoverdomainnode name="clsn2.eda.com" 
priority="1"/>
>></failoverdomain>
>></failoverdomains>
>><resources>
>><ip address="10.34.1.111" 
monitor_link="1"/>
>><ip address="10.34.1.246" 
monitor_link="0"/>
>><ip address="10.34.1.247" 
monitor_link="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7" force_unmount="1" 
fsid="1689" fstype="gfs2" mountpoint="/usr/sap/PRO/ASCS01" 
name="/usr/sap/PRO/ASCS01" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_b2-SAPClusterLV_b2" 
force_unmount="1" fsid="52296" fstype="gfs2" mountpoint="/oracle" name="/oracle" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_b3-SAPClusterLV_b3" force_unmount="1" 
fsid="25486" fstype="gfs2" mountpoint="/oracle/client" name="/oracle/client" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_b5-SAPClusterLV_b5" force_unmount="1" 
fsid="5895" fstype="gfs2" mountpoint="/oracle/stage" name="/oracle/stage" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_b6-SAPClusterLV_b6" force_unmount="1" 
fsid="19741" fstype="gfs2" mountpoint="/oracle/PRO" name="/oracle/PRO" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_b7-SAPClusterLV_b7" force_unmount="1" 
fsid="6452" fstype="gfs2" mountpoint="/oracle/PRO/112_64" 
name="/oracle/PRO/112_64" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_b8-SAPClusterLV_b8" 
force_unmount="1" fsid="40841" fstype="gfs2" mountpoint="/oracle/PRO/origlogA" 
name="/oracle/PRO/origlogA" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_b9-SAPClusterLV_b9" 
force_unmount="1" fsid="52787" fstype="gfs2" mountpoint="/oracle/PRO/origlogB" 
name="/oracle/PRO/origlogB" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c5-SAPClusterLV_c5" 
force_unmount="1" fsid="22219" fstype="gfs2" mountpoint="/oracle/PRO/sapdata1" 
name="/oracle/PRO/sapdata1" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_b10-SAPClusterLV_b10" 
force_unmount="1" fsid="47722" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogA" 
name="/oracle/PRO/mirrlogA" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c6-SAPClusterLV_c6" 
force_unmount="1" fsid="1905" fstype="gfs2" mountpoint="/oracle/PRO/sapdata2" 
name="/oracle/PRO/sapdata2" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c1-SAPClusterLV_c1" 
force_unmount="1" fsid="60368" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogB" 
name="/oracle/PRO/mirrlogB" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c7-SAPClusterLV_c7" 
force_unmount="1" fsid="14311" fstype="gfs2" mountpoint="/oracle/PRO/sapdata3" 
name="/oracle/PRO/sapdata3" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c2-SAPClusterLV_c2" 
force_unmount="1" fsid="8037" fstype="gfs2" mountpoint="/oracle/PRO/oraarch" 
name="/oracle/PRO/oraarch" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c8-SAPClusterLV_c8" 
force_unmount="1" fsid="41540" fstype="gfs2" mountpoint="/oracle/PRO/sapdata4" 
name="/oracle/PRO/sapdata4" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c3-SAPClusterLV_c3" 
force_unmount="1" fsid="23164" fstype="gfs2" mountpoint="/oracle/PRO/sapreorg" 
name="/oracle/PRO/sapreorg" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_c9-SAPClusterLV_c9" 
force_unmount="1" fsid="37586" fstype="gfs2" mountpoint="/oracle/PRO/sapdata5" 
name="/oracle/PRO/sapdata5" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_d1-SAPClusterLV_d1" 
force_unmount="1" fsid="61050" fstype="gfs2" mountpoint="/software" 
name="/software" self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_d2-SAPClusterLV_d2" force_unmount="1" 
fsid="45919" fstype="gfs2" mountpoint="/saptmp" name="/saptmp" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_d3-SAPClusterLV_d3" force_unmount="1" 
fsid="56812" fstype="gfs2" mountpoint="/usr/sap/PRO" name="/usr/sap/PRO" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_d5-SAPClusterLV_d5" force_unmount="1" 
fsid="47829" fstype="gfs2" mountpoint="/usr/sap/DAA" name="/usr/sap/DAA" 
self_fence="0"/>
>><clusterfs 
device="/dev/mapper/SAPClusterVG_d6-SAPClusterLV_d6" force_unmount="1" 
fsid="1394" fstype="gfs2" mountpoint="/usr/sap/hostctrl" 
name="/usr/sap/hostctrl" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_d8-SAPClusterLV_d8" 
force_unmount="1" fsid="33058" fstype="gfs2" mountpoint="/usr/sap/PRO/DVEBMGS00" 
name="/usr/sap/PRO/DVEBMGS00" self_fence="0"/>
>><clusterfs device="/dev/mapper/SAPClusterVG_b1-SAPClusterLV_b1" 
force_unmount="0" fsid="1822" fstype="gfs2" mountpoint="/sapmnt/PRO" 
name="/sapmnt/PRO" self_fence="0"/>
>><SAPInstance DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
DIR_PROFILE="/usr/sap/PRO/SYS/profile" InstanceName="PRO_ASCS01_sapproascs" 
START_PROFILE="START_ASCS01_sapproascs"/>
>><SAPDatabase DBTYPE="ORA" DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
NETSERVICENAME="LISTENER" SID="PRO"/>
>></resources>
>><service autostart="0" 
domain="sapfailover" exclusive="0" name="DB">
>><clusterfs fstype="gfs" ref="/oracle/PRO"/>
>><clusterfs fstype="gfs" 
ref="/oracle/PRO/112_64"/>
>><clusterfs 
fstype="gfs" ref="/oracle/PRO/origlogA"/>
>><clusterfs fstype="gfs" 
ref="/oracle/PRO/origlogB"/>
>><clusterfs 
fstype="gfs" ref="/oracle/PRO/mirrlogA"/>
>><clusterfs fstype="gfs" 
ref="/oracle/PRO/mirrlogB"/>
>><clusterfs 
fstype="gfs" ref="/oracle/PRO/oraarch"/>
>><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapreorg"/>
>><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata1"/>
>><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata2"/>
>><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata3"/>
>><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata4"/>
>><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata5"/>
>><ip 
ref="10.34.1.247"/>
>></service>
>><service autostart="0" domain="sapfailover" exclusive="1" 
name="sap">
>><ip 
ref="10.34.1.246"/>
>><clusterfs 
ref="/usr/sap/PRO/ASCS01"/>
>></service>
>></rm>
>></cluster>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>-- 
>esta es mi vida e me la vivo hasta que dios quiera
>
>
>  


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/bcb9b63b/attachment.htm>

From akinoztopuz at yahoo.com  Tue Jul 17 11:12:47 2012
From: akinoztopuz at yahoo.com (=?utf-8?B?QUtJTiDDv2ZmZmZmZmZmZmZkNlpUT1BVWg==?=)
Date: Tue, 17 Jul 2012 04:12:47 -0700 (PDT)
Subject: [Linux-cluster] re-post
In-Reply-To: <CAE7pJ3D2HQQmg0d7fSbFnTrKSY-t8_8KvLs9E=qnp5gDeB9Bjg@mail.gmail.com>
References: <1342503586.78659.YahooMailNeo@web125804.mail.ne1.yahoo.com>
	<CAE7pJ3BeQDDj_ZJO-QvwVGR5xSj9ahjT4DrsqJhh2e8MuSoR=A@mail.gmail.com>
	<1342519934.47015.YahooMailNeo@web125803.mail.ne1.yahoo.com>
	<CAE7pJ3AmYy5wfsqY+XN-GvnvXNW9OHbFfoV5AX-M2q-JJDvmFg@mail.gmail.com>
	<1342521357.64460.YahooMailNeo@web125805.mail.ne1.yahoo.com>
	<CAE7pJ3D2HQQmg0d7fSbFnTrKSY-t8_8KvLs9E=qnp5gDeB9Bjg@mail.gmail.com>
Message-ID: <1342523567.70101.YahooMailNeo@web125801.mail.ne1.yahoo.com>


????
thanks Emmanuel
? will try and let you know
 

________________________________
 From: emmanuel segura <emi2fast at gmail.com>
To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com> 
Sent: Tuesday, July 17, 2012 2:09 PM
Subject: Re: [Linux-cluster] re-post
  

yes with _netdev option


2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>

you mean that , shared file systems will be in?each nodes fstab?? instead of? adding? cluster services??
>
> 
> From: emmanuel segura <emi2fast at gmail.com>
>To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com> 
>Sent: Tuesday, July 17, 2012 1:26 PM
>Subject: Re: [Linux-cluster] re-post
>  
>
>When i use gfs a i use /etc/fstab for mount
>
>
>2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>
>
>could you please? make some clerificaiton ?about your comments for my understanding? 
>>?
>>gfs for clustered file system
>>there are two services including oracle and? sap??? mount points.
>>
>> 
>> From: emmanuel segura <emi2fast at gmail.com>
>>To: AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>; linux clustering <linux-cluster at redhat.com> 
>>Sent: Tuesday, July 17, 2012 10:40 AM
>>Subject: Re: [Linux-cluster] re-post
>>  
>>
>>If you have a failover service, why do you use gfs2?
>>
>>
>>2012/7/17 AKIN ?ffffffffffd6ZTOPUZ <akinoztopuz at yahoo.com>
>>
>>????????Hello 
>>>?
>>>I am sending my post again?:are there anybody came accross with this issue before?
>>>?
>>>? 
>>>?I have 2 nodes clsuter without quorum disk. 
>>>I saw a problem when I moved to services to other 
node. 
>>>disk  loyout is iscsi . 
>>>I th?nk problem is about gfs.
>>>when I stop service in node1  and related file 
systems(included in service) are unmounted from that node and I want to mount it 
on node2 manually  , I am tak?ng a message about resource busy.  
>>>[root clsn2 ~]# mount -t gfs2  
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 
/usr/sap/PRO/ASCS01
>>>/sbin/mount.gfs2: 
/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7 already mounted or 
/usr/sap/PRO/ASCS01 busy
>>>  
>>>Could you have any ideas?   
>>>cluster.conf is at the below: 
>>>?xml version="1.0"?>
>>><cluster 
alias="testsapcluster" config_version="197" name="testsapcluster">
>>><fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
>>><clusternodes>
>>><clusternode name="clsn1.eda.com" nodeid="1" 
votes="1">
>>><fence>
>>><method 
name="1">
>>><device 
name="fence_node1"/>
>>></method>
>>></fence>
>>></clusternode>
>>><clusternode name="clsn2.eda.com" 
nodeid="2" votes="1">
>>><fence>
>>><method 
name="1">
>>><device 
name="fence_node2"/>
>>></method>
>>></fence>
>>></clusternode>
>>></clusternodes>
>>><cman 
expected_votes="1" two_node="1"/>
>>><fencedevices>
>>><fencedevice agent="fence_ilo" 
hostname="iloclsnode1" login="clsfenceadmin" name="ClsNode1Fence" 
passwd="***********"/>
>>><fencedevice agent="fence_ilo" 
hostname="iloclsnode2" login="clsfenceadmin" name="ClsNode2Fence" 
passwd="************"/>
>>><fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.68" login="clsfenceadmin" 
name="IPMI-Node1" passwd="**********"/>
>>><fencedevice 
agent="fence_ipmilan" ipaddr="192.168.11.67" login="clsfenceadmin" 
name="IPMI-Node2" passwd="**********"/>
>>><fencedevice 
agent="fence_ipmilan" ipaddr="10.34.1.68" login="clsfenceadmin" name="IPMI_1" 
passwd="********"/>
>>><fencedevice agent="fence_ipmilan" 
ipaddr="10.34.1.67" login="clsfenceadmin" name="IPMI_2" 
passwd="********"/>
>>><fencedevice agent="fence_ipmilan" 
ipaddr="192.168.11.68" lanplus="1" login="clsfenceadmin" method="cycle" 
name="fence_node1" passwd="*******" power_wait="4"/>
>>><fencedevice agent="fence_ipmilan" ipaddr="192.168.11.67" lanplus="1" 
login="clsfenceadmin" method="cycle" name="fence_node2" passwd="*******" 
power_wait="4"/>
>>></fencedevices>
>>><rm 
log_level="7">
>>><failoverdomains>
>>><failoverdomain 
name="sapfailover" nofailback="0" ordered="1" 
restricted="0">
>>><failoverdomainnode 
name="clsn1.eda.com" priority="1"/>
>>><failoverdomainnode name="clsn2.eda.com" 
priority="1"/>
>>></failoverdomain>
>>></failoverdomains>
>>><resources>
>>><ip address="10.34.1.111" 
monitor_link="1"/>
>>><ip address="10.34.1.246" 
monitor_link="0"/>
>>><ip address="10.34.1.247" 
monitor_link="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_d7-SAPClusterLV_d7" force_unmount="1" 
fsid="1689" fstype="gfs2" mountpoint="/usr/sap/PRO/ASCS01" 
name="/usr/sap/PRO/ASCS01" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_b2-SAPClusterLV_b2" 
force_unmount="1" fsid="52296" fstype="gfs2" mountpoint="/oracle" name="/oracle" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_b3-SAPClusterLV_b3" force_unmount="1" 
fsid="25486" fstype="gfs2" mountpoint="/oracle/client" name="/oracle/client" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_b5-SAPClusterLV_b5" force_unmount="1" 
fsid="5895" fstype="gfs2" mountpoint="/oracle/stage" name="/oracle/stage" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_b6-SAPClusterLV_b6" force_unmount="1" 
fsid="19741" fstype="gfs2" mountpoint="/oracle/PRO" name="/oracle/PRO" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_b7-SAPClusterLV_b7" force_unmount="1" 
fsid="6452" fstype="gfs2" mountpoint="/oracle/PRO/112_64" 
name="/oracle/PRO/112_64" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_b8-SAPClusterLV_b8" 
force_unmount="1" fsid="40841" fstype="gfs2" mountpoint="/oracle/PRO/origlogA" 
name="/oracle/PRO/origlogA" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_b9-SAPClusterLV_b9" 
force_unmount="1" fsid="52787" fstype="gfs2" mountpoint="/oracle/PRO/origlogB" 
name="/oracle/PRO/origlogB" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c5-SAPClusterLV_c5" 
force_unmount="1" fsid="22219" fstype="gfs2" mountpoint="/oracle/PRO/sapdata1" 
name="/oracle/PRO/sapdata1" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_b10-SAPClusterLV_b10" 
force_unmount="1" fsid="47722" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogA" 
name="/oracle/PRO/mirrlogA" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c6-SAPClusterLV_c6" 
force_unmount="1" fsid="1905" fstype="gfs2" mountpoint="/oracle/PRO/sapdata2" 
name="/oracle/PRO/sapdata2" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c1-SAPClusterLV_c1" 
force_unmount="1" fsid="60368" fstype="gfs2" mountpoint="/oracle/PRO/mirrlogB" 
name="/oracle/PRO/mirrlogB" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c7-SAPClusterLV_c7" 
force_unmount="1" fsid="14311" fstype="gfs2" mountpoint="/oracle/PRO/sapdata3" 
name="/oracle/PRO/sapdata3" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c2-SAPClusterLV_c2" 
force_unmount="1" fsid="8037" fstype="gfs2" mountpoint="/oracle/PRO/oraarch" 
name="/oracle/PRO/oraarch" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c8-SAPClusterLV_c8" 
force_unmount="1" fsid="41540" fstype="gfs2" mountpoint="/oracle/PRO/sapdata4" 
name="/oracle/PRO/sapdata4" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c3-SAPClusterLV_c3" 
force_unmount="1" fsid="23164" fstype="gfs2" mountpoint="/oracle/PRO/sapreorg" 
name="/oracle/PRO/sapreorg" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_c9-SAPClusterLV_c9" 
force_unmount="1" fsid="37586" fstype="gfs2" mountpoint="/oracle/PRO/sapdata5" 
name="/oracle/PRO/sapdata5" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_d1-SAPClusterLV_d1" 
force_unmount="1" fsid="61050" fstype="gfs2" mountpoint="/software" 
name="/software" self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_d2-SAPClusterLV_d2" force_unmount="1" 
fsid="45919" fstype="gfs2" mountpoint="/saptmp" name="/saptmp" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_d3-SAPClusterLV_d3" force_unmount="1" 
fsid="56812" fstype="gfs2" mountpoint="/usr/sap/PRO" name="/usr/sap/PRO" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_d5-SAPClusterLV_d5" force_unmount="1" 
fsid="47829" fstype="gfs2" mountpoint="/usr/sap/DAA" name="/usr/sap/DAA" 
self_fence="0"/>
>>><clusterfs 
device="/dev/mapper/SAPClusterVG_d6-SAPClusterLV_d6" force_unmount="1" 
fsid="1394" fstype="gfs2" mountpoint="/usr/sap/hostctrl" 
name="/usr/sap/hostctrl" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_d8-SAPClusterLV_d8" 
force_unmount="1" fsid="33058" fstype="gfs2" mountpoint="/usr/sap/PRO/DVEBMGS00" 
name="/usr/sap/PRO/DVEBMGS00" self_fence="0"/>
>>><clusterfs device="/dev/mapper/SAPClusterVG_b1-SAPClusterLV_b1" 
force_unmount="0" fsid="1822" fstype="gfs2" mountpoint="/sapmnt/PRO" 
name="/sapmnt/PRO" self_fence="0"/>
>>><SAPInstance DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
DIR_PROFILE="/usr/sap/PRO/SYS/profile" InstanceName="PRO_ASCS01_sapproascs" 
START_PROFILE="START_ASCS01_sapproascs"/>
>>><SAPDatabase DBTYPE="ORA" DIR_EXECUTABLE="/usr/sap/PRO/ASCS01/exe" 
NETSERVICENAME="LISTENER" SID="PRO"/>
>>></resources>
>>><service autostart="0" 
domain="sapfailover" exclusive="0" name="DB">
>>><clusterfs fstype="gfs" ref="/oracle/PRO"/>
>>><clusterfs fstype="gfs" 
ref="/oracle/PRO/112_64"/>
>>><clusterfs 
fstype="gfs" ref="/oracle/PRO/origlogA"/>
>>><clusterfs fstype="gfs" 
ref="/oracle/PRO/origlogB"/>
>>><clusterfs 
fstype="gfs" ref="/oracle/PRO/mirrlogA"/>
>>><clusterfs fstype="gfs" 
ref="/oracle/PRO/mirrlogB"/>
>>><clusterfs 
fstype="gfs" ref="/oracle/PRO/oraarch"/>
>>><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapreorg"/>
>>><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata1"/>
>>><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata2"/>
>>><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata3"/>
>>><clusterfs fstype="gfs" 
ref="/oracle/PRO/sapdata4"/>
>>><clusterfs 
fstype="gfs" ref="/oracle/PRO/sapdata5"/>
>>><ip 
ref="10.34.1.247"/>
>>></service>
>>><service autostart="0" domain="sapfailover" exclusive="1" 
name="sap">
>>><ip 
ref="10.34.1.246"/>
>>><clusterfs 
ref="/usr/sap/PRO/ASCS01"/>
>>></service>
>>></rm>
>>></cluster>
>>>--
>>>Linux-cluster mailing list
>>>Linux-cluster at redhat.com
>>>https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>-- 
>>esta es mi vida e me la vivo hasta que dios quiera
>>
>>
>> 
>
>
>-- 
>esta es mi vida e me la vivo hasta que dios quiera
>
>
>  


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/d637e050/attachment.htm>

From lists at alteeve.ca  Tue Jul 17 14:39:29 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 17 Jul 2012 10:39:29 -0400
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
	<50044D50.5060008@alteeve.ca>
	<CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
Message-ID: <50057921.3020505@alteeve.ca>

I didn't notice the qdisk entry, so yes, expected_votes="3" is fine.

If fencing is working, then that isn't a problem.

In some switches, multicast groups are occasionally deleted, forcing
members to re-join the multicast group (I've not seen this myself, but
I've heard of it on Cisco switches, iirc). The idea was to remove unused
groups over time.

Is this a new cluster? If it is, have you considered using RHEL 6.3?
There are a lot of improvements in cluster stable 3. If not, can you
update to RHEL 5.8 to get all the outstanding updates?

Digimer

On 07/16/2012 02:46 PM, Javier Vela wrote:
> Hi,
> 
> I set two_node=0 in purpose, because of I use a quorum disk with one
> additional vote. If one one fails, I still have two votes, and the
> cluster remains quorate, avoiding the split-brain situation. Is this
> approach wrong? In my tests, this aspect of the quorum worked well.
> 
> Fencing works very well. When something happens, the fencing kills the
> faulting server without any problems.
> 
> The first time I ran into problems I cheked multicast traffic between
> the nodes with iperf and everything appeared to be OK. What I don't know
> is how works the purge you said. I didn't know that any purge was
> running whatsoever. How can I check if is happening? Moreover, when I
> did the test only one cluster was running. Now there are 3 cluster
> running in the same virtual switch.
> 
> 
> Software:
> 
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
> 
> 
>  Regards, Javi
> 
> 2012/7/16 Digimer <lists at alteeve.ca <mailto:lists at alteeve.ca>>
> 
>     Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
>     With this, losing a node will mean you lose quorum and all cluster
>     activity will stop. Please change this to 'two_node="1"
>     expected_votes="1"'.
> 
>     Did you confirm that your fencing actually works? Does 'fence_node
>     node1' and 'fence_node node2' actually kill the target?
> 
>     Are you running into multicast issues? If your switch (virtual or real)
>     purges multicast groups periodically, it will break the cluster.
> 
>     What version of the cluster software and what distro are you using?
> 
>     Digimer
> 
> 
>     On 07/16/2012 12:03 PM, Javier Vela wrote:
>     > Hi, two weeks ago I asked for some help building a two-node
>     cluster with
>     > HA-LVM. After some e-mails, finally I got my cluster working. The
>     > problem now is that sometimes, and in some clusters (I have three
>     > clusters with the same configuration), I got very strange behaviours.
>     >
>     > #1 Openais detects some problem and shutdown itself. The network
>     is Ok,
>     > is a virtual device in vmware, shared with the other cluster hearbet
>     > networks, and only happens in one cluster. The error messages:
>     >
>     > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
>     > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state
>     from 6.
>     > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state
>     from 0
>     >
>     > Do you know what can I check in order to solve the problem? I
>     don't know
>     > from where I should start. What makes Openais to not receive messages?
>     >
>     >
>     > #2 I'm getting a lot of RGmanager errors when rgmanager tries to
>     change
>     > the service status. i.e: clusvdcam -d service. Always happens when I
>     > have the two nodes UP. If I shutdown one node, then the command
>     finishes
>     > succesfully. Prior to execute the command, I always check the status
>     > with clustat, and everything is OK:
>     >
>     > clurgmgrd[5667]: <err> #52: Failed changing RG status
>     >
>     > Another time, what can I check in order to detect problems with
>     > rgmanager that clustat and cman_tool doesn't show?
>     >
>     > #3 Sometimes, not always, a node that has been fenced cannot join the
>     > cluster after the reboot. With clustat I can see that there is quorum:
>     >
>     > clustat:
>     > [root at node2 ~]# clustat
>     > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
>     > Member Status: Quorate
>     >
>     >  Member Name                             ID   Status
>     >  ------ ----                             ---- ------
>     >  node1-hb                                  1 Offline
>     >  node2-hb                               2 Online, Local, rgmanager
>     >  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>     >
>     >  Service Name                   Owner (Last)                   State
>     >  ------- ----                   ----- ------                   -----
>     >  service:test                   node2-hb                  started
>     >
>     > The log show how node2 fenced node1:
>     >
>     > node2 messages
>     > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
>     > sec post_fail_delay
>     > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
>     > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1
>     to be
>     > fenced
>     > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
>     > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced;
>     continuing
>     >
>     > But the node that tries to join the cluster says that there isn't
>     > quorum. Finally. It finishes inquorate, without seeing node1 and the
>     > quorum disk.
>     >
>     > node1 messages
>     > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
>     > Connection refused
>     > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
>     > connection.
>     >
>     > Have something in common the three errors?  What should I check? I've
>     > discarded cluster configuration because cluster is working, and the
>     > errors doesn't appear in all the nodes. The most annoying error
>     > cureently is the #1. Every 10-15 minutes Openais fails and the nodes
>     > gets fenced. I attach the cluster.conf.
>     >
>     > Thanks in advance.
>     >
>     > Regards, Javi
>     >
>     >
>     >
>     > --
>     > Linux-cluster mailing list
>     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     > https://www.redhat.com/mailman/listinfo/linux-cluster
>     >
> 
> 
>     --
>     Digimer
>     Papers and Projects: https://alteeve.com
> 
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
Digimer
Papers and Projects: https://alteeve.com




From lists at alteeve.ca  Tue Jul 17 14:45:12 2012
From: lists at alteeve.ca (Digimer)
Date: Tue, 17 Jul 2012 10:45:12 -0400
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <CAEAM5QX-djOYBq0Tm3sYpDRzAFfwp3fY=7g2FbWBg+n4Gr4BqA@mail.gmail.com>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
	<50044D50.5060008@alteeve.ca>
	<CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
	<CAEAM5QX-djOYBq0Tm3sYpDRzAFfwp3fY=7g2FbWBg+n4Gr4BqA@mail.gmail.com>
Message-ID: <50057A78.2080304@alteeve.ca>

On 07/17/2012 03:30 AM, Javier Vela wrote:
> Hi, I'm also seeing a lot of log entries in the logs like that:
> 
> openais[4264]: [TOTEM] Retransmit List: 34 35 36 37 38 39 3a 3b 3c
> 
> I've searched through internet and this happens when there are some
> delay between the nodes, but openais its supposed to recover gracefully.
> Can this be a problem?
> 
> 2012/7/16 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>

I saw this happen with a bug in rhel 6.1 when the nodes were too slow.
I'm wondering if a) you have network problems somewhere or b) you have
insufficient performance on your nodes.

Usually it recovers on it's own, but I have seen it run away to the
point where I had to stop the cluster. That was on modest hardware in a
test environment. On all production machines I've seen, it recovered on
it's own.

-- 
Digimer
Papers and Projects: https://alteeve.com




From jvdiago at gmail.com  Tue Jul 17 15:10:37 2012
From: jvdiago at gmail.com (Javier Vela)
Date: Tue, 17 Jul 2012 17:10:37 +0200
Subject: [Linux-cluster] Strange behaviours in two-node cluster
In-Reply-To: <50057A78.2080304@alteeve.ca>
References: <CAEAM5QVKOd+SbxKWkomSgtxeMrhFK0XH20vjEYrOTSKfMzLwLw@mail.gmail.com>
	<50044D50.5060008@alteeve.ca>
	<CAEAM5QUtVLLhK_1kmyMCdR-Yz5xtmUtTZygmg5EX0hrmo5Nv_g@mail.gmail.com>
	<CAEAM5QX-djOYBq0Tm3sYpDRzAFfwp3fY=7g2FbWBg+n4Gr4BqA@mail.gmail.com>
	<50057A78.2080304@alteeve.ca>
Message-ID: <CAEAM5QWVfPQ_PV0OydBtv558yE4yKW5GWu_B7TTQw1zQX3vGDg@mail.gmail.com>

Hi,

Thank you for the quick reply. I'm going to ask if we can upgrade to Red
Hat 5.8.

Moreover, the machines don't have now performance problems (we are still in
pre). But all is virtual, under VMWare, so a punctual problem in the
VMWare  infrastructure can affect us. Do you know some way to test network
problems that could affect RHCS?

I tried tcpdump and iperf, but haven't seen anything.

Regards, Javi.

2012/7/17 Digimer <lists at alteeve.ca>

> On 07/17/2012 03:30 AM, Javier Vela wrote:
> > Hi, I'm also seeing a lot of log entries in the logs like that:
> >
> > openais[4264]: [TOTEM] Retransmit List: 34 35 36 37 38 39 3a 3b 3c
> >
> > I've searched through internet and this happens when there are some
> > delay between the nodes, but openais its supposed to recover gracefully.
> > Can this be a problem?
> >
> > 2012/7/16 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>
> I saw this happen with a bug in rhel 6.1 when the nodes were too slow.
> I'm wondering if a) you have network problems somewhere or b) you have
> insufficient performance on your nodes.
>
> Usually it recovers on it's own, but I have seen it run away to the
> point where I had to stop the cluster. That was on modest hardware in a
> test environment. On all production machines I've seen, it recovered on
> it's own.
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/56dfe96b/attachment.htm>

From msalced at pucp.edu.pe  Wed Jul 18 22:34:01 2012
From: msalced at pucp.edu.pe (Mario Salcedo)
Date: Wed, 18 Jul 2012 17:34:01 -0500
Subject: [Linux-cluster] GNBD on Centos 6.3
Message-ID: <CAFgj4zyew37SYaZZULEjTDNgQSOcVafYEjdyBk3Me=bZuraQAA@mail.gmail.com>

Hi. I am configuring a cluster ha with six pcs Centos 6.3. I want
share the same apache data with the nodes with GNBD/GFS but I dont
find gnbd in th repost of Centos 6.3.

Where can i find this packet for Centos 6.3?



From urgrue at bulbous.org  Thu Jul 19 09:45:11 2012
From: urgrue at bulbous.org (urgrue)
Date: Thu, 19 Jul 2012 12:45:11 +0300
Subject: [Linux-cluster] How to deal with a node losing disks in HA-LVM
Message-ID: <1342691111.11657.140661103976441.38D93867@webmail.messagingengine.com>

Using HA-LVM (with LVM tags), if node1 loses access to the disks, it
obviously can't strip the tags.
Other nodes will refuse to recover because the tags are there, and node1
is still online.
If I fence node1, others will happily take over, because they see node1
is offline and they can safely strip the tags.
I've got self_fence=on on the resources, but it's unclear to me in what
conditions it will be triggered on. Apparently not this one (access to
disks lost on the active node for example by HBA problem or zoning
error).

How can I ensure the self_fence is triggered? Or any other better ideas?

Here's the relevant bits from cluster.conf:
<resources>
  <lvm name="res_sanvg" self_fence="on" vg_name="sanvg"/>
  <fs device="/dev/sanvg/sanlv" fsid="29088" mountpoint="/var/lib/mysql"
  name="fs_sanlv" self_fence="on"/>
  <mysql config_file="/etc/my.cnf" name="res_mysql" shutdown_wait="5"
  startup_wait="5"/>
</resources>
<service domain="DC0" name="srv_mysql" recovery="relocate">
  <lvm ref="res_sanvg">
    <fs ref="fs_sanlv">
      <mysql ref="res_mysql"/>
    </fs>
  </lvm>
</service>



From habdellaoui at gmail.com  Fri Jul 20 09:15:07 2012
From: habdellaoui at gmail.com (hakim abdellaoui)
Date: Fri, 20 Jul 2012 11:15:07 +0200
Subject: [Linux-cluster] problem quorum cman
Message-ID: <CANQ3Co9=qVZehgECe=ZGj_wEv-eGGR__-TbDF-M=yMgeQuRpyg@mail.gmail.com>

Hi,

I use rhel6.3 with packages :

cman-3.0.12.1-32.el6.x86_64
rgmanager-3.0.12.1-12.el6.x86_64
openais-1.1.1-7.el6.x86_64

I have two virtual nodes  (vmware) and a quorum share disk (it's a virtual
disk i use scsi sharing multi-write)

the cluster work sometime.

if i reboot node2 the cman not start i have   : *Waiting for quorum...
Timed-out waiting for cluster.*

On the log corosync i have :

*Jul 20 10:51:22 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jul 20 10:51:22 corosync [CPG   ] chosen downlist: sender r(0)
ip(192.168.10.154) ; members(old:1 left:0)
Jul 20 10:51:22 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jul 20 10:51:23 corosync [CMAN  ] quorum device unregistered*


On the node1 when i type clustat i have :

Cluster Status for clusterweb @ Fri Jul 20 10:38:57 2012
Member Status: Quorate

 Member Name                                  ID   Status
 ------ ----                                  ---- ------
 server-1                                      1 Online, Local
 server-2                                      2 Offline
 /dev/block/8:16                           0 Online, Quorum Disk


If i restart cman on node1 and  i restart cman on node2 the cman start
properly a
When i  type clustat on both nodes i can see all online.


I don't understand why i must restart on node1 the cman if i want to add
the node2 on the
cluster .


You can see my cluster.conf

<?xml version="1.0"?>
<cluster config_version="6" name="clusterweb">
        <clusternodes>
                <clusternode name="server-1" nodeid="1"/>
                <clusternode name="server-2" nodeid="2"/>
        </clusternodes>
        <cman expected_votes="3"/>
        <quorumd label="quorum">
                <heuristic program="ping -c3 -t2 192.168.254.254"/>
        </quorumd>
</cluster>


Very thanks for your help

Best regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120720/f66ea60b/attachment.htm>

From fdinitto at redhat.com  Sun Jul 22 14:59:54 2012
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sun, 22 Jul 2012 16:59:54 +0200
Subject: [Linux-cluster] GNBD on Centos 6.3
In-Reply-To: <CAFgj4zyew37SYaZZULEjTDNgQSOcVafYEjdyBk3Me=bZuraQAA@mail.gmail.com>
References: <CAFgj4zyew37SYaZZULEjTDNgQSOcVafYEjdyBk3Me=bZuraQAA@mail.gmail.com>
Message-ID: <500C156A.6050407@redhat.com>

On 07/19/2012 12:34 AM, Mario Salcedo wrote:
> Hi. I am configuring a cluster ha with six pcs Centos 6.3. I want
> share the same apache data with the nodes with GNBD/GFS but I dont
> find gnbd in th repost of Centos 6.3.
> 
> Where can i find this packet for Centos 6.3?

GNBD has been deprecated in favor of iSCSI or AoE that are recognized
standards.

Fabio



From msalced at pucp.edu.pe  Mon Jul 23 13:38:38 2012
From: msalced at pucp.edu.pe (Mario Salcedo)
Date: Mon, 23 Jul 2012 08:38:38 -0500
Subject: [Linux-cluster] GNBD on Centos 6.3
In-Reply-To: <500C156A.6050407@redhat.com>
References: <CAFgj4zyew37SYaZZULEjTDNgQSOcVafYEjdyBk3Me=bZuraQAA@mail.gmail.com>
	<500C156A.6050407@redhat.com>
Message-ID: <CAFgj4zzvxL0fijhjWuStkb=-KdBnb1Aoz+Ed4uYMraDQn=3GLQ@mail.gmail.com>

Thanks for your reply.

GNBD is good for test LVS Cluster, but not for service online. I
change to centos 5 only for my tests.

I am going to investigue about ATA over Ethernet.

Thanks

2012/7/22 Fabio M. Di Nitto <fdinitto at redhat.com>:
> On 07/19/2012 12:34 AM, Mario Salcedo wrote:
>> Hi. I am configuring a cluster ha with six pcs Centos 6.3. I want
>> share the same apache data with the nodes with GNBD/GFS but I dont
>> find gnbd in th repost of Centos 6.3.
>>
>> Where can i find this packet for Centos 6.3?
>
> GNBD has been deprecated in favor of iSCSI or AoE that are recognized
> standards.
>
> Fabio
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From jeff.sturm at eprize.com  Mon Jul 23 14:17:34 2012
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Mon, 23 Jul 2012 14:17:34 +0000
Subject: [Linux-cluster] GNBD on Centos 6.3
In-Reply-To: <CAFgj4zzvxL0fijhjWuStkb=-KdBnb1Aoz+Ed4uYMraDQn=3GLQ@mail.gmail.com>
References: <CAFgj4zyew37SYaZZULEjTDNgQSOcVafYEjdyBk3Me=bZuraQAA@mail.gmail.com>
	<500C156A.6050407@redhat.com>
	<CAFgj4zzvxL0fijhjWuStkb=-KdBnb1Aoz+Ed4uYMraDQn=3GLQ@mail.gmail.com>
Message-ID: <B1B9801C5CBC954680D0374CC4EEABA536624C47@MailNode2.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Mario Salcedo
> Sent: Monday, July 23, 2012 9:39 AM
> 
> GNBD is good for test LVS Cluster, but not for service online. I change to centos 5 only
> for my tests.
> 
> I am going to investigue about ATA over Ethernet.

If you have a pair of nodes, DRBD works well, too.

-Jeff





From msalced at pucp.edu.pe  Tue Jul 24 13:08:22 2012
From: msalced at pucp.edu.pe (Mario Salcedo)
Date: Tue, 24 Jul 2012 08:08:22 -0500
Subject: [Linux-cluster] GNBD on Centos 6.3
In-Reply-To: <B1B9801C5CBC954680D0374CC4EEABA536624C47@MailNode2.eprize.local>
References: <CAFgj4zyew37SYaZZULEjTDNgQSOcVafYEjdyBk3Me=bZuraQAA@mail.gmail.com>
	<500C156A.6050407@redhat.com>
	<CAFgj4zzvxL0fijhjWuStkb=-KdBnb1Aoz+Ed4uYMraDQn=3GLQ@mail.gmail.com>
	<B1B9801C5CBC954680D0374CC4EEABA536624C47@MailNode2.eprize.local>
Message-ID: <CAFgj4zxG-f7we1zgavt2pzET4RdxELiZ6gGVcVjT5V5yWLpReQ@mail.gmail.com>

Good tip.

I am going tio probe that.

http://baberzahoor.blogspot.com/2009/07/linux-load-balancer-using-lvs-heartbeat.html



2012/7/23 Jeff Sturm <jeff.sturm at eprize.com>:
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
>> On Behalf Of Mario Salcedo
>> Sent: Monday, July 23, 2012 9:39 AM
>>
>> GNBD is good for test LVS Cluster, but not for service online. I change to centos 5 only
>> for my tests.
>>
>> I am going to investigue about ATA over Ethernet.
>
> If you have a pair of nodes, DRBD works well, too.
>
> -Jeff
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From peixubin at yahoo.com.cn  Thu Jul 26 14:22:28 2012
From: peixubin at yahoo.com.cn (=?gb2312?B?xeHQ8bHz?=)
Date: Thu, 26 Jul 2012 22:22:28 +0800 (CST)
Subject: [Linux-cluster] Frgjgtjhreugyufdfdgf
Message-ID: <1343312548.23111.yext-apple-iphone@web15004.mail.cnb.yahoo.com>



???? iPad

From peixubin at yahoo.com.cn  Thu Jul 26 14:23:32 2012
From: peixubin at yahoo.com.cn (=?gb2312?B?xeHQ8bHz?=)
Date: Thu, 26 Jul 2012 22:23:32 +0800 (CST)
Subject: [Linux-cluster] Where to get CLVM
Message-ID: <1343312612.42865.yext-apple-iphone@web15008.mail.cnb.yahoo.com>

Hodyiohohucffn

???? iPad

? 2012-5-26?11:53?Digimer <lists at alteeve.ca> ???

> On 05/25/2012 12:27 PM, Chen, Ming Ming wrote:
>> I'm going to install and configure a CentOS 6.2 cluster. I need  CLVM  Is CLVM  included in  CentOS 6.2? If so, which package should I pick to install? If not, where can I get it?
>> Thanks
>> Ming
> 
> CentOS is a binary compatible, community released version of Red Hat. So
> whatever is available in Red Hat is available in CentOS, including clvmd.
> 
> This tutorial covers, among other things, how to install and configure
> Clustered LVM. Be sure to configure fencing as without it, clvmd will
> hang (by design) the first time a node fails and can't be fenced.
> 
> https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

From td250y at att.com  Thu Jul 26 15:44:39 2012
From: td250y at att.com (DIMITROV, TANIO)
Date: Thu, 26 Jul 2012 15:44:39 +0000
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
	other's cman
Message-ID: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>

Hello,
I'm testing RHEL 6.2 cluster using CMAN.
It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :

Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting

Can this be avoided somehow?

Thanks in advance!





From lists at alteeve.ca  Thu Jul 26 15:48:08 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 26 Jul 2012 11:48:08 -0400
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
Message-ID: <501166B8.6050807@alteeve.ca>

On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
> Hello,
> I'm testing RHEL 6.2 cluster using CMAN.
> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>
> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>
> Can this be avoided somehow?
>
> Thanks in advance!

Use real fencing.

The problem is, I believe, that the CPG messages fall out of sync. You 
could try stopping cman on one node, reconnecting the network and 
restarting cman on the one node again.

-- 
Digimer
Papers and Projects: https://alteeve.com



From td250y at att.com  Thu Jul 26 16:04:53 2012
From: td250y at att.com (DIMITROV, TANIO)
Date: Thu, 26 Jul 2012 16:04:53 +0000
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <501166B8.6050807@alteeve.ca>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<501166B8.6050807@alteeve.ca>
Message-ID: <EDB72A472210FD4B8A26174140D3063C016C1609@MISOUT7MSGUSR9A.ITServices.sbc.com>

Thanks Digimer,

Yes, this works but it cannot be done automatically - and that's my problem.
I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
And if it is not, what about the SAN switch fencing scenario? 



-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca] 
Sent: Thursday, July 26, 2012 11:48 AM
To: linux clustering
Cc: DIMITROV, TANIO
Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman

On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
> Hello,
> I'm testing RHEL 6.2 cluster using CMAN.
> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>
> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>
> Can this be avoided somehow?
>
> Thanks in advance!

Use real fencing.

The problem is, I believe, that the CPG messages fall out of sync. You 
could try stopping cman on one node, reconnecting the network and 
restarting cman on the one node again.

-- 
Digimer
Papers and Projects: https://alteeve.com



From muhshaik at cisco.com  Thu Jul 26 17:09:03 2012
From: muhshaik at cisco.com (Muhammad Shaikh (muhshaik))
Date: Thu, 26 Jul 2012 17:09:03 +0000
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
Message-ID: <CC36C74A.617F%muhshaik@cisco.com>


I think you have split brain and both nodes are out of sync and likely
doing stonith to each other.

You may want to shutdown the crm resources on one of the node. And promote
one of the node as primary and go through the steps to join the other node
as part of the cluster as secondary node.

Muhammad

On 7/26/12 8:44 AM, "DIMITROV, TANIO" <td250y at att.com> wrote:

>Hello,
>I'm testing RHEL 6.2 cluster using CMAN.
>It is a two-node cluster, no shared data. The problem is that if there is
>a connectivity problem between the nodes, each of them continues working
>as stand-alone - which is OK (no shared data, manual fencing). But when
>the connection comes back up the nodes kill each other's cman instances :
>
>Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because
>we were killed by cman_tool or other application
>Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>
>Can this be avoided somehow?
>
>Thanks in advance!
>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From lists at alteeve.ca  Thu Jul 26 17:43:39 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 26 Jul 2012 13:43:39 -0400
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <EDB72A472210FD4B8A26174140D3063C016C1675@MISOUT7MSGUSR9A.ITServices.sbc.com>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<501166B8.6050807@alteeve.ca>
	<EDB72A472210FD4B8A26174140D3063C016C1609@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<50117483.6020304@alteeve.ca>
	<EDB72A472210FD4B8A26174140D3063C016C1675@MISOUT7MSGUSR9A.ITServices.sbc.com>
Message-ID: <501181CB.9090405@alteeve.ca>

That's a non-standard use of the cluster stack. It's design (and safe 
guards) suppose a configuration where the nodes fully work together and 
and redundant. So though it works, it's not going to work perfectly in 
your use-case.

And yes, you do need to restart cman (one way or the other)

Digimer

PS - Please reply to the mailing list. These replies can help others 
later by being in the archives.

On 07/26/2012 01:18 PM, DIMITROV, TANIO wrote:
> The reason I don't want to reboot/fence the node is that my nodes are actually semi-independent - each one writes to its local file system which is then backed up on the other node when it becomes available.
>
> So, the only way to rejoin the cluster is to start CPG sequence from 0 (clean state) by either rebooting the node or restarting CMAN?
>
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 12:47 PM
> To: DIMITROV, TANIO
> Cc: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> For automatic recovery, you have to use power fencing. Fabric fencing
> (like fencing at a SAN switch) is perfectly safe, but it requires human
> intervention.
>
> The problem is that the messages passed around the cluster in the closed
> process group (CPG) are sequenced. Once a node falls out of sequence, it
> needs to be restarted. To automate this, power fence the node. When it
> boots back up, it should automatically rejoin the cluster with a clean
> state.
>
> May I ask why you're so careful to avoid a restart? The whole idea of
> clustering is to have no/minimal interruption of service during a node
> failure.
>
> Digimer
>
> On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
>> Thanks Digimer,
>>
>> Yes, this works but it cannot be done automatically - and that's my problem.
>> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
>> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
>> And if it is not, what about the SAN switch fencing scenario?
>>
>>
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, July 26, 2012 11:48 AM
>> To: linux clustering
>> Cc: DIMITROV, TANIO
>> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>>
>> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>>> Hello,
>>> I'm testing RHEL 6.2 cluster using CMAN.
>>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>>
>>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>>
>>> Can this be avoided somehow?
>>>
>>> Thanks in advance!
>>
>> Use real fencing.
>>
>> The problem is, I believe, that the CPG messages fall out of sync. You
>> could try stopping cman on one node, reconnecting the network and
>> restarting cman on the one node again.
>>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Thu Jul 26 16:46:59 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 26 Jul 2012 12:46:59 -0400
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <EDB72A472210FD4B8A26174140D3063C016C1609@MISOUT7MSGUSR9A.ITServices.sbc.com>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<501166B8.6050807@alteeve.ca>
	<EDB72A472210FD4B8A26174140D3063C016C1609@MISOUT7MSGUSR9A.ITServices.sbc.com>
Message-ID: <50117483.6020304@alteeve.ca>

For automatic recovery, you have to use power fencing. Fabric fencing 
(like fencing at a SAN switch) is perfectly safe, but it requires human 
intervention.

The problem is that the messages passed around the cluster in the closed 
process group (CPG) are sequenced. Once a node falls out of sequence, it 
needs to be restarted. To automate this, power fence the node. When it 
boots back up, it should automatically rejoin the cluster with a clean 
state.

May I ask why you're so careful to avoid a restart? The whole idea of 
clustering is to have no/minimal interruption of service during a node 
failure.

Digimer

On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
> Thanks Digimer,
>
> Yes, this works but it cannot be done automatically - and that's my problem.
> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
> And if it is not, what about the SAN switch fencing scenario?
>
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 11:48 AM
> To: linux clustering
> Cc: DIMITROV, TANIO
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>> Hello,
>> I'm testing RHEL 6.2 cluster using CMAN.
>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>
>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>
>> Can this be avoided somehow?
>>
>> Thanks in advance!
>
> Use real fencing.
>
> The problem is, I believe, that the CPG messages fall out of sync. You
> could try stopping cman on one node, reconnecting the network and
> restarting cman on the one node again.
>


-- 
Digimer
Papers and Projects: https://alteeve.com



From lists at alteeve.ca  Thu Jul 26 17:52:58 2012
From: lists at alteeve.ca (Digimer)
Date: Thu, 26 Jul 2012 13:52:58 -0400
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <EDB72A472210FD4B8A26174140D3063C016C1685@MISOUT7MSGUSR9A.ITServices.sbc.com>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<501166B8.6050807@alteeve.ca>
	<EDB72A472210FD4B8A26174140D3063C016C1609@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<50117483.6020304@alteeve.ca>
	<EDB72A472210FD4B8A26174140D3063C016C1685@MISOUT7MSGUSR9A.ITServices.sbc.com>
Message-ID: <501183FA.8050905@alteeve.ca>

Ah, hehe, ignore my PS in the other reply then. :)

digimer

On 07/26/2012 01:20 PM, DIMITROV, TANIO wrote:
> Sorry, sent the message to the wrong address
>
>
> The reason I don't want to reboot/fence the node is that my nodes are actually semi-independent - each one writes to its local file system which is then backed up on the other node when it becomes available.
>
> So, the only way to rejoin the cluster is to start CPG sequence from 0 (clean state) by either rebooting the node or restarting CMAN?
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 12:47 PM
> To: DIMITROV, TANIO
> Cc: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> For automatic recovery, you have to use power fencing. Fabric fencing
> (like fencing at a SAN switch) is perfectly safe, but it requires human
> intervention.
>
> The problem is that the messages passed around the cluster in the closed
> process group (CPG) are sequenced. Once a node falls out of sequence, it
> needs to be restarted. To automate this, power fence the node. When it
> boots back up, it should automatically rejoin the cluster with a clean
> state.
>
> May I ask why you're so careful to avoid a restart? The whole idea of
> clustering is to have no/minimal interruption of service during a node
> failure.
>
> Digimer
>
> On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
>> Thanks Digimer,
>>
>> Yes, this works but it cannot be done automatically - and that's my problem.
>> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
>> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
>> And if it is not, what about the SAN switch fencing scenario?
>>
>>
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, July 26, 2012 11:48 AM
>> To: linux clustering
>> Cc: DIMITROV, TANIO
>> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>>
>> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>>> Hello,
>>> I'm testing RHEL 6.2 cluster using CMAN.
>>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>>
>>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>>
>>> Can this be avoided somehow?
>>>
>>> Thanks in advance!
>>
>> Use real fencing.
>>
>> The problem is, I believe, that the CPG messages fall out of sync. You
>> could try stopping cman on one node, reconnecting the network and
>> restarting cman on the one node again.
>>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.com



From td250y at att.com  Thu Jul 26 17:20:11 2012
From: td250y at att.com (DIMITROV, TANIO)
Date: Thu, 26 Jul 2012 17:20:11 +0000
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <50117483.6020304@alteeve.ca>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<501166B8.6050807@alteeve.ca>
	<EDB72A472210FD4B8A26174140D3063C016C1609@MISOUT7MSGUSR9A.ITServices.sbc.com>
	<50117483.6020304@alteeve.ca>
Message-ID: <EDB72A472210FD4B8A26174140D3063C016C1685@MISOUT7MSGUSR9A.ITServices.sbc.com>

Sorry, sent the message to the wrong address


The reason I don't want to reboot/fence the node is that my nodes are actually semi-independent - each one writes to its local file system which is then backed up on the other node when it becomes available. 

So, the only way to rejoin the cluster is to start CPG sequence from 0 (clean state) by either rebooting the node or restarting CMAN?


-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca] 
Sent: Thursday, July 26, 2012 12:47 PM
To: DIMITROV, TANIO
Cc: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman

For automatic recovery, you have to use power fencing. Fabric fencing 
(like fencing at a SAN switch) is perfectly safe, but it requires human 
intervention.

The problem is that the messages passed around the cluster in the closed 
process group (CPG) are sequenced. Once a node falls out of sequence, it 
needs to be restarted. To automate this, power fence the node. When it 
boots back up, it should automatically rejoin the cluster with a clean 
state.

May I ask why you're so careful to avoid a restart? The whole idea of 
clustering is to have no/minimal interruption of service during a node 
failure.

Digimer

On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
> Thanks Digimer,
>
> Yes, this works but it cannot be done automatically - and that's my problem.
> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
> And if it is not, what about the SAN switch fencing scenario?
>
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 11:48 AM
> To: linux clustering
> Cc: DIMITROV, TANIO
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>> Hello,
>> I'm testing RHEL 6.2 cluster using CMAN.
>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>
>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>
>> Can this be avoided somehow?
>>
>> Thanks in advance!
>
> Use real fencing.
>
> The problem is, I believe, that the CPG messages fall out of sync. You
> could try stopping cman on one node, reconnecting the network and
> restarting cman on the one node again.
>


-- 
Digimer
Papers and Projects: https://alteeve.com



From kailash.kumawat at rudrainfotainment.com  Thu Jul 26 18:19:47 2012
From: kailash.kumawat at rudrainfotainment.com (kailash kumawat)
Date: Thu, 26 Jul 2012 23:49:47 +0530
Subject: [Linux-cluster] if password change after configure cluster
Message-ID: <CALO-jX6+TCMa8J_VnEmqq+0EsFd1v9TYvRYWiRBL-SM+912j1g@mail.gmail.com>

Hi

I had configured 2 node cluster with a one luci server for the apache
server, its running fine but i have some doubt in this....

i want to change my 2 nodes password so how we can connect our luci server
with a new authentication because its reads old configuration whats the
solution for this tell me please

-- 
Regards
*Kailash Kumawat*
System Admin
09167396313
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120726/a9926f49/attachment.htm>

From rmitchel at redhat.com  Thu Jul 26 23:37:01 2012
From: rmitchel at redhat.com (Ryan Mitchell)
Date: Fri, 27 Jul 2012 09:37:01 +1000
Subject: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each
 other's cman
In-Reply-To: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
References: <EDB72A472210FD4B8A26174140D3063C016C15C7@MISOUT7MSGUSR9A.ITServices.sbc.com>
Message-ID: <5011D49D.9000106@redhat.com>

On 07/27/2012 01:44 AM, DIMITROV, TANIO wrote:
> Hello,
> I'm testing RHEL 6.2 cluster using CMAN.
> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>
> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>
> Can this be avoided somehow?
>
> Thanks in advance!
Hi,

The error you see is the result of 2 clusters with existing state trying 
to merge.  Both nodes have previously been in a quorate cluster and 
therefore have existing cluster state.  At this time, CMAN and other 
tools do not support merging cluster states so that is why you hit this 
problem.  The solution is to implement fencing, because once a node is 
fenced and rebooted, it starts with no state (ie. not dirty) and can 
join the existing node (which has state, ie. dirty) successfully.

While it is possible to run clusters without fencing, behavior is 
designed with fencing in mind and you can end up with strange behavior 
like you've experienced when fencing doesn't trigger.  In some 
occasions, both nodes will kill each other and you'll lose both cluster 
nodes.  If this is really a critical system, I highly recommend fencing.

Regards,

Ryan Mitchell
Red Hat Global Support Services



From mgrac at redhat.com  Fri Jul 27 07:14:34 2012
From: mgrac at redhat.com (Marek Grac)
Date: Fri, 27 Jul 2012 09:14:34 +0200
Subject: [Linux-cluster] fence-agents-3.1.9 stable release
Message-ID: <50123FDA.1070608@redhat.com>

Welcome to the fence-agents 3.1.9 release.

This release includes these updates:
* support for HP BladeSystem (fence_hpblade) was added
* support for IBM iPDU (fence_ipdu) was added
* support for creating 'symlinks' to fence agents which can add 
additional options (e.g. fence_ilo3 -> fence_ipmilan with correct cycle, 
lanplus and power-wait parameters) so users do not have to set them
* autodetect EOL in fence agents during login
* fix attribute unique in XML metadata
* support option 'action' in old fence agents

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.9.tar.xz 


To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
    experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

m,



From gounini.geekarea at gmail.com  Mon Jul 30 14:43:25 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Mon, 30 Jul 2012 16:43:25 +0200 (CEST)
Subject: [Linux-cluster] Quorum device brain the cluster when master lose
	network
In-Reply-To: <1094353528.2191.1343659371625.JavaMail.root@geekarea.fr>
Message-ID: <1834035503.2192.1343659405772.JavaMail.root@geekarea.fr>

Hello,

I did some tests on 4 nodes cluster with quorum device and I find a bad situation with one test, so I need your knowledges to correct my configuration.

Configuation:
4 nodes, all vote for 1
quorum device vote for 1 (to hold services with minimum 2 nodes up)
cman expected votes 5

Situation:
I shut down network on 2 nodes, one of them is master.

Observation:
Fencing of one node (the master)... Quorum device Offline, Quorum disolved ! Services stopped.
Fenced node reboot, cluster is quorate, 2nd offline node is fenced. Services restart.
2nd node offline reboot.

My cluster is not quorate for 8 min (very long hardware boot :-) and my services were offline.

Do you know how to prevent this situation?

Regards,

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr



From gounini.geekarea at gmail.com  Mon Jul 30 14:46:51 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Mon, 30 Jul 2012 16:46:51 +0200 (CEST)
Subject: [Linux-cluster] How to change quorumd intervel and tko online?
Message-ID: <1277138514.2193.1343659611647.JavaMail.root@geekarea.fr>

Re,

Juste two little questions.
How to change quorumd intervel and tko **online**?
How to check these values on online cluster?

Thanks
Regards,

-- 
  .`'`.   GouNiNi
 :  ': :  
 `. ` .`  GNU/Linux
   `'`    http://www.geekarea.fr



From lists at alteeve.ca  Mon Jul 30 15:10:10 2012
From: lists at alteeve.ca (Digimer)
Date: Mon, 30 Jul 2012 11:10:10 -0400
Subject: [Linux-cluster] Quorum device brain the cluster when master
 lose network
In-Reply-To: <1834035503.2192.1343659405772.JavaMail.root@geekarea.fr>
References: <1834035503.2192.1343659405772.JavaMail.root@geekarea.fr>
Message-ID: <5016A3D2.5060605@alteeve.ca>

On 07/30/2012 10:43 AM, GouNiNi wrote:
> Hello,
>
> I did some tests on 4 nodes cluster with quorum device and I find a bad situation with one test, so I need your knowledges to correct my configuration.
>
> Configuation:
> 4 nodes, all vote for 1
> quorum device vote for 1 (to hold services with minimum 2 nodes up)
> cman expected votes 5
>
> Situation:
> I shut down network on 2 nodes, one of them is master.
>
> Observation:
> Fencing of one node (the master)... Quorum device Offline, Quorum disolved ! Services stopped.
> Fenced node reboot, cluster is quorate, 2nd offline node is fenced. Services restart.
> 2nd node offline reboot.
>
> My cluster is not quorate for 8 min (very long hardware boot :-) and my services were offline.
>
> Do you know how to prevent this situation?
>
> Regards,

Please tell us the name and version of the cluster software you are 
using, Please also share your configuration file(s).

-- 
Digimer
Papers and Projects: https://alteeve.com



From emi2fast at gmail.com  Mon Jul 30 15:15:27 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 30 Jul 2012 17:15:27 +0200
Subject: [Linux-cluster] Quorum device brain the cluster when master
	lose network
In-Reply-To: <1834035503.2192.1343659405772.JavaMail.root@geekarea.fr>
References: <1094353528.2191.1343659371625.JavaMail.root@geekarea.fr>
	<1834035503.2192.1343659405772.JavaMail.root@geekarea.fr>
Message-ID: <CAE7pJ3AhR1hO5uaad9+J2X3jnOgsi0_n3xJ0DJf4Op+Do+vidA@mail.gmail.com>

Hello GouNiNi

Don't use expected votes directive let the cluster calculate that, if you
wanna a cluster it's remain quorate with two nodes + quorum disk, the
quorum votes must be 2 votes

all votes = 6 : 6 - 2 = 4 and the resoult it's more then half

Sorry for my english, i hope the idea it's clear for you

2012/7/30 GouNiNi <gounini.geekarea at gmail.com>

> Hello,
>
> I did some tests on 4 nodes cluster with quorum device and I find a bad
> situation with one test, so I need your knowledges to correct my
> configuration.
>
> Configuation:
> 4 nodes, all vote for 1
> quorum device vote for 1 (to hold services with minimum 2 nodes up)
> cman expected votes 5
>
> Situation:
> I shut down network on 2 nodes, one of them is master.
>
> Observation:
> Fencing of one node (the master)... Quorum device Offline, Quorum disolved
> ! Services stopped.
> Fenced node reboot, cluster is quorate, 2nd offline node is fenced.
> Services restart.
> 2nd node offline reboot.
>
> My cluster is not quorate for 8 min (very long hardware boot :-) and my
> services were offline.
>
> Do you know how to prevent this situation?
>
> Regards,
>
> --
>   .`'`.   GouNiNi
>  :  ': :
>  `. ` .`  GNU/Linux
>    `'`    http://www.geekarea.fr
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120730/dd9e30c4/attachment.htm>

From gounini.geekarea at gmail.com  Mon Jul 30 15:27:03 2012
From: gounini.geekarea at gmail.com (GouNiNi)
Date: Mon, 30 Jul 2012 17:27:03 +0200 (CEST)
Subject: [Linux-cluster] Quorum device brain the cluster when master
 lose network
In-Reply-To: <5016A3D2.5060605@alteeve.ca>
Message-ID: <361678790.2203.1343662022960.JavaMail.root@geekarea.fr>



----- Mail original -----
> De: "Digimer" <lists at alteeve.ca>
> ?: "linux clustering" <linux-cluster at redhat.com>
> Cc: "GouNiNi" <gounini.geekarea at gmail.com>
> Envoy?: Lundi 30 Juillet 2012 17:10:10
> Objet: Re: [Linux-cluster] Quorum device brain the cluster when master lose network
> 
> On 07/30/2012 10:43 AM, GouNiNi wrote:
> > Hello,
> >
> > I did some tests on 4 nodes cluster with quorum device and I find a
> > bad situation with one test, so I need your knowledges to correct
> > my configuration.
> >
> > Configuation:
> > 4 nodes, all vote for 1
> > quorum device vote for 1 (to hold services with minimum 2 nodes up)
> > cman expected votes 5
> >
> > Situation:
> > I shut down network on 2 nodes, one of them is master.
> >
> > Observation:
> > Fencing of one node (the master)... Quorum device Offline, Quorum
> > disolved ! Services stopped.
> > Fenced node reboot, cluster is quorate, 2nd offline node is fenced.
> > Services restart.
> > 2nd node offline reboot.
> >
> > My cluster is not quorate for 8 min (very long hardware boot :-)
> > and my services were offline.
> >
> > Do you know how to prevent this situation?
> >
> > Regards,
> 
> Please tell us the name and version of the cluster software you are
> using, Please also share your configuration file(s).
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com
> 

Sorry, RHEL5.6 64bits

# rpm -q cman rgmanager
cman-2.0.115-68.el5
rgmanager-2.0.52-9.el5


<?xml version="1.0"?>
<cluster alias="cluname" config_version="144" name="cluname">
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="fenceIBM_307" port="12"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="fenceIBM_307" port="11"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node3" nodeid="3" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="fenceIBM_308" port="6"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node4" nodeid="4" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="fenceIBM_308" port="7"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_bladecenter" ipaddr="XX.XX.XX.XX" login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
                <fencedevice agent="fence_bladecenter" ipaddr="YY.YY.YY.YY" login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
        </fencedevices>
        <rm log_level="7">
                <failoverdomains/>
                <resources/>
                <service ...>
                        <...>
                </service>
        </rm>
        <fence_daemon clean_start="0" post_fail_delay="15" post_join_delay="300"/>
        <cman expected_votes="5">
                <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
        </cman>
        <quorumd interval="7" label="quorum" tko="12" votes="1"/>
</cluster>



From emi2fast at gmail.com  Mon Jul 30 15:35:39 2012
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 30 Jul 2012 17:35:39 +0200
Subject: [Linux-cluster] Quorum device brain the cluster when master
	lose network
In-Reply-To: <361678790.2203.1343662022960.JavaMail.root@geekarea.fr>
References: <5016A3D2.5060605@alteeve.ca>
	<361678790.2203.1343662022960.JavaMail.root@geekarea.fr>
Message-ID: <CAE7pJ3Czn8c9kkFXfFG0haSL2-jsbSCN9xS642Pw0ptfd0H0Kw@mail.gmail.com>

can you send me the ouput from cman_tool status? when the cluster it's
running

2012/7/30 GouNiNi <gounini.geekarea at gmail.com>

>
>
> ----- Mail original -----
> > De: "Digimer" <lists at alteeve.ca>
> > ?: "linux clustering" <linux-cluster at redhat.com>
> > Cc: "GouNiNi" <gounini.geekarea at gmail.com>
> > Envoy?: Lundi 30 Juillet 2012 17:10:10
> > Objet: Re: [Linux-cluster] Quorum device brain the cluster when master
> lose network
> >
> > On 07/30/2012 10:43 AM, GouNiNi wrote:
> > > Hello,
> > >
> > > I did some tests on 4 nodes cluster with quorum device and I find a
> > > bad situation with one test, so I need your knowledges to correct
> > > my configuration.
> > >
> > > Configuation:
> > > 4 nodes, all vote for 1
> > > quorum device vote for 1 (to hold services with minimum 2 nodes up)
> > > cman expected votes 5
> > >
> > > Situation:
> > > I shut down network on 2 nodes, one of them is master.
> > >
> > > Observation:
> > > Fencing of one node (the master)... Quorum device Offline, Quorum
> > > disolved ! Services stopped.
> > > Fenced node reboot, cluster is quorate, 2nd offline node is fenced.
> > > Services restart.
> > > 2nd node offline reboot.
> > >
> > > My cluster is not quorate for 8 min (very long hardware boot :-)
> > > and my services were offline.
> > >
> > > Do you know how to prevent this situation?
> > >
> > > Regards,
> >
> > Please tell us the name and version of the cluster software you are
> > using, Please also share your configuration file(s).
> >
> > --
> > Digimer
> > Papers and Projects: https://alteeve.com
> >
>
> Sorry, RHEL5.6 64bits
>
> # rpm -q cman rgmanager
> cman-2.0.115-68.el5
> rgmanager-2.0.52-9.el5
>
>
> <?xml version="1.0"?>
> <cluster alias="cluname" config_version="144" name="cluname">
>         <clusternodes>
>                 <clusternode name="node1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="single">
>                                         <device name="fenceIBM_307"
> port="12"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node2" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="single">
>                                         <device name="fenceIBM_307"
> port="11"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node3" nodeid="3" votes="1">
>                         <fence>
>                                 <method name="single">
>                                         <device name="fenceIBM_308"
> port="6"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node4" nodeid="4" votes="1">
>                         <fence>
>                                 <method name="single">
>                                         <device name="fenceIBM_308"
> port="7"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <fencedevices>
>                 <fencedevice agent="fence_bladecenter"
> ipaddr="XX.XX.XX.XX" login="xxxx" name="fenceIBM_307" passwd="yyyy"/>
>                 <fencedevice agent="fence_bladecenter"
> ipaddr="YY.YY.YY.YY" login="xxxx" name="fenceIBM_308" passwd="yyyy"/>
>         </fencedevices>
>         <rm log_level="7">
>                 <failoverdomains/>
>                 <resources/>
>                 <service ...>
>                         <...>
>                 </service>
>         </rm>
>         <fence_daemon clean_start="0" post_fail_delay="15"
> post_join_delay="300"/>
>         <cman expected_votes="5">
>                 <multicast addr="ZZ.ZZ.ZZ.ZZ"/>
>         </cman>
>         <quorumd interval="7" label="quorum" tko="12" votes="1"/>
> </cluster>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120730/dc4021e7/attachment.htm>

From heiko.nardmann at itechnical.de  Tue Jul 31 13:57:33 2012
From: heiko.nardmann at itechnical.de (Heiko Nardmann)
Date: Tue, 31 Jul 2012 15:57:33 +0200
Subject: [Linux-cluster] reasons for sporadic token loss?
Message-ID: <5017E44D.2010702@itechnical.de>

Hi together!

I am experiencing sporadic problems with my cluster setup. Maybe someone 
has an idea? But first some facts:

Type: RHEL 6.1 two node cluster (corosync 1.2.3-36) on two Dell R610 
each with a quad port NIC

NICs:
- interfaces em1/em2 are bonded using mode 5; these interfaces are cross 
connected (intended to be used for the cluster housekeeping 
communication) - no network element in between
- interfaces em3/em4 are bonded using mode 1; these interfaces are 
connected to two switches

Cluster configuration:

<?xml version="1.0"?>
<cluster config_version="51" name="my-cluster">
     <cman expected_votes="1" two_node="1"/>
     <clusternodes>
         <clusternode name="df1-clusterlink" nodeid="1">
             <fence>
                 <method name="VBoxManage-DF-1">
                     <device name="VBoxManage-DF-1" />
                 </method>
             </fence>
             <unfence>
             </unfence>
         </clusternode>
         <clusternode name="df2-clusterlink" nodeid="2">
             <fence>
                 <method name="VBoxManage-DF-2">
                     <device name="VBoxManage-DF-2" />
                 </method>

             </fence>
             <unfence>
             </unfence>
         </clusternode>
     </clusternodes>
     <fencedevices>
         <fencedevice name="VBoxManage-DF-1" agent="fence_vbox" 
vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 
DF-System Server 1" />
         <fencedevice name="VBoxManage-DF-2" agent="fence_vbox" 
vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 
DF-System Server 2" />
     </fencedevices>
     <rm>
         <resources>
             <ip address="10.200.104.15/27" monitor_link="on" 
sleeptime="10"/>
             <script file="/usr/share/cluster/app.sh" name="myapp"/>
         </resources>
         <failoverdomains>
             <failoverdomain name="fod-myapp" nofailback="0" ordered="1" 
restricted="0">
                 <failoverdomainnode name="df1-clusterlink" priority="1"/>
                 <failoverdomainnode name="df2-clusterlink" priority="2"/>
             </failoverdomain>
         </failoverdomains>
         <service domain="fod-myapp" exclusive="1" max_restarts="3" 
name="rg-myapp" recovery="restart" restart_expire_time="1">
             <script ref=myapp"/>
             <ip ref="10.200.104.15/27"/>
         </service>
     </rm>
     <logging debug="on"/>
     <gfs_controld enable_plock="0" plock_rate_limit="0"/>
     <dlm enable_plock="0" plock_ownership="1" plock_rate_limit="0"/>
</cluster>


--------------------------------------------------------------------------------

Problem:
Sometimes the second node "detects" that the token has been lost 
(corosync.log):

[no TOTEM messages before that]
Jul 28 13:00:10 corosync [TOTEM ] The token was lost in the OPERATIONAL 
state.
Jul 28 13:00:10 corosync [TOTEM ] A processor failed, forming new 
configuration.
Jul 28 13:00:10 corosync [TOTEM ] Receive multicast socket recv buffer 
size (262142 bytes).
Jul 28 13:00:10 corosync [TOTEM ] Transmit multicast socket send buffer 
size (262142 bytes).

This happens lets say once a week. This leads to fencing of the first 
node. What I see from 'corosync-objctl -a' is that this is maybe due to 
a consensus timeout (some excerpt from the commands output follows); I 
marked the lines which I so far consider as important:

totem.transport=udp
totem.version=2
totem.nodeid=2
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.rrp_mode=none
totem.secauth=1
totem.key=my-cluster
totem.interface.ringnumber=0
totem.interface.bindnetaddr=172.16.42.2
totem.interface.mcastaddr=239.192.187.168
totem.interface.mcastport=5405
runtime.totem.pg.mrp.srp.orf_token_tx=3
runtime.totem.pg.mrp.srp.orf_token_rx=1103226
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=395
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=1098359
runtime.totem.pg.mrp.srp.memb_join_tx=38
runtime.totem.pg.mrp.srp.memb_join_rx=50
runtime.totem.pg.mrp.srp.mcast_tx=218
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=541
runtime.totem.pg.mrp.srp.memb_commit_token_tx=12
runtime.totem.pg.mrp.srp.memb_commit_token_rx=18
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=49
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=173
runtime.totem.pg.mrp.srp.operational_entered=6
runtime.totem.pg.mrp.srp.operational_token_lost=1
^^^
runtime.totem.pg.mrp.srp.gather_entered=7
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=6
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=6
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=1
^^^
runtime.totem.pg.mrp.srp.mtt_rx_token=1727
runtime.totem.pg.mrp.srp.avg_token_workload=62244458
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(172.16.42.2)
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(172.16.42.1)
runtime.totem.pg.mrp.srp.members.1.join_count=3
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no

Some questions at this point:
A) why did the cluster lose the token? due to timeout? token (10000) or 
consensus (2000)?
B) why is the timeout ellapsed? maybe that is connected with the answer 
to A ... ?
C) is it normal that 'token=10000' and 'consensus=2000' although normal 
documentation says that default is 'token=1000' and 'consensus=1.2*token'?
D) since I suspect problems concerning the switches connecting the other 
interfaces (em3/em4 bonded to bond0) of those machines I wonder whether 
any traffic goes that way and not via bond1?

As I already stated: the connection of em3/em4 is a direct one without 
any network element.

So far I want to add the following line to cluster.conf and see whether 
the situation improves:

     <totem token_retransmits_before_loss_const="10" 
fail_recv_const="100" consensus="12000"/>

Any comment concerning that?

While googling for reasons I have seen that it is also a problem if both 
nodes are not synchronized concerning time; but in my case the ntpd on 
both nodes uses two stratum 2 NTP servers. I also cannot detect anything 
unsual like e.g. a jump of multiple seconds inside the log files 
although I have to admit that so far the ntpd does not run with debug 
enabled.


Thanks in advance for any hint or comment!


Kind regards,

     Heiko



From piotr.pietrzak at hp.com  Tue Jul 31 14:55:08 2012
From: piotr.pietrzak at hp.com (Pietrzak, Piotr (CMS rtBSS))
Date: Tue, 31 Jul 2012 14:55:08 +0000
Subject: [Linux-cluster] Named pipes not working on GFS2 in Redhat 5.x
Message-ID: <3ABFB3D87EB6904F9F6FB45C90E24F4D07FC6C@G1W3650.americas.hpqcorp.net>

Hello,

I have got a problem with named pipes (fifo files)on GFS2 shared filesystem. I can't use them on single machine (mean write and read on the single host in cluster).

It works well on GFS and local FS like ext3. I have tried to upgrade kernel version to 5.8 as it seems the problem has been addressed but the issue remain as it was before.

The bug registered is under id 450276, can anybody knows where the compiled resolution can be found

Regards
Piotr


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120731/a0589a67/attachment.htm>

From rpeterso at redhat.com  Tue Jul 31 15:37:14 2012
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 31 Jul 2012 11:37:14 -0400 (EDT)
Subject: [Linux-cluster] Named pipes not working on GFS2 in Redhat 5.x
In-Reply-To: <3ABFB3D87EB6904F9F6FB45C90E24F4D07FC6C@G1W3650.americas.hpqcorp.net>
Message-ID: <616321622.6879120.1343749034989.JavaMail.root@redhat.com>

----- Original Message -----
| Hello,
| 
| I have got a problem with named pipes (fifo files)on GFS2 shared
| filesystem. I can't use them on single machine (mean write and read
| on the single host in cluster).
| 
| It works well on GFS and local FS like ext3. I have tried to upgrade
| kernel version to 5.8 as it seems the problem has been addressed but
| the issue remain as it was before.
| 
| The bug registered is under id 450276, can anybody knows where the
| compiled resolution can be found
| 
| Regards
| Piotr

Hi Piotr,

I'm not entirely sure what you mean, so I have some questions.

(1) You say "single machine" and "in a cluster" but this is contradictory.
I think of "single machine" as outside of a cluster. So does your failure
occur on a single machine or in a cluster? Or do you mean that the
problem occurs on one node of a cluster but not the other node?

(2) Does your program fail on the 5.8 kernel or on a different kernel?
If so, which kernel does it fail on?

(3) If you are a Red Hat customer, you should open a support case with
Red Hat. If you're not a Red Hat customer, you can open a new bugzilla 
record for the problem. But please include the program that works on
other file systems but not on GFS2. (Hopefully source code).

I verified that the patch for bz #450276 exists in the RHEL5.8 kernel.
This is a very old bz and the fix has been in many kernels now.

Regards,

Bob Peterson
Red Hat File Systems