From corey.kovacs at gmail.com  Sun May  1 09:32:39 2011
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Sun, 1 May 2011 10:32:39 +0100
Subject: [Linux-cluster] How do you HA your storage?
In-Reply-To: <4DBC401A.6000902@bulbous.org>
References: <1304154086.10889.1446718041@webmail.messagingengine.com>
	<D53473AE95A74F478919D4B8A0C29898@kaglaptop>
	<4DBBD19C.1000606@bulbous.org>
	<BANLkTimOKKur0GkiT5o4E517xzvWf+i83g@mail.gmail.com>
	<4DBBDE59.9000107@bulbous.org>
	<BANLkTinAfhpUxbmTRqFhe-5_zjnm=Y_nCA@mail.gmail.com>
	<4DBC401A.6000902@bulbous.org>
Message-ID: <BANLkTinQ+F2SnL-nX0Vh7tn540X7F89PhQ@mail.gmail.com>

You could probably do what you want using san level mirroring across
two sans and device-mapper-multipath. I believe the sans will
automatically put the alternate san copies into read/write if it
cannot communicate with the first if it's configured to do so but I
don't have access to that capability on my EVA for lack of license.
Actually, it woulnd't require a whole other san but if I was mirroring
things, that's what I'd opt for. This is the kind of problem that DMP
was designed to handle. If you are booting from the san, you may have
some other tweeks but in general I think DMP is still the way to go.

Good luck

-C


On Sat, Apr 30, 2011 at 6:00 PM, urgrue <urgrue at bulbous.org> wrote:
> On 30/4/11 14:27, Corey Kovacs wrote:
>
> This has nothing to do with any network. It's all over the fiber...
>
> True, my bad, I was thinking of DRBD.
>
>> Points in time? It's a raid 1, it's relatively instant. It's more
>> complex to manage a failover in the way you describe if anything.
>
> I didn't mean that. What I meant is with any enterprise storage filer I can
> walk in and take a point in time snapshot of my entire datacenter - all
> hundreds of servers - with almost no effort. And restore it. That's a pretty
> fantastic thing to be able to do before, say, a major upgrade on hundreds of
> servers. And you manage all of it in one place. Take a situation like if the
> company decides it needs a third copy of the data. It'd be a fun job to map
> and configure the third LUN on 500 servers, when on the SAN it'd be a a few
> minutes to configure. Or if that third copy needs to be async instead, I
> don't even think you can do that with LVM or software raid.
> Host-based mirroring is great for many situations, but when it comes to
> larger environments, I think most companies tend to prefer SAN mirroring.
>
>> Well, my $0.02 anyway.
>>
>> -C
>>
>> On Sat, Apr 30, 2011 at 11:03 AM, urgrue<urgrue at bulbous.org> ?wrote:
>>>
>>> Yes, these work, but then I'm having each server handle the job of
>>> mirroring
>>> their own disks, which has some disadvantages. Network usage instead of
>>> fiber, more complex management of points-in-time compared to a nice big
>>> fat
>>> centralized SAN, etc. In my experience most companies favor SAN-level
>>> replication.
>>> The challenge is just getting Linux to recover gracefully when the SAN
>>> fails
>>> over. Worst case you can just reboot, but, that's not very HA.
>>>
>>>
>>> On 30/4/11 13:23, Corey Kovacs wrote:
>>>>
>>>> What you seem to be describing is the mirror target for device mapper.
>>>>
>>>> Another alternative would be to setup a software raid using multipath'd
>>>> luns.
>>>>
>>>> SANVOL1 ? ? ? ? ? ?SANVOL2
>>>> ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? |
>>>> ? ?\ ? ? ? ? ? ? ? ? ? ? ? ? ?/
>>>> ? ? \ ? ? ? ? ? ? ? ? ? ? ? /
>>>> ? ? ? \ ? ? ? ? ? ? ? ? ? /
>>>> ? ? MPATH1 ? ?MPATH2
>>>> ? ? ? ? ?\ ? ? ? ? ? ? /
>>>> ? ? ? ?RAID 1 DEV
>>>> ? ? ? ? ? ? ? ?|
>>>> ? ? ? ? ? ? ?PV
>>>> ? ? ? ? ? ? ? ?|
>>>> ? ? ? ? ? ? ? VG
>>>> ? ? ? ? ? ? ? ?|
>>>> ? ? ? ? ? ? ? LV
>>>>
>>>> That might work
>>>>
>>>> -C
>>>>
>>>>
>>>> On Sat, Apr 30, 2011 at 10:08 AM, urgrue<urgrue at bulbous.org> ? ?wrote:
>>>>>
>>>>> But, how do you get dm-multipath to consider two different LUNs to be
>>>>> in
>>>>> fact two paths to the same device?
>>>>> I mean, normally multipath has two paths to one device.
>>>>> When we're talking about san-level mirroring, we've got two paths to
>>>>> two
>>>>> different devices (which just happen to contain identical data).
>>>>>
>>>>> On 30/4/11 11:47, Kit Gerrits wrote:
>>>>>>
>>>>>> With dual-controller arrays, dm-multipath ?keeps checking if the
>>>>>> current
>>>>>> device is still responding and switches to a different path if it is
>>>>>> not.
>>>>>> (for examply, by reading sector 0)
>>>>>>
>>>>>> With SAN failover, you may need to tell the secondary SAN LUN to go
>>>>>> into
>>>>>> read-write mode.
>>>>>> Unfortunately, I am not familiar with tying this into RHEL.
>>>>>> (also, sector 0 will already be readable on the secundary LUN, but not
>>>>>> writable)
>>>>>>
>>>>>> Maybe there is a write test, which tries to write to both SANs
>>>>>> The one which allows write access will become the active LUN.
>>>>>>
>>>>>> If you can switch your SANs inside 30 seconds, you might even be able
>>>>>> to
>>>>>> salvage/execute pending write operations.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Kit
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: linux-cluster-bounces at redhat.com
>>>>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of urgrue
>>>>>> Sent: zaterdag 30 april 2011 11:01
>>>>>> To: linux-cluster at redhat.com
>>>>>> Subject: [Linux-cluster] How do you HA your storage?
>>>>>>
>>>>>> I'm struggling to find the best way to deal with SAN failover.
>>>>>> By this I mean the common scenario where you have SAN-based mirroring.
>>>>>> It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but
>>>>>> how
>>>>>> can
>>>>>> you minimize the impact and manual effort to recover from losing a
>>>>>> LUN,
>>>>>> and
>>>>>> needing to somehow get your system to realize the data is now on a
>>>>>> different
>>>>>> LUN (the now-active mirror)?
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From Chris.Jankowski at hp.com  Sun May  1 11:22:03 2011
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Sun, 1 May 2011 11:22:03 +0000
Subject: [Linux-cluster] How do you HA your storage?
In-Reply-To: <4DBC3BED.6080104@bulbous.org>
References: <1304154086.10889.1446718041@webmail.messagingengine.com>
	<036B68E61A28CA49AC2767596576CD596F6579D294@GVW1113EXC.americas.hpqcorp.net>
	<4DBC3BED.6080104@bulbous.org>
Message-ID: <036B68E61A28CA49AC2767596576CD596F6579D2AB@GVW1113EXC.americas.hpqcorp.net>

I think you might not appreciate subtleties involved in making the decision of failover and concurrency issues between different LUNs.

Be that as it may, I'd suggest to close this discussion at this point, as it has nothing to do with the Linux cluster and everything to do with general BC and DR in multi-site environment.  There are specialized forums for this.

Regards,

Chris

-----Original Message-----
From: urgrue [mailto:urgrue at bulbous.org] 
Sent: Sunday, 1 May 2011 02:42
To: linux clustering
Cc: Jankowski, Chris
Subject: Re: [Linux-cluster] How do you HA your storage?

I do have RAID, multipath over multiple fabrics, etc. But what you're 
not at all protected from is major SAN failure, or a datacenter outage, 
for example. Which happens, and if you've got more than a few 
datacenters and dozens of SAN filers, you know they happen actually way 
too often for you to not miss a graceful, predictable recovery procedure.

So like everyone else, you've got cluster nodes in each datacenter, and 
all of them connected to the same SAN. Everything will recover quite 
nicely from just about every type of failure - except failure of the SAN 
itself. Your cluster nodes in your backup datacenter will not be happy 
to see the disks disappear. You can activate your backup filer(s) in 
seconds - all your hundreds of passive nodes actually do now have 
functioning copies of the data and could/should be able to get back to 
work - but getting all of them to actually realize it and get back to 
work, can be hours of messy manual work.

I wouldn't think it'd be very difficult to handle this gracefully, all 
the basic functionality is already there in multipath and LVM. I think 
it would be a pretty big deal in the enterprise world to be able to 
transparently switch SANs like this. As far as I know only z/os can do 
this currently and even then it's built around a very specific, 
complicated and expensive storage configuration. And there's a whole 
industry around "san virtualization" just because of this kind of 
sitautions, that would become obsolete overnight if the OS itself could 
handle it natively.



On 30/4/11 16:29, Jankowski, Chris wrote:
> I am just wondering, why would you like to do it this way?
>
> If you have SAN then by implication you have a storage array on the SAN.  This storage array will normally have capability to give you highly available storage through RAID{1,5,6}. Moreover, any decent array will also provide redundancy in case of a failure of one of is controllers. Then standard dual fabric FC SAN configuration will give you multiple paths to the controllers of the array - normally at least 4 paths. What remains to be done on the servers is to configure device mapper multipath to fit your SAN configuration and capabilities of the array. Most modern arrays these days are active-active and support ALUA extensions.
>
> Nothing specifically needs to be done in the cluster software.  This works the same way as for a single host.
>
> Are you trying to build a stretched cluster across multiple sites with a SAN array in each?
>
> Regards,
>
> Chris Jankowski
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of urgrue
> Sent: Saturday, 30 April 2011 19:01
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] How do you HA your storage?
>
> I'm struggling to find the best way to deal with SAN failover.
> By this I mean the common scenario where you have SAN-based mirroring.
> It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but how
> can you minimize the impact and manual effort to recover from losing a
> LUN, and needing to somehow get your system to realize the data is now
> on a different LUN (the now-active mirror)?
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From ercan.karadeniz at vodafone.com  Sun May  1 21:57:23 2011
From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group)
Date: Sun, 1 May 2011 23:57:23 +0200
Subject: [Linux-cluster] GFS2 daemon hangs during boot process
Message-ID: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com>

Hi Linux-Cluster-List-Members,

 

I'm a newbie in RHCS. I have visited recently the RH436 training. Currently I'm trying to get more experience by doing some hands-on on the course labs.

 

My setup is as follows:

?         Physical Server where Dom-0 is running

?         2 x xen virtual machines

?         2 nodes cluster (via Conga)

 

The two node cluster is setup by using Conga. The node1 and node2 are xen virtual machines. Everything worked so far. For fencing I'm using fence_xvmd. That is also working without any problems. 

To test the multicasting with different address (than the default one), I have done some changes on the multicast address and rebooted both nodes. Apparently when I start node1 or node2 (xm console node1 -c ) they hang during boot process on the "GFS2 daemon".

I have tried to login via using the "Single" mode as boot parameter regrettably this didn't help.

 

My question is how can I overcome this deadlock situation. I need somehow to boot both nodes and change my recent changes related to the multicasting address in the cluster.conf file. However I cannot login to any of the nodes? Furthermore is there a change within the xen virtual machine to get in to the interactive boot mode?

 

It would be great if you can give me some hint here.

 

Many thanks in advance!

 

Warm regards from D?sseldorf/Germany

 

Ercan Karadeniz

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110501/209ef499/attachment.htm>

From linux at alteeve.com  Sun May  1 22:09:19 2011
From: linux at alteeve.com (Digimer)
Date: Sun, 01 May 2011 18:09:19 -0400
Subject: [Linux-cluster] GFS2 daemon hangs during boot process
In-Reply-To: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com>
References: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com>
Message-ID: <4DBDDA0F.4070509@alteeve.com>

On 05/01/2011 05:57 PM, Karadeniz, Ercan, VF-Group wrote:
> Hi Linux-Cluster-List-Members,
> 
> I?m a newbie in RHCS. I have visited recently the RH436 training.
> Currently I?m trying to get more experience by doing some hands-on on
> the course labs.
> 
> My setup is as follows:
> 
> ?         Physical Server where Dom-0 is running
> ?         2 x xen virtual machines
> ?         2 nodes cluster (via Conga)
> 
> The two node cluster is setup by using Conga. The node1 and node2 are
> xen virtual machines. Everything worked so far. For fencing I?m using
> fence_xvmd. That is also working without any problems.
> 
> To test the multicasting with different address (than the default one),
> I have done some changes on the multicast address and rebooted both
> nodes. Apparently when I start node1 or node2 (xm console node1 ?c )
> they hang during boot process on the ?GFS2 daemon?.
> 
> I have tried to login via using the ?Single? mode as boot parameter
> regrettably this didn?t help.
> 
> My question is how can I overcome this deadlock situation. I need
> somehow to boot both nodes and change my recent changes related to the
> multicasting address in the cluster.conf file. However I cannot login to
> any of the nodes? Furthermore is there a change within the xen virtual
> machine to get in to the interactive boot mode?
> 
> It would be great if you can give me some hint here.
> 
> Many thanks in advance!

You could try booting the VM using the RHEL5 ISO as the first boot
device, then boot into rescue mode. This should allow you to mount the
system and edit you /etc/fstab and/or /etc/cluster/cluster.conf.

As a side note, for testing, I like to 'chkconfig cman off'. This way, I
know that even if I totally screw up, I'll always be able to boot into
the host OS. :)

Further, I'd set the GFS2 entry in fstab to not use 'defaults', but
instead to use 'rw,suid,dev,exec,nouser,async'. This excludes the 'auto'
option, so that a failure to mount the GFS2 partition won't cause dom0
to drop to single-user mode.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org


From mgrac at redhat.com  Mon May  2 07:33:58 2011
From: mgrac at redhat.com (Marek Grac)
Date: Mon, 02 May 2011 09:33:58 +0200
Subject: [Linux-cluster] Plugged out blade from bladecenter chassis -
 fence_bladecenter failed
In-Reply-To: <BANLkTi=mOvNcTbk_b+EiYCjT17HeYktOxw@mail.gmail.com>
References: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>	<4DBA71EA.9070303@redhat.com>
	<BANLkTi=mOvNcTbk_b+EiYCjT17HeYktOxw@mail.gmail.com>
Message-ID: <4DBE5E66.80802@redhat.com>

Hi,

On 04/29/2011 10:15 AM, Parvez Shaikh wrote:
> Hi Marek,
>
> Can we give this option in cluster.conf file for bladecenter fencing 
> device or method

for cluster.conf you should add ... missing_as_off="1" ... to fence 
configuration

>
> For IPMI, fencing is there similar option?
>

There is no such method for IPMI.

m,



From ercan.karadeniz at vodafone.com  Mon May  2 11:35:27 2011
From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group)
Date: Mon, 2 May 2011 13:35:27 +0200
Subject: [Linux-cluster] GFS2 daemon hangs during boot process
In-Reply-To: <4DBDDA0F.4070509@alteeve.com>
References: <84220C5308E5B146BFA374E47437FB550674AEAE@VF-MBX12.internal.vodafone.com>
	<4DBDDA0F.4070509@alteeve.com>
Message-ID: <84220C5308E5B146BFA374E47437FB550674AFD3@VF-MBX12.internal.vodafone.com>

Many thanks for the hints.

Regards,
Ercan

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Montag, 2. Mai 2011 00:09
To: linux clustering
Cc: Karadeniz, Ercan, VF-Group
Subject: Re: [Linux-cluster] GFS2 daemon hangs during boot process

On 05/01/2011 05:57 PM, Karadeniz, Ercan, VF-Group wrote:
> Hi Linux-Cluster-List-Members,
> 
> I'm a newbie in RHCS. I have visited recently the RH436 training.
> Currently I'm trying to get more experience by doing some hands-on on 
> the course labs.
> 
> My setup is as follows:
> 
> *         Physical Server where Dom-0 is running
> *         2 x xen virtual machines
> *         2 nodes cluster (via Conga)
> 
> The two node cluster is setup by using Conga. The node1 and node2 are 
> xen virtual machines. Everything worked so far. For fencing I'm using 
> fence_xvmd. That is also working without any problems.
> 
> To test the multicasting with different address (than the default 
> one), I have done some changes on the multicast address and rebooted 
> both nodes. Apparently when I start node1 or node2 (xm console node1 
> -c ) they hang during boot process on the "GFS2 daemon".
> 
> I have tried to login via using the "Single" mode as boot parameter 
> regrettably this didn't help.
> 
> My question is how can I overcome this deadlock situation. I need 
> somehow to boot both nodes and change my recent changes related to the

> multicasting address in the cluster.conf file. However I cannot login 
> to any of the nodes? Furthermore is there a change within the xen 
> virtual machine to get in to the interactive boot mode?
> 
> It would be great if you can give me some hint here.
> 
> Many thanks in advance!

You could try booting the VM using the RHEL5 ISO as the first boot
device, then boot into rescue mode. This should allow you to mount the
system and edit you /etc/fstab and/or /etc/cluster/cluster.conf.

As a side note, for testing, I like to 'chkconfig cman off'. This way, I
know that even if I totally screw up, I'll always be able to boot into
the host OS. :)

Further, I'd set the GFS2 entry in fstab to not use 'defaults', but
instead to use 'rw,suid,dev,exec,nouser,async'. This excludes the 'auto'
option, so that a failure to mount the GFS2 partition won't cause dom0
to drop to single-user mode.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From parvez.h.shaikh at gmail.com  Mon May  2 13:19:17 2011
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Mon, 2 May 2011 18:49:17 +0530
Subject: [Linux-cluster] Plugged out blade from bladecenter chassis -
 fence_bladecenter failed
In-Reply-To: <4DBE5E66.80802@redhat.com>
References: <BANLkTi=htj0r-O9AXK78Sz6x-HV5od-h3A@mail.gmail.com>
	<4DBA71EA.9070303@redhat.com>
	<BANLkTi=mOvNcTbk_b+EiYCjT17HeYktOxw@mail.gmail.com>
	<4DBE5E66.80802@redhat.com>
Message-ID: <BANLkTin6P1XhQiMjWRO0GDw_1RrFo=vt+A@mail.gmail.com>

Hi Marek,

I tried the option missing_as_off="1" and now I get an another error -

fenced[18433]: fence "node5.sscdomain" failed
fenced[18433]: fencing node "node5.sscdomain"

Sniplet of cluster.conf file is -
....
    <clusternode name="node5" nodeid="5" votes="1">
      <fence>
        <method name="1">
          <device blade="5" name="BladeCenterFencing" missing_as_off="1"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
....
  <fencedevices>
    <fencedevice agent="fence_bladecenter" ipaddr="blade-mm-1"
login="USERID" name="BladeCenterFencing" passwd="PASSW0RD"/>
  </fencedevices>

Did I miss something?

Thanks
Parvez


On Mon, May 2, 2011 at 1:03 PM, Marek Grac <mgrac at redhat.com> wrote:

> Hi,
>
>
> On 04/29/2011 10:15 AM, Parvez Shaikh wrote:
>
>> Hi Marek,
>>
>> Can we give this option in cluster.conf file for bladecenter fencing
>> device or method
>>
>
> for cluster.conf you should add ... missing_as_off="1" ... to fence
> configuration
>
>
>
>> For IPMI, fencing is there similar option?
>>
>>
> There is no such method for IPMI.
>
>
> m,
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110502/aec858e4/attachment.htm>

From linux at alteeve.com  Wed May  4 15:37:27 2011
From: linux at alteeve.com (Digimer)
Date: Wed, 04 May 2011 11:37:27 -0400
Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T'
Message-ID: <4DC172B7.3040206@alteeve.com>

This is a little concerning... Can someone confirm that I didn't screw
up before I lodge a bug?

[root at xenmaster003 ~]# rpm -q cman gfs2-utils
cman-2.0.115-68.el5_6.3
gfs2-utils-0.1.62-28.el5

[root at xenmaster003 ~]# lvextend -L +50G /dev/drbd_sh1_vg0/cluster_files
/dev/drbd3
  Extending logical volume cluster_files to 250.00 GB
  Logical volume cluster_files successfully resized
[root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files
/cluster_files/
(Test mode--File system will not be changed)
FS: Mount Point: /cluster_files
FS: Device:      /dev/mapper/drbd_sh1_vg0-cluster_files
FS: Size:        52428798 (0x31ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       65536000 (0x3e80000)
The file system grew by 51200MB.
FS: Mount Point: /cluster_files
FS: Device:      /dev/mapper/drbd_sh1_vg0-cluster_files
FS: Size:        52428798 (0x31ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       65536000 (0x3e80000)
The file system grew by 51200MB.
gfs2_grow complete.

[root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files
/cluster_files/
FS: Mount Point: /cluster_files
FS: Device:      /dev/mapper/drbd_sh1_vg0-cluster_files
FS: Size:        52428798 (0x31ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       65536000 (0x3e80000)
The file system grew by 51200MB.
Error: The device has grown by less than one Resource Group (RG).
The device grew by 0MB.  One RG is 255MB for this file system.
gfs2_grow complete.

[root at xenmaster003 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md2               57G  2.7G   51G   6% /
/dev/md0              251M   52M  187M  22% /boot
tmpfs                 7.7G     0  7.7G   0% /dev/shm
/dev/mapper/drbd_sh0_vg0-xen_shared
                       56G  259M   56G   1% /xen_shared
/dev/mapper/drbd_sh1_vg0-cluster_files
                      250G  145G  106G  58% /cluster_files

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org


From rpeterso at redhat.com  Wed May  4 15:54:30 2011
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 4 May 2011 11:54:30 -0400 (EDT)
Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T'
In-Reply-To: <4DC172B7.3040206@alteeve.com>
Message-ID: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- Original Message -----
| This is a little concerning... Can someone confirm that I didn't screw
| up before I lodge a bug?
| 
| [root at xenmaster003 ~]# rpm -q cman gfs2-utils
| cman-2.0.115-68.el5_6.3
| gfs2-utils-0.1.62-28.el5
| 
| [root at xenmaster003 ~]# lvextend -L +50G
| /dev/drbd_sh1_vg0/cluster_files
| /dev/drbd3
| Extending logical volume cluster_files to 250.00 GB
| Logical volume cluster_files successfully resized
| [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files
| /cluster_files/
| (Test mode--File system will not be changed)
| FS: Mount Point: /cluster_files
| FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
| FS: Size: 52428798 (0x31ffffe)
| FS: RG size: 65535 (0xffff)
| DEV: Size: 65536000 (0x3e80000)
| The file system grew by 51200MB.
| FS: Mount Point: /cluster_files
| FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
| FS: Size: 52428798 (0x31ffffe)
| FS: RG size: 65535 (0xffff)
| DEV: Size: 65536000 (0x3e80000)
| The file system grew by 51200MB.
| gfs2_grow complete.
| 
| [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files
| /cluster_files/
| FS: Mount Point: /cluster_files
| FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
| FS: Size: 52428798 (0x31ffffe)
| FS: RG size: 65535 (0xffff)
| DEV: Size: 65536000 (0x3e80000)
| The file system grew by 51200MB.
| Error: The device has grown by less than one Resource Group (RG).
| The device grew by 0MB. One RG is 255MB for this file system.
| gfs2_grow complete.
| 
| [root at xenmaster003 ~]# df -h
| Filesystem Size Used Avail Use% Mounted on
| /dev/md2 57G 2.7G 51G 6% /
| /dev/md0 251M 52M 187M 22% /boot
| tmpfs 7.7G 0 7.7G 0% /dev/shm
| /dev/mapper/drbd_sh0_vg0-xen_shared
| 56G 259M 56G 1% /xen_shared
| /dev/mapper/drbd_sh1_vg0-cluster_files
| 250G 145G 106G 58% /cluster_files
| 
| --
| Digimer
| E-Mail: digimer at alteeve.com
| AN!Whitepapers: http://alteeve.com
| Node Assassin: http://nodeassassin.org

Hi,

Hm...This sounds like a bug to me.  I'd open the bug record.

Regards,

Bob Peterson
Red Hat File Systems


From linux at alteeve.com  Wed May  4 15:59:37 2011
From: linux at alteeve.com (Digimer)
Date: Wed, 04 May 2011 11:59:37 -0400
Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T'
In-Reply-To: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <4DC177E9.3060107@alteeve.com>

On 05/04/2011 11:54 AM, Bob Peterson wrote:
> ----- Original Message -----
> | This is a little concerning... Can someone confirm that I didn't screw
> | up before I lodge a bug?
> | 
> | [root at xenmaster003 ~]# rpm -q cman gfs2-utils
> | cman-2.0.115-68.el5_6.3
> | gfs2-utils-0.1.62-28.el5
> | 
> | [root at xenmaster003 ~]# lvextend -L +50G
> | /dev/drbd_sh1_vg0/cluster_files
> | /dev/drbd3
> | Extending logical volume cluster_files to 250.00 GB
> | Logical volume cluster_files successfully resized
> | [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files
> | /cluster_files/
> | (Test mode--File system will not be changed)
> | FS: Mount Point: /cluster_files
> | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
> | FS: Size: 52428798 (0x31ffffe)
> | FS: RG size: 65535 (0xffff)
> | DEV: Size: 65536000 (0x3e80000)
> | The file system grew by 51200MB.
> | FS: Mount Point: /cluster_files
> | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
> | FS: Size: 52428798 (0x31ffffe)
> | FS: RG size: 65535 (0xffff)
> | DEV: Size: 65536000 (0x3e80000)
> | The file system grew by 51200MB.
> | gfs2_grow complete.
> | 
> | [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files
> | /cluster_files/
> | FS: Mount Point: /cluster_files
> | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
> | FS: Size: 52428798 (0x31ffffe)
> | FS: RG size: 65535 (0xffff)
> | DEV: Size: 65536000 (0x3e80000)
> | The file system grew by 51200MB.
> | Error: The device has grown by less than one Resource Group (RG).
> | The device grew by 0MB. One RG is 255MB for this file system.
> | gfs2_grow complete.
> | 
> | [root at xenmaster003 ~]# df -h
> | Filesystem Size Used Avail Use% Mounted on
> | /dev/md2 57G 2.7G 51G 6% /
> | /dev/md0 251M 52M 187M 22% /boot
> | tmpfs 7.7G 0 7.7G 0% /dev/shm
> | /dev/mapper/drbd_sh0_vg0-xen_shared
> | 56G 259M 56G 1% /xen_shared
> | /dev/mapper/drbd_sh1_vg0-cluster_files
> | 250G 145G 106G 58% /cluster_files
> | 
> | --
> | Digimer
> | E-Mail: digimer at alteeve.com
> | AN!Whitepapers: http://alteeve.com
> | Node Assassin: http://nodeassassin.org
> 
> Hi,
> 
> Hm...This sounds like a bug to me.  I'd open the bug record.
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems

Will do, thanks for the prompt reply.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org


From linux at alteeve.com  Wed May  4 16:08:39 2011
From: linux at alteeve.com (Digimer)
Date: Wed, 04 May 2011 12:08:39 -0400
Subject: [Linux-cluster] GFS2 partition grew with 'gfs2_grow -T'
In-Reply-To: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <345920098.235604.1304524470146.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <4DC17A07.6070004@alteeve.com>

On 05/04/2011 11:54 AM, Bob Peterson wrote:
> ----- Original Message -----
> | This is a little concerning... Can someone confirm that I didn't screw
> | up before I lodge a bug?
> | 
> | [root at xenmaster003 ~]# rpm -q cman gfs2-utils
> | cman-2.0.115-68.el5_6.3
> | gfs2-utils-0.1.62-28.el5
> | 
> | [root at xenmaster003 ~]# lvextend -L +50G
> | /dev/drbd_sh1_vg0/cluster_files
> | /dev/drbd3
> | Extending logical volume cluster_files to 250.00 GB
> | Logical volume cluster_files successfully resized
> | [root at xenmaster003 ~]# gfs2_grow -T /dev/drbd_sh1_vg0/cluster_files
> | /cluster_files/
> | (Test mode--File system will not be changed)
> | FS: Mount Point: /cluster_files
> | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
> | FS: Size: 52428798 (0x31ffffe)
> | FS: RG size: 65535 (0xffff)
> | DEV: Size: 65536000 (0x3e80000)
> | The file system grew by 51200MB.
> | FS: Mount Point: /cluster_files
> | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
> | FS: Size: 52428798 (0x31ffffe)
> | FS: RG size: 65535 (0xffff)
> | DEV: Size: 65536000 (0x3e80000)
> | The file system grew by 51200MB.
> | gfs2_grow complete.
> | 
> | [root at xenmaster003 ~]# gfs2_grow /dev/drbd_sh1_vg0/cluster_files
> | /cluster_files/
> | FS: Mount Point: /cluster_files
> | FS: Device: /dev/mapper/drbd_sh1_vg0-cluster_files
> | FS: Size: 52428798 (0x31ffffe)
> | FS: RG size: 65535 (0xffff)
> | DEV: Size: 65536000 (0x3e80000)
> | The file system grew by 51200MB.
> | Error: The device has grown by less than one Resource Group (RG).
> | The device grew by 0MB. One RG is 255MB for this file system.
> | gfs2_grow complete.
> | 
> | [root at xenmaster003 ~]# df -h
> | Filesystem Size Used Avail Use% Mounted on
> | /dev/md2 57G 2.7G 51G 6% /
> | /dev/md0 251M 52M 187M 22% /boot
> | tmpfs 7.7G 0 7.7G 0% /dev/shm
> | /dev/mapper/drbd_sh0_vg0-xen_shared
> | 56G 259M 56G 1% /xen_shared
> | /dev/mapper/drbd_sh1_vg0-cluster_files
> | 250G 145G 106G 58% /cluster_files
> | 
> | --
> | Digimer
> | E-Mail: digimer at alteeve.com
> | AN!Whitepapers: http://alteeve.com
> | Node Assassin: http://nodeassassin.org
> 
> Hi,
> 
> Hm...This sounds like a bug to me.  I'd open the bug record.
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems

For the archives:
https://bugzilla.redhat.com/show_bug.cgi?id=702050

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org


From ercan.karadeniz at vodafone.com  Sat May  7 17:48:39 2011
From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group)
Date: Sat, 7 May 2011 19:48:39 +0200
Subject: [Linux-cluster] GFS File System Resource does not mount
	automatically => Bug or Feature
Message-ID: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com>

Hi All,

 

I have a two node cluster setup, with a httpd service and with the
resource IP, GFS file system (iscsi => /dev/sda2 /var/www/html) and
httpd.

 

Now when I start the service the GFS file system gets not automatically
mounted. I have also tried to relocate the service between both nodes
(node1 and node2). However the result has not changed. Moreover I have
checked the logs but did not see any error messages. The used OS is RHEL
5.4.

 

Is this a normal behaviour of the RHCS or is this a bug or am I doing
something wrong?

 

Since I'm a newbie, I will be thankful for any hint. 

 

Have a nice weekend!

 

Warm regards,

 

Ercan Karadeniz

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110507/cddd77c1/attachment.htm>

From raju.rajsand at gmail.com  Sun May  8 10:39:58 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 8 May 2011 16:09:58 +0530
Subject: [Linux-cluster] GFS File System Resource does not mount
 automatically => Bug or Feature
In-Reply-To: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com>
References: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com>
Message-ID: <BANLkTimc7jmrm1tg_MVRPizs1=P-XdkRSA@mail.gmail.com>

Greetings,


On Sat, May 7, 2011 at 11:18 PM, Karadeniz, Ercan, VF-Group
<ercan.karadeniz at vodafone.com> wrote:
> Hi All,
>

Can you post the config file here?


-- 
Regards,

Rajagopal



From ercan.karadeniz at vodafone.com  Sun May  8 10:56:35 2011
From: ercan.karadeniz at vodafone.com (Karadeniz, Ercan, VF-Group)
Date: Sun, 8 May 2011 12:56:35 +0200
Subject: [Linux-cluster] GFS File System Resource does not mount
	automatically => Bug or Feature
In-Reply-To: <BANLkTimc7jmrm1tg_MVRPizs1=P-XdkRSA@mail.gmail.com>
References: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com>
	<BANLkTimc7jmrm1tg_MVRPizs1=P-XdkRSA@mail.gmail.com>
Message-ID: <84220C5308E5B146BFA374E47437FB55067EB49A@VF-MBX12.internal.vodafone.com>

Hi Rajagopal,

Please find enclosed my cluster.conf file.

Regards,
Ercan

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal
Swaminathan
Sent: Sonntag, 8. Mai 2011 12:40
To: linux clustering
Subject: Re: [Linux-cluster] GFS File System Resource does not mount
automatically => Bug or Feature

Greetings,


On Sat, May 7, 2011 at 11:18 PM, Karadeniz, Ercan, VF-Group
<ercan.karadeniz at vodafone.com> wrote:
> Hi All,
>

Can you post the config file here?


--
Regards,

Rajagopal

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 2232 bytes
Desc: cluster.conf
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110508/21d86234/attachment.obj>

From fdinitto at redhat.com  Mon May  9 11:25:28 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 09 May 2011 13:25:28 +0200
Subject: [Linux-cluster] new RHCS upstream wiki
Message-ID: <4DC7CF28.30602@redhat.com>

Hi all,

we are in the process of moving the old cluster wiki
(http://sourceware.org/cluster/wiki/) to:

https://fedorahosted.org/cluster/wiki/HomePage

All pages from the old wiki have been imported and we are in the process
to reformat the pages to match the new trac-wiki notation.

If you own any page or content, please make sure to verify that the
content is correct.

In the process I also spotted an insane amount of spam, if you have 5
minutes to spare to help cleaning that up,
https://fedorahosted.org/cluster/wiki/TitleIndex is a good starting point.

The old wiki will be made readonly soon and any change will be discarded.

If necessary I have a backup stored on my harddisk.

Please update all your URLs.

Fabio



From mra at webtel.pl  Mon May  9 13:32:39 2011
From: mra at webtel.pl (mr)
Date: Mon, 09 May 2011 15:32:39 +0200
Subject: [Linux-cluster] gfs2 setting quota problem
Message-ID: <4DC7ECF7.4060302@webtel.pl>

Hello,
I'm having problem to init gfs2 quota on my existing FS.

I have 2TB gfs2 FS which is being used in 50%. I have decided to set up
quotas. Setting warning and limit levels seemed OK - no errors (athought
I had to reset all my existing setting gfs2_quota reset...) New quota
calculation ends with the following error:

gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument

Getting some quota values fails - I'm always getting "value: 0.0" :(

I have no idea what is wrong... Sombody could help? thx in advance

Details:
2.6.18-194.11.1.el5
/tmp/test type gfs2
(rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
gfs2-utils.i386                          0.1.62-28.el5_6.1
kmod-gfs.i686                         0.1.34-2.el5
cman.i386                               2.0.98-1.el5_3.4



-- 
Best Regards,
MR



From swhiteho at redhat.com  Mon May  9 18:07:21 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 09 May 2011 19:07:21 +0100
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <4DC7ECF7.4060302@webtel.pl>
References: <4DC7ECF7.4060302@webtel.pl>
Message-ID: <1304964441.2813.9.camel@menhir>


Hi,

On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
> Hello,
> I'm having problem to init gfs2 quota on my existing FS.
> 
> I have 2TB gfs2 FS which is being used in 50%. I have decided to set up
> quotas. Setting warning and limit levels seemed OK - no errors (athought
> I had to reset all my existing setting gfs2_quota reset...) New quota
> calculation ends with the following error:
> 
> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
> 
Are you using selinux? The gfs2_quota tool tries to mount the GFS2
metafs in order to make the changes that you requested. For some reason
it seems this mount is failing.

> Getting some quota values fails - I'm always getting "value: 0.0" :(
> 
> I have no idea what is wrong... Sombody could help? thx in advance
> 
> Details:
> 2.6.18-194.11.1.el5
> /tmp/test type gfs2
> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
> gfs2-utils.i386                          0.1.62-28.el5_6.1
> kmod-gfs.i686                         0.1.34-2.el5
> cman.i386                               2.0.98-1.el5_3.4
> 
> 
> 
Is this CentOS or a real RHEL installation?

Steve.




From raju.rajsand at gmail.com  Tue May 10 03:25:31 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Tue, 10 May 2011 08:55:31 +0530
Subject: [Linux-cluster] GFS File System Resource does not mount
 automatically => Bug or Feature
In-Reply-To: <84220C5308E5B146BFA374E47437FB55067EB49A@VF-MBX12.internal.vodafone.com>
References: <84220C5308E5B146BFA374E47437FB55067EB475@VF-MBX12.internal.vodafone.com>
	<BANLkTimc7jmrm1tg_MVRPizs1=P-XdkRSA@mail.gmail.com>
	<84220C5308E5B146BFA374E47437FB55067EB49A@VF-MBX12.internal.vodafone.com>
Message-ID: <BANLkTin1WpaJMCeHSTODdGJzXg=WTm0J+g@mail.gmail.com>

Greetings,

On Sun, May 8, 2011 at 4:26 PM, Karadeniz, Ercan, VF-Group
<ercan.karadeniz at vodafone.com> wrote:
> Hi Rajagopal,
>
> Please find enclosed my cluster.conf file.
>

Just in case, why not mount the GFS in rc.local in both the nodes?
Not an elegent solution. but usually works.

-- 
Regards,

Rajagopal



From mra at webtel.pl  Tue May 10 06:06:38 2011
From: mra at webtel.pl (mr)
Date: Tue, 10 May 2011 08:06:38 +0200
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <1304964441.2813.9.camel@menhir>
References: <4DC7ECF7.4060302@webtel.pl> <1304964441.2813.9.camel@menhir>
Message-ID: <4DC8D5EE.7050001@webtel.pl>

Hi,
Steven Whitehouse pisze:
> Hi,
>
> On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
>   
>> Hello,
>> I'm having problem to init gfs2 quota on my existing FS.
>>
>> I have 2TB gfs2 FS which is being used in 50%. I have decided to set up
>> quotas. Setting warning and limit levels seemed OK - no errors (athought
>> I had to reset all my existing setting gfs2_quota reset...) New quota
>> calculation ends with the following error:
>>
>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
>>
>>     
> Are you using selinux? The gfs2_quota tool tries to mount the GFS2
> metafs in order to make the changes that you requested. For some reason
> it seems this mount is failing.
>   
Selinux is diabled. I'm also able to mount gfs2meta manually.
>   
>> Getting some quota values fails - I'm always getting "value: 0.0" :(
>>
>> I have no idea what is wrong... Sombody could help? thx in advance
>>
>> Details:
>> 2.6.18-194.11.1.el5
>> /tmp/test type gfs2
>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
>> gfs2-utils.i386                          0.1.62-28.el5_6.1
>> kmod-gfs.i686                         0.1.34-2.el5
>> cman.i386                               2.0.98-1.el5_3.4
>>
>>
>>
>>     
> Is this CentOS or a real RHEL installation?
>   
Centos.
> Steve.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   


-- 
Pozdrawiam,
Miko?aj Radzewicz
Tel. (022) 257 43 36 


Webtel - Interactive Solutions 
Interaktywne rozwi?zania dla biznesu i marketingu 

Webtel Sp.z o.o. 
ul. Marynarska 11, 02-674 Warszawa

S?d Rejonowy dla m. st. Warszawy, XIII Wydzia? Gospodarczy Krajowego
Rejestru S?dowego 
KRS: 0000088129, 
NIP: 525-10-61-332, 
kapita? zak?adowy: 1 745 700 PLN 
www.webtel.pl <http://www.webtel.pl/>  

Niniejsza wiadomo??, wraz z wszelkimi za??cznikami, jest poufna i
przeznaczona wy??cznie do wiadomo?ci adresata. W przypadku omy?kowego
otrzymania tej wiadomo?ci, prosimy o poinformowanie nadawcy oraz nie
u?ywanie, nie przekazywanie i nie kopiowanie zawartych w niej tre?ci, kt?re jest prawnie zabronione.





From koubat at fzu.cz  Tue May 10 12:33:45 2011
From: koubat at fzu.cz (Tomas Kouba)
Date: Tue, 10 May 2011 14:33:45 +0200
Subject: [Linux-cluster] Documentation for cluster beginner and pacemaker vs
	rgmanager
Message-ID: <4DC930A9.9040803@fzu.cz>

Hello HA magicians,

I would like to bring our services to a more reliable level and I was googling around some
basic information about RH cluster suite.
I am not quite able to answer 2 questions so I'd like to ask here:

1) What is the starting documentation that you would recommend to a linux administrator who would
like to setup a HA cluster? I have found
http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/
is it good even though I use clone of RHEL? (Scientific Linux 6).

2) Which resource manager would you recommend? rgmanager or pacemaker?
The following pages favor pacemaker but the documentation usually says rgmanager:
http://www.spinics.net/lists/cluster/msg16401.html
http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker

Best regards,

-- 
Tomas Kouba



From adas at redhat.com  Tue May 10 13:27:07 2011
From: adas at redhat.com (Abhijith Das)
Date: Tue, 10 May 2011 09:27:07 -0400 (EDT)
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <4DC8D5EE.7050001@webtel.pl>
Message-ID: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>



----- Original Message -----
> From: "mr" <mra at webtel.pl>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Tuesday, May 10, 2011 1:06:38 AM
> Subject: Re: [Linux-cluster] gfs2 setting quota problem
> Hi,
> Steven Whitehouse pisze:
> > Hi,
> >
> > On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
> >
> >> Hello,
> >> I'm having problem to init gfs2 quota on my existing FS.
> >>
> >> I have 2TB gfs2 FS which is being used in 50%. I have decided to
> >> set up
> >> quotas. Setting warning and limit levels seemed OK - no errors
> >> (athought
> >> I had to reset all my existing setting gfs2_quota reset...) New
> >> quota
> >> calculation ends with the following error:
> >>
> >> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
> >>
> >>
> > Are you using selinux? The gfs2_quota tool tries to mount the GFS2
> > metafs in order to make the changes that you requested. For some
> > reason
> > it seems this mount is failing.
> >
> Selinux is diabled. I'm also able to mount gfs2meta manually.
> >
> >> Getting some quota values fails - I'm always getting "value: 0.0"
> >> :(
> >>
> >> I have no idea what is wrong... Sombody could help? thx in advance
> >>
> >> Details:
> >> 2.6.18-194.11.1.el5
> >> /tmp/test type gfs2
> >> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
> >> gfs2-utils.i386 0.1.62-28.el5_6.1
> >> kmod-gfs.i686 0.1.34-2.el5
> >> cman.i386 2.0.98-1.el5_3.4
> >>
> >>
> >>
> >>
> > Is this CentOS or a real RHEL installation?
> >
> Centos.
> > Steve.
> >

Hi,

I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around?

I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output.

Thanks!
--Abhi



From andrew at beekhof.net  Tue May 10 13:33:25 2011
From: andrew at beekhof.net (Andrew Beekhof)
Date: Tue, 10 May 2011 15:33:25 +0200
Subject: [Linux-cluster] Documentation for cluster beginner and
 pacemaker vs rgmanager
In-Reply-To: <4DC930A9.9040803@fzu.cz>
References: <4DC930A9.9040803@fzu.cz>
Message-ID: <BANLkTinQ1Y1yD2pRZkGMcdGo4quaQsAdfQ@mail.gmail.com>

On Tue, May 10, 2011 at 2:33 PM, Tomas Kouba <koubat at fzu.cz> wrote:
> Hello HA magicians,
>
> I would like to bring our services to a more reliable level and I was
> googling around some
> basic information about RH cluster suite.
> I am not quite able to answer 2 questions so I'd like to ask here:
>
> 1) What is the starting documentation that you would recommend to a linux
> administrator who would
> like to setup a HA cluster? I have found
> http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/
> is it good even though I use clone of RHEL? (Scientific Linux 6).
>
> 2) Which resource manager would you recommend? rgmanager or pacemaker?
> The following pages favor pacemaker but the documentation usually says
> rgmanager:
> http://www.spinics.net/lists/cluster/msg16401.html
> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker

Well I'm going to say Pacemaker +
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/
But then you'd expect that since I wrote both :-)

I think its fair to say that Pacemaker is better than rgmanager, but
the stars didn't align in time for RHEL6.0 so full support wasn't an
option.  That said, you're using SL6 so community support on mailing
lists such as this one might well be sufficient.

Oh, but the cluster GUI only supports rgmanager if thats important to you.
Pacemaker does have a shiny integrated CLI though.



From linux at alteeve.com  Tue May 10 14:06:49 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 10 May 2011 10:06:49 -0400
Subject: [Linux-cluster] Documentation for cluster beginner and
 pacemaker vs rgmanager
In-Reply-To: <BANLkTinQ1Y1yD2pRZkGMcdGo4quaQsAdfQ@mail.gmail.com>
References: <4DC930A9.9040803@fzu.cz>
	<BANLkTinQ1Y1yD2pRZkGMcdGo4quaQsAdfQ@mail.gmail.com>
Message-ID: <4DC94679.7070507@alteeve.com>

On 05/10/2011 09:33 AM, Andrew Beekhof wrote:
> On Tue, May 10, 2011 at 2:33 PM, Tomas Kouba <koubat at fzu.cz> wrote:
>> Hello HA magicians,
>>
>> I would like to bring our services to a more reliable level and I was
>> googling around some
>> basic information about RH cluster suite.
>> I am not quite able to answer 2 questions so I'd like to ask here:
>>
>> 1) What is the starting documentation that you would recommend to a linux
>> administrator who would
>> like to setup a HA cluster? I have found
>> http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/
>> is it good even though I use clone of RHEL? (Scientific Linux 6).
>>
>> 2) Which resource manager would you recommend? rgmanager or pacemaker?
>> The following pages favor pacemaker but the documentation usually says
>> rgmanager:
>> http://www.spinics.net/lists/cluster/msg16401.html
>> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker
> 
> Well I'm going to say Pacemaker +
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/
> But then you'd expect that since I wrote both :-)
> 
> I think its fair to say that Pacemaker is better than rgmanager, but
> the stars didn't align in time for RHEL6.0 so full support wasn't an
> option.  That said, you're using SL6 so community support on mailing
> lists such as this one might well be sufficient.
> 
> Oh, but the cluster GUI only supports rgmanager if thats important to you.
> Pacemaker does have a shiny integrated CLI though.

I'd agree with Andrew that Pacemaker is better, but I'd also say that
rgmanager has it's bright spots, too. :)

Pacemaker is far more flexible with regard to the resource management
side of things, and rgmanager will be phased out over the next few years
in favour of pacemaker.

The two biggest arguments I'd make in favour of rgmanager are; If your
resource managements needs are within it's capabilities and you are
running RHCS, then it can be configured within the main cluster.conf
file. It is also an old and well tested solution, which some find value in.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From andrew at beekhof.net  Tue May 10 14:38:39 2011
From: andrew at beekhof.net (Andrew Beekhof)
Date: Tue, 10 May 2011 16:38:39 +0200
Subject: [Linux-cluster] Documentation for cluster beginner and
 pacemaker vs rgmanager
In-Reply-To: <4DC94679.7070507@alteeve.com>
References: <4DC930A9.9040803@fzu.cz>
	<BANLkTinQ1Y1yD2pRZkGMcdGo4quaQsAdfQ@mail.gmail.com>
	<4DC94679.7070507@alteeve.com>
Message-ID: <BANLkTikng8c=fGc93wD+_JEmC5061nXKgw@mail.gmail.com>

On Tue, May 10, 2011 at 4:06 PM, Digimer <linux at alteeve.com> wrote:
> On 05/10/2011 09:33 AM, Andrew Beekhof wrote:
>> On Tue, May 10, 2011 at 2:33 PM, Tomas Kouba <koubat at fzu.cz> wrote:
>>> Hello HA magicians,
>>>
>>> I would like to bring our services to a more reliable level and I was
>>> googling around some
>>> basic information about RH cluster suite.
>>> I am not quite able to answer 2 questions so I'd like to ask here:
>>>
>>> 1) What is the starting documentation that you would recommend to a linux
>>> administrator who would
>>> like to setup a HA cluster? I have found
>>> http://www.linuxtopia.org/online_books/rhel6/rhel_6_cluster_admin/
>>> is it good even though I use clone of RHEL? (Scientific Linux 6).
>>>
>>> 2) Which resource manager would you recommend? rgmanager or pacemaker?
>>> The following pages favor pacemaker but the documentation usually says
>>> rgmanager:
>>> http://www.spinics.net/lists/cluster/msg16401.html
>>> http://sources.redhat.com/cluster/wiki/RGManagerVsPacemaker
>>
>> Well I'm going to say Pacemaker +
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/
>> But then you'd expect that since I wrote both :-)
>>
>> I think its fair to say that Pacemaker is better than rgmanager, but
>> the stars didn't align in time for RHEL6.0 so full support wasn't an
>> option. ?That said, you're using SL6 so community support on mailing
>> lists such as this one might well be sufficient.
>>
>> Oh, but the cluster GUI only supports rgmanager if thats important to you.
>> Pacemaker does have a shiny integrated CLI though.
>
> I'd agree with Andrew that Pacemaker is better, but I'd also say that
> rgmanager has it's bright spots, too. :)

No argument there.

> Pacemaker is far more flexible with regard to the resource management
> side of things, and rgmanager will be phased out over the next few years
> in favour of pacemaker.
>
> The two biggest arguments I'd make in favour of rgmanager are; If your
> resource managements needs are within it's capabilities and you are
> running RHCS, then it can be configured within the main cluster.conf
> file. It is also an old and well tested solution, which some find value in.

Pacemaker will be celebrating its 8th anniversary this year.
So it's not a spring chicken either ;-)



From victor.ramirez at prhin.net  Wed May 11 16:09:24 2011
From: victor.ramirez at prhin.net (Victor Ramirez)
Date: Wed, 11 May 2011 12:09:24 -0400
Subject: [Linux-cluster] Fencing problem on Cluster Suite 3.0.12: fenced
 throws agent error when invoking fence_xvm
Message-ID: <BANLkTikyPeu7s0D_Vf8LuduygkLUnGogig@mail.gmail.com>

I cannot get fence_virt in multicast mode to fence automatically even though
I can do it manually with the fence_node command.

Lemme start from the beginning. I have a 2 node cluster, each node is a kvm
guest running on a different physical host. All machines are RHEL 6 x64 and
the cluster suite version is 3.0.12.
Like I mentioned before, cluster is configured correctly since I can fence
manually with the fence_node command, but when I trigger fenced to call
fence_xvm automatically, fence_xvm fails silently with error status 1 and no
multicast packet is sent. fence_xvm command does not write its output
anywhere when invoked by fenced so I cannot know why it fails, but I suspect
that it may be trying to use serial communication instead of multicast.

During troubleshooting, I made a script to run in place of fence_xvm in
order to write the piped arguments into a log file and the arguments seem to
be correct:

domain=prhin01-vm01
nodename=prhin01-vm01
agent=fence_xvm
debug=5

Also, I used tcpdump to determine that no multicast packet was being sent by
fence_xvm.

I downloaded the fence-virt code but I am not too keen on debugging linux C
code as I am a lowly Java webapp developer.

More info can be found here:
https://access.redhat.com/discussion/fencing-problem-cluster-suite-3012-fenced-throws-agent-error-when-invoking-fencexvm

What else can I try to coax fence_xvm to work?
How can I make fence_xvm write to its output to a log file?
Can I call fence_virt instead and use a parameter to force multicast mode?
Should I give up multicast and try to use some vmchannel scheme?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110511/f0ad75a8/attachment.htm>

From ramiblanco at gmail.com  Wed May 11 20:34:18 2011
From: ramiblanco at gmail.com (Ramiro Blanco)
Date: Wed, 11 May 2011 17:34:18 -0300
Subject: [Linux-cluster] Write Performance Issues with GFS2
Message-ID: <BANLkTimwzTYyRQkw8OysDWw9GkvvBhUvGQ@mail.gmail.com>

Hi,

I have a 4 node cluster running gfs2 on top of a EMC SAN for a while
now, and since couple of months ago we are randomly experiencing heavy
write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s.
It can affect 1 node or more at the same time. Umount and mount solves
the problem on the affected node, but after some random time (hours,
days) happens again.

Operating system: Centos 5.6 x86_64
Kernel: 2.6.18-238.9.1.el5
Cman: cman-2.0.115-68.el5_6.3
Gfs2-utils: gfs2-utils-0.1.62-28.el5_6.1
3 nodes fibre channel 4gb
1 node on iscsi 1gb


I've read that there's a bug concerning slow writes, but i think that
affects newer kernels, isn't that right? Is there any other bug that
could be the root of this?

Cheers,




-- 
Ramiro Blanco



From amrossi at linux.it  Wed May 11 22:07:58 2011
From: amrossi at linux.it (Andrea Modesto Rossi)
Date: Thu, 12 May 2011 00:07:58 +0200 (CEST)
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <BANLkTimwzTYyRQkw8OysDWw9GkvvBhUvGQ@mail.gmail.com>
References: <BANLkTimwzTYyRQkw8OysDWw9GkvvBhUvGQ@mail.gmail.com>
Message-ID: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it>


On Mer, 11 Maggio 2011 10:34 pm, Ramiro Blanco wrote:
> Hi,
>
> I have a 4 node cluster running gfs2 on top of a EMC SAN for a while
> now, and since couple of months ago we are randomly experiencing heavy
> write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s.

Hi!

i've got a similar issue. In may case for example, an SCP copy begin with
30MB/s but after about 15 minuts it is less then 300Kb/s

Why?


-- 
Andrea Modesto Rossi
Fedora Ambassador




From adrew at redhat.com  Wed May 11 22:15:58 2011
From: adrew at redhat.com (Adam Drew)
Date: Wed, 11 May 2011 18:15:58 -0400 (EDT)
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it>
Message-ID: <1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com>

Hello,

There's multiple reasons such things could be happening and though both of you see similar symptoms the underlying causes may be different. I highly suggest opening cases with Red Hat Support if you are Red Hat customers.

As far as what it could be... well, there's a lot to pick from. One thing that comes to mind is:

https://bugzilla.redhat.com/show_bug.cgi?id=683155

But that depends on the size of the file being written vs. rgrp size. Lock contention is also a possibility of course. I'd start with that bug and go from there. Again, Red Hat Support may be able to really assist you in this.

Thanks,
Adam Drew

----- Original Message -----
> On Mer, 11 Maggio 2011 10:34 pm, Ramiro Blanco wrote:
> > Hi,
> >
> > I have a 4 node cluster running gfs2 on top of a EMC SAN for a while
> > now, and since couple of months ago we are randomly experiencing
> > heavy
> > write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s.
> 
> Hi!
> 
> i've got a similar issue. In may case for example, an SCP copy begin
> with
> 30MB/s but after about 15 minuts it is less then 300Kb/s
> 
> Why?
> 
> 
> --
> Andrea Modesto Rossi
> Fedora Ambassador
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From ramiblanco at gmail.com  Wed May 11 23:32:57 2011
From: ramiblanco at gmail.com (Ramiro Blanco)
Date: Wed, 11 May 2011 20:32:57 -0300
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com>
References: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it>
	<1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com>
Message-ID: <BANLkTimt0vT+pt3=qgW3iMpxqy3s-XMEDQ@mail.gmail.com>

2011/5/11 Adam Drew <adrew at redhat.com>:
> Hello,
>
> There's multiple reasons such things could be happening and though both of you see similar symptoms the underlying causes may be different. I highly suggest opening cases with Red Hat Support if you are Red Hat customers.
>
I'll do that.


> As far as what it could be... well, there's a lot to pick from. One thing that comes to mind is:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=683155
Can't access that one: "You are not authorized to access bug #683155"

>
> But that depends on the size of the file being written vs. rgrp size. Lock contention is also a possibility of course. I'd start with that bug and go from there. Again, Red Hat Support may be able to really assist you in this.
>
In my case, no mather what the file size is, it could be 100k or 1gb,
the same performance is reached.

Cheers,


-- 
Ramiro Blanco



From sufyan.khan at its.ws  Thu May 12 06:27:13 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Thu, 12 May 2011 09:27:13 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
Message-ID: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>

Dear All

I need to setup HA cluster for mu oracle dabase.
I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 
I created RG a  shared file system "/emc01" ext3 , shared IP and DB script
to monitor the DB. 
My cluster starts perfectly and fail over on shutting down primary node,
also stopping shared IP  fails node to failover node.
But on kill PMON , or LSNR process the node does not fails and keep showing
the status services running on primary node.

I JUST NEED TO KNOW WHERE IS THE PROBLEM.

ATTACHED IS DB scripts and "cluster.conf" file.

Thanks in advance for help.

Sufyan



-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1534 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110512/6fddfa57/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: script_db.sh
Type: application/octet-stream
Size: 814 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110512/6fddfa57/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: startdb.sh
Type: application/octet-stream
Size: 448 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110512/6fddfa57/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stopdb.sh
Type: application/octet-stream
Size: 414 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110512/6fddfa57/attachment-0003.obj>

From Chris.Jankowski at hp.com  Thu May 12 06:44:19 2011
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 12 May 2011 06:44:19 +0000
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
 deamon
In-Reply-To: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
Message-ID: <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>

Sufyan,

What username does the instance of Oracle DB run as? Is this "orainfra" or some other username?

The scripts assume a user named "orainfra".
If you use a different username then you need to modify the scripts accordingly.

Regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
Sent: Thursday, 12 May 2011 16:27
To: 'linux clustering'
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon

Dear All

I need to setup HA cluster for mu oracle dabase.
I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I created RG a  shared file system "/emc01" ext3 , shared IP and DB script to monitor the DB. 
My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP  fails node to failover node.
But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node.

I JUST NEED TO KNOW WHERE IS THE PROBLEM.

ATTACHED IS DB scripts and "cluster.conf" file.

Thanks in advance for help.

Sufyan






From sufyan.khan at its.ws  Thu May 12 07:22:01 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Thu, 12 May 2011 10:22:01 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <00b501cc1075$4a47cfa0$ded76ee0$@its.ws>

First of all thanks for you quick response.

Secondly please note:  the working "cluster.conf" file is attached here, the
previous file was not correct. 
Yes the  orainfra is the user name.

Any othere clue please.

sufyan






-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris
Sent: Thursday, May 12, 2011 9:44 AM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

Sufyan,

What username does the instance of Oracle DB run as? Is this "orainfra" or
some other username?

The scripts assume a user named "orainfra".
If you use a different username then you need to modify the scripts
accordingly.

Regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
Sent: Thursday, 12 May 2011 16:27
To: 'linux clustering'
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon

Dear All

I need to setup HA cluster for mu oracle dabase.
I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I
created RG a  shared file system "/emc01" ext3 , shared IP and DB script to
monitor the DB. 
My cluster starts perfectly and fail over on shutting down primary node,
also stopping shared IP  fails node to failover node.
But on kill PMON , or LSNR process the node does not fails and keep showing
the status services running on primary node.

I JUST NEED TO KNOW WHERE IS THE PROBLEM.

ATTACHED IS DB scripts and "cluster.conf" file.

Thanks in advance for help.

Sufyan




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1457 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110512/0432c58b/attachment.obj>

From Andre.Gerbatsch at globalfoundries.com  Fri May 13 10:09:40 2011
From: Andre.Gerbatsch at globalfoundries.com (Gerbatsch, Andre)
Date: Fri, 13 May 2011 12:09:40 +0200
Subject: [Linux-cluster] qdiskd does not call heuristics regularly?
Message-ID: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com>

Hello,

Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me:

I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node.

What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time".

Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ?
Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman) 

The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision.

Thank you in advance for any response
Andre

=================================================
== rpm -qi cman
Name        : cman                         Relocations: (not relocatable)
Version     : 2.0.115                           Vendor: Red Hat, Inc.
Release     : 68.el5_6.1                    Build Date: Mon Dec 20 19:28:36 2010
Install Date: Thu Apr 28 11:11:43 2011         Build Host: ls20-bc2-14.build.redhat.com
Group       : System Environment/Base       Source RPM: cman-2.0.115-68.el5_6.1.src.rpm
Size        : 2619414                          License: GPL
Signature   : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://sources.redhat.com/cluster/
Summary     : cman - The Cluster Manager
Description :
cman - The Cluster Manager

==
cluster.conf:
..
<totem consensus="4800" join="60" token="60000" token_retransmits_before_loss_const="20"/>

<quorumd status_file="/tmp/qdiskd_status" log_level="7" interval="5" device="/dev/mapper/xp1_00p1" tko="5" votes="1">
                <heuristic interval="5" program="/root/root/cluster/checkpvtlink.sh eth0" score="1" tko="3"/>
</quorumd>
..
==
> ps -eLf | grep qdiskd
root      3976     1  3976  0    3 08:59 ?        00:00:00 qdiskd -Q
root      3976     1  3978  0    3 08:59 ?        00:00:00 qdiskd -Q
root      3976     1  4226  0    3 08:59 ?        00:00:00 qdiskd -Q
root     21613 12673 21613  0    1 10:45 pts/0    00:00:00 grep qdiskd

== strace "score thread" (hopefully :-)
=  it seems simply waiting for some timer...
clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0},  <unfinished ...>
Process 3978 detached


.. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg..

cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary

99	fork_heuristic(struct h_data *h) 
100 { 
...
110         now = time(NULL); 
111         if (now < h->nextrun) 
112                 return 0; 
113  
114         h->nextrun = now + h->interval; 
115  
116         pid = fork();


== output from heuristic testscript
> cat checkpvtlink.sh
#!/bin/sh
rval=0
echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log
exit $rval

> tail checkpvtlink.log
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ??
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0





Andre Gerbatsch
MTS IT Systems Engineer
Tel  +49 (0) 351 277-1762
Fax +49 (0) 351 277-91762
andre.gerbatsch at globalfoundries.com  

GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG
Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896




From Andre.Gerbatsch at globalfoundries.com  Fri May 13 12:00:23 2011
From: Andre.Gerbatsch at globalfoundries.com (Gerbatsch, Andre)
Date: Fri, 13 May 2011 14:00:23 +0200
Subject: [Linux-cluster] qdiskd does not call heuristics regularly?
In-Reply-To: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com>
References: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com>
Message-ID: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com>


.. small correction of the qdiskd->heuristic script timing:
dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1
dummy: Fri May 13 08:59:21 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:26 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:31 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:36 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:41 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:51 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:56 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <--changed script, rval=0
dummy: Fri May 13 09:00:01 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:00:06 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:00:11 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- until this point ok (dt=5s)
dummy: Fri May 13 09:01:53 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- below: ?? every 103s ?
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- ?? no regular checks ?
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerbatsch, Andre
Sent: Freitag, 13. Mai 2011 12:10
To: 'linux-cluster at redhat.com'
Subject: [Linux-cluster] qdiskd does not call heuristics regularly?

Hello,

Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me:

I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node.

What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time".

Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ?
Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman) 

The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision.

Thank you in advance for any response
Andre

=================================================
== rpm -qi cman
Name        : cman                         Relocations: (not relocatable)
Version     : 2.0.115                           Vendor: Red Hat, Inc.
Release     : 68.el5_6.1                    Build Date: Mon Dec 20 19:28:36 2010
Install Date: Thu Apr 28 11:11:43 2011         Build Host: ls20-bc2-14.build.redhat.com
Group       : System Environment/Base       Source RPM: cman-2.0.115-68.el5_6.1.src.rpm
Size        : 2619414                          License: GPL
Signature   : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://sources.redhat.com/cluster/
Summary     : cman - The Cluster Manager
Description :
cman - The Cluster Manager

==
cluster.conf:
..
<totem consensus="4800" join="60" token="60000" token_retransmits_before_loss_const="20"/>

<quorumd status_file="/tmp/qdiskd_status" log_level="7" interval="5" device="/dev/mapper/xp1_00p1" tko="5" votes="1">
                <heuristic interval="5" program="/root/root/cluster/checkpvtlink.sh eth0" score="1" tko="3"/>
</quorumd>
..
==
> ps -eLf | grep qdiskd
root      3976     1  3976  0    3 08:59 ?        00:00:00 qdiskd -Q
root      3976     1  3978  0    3 08:59 ?        00:00:00 qdiskd -Q
root      3976     1  4226  0    3 08:59 ?        00:00:00 qdiskd -Q
root     21613 12673 21613  0    1 10:45 pts/0    00:00:00 grep qdiskd

== strace "score thread" (hopefully :-)
=  it seems simply waiting for some timer...
clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0},  <unfinished ...>
Process 3978 detached


.. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg..

cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary

99	fork_heuristic(struct h_data *h) 
100 { 
...
110         now = time(NULL); 
111         if (now < h->nextrun) 
112                 return 0; 
113  
114         h->nextrun = now + h->interval; 
115  
116         pid = fork();


== output from heuristic testscript
> cat checkpvtlink.sh
#!/bin/sh
rval=0
echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log
exit $rval

> tail checkpvtlink.log
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ??
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0





Andre Gerbatsch
MTS IT Systems Engineer
Tel  +49 (0) 351 277-1762
Fax +49 (0) 351 277-91762
andre.gerbatsch at globalfoundries.com  

GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG
Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From klusterfsck at outofoptions.net  Fri May 13 16:12:10 2011
From: klusterfsck at outofoptions.net (Kluster Fsck)
Date: Fri, 13 May 2011 12:12:10 -0400
Subject: [Linux-cluster] Virtual Network
Message-ID: <4DCD585A.1090304@outofoptions.net>

I inherited an old cluster that RH won't support.  Red Hat Linux 
Advanced Server release 2.1AS.  The last day the old sys admin the 
cluster went down and never joined.  (Customer owned equipment and the 
UPS is failed)  As a quick fix I hard coded the address on the active 
node.  Life was good until last night when another power bump occured 
and the other machine grabbed control.  This is EOL hardware/software 
and we are working to get off of this in the next couple of weeks.

My question.  What is the mechanism for bringing up the shared address?  
After taking the hard coded nic down I tried: service cluster 
stop/start.  I tried bringing up the preferred node a little ahead of 
the non-preferred node and then tried allowing it to come up completely 
before brining it up on the second node.

Thanks for listening.




From ajb2 at mssl.ucl.ac.uk  Fri May 13 21:49:59 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Fri, 13 May 2011 22:49:59 +0100
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <BANLkTimt0vT+pt3=qgW3iMpxqy3s-XMEDQ@mail.gmail.com>
References: <2920eae44f09b0e9a8e24c98c21e1cff.squirrel@picard.linux.it>	<1335256588.190299.1305152158222.JavaMail.root@zmail01.collab.prod.int.phx2.redhat.com>
	<BANLkTimt0vT+pt3=qgW3iMpxqy3s-XMEDQ@mail.gmail.com>
Message-ID: <4DCDA787.5070508@mssl.ucl.ac.uk>

On 12/05/11 00:32, Ramiro Blanco wrote:

>> https://bugzilla.redhat.com/show_bug.cgi?id=683155
> Can't access that one: "You are not authorized to access bug #683155"

There's no reason this bug should be private, however it's addressed in 
test kernel kernel-2.6.18-248.el5

Steve/Bob, how about opening this one up for public view?





From rpeterso at redhat.com  Fri May 13 22:21:05 2011
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 13 May 2011 18:21:05 -0400 (EDT)
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <4DCDA787.5070508@mssl.ucl.ac.uk>
Message-ID: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- Original Message -----
| On 12/05/11 00:32, Ramiro Blanco wrote:
| 
| >> https://bugzilla.redhat.com/show_bug.cgi?id=683155
| > Can't access that one: "You are not authorized to access bug
| > #683155"
| 
| There's no reason this bug should be private, however it's addressed
| in
| test kernel kernel-2.6.18-248.el5
| 
| Steve/Bob, how about opening this one up for public view?

Sounds okay to me.  Not sure how that's done, and not sure if I have
the right authority in bugzilla to do it.

Regards,

Bob Peterson
Red Hat File Systems



From ajb2 at mssl.ucl.ac.uk  Sat May 14 13:01:09 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Sat, 14 May 2011 14:01:09 +0100
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <4DCE7D15.8090709@mssl.ucl.ac.uk>

On 13/05/11 23:21, Bob Peterson wrote:

> | Steve/Bob, how about opening this one up for public view?
>
> Sounds okay to me.  Not sure how that's done, and not sure if I have
> the right authority in bugzilla to do it.

I'm not entirely sure either but as the creator I think all you have to 
do is uncheck the private/developers boxes.

AB




From unknownboogyman at gmail.com  Sat May 14 15:18:05 2011
From: unknownboogyman at gmail.com (Steve)
Date: Sat, 14 May 2011 11:18:05 -0400
Subject: [Linux-cluster] (no subject)
Message-ID: <BANLkTin8wmjiDCytPeOe2c2=ANYqB28Wyg@mail.gmail.com>

Hello all,

Currently my group at college is working on a Senior Project and have
created it pretty much successfully. We have a group of four test computers
in a cluster before we go along with the eight we plan on. Right now we have
tried a cluster software openMosix, or just follow the link below. Well, the
dependencies didn't work and we couldn't install it.

So, my question is does anyone know of stress testing software for CentOS
clustering? Just regular stress testing too, like processing speed,
hard-drive, yatta yatta. Basically, just four computers clustered over
Ethernet (yes, I know, it'll most likely be slow).

If you need anymore information, just let me know.

This is the link for the one that didn't work:
http://www.openmosixview.com/omtest/

-- 
-Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/29b5c6f8/attachment.htm>

From parvez.h.shaikh at gmail.com  Sat May 14 15:36:47 2011
From: parvez.h.shaikh at gmail.com (Parvez Shaikh)
Date: Sat, 14 May 2011 21:06:47 +0530
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
Message-ID: <BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>

Hi Sufyan

Does your status function return 0 or 1 if database is up or down
respectively (i.e. have you tested it works outside script_db.sh) when run
as "root"?

On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan <sufyan.khan at its.ws> wrote:

> First of all thanks for you quick response.
>
> Secondly please note:  the working "cluster.conf" file is attached here,
> the
> previous file was not correct.
> Yes the  orainfra is the user name.
>
> Any othere clue please.
>
> sufyan
>
>
>
>
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris
> Sent: Thursday, May 12, 2011 9:44 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
> deamon
>
> Sufyan,
>
> What username does the instance of Oracle DB run as? Is this "orainfra" or
> some other username?
>
> The scripts assume a user named "orainfra".
> If you use a different username then you need to modify the scripts
> accordingly.
>
> Regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
> Sent: Thursday, 12 May 2011 16:27
> To: 'linux clustering'
> Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
> deamon
>
> Dear All
>
> I need to setup HA cluster for mu oracle dabase.
> I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5
> I
> created RG a  shared file system "/emc01" ext3 , shared IP and DB script to
> monitor the DB.
> My cluster starts perfectly and fail over on shutting down primary node,
> also stopping shared IP  fails node to failover node.
> But on kill PMON , or LSNR process the node does not fails and keep showing
> the status services running on primary node.
>
> I JUST NEED TO KNOW WHERE IS THE PROBLEM.
>
> ATTACHED IS DB scripts and "cluster.conf" file.
>
> Thanks in advance for help.
>
> Sufyan
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/61b86f53/attachment.htm>

From sufyan.khan at its.ws  Sat May 14 19:21:25 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Sat, 14 May 2011 22:21:25 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
Message-ID: <007301cc126c$200506b0$600f1410$@its.ws>

Yes , you can see in attached script

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh
Sent: Saturday, May 14, 2011 6:37 PM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Hi Sufyan

Does your status function return 0 or 1 if database is up or down
respectively (i.e. have you tested it works outside script_db.sh) when run
as "root"?

On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan <sufyan.khan at its.ws> wrote:

First of all thanks for you quick response.

Secondly please note:  the working "cluster.conf" file is attached here, the
previous file was not correct.
Yes the  orainfra is the user name.

Any othere clue please.

sufyan







-----Original Message-----
From: linux-cluster-bounces at redhat.com

[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris
Sent: Thursday, May 12, 2011 9:44 AM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON

deamon

Sufyan,

What username does the instance of Oracle DB run as? Is this "orainfra" or
some other username?

The scripts assume a user named "orainfra".
If you use a different username then you need to modify the scripts
accordingly.

Regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
Sent: Thursday, 12 May 2011 16:27
To: 'linux clustering'
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon

Dear All

I need to setup HA cluster for mu oracle dabase.
I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I
created RG a  shared file system "/emc01" ext3 , shared IP and DB script to
monitor the DB.
My cluster starts perfectly and fail over on shutting down primary node,
also stopping shared IP  fails node to failover node.
But on kill PMON , or LSNR process the node does not fails and keep showing
the status services running on primary node.

I JUST NEED TO KNOW WHERE IS THE PROBLEM.

ATTACHED IS DB scripts and "cluster.conf" file.

Thanks in advance for help.

Sufyan




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ddfcbab2/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: script_db.sh
Type: application/octet-stream
Size: 814 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ddfcbab2/attachment.obj>

From sufyan.khan at its.ws  Sat May 14 19:25:37 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Sat, 14 May 2011 22:25:37 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
Message-ID: <007901cc126c$b5ae03b0$210a0b10$@its.ws>

I can run the script by command as root, but do see the script is running in
background as  a daemon,  

 

 

Mohammad Raza Sufyan Khan 
Team Leader (Technology&Infrastructure Group)

Telco development

Description: Description: ITS Logo.pngT.    + (965) 22409100 ext. 379

M.   + (965) 99871684
F.    + (965) 22405201

                                      E.    sufyan.khan at its.ws

 

Description: Description: degital signeture.png

 

 

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh
Sent: Saturday, May 14, 2011 6:37 PM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Hi Sufyan

Does your status function return 0 or 1 if database is up or down
respectively (i.e. have you tested it works outside script_db.sh) when run
as "root"?

On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan <sufyan.khan at its.ws> wrote:

First of all thanks for you quick response.

Secondly please note:  the working "cluster.conf" file is attached here, the
previous file was not correct.
Yes the  orainfra is the user name.

Any othere clue please.

sufyan







-----Original Message-----
From: linux-cluster-bounces at redhat.com

[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris
Sent: Thursday, May 12, 2011 9:44 AM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON

deamon

Sufyan,

What username does the instance of Oracle DB run as? Is this "orainfra" or
some other username?

The scripts assume a user named "orainfra".
If you use a different username then you need to modify the scripts
accordingly.

Regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
Sent: Thursday, 12 May 2011 16:27
To: 'linux clustering'
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon

Dear All

I need to setup HA cluster for mu oracle dabase.
I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I
created RG a  shared file system "/emc01" ext3 , shared IP and DB script to
monitor the DB.
My cluster starts perfectly and fail over on shutting down primary node,
also stopping shared IP  fails node to failover node.
But on kill PMON , or LSNR process the node does not fails and keep showing
the status services running on primary node.

I JUST NEED TO KNOW WHERE IS THE PROBLEM.

ATTACHED IS DB scripts and "cluster.conf" file.

Thanks in advance for help.

Sufyan




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ee7e56d2/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 130 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ee7e56d2/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 175 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ee7e56d2/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 5748 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ee7e56d2/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 8941 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ee7e56d2/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.jpg
Type: image/jpeg
Size: 3350 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110514/ee7e56d2/attachment.jpg>

From raju.rajsand at gmail.com  Sat May 14 21:13:30 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 15 May 2011 02:43:30 +0530
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <007901cc126c$b5ae03b0$210a0b10$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
Message-ID: <BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>

Greetings,


On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan <sufyan.khan at its.ws> wrote:

> I can run the script by command as root, but do see the script is running
> in background as  a daemon,
>
>
>
>
>
> Mohammad Raza Sufyan Khan
> Team Leader (Technology&Infrastructure Group)
>
> Telco development
>


a stupid question: Have you used chkconfig to switchoff all the cluster
controlled services?

IMHO, They should be

-- 
Regards,

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110515/48049bf5/attachment.htm>

From sufyan.khan at its.ws  Sun May 15 07:06:21 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Sun, 15 May 2011 10:06:21 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
Message-ID: <000901cc12ce$99532df0$cbf989d0$@its.ws>

There is writing mistake, I cannot see the script is running in background.

 

Off course I stop the cluster then I run the manual script.

 

 

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Sunday, May 15, 2011 12:14 AM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Greetings,



On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan <sufyan.khan at its.ws> wrote:

I can run the script by command as root, but do see the script is running in
background as  a daemon,  

 

 

Mohammad Raza Sufyan Khan 
Team Leader (Technology&Infrastructure Group)

Telco development



a stupid question: Have you used chkconfig to switchoff all the cluster
controlled services?

IMHO, They should be

-- 
Regards,

Rajagopal

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110515/f6f9132d/attachment.htm>

From mguazzardo76 at gmail.com  Sun May 15 08:28:52 2011
From: mguazzardo76 at gmail.com (Marcelo Guazzardo)
Date: Sun, 15 May 2011 05:28:52 -0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <000901cc12ce$99532df0$cbf989d0$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
Message-ID: <BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>

HI sufyan

Maybe is irrelevant, but, do you try with oracledb.sh script?.
I use that script and all work fine for me....

Regards,
Marcelo
PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify
the oracle settings, like orauser, db_instance_name, and db_virtual name.

2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

> There is writing mistake, I *cannot* see the script is running in
> background.
>
>
>
> Off course I stop the cluster then I run the manual script.
>
>
>
>
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Rajagopal Swaminathan
> *Sent:* Sunday, May 15, 2011 12:14 AM
>
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin
> PMON deamon
>
>
>
> Greetings,
>
> On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan <sufyan.khan at its.ws> wrote:
>
> I ca nd as root, but do see the script is running in background as  a
> daemon,
>
>
>
>
>
> Mohammad Raza Sufyan Khan
> Team Leader (Technology&Infrastructure Group)
>
> Telco development
>
>
>
> a stupid question: Have you used chkconfig to switchoff all the cluster
> controlled services?
>
> ! IMHO, The >Regards,
>
> Rajagopal
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110515/37a1a50b/attachment.htm>

From sufyan.khan at its.ws  Sun May 15 13:49:54 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Sun, 15 May 2011 16:49:54 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
Message-ID: <007801cc1306$f93e6850$ebbb38f0$@its.ws>

Thanks for the tips,

 

I need only oracle database to be monitor by Cluster NOT OPMN ( application
server)

Any clue

 

sufyan

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 11:29 AM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

HI sufyan

Maybe is irrelevant, but, do you try with oracledb.sh script?.
I use that script and all work fine for me....

Regards,
Marcelo
PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify
the oracle settings, like orauser, db_instance_name, and db_virtual name.

2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

There is writing mistake, I cannot see the script is running in background.

 

Off course I stop the cluster then I run the manual script.

 

 

  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Sunday, May 15, 2011 12:14 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Greetings,

On Sun, May 15, 2011 at 12:55 AM, Sufyan Khan <sufyan.khan at its.ws> wrote:

I ca nd as root, but do see the script is running in background as  a
daemon,  

 

 

Mohammad Raza Sufyan Khan 
Team Leader (Technology&Infrastructure Group)

Telco development



a stupid question: Have you used chkconfig to switchoff all the cluster
controlled services?

! IMHO, The >Regards,

Rajagopal


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110515/27014ba8/attachment.htm>

From mguazzardo76 at gmail.com  Sun May 15 16:02:42 2011
From: mguazzardo76 at gmail.com (Marcelo Guazzardo)
Date: Sun, 15 May 2011 13:02:42 -0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <007801cc1306$f93e6850$ebbb38f0$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
Message-ID: <BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>

ok, this script checks the listener and db , if you select "base" in
database type.
I 've worked with oracle10g r2 , and it worked fine for me.
Thanks

2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

> Thanks for the tips,
>
>
>
> I need only oracle database to be monitor by Cluster NOT OPMN ( application
> server)
>
> Any clue
>
>
>
> sufyan
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo
> *Sent:* Sunday, May 15, 2011 11:29 AM
>
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin
> PMON deamon
>
>
>
> HI sufyan
>
>
> Maybe is irrelevant, but, do you try with oracledb.sh script?.
> I use that script and all work fine for me....
>
> Regards,
> Marcelo
> PS: that script is placed in /usr/share/cluster/oracledb.sh, you must
> modify the oracle settings, like orauser, db_instance_name, and db_virtual
> name.
>
> 2011/5/15 Sufyan! Khan &lt han at its.ws">sufyan.khan at its.ws>
>
> There is writing mistake, I *cannot* see the script is running in
> background.
>
>
>
> Off course I stop the cluster then I run the manual script.
>
>
>
>
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Rajagopal Swaminathan
> *Sent:* Sunday, May 15, 2011 12:14 AM
>
>
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin
> PMON deamon
>
>
>
> Greetings,
>
> I ca nd as root, but do see the script is running in background as  a
> daemon,
>
>
>
>
>
> Mohammad Raza Sufyan Khan
> Team Leader (Technology&Infrastructure Group)
>
> Telco development
>
>
>
> a stupid question: Have you used chkconfig to switchoff all the cluster
> controlled services?
>
> ! IMHO, The >Regards,
>
> Rajagopal
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Marcelo Guazzardo
> mguazzardo76 at gmail.com
> http://mguazzardo.blogspot.com
>
> </htm
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110515/7218c059/attachment.htm>

From sufyan.khan at its.ws  Sun May 15 18:14:10 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Sun, 15 May 2011 21:14:10 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
	<BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
Message-ID: <008e01cc132b$e4692800$ad3b7800$@its.ws>

Will you share if it is not confidential.

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 7:03 PM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

ok, this script checks the listener and db , if you select "base" in
database type.
I 've worked with oracle10g r2 , and it worked fine for me.
Thanks

2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

Thanks for the tips,

 

I need only oracle database to be monitor by Cluster NOT OPMN ( application
server)

Any clue

 

sufyan

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 11:29 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

HI sufyan



Maybe is irrelevant, but, do you try with oracledb.sh script?.
I use that script and all work fine for me....

Regards,
Marcelo
PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify
the oracle settings, like orauser, db_instance_name, and db_virtual name.

2011/5/15 Sufyan! Khan &lt han at its.ws">sufyan.khan at its.ws>

There is writing mistake, I cannot see the script is running in background.

 

Off course I stop the cluster then I run the manual script.

 

 

  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Sunday, May 15, 2011 12:14 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Greetings,

I ca nd as root, but do see the script is running in background as  a
daemon,  

 

 

Mohammad Raza Sufyan Khan 
Team Leader (Technology&Infrastructure Group)

Telco development



a stupid question: Have you used chkconfig to switchoff all the cluster
controlled services?

! IMHO, The >Regards,

Rajagopal


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com

</htm 
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110515/6fc1d28b/attachment.htm>

From linux at alteeve.com  Mon May 16 03:39:10 2011
From: linux at alteeve.com (Digimer)
Date: Sun, 15 May 2011 23:39:10 -0400
Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen,
 DRBD and rgmanager 2-node cluster tutorial
Message-ID: <4DD09C5E.70508@alteeve.com>

Two years ago, I set out to learn clustering. I decided the best way to
ensure that I learned it properly would be to write down, as a tutorial.
I expect many warts to be found, but I think it is done enough to
"officially" announce it, in hopes that it might help others.

This tutorial shows how to build a 2-node cluster using Red Hat's
Cluster Service Stable 2, using rgmanager for resource management, DRBD
and Clustered LVM for shared storage, GFS2 for definition file storage
and Xen for virtualization.

The tutorial can be found here:

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial

Anyone who has been around the #linux-cluster IRC channel has probably
heard me talking about this tutorial. I need to give a tremendous thank
you to many of the regulars in that channel. I've put a "thanks" section
at the end, but it is woefully short of all the people who have helped
me over the last two years. :)

Any and all feedback, particularly critical ones, are welcome and
appreciated!

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From Chris.Jankowski at hp.com  Mon May 16 04:29:04 2011
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Mon, 16 May 2011 04:29:04 +0000
Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen,
 DRBD and rgmanager 2-node cluster tutorial
In-Reply-To: <4DD09C5E.70508@alteeve.com>
References: <4DD09C5E.70508@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net>

Digimer,

I think you published an earlier version before.
Isn't it the time to introduce versioning, release dates and also list of deltas from version to version?

Mundane things, I know.  But if you want to make this a useful document for others they are all very necessary, I think.

Regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer
Sent: Monday, 16 May 2011 13:39
To: linux clustering
Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen, DRBD and rgmanager 2-node cluster tutorial

Two years ago, I set out to learn clustering. I decided the best way to
ensure that I learned it properly would be to write down, as a tutorial.
I expect many warts to be found, but I think it is done enough to
"officially" announce it, in hopes that it might help others.

This tutorial shows how to build a 2-node cluster using Red Hat's
Cluster Service Stable 2, using rgmanager for resource management, DRBD
and Clustered LVM for shared storage, GFS2 for definition file storage
and Xen for virtualization.

The tutorial can be found here:

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial

Anyone who has been around the #linux-cluster IRC channel has probably
heard me talking about this tutorial. I need to give a tremendous thank
you to many of the regulars in that channel. I've put a "thanks" section
at the end, but it is woefully short of all the people who have helped
me over the last two years. :)

Any and all feedback, particularly critical ones, are welcome and
appreciated!

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From linux at alteeve.com  Mon May 16 04:33:30 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 16 May 2011 00:33:30 -0400
Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen,
 DRBD and rgmanager 2-node cluster tutorial
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net>
References: <4DD09C5E.70508@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4DD0A91A.3090701@alteeve.com>

On 05/16/2011 12:29 AM, Jankowski, Chris wrote:
> Digimer,
> 
> I think you published an earlier version before.
> Isn't it the time to introduce versioning, release dates and also list of deltas from version to version?
> 
> Mundane things, I know.  But if you want to make this a useful document for others they are all very necessary, I think.
> 
> Regards,
> 
> Chris Jankowski

I mentioned it to some people off-list as it was being developed, but
this is the first "official" announcement/release. You comment is valid,
and partly addressed by the medium of being a wiki. Changes through time
can be seen and tracked using the "History" button at the top of the page.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From corey.kovacs at gmail.com  Mon May 16 04:58:41 2011
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Mon, 16 May 2011 05:58:41 +0100
Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen,
 DRBD and rgmanager 2-node cluster tutorial
In-Reply-To: <4DD0A91A.3090701@alteeve.com>
References: <4DD09C5E.70508@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F66412D43@GVW1113EXC.americas.hpqcorp.net>
	<4DD0A91A.3090701@alteeve.com>
Message-ID: <BANLkTikocR2f2p=YOJeP6T84YscW2d=5BA@mail.gmail.com>

Nice job, I am sure this will help quite a few...

-C

On Mon, May 16, 2011 at 5:33 AM, Digimer <linux at alteeve.com> wrote:

> On 05/16/2011 12:29 AM, Jankowski, Chris wrote:
> > Digimer,
> >
> > I think you published an earlier version before.
> > Isn't it the time to introduce versioning, release dates and also list of
> deltas from version to version?
> >
> > Mundane things, I know.  But if you want to make this a useful document
> for others they are all very necessary, I think.
> >
> > Regards,
> >
> > Chris Jankowski
>
> I mentioned it to some people off-list as it was being developed, but
> this is the first "official" announcement/release. You comment is valid,
> and partly addressed by the medium of being a wiki. Changes through time
> can be seen and tracked using the "History" button at the top of the page.
>
> --
> Digimer
> E-Mail: digimer at alteeve.com
> AN!Whitepapers: http://alteeve.com
> Node Assassin:  http://nodeassassin.org
> "I feel confined, only free to expand myself within boundaries."
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110516/a228cbe7/attachment.htm>

From swhiteho at redhat.com  Mon May 16 09:15:56 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 16 May 2011 10:15:56 +0100
Subject: [Linux-cluster] Write Performance Issues with GFS2
In-Reply-To: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <1466017320.34706.1305325265203.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <1305537356.2855.1.camel@menhir>

Hi,

On Fri, 2011-05-13 at 18:21 -0400, Bob Peterson wrote:
> ----- Original Message -----
> | On 12/05/11 00:32, Ramiro Blanco wrote:
> | 
> | >> https://bugzilla.redhat.com/show_bug.cgi?id=683155
> | > Can't access that one: "You are not authorized to access bug
> | > #683155"
> | 
> | There's no reason this bug should be private, however it's addressed
> | in
> | test kernel kernel-2.6.18-248.el5
> | 
> | Steve/Bob, how about opening this one up for public view?
> 
> Sounds okay to me.  Not sure how that's done, and not sure if I have
> the right authority in bugzilla to do it.
> 
You can just untick all the boxes which restrict it to certain groups,
which I've now done,

Steve.




From mammadshah at hotmail.com  Mon May 16 09:34:42 2011
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Mon, 16 May 2011 15:34:42 +0600
Subject: [Linux-cluster] rhel5.5 GFS2
Message-ID: <BLU167-w6283F44E8102617977194FD48D0@phx.gbl>


Hi,
 
I want to force mount the filesystem before starting other services and when relocating the services to another node, the other services should be stopped before filesystem should be unmounted on active node. 
 
 


Thanks,
Muhammad Ammad Shah 		 	   		  



From mra at webtel.pl  Mon May 16 11:29:06 2011
From: mra at webtel.pl (mr)
Date: Mon, 16 May 2011 13:29:06 +0200
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <4DD10A82.3000904@webtel.pl>

Hello,
No, I don't think so...

Anyway I was able to finally set the quotes on my FS after two fails. I
rebooted the server and moved the mounting point from /tmp/test to
/mnt/test... It seems real strange to me but it worked. I can not find
any reasonable explanation of that.... As I saw in strace gfs2 uses
/tmp/ to mount its meta so maybe it was sth with  that...

The other thing is I'm trying to use gfs2_quota in chroot env. After
some tests and changes I'm able to use gfs2_quota get command without
any errors but gfs2_quota limit and gfs2_quota warn make some error
although it works...

"Warning: This filesystem doesn't seem to have the new quota list format
or the quota list is corrupt. list, check and init operation performance
will suffer due to this. It is recommended that you run the 'gfs2_quota
reset' operation to reset the quota file. All current quota information
will be lost and you will have to reassign all quota limits and warnings"

In "real" env everything is ok, without any errors. I have already mount
/dev, /proc and /sys in chroot env...

I have noticed  in strace output that the salt in chroot env is not
being generated during quota tasks: fg.

good ("real"):
oldumount("/tmp/.gfs2meta.4Hd5aR")

bad:
oldumount("/tmp/.gfs2meta")

I think sth is missing.... Any ideas?

Abhijith Das pisze:
> ----- Original Message -----
>   
>> From: "mr" <mra at webtel.pl>
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Sent: Tuesday, May 10, 2011 1:06:38 AM
>> Subject: Re: [Linux-cluster] gfs2 setting quota problem
>> Hi,
>> Steven Whitehouse pisze:
>>     
>>> Hi,
>>>
>>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
>>>
>>>       
>>>> Hello,
>>>> I'm having problem to init gfs2 quota on my existing FS.
>>>>
>>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to
>>>> set up
>>>> quotas. Setting warning and limit levels seemed OK - no errors
>>>> (athought
>>>> I had to reset all my existing setting gfs2_quota reset...) New
>>>> quota
>>>> calculation ends with the following error:
>>>>
>>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
>>>>
>>>>
>>>>         
>>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2
>>> metafs in order to make the changes that you requested. For some
>>> reason
>>> it seems this mount is failing.
>>>
>>>       
>> Selinux is diabled. I'm also able to mount gfs2meta manually.
>>     
>>>> Getting some quota values fails - I'm always getting "value: 0.0"
>>>> :(
>>>>
>>>> I have no idea what is wrong... Sombody could help? thx in advance
>>>>
>>>> Details:
>>>> 2.6.18-194.11.1.el5
>>>> /tmp/test type gfs2
>>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
>>>> gfs2-utils.i386 0.1.62-28.el5_6.1
>>>> kmod-gfs.i686 0.1.34-2.el5
>>>> cman.i386 2.0.98-1.el5_3.4
>>>>
>>>>
>>>>
>>>>
>>>>         
>>> Is this CentOS or a real RHEL installation?
>>>
>>>       
>> Centos.
>>     
>>> Steve.
>>>
>>>       
>
> Hi,
>
> I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around?
>
> I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output.
>
> Thanks!
> --Abhi
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   


-- 
mr





From swhiteho at redhat.com  Mon May 16 11:43:12 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 16 May 2011 12:43:12 +0100
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <4DD10A82.3000904@webtel.pl>
References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<4DD10A82.3000904@webtel.pl>
Message-ID: <1305546192.2855.7.camel@menhir>

Hi,

On Mon, 2011-05-16 at 13:29 +0200, mr wrote:
> Hello,
> No, I don't think so...
> 
> Anyway I was able to finally set the quotes on my FS after two fails. I
> rebooted the server and moved the mounting point from /tmp/test to
> /mnt/test... It seems real strange to me but it worked. I can not find
> any reasonable explanation of that.... As I saw in strace gfs2 uses
> /tmp/ to mount its meta so maybe it was sth with  that...
> 
Possibly...

> The other thing is I'm trying to use gfs2_quota in chroot env. After
> some tests and changes I'm able to use gfs2_quota get command without
> any errors but gfs2_quota limit and gfs2_quota warn make some error
> although it works...
> 
> "Warning: This filesystem doesn't seem to have the new quota list format
> or the quota list is corrupt. list, check and init operation performance
> will suffer due to this. It is recommended that you run the 'gfs2_quota
> reset' operation to reset the quota file. All current quota information
> will be lost and you will have to reassign all quota limits and warnings"
> 
That sounds like a pretty old version of gfs2_quota.

> In "real" env everything is ok, without any errors. I have already mount
> /dev, /proc and /sys in chroot env...
> 
> I have noticed  in strace output that the salt in chroot env is not
> being generated during quota tasks: fg.
> 
> good ("real"):
> oldumount("/tmp/.gfs2meta.4Hd5aR")
> 
> bad:
> oldumount("/tmp/.gfs2meta")
> 
> I think sth is missing.... Any ideas?
> 
One of the problems with CentOS is that it doesn't have our more recent
fixes. If you used Fedora or another more uptodate distro then this
problem should have long since been fixed. Also with the latest Fedora
(Abhi should be able to confirm the exact version) then the standard
system quota tools are available to use with GFS2.

The plan is to get rid of gfs2_quota (probably fairly shortly in Fedora,
but it will stay much longer in RHEL - until the end of the release, of
course) and use exclusively the system quota-tools package,

Steve.

> Abhijith Das pisze:
> > ----- Original Message -----
> >   
> >> From: "mr" <mra at webtel.pl>
> >> To: "linux clustering" <linux-cluster at redhat.com>
> >> Sent: Tuesday, May 10, 2011 1:06:38 AM
> >> Subject: Re: [Linux-cluster] gfs2 setting quota problem
> >> Hi,
> >> Steven Whitehouse pisze:
> >>     
> >>> Hi,
> >>>
> >>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
> >>>
> >>>       
> >>>> Hello,
> >>>> I'm having problem to init gfs2 quota on my existing FS.
> >>>>
> >>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to
> >>>> set up
> >>>> quotas. Setting warning and limit levels seemed OK - no errors
> >>>> (athought
> >>>> I had to reset all my existing setting gfs2_quota reset...) New
> >>>> quota
> >>>> calculation ends with the following error:
> >>>>
> >>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
> >>>>
> >>>>
> >>>>         
> >>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2
> >>> metafs in order to make the changes that you requested. For some
> >>> reason
> >>> it seems this mount is failing.
> >>>
> >>>       
> >> Selinux is diabled. I'm also able to mount gfs2meta manually.
> >>     
> >>>> Getting some quota values fails - I'm always getting "value: 0.0"
> >>>> :(
> >>>>
> >>>> I have no idea what is wrong... Sombody could help? thx in advance
> >>>>
> >>>> Details:
> >>>> 2.6.18-194.11.1.el5
> >>>> /tmp/test type gfs2
> >>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
> >>>> gfs2-utils.i386 0.1.62-28.el5_6.1
> >>>> kmod-gfs.i686 0.1.34-2.el5
> >>>> cman.i386 2.0.98-1.el5_3.4
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>         
> >>> Is this CentOS or a real RHEL installation?
> >>>
> >>>       
> >> Centos.
> >>     
> >>> Steve.
> >>>
> >>>       
> >
> > Hi,
> >
> > I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around?
> >
> > I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output.
> >
> > Thanks!
> > --Abhi
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >   
> 
> 




From mguazzardo76 at gmail.com  Mon May 16 12:20:07 2011
From: mguazzardo76 at gmail.com (Marcelo Guazzardo)
Date: Mon, 16 May 2011 09:20:07 -0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <008e01cc132b$e4692800$ad3b7800$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
	<BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
	<008e01cc132b$e4692800$ad3b7800$@its.ws>
Message-ID: <BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>

Hy sufyan

I sent two files. First, is cluster.conf, second, is the oracledb.sh , this
file must be placed in /usr/share/cluster in both nodes. (Or if you have
more nodes, in all nodes).
In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I
used base (for monitor and listener), and virtual ip,

If you have any doubt, just let me know
I hope that helps you
Regards,
Marcelo


2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

> Will you share if it is not confidential.
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo
> *Sent:* Sunday, May 15, 2011 7:03 PM
>
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin
> PMON deamon
>
>
>
> ok, this script checks the listener a! nd db , i uot; in database type.
>
> I 've worked with oracle10g r2 , and it worked fine for me.
> Thanks
>
> 2011/5/15 Sufyan Khan <sufyan.khan at its.ws>
>
> Thanks for the tips,
>
>
>
> I need only oracle database to be monitor by Cluster NOT OPMN ( application
> server)
>
> Any clue
>
>
>
> sufyan
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Marcelo Guazzardo
> *Sent:* Sunday, May 15, 2011 11:29 AM
>
>
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin
> PMON deamon
>
>
>
> HI sufyan
>
>
>
> Maybe is irrelevant, but, do you try with oracledb.sh script?.
> I use that script and all work fine for me....
>
> Regards,
> Marcelo
> PS: that script is placed in /usr/share/cluster/oracledb.sh, you must
> modify the oracle settings, like orauser, db_instance_name, and db_virtual
> name.
>
> 2011/5/15 Sufyan! Khan &lt han at its.ws">sufyan.khan at its.ws>
>
> There is writing mistake, I *cannot* see the script is running in
> background.!
>
> Off course I stop the cluster then I run the manual script.
>
>
>
>
>
>
>
> *From:* linux-cluster-bounces at redhat.com <linux-clust%21+er-bounceank>[mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Rajagopal Swaminathan
> *Sent:* Sunday, May 15, 2011 12:14 AM
>
>
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] oracle DB is not failing over on killin
> PMON deamon
>
>
>
> Greetings,
>
> I ca nd as root, but do see the script is running in background as  a
> daemon,
>
>
>
>
>
> Mohammad Raza Sufyan Khan
> Team Leader (Technology&Infrastructure Group)
>
> Telco development
>
>
>
> a stupid question: Have you used chkconfig to switchoff all the cluster
> controlled services?
>
> ! IMHO, The >Regards,
>
> ! Rajagopal v>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Marcelo Guazzardo
> mguazzardo76 at gmail.com
> http://mguazzardo.blogspot.com
>
> </htm
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-clust! er<https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>
> --
> Marcelo Guazzardo
> mguazzardo76 at gmail.com
> http://mguazzardo.blogspot.com
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110516/11c70a79/attachment.htm>
-------------- next part --------------
<?xml version="1.0"?>
<cluster config_version="17" name="Labo">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="nodo1" nodeid="1" votes="1">
                <fence>
                        <method name="1">
                                <device host="marcelo at consolas.balinux.com.ar" n                                                                             ame="vbox" vm="Nodo1"/>
                        </method>
                </fence>
                </clusternode>
                <clusternode name="nodo2" nodeid="2" votes="1">
                <fence>
                        <method name="1">
                                <device host="marcelo at consolas.balinux.com.ar" n                                                                             ame="vbox" vm="Nodo2"/>
                        </method>
                </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_vbox" name="vbox"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
        <service name="oracleha">
                <fs device="/dev/sdb1" force_unmount="1" fstype="ext3" mountpoin                                                                             t="/mnt/oracle" name="Oracle Mount"/>
                <oracledb name="orcl" user="oracle" home="/mnt/oracle/product/10                                                                             .2.0/db_1" type="10g" vhost="oracleha.balinux.com.ar"/>
                <ip address="192.168.0.90"/>
        </service>
        </rm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: oracledb.sh
Type: application/x-sh
Size: 21744 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110516/11c70a79/attachment.sh>

From Gert.Wieberdink at enovation.nl  Mon May 16 12:21:18 2011
From: Gert.Wieberdink at enovation.nl (Gert Wieberdink)
Date: Mon, 16 May 2011 14:21:18 +0200
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <1305546192.2855.7.camel@menhir>
References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<4DD10A82.3000904@webtel.pl> <1305546192.2855.7.camel@menhir>
Message-ID: <8634845864125D4D9B397A3E598995980555800F5C@MBX.emd.enovation.net>

bij deze

Met vriendelijke groet/With kind regards,

Gert Wieberdink
Sr. Engineer



-----Oorspronkelijk bericht-----
Van: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Namens Steven Whitehouse
Verzonden: maandag 16 mei 2011 13:43
Aan: linux clustering
Onderwerp: Re: [Linux-cluster] gfs2 setting quota problem

Hi,

On Mon, 2011-05-16 at 13:29 +0200, mr wrote:
> Hello,
> No, I don't think so...
> 
> Anyway I was able to finally set the quotes on my FS after two fails. I
> rebooted the server and moved the mounting point from /tmp/test to
> /mnt/test... It seems real strange to me but it worked. I can not find
> any reasonable explanation of that.... As I saw in strace gfs2 uses
> /tmp/ to mount its meta so maybe it was sth with  that...
> 
Possibly...

> The other thing is I'm trying to use gfs2_quota in chroot env. After
> some tests and changes I'm able to use gfs2_quota get command without
> any errors but gfs2_quota limit and gfs2_quota warn make some error
> although it works...
> 
> "Warning: This filesystem doesn't seem to have the new quota list format
> or the quota list is corrupt. list, check and init operation performance
> will suffer due to this. It is recommended that you run the 'gfs2_quota
> reset' operation to reset the quota file. All current quota information
> will be lost and you will have to reassign all quota limits and warnings"
> 
That sounds like a pretty old version of gfs2_quota.

> In "real" env everything is ok, without any errors. I have already mount
> /dev, /proc and /sys in chroot env...
> 
> I have noticed  in strace output that the salt in chroot env is not
> being generated during quota tasks: fg.
> 
> good ("real"):
> oldumount("/tmp/.gfs2meta.4Hd5aR")
> 
> bad:
> oldumount("/tmp/.gfs2meta")
> 
> I think sth is missing.... Any ideas?
> 
One of the problems with CentOS is that it doesn't have our more recent
fixes. If you used Fedora or another more uptodate distro then this
problem should have long since been fixed. Also with the latest Fedora
(Abhi should be able to confirm the exact version) then the standard
system quota tools are available to use with GFS2.

The plan is to get rid of gfs2_quota (probably fairly shortly in Fedora,
but it will stay much longer in RHEL - until the end of the release, of
course) and use exclusively the system quota-tools package,

Steve.

> Abhijith Das pisze:
> > ----- Original Message -----
> >   
> >> From: "mr" <mra at webtel.pl>
> >> To: "linux clustering" <linux-cluster at redhat.com>
> >> Sent: Tuesday, May 10, 2011 1:06:38 AM
> >> Subject: Re: [Linux-cluster] gfs2 setting quota problem
> >> Hi,
> >> Steven Whitehouse pisze:
> >>     
> >>> Hi,
> >>>
> >>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
> >>>
> >>>       
> >>>> Hello,
> >>>> I'm having problem to init gfs2 quota on my existing FS.
> >>>>
> >>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to
> >>>> set up
> >>>> quotas. Setting warning and limit levels seemed OK - no errors
> >>>> (athought
> >>>> I had to reset all my existing setting gfs2_quota reset...) New
> >>>> quota
> >>>> calculation ends with the following error:
> >>>>
> >>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
> >>>>
> >>>>
> >>>>         
> >>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2
> >>> metafs in order to make the changes that you requested. For some
> >>> reason
> >>> it seems this mount is failing.
> >>>
> >>>       
> >> Selinux is diabled. I'm also able to mount gfs2meta manually.
> >>     
> >>>> Getting some quota values fails - I'm always getting "value: 0.0"
> >>>> :(
> >>>>
> >>>> I have no idea what is wrong... Sombody could help? thx in advance
> >>>>
> >>>> Details:
> >>>> 2.6.18-194.11.1.el5
> >>>> /tmp/test type gfs2
> >>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
> >>>> gfs2-utils.i386 0.1.62-28.el5_6.1
> >>>> kmod-gfs.i686 0.1.34-2.el5
> >>>> cman.i386 2.0.98-1.el5_3.4
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>         
> >>> Is this CentOS or a real RHEL installation?
> >>>
> >>>       
> >> Centos.
> >>     
> >>> Steve.
> >>>
> >>>       
> >
> > Hi,
> >
> > I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around?
> >
> > I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output.
> >
> > Thanks!
> > --Abhi
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >   
> 
> 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From sufyan.khan at its.ws  Mon May 16 13:01:44 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Mon, 16 May 2011 16:01:44 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
	<BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
	<008e01cc132b$e4692800$ad3b7800$@its.ws>
	<BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>
Message-ID: <009701cc13c9$68fd80a0$3af881e0$@its.ws>

Thanks Marcelo

Thanks for support and help

 

Let me try and update

 

 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Monday, May 16, 2011 3:20 PM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Hy sufyan

I sent two files. First, is cluster.conf, second, is the oracledb.sh , this
file must be placed in /usr/share/cluster in both nodes. (Or if you have
more nodes, in all nodes).
In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I
used base (for monitor and listener), and virtual ip, 

If you have any doubt, just let me know
I hope that helps you
Regards,
Marcelo



2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

Will you share if it is not confidential.

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 7:03 PM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

ok, this script checks the listener a! nd db , i uot; in database type.


I 've worked with oracle10g r2 , and it worked fine for me.
Thanks

2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

Thanks for the tips,

 

I need only oracle database to be monitor by Cluster NOT OPMN ( application
server)

Any clue

 

sufyan

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 11:29 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

HI sufyan



Maybe is irrelevant, but, do you try with oracledb.sh script?.
I use that script and all work fine for me....

Regards,
Marcelo
PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify
the oracle settings, like orauser, db_instance_name, and db_virtual name.

2011/5/15 Sufyan! Khan &lt han at its.ws">sufyan.khan at its.ws>

There is writing mistake, I cannot see the script is running in background.!


Off course I stop the cluster then I run the manual script.

 

 

  

From: linux-cluster-bounces at redhat.com <mailto:linux-clust%21+er-bounceank>
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Sunday, May 15, 2011 12:14 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Greetings,

I ca nd as root, but do see the script is running in background as  a
daemon,  

 

 

Mohammad Raza Sufyan Khan 
Team Leader (Technology&Infrastructure Group)

Telco development



a stupid question: Have you used chkconfig to switchoff all the cluster
controlled services?

! IMHO, The >Regards,

! Rajagopal v>


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com

</htm 
--
Linux-cluster mailing list
Linux-cluster at redhat.com

https://www.redhat.com/mailman/listinfo/linux-clust! er
<https://www.redhat.com/mailman/listinfo/linux-cluster> 




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110516/dfcd408b/attachment.htm>

From rhayden.public at gmail.com  Mon May 16 13:16:50 2011
From: rhayden.public at gmail.com (Robert Hayden)
Date: Mon, 16 May 2011 08:16:50 -0500
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <007301cc126c$200506b0$600f1410$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007301cc126c$200506b0$600f1410$@its.ws>
Message-ID: <BANLkTi=9oXQWcpLhg2dso3Bh9caL3XN3Mg@mail.gmail.com>

On Sat, May 14, 2011 at 2:21 PM, Sufyan Khan <sufyan.khan at its.ws> wrote:
>
> Yes , you can see in attached script

I can very well be miss reading the script, but with the status
function, you are returning a "0" or a "1" appropriately, but I am not
sure that return value is the return value for the script_db.sh.
Isn't that just the return value for the status function?  Meaning,
you need to set the RETVAL variable in the status function to be then
returned at the end of the bash script.  I don't code in bash much, so
RETVAL may be a special variable.  I attempted to boil down the script
to test.

#!/bin/bash
. /etc/rc.d/init.d/functions

status() {
return 1
}

case "$1" in
status)
   status
   ;;
*)
   echo $" Not Applicable"
   exit 1
esac

When I run the above, I see the "0" being returned.
[root ~]# ./status.ksh
 Not Applicable
[root ~]# ./status.ksh status
exiting script with
[root ~]# echo $?
0


echo "exiting script with $RETVAL"
exit $RETVAL




>
>
>
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh
> Sent: Saturday, May 14, 2011 6:37 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
>
>
>
> Hi Sufyan
>
> Does your status function r! eturn 0 o down respectively (i.e. have you tested it works outside script_db.sh) when run as "root"?
>
> On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan <sufyan.khan at its.ws> wrote:
>
> First of all thanks for you quick response.
>
> Secondly please note: ?the working "cluster.conf" file is attached here, the
> previous file was not correct.
> Yes the ?orainfra is the user name.
>
> Any othere clue please.
>
> sufyan
>
>
>
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
>
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris
> Sent: Thursday, May 12, 2011 9:44 AM
> To: linux clustering
> Subject: Re: [! Linux-clu iling over on killin PMON
>
> deamon
>
> Sufyan,
>
> What username does the instance of Oracle DB run as? Is this "orainfra" or
> some other username?
>
> The scripts assume a user named "orainfra".
> If you use a different username then you need to modify the scripts
> accordingly.
>
> Regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
> Sent: Thursday, 12 May 2011 16:27
> To: 'linux clustering'
> Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
>
> Dear All
>
> I need to setup HA cluster for mu oracle dabase.
> I have setup two node cluster using "System-Config-Cluster .." on RHEL 5.5 I
> created RG a ?shared fil! e system shared IP and DB script to
> monitor the DB.
> My cluster starts perfectly and fail over on shutting down primary node,
> also stopping shared IP ?fails node to failover node.
> But on kill PMON , or LSNR process the node does not fails and keep showing
> the status services running on primary node.
>
> I JUST NEED TO KNOW WHERE IS THE PROBLEM.
>
> ATTACHED IS DB scripts and "cluster.conf" file.
>
> Thanks in advance for help.
>
> Sufyan
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://! www.redha nux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From mra at webtel.pl  Mon May 16 13:34:09 2011
From: mra at webtel.pl (mr)
Date: Mon, 16 May 2011 15:34:09 +0200
Subject: [Linux-cluster] gfs2 setting quota problem
In-Reply-To: <1305546192.2855.7.camel@menhir>
References: <802130008.429868.1305034027382.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>	<4DD10A82.3000904@webtel.pl>
	<1305546192.2855.7.camel@menhir>
Message-ID: <4DD127D1.1090204@webtel.pl>

ok, but why gfs2_quota works fine in "real" env and errors/warnings only
appera in "chroot" env then... If this is a issue of binaries I should
have seen them in both real and "chroot" env, right?

Steven Whitehouse pisze:
> Hi,
>
> On Mon, 2011-05-16 at 13:29 +0200, mr wrote:
>   
>> Hello,
>> No, I don't think so...
>>
>> Anyway I was able to finally set the quotes on my FS after two fails. I
>> rebooted the server and moved the mounting point from /tmp/test to
>> /mnt/test... It seems real strange to me but it worked. I can not find
>> any reasonable explanation of that.... As I saw in strace gfs2 uses
>> /tmp/ to mount its meta so maybe it was sth with  that...
>>
>>     
> Possibly...
>
>   
>> The other thing is I'm trying to use gfs2_quota in chroot env. After
>> some tests and changes I'm able to use gfs2_quota get command without
>> any errors but gfs2_quota limit and gfs2_quota warn make some error
>> although it works...
>>
>> "Warning: This filesystem doesn't seem to have the new quota list format
>> or the quota list is corrupt. list, check and init operation performance
>> will suffer due to this. It is recommended that you run the 'gfs2_quota
>> reset' operation to reset the quota file. All current quota information
>> will be lost and you will have to reassign all quota limits and warnings"
>>
>>     
> That sounds like a pretty old version of gfs2_quota.
>
>   
>> In "real" env everything is ok, without any errors. I have already mount
>> /dev, /proc and /sys in chroot env...
>>
>> I have noticed  in strace output that the salt in chroot env is not
>> being generated during quota tasks: fg.
>>
>> good ("real"):
>> oldumount("/tmp/.gfs2meta.4Hd5aR")
>>
>> bad:
>> oldumount("/tmp/.gfs2meta")
>>
>> I think sth is missing.... Any ideas?
>>
>>     
> One of the problems with CentOS is that it doesn't have our more recent
> fixes. If you used Fedora or another more uptodate distro then this
> problem should have long since been fixed. Also with the latest Fedora
> (Abhi should be able to confirm the exact version) then the standard
> system quota tools are available to use with GFS2.
>
> The plan is to get rid of gfs2_quota (probably fairly shortly in Fedora,
> but it will stay much longer in RHEL - until the end of the release, of
> course) and use exclusively the system quota-tools package,
>
> Steve.
>
>   
>> Abhijith Das pisze:
>>     
>>> ----- Original Message -----
>>>   
>>>       
>>>> From: "mr" <mra at webtel.pl>
>>>> To: "linux clustering" <linux-cluster at redhat.com>
>>>> Sent: Tuesday, May 10, 2011 1:06:38 AM
>>>> Subject: Re: [Linux-cluster] gfs2 setting quota problem
>>>> Hi,
>>>> Steven Whitehouse pisze:
>>>>     
>>>>         
>>>>> Hi,
>>>>>
>>>>> On Mon, 2011-05-09 at 15:32 +0200, mr wrote:
>>>>>
>>>>>       
>>>>>           
>>>>>> Hello,
>>>>>> I'm having problem to init gfs2 quota on my existing FS.
>>>>>>
>>>>>> I have 2TB gfs2 FS which is being used in 50%. I have decided to
>>>>>> set up
>>>>>> quotas. Setting warning and limit levels seemed OK - no errors
>>>>>> (athought
>>>>>> I had to reset all my existing setting gfs2_quota reset...) New
>>>>>> quota
>>>>>> calculation ends with the following error:
>>>>>>
>>>>>> gfs2_quota: Couldn't mount /tmp/.gfs2meta.WUXfKC : Invalid argument
>>>>>>
>>>>>>
>>>>>>         
>>>>>>             
>>>>> Are you using selinux? The gfs2_quota tool tries to mount the GFS2
>>>>> metafs in order to make the changes that you requested. For some
>>>>> reason
>>>>> it seems this mount is failing.
>>>>>
>>>>>       
>>>>>           
>>>> Selinux is diabled. I'm also able to mount gfs2meta manually.
>>>>     
>>>>         
>>>>>> Getting some quota values fails - I'm always getting "value: 0.0"
>>>>>> :(
>>>>>>
>>>>>> I have no idea what is wrong... Sombody could help? thx in advance
>>>>>>
>>>>>> Details:
>>>>>> 2.6.18-194.11.1.el5
>>>>>> /tmp/test type gfs2
>>>>>> (rw,noatime,lockproto=lock_nolock,localflocks,localcaching,quota=on)
>>>>>> gfs2-utils.i386 0.1.62-28.el5_6.1
>>>>>> kmod-gfs.i686 0.1.34-2.el5
>>>>>> cman.i386 2.0.98-1.el5_3.4
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>         
>>>>>>             
>>>>> Is this CentOS or a real RHEL installation?
>>>>>
>>>>>       
>>>>>           
>>>> Centos.
>>>>     
>>>>         
>>>>> Steve.
>>>>>
>>>>>       
>>>>>           
>>> Hi,
>>>
>>> I found this bz: https://bugzilla.redhat.com/show_bug.cgi?id=459630, but the package versions you list are pretty recent and this was fixed quite a while ago. Are there any older gfs2-utils bits lying around?
>>>
>>> I'd like to see the strace of the gfs2_quota command that triggers the meta-mount error. Please also include the command line in your output.
>>>
>>> Thanks!
>>> --Abhi
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>   
>>>       
>>     
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   


-- 
mr



From Colin.Simpson at iongeo.com  Mon May 16 17:27:25 2011
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Mon, 16 May 2011 18:27:25 +0100
Subject: [Linux-cluster] Announcing - RHCS2 on EL5, Xen,
	DRBD and rgmanager 2-node cluster tutorial
In-Reply-To: <BANLkTikocR2f2p=YOJeP6T84YscW2d=5BA@mail.gmail.com>
References: <BANLkTikocR2f2p=YOJeP6T84YscW2d=5BA@mail.gmail.com>
Message-ID: <1305566845.4224.34.camel@cowie.iouk.ioroot.tld>

Recently I constructed a cluster for Intranet services. I too had to dig
around for information to get all this going, it wasn't easy to find (I
kind of hoped RH would have more recipes and worked examples out there
for all the different services). So I also decided to write up my setup
too, and as it looks pretty similar technology underlying (DRBD, CLVMD
and GFS2) but as I required different services I thought I'd mention it
here. 

Sadly my howto isn't as neat and tidy as yours (just in a blog) but
covers:

File Services (NFS)
Printing Services (CUPS)
DHCP
DNS Server (named)
Clustered Samba (ctdb)
Intranet Web Service (HTTP)

http://catsysadminblog.blogspot.com/2011/04/building-rhel-6centos-6-ha-cluster-for.html

Hopefully might help someone else out there

Thanks

Colin

On Mon, 2011-05-16 at 05:58 +0100, Corey Kovacs wrote:
> Nice job, I am sure this will help quite a few...
> 
> -C
> 
> On Mon, May 16, 2011 at 5:33 AM, Digimer <linux at alteeve.com> wrote:
>         On 05/16/2011 12:29 AM, Jankowski, Chris wrote:
>         > Digimer,
>         >
>         > I think you published an earlier version before.
>         > Isn't it the time to introduce versioning, release dates and
>         also list of deltas from version to version?
>         >
>         > Mundane things, I know.  But if you want to make this a
>         useful document for others they are all very necessary, I
>         think.
>         >
>         > Regards,
>         >
>         > Chris Jankowski
>         
>         
>         I mentioned it to some people off-list as it was being
>         developed, but
>         this is the first "official" announcement/release. You comment
>         is valid,
>         and partly addressed by the medium of being a wiki. Changes
>         through time
>         can be seen and tracked using the "History" button at the top
>         of the page.
>         
>         
>         --
>         Digimer
>         E-Mail: digimer at alteeve.com
>         AN!Whitepapers: http://alteeve.com
>         Node Assassin:  http://nodeassassin.org
>         "I feel confined, only free to expand myself within
>         boundaries."
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>         
> 
> plain text document attachment (ATT666054.txt)
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From lhh at redhat.com  Mon May 16 21:39:20 2011
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 16 May 2011 17:39:20 -0400
Subject: [Linux-cluster] rg_test for testing other resource agent
 functions?
In-Reply-To: <20110322212940.GF13584@mip.aaaaa.org>
References: <20110304194923.GX934@mip.aaaaa.org>
	<20110307214919.GJ17423@redhat.com>
	<20110322212940.GF13584@mip.aaaaa.org>
Message-ID: <20110516213919.GA23451@redhat.com>

On Tue, Mar 22, 2011 at 04:29:40PM -0500, Ofer Inbar wrote:
> 
> That could be useful.
> Do you have any plans to distribute this tool with cluster suite?
>   -- Cos
> 

(ancient thread resurrection)

https://github.com/lhh/ccs2cib

The 'rgm_flatten' command is in there.

-- 
Lon Hohberger - Red Hat, Inc.



From alvaro.fernandez at sivsa.com  Mon May 16 22:03:24 2011
From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez)
Date: Tue, 17 May 2011 00:03:24 +0200
Subject: [Linux-cluster] q about post_fail_delay
Message-ID: <607D6181D9919041BE792D70EF2AEC480195B102@LIMENS.sivsa.int>

Hi,

 

Do using a post_fail_delay > 0, when triggered, blocks running resources
on the node, if one is not using GFS? . For example, if one only uses a
couple of fs resources locally mounted in HA configuration, not shared
filesystems at all.

 

Regards,

 

alvaro

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110517/0390f2fd/attachment.htm>

From linux at alteeve.com  Mon May 16 22:52:44 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 16 May 2011 18:52:44 -0400
Subject: [Linux-cluster] q about post_fail_delay
In-Reply-To: <607D6181D9919041BE792D70EF2AEC480195B102@LIMENS.sivsa.int>
References: <607D6181D9919041BE792D70EF2AEC480195B102@LIMENS.sivsa.int>
Message-ID: <4DD1AABC.8080202@alteeve.com>

On 05/16/2011 06:03 PM, Alvaro Jose Fernandez wrote:
> Hi,
> 
> Do using a post_fail_delay > 0, when triggered, blocks running resources
> on the node, if one is not using GFS? . For example, if one only uses a
> couple of fs resources locally mounted in HA configuration, not shared
> filesystems at all.
> 
> Regards,
> 
> alvaro

I believe that all IO blocks because the cluster is not able to ensure
that messages arrived to all nodes in the same order, as the
silent/failed node stopped responding. This is a trait called "virtual
synchrony".

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From mguazzardo76 at gmail.com  Mon May 16 23:17:14 2011
From: mguazzardo76 at gmail.com (Marcelo Guazzardo)
Date: Mon, 16 May 2011 20:17:14 -0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <BANLkTi=9oXQWcpLhg2dso3Bh9caL3XN3Mg@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007301cc126c$200506b0$600f1410$@its.ws>
	<BANLkTi=9oXQWcpLhg2dso3Bh9caL3XN3Mg@mail.gmail.com>
Message-ID: <BANLkTin_NqD-F9+KM81kQ5XtOTwpQmdvqw@mail.gmail.com>

Hy Sufyan

Morning, I forgot nombrar the source that I 've followed to made a cluster
this is
http://people.redhat.com/lhh/oracle-rhel5-notes-0.6/oracle-notes.html
Good Luck!
Regards,

2011/5/16 Robert Hayden <rhayden.public at gmail.com>

> On Sat, May 14, 2011 at 2:21 PM, Sufyan Khan <sufyan.khan at its.ws> wrote:
> >
> > Yes , you can see in attached script
>
> I can very well be miss reading the script, but with the status
> function, you are returning a "0" or a "1" appropriately, but I am not
> sure that return value is the return value for the script_db.sh.
> Isn't that just the return value for the status function?  Meaning,
> you need to set the RETVAL variable in the status function to be then
> returned at the end of the bash script.  I don't code in bash much, so
> RETVAL may be a special variable.  I attempted to boil down the script
> to test.
>
> #!/bin/bash
> . /etc/rc.d/init.d/functions
>
> status() {
> return 1
> }
>
> case "$1" in
> status)
>   status
>   ;;
> *)
>   echo $" Not Applicable"
>   exit 1
> esac
>
> When I run the above, I see the "0" being returned.
> [root ~]# ./status.ksh
>  Not Applicable
> [root ~]# ./status.ksh status
> exiting script with
> [root ~]# echo $?
> 0
>
>
> echo "exiting script with $RETVAL"
> exit $RETVAL
>
>
>
>
> >
> >
> >
> > From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] On Behalf Of Parvez Shaikh
> > Sent: Saturday, May 14, 2011 6:37 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
> deamon
> >
> >
> >
> > Hi Sufyan
> >
> > Does your status function r! eturn 0 o down respectively (i.e. have you
> tested it works outside script_db.sh) when run as "root"?
> >
> > On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan <sufyan.khan at its.ws>
> wrote:
> >
> > First of all thanks for you quick response.
> >
> > Secondly please note:  the working "cluster.conf" file is attached here,
> the
> > previous file was not correct.
> > Yes the  orainfra is the user name.
> >
> > Any othere clue please.
> >
> > sufyan
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> >
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris
> > Sent: Thursday, May 12, 2011 9:44 AM
> > To: linux clustering
> > Subject: Re: [! Linux-clu iling over on killin PMON
> >
> > deamon
> >
> > Sufyan,
> >
> > What username does the instance of Oracle DB run as? Is this "orainfra"
> or
> > some other username?
> >
> > The scripts assume a user named "orainfra".
> > If you use a different username then you need to modify the scripts
> > accordingly.
> >
> > Regards,
> >
> > Chris Jankowski
> >
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sufyan Khan
> > Sent: Thursday, 12 May 2011 16:27
> > To: 'linux clustering'
> > Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
> deamon
> >
> > Dear All
> >
> > I need to setup HA cluster for mu oracle dabase.
> > I have setup two node cluster using "System-Config-Cluster .." on RHEL
> 5.5 I
> > created RG a  shared fil! e system shared IP and DB script to
> > monitor the DB.
> > My cluster starts perfectly and fail over on shutting down primary node,
> > also stopping shared IP  fails node to failover node.
> > But on kill PMON , or LSNR process the node does not fails and keep
> showing
> > the status services running on primary node.
> >
> > I JUST NEED TO KNOW WHERE IS THE PROBLEM.
> >
> > ATTACHED IS DB scripts and "cluster.conf" file.
> >
> > Thanks in advance for help.
> >
> > Sufyan
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://! www.redha nux-cluster
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110516/5812e85a/attachment.htm>

From carlopmart at gmail.com  Tue May 17 16:06:27 2011
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 17 May 2011 18:06:27 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
Message-ID: <4DD29D03.9080901@gmail.com>

Hi all,

I am running cman-3.0.12-23.el6_0.6.i686 with corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems running in KVM, each on a dedicated host.

I have observed several times that corosync goes cpu to 95-99% in only one node.

Is this a bug??


-- 
CL Martinez
carlopmart {at} gmail {d0t} com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110517/6e1ca48c/attachment.htm>

From sdake at redhat.com  Tue May 17 18:13:23 2011
From: sdake at redhat.com (Steven Dake)
Date: Tue, 17 May 2011 11:13:23 -0700
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DD29D03.9080901@gmail.com>
References: <4DD29D03.9080901@gmail.com>
Message-ID: <4DD2BAC3.50509@redhat.com>

On 05/17/2011 09:06 AM, carlopmart wrote:
> Hi all,
> 
> I am running cman-3.0.12-23.el6_0.6.i686 with corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems running in KVM, each on a dedicated host.
> 
> I have observed several times that corosync goes cpu to 95-99% in only one node.
> 
> Is this a bug??
> 
> 
> -- 
> CL Martinez
> carlopmart {at} gmail {d0t} com
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

yes

Believe this is fixed in 1.3.1



From carlopmart at gmail.com  Tue May 17 18:25:01 2011
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 17 May 2011 20:25:01 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DD2BAC3.50509@redhat.com>
References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com>
Message-ID: <4DD2BD7D.5070704@gmail.com>

On 05/17/2011 08:13 PM, Steven Dake wrote:
> On 05/17/2011 09:06 AM, carlopmart wrote:
>> Hi all,
>>
>> I am running cman-3.0.12-23.el6_0.6.i686 with corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems running in KVM, each on a dedicated host.
>>
>> I have observed several times that corosync goes cpu to 95-99% in only one node.
>>
>> Is this a bug??
>>
>>
>> --
>> CL Martinez
>> carlopmart {at} gmail {d0t} com
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> yes
>
> Believe this is fixed in 1.3.1
>

Thanks Steven ... But is it released for rhel6??

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From sdake at redhat.com  Tue May 17 19:20:48 2011
From: sdake at redhat.com (Steven Dake)
Date: Tue, 17 May 2011 12:20:48 -0700
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DD2BD7D.5070704@gmail.com>
References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com>
	<4DD2BD7D.5070704@gmail.com>
Message-ID: <4DD2CA90.6090802@redhat.com>

On 05/17/2011 11:25 AM, carlopmart wrote:
> On 05/17/2011 08:13 PM, Steven Dake wrote:
>> On 05/17/2011 09:06 AM, carlopmart wrote:
>>> Hi all,
>>>
>>> I am running cman-3.0.12-23.el6_0.6.i686 with
>>> corosync-1.2.3-21.el6_0.1.i686; the cluster consists of two systems
>>> running in KVM, each on a dedicated host.
>>>
>>> I have observed several times that corosync goes cpu to 95-99% in
>>> only one node.
>>>
>>> Is this a bug??
>>>
>>>
>>> -- 
>>> CL Martinez
>>> carlopmart {at} gmail {d0t} com
>>>
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> yes
>>
>> Believe this is fixed in 1.3.1
>>
> 
> Thanks Steven ... But is it released for rhel6??
> 

RHEL 6.1 has these problems resolved.  If you have problems with rhel6.0
please open a support ticket.  There is no SLA for bugzilla/mailing
lists, and I can't modify shipped RHEL 6.0.z packages without support
tickets.

Regards
-steve



From carlopmart at gmail.com  Tue May 17 19:28:50 2011
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 17 May 2011 21:28:50 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DD2CA90.6090802@redhat.com>
References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com>
	<4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com>
Message-ID: <4DD2CC72.80404@gmail.com>

On 05/17/2011 09:20 PM, Steven Dake
>>> yes
>>>
>>> Believe this is fixed in 1.3.1
>>>
>>
>> Thanks Steven ... But is it released for rhel6??
>>
>
> RHEL 6.1 has these problems resolved.  If you have problems with rhel6.0
> please open a support ticket.  There is no SLA for bugzilla/mailing
> lists, and I can't modify shipped RHEL 6.0.z packages without support
> tickets.
>
> Regards
> -steve

Thanks Steve.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From sufyan.khan at its.ws  Tue May 17 19:40:46 2011
From: sufyan.khan at its.ws (Sufyan Khan)
Date: Tue, 17 May 2011 22:40:46 +0300
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
	<BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
	<008e01cc132b$e4692800$ad3b7800$@its.ws>
	<BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>
Message-ID: <002801cc14ca$52ed6260$f8c82720$@its.ws>

Hi Marcelo

 

I am succeeded to run oracle DB and its restarting automatically with
killing  pmon process.

 

Thanks to all.

 

I have another question (sorry I am new to RHEL cluster) my oracle
application server and DB  server has different HOME directory, if I used
oracledb.sh , service fails in startup because in the oracledb.sh the HOME
directory is same for PMON and OPMN process.

What could be the solution , I am using ricci and luci.

 

sufyan

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Monday, May 16, 2011 3:20 PM
To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Hy sufyan

I sent two files. First, is cluster.conf, second, is the oracledb.sh , this
file must be placed in /usr/share/cluster in both nodes. (Or if you have
more nodes, in all nodes).
In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I
used base (for monitor and listener), and virtual ip, 

If you have any doubt, just let me know
I hope that helps you
Regards,
Marcelo



2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

Will you share if it is not confidential.

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 7:03 PM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

ok, this script checks the listener a! nd db , i uot; in database type.


I 've worked with oracle10g r2 , and it worked fine for me.
Thanks

2011/5/15 Sufyan Khan <sufyan.khan at its.ws>

Thanks for the tips,

 

I need only oracle database to be monitor by Cluster NOT OPMN ( application
server)

Any clue

 

sufyan

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
Sent: Sunday, May 15, 2011 11:29 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

HI sufyan



Maybe is irrelevant, but, do you try with oracledb.sh script?.
I use that script and all work fine for me....

Regards,
Marcelo
PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify
the oracle settings, like orauser, db_instance_name, and db_virtual name.

2011/5/15 Sufyan! Khan &lt han at its.ws">sufyan.khan at its.ws>

There is writing mistake, I cannot see the script is running in background.!


Off course I stop the cluster then I run the manual script.

 

 

  

From: linux-cluster-bounces at redhat.com <mailto:linux-clust%21+er-bounceank>
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Sunday, May 15, 2011 12:14 AM


To: linux clustering
Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON
deamon

 

Greetings,

I ca nd as root, but do see the script is running in background as  a
daemon,  

 

 

Mohammad Raza Sufyan Khan 
Team Leader (Technology&Infrastructure Group)

Telco development



a stupid question: Have you used chkconfig to switchoff all the cluster
controlled services?

! IMHO, The >Regards,

! Rajagopal v>


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com

</htm 
--
Linux-cluster mailing list
Linux-cluster at redhat.com

https://www.redhat.com/mailman/listinfo/linux-clust! er
<https://www.redhat.com/mailman/listinfo/linux-cluster> 




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
Marcelo Guazzardo
mguazzardo76 at gmail.com
http://mguazzardo.blogspot.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110517/828b4acc/attachment.htm>

From mguazzardo76 at gmail.com  Wed May 18 00:42:35 2011
From: mguazzardo76 at gmail.com (Marcelo Guazzardo)
Date: Tue, 17 May 2011 21:42:35 -0300
Subject: [Linux-cluster] oracle DB is not failing over on killin PMON
	deamon
In-Reply-To: <002801cc14ca$52ed6260$f8c82720$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
	<BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
	<008e01cc132b$e4692800$ad3b7800$@its.ws>
	<BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>
	<002801cc14ca$52ed6260$f8c82720$@its.ws>
Message-ID: <BANLkTikjcWd6-V6+m++cv7nj1QRvFNY=eA@mail.gmail.com>

2011/5/17 Sufyan Khan <sufyan.khan at its.ws>

> Hi Marcelo
>
>
>
> I am succeeded to run oracle DB and its restarting automatically with
> killing  pmon process.
>
>
>
> Thanks to all.
>
>
>
> I have another question (sorry I am new to RHEL cluster) my oracle
> application server and DB  server has different HOME directory, if I used
> oracledb.sh , service fails in startup because in the oracledb.sh the HOME
> directory is same for PMON and OPMN process.
>
> What could be the solution , I am using ricci and luci.
>
>
>
> Sufyan


Sorry, I am not DBA. I don't know how help you,
Maybe in this list there are a dba who can help you
Regards,

Marcelo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110517/fcb1fdf1/attachment.htm>

From munishdh at yahoo.com  Wed May 18 02:45:44 2011
From: munishdh at yahoo.com (Munish)
Date: Wed, 18 May 2011 10:45:44 +0800
Subject: [Linux-cluster] oracle DB is not failing over on killin
	PMON	deamon
In-Reply-To: <002801cc14ca$52ed6260$f8c82720$@its.ws>
References: <00ae01cc106d$a294c440$e7be4cc0$@its.ws>
	<036B68E61A28CA49AC2767596576CD596F6641246F@GVW1113EXC.americas.hpqcorp.net>
	<00b501cc1075$4a47cfa0$ded76ee0$@its.ws>
	<BANLkTi=87vKJ5jwnnUvXNo6+cG50ecX1Sg@mail.gmail.com>
	<007901cc126c$b5ae03b0$210a0b10$@its.ws>
	<BANLkTikQ6CWCfCMWg-gwAFpZOK9t7RnmMA@mail.gmail.com>
	<000901cc12ce$99532df0$cbf989d0$@its.ws>
	<BANLkTi=0pk4aGD+YBAOqxLJvRhrDMRQwWw@mail.gmail.com>
	<007801cc1306$f93e6850$ebbb38f0$@its.ws>
	<BANLkTi=SQhN=0cfmBecr62FWrxSTvVS9mg@mail.gmail.com>
	<008e01cc132b$e4692800$ad3b7800$@its.ws>
	<BANLkTikTExM60SqLh6Ne-mDcXskAq_tNyw@mail.gmail.com>
	<002801cc14ca$52ed6260$f8c82720$@its.ws>
Message-ID: <AEB6AB66-4A50-4DD6-8FF1-4FA3FE5DACCD@yahoo.com>

Where was the problem ? What has been done to fix it? 

Cheers!!!
Munish



On May 18, 2011, at 3:40 AM, Sufyan Khan <sufyan.khan at its.ws> wrote:

> Hi Marcelo
> 
>  
> 
> I am succeeded to run oracle DB and its restarting automatically with killing  pmon process.
> 
>  
> 
> Thanks to all.
> 
>  
> 
> I have another question (sorry I am new to RHEL cluster) my oracle application server and DB  server has different HOME directory, if I used oracledb.sh , service fails in startup because in the oracledb.sh the HOME directory is same for PMON and OPMN process.
> 
> What could be the solution , I am using ricci and luci.
> 
>  
> 
> sufyan
> 
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
> Sent: Monday, May 16, 2011 3:20 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
> 
>  
> 
> Hy sufyan
> 
> I sent two files. First, is cluster.conf, second, is the oracledb.sh , this file must be placed in /usr/share/cluster in both nodes. (Or if you have more nodes, in all nodes).
> In oracledb.sh I 've changed oracle_user, oracle_sid, type of database, I used base (for monitor and listener), and virtual ip, 
> 
> If you have any doubt, just let me know
> I h! ope that Marcelo
> 
> 
> 2011/5/15 Sufyan Khan <sufyan.khan at its.ws>
> 
> Will you share if it is not confidential.
> 
>  
> 
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
> Sent: Sunday, May 15, 2011 7:03 PM
> 
> 
> Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
>  
> 
> ok, this script checks the listener a! nd db , i uot; in database type.
> 
> 
> I 've worked with oracle10g r2 , and it worked fine for me.
> Thanks
> 
> 2011/5/15 Sufyan Khan <sufyan.khan at its.ws>
> 
> Thanks for the tips,
> 
>  
> 
> I need only oracle database to be monitor by Cluster NOT OPMN ( application server)
> 
> Any clue
> 
>  
> 
> sufyan
> 
>  
> 
> From: linux-cluster-bounces at redhat.com] On Behalf Of Marcelo Guazzardo
> Sent: Sunday, May 15, 2011 11:29 AM
> 
> 
> To: linux clustering
> Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
> 
>  
> 
> HI sufyan
> 
> 
> 
> Maybe is irrelevant, but, do you try with oracledb.sh script?.
> I use that script and all work fine for me....
> 
> Regards,
> Marcelo
> PS: that script is placed in /usr/share/cluster/oracledb.sh, you must modify the oracle settings,! like ora nd db_virtual name.
> 
> 2011/5/15 Sufyan! Khan &lt han at its.ws">sufyan.khan at its.ws>
> 
> There is writing mistake, I cannot see the script is running in background.!  
> 
> Off course I stop the cluster then I run the manual script.
> 
>  
> 
>  
> 
>   
> 
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
> Sent: Sunday, May 15, 2011 12:14 AM
> 
> 
> To: linux clustering
> Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
> 
>  
> 
> Greetings,
> 
> I ca nd as root, but do see the script is running in background as  a daemon,  
> 
>  
> 
>  
> 
> Mohammad Raza Sufyan Khan 
> Team Leader (Technology&Infrastructure Group)
> 
> Telco development
> 
> 
> 
> a stupid question: Have you used chkconfig to switchoff all the cluster controlled services?
> 
> ! IMHO, The >Regards,
> 
> ! Rajagopal v>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> -- Marcel ailto:mguazzardo76 at gmail.com" target="_blank">mguazzardo76 at gmail.com
> http://mguazzardo.blogspot.com
> 
> </htm 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> 
> https://www.redhat.com/mailman/listinfo/linux-clust! er
> 
> 
> 
> 
> -- 
> Marcelo Guazzardo
> mguazzardo76 at gmail.com
> http://mguazzardo.blogspot.com
> 
> 
> --
> Linux-cluster mailing list
> Linu! x-cluster f="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> -- 
> Marcelo Guazzardo
> mguazzardo76 at gmail.com
> http://mguazzardo.blogspot.com
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110518/793d5ec0/attachment.htm>

From ajb2 at mssl.ucl.ac.uk  Wed May 18 15:14:58 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Wed, 18 May 2011 16:14:58 +0100
Subject: [Linux-cluster] |Optimizing DLM Speed
Message-ID: <4DD3E272.7080709@mssl.ucl.ac.uk>

Bob, Steve, Dave,

Is there any progress on tuning the size of the tables (RHEL5) to allow 
larger values and see if they help things as far as caching goes?

It would be advantageous to tweak the dentry limits too - the kernel 
limits this to 10% and attempts to increase are throttled back.

This doesn't scale for larger memory sizes on fileservers and  I think 
it's a hangover from 4Gb ram days.

AB








From swhiteho at redhat.com  Wed May 18 15:31:55 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 18 May 2011 16:31:55 +0100
Subject: [Linux-cluster] |Optimizing DLM Speed
In-Reply-To: <4DD3E272.7080709@mssl.ucl.ac.uk>
References: <4DD3E272.7080709@mssl.ucl.ac.uk>
Message-ID: <1305732715.5294.32.camel@menhir>

Hi,

On Wed, 2011-05-18 at 16:14 +0100, Alan Brown wrote:
> Bob, Steve, Dave,
> 
> Is there any progress on tuning the size of the tables (RHEL5) to allow 
> larger values and see if they help things as far as caching goes?
> 
There is a bz open, and you should ask for that to be linked to one of
your support cases, if it hasn't already been. I thought we'd concluded
though that this didn't actually affect your particular workload.

> It would be advantageous to tweak the dentry limits too - the kernel 
> limits this to 10% and attempts to increase are throttled back.
> 
Yes, I've not forgotten this. I've been working on some similar issues
recently and I'll explore this more fully once I'm done with the
writeback side of things.

> This doesn't scale for larger memory sizes on fileservers and  I think 
> it's a hangover from 4Gb ram days.
> 
> AB
> 
Yes, it might well be, so we should certainly look into it. Again
though, please ensure that you raise this through support so that (a) it
doesn't get missed by accident and (b) that we are all in the loop. If
there are not tickets open for these, then we need to resolve that in
order to push this forward,

Steve.




From lhh at redhat.com  Wed May 18 15:41:39 2011
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 18 May 2011 11:41:39 -0400
Subject: [Linux-cluster] qdiskd does not call heuristics regularly?
In-Reply-To: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com>
References: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com>
	<495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com>
Message-ID: <20110518154138.GN11022@redhat.com>

On Fri, May 13, 2011 at 02:00:23PM +0200, Gerbatsch, Andre wrote:
> 
> . small correction of the qdiskd->heuristic script timing:
> dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=a47bc261ef58cb056077c448c06a7c518dd4191d

-- 
Lon Hohberger - Red Hat, Inc.



From Benjamin.Navaro at loto-quebec.com  Wed May 18 16:54:44 2011
From: Benjamin.Navaro at loto-quebec.com (Navaro Benjamin)
Date: Wed, 18 May 2011 12:54:44 -0400
Subject: [Linux-cluster] CLVM - Locking Disabled
In-Reply-To: <20110518154138.GN11022@redhat.com>
References: <495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EC@VDRSEXMBXP1.gfoundries.com><495A8D8FDB79D54DA88BDC8AE7F3A76D23D70533EF@VDRSEXMBXP1.gfoundries.com>
	<20110518154138.GN11022@redhat.com>
Message-ID: <B9B6D2F6B265A34F8A552AD84F1B6B10015F01C2@SMAI50006P.le500.loto-quebec.com>

Hi list,

Is it normal for a fresh install that CLVM says that the locking is
disabled while locking_type is set to 3 in lvm.conf ?

[root at myhost ~]# clvmd -d
CLVMD[e2bd6170]: May 18 12:42:24 CLVMD started
CLVMD[e2bd6170]: May 18 12:42:24 Connected to CMAN
CLVMD[e2bd6170]: May 18 12:42:24 CMAN initialisation complete
CLVMD[e2bd6170]: May 18 12:42:25 DLM initialisation complete
CLVMD[e2bd6170]: May 18 12:42:25 Cluster ready, doing some more
initialisation
CLVMD[e2bd6170]: May 18 12:42:25 starting LVM thread
CLVMD[e2bd6170]: May 18 12:42:25 clvmd ready for work
CLVMD[e2bd6170]: May 18 12:42:25 Using timeout of 60 seconds
CLVMD[42aa8940]: May 18 12:42:25 LVM thread function started
File descriptor 5 (/dev/zero) leaked on lvm invocation. Parent PID 6240:
clvmd
  WARNING: Locking disabled. Be careful! This could corrupt your
metadata.
CLVMD[42aa8940]: May 18 12:42:25 LVM thread waiting for work

I guess it's related to this following warning when trying to list the
vg's (while clvmd is up) :

[root at myhost ~]# vgs
  connect() failed on local socket: No such file or directory
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  VG   #PV #LV #SN Attr   VSize  VFree
  vg00   1   7   0 wz--n- 24.28G 10.44G

This prevents me from creating a clustered VG (actually I can create a
clustered VG, but not the LV inside).

[root at myhost ~]# vgcreate -c y vggfs01 /dev/sdb2
  connect() failed on local socket: No such file or directory
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  No physical volume label read from /dev/sdb2
  Physical volume "/dev/sdb2" successfully created
  Clustered volume group "vggfs01" successfully created
[root at myhost ~]# lvcreate -L 500M -n lvgfs01 vggfs01
  connect() failed on local socket: No such file or directory
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  Skipping clustered volume group vggfs01
[root at myhost ~]#

The final goal is to build a GFS shared storage between 3 nodes.
The cman part seems to be OK for the three nodes :

[root at lhnq501l ~]# cman_tool services
type             level name       id       state
fence            0     default    00010001 none
[1 2 3]
dlm              1     rgmanager  00020003 none
[1 2 3]
dlm              1     clvmd      00010003 none
[1 2 3]

This is my first RHEL cluster, and I'm not sure where to investigate
right now.
If anyone has ever seen this behaviour, any comment is appreciated,

Thanks,

- Ben.

Mise en garde concernant la confidentialite : Le present message, comprenant tout fichier qui y est joint, est envoye a l'intention exclusive de son destinataire; il est de nature confidentielle et peut constituer une information protegee par le secret professionnel. Si vous n'etes pas le destinataire, nous vous avisons que toute impression, copie, distribution ou autre utilisation de ce message est strictement interdite. Si vous avez recu ce courriel par erreur, veuillez en aviser immediatement l'expediteur par retour de courriel et supprimer le courriel. Merci! 

Confidentiality Warning: This message, including any attachment, is sent only for the use of the intended recipient; it is confidential and may constitute privileged information. If you are not the intended recipient, you are hereby notified that any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email, and delete it. Thank you!



From ajb2 at mssl.ucl.ac.uk  Wed May 18 17:34:45 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Wed, 18 May 2011 18:34:45 +0100
Subject: [Linux-cluster] |Optimizing DLM Speed
In-Reply-To: <1305732715.5294.32.camel@menhir>
References: <4DD3E272.7080709@mssl.ucl.ac.uk> <1305732715.5294.32.camel@menhir>
Message-ID: <4DD40335.4010406@mssl.ucl.ac.uk>

Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2011-05-18 at 16:14 +0100, Alan Brown wrote:
>> Bob, Steve, Dave,
>>
>> Is there any progress on tuning the size of the tables (RHEL5) to allow 
>> larger values and see if they help things as far as caching goes?
>>
> There is a bz open, 

I thought so, but I can't find it.

> and you should ask for that to be linked to one of
> your support cases, if it hasn't already been. I thought we'd concluded
> though that this didn't actually affect your particular workload.

Increasing them to 4096 hasn't but larger numbers might.

>> It would be advantageous to tweak the dentry limits too - the kernel 
>> limits this to 10% and attempts to increase are throttled back.
>>
> Yes, I've not forgotten this. I've been working on some similar issues
> recently and I'll explore this more fully once I'm done with the
> writeback side of things.

Do you have a BZ for this one?


>> This doesn't scale for larger memory sizes on fileservers and  I think 
>> it's a hangover from 4Gb ram days.
>>
>> AB
>>
> Yes, it might well be, so we should certainly look into it. Again
> though, please ensure that you raise this through support so that (a) it
> doesn't get missed by accident and (b) that we are all in the loop. If
> there are not tickets open for these, then we need to resolve that in
> order to push this forward,

willdo.



> 
> Steve.
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 





From swhiteho at redhat.com  Wed May 18 17:52:24 2011
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 18 May 2011 18:52:24 +0100
Subject: [Linux-cluster] |Optimizing DLM Speed
In-Reply-To: <4DD40335.4010406@mssl.ucl.ac.uk>
References: <4DD3E272.7080709@mssl.ucl.ac.uk>
	<1305732715.5294.32.camel@menhir>  <4DD40335.4010406@mssl.ucl.ac.uk>
Message-ID: <1305741144.5294.39.camel@menhir>

Hi,

On Wed, 2011-05-18 at 18:34 +0100, Alan Brown wrote:
> Steven Whitehouse wrote:
> > Hi,
> > 
> > On Wed, 2011-05-18 at 16:14 +0100, Alan Brown wrote:
> >> Bob, Steve, Dave,
> >>
> >> Is there any progress on tuning the size of the tables (RHEL5) to allow 
> >> larger values and see if they help things as far as caching goes?
> >>
> > There is a bz open, 
> 
> I thought so, but I can't find it.
> 
Its #678102, which you are on the cc list of. It probably needs a RHEL5
bug as well. Bryn posted a patch to it to make the change, but I'm not
sure of the current status. I'm copying in Dave Teigland so that he can
comment on the current status.

> > and you should ask for that to be linked to one of
> > your support cases, if it hasn't already been. I thought we'd concluded
> > though that this didn't actually affect your particular workload.
> 
> Increasing them to 4096 hasn't but larger numbers might.
> 
> >> It would be advantageous to tweak the dentry limits too - the kernel 
> >> limits this to 10% and attempts to increase are throttled back.
> >>
> > Yes, I've not forgotten this. I've been working on some similar issues
> > recently and I'll explore this more fully once I'm done with the
> > writeback side of things.
> 
> Do you have a BZ for this one?
> 
The writeback issues are under #676626 at the moment, although this is a
slightly different issue to what that bug was originally opened for.
There isn't a bug for the dentries issue as that needs to have a ticket
opened first, and then a bz opened by support if appropriate. I've
copied in Bryn so that he can pick this up and make sure that it is
done,

Steve.




From teigland at redhat.com  Wed May 18 18:12:34 2011
From: teigland at redhat.com (David Teigland)
Date: Wed, 18 May 2011 14:12:34 -0400
Subject: [Linux-cluster] |Optimizing DLM Speed
In-Reply-To: <1305741144.5294.39.camel@menhir>
References: <4DD3E272.7080709@mssl.ucl.ac.uk> <1305732715.5294.32.camel@menhir>
	<4DD40335.4010406@mssl.ucl.ac.uk> <1305741144.5294.39.camel@menhir>
Message-ID: <20110518181234.GB3381@redhat.com>

On Wed, May 18, 2011 at 06:52:24PM +0100, Steven Whitehouse wrote:
> > >> Is there any progress on tuning the size of the tables (RHEL5) to allow 
> > >> larger values and see if they help things as far as caching goes?
> > >>
> > > There is a bz open, 
> > 
> > I thought so, but I can't find it.
> > 
> Its #678102, which you are on the cc list of. It probably needs a RHEL5
> bug as well. Bryn posted a patch to it to make the change, but I'm not
> sure of the current status. I'm copying in Dave Teigland so that he can
> comment on the current status.
> 
> > > and you should ask for that to be linked to one of
> > > your support cases, if it hasn't already been. I thought we'd concluded
> > > though that this didn't actually affect your particular workload.
> > 
> > Increasing them to 4096 hasn't but larger numbers might.

I'd suggest applying Bryn's vmalloc patch, and trying a higher value to
see if it has any effect.  If it does, we can certainly get that patch and
larger default values queued up for various releases.

Thanks,
Dave



From klusterfsck at outofoptions.net  Thu May 19 14:27:58 2011
From: klusterfsck at outofoptions.net (Kluster Fsck)
Date: Thu, 19 May 2011 10:27:58 -0400
Subject: [Linux-cluster] Cannot migrate VM's
Message-ID: <4DD528EE.9030206@outofoptions.net>

I upgraded Red Hat Enterprise Linux Server release 5.6 to try and solve 
some problems with an inherited broken cluster.  After some effort I was 
able to migrate to the upgraded machine last night.  This morning I 
upgraded the second machine and all seemed to go well until I tried to 
migrate the VM's back.  The system just hangs and nothing happens 
whether I use virsh or Virtual Machine Manager.

 From the machine with the vm's currently running:

May 19 09:48:26 julius libvirtd: 09:48:26.384: error : 
qemuDomainMigrateSetMaxDowntime:11792 : invalid argument in 
qemuDomainMigrateSetMaxDowntime: unsupported flags (0xbc614e)
May 19 09:48:26 julius libvirtd: 09:48:26.400: error : 
qemuDomainMigrateSetMaxDowntime:11792 : invalid argument in 
qemuDomainMigrateSetMaxDowntime: unsupported flags (0xbc614e)
May 19 09:49:04 julius libvirtd: 09:49:04.957: error : 
qemuDomainWaitForMigrationComplete:5066 : operation failed: Migration 
was cancelled by client


 From the machine trying to migrate too:

May 19 09:43:11 justinian libvirtd: 09:43:11.682: warning : 
qemudStartup:1662 : Unable to create cgroup for driver: No such device 
or address

May 19 09:48:25 justinian libvirtd: 09:48:25.202: error : 
qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list 
vcpu pinning for an inactive domain
May 19 09:48:25 justinian libvirtd: 09:48:25.208: error : 
qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is 
not running
May 19 09:48:25 justinian libvirtd: 09:48:25.230: error : 
qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list 
vcpu pinning for an inactive domain
May 19 09:48:25 justinian libvirtd: 09:48:25.235: error : 
qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is 
not running
May 19 09:48:25 justinian libvirtd: 09:48:25.256: error : 
qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list 
vcpu pinning for an inactive domain
May 19 09:48:25 justinian libvirtd: 09:48:25.262: error : 
qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is 
not running
May 19 09:48:25 justinian libvirtd: 09:48:25.279: error : 
qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list 
vcpu pinning for an inactive domain
May 19 09:48:25 justinian libvirtd: 09:48:25.285: error : 
qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is 
not running
May 19 09:48:25 justinian libvirtd: 09:48:25.347: error : 
qemudDomainGetVcpus:5993 : Requested operation is not valid: cannot list 
vcpu pinning for an inactive domain
May 19 09:48:25 justinian libvirtd: 09:48:25.353: error : 
qemuDomainGetJobInfo:11726 : Requested operation is not valid: domain is 
not running
May 19 09:49:13 justinian libvirtd: 09:49:13.006: error : 
qemuDomainObjBeginJob:362 : Timed out during operation: cannot acquire 
state change lock
May 19 09:49:13 justinian libvirtd: 09:49:13.827: error : 
qemudDomainBlockStats:9500 : Requested operation is not valid: domain is 
not running

From:
http://libvirt.org/drvqemu.html

I tried:
mkdir /dev/cgroup
mount -t cgroup none /dev/cgroup -o devices

[root at julius vmdata]# mount -t cgroup none /dev/cgroup -o devices
mount: unknown filesystem type 'cgroup'

Any help would be appreciated.
Thank You
Ken Lowther






From rossnick-lists at cybercat.ca  Thu May 19 15:14:28 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Thu, 19 May 2011 11:14:28 -0400
Subject: [Linux-cluster] RedHat EL6.1
Message-ID: <90062EBE94844CE19B00124C4559D9C2@versa>

Hi all !

We are running our cluster with RHEL6, and now 6.1 is out. We have an 8 node 
cluster, and I want to know is it "safe" to update on a running cluster ? We 
use GFS2 on a FC network.

Is it just a matter of taking the first node, moving it's service to another 
one, yum update, reboot, and move the next one ?

Thanks,
Regards, 



From fdinitto at redhat.com  Thu May 19 16:59:39 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 19 May 2011 18:59:39 +0200
Subject: [Linux-cluster] RedHat EL6.1
In-Reply-To: <90062EBE94844CE19B00124C4559D9C2@versa>
References: <90062EBE94844CE19B00124C4559D9C2@versa>
Message-ID: <4DD54C7B.1040106@redhat.com>

On 05/19/2011 05:14 PM, Nicolas Ross wrote:
> Is it just a matter of taking the first node, moving it's service to
> another one, yum update, reboot, and move the next one ?

Please contact GSS that will point you to the correct documentation to
perform the upgrade.

In general:

take first node, move services to another, shutdown all cluster services
(cman), yum update, reboot, move to the next one.

Fabio



From ableisch at redhat.com  Fri May 20 11:51:30 2011
From: ableisch at redhat.com (Andreas Bleischwitz)
Date: Fri, 20 May 2011 13:51:30 +0200
Subject: [Linux-cluster] Mirrored LVM device and recovery
Message-ID: <4DD655C2.6080406@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello all,

we are currently facing some handling issues using mirrored LVM-lvols in
a cluster:

We have two diffent storage systems which should be mirrored using
host-based mirroring.
AFAIK cmirrored lvols are the only supported mirroring solution under
RHEL56. So we have three multipath devices which are used for 2 data and
one log-volume.
We added these three pvs to one volumegroup and created the logical
volume using the following command:
lvcreate -m1 -L 10G -n lv_mirrored /dev/mpath/mpath0p1 /dev/mpath2p1
/dev/mpath/mpath1p1

The volume replicates ok and everything is fine.... until we remove one
storage-side of the mirror. Then LVM simply removes the missing pv and
the mirror is simply removed - which I think is ok; if it will be
recreated after readding the failed mirror-side.
Unfortunately LVM doesn't do anything such - is there a special
configuration-option which we missed?

And keep in mind: there might be a huge amount of lvms which have to be
re-mirrored. So manual interaction shouldn't be the default option ;)

Regards,

Andreas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJN1lXCAAoJEK45y/Z6LXho/4wP/AjjoOGX3pRoe3XRARumkTKW
C/+8Jjm4+aC4VP1ycVkHZhrdGlzy4QmTtCFTCv40AgPU2YT/Aq+PqfNnn4SNqpwR
c815zd9Gk+uQwwR55kloX232eZzFEw2wVa9PWxOmKwaeuYSEOz8GmLVZrPVc3V9p
MNr6wkV5gzTzhC2v75KOZ4PchOiuYEDbhCd5GFDKmpyTeHTq/uNW2yRnjInAX8L9
8UCJ1JEzo4ry2mIBK1J+du5YtKx4uDLB893rgbf+T5Cci3hsLJ9/gfF1VU80b+o/
uVc5t31rwUwMaFSyt9wtEhMQB0ggbyiQqzzjSP5wnnakd6lbJKhB06wM5XqGuUkS
ZetkZdH+etALFpt3PrV7F4+LDwGnP7Hw438czKjD+Xk21fd7idSo3vhtWjArPgKp
L+b5fxB8JoUGN7x2S3239aDMI6BmxTTZ+QnsamYzSy0IdHYghPSjPSsx8H5laJWd
I03F2sfPWwB8vWVweHvNbxfFjZfmEaawoMqGanoGktj/RYgvUpPZJD+YHDVGXohN
VoRVmB+t4JVSWb15BzOhzkAI//LtXjSHmtcnBuYQf8G0Q2v/r0x/hv04F9/0fQ0l
dPlU1vh244fh0nG5BMCJlKPcdcpGJnGy4kIKOknOi+NuI2ZxxvSLIY5WbrqAwkBX
QXYt6plJ5DgzWCa8fNYN
=0I6Z
-----END PGP SIGNATURE-----



From rossnick-lists at cybercat.ca  Fri May 20 12:37:30 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 20 May 2011 08:37:30 -0400
Subject: [Linux-cluster] fence_apc and Apc AP-8941
References: <ECAACE49B7534DEAA01BC227128A2725@versa><4D9C06D2.1040106@redhat.com><48C9F07371214F2AA678938FDE5BD3B5@versa><4D9C768D.4060106@redhat.com><8C77012A023D431CB5AA58E6CC676A35@versa>
	<4D9EC4A4.3060201@redhat.com>
Message-ID: <7D713A7F1242489EB185872AC93ED25B@versa>

> Add "cmd_prompt" into device_opt in fence_apc. Then you will have 
> possibility to set --command-prompt to "apc>".
> 
> Both fixes will be simple, feel free to create bugzilla entry for them.

Hi !

It appears that development management won't fix the problem :

https://bugzilla.redhat.com/show_bug.cgi?id=694894

It's not all that bad, since I now use fence_apc_snmp instead.

Regards,



From rossnick-lists at cybercat.ca  Fri May 20 14:12:10 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 20 May 2011 10:12:10 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
References: <4DD29D03.9080901@gmail.com>
	<4DD2BAC3.50509@redhat.com><4DD2BD7D.5070704@gmail.com>
	<4DD2CA90.6090802@redhat.com>
Message-ID: <3B50BA7445114813AE429BEE51A2BA52@versa>

>>> Believe this is fixed in 1.3.1
>>>
>>
>> Thanks Steven ... But is it released for rhel6??
>>
>
> RHEL 6.1 has these problems resolved.  If you have problems with rhel6.0
> please open a support ticket.  There is no SLA for bugzilla/mailing
> lists, and I can't modify shipped RHEL 6.0.z packages without support
> tickets.

I am also observing this kind of beahviour. But at a different level. We 
have an 8 node cluster composed of dual quad-core xeon. I have now updated 
all the nodes to RHEL 6.1, cman is at 3.0.12-41.el6.

and from time to time, for no apparent reason one random node has a peak in 
cpu usage, where it's corosync that eats CPU for a minute or so. During that 
time services on that node responds very slowly and ssh shell access is very 
rough and slow as hell... 



From carlopmart at gmail.com  Sat May 21 09:42:32 2011
From: carlopmart at gmail.com (carlopmart)
Date: Sat, 21 May 2011 11:42:32 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <3B50BA7445114813AE429BEE51A2BA52@versa>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com><4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>
	<3B50BA7445114813AE429BEE51A2BA52@versa>
Message-ID: <4DD78908.2030801@gmail.com>

On 05/20/2011 04:12 PM, Nicolas Ross wrote:
>>>> Believe this is fixed in 1.3.1
>>>>
>>>
>>> Thanks Steven ... But is it released for rhel6??
>>>
>>
>> RHEL 6.1 has these problems resolved. If you have problems with rhel6.0
>> please open a support ticket. There is no SLA for bugzilla/mailing
>> lists, and I can't modify shipped RHEL 6.0.z packages without support
>> tickets.
>
> I am also observing this kind of beahviour. But at a different level. We
> have an 8 node cluster composed of dual quad-core xeon. I have now
> updated all the nodes to RHEL 6.1, cman is at 3.0.12-41.el6.
>
> and from time to time, for no apparent reason one random node has a peak
> in cpu usage, where it's corosync that eats CPU for a minute or so.
> During that time services on that node responds very slowly and ssh
> shell access is very rough and slow as hell...

Steven, is this problem confirmed to rhel6.1?? It seems that I need to 
downgrade my servers to rhel5.x ...


-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From rossnick-lists at cybercat.ca  Sat May 21 13:12:09 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Sat, 21 May 2011 09:12:09 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DD78908.2030801@gmail.com>
References: <4DD29D03.9080901@gmail.com> <4DD2BAC3.50509@redhat.com>
	<4DD2BD7D.5070704@gmail.com> <4DD2CA90.6090802@redhat.com>
	<3B50BA7445114813AE429BEE51A2BA52@versa>
	<4DD78908.2030801@gmail.com>
Message-ID: <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca>


> Steven, is this problem confirmed to rhel6.1?? It seems that I need to downgrade my servers to rhel5.x ...

I've opened a support case at redhat for this. While collecting the sosreport for redhat, I found ot in my var/log/message file something about gfs2_quotad being stalled for more than 120 seconds.  Tought I disabled quotas with the noquota option. It appears that it's "quota=off". Since I cannot chane thecluster config and remount the filessystems at the moment, I did not made the change to tes it.

It might helps you.



From carlopmart at gmail.com  Sat May 21 19:07:09 2011
From: carlopmart at gmail.com (carlopmart)
Date: Sat, 21 May 2011 21:07:09 +0200
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca>
References: <4DD29D03.9080901@gmail.com>
	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>
	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>
	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca>
Message-ID: <4DD80D5D.10004@gmail.com>

On 05/21/2011 03:12 PM, Nicolas Ross wrote:
>
>> Steven, is this problem confirmed to rhel6.1?? It seems that I need to downgrade my servers to rhel5.x ...
>
> I've opened a support case at redhat for this. While collecting the sosreport for redhat, I found ot in my var/log/message file something about gfs2_quotad being stalled for more than 120 seconds.  Tought I disabled quotas with the noquota option. It appears that it's "quota=off". Since I cannot chane thecluster config and remount the filessystems at the moment, I did not made the change to tes it.
>
> It might helps you.
>

Thanks Nicolas. what bugzilla id is??


-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From rossnick-lists at cybercat.ca  Sun May 22 02:24:07 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Sat, 21 May 2011 22:24:07 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DD80D5D.10004@gmail.com>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca>
	<4DD80D5D.10004@gmail.com>
Message-ID: <4DD873C7.8080402@cybercat.ca>

>> I've opened a support case at redhat for this. While collecting the
>> sosreport for redhat, I found ot in my var/log/message file something
>> about gfs2_quotad being stalled for more than 120 seconds. Tought I
>> disabled quotas with the noquota option. It appears that it's
>> "quota=off". Since I cannot chane thecluster config and remount the
>> filessystems at the moment, I did not made the change to tes it.
>>
>> It might helps you.
>>
>
> Thanks Nicolas. what bugzilla id is??

It's not a bugzilla, it's a support case.



From fdinitto at redhat.com  Wed May 25 07:34:57 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 25 May 2011 09:34:57 +0200
Subject: [Linux-cluster] fence-agents 3.1.4 stable release
Message-ID: <4DDCB121.1020204@redhat.com>

Welcome to the fence-agents 3.1.4 release.

This release contains a few bug fixes and a new fence_xenapi contributed
by Matt Clark that supports Citrix XenServer and XCP.

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.4.tar.xz

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Fabio

Under the hood (from 3.1.3):

Cedric Buissart (1):
      ipmilan help: login same as -l

Fabio M. Di Nitto (4):
      Fix file permissions
      build: add missing file from tarball release
      fence_rsa: readd test info
      build: allow selection of agents to build and fix configure help
output

Lon Hohberger (1):
      Revert "fence_ipmilan: Correct return code for diag operation"

Marek 'marx' Grac (1):
      fence_ipmilan: Correct return code for diag operation

Matt Clark (5):
      New fencing script for Citrix XenServer and XCP.
      Updated to include xenapi script.
      Updated to include xenapi script in Makefile.am.
      Clean up of fence_xenapi patches. Moved XenAPI.py to lib directory
and added to Makefile.am.
      Cleanup of fence_xenapi patches. Added copyright information to
doc/COPYRIGHT. Fixed static reference to lib directory in
fence_xenapi.py. Fixed static reference to RELEASE_VERSION and
BUILD_DATE in fence_xenapi.py.

 configure.ac                              |   38 +++++-
 doc/COPYRIGHT                             |    4 +
 fence/agents/Makefile.am                  |   40 +-----
 fence/agents/ipmilan/ipmilan.c            |    4 +-
 fence/agents/lib/Makefile.am              |    8 +-
 fence/agents/lib/XenAPI.py.py             |  209 ++++++++++++++++++++++++++
 fence/agents/rsa/fence_rsa.py             |    1 +
 fence/agents/xenapi/Makefile.am           |   17 ++
 fence/agents/xenapi/fence_xenapi.py       |  231
+++++++++++++++++++++++++++++
 9 files changed, 505 insertions(+), 47 deletions(-)



From hiroysato at gmail.com  Sat May 28 12:39:37 2011
From: hiroysato at gmail.com (Hiroyuki Sato)
Date: Sat, 28 May 2011 21:39:37 +0900
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
Message-ID: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>

Dear members.

I'm newbie Red Hat cluster.

Could you point me to good documentation about command line interface??
 ( cman_tool, css_tool, ccs_test, fence_ack_manual ..)

Especially the following topics.

  * How to rejoin to node.
  * How to leave from node.
  * How to use fence_ack_manual
  * How to manage cluster with command line tools.

One of my problem is here.

The status of gfs3 which in my test cluster is JOIN_STOP_WAIT.
I don't know how to re-join it.

# /usr/sbin/cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT

I found a keyword 'fenced_override'. This file. should be named pipe.
Howevre I can't find that file in /var/run/cluter directory in my clusters.
fenced working on all of clusters.

Sincerely.


* Environment

  CentOS 5.6


* Configurations

<?xml version="1.0"?>
<cluster name="arch_gfs1" config_version="21">
  <cman expected_votes="1">
  </cman>
  <clusternodes>
    <clusternode name="gfs1.doma.in" votes="1" nodeid="5">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs1.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs2.doma.in" votes="1" nodeid="6">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs2.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs3.doma.in" votes="1" nodeid="7">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs3.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client1.doma.in" votes="1" nodeid="21">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client1.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client2.doma.in" votes="1" nodeid="22">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client2.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client3.doma.in" votes="1" nodeid="23">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client3.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client4.doma.in" votes="1" nodeid="24">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client4.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client5.doma.in" votes="1" nodeid="25">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client5.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client6.doma.in" votes="1" nodeid="26">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client6.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client7.doma.in" votes="1" nodeid="27">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client7.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client8.doma.in" votes="1" nodeid="28">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client8.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client9.doma.in" votes="1" nodeid="29">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client9.doma.in"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client10.doma.in" votes="1" nodeid="30">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client10.doma.in"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="manual" agent="fence_manual"/>
  </fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>


-- 
Hiroyuki Sato



From linux at alteeve.com  Sat May 28 17:31:12 2011
From: linux at alteeve.com (Digimer)
Date: Sat, 28 May 2011 13:31:12 -0400
Subject: [Linux-cluster] [Q] Good documentation about command
	line	interface??
In-Reply-To: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>
Message-ID: <4DE13160.7080709@alteeve.com>

On 05/28/2011 08:39 AM, Hiroyuki Sato wrote:
> Dear members.
>
> I'm newbie Red Hat cluster.

Welcome!

> Could you point me to good documentation about command line interface??
 >   ( cman_tool, css_tool, ccs_test, fence_ack_manual ..)

The man pages for these tools are well documented.

>   fence_ack_manual

<Digimer rolls up her sleeves and grabs her you-need-a-real-fence-device 
bat>

This is not supported in any way, shape or form. You *must* use a proper 
fence device. Do your servers have IPMI (or OEM version like DRAC, iLO, 
etc?).

Please read this:

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Virtual_Synchrony

Specifically; "Concept; Virtual Synchrony" and "Concept; Fencing"

> Especially the following topics.
>
>    * How to rejoin to node.
>    * How to leave from node.

Starting and stopping the cman service will cause the node to join and 
leave, respectively. You can do it manually if you wish, please check 
the man pages.

>    * How to use fence_ack_manual

Again, you can't. It is not supported.

>    * How to manage cluster with command line tools.

ccs_tool is the main program to look at.

> One of my problem is here.
>
> The status of gfs3 which in my test cluster is JOIN_STOP_WAIT.
> I don't know how to re-join it.
>
> # /usr/sbin/cman_tool services
> type             level name     id       state
> fence            0     default  00000000 JOIN_STOP_WAIT

Without a working fence device, the cluster will block forever. As far 
as I know, once a fence call has been issued, there is nothing that can 
be done to abort it. I'd suggest pulling the power on the node, boot it 
cleanly and start cman.

> I found a keyword 'fenced_override'. This file. should be named pipe.
> Howevre I can't find that file in /var/run/cluter directory in my clusters.
> fenced working on all of clusters.

Again, it's not supported.

> Sincerely.
>
>
> * Environment
>
>    CentOS 5.6
>
>
> * Configurations
>
> <?xml version="1.0"?>
> <cluster name="arch_gfs1" config_version="21">
>    <cman expected_votes="1">

This is wrong, 'expected_votes' is the number of nodes in the cluster 
(plus qdisk votes, if you are using it).

>    </cman>
>    <clusternodes>
>      <clusternode name="gfs1.doma.in" votes="1" nodeid="5">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs1.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs2.doma.in" votes="1" nodeid="6">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs2.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs3.doma.in" votes="1" nodeid="7">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs3.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client1.doma.in" votes="1" nodeid="21">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client1.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client2.doma.in" votes="1" nodeid="22">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client2.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client3.doma.in" votes="1" nodeid="23">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client3.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client4.doma.in" votes="1" nodeid="24">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client4.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client5.doma.in" votes="1" nodeid="25">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client5.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client6.doma.in" votes="1" nodeid="26">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client6.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client7.doma.in" votes="1" nodeid="27">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client7.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client8.doma.in" votes="1" nodeid="28">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client8.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client9.doma.in" votes="1" nodeid="29">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client9.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client10.doma.in" votes="1" nodeid="30">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client10.doma.in"/>
>          </method>
>        </fence>
>      </clusternode>
>    </clusternodes>
>    <fencedevices>
>      <fencedevice name="manual" agent="fence_manual"/>
>    </fencedevices>
>    <rm>
>      <failoverdomains/>
>      <resources/>
>    </rm>
> </cluster>

If you are on IRC, join #linux-cluster, it is also a great place to get 
help. I am usually there and will be happy to help you get a) fencing 
working and b) get the rest working.

Welcome to clustering! :)

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From hiroysato at gmail.com  Sun May 29 10:14:41 2011
From: hiroysato at gmail.com (Hiroyuki Sato)
Date: Sun, 29 May 2011 19:14:41 +0900
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <4DE13160.7080709@alteeve.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>
	<4DE13160.7080709@alteeve.com>
Message-ID: <BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>

Hello Digimer.

Thank you for your information.

This is the document that I'm looking for!!.
This doc is very very usuful. Thanks!!.


I want to ask one thing.

Please take a look my cluster configration again.

Mainly I want to use GNBD on gfs_clientX.
GNBD server is gfs2, and gfs3.

And gfs_client's hardwhere does not support IPMI, iLO...,
Because That machine is Desktop computers.

And no APC like UPS.

The desktop machine is just support Wake On LAN.

What fence device should I use??
I'm thinking fence_wake_on_lan is proper fence device.
but that is nothing..

Thank you for your advice.

<?xml version="1.0"?>
<cluster name="arch_gfs1" config_version="21">
  <clusternodes>
    <clusternode name="gfs1.archsystem.com" votes="1" nodeid="5">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs1.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs2.archsystem.com" votes="1" nodeid="6">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs2.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs3.archsystem.com" votes="1" nodeid="7">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs3.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client1.archsystem.com" votes="1" nodeid="21">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client1.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client2.archsystem.com" votes="1" nodeid="22">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client2.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client3.archsystem.com" votes="1" nodeid="23">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client3.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client4.archsystem.com" votes="1" nodeid="24">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client4.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client5.archsystem.com" votes="1" nodeid="25">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client5.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client6.archsystem.com" votes="1" nodeid="26">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client6.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client7.archsystem.com" votes="1" nodeid="27">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client7.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client8.archsystem.com" votes="1" nodeid="28">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client8.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client9.archsystem.com" votes="1" nodeid="29">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client9.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="gfs_client10.archsystem.com" votes="1" nodeid="30">
      <fence>
        <method name="single">
          <device name="manual" nodename="gfs_client10.archsystem.com"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="manual" agent="fence_manual"/>
  </fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>





Regards.


2011/5/29 Digimer <linux at alteeve.com>:
> On 05/28/2011 08:39 AM, Hiroyuki Sato wrote:
>>
>> Dear members.
>>
>> I'm newbie Red Hat cluster.
>
> Welcome!
>
>> Could you point me to good documentation about command line interface??
>
>> ? ( cman_tool, css_tool, ccs_test, fence_ack_manual ..)
>
> The man pages for these tools are well documented.
>
>> ?fence_ack_manual
>
> <Digimer rolls up her sleeves and grabs her you-need-a-real-fence-device
> bat>
>
> This is not supported in any way, shape or form. You *must* use a proper
> fence device. Do your servers have IPMI (or OEM version like DRAC, iLO,
> etc?).
>
> Please read this:
>
> http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Virtual_Synchrony
>
> Specifically; "Concept; Virtual Synchrony" and "Concept; Fencing"
>
>> Especially the following topics.
>>
>> ? * How to rejoin to node.
>> ? * How to leave from node.
>
> Starting and stopping the cman service will cause the node to join and
> leave, respectively. You can do it manually if you wish, please check the
> man pages.
>
>> ? * How to use fence_ack_manual
>
> Again, you can't. It is not supported.
>
>> ? * How to manage cluster with command line tools.
>
> ccs_tool is the main program to look at.
>
>> One of my problem is here.
>>
>> The status of gfs3 which in my test cluster is JOIN_STOP_WAIT.
>> I don't know how to re-join it.
>>
>> # /usr/sbin/cman_tool services
>> type ? ? ? ? ? ? level name ? ? id ? ? ? state
>> fence ? ? ? ? ? ?0 ? ? default ?00000000 JOIN_STOP_WAIT
>
> Without a working fence device, the cluster will block forever. As far as I
> know, once a fence call has been issued, there is nothing that can be done
> to abort it. I'd suggest pulling the power on the node, boot it cleanly and
> start cman.
>
>> I found a keyword 'fenced_override'. This file. should be named pipe.
>> Howevre I can't find that file in /var/run/cluter directory in my
>> clusters.
>> fenced working on all of clusters.
>
> Again, it's not supported.
>
>> Sincerely.
>>
>>
>> * Environment
>>
>> ? CentOS 5.6
>>
>>
>> * Configurations
>>
>> <?xml version="1.0"?>
>> <cluster name="arch_gfs1" config_version="21">
>> ? <cman expected_votes="1">
>
> This is wrong, 'expected_votes' is the number of nodes in the cluster (plus
> qdisk votes, if you are using it).
>
>> ? </cman>
>> ? <clusternodes>
>> ? ? <clusternode name="gfs1.doma.in" votes="1" nodeid="5">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs1.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs2.doma.in" votes="1" nodeid="6">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs2.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs3.doma.in" votes="1" nodeid="7">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs3.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client1.doma.in" votes="1" nodeid="21">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client1.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client2.doma.in" votes="1" nodeid="22">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client2.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client3.doma.in" votes="1" nodeid="23">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client3.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client4.doma.in" votes="1" nodeid="24">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client4.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client5.doma.in" votes="1" nodeid="25">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client5.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client6.doma.in" votes="1" nodeid="26">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client6.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client7.doma.in" votes="1" nodeid="27">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client7.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client8.doma.in" votes="1" nodeid="28">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client8.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client9.doma.in" votes="1" nodeid="29">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client9.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client10.doma.in" votes="1" nodeid="30">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client10.doma.in"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? </clusternodes>
>> ? <fencedevices>
>> ? ? <fencedevice name="manual" agent="fence_manual"/>
>> ? </fencedevices>
>> ? <rm>
>> ? ? <failoverdomains/>
>> ? ? <resources/>
>> ? </rm>
>> </cluster>
>
> If you are on IRC, join #linux-cluster, it is also a great place to get
> help. I am usually there and will be happy to help you get a) fencing
> working and b) get the rest working.
>
> Welcome to clustering! :)
>
> --
> Digimer
> E-Mail: digimer at alteeve.com
> AN!Whitepapers: http://alteeve.com
> Node Assassin: ?http://nodeassassin.org
> "I feel confined, only free to expand myself within boundaries."
>



-- 
Hiroyuki Sato



From linux at alteeve.com  Sun May 29 16:00:57 2011
From: linux at alteeve.com (Digimer)
Date: Sun, 29 May 2011 12:00:57 -0400
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>	<4DE13160.7080709@alteeve.com>
	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>
Message-ID: <4DE26DB9.6020905@alteeve.com>

On 05/29/2011 06:14 AM, Hiroyuki Sato wrote:
> Hello Digimer.
>
> Thank you for your information.
>
> This is the document that I'm looking for!!.
> This doc is very very usuful. Thanks!!.

Wonderful, I'm glad you find it useful. :)

> I want to ask one thing.
>
> Please take a look my cluster configration again.

Will do, comments will be in-line.

> Mainly I want to use GNBD on gfs_clientX.
> GNBD server is gfs2, and gfs3.
>
> And gfs_client's hardwhere does not support IPMI, iLO...,
> Because That machine is Desktop computers.
>
> And no APC like UPS.
>
> The desktop machine is just support Wake On LAN.
>
> What fence device should I use??
> I'm thinking fence_wake_on_lan is proper fence device.
> but that is nothing..

The least expensive option for a commercial product would be APC's 
switched PDU. You have 13 machines, so you would need either 2 of the 1U 
models, or 1 of the 0U models.

If you are in North America, you can use these:

http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900

or

http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7931

If you are in Japan, you'll need to select the best one of these:

http://www.apc.com/products/family/index.cfm?id=70&ISOCountryCode=JP

Whichever you get, you can use the 'fence_apc' fence agent.

> Thank you for your advice.
>
> <?xml version="1.0"?>
> <cluster name="arch_gfs1" config_version="21">
>    <clusternodes>
>      <clusternode name="gfs1.archsystem.com" votes="1" nodeid="5">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs1.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs2.archsystem.com" votes="1" nodeid="6">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs2.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs3.archsystem.com" votes="1" nodeid="7">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs3.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client1.archsystem.com" votes="1" nodeid="21">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client1.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client2.archsystem.com" votes="1" nodeid="22">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client2.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client3.archsystem.com" votes="1" nodeid="23">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client3.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client4.archsystem.com" votes="1" nodeid="24">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client4.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client5.archsystem.com" votes="1" nodeid="25">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client5.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client6.archsystem.com" votes="1" nodeid="26">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client6.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client7.archsystem.com" votes="1" nodeid="27">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client7.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client8.archsystem.com" votes="1" nodeid="28">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client8.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client9.archsystem.com" votes="1" nodeid="29">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client9.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>      <clusternode name="gfs_client10.archsystem.com" votes="1" nodeid="30">
>        <fence>
>          <method name="single">
>            <device name="manual" nodename="gfs_client10.archsystem.com"/>
>          </method>
>        </fence>
>      </clusternode>
>    </clusternodes>
>    <fencedevices>
>      <fencedevice name="manual" agent="fence_manual"/>
>    </fencedevices>
>    <rm>
>      <failoverdomains/>
>      <resources/>
>    </rm>
> </cluster>
>
> Regards.

Outside of the "fence_manual" issue, this looks fine. You will probably 
want to get the GFS and GNBD stuff into rgmanager, but that can come 
later after you have fencing working and the core of the cluster tested 
and working.

Take a look at this:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/s1-gnbd-mp-sn.html

It discusses fencing with GNBD. Below is the start of the Red Hat 
document on GNBD in EL5 that you may find helpful, if you haven't read 
it already.

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/ch-gnbd.html

Let me know if you want/need any more help. I'll be happy to see what I 
can do.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From tom+linux-cluster at oneshoeco.com  Mon May 30 09:45:27 2011
From: tom+linux-cluster at oneshoeco.com (Tom Lanyon)
Date: Mon, 30 May 2011 19:15:27 +0930
Subject: [Linux-cluster] How do you HA your storage?
In-Reply-To: <1304154086.10889.1446718041@webmail.messagingengine.com>
References: <1304154086.10889.1446718041@webmail.messagingengine.com>
Message-ID: <B1808D17-D7EC-4F17-8C9C-3125CEE894AE@oneshoeco.com>

On 30/04/2011, at 6:31 PM, urgrue wrote:
> 
> I'm struggling to find the best way to deal with SAN failover.
> By this I mean the common scenario where you have SAN-based mirroring.
> It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but how
> can you minimize the impact and manual effort to recover from losing a
> LUN, and needing to somehow get your system to realize the data is now
> on a different LUN (the now-active mirror)?

urgrue,

As others have mentioned, this may be a little off-topic for the list. However, I reply in support of hopefully providing an answer to your original question.

In my experience the destination array of storage-based (i.e. array-to-array) replication is able to present the replication target LUN with the same ID (e.g. WWN) as that of the source LUN on the source array.

In this scenario, you would present the replicated LUN on the destination array to your server(s), and your multipathing (i.e. device-mapper-multipath) software would essentially see it as another path to the same device. You obviously need to ensure that the priority of these paths are such that no I/O operations will traverse them unless the paths to the source array have failed.

In the case of a failure on the source array, it's paths will (hopefully!) be marked as failed, your multipath software will start queueing I/O, the destination array will detect the source array failure and switch its LUN presentation to read/write and your multipathing software will resume I/O on the new paths.

There's a lot to consider here. Such live failover can often be asking for trouble, and given the total failure rates of high-end storage equipment is quite minimal, I'd only implement if absolutely required.

The above assumes synchronous replication between the arrays.

Hope this helps somewhat.

Tom




From Chris.Jankowski at hp.com  Mon May 30 11:14:37 2011
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Mon, 30 May 2011 11:14:37 +0000
Subject: [Linux-cluster] How do you HA your storage?
In-Reply-To: <B1808D17-D7EC-4F17-8C9C-3125CEE894AE@oneshoeco.com>
References: <1304154086.10889.1446718041@webmail.messagingengine.com>
	<B1808D17-D7EC-4F17-8C9C-3125CEE894AE@oneshoeco.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F6710F09B@GVW1113EXC.americas.hpqcorp.net>

There is a school of thought among practitioners of Business Continuity that says:

	HA != DR

The two cover different domains and mixing the two concepts leads to horror stories.

Essentially, HA covers a single (small or large) component failure. If components can be duplicated and work in parallel (e.g. disks, paths, controllers) then failure of one component may be transparent to the end users.  If they carry state e.g. a server then you replace the element and recover stable state - hence a HA cluster.
The action taken is automatic and the outcome can be guaranteed if only one component failed.

DR covers multiple and not necessarily simultaneous component failures.  They may result from large catastrophic events such as a fire in the data centre.  As the extent of damage is not known then a human must be in the loop - to declare a disaster and initiate execution of a disaster recovery plan. Software has horrible problems distinguishing between a hole in the ground from a massive bomb blast and a puff of smoke from a little short circuit in a power supply (:-)). Humans do better here. The results of execution of a disaster recovery plan can be achieved by very careful design for geographical separation, so a disaster does not invalidate redundancy. The execution itself can be automated, but is initiated by a human - push button solution.

Typically DR is layered on top of HA e.g. HA clusters in each location to protect against single component failures and data replication from the active to the DR site to maintain complete state in geographically distant location.

The typical cost ratios are 1=>4=>16 for single system => HA cluster => complete DR solution. That is why there are very few properly designed, built, tested and maintained DR solutions based on two HA clusters and replication.

---------

I believe that you are trying to configure a stretched cluster that would provide some automatic DR capabilities.

The problem with stretched cluster solutions is that they do not normally take into consideration multiple, non-simultaneous component failures. I suggest that you think carefully what happens in such system depending on which fibre melts first and which disk seizes up first in a fire. You will soon find out that the software lacks the notion of locally consistent groups. The only cluster that ever did that location stuff right was DEC VMS cluster 25 years ago. Stretched VMS clusters did work correctly. The cost was horrendous though.

---------

You can also try to make storage somebody else's problem by using a storage array that enables you to build HA geographically extended configuration. Believe it or not, there is one like that - P4000 from HP (formerly from Left Hand Networks). Of course, you still would need to properly design and configure such extended configuration, but it is a fully supported solution from the vendor.  You can play with it by downloading evaluation copies of the software - VSA - Virtual Storage Appliance from HP site.

Regards,

Chris Jankowski





Once you ado

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tom Lanyon
Sent: Monday, 30 May 2011 19:45
To: linux clustering
Subject: Re: [Linux-cluster] How do you HA your storage?

On 30/04/2011, at 6:31 PM, urgrue wrote:
> 
> I'm struggling to find the best way to deal with SAN failover.
> By this I mean the common scenario where you have SAN-based mirroring.
> It's pretty easy with host-based mirroring (md, DRBD, LVM, etc) but how
> can you minimize the impact and manual effort to recover from losing a
> LUN, and needing to somehow get your system to realize the data is now
> on a different LUN (the now-active mirror)?

urgrue,

As others have mentioned, this may be a little off-topic for the list. However, I reply in support of hopefully providing an answer to your original question.

In my experience the destination array of storage-based (i.e. array-to-array) replication is able to present the replication target LUN with the same ID (e.g. WWN) as that of the source LUN on the source array.

In this scenario, you would present the replicated LUN on the destination array to your server(s), and your multipathing (i.e. device-mapper-multipath) software would essentially see it as another path to the same device. You obviously need to ensure that the priority of these paths are such that no I/O operations will traverse them unless the paths to the source array have failed.

In the case of a failure on the source array, it's paths will (hopefully!) be marked as failed, your multipath software will start queueing I/O, the destination array will detect the source array failure and switch its LUN presentation to read/write and your multipathing software will resume I/O on the new paths.

There's a lot to consider here. Such live failover can often be asking for trouble, and given the total failure rates of high-end storage equipment is quite minimal, I'd only implement if absolutely required.

The above assumes synchronous replication between the arrays.

Hope this helps somewhat.

Tom


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From hiroysato at gmail.com  Mon May 30 11:35:38 2011
From: hiroysato at gmail.com (Hiroyuki Sato)
Date: Mon, 30 May 2011 20:35:38 +0900
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <4DE26DB9.6020905@alteeve.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>
	<4DE13160.7080709@alteeve.com>
	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>
	<4DE26DB9.6020905@alteeve.com>
Message-ID: <BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>

Hello Digimer

Thank you for your advice.

* GNBD
  I've already succeed to mount GNBD.
  locking_type = 1
  Should I change lock_type = 3 ?,
  If not, what problem will be occur??

* fence_apc
 some of reason, I can't get use APC switch.
 (That configuration example is test environment. )
 so I asked alternative solution.

* fence_wol

 I can't find fence_wake_on_lan. so I'm thinking to create it.
 WOL supports Power on and Power off ( I'll test later ).
 So, It's will be fence tool.

 And I downloaded fence_na, It was written in Perl script.
 so I want to change fence_na to use wol command.


 Could you point me to good reference to build fence_wol.
 (Of course!!. fence_na is good reference)



Thank you for your advice again.

Regards.

2011/5/30 Digimer <linux at alteeve.com>:
> On 05/29/2011 06:14 AM, Hiroyuki Sato wrote:
>>
>> Hello Digimer.
>>
>> Thank you for your information.
>>
>> This is the document that I'm looking for!!.
>> This doc is very very usuful. Thanks!!.
>
> Wonderful, I'm glad you find it useful. :)
>
>> I want to ask one thing.
>>
>> Please take a look my cluster configration again.
>
> Will do, comments will be in-line.
>
>> Mainly I want to use GNBD on gfs_clientX.
>> GNBD server is gfs2, and gfs3.
>>
>> And gfs_client's hardwhere does not support IPMI, iLO...,
>> Because That machine is Desktop computers.
>>
>> And no APC like UPS.
>>
>> The desktop machine is just support Wake On LAN.
>>
>> What fence device should I use??
>> I'm thinking fence_wake_on_lan is proper fence device.
>> but that is nothing..
>
> The least expensive option for a commercial product would be APC's switched
> PDU. You have 13 machines, so you would need either 2 of the 1U models, or 1
> of the 0U models.
>
> If you are in North America, you can use these:
>
> http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900
>
> or
>
> http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7931
>
> If you are in Japan, you'll need to select the best one of these:
>
> http://www.apc.com/products/family/index.cfm?id=70&ISOCountryCode=JP
>
> Whichever you get, you can use the 'fence_apc' fence agent.
>
>> Thank you for your advice.
>>
>> <?xml version="1.0"?>
>> <cluster name="arch_gfs1" config_version="21">
>> ? <clusternodes>
>> ? ? <clusternode name="gfs1.archsystem.com" votes="1" nodeid="5">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs1.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs2.archsystem.com" votes="1" nodeid="6">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs2.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs3.archsystem.com" votes="1" nodeid="7">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs3.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client1.archsystem.com" votes="1" nodeid="21">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client1.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client2.archsystem.com" votes="1" nodeid="22">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client2.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client3.archsystem.com" votes="1" nodeid="23">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client3.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client4.archsystem.com" votes="1" nodeid="24">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client4.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client5.archsystem.com" votes="1" nodeid="25">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client5.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client6.archsystem.com" votes="1" nodeid="26">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client6.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client7.archsystem.com" votes="1" nodeid="27">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client7.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client8.archsystem.com" votes="1" nodeid="28">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client8.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client9.archsystem.com" votes="1" nodeid="29">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client9.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? ? <clusternode name="gfs_client10.archsystem.com" votes="1" nodeid="30">
>> ? ? ? <fence>
>> ? ? ? ? <method name="single">
>> ? ? ? ? ? <device name="manual" nodename="gfs_client10.archsystem.com"/>
>> ? ? ? ? </method>
>> ? ? ? </fence>
>> ? ? </clusternode>
>> ? </clusternodes>
>> ? <fencedevices>
>> ? ? <fencedevice name="manual" agent="fence_manual"/>
>> ? </fencedevices>
>> ? <rm>
>> ? ? <failoverdomains/>
>> ? ? <resources/>
>> ? </rm>
>> </cluster>
>>
>> Regards.
>
> Outside of the "fence_manual" issue, this looks fine. You will probably want
> to get the GFS and GNBD stuff into rgmanager, but that can come later after
> you have fencing working and the core of the cluster tested and working.
>
> Take a look at this:
>
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/s1-gnbd-mp-sn.html
>
> It discusses fencing with GNBD. Below is the start of the Red Hat document
> on GNBD in EL5 that you may find helpful, if you haven't read it already.
>
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_Network_Block_Device/ch-gnbd.html
>
> Let me know if you want/need any more help. I'll be happy to see what I can
> do.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Freenode handle: ? ? digimer
> Papers and Projects: http://alteeve.com
> Node Assassin: ? ? ? http://nodeassassin.org
> "I feel confined, only free to expand myself within boundaries."
>



-- 
Hiroyuki Sato



From linux at alteeve.com  Mon May 30 12:26:53 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 30 May 2011 08:26:53 -0400
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>	<4DE13160.7080709@alteeve.com>	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>	<4DE26DB9.6020905@alteeve.com>
	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>
Message-ID: <4DE38D0D.8010800@alteeve.com>

On 05/30/2011 07:35 AM, Hiroyuki Sato wrote:
> Hello Digimer
>
> Thank you for your advice.
>
> * GNBD
>    I've already succeed to mount GNBD.
>    locking_type = 1
>    Should I change lock_type = 3 ?,
>    If not, what problem will be occur??

To be honest, I'm not familiar with GNBD. The locking needs to use DLM I 
do believe, so check the documentation to ensure that is the case.

> * fence_apc
>   some of reason, I can't get use APC switch.
>   (That configuration example is test environment. )
>   so I asked alternative solution.

Ah, ok.

> * fence_wol
>
>   I can't find fence_wake_on_lan. so I'm thinking to create it.
>   WOL supports Power on and Power off ( I'll test later ).
>   So, It's will be fence tool.
>
>   And I downloaded fence_na, It was written in Perl script.
>   so I want to change fence_na to use wol command.
>
>
>   Could you point me to good reference to build fence_wol.
>   (Of course!!. fence_na is good reference)

Does wake-on-lan allow for:

a) Forcing a node to power off, or does it just start an ACPI shutdown?
b) Can you check that the node is successfully off using wol?

Unless wol can force a node off (ie: in the case of a hung OS) and can 
return the current power state of the node, then I would be hesitant to 
use it.

As a general question though; You will need to write a script that 
follows the FenceAgentAPI:

https://fedorahosted.org/cluster/wiki/FenceAgentAPI

You could make a few Node Assassin devices if you have access to Arduino 
boards and don't mind soldering. However, you have 13 nodes, so you'd 
need five of them... Not sure if that is feasible. If you want to to a 
test cluster with fewer nodes though, it's probably more reasonable.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From Ralph.Grothe at itdz-berlin.de  Mon May 30 12:28:34 2011
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Mon, 30 May 2011 14:28:34 +0200
Subject: [Linux-cluster] How to integrate a custom resource agent into RHCS?
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432903@itdzex101.ITDZ.verwalt-berlin.de>

Hi,

I hope this is the right forum. So bear with me Pacemaker
aficionados et alii when I talk about Red Hat Cluster Suite
(RHCS).
That's the clusterware product I am given to set up the cluster
and I'm not free to chose another software of my liking.

Though this may sound ridiculous, since days I've been labouring
to get a fairly simple custom resource agent (hence RA) to be
acknowledged by RHCS and correctly executed through its
rgmanager.

When scripting my RA I mostly adhered to
http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html apart
from where RHCS RAs differs from general OCF.

I put my RA in /usr/share/cluster and afterwards restarted
rgmanager on all nodes.

When I try to start the service whereof my RA's managed resource
is part of the service though gets started but not my resource,
as if it wasn't part of the service at all.


When I try to start my resource via rg_test nothing happens apart
from this obscure log entry


[root at aruba:~]
# rg_test test /etc/cluster/cluster.conf start aDIStn_sec
Running in test mode.
Entity: line 2: parser error : Char 0x0 out of allowed range

^
Entity: line 2: parser error : Premature end of data in tag error
line 1

^
[root at aruba:~]
# echo $?
0

[root at aruba:~]
# grep rg_test /var/log/cluster.log|tail -1
May 30 13:54:55 aruba rg_test: [28643]: <err> Cannot dump
meta-data because '/usr/share/cluster/default.metadata' is
missing 


Though this is true

[root at aruba:~]
# ls -l /usr/share/cluster/default.metadata
ls: /usr/share/cluster/default.metadata: No such file or
directory

there isn't such a file part of the installed clusterware at all
either

[root at aruba:~]
# yum groupinfo Clustering|tail -10|xargs rpm -ql|grep -c
default\\.metadata
0

And besides, I don't understand this error because since I wrote
my RA according to above mentioned RA Developer's Guide it of
course dumps its metadata


[root at aruba:~]
# /usr/share/cluster/aDIStn_sec.sh meta-data|grep action
    <actions>
        <action name="start" timeout="0"/>
        <action name="stop" timeout="0"/>
        <action name="status" timeout="5"/>
        <action name="monitor" timeout="5"/>
        <action name="meta-data" timeout="0"/>
        <action name="verify-all" timeout="5"/>
        <action name="validate-all" timeout="5"/>
    </actions>

(note, RHCS deviates from OCF here in naming its actions
verify-all instead of validate-all and status instead of monitor.
But both refer to the same case block in my RA)


I also don't understand the "Char 0x0 out of allowed range" error
from the XML parser.

If it really refers to line 2 of my cluster.conf this looks
pretty ok to me


[root at aruba:~]
# sed -n 2p /etc/cluster/cluster.conf
<cluster alias="rhcs_mock" config_version="43" name="rhcs_mock">


If I run a validity check of the XML of my cluster.conf against
RHCS's RNG schema I also get an incomprehensible error about
extra elements in interleave.

Nevertheless, all other resources of my cluster which rely on
RHCS's standard RAs are managed ok by the clusterware.



[root at aruba:~]
# declare -f cluconf_valid
cluconf_valid () 
{ 
    xmllint --noout --relaxng
/usr/share/system-config-cluster/misc/cluster.ng
${1:-/etc/cluster/cluster.conf}
}
[root at aruba:~]
# cluconf_valid 
Relax-NG validity error : Extra element cman in interleave
/etc/cluster/cluster.conf:2: element cluster: Relax-NG validity
error : Element cluster failed to validate content
/etc/cluster/cluster.conf fails to validate


Btw. is there a schema file available to check an RA's metadata
for validity?



Of course did I test my RA script for correct functionality when
used like an init script (to which end I provide the required
environment of OCF_RESKEY_parameter(s)),
and it starts, stops and monitors my resource as intended.


Can anyone help?


Regards
Ralph




From linux at alteeve.com  Mon May 30 13:15:37 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 30 May 2011 09:15:37 -0400
Subject: [Linux-cluster] How to integrate a custom resource agent into
 RHCS?
In-Reply-To: <A789DDB53ED7E94396E842EE2AC9B5FF01432903@itdzex101.ITDZ.verwalt-berlin.de>
References: <A789DDB53ED7E94396E842EE2AC9B5FF01432903@itdzex101.ITDZ.verwalt-berlin.de>
Message-ID: <4DE39879.4070406@alteeve.com>

On 05/30/2011 08:28 AM, Ralph.Grothe at itdz-berlin.de wrote:
> Hi,
>
> I hope this is the right forum. So bear with me Pacemaker
> aficionados et alii when I talk about Red Hat Cluster Suite
> (RHCS).
> That's the clusterware product I am given to set up the cluster
> and I'm not free to chose another software of my liking.
>
> Though this may sound ridiculous, since days I've been labouring
> to get a fairly simple custom resource agent (hence RA) to be
> acknowledged by RHCS and correctly executed through its
> rgmanager.
>
> When scripting my RA I mostly adhered to
> http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html apart
> from where RHCS RAs differs from general OCF.
>
> I put my RA in /usr/share/cluster and afterwards restarted
> rgmanager on all nodes.
>
> When I try to start the service whereof my RA's managed resource
> is part of the service though gets started but not my resource,
> as if it wasn't part of the service at all.
>
>
> When I try to start my resource via rg_test nothing happens apart
> from this obscure log entry
>
>
> [root at aruba:~]
> # rg_test test /etc/cluster/cluster.conf start aDIStn_sec
> Running in test mode.
> Entity: line 2: parser error : Char 0x0 out of allowed range
>
> ^
> Entity: line 2: parser error : Premature end of data in tag error
> line 1
>
> ^
> [root at aruba:~]
> # echo $?
> 0
>
> [root at aruba:~]
> # grep rg_test /var/log/cluster.log|tail -1
> May 30 13:54:55 aruba rg_test: [28643]:<err>  Cannot dump
> meta-data because '/usr/share/cluster/default.metadata' is
> missing
>
>
> Though this is true
>
> [root at aruba:~]
> # ls -l /usr/share/cluster/default.metadata
> ls: /usr/share/cluster/default.metadata: No such file or
> directory
>
> there isn't such a file part of the installed clusterware at all
> either
>
> [root at aruba:~]
> # yum groupinfo Clustering|tail -10|xargs rpm -ql|grep -c
> default\\.metadata
> 0
>
> And besides, I don't understand this error because since I wrote
> my RA according to above mentioned RA Developer's Guide it of
> course dumps its metadata
>
>
> [root at aruba:~]
> # /usr/share/cluster/aDIStn_sec.sh meta-data|grep action
>      <actions>
>          <action name="start" timeout="0"/>
>          <action name="stop" timeout="0"/>
>          <action name="status" timeout="5"/>
>          <action name="monitor" timeout="5"/>
>          <action name="meta-data" timeout="0"/>
>          <action name="verify-all" timeout="5"/>
>          <action name="validate-all" timeout="5"/>
>      </actions>
>
> (note, RHCS deviates from OCF here in naming its actions
> verify-all instead of validate-all and status instead of monitor.
> But both refer to the same case block in my RA)
>
>
> I also don't understand the "Char 0x0 out of allowed range" error
> from the XML parser.
>
> If it really refers to line 2 of my cluster.conf this looks
> pretty ok to me
>
>
> [root at aruba:~]
> # sed -n 2p /etc/cluster/cluster.conf
> <cluster alias="rhcs_mock" config_version="43" name="rhcs_mock">
>
>
> If I run a validity check of the XML of my cluster.conf against
> RHCS's RNG schema I also get an incomprehensible error about
> extra elements in interleave.
>
> Nevertheless, all other resources of my cluster which rely on
> RHCS's standard RAs are managed ok by the clusterware.
>
>
>
> [root at aruba:~]
> # declare -f cluconf_valid
> cluconf_valid ()
> {
>      xmllint --noout --relaxng
> /usr/share/system-config-cluster/misc/cluster.ng
> ${1:-/etc/cluster/cluster.conf}
> }
> [root at aruba:~]
> # cluconf_valid
> Relax-NG validity error : Extra element cman in interleave
> /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity
> error : Element cluster failed to validate content
> /etc/cluster/cluster.conf fails to validate
>
>
> Btw. is there a schema file available to check an RA's metadata
> for validity?
>
>
>
> Of course did I test my RA script for correct functionality when
> used like an init script (to which end I provide the required
> environment of OCF_RESKEY_parameter(s)),
> and it starts, stops and monitors my resource as intended.
>
>
> Can anyone help?
>
>
> Regards
> Ralph

Can you paste in your cluster.conf file? Please only alter the passwords.

Generally speaking, if your scripts can work like init.d script (taking 
start/stop/status arguments), then you should be able to use the 
"script" resource type.

I am not too familiar with OCF, I am afraid, but I think I can help with 
RHCS as that is what I am most familiar with.


-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From hiroysato at gmail.com  Mon May 30 14:49:15 2011
From: hiroysato at gmail.com (Hiroyuki Sato)
Date: Mon, 30 May 2011 23:49:15 +0900
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <4DE38D0D.8010800@alteeve.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>
	<4DE13160.7080709@alteeve.com>
	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>
	<4DE26DB9.6020905@alteeve.com>
	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>
	<4DE38D0D.8010800@alteeve.com>
Message-ID: <BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>

Hello Digimer.

Thank you for your advice.
It is very very useful information for me.

> a) Forcing a node to power off, or does it just start an ACPI shutdown?

Maybe ok. I'll test it.

> b) Can you check that the node is successfully off using wol?

I'm not sure, I'll test it.

Could you tell me one more thing.

Where fenced will call fence agent??
It is mean that  the following

 * Can I check where fenced daemon will call fence_agent when I
execute fence_node??
   (that message send to master fenced, or localhost??)
 * And Can I check ``where are master''  with command?? (If fenced is
master-slave type)
 * Can I control master priority.
   (for example I want to specify gfs1, gfs2, gfs3 as fenced master)

Thanks again

Regards.



2011/5/30 Digimer <linux at alteeve.com>:
> On 05/30/2011 07:35 AM, Hiroyuki Sato wrote:
>>
>> Hello Digimer
>>
>> Thank you for your advice.
>>
>> * GNBD
>> ? I've already succeed to mount GNBD.
>> ? locking_type = 1
>> ? Should I change lock_type = 3 ?,
>> ? If not, what problem will be occur??
>
> To be honest, I'm not familiar with GNBD. The locking needs to use DLM I do
> believe, so check the documentation to ensure that is the case.
>
>> * fence_apc
>> ?some of reason, I can't get use APC switch.
>> ?(That configuration example is test environment. )
>> ?so I asked alternative solution.
>
> Ah, ok.
>
>> * fence_wol
>>
>> ?I can't find fence_wake_on_lan. so I'm thinking to create it.
>> ?WOL supports Power on and Power off ( I'll test later ).
>> ?So, It's will be fence tool.
>>
>> ?And I downloaded fence_na, It was written in Perl script.
>> ?so I want to change fence_na to use wol command.
>>
>>
>> ?Could you point me to good reference to build fence_wol.
>> ?(Of course!!. fence_na is good reference)
>
> Does wake-on-lan allow for:
>
> a) Forcing a node to power off, or does it just start an ACPI shutdown?
> b) Can you check that the node is successfully off using wol?
>
> Unless wol can force a node off (ie: in the case of a hung OS) and can
> return the current power state of the node, then I would be hesitant to use
> it.
>
> As a general question though; You will need to write a script that follows
> the FenceAgentAPI:
>
> https://fedorahosted.org/cluster/wiki/FenceAgentAPI
>
> You could make a few Node Assassin devices if you have access to Arduino
> boards and don't mind soldering. However, you have 13 nodes, so you'd need
> five of them... Not sure if that is feasible. If you want to to a test
> cluster with fewer nodes though, it's probably more reasonable.
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Freenode handle: ? ? digimer
> Papers and Projects: http://alteeve.com
> Node Assassin: ? ? ? http://nodeassassin.org
> "I feel confined, only free to expand myself within boundaries."
>



-- 
Hiroyuki Sato



From linux at alteeve.com  Mon May 30 15:06:45 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 30 May 2011 11:06:45 -0400
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>	<4DE13160.7080709@alteeve.com>	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>	<4DE26DB9.6020905@alteeve.com>	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>	<4DE38D0D.8010800@alteeve.com>
	<BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>
Message-ID: <4DE3B285.1010105@alteeve.com>

On 05/30/2011 10:49 AM, Hiroyuki Sato wrote:
> Hello Digimer.
>
> Thank you for your advice.
> It is very very useful information for me.
>
>> a) Forcing a node to power off, or does it just start an ACPI shutdown?
>
> Maybe ok. I'll test it.

To test, hang the host (echo c > /proc/sysrq-trigger), then try to force 
it to power off with wol. If this succeeds, you are in business. I have 
my doubts though.

>> b) Can you check that the node is successfully off using wol?
>
> I'm not sure, I'll test it.

Please do. If you can though, it will make IPMI far less needed. :)

> Could you tell me one more thing.
>
> Where fenced will call fence agent??
> It is mean that  the following
>
>   * Can I check where fenced daemon will call fence_agent when I
> execute fence_node??
>     (that message send to master fenced, or localhost??)
>   * And Can I check ``where are master''  with command?? (If fenced is
> master-slave type)
>   * Can I control master priority.
>     (for example I want to specify gfs1, gfs2, gfs3 as fenced master)
>
> Thanks again
>
> Regards.

I'm not sure about the internals of cman, so I am not sure which machine 
actually sends the fence command. I do know that it has to come from a 
machine with quorum, and I do believe it is handled by the cluster 
manager. It's not like pacemaker where a DC is clearly defined.

I'll try to sort out how the internals work and will let you know.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From linux at alteeve.com  Mon May 30 15:23:55 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 30 May 2011 11:23:55 -0400
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>	<4DE13160.7080709@alteeve.com>	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>	<4DE26DB9.6020905@alteeve.com>	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>	<4DE38D0D.8010800@alteeve.com>
	<BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>
Message-ID: <4DE3B68B.9080605@alteeve.com>

On 05/30/2011 10:49 AM, Hiroyuki Sato wrote:
> Where fenced will call fence agent??
> It is mean that  the following
>
>   * Can I check where fenced daemon will call fence_agent when I
> execute fence_node??
>     (that message send to master fenced, or localhost??)
>   * And Can I check ``where are master''  with command?? (If fenced is
> master-slave type)
>   * Can I control master priority.
>     (for example I want to specify gfs1, gfs2, gfs3 as fenced master)
>
> Thanks again
>
> Regards.

It looks like the node with the lowest ID that is quorate sends the 
actual fence call.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From kkovachev at varna.net  Mon May 30 15:30:41 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 30 May 2011 18:30:41 +0300
Subject: [Linux-cluster]
 =?utf-8?q?=5BQ=5D_Good_documentation_about_comman?=
 =?utf-8?q?d_line_interface=3F=3F?=
In-Reply-To: <4DE3B285.1010105@alteeve.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>	<4DE13160.7080709@alteeve.com>	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>	<4DE26DB9.6020905@alteeve.com>	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>	<4DE38D0D.8010800@alteeve.com>
	<BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>
	<4DE3B285.1010105@alteeve.com>
Message-ID: <8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net>

On Mon, 30 May 2011 11:06:45 -0400, Digimer <linux at alteeve.com> wrote:
> On 05/30/2011 10:49 AM, Hiroyuki Sato wrote:
>> Hello Digimer.
>>
>> Thank you for your advice.
>> It is very very useful information for me.
>>
>>> a) Forcing a node to power off, or does it just start an ACPI
shutdown?
>>
>> Maybe ok. I'll test it.
> 
> To test, hang the host (echo c > /proc/sysrq-trigger), then try to force

> it to power off with wol. If this succeeds, you are in business. I have 
> my doubts though.
> 
>>> b) Can you check that the node is successfully off using wol?
>>
>> I'm not sure, I'll test it.
> 
> Please do. If you can though, it will make IPMI far less needed. :)
> 
>> Could you tell me one more thing.
>>
>> Where fenced will call fence agent??
>> It is mean that  the following
>>
>>   * Can I check where fenced daemon will call fence_agent when I
>> execute fence_node??
>>     (that message send to master fenced, or localhost??)
>>   * And Can I check ``where are master''  with command?? (If fenced is
>> master-slave type)
>>   * Can I control master priority.
>>     (for example I want to specify gfs1, gfs2, gfs3 as fenced master)
>>
>> Thanks again
>>
>> Regards.
> 
> I'm not sure about the internals of cman, so I am not sure which machine

> actually sends the fence command. I do know that it has to come from a 
> machine with quorum, and I do believe it is handled by the cluster 
> manager. It's not like pacemaker where a DC is clearly defined.
> 
> I'll try to sort out how the internals work and will let you know.

Not sure where i got this information from (i think it was on this list),
but for sure: the node with the lowest ID, which is quorate, will take the
responsibility to call the fencing script



From linux at alteeve.com  Mon May 30 15:34:30 2011
From: linux at alteeve.com (Digimer)
Date: Mon, 30 May 2011 11:34:30 -0400
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>	<4DE13160.7080709@alteeve.com>	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>	<4DE26DB9.6020905@alteeve.com>	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>	<4DE38D0D.8010800@alteeve.com>	<BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>	<4DE3B285.1010105@alteeve.com>
	<8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net>
Message-ID: <4DE3B906.4080005@alteeve.com>

On 05/30/2011 11:30 AM, Kaloyan Kovachev wrote:
>> actually sends the fence command. I do know that it has to come from a
>> machine with quorum, and I do believe it is handled by the cluster
>> manager. It's not like pacemaker where a DC is clearly defined.
>>
>> I'll try to sort out how the internals work and will let you know.
>
> Not sure where i got this information from (i think it was on this list),
> but for sure: the node with the lowest ID, which is quorate, will take the
> responsibility to call the fencing script

Indeed, you are right. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From hiroysato at gmail.com  Mon May 30 16:54:30 2011
From: hiroysato at gmail.com (Hiroyuki Sato)
Date: Tue, 31 May 2011 01:54:30 +0900
Subject: [Linux-cluster] [Q] Good documentation about command line
	interface??
In-Reply-To: <4DE3B906.4080005@alteeve.com>
References: <BANLkTikcrQri8qYridHUrNRLwQ=erdMr7Q@mail.gmail.com>
	<4DE13160.7080709@alteeve.com>
	<BANLkTi=mOrWM11BtpnJ4R_4EaGEq5TFm2A@mail.gmail.com>
	<4DE26DB9.6020905@alteeve.com>
	<BANLkTinC8pA0Lrc=-XMJEuUSUOciKbAnBw@mail.gmail.com>
	<4DE38D0D.8010800@alteeve.com>
	<BANLkTinCe+3jytym+PEm3xxoVs6C7ZXReg@mail.gmail.com>
	<4DE3B285.1010105@alteeve.com>
	<8ccb8ebec7d0cd0071d3e2898d312b8e@mx.varna.net>
	<4DE3B906.4080005@alteeve.com>
Message-ID: <BANLkTingo4xd8VPj89-7tnN-D59FrAfK_w@mail.gmail.com>

Hello Digimer and Kaloyan

Thank you for your information.

I'll set gfs1, gfs2 and gfs3 with lowest ID (ex, 1,2,3).

I found the following Notes in fenced/recover.c

recover.c

   Notes:
   - When fenced is started, the complete list is initialized to all
   the nodes in cluster.conf.
   - fence_victims actually only runs on one of the nodes in the domain
   so that a victim isn't fenced by everyone.
   - The node to run fence_victims is the node with lowest id that's in both
   complete and prev lists.
   - This node will never be a node that's just joining since by definition
   the joining node wasn't in the last complete group.
   - An exception to this is when there is just one node in the group
   in which case it's chosen even if it wasn't in the last complete group.
   - There's also a leaving list that parallels the victims list but are
   not fenced.


Here is call procedures.

recover.c
  do_recovery
    fence_victims
      dispatch_fence_agent

agent.c
  dispatch_fence_agent
    use_device
      run_agent
        exec fence_XXXX

Regards.



2011/5/31 Digimer <linux at alteeve.com>:
> On 05/30/2011 11:30 AM, Kaloyan Kovachev wrote:
>>>
>>> actually sends the fence command. I do know that it has to come from a
>>> machine with quorum, and I do believe it is handled by the cluster
>>> manager. It's not like pacemaker where a DC is clearly defined.
>>>
>>> I'll try to sort out how the internals work and will let you know.
>>
>> Not sure where i got this information from (i think it was on this list),
>> but for sure: the node with the lowest ID, which is quorate, will take the
>> responsibility to call the fencing script
>
> Indeed, you are right. :)
>
> --
> Digimer
> E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
> Freenode handle: ? ? digimer
> Papers and Projects: http://alteeve.com
> Node Assassin: ? ? ? http://nodeassassin.org
> "I feel confined, only free to expand myself within boundaries."
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Hiroyuki Sato



From swap_project at yahoo.com  Mon May 30 19:17:07 2011
From: swap_project at yahoo.com (Srija)
Date: Mon, 30 May 2011 12:17:07 -0700 (PDT)
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <BANLkTingo4xd8VPj89-7tnN-D59FrAfK_w@mail.gmail.com>
Message-ID: <17723.13598.qm@web112805.mail.gq1.yahoo.com>

Hi,

I am very new to the redhat cluster. Need some help and suggession for the cluster configuration.
We have sixteen node cluster of 

            OS : Linux Server release 5.5 (Tikanga)
            kernel :  2.6.18-194.3.1.el5xen.

The problem is sometimes the cluster is getting  broken. The solution is (still yet)to reboot the 
sixteen nodes. Otherwise the nodes are not joining

We are using  clvm and not using any quorum disk. The quorum is by default.

When it is getting broken, clustat commands shows  evrything  offline except the node from where
the clustat command executed.  If we execute vgs, lvs command, those commands are getting hung.

Here is at present the clustat report
-------------------------------------

[server1]# clustat
Cluster Status for newcluster @ Mon May 30 14:55:10 2011
Member Status: Quorate

 Member Name                      ID   Status
 ------ ----                      ---- ------
 server1                          1 Online
 server2                          2 Online, Local
 server3                          3 Online
 server4                          4 Online
 server5                          5 Online
 server6                          6 Online
 server7                          7 Online
 server8                          8 Online
 server9                          9 Online
 server10                         10 Online
 server11                         11 Online
 server12                         12 Online
 server13                         13 Online
 server14                         14 Online
 server15                         15 Online
 server16                         16 Online

Here the cman_tool status  output  from one server
--------------------------------------------------

[server1 ~]# cman_tool status
Version: 6.2.0
Config Version: 23
Cluster Name: newcluster
Cluster Id: 53322
Cluster Member: Yes
Cluster Generation: 11432
Membership state: Cluster-Member
Nodes: 16
Expected votes: 16
Total votes: 16
Quorum: 9  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 11  
Node name: server1
Node ID: 1
Multicast addresses: xxx.xxx.xxx.xx 
Node addresses: 192.168.xxx.xx 


Here is the cluster.conf file.
------------------------------

<?xml version="1.0"?>
<cluster alias="newcluster" config_version="23" name="newcluster">
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="15"/>

<clusternodes>

<clusternode name="server1-priv" nodeid="1" votes="1">
                 <fence><method name="1">
                 <device name="ilo-server1r"/></method>
                 </fence>
</clusternode>

<clusternode name="server2-priv" nodeid="3" votes="1">
        <fence><method name="1">
        <device name="ilo-server2r"/></method>
        </fence>
</clusternode>

<clusternode name="server3-priv" nodeid="2" votes="1">
        <fence><method name="1">
        <device name="ilo-server3r"/></method>
        </fence>
</clusternode>

[ ... sinp .....]

<clusternode name="server16-priv" nodeid="16" votes="1">
       <fence><method name="1">
       <device name="ilo-server16r"/></method>
       </fence>
</clusternode>

</clusternodes>
<cman/>

<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>

<fencedevices>
        <fencedevice agent="fence_ilo" hostname="server1r" login="Admin" name="ilo-server1r" passwd="xxxxx"/>
        ..........
        <fencedevice agent="fence_ilo" hostname="server16r" login="Admin" name="ilo-server16r" passwd="xxxxx"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm></cluster>

Here is the lvm.conf file
--------------------------

devices {

    dir = "/dev"
    scan = [ "/dev" ]
    preferred_names = [ ]
    filter = [ "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
    cache_dir = "/etc/lvm/cache"
    cache_file_prefix = ""
    write_cache_state = 1
    sysfs_scan = 1
    md_component_detection = 1
    md_chunk_alignment = 1
    data_alignment_detection = 1
    data_alignment = 0
    data_alignment_offset_detection = 1
    ignore_suspended_devices = 0
}

log {

    verbose = 0
    syslog = 1
    overwrite = 0
    level = 0
    indent = 1
    command_names = 0
    prefix = "  "
}

backup {

    backup = 1
    backup_dir = "/etc/lvm/backup"
    archive = 1
    archive_dir = "/etc/lvm/archive"
    retain_min = 10
    retain_days = 30
}

shell {

    history_size = 100
}
global {
    library_dir = "/usr/lib64"
    umask = 077
    test = 0
    units = "h"
    si_unit_consistency = 0
    activation = 1
    proc = "/proc"
    locking_type = 3
    wait_for_locks = 1
    fallback_to_clustered_locking = 1
    fallback_to_local_locking = 1
    locking_dir = "/var/lock/lvm"
    prioritise_write_locks = 1
}

activation {
    udev_sync = 1
    missing_stripe_filler = "error"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    readahead = "auto"
    mirror_log_fault_policy = "allocate"
    mirror_image_fault_policy = "remove"
}
dmeventd {

    mirror_library = "libdevmapper-event-lvm2mirror.so"
    snapshot_library = "libdevmapper-event-lvm2snapshot.so"
}


If you need more  information,  I can provide ...

Thanks for your help
Priya



From kkovachev at varna.net  Mon May 30 20:05:38 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 30 May 2011 23:05:38 +0300
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <17723.13598.qm@web112805.mail.gq1.yahoo.com>
References: <17723.13598.qm@web112805.mail.gq1.yahoo.com>
Message-ID: <fae932669bb3044e89962c6e5158ae16@mx.varna.net>

Hi,
 when your cluster gets broken, most likely the reason is, there is a
network problem (switch restart or multicast traffic is lost for a while)
on the interface where serverX-priv IPs are configured. Having a quorum
disk may help by giving a quorum vote to one of the servers, so it can
fence the others, but the best thing to do is to fix your network and
preferably add a redundant link for the cluster communication to avoid
breakage in the first place

On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija <swap_project at yahoo.com>
wrote:
> Hi,
> 
> I am very new to the redhat cluster. Need some help and suggession for
the
> cluster configuration.
> We have sixteen node cluster of 
> 
>             OS : Linux Server release 5.5 (Tikanga)
>             kernel :  2.6.18-194.3.1.el5xen.
> 
> The problem is sometimes the cluster is getting  broken. The solution is
> (still yet)to reboot the 
> sixteen nodes. Otherwise the nodes are not joining
> 
> We are using  clvm and not using any quorum disk. The quorum is by
default.
> 
> When it is getting broken, clustat commands shows  evrything  offline
> except the node from where
> the clustat command executed.  If we execute vgs, lvs command, those
> commands are getting hung.
> 
> Here is at present the clustat report
> -------------------------------------
> 
> [server1]# clustat
> Cluster Status for newcluster @ Mon May 30 14:55:10 2011
> Member Status: Quorate
> 
>  Member Name                      ID   Status
>  ------ ----                      ---- ------
>  server1                          1 Online
>  server2                          2 Online, Local
>  server3                          3 Online
>  server4                          4 Online
>  server5                          5 Online
>  server6                          6 Online
>  server7                          7 Online
>  server8                          8 Online
>  server9                          9 Online
>  server10                         10 Online
>  server11                         11 Online
>  server12                         12 Online
>  server13                         13 Online
>  server14                         14 Online
>  server15                         15 Online
>  server16                         16 Online
> 
> Here the cman_tool status  output  from one server
> --------------------------------------------------
> 
> [server1 ~]# cman_tool status
> Version: 6.2.0
> Config Version: 23
> Cluster Name: newcluster
> Cluster Id: 53322
> Cluster Member: Yes
> Cluster Generation: 11432
> Membership state: Cluster-Member
> Nodes: 16
> Expected votes: 16
> Total votes: 16
> Quorum: 9  
> Active subsystems: 8
> Flags: Dirty 
> Ports Bound: 0 11  
> Node name: server1
> Node ID: 1
> Multicast addresses: xxx.xxx.xxx.xx 
> Node addresses: 192.168.xxx.xx 
> 
> 
> Here is the cluster.conf file.
> ------------------------------
> 
> <?xml version="1.0"?>
> <cluster alias="newcluster" config_version="23" name="newcluster">
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="15"/>
> 
> <clusternodes>
> 
> <clusternode name="server1-priv" nodeid="1" votes="1">
>                  <fence><method name="1">
>                  <device name="ilo-server1r"/></method>
>                  </fence>
> </clusternode>
> 
> <clusternode name="server2-priv" nodeid="3" votes="1">
>         <fence><method name="1">
>         <device name="ilo-server2r"/></method>
>         </fence>
> </clusternode>
> 
> <clusternode name="server3-priv" nodeid="2" votes="1">
>         <fence><method name="1">
>         <device name="ilo-server3r"/></method>
>         </fence>
> </clusternode>
> 
> [ ... sinp .....]
> 
> <clusternode name="server16-priv" nodeid="16" votes="1">
>        <fence><method name="1">
>        <device name="ilo-server16r"/></method>
>        </fence>
> </clusternode>
> 
> </clusternodes>
> <cman/>
> 
> <dlm plock_ownership="1" plock_rate_limit="0"/>
> <gfs_controld plock_rate_limit="0"/>
> 
> <fencedevices>
>         <fencedevice agent="fence_ilo" hostname="server1r" login="Admin"
>         name="ilo-server1r" passwd="xxxxx"/>
>         ..........
>         <fencedevice agent="fence_ilo" hostname="server16r"
login="Admin"
>         name="ilo-server16r" passwd="xxxxx"/>
> </fencedevices>
> <rm>
> <failoverdomains/>
> <resources/>
> </rm></cluster>
> 
> Here is the lvm.conf file
> --------------------------
> 
> devices {
> 
>     dir = "/dev"
>     scan = [ "/dev" ]
>     preferred_names = [ ]
>     filter = [ "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
>     cache_dir = "/etc/lvm/cache"
>     cache_file_prefix = ""
>     write_cache_state = 1
>     sysfs_scan = 1
>     md_component_detection = 1
>     md_chunk_alignment = 1
>     data_alignment_detection = 1
>     data_alignment = 0
>     data_alignment_offset_detection = 1
>     ignore_suspended_devices = 0
> }
> 
> log {
> 
>     verbose = 0
>     syslog = 1
>     overwrite = 0
>     level = 0
>     indent = 1
>     command_names = 0
>     prefix = "  "
> }
> 
> backup {
> 
>     backup = 1
>     backup_dir = "/etc/lvm/backup"
>     archive = 1
>     archive_dir = "/etc/lvm/archive"
>     retain_min = 10
>     retain_days = 30
> }
> 
> shell {
> 
>     history_size = 100
> }
> global {
>     library_dir = "/usr/lib64"
>     umask = 077
>     test = 0
>     units = "h"
>     si_unit_consistency = 0
>     activation = 1
>     proc = "/proc"
>     locking_type = 3
>     wait_for_locks = 1
>     fallback_to_clustered_locking = 1
>     fallback_to_local_locking = 1
>     locking_dir = "/var/lock/lvm"
>     prioritise_write_locks = 1
> }
> 
> activation {
>     udev_sync = 1
>     missing_stripe_filler = "error"
>     reserved_stack = 256
>     reserved_memory = 8192
>     process_priority = -18
>     mirror_region_size = 512
>     readahead = "auto"
>     mirror_log_fault_policy = "allocate"
>     mirror_image_fault_policy = "remove"
> }
> dmeventd {
> 
>     mirror_library = "libdevmapper-event-lvm2mirror.so"
>     snapshot_library = "libdevmapper-event-lvm2snapshot.so"
> }
> 
> 
> If you need more  information,  I can provide ...
> 
> Thanks for your help
> Priya
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From swap_project at yahoo.com  Tue May 31 01:22:00 2011
From: swap_project at yahoo.com (Srija)
Date: Mon, 30 May 2011 18:22:00 -0700 (PDT)
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <fae932669bb3044e89962c6e5158ae16@mx.varna.net>
Message-ID: <584441.81591.qm@web112812.mail.gq1.yahoo.com>

Thanks for your quick reply.

I talked to the network people , but they are saying everything is good at their end. Is there anyway at the server end, to figure it  for the switch restart or multicast traffic? 

I think you have already checked the cluster.conf file.. Except quorum disk, do you think that the cluster configuration is sufficient for handling the sixteen node cluster!! 

thanks again .
regards
--- On Mon, 5/30/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:

> From: Kaloyan Kovachev <kkovachev at varna.net>
> Subject: Re: [Linux-cluster] Cluster environment issue
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Monday, May 30, 2011, 4:05 PM
> Hi,
>  when your cluster gets broken, most likely the reason is,
> there is a
> network problem (switch restart or multicast traffic is
> lost for a while)
> on the interface where serverX-priv IPs are configured.
> Having a quorum
> disk may help by giving a quorum vote to one of the
> servers, so it can
> fence the others, but the best thing to do is to fix your
> network and
> preferably add a redundant link for the cluster
> communication to avoid
> breakage in the first place
> 
> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija <swap_project at yahoo.com>
> wrote:
> > Hi,
> > 
> > I am very new to the redhat cluster. Need some help
> and suggession for
> the
> > cluster configuration.
> > We have sixteen node cluster of 
> > 
> >? ? ? ? ? ???OS
> : Linux Server release 5.5 (Tikanga)
> >? ? ? ? ?
> ???kernel :? 2.6.18-194.3.1.el5xen.
> > 
> > The problem is sometimes the cluster is getting?
> broken. The solution is
> > (still yet)to reboot the 
> > sixteen nodes. Otherwise the nodes are not joining
> > 
> > We are using? clvm and not using any quorum disk.
> The quorum is by
> default.
> > 
> > When it is getting broken, clustat commands
> shows? evrything? offline
> > except the node from where
> > the clustat command executed.? If we execute vgs,
> lvs command, those
> > commands are getting hung.
> > 
> > Here is at present the clustat report
> > -------------------------------------
> > 
> > [server1]# clustat
> > Cluster Status for newcluster @ Mon May 30 14:55:10
> 2011
> > Member Status: Quorate
> > 
> >? Member Name? ? ? ? ?
> ? ? ? ? ? ?
> ID???Status
> >? ------ ----? ? ? ? ?
> ? ? ? ? ? ? ---- ------
> >? server1? ? ? ? ? ?
> ? ? ? ? ? ? ? 1 Online
> >? server2? ? ? ? ? ?
> ? ? ? ? ? ? ? 2 Online,
> Local
> >? server3? ? ? ? ? ?
> ? ? ? ? ? ? ? 3 Online
> >? server4? ? ? ? ? ?
> ? ? ? ? ? ? ? 4 Online
> >? server5? ? ? ? ? ?
> ? ? ? ? ? ? ? 5 Online
> >? server6? ? ? ? ? ?
> ? ? ? ? ? ? ? 6 Online
> >? server7? ? ? ? ? ?
> ? ? ? ? ? ? ? 7 Online
> >? server8? ? ? ? ? ?
> ? ? ? ? ? ? ? 8 Online
> >? server9? ? ? ? ? ?
> ? ? ? ? ? ? ? 9 Online
> >? server10? ? ? ? ?
> ? ? ? ? ? ?
> ???10 Online
> >? server11? ? ? ? ?
> ? ? ? ? ? ?
> ???11 Online
> >? server12? ? ? ? ?
> ? ? ? ? ? ?
> ???12 Online
> >? server13? ? ? ? ?
> ? ? ? ? ? ?
> ???13 Online
> >? server14? ? ? ? ?
> ? ? ? ? ? ?
> ???14 Online
> >? server15? ? ? ? ?
> ? ? ? ? ? ?
> ???15 Online
> >? server16? ? ? ? ?
> ? ? ? ? ? ?
> ???16 Online
> > 
> > Here the cman_tool status? output? from one
> server
> > --------------------------------------------------
> > 
> > [server1 ~]# cman_tool status
> > Version: 6.2.0
> > Config Version: 23
> > Cluster Name: newcluster
> > Cluster Id: 53322
> > Cluster Member: Yes
> > Cluster Generation: 11432
> > Membership state: Cluster-Member
> > Nodes: 16
> > Expected votes: 16
> > Total votes: 16
> > Quorum: 9? 
> > Active subsystems: 8
> > Flags: Dirty 
> > Ports Bound: 0 11? 
> > Node name: server1
> > Node ID: 1
> > Multicast addresses: xxx.xxx.xxx.xx 
> > Node addresses: 192.168.xxx.xx 
> > 
> > 
> > Here is the cluster.conf file.
> > ------------------------------
> > 
> > <?xml version="1.0"?>
> > <cluster alias="newcluster" config_version="23"
> name="newcluster">
> > <fence_daemon clean_start="1" post_fail_delay="0"
> post_join_delay="15"/>
> > 
> > <clusternodes>
> > 
> > <clusternode name="server1-priv" nodeid="1"
> votes="1">
> >? ? ? ? ? ? ? ?
> ? <fence><method name="1">
> >? ? ? ? ? ? ? ?
> ? <device name="ilo-server1r"/></method>
> >? ? ? ? ? ? ? ?
> ? </fence>
> > </clusternode>
> > 
> > <clusternode name="server2-priv" nodeid="3"
> votes="1">
> >? ? ?
> ???<fence><method name="1">
> >? ? ? ???<device
> name="ilo-server2r"/></method>
> >? ? ? ???</fence>
> > </clusternode>
> > 
> > <clusternode name="server3-priv" nodeid="2"
> votes="1">
> >? ? ?
> ???<fence><method name="1">
> >? ? ? ???<device
> name="ilo-server3r"/></method>
> >? ? ? ???</fence>
> > </clusternode>
> > 
> > [ ... sinp .....]
> > 
> > <clusternode name="server16-priv" nodeid="16"
> votes="1">
> >? ? ? ? <fence><method
> name="1">
> >? ? ? ? <device
> name="ilo-server16r"/></method>
> >? ? ? ? </fence>
> > </clusternode>
> > 
> > </clusternodes>
> > <cman/>
> > 
> > <dlm plock_ownership="1" plock_rate_limit="0"/>
> > <gfs_controld plock_rate_limit="0"/>
> > 
> > <fencedevices>
> >? ? ? ???<fencedevice
> agent="fence_ilo" hostname="server1r" login="Admin"
> >? ? ?
> ???name="ilo-server1r" passwd="xxxxx"/>
> >? ? ? ???..........
> >? ? ? ???<fencedevice
> agent="fence_ilo" hostname="server16r"
> login="Admin"
> >? ? ?
> ???name="ilo-server16r" passwd="xxxxx"/>
> > </fencedevices>
> > <rm>
> > <failoverdomains/>
> > <resources/>
> > </rm></cluster>
> > 
> > Here is the lvm.conf file
> > --------------------------
> > 
> > devices {
> > 
> >? ???dir = "/dev"
> >? ???scan = [ "/dev" ]
> >? ???preferred_names = [ ]
> >? ???filter = [
> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
> >? ???cache_dir = "/etc/lvm/cache"
> >? ???cache_file_prefix = ""
> >? ???write_cache_state = 1
> >? ???sysfs_scan = 1
> >? ???md_component_detection = 1
> >? ???md_chunk_alignment = 1
> >? ???data_alignment_detection = 1
> >? ???data_alignment = 0
> >?
> ???data_alignment_offset_detection = 1
> >? ???ignore_suspended_devices = 0
> > }
> > 
> > log {
> > 
> >? ???verbose = 0
> >? ???syslog = 1
> >? ???overwrite = 0
> >? ???level = 0
> >? ???indent = 1
> >? ???command_names = 0
> >? ???prefix = "? "
> > }
> > 
> > backup {
> > 
> >? ???backup = 1
> >? ???backup_dir =
> "/etc/lvm/backup"
> >? ???archive = 1
> >? ???archive_dir =
> "/etc/lvm/archive"
> >? ???retain_min = 10
> >? ???retain_days = 30
> > }
> > 
> > shell {
> > 
> >? ???history_size = 100
> > }
> > global {
> >? ???library_dir = "/usr/lib64"
> >? ???umask = 077
> >? ???test = 0
> >? ???units = "h"
> >? ???si_unit_consistency = 0
> >? ???activation = 1
> >? ???proc = "/proc"
> >? ???locking_type = 3
> >? ???wait_for_locks = 1
> >? ???fallback_to_clustered_locking
> = 1
> >? ???fallback_to_local_locking = 1
> >? ???locking_dir = "/var/lock/lvm"
> >? ???prioritise_write_locks = 1
> > }
> > 
> > activation {
> >? ???udev_sync = 1
> >? ???missing_stripe_filler =
> "error"
> >? ???reserved_stack = 256
> >? ???reserved_memory = 8192
> >? ???process_priority = -18
> >? ???mirror_region_size = 512
> >? ???readahead = "auto"
> >? ???mirror_log_fault_policy =
> "allocate"
> >? ???mirror_image_fault_policy =
> "remove"
> > }
> > dmeventd {
> > 
> >? ???mirror_library =
> "libdevmapper-event-lvm2mirror.so"
> >? ???snapshot_library =
> "libdevmapper-event-lvm2snapshot.so"
> > }
> > 
> > 
> > If you need more? information,? I can
> provide ...
> > 
> > Thanks for your help
> > Priya
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From hiroysato at gmail.com  Tue May 31 03:03:58 2011
From: hiroysato at gmail.com (Hiroyuki Sato)
Date: Tue, 31 May 2011 12:03:58 +0900
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <584441.81591.qm@web112812.mail.gq1.yahoo.com>
References: <fae932669bb3044e89962c6e5158ae16@mx.varna.net>
	<584441.81591.qm@web112812.mail.gq1.yahoo.com>
Message-ID: <BANLkTinurjeer7GLtxHLa5Fh9kmV=zY+-g@mail.gmail.com>

Hello

I'm not sure, This is useful or not.

Have you ever checked ``ping some_where'' on domU when cluster is broken??
( I thought you are using Xen, because you are using 2.6.18-194.3.1.el5xen. )
If it does not respond anything, you should check iptables.
(ex, disable iptables)

--
Hiroyuki Sato

2011/5/31 Srija <swap_project at yahoo.com>:
> Thanks for your quick reply.
>
> I talked to the network people , but they are saying everything is good at their end. Is there anyway at the server end, to figure it ?for the switch restart or multicast traffic?
>
> I think you have already checked the cluster.conf file.. Except quorum disk, do you think that the cluster configuration is sufficient for handling the sixteen node cluster!!
>
> thanks again .
> regards
>
> --- On Mon, 5/30/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:
>
>> From: Kaloyan Kovachev <kkovachev at varna.net>
>> Subject: Re: [Linux-cluster] Cluster environment issue
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Date: Monday, May 30, 2011, 4:05 PM
>> Hi,
>> ?when your cluster gets broken, most likely the reason is,
>> there is a
>> network problem (switch restart or multicast traffic is
>> lost for a while)
>> on the interface where serverX-priv IPs are configured.
>> Having a quorum
>> disk may help by giving a quorum vote to one of the
>> servers, so it can
>> fence the others, but the best thing to do is to fix your
>> network and
>> preferably add a redundant link for the cluster
>> communication to avoid
>> breakage in the first place
>>
>> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija <swap_project at yahoo.com>
>> wrote:
>> > Hi,
>> >
>> > I am very new to the redhat cluster. Need some help
>> and suggession for
>> the
>> > cluster configuration.
>> > We have sixteen node cluster of
>> >
>> >? ? ? ? ? ???OS
>> : Linux Server release 5.5 (Tikanga)
>> >
>> ???kernel :? 2.6.18-194.3.1.el5xen.
>> >
>> > The problem is sometimes the cluster is getting
>> broken. The solution is
>> > (still yet)to reboot the
>> > sixteen nodes. Otherwise the nodes are not joining
>> >
>> > We are using? clvm and not using any quorum disk.
>> The quorum is by
>> default.
>> >
>> > When it is getting broken, clustat commands
>> shows? evrything? offline
>> > except the node from where
>> > the clustat command executed.? If we execute vgs,
>> lvs command, those
>> > commands are getting hung.
>> >
>> > Here is at present the clustat report
>> > -------------------------------------
>> >
>> > [server1]# clustat
>> > Cluster Status for newcluster @ Mon May 30 14:55:10
>> 2011
>> > Member Status: Quorate
>> >
>> >? Member Name
>>
>> ID???Status
>> >? ------ ----
>> ? ? ? ? ? ? ---- ------
>> >? server1
>> ? ? ? ? ? ? ? 1 Online
>> >? server2
>> ? ? ? ? ? ? ? 2 Online,
>> Local
>> >? server3
>> ? ? ? ? ? ? ? 3 Online
>> >? server4
>> ? ? ? ? ? ? ? 4 Online
>> >? server5
>> ? ? ? ? ? ? ? 5 Online
>> >? server6
>> ? ? ? ? ? ? ? 6 Online
>> >? server7
>> ? ? ? ? ? ? ? 7 Online
>> >? server8
>> ? ? ? ? ? ? ? 8 Online
>> >? server9
>> ? ? ? ? ? ? ? 9 Online
>> >? server10
>>
>> ???10 Online
>> >? server11
>>
>> ???11 Online
>> >? server12
>>
>> ???12 Online
>> >? server13
>>
>> ???13 Online
>> >? server14
>>
>> ???14 Online
>> >? server15
>>
>> ???15 Online
>> >? server16
>>
>> ???16 Online
>> >
>> > Here the cman_tool status? output? from one
>> server
>> > --------------------------------------------------
>> >
>> > [server1 ~]# cman_tool status
>> > Version: 6.2.0
>> > Config Version: 23
>> > Cluster Name: newcluster
>> > Cluster Id: 53322
>> > Cluster Member: Yes
>> > Cluster Generation: 11432
>> > Membership state: Cluster-Member
>> > Nodes: 16
>> > Expected votes: 16
>> > Total votes: 16
>> > Quorum: 9
>> > Active subsystems: 8
>> > Flags: Dirty
>> > Ports Bound: 0 11
>> > Node name: server1
>> > Node ID: 1
>> > Multicast addresses: xxx.xxx.xxx.xx
>> > Node addresses: 192.168.xxx.xx
>> >
>> >
>> > Here is the cluster.conf file.
>> > ------------------------------
>> >
>> > <?xml version="1.0"?>
>> > <cluster alias="newcluster" config_version="23"
>> name="newcluster">
>> > <fence_daemon clean_start="1" post_fail_delay="0"
>> post_join_delay="15"/>
>> >
>> > <clusternodes>
>> >
>> > <clusternode name="server1-priv" nodeid="1"
>> votes="1">
>> >
>> ? <fence><method name="1">
>> >
>> ? <device name="ilo-server1r"/></method>
>> >
>> ? </fence>
>> > </clusternode>
>> >
>> > <clusternode name="server2-priv" nodeid="3"
>> votes="1">
>> >
>> ???<fence><method name="1">
>> >? ? ? ???<device
>> name="ilo-server2r"/></method>
>> >? ? ? ???</fence>
>> > </clusternode>
>> >
>> > <clusternode name="server3-priv" nodeid="2"
>> votes="1">
>> >
>> ???<fence><method name="1">
>> >? ? ? ???<device
>> name="ilo-server3r"/></method>
>> >? ? ? ???</fence>
>> > </clusternode>
>> >
>> > [ ... sinp .....]
>> >
>> > <clusternode name="server16-priv" nodeid="16"
>> votes="1">
>> >? ? ? ? <fence><method
>> name="1">
>> >? ? ? ? <device
>> name="ilo-server16r"/></method>
>> >? ? ? ? </fence>
>> > </clusternode>
>> >
>> > </clusternodes>
>> > <cman/>
>> >
>> > <dlm plock_ownership="1" plock_rate_limit="0"/>
>> > <gfs_controld plock_rate_limit="0"/>
>> >
>> > <fencedevices>
>> >? ? ? ???<fencedevice
>> agent="fence_ilo" hostname="server1r" login="Admin"
>> >
>> ???name="ilo-server1r" passwd="xxxxx"/>
>> >? ? ? ???..........
>> >? ? ? ???<fencedevice
>> agent="fence_ilo" hostname="server16r"
>> login="Admin"
>> >
>> ???name="ilo-server16r" passwd="xxxxx"/>
>> > </fencedevices>
>> > <rm>
>> > <failoverdomains/>
>> > <resources/>
>> > </rm></cluster>
>> >
>> > Here is the lvm.conf file
>> > --------------------------
>> >
>> > devices {
>> >
>> >? ???dir = "/dev"
>> >? ???scan = [ "/dev" ]
>> >? ???preferred_names = [ ]
>> >? ???filter = [
>> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
>> >? ???cache_dir = "/etc/lvm/cache"
>> >? ???cache_file_prefix = ""
>> >? ???write_cache_state = 1
>> >? ???sysfs_scan = 1
>> >? ???md_component_detection = 1
>> >? ???md_chunk_alignment = 1
>> >? ???data_alignment_detection = 1
>> >? ???data_alignment = 0
>> >
>> ???data_alignment_offset_detection = 1
>> >? ???ignore_suspended_devices = 0
>> > }
>> >
>> > log {
>> >
>> >? ???verbose = 0
>> >? ???syslog = 1
>> >? ???overwrite = 0
>> >? ???level = 0
>> >? ???indent = 1
>> >? ???command_names = 0
>> >? ???prefix = "? "
>> > }
>> >
>> > backup {
>> >
>> >? ???backup = 1
>> >? ???backup_dir =
>> "/etc/lvm/backup"
>> >? ???archive = 1
>> >? ???archive_dir =
>> "/etc/lvm/archive"
>> >? ???retain_min = 10
>> >? ???retain_days = 30
>> > }
>> >
>> > shell {
>> >
>> >? ???history_size = 100
>> > }
>> > global {
>> >? ???library_dir = "/usr/lib64"
>> >? ???umask = 077
>> >? ???test = 0
>> >? ???units = "h"
>> >? ???si_unit_consistency = 0
>> >? ???activation = 1
>> >? ???proc = "/proc"
>> >? ???locking_type = 3
>> >? ???wait_for_locks = 1
>> >? ???fallback_to_clustered_locking
>> = 1
>> >? ???fallback_to_local_locking = 1
>> >? ???locking_dir = "/var/lock/lvm"
>> >? ???prioritise_write_locks = 1
>> > }
>> >
>> > activation {
>> >? ???udev_sync = 1
>> >? ???missing_stripe_filler =
>> "error"
>> >? ???reserved_stack = 256
>> >? ???reserved_memory = 8192
>> >? ???process_priority = -18
>> >? ???mirror_region_size = 512
>> >? ???readahead = "auto"
>> >? ???mirror_log_fault_policy =
>> "allocate"
>> >? ???mirror_image_fault_policy =
>> "remove"
>> > }
>> > dmeventd {
>> >
>> >? ???mirror_library =
>> "libdevmapper-event-lvm2mirror.so"
>> >? ???snapshot_library =
>> "libdevmapper-event-lvm2snapshot.so"
>> > }
>> >
>> >
>> > If you need more? information,? I can
>> provide ...
>> >
>> > Thanks for your help
>> > Priya
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From fdinitto at redhat.com  Tue May 31 04:36:05 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 31 May 2011 06:36:05 +0200
Subject: [Linux-cluster] [Cluster-devel] new RHCS upstream wiki
In-Reply-To: <4DC7CF28.30602@redhat.com>
References: <4DC7CF28.30602@redhat.com>
Message-ID: <4DE47035.9040405@redhat.com>

On 05/09/2011 01:25 PM, Fabio M. Di Nitto wrote:
> Hi all,
> 
> we are in the process of moving the old cluster wiki
> (http://sourceware.org/cluster/wiki/) to:
> 
> https://fedorahosted.org/cluster/wiki/HomePage

The relocation is now complete and the old wiki is redirecting users to
the new one.

I'd like to thanks Digimer for doing the heavy lifting of fixing all pages.

The very last thing left to do is to create a proper default home page
with a summary and maybe a logo... anybody would like to suggest one?

the winner will get a month free support on IRC #linux-cluster ;)

Fabio



From kkovachev at varna.net  Tue May 31 09:18:51 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Tue, 31 May 2011 12:18:51 +0300
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <584441.81591.qm@web112812.mail.gq1.yahoo.com>
References: <584441.81591.qm@web112812.mail.gq1.yahoo.com>
Message-ID: <9537f2a4eb5ae1c11038deed2e3fe40f@mx.varna.net>

On Mon, 30 May 2011 18:22:00 -0700 (PDT), Srija <swap_project at yahoo.com>
wrote:
> Thanks for your quick reply.
> 
> I talked to the network people , but they are saying everything is good
at
> their end. Is there anyway at the server end, to figure it  for the
switch
> restart or multicast traffic? 
> 

If it is a switch restart you will have in your logs the interface going
down/up, but more problematic is to find a short drop of the multicast
traffic (even with a ping script you may miss it) which is more likely the
case, as your cluster is working fine, but suddenly looses connection to
all nodes at the same time.
You may ask the network people to check for STP changes and double check
the multicast configuration and you may also try to use broadcast instead
of multicast or use a dedicated switch.

> I think you have already checked the cluster.conf file.. Except quorum
> disk, do you think that the cluster configuration is sufficient for
> handling the sixteen node cluster!! 
> 

The config is OK ... probably add specific multicast address in the cman
section to avoid surprises, but the default is also fine.

To confirm it is a multicast drop (if you are lucky not ti miss it) - on
just one of the nodes enable icmp broadcasts:
	echo 0 >/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
then ping from another node, check if just a single one replies (change to
your interface and multicast address)
	ping -I ethX -b -L 239.x.x.x -c 1
and finaly run this script until the cluster gets broken

	while [ $((`ping -I ethX -w 1 -b -L 239.x.x.x -c 1 | grep -c ' 0% packet
loss'`)) -eq 1 ]; do sleep 1; done; echo "missed ping at `date`"

if you get 'missed ping' at the same when cluster goes down - it is
confirmed :)

> thanks again .
> regards
> 
> --- On Mon, 5/30/11, Kaloyan Kovachev <kkovachev at varna.net> wrote:
> 
>> From: Kaloyan Kovachev <kkovachev at varna.net>
>> Subject: Re: [Linux-cluster] Cluster environment issue
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Date: Monday, May 30, 2011, 4:05 PM
>> Hi,
>>  when your cluster gets broken, most likely the reason is,
>> there is a
>> network problem (switch restart or multicast traffic is
>> lost for a while)
>> on the interface where serverX-priv IPs are configured.
>> Having a quorum
>> disk may help by giving a quorum vote to one of the
>> servers, so it can
>> fence the others, but the best thing to do is to fix your
>> network and
>> preferably add a redundant link for the cluster
>> communication to avoid
>> breakage in the first place
>> 
>> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija
<swap_project at yahoo.com>
>> wrote:
>> > Hi,
>> > 
>> > I am very new to the redhat cluster. Need some help
>> and suggession for
>> the
>> > cluster configuration.
>> > We have sixteen node cluster of 
>> > 
>> >             OS
>> : Linux Server release 5.5 (Tikanga)
>> >         
>>    kernel :  2.6.18-194.3.1.el5xen.
>> > 
>> > The problem is sometimes the cluster is getting 
>> broken. The solution is
>> > (still yet)to reboot the 
>> > sixteen nodes. Otherwise the nodes are not joining
>> > 
>> > We are using  clvm and not using any quorum disk.
>> The quorum is by
>> default.
>> > 
>> > When it is getting broken, clustat commands
>> shows  evrything  offline
>> > except the node from where
>> > the clustat command executed.  If we execute vgs,
>> lvs command, those
>> > commands are getting hung.
>> > 
>> > Here is at present the clustat report
>> > -------------------------------------
>> > 
>> > [server1]# clustat
>> > Cluster Status for newcluster @ Mon May 30 14:55:10
>> 2011
>> > Member Status: Quorate
>> > 
>> >  Member Name         
>>            
>> ID   Status
>> >  ------ ----         
>>             ---- ------
>> >  server1           
>>               1 Online
>> >  server2           
>>               2 Online,
>> Local
>> >  server3           
>>               3 Online
>> >  server4           
>>               4 Online
>> >  server5           
>>               5 Online
>> >  server6           
>>               6 Online
>> >  server7           
>>               7 Online
>> >  server8           
>>               8 Online
>> >  server9           
>>               9 Online
>> >  server10         
>>            
>>    10 Online
>> >  server11         
>>            
>>    11 Online
>> >  server12         
>>            
>>    12 Online
>> >  server13         
>>            
>>    13 Online
>> >  server14         
>>            
>>    14 Online
>> >  server15         
>>            
>>    15 Online
>> >  server16         
>>            
>>    16 Online
>> > 
>> > Here the cman_tool status  output  from one
>> server
>> > --------------------------------------------------
>> > 
>> > [server1 ~]# cman_tool status
>> > Version: 6.2.0
>> > Config Version: 23
>> > Cluster Name: newcluster
>> > Cluster Id: 53322
>> > Cluster Member: Yes
>> > Cluster Generation: 11432
>> > Membership state: Cluster-Member
>> > Nodes: 16
>> > Expected votes: 16
>> > Total votes: 16
>> > Quorum: 9  
>> > Active subsystems: 8
>> > Flags: Dirty 
>> > Ports Bound: 0 11  
>> > Node name: server1
>> > Node ID: 1
>> > Multicast addresses: xxx.xxx.xxx.xx 
>> > Node addresses: 192.168.xxx.xx 
>> > 
>> > 
>> > Here is the cluster.conf file.
>> > ------------------------------
>> > 
>> > <?xml version="1.0"?>
>> > <cluster alias="newcluster" config_version="23"
>> name="newcluster">
>> > <fence_daemon clean_start="1" post_fail_delay="0"
>> post_join_delay="15"/>
>> > 
>> > <clusternodes>
>> > 
>> > <clusternode name="server1-priv" nodeid="1"
>> votes="1">
>> >               
>>   <fence><method name="1">
>> >               
>>   <device name="ilo-server1r"/></method>
>> >               
>>   </fence>
>> > </clusternode>
>> > 
>> > <clusternode name="server2-priv" nodeid="3"
>> votes="1">
>> >     
>>    <fence><method name="1">
>> >         <device
>> name="ilo-server2r"/></method>
>> >         </fence>
>> > </clusternode>
>> > 
>> > <clusternode name="server3-priv" nodeid="2"
>> votes="1">
>> >     
>>    <fence><method name="1">
>> >         <device
>> name="ilo-server3r"/></method>
>> >         </fence>
>> > </clusternode>
>> > 
>> > [ ... sinp .....]
>> > 
>> > <clusternode name="server16-priv" nodeid="16"
>> votes="1">
>> >        <fence><method
>> name="1">
>> >        <device
>> name="ilo-server16r"/></method>
>> >        </fence>
>> > </clusternode>
>> > 
>> > </clusternodes>
>> > <cman/>
>> > 
>> > <dlm plock_ownership="1" plock_rate_limit="0"/>
>> > <gfs_controld plock_rate_limit="0"/>
>> > 
>> > <fencedevices>
>> >         <fencedevice
>> agent="fence_ilo" hostname="server1r" login="Admin"
>> >     
>>    name="ilo-server1r" passwd="xxxxx"/>
>> >         ..........
>> >         <fencedevice
>> agent="fence_ilo" hostname="server16r"
>> login="Admin"
>> >     
>>    name="ilo-server16r" passwd="xxxxx"/>
>> > </fencedevices>
>> > <rm>
>> > <failoverdomains/>
>> > <resources/>
>> > </rm></cluster>
>> > 
>> > Here is the lvm.conf file
>> > --------------------------
>> > 
>> > devices {
>> > 
>> >     dir = "/dev"
>> >     scan = [ "/dev" ]
>> >     preferred_names = [ ]
>> >     filter = [
>> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
>> >     cache_dir = "/etc/lvm/cache"
>> >     cache_file_prefix = ""
>> >     write_cache_state = 1
>> >     sysfs_scan = 1
>> >     md_component_detection = 1
>> >     md_chunk_alignment = 1
>> >     data_alignment_detection = 1
>> >     data_alignment = 0
>> > 
>>    data_alignment_offset_detection = 1
>> >     ignore_suspended_devices = 0
>> > }
>> > 
>> > log {
>> > 
>> >     verbose = 0
>> >     syslog = 1
>> >     overwrite = 0
>> >     level = 0
>> >     indent = 1
>> >     command_names = 0
>> >     prefix = "  "
>> > }
>> > 
>> > backup {
>> > 
>> >     backup = 1
>> >     backup_dir =
>> "/etc/lvm/backup"
>> >     archive = 1
>> >     archive_dir =
>> "/etc/lvm/archive"
>> >     retain_min = 10
>> >     retain_days = 30
>> > }
>> > 
>> > shell {
>> > 
>> >     history_size = 100
>> > }
>> > global {
>> >     library_dir = "/usr/lib64"
>> >     umask = 077
>> >     test = 0
>> >     units = "h"
>> >     si_unit_consistency = 0
>> >     activation = 1
>> >     proc = "/proc"
>> >     locking_type = 3
>> >     wait_for_locks = 1
>> >     fallback_to_clustered_locking
>> = 1
>> >     fallback_to_local_locking = 1
>> >     locking_dir = "/var/lock/lvm"
>> >     prioritise_write_locks = 1
>> > }
>> > 
>> > activation {
>> >     udev_sync = 1
>> >     missing_stripe_filler =
>> "error"
>> >     reserved_stack = 256
>> >     reserved_memory = 8192
>> >     process_priority = -18
>> >     mirror_region_size = 512
>> >     readahead = "auto"
>> >     mirror_log_fault_policy =
>> "allocate"
>> >     mirror_image_fault_policy =
>> "remove"
>> > }
>> > dmeventd {
>> > 
>> >     mirror_library =
>> "libdevmapper-event-lvm2mirror.so"
>> >     snapshot_library =
>> "libdevmapper-event-lvm2snapshot.so"
>> > }
>> > 
>> > 
>> > If you need more  information,  I can
>> provide ...
>> > 
>> > Thanks for your help
>> > Priya
>> > 
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From hlawatschek at atix.de  Tue May 31 12:16:43 2011
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 31 May 2011 14:16:43 +0200 (CEST)
Subject: [Linux-cluster] How to integrate a custom resource agent into
 RHCS?
In-Reply-To: <A789DDB53ED7E94396E842EE2AC9B5FF01432903@itdzex101.ITDZ.verwalt-berlin.de>
Message-ID: <1531003971.3181.1306844203629.JavaMail.root@axgroupware01-1.gallien.atix>

Hi Ralph,

could you post your RA script and the service definition element from your cluster.conf?

Best regards, 

Mark  

----- "Ralph Grothe" <Ralph.Grothe at itdz-berlin.de> wrote:

> Hi,
> 
> I hope this is the right forum. So bear with me Pacemaker
> aficionados et alii when I talk about Red Hat Cluster Suite
> (RHCS).
> That's the clusterware product I am given to set up the cluster
> and I'm not free to chose another software of my liking.
> 
> Though this may sound ridiculous, since days I've been labouring
> to get a fairly simple custom resource agent (hence RA) to be
> acknowledged by RHCS and correctly executed through its
> rgmanager.
> 
> When scripting my RA I mostly adhered to
> http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html apart
> from where RHCS RAs differs from general OCF.
> 
> I put my RA in /usr/share/cluster and afterwards restarted
> rgmanager on all nodes.
> 
> When I try to start the service whereof my RA's managed resource
> is part of the service though gets started but not my resource,
> as if it wasn't part of the service at all.
> 
> 
> When I try to start my resource via rg_test nothing happens apart
> from this obscure log entry
> 
> 
> [root at aruba:~]
> # rg_test test /etc/cluster/cluster.conf start aDIStn_sec
> Running in test mode.
> Entity: line 2: parser error : Char 0x0 out of allowed range
> 
> ^
> Entity: line 2: parser error : Premature end of data in tag error
> line 1
> 
> ^
> [root at aruba:~]
> # echo $?
> 0
> 
> [root at aruba:~]
> # grep rg_test /var/log/cluster.log|tail -1
> May 30 13:54:55 aruba rg_test: [28643]: <err> Cannot dump
> meta-data because '/usr/share/cluster/default.metadata' is
> missing 
> 
> 
> Though this is true
> 
> [root at aruba:~]
> # ls -l /usr/share/cluster/default.metadata
> ls: /usr/share/cluster/default.metadata: No such file or
> directory
> 
> there isn't such a file part of the installed clusterware at all
> either
> 
> [root at aruba:~]
> # yum groupinfo Clustering|tail -10|xargs rpm -ql|grep -c
> default\\.metadata
> 0
> 
> And besides, I don't understand this error because since I wrote
> my RA according to above mentioned RA Developer's Guide it of
> course dumps its metadata
> 
> 
> [root at aruba:~]
> # /usr/share/cluster/aDIStn_sec.sh meta-data|grep action
>     <actions>
>         <action name="start" timeout="0"/>
>         <action name="stop" timeout="0"/>
>         <action name="status" timeout="5"/>
>         <action name="monitor" timeout="5"/>
>         <action name="meta-data" timeout="0"/>
>         <action name="verify-all" timeout="5"/>
>         <action name="validate-all" timeout="5"/>
>     </actions>
> 
> (note, RHCS deviates from OCF here in naming its actions
> verify-all instead of validate-all and status instead of monitor.
> But both refer to the same case block in my RA)
> 
> 
> I also don't understand the "Char 0x0 out of allowed range" error
> from the XML parser.
> 
> If it really refers to line 2 of my cluster.conf this looks
> pretty ok to me
> 
> 
> [root at aruba:~]
> # sed -n 2p /etc/cluster/cluster.conf
> <cluster alias="rhcs_mock" config_version="43" name="rhcs_mock">
> 
> 
> If I run a validity check of the XML of my cluster.conf against
> RHCS's RNG schema I also get an incomprehensible error about
> extra elements in interleave.
> 
> Nevertheless, all other resources of my cluster which rely on
> RHCS's standard RAs are managed ok by the clusterware.
> 
> 
> 
> [root at aruba:~]
> # declare -f cluconf_valid
> cluconf_valid () 
> { 
>     xmllint --noout --relaxng
> /usr/share/system-config-cluster/misc/cluster.ng
> ${1:-/etc/cluster/cluster.conf}
> }
> [root at aruba:~]
> # cluconf_valid 
> Relax-NG validity error : Extra element cman in interleave
> /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity
> error : Element cluster failed to validate content
> /etc/cluster/cluster.conf fails to validate
> 
> 
> Btw. is there a schema file available to check an RA's metadata
> for validity?
> 
> 
> 
> Of course did I test my RA script for correct functionality when
> used like an init script (to which end I provide the required
> environment of OCF_RESKEY_parameter(s)),
> and it starts, stops and monitors my resource as intended.
> 
> 
> Can anyone help?
> 
> 
> Regards
> Ralph
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Mark Hlawatschek 

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de 

http://www.linux-subscriptions.com



From hlawatschek at atix.de  Tue May 31 12:26:39 2011
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 31 May 2011 14:26:39 +0200 (CEST)
Subject: [Linux-cluster] Mirrored LVM device and recovery
In-Reply-To: <4DD655C2.6080406@redhat.com>
Message-ID: <1373329283.3184.1306844799557.JavaMail.root@axgroupware01-1.gallien.atix>

Hi Andreas,

your system works as designed. If a storage leg of a mirrored LVM volume fails, it simply gets removed from the LVM mirror. LVM does not provide automatic resync if the storage is available again.

Best regards,

Mark


----- "Andreas Bleischwitz" <ableisch at redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hello all,
> 
> we are currently facing some handling issues using mirrored LVM-lvols
> in
> a cluster:
> 
> We have two diffent storage systems which should be mirrored using
> host-based mirroring.
> AFAIK cmirrored lvols are the only supported mirroring solution under
> RHEL56. So we have three multipath devices which are used for 2 data
> and
> one log-volume.
> We added these three pvs to one volumegroup and created the logical
> volume using the following command:
> lvcreate -m1 -L 10G -n lv_mirrored /dev/mpath/mpath0p1 /dev/mpath2p1
> /dev/mpath/mpath1p1
> 
> The volume replicates ok and everything is fine.... until we remove
> one
> storage-side of the mirror. Then LVM simply removes the missing pv
> and
> the mirror is simply removed - which I think is ok; if it will be
> recreated after readding the failed mirror-side.
> Unfortunately LVM doesn't do anything such - is there a special
> configuration-option which we missed?
> 
> And keep in mind: there might be a huge amount of lvms which have to
> be
> re-mirrored. So manual interaction shouldn't be the default option ;)
> 
> Regards,
> 
> Andreas
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (GNU/Linux)
> Comment: Using GnuPG with Red Hat - http://enigmail.mozdev.org/
> 
> iQIcBAEBAgAGBQJN1lXCAAoJEK45y/Z6LXho/4wP/AjjoOGX3pRoe3XRARumkTKW
> C/+8Jjm4+aC4VP1ycVkHZhrdGlzy4QmTtCFTCv40AgPU2YT/Aq+PqfNnn4SNqpwR
> c815zd9Gk+uQwwR55kloX232eZzFEw2wVa9PWxOmKwaeuYSEOz8GmLVZrPVc3V9p
> MNr6wkV5gzTzhC2v75KOZ4PchOiuYEDbhCd5GFDKmpyTeHTq/uNW2yRnjInAX8L9
> 8UCJ1JEzo4ry2mIBK1J+du5YtKx4uDLB893rgbf+T5Cci3hsLJ9/gfF1VU80b+o/
> uVc5t31rwUwMaFSyt9wtEhMQB0ggbyiQqzzjSP5wnnakd6lbJKhB06wM5XqGuUkS
> ZetkZdH+etALFpt3PrV7F4+LDwGnP7Hw438czKjD+Xk21fd7idSo3vhtWjArPgKp
> L+b5fxB8JoUGN7x2S3239aDMI6BmxTTZ+QnsamYzSy0IdHYghPSjPSsx8H5laJWd
> I03F2sfPWwB8vWVweHvNbxfFjZfmEaawoMqGanoGktj/RYgvUpPZJD+YHDVGXohN
> VoRVmB+t4JVSWb15BzOhzkAI//LtXjSHmtcnBuYQf8G0Q2v/r0x/hv04F9/0fQ0l
> dPlU1vh244fh0nG5BMCJlKPcdcpGJnGy4kIKOknOi+NuI2ZxxvSLIY5WbrqAwkBX
> QXYt6plJ5DgzWCa8fNYN
> =0I6Z
> -----END PGP SIGNATURE-----
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Mark Hlawatschek 

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de 

http://www.linux-subscriptions.com



From agk at redhat.com  Tue May 31 12:37:14 2011
From: agk at redhat.com (Alasdair G Kergon)
Date: Tue, 31 May 2011 13:37:14 +0100
Subject: [Linux-cluster] Mirrored LVM device and recovery
In-Reply-To: <4DD655C2.6080406@redhat.com>
References: <4DD655C2.6080406@redhat.com>
Message-ID: <20110531123713.GJ11145@agk-dp.fab.redhat.com>

On Fri, May 20, 2011 at 01:51:30PM +0200, Andreas Bleischwitz wrote:
> Unfortunately LVM doesn't do anything such - is there a special
> configuration-option which we missed?

Try
  mirror_image_fault_policy = "allocate"
in the activation section of lvm.conf.
 
Alasdair



From swap_project at yahoo.com  Tue May 31 13:51:35 2011
From: swap_project at yahoo.com (Srija)
Date: Tue, 31 May 2011 06:51:35 -0700 (PDT)
Subject: [Linux-cluster] Cluster environment issue
In-Reply-To: <BANLkTinurjeer7GLtxHLa5Fh9kmV=zY+-g@mail.gmail.com>
Message-ID: <832312.52153.qm@web112812.mail.gq1.yahoo.com>

Thanks  again  for the reply.

Yes, this cluster environment is of  xen hosts. When the  cluster  is detatched, all the guests are pingable, there is no issue for that. Only as I said , clustat command  shows  everything 'offline',  also  can't able to execute the  lvm related commands.  

iptables  are 'off' already in this  cluster  environment.

regards.

--- On Mon, 5/30/11, Hiroyuki Sato <hiroysato at gmail.com> wrote:

> From: Hiroyuki Sato <hiroysato at gmail.com>
> Subject: Re: [Linux-cluster] Cluster environment issue
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Monday, May 30, 2011, 11:03 PM
> Hello
> 
> I'm not sure, This is useful or not.
> 
> Have you ever checked ``ping some_where'' on domU when
> cluster is broken??
> ( I thought you are using Xen, because you are using
> 2.6.18-194.3.1.el5xen. )
> If it does not respond anything, you should check
> iptables.
> (ex, disable iptables)
> 
> --
> Hiroyuki Sato
> 
> 2011/5/31 Srija <swap_project at yahoo.com>:
> > Thanks for your quick reply.
> >
> > I talked to the network people , but they are saying
> everything is good at their end. Is there anyway at the
> server end, to figure it ?for the switch restart or
> multicast traffic?
> >
> > I think you have already checked the cluster.conf
> file.. Except quorum disk, do you think that the cluster
> configuration is sufficient for handling the sixteen node
> cluster!!
> >
> > thanks again .
> > regards
> >
> > --- On Mon, 5/30/11, Kaloyan Kovachev <kkovachev at varna.net>
> wrote:
> >
> >> From: Kaloyan Kovachev <kkovachev at varna.net>
> >> Subject: Re: [Linux-cluster] Cluster environment
> issue
> >> To: "linux clustering" <linux-cluster at redhat.com>
> >> Date: Monday, May 30, 2011, 4:05 PM
> >> Hi,
> >> ?when your cluster gets broken, most likely the
> reason is,
> >> there is a
> >> network problem (switch restart or multicast
> traffic is
> >> lost for a while)
> >> on the interface where serverX-priv IPs are
> configured.
> >> Having a quorum
> >> disk may help by giving a quorum vote to one of
> the
> >> servers, so it can
> >> fence the others, but the best thing to do is to
> fix your
> >> network and
> >> preferably add a redundant link for the cluster
> >> communication to avoid
> >> breakage in the first place
> >>
> >> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija
> <swap_project at yahoo.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I am very new to the redhat cluster. Need
> some help
> >> and suggession for
> >> the
> >> > cluster configuration.
> >> > We have sixteen node cluster of
> >> >
> >> >? ? ? ? ? ???OS
> >> : Linux Server release 5.5 (Tikanga)
> >> >
> >> ???kernel :? 2.6.18-194.3.1.el5xen.
> >> >
> >> > The problem is sometimes the cluster is
> getting
> >> broken. The solution is
> >> > (still yet)to reboot the
> >> > sixteen nodes. Otherwise the nodes are not
> joining
> >> >
> >> > We are using? clvm and not using any quorum
> disk.
> >> The quorum is by
> >> default.
> >> >
> >> > When it is getting broken, clustat commands
> >> shows? evrything? offline
> >> > except the node from where
> >> > the clustat command executed.? If we execute
> vgs,
> >> lvs command, those
> >> > commands are getting hung.
> >> >
> >> > Here is at present the clustat report
> >> > -------------------------------------
> >> >
> >> > [server1]# clustat
> >> > Cluster Status for newcluster @ Mon May 30
> 14:55:10
> >> 2011
> >> > Member Status: Quorate
> >> >
> >> >? Member Name
> >>
> >> ID???Status
> >> >? ------ ----
> >> ? ? ? ? ? ? ---- ------
> >> >? server1
> >> ? ? ? ? ? ? ? 1 Online
> >> >? server2
> >> ? ? ? ? ? ? ? 2 Online,
> >> Local
> >> >? server3
> >> ? ? ? ? ? ? ? 3 Online
> >> >? server4
> >> ? ? ? ? ? ? ? 4 Online
> >> >? server5
> >> ? ? ? ? ? ? ? 5 Online
> >> >? server6
> >> ? ? ? ? ? ? ? 6 Online
> >> >? server7
> >> ? ? ? ? ? ? ? 7 Online
> >> >? server8
> >> ? ? ? ? ? ? ? 8 Online
> >> >? server9
> >> ? ? ? ? ? ? ? 9 Online
> >> >? server10
> >>
> >> ???10 Online
> >> >? server11
> >>
> >> ???11 Online
> >> >? server12
> >>
> >> ???12 Online
> >> >? server13
> >>
> >> ???13 Online
> >> >? server14
> >>
> >> ???14 Online
> >> >? server15
> >>
> >> ???15 Online
> >> >? server16
> >>
> >> ???16 Online
> >> >
> >> > Here the cman_tool status? output? from
> one
> >> server
> >> >
> --------------------------------------------------
> >> >
> >> > [server1 ~]# cman_tool status
> >> > Version: 6.2.0
> >> > Config Version: 23
> >> > Cluster Name: newcluster
> >> > Cluster Id: 53322
> >> > Cluster Member: Yes
> >> > Cluster Generation: 11432
> >> > Membership state: Cluster-Member
> >> > Nodes: 16
> >> > Expected votes: 16
> >> > Total votes: 16
> >> > Quorum: 9
> >> > Active subsystems: 8
> >> > Flags: Dirty
> >> > Ports Bound: 0 11
> >> > Node name: server1
> >> > Node ID: 1
> >> > Multicast addresses: xxx.xxx.xxx.xx
> >> > Node addresses: 192.168.xxx.xx
> >> >
> >> >
> >> > Here is the cluster.conf file.
> >> > ------------------------------
> >> >
> >> > <?xml version="1.0"?>
> >> > <cluster alias="newcluster"
> config_version="23"
> >> name="newcluster">
> >> > <fence_daemon clean_start="1"
> post_fail_delay="0"
> >> post_join_delay="15"/>
> >> >
> >> > <clusternodes>
> >> >
> >> > <clusternode name="server1-priv"
> nodeid="1"
> >> votes="1">
> >> >
> >> ? <fence><method name="1">
> >> >
> >> ? <device
> name="ilo-server1r"/></method>
> >> >
> >> ? </fence>
> >> > </clusternode>
> >> >
> >> > <clusternode name="server2-priv"
> nodeid="3"
> >> votes="1">
> >> >
> >> ???<fence><method name="1">
> >> >? ? ? ???<device
> >> name="ilo-server2r"/></method>
> >> >? ? ? ???</fence>
> >> > </clusternode>
> >> >
> >> > <clusternode name="server3-priv"
> nodeid="2"
> >> votes="1">
> >> >
> >> ???<fence><method name="1">
> >> >? ? ? ???<device
> >> name="ilo-server3r"/></method>
> >> >? ? ? ???</fence>
> >> > </clusternode>
> >> >
> >> > [ ... sinp .....]
> >> >
> >> > <clusternode name="server16-priv"
> nodeid="16"
> >> votes="1">
> >> >? ? ? ? <fence><method
> >> name="1">
> >> >? ? ? ? <device
> >> name="ilo-server16r"/></method>
> >> >? ? ? ? </fence>
> >> > </clusternode>
> >> >
> >> > </clusternodes>
> >> > <cman/>
> >> >
> >> > <dlm plock_ownership="1"
> plock_rate_limit="0"/>
> >> > <gfs_controld plock_rate_limit="0"/>
> >> >
> >> > <fencedevices>
> >> >? ? ? ???<fencedevice
> >> agent="fence_ilo" hostname="server1r"
> login="Admin"
> >> >
> >> ???name="ilo-server1r" passwd="xxxxx"/>
> >> >? ? ? ???..........
> >> >? ? ? ???<fencedevice
> >> agent="fence_ilo" hostname="server16r"
> >> login="Admin"
> >> >
> >> ???name="ilo-server16r" passwd="xxxxx"/>
> >> > </fencedevices>
> >> > <rm>
> >> > <failoverdomains/>
> >> > <resources/>
> >> > </rm></cluster>
> >> >
> >> > Here is the lvm.conf file
> >> > --------------------------
> >> >
> >> > devices {
> >> >
> >> >? ???dir = "/dev"
> >> >? ???scan = [ "/dev" ]
> >> >? ???preferred_names = [ ]
> >> >? ???filter = [
> >> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
> >> >? ???cache_dir = "/etc/lvm/cache"
> >> >? ???cache_file_prefix = ""
> >> >? ???write_cache_state = 1
> >> >? ???sysfs_scan = 1
> >> >? ???md_component_detection = 1
> >> >? ???md_chunk_alignment = 1
> >> >? ???data_alignment_detection = 1
> >> >? ???data_alignment = 0
> >> >
> >> ???data_alignment_offset_detection = 1
> >> >? ???ignore_suspended_devices = 0
> >> > }
> >> >
> >> > log {
> >> >
> >> >? ???verbose = 0
> >> >? ???syslog = 1
> >> >? ???overwrite = 0
> >> >? ???level = 0
> >> >? ???indent = 1
> >> >? ???command_names = 0
> >> >? ???prefix = "? "
> >> > }
> >> >
> >> > backup {
> >> >
> >> >? ???backup = 1
> >> >? ???backup_dir =
> >> "/etc/lvm/backup"
> >> >? ???archive = 1
> >> >? ???archive_dir =
> >> "/etc/lvm/archive"
> >> >? ???retain_min = 10
> >> >? ???retain_days = 30
> >> > }
> >> >
> >> > shell {
> >> >
> >> >? ???history_size = 100
> >> > }
> >> > global {
> >> >? ???library_dir = "/usr/lib64"
> >> >? ???umask = 077
> >> >? ???test = 0
> >> >? ???units = "h"
> >> >? ???si_unit_consistency = 0
> >> >? ???activation = 1
> >> >? ???proc = "/proc"
> >> >? ???locking_type = 3
> >> >? ???wait_for_locks = 1
> >> >? ???fallback_to_clustered_locking
> >> = 1
> >> >? ???fallback_to_local_locking = 1
> >> >? ???locking_dir = "/var/lock/lvm"
> >> >? ???prioritise_write_locks = 1
> >> > }
> >> >
> >> > activation {
> >> >? ???udev_sync = 1
> >> >? ???missing_stripe_filler =
> >> "error"
> >> >? ???reserved_stack = 256
> >> >? ???reserved_memory = 8192
> >> >? ???process_priority = -18
> >> >? ???mirror_region_size = 512
> >> >? ???readahead = "auto"
> >> >? ???mirror_log_fault_policy =
> >> "allocate"
> >> >? ???mirror_image_fault_policy =
> >> "remove"
> >> > }
> >> > dmeventd {
> >> >
> >> >? ???mirror_library =
> >> "libdevmapper-event-lvm2mirror.so"
> >> >? ???snapshot_library =
> >> "libdevmapper-event-lvm2snapshot.so"
> >> > }
> >> >
> >> >
> >> > If you need more? information,? I can
> >> provide ...
> >> >
> >> > Thanks for your help
> >> > Priya
> >> >
> >> > --
> >> > Linux-cluster mailing list
> >> > Linux-cluster at redhat.com
> >> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From claudio.martin at abilene.it  Tue May 31 16:22:01 2011
From: claudio.martin at abilene.it (Martin Claudio)
Date: Tue, 31 May 2011 18:22:01 +0200
Subject: [Linux-cluster] quorum dissolved but resources are still alive
Message-ID: <4DE515A9.40003@abilene.it>

Hi,

i have a problem with a 2 node cluster with this conf:


         <clusternodes>
                 <clusternode name="TEST1" nodeid="1" votes="1">
                         <fence/>
                 </clusternode>
                 <clusternode name="TEST2" nodeid="2" votes="2">
                         <fence/>
                 </clusternode>
         </clusternodes>
         <cman expected_votes="2"/>


all is ok but when node 2 goes down quorum dissolved but resources is 
not stopped, here log:


clurgmgrd[1302]: <emerg> #1: Quorum Dissolved
kernel: dlm: closing connection to node 2
openais[971]: [CLM  ]       r(0) ip(10.1.1.11)
openais[971]: [CLM  ] Members Left:
openais[971]: [CLM  ]       r(0) ip(10.1.1.12)
openais[971]: [CLM  ] Members Joined:
openais[971]: [CMAN ] quorum lost, blocking activity
openais[971]: [CLM  ] CLM CONFIGURATION CHANGE
openais[971]: [CLM  ] New Configuration:
openais[971]: [CLM  ]       r(0) ip(10.1.1.11)
openais[971]: [CLM  ] Members Left:
openais[971]: [CLM  ] Members Joined:
openais[971]: [SYNC ] This node is within the primary component and will 
provide service.
openais[971]: [TOTEM] entering OPERATIONAL state.
openais[971]: [CLM  ] got nodejoin message 10.1.1.11
openais[971]: [CPG  ] got joinlist message from node 1
ccsd[964]: Cluster is not quorate.  Refusing connection.


cluster recognized that quorum is dissolved but resource manager doesn't 
stop resource, ip address is still alive, filesystem is still mount, 
i'll expect an emergency shutdown but it does not happen....




-- 
Distinti Saluti
Claudio Martin
Abilene Net Solutions S.r.l.



From linux at alteeve.com  Tue May 31 17:05:17 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 31 May 2011 13:05:17 -0400
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE515A9.40003@abilene.it>
References: <4DE515A9.40003@abilene.it>
Message-ID: <4DE51FCD.5050904@alteeve.com>

On 05/31/2011 12:22 PM, Martin Claudio wrote:
> Hi,
>
> i have a problem with a 2 node cluster with this conf:
>
>
> <clusternodes>
> <clusternode name="TEST1" nodeid="1" votes="1">
> <fence/>
> </clusternode>
> <clusternode name="TEST2" nodeid="2" votes="2">
> <fence/>
> </clusternode>
> </clusternodes>
> <cman expected_votes="2"/>

There are a couple of problems here; You need:

<cman expected_votes="1" two_node="1">

With a two-node, quorum is effectively useless, as a single node is 
allowed to continue. Also, without proper fencing, things will not fail 
properly. This means that you are in somewhat of an undefined area.

Can you setup proper fencing, make the <cman .../> change and then try 
again? If the problem persists, please paste your entire cluster.conf 
(please only alter passwords) along with the relevant sections of logs 
from both nodes?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From ajb2 at mssl.ucl.ac.uk  Tue May 31 17:56:56 2011
From: ajb2 at mssl.ucl.ac.uk (Alan Brown)
Date: Tue, 31 May 2011 18:56:56 +0100
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE51FCD.5050904@alteeve.com>
References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com>
Message-ID: <4DE52BE8.8020009@mssl.ucl.ac.uk>

Digimer wrote:
>
> With a two-node, quorum is effectively useless, as a single node is 
> allowed to continue.

That's what qdiskd is for. It's also useful in larger clusters.

> Also, without proper fencing, things will not fail 
> properly. This means that you are in somewhat of an undefined area.

Undefined = likely to cause data corruption.

The OP needs to sort this out first before going on to anything else.






From claudio.martin at abilene.it  Tue May 31 18:33:03 2011
From: claudio.martin at abilene.it (Martin Claudio)
Date: Tue, 31 May 2011 20:33:03 +0200
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE51FCD.5050904@alteeve.com>
References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com>
Message-ID: <4DE5345F.6000804@abilene.it>


Il 31/05/2011 19.05, Digimer ha scritto:

>
> There are a couple of problems here; You need:
>
> <cman expected_votes="1" two_node="1">
>
> With a two-node, quorum is effectively useless, as a single node is
> allowed to continue. Also, without proper fencing, things will not fail
> properly. This means that you are in somewhat of an undefined area.
>
> Can you setup proper fencing, make the <cman .../> change and then try
> again? If the problem persists, please paste your entire cluster.conf
> (please only alter passwords) along with the relevant sections of logs
> from both nodes?
>

i know that quorum in a "two way cluster" is useless, but i need to 
config cluster in this way :

node 1 votes 1
node 2 votes 2
quorum 2

When all nodes are working total votes is 3, quorum is 2 and all is 
working fine...

if link between nodes is down node 1 alone has no quorum ( votes = 1 ) 
and it has to shutdown his resources while node 2 has quorum ( votes = 
2) and it has to bring up resources. In this way i avoid "split brain 
situation".
I know that in this config i have a single-point-of-failure, infact if 
node 2 goes down, also node 1 goes down ( no quorum ) but for me is ok...
I also plannig to implement some way to fencing nodes, but at the moment 
it's only a simulation lab....
Anyway i still have the problem, node without quorum has not shutdown 
resources, any help for me plese?









Distinti Saluti
Claudio Martin
Abilene Net Solutions S.r.l.



From linux at alteeve.com  Tue May 31 18:34:43 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 31 May 2011 14:34:43 -0400
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE52BE8.8020009@mssl.ucl.ac.uk>
References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com>
	<4DE52BE8.8020009@mssl.ucl.ac.uk>
Message-ID: <4DE534C3.8030302@alteeve.com>

On 05/31/2011 01:56 PM, Alan Brown wrote:
> Digimer wrote:
>>
>> With a two-node, quorum is effectively useless, as a single node is
>> allowed to continue.
>
> That's what qdiskd is for. It's also useful in larger clusters.

Agreed, but there are 2 caveats that need addressing;

1. qdisk requires a SAN (DRBD will not do).
2. qdisk works up to 16 nodes only.

>> Also, without proper fencing, things will not fail properly. This
>> means that you are in somewhat of an undefined area.
>
> Undefined = likely to cause data corruption.
>
> The OP needs to sort this out first before going on to anything else.

Agreed. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From linux at alteeve.com  Tue May 31 18:56:05 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 31 May 2011 14:56:05 -0400
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE5345F.6000804@abilene.it>
References: <4DE515A9.40003@abilene.it> <4DE51FCD.5050904@alteeve.com>
	<4DE5345F.6000804@abilene.it>
Message-ID: <4DE539C5.9000700@alteeve.com>

On 05/31/2011 02:33 PM, Martin Claudio wrote:
> I also plannig to implement some way to fencing nodes, but at the moment
> it's only a simulation lab....

Please read this:

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial#Concept.3B_Fencing

> Anyway i still have the problem, node without quorum has not shutdown
> resources, any help for me plese?

We'd like to help you, but we've been here before. Without getting 
fencing working, there is no real sense going forward.

Please, take the time now to get fencing working. The cluster stack has 
no concept of "a test cluster"; All clusters are treated as mission 
critical by the software.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From rossnick-lists at cybercat.ca  Tue May 31 19:00:11 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Tue, 31 May 2011 15:00:11 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>
	<4DD873C7.8080402@cybercat.ca>
Message-ID: <22E7D11CD5E64E338A66811F31F06238@versa>

>>> I've opened a support case at redhat for this. While collecting the
>>> sosreport for redhat, I found ot in my var/log/message file something
>>> about gfs2_quotad being stalled for more than 120 seconds. Tought I
>>> disabled quotas with the noquota option. It appears that it's
>>> "quota=off". Since I cannot chane thecluster config and remount the
>>> filessystems at the moment, I did not made the change to tes it.
>>
>> Thanks Nicolas. what bugzilla id is??
>
> It's not a bugzilla, it's a support case.

Hi !

FYI, my support ticket is still open, and GSS are searching to find the 
cause of the problem. In the mean time, they suggested that I start corosync 
with -p option and see if that changes anything.

I wanted to know how to do that since it's cman that does start corosync ?

Regards, 



From hlawatschek at atix.de  Tue May 31 19:10:26 2011
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 31 May 2011 21:10:26 +0200 (CEST)
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <1977462131.3330.1306868972931.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix>

Martin,

I did some testings with RHEL5.6 and no additional asynchronous updates. 
I remember that it worked as you expected. If rgmanager notices that quorum dissolved, it triggers an emergency shutdown for all services running on the nodes that lost quorum. 

Which version of rgmanager are you using?

Best regards,
Mark



----- "Martin Claudio" <claudio.martin at abilene.it> wrote:

> Hi,
> 
> i have a problem with a 2 node cluster with this conf:
> 
> 
>          <clusternodes>
>                  <clusternode name="TEST1" nodeid="1" votes="1">
>                          <fence/>
>                  </clusternode>
>                  <clusternode name="TEST2" nodeid="2" votes="2">
>                          <fence/>
>                  </clusternode>
>          </clusternodes>
>          <cman expected_votes="2"/>
> 
> 
> all is ok but when node 2 goes down quorum dissolved but resources is
> 
> not stopped, here log:
> 
> 
> clurgmgrd[1302]: <emerg> #1: Quorum Dissolved
> kernel: dlm: closing connection to node 2
> openais[971]: [CLM  ]       r(0) ip(10.1.1.11)
> openais[971]: [CLM  ] Members Left:
> openais[971]: [CLM  ]       r(0) ip(10.1.1.12)
> openais[971]: [CLM  ] Members Joined:
> openais[971]: [CMAN ] quorum lost, blocking activity
> openais[971]: [CLM  ] CLM CONFIGURATION CHANGE
> openais[971]: [CLM  ] New Configuration:
> openais[971]: [CLM  ]       r(0) ip(10.1.1.11)
> openais[971]: [CLM  ] Members Left:
> openais[971]: [CLM  ] Members Joined:
> openais[971]: [SYNC ] This node is within the primary component and
> will 
> provide service.
> openais[971]: [TOTEM] entering OPERATIONAL state.
> openais[971]: [CLM  ] got nodejoin message 10.1.1.11
> openais[971]: [CPG  ] got joinlist message from node 1
> ccsd[964]: Cluster is not quorate.  Refusing connection.
> 
> 
> cluster recognized that quorum is dissolved but resource manager
> doesn't 
> stop resource, ip address is still alive, filesystem is still mount, 
> i'll expect an emergency shutdown but it does not happen....
> 
> 
> 
> 
> -- 
> Distinti Saluti
> Claudio Martin
> Abilene Net Solutions S.r.l.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Mark Hlawatschek

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de 

http://www.linux-subscriptions.com
 
Registergericht: Amtsgericht Muenchen, Registernummer:  HRB 168930
USt.-Id.: DE209485962
Vorstand: Thomas Merz (Vors.), Marc Grimme, Mark Hlawatschek, Jan R. Bergrath
Vorsitzender des Aufsichtsrats: Dr. Martin Buss



From linux at alteeve.com  Tue May 31 19:13:31 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 31 May 2011 15:13:31 -0400
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix>
References: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <4DE53DDB.8080501@alteeve.com>

On 05/31/2011 03:10 PM, Mark Hlawatschek wrote:
> Martin,
>
> I did some testings with RHEL5.6 and no additional asynchronous updates.
> I remember that it worked as you expected. If rgmanager notices that quorum dissolved, it triggers an emergency shutdown for all services running on the nodes that lost quorum.
>
> Which version of rgmanager are you using?
>
> Best regards,
> Mark

The "openais" log prefixes lead me to believe it's EL5.x.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From bergman at merctech.com  Tue May 31 19:35:54 2011
From: bergman at merctech.com (bergman at merctech.com)
Date: Tue, 31 May 2011 15:35:54 -0400
Subject: [Linux-cluster] recommended method for changing quorum device
Message-ID: <21868.1306870554@localhost>

I've got a 3-node RHCS cluster and the quorum device is on a SAN disk
array that needs to be replaced. The relevent versions are:

      CentOS 5.6 (2.6.18-238.9.1.el5)
      openais-0.80.6-28.el5_6.1
      cman-2.0.115-68.el5_6.3
      rgmanager-2.0.52-9.el5.centos.1
      

Currently the cluster is configured with each node having one vote and
the quorum device having 2 votes, to allow operation in the event of
multiple node failures.

I'd like to know if there's any recommended method for changing the
quorum disk "in place", without shutting down the cluster.

The following approaches come to mind:

      1. Create a new quorum device (multipath, mkqdisk).

         Ensure that at least 2 of the 3 nodes are up.

         Change the cluster configuration to use the new path to
         the new device instead of the old device.

         Commit the change to the cluster.

      2. Create a new quorum device (multipath, mkqdisk).

         Ensure that at least 2 of the 3 nodes are up.

         Change the cluster configuration to not use any quorum
         device.
         
         Commit the change to the cluster.
         
         Change the cluster configuration to use the new quorum
         device.

         Commit the change to the cluster.

      3. Create a new quorum device (multipath, mkqdisk).

         Change the cluster configuration to use both quorum
         devices. 

         Commit the change to the cluster.

            --------------------------------------------------
            Note: the 'mkqdisk' manual page (dated July 2006)
	    states:
                  using multiple different devices is currently
                  not supported
            Is that still accurate?
            --------------------------------------------------

         Change the cluster configuration to use just the 
         new quorum device instead of the old device.

         Commit the change to the cluster.

Thanks for any suggestions.

Mark



From claudio.martin at abilene.it  Tue May 31 19:40:48 2011
From: claudio.martin at abilene.it (Martin Claudio)
Date: Tue, 31 May 2011 21:40:48 +0200
Subject: [Linux-cluster] quorum dissolved but resources are still alive
In-Reply-To: <4DE53DDB.8080501@alteeve.com>
References: <580423379.3332.1306869026208.JavaMail.root@axgroupware01-1.gallien.atix>
	<4DE53DDB.8080501@alteeve.com>
Message-ID: <4DE54440.5040404@abilene.it>

First of all thanks everybody for help me...

RHEL 5.5
rgmanager-2.0.52-6.0.1.el5
cman-2.0.115-34.el5


Distinti Saluti
Claudio Martin
Abilene Net Solutions S.r.l.

Il 31/05/2011 21.13, Digimer ha scritto:
> On 05/31/2011 03:10 PM, Mark Hlawatschek wrote:
>> Martin,
>>
>> I did some testings with RHEL5.6 and no additional asynchronous updates.
>> I remember that it worked as you expected. If rgmanager notices that
>> quorum dissolved, it triggers an emergency shutdown for all services
>> running on the nodes that lost quorum.
>>
>> Which version of rgmanager are you using?
>>
>> Best regards,
>> Mark
>
> The "openais" log prefixes lead me to believe it's EL5.x.
>



From linux at alteeve.com  Tue May 31 19:42:16 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 31 May 2011 15:42:16 -0400
Subject: [Linux-cluster] recommended method for changing quorum device
In-Reply-To: <21868.1306870554@localhost>
References: <21868.1306870554@localhost>
Message-ID: <4DE54498.1000509@alteeve.com>

On 05/31/2011 03:35 PM, bergman at merctech.com wrote:
> I've got a 3-node RHCS cluster and the quorum device is on a SAN disk
> array that needs to be replaced. The relevent versions are:
>
>        CentOS 5.6 (2.6.18-238.9.1.el5)
>        openais-0.80.6-28.el5_6.1
>        cman-2.0.115-68.el5_6.3
>        rgmanager-2.0.52-9.el5.centos.1
>
>
> Currently the cluster is configured with each node having one vote and
> the quorum device having 2 votes, to allow operation in the event of
> multiple node failures.
>
> I'd like to know if there's any recommended method for changing the
> quorum disk "in place", without shutting down the cluster.
>
> The following approaches come to mind:
>
>        1. Create a new quorum device (multipath, mkqdisk).
>
>           Ensure that at least 2 of the 3 nodes are up.
>
>           Change the cluster configuration to use the new path to
>           the new device instead of the old device.
>
>           Commit the change to the cluster.
>
>        2. Create a new quorum device (multipath, mkqdisk).
>
>           Ensure that at least 2 of the 3 nodes are up.
>
>           Change the cluster configuration to not use any quorum
>           device.
>
>           Commit the change to the cluster.
>
>           Change the cluster configuration to use the new quorum
>           device.
>
>           Commit the change to the cluster.
>
>        3. Create a new quorum device (multipath, mkqdisk).
>
>           Change the cluster configuration to use both quorum
>           devices.
>
>           Commit the change to the cluster.
>
>              --------------------------------------------------
>              Note: the 'mkqdisk' manual page (dated July 2006)
> 	    states:
>                    using multiple different devices is currently
>                    not supported
>              Is that still accurate?
>              --------------------------------------------------
>
>           Change the cluster configuration to use just the
>           new quorum device instead of the old device.
>
>           Commit the change to the cluster.
>
> Thanks for any suggestions.
>
> Mark

With the caveat that I have not done this and make no claims to being an 
expert; Option 2 strikes me as the best choice.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"I feel confined, only free to expand myself within boundaries."



From sdake at redhat.com  Tue May 31 19:47:35 2011
From: sdake at redhat.com (Steven Dake)
Date: Tue, 31 May 2011 12:47:35 -0700
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <22E7D11CD5E64E338A66811F31F06238@versa>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>
	<22E7D11CD5E64E338A66811F31F06238@versa>
Message-ID: <4DE545D7.1080703@redhat.com>

On 05/31/2011 12:00 PM, Nicolas Ross wrote:
>>>> I've opened a support case at redhat for this. While collecting the
>>>> sosreport for redhat, I found ot in my var/log/message file something
>>>> about gfs2_quotad being stalled for more than 120 seconds. Tought I
>>>> disabled quotas with the noquota option. It appears that it's
>>>> "quota=off". Since I cannot chane thecluster config and remount the
>>>> filessystems at the moment, I did not made the change to tes it.
>>>
>>> Thanks Nicolas. what bugzilla id is??
>>
>> It's not a bugzilla, it's a support case.
> 
> Hi !
> 
> FYI, my support ticket is still open, and GSS are searching to find the
> cause of the problem. In the mean time, they suggested that I start
> corosync with -p option and see if that changes anything.
> 
> I wanted to know how to do that since it's cman that does start corosync ?
> 

cman_tool join is called in /etc/rc.d/init.d/cman I believe.  Add a -P
option to it.

Regards
-steve

> Regards,
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From hlawatschek at atix.de  Tue May 31 20:22:44 2011
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 31 May 2011 22:22:44 +0200 (CEST)
Subject: [Linux-cluster] recommended method for changing quorum device
In-Reply-To: <713182518.3335.1306873331051.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <215272920.3337.1306873364406.JavaMail.root@axgroupware01-1.gallien.atix>

Mark,

without guarantee ;-) I believe that the following method should work:

1. make sure that all 3 nodes are running and part of the cluster
2. stop qdiskd on all nodes (#service qdiskd stop)
3. create new quorum disk (#mkqdisk ...)
4. modify cluster.conf
5. #ccs_tool update /etc/cluster/cluster.conf
6. start qdiskd on all nodes (#service qdiskd start)

Kind regards,
Mark


----- bergman at merctech.com wrote:

> I've got a 3-node RHCS cluster and the quorum device is on a SAN disk
> array that needs to be replaced. The relevent versions are:
> 
>       CentOS 5.6 (2.6.18-238.9.1.el5)
>       openais-0.80.6-28.el5_6.1
>       cman-2.0.115-68.el5_6.3
>       rgmanager-2.0.52-9.el5.centos.1
>       
> 
> Currently the cluster is configured with each node having one vote
> and
> the quorum device having 2 votes, to allow operation in the event of
> multiple node failures.
> 
> I'd like to know if there's any recommended method for changing the
> quorum disk "in place", without shutting down the cluster.
> 
> The following approaches come to mind:
> 
>       1. Create a new quorum device (multipath, mkqdisk).
> 
>          Ensure that at least 2 of the 3 nodes are up.
> 
>          Change the cluster configuration to use the new path to
>          the new device instead of the old device.
> 
>          Commit the change to the cluster.
> 
>       2. Create a new quorum device (multipath, mkqdisk).
> 
>          Ensure that at least 2 of the 3 nodes are up.
> 
>          Change the cluster configuration to not use any quorum
>          device.
>          
>          Commit the change to the cluster.
>          
>          Change the cluster configuration to use the new quorum
>          device.
> 
>          Commit the change to the cluster.
> 
>       3. Create a new quorum device (multipath, mkqdisk).
> 
>          Change the cluster configuration to use both quorum
>          devices. 
> 
>          Commit the change to the cluster.
> 
>             --------------------------------------------------
>             Note: the 'mkqdisk' manual page (dated July 2006)
> 	    states:
>                   using multiple different devices is currently
>                   not supported
>             Is that still accurate?
>             --------------------------------------------------
> 
>          Change the cluster configuration to use just the 
>          new quorum device instead of the old device.
> 
>          Commit the change to the cluster.
> 
> Thanks for any suggestions.
> 
> Mark
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Mark Hlawatschek

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de 

http://www.linux-subscriptions.com



From rossnick-lists at cybercat.ca  Tue May 31 22:34:21 2011
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Tue, 31 May 2011 18:34:21 -0400
Subject: [Linux-cluster] Corosync goes cpu to 95-99%
In-Reply-To: <4DE545D7.1080703@redhat.com>
References: <4DD29D03.9080901@gmail.com>	<4DD2BAC3.50509@redhat.com>	<4DD2BD7D.5070704@gmail.com>	<4DD2CA90.6090802@redhat.com>	<3B50BA7445114813AE429BEE51A2BA52@versa>	<4DD78908.2030801@gmail.com>	<0B1965C8-9807-42B6-9453-01BE0C0B1DCB@cybercat.ca><4DD80D5D.10004@gmail.com>	<4DD873C7.8080402@cybercat.ca>
	<22E7D11CD5E64E338A66811F31F06238@versa>
	<4DE545D7.1080703@redhat.com>
Message-ID: <068AEB47E11A41C3A8EC25F71D30B82F@Inspiron>

>> FYI, my support ticket is still open, and GSS are searching to find the
>> cause of the problem. In the mean time, they suggested that I start
>> corosync with -p option and see if that changes anything.
>>
>> I wanted to know how to do that since it's cman that does start corosync 
>> ?
>>
>
> cman_tool join is called in /etc/rc.d/init.d/cman I believe.  Add a -P
> option to it.

That did it.

I will do it for a couple of nodes and see what happens.

Regards,