From ccaulfie at redhat.com  Mon Jan  4 07:58:11 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 04 Jan 2010 07:58:11 +0000
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com>	<74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com>	<8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com>
	<8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
Message-ID: <4B419F93.5070704@redhat.com>

On 31/12/09 16:13, Terry wrote:
> On Wed, Dec 30, 2009 at 10:13 AM, Terry<td3201 at gmail.com>  wrote:
>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W.<jwellband at gmail.com>  wrote:
>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry<td3201 at gmail.com>  wrote:
>>>> Hello,
>>>>
>>>> I have a working 2 node cluster that I am trying to add a third node
>>>> to.   I am trying to use Red Hat's conga (luci) to add the node in but
>>>
>>> If you have two node cluster with two_node=1 in cluster.conf - such as
>>> two nodes with no quorum device to break a tie - you'll need to bring
>>> the cluster down, change two_node to 0 on both nodes (and rev the
>>> cluster version at the top of cluster.conf), bring the cluster up and
>>> then add the third node.
>>>
>>> For troubleshooting any cluster issue, take a look at syslog
>>> (/var/log/messages by default). It can help to watch it on a
>>> centralized syslog server that all of your nodes forward logs to.
>>>
>>> --
>>> HTH, YMMV, HANW :)
>>>
>>> Jason
>>>
>>> The path to enlightenment is /usr/bin/enlightenment.
>>
>> Thank you for the response.  /var/log/messages doesn't have any
>> errors.  It says cman started then says can't connect to cluster
>> infrastructure after a few seconds.  My cluster does not have the
>> two_node=1 config now.  Conga took that out for me.  That bit me last
>> night because I needed to put that back in.
>>
>
> CMAN still will not start and gives no debug information.  Anyone know
> why cman_tool -d join would not print any output at all?
> Troubleshooting this is kind of a nightmare.  I verified that two_node
> is not in play.


If cman_tool join -d doesn't produce any output then the most likely 
problem is a mismatch between the cman and openais versions. Because 
cman is a configuration module for openais it loads very early in the 
initialisation sequence.

If you are sure the versions are right (ie they match those on the 
running nodes of the cluster) then do

# strace -f cman_tool join -d


and post the results here and I'll have a look for you.

Chrisie



From diamondiona at gmail.com  Mon Jan  4 08:25:59 2010
From: diamondiona at gmail.com (Diamond Li)
Date: Mon, 4 Jan 2010 16:25:59 +0800
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <dd23a5e0912302316x5b8c1fb2y61ff503acb0dfd3f@mail.gmail.com>
References: <dd23a5e0912302255h4f11f77ay1152bb7639530acf@mail.gmail.com>
	<dd23a5e0912302316x5b8c1fb2y61ff503acb0dfd3f@mail.gmail.com>
Message-ID: <dd23a5e1001040025v152cd5f4qfdbde38a8b660a09@mail.gmail.com>

could someone kindly help me to get through?

thanks in advance!

On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li <diamondiona at gmail.com> wrote:
> from system log, I can see the erorr message:
>
> Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not exist
>
> but I have mounted gfs2 file system under /gfs folder and I can do
> operations such as mkdir, rm, successfully.
>
>
>
> On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li <diamondiona at gmail.com> wrote:
>> Hello,
>>
>> I am trying to grow a gfs2 file system, unfortunately ?it does not work.
>>
>> anyone has similar issues or I always have bad luck?
>>
>> [root at wplccdlvm446 gfs]# mount
>>
>> /dev/mapper/vg100-lvol0 on /gfs type gfs2 (rw,hostdata=jid=0:id=131074:first=1)
>>
>> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
>> Initializing lists...
>> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
>>
>> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
>> . ?..
>>
>>
>> [root at wplccdlvm446 gfs]# uname -r
>> 2.6.18-164.el5
>>
>> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
>> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
>>
>



From a.alawi at auckland.ac.nz  Mon Jan  4 19:24:40 2010
From: a.alawi at auckland.ac.nz (Abraham Alawi)
Date: Tue, 5 Jan 2010 08:24:40 +1300
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <dd23a5e1001040025v152cd5f4qfdbde38a8b660a09@mail.gmail.com>
References: <dd23a5e0912302255h4f11f77ay1152bb7639530acf@mail.gmail.com>
	<dd23a5e0912302316x5b8c1fb2y61ff503acb0dfd3f@mail.gmail.com>
	<dd23a5e1001040025v152cd5f4qfdbde38a8b660a09@mail.gmail.com>
Message-ID: <96F1FFC8-83C1-48C9-8CF0-271CB9A998F2@auckland.ac.nz>

I used gfs2_grow before but never experienced this error. Probably it's /tmp related issue, got the right permission (1777) + does it have enough space? strace could be of great help as well. 

Good luck

On 4/01/2010, at 9:25 PM, Diamond Li wrote:

> could someone kindly help me to get through?
> 
> thanks in advance!
> 
> On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li <diamondiona at gmail.com> wrote:
>> from system log, I can see the erorr message:
>> 
>> Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not exist
>> 
>> but I have mounted gfs2 file system under /gfs folder and I can do
>> operations such as mkdir, rm, successfully.
>> 
>> 
>> 
>> On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li <diamondiona at gmail.com> wrote:
>>> Hello,
>>> 
>>> I am trying to grow a gfs2 file system, unfortunately  it does not work.
>>> 
>>> anyone has similar issues or I always have bad luck?
>>> 
>>> [root at wplccdlvm446 gfs]# mount
>>> 
>>> /dev/mapper/vg100-lvol0 on /gfs type gfs2 (rw,hostdata=jid=0:id=131074:first=1)
>>> 
>>> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
>>> Initializing lists...
>>> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
>>> 
>>> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
>>> .  ..
>>> 
>>> 
>>> [root at wplccdlvm446 gfs]# uname -r
>>> 2.6.18-164.el5
>>> 
>>> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
>>> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
>>> 
>> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

''''''''''''''''''''''''''''''''''''''''''''''''''''''
Abraham Alawi

Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi at auckland.ac.nz
p: +64-9-373 7599, ext#: 87572

''''''''''''''''''''''''''''''''''''''''''''''''''''''




From a.alawi at auckland.ac.nz  Mon Jan  4 19:34:00 2010
From: a.alawi at auckland.ac.nz (Abraham Alawi)
Date: Tue, 5 Jan 2010 08:34:00 +1300
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com>
	<74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com>
	<8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com>
	<8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
Message-ID: <37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz>


On 1/01/2010, at 5:13 AM, Terry wrote:

> On Wed, Dec 30, 2009 at 10:13 AM, Terry <td3201 at gmail.com> wrote:
>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W. <jwellband at gmail.com> wrote:
>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry <td3201 at gmail.com> wrote:
>>>> Hello,
>>>> 
>>>> I have a working 2 node cluster that I am trying to add a third node
>>>> to.   I am trying to use Red Hat's conga (luci) to add the node in but
>>> 
>>> If you have two node cluster with two_node=1 in cluster.conf - such as
>>> two nodes with no quorum device to break a tie - you'll need to bring
>>> the cluster down, change two_node to 0 on both nodes (and rev the
>>> cluster version at the top of cluster.conf), bring the cluster up and
>>> then add the third node.
>>> 
>>> For troubleshooting any cluster issue, take a look at syslog
>>> (/var/log/messages by default). It can help to watch it on a
>>> centralized syslog server that all of your nodes forward logs to.
>>> 
>>> --
>>> HTH, YMMV, HANW :)
>>> 
>>> Jason
>>> 
>>> The path to enlightenment is /usr/bin/enlightenment.
>> 
>> Thank you for the response.  /var/log/messages doesn't have any
>> errors.  It says cman started then says can't connect to cluster
>> infrastructure after a few seconds.  My cluster does not have the
>> two_node=1 config now.  Conga took that out for me.  That bit me last
>> night because I needed to put that back in.
>> 
> 
> CMAN still will not start and gives no debug information.  Anyone know
> why cman_tool -d join would not print any output at all?
> Troubleshooting this is kind of a nightmare.  I verified that two_node
> is not in play.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Try this line in your cluster.conf file:
<logging debug="on" logfile="/var/log/rhcs.log" to_file="yes"/>

Also, if you are sure your cluster.conf is correct then copy it manually to all the nodes and add clean_start="1" to the fence_daemon line in cluster.conf and run 'service cman start' simultaneously on all the nodes (probably a good idea to do that from runlevel 1 but make sure you have the network up first)

Cheers,

  -- Abraham

''''''''''''''''''''''''''''''''''''''''''''''''''''''
Abraham Alawi

Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi at auckland.ac.nz
p: +64-9-373 7599, ext#: 87572

''''''''''''''''''''''''''''''''''''''''''''''''''''''




From adas at redhat.com  Mon Jan  4 22:27:15 2010
From: adas at redhat.com (Abhijith Das)
Date: Mon, 4 Jan 2010 17:27:15 -0500 (EST)
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <1186328466.2354981262643939629.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <330034033.2355001262644035840.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>

Hi,

>From the following message, it looks like the gfs2meta mount routine is not able to locate the gfs2 mountpoint.
"Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not exist"
Can you confirm that /proc/mounts and /etc/mtab all agree on the mounted gfs2 at /gfs?
Also, can you run gfs2_grow under strace so that we can see what arguments gfs2_grow passes to the mount() system call when it tries to mount the gfs2meta filesystem?

Thanks!
--Abhi

----- "Diamond Li" <diamondiona at gmail.com> wrote:

> From: "Diamond Li" <diamondiona at gmail.com>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Monday, January 4, 2010 2:25:59 AM GMT -06:00 US/Canada Central
> Subject: Re: [Linux-cluster] gfs2_grow does not work
>
> could someone kindly help me to get through?
> 
> thanks in advance!
> 
> On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li <diamondiona at gmail.com>
> wrote:
> > from system log, I can see the erorr message:
> >
> > Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
> exist
> >
> > but I have mounted gfs2 file system under /gfs folder and I can do
> > operations such as mkdir, rm, successfully.
> >
> >
> >
> > On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li <diamondiona at gmail.com>
> wrote:
> >> Hello,
> >>
> >> I am trying to grow a gfs2 file system, unfortunately ?it does not
> work.
> >>
> >> anyone has similar issues or I always have bad luck?
> >>
> >> [root at wplccdlvm446 gfs]# mount
> >>
> >> /dev/mapper/vg100-lvol0 on /gfs type gfs2
> (rw,hostdata=jid=0:id=131074:first=1)
> >>
> >> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
> >> Initializing lists...
> >> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
> >>
> >> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
> >> . ?..
> >>
> >>
> >> [root at wplccdlvm446 gfs]# uname -r
> >> 2.6.18-164.el5
> >>
> >> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
> >> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
> >>
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From diamondiona at gmail.com  Tue Jan  5 02:12:12 2010
From: diamondiona at gmail.com (Diamond Li)
Date: Tue, 5 Jan 2010 10:12:12 +0800
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <330034033.2355001262644035840.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
References: <1186328466.2354981262643939629.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<330034033.2355001262644035840.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <dd23a5e1001041812r5a9aa04mfd58e1a3e3b37ea3@mail.gmail.com>

[root at wplccdlvm445 proc]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      28376956   9144384  17767844  34% /
/dev/sda1               101086     12055     83812  13% /boot
tmpfs                  1037748         0   1037748   0% /dev/shm
/dev/mapper/vg100-lvol0
                        819024    794264     24760  97% /gfs

[root at wplccdlvm445 proc]# ls -ld /tmp
drwxrwxrwt 8 root root 4096 Jan  5 04:02 /tmp
[root at wplccdlvm445 proc]# ls -ld /tmp/.gfs2meta/
drwx------ 2 root root 4096 Dec 31 14:24 /tmp/.gfs2meta/


[root at wplccdlvm445 proc]# cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/sda1 /boot ext3 rw,data=ordered 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
/etc/auto.misc /misc autofs
rw,fd=7,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs
rw,fd=13,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
none /sys/kernel/config configfs rw 0 0
/dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0 0

[root at wplccdlvm445 proc]# cat /etc/mtab
/dev/mapper/VolGroup00-LogVol00 / ext3 rw 0 0
proc /proc proc rw 0 0
sysfs /sys sysfs rw 0 0
devpts /dev/pts devpts rw,gid=5,mode=620 0 0
/dev/sda1 /boot ext3 rw 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
none /sys/kernel/config configfs rw 0 0
/dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0 0


[root at wplccdlvm445 proc]# strace gfs2_grow -v /gfs
execve("/sbin/gfs2_grow", ["gfs2_grow", "-v", "/gfs"], [/* 29 vars */]) = 0
brk(0)                                  = 0x942d000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=90296, ...}) = 0
mmap2(NULL, 90296, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f96000
close(3)                                = 0
open("/lib/libvolume_id.so.0", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360 at k\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=32144, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7f95000
mmap2(0x6b3000, 33540, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE,
3, 0) = 0x6b3000
mmap2(0x6bb000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7) = 0x6bb000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340\17X\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1611564, ...}) = 0
mmap2(0x56b000, 1332676, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x56b000
mprotect(0x6aa000, 4096, PROT_NONE)     = 0
mmap2(0x6ab000, 12288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13f) = 0x6ab000
mmap2(0x6ae000, 9668, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6ae000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7f94000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f946c0,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0x6ab000, 8192, PROT_READ)     = 0
mprotect(0x567000, 4096, PROT_READ)     = 0
munmap(0xb7f96000, 90296)               = 0
time(NULL)                              = 1262655667
getpid()                                = 18781
brk(0)                                  = 0x942d000
brk(0x944e000)                          = 0x944e000
open("/gfs", O_RDONLY|O_LARGEFILE)      = 3
open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 4
lstat64("/gfs", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7fac000
read(4, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
close(4)                                = 0
munmap(0xb7fac000, 4096)                = 0
open("/dev/mapper/vg100-lvol0", O_RDWR|O_LARGEFILE) = 4
fstat64(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 4), ...}) = 0
_llseek(4, 0, [1677721600], SEEK_END)   = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7fac000
write(1, "Initializing lists...\n", 22Initializing lists...
) = 22
_llseek(4, 65536, [65536], SEEK_SET)    = 0
read(4, "\1\26\31p\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0d\0\0\0\0\0\0\7\t\0\0\7l"...,
4096) = 4096
_llseek(4, 0, [1677721600], SEEK_END)   = 0
open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 5
fstat64(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7fab000
read(5, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
read(5, "", 4096)                       = 0
close(5)                                = 0
munmap(0xb7fab000, 4096)                = 0
open("/tmp/.gfs2meta", O_RDONLY|O_LARGEFILE) = 5
fstat64(5, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
close(5)                                = 0
mount("/dev/mapper/vg100-lvol0", "/tmp/.gfs2meta", "gfs2meta", 0,
NULL) = -1 EINVAL (Invalid argument)
write(2, "gfs2_grow: ", 11gfs2_grow: )             = 11
write(2, "Couldn't mount /tmp/.gfs2meta : "..., 49Couldn't mount
/tmp/.gfs2meta : Invalid argument
) = 49
exit_group(1)                           = ?


On Tue, Jan 5, 2010 at 6:27 AM, Abhijith Das <adas at redhat.com> wrote:
> Hi,
>
> >From the following message, it looks like the gfs2meta mount routine is not able to locate the gfs2 mountpoint.
> "Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not exist"
> Can you confirm that /proc/mounts and /etc/mtab all agree on the mounted gfs2 at /gfs?
> Also, can you run gfs2_grow under strace so that we can see what arguments gfs2_grow passes to the mount() system call when it tries to mount the gfs2meta filesystem?
>
> Thanks!
> --Abhi
>
> ----- "Diamond Li" <diamondiona at gmail.com> wrote:
>
>> From: "Diamond Li" <diamondiona at gmail.com>
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Sent: Monday, January 4, 2010 2:25:59 AM GMT -06:00 US/Canada Central
>> Subject: Re: [Linux-cluster] gfs2_grow does not work
>>
>> could someone kindly help me to get through?
>>
>> thanks in advance!
>>
>> On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li <diamondiona at gmail.com>
>> wrote:
>> > from system log, I can see the erorr message:
>> >
>> > Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
>> exist
>> >
>> > but I have mounted gfs2 file system under /gfs folder and I can do
>> > operations such as mkdir, rm, successfully.
>> >
>> >
>> >
>> > On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li <diamondiona at gmail.com>
>> wrote:
>> >> Hello,
>> >>
>> >> I am trying to grow a gfs2 file system, unfortunately ?it does not
>> work.
>> >>
>> >> anyone has similar issues or I always have bad luck?
>> >>
>> >> [root at wplccdlvm446 gfs]# mount
>> >>
>> >> /dev/mapper/vg100-lvol0 on /gfs type gfs2
>> (rw,hostdata=jid=0:id=131074:first=1)
>> >>
>> >> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
>> >> Initializing lists...
>> >> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
>> >>
>> >> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
>> >> . ?..
>> >>
>> >>
>> >> [root at wplccdlvm446 gfs]# uname -r
>> >> 2.6.18-164.el5
>> >>
>> >> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
>> >> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
>> >>
>> >
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From adas at redhat.com  Tue Jan  5 05:26:15 2010
From: adas at redhat.com (Abhijith Das)
Date: Tue, 5 Jan 2010 00:26:15 -0500 (EST)
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <678287746.2365421262668986749.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <1979230168.2365461262669175952.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>

Hi Diamond,

Could I also have the kernel and gfs2-utils rpm versions you are using so I can try this on my setup? I just spotted something in your strace output that could be a problem if you have a newer kernel, but not a newer gfs2-utils package.

The mount syscall in your strace output takes the device as the first arg to mount the metafs. A recent kernel patch from https://bugzilla.redhat.com/show_bug.cgi?id=457798 changed that to take the mountpoint as the first arg instead. There was a corresponding userland patch to gfs2-utils in https://bugzilla.redhat.com/show_bug.cgi?id=459630#c3 that fixed this mismatch.
I'm not sure if you're seeing this. If so, an upgrade of these packages should fix what you're seeing.

Cheers!
--Abhi

----- "Diamond Li" <diamondiona at gmail.com> wrote:

> From: "Diamond Li" <diamondiona at gmail.com>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Monday, January 4, 2010 8:12:12 PM GMT -06:00 US/Canada Central
> Subject: Re: [Linux-cluster] gfs2_grow does not work
>
> [root at wplccdlvm445 proc]# df -k
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/mapper/VolGroup00-LogVol00
>                       28376956   9144384  17767844  34% /
> /dev/sda1               101086     12055     83812  13% /boot
> tmpfs                  1037748         0   1037748   0% /dev/shm
> /dev/mapper/vg100-lvol0
>                         819024    794264     24760  97% /gfs
> 
> [root at wplccdlvm445 proc]# ls -ld /tmp
> drwxrwxrwt 8 root root 4096 Jan  5 04:02 /tmp
> [root at wplccdlvm445 proc]# ls -ld /tmp/.gfs2meta/
> drwx------ 2 root root 4096 Dec 31 14:24 /tmp/.gfs2meta/
> 
> 
> [root at wplccdlvm445 proc]# cat /proc/mounts
> rootfs / rootfs rw 0 0
> /dev/root / ext3 rw,data=ordered 0 0
> /dev /dev tmpfs rw 0 0
> /proc /proc proc rw 0 0
> /sys /sys sysfs rw 0 0
> /proc/bus/usb /proc/bus/usb usbfs rw 0 0
> devpts /dev/pts devpts rw 0 0
> /dev/sda1 /boot ext3 rw,data=ordered 0 0
> tmpfs /dev/shm tmpfs rw 0 0
> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
> /etc/auto.misc /misc autofs
> rw,fd=7,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
> -hosts /net autofs
> rw,fd=13,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
> none /sys/kernel/config configfs rw 0 0
> /dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0
> 0
> 
> [root at wplccdlvm445 proc]# cat /etc/mtab
> /dev/mapper/VolGroup00-LogVol00 / ext3 rw 0 0
> proc /proc proc rw 0 0
> sysfs /sys sysfs rw 0 0
> devpts /dev/pts devpts rw,gid=5,mode=620 0 0
> /dev/sda1 /boot ext3 rw 0 0
> tmpfs /dev/shm tmpfs rw 0 0
> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
> none /sys/kernel/config configfs rw 0 0
> /dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0
> 0
> 
> 
> [root at wplccdlvm445 proc]# strace gfs2_grow -v /gfs
> execve("/sbin/gfs2_grow", ["gfs2_grow", "-v", "/gfs"], [/* 29 vars
> */]) = 0
> brk(0)                                  = 0x942d000
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or
> directory)
> open("/etc/ld.so.cache", O_RDONLY)      = 3
> fstat64(3, {st_mode=S_IFREG|0644, st_size=90296, ...}) = 0
> mmap2(NULL, 90296, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f96000
> close(3)                                = 0
> open("/lib/libvolume_id.so.0", O_RDONLY) = 3
> read(3,
> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360 at k\0004\0\0\0"...,
> 512) = 512
> fstat64(3, {st_mode=S_IFREG|0755, st_size=32144, ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1,
> 0) = 0xb7f95000
> mmap2(0x6b3000, 33540, PROT_READ|PROT_EXEC,
> MAP_PRIVATE|MAP_DENYWRITE,
> 3, 0) = 0x6b3000
> mmap2(0x6bb000, 4096, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7) = 0x6bb000
> close(3)                                = 0
> open("/lib/libc.so.6", O_RDONLY)        = 3
> read(3,
> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340\17X\0004\0\0\0"...,
> 512) = 512
> fstat64(3, {st_mode=S_IFREG|0755, st_size=1611564, ...}) = 0
> mmap2(0x56b000, 1332676, PROT_READ|PROT_EXEC,
> MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x56b000
> mprotect(0x6aa000, 4096, PROT_NONE)     = 0
> mmap2(0x6ab000, 12288, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13f) = 0x6ab000
> mmap2(0x6ae000, 9668, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6ae000
> close(3)                                = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1,
> 0) = 0xb7f94000
> set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f946c0,
> limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
> limit_in_pages:1, seg_not_present:0, useable:1}) = 0
> mprotect(0x6ab000, 8192, PROT_READ)     = 0
> mprotect(0x567000, 4096, PROT_READ)     = 0
> munmap(0xb7f96000, 90296)               = 0
> time(NULL)                              = 1262655667
> getpid()                                = 18781
> brk(0)                                  = 0x942d000
> brk(0x944e000)                          = 0x944e000
> open("/gfs", O_RDONLY|O_LARGEFILE)      = 3
> open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 4
> lstat64("/gfs", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
> fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1,
> 0) = 0xb7fac000
> read(4, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
> close(4)                                = 0
> munmap(0xb7fac000, 4096)                = 0
> open("/dev/mapper/vg100-lvol0", O_RDWR|O_LARGEFILE) = 4
> fstat64(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 4), ...}) = 0
> _llseek(4, 0, [1677721600], SEEK_END)   = 0
> fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1,
> 0) = 0xb7fac000
> write(1, "Initializing lists...\n", 22Initializing lists...
> ) = 22
> _llseek(4, 65536, [65536], SEEK_SET)    = 0
> read(4,
> "\1\26\31p\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0d\0\0\0\0\0\0\7\t\0\0\7l"...,
> 4096) = 4096
> _llseek(4, 0, [1677721600], SEEK_END)   = 0
> open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 5
> fstat64(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1,
> 0) = 0xb7fab000
> read(5, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
> read(5, "", 4096)                       = 0
> close(5)                                = 0
> munmap(0xb7fab000, 4096)                = 0
> open("/tmp/.gfs2meta", O_RDONLY|O_LARGEFILE) = 5
> fstat64(5, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
> close(5)                                = 0
> mount("/dev/mapper/vg100-lvol0", "/tmp/.gfs2meta", "gfs2meta", 0,
> NULL) = -1 EINVAL (Invalid argument)
> write(2, "gfs2_grow: ", 11gfs2_grow: )             = 11
> write(2, "Couldn't mount /tmp/.gfs2meta : "..., 49Couldn't mount
> /tmp/.gfs2meta : Invalid argument
> ) = 49
> exit_group(1)                           = ?
> 
> 
> On Tue, Jan 5, 2010 at 6:27 AM, Abhijith Das <adas at redhat.com> wrote:
> > Hi,
> >
> > >From the following message, it looks like the gfs2meta mount
> routine is not able to locate the gfs2 mountpoint.
> > "Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
> exist"
> > Can you confirm that /proc/mounts and /etc/mtab all agree on the
> mounted gfs2 at /gfs?
> > Also, can you run gfs2_grow under strace so that we can see what
> arguments gfs2_grow passes to the mount() system call when it tries to
> mount the gfs2meta filesystem?
> >
> > Thanks!
> > --Abhi
> >
> > ----- "Diamond Li" <diamondiona at gmail.com> wrote:
> >
> >> From: "Diamond Li" <diamondiona at gmail.com>
> >> To: "linux clustering" <linux-cluster at redhat.com>
> >> Sent: Monday, January 4, 2010 2:25:59 AM GMT -06:00 US/Canada
> Central
> >> Subject: Re: [Linux-cluster] gfs2_grow does not work
> >>
> >> could someone kindly help me to get through?
> >>
> >> thanks in advance!
> >>
> >> On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li
> <diamondiona at gmail.com>
> >> wrote:
> >> > from system log, I can see the erorr message:
> >> >
> >> > Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
> >> exist
> >> >
> >> > but I have mounted gfs2 file system under /gfs folder and I can
> do
> >> > operations such as mkdir, rm, successfully.
> >> >
> >> >
> >> >
> >> > On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li
> <diamondiona at gmail.com>
> >> wrote:
> >> >> Hello,
> >> >>
> >> >> I am trying to grow a gfs2 file system, unfortunately ?it does
> not
> >> work.
> >> >>
> >> >> anyone has similar issues or I always have bad luck?
> >> >>
> >> >> [root at wplccdlvm446 gfs]# mount
> >> >>
> >> >> /dev/mapper/vg100-lvol0 on /gfs type gfs2
> >> (rw,hostdata=jid=0:id=131074:first=1)
> >> >>
> >> >> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
> >> >> Initializing lists...
> >> >> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
> >> >>
> >> >> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
> >> >> . ?..
> >> >>
> >> >>
> >> >> [root at wplccdlvm446 gfs]# uname -r
> >> >> 2.6.18-164.el5
> >> >>
> >> >> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
> >> >> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
> >> >>
> >> >
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From diamondiona at gmail.com  Tue Jan  5 05:57:08 2010
From: diamondiona at gmail.com (Diamond Li)
Date: Tue, 5 Jan 2010 13:57:08 +0800
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <1979230168.2365461262669175952.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
References: <678287746.2365421262668986749.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<1979230168.2365461262669175952.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <dd23a5e1001042157q7ae1debcqf0f52edf54f2d473@mail.gmail.com>

Appreciate for your reply!


[root at wplccdlvm445 ~]# uname -r
2.6.18-164.el5

[root at wplccdlvm445 ~]# rpm -qa |grep gfs
gfs2-utils-0.1.44-1.el5
gfs-utils-0.1.17-1.el5

I am using the packages shipped by Redhat, no any customization. Does
it mean gfs2_grow does not work at all in 5.4 release(I wish I am
wrong)?

I did not find any patch for x86 32 bit CPU,  it does have x86_64.



On Tue, Jan 5, 2010 at 1:26 PM, Abhijith Das <adas at redhat.com> wrote:
> Hi Diamond,
>
> Could I also have the kernel and gfs2-utils rpm versions you are using so I can try this on my setup? I just spotted something in your strace output that could be a problem if you have a newer kernel, but not a newer gfs2-utils package.
>
> The mount syscall in your strace output takes the device as the first arg to mount the metafs. A recent kernel patch from https://bugzilla.redhat.com/show_bug.cgi?id=457798 changed that to take the mountpoint as the first arg instead. There was a corresponding userland patch to gfs2-utils in https://bugzilla.redhat.com/show_bug.cgi?id=459630#c3 that fixed this mismatch.
> I'm not sure if you're seeing this. If so, an upgrade of these packages should fix what you're seeing.
>
> Cheers!
> --Abhi
>
> ----- "Diamond Li" <diamondiona at gmail.com> wrote:
>
>> From: "Diamond Li" <diamondiona at gmail.com>
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Sent: Monday, January 4, 2010 8:12:12 PM GMT -06:00 US/Canada Central
>> Subject: Re: [Linux-cluster] gfs2_grow does not work
>>
>> [root at wplccdlvm445 proc]# df -k
>> Filesystem ? ? ? ? ? 1K-blocks ? ? ?Used Available Use% Mounted on
>> /dev/mapper/VolGroup00-LogVol00
>> ? ? ? ? ? ? ? ? ? ? ? 28376956 ? 9144384 ?17767844 ?34% /
>> /dev/sda1 ? ? ? ? ? ? ? 101086 ? ? 12055 ? ? 83812 ?13% /boot
>> tmpfs ? ? ? ? ? ? ? ? ?1037748 ? ? ? ? 0 ? 1037748 ? 0% /dev/shm
>> /dev/mapper/vg100-lvol0
>> ? ? ? ? ? ? ? ? ? ? ? ? 819024 ? ?794264 ? ? 24760 ?97% /gfs
>>
>> [root at wplccdlvm445 proc]# ls -ld /tmp
>> drwxrwxrwt 8 root root 4096 Jan ?5 04:02 /tmp
>> [root at wplccdlvm445 proc]# ls -ld /tmp/.gfs2meta/
>> drwx------ 2 root root 4096 Dec 31 14:24 /tmp/.gfs2meta/
>>
>>
>> [root at wplccdlvm445 proc]# cat /proc/mounts
>> rootfs / rootfs rw 0 0
>> /dev/root / ext3 rw,data=ordered 0 0
>> /dev /dev tmpfs rw 0 0
>> /proc /proc proc rw 0 0
>> /sys /sys sysfs rw 0 0
>> /proc/bus/usb /proc/bus/usb usbfs rw 0 0
>> devpts /dev/pts devpts rw 0 0
>> /dev/sda1 /boot ext3 rw,data=ordered 0 0
>> tmpfs /dev/shm tmpfs rw 0 0
>> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>> /etc/auto.misc /misc autofs
>> rw,fd=7,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
>> -hosts /net autofs
>> rw,fd=13,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
>> none /sys/kernel/config configfs rw 0 0
>> /dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0
>> 0
>>
>> [root at wplccdlvm445 proc]# cat /etc/mtab
>> /dev/mapper/VolGroup00-LogVol00 / ext3 rw 0 0
>> proc /proc proc rw 0 0
>> sysfs /sys sysfs rw 0 0
>> devpts /dev/pts devpts rw,gid=5,mode=620 0 0
>> /dev/sda1 /boot ext3 rw 0 0
>> tmpfs /dev/shm tmpfs rw 0 0
>> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>> none /sys/kernel/config configfs rw 0 0
>> /dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0
>> 0
>>
>>
>> [root at wplccdlvm445 proc]# strace gfs2_grow -v /gfs
>> execve("/sbin/gfs2_grow", ["gfs2_grow", "-v", "/gfs"], [/* 29 vars
>> */]) = 0
>> brk(0) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0x942d000
>> access("/etc/ld.so.preload", R_OK) ? ? ?= -1 ENOENT (No such file or
>> directory)
>> open("/etc/ld.so.cache", O_RDONLY) ? ? ?= 3
>> fstat64(3, {st_mode=S_IFREG|0644, st_size=90296, ...}) = 0
>> mmap2(NULL, 90296, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f96000
>> close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>> open("/lib/libvolume_id.so.0", O_RDONLY) = 3
>> read(3,
>> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360 at k\0004\0\0\0"...,
>> 512) = 512
>> fstat64(3, {st_mode=S_IFREG|0755, st_size=32144, ...}) = 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>> -1,
>> 0) = 0xb7f95000
>> mmap2(0x6b3000, 33540, PROT_READ|PROT_EXEC,
>> MAP_PRIVATE|MAP_DENYWRITE,
>> 3, 0) = 0x6b3000
>> mmap2(0x6bb000, 4096, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7) = 0x6bb000
>> close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>> open("/lib/libc.so.6", O_RDONLY) ? ? ? ?= 3
>> read(3,
>> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340\17X\0004\0\0\0"...,
>> 512) = 512
>> fstat64(3, {st_mode=S_IFREG|0755, st_size=1611564, ...}) = 0
>> mmap2(0x56b000, 1332676, PROT_READ|PROT_EXEC,
>> MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x56b000
>> mprotect(0x6aa000, 4096, PROT_NONE) ? ? = 0
>> mmap2(0x6ab000, 12288, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13f) = 0x6ab000
>> mmap2(0x6ae000, 9668, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6ae000
>> close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>> -1,
>> 0) = 0xb7f94000
>> set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f946c0,
>> limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
>> limit_in_pages:1, seg_not_present:0, useable:1}) = 0
>> mprotect(0x6ab000, 8192, PROT_READ) ? ? = 0
>> mprotect(0x567000, 4096, PROT_READ) ? ? = 0
>> munmap(0xb7f96000, 90296) ? ? ? ? ? ? ? = 0
>> time(NULL) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 1262655667
>> getpid() ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 18781
>> brk(0) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0x942d000
>> brk(0x944e000) ? ? ? ? ? ? ? ? ? ? ? ? ?= 0x944e000
>> open("/gfs", O_RDONLY|O_LARGEFILE) ? ? ?= 3
>> open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 4
>> lstat64("/gfs", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
>> fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>> -1,
>> 0) = 0xb7fac000
>> read(4, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
>> close(4) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>> munmap(0xb7fac000, 4096) ? ? ? ? ? ? ? ?= 0
>> open("/dev/mapper/vg100-lvol0", O_RDWR|O_LARGEFILE) = 4
>> fstat64(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 4), ...}) = 0
>> _llseek(4, 0, [1677721600], SEEK_END) ? = 0
>> fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>> -1,
>> 0) = 0xb7fac000
>> write(1, "Initializing lists...\n", 22Initializing lists...
>> ) = 22
>> _llseek(4, 65536, [65536], SEEK_SET) ? ?= 0
>> read(4,
>> "\1\26\31p\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0d\0\0\0\0\0\0\7\t\0\0\7l"...,
>> 4096) = 4096
>> _llseek(4, 0, [1677721600], SEEK_END) ? = 0
>> open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 5
>> fstat64(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>> -1,
>> 0) = 0xb7fab000
>> read(5, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
>> read(5, "", 4096) ? ? ? ? ? ? ? ? ? ? ? = 0
>> close(5) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>> munmap(0xb7fab000, 4096) ? ? ? ? ? ? ? ?= 0
>> open("/tmp/.gfs2meta", O_RDONLY|O_LARGEFILE) = 5
>> fstat64(5, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
>> close(5) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>> mount("/dev/mapper/vg100-lvol0", "/tmp/.gfs2meta", "gfs2meta", 0,
>> NULL) = -1 EINVAL (Invalid argument)
>> write(2, "gfs2_grow: ", 11gfs2_grow: ) ? ? ? ? ? ? = 11
>> write(2, "Couldn't mount /tmp/.gfs2meta : "..., 49Couldn't mount
>> /tmp/.gfs2meta : Invalid argument
>> ) = 49
>> exit_group(1) ? ? ? ? ? ? ? ? ? ? ? ? ? = ?
>>
>>
>> On Tue, Jan 5, 2010 at 6:27 AM, Abhijith Das <adas at redhat.com> wrote:
>> > Hi,
>> >
>> > >From the following message, it looks like the gfs2meta mount
>> routine is not able to locate the gfs2 mountpoint.
>> > "Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
>> exist"
>> > Can you confirm that /proc/mounts and /etc/mtab all agree on the
>> mounted gfs2 at /gfs?
>> > Also, can you run gfs2_grow under strace so that we can see what
>> arguments gfs2_grow passes to the mount() system call when it tries to
>> mount the gfs2meta filesystem?
>> >
>> > Thanks!
>> > --Abhi
>> >
>> > ----- "Diamond Li" <diamondiona at gmail.com> wrote:
>> >
>> >> From: "Diamond Li" <diamondiona at gmail.com>
>> >> To: "linux clustering" <linux-cluster at redhat.com>
>> >> Sent: Monday, January 4, 2010 2:25:59 AM GMT -06:00 US/Canada
>> Central
>> >> Subject: Re: [Linux-cluster] gfs2_grow does not work
>> >>
>> >> could someone kindly help me to get through?
>> >>
>> >> thanks in advance!
>> >>
>> >> On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li
>> <diamondiona at gmail.com>
>> >> wrote:
>> >> > from system log, I can see the erorr message:
>> >> >
>> >> > Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
>> >> exist
>> >> >
>> >> > but I have mounted gfs2 file system under /gfs folder and I can
>> do
>> >> > operations such as mkdir, rm, successfully.
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li
>> <diamondiona at gmail.com>
>> >> wrote:
>> >> >> Hello,
>> >> >>
>> >> >> I am trying to grow a gfs2 file system, unfortunately ?it does
>> not
>> >> work.
>> >> >>
>> >> >> anyone has similar issues or I always have bad luck?
>> >> >>
>> >> >> [root at wplccdlvm446 gfs]# mount
>> >> >>
>> >> >> /dev/mapper/vg100-lvol0 on /gfs type gfs2
>> >> (rw,hostdata=jid=0:id=131074:first=1)
>> >> >>
>> >> >> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
>> >> >> Initializing lists...
>> >> >> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
>> >> >>
>> >> >> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
>> >> >> . ?..
>> >> >>
>> >> >>
>> >> >> [root at wplccdlvm446 gfs]# uname -r
>> >> >> 2.6.18-164.el5
>> >> >>
>> >> >> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
>> >> >> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
>> >> >>
>> >> >
>> >>
>> >> --
>> >> Linux-cluster mailing list
>> >> Linux-cluster at redhat.com
>> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From diamondiona at gmail.com  Tue Jan  5 06:00:33 2010
From: diamondiona at gmail.com (Diamond Li)
Date: Tue, 5 Jan 2010 14:00:33 +0800
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <dd23a5e1001042157q7ae1debcqf0f52edf54f2d473@mail.gmail.com>
References: <678287746.2365421262668986749.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<1979230168.2365461262669175952.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<dd23a5e1001042157q7ae1debcqf0f52edf54f2d473@mail.gmail.com>
Message-ID: <dd23a5e1001042200v50f7c9d8uabc3b93b553ab5bf@mail.gmail.com>

by the way, from the link, the version is 5.3. but my version is
[root at wplccdlvm445 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)



On Tue, Jan 5, 2010 at 1:57 PM, Diamond Li <diamondiona at gmail.com> wrote:
> Appreciate for your reply!
>
>
> [root at wplccdlvm445 ~]# uname -r
> 2.6.18-164.el5
>
> [root at wplccdlvm445 ~]# rpm -qa |grep gfs
> gfs2-utils-0.1.44-1.el5
> gfs-utils-0.1.17-1.el5
>
> I am using the packages shipped by Redhat, no any customization. Does
> it mean gfs2_grow does not work at all in 5.4 release(I wish I am
> wrong)?
>
> I did not find any patch for x86 32 bit CPU, ?it does have x86_64.
>
>
>
> On Tue, Jan 5, 2010 at 1:26 PM, Abhijith Das <adas at redhat.com> wrote:
>> Hi Diamond,
>>
>> Could I also have the kernel and gfs2-utils rpm versions you are using so I can try this on my setup? I just spotted something in your strace output that could be a problem if you have a newer kernel, but not a newer gfs2-utils package.
>>
>> The mount syscall in your strace output takes the device as the first arg to mount the metafs. A recent kernel patch from https://bugzilla.redhat.com/show_bug.cgi?id=457798 changed that to take the mountpoint as the first arg instead. There was a corresponding userland patch to gfs2-utils in https://bugzilla.redhat.com/show_bug.cgi?id=459630#c3 that fixed this mismatch.
>> I'm not sure if you're seeing this. If so, an upgrade of these packages should fix what you're seeing.
>>
>> Cheers!
>> --Abhi
>>
>> ----- "Diamond Li" <diamondiona at gmail.com> wrote:
>>
>>> From: "Diamond Li" <diamondiona at gmail.com>
>>> To: "linux clustering" <linux-cluster at redhat.com>
>>> Sent: Monday, January 4, 2010 8:12:12 PM GMT -06:00 US/Canada Central
>>> Subject: Re: [Linux-cluster] gfs2_grow does not work
>>>
>>> [root at wplccdlvm445 proc]# df -k
>>> Filesystem ? ? ? ? ? 1K-blocks ? ? ?Used Available Use% Mounted on
>>> /dev/mapper/VolGroup00-LogVol00
>>> ? ? ? ? ? ? ? ? ? ? ? 28376956 ? 9144384 ?17767844 ?34% /
>>> /dev/sda1 ? ? ? ? ? ? ? 101086 ? ? 12055 ? ? 83812 ?13% /boot
>>> tmpfs ? ? ? ? ? ? ? ? ?1037748 ? ? ? ? 0 ? 1037748 ? 0% /dev/shm
>>> /dev/mapper/vg100-lvol0
>>> ? ? ? ? ? ? ? ? ? ? ? ? 819024 ? ?794264 ? ? 24760 ?97% /gfs
>>>
>>> [root at wplccdlvm445 proc]# ls -ld /tmp
>>> drwxrwxrwt 8 root root 4096 Jan ?5 04:02 /tmp
>>> [root at wplccdlvm445 proc]# ls -ld /tmp/.gfs2meta/
>>> drwx------ 2 root root 4096 Dec 31 14:24 /tmp/.gfs2meta/
>>>
>>>
>>> [root at wplccdlvm445 proc]# cat /proc/mounts
>>> rootfs / rootfs rw 0 0
>>> /dev/root / ext3 rw,data=ordered 0 0
>>> /dev /dev tmpfs rw 0 0
>>> /proc /proc proc rw 0 0
>>> /sys /sys sysfs rw 0 0
>>> /proc/bus/usb /proc/bus/usb usbfs rw 0 0
>>> devpts /dev/pts devpts rw 0 0
>>> /dev/sda1 /boot ext3 rw,data=ordered 0 0
>>> tmpfs /dev/shm tmpfs rw 0 0
>>> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>>> /etc/auto.misc /misc autofs
>>> rw,fd=7,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
>>> -hosts /net autofs
>>> rw,fd=13,pgrp=2200,timeout=300,minproto=5,maxproto=5,indirect 0 0
>>> none /sys/kernel/config configfs rw 0 0
>>> /dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0
>>> 0
>>>
>>> [root at wplccdlvm445 proc]# cat /etc/mtab
>>> /dev/mapper/VolGroup00-LogVol00 / ext3 rw 0 0
>>> proc /proc proc rw 0 0
>>> sysfs /sys sysfs rw 0 0
>>> devpts /dev/pts devpts rw,gid=5,mode=620 0 0
>>> /dev/sda1 /boot ext3 rw 0 0
>>> tmpfs /dev/shm tmpfs rw 0 0
>>> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>>> none /sys/kernel/config configfs rw 0 0
>>> /dev/mapper/vg100-lvol0 /gfs gfs2 rw,hostdata=jid=0:id=65537:first=1 0
>>> 0
>>>
>>>
>>> [root at wplccdlvm445 proc]# strace gfs2_grow -v /gfs
>>> execve("/sbin/gfs2_grow", ["gfs2_grow", "-v", "/gfs"], [/* 29 vars
>>> */]) = 0
>>> brk(0) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0x942d000
>>> access("/etc/ld.so.preload", R_OK) ? ? ?= -1 ENOENT (No such file or
>>> directory)
>>> open("/etc/ld.so.cache", O_RDONLY) ? ? ?= 3
>>> fstat64(3, {st_mode=S_IFREG|0644, st_size=90296, ...}) = 0
>>> mmap2(NULL, 90296, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f96000
>>> close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>>> open("/lib/libvolume_id.so.0", O_RDONLY) = 3
>>> read(3,
>>> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360 at k\0004\0\0\0"...,
>>> 512) = 512
>>> fstat64(3, {st_mode=S_IFREG|0755, st_size=32144, ...}) = 0
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>>> -1,
>>> 0) = 0xb7f95000
>>> mmap2(0x6b3000, 33540, PROT_READ|PROT_EXEC,
>>> MAP_PRIVATE|MAP_DENYWRITE,
>>> 3, 0) = 0x6b3000
>>> mmap2(0x6bb000, 4096, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7) = 0x6bb000
>>> close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>>> open("/lib/libc.so.6", O_RDONLY) ? ? ? ?= 3
>>> read(3,
>>> "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340\17X\0004\0\0\0"...,
>>> 512) = 512
>>> fstat64(3, {st_mode=S_IFREG|0755, st_size=1611564, ...}) = 0
>>> mmap2(0x56b000, 1332676, PROT_READ|PROT_EXEC,
>>> MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x56b000
>>> mprotect(0x6aa000, 4096, PROT_NONE) ? ? = 0
>>> mmap2(0x6ab000, 12288, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13f) = 0x6ab000
>>> mmap2(0x6ae000, 9668, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6ae000
>>> close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>>> -1,
>>> 0) = 0xb7f94000
>>> set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f946c0,
>>> limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
>>> limit_in_pages:1, seg_not_present:0, useable:1}) = 0
>>> mprotect(0x6ab000, 8192, PROT_READ) ? ? = 0
>>> mprotect(0x567000, 4096, PROT_READ) ? ? = 0
>>> munmap(0xb7f96000, 90296) ? ? ? ? ? ? ? = 0
>>> time(NULL) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 1262655667
>>> getpid() ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 18781
>>> brk(0) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0x942d000
>>> brk(0x944e000) ? ? ? ? ? ? ? ? ? ? ? ? ?= 0x944e000
>>> open("/gfs", O_RDONLY|O_LARGEFILE) ? ? ?= 3
>>> open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 4
>>> lstat64("/gfs", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
>>> fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>>> -1,
>>> 0) = 0xb7fac000
>>> read(4, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
>>> close(4) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>>> munmap(0xb7fac000, 4096) ? ? ? ? ? ? ? ?= 0
>>> open("/dev/mapper/vg100-lvol0", O_RDWR|O_LARGEFILE) = 4
>>> fstat64(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 4), ...}) = 0
>>> _llseek(4, 0, [1677721600], SEEK_END) ? = 0
>>> fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>>> -1,
>>> 0) = 0xb7fac000
>>> write(1, "Initializing lists...\n", 22Initializing lists...
>>> ) = 22
>>> _llseek(4, 65536, [65536], SEEK_SET) ? ?= 0
>>> read(4,
>>> "\1\26\31p\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0d\0\0\0\0\0\0\7\t\0\0\7l"...,
>>> 4096) = 4096
>>> _llseek(4, 0, [1677721600], SEEK_END) ? = 0
>>> open("/proc/mounts", O_RDONLY|O_LARGEFILE) = 5
>>> fstat64(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
>>> -1,
>>> 0) = 0xb7fab000
>>> read(5, "rootfs / rootfs rw 0 0\n/dev/root"..., 4096) = 659
>>> read(5, "", 4096) ? ? ? ? ? ? ? ? ? ? ? = 0
>>> close(5) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>>> munmap(0xb7fab000, 4096) ? ? ? ? ? ? ? ?= 0
>>> open("/tmp/.gfs2meta", O_RDONLY|O_LARGEFILE) = 5
>>> fstat64(5, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
>>> close(5) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
>>> mount("/dev/mapper/vg100-lvol0", "/tmp/.gfs2meta", "gfs2meta", 0,
>>> NULL) = -1 EINVAL (Invalid argument)
>>> write(2, "gfs2_grow: ", 11gfs2_grow: ) ? ? ? ? ? ? = 11
>>> write(2, "Couldn't mount /tmp/.gfs2meta : "..., 49Couldn't mount
>>> /tmp/.gfs2meta : Invalid argument
>>> ) = 49
>>> exit_group(1) ? ? ? ? ? ? ? ? ? ? ? ? ? = ?
>>>
>>>
>>> On Tue, Jan 5, 2010 at 6:27 AM, Abhijith Das <adas at redhat.com> wrote:
>>> > Hi,
>>> >
>>> > >From the following message, it looks like the gfs2meta mount
>>> routine is not able to locate the gfs2 mountpoint.
>>> > "Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
>>> exist"
>>> > Can you confirm that /proc/mounts and /etc/mtab all agree on the
>>> mounted gfs2 at /gfs?
>>> > Also, can you run gfs2_grow under strace so that we can see what
>>> arguments gfs2_grow passes to the mount() system call when it tries to
>>> mount the gfs2meta filesystem?
>>> >
>>> > Thanks!
>>> > --Abhi
>>> >
>>> > ----- "Diamond Li" <diamondiona at gmail.com> wrote:
>>> >
>>> >> From: "Diamond Li" <diamondiona at gmail.com>
>>> >> To: "linux clustering" <linux-cluster at redhat.com>
>>> >> Sent: Monday, January 4, 2010 2:25:59 AM GMT -06:00 US/Canada
>>> Central
>>> >> Subject: Re: [Linux-cluster] gfs2_grow does not work
>>> >>
>>> >> could someone kindly help me to get through?
>>> >>
>>> >> thanks in advance!
>>> >>
>>> >> On Thu, Dec 31, 2009 at 3:16 PM, Diamond Li
>>> <diamondiona at gmail.com>
>>> >> wrote:
>>> >> > from system log, I can see the erorr message:
>>> >> >
>>> >> > Dec 31 15:04:56 wplccdlvm446 kernel: GFS2: gfs2 mount does not
>>> >> exist
>>> >> >
>>> >> > but I have mounted gfs2 file system under /gfs folder and I can
>>> do
>>> >> > operations such as mkdir, rm, successfully.
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Thu, Dec 31, 2009 at 2:55 PM, Diamond Li
>>> <diamondiona at gmail.com>
>>> >> wrote:
>>> >> >> Hello,
>>> >> >>
>>> >> >> I am trying to grow a gfs2 file system, unfortunately ?it does
>>> not
>>> >> work.
>>> >> >>
>>> >> >> anyone has similar issues or I always have bad luck?
>>> >> >>
>>> >> >> [root at wplccdlvm446 gfs]# mount
>>> >> >>
>>> >> >> /dev/mapper/vg100-lvol0 on /gfs type gfs2
>>> >> (rw,hostdata=jid=0:id=131074:first=1)
>>> >> >>
>>> >> >> [root at wplccdlvm446 gfs]# gfs2_grow -v /gfs
>>> >> >> Initializing lists...
>>> >> >> gfs2_grow: Couldn't mount /tmp/.gfs2meta : Invalid argument
>>> >> >>
>>> >> >> [root at wplccdlvm446 gfs]# ls -a /tmp/.gfs2meta/
>>> >> >> . ?..
>>> >> >>
>>> >> >>
>>> >> >> [root at wplccdlvm446 gfs]# uname -r
>>> >> >> 2.6.18-164.el5
>>> >> >>
>>> >> >> [root at wplccdlvm446 gfs]# cat /etc/redhat-release
>>> >> >> Red Hat Enterprise Linux Server release 5.4 (Tikanga)
>>> >> >>
>>> >> >
>>> >>
>>> >> --
>>> >> Linux-cluster mailing list
>>> >> Linux-cluster at redhat.com
>>> >> https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>> > --
>>> > Linux-cluster mailing list
>>> > Linux-cluster at redhat.com
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From adas at redhat.com  Tue Jan  5 16:48:15 2010
From: adas at redhat.com (Abhijith Das)
Date: Tue, 5 Jan 2010 11:48:15 -0500 (EST)
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <1328423852.2405981262710057211.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <1673557480.2406041262710095635.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>


----- "Diamond Li" <diamondiona at gmail.com> wrote:

> From: "Diamond Li" <diamondiona at gmail.com>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Monday, January 4, 2010 11:57:08 PM GMT -06:00 US/Canada Central
> Subject: Re: [Linux-cluster] gfs2_grow does not work
>
> Appreciate for your reply!
> 
> 
> [root at wplccdlvm445 ~]# uname -r
> 2.6.18-164.el5
> 
> [root at wplccdlvm445 ~]# rpm -qa |grep gfs
> gfs2-utils-0.1.44-1.el5

There's your problem :). This gfs2-utils package is pretty old (RHEL5.2 timeframe). The one that shipped with RHEL5.4 is gfs2-utils-0.1.62-1.el5. Please upgrade to this version and try again.

Cheers!
--Abhi



From diamondiona at gmail.com  Wed Jan  6 02:53:33 2010
From: diamondiona at gmail.com (Diamond Li)
Date: Wed, 6 Jan 2010 10:53:33 +0800
Subject: [Linux-cluster] gfs2_grow does not work
In-Reply-To: <1673557480.2406041262710095635.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
References: <1328423852.2405981262710057211.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<1673557480.2406041262710095635.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <dd23a5e1001051853x1ed07fe8g2876aca457ec3857@mail.gmail.com>

Super!

yes, I made mistake, installed this package from 5.2 yum repository.
now, it is working perfectly!

really appreciate for your help!

On Wed, Jan 6, 2010 at 12:48 AM, Abhijith Das <adas at redhat.com> wrote:
>
> ----- "Diamond Li" <diamondiona at gmail.com> wrote:
>
>> From: "Diamond Li" <diamondiona at gmail.com>
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Sent: Monday, January 4, 2010 11:57:08 PM GMT -06:00 US/Canada Central
>> Subject: Re: [Linux-cluster] gfs2_grow does not work
>>
>> Appreciate for your reply!
>>
>>
>> [root at wplccdlvm445 ~]# uname -r
>> 2.6.18-164.el5
>>
>> [root at wplccdlvm445 ~]# rpm -qa |grep gfs
>> gfs2-utils-0.1.44-1.el5
>
> There's your problem :). This gfs2-utils package is pretty old (RHEL5.2 timeframe). The one that shipped with RHEL5.4 is gfs2-utils-0.1.62-1.el5. Please upgrade to this version and try again.
>
> Cheers!
> --Abhi
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From ccook at pandora.com  Wed Jan  6 18:03:05 2010
From: ccook at pandora.com (Christopher Strider Cook)
Date: Wed, 06 Jan 2010 10:03:05 -0800
Subject: [Linux-cluster] Will a service fail cause a node to fence?
Message-ID: <4B44D059.6030801@pandora.com>

If a cluster looses communication with a node then fencing will take 
place, but if a service fails and fails to stop/exit cleanly so another 
node can take over, will a fencing operation take place?

Cluster 3, corosync 1, rgmanager 3


Thanks,
Chris




From pradhanparas at gmail.com  Wed Jan  6 23:47:08 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Wed, 6 Jan 2010 17:47:08 -0600
Subject: [Linux-cluster] GFS and performace
Message-ID: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>

I have a GFS based shared storage cluster that connects to SAN by fibre
channel. This GFS shared storage hold several virtual machines. While
running hdparam from the host to a GFS share, I get following results.

--
hdparm -t /guest_vms1

/dev/mapper/test_vg1-prd_vg1_lv:
Timing buffered disk reads:  262 MB in  3.00 seconds =  87.24 MB/sec
---


Now from within the virtual machines, the I/O is low

---
hdparm -t /dev/mapper/VolGroup00-LogVol00

/dev/mapper/VolGroup00-LogVol00:
 Timing buffered disk reads:   88 MB in  3.00 seconds =  29.31 MB/sec
---

I am looking for possibilities if I can increase my I/O read write within my
virtual machines. Tuning GFS does help in this case?

Sorry if my question is not relevant to this list


Thanks
Paras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100106/086d8c49/attachment.htm>

From sdake at redhat.com  Wed Jan  6 23:55:40 2010
From: sdake at redhat.com (Steven Dake)
Date: Wed, 06 Jan 2010 16:55:40 -0700
Subject: [Linux-cluster] GFS and performace
In-Reply-To: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>
References: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>
Message-ID: <1262822140.2588.3.camel@localhost.localdomain>

Virtual machines use memory copies between the physical device and the
guest OS.  Clearly this is an area where more work is being done in the
virtualization community but is outside the scope of the typical
filesystem gfs or otherwise.  You might ask about io performance tuning
on the respective virtualization technology mailing list you use.

Regards
-steve

 On Wed, 2010-01-06 at 17:47 -0600, Paras pradhan wrote:
> I have a GFS based shared storage cluster that connects to SAN by
> fibre channel. This GFS shared storage hold several virtual machines.
> While running hdparam from the host to a GFS share, I get following
> results.
> 
> 
> --
> hdparm -t /guest_vms1
> 
> 
> /dev/mapper/test_vg1-prd_vg1_lv:
> Timing buffered disk reads:  262 MB in  3.00 seconds =  87.24 MB/sec
> ---
> 
> 
> 
> 
> Now from within the virtual machines, the I/O is low
> 
> 
> ---
> hdparm -t /dev/mapper/VolGroup00-LogVol00 
> 
> 
> /dev/mapper/VolGroup00-LogVol00:
>  Timing buffered disk reads:   88 MB in  3.00 seconds =  29.31 MB/sec
> ---
> 
> 
> I am looking for possibilities if I can increase my I/O read write
> within my virtual machines. Tuning GFS does help in this case?
> 
> 
> Sorry if my question is not relevant to this list
> 
> 
> 
> 
> Thanks
> Paras.
> 
> 
> 
> 
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gordan at bobich.net  Thu Jan  7 00:13:50 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 07 Jan 2010 00:13:50 +0000
Subject: [Linux-cluster] GFS and performace
In-Reply-To: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>
References: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>
Message-ID: <4B45273E.8050807@bobich.net>

Paras pradhan wrote:
> I have a GFS based shared storage cluster that connects to SAN by fibre 
> channel. This GFS shared storage hold several virtual machines. While 
> running hdparam from the host to a GFS share, I get following results.
> 
> --
> hdparm -t /guest_vms1
> 
> /dev/mapper/test_vg1-prd_vg1_lv:
> Timing buffered disk reads:  262 MB in  3.00 seconds =  87.24 MB/sec
> ---
> 
> 
> Now from within the virtual machines, the I/O is low
> 
> ---
> hdparm -t /dev/mapper/VolGroup00-LogVol00 
> 
> /dev/mapper/VolGroup00-LogVol00:
>  Timing buffered disk reads:   88 MB in  3.00 seconds =  29.31 MB/sec
> ---
> 
> I am looking for possibilities if I can increase my I/O read write 
> within my virtual machines. Tuning GFS does help in this case?
> 
> Sorry if my question is not relevant to this list

I suspect you'll find that is pretty normal for virtualization-induced 
I/O penalty. Virtualization really, trully, utterly sucks when it comes 
to I/O performance.

My I/O performance tests (done using kernel building) show that the 
bottleneck was always disk I/O (including when the entire kernel source 
tree is pre-cached, with a 2GB of RAM guest. The _least_ horribly 
performing virtualization solution was VMware (tested with latest player 
3.0, but verified against the latest server, too). That managed to 
complete the task in "only" 140% of the time the bare metal machine did 
(the host machine had it's memory limited to 2GB with the mem= kernel 
option to make sure the test was fair). So, 40% slower than bare metal.

Paravirtualized Xen was close behind, followed very closely by 
non-paravirtualized KVM (which was actually slower when paravirtualized 
drivers were used!). VirtualBox came so far behind it's not even worth 
mentioning.

Nevertheless, it shows that the whole "performance being close to bare 
metal" premise is completely mythical and comes from very selective 
tests (e.g. only testing CPU intensive tasks). But then again we all 
knew that, right?

Gordan



From frank at si.ct.upc.edu  Thu Jan  7 07:43:05 2010
From: frank at si.ct.upc.edu (frank)
Date: Thu, 07 Jan 2010 08:43:05 +0100
Subject: [Linux-cluster] lock_dlm but local flocks = true?
In-Reply-To: <mailman.4029.1261990315.8260.linux-cluster@redhat.com>
References: <mailman.4029.1261990315.8260.linux-cluster@redhat.com>
Message-ID: <4B459089.5030807@si.ct.upc.edu>

Hi Steve, I have not answered before because I was on holidays. By the 
way, happy new year.

I have looked /proc/mounts as you told me, and ... surprise:

/dev/mapper/volCluster-lvol0 /mnt/gfs gfs 
rw,hostdata=jid=0:id=196610:first=1,localflocks 0 0

"localflocks" is there! I don't understand because I mount it using 
"/etc/init.d/gfs start" which looks at /etc/fstab, and there the line is:

/dev/volCluster/lvol0    /mnt/gfs     gfs    defaults  0 0

I must admit that there is a particular thing in this system which I 
thought it didn't affect, but I am not so sure now, and that is it is a 
OpenVZ patched kernel. Can this have something to do with gfs mounts?

Thanks for your help once more.

Frank
> Date: Wed, 23 Dec 2009 15:15:28 +0000 From: Steven Whitehouse 
> <swhiteho at redhat.com> To: linux clustering <linux-cluster at redhat.com> 
> Subject: Re: [Linux-cluster] lock_dlm but local flocks = true? 
> Message-ID: <1261581328.14393.113.camel at localhost.localdomain> 
> Content-Type: text/plain Hi, On Wed, 2009-12-23 at 15:53 +0100, frank 
> wrote:
>> >  Hi Steve, thanks for your answer
>> >  but I have not put the "localflocks" mount parameter anywhere. Look at
>> >  "gfs_tool df" output:
>> >  
>> >  # gfs_tool df /mnt/gfs
>> >  /mnt/gfs:
>> >      SB lock proto = "lock_dlm"
>> >      SB lock table = "H-N:gfs01"
>> >      SB ondisk format = 1309
>> >      SB multihost format = 1401
>> >      Block size = 4096
>> >      Journals = 2
>> >      Resource Groups = 200
>> >      Mounted lock proto = "lock_dlm"
>> >      Mounted lock table = "H-N:gfs01"
>> >      Mounted host data = "jid=0:id=196610:first=1"
>> >      Journal number = 0
>> >      Lock module flags = 0
>> >      Local flocks = TRUE
>> >      Local caching = FALSE
>> >      Oopses OK = FALSE
>> >  
>> >  it says 'Mounted lock proto = "lock_dlm" ' because that is what I did.
>> >  So why is it using "local flocks"?
>> >  
>>      
> I don't know. What does it say in /proc/mounts? (or what was your mount
> command line?)
>
> Steve.
>    


-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est? net.
For all your IT requirements visit: http://www.transtec.co.uk



From pasik at iki.fi  Thu Jan  7 08:34:29 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Thu, 7 Jan 2010 10:34:29 +0200
Subject: [Linux-cluster] GFS and performace
In-Reply-To: <4B45273E.8050807@bobich.net>
References: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>
	<4B45273E.8050807@bobich.net>
Message-ID: <20100107083429.GO25902@reaktio.net>

On Thu, Jan 07, 2010 at 12:13:50AM +0000, Gordan Bobic wrote:
> Paras pradhan wrote:
> >I have a GFS based shared storage cluster that connects to SAN by fibre 
> >channel. This GFS shared storage hold several virtual machines. While 
> >running hdparam from the host to a GFS share, I get following results.
> >
> >--
> >hdparm -t /guest_vms1
> >
> >/dev/mapper/test_vg1-prd_vg1_lv:
> >Timing buffered disk reads:  262 MB in  3.00 seconds =  87.24 MB/sec
> >---
> >
> >
> >Now from within the virtual machines, the I/O is low
> >
> >---
> >hdparm -t /dev/mapper/VolGroup00-LogVol00 
> >
> >/dev/mapper/VolGroup00-LogVol00:
> > Timing buffered disk reads:   88 MB in  3.00 seconds =  29.31 MB/sec
> >---
> >
> >I am looking for possibilities if I can increase my I/O read write 
> >within my virtual machines. Tuning GFS does help in this case?
> >
> >Sorry if my question is not relevant to this list
> 
> I suspect you'll find that is pretty normal for virtualization-induced 
> I/O penalty. Virtualization really, trully, utterly sucks when it comes 
> to I/O performance.
> 
> My I/O performance tests (done using kernel building) show that the 
> bottleneck was always disk I/O (including when the entire kernel source 
> tree is pre-cached, with a 2GB of RAM guest. The _least_ horribly 
> performing virtualization solution was VMware (tested with latest player 
> 3.0, but verified against the latest server, too). That managed to 
> complete the task in "only" 140% of the time the bare metal machine did 
> (the host machine had it's memory limited to 2GB with the mem= kernel 
> option to make sure the test was fair). So, 40% slower than bare metal.
> 
> Paravirtualized Xen was close behind, followed very closely by 
> non-paravirtualized KVM (which was actually slower when paravirtualized 
> drivers were used!). VirtualBox came so far behind it's not even worth 
> mentioning.
> 

What, you're saying VMware Server (and player) were faster than Xen PV?

I have hard time believing that.. based on my own experiences.

-- Pasi



From gordan at bobich.net  Thu Jan  7 09:24:22 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 07 Jan 2010 09:24:22 +0000
Subject: [Linux-cluster] GFS and performace
In-Reply-To: <20100107083429.GO25902@reaktio.net>
References: <8b711df41001061547h757f57fdi23fed852b5ac536a@mail.gmail.com>	<4B45273E.8050807@bobich.net>
	<20100107083429.GO25902@reaktio.net>
Message-ID: <4B45A846.1090909@bobich.net>

Pasi K?rkk?inen wrote:
> On Thu, Jan 07, 2010 at 12:13:50AM +0000, Gordan Bobic wrote:
>> Paras pradhan wrote:
>>> I have a GFS based shared storage cluster that connects to SAN by fibre 
>>> channel. This GFS shared storage hold several virtual machines. While 
>>> running hdparam from the host to a GFS share, I get following results.
>>>
>>> --
>>> hdparm -t /guest_vms1
>>>
>>> /dev/mapper/test_vg1-prd_vg1_lv:
>>> Timing buffered disk reads:  262 MB in  3.00 seconds =  87.24 MB/sec
>>> ---
>>>
>>>
>>> Now from within the virtual machines, the I/O is low
>>>
>>> ---
>>> hdparm -t /dev/mapper/VolGroup00-LogVol00 
>>>
>>> /dev/mapper/VolGroup00-LogVol00:
>>> Timing buffered disk reads:   88 MB in  3.00 seconds =  29.31 MB/sec
>>> ---
>>>
>>> I am looking for possibilities if I can increase my I/O read write 
>>> within my virtual machines. Tuning GFS does help in this case?
>>>
>>> Sorry if my question is not relevant to this list
>> I suspect you'll find that is pretty normal for virtualization-induced 
>> I/O penalty. Virtualization really, trully, utterly sucks when it comes 
>> to I/O performance.
>>
>> My I/O performance tests (done using kernel building) show that the 
>> bottleneck was always disk I/O (including when the entire kernel source 
>> tree is pre-cached, with a 2GB of RAM guest. The _least_ horribly 
>> performing virtualization solution was VMware (tested with latest player 
>> 3.0, but verified against the latest server, too). That managed to 
>> complete the task in "only" 140% of the time the bare metal machine did 
>> (the host machine had it's memory limited to 2GB with the mem= kernel 
>> option to make sure the test was fair). So, 40% slower than bare metal.
>>
>> Paravirtualized Xen was close behind, followed very closely by 
>> non-paravirtualized KVM (which was actually slower when paravirtualized 
>> drivers were used!). VirtualBox came so far behind it's not even worth 
>> mentioning.
>>
> 
> What, you're saying VMware Server (and player) were faster than Xen PV?
> 
> I have hard time believing that.. based on my own experiences.

Yes, that is exactly what I'm saying. But the best performing 
virtualization solution (vmware) still had a 40% performance penalty in 
disk I/O compared to bare metal. But regardless of which one is least 
slow, they are all so slow as to only be worth considering if you are 
doing anything other than consolidating idle machines. The VM may feel 
faster in terms of boot times and such-like (the second time around when 
all the data is cached in the host's RAM), but it's all smoke and 
mirrors and doesn't stand up to scrutiny.

The only virtualization solutions that deliver on the sort of 
performance claims the big vendors are making are the likes of OpenVZ 
and VServers, but those are mostly just chroots, more like FreeBSD jails 
or Solaris zones with a bit of network interface virtualization thrown 
in than what proper virtualization.

If you don't believe me, try it yourself. Do a full kernel build with 
the stock RH .config file with make -j 8 on a quad core box with the 2GB 
of RAM VM and then on the bare metal box limited to 2GB with mem= kernel 
boot parameter and see how long it takes. I make it 6.5 minutes on bare 
metal on my 3.2GHz Core2 vs about 9.5 minutes in a VM on the same 
machine (vmware, paravirtualized xen and KVM come reasonably close 
together). Each was tested multiple times, and the results were holding 
consistent.

Gordan



From brem.belguebli at gmail.com  Thu Jan  7 11:30:50 2010
From: brem.belguebli at gmail.com (brem belguebli)
Date: Thu, 7 Jan 2010 12:30:50 +0100
Subject: [Linux-cluster] qdisk max_error_cycles setting
In-Reply-To: <29ae894c0912300738g16c2a808u6aa1afe38270cce3@mail.gmail.com>
References: <29ae894c0912300738g16c2a808u6aa1afe38270cce3@mail.gmail.com>
Message-ID: <29ae894c1001070330g195ba815o360d3057e8b163d9@mail.gmail.com>

Hi All,

Any idea about that ?

Regards

2009/12/30 brem belguebli <brem.belguebli at gmail.com>:
> Hi,
>
> It looks like the quorumd max_error_cycles parameter it not taken into account.
>
> Here's the test I'm doing:
>
> A 3 nodes cluster (RHEL 5.4) with a iscsi qdisk lun from a RHEL 5.4
> target server.
>
> All 3 cluster nodes have the following cqdisk configuration:
>
> <quorumd device="/dev/iscsi/storage.quorum" interval="1"
> log_facility="local5" log_level="7" tko="10" votes="1"
> max_error_cycles="10">
>
> When I block access from the 3 nodes to the target server (iptables
> rule that prevents all ip flows from the 3 nodes to the target
> server), I see the Quorum disk go offline but qdisk never gets stopped
> and keeps on retrying the qdisk device despite the fact that I
> instructed it to abort after 10 cycles (max_error_cycles=10).
>
> Am I misunderstanding the max_error_cycles definition in the qdisk man page ?
>
> Regards
>
> PS: As consequence of not being killed after this max-error_cycles,
> qdisk ?keeps on growing (memory usage virtual size) and if the
> situation lasts too long OOM killer gets involved.....
>



From swhiteho at redhat.com  Thu Jan  7 11:54:35 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 07 Jan 2010 11:54:35 +0000
Subject: [Linux-cluster] lock_dlm but local flocks = true?
In-Reply-To: <4B459089.5030807@si.ct.upc.edu>
References: <mailman.4029.1261990315.8260.linux-cluster@redhat.com>
	<4B459089.5030807@si.ct.upc.edu>
Message-ID: <1262865275.2240.4.camel@localhost>

Hi,

On Thu, 2010-01-07 at 08:43 +0100, frank wrote:
> Hi Steve, I have not answered before because I was on holidays. By the 
> way, happy new year.
> 
> I have looked /proc/mounts as you told me, and ... surprise:
> 
> /dev/mapper/volCluster-lvol0 /mnt/gfs gfs 
> rw,hostdata=jid=0:id=196610:first=1,localflocks 0 0
> 
> "localflocks" is there! I don't understand because I mount it using 
> "/etc/init.d/gfs start" which looks at /etc/fstab, and there the line is:
> 
> /dev/volCluster/lvol0    /mnt/gfs     gfs    defaults  0 0
> 
> I must admit that there is a particular thing in this system which I 
> thought it didn't affect, but I am not so sure now, and that is it is a 
> OpenVZ patched kernel. Can this have something to do with gfs mounts?
> 
> Thanks for your help once more.
> 
That does seem strange. You could try stracing the mount command when
its run and that might show you the source of the localflocks flag,

Steve.




From hicheerup at gmail.com  Fri Jan  8 05:53:10 2010
From: hicheerup at gmail.com (linux-crazy)
Date: Fri, 8 Jan 2010 11:23:10 +0530
Subject: [Linux-cluster] Will a service fail cause a node to fence?
In-Reply-To: <4B44D059.6030801@pandora.com>
References: <4B44D059.6030801@pandora.com>
Message-ID: <29e045b81001072153q7ecfb70aw51d17e8b9f470ad2@mail.gmail.com>

Hi,


 AFAIK Rgmanager  tries to restart the failed service 6 times(not
sure), if it fails to restart on the that node , then it will try to
relocate the service to another node.



On Wed, Jan 6, 2010 at 11:33 PM, Christopher Strider Cook
<ccook at pandora.com> wrote:
> If a cluster looses communication with a node then fencing will take place,
> but if a service fails and fails to stop/exit cleanly so another node can
> take over, will a fencing operation take place?
>
> Cluster 3, corosync 1, rgmanager 3
>
>
> Thanks,
> Chris
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From gianluca.cecchi at gmail.com  Fri Jan  8 12:03:38 2010
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Fri, 8 Jan 2010 13:03:38 +0100
Subject: [Linux-cluster] suggestion on freeze-on-node1 and unfreeze-on-node2
	approach?
Message-ID: <561c252c1001080403r2052be58qec9c05cf9d8114a5@mail.gmail.com>

Hello, I have a cluster with an Oracle service and rhel 5.4 nodes.
Tipically one sets the "shutdown abort" of the DB as the default
mechanism to close the service, to prevent stalling and accelerate
switch of service itself in case of problems.
The same approach is indeed used by the rhcs provided script, that I'm using.

But sometimes we have to do maintenance on DB and use the strategy to
freeze the service, manually stop DB, make modifications, manually
start DB and unfreeze the service.

This is useful when all the work is done on the same node carrying the
service at that moment.
Sometimes we need activities where we want to relocate the service too.
And for the DBAs is desirable to clean shutdown the DB when there is a
planned activity in place.
With the same approach we do something like this:

node1 with active service
- freeze of the service:
clusvcadm -Z SRV

- maintenance activities with manual stop of service components (eg
listener and Oracle instance)

- shutdown of node1
shutdown -h now

The shutdown takes about 2 minutes
it is necessary to do a shutdown, because any command I tried, gave
the error that the service was frozen and that I cannot run that
command...

- Wait on the survival node that:
1) it becomes master for the quorum disk  otherwise it looses quorum
Messagges in /var/log/qdiskd.log
Jan  7 17:57:55 oracs1 qdiskd[7043]: <info> Node 2 shutdown
Jan  7 17:57:55 oracs1 qdiskd[7043]: <debug> Making bid for master
Jan  7 17:58:30 oracs1 qdiskd[7043]: <info> Assuming master role

it takes about 1 minute, after shutdown of the other one

2) the cluster registers that the other node has gone
Messages in /var/log/qdiskd.log
Jan  7 18:00:35 oracs1 openais[7014]: [TOTEM] The token was lost in
the OPERATIONAL state.
Jan  7 18:00:35 oracs1 openais[7014]: [TOTEM] Receive multicast socket
recv buffer size (320000 bytes).
Jan  7 18:00:35 oracs1 openais[7014]: [TOTEM] Transmit multicast
socket send buffer size (320000 bytes).
Jan  7 18:00:35 oracs1 openais[7014]: [TOTEM] entering GATHER state from 2.
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] entering GATHER state from 0.
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] Creating commit token
because I am the rep.
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] Saving state aru 24 high
seq received 24
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] Storing new sequence id
for ring 4da34
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] entering COMMIT state.
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] entering RECOVERY state.
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] position [0] member 192.168.16.1:
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] previous ring seq 318000
rep 192.168.16.1
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] aru 24 high delivered 24
received flag 1
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] Did not need to
originate any messages in recovery.
Jan  7 18:00:40 oracs1 openais[7014]: [TOTEM] Sending initial ORF token
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ] New Configuration:
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.1)
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ] Members Left:
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.8)
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ] Members Joined:
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ] New Configuration:
Jan  7 18:00:40 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.1)
Jan  7 18:00:41 oracs1 openais[7014]: [CLM  ] Members Left:
Jan  7 18:00:41 oracs1 openais[7014]: [CLM  ] Members Joined:
Jan  7 18:00:41 oracs1 openais[7014]: [SYNC ] This node is within the
primary component and will provide service.
Jan  7 18:00:41 oracs1 openais[7014]: [TOTEM] entering OPERATIONAL state.
Jan  7 18:00:41 oracs1 openais[7014]: [CLM  ] got nodejoin message 192.168.16.1
Jan  7 18:00:41 oracs1 openais[7014]: [CPG  ] got joinlist message from node 1

It takes about 2 minutes (also due to timeouts set up because of
qdisk, cman and multipath interactions needs)

Total of about 5 minutes. And after this we can work on node2:

- unfreeze of the service
clusvcadm -U SRV

This is not enough to have service start automatically.
clustat gives service as "started" on the other node and remains so.
Even if theoretically the node knows that the other one has left the
cluster...... sort of bug in my opinion....

- disable of the service
clusvcadm -d SRV

- enable of the service
clusvcadm -e SRV

At this time the service suddenly starts as there is only one node
alive and it is not necessary to specify the "-m " switch

After a few minutes we can restart the node1 that will join the
cluster again without problems:

Messages in /var/log/qdiskd.log of the node2
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] entering GATHER state from 11.
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] Creating commit token
because I am the rep.
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] Saving state aru 1c high
seq received 1c
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] Storing new sequence id
for ring 4da38
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] entering COMMIT state.
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] entering RECOVERY state.
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] position [0] member 192.168.16.1:
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] previous ring seq 318004
rep 192.168.16.1
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] aru 1c high delivered 1c
received flag 1
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] position [1] member 192.168.16.8:
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] previous ring seq 318004
rep 192.168.16.8
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] aru a high delivered a
received flag 1
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] Did not need to
originate any messages in recovery.
Jan  7 18:12:50 oracs1 openais[7014]: [TOTEM] Sending initial ORF token
Jan  7 18:12:50 oracs1 openais[7014]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  7 18:12:50 oracs1 openais[7014]: [CLM  ] New Configuration:
Jan  7 18:12:50 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.1)
Jan  7 18:12:50 oracs1 openais[7014]: [CLM  ] Members Left:
Jan  7 18:12:50 oracs1 openais[7014]: [CLM  ] Members Joined:
Jan  7 18:12:50 oracs1 openais[7014]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ] New Configuration:
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.1)
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.8)
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ] Members Left:
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ] Members Joined:
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ]   r(0) ip(192.168.16.8)
Jan  7 18:12:51 oracs1 openais[7014]: [SYNC ] This node is within the
primary component and will provide service.
Jan  7 18:12:51 oracs1 openais[7014]: [TOTEM] entering OPERATIONAL state.
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ] got nodejoin message 192.168.16.1
Jan  7 18:12:51 oracs1 openais[7014]: [CLM  ] got nodejoin message 192.168.16.8
Jan  7 18:12:51 oracs1 openais[7014]: [CPG  ] got joinlist message from node 1
Jan  7 18:13:20 oracs1 qdiskd[7043]: <debug> Node 2 is UP

So the steps above let us clean switch the db with this limits:
1) it takes about 10-15 minutes to have the whole cluster up again
with both nodes active
2) we have to shutdown one node and in case of clusters with more than
only one service this could be a blocker at all of the approach
itself.

Any hints?

Thanks,
Gianluca



From lhh at redhat.com  Fri Jan  8 14:06:57 2010
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 08 Jan 2010 09:06:57 -0500
Subject: [Linux-cluster] suggestion on freeze-on-node1 and
 unfreeze-on-node2 approach?
In-Reply-To: <561c252c1001080403r2052be58qec9c05cf9d8114a5@mail.gmail.com>
References: <561c252c1001080403r2052be58qec9c05cf9d8114a5@mail.gmail.com>
Message-ID: <1262959617.7436.30.camel@localhost.localdomain>

On Fri, 2010-01-08 at 13:03 +0100, Gianluca Cecchi wrote:

> But sometimes we have to do maintenance on DB and use the strategy to
> freeze the service, manually stop DB, make modifications, manually
> start DB and unfreeze the service.

You could set 'recovery="relocate"', freeze the service, stop the
database cleanly, then unfreeze the service.

When rgmanager does a status check on the service, it will fail, issue a
stop (which will be a no-op for the database), then start it on the
other node.

It's kind of ... odd to do it that way, but it should work.

Alternatively, you could make oracledb.sh use clean shutdown of the
database, wait some period of time, then do a hard-shutdown.

In a failover case where the node fails, the recovery time will be no
different.

-- Lon




From cluster at xinet.it  Fri Jan  8 15:00:10 2010
From: cluster at xinet.it (cluster at xinet.it)
Date: Fri, 8 Jan 2010 16:00:10 +0100
Subject: [Linux-cluster] Rename scsi devices
Message-ID: <00d301ca9073$4609da10$d21d8e30$@it>

Hi all,

 

do someone know how i could rename an iscsi device from

 

/dev/sde to /dev/sdg?

 

I mean, I have an iscsi device from a storage with address 7:0:0:12 and I
see it in one host with the name /dev/sde while in the second host i see it
with name /dev/sdg.

 

I need to rename due to virtualization purposes.

 

Can someone help me?

 

Thanks all,

Francesco Gallo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100108/af19e3ee/attachment.htm>

From gianluca.cecchi at gmail.com  Fri Jan  8 15:12:21 2010
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Fri, 8 Jan 2010 16:12:21 +0100
Subject: [Linux-cluster] suggestion on freeze-on-node1 and
	unfreeze-on-node2 approach?
In-Reply-To: <561c252c1001080403r2052be58qec9c05cf9d8114a5@mail.gmail.com>
References: <561c252c1001080403r2052be58qec9c05cf9d8114a5@mail.gmail.com>
Message-ID: <561c252c1001080712w18f9c941hcfc58917301c4165@mail.gmail.com>

On Fri, 08 Jan 2010 09:06:57 -0500 Lon Hohberger wrote:
>You could set 'recovery="relocate"', freeze the service, stop the
> database cleanly, then unfreeze the service.

Ah, thanks, it should work.
The only "limit" would be that any recovery action will imply
relocation, correct?
(Some problems here with Oracle license in theory, because they let
you pay only one license in a two node cluster only if total time
where DB runs on one of the two node is less than a small amount of
time....)

Re-reading the manual about rhel 5.4 cluster administration, it puts
another doubt in my mind....

Section D.4 Failure Recovery and Independent Subtrees:
"... if any of the scripts defined in this service fail, the normal course
of action is to restart (or relocate or disable, according to the
service recovery policy) the service..."

Does this mean that if my service definition is the one below:

                 <service autostart="1" name="ACSSRV" recovery="relocate">
                        <ip ref="10.4.5.123"/>
                        <fs ref="oradata"/>
                        <fs ref="orasave"/>
                        <fs ref="rdoffline"/>
                        <fs ref="appl"/>
                        <script ref="ACS"/>
                </service>

If I get a network problem and my vip goes down for more than 30
seconds (that should be default interval between checks), it will
cause a relocation of the whole service and not a try-restart of only
the vip, correct?

Without the recovery="relocate" option would this imply again that the
whole service would be restarted  in the running node in this network
problem scenario (so shutdown abort of DB, umount of the FS and
restart of all of them)?

If this is true, could the modification below prevent this (I don't
care about fs, because if one of them goes down, probably I have
problems impacting DB itself, so it is safer to stop it....) for the
general case

                 <service autostart="1" name="ACSSRV" >
                        <ip ref="10.4.5.123" __independent_subtree="1" />
                        <fs ref="oradata"/>
                        <fs ref="orasave"/>
                        <fs ref="rdoffline"/>
                        <fs ref="appl"/>
                        <script ref="ACS"/>
                </service>


And what about this one, does it make sense at all if I add the
recovery=relocate policy?

                 <service autostart="1" name="ACSSRV" recovery="relocate">
                        <ip ref="10.4.5.123" __independent_subtree="1" />
                        <fs ref="oradata"/>
                        <fs ref="orasave"/>
                        <fs ref="rdoffline"/>
                        <fs ref="appl"/>
                        <script ref="ACS"/>
                </service>

Thanks again,
Gianluca



From cthulhucalling at gmail.com  Fri Jan  8 15:22:24 2010
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 8 Jan 2010 07:22:24 -0800
Subject: [Linux-cluster] Rename scsi devices
In-Reply-To: <00d301ca9073$4609da10$d21d8e30$@it>
References: <00d301ca9073$4609da10$d21d8e30$@it>
Message-ID: <36df569a1001080722g57322fb6l2f1ff813dd2aa717@mail.gmail.com>

You could use a udev rule to symlink it to a new device name

On Jan 8, 2010 7:07 AM, <cluster at xinet.it> wrote:

 Hi all,



do someone know how i could rename an iscsi device from



/dev/sde to /dev/sdg?



I mean, I have an iscsi device from a storage with address 7:0:0:12 and I
see it in one host with the name /dev/sde while in the second host i see it
with name /dev/sdg.



I need to rename due to virtualization purposes.



Can someone help me?



Thanks all,

Francesco Gallo

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100108/c3ceff31/attachment.htm>

From berthiaume_wayne at emc.com  Fri Jan  8 15:35:34 2010
From: berthiaume_wayne at emc.com (berthiaume_wayne at emc.com)
Date: Fri, 8 Jan 2010 10:35:34 -0500
Subject: [Linux-cluster] Rename scsi devices
In-Reply-To: <36df569a1001080722g57322fb6l2f1ff813dd2aa717@mail.gmail.com>
References: <00d301ca9073$4609da10$d21d8e30$@it>
	<36df569a1001080722g57322fb6l2f1ff813dd2aa717@mail.gmail.com>
Message-ID: <8F08A56613D77044BD153E2AC5DA0F84036A044E@CORPUSMX40A.corp.emc.com>

/dev/sd names are not persistent across boots. You should use the
disk-by-id names. These persist across boots.

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ian Hayes
Sent: Friday, January 08, 2010 10:22 AM
To: linux clustering
Subject: Re: [Linux-cluster] Rename scsi devices



You could use a udev rule to symlink it to a new device name

	On Jan 8, 2010 7:07 AM, <cluster at xinet.it> wrote:
	
	

	Hi all,

	 

	do someone know how i could rename an iscsi device from

	 

	/dev/sde to /dev/sdg?

	 

	I mean, I have an iscsi device from a storage with address
7:0:0:12 and I see it in one host with the name /dev/sde while in the
second host i see it with name /dev/sdg.

	 

	I need to rename due to virtualization purposes.

	 

	Can someone help me?

	 

	Thanks all,

	Francesco Gallo


	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100108/2ec28cda/attachment.htm>

From avi at myphonebook.co.in  Fri Jan  8 15:42:56 2010
From: avi at myphonebook.co.in (avi at myphonebook.co.in)
Date: Fri, 08 Jan 2010 21:12:56 +0530 (IST)
Subject: [Linux-cluster] Apache/Cluster issue -- Single public IP address
Message-ID: <1262965376.10914@myphonebook.co.in>

I am running Centos...hope that qualifies to post on this forum, if not plz ignore.

I want to set up a cluster that is used for web hosting (RHCS cluster). Recently, I experimented with a 2 node cluster and was able to run it successfully. However, I have started facing some issues with Apache, which I had configured as a service using the Cluster configuration tool. To do this Apache needs a floating IP address. This address is in addition to the network interface IP address. For example if my NIC IP is 192.168.1.1, then I need 192.168.1.2 as a floating IP address, and this has to be bound to apache. At least this is what I understood after several hours of experimentation. I also figured that if I configured 192.168.1.1 as a cluster resource and bound it to apache, the cluster manager failed to start apache and the NIC was put "off". ( In effect 192.168.1.1 was removed from the NIC ). 

However, the Apache service ran fine when bound to the ADDITIONAL IP i.e. 192.168.1.2 ( the floating IP ).

The above example was just a lab setup to test the cluster.

My problem is that I only have a single public IP address, to serve my web server and my question is this:

Is this the end of the road, for installing apache as a service with ONE public IP address??

I am willing to give further info, if required. 

From cluster at xinet.it  Fri Jan  8 16:45:46 2010
From: cluster at xinet.it (cluster at xinet.it)
Date: Fri, 8 Jan 2010 17:45:46 +0100
Subject: [Linux-cluster] R:  Rename scsi devices
In-Reply-To: <36df569a1001080722g57322fb6l2f1ff813dd2aa717@mail.gmail.com>
References: <00d301ca9073$4609da10$d21d8e30$@it>
	<36df569a1001080722g57322fb6l2f1ff813dd2aa717@mail.gmail.com>
Message-ID: <00ff01ca9082$067e5010$137af030$@it>

Thanks a lot for your answer!

 

Can you post some example for doing this?

 

My problem is that I have to refer to a right device in my VM configuration
for xen:

 

disk = [?phy:/dev/sdj,xvda,w?]

 

for live migration purposes the reference to disk must be the same in all
the physical hosts, so I must be shore that sdj points to the right ISCSI
LUN.

 

Thanks for any help!

 

Regards,

Francesco Gallo

 

Da: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] Per conto di Ian Hayes
Inviato: venerd? 8 gennaio 2010 16:22
A: linux clustering
Oggetto: Re: [Linux-cluster] Rename scsi devices

 

You could use a udev rule to symlink it to a new device name

On Jan 8, 2010 7:07 AM, <cluster at xinet.it> wrote:

Hi all,

 

do someone know how i could rename an iscsi device from

 

/dev/sde to /dev/sdg?

 

I mean, I have an iscsi device from a storage with address 7:0:0:12 and I
see it in one host with the name /dev/sde while in the second host i see it
with name /dev/sdg.

 

I need to rename due to virtualization purposes.

 

Can someone help me?

 

Thanks all,

Francesco Gallo


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Nessun virus nel messaggio in arrivo.
Controllato da AVG - www.avg.com
Versione: 9.0.725 / Database dei virus: 270.14.129/2606 - Data di rilascio:
01/07/10 20:35:00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100108/c9edd141/attachment.htm>

From lhh at redhat.com  Fri Jan  8 17:05:24 2010
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 08 Jan 2010 12:05:24 -0500
Subject: [Linux-cluster] suggestion on freeze-on-node1 and
 unfreeze-on-node2 approach?
In-Reply-To: <561c252c1001080712w18f9c941hcfc58917301c4165@mail.gmail.com>
References: <561c252c1001080403r2052be58qec9c05cf9d8114a5@mail.gmail.com>
	<561c252c1001080712w18f9c941hcfc58917301c4165@mail.gmail.com>
Message-ID: <1262970324.7436.34.camel@localhost.localdomain>

On Fri, 2010-01-08 at 16:12 +0100, Gianluca Cecchi wrote:

> If I get a network problem and my vip goes down for more than 30
> seconds (that should be default interval between checks), it will
> cause a relocation of the whole service and not a try-restart of only
> the vip, correct?

Yes.

> Without the recovery="relocate" option would this imply again that the
> whole service would be restarted  in the running node in this network
> problem scenario (so shutdown abort of DB, umount of the FS and
> restart of all of them)?

Yes.

> If this is true, could the modification below prevent this (I don't
> care about fs, because if one of them goes down, probably I have
> problems impacting DB itself, so it is safer to stop it....) for the
> general case
> 
>                  <service autostart="1" name="ACSSRV" >
>                         <ip ref="10.4.5.123" __independent_subtree="1" />
>                         <fs ref="oradata"/>
>                         <fs ref="orasave"/>
>                         <fs ref="rdoffline"/>
>                         <fs ref="appl"/>
>                         <script ref="ACS"/>
>                 </service>

Exactly.

> And what about this one, does it make sense at all if I add the
> recovery=relocate policy?
> 
>                  <service autostart="1" name="ACSSRV" recovery="relocate">
>                         <ip ref="10.4.5.123" __independent_subtree="1" />
>                         <fs ref="oradata"/>
>                         <fs ref="orasave"/>
>                         <fs ref="rdoffline"/>
>                         <fs ref="appl"/>
>                         <script ref="ACS"/>
>                 </service>

Independent subtree means that the IP address failure is not considered
a service failure unless restarting it fails.  (There's a RFE open to
effectively extend 'max_restarts' and 'restart_expire_time' to
individual resources, but it has not been implemented.)

So, if ip fails, just ip is restarted.

If oradata, orasave, ACS, appl, or rdoffline fail, the service is
relocated.

-- Lon




From lhh at redhat.com  Fri Jan  8 17:18:29 2010
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 08 Jan 2010 12:18:29 -0500
Subject: [Linux-cluster] Apache/Cluster issue -- Single public IP address
In-Reply-To: <1262965376.10914@myphonebook.co.in>
References: <1262965376.10914@myphonebook.co.in>
Message-ID: <1262971109.7436.46.camel@localhost.localdomain>

On Fri, 2010-01-08 at 21:12 +0530, avi at myphonebook.co.in wrote:
> My problem is that I only have a single public IP address, to serve my web server and my question is this:
> 
> Is this the end of the road, for installing apache as a service with ONE public IP address??

In order to manage the 'only possible' ip address on an interface or set
of interfaces, the ip.sh script would have to handle a whole lot more
than it was designed:

- device activation
- module loading
- routing configuration

This would it very complex.  Right now, it is very simple: use existing
routes and devices.

Possible solutions:

* Edit the ip.sh script to do what you want directly.

* Use a custom script to start apache & bring up your IP address rather
than using ip.sh to do it.

* Statically assign a couple of IP addresses and mask them using noarp
option or arptables.

-- Lon




From robert at exa-omicron.nl  Fri Jan  8 17:38:11 2010
From: robert at exa-omicron.nl (Robert Verspuy)
Date: Fri, 08 Jan 2010 18:38:11 +0100
Subject: [Linux-cluster] R:  Rename scsi devices
In-Reply-To: <00ff01ca9082$067e5010$137af030$@it>
References: <00d301ca9073$4609da10$d21d8e30$@it>	<36df569a1001080722g57322fb6l2f1ff813dd2aa717@mail.gmail.com>
	<00ff01ca9082$067e5010$137af030$@it>
Message-ID: <4B476D83.6090001@exa-omicron.nl>

See 
http://www.cyberciti.biz/tips/linux-assign-static-names-to-scsi-devices.html 
for an example of such an udev rule.
Then you can use the symlink in your xen config.

With kind regards,
Robert Verspuy

On 01/08/2010 05:45 PM, cluster at xinet.it wrote:
>
> Thanks a lot for your answer!
>
> Can you post some example for doing this?
>
> My problem is that I have to refer to a right device in my VM 
> configuration for xen:
>
> disk = ['phy:/dev/sdj,xvda,w']
>
> for live migration purposes the reference to disk must be the same in 
> all the physical hosts, so I must be shore that sdj points to the 
> right ISCSI LUN.
>
> Thanks for any help!
>
> Regards,
>
> Francesco Gallo
>
> *Da:* linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] *Per conto di *Ian Hayes
> *Inviato:* venerd? 8 gennaio 2010 16:22
> *A:* linux clustering
> *Oggetto:* Re: [Linux-cluster] Rename scsi devices
>
> You could use a udev rule to symlink it to a new device name
>
>     On Jan 8, 2010 7:07 AM, <cluster at xinet.it
>     <mailto:cluster at xinet.it>> wrote:
>
>     Hi all,
>
>     do someone know how i could rename an iscsi device from
>
>     /dev/sde to /dev/sdg?
>
>     I mean, I have an iscsi device from a storage with address
>     7:0:0:12 and I see it in one host with the name /dev/sde while in
>     the second host i see it with name /dev/sdg.
>
>     I need to rename due to virtualization purposes.
>
>     Can someone help me?
>
>     Thanks all,
>
>     Francesco Gallo
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
> Nessun virus nel messaggio in arrivo.
> Controllato da AVG - www.avg.com
> Versione: 9.0.725 / Database dei virus: 270.14.129/2606 - Data di 
> rilascio: 01/07/10 20:35:00
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
*Exa-Omicron*
Patroonsweg 10
3892 DB Zeewolde
Tel.: 088-OMICRON (66 427 66)
http://www.exa-omicron.nl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100108/eb23d38a/attachment.htm>

From brem.belguebli at gmail.com  Fri Jan  8 17:53:23 2010
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 8 Jan 2010 18:53:23 +0100
Subject: [Linux-cluster] qdisk max_error_cycles setting
In-Reply-To: <29ae894c1001070330g195ba815o360d3057e8b163d9@mail.gmail.com>
References: <29ae894c0912300738g16c2a808u6aa1afe38270cce3@mail.gmail.com>
	<29ae894c1001070330g195ba815o360d3057e8b163d9@mail.gmail.com>
Message-ID: <29ae894c1001080953o27e08b62j899718e4ed38a34f@mail.gmail.com>

Hi,

May be you Lon have an idea about this...


Thanks in advance

Brem

2010/1/7 brem belguebli <brem.belguebli at gmail.com>:
> Hi All,
>
> Any idea about that ?
>
> Regards
>
> 2009/12/30 brem belguebli <brem.belguebli at gmail.com>:
>> Hi,
>>
>> It looks like the quorumd max_error_cycles parameter it not taken into account.
>>
>> Here's the test I'm doing:
>>
>> A 3 nodes cluster (RHEL 5.4) with a iscsi qdisk lun from a RHEL 5.4
>> target server.
>>
>> All 3 cluster nodes have the following cqdisk configuration:
>>
>> <quorumd device="/dev/iscsi/storage.quorum" interval="1"
>> log_facility="local5" log_level="7" tko="10" votes="1"
>> max_error_cycles="10">
>>
>> When I block access from the 3 nodes to the target server (iptables
>> rule that prevents all ip flows from the 3 nodes to the target
>> server), I see the Quorum disk go offline but qdisk never gets stopped
>> and keeps on retrying the qdisk device despite the fact that I
>> instructed it to abort after 10 cycles (max_error_cycles=10).
>>
>> Am I misunderstanding the max_error_cycles definition in the qdisk man page ?
>>
>> Regards
>>
>> PS: As consequence of not being killed after this max-error_cycles,
>> qdisk ?keeps on growing (memory usage virtual size) and if the
>> situation lasts too long OOM killer gets involved.....
>>
>



From frederic at ovsg.univ-ag.fr  Fri Jan  8 23:30:13 2010
From: frederic at ovsg.univ-ag.fr (frederic)
Date: Fri, 8 Jan 2010 18:30:13 -0500
Subject: [Linux-cluster] debian lenny qdisk
Message-ID: <201001081830.13876.frederic@ovsg.univ-ag.fr>

Hi,

I have a 4 node cluster running fine with redhat el5 as operating system. I 
added a debian lenny system to that cluster. It joins the cluster fine but i 
have some issues with

1- qdisk
 error = Specified device /dev/vgr10one/qdisk does match kernel's reported 
sector size (0 != -1)

2- rgmanager
	error = clurgmgrd[8247]: <err> #34: Cannot get status for service 
service:nfssanconf
	the error appears for all configured services

Please help me out,

Thanks a lot

redhat os:
	
	uname -r :
		2.6.18-92.1.6.el5xen
	yum info rgmanager:
		Name       : rgmanager
		Arch       : x86_64
		Version    : 2.0.52
		Release    : 1.el5
	yum info cman:
		Arch       : x86_64
		Version    : 2.0.115
		Release    : 1.el5


debian lenny:
	uname -r :
			2.6.26-2-xen-amd64
	apt-cache show rgmanager:
			Architecture: amd64
			Source: redhat-cluster
			Version: 2.20081102-1
	apt-cache show cman:
			Architecture: amd64
			Source: redhat-cluster
			Version: 2.20081102-1



From frederic at ovsg.univ-ag.fr  Fri Jan  8 23:41:57 2010
From: frederic at ovsg.univ-ag.fr (frederic)
Date: Fri, 8 Jan 2010 18:41:57 -0500
Subject: [Linux-cluster] debian lenny qdisk
In-Reply-To: <201001081830.13876.frederic@ovsg.univ-ag.fr>
References: <201001081830.13876.frederic@ovsg.univ-ag.fr>
Message-ID: <201001081841.57624.frederic@ovsg.univ-ag.fr>

First problem ( qdisk solved ) : i put label instead of device in 
cluster.conf. Sorry. 
Second problem remains: rgmanager

Thanks a lot

Le Friday 08 January 2010 18:30:13 frederic, vous avez ?crit?:
> Hi,
>
> I have a 4 node cluster running fine with redhat el5 as operating system. I
> added a debian lenny system to that cluster. It joins the cluster fine but
> i have some issues with
>
> 1- qdisk
>  error = Specified device /dev/vgr10one/qdisk does match kernel's reported
> sector size (0 != -1)
>
> 2- rgmanager
> 	error = clurgmgrd[8247]: <err> #34: Cannot get status for service
> service:nfssanconf
> 	the error appears for all configured services
>
> Please help me out,
>
> Thanks a lot
>
> redhat os:
>
> 	uname -r :
> 		2.6.18-92.1.6.el5xen
> 	yum info rgmanager:
> 		Name       : rgmanager
> 		Arch       : x86_64
> 		Version    : 2.0.52
> 		Release    : 1.el5
> 	yum info cman:
> 		Arch       : x86_64
> 		Version    : 2.0.115
> 		Release    : 1.el5
>
>
> debian lenny:
> 	uname -r :
> 			2.6.26-2-xen-amd64
> 	apt-cache show rgmanager:
> 			Architecture: amd64
> 			Source: redhat-cluster
> 			Version: 2.20081102-1
> 	apt-cache show cman:
> 			Architecture: amd64
> 			Source: redhat-cluster
> 			Version: 2.20081102-1





From broder at mit.edu  Fri Jan  8 22:58:22 2010
From: broder at mit.edu (Evan Broder)
Date: Fri, 8 Jan 2010 16:58:22 -0600
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
Message-ID: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>

[please preserve the CC when replying, thanks]

Hi -
   We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
(1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
crashes leaving DLM state around and forcing us to reboot our nodes,
so we're specifically looking for a solution that doesn't involve
in-kernel locking.

We're also running the Pacemaker OpenAIS service, as we're hoping to
use it for management of some other resources going forward.

We've managed to form the OpenAIS cluster, and get clvmd running on
both of our nodes. Operations using LVM succeed, so long as only one
operation runs at a time. However, if we attempt to run two operations
(say, one lvcreate on each host) at a time, they both hang, and both
clvmd processes appear to deadlock.

When they deadlock, it doesn't appear to affect the other clustering
processes - both corosync and pacemaker still report a fully formed
cluster, so it seems the issue is localized to clvmd.

I've looked at logs from corosync and pacemaker, and I've straced
various processes, but I don't want to blast a bunch of useless
information at the list. What information can I provide to make it
easier to debug and fix this deadlock?

Thanks,
 - Evan



From pradhanparas at gmail.com  Fri Jan  8 23:12:17 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Fri, 8 Jan 2010 17:12:17 -0600
Subject: [Linux-cluster] cluster timeout
Message-ID: <8b711df41001081512x5c356eb8p98b973673299fcc3@mail.gmail.com>

Hi,

I have a gfs2 share in my RHEL 5.4. gfs2 is created using device mapper for
multipathing using 2 ports qlogic HBA card.

Now when I pull off the cable from port 1 in HBA, it takes sometime (around
1+ minute) to failover to the 2nd port/link. . When the failover to the 2nd
link is complete, I can see cman is not running in this node. So, how do I
handle this situation?

I hope i explained properly :( .

Thanks
Paras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100108/dd64c79b/attachment.htm>

From avi at myphonebook.co.in  Sat Jan  9 03:45:55 2010
From: avi at myphonebook.co.in (avi at myphonebook.co.in)
Date: Sat, 09 Jan 2010 09:15:55 +0530 (IST)
Subject: [Linux-cluster] Apache/Cluster issue -- Single public IP address
Message-ID: <1263008755.18900@myphonebook.co.in>

Lon Hohberger <lhh at redhat.com> wrote ..
> On Fri, 2010-01-08 at 21:12 +0530, avi at myphonebook.co.in wrote:
> > My problem is that I only have a single public IP address, to serve my web server
> and my question is this:
> > 
> > Is this the end of the road, for installing apache as a service with ONE public
> IP address??
> 
> In order to manage the 'only possible' ip address on an interface or set
> of interfaces, the ip.sh script would have to handle a whole lot more
> than it was designed:
> 
> - device activation
> - module loading
> - routing configuration
> 
> This would it very complex.  Right now, it is very simple: use existing
> routes and devices.
> 
> Possible solutions:
> 
> * Edit the ip.sh script to do what you want directly.
> 
> * Use a custom script to start apache & bring up your IP address rather
> than using ip.sh to do it.
> 
> * Statically assign a couple of IP addresses and mask them using noarp
> option or arptables.
> 
> -- Lon

Thanks for the suggestion, but I went in for another public IP from my ISP,
so this issue is resolved. ( lightened up my pocket a bit..but thats ok ;P )

From dhoffutt at gmail.com  Sun Jan 10 17:05:04 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Mon, 11 Jan 2010 11:05:04 +1800
Subject: [Linux-cluster] Failover Domains and Voting for Quorum
Message-ID: <d740f8db1001100905p1e58c513i3614d87737cc1142@mail.gmail.com>

Hello Linux Clustering,

Have been searching for this for hours, perhaps someone here knows...

Please consider the following scenario:

Cluster considering of servers:
* node1
* node2
* node3
* node4

Failover domains:
* ip1 ordered and restricted failover domain consisting of node1, node4,
qdisk
* ip2 ordered and restricted failover domain consisting of node2, node3,
node4, qdisk

I can see how to set votes cluster wide, but how can one manipuate the
number-of-votes-for-quorum setting in a Failover Domain level?

Thank you very much,

Dusty
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100111/afa2f43d/attachment.htm>

From agx at sigxcpu.org  Sun Jan 10 19:35:12 2010
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Sun, 10 Jan 2010 20:35:12 +0100
Subject: [Linux-cluster] [PATCH] initscript: Do nothing if cman_tool isn't
	there
Message-ID: <20100110193512.GA30643@bogon.sigxcpu.org>

Hi,
attached patch makes sure the init script doesn't error out if the cman
package has been removed but not purged by Debian's dpkg - conf files
like init scripts are still around then. Patch is against 3.0.6.
Cheers,
 -- Guido

---
 cman/init.d/cman.in |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in
index f090064..0801105 100644
--- a/cman/init.d/cman.in
+++ b/cman/init.d/cman.in
@@ -19,6 +19,8 @@
 # set secure PATH
 PATH="/bin:/usr/bin:/sbin:/usr/sbin:@SBINDIR@"
 
+test -x @SBINDIR@/cman_tool || exit 0
+
 local_chkconfig()
 {
 	ls /etc/rc${2}.d/S*${3} > /dev/null 2>/dev/null
-- 



From alfredo.moralejo at roche.com  Mon Jan 11 08:35:36 2010
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Mon, 11 Jan 2010 09:35:36 +0100
Subject: [Linux-cluster] cluster timeout
In-Reply-To: <8b711df41001081512x5c356eb8p98b973673299fcc3@mail.gmail.com>
References: <8b711df41001081512x5c356eb8p98b973673299fcc3@mail.gmail.com>
Message-ID: <C64734E4E1C80E49955AD539DB2FBC3A1B450C28@rkamsem703.emea.roche.com>

This can help:

http://kbase.redhat.com/faq/docs/DOC-2882

Regards,

Alfredo

________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan
Sent: Saturday, January 09, 2010 12:12 AM
To: linux clustering
Subject: [Linux-cluster] cluster timeout

Hi,

I have a gfs2 share in my RHEL 5.4. gfs2 is created using device mapper for multipathing using 2 ports qlogic HBA card.

Now when I pull off the cable from port 1 in HBA, it takes sometime (around 1+ minute) to failover to the 2nd port/link. . When the failover to the 2nd link is complete, I can see cman is not running in this node. So, how do I handle this situation?

I hope i explained properly :( .

Thanks
Paras.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100111/4d1d3644/attachment.htm>

From ccaulfie at redhat.com  Mon Jan 11 09:03:46 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 11 Jan 2010 09:03:46 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
Message-ID: <4B4AE972.6040004@redhat.com>

On 08/01/10 22:58, Evan Broder wrote:
> [please preserve the CC when replying, thanks]
>
> Hi -
>     We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
> crashes leaving DLM state around and forcing us to reboot our nodes,
> so we're specifically looking for a solution that doesn't involve
> in-kernel locking.
>
> We're also running the Pacemaker OpenAIS service, as we're hoping to
> use it for management of some other resources going forward.
>
> We've managed to form the OpenAIS cluster, and get clvmd running on
> both of our nodes. Operations using LVM succeed, so long as only one
> operation runs at a time. However, if we attempt to run two operations
> (say, one lvcreate on each host) at a time, they both hang, and both
> clvmd processes appear to deadlock.
>
> When they deadlock, it doesn't appear to affect the other clustering
> processes - both corosync and pacemaker still report a fully formed
> cluster, so it seems the issue is localized to clvmd.
>
> I've looked at logs from corosync and pacemaker, and I've straced
> various processes, but I don't want to blast a bunch of useless
> information at the list. What information can I provide to make it
> easier to debug and fix this deadlock?
>

To start with, the best logging to produce is the clvmd logs which can 
be got with clvmd -d (see the man page for details). Ideally these 
should be from all nodes in the cluster so they can be correlated. If 
you're still using DLM then a dlm lock dump from all nodes is often 
helpful in conjunction with the clvmd logs.

Also, did you know it's possible to use clvmd without the DLM? The -I 
openais option will tell it to use the Lck service in userspace - though 
if there are DLM bugs I think we'd like to fix them if possible ;-)


Chrissie



From broder at mit.edu  Mon Jan 11 09:32:08 2010
From: broder at mit.edu (Evan Broder)
Date: Mon, 11 Jan 2010 04:32:08 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B4AE972.6040004@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<4B4AE972.6040004@redhat.com>
Message-ID: <178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>

On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 08/01/10 22:58, Evan Broder wrote:
>>
>> [please preserve the CC when replying, thanks]
>>
>> Hi -
>> ? ?We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>> crashes leaving DLM state around and forcing us to reboot our nodes,
>> so we're specifically looking for a solution that doesn't involve
>> in-kernel locking.
>>
>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>> use it for management of some other resources going forward.
>>
>> We've managed to form the OpenAIS cluster, and get clvmd running on
>> both of our nodes. Operations using LVM succeed, so long as only one
>> operation runs at a time. However, if we attempt to run two operations
>> (say, one lvcreate on each host) at a time, they both hang, and both
>> clvmd processes appear to deadlock.
>>
>> When they deadlock, it doesn't appear to affect the other clustering
>> processes - both corosync and pacemaker still report a fully formed
>> cluster, so it seems the issue is localized to clvmd.
>>
>> I've looked at logs from corosync and pacemaker, and I've straced
>> various processes, but I don't want to blast a bunch of useless
>> information at the list. What information can I provide to make it
>> easier to debug and fix this deadlock?
>>
>
> To start with, the best logging to produce is the clvmd logs which can be
> got with clvmd -d (see the man page for details). Ideally these should be
> from all nodes in the cluster so they can be correlated. If you're still
> using DLM then a dlm lock dump from all nodes is often helpful in
> conjunction with the clvmd logs.

Sure, no problem. I've posted the logs from clvmd on both processes in
<http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
few points with what I was doing - the annotations all start with "
>> ", so they should be easy to spot.

One interesting thing was the output from lvcreate when I SIGKILLed
the clvmd process:

root at black-mesa:~# lvcreate -L 1G -n broder-test-1 xenvg
  Error reading data from clvmd: Connection reset by peer
  Aborting. Failed to activate new LV to wipe the start of it.
  Error writing data to clvmd: Broken pipe
  Unable to deactivate failed new LV. Manual intervention required.
  Error writing data to clvmd: Broken pipe
  Internal error: Volume Group xenvg was not unlocked
  Device '/dev/sdb' has been left open.
  Device '/dev/sdc' has been left open.
  Device '/dev/sdc' has been left open.
  Device '/dev/sdb' has been left open.

I'm not entirely sure what to make of that, or how worried I should
be. After I killed clvmd and restarted corosync, broder-test-1 (the
first lvcreate I started) did seem to exist, although I didn't look
too hard at it.

> Also, did you know it's possible to use clvmd without the DLM? The -I
> openais option will tell it to use the Lck service in userspace - though if
> there are DLM bugs I think we'd like to fix them if possible ;-)

I suspect that the DLM bugs we ran into were all from using old
software - it just left a bit of a bitter taste in our mouth when we
were left with no way to recover from problems. We  built the clvmd
we're using now without support for any clustering infrastructure but
openais, so we should always be using the Lck service.

>
> Chrissie
>

Let us know if there's any other information we can provide.

- Evan



From ccaulfie at redhat.com  Mon Jan 11 09:38:04 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 11 Jan 2010 09:38:04 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<4B4AE972.6040004@redhat.com>
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
Message-ID: <4B4AF17C.4040306@redhat.com>

On 11/01/10 09:32, Evan Broder wrote:
> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
> <ccaulfie at redhat.com>  wrote:
>> On 08/01/10 22:58, Evan Broder wrote:
>>>
>>> [please preserve the CC when replying, thanks]
>>>
>>> Hi -
>>>     We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>> so we're specifically looking for a solution that doesn't involve
>>> in-kernel locking.
>>>
>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>> use it for management of some other resources going forward.
>>>
>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>> both of our nodes. Operations using LVM succeed, so long as only one
>>> operation runs at a time. However, if we attempt to run two operations
>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>> clvmd processes appear to deadlock.
>>>
>>> When they deadlock, it doesn't appear to affect the other clustering
>>> processes - both corosync and pacemaker still report a fully formed
>>> cluster, so it seems the issue is localized to clvmd.
>>>
>>> I've looked at logs from corosync and pacemaker, and I've straced
>>> various processes, but I don't want to blast a bunch of useless
>>> information at the list. What information can I provide to make it
>>> easier to debug and fix this deadlock?
>>>
>>
>> To start with, the best logging to produce is the clvmd logs which can be
>> got with clvmd -d (see the man page for details). Ideally these should be
>> from all nodes in the cluster so they can be correlated. If you're still
>> using DLM then a dlm lock dump from all nodes is often helpful in
>> conjunction with the clvmd logs.
>
> Sure, no problem. I've posted the logs from clvmd on both processes in
> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
> few points with what I was doing - the annotations all start with "
>>> ", so they should be easy to spot.
>
> One interesting thing was the output from lvcreate when I SIGKILLed
> the clvmd process:
>
> root at black-mesa:~# lvcreate -L 1G -n broder-test-1 xenvg
>    Error reading data from clvmd: Connection reset by peer
>    Aborting. Failed to activate new LV to wipe the start of it.
>    Error writing data to clvmd: Broken pipe
>    Unable to deactivate failed new LV. Manual intervention required.
>    Error writing data to clvmd: Broken pipe
>    Internal error: Volume Group xenvg was not unlocked
>    Device '/dev/sdb' has been left open.
>    Device '/dev/sdc' has been left open.
>    Device '/dev/sdc' has been left open.
>    Device '/dev/sdb' has been left open.


That's perfectly normal for LVM if you killed clvmd with SIGKILL

> I'm not entirely sure what to make of that, or how worried I should
> be. After I killed clvmd and restarted corosync, broder-test-1 (the
> first lvcreate I started) did seem to exist, although I didn't look
> too hard at it.
>
>> Also, did you know it's possible to use clvmd without the DLM? The -I
>> openais option will tell it to use the Lck service in userspace - though if
>> there are DLM bugs I think we'd like to fix them if possible ;-)
>
> I suspect that the DLM bugs we ran into were all from using old
> software - it just left a bit of a bitter taste in our mouth when we
> were left with no way to recover from problems. We  built the clvmd
> we're using now without support for any clustering infrastructure but
> openais, so we should always be using the Lck service.
>

If you were using the DLM in RHEL4 (or equivalent) then that doesn't 
entirely surprise me, it was written and tested largely to support GFS 
and other applications were a bit of an afterthought, sadly. The new 
upstream DLM is a huge improvement.

I'll try and get some time to look at your logs, thanks. But it might 
not be today.

Chrissie



From dhoffutt at gmail.com  Mon Jan 11 14:15:38 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Mon, 11 Jan 2010 08:15:38 -0600
Subject: [Linux-cluster] Failover Domains and Voting for Quorum
In-Reply-To: <d740f8db1001100905p1e58c513i3614d87737cc1142@mail.gmail.com>
References: <d740f8db1001100905p1e58c513i3614d87737cc1142@mail.gmail.com>
Message-ID: <d740f8db1001110615l7201c342ved09b58fe9a342ed@mail.gmail.com>

PS: Using the cluster suite version as released with RHEL5U4.

Link to example configuration: http://www.pastebin.org/74282

If a moment for trivia, what does a service/ip _independent_subtree being on
do?

Thank  you for any consideration.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100111/a91ffb4e/attachment.htm>

From fdinitto at redhat.com  Mon Jan 11 14:51:19 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 11 Jan 2010 15:51:19 +0100
Subject: [Linux-cluster] [PATCH] initscript: Do nothing if cman_tool
 isn't there
In-Reply-To: <20100110193512.GA30643@bogon.sigxcpu.org>
References: <20100110193512.GA30643@bogon.sigxcpu.org>
Message-ID: <4B4B3AE7.8080001@redhat.com>

On 1/10/2010 8:35 PM, Guido G?nther wrote:
> Hi,
> attached patch makes sure the init script doesn't error out if the cman
> package has been removed but not purged by Debian's dpkg - conf files
> like init scripts are still around then. Patch is against 3.0.6.
> Cheers,
>  -- Guido
> 

Hi Guido,

thanks for the patch. I merged a slightly different one in the tree with
the same functionality.

Fabio



From sklemer at gmail.com  Mon Jan 11 19:54:23 2010
From: sklemer at gmail.com (=?UTF-8?B?16nXnNeV150g16fXnNee16g=?=)
Date: Mon, 11 Jan 2010 21:54:23 +0200
Subject: [Linux-cluster] RedHat Cluster suite 5.3 ( HALVM) fencing& service
	Q.
Message-ID: <2746211a1001111154l4cd47670i46838f6b8c0b034a@mail.gmail.com>

Hello !!!!


Hello !!!



I build Redhat cluster suite 5.3 with latest cman,openais,rgmanager rpms.

The hardware is dell r610 with  fence_drac5             .



Lvm is configured as HA-LVM.



The cluster is working properly , but customer asked for some features which
i

Hope someone can help me define them.



First :



   1. if there is a network fail on one system , Customer ask for 30 sec
   before fencing

      The failing system ,& if network is up again ? not to fence the system
at all.

( each time the network fail, the  fenced immediately. )



   1. define service which become online only on one system ( on startup ) ,
   but can failover the other cluster node .( no failback )
   2. can I define bonding with ( mode1 ?active/passive) with failback
   option ??



I attached the cluster.conf file





Thanks in advance



Shalom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100111/61a360be/attachment.htm>
-------------- next part --------------
?ubierp2 /etc/cluster # cat cluster.conf
<?xml version="1.0"?>
<cluster alias="bicluster" config_version="70" name="bicluster">
        <quorumd device="/dev/mapper/EmcQdisk" interval="1" label="qdisk" min_score="1" tko="23" votes="1">
                <heuristic interval="2" program="ping -c1 -w1 140.4.0.1" score="1"/>
        </quorumd>
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="ubierp1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device module_name="" name="ubierp1-drac5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="ubierp2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device module_name="" name="ubierp2-drac5"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3" two_node="0"/>
        <fencedevices>
                <fencedevice agent="fence_drac5" cmd_prompt="admin1-" ipaddr="140.4.20.136" login="root" name="ubierp1-drac5" passwd="eds123"/>
                <fencedevice agent="fence_drac5" cmd_prompt="admin1-" ipaddr="140.4.20.137" login="root" name="ubierp2-drac5" passwd="eds123"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="etldom" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="ubierp1" priority="1"/>
                                <failoverdomainnode name="ubierp2" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="dwhdom" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="ubierp2" priority="1"/>
                                <failoverdomainnode name="ubierp1" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="140.4.20.132" monitor_link="1"/>
                        <ip address="140.4.20.135" monitor_link="1"/>
                        <lvm lv_name="" name="etl-vg" vg_name="vgetl"/>
                        <lvm lv_name="" name="dwh-vg" vg_name="vgdwh"/>
                        <fs device="/dev/mapper/vgetl-etlinfo" force_fsck="0" force_unmount="1" fsid="34796" fstype="ext3" mountpoint="/etl/info" name="etlinfo" options="" self_fence="0"/>
                        <fs device="/dev/mapper/vgetl-etldata" force_fsck="0" force_unmount="1" fsid="17880" fstype="ext3" mountpoint="/etl/data" name="etldata" options="" self_fence="0"/>
                        <fs device="/dev/mapper/vgdwh-dwhoradat" force_fsck="0" force_unmount="1" fsid="19572" fstype="ext3" mountpoint="/dwh/data" name="dwhoradat" options="" self_fence="0"/>
                        <fs device="/dev/mapper/vgdwh-dwhias" force_fsck="0" force_unmount="1" fsid="34067" fstype="ext3" mountpoint="/dwh/ias" name="dwhias" options="" self_fence="0"/>
                        <fs device="/dev/mapper/vgdwh-dwhbiee" force_fsck="0" force_unmount="1" fsid="29551" fstype="ext3" mountpoint="/dwh/obiee" name="dwhbiee" options="" self_fence="0"/>
                </resources>
                <service autostart="1" domain="dwhdom" exclusive="1" name="dwhsrv" recovery="relocate">
                        <ip ref="140.4.20.135"/>
                        <lvm ref="dwh-vg"/>
                        <fs ref="dwhoradat"/>
                        <fs ref="dwhias"/>
                        <fs ref="dwhbiee"/>
                </service>
                <service autostart="1" domain="etldom" exclusive="1" name="etlsrv" recovery="relocate">
                        <ip ref="140.4.20.132"/>
                        <lvm ref="etl-vg"/>
                        <fs ref="etlinfo"/>
                        <fs ref="etldata"/>
                </service>
        </rm>
</cluster>

From ccaulfie at redhat.com  Tue Jan 12 08:54:41 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 12 Jan 2010 08:54:41 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B4AF17C.4040306@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>		<4B4AE972.6040004@redhat.com>	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com>
Message-ID: <4B4C38D1.6000608@redhat.com>

On 11/01/10 09:38, Christine Caulfield wrote:
> On 11/01/10 09:32, Evan Broder wrote:
>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>> <ccaulfie at redhat.com> wrote:
>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>
>>>> [please preserve the CC when replying, thanks]
>>>>
>>>> Hi -
>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>> so we're specifically looking for a solution that doesn't involve
>>>> in-kernel locking.
>>>>
>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>> use it for management of some other resources going forward.
>>>>
>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>> operation runs at a time. However, if we attempt to run two operations
>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>> clvmd processes appear to deadlock.
>>>>
>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>> processes - both corosync and pacemaker still report a fully formed
>>>> cluster, so it seems the issue is localized to clvmd.
>>>>
>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>> various processes, but I don't want to blast a bunch of useless
>>>> information at the list. What information can I provide to make it
>>>> easier to debug and fix this deadlock?
>>>>
>>>
>>> To start with, the best logging to produce is the clvmd logs which
>>> can be
>>> got with clvmd -d (see the man page for details). Ideally these
>>> should be
>>> from all nodes in the cluster so they can be correlated. If you're still
>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>> conjunction with the clvmd logs.
>>
>> Sure, no problem. I've posted the logs from clvmd on both processes in
>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>> few points with what I was doing - the annotations all start with "
>>>> ", so they should be easy to spot.


Ironically it looks like a bug in the clvmd-openais code. I can 
reproduce it on my systems here. I don't see the problem when using the dlm!

Can you try -Icorosync and see if that helps? In the meantime I'll have 
a look at the openais bits to try and find out what is wrong.

Chrissie



From fdinitto at redhat.com  Tue Jan 12 11:16:26 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 12 Jan 2010 12:16:26 +0100
Subject: [Linux-cluster] Cluster 3.0.7 stable release
Message-ID: <4B4C5A0A.6010203@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The cluster team and its community are proud to announce the 3.0.7
stable release from the STABLE3 branch.

This release contains a few major bug fixes. We strongly recommend
people to update your clusters.

In order to build/run the 3.0.7 release you will need:

- - corosync 1.2.0
- - openais 1.1.1
- - linux kernel 2.6.31

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.7.tar.gz
ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.7.tar.gz

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Fabio

Under the hood (from 3.0.6):

Abhijith Das (1):
      gfs2_quota: Bug 536902 - quota file size not a multiple of struct
gfs2_quota

Andrew Beekhof (1):
      gfs_controld.pcmk - Implement the setup_cluster and
fenced_node_info hooks

Bob Peterson (3):
      GFS2: gfs2_edit savemeta bugs
      gfs_fsck: Don't try to convert free_meta to free if -n was specified.
      gfs_fsck: Don't try to convert free inodes to free if -n was
specified.

Christine Caulfield (1):
      cman: Make cman quit if it has an old config when it starts up

David Teigland (3):
      group_tool: remove "groupd not running" (another)
      cluster.rng: fence_daemon updates
      man pages: fenced, fence_tool, fence_node, fence_ack_manual

Fabio M. Di Nitto (15):
      cman init: don't start cluster if NM is running or about to run
      cman_tool: make it the authoritative tool to enable a new
configuration
      cman init: add horrid workaround to some other issues
      gfs2: fix build failure
      fence lpar: fix quoting
      gfs: disable gfs kernel and userland tool build by default
      build: fix typo in configure
      build: fix Makefile filters for gfs
      build: fix fence man page stuff
      cman init: check for cman_tool before executing the script
      cman init: drop use of test in favour of shell bits
      cman init: add check for executable
      Revert "cman init: add horrid workaround to some other issues"
      fence agents: drop build date
      misc: update copyright year across the board

Lon Hohberger (14):
      rgmanager: Fix relocation & migration errors
      rgmanager: Add debug information to dumps
      rgmanager: Move prototypes in to correct header
      config: Remove data type="string"
      config: Fix build warning in rng2ldif
      rgmanager: Fix build warnings
      rgmanager: Fix bind mount handling in fs.sh
      fence-agents: Add missing state handling to fence_virsh
      rgmanager: Fix erroneous bind mount warning in fs.sh
      rgmanager: Make clusvcadm check msg_send return code
      rgmanager: Fix event generation with central_processing
      resource-agents: Add missing btrfs & ext4 support
      rgmanager: Allow non-root clustat
      rgmanager: Fix compiler warning

Marek 'marx' Grac (4):
      fencing: Change default timeout values for LPAR and IPMI
      fencing: Version information not shown
      fencing: Add support for identity files
      fencing: fence_rsa fails to login

Ryan O'Hara (3):
      Add error messages to fence_scsi as well as a few comments.
      Add debug messages to subroutines that perform sg_persist operations.
      Add support for metadata action.

Shane Bradley (1):
      rgmanager: Fix ipv6 handling

 Makefile                                        |    6 +-
 cman/cman_tool/main.c                           |   13 +-
 cman/daemon/ais.c                               |    3 +-
 cman/daemon/cman.h                              |    1 +
 cman/daemon/commands.c                          |   23 ++
 cman/init.d/cman.in                             |   49 ++++-
 config/tools/ldap/rng2ldif/tree.c               |    2 +-
 config/tools/xml/cluster.rng.in                 |   87 +++-----
 configure                                       |   20 +-
 doc/COPYRIGHT                                   |   58 +++---
 fence/agents/alom/fence_alom.py                 |    2 +-
 fence/agents/apc/fence_apc.py                   |    2 +-
 fence/agents/drac5/fence_drac5.py               |    2 +-
 fence/agents/ilo_mp/fence_ilo_mp.py             |    2 +-
 fence/agents/ipmilan/ipmilan.c                  |    2 +-
 fence/agents/lib/fencing.py.py                  |    7 +-
 fence/agents/lpar/fence_lpar.py                 |    9 +-
 fence/agents/rsa/fence_rsa.py                   |    5 +-
 fence/agents/scsi/fence_scsi.pl                 |  292
++++++++++++++++-------
 fence/agents/virsh/fence_virsh.py               |    2 +-
 fence/agents/vmware/fence_vmware.py             |    2 +-
 fence/agents/wti/fence_wti.py                   |    2 +-
 fence/man/Makefile                              |    2 +-
 fence/man/fence.8                               |   30 ---
 fence/man/fence_ack_manual.8                    |   55 +++--
 fence/man/fence_node.8                          |  133 +++++++++--
 fence/man/fence_tool.8                          |   70 ++++--
 fence/man/fenced.8                              |  297
+++++++++++++----------
 gfs/gfs_fsck/pass4.c                            |   70 +++---
 gfs/gfs_fsck/pass5.c                            |   10 +-
 gfs2/edit/hexedit.c                             |   30 ++-
 gfs2/edit/hexedit.h                             |    7 +-
 gfs2/edit/savemeta.c                            |  190 ++++++++-------
 gfs2/libgfs2/ondisk.c                           |    6 +-
 gfs2/quota/check.c                              |  128 ++++++-----
 gfs2/quota/gfs2_quota.h                         |    5 +-
 gfs2/quota/main.c                               |  267 +++++---------------
 group/gfs_controld/member_pcmk.c                |   41 +++-
 group/tool/main.c                               |    1 -
 make/copyright.cf                               |    2 +-
 make/defines.mk.input                           |    4 +-
 make/release.mk                                 |    2 +-
 rgmanager/include/event.h                       |    2 +
 rgmanager/include/fo_domain.h                   |    1 +
 rgmanager/include/message.h                     |    1 +
 rgmanager/include/resgroup.h                    |    5 +
 rgmanager/include/reslist.h                     |   11 +-
 rgmanager/src/clulib/msg_socket.c               |   17 ++-
 rgmanager/src/clulib/rg_strings.c               |    3 +
 rgmanager/src/daemons/Makefile                  |    2 +-
 rgmanager/src/daemons/event_config.c            |   31 ++-
 rgmanager/src/daemons/fo_domain.c               |   41 +++-
 rgmanager/src/daemons/groups.c                  |   24 ++
 rgmanager/src/daemons/main.c                    |   71 +++++-
 rgmanager/src/daemons/reslist.c                 |   69 ++++--
 rgmanager/src/daemons/resrules.c                |   97 +++++---
 rgmanager/src/daemons/restree.c                 |   57 +++--
 rgmanager/src/daemons/rg_forward.c              |    4 +-
 rgmanager/src/daemons/rg_state.c                |   14 +-
 rgmanager/src/daemons/service_op.c              |    2 +-
 rgmanager/src/daemons/slang_event.c             |    2 +-
 rgmanager/src/daemons/test.c                    |   18 +-
 rgmanager/src/daemons/tests/test012.conf        |    2 +-
 rgmanager/src/resources/default_event_script.sl |    8 +
 rgmanager/src/resources/follow-service.sl       |    2 +-
 rgmanager/src/resources/fs.sh.in                |   28 ++-
 rgmanager/src/resources/ip.sh                   |   84 ++++++-
 rgmanager/src/resources/vm.sh                   |    6 +-
 rgmanager/src/utils/clustat.c                   |    2 +-
 rgmanager/src/utils/clusvcadm.c                 |   12 +-
 70 files changed, 1526 insertions(+), 1031 deletions(-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktMWgcACgkQhCzbekR3nhgU8wCdEVDvXiTMsvOytt0nYSQkYHpu
n2oAoJ/x+IcPwIhiI9+uoFvmU/Vv/3BU
=LxVE
-----END PGP SIGNATURE-----



From fdinitto at redhat.com  Tue Jan 12 11:18:17 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 12 Jan 2010 12:18:17 +0100
Subject: [Linux-cluster] cluster 2 EOL
Message-ID: <4B4C5A79.5090804@redhat.com>

Hi all,

as previously announced, STABLE2 tree is now to be considered deprecated
and unsupported.

the STABLE2 branch in git has been marked read-only and no more
cluster-2 releases will take place.

Cheers
Fabio



From dirk.schulz at kinzesberg.de  Tue Jan 12 12:08:35 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Tue, 12 Jan 2010 13:08:35 +0100
Subject: [Linux-cluster] clusvcadm -M for VM inside service container
Message-ID: <4B4C6643.6010409@kinzesberg.de>

Hi folks,

I am new to this list and hope this has not been asked too often, but I 
did not find many posts on this problem on the web, and none with a 
solution.

I am running a cluster on CentOS 5.4 with VMs as services, defined like 
this:
<service autostart="1" exclusive="0" domain="VirtMach02" 
name="S-MasseCent5-20" recovery="relocate" hardrecovery="0" 
max_restarts="5">
            <vm name="MasseCent5-20" path="/etc/xen/config-san" 
migrate="live" use_virsh="0" hypervisor="xen" migration_mapping="...,..."/>
</service>

When I do a live migration using xm it works like a charm.
When I do a live migration using "clusvcadm -M S-MasseCent5-20 -m 
/targnode/" I get the following error:
> Trying to migrate service:S-MasseCent5-20 to targnode...Invalid 
> operation for resource
Since clusvcadm -M is restricted to be used with VMs only, I suspect 
that it does not work for VMs defined inside <service> containers, just 
for "naked" <vm> definitions.
Is that correct? Or is there some other reason?

Any hint or help is appreciated.

Dirk



From dirk.schulz at kinzesberg.de  Tue Jan 12 15:14:14 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Tue, 12 Jan 2010 16:14:14 +0100
Subject: [Linux-cluster] clusvcadm -M for VM inside service container
In-Reply-To: <4B4C6643.6010409@kinzesberg.de>
References: <4B4C6643.6010409@kinzesberg.de>
Message-ID: <4B4C91C6.4070006@kinzesberg.de>

Okay, I should have tested myself before asking.

It is as suspected below:
A <vm> defined inside a <service> definition can not be migrated live. 
It can be relocated, but that implies stopping and starting.
A <vm> defined "on it's own" can be migrated live (that is using 
clusvcadm -M).

I would have preferred to bundle up several vms to an entity that can be 
started, stopped and live migrated as one. That seems to be impossible, 
then.

Is that correct or do I overlook something?

Dirk

Dirk H. Schulz schrieb:
> Hi folks,
>
> I am new to this list and hope this has not been asked too often, but 
> I did not find many posts on this problem on the web, and none with a 
> solution.
>
> I am running a cluster on CentOS 5.4 with VMs as services, defined 
> like this:
> <service autostart="1" exclusive="0" domain="VirtMach02" 
> name="S-MasseCent5-20" recovery="relocate" hardrecovery="0" 
> max_restarts="5">
>            <vm name="MasseCent5-20" path="/etc/xen/config-san" 
> migrate="live" use_virsh="0" hypervisor="xen" 
> migration_mapping="...,..."/>
> </service>
>
> When I do a live migration using xm it works like a charm.
> When I do a live migration using "clusvcadm -M S-MasseCent5-20 -m 
> /targnode/" I get the following error:
>> Trying to migrate service:S-MasseCent5-20 to targnode...Invalid 
>> operation for resource
> Since clusvcadm -M is restricted to be used with VMs only, I suspect 
> that it does not work for VMs defined inside <service> containers, 
> just for "naked" <vm> definitions.
> Is that correct? Or is there some other reason?
>
> Any hint or help is appreciated.
>
> Dirk
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From broder at mit.edu  Tue Jan 12 16:21:14 2010
From: broder at mit.edu (Evan Broder)
Date: Tue, 12 Jan 2010 11:21:14 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B4C38D1.6000608@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<4B4AE972.6040004@redhat.com>
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
Message-ID: <178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>

On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 11/01/10 09:38, Christine Caulfield wrote:
>>
>> On 11/01/10 09:32, Evan Broder wrote:
>>>
>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>> <ccaulfie at redhat.com> wrote:
>>>>
>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>
>>>>> [please preserve the CC when replying, thanks]
>>>>>
>>>>> Hi -
>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>>> so we're specifically looking for a solution that doesn't involve
>>>>> in-kernel locking.
>>>>>
>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>>> use it for management of some other resources going forward.
>>>>>
>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>>> operation runs at a time. However, if we attempt to run two operations
>>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>>> clvmd processes appear to deadlock.
>>>>>
>>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>
>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>> various processes, but I don't want to blast a bunch of useless
>>>>> information at the list. What information can I provide to make it
>>>>> easier to debug and fix this deadlock?
>>>>>
>>>>
>>>> To start with, the best logging to produce is the clvmd logs which
>>>> can be
>>>> got with clvmd -d (see the man page for details). Ideally these
>>>> should be
>>>> from all nodes in the cluster so they can be correlated. If you're still
>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>> conjunction with the clvmd logs.
>>>
>>> Sure, no problem. I've posted the logs from clvmd on both processes in
>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>> few points with what I was doing - the annotations all start with "
>>>>>
>>>>> ", so they should be easy to spot.
>
>
> Ironically it looks like a bug in the clvmd-openais code. I can reproduce it
> on my systems here. I don't see the problem when using the dlm!
>
> Can you try -Icorosync and see if that helps? In the meantime I'll have a
> look at the openais bits to try and find out what is wrong.
>
> Chrissie
>

I'll see what we can pull together, but the nodes running the clvm
cluster are also Xen dom0's. They're currently running on (Ubuntu
Hardy's) 2.6.24, so upgrading them to something new enough to support
DLM 3 would be...challenging.

It would be much, much better for us if we could get clvmd-openais working.

Is there any chance this would work better if we dropped back to
openais whitetank instead of corosync + openais wilson?

- Evan



From ccaulfie at redhat.com  Tue Jan 12 16:32:59 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 12 Jan 2010 16:32:59 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<4B4AE972.6040004@redhat.com>	
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>	
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
Message-ID: <4B4CA43B.4000706@redhat.com>

On 12/01/10 16:21, Evan Broder wrote:
> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
> <ccaulfie at redhat.com>  wrote:
>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>
>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>
>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>  wrote:
>>>>>
>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>
>>>>>> [please preserve the CC when replying, thanks]
>>>>>>
>>>>>> Hi -
>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>> in-kernel locking.
>>>>>>
>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>>>> use it for management of some other resources going forward.
>>>>>>
>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>>>> operation runs at a time. However, if we attempt to run two operations
>>>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>>>> clvmd processes appear to deadlock.
>>>>>>
>>>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>
>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>> information at the list. What information can I provide to make it
>>>>>> easier to debug and fix this deadlock?
>>>>>>
>>>>>
>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>> can be
>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>> should be
>>>>> from all nodes in the cluster so they can be correlated. If you're still
>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>> conjunction with the clvmd logs.
>>>>
>>>> Sure, no problem. I've posted the logs from clvmd on both processes in
>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>> few points with what I was doing - the annotations all start with "
>>>>>>
>>>>>> ", so they should be easy to spot.
>>
>>
>> Ironically it looks like a bug in the clvmd-openais code. I can reproduce it
>> on my systems here. I don't see the problem when using the dlm!
>>
>> Can you try -Icorosync and see if that helps? In the meantime I'll have a
>> look at the openais bits to try and find out what is wrong.
>>
>> Chrissie
>>
>
> I'll see what we can pull together, but the nodes running the clvm
> cluster are also Xen dom0's. They're currently running on (Ubuntu
> Hardy's) 2.6.24, so upgrading them to something new enough to support
> DLM 3 would be...challenging.
>
> It would be much, much better for us if we could get clvmd-openais working.
>
> Is there any chance this would work better if we dropped back to
> openais whitetank instead of corosync + openais wilson?
>


I don't think so. I think it's a clvmd flaw. If you want to downgrade 
something then that's the software I would suggest. Probably to 2.0.53 
or earlier but I can't be certain of that as I haven't tested it..

The clvmd locking operations have been changed a little over the last 
few releases, the DLM code has been updated but the openais code has 
not. I've been staring at it for a while now and I'm still not sure 
what's wrong. If I get anywhere I'll send you a fix to test...

Chrissie



From ccaulfie at redhat.com  Wed Jan 13 09:59:30 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Wed, 13 Jan 2010 09:59:30 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<4B4AE972.6040004@redhat.com>	
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>	
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
Message-ID: <4B4D9982.6060505@redhat.com>

On 12/01/10 16:21, Evan Broder wrote:
> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
> <ccaulfie at redhat.com>  wrote:
>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>
>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>
>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>  wrote:
>>>>>
>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>
>>>>>> [please preserve the CC when replying, thanks]
>>>>>>
>>>>>> Hi -
>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>> in-kernel locking.
>>>>>>
>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>>>> use it for management of some other resources going forward.
>>>>>>
>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>>>> operation runs at a time. However, if we attempt to run two operations
>>>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>>>> clvmd processes appear to deadlock.
>>>>>>
>>>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>
>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>> information at the list. What information can I provide to make it
>>>>>> easier to debug and fix this deadlock?
>>>>>>
>>>>>
>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>> can be
>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>> should be
>>>>> from all nodes in the cluster so they can be correlated. If you're still
>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>> conjunction with the clvmd logs.
>>>>
>>>> Sure, no problem. I've posted the logs from clvmd on both processes in
>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>> few points with what I was doing - the annotations all start with "
>>>>>>
>>>>>> ", so they should be easy to spot.
>>
>>
>> Ironically it looks like a bug in the clvmd-openais code. I can reproduce it
>> on my systems here. I don't see the problem when using the dlm!
>>
>> Can you try -Icorosync and see if that helps? In the meantime I'll have a
>> look at the openais bits to try and find out what is wrong.
>>
>> Chrissie
>>
>
> I'll see what we can pull together, but the nodes running the clvm
> cluster are also Xen dom0's. They're currently running on (Ubuntu
> Hardy's) 2.6.24, so upgrading them to something new enough to support
> DLM 3 would be...challenging.
>
> It would be much, much better for us if we could get clvmd-openais working.
>
> Is there any chance this would work better if we dropped back to
> openais whitetank instead of corosync + openais wilson?
>


OK, I've found the bug and it IS in openais. The attached patch will fix it.

Chrissie
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lck-continue.patch
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100113/564048ba/attachment.ksh>

From glisha at gmail.com  Wed Jan 13 15:20:32 2010
From: glisha at gmail.com (Georgi Stanojevski)
Date: Wed, 13 Jan 2010 16:20:32 +0100
Subject: [Linux-cluster] fencing with iRMC (integrated Remote Management
	Controller ) on fujitsu primergy server
Message-ID: <b71e4df91001130720s7f850081x9f001a55a541b78e@mail.gmail.com>

Hi,

has any one configured a RedHat cluster on a Fujitsu Primergi
(rx300S5) with it's iRMC (integrated Remote Management Controller ) as
a fencing device?

The documentation says that Fujitsu RSB can be used as a fencing
device. Is the iRMC a newer version of a RSB and can be configured as
a fencing device?

Thanks for any input.

-- 
Glisha



From klaus.steinberger at Physik.Uni-Muenchen.DE  Wed Jan 13 17:55:51 2010
From: klaus.steinberger at Physik.Uni-Muenchen.DE (Klaus Steinberger)
Date: Wed, 13 Jan 2010 18:55:51 +0100
Subject: [Linux-cluster] encing with iRMC (integrated Remote,
 > 	Management	Controller ) on fujitsu primergy server
In-Reply-To: <mailman.43.1263402025.31753.linux-cluster@redhat.com>
References: <mailman.43.1263402025.31753.linux-cluster@redhat.com>
Message-ID: <4B4E0927.3080109@Physik.Uni-Muenchen.DE>


> has any one configured a RedHat cluster on a Fujitsu Primergi
> (rx300S5) with it's iRMC (integrated Remote Management Controller ) as
> a fencing device?
Yes.

> The documentation says that Fujitsu RSB can be used as a fencing
> device. Is the iRMC a newer version of a RSB and can be configured as
> a fencing device?
Yes, the iRMC talks most protocols you need, at least it talk IPMI.
Configure a user for fencing in the iRMC with the following rights:

LAN Privilege:  OEM

the following fence device should work (with the user and password from above
action of course).

<fencedevice action="reboot" agent="fence_ipmilan" lanplus="1" auth="password"
login="somelogin" name="ipmi" passwd="somepassword"/>

To be complete, you should also install net-snmp and the Serverview Management
Agents on the host(s), then the iRMC will know more about you're system, and
will detect ECC's or other errors correctly.

Sincerly,
Klaus


-------------- next part --------------
A non-text attachment was scrubbed...
Name: klaus_steinberger.vcf
Type: text/x-vcard
Size: 355 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100113/cf4a8e91/attachment.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100113/cf4a8e91/attachment.sig>

From rohara at redhat.com  Wed Jan 13 17:59:16 2010
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 13 Jan 2010 11:59:16 -0600
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<4B4AE972.6040004@redhat.com>
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
Message-ID: <20100113175916.GB31184@redhat.com>

On Tue, Jan 12, 2010 at 11:21:14AM -0500, Evan Broder wrote:
> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
> <ccaulfie at redhat.com> wrote:
> > On 11/01/10 09:38, Christine Caulfield wrote:
> >>
> >> On 11/01/10 09:32, Evan Broder wrote:
> >>>
> >>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
> >>> <ccaulfie at redhat.com> wrote:
> >>>>
> >>>> On 08/01/10 22:58, Evan Broder wrote:
> >>>>>
> >>>>> [please preserve the CC when replying, thanks]
> >>>>>
> >>>>> Hi -
> >>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
> >>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
> >>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
> >>>>> so we're specifically looking for a solution that doesn't involve
> >>>>> in-kernel locking.
> >>>>>
> >>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
> >>>>> use it for management of some other resources going forward.
> >>>>>
> >>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
> >>>>> both of our nodes. Operations using LVM succeed, so long as only one
> >>>>> operation runs at a time. However, if we attempt to run two operations
> >>>>> (say, one lvcreate on each host) at a time, they both hang, and both
> >>>>> clvmd processes appear to deadlock.
> >>>>>
> >>>>> When they deadlock, it doesn't appear to affect the other clustering
> >>>>> processes - both corosync and pacemaker still report a fully formed
> >>>>> cluster, so it seems the issue is localized to clvmd.
> >>>>>
> >>>>> I've looked at logs from corosync and pacemaker, and I've straced
> >>>>> various processes, but I don't want to blast a bunch of useless
> >>>>> information at the list. What information can I provide to make it
> >>>>> easier to debug and fix this deadlock?
> >>>>>
> >>>>
> >>>> To start with, the best logging to produce is the clvmd logs which
> >>>> can be
> >>>> got with clvmd -d (see the man page for details). Ideally these
> >>>> should be
> >>>> from all nodes in the cluster so they can be correlated. If you're still
> >>>> using DLM then a dlm lock dump from all nodes is often helpful in
> >>>> conjunction with the clvmd logs.
> >>>
> >>> Sure, no problem. I've posted the logs from clvmd on both processes in
> >>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
> >>> few points with what I was doing - the annotations all start with "
> >>>>>
> >>>>> ", so they should be easy to spot.
> >
> >
> > Ironically it looks like a bug in the clvmd-openais code. I can reproduce it
> > on my systems here. I don't see the problem when using the dlm!
> >
> > Can you try -Icorosync and see if that helps? In the meantime I'll have a
> > look at the openais bits to try and find out what is wrong.
> >
> > Chrissie
> >
> 
> I'll see what we can pull together, but the nodes running the clvm
> cluster are also Xen dom0's. They're currently running on (Ubuntu
> Hardy's) 2.6.24, so upgrading them to something new enough to support
> DLM 3 would be...challenging.
> 
> It would be much, much better for us if we could get clvmd-openais working.
> 
> Is there any chance this would work better if we dropped back to
> openais whitetank instead of corosync + openais wilson?

No. The LCK service in the wilson branch was only partially
implemented and contained a number of bugs. You'll need at least
openais 1.0 to have a functional LCK service.

Ryan



From dirk.schulz at kinzesberg.de  Thu Jan 14 15:53:49 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Thu, 14 Jan 2010 16:53:49 +0100
Subject: [Linux-cluster] restart_expire_time in vm.sh
Message-ID: <4B4F3E0D.6090007@kinzesberg.de>

Hi folks,

for vm services there is a parameter restart_expire_time but it is not 
described in vm.sh; I did not find anything detailled on the web as well.

Does that parameter define the time after which a restart of a vm has to 
be defined as failed?

Dirk



From dirk.schulz at kinzesberg.de  Thu Jan 14 15:58:02 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Thu, 14 Jan 2010 16:58:02 +0100
Subject: [Linux-cluster] Parameter snapshot in vm.sh
Message-ID: <4B4F3F0A.8000509@kinzesberg.de>

Hi folks,

in vm services there is a parameter "snapshot=PATH" which in vm.sh is 
explained as "Path to the snapshot directory where the virtual machine 
image will be stored".
I do not understand this as the virtual machine image is stored wherever 
it is configured in the virtual machine config file - and what does it 
have to do with snapshots (i. e. what is a "snapshot directory")?

Any hint or help is appreciated.

Dirk



From thomas at sjolshagen.net  Thu Jan 14 16:28:09 2010
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Thu, 14 Jan 2010 11:28:09 -0500
Subject: [Linux-cluster] Parameter snapshot in vm.sh
In-Reply-To: <4B4F3F0A.8000509@kinzesberg.de>
References: <4B4F3F0A.8000509@kinzesberg.de>
Message-ID: <20100114112809.77157fo03aw6s9og@www.sjolshagen.net>


Quoting "Dirk H. Schulz" <dirk.schulz at kinzesberg.de>:

> Hi folks,
>
> I do not understand this as the virtual machine image is stored  
> wherever it is configured in the virtual machine config file - and  
> what does it have to do with snapshots (i. e. what is a "snapshot  
> directory")?
>

The snapshot directory is the directory into which vm.sh will "virsh  
save" the memory contents of the guest when shutting it down (when  
use_virsh="1" at least). As I understand it, during rgmanager (or  
clusvcadm -d) activities, the vm is saved to the snapshot directory  
and restored from the file in that directory - if it exists.

Somewhat unrelated; I'd love to have a parameter that defines the  
"stop" action as either "snapshot" or "shutdown". It takes a long time  
to snapshot 10 guests to a gfs2 file system and the restore  
functionality does a valiant effort, but the Linux kernel does not  
seem to like being restored as of yet (the guest usually hangs after a  
seemingly arbitrary amount of time). As a result, my innodb based  
tables get a little cranky from the interruption/crash.

// Thomas

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.




From dirk.schulz at kinzesberg.de  Thu Jan 14 17:07:37 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Thu, 14 Jan 2010 18:07:37 +0100
Subject: [Linux-cluster] Parameter snapshot in vm.sh
In-Reply-To: <20100114112809.77157fo03aw6s9og@www.sjolshagen.net>
References: <4B4F3F0A.8000509@kinzesberg.de>
	<20100114112809.77157fo03aw6s9og@www.sjolshagen.net>
Message-ID: <4B4F4F59.9030000@kinzesberg.de>

Hi Thomas,

Thomas Sjolshagen schrieb:
>
> Quoting "Dirk H. Schulz" <dirk.schulz at kinzesberg.de>:
>
>> Hi folks,
>>
>> I do not understand this as the virtual machine image is stored 
>> wherever it is configured in the virtual machine config file - and 
>> what does it have to do with snapshots (i. e. what is a "snapshot 
>> directory")?
>>
>
> The snapshot directory is the directory into which vm.sh will "virsh 
> save" the memory contents of the guest when shutting it down (when 
> use_virsh="1" at least). As I understand it, during rgmanager (or 
> clusvcadm -d) activities, the vm is saved to the snapshot directory 
> and restored from the file in that directory - if it exists.

Thanks for pointing to the right direction. I should have looked more 
deeply into vm.sh - you are right, it is used only in conjunction with 
use_virsh=1 so I can look up there what is done exactly (when I need it, 
at the moment it is all xm).
>
> Somewhat unrelated; I'd love to have a parameter that defines the 
> "stop" action as either "snapshot" or "shutdown". It takes a long time 
> to snapshot 10 guests to a gfs2 file system and the restore 
> functionality does a valiant effort, but the Linux kernel does not 
> seem to like being restored as of yet (the guest usually hangs after a 
> seemingly arbitrary amount of time). As a result, my innodb based 
> tables get a little cranky from the interruption/crash.
Is far as I can see in vm.sh the standard action is shutdown:

> case $1 in
>         start)
>                   validate_all || exit $OCF_ERR_ARGS
>                 do_start
>                 exit $?
>                 ;;
>         stop)
>                  validate_all || exit $OCF_ERR_ARGS
>              do_stop shutdown destroy
>              exit $?
>                 ;;
In the "do_stop" line the method is set as a parameter to the do_stop 
function (which hands it to the xm/virsh stop functions). If you need 
both you could seperate a vm-stop.sh and vm-snap.sh script and use 
relevant <vm-stop name=...> containers in cluster.conf (I did not test 
that but it should work like this, should it not?).

Dirk



From dirk.schulz at kinzesberg.de  Thu Jan 14 17:21:03 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Thu, 14 Jan 2010 18:21:03 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
Message-ID: <4B4F527F.8030105@kinzesberg.de>

Hi folks,

I found several howtos on the web stating that if you want to snapshot a 
gfs volume (on top of clvm, of course), you have to freeze gfs (using 
gfs_tool) to make that possible.

What comes in mind then is: If I have to freeze the gfs volume anyway, 
do I need a snapshot at all? Can't I copy the contents off the frozen 
gfs volume directly?
Because the basic purpose of a snapshot is to provide files that do not 
change during beeing copied. And I would expect a frozen gfs volume to 
provide exactly that.

Please help in clarification.

Dirk



From jeff.sturm at eprize.com  Thu Jan 14 19:57:03 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Thu, 14 Jan 2010 14:57:03 -0500
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B4F527F.8030105@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de>
Message-ID: <64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Dirk H. Schulz
> Sent: Thursday, January 14, 2010 12:21 PM
> To: linux clustering
> Subject: [Linux-cluster] Snapshotting GFS and freezing
> 
> Hi folks,
> 
> I found several howtos on the web stating that if you want to snapshot
a
> gfs volume (on top of clvm, of course), you have to freeze gfs (using
> gfs_tool) to make that possible.
> 
> What comes in mind then is: If I have to freeze the gfs volume anyway,
> do I need a snapshot at all? Can't I copy the contents off the frozen
> gfs volume directly?
> Because the basic purpose of a snapshot is to provide files that do
not
> change during beeing copied. And I would expect a frozen gfs volume to
> provide exactly that.

The important consideration here is:  How long can you withstand the
filesystem being frozen?

You can freeze it, copy all files (e.g. with rsync) and unfreeze,
potentially over the course of a few hours.  If that's okay with your
application and users, go right ahead.

Or you can freeze, snapshot and unfreeze, and potentially be done within
a few seconds.  That's the beauty of snapshots--we do it online with
little or no user disruption.

-Jeff





From dirk.schulz at kinzesberg.de  Fri Jan 15 06:06:53 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Fri, 15 Jan 2010 07:06:53 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
Message-ID: <4B5005FD.1030204@kinzesberg.de>

Hi Jeff,

Jeff Sturm schrieb:
>
> The important consideration here is:  How long can you withstand the
> filesystem being frozen?
>
> You can freeze it, copy all files (e.g. with rsync) and unfreeze,
> potentially over the course of a few hours.  If that's okay with your
> application and users, go right ahead.
>
> Or you can freeze, snapshot and unfreeze, and potentially be done within
> a few seconds.  That's the beauty of snapshots--we do it online with
> little or no user disruption.
>   
That means you can unfreeze gfs before undoing the snapshot? The 
snapshot has to be CREATED with frozen gfs but can go on WORKING with 
unfrozen gfs?
If that is right I would like to understand this in detail. Why does gfs 
have to be frozen for creating a snapshot?

Dirk
> -Jeff
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   




From cluster at xinet.it  Fri Jan 15 07:25:36 2010
From: cluster at xinet.it (cluster at xinet.it)
Date: Fri, 15 Jan 2010 08:25:36 +0100
Subject: [Linux-cluster] CLVMD hangs
Message-ID: <00b101ca95b3$ee5ff0b0$cb1fd210$@it>

Hi all,

 

i use redhat cluster my os version  is 5.4. I have a three-node cluster and
clustered VG (XenVG) with many LV (one for each VM). After a cluster reboot
I'm not able to restart clvmd anymore due to vgscan hangs.

 

Is there a way to convert the clustered VG to a non-clustered one? I tried
changing the locking_typ parameter from 3 to 1 but it wasn't the right
solution.

 

The system hangs each time I use the command vgscan, vgdisplay and vgchange.

 

Any suggestion?

 

Thanks a lot for any idea.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100115/fb805a30/attachment.htm>

From momogentoo at gmail.com  Fri Jan 15 07:53:19 2010
From: momogentoo at gmail.com (Gentoo Momo)
Date: Fri, 15 Jan 2010 15:53:19 +0800
Subject: [Linux-cluster] How to push through NFS unmount if NFS server is
	gone?
Message-ID: <6ce0dbb91001142353v29079043r94fa215eb3b8d75f@mail.gmail.com>

Hi,
    I'm configuring a two-node cluster with NFS3 shared storage via "netfs"
agent for data repository. I've also configured a virtual ip on the
interface for NFS access. If the server is gone or the link for NFS is
broken ("tested by ifconfig ethX down"), the cluster tries to stop the
service by unmounting the NFS while it always hangs there. Is there any sort
which can solve this problem?
    Any suggestions are appreciated!
    Thanks!

Han.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100115/0eeef36c/attachment.htm>

From pasik at iki.fi  Fri Jan 15 09:35:14 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Fri, 15 Jan 2010 11:35:14 +0200
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
Message-ID: <20100115093514.GJ17978@reaktio.net>

On Thu, Jan 14, 2010 at 02:57:03PM -0500, Jeff Sturm wrote:
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
> > On Behalf Of Dirk H. Schulz
> > Sent: Thursday, January 14, 2010 12:21 PM
> > To: linux clustering
> > Subject: [Linux-cluster] Snapshotting GFS and freezing
> > 
> > Hi folks,
> > 
> > I found several howtos on the web stating that if you want to snapshot
> a
> > gfs volume (on top of clvm, of course), you have to freeze gfs (using
> > gfs_tool) to make that possible.
> > 
> > What comes in mind then is: If I have to freeze the gfs volume anyway,
> > do I need a snapshot at all? Can't I copy the contents off the frozen
> > gfs volume directly?
> > Because the basic purpose of a snapshot is to provide files that do
> not
> > change during beeing copied. And I would expect a frozen gfs volume to
> > provide exactly that.
> 
> The important consideration here is:  How long can you withstand the
> filesystem being frozen?
> 
> You can freeze it, copy all files (e.g. with rsync) and unfreeze,
> potentially over the course of a few hours.  If that's okay with your
> application and users, go right ahead.
> 
> Or you can freeze, snapshot and unfreeze, and potentially be done within
> a few seconds.  That's the beauty of snapshots--we do it online with
> little or no user disruption.
> 

Hmm.. so snapshots with CLVM are possible nowadays? 

-- Pasi



From ccaulfie at redhat.com  Fri Jan 15 14:19:34 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 15 Jan 2010 14:19:34 +0000
Subject: [Linux-cluster] CLVMD hangs
In-Reply-To: <00b101ca95b3$ee5ff0b0$cb1fd210$@it>
References: <00b101ca95b3$ee5ff0b0$cb1fd210$@it>
Message-ID: <4B507976.30709@redhat.com>

On 15/01/10 07:25, cluster at xinet.it wrote:
> Hi all,
>
> i use redhat cluster my os version is 5.4. I have a three-node cluster
> and clustered VG (XenVG) with many LV (one for each VM). After a cluster
> reboot I?m not able to restart clvmd anymore due to vgscan hangs.
>
> Is there a way to convert the clustered VG to a non-clustered one? I
> tried changing the locking_typ parameter from 3 to 1 but it wasn?t the
> right solution.

You can remove the cluster flag from a VG with the following command

vgchange -cn <vg> --config 'global {locking_type = 0}'


If you're amenable it would be helpful to try and debug this. Can you add

CLVMDOPTS="-d2"

to the file /etc/sysconfig/cluster


and post the clvmd parts of system to me please.

Thanks

Chrissie



From jeff.sturm at eprize.com  Fri Jan 15 14:34:50 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Fri, 15 Jan 2010 09:34:50 -0500
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B5005FD.1030204@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de><64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<4B5005FD.1030204@kinzesberg.de>
Message-ID: <64D0546C5EBBD147B75DE133D798665F03F3F47F@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Dirk H. Schulz
> Sent: Friday, January 15, 2010 1:07 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Snapshotting GFS and freezing
> 
> > Or you can freeze, snapshot and unfreeze, and potentially be done
within
> > a few seconds.  That's the beauty of snapshots--we do it online with
> > little or no user disruption.
> >
> That means you can unfreeze gfs before undoing the snapshot? The
> snapshot has to be CREATED with frozen gfs but can go on WORKING with
> unfrozen gfs?

Snapshots are copy-on-write.  Once the snapshot is taken, further writes
to the origin device do not alter the snapshot.

> If that is right I would like to understand this in detail. Why does
gfs
> have to be frozen for creating a snapshot?

Good question--I've taken snapshots of ext3 volumes where no "freeze"
command is available.  It might have to do with flushing buffers or
playing journals.  Perhaps someone on the cluster list can explain
precisely what "gfs_tool freeze" accomplishes?





From jeff.sturm at eprize.com  Fri Jan 15 14:38:30 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Fri, 15 Jan 2010 09:38:30 -0500
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100115093514.GJ17978@reaktio.net>
References: <4B4F527F.8030105@kinzesberg.de><64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
Message-ID: <64D0546C5EBBD147B75DE133D798665F03F3F480@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Pasi K?rkk?inen
> Sent: Friday, January 15, 2010 4:35 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Snapshotting GFS and freezing
>  
> Hmm.. so snapshots with CLVM are possible nowadays?

Honestly, I think I heard it is possible, but I don't know exactly when and how.  We are running 5.2 and I can say definitively that CLVM snapshots do not work in 5.2.  (I have, as a workaround, dropped the cluster bit, taken a snapshot, and re-clustered the volume group.  Not too painful to do provided I have exclusive access from one node.)

I know 5.3 enabled clustered mirroring, which I think is a prerequisite to clustered snapshots.  I've seen patches to make snapshots work on a clustered vg.  5.4 may fix it, but I haven't tried that yet.





From jose.neto at liber4e.com  Fri Jan 15 14:53:53 2010
From: jose.neto at liber4e.com (jose nuno neto)
Date: Fri, 15 Jan 2010 14:53:53 -0000 (GMT)
Subject: [Linux-cluster] fence_ipmilan
Message-ID: <8ef0d127d8fec3ccc7f0f5e844815533.squirrel@fela.liber4e.com>

Hi,

I'm trying to set up fencing with ipmi on 2node cluster.
I have ipmi working fine and I can issue shutdown/reboot commands through
ipmitool, but when I use fence_ipmilan the command is executed but
fence_ipmilan keeps waiting for aknowledge ( I guess ) and closes with
error.
Then fenced daemon keeps trying to fence again.

Is there a way to prevent this? What is fenced_ipmilan expecting from the
machine it has been shutdown?

Much Thanks
Jose

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From rvandolson at esri.com  Fri Jan 15 17:12:11 2010
From: rvandolson at esri.com (Ray Van Dolson)
Date: Fri, 15 Jan 2010 09:12:11 -0800
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100115093514.GJ17978@reaktio.net>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
Message-ID: <20100115171211.GA13575@esri.com>

On Fri, Jan 15, 2010 at 01:35:14AM -0800, Pasi K?rkk?inen wrote:
> On Thu, Jan 14, 2010 at 02:57:03PM -0500, Jeff Sturm wrote:
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com]
> > > On Behalf Of Dirk H. Schulz
> > > Sent: Thursday, January 14, 2010 12:21 PM
> > > To: linux clustering
> > > Subject: [Linux-cluster] Snapshotting GFS and freezing
> > > 
> > > Hi folks,
> > > 
> > > I found several howtos on the web stating that if you want to snapshot
> > a
> > > gfs volume (on top of clvm, of course), you have to freeze gfs (using
> > > gfs_tool) to make that possible.
> > > 
> > > What comes in mind then is: If I have to freeze the gfs volume anyway,
> > > do I need a snapshot at all? Can't I copy the contents off the frozen
> > > gfs volume directly?
> > > Because the basic purpose of a snapshot is to provide files that do
> > not
> > > change during beeing copied. And I would expect a frozen gfs volume to
> > > provide exactly that.
> > 
> > The important consideration here is:  How long can you withstand the
> > filesystem being frozen?
> > 
> > You can freeze it, copy all files (e.g. with rsync) and unfreeze,
> > potentially over the course of a few hours.  If that's okay with your
> > application and users, go right ahead.
> > 
> > Or you can freeze, snapshot and unfreeze, and potentially be done within
> > a few seconds.  That's the beauty of snapshots--we do it online with
> > little or no user disruption.
> > 
> 
> Hmm.. so snapshots with CLVM are possible nowadays? 
> 

No....

RH has stated (recently on this list) that patches exist to do it, but
it hasn't been a high enough priority for them to complete the work to
the point where it could be distributed to customers.

Ray



From yvette at dbtgroup.com  Fri Jan 15 20:11:50 2010
From: yvette at dbtgroup.com (yvette hirth)
Date: Fri, 15 Jan 2010 20:11:50 +0000
Subject: [Linux-cluster] How to push through NFS unmount if NFS server
 is	gone?
In-Reply-To: <6ce0dbb91001142353v29079043r94fa215eb3b8d75f@mail.gmail.com>
References: <6ce0dbb91001142353v29079043r94fa215eb3b8d75f@mail.gmail.com>
Message-ID: <4B50CC06.20701@dbtgroup.com>

Gentoo Momo wrote:
> Hi,
>     I'm configuring a two-node cluster with NFS3 shared storage via 
> "netfs" agent for data repository. I've also configured a virtual ip on 
> the interface for NFS access. If the server is gone or the link for NFS 
> is broken ("tested by ifconfig ethX down"), the cluster tries to stop 
> the service by unmounting the NFS while it always hangs there. Is there 
> any sort which can solve this problem?
>     Any suggestions are appreciated!
>     Thanks!
>  
> Han.
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

try the -f and -l options of umount.

yvette hirth



From fdinitto at redhat.com  Sat Jan 16 07:02:29 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sat, 16 Jan 2010 08:02:29 +0100
Subject: [Linux-cluster] fence_ipmilan
In-Reply-To: <8ef0d127d8fec3ccc7f0f5e844815533.squirrel@fela.liber4e.com>
References: <8ef0d127d8fec3ccc7f0f5e844815533.squirrel@fela.liber4e.com>
Message-ID: <4B516485.706@redhat.com>

On 1/15/2010 3:53 PM, jose nuno neto wrote:
> Hi,
> 
> I'm trying to set up fencing with ipmi on 2node cluster.
> I have ipmi working fine and I can issue shutdown/reboot commands through
> ipmitool, but when I use fence_ipmilan the command is executed but
> fence_ipmilan keeps waiting for aknowledge ( I guess ) and closes with
> error.
> Then fenced daemon keeps trying to fence again.
> 
> Is there a way to prevent this? What is fenced_ipmilan expecting from the
> machine it has been shutdown?
> 
> Much Thanks
> Jose
> 

Might be helpful to know what version of fence_ipmilan you are using and
how you configured it in cluster.conf. Did you try using fence_ipmilan
from command line?

Fabio



From arturogf at gmail.com  Sat Jan 16 20:34:42 2010
From: arturogf at gmail.com (Arturo Gonzalez Ferrer)
Date: Sat, 16 Jan 2010 21:34:42 +0100
Subject: [Linux-cluster] Problem with write performance
Message-ID: <d2aeadbf1001161234r76231254i9f3c6426030f68c2@mail.gmail.com>

Hello,

We've set up recently a rhel 5.4 cluster of 3 nodes for a Moodle
high-availability website, where the sessions and data are share in a GFS2
volume.

We found that, while the read performance have been constantly good, there
is a problem with writes, as the system decrease its peformance after some
conditions. We think that it can be related with our backup procedure:

We do an NFS export of the GFS2 volume from one of the nodes, so that we can
backup the volume every night externally, from a veritas backup client.
After that, we find next morning that the write performance has decreased a
lot, so that it is practically unusable for some big files and for the
operation of cloning an existing course (zip, unzip the data of the course
in a new folder). After some experiments with the writes and clone
operations, we have found a way that improve the issue, but we think that
there should be a better way. What we did was to add the next entry to
crontab in every node:

0-59/10 * * * * sync; echo 3 > /proc/sys/vm/drop_caches

So that the lock caches are cleaned every ten minutes, as we didn't notice
that it affects badly to the performance of the system, and effectivily, it
improves the write performance somehow, at least making it usable.

Do you think this could be an option? Do you have a better explanation for
it? Any other ideas what could we do?

We have been since then having problems with apache service, being stopped
sometimes (not very offen) in one of the nodes. I think that it could be
related to this maintenance of the vm caches... but I'm not sure.

Best regards,
Arturo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100116/733c67a6/attachment.htm>

From dirk.schulz at kinzesberg.de  Sun Jan 17 13:47:00 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sun, 17 Jan 2010 14:47:00 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100115171211.GA13575@esri.com>
References: <4B4F527F.8030105@kinzesberg.de>	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com>
Message-ID: <4B5314D4.8010900@kinzesberg.de>

Ray Van Dolson schrieb:
>> Hmm.. so snapshots with CLVM are possible nowadays?
>
> No....
>
> RH has stated (recently on this list) that patches exist to do it, but
> it hasn't been a high enough priority for them to complete the work to
> the point where it could be distributed to customers.
>   

Okay then, how do you people out there with clusters that run virtual 
machines with live migration do backups of the virtual machines?
I surely do not want to shut down the virtual machine to be able to copy 
the image safely away if I have live migration available.

At the moment there is only one way I can see. PLEASE prove me to be wrong.
Searching in RedHat's documentation I found that the problem is that lvm 
snapshots need exclusively allocated logical volumes. So I think the 
following
should be technically possible:
0. Start environment: the logical volume containing the images of the 
virtual machines uses gfs and is mounted on all relevant cluster nodes 
since VMs are running on several cluster nodes.
1. All VMs have to be migrated to one of the cluster nodes
2. On all other nodes, the gfs volume is unmounted
3. On the remaining node (where all VMs now run) the logical volume is 
bound exclusively with "lvchange -aey LOGICALVOLUME"
    (I hope this is possible without deactivating it first)
4. Now GFS on this volume is frozen: "gfs_tool freeze 
/mountpoint/of/local/volume"
5. Now the snapshot is generated and mounted
6. GFS is unfrozen again
7. The virtual machine images are copied off the snapshot
8. The snapshot is umounted und undone
9. The LV is activated on all cluster nodes again, GFS file system is 
mounted on all relevant cluster nodes again.
10. The virtual machines are migrated back to whereever they belong.

Now I would like to know:
- is that basically the correct approach?
- is there any better method to backup the virtual machines without 
shutting them down?
- is anyone out there using the above steps regularly in a production 
environment?

Dirk



From a.alawi at auckland.ac.nz  Sun Jan 17 22:17:47 2010
From: a.alawi at auckland.ac.nz (Abraham Alawi)
Date: Mon, 18 Jan 2010 11:17:47 +1300
Subject: [Linux-cluster] Anyone had any luck changing the NETMTU +1500 in
	cluster.conf?
Message-ID: <89EFE9C3-B44C-4C7A-8708-841D7E7B3C41@auckland.ac.nz>

I'm using AoE (CoRAID) as backend storage for GFS2 and they support jumbo frames (MTU=9000), changing the MTU at the interface level works and I can see AoE packets (layer 2) bigger than 1500, it does reach +8000. RHCS supports changing the MTU in cluster.conf via the 'netmtu' option in the totem directive and theoretically it's supposed to double the throughput performance, quote from openais.conf man page:

> Increasing  the  MTU  from 1500 to 8982 doubles throughput performance from 30MB/sec to 60MB/sec as measured with evsbench with 175000 byte messages with the secauth directive set to off


I've changed it in cluster.conf and has been accepted by CCS but I can't see any IP (layer 3/4) packets larger than 1472, anyone had any luck changing that and worked for them?

Cheers,

  -- Abraham

''''''''''''''''''''''''''''''''''''''''''''''''''''''
Abraham Alawi

Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi at auckland.ac.nz
p: +64-9-373 7599, ext#: 87572

''''''''''''''''''''''''''''''''''''''''''''''''''''''




From linux at alteeve.com  Sun Jan 17 22:37:10 2010
From: linux at alteeve.com (Madison Kelly)
Date: Sun, 17 Jan 2010 17:37:10 -0500
Subject: [Linux-cluster] Programming a fence device to be compatible with
	Redhat
Message-ID: <4B539116.1060702@alteeve.com>

Hi all,

   Is there a document that explains what is needed to make a fence 
device recognized and usable by the Red Hat Cluster software? 
Specifically, a single device that sits on a network.

Thanks in advance!

Madi



From cthulhucalling at gmail.com  Sun Jan 17 23:06:31 2010
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Sun, 17 Jan 2010 15:06:31 -0800
Subject: [Linux-cluster] Programming a fence device to be compatible
	with Redhat
In-Reply-To: <4B539116.1060702@alteeve.com>
References: <4B539116.1060702@alteeve.com>
Message-ID: <36df569a1001171506w71856000k30f2bfb797c9d83c@mail.gmail.com>

You can specify any script you have written, although I don't know if you
can modify the list of fence devices in system-config-cluster or conga. I've
usually just specified manual fencing in the config, then edited the
cluster.conf file by hand to use the custom script.

On Jan 17, 2010 2:44 PM, "Madison Kelly" <linux at alteeve.com> wrote:

Hi all,

 Is there a document that explains what is needed to make a fence device
recognized and usable by the Red Hat Cluster software? Specifically, a
single device that sits on a network.

Thanks in advance!

Madi

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100117/50ed96ad/attachment.htm>

From carlopmart at gmail.com  Mon Jan 18 10:28:07 2010
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 18 Jan 2010 11:28:07 +0100
Subject: [Linux-cluster] Problems with RHCS across firewalls
Message-ID: <4B5437B7.3060901@gmail.com>

Hi all,

  I have several problems to setup a rhcs (two nodes) when are installed and secured 
on two separated networks with a different firewalls.

  I have setup these rules on my internal firewalls to allow traffic for rhcs:

  source: hostA
  destination: hostB
  services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp

  source: hostB
  destination: hostA
  services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp

  source: hostA and hostB
  destination: 255.255.255.255
  services: 5404 and 50405 udp

  source: hostA and hostB
  destination: multicast_address
  service: any

  ... but it doesn't works ... What am I doing wrong?? I don't use NAT on these 
firewalls due to all traffic needs to across internal networks.

  Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From pasik at iki.fi  Mon Jan 18 10:34:39 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Mon, 18 Jan 2010 12:34:39 +0200
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B5314D4.8010900@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com> <4B5314D4.8010900@kinzesberg.de>
Message-ID: <20100118103439.GK17978@reaktio.net>

On Sun, Jan 17, 2010 at 02:47:00PM +0100, Dirk H. Schulz wrote:
> Ray Van Dolson schrieb:
> >>Hmm.. so snapshots with CLVM are possible nowadays?
> >
> >No....
> >
> >RH has stated (recently on this list) that patches exist to do it, but
> >it hasn't been a high enough priority for them to complete the work to
> >the point where it could be distributed to customers.
> >  
> 
> Okay then, how do you people out there with clusters that run virtual 
> machines with live migration do backups of the virtual machines?
> I surely do not want to shut down the virtual machine to be able to copy 
> the image safely away if I have live migration available.
> 
> At the moment there is only one way I can see. PLEASE prove me to be wrong.
> Searching in RedHat's documentation I found that the problem is that lvm 
> snapshots need exclusively allocated logical volumes. So I think the 
> following
> should be technically possible:
> 0. Start environment: the logical volume containing the images of the 
> virtual machines uses gfs and is mounted on all relevant cluster nodes 
> since VMs are running on several cluster nodes.
> 1. All VMs have to be migrated to one of the cluster nodes
> 2. On all other nodes, the gfs volume is unmounted
> 3. On the remaining node (where all VMs now run) the logical volume is 
> bound exclusively with "lvchange -aey LOGICALVOLUME"
>    (I hope this is possible without deactivating it first)
> 4. Now GFS on this volume is frozen: "gfs_tool freeze 
> /mountpoint/of/local/volume"

Before freezing the GFS you should make sure the VMs are in consistent
state, and the VMs have flushed their caches/buffers/disks.

> 5. Now the snapshot is generated and mounted
> 6. GFS is unfrozen again
> 7. The virtual machine images are copied off the snapshot
> 8. The snapshot is umounted und undone
> 9. The LV is activated on all cluster nodes again, GFS file system is 
> mounted on all relevant cluster nodes again.
> 10. The virtual machines are migrated back to whereever they belong.
> 
> Now I would like to know:
> - is that basically the correct approach?
> - is there any better method to backup the virtual machines without 
> shutting them down?
> - is anyone out there using the above steps regularly in a production 
> environment?
> 

-- Pasi



From raju.rajsand at gmail.com  Mon Jan 18 12:14:33 2010
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Mon, 18 Jan 2010 17:44:33 +0530
Subject: [Linux-cluster] Anyone had any luck changing the NETMTU +1500
	in cluster.conf?
In-Reply-To: <89EFE9C3-B44C-4C7A-8708-841D7E7B3C41@auckland.ac.nz>
References: <89EFE9C3-B44C-4C7A-8708-841D7E7B3C41@auckland.ac.nz>
Message-ID: <8786b91c1001180414m2c3b25ag8a350a62fe96d27a@mail.gmail.com>

Greetings,

On Mon, Jan 18, 2010 at 3:47 AM, Abraham Alawi <a.alawi at auckland.ac.nz> wrote:
>
>
> I've changed it in cluster.conf and has been accepted by CCS but I can't see any IP (layer 3/4) packets larger than 1472, anyone had any luck changing that and worked for them?
>

Has the switch they are connected to configured to handle Jumbo
frames? else they will dice the frames when passing through them...

Regards

Rajagopal



From jose.neto at liber4e.com  Mon Jan 18 13:13:25 2010
From: jose.neto at liber4e.com (jose nuno neto)
Date: Mon, 18 Jan 2010 13:13:25 -0000 (GMT)
Subject: [Linux-cluster] fence_ipmilan
Message-ID: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>

Hi

Sorry for not posting the info before....
I repeated the steps, and now fence_ipmilan -o reboot powers down the
machine, no power up, and ILOm gets reseted to factory defaults ????

Below is the info, I've been finding the fence_ipmilan not very reliable,
any one using it in prod? what is the most used fencing methods?


##### Info for ipmi #####
SUN BLADE X6270 SERVER MODULE, ILOM v3.0.6.10, r48729
fence_ipmilan 2.0.115 (built Sep  2 2009 11:45:31)

fence_ipmilan -a 172.26.18.3 -l xxxx -p xxxx -o status
Getting status of IPMI:172.26.18.3...Chassis power = On
Done

fence_ipmilan -a 172.26.255.1 -l xxxx -p xxxx -o reboot
Rebooting machine @ IPMI:172.26.255.1...ipmilan: Failed to connect after
10 seconds

Powers Down, no power Up, Ilo goes to factory Reset

/etc/ipmi_lan.conf
addr 0.0.0.0
priv_limit admin
allowed_auths_callback md5
allowed_auths_user md5
allowed_auths_operator md5
allowed_auths_admin md5
user 2 true xxxx xxxx admin 1 md5

ipmitool lan print

Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 PASSWORD
Auth Type Enable        : Callback : MD2 MD5 PASSWORD
                        : User     : MD2 MD5 PASSWORD
                        : Operator : MD2 MD5 PASSWORD
                        : Admin    : MD2 MD5 PASSWORD
                        : OEM      :
IP Address Source       : Static Address
IP Address              : 172.26.18.5
Subnet Mask             : 255.255.240.0
MAC Address             : 00:14:4f:ca:83:4e
SNMP Community String   : public
IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Enabled
Gratituous ARP Intrvl   : 5.0 seconds
Default Gateway IP      : 0.0.0.0
Default Gateway MAC     : 00:00:00:00:00:00
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 2,3,0
Cipher Suite Priv Max   : XXXXXXXXXXXXXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM



ipmitool -U ipmi -P ipmi -H malta8 chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false



> On 1/15/2010 3:53 PM, jose nuno neto wrote:
>> Hi,
>> I'm trying to set up fencing with ipmi on 2node cluster.
>> I have ipmi working fine and I can issue shutdown/reboot commands through
>> ipmitool, but when I use fence_ipmilan the command is executed but
fence_ipmilan keeps waiting for aknowledge ( I guess ) and closes with
error.
>> Then fenced daemon keeps trying to fence again.
>> Is there a way to prevent this? What is fenced_ipmilan expecting from the
>> machine it has been shutdown?
>> Much Thanks
>> Jose
>
> Might be helpful to know what version of fence_ipmilan you are using and
how you configured it in cluster.conf. Did you try using fence_ipmilan
from command line?
>
> Fabio
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>




-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From jfriesse at redhat.com  Mon Jan 18 13:27:30 2010
From: jfriesse at redhat.com (Jan Friesse)
Date: Mon, 18 Jan 2010 14:27:30 +0100
Subject: [Linux-cluster] fence_ipmilan
In-Reply-To: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>
References: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>
Message-ID: <4B5461C2.50804@redhat.com>

Jose,
this is known bug/feature of newer IPMI implementations. See
https://bugzilla.redhat.com/show_bug.cgi?id=482913 for more informations.

Regards,
  Honza

jose nuno neto wrote:
> Hi
> 
> Sorry for not posting the info before....
> I repeated the steps, and now fence_ipmilan -o reboot powers down the
> machine, no power up, and ILOm gets reseted to factory defaults ????
> 
> Below is the info, I've been finding the fence_ipmilan not very reliable,
> any one using it in prod? what is the most used fencing methods?
> 
> 
> ##### Info for ipmi #####
> SUN BLADE X6270 SERVER MODULE, ILOM v3.0.6.10, r48729
> fence_ipmilan 2.0.115 (built Sep  2 2009 11:45:31)
> 
> fence_ipmilan -a 172.26.18.3 -l xxxx -p xxxx -o status
> Getting status of IPMI:172.26.18.3...Chassis power = On
> Done
> 
> fence_ipmilan -a 172.26.255.1 -l xxxx -p xxxx -o reboot
> Rebooting machine @ IPMI:172.26.255.1...ipmilan: Failed to connect after
> 10 seconds
> 
> Powers Down, no power Up, Ilo goes to factory Reset
> 
> /etc/ipmi_lan.conf
> addr 0.0.0.0
> priv_limit admin
> allowed_auths_callback md5
> allowed_auths_user md5
> allowed_auths_operator md5
> allowed_auths_admin md5
> user 2 true xxxx xxxx admin 1 md5
> 
> ipmitool lan print
> 
> Set in Progress         : Set Complete
> Auth Type Support       : NONE MD2 MD5 PASSWORD
> Auth Type Enable        : Callback : MD2 MD5 PASSWORD
>                         : User     : MD2 MD5 PASSWORD
>                         : Operator : MD2 MD5 PASSWORD
>                         : Admin    : MD2 MD5 PASSWORD
>                         : OEM      :
> IP Address Source       : Static Address
> IP Address              : 172.26.18.5
> Subnet Mask             : 255.255.240.0
> MAC Address             : 00:14:4f:ca:83:4e
> SNMP Community String   : public
> IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
> BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Enabled
> Gratituous ARP Intrvl   : 5.0 seconds
> Default Gateway IP      : 0.0.0.0
> Default Gateway MAC     : 00:00:00:00:00:00
> Backup Gateway IP       : 0.0.0.0
> Backup Gateway MAC      : 00:00:00:00:00:00
> 802.1q VLAN ID          : Disabled
> 802.1q VLAN Priority    : 0
> RMCP+ Cipher Suites     : 2,3,0
> Cipher Suite Priv Max   : XXXXXXXXXXXXXXX
>                         :     X=Cipher Suite Unused
>                         :     c=CALLBACK
>                         :     u=USER
>                         :     o=OPERATOR
>                         :     a=ADMIN
>                         :     O=OEM
> 
> 
> 
> ipmitool -U ipmi -P ipmi -H malta8 chassis status
> System Power         : on
> Power Overload       : false
> Power Interlock      : inactive
> Main Power Fault     : false
> Power Control Fault  : false
> Power Restore Policy : always-off
> Last Power Event     :
> Chassis Intrusion    : inactive
> Front-Panel Lockout  : inactive
> Drive Fault          : false
> Cooling/Fan Fault    : false
> 
> 
> 
>> On 1/15/2010 3:53 PM, jose nuno neto wrote:
>>> Hi,
>>> I'm trying to set up fencing with ipmi on 2node cluster.
>>> I have ipmi working fine and I can issue shutdown/reboot commands through
>>> ipmitool, but when I use fence_ipmilan the command is executed but
> fence_ipmilan keeps waiting for aknowledge ( I guess ) and closes with
> error.
>>> Then fenced daemon keeps trying to fence again.
>>> Is there a way to prevent this? What is fenced_ipmilan expecting from the
>>> machine it has been shutdown?
>>> Much Thanks
>>> Jose
>> Might be helpful to know what version of fence_ipmilan you are using and
> how you configured it in cluster.conf. Did you try using fence_ipmilan
> from command line?
>> Fabio
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
> 
> 
> 
> 



From francisco_javier.pena at roche.com  Mon Jan 18 14:29:34 2010
From: francisco_javier.pena at roche.com (Pena, Francisco Javier)
Date: Mon, 18 Jan 2010 15:29:34 +0100
Subject: [Linux-cluster] Problem with write performance
In-Reply-To: <d2aeadbf1001161234r76231254i9f3c6426030f68c2@mail.gmail.com>
References: <d2aeadbf1001161234r76231254i9f3c6426030f68c2@mail.gmail.com>
Message-ID: <9E370B183F253D48A26A4D803EDF355A07F7712C@RBAMSEM705.emea.roche.com>

Hello Arturo,

The following Red Hat KB article may be useful in your case: http://kbase.redhat.com/faq/docs/DOC-6533 . The glock_purge and demote_secs tunables have been quite useful to us in some cases, and your case looks similar.

The only drawback of this method is that you will need to write a script somewhere to set the tunables, as they are not persistent.

Regards,

Javier

________________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arturo Gonzalez Ferrer
Sent: Saturday, January 16, 2010 9:35 PM
To: linux clustering
Subject: [Linux-cluster] Problem with write performance

Hello,

We've set up recently a rhel 5.4 cluster of 3 nodes for a Moodle high-availability website, where the sessions and data are share in a GFS2 volume.

We found that, while the read performance have been constantly good, there is a problem with writes, as the system decrease its peformance after some conditions. We think that it can be related with our backup procedure:

We do an NFS export of the GFS2 volume from one of the nodes, so that we can backup the volume every night externally, from a veritas backup client. After that, we find next morning that the write performance has decreased a lot, so that it is practically unusable for some big files and for the operation of cloning an existing course (zip, unzip the data of the course in a new folder). After some experiments with the writes and clone operations, we have found a way that improve the issue, but we think that there should be a better way. What we did was to add the next entry to crontab in every node:

0-59/10 * * * * sync; echo 3 > /proc/sys/vm/drop_caches

So that the lock caches are cleaned every ten minutes, as we didn't notice that it affects badly to the performance of the system, and effectivily, it improves the write performance somehow, at least making it usable.

Do you think this could be an option? Do you have a better explanation for it? Any other ideas what could we do?

We have been since then having problems with apache service, being stopped sometimes (not very offen) in one of the nodes. I think that it could be related to this maintenance of the vm caches... but I'm not sure.

Best regards,
Arturo.





From carlopmart at gmail.com  Mon Jan 18 14:45:32 2010
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 18 Jan 2010 15:45:32 +0100
Subject: [Linux-cluster] Problems with RHCS across firewalls
In-Reply-To: <4B5437B7.3060901@gmail.com>
References: <4B5437B7.3060901@gmail.com>
Message-ID: <4B54740C.6080708@gmail.com>

carlopmart wrote:
> Hi all,
> 
>  I have several problems to setup a rhcs (two nodes) when are installed 
> and secured on two separated networks with a different firewalls.
> 
>  I have setup these rules on my internal firewalls to allow traffic for 
> rhcs:
> 
>  source: hostA
>  destination: hostB
>  services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp
> 
>  source: hostB
>  destination: hostA
>  services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp
> 
>  source: hostA and hostB
>  destination: 255.255.255.255
>  services: 5404 and 50405 udp
> 
>  source: hostA and hostB
>  destination: multicast_address
>  service: any
> 
>  ... but it doesn't works ... What am I doing wrong?? I don't use NAT on 
> these firewalls due to all traffic needs to across internal networks.
> 
>  Thanks.
> 

Please, any ideas??

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From ccaulfie at redhat.com  Mon Jan 18 15:03:50 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 18 Jan 2010 15:03:50 +0000
Subject: [Linux-cluster] Problems with RHCS across firewalls
In-Reply-To: <4B5437B7.3060901@gmail.com>
References: <4B5437B7.3060901@gmail.com>
Message-ID: <4B547856.6050401@redhat.com>

On 18/01/10 10:28, carlopmart wrote:
> Hi all,
>
> I have several problems to setup a rhcs (two nodes) when are installed
> and secured on two separated networks with a different firewalls.
>
> I have setup these rules on my internal firewalls to allow traffic for
> rhcs:
>
> source: hostA
> destination: hostB
> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp
>
> source: hostB
> destination: hostA
> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp
>
> source: hostA and hostB
> destination: 255.255.255.255
> services: 5404 and 50405 udp
>
> source: hostA and hostB
> destination: multicast_address
> service: any
>
> ... but it doesn't works ... What am I doing wrong?? I don't use NAT on
> these firewalls due to all traffic needs to across internal networks.
>
> Thanks.
>

You mention port 50405 - that should be 5405. Now that might be a typo 
in your email, but if that's the port you've opened on the servers then 
the cluster won't start.

Here's the documentation for enabling ports for cluster suite:

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Cluster_Administration/s1-iptables-CA.html

Chrissie



From arturogf at gmail.com  Mon Jan 18 15:05:15 2010
From: arturogf at gmail.com (Arturo Gonzalez Ferrer)
Date: Mon, 18 Jan 2010 16:05:15 +0100
Subject: [Linux-cluster] Problem with write performance
In-Reply-To: <9E370B183F253D48A26A4D803EDF355A07F7712C@RBAMSEM705.emea.roche.com>
References: <d2aeadbf1001161234r76231254i9f3c6426030f68c2@mail.gmail.com>
	<9E370B183F253D48A26A4D803EDF355A07F7712C@RBAMSEM705.emea.roche.com>
Message-ID: <d2aeadbf1001180705m3284885ave6d932b413a3debf@mail.gmail.com>

Thank you F. Javier,

But the problem is that we use gfs2, and it does not seem to have these
tunables as you can read at:
http://www.mail-archive.com/linux-cluster at redhat.com/msg04673.html

So, we do not have the chance to test it...I think.

It would be very nice to, at least, know what is this problem related to.

Regards,
Arturo.

2010/1/18 Pena, Francisco Javier <francisco_javier.pena at roche.com>

> Hello Arturo,
>
> The following Red Hat KB article may be useful in your case:
> http://kbase.redhat.com/faq/docs/DOC-6533 . The glock_purge and
> demote_secs tunables have been quite useful to us in some cases, and your
> case looks similar.
>
> The only drawback of this method is that you will need to write a script
> somewhere to set the tunables, as they are not persistent.
>
> Regards,
>
> Javier
>
> ________________________________________
> From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] On Behalf Of Arturo Gonzalez Ferrer
> Sent: Saturday, January 16, 2010 9:35 PM
> To: linux clustering
> Subject: [Linux-cluster] Problem with write performance
>
> Hello,
>
> We've set up recently a rhel 5.4 cluster of 3 nodes for a Moodle
> high-availability website, where the sessions and data are share in a GFS2
> volume.
>
> We found that, while the read performance have been constantly good, there
> is a problem with writes, as the system decrease its peformance after some
> conditions. We think that it can be related with our backup procedure:
>
> We do an NFS export of the GFS2 volume from one of the nodes, so that we
> can backup the volume every night externally, from a veritas backup client.
> After that, we find next morning that the write performance has decreased a
> lot, so that it is practically unusable for some big files and for the
> operation of cloning an existing course (zip, unzip the data of the course
> in a new folder). After some experiments with the writes and clone
> operations, we have found a way that improve the issue, but we think that
> there should be a better way. What we did was to add the next entry to
> crontab in every node:
>
> 0-59/10 * * * * sync; echo 3 > /proc/sys/vm/drop_caches
>
> So that the lock caches are cleaned every ten minutes, as we didn't notice
> that it affects badly to the performance of the system, and effectivily, it
> improves the write performance somehow, at least making it usable.
>
> Do you think this could be an option? Do you have a better explanation for
> it? Any other ideas what could we do?
>
> We have been since then having problems with apache service, being stopped
> sometimes (not very offen) in one of the nodes. I think that it could be
> related to this maintenance of the vm caches... but I'm not sure.
>
> Best regards,
> Arturo.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100118/659bd26f/attachment.htm>

From carlopmart at gmail.com  Mon Jan 18 15:38:24 2010
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 18 Jan 2010 16:38:24 +0100
Subject: [Linux-cluster] Problems with RHCS across firewalls
In-Reply-To: <4B547856.6050401@redhat.com>
References: <4B5437B7.3060901@gmail.com> <4B547856.6050401@redhat.com>
Message-ID: <4B548070.4030907@gmail.com>

Christine Caulfield wrote:
> On 18/01/10 10:28, carlopmart wrote:
>> Hi all,
>>
>> I have several problems to setup a rhcs (two nodes) when are installed
>> and secured on two separated networks with a different firewalls.
>>
>> I have setup these rules on my internal firewalls to allow traffic for
>> rhcs:
>>
>> source: hostA
>> destination: hostB
>> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp
>>
>> source: hostB
>> destination: hostA
>> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 udp
>>
>> source: hostA and hostB
>> destination: 255.255.255.255
>> services: 5404 and 50405 udp
>>
>> source: hostA and hostB
>> destination: multicast_address
>> service: any
>>
>> ... but it doesn't works ... What am I doing wrong?? I don't use NAT on
>> these firewalls due to all traffic needs to across internal networks.
>>
>> Thanks.
>>
> 
> You mention port 50405 - that should be 5405. Now that might be a typo 
> in your email, but if that's the port you've opened on the servers then 
> the cluster won't start.
> 
> Here's the documentation for enabling ports for cluster suite:
> 
> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Cluster_Administration/s1-iptables-CA.html 
> 
> 
> Chrissie
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Sorry, 50405 is a typo. I am using 5405 port under firewall policies ...


-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From carlopmart at gmail.com  Mon Jan 18 15:43:41 2010
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 18 Jan 2010 16:43:41 +0100
Subject: [Linux-cluster] tunneling all cluster traffic across ipsec tunnel
Message-ID: <4B5481AD.6020104@gmail.com>

Hi all,

  Is it possible to route all cluster traffic: multicast address, ccsd ports, cman 
ports, etc. over a ipsec host-to-host tunnel?? Or maybe a stupid question??

  Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From carlopmart at gmail.com  Mon Jan 18 17:23:12 2010
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 18 Jan 2010 18:23:12 +0100
Subject: [Linux-cluster] Problems with RHCS across firewalls
In-Reply-To: <4B548070.4030907@gmail.com>
References: <4B5437B7.3060901@gmail.com> <4B547856.6050401@redhat.com>
	<4B548070.4030907@gmail.com>
Message-ID: <4B549900.2030002@gmail.com>

carlopmart wrote:
> Christine Caulfield wrote:
>> On 18/01/10 10:28, carlopmart wrote:
>>> Hi all,
>>>
>>> I have several problems to setup a rhcs (two nodes) when are installed
>>> and secured on two separated networks with a different firewalls.
>>>
>>> I have setup these rules on my internal firewalls to allow traffic for
>>> rhcs:
>>>
>>> source: hostA
>>> destination: hostB
>>> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 
>>> udp
>>>
>>> source: hostB
>>> destination: hostA
>>> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 50007 
>>> udp
>>>
>>> source: hostA and hostB
>>> destination: 255.255.255.255
>>> services: 5404 and 50405 udp
>>>
>>> source: hostA and hostB
>>> destination: multicast_address
>>> service: any
>>>
>>> ... but it doesn't works ... What am I doing wrong?? I don't use NAT on
>>> these firewalls due to all traffic needs to across internal networks.
>>>
>>> Thanks.
>>>
>>
>> You mention port 50405 - that should be 5405. Now that might be a typo 
>> in your email, but if that's the port you've opened on the servers 
>> then the cluster won't start.
>>
>> Here's the documentation for enabling ports for cluster suite:
>>
>> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Cluster_Administration/s1-iptables-CA.html 
>>
>>
>> Chrissie
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> Sorry, 50405 is a typo. I am using 5405 port under firewall policies ...
> 
> 
Nothing??

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From pcs.fagnonr at pcsb.org  Mon Jan 18 19:53:12 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Mon, 18 Jan 2010 14:53:12 -0500
Subject: [Linux-cluster] virtual ip
In-Reply-To: <mailman.0.1263844027.8387.linux-cluster@redhat.com>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com>
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>

Hello and thank you for the help in advance. I am in charge of a two node HA cluster. I am trying to figure out where the virtual ip resides. When I do an ifconfig on node one there is no alias. The same on node two. Where does the virtual ip get set at, and what command can you type to find out the virtual ip?



From gordan at bobich.net  Mon Jan 18 20:07:48 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 18 Jan 2010 20:07:48 +0000
Subject: [Linux-cluster] virtual ip
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
Message-ID: <4B54BF94.4060306@bobich.net>

Fagnon Raymond wrote:
> Hello and thank you for the help in advance. I am in charge of a two node HA cluster. I am trying to figure out where the virtual ip resides. When I do an ifconfig on node one there is no alias. The same on node two. Where does the virtual ip get set at, and what command can you type to find out the virtual ip?

ip addr

man ip

That should get you moving in the right direction.

Gordan



From pcs.fagnonr at pcsb.org  Mon Jan 18 20:23:15 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Mon, 18 Jan 2010 15:23:15 -0500
Subject: [Linux-cluster] virtual ip
In-Reply-To: <4B54BF94.4060306@bobich.net>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
	<4B54BF94.4060306@bobich.net>
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A11@a0040semail3.pinellas.local>

Thank you 

Why does it not show up in ifconfig?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Monday, January 18, 2010 3:08 PM
To: linux clustering
Subject: Re: [Linux-cluster] virtual ip

Fagnon Raymond wrote:
> Hello and thank you for the help in advance. I am in charge of a two node HA cluster. I am trying to figure out where the virtual ip resides. When I do an ifconfig on node one there is no alias. The same on node two. Where does the virtual ip get set at, and what command can you type to find out the virtual ip?

ip addr

man ip

That should get you moving in the right direction.

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From skbenja at gmail.com  Mon Jan 18 20:38:54 2010
From: skbenja at gmail.com (Stephen Benjamin)
Date: Mon, 18 Jan 2010 15:38:54 -0500
Subject: [Linux-cluster] virtual ip
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A11@a0040semail3.pinellas.local>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com> 
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
	<4B54BF94.4060306@bobich.net>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A11@a0040semail3.pinellas.local>
Message-ID: <d67f5d1b1001181238x1ace12acyd53045c8bae765ef@mail.gmail.com>

A subinterface (eth#:1) would show up in ifconfig, but RHCS adds the VIP
directly to the eth# interface itself as a second address, so you have to
use 'ip addr' to view all the IP's. Ifconfig will only show you the first.


- Steve


On Mon, Jan 18, 2010 at 3:23 PM, Fagnon Raymond <pcs.fagnonr at pcsb.org>wrote:

> Thank you
>
> Why does it not show up in ifconfig?
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
> Sent: Monday, January 18, 2010 3:08 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] virtual ip
>
> Fagnon Raymond wrote:
> > Hello and thank you for the help in advance. I am in charge of a two node
> HA cluster. I am trying to figure out where the virtual ip resides. When I
> do an ifconfig on node one there is no alias. The same on node two. Where
> does the virtual ip get set at, and what command can you type to find out
> the virtual ip?
>
> ip addr
>
> man ip
>
> That should get you moving in the right direction.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100118/11af4985/attachment.htm>

From jos at xos.nl  Mon Jan 18 20:47:10 2010
From: jos at xos.nl (Jos Vos)
Date: Mon, 18 Jan 2010 21:47:10 +0100
Subject: [Linux-cluster] virtual ip
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
Message-ID: <20100118204710.GA21734@jasmine.xos.nl>

On Mon, Jan 18, 2010 at 02:53:12PM -0500, Fagnon Raymond wrote:

> Hello and thank you for the help in advance. I am in charge of a
> two node HA cluster. I am trying to figure out where the virtual ip
> resides. When I do an ifconfig on node one there is no alias. The same
> on node two. Where does the virtual ip get set at, and what command can
> you type to find out the virtual ip?

You can't see them with ifconfig.  Try "/sbin/ip addr show".

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From bohdan at harazd.net  Mon Jan 18 20:50:06 2010
From: bohdan at harazd.net (Bohdan Sydor)
Date: Mon, 18 Jan 2010 21:50:06 +0100
Subject: [Linux-cluster] virtual ip
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A11@a0040semail3.pinellas.local>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com>	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>	<4B54BF94.4060306@bobich.net>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A11@a0040semail3.pinellas.local>
Message-ID: <4B54C97E.5010209@harazd.net>

Fagnon Raymond wrote:
> Why does it not show up in ifconfig?

Hi,
generally, ifconfig is obsolete.
This article might be helpful to clarify this issue:
http://kbase.redhat.com/faq/docs/DOC-20647

Regards

-- 
Bohdan Sydor
RHC{E,I,X}
www.sydor.net



From pcs.fagnonr at pcsb.org  Mon Jan 18 21:24:48 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Mon, 18 Jan 2010 16:24:48 -0500
Subject: [Linux-cluster] Luci home page
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>

When I log into my luci homepage and click on cluster. My Cluster does not appear.  The servers show up under the storage tab and so forth.

As far as I know this is a nfs cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100118/4a2b8d31/attachment.htm>

From celsowebber at yahoo.com  Tue Jan 19 01:12:39 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Mon, 18 Jan 2010 17:12:39 -0800 (PST)
Subject: [Linux-cluster] Luci home page
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
Message-ID: <5571.59324.qm@web32106.mail.mud.yahoo.com>

Hi,

I don't know if your problem is the same as mine, but I'm experiencing a problem where luci is a little bit unstable.

I can log in into luci, but when I click the "cluster" tab, it takes some time and returns:
"An error occurred when trying to contact any of the nodes in the rhcs-xen cluster."

In the server where luci is running (node1), I see the following messages in /var/log/messages:
Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout

Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout


I've checked name resolution, host names, /etc/hosts, etc. Firewall and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from anywhere successfully.

Whenever I receive this kind of "timeout error" in luci, if I click the same link again or if I do a "reload" in the browser, then luci usually responds correctly. There is the inconvenience of re-clicking almost every link I use in luci.

I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the problem persistend. I indeed tried to remove luci and ricci, rm -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.

Does anyone has these same symptoms?

Thanks, Celso.



________________________________
From: Fagnon Raymond <pcs.fagnonr at pcsb.org>
To: linux clustering <linux-cluster at redhat.com>
Sent: Mon, January 18, 2010 7:24:48 PM
Subject: [Linux-cluster] Luci home page

 
When I log into my luci homepage and click on cluster. My
Cluster does not appear.  The servers show up under the storage tab and so
forth.
 
As far as I know this is a nfs cluster 


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100118/43b3c95f/attachment.htm>

From celsowebber at yahoo.com  Tue Jan 19 01:20:21 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Mon, 18 Jan 2010 17:20:21 -0800 (PST)
Subject: [Linux-cluster] virtual ip
In-Reply-To: <d67f5d1b1001181238x1ace12acyd53045c8bae765ef@mail.gmail.com>
References: <mailman.0.1263844027.8387.linux-cluster@redhat.com>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A0D@a0040semail3.pinellas.local>
	<4B54BF94.4060306@bobich.net>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A11@a0040semail3.pinellas.local>
	<d67f5d1b1001181238x1ace12acyd53045c8bae765ef@mail.gmail.com>
Message-ID: <967276.46324.qm@web32103.mail.mud.yahoo.com>

Hello,

As far as I can tell, this way of adding VIPs was introduced in RH Cluster Suite 4.0 and onwards, Cluster Suite under RHEL versions 2.1 and 3.x used to implement VIPs using Linux alias interfaces (eth0:0, eth0:1, etc.).

Celso.




________________________________
From: Stephen Benjamin <skbenja at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Mon, January 18, 2010 6:38:54 PM
Subject: Re: [Linux-cluster] virtual ip

A subinterface (eth#:1) would show up in ifconfig, but RHCS adds the VIP directly to the eth# interface itself as a second address, so you have to use 'ip addr' to view all the IP's. Ifconfig will only show you the first.


- Steve



On Mon, Jan 18, 2010 at 3:23 PM, Fagnon Raymond <pcs.fagnonr at pcsb.org> wrote:

>
>
>Thank you
>
>>Why does it not show up in ifconfig?
>
>
>>-----Original Message-----
>>From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
>>
>
>Sent: Monday, January 18, 2010 3:08 PM
>>To: linux clustering
>>Subject: Re: [Linux-cluster] virtual ip
>
>>Fagnon Raymond wrote:
>>> Hello and thank you for the help in advance. I am in charge of a two node HA cluster. I am trying to figure out where the virtual ip resides. When I do an ifconfig on node one there is no alias. The same on node two. Where does the virtual ip get set at, and what command can you type to find out the virtual ip?
>
>>ip addr
>
>>man ip
>
>>That should get you moving in the right direction.
>
>>Gordan
>
>>--
>>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>>--
>>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100118/b9f9fe77/attachment.htm>

From pradhanparas at gmail.com  Tue Jan 19 15:44:48 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 19 Jan 2010 09:44:48 -0600
Subject: [Linux-cluster] Luci home page
In-Reply-To: <5571.59324.qm@web32106.mail.mud.yahoo.com>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
	<5571.59324.qm@web32106.mail.mud.yahoo.com>
Message-ID: <8b711df41001190744p527cfdf4s17b19f90be8fc845@mail.gmail.com>

Yes I believe lots of people are having this issue including me. Luci is
almost useless.. specially the storage tab.

Paras.


On Mon, Jan 18, 2010 at 7:12 PM, Celso K. Webber <celsowebber at yahoo.com>wrote:

> Hi,
>
> I don't know if your problem is the same as mine, but I'm experiencing a
> problem where luci is a little bit unstable.
>
> I can log in into luci, but when I click the "cluster" tab, it takes some
> time and returns:
> "An error occurred when trying to contact any of the nodes in the rhcs-xen
> cluster."
>
> In the server where luci is running (node1), I see the following messages
> in /var/log/messages:
> Jan  7 17:43:01 node1 luci[5262]: Error reading from
> node2.localdomain:11111: timeout
> Jan  7 17:43:01 node1 luci[5262]: Error reading from
> node2.localdomain:11111: timeout
>
> Jan  7 17:43:06 node1 luci[5262]: Error reading from
> node1.localdomain:11111: timeout
> Jan  7 17:43:06 node1 luci[5262]: Error reading from
> node1.localdomain:11111: timeout
>
> I've checked name resolution, host names, /etc/hosts, etc. Firewall and
> SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from
> anywhere successfully.
>
> Whenever I receive this kind of "timeout error" in luci, if I click the
> same link again or if I do a "reload" in the browser, then luci usually
> responds correctly. There is the inconvenience of re-clicking almost every
> link I use in luci.
>
> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the
> problem persistend. I indeed tried to remove luci and ricci, rm -rf
> /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
>
> Does anyone has these same symptoms?
>
> Thanks, Celso.
>
> ------------------------------
> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
> *To:* linux clustering <linux-cluster at redhat.com>
> *Sent:* Mon, January 18, 2010 7:24:48 PM
> *Subject:* [Linux-cluster] Luci home page
>
>  When I log into my luci homepage and click on cluster. My Cluster does
> not appear.  The servers show up under the storage tab and so forth.
>
>
>
> As far as I know this is a nfs cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100119/60ab4f24/attachment.htm>

From pcs.fagnonr at pcsb.org  Tue Jan 19 16:35:52 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Tue, 19 Jan 2010 11:35:52 -0500
Subject: [Linux-cluster] GFS and other questions
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4BBD@a0040semail3.pinellas.local>

Is there a command that will tell you how many journals are on a GFS?  Also is there a command to tell what the status of a cluster is information on the nodes and so forth? I have been doing a lot of reading but most it covers high level stuff.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100119/0d477bab/attachment.htm>

From pradhanparas at gmail.com  Tue Jan 19 16:58:27 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 19 Jan 2010 10:58:27 -0600
Subject: [Linux-cluster] GFS and other questions
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4BBD@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4BBD@a0040semail3.pinellas.local>
Message-ID: <8b711df41001190858h3c5ab12bs430355590e75268c@mail.gmail.com>

Journals:

gfs2_tool df |grep Journals

Cluster status:

clustat


Paras.


On Tue, Jan 19, 2010 at 10:35 AM, Fagnon Raymond <pcs.fagnonr at pcsb.org>wrote:

>  Is there a command that will tell you how many journals are on a GFS?
> Also is there a command to tell what the status of a cluster is information
> on the nodes and so forth? I have been doing a lot of reading but most it
> covers high level stuff.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100119/b5622bf1/attachment.htm>

From celsowebber at yahoo.com  Tue Jan 19 17:39:37 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Tue, 19 Jan 2010 09:39:37 -0800 (PST)
Subject: [Linux-cluster] Luci home page
In-Reply-To: <8b711df41001190744p527cfdf4s17b19f90be8fc845@mail.gmail.com>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
	<5571.59324.qm@web32106.mail.mud.yahoo.com>
	<8b711df41001190744p527cfdf4s17b19f90be8fc845@mail.gmail.com>
Message-ID: <446680.18433.qm@web32107.mail.mud.yahoo.com>

Hi Paras,

Do you also see the same error messages I see in my /var/log/messages file?
Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout

Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout

If so, I believe we have a similar problem, but in my case, I'm using the Cluster tab only.

Regards, Celso.




________________________________
From: Paras pradhan <pradhanparas at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Tue, January 19, 2010 1:44:48 PM
Subject: Re: [Linux-cluster] Luci home page

Yes I believe lots of people are having this issue including me. Luci is almost useless.. specially the storage tab.

Paras.



On Mon, Jan 18, 2010 at 7:12 PM, Celso K. Webber <celsowebber at yahoo.com> wrote:

Hi,
>
>I don't know if your problem is the same as mine, but I'm experiencing a problem where luci is a little bit unstable.
>
>I can log in into luci, but when I click the "cluster" tab, it takes some time and returns:
>"An error occurred when trying to contact any of the nodes in the rhcs-xen cluster."
>
>In the server where luci is running (node1), I see the following messages in /var/log/messages:
>>Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
>Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
>
>Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
>>Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111:
> timeout
>
>
>I've checked name resolution, host names, /etc/hosts, etc. Firewall and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from anywhere successfully.
>
>Whenever I receive this kind of "timeout error" in luci, if I click the same link again or if I do a "reload" in the browser, then luci usually responds correctly. There is the inconvenience of re-clicking almost every link I use in luci.
>
>I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the problem persistend. I indeed tried to remove luci and ricci, rm -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
>
>Does anyone has these same symptoms?
>
>Thanks, Celso.
>
>
>
________________________________
From: Fagnon Raymond <pcs.fagnonr at pcsb.org>
>To: linux clustering <linux-cluster at redhat.com>
>Sent: Mon, January 18, 2010 7:24:48 PM
>Subject: [Linux-cluster] Luci home page
>
>
>
>>
>When I log into my luci homepage and click on cluster. My
>Cluster does not appear.  The servers show up under the storage tab and so
>forth.
> 
>As far as I know this is a nfs cluster 
>
>--
>>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100119/49db0b69/attachment.htm>

From dirk.schulz at kinzesberg.de  Tue Jan 19 18:14:28 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Tue, 19 Jan 2010 19:14:28 +0100
Subject: [Linux-cluster] Luci home page
In-Reply-To: <5571.59324.qm@web32106.mail.mud.yahoo.com>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
	<5571.59324.qm@web32106.mail.mud.yahoo.com>
Message-ID: <4B55F684.9070607@kinzesberg.de>

I have experienced this kind of difficulties when I started testing 
conga. One of the first things I tried was setting SElinux to permissive 
on the conga server, and from then on I could work well with it.

I did not look into audit.log to find out if there is a solution to it, 
so far.

Dirk

Celso K. Webber schrieb:
> Hi,
>
> I don't know if your problem is the same as mine, but I'm experiencing 
> a problem where luci is a little bit unstable.
>
> I can log in into luci, but when I click the "cluster" tab, it takes 
> some time and returns:
> "An error occurred when trying to contact any of the nodes in the 
> rhcs-xen cluster."
>
> In the server where luci is running (node1), I see the following 
> messages in /var/log/messages:
> Jan  7 17:43:01 node1 luci[5262]: Error reading from 
> node2.localdomain:11111: timeout
> Jan  7 17:43:01 node1 luci[5262]: Error reading from 
> node2.localdomain:11111: timeout
>
> Jan  7 17:43:06 node1 luci[5262]: Error reading from 
> node1.localdomain:11111: timeout
> Jan  7 17:43:06 node1 luci[5262]: Error reading from 
> node1.localdomain:11111: timeout
>
> I've checked name resolution, host names, /etc/hosts, etc. Firewall 
> and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" 
> from anywhere successfully.
>
> Whenever I receive this kind of "timeout error" in luci, if I click 
> the same link again or if I do a "reload" in the browser, then luci 
> usually responds correctly. There is the inconvenience of re-clicking 
> almost every link I use in luci.
>
> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and 
> the problem persistend. I indeed tried to remove luci and ricci, rm 
> -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
>
> Does anyone has these same symptoms?
>
> Thanks, Celso.
>
> ------------------------------------------------------------------------
> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
> *To:* linux clustering <linux-cluster at redhat.com>
> *Sent:* Mon, January 18, 2010 7:24:48 PM
> *Subject:* [Linux-cluster] Luci home page
>
> When I log into my luci homepage and click on cluster. My Cluster does 
> not appear.  The servers show up under the storage tab and so forth.
>
>  
>
> As far as I know this is a nfs cluster
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From dirk.schulz at kinzesberg.de  Tue Jan 19 18:18:38 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Tue, 19 Jan 2010 19:18:38 +0100
Subject: [Linux-cluster] Problems with RHCS across firewalls
In-Reply-To: <4B549900.2030002@gmail.com>
References: <4B5437B7.3060901@gmail.com>
	<4B547856.6050401@redhat.com>	<4B548070.4030907@gmail.com>
	<4B549900.2030002@gmail.com>
Message-ID: <4B55F77E.6070306@kinzesberg.de>

This is what I am using:

UDP: 5404,5405,50007
TCP: 11111,16851,21064,41966,41967,41968,41969,50006,50008,50009

This works so far.

Dirk


carlopmart schrieb:
> carlopmart wrote:
>> Christine Caulfield wrote:
>>> On 18/01/10 10:28, carlopmart wrote:
>>>> Hi all,
>>>>
>>>> I have several problems to setup a rhcs (two nodes) when are installed
>>>> and secured on two separated networks with a different firewalls.
>>>>
>>>> I have setup these rules on my internal firewalls to allow traffic for
>>>> rhcs:
>>>>
>>>> source: hostA
>>>> destination: hostB
>>>> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 
>>>> 50007 udp
>>>>
>>>> source: hostB
>>>> destination: hostA
>>>> services: 5404 and 50405 udp, 21064 tcp, 50006-50008-50009 tcp, 
>>>> 50007 udp
>>>>
>>>> source: hostA and hostB
>>>> destination: 255.255.255.255
>>>> services: 5404 and 50405 udp
>>>>
>>>> source: hostA and hostB
>>>> destination: multicast_address
>>>> service: any
>>>>
>>>> ... but it doesn't works ... What am I doing wrong?? I don't use 
>>>> NAT on
>>>> these firewalls due to all traffic needs to across internal networks.
>>>>
>>>> Thanks.
>>>>
>>>
>>> You mention port 50405 - that should be 5405. Now that might be a 
>>> typo in your email, but if that's the port you've opened on the 
>>> servers then the cluster won't start.
>>>
>>> Here's the documentation for enabling ports for cluster suite:
>>>
>>> http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Cluster_Administration/s1-iptables-CA.html 
>>>
>>>
>>> Chrissie
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>> Sorry, 50405 is a typo. I am using 5405 port under firewall policies ...
>>
>>
> Nothing??
>




From pradhanparas at gmail.com  Tue Jan 19 18:25:50 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 19 Jan 2010 12:25:50 -0600
Subject: [Linux-cluster] Luci home page
In-Reply-To: <446680.18433.qm@web32107.mail.mud.yahoo.com>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
	<5571.59324.qm@web32106.mail.mud.yahoo.com>
	<8b711df41001190744p527cfdf4s17b19f90be8fc845@mail.gmail.com>
	<446680.18433.qm@web32107.mail.mud.yahoo.com>
Message-ID: <8b711df41001191025y176b8968lee0a043c79349e3a@mail.gmail.com>

Yes Celso. I see that 11111 timeout in my log as well.

Paras.


On Tue, Jan 19, 2010 at 11:39 AM, Celso K. Webber <celsowebber at yahoo.com>wrote:

> Hi Paras,
>
> Do you also see the same error messages I see in my /var/log/messages file?
>
> Jan  7 17:43:01 node1 luci[5262]: Error reading from
> node2.localdomain:11111: timeout
> Jan  7 17:43:01 node1 luci[5262]: Error reading from
> node2.localdomain:11111: timeout
>
> Jan  7 17:43:06 node1 luci[5262]: Error reading from
> node1.localdomain:11111: timeout
> Jan  7 17:43:06 node1 luci[5262]: Error reading from
> node1.localdomain:11111: timeout
>
> If so, I believe we have a similar problem, but in my case, I'm using the
> Cluster tab only.
>
> Regards, Celso.
>
> ------------------------------
> *From:* Paras pradhan <pradhanparas at gmail.com>
>
> *To:* linux clustering <linux-cluster at redhat.com>
> *Sent:* Tue, January 19, 2010 1:44:48 PM
> *Subject:* Re: [Linux-cluster] Luci home page
>
> Yes I believe lots of people are having this issue including me. Luci is
> almost useless.. specially the storage tab.
>
> Paras.
>
>
> On Mon, Jan 18, 2010 at 7:12 PM, Celso K. Webber <celsowebber at yahoo.com>wrote:
>
>> Hi,
>>
>> I don't know if your problem is the same as mine, but I'm experiencing a
>> problem where luci is a little bit unstable.
>>
>> I can log in into luci, but when I click the "cluster" tab, it takes some
>> time and returns:
>> "An error occurred when trying to contact any of the nodes in the rhcs-xen
>> cluster."
>>
>> In the server where luci is running (node1), I see the following messages
>> in /var/log/messages:
>> Jan  7 17:43:01 node1 luci[5262]: Error reading from
>> node2.localdomain:11111: timeout
>> Jan  7 17:43:01 node1 luci[5262]: Error reading from
>> node2.localdomain:11111: timeout
>>
>> Jan  7 17:43:06 node1 luci[5262]: Error reading from
>> node1.localdomain:11111: timeout
>> Jan  7 17:43:06 node1 luci[5262]: Error reading from
>> node1.localdomain:11111: timeout
>>
>> I've checked name resolution, host names, /etc/hosts, etc. Firewall and
>> SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from
>> anywhere successfully.
>>
>> Whenever I receive this kind of "timeout error" in luci, if I click the
>> same link again or if I do a "reload" in the browser, then luci usually
>> responds correctly. There is the inconvenience of re-clicking almost every
>> link I use in luci.
>>
>> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the
>> problem persistend. I indeed tried to remove luci and ricci, rm -rf
>> /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
>>
>> Does anyone has these same symptoms?
>>
>> Thanks, Celso.
>>
>> ------------------------------
>> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
>> *To:* linux clustering <linux-cluster at redhat.com>
>> *Sent:* Mon, January 18, 2010 7:24:48 PM
>> *Subject:* [Linux-cluster] Luci home page
>>
>>  When I log into my luci homepage and click on cluster. My Cluster does
>> not appear.  The servers show up under the storage tab and so forth.
>>
>>
>>
>> As far as I know this is a nfs cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100119/83e2258d/attachment.htm>

From celsowebber at yahoo.com  Tue Jan 19 21:24:51 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Tue, 19 Jan 2010 13:24:51 -0800 (PST)
Subject: [Linux-cluster] Luci home page
In-Reply-To: <4B55F684.9070607@kinzesberg.de>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B7D4A17@a0040semail3.pinellas.local>
	<5571.59324.qm@web32106.mail.mud.yahoo.com>
	<4B55F684.9070607@kinzesberg.de>
Message-ID: <782081.52039.qm@web32102.mail.mud.yahoo.com>

Hi Dirk,

I've double checked my environment, and I have SELinux DISABLED on both nodes, as I have the firewall DISABLED also.

Thanks.



----- Original Message ----
From: Dirk H. Schulz <dirk.schulz at kinzesberg.de>
To: linux clustering <linux-cluster at redhat.com>
Sent: Tue, January 19, 2010 4:14:28 PM
Subject: Re: [Linux-cluster] Luci home page

I have experienced this kind of difficulties when I started testing conga. One of the first things I tried was setting SElinux to permissive on the conga server, and from then on I could work well with it.

I did not look into audit.log to find out if there is a solution to it, so far.

Dirk

Celso K. Webber schrieb:
> Hi,
> 
> I don't know if your problem is the same as mine, but I'm experiencing a problem where luci is a little bit unstable.
> 
> I can log in into luci, but when I click the "cluster" tab, it takes some time and returns:
> "An error occurred when trying to contact any of the nodes in the rhcs-xen cluster."
> 
> In the server where luci is running (node1), I see the following messages in /var/log/messages:
> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
> 
> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
> 
> I've checked name resolution, host names, /etc/hosts, etc. Firewall and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from anywhere successfully.
> 
> Whenever I receive this kind of "timeout error" in luci, if I click the same link again or if I do a "reload" in the browser, then luci usually responds correctly. There is the inconvenience of re-clicking almost every link I use in luci.
> 
> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the problem persistend. I indeed tried to remove luci and ricci, rm -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
> 
> Does anyone has these same symptoms?
> 
> Thanks, Celso.
> 
> ------------------------------------------------------------------------
> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
> *To:* linux clustering <linux-cluster at redhat.com>
> *Sent:* Mon, January 18, 2010 7:24:48 PM
> *Subject:* [Linux-cluster] Luci home page
> 
> When I log into my luci homepage and click on cluster. My Cluster does not appear.  The servers show up under the storage tab and so forth.
> 
>  
> As far as I know this is a nfs cluster
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



      



From rmicmirregs at gmail.com  Tue Jan 19 22:27:29 2010
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Tue, 19 Jan 2010 23:27:29 +0100
Subject: [Linux-cluster] Expected qdiskd behaviour on node failure and
	reboot implementation
Message-ID: <1263940049.6656.21.camel@mecatol>

Hi all,

Today I was shocked while making a test to one of my cluster
configurations.

A) Environment
- 6 x different servers used as cluster nodes, with dual FC HBA
- iLO/DRAC fencing devices for each cluster node
- 2 x different fabrics, each build with 3 FC SAN switches
- 2 x storage arrays, with 23 270GB LUNs of data each. 
- 1x Qdisk: a 24th LUN located in one of the storage arrays
- 2x qdisk heuristics

B) Test
- Removing the 2 service interface wires on node A

C) Expected behaviour (due to qdisk and cman timers)
- Qdiskd should notice the lost of the heuristics on node A
- CMAN should notice the lost of connectivity with node A
- The rest of the nodes should fence node A

D) Experienced behaviour
- Qdisk notices the lost of the heuristic on node A
- Qdisk reboots via "hard reset" node A
- CMAN notices the lost of connectivity with node A
- The rest of the nodes fence it (I see the 2 reboots in the iLO log of
the system)

I was shocked with the capacity of Qdisk of doing a "hard reset" of the
system. I mean: it was not a clean shutdown of the system via a "reboot"
or "poweroff" O.S. command. It was more likely to be a power reset in
the system. I was expecting to qdisk to stop the CMAN service or, in the
strongest situation, doing a clean reboot of the system.

After that, I found this in the qdisk man page:

"By default, only nodes scoring over 1/2 of the total maximum score will
claim they are available via the quorum disk, and a node (master or
otherwise) whose score drops too low will remove itself (usually, by
rebooting).
[...]
reboot="1"
        If set to 0 (off), qdiskd will *not* reboot after a negative
        transition as a result in a change in score (see section 2.2).
        The default for this value is 1 (on)."
        
So my thoughts were wrong and this is the default behaviour, isn't it?
I'm pretty sure in my previous tests I did not see this behaviour.

Another question is: how does qdisk implement the "reboot" function? Is
it really a "hard reset"? 

Thanks in advance,

Rafael

-- 
Rafael Mic? Miranda



From pradhanparas at gmail.com  Tue Jan 19 22:59:52 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 19 Jan 2010 16:59:52 -0600
Subject: [Linux-cluster] statfs
Message-ID: <8b711df41001191459x41c55814pf0de439f450debbd@mail.gmail.com>

hi.

Why I am not seeing statfs_fast ?

[root at prd tune]# gfs2_tool gettune /guest_vms1
new_files_directio = 0
new_files_jdata = 0
quota_scale = 1.0000   (1, 1)
logd_secs = 1
recoverd_secs = 60
statfs_quantum = 30
stall_secs = 600
quota_cache_secs = 300
quota_simul_sync = 64
statfs_slow = 0
complain_secs = 10
max_readahead = 262144
quota_quantum = 60
quota_warn_period = 10
jindex_refresh_secs = 60
log_flush_secs = 60
incore_log_blocks = 1024

-----

I do not see statfs_fast at this location. /sys/fs/gfs2/prd:guest_vms2/tune/


OS: 5.4 EL

Thanks!
Paras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100119/24f4706b/attachment.htm>

From mammadshah at hotmail.com  Wed Jan 20 05:16:00 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Wed, 20 Jan 2010 10:16:00 +0500
Subject: [Linux-cluster] Red Hat two node cluster to setup Oracle 11
Message-ID: <BLU110-W20C5B787E97B35E96E4A10D4640@phx.gbl>




HI,

I want to install oracle 11 on RHEL 5.3, following are hardware details.

HP-DL 580  (2 Nodes)
HP-Storage ( 4.5TB=DB storage, 100MB=Quorum Disk)

Need help to setup Oracle RHEL cluster. i have performed the following steps.

gfs_mkfs -p lock_dlm -t alpha:mydata1 -j 8 /dev/vg01/lvol0    
mkqdisk -c /dev/vg01/lvol1  -l db_qdisk

I want to know, Parameter to define in "system-config-cluster" for "quorum disk" and for "oracle " 

 
Thanks,
Muhammad Ammad Shah
 


 		 	   		  
_________________________________________________________________
Windows Live Hotmail: Your friends can get your Facebook updates, right from Hotmail?.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100120/dcaf0e0b/attachment.htm>

From carlopmart at gmail.com  Wed Jan 20 09:06:30 2010
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 20 Jan 2010 10:06:30 +0100
Subject: [Linux-cluster] Problems with RHCS across firewalls (SOLVED)
In-Reply-To: <4B55F77E.6070306@kinzesberg.de>
References: <4B5437B7.3060901@gmail.com>	<4B547856.6050401@redhat.com>	<4B548070.4030907@gmail.com>	<4B549900.2030002@gmail.com>
	<4B55F77E.6070306@kinzesberg.de>
Message-ID: <4B56C796.4090802@gmail.com>

Dirk H. Schulz wrote:
> This is what I am using:
> 
> UDP: 5404,5405,50007
> TCP: 11111,16851,21064,41966,41967,41968,41969,50006,50008,50009
> 
> This works so far.
> 
> Dirk
> 
> 

Ummm... Are you sure?? My problem was multicast routing. I have configured my 
internal firewalls to allow multicast routing and all works ok now (using firewall 
rules previously posted).

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From jose.neto at liber4e.com  Wed Jan 20 09:58:41 2010
From: jose.neto at liber4e.com (jose nuno neto)
Date: Wed, 20 Jan 2010 09:58:41 -0000 (GMT)
Subject: [Linux-cluster] fence_ipmilan
In-Reply-To: <4B5461C2.50804@redhat.com>
References: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>
	<4B5461C2.50804@redhat.com>
Message-ID: <56b46216bfab23c2a48e7756f995e6fb.squirrel@fela.liber4e.com>

HI
Thanks for the info

I'm having other fencing problems with fence_ilo and I'm thinking of
writing a script to do the fencing, ipmitool works fine, I'll just write a
script to call it. what do you think?

Also, I'd like to know what fencing solutions are the most common, reliable.
I'm doing a proof-of-concept for a big institution and would like to know
what are the more stable solutions

Best Regards
Jose


> Jose,
> this is known bug/feature of newer IPMI implementations. See
> https://bugzilla.redhat.com/show_bug.cgi?id=482913 for more informations.
>
> Regards,
>   Honza
>
> jose nuno neto wrote:
>> Hi
>>
>> Sorry for not posting the info before....
>> I repeated the steps, and now fence_ipmilan -o reboot powers down the
>> machine, no power up, and ILOm gets reseted to factory defaults ????
>>
>> Below is the info, I've been finding the fence_ipmilan not very
>> reliable,
>> any one using it in prod? what is the most used fencing methods?
>>
>>
>> ##### Info for ipmi #####
>> SUN BLADE X6270 SERVER MODULE, ILOM v3.0.6.10, r48729
>> fence_ipmilan 2.0.115 (built Sep  2 2009 11:45:31)
>>
>> fence_ipmilan -a 172.26.18.3 -l xxxx -p xxxx -o status
>> Getting status of IPMI:172.26.18.3...Chassis power = On
>> Done
>>
>> fence_ipmilan -a 172.26.255.1 -l xxxx -p xxxx -o reboot
>> Rebooting machine @ IPMI:172.26.255.1...ipmilan: Failed to connect after
>> 10 seconds
>>
>> Powers Down, no power Up, Ilo goes to factory Reset
>>
>> /etc/ipmi_lan.conf
>> addr 0.0.0.0
>> priv_limit admin
>> allowed_auths_callback md5
>> allowed_auths_user md5
>> allowed_auths_operator md5
>> allowed_auths_admin md5
>> user 2 true xxxx xxxx admin 1 md5
>>
>> ipmitool lan print
>>
>> Set in Progress         : Set Complete
>> Auth Type Support       : NONE MD2 MD5 PASSWORD
>> Auth Type Enable        : Callback : MD2 MD5 PASSWORD
>>                         : User     : MD2 MD5 PASSWORD
>>                         : Operator : MD2 MD5 PASSWORD
>>                         : Admin    : MD2 MD5 PASSWORD
>>                         : OEM      :
>> IP Address Source       : Static Address
>> IP Address              : 172.26.18.5
>> Subnet Mask             : 255.255.240.0
>> MAC Address             : 00:14:4f:ca:83:4e
>> SNMP Community String   : public
>> IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
>> BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Enabled
>> Gratituous ARP Intrvl   : 5.0 seconds
>> Default Gateway IP      : 0.0.0.0
>> Default Gateway MAC     : 00:00:00:00:00:00
>> Backup Gateway IP       : 0.0.0.0
>> Backup Gateway MAC      : 00:00:00:00:00:00
>> 802.1q VLAN ID          : Disabled
>> 802.1q VLAN Priority    : 0
>> RMCP+ Cipher Suites     : 2,3,0
>> Cipher Suite Priv Max   : XXXXXXXXXXXXXXX
>>                         :     X=Cipher Suite Unused
>>                         :     c=CALLBACK
>>                         :     u=USER
>>                         :     o=OPERATOR
>>                         :     a=ADMIN
>>                         :     O=OEM
>>
>>
>>
>> ipmitool -U ipmi -P ipmi -H malta8 chassis status
>> System Power         : on
>> Power Overload       : false
>> Power Interlock      : inactive
>> Main Power Fault     : false
>> Power Control Fault  : false
>> Power Restore Policy : always-off
>> Last Power Event     :
>> Chassis Intrusion    : inactive
>> Front-Panel Lockout  : inactive
>> Drive Fault          : false
>> Cooling/Fan Fault    : false
>>
>>
>>
>>> On 1/15/2010 3:53 PM, jose nuno neto wrote:
>>>> Hi,
>>>> I'm trying to set up fencing with ipmi on 2node cluster.
>>>> I have ipmi working fine and I can issue shutdown/reboot commands
>>>> through
>>>> ipmitool, but when I use fence_ipmilan the command is executed but
>> fence_ipmilan keeps waiting for aknowledge ( I guess ) and closes with
>> error.
>>>> Then fenced daemon keeps trying to fence again.
>>>> Is there a way to prevent this? What is fenced_ipmilan expecting from
>>>> the
>>>> machine it has been shutdown?
>>>> Much Thanks
>>>> Jose
>>> Might be helpful to know what version of fence_ipmilan you are using
>>> and
>> how you configured it in cluster.conf. Did you try using fence_ipmilan
>> from command line?
>>> Fabio
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>>
>>
>>
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From swhiteho at redhat.com  Wed Jan 20 10:31:18 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 20 Jan 2010 10:31:18 +0000
Subject: [Linux-cluster] statfs
In-Reply-To: <8b711df41001191459x41c55814pf0de439f450debbd@mail.gmail.com>
References: <8b711df41001191459x41c55814pf0de439f450debbd@mail.gmail.com>
Message-ID: <1263983478.2528.9.camel@localhost>

Hi,

On Tue, 2010-01-19 at 16:59 -0600, Paras pradhan wrote:
> hi.
> 
> 
> Why I am not seeing statfs_fast ?
> 
> 
> [root at prd tune]# gfs2_tool gettune /guest_vms1
> new_files_directio = 0
> new_files_jdata = 0
> quota_scale = 1.0000   (1, 1)
> logd_secs = 1
> recoverd_secs = 60
> statfs_quantum = 30
> stall_secs = 600
> quota_cache_secs = 300
> quota_simul_sync = 64
> statfs_slow = 0
  ^^^^ Its called statfs_slow

Steve.





From dirk.schulz at kinzesberg.de  Wed Jan 20 14:14:02 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Wed, 20 Jan 2010 15:14:02 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100118103439.GK17978@reaktio.net>
References: <4B4F527F.8030105@kinzesberg.de>	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>	<20100115093514.GJ17978@reaktio.net>	<20100115171211.GA13575@esri.com>
	<4B5314D4.8010900@kinzesberg.de>
	<20100118103439.GK17978@reaktio.net>
Message-ID: <4B570FAA.1090202@kinzesberg.de>

Pasi K?rkk?inen schrieb:
> On Sun, Jan 17, 2010 at 02:47:00PM +0100, Dirk H. Schulz wrote:
>   
>> Ray Van Dolson schrieb:
>>     
>>>> Hmm.. so snapshots with CLVM are possible nowadays?
>>>>         
>>> No....
>>>
>>> RH has stated (recently on this list) that patches exist to do it, but
>>> it hasn't been a high enough priority for them to complete the work to
>>> the point where it could be distributed to customers.
>>>  
>>>       
>> Okay then, how do you people out there with clusters that run virtual 
>> machines with live migration do backups of the virtual machines?
>> I surely do not want to shut down the virtual machine to be able to copy 
>> the image safely away if I have live migration available.
>>
>> At the moment there is only one way I can see. PLEASE prove me to be wrong.
>> Searching in RedHat's documentation I found that the problem is that lvm 
>> snapshots need exclusively allocated logical volumes. So I think the 
>> following
>> should be technically possible:
>> 0. Start environment: the logical volume containing the images of the 
>> virtual machines uses gfs and is mounted on all relevant cluster nodes 
>> since VMs are running on several cluster nodes.
>> 1. All VMs have to be migrated to one of the cluster nodes
>> 2. On all other nodes, the gfs volume is unmounted
>> 3. On the remaining node (where all VMs now run) the logical volume is 
>> bound exclusively with "lvchange -aey LOGICALVOLUME"
>>    (I hope this is possible without deactivating it first)
>> 4. Now GFS on this volume is frozen: "gfs_tool freeze 
>> /mountpoint/of/local/volume"
>>     
>
> Before freezing the GFS you should make sure the VMs are in consistent
> state, and the VMs have flushed their caches/buffers/disks.
>   
That means calling  
        sync
        echo 3 > /proc/sys/vm/drop_caches
inside the VM, right? Or is there anything more to do to flush everything?

Dirk



From rvandolson at esri.com  Wed Jan 20 16:04:01 2010
From: rvandolson at esri.com (Ray Van Dolson)
Date: Wed, 20 Jan 2010 08:04:01 -0800
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B570FAA.1090202@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com> <4B5314D4.8010900@kinzesberg.de>
	<20100118103439.GK17978@reaktio.net>
	<4B570FAA.1090202@kinzesberg.de>
Message-ID: <20100120160401.GA10468@esri.com>

On Wed, Jan 20, 2010 at 06:14:02AM -0800, Dirk H. Schulz wrote:
> Pasi K?rkk?inen schrieb:
> > On Sun, Jan 17, 2010 at 02:47:00PM +0100, Dirk H. Schulz wrote:
> >   
> >> Ray Van Dolson schrieb:
> >>     
> >>>> Hmm.. so snapshots with CLVM are possible nowadays?
> >>>>         
> >>> No....
> >>>
> >>> RH has stated (recently on this list) that patches exist to do it, but
> >>> it hasn't been a high enough priority for them to complete the work to
> >>> the point where it could be distributed to customers.
> >>>  
> >>>       
> >> Okay then, how do you people out there with clusters that run virtual 
> >> machines with live migration do backups of the virtual machines?
> >> I surely do not want to shut down the virtual machine to be able to copy 
> >> the image safely away if I have live migration available.
> >>
> >> At the moment there is only one way I can see. PLEASE prove me to be wrong.
> >> Searching in RedHat's documentation I found that the problem is that lvm 
> >> snapshots need exclusively allocated logical volumes. So I think the 
> >> following
> >> should be technically possible:
> >> 0. Start environment: the logical volume containing the images of the 
> >> virtual machines uses gfs and is mounted on all relevant cluster nodes 
> >> since VMs are running on several cluster nodes.
> >> 1. All VMs have to be migrated to one of the cluster nodes
> >> 2. On all other nodes, the gfs volume is unmounted
> >> 3. On the remaining node (where all VMs now run) the logical volume is 
> >> bound exclusively with "lvchange -aey LOGICALVOLUME"
> >>    (I hope this is possible without deactivating it first)
> >> 4. Now GFS on this volume is frozen: "gfs_tool freeze 
> >> /mountpoint/of/local/volume"
> >>     
> >
> > Before freezing the GFS you should make sure the VMs are in consistent
> > state, and the VMs have flushed their caches/buffers/disks.
> >   
> That means calling  
>         sync
>         echo 3 > /proc/sys/vm/drop_caches
> inside the VM, right? Or is there anything more to do to flush everything?
> 
> Dirk

I would think the best way would be to actually pause the VM.  If you
snapshot a VM while it's running, even with the above, how can you
guarantee an application won't do something on the VM right as the
snapshot occurs?

Ray



From pradhanparas at gmail.com  Wed Jan 20 16:07:16 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Wed, 20 Jan 2010 10:07:16 -0600
Subject: [Linux-cluster] statfs
In-Reply-To: <1263983478.2528.9.camel@localhost>
References: <8b711df41001191459x41c55814pf0de439f450debbd@mail.gmail.com>
	<1263983478.2528.9.camel@localhost>
Message-ID: <8b711df41001200807l69580e16g6f2fd470c19b3c01@mail.gmail.com>

So.. does statfs_slow=0 means statfs_fast has been enabled?

Paras.


On Wed, Jan 20, 2010 at 4:31 AM, Steven Whitehouse <swhiteho at redhat.com>wrote:

> Hi,
>
> On Tue, 2010-01-19 at 16:59 -0600, Paras pradhan wrote:
> > hi.
> >
> >
> > Why I am not seeing statfs_fast ?
> >
> >
> > [root at prd tune]# gfs2_tool gettune /guest_vms1
> > new_files_directio = 0
> > new_files_jdata = 0
> > quota_scale = 1.0000   (1, 1)
> > logd_secs = 1
> > recoverd_secs = 60
> > statfs_quantum = 30
> > stall_secs = 600
> > quota_cache_secs = 300
> > quota_simul_sync = 64
> > statfs_slow = 0
>   ^^^^ Its called statfs_slow
>
> Steve.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100120/4f405585/attachment.htm>

From pradhanparas at gmail.com  Wed Jan 20 16:27:03 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Wed, 20 Jan 2010 10:27:03 -0600
Subject: [Linux-cluster] GFS resource
Message-ID: <8b711df41001200827n1b43d259t98f553ca408e4694@mail.gmail.com>

Hi,

I have a 3 nodes cluster with shared GFS2 partition connected through SAN.
This is Xen virtualization cluster and fencing is working properly. So when
there is a problem in any nodes, the node is being fenced and other nodes
are working properly. Now my question is do I need to add a GFS resource in
this cluster.  If yes why?


Thanks!
Paras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100120/8219246f/attachment.htm>

From swhiteho at redhat.com  Wed Jan 20 17:21:17 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 20 Jan 2010 17:21:17 +0000
Subject: [Linux-cluster] statfs
In-Reply-To: <8b711df41001200807l69580e16g6f2fd470c19b3c01@mail.gmail.com>
References: <8b711df41001191459x41c55814pf0de439f450debbd@mail.gmail.com>
	<1263983478.2528.9.camel@localhost>
	<8b711df41001200807l69580e16g6f2fd470c19b3c01@mail.gmail.com>
Message-ID: <1264008077.14393.230.camel@localhost.localdomain>

Hi,

On Wed, 2010-01-20 at 10:07 -0600, Paras pradhan wrote:
> So.. does statfs_slow=0 means statfs_fast has been enabled?
> 
Basically yes. With newer GFS2 you can also alter this behaviour via the
mount command line. I should check to see whether we've got the man
page updated as well.

As a general principle we are trying to gradually move all the
tweekables for the filesystem to the mount command line (and allow them
to be set/reset via mount -o remount as required) rather than using the
gfs2_tool settune system (which is non-standard). So if there is a mount
command line equivalent command, then I would generally encourage the
use of that instead,

Steve.

> 
> Paras.
> 
> 
> On Wed, Jan 20, 2010 at 4:31 AM, Steven Whitehouse
> <swhiteho at redhat.com> wrote:
>         Hi,
>         
>         On Tue, 2010-01-19 at 16:59 -0600, Paras pradhan wrote:
>         > hi.
>         >
>         >
>         > Why I am not seeing statfs_fast ?
>         >
>         >
>         > [root at prd tune]# gfs2_tool gettune /guest_vms1
>         > new_files_directio = 0
>         > new_files_jdata = 0
>         > quota_scale = 1.0000   (1, 1)
>         > logd_secs = 1
>         > recoverd_secs = 60
>         > statfs_quantum = 30
>         > stall_secs = 600
>         > quota_cache_secs = 300
>         > quota_simul_sync = 64
>         > statfs_slow = 0
>         
>          ^^^^ Its called statfs_slow
>         
>         Steve.
>         
>         
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Wed Jan 20 17:24:22 2010
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 20 Jan 2010 12:24:22 -0500 (EST)
Subject: [Linux-cluster] statfs
In-Reply-To: <8b711df41001200807l69580e16g6f2fd470c19b3c01@mail.gmail.com>
Message-ID: <907892973.226781264008262558.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Paras pradhan" <pradhanparas at gmail.com> wrote:
| So.. does statfs_slow=0 means statfs_fast has been enabled?
| 
| 
| Paras.

Yes, gfs2 uses the fast statfs method unless it's turned off
by setting statfs_slow=1.

Regards,

Bob Peterson
Red Hat File Systems



From pradhanparas at gmail.com  Wed Jan 20 17:36:15 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Wed, 20 Jan 2010 11:36:15 -0600
Subject: [Linux-cluster] statfs
In-Reply-To: <907892973.226781264008262558.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
References: <8b711df41001200807l69580e16g6f2fd470c19b3c01@mail.gmail.com>
	<907892973.226781264008262558.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <8b711df41001200936k1fbd987dm674835517d3f8224@mail.gmail.com>

Thanks Bob , Steve for the info.

Paras.


On Wed, Jan 20, 2010 at 11:24 AM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- "Paras pradhan" <pradhanparas at gmail.com> wrote:
> | So.. does statfs_slow=0 means statfs_fast has been enabled?
> |
> |
> | Paras.
>
> Yes, gfs2 uses the fast statfs method unless it's turned off
> by setting statfs_slow=1.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100120/3c6d437b/attachment.htm>

From pasik at iki.fi  Wed Jan 20 19:15:48 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Wed, 20 Jan 2010 21:15:48 +0200
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B570FAA.1090202@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com> <4B5314D4.8010900@kinzesberg.de>
	<20100118103439.GK17978@reaktio.net>
	<4B570FAA.1090202@kinzesberg.de>
Message-ID: <20100120191548.GE2861@reaktio.net>

On Wed, Jan 20, 2010 at 03:14:02PM +0100, Dirk H. Schulz wrote:
> Pasi K?rkk?inen schrieb:
>> On Sun, Jan 17, 2010 at 02:47:00PM +0100, Dirk H. Schulz wrote:
>>   
>>> Ray Van Dolson schrieb:
>>>     
>>>>> Hmm.. so snapshots with CLVM are possible nowadays?
>>>>>         
>>>> No....
>>>>
>>>> RH has stated (recently on this list) that patches exist to do it, but
>>>> it hasn't been a high enough priority for them to complete the work to
>>>> the point where it could be distributed to customers.
>>>>        
>>> Okay then, how do you people out there with clusters that run virtual 
>>> machines with live migration do backups of the virtual machines?
>>> I surely do not want to shut down the virtual machine to be able to 
>>> copy the image safely away if I have live migration available.
>>>
>>> At the moment there is only one way I can see. PLEASE prove me to be wrong.
>>> Searching in RedHat's documentation I found that the problem is that 
>>> lvm snapshots need exclusively allocated logical volumes. So I think 
>>> the following
>>> should be technically possible:
>>> 0. Start environment: the logical volume containing the images of the 
>>> virtual machines uses gfs and is mounted on all relevant cluster 
>>> nodes since VMs are running on several cluster nodes.
>>> 1. All VMs have to be migrated to one of the cluster nodes
>>> 2. On all other nodes, the gfs volume is unmounted
>>> 3. On the remaining node (where all VMs now run) the logical volume 
>>> is bound exclusively with "lvchange -aey LOGICALVOLUME"
>>>    (I hope this is possible without deactivating it first)
>>> 4. Now GFS on this volume is frozen: "gfs_tool freeze  
>>> /mountpoint/of/local/volume"
>>>     
>>
>> Before freezing the GFS you should make sure the VMs are in consistent
>> state, and the VMs have flushed their caches/buffers/disks.
>>   
> That means calling         sync
>        echo 3 > /proc/sys/vm/drop_caches
> inside the VM, right? Or is there anything more to do to flush everything?
>

And before those commands stop any databases or complex applications.. 
or make them flush their buffers and lock the tables for writes.

-- Pasi



From pasik at iki.fi  Wed Jan 20 19:28:11 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Wed, 20 Jan 2010 21:28:11 +0200
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100120191548.GE2861@reaktio.net>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com> <4B5314D4.8010900@kinzesberg.de>
	<20100118103439.GK17978@reaktio.net>
	<4B570FAA.1090202@kinzesberg.de>
	<20100120191548.GE2861@reaktio.net>
Message-ID: <20100120192811.GF2861@reaktio.net>

On Wed, Jan 20, 2010 at 09:15:48PM +0200, Pasi K?rkk?inen wrote:
> On Wed, Jan 20, 2010 at 03:14:02PM +0100, Dirk H. Schulz wrote:
> > Pasi K?rkk?inen schrieb:
> >> On Sun, Jan 17, 2010 at 02:47:00PM +0100, Dirk H. Schulz wrote:
> >>   
> >>> Ray Van Dolson schrieb:
> >>>     
> >>>>> Hmm.. so snapshots with CLVM are possible nowadays?
> >>>>>         
> >>>> No....
> >>>>
> >>>> RH has stated (recently on this list) that patches exist to do it, but
> >>>> it hasn't been a high enough priority for them to complete the work to
> >>>> the point where it could be distributed to customers.
> >>>>        
> >>> Okay then, how do you people out there with clusters that run virtual 
> >>> machines with live migration do backups of the virtual machines?
> >>> I surely do not want to shut down the virtual machine to be able to 
> >>> copy the image safely away if I have live migration available.
> >>>
> >>> At the moment there is only one way I can see. PLEASE prove me to be wrong.
> >>> Searching in RedHat's documentation I found that the problem is that 
> >>> lvm snapshots need exclusively allocated logical volumes. So I think 
> >>> the following
> >>> should be technically possible:
> >>> 0. Start environment: the logical volume containing the images of the 
> >>> virtual machines uses gfs and is mounted on all relevant cluster 
> >>> nodes since VMs are running on several cluster nodes.
> >>> 1. All VMs have to be migrated to one of the cluster nodes
> >>> 2. On all other nodes, the gfs volume is unmounted
> >>> 3. On the remaining node (where all VMs now run) the logical volume 
> >>> is bound exclusively with "lvchange -aey LOGICALVOLUME"
> >>>    (I hope this is possible without deactivating it first)
> >>> 4. Now GFS on this volume is frozen: "gfs_tool freeze  
> >>> /mountpoint/of/local/volume"
> >>>     
> >>
> >> Before freezing the GFS you should make sure the VMs are in consistent
> >> state, and the VMs have flushed their caches/buffers/disks.
> >>   
> > That means calling         sync
> >        echo 3 > /proc/sys/vm/drop_caches
> > inside the VM, right? Or is there anything more to do to flush everything?
> >
> 
> And before those commands stop any databases or complex applications.. 
> or make them flush their buffers and lock the tables for writes.
> 

You might also be interested of this thread:
http://lists.xensource.com/archives/html/xen-users/2010-01/msg00393.html

-- Pasi



From dirk.schulz at kinzesberg.de  Thu Jan 21 10:42:26 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Thu, 21 Jan 2010 11:42:26 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100120192811.GF2861@reaktio.net>
References: <4B4F527F.8030105@kinzesberg.de>	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>	<20100115093514.GJ17978@reaktio.net>	<20100115171211.GA13575@esri.com>
	<4B5314D4.8010900@kinzesberg.de>	<20100118103439.GK17978@reaktio.net>	<4B570FAA.1090202@kinzesberg.de>	<20100120191548.GE2861@reaktio.net>
	<20100120192811.GF2861@reaktio.net>
Message-ID: <4B582F92.4090400@kinzesberg.de>

Pasi K?rkk?inen schrieb:
> You might also be interested of this thread:
> http://lists.xensource.com/archives/html/xen-users/2010-01/msg00393.html
>   

That is interesting. Sounds like it could be interesting to make it a 
community effort.

Dirk
> -- Pasi
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   




From adam.king at intechnology.com  Thu Jan 21 13:25:38 2010
From: adam.king at intechnology.com (King, Adam)
Date: Thu, 21 Jan 2010 13:25:38 -0000
Subject: [Linux-cluster] fence_ilo
Message-ID: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>

Hi, 

I have a 2 node cluster (xen1 and xen2) running on dl380 G3's. There are
2 virtual machines running on these. From first node from the shell
command line I can fence the 2nd  node through the hp lights out port
using /sbin/fence_ilo -a 192.168.1.53 -l aking -p password -v. I can
also do the same from the second node and successfully fence the first
node. When I fence the node that has control of one or more virtual
machines they are migrated to the other node. 

If I fence a node in Luci this works too. 

However, if I have a node running one or more virtual machines and run
/sbin/reboot -f on that node, in /var/log/messages in the other node I
get messages such as 

xen1 fenced[3126]: agent "fence_ilo" reports: Unable to connect/login to
fencing device

xen1 fenced[3126]: fence "192.168.50.116" failed

 

Can anyone advise why this is happening? 

Cheers

Adam 
 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

InTechnology provides industry leading network, hosting, data
and IP telephony services to over 800 UK businesses.


* Over 25 years in business employing 220 employees in the UK
* Annual turnover of ?50m, profitable and financially secure
* InTechnology manages over 14,000 managed network circuits
* InTechnology manages over 5PB (over 5.2 million GB) of customer data
* InTechnology hosts more than 16,000 IP telephony seats
* InTechnology operates four managed data centres - 50,000 sq ft capacity



This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/4a4b882b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-logo-circle.gif
Type: image/gif
Size: 6495 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/4a4b882b/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-signature-dots.gif
Type: image/gif
Size: 76 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/4a4b882b/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arrow.gif
Type: image/gif
Size: 97 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/4a4b882b/attachment-0002.gif>

From pcs.fagnonr at pcsb.org  Thu Jan 21 14:23:34 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Thu, 21 Jan 2010 09:23:34 -0500
Subject: [Linux-cluster] fail over
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B89019D@a0040semail3.pinellas.local>

I want to fail my virtual server over to my standby node. I am going to do this via a reboot. When the primary is backup what command can I issue to have the virtual server failover to the primary again?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/139de31d/attachment.htm>

From broder at mit.edu  Thu Jan 21 15:17:25 2010
From: broder at mit.edu (Evan Broder)
Date: Thu, 21 Jan 2010 10:17:25 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B4D9982.6060505@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<4B4AE972.6040004@redhat.com>
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
	<4B4D9982.6060505@redhat.com>
Message-ID: <178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>

On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 12/01/10 16:21, Evan Broder wrote:
>>
>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>> <ccaulfie at redhat.com> ?wrote:
>>>
>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>
>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>
>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com> ?wrote:
>>>>>>
>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>
>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>
>>>>>>> Hi -
>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>> in-kernel locking.
>>>>>>>
>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>>>>> use it for management of some other resources going forward.
>>>>>>>
>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>> operations
>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>>>>> clvmd processes appear to deadlock.
>>>>>>>
>>>>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>
>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>> information at the list. What information can I provide to make it
>>>>>>> easier to debug and fix this deadlock?
>>>>>>>
>>>>>>
>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>> can be
>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>> should be
>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>> still
>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>> conjunction with the clvmd logs.
>>>>>
>>>>> Sure, no problem. I've posted the logs from clvmd on both processes in
>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>
>>>>>>> ", so they should be easy to spot.
>>>
>>>
>>> Ironically it looks like a bug in the clvmd-openais code. I can reproduce
>>> it
>>> on my systems here. I don't see the problem when using the dlm!
>>>
>>> Can you try -Icorosync and see if that helps? In the meantime I'll have a
>>> look at the openais bits to try and find out what is wrong.
>>>
>>> Chrissie
>>>
>>
>> I'll see what we can pull together, but the nodes running the clvm
>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>> DLM 3 would be...challenging.
>>
>> It would be much, much better for us if we could get clvmd-openais
>> working.
>>
>> Is there any chance this would work better if we dropped back to
>> openais whitetank instead of corosync + openais wilson?
>>
>
>
> OK, I've found the bug and it IS in openais. The attached patch will fix it.
>
> Chrissie
>

Awesome. That patch fixed our problem.

We are running into one other problem - performing LVM operations on
one node is substantially slower than performing them on the other
node:

root at black-mesa:~# time lvcreate -n test -L 1G xenvg
  Logical volume "test" created

real	0m0.309s
user	0m0.000s
sys	0m0.008s
root at black-mesa:~# time lvremove -f /dev/xenvg/test
  Logical volume "test" successfully removed

real	0m0.254s
user	0m0.004s
sys	0m0.008s


root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
  Logical volume "test" created

real	0m7.282s
user	0m6.396s
sys	0m0.312s
root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
  Logical volume "test" successfully removed

real	0m7.277s
user	0m6.420s
sys	0m0.292s

Any idea why this is happening and if there's anything we can do about it?

Thanks again for your help,
 - Evan



From celsowebber at yahoo.com  Thu Jan 21 15:44:12 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Thu, 21 Jan 2010 07:44:12 -0800 (PST)
Subject: [Linux-cluster] fail over
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B89019D@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B89019D@a0040semail3.pinellas.local>
Message-ID: <946600.62867.qm@web32103.mail.mud.yahoo.com>

Hi,

You can do this through Conga (Web user interface) or via the "clusvcadm" command line. I believe you can see this functionality in the documentation. Try also "man clusvcadm" and look for the "relocate" function.

Regards, Celso.




________________________________
From: Fagnon Raymond <pcs.fagnonr at pcsb.org>
To: linux clustering <linux-cluster at redhat.com>
Sent: Thu, January 21, 2010 12:23:34 PM
Subject: [Linux-cluster] fail over

 
I want to fail my virtual server over to my standby node. I
am going to do this via a reboot. When the primary is backup what command can I
issue to have the virtual server failover to the primary again?


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/ce7979b9/attachment.htm>

From ccook at pandora.com  Thu Jan 21 22:44:29 2010
From: ccook at pandora.com (Christopher Strider Cook)
Date: Thu, 21 Jan 2010 14:44:29 -0800
Subject: [Linux-cluster] Cluster failover troubleshooting
Message-ID: <4B58D8CD.7020103@pandora.com>

Please help me figure out why this cluster failed over. This has 
happened several times in the past month, while previously it had been 
quite stable. What can trigger the corosync " [TOTEM ] A processor 
failed, forming new configuration. " message? By all appearances the 
primary server was functioning properly until it was fenced by the 
secondary.

I've got cluster3 running on debian lenny 2.6.30-1-amd64
ii  openais        1.0.0-3local1  Standards-based cluster framework 
(daemon an
ii  corosync       1.0.0-4        Standards-based cluster framework 
(daemon an
ii  rgmanager      3.0.0-1~agx0lo clustered resource group manager
ii  cman           3.0.0-1~agx0lo cluster manager


A bunch of successful status checks on the active server, nicks, leading 
up to:

Jan 21 04:28:08 wonder corosync[2856]: [TOTEM ] A processor failed, 
forming new configuration.
Jan 21 04:28:09 wonder qdiskd[2873]: Writing eviction notice for node 2
Jan 21 04:28:10 wonder qdiskd[2873]: Node 2 evicted
Jan 21 04:28:11 nicks corosync[2991]: [CMAN ] lost contact with quorum 
device
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.21)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined:
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] This node is within the 
primary component and will provide service.
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] Members[1]:
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] 1
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined:
Jan 21 04:28:12 wonder corosync[2856]: [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Jan 21 04:28:12 wonder corosync[2856]: [MAIN ] Completed service 
synchronization, ready to provide service.
Jan 21 04:28:12 wonder rgmanager[3578]: State change: nicks-p DOWN
Jan 21 04:28:13 wonder kernel: [1298595.738213] dlm: closing connection 
to node 2
Jan 21 04:28:13 wonder fenced[3206]: fencing node nicks-p
Jan 21 04:28:12 nicks corosync[2991]: [QUORUM] This node is within the 
primary component and will provide service.
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] Members[2]:
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 1
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 2
Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success
Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success
Jan 21 04:28:17 wonder rgmanager[3578]: Taking over service 
service:MailHost from down member nicks-p
Jan 21 04:28:17 wonder bash[13236]: Unknown file system type 'ext4' for 
device /dev/dm-0. Assuming fsck is required.
Jan 21 04:28:17 wonder bash[13259]: Running fsck on /dev/dm-0
Jan 21 04:28:18 wonder bash[13284]: mounting /dev/dm-0 on /home
Jan 21 04:28:18 wonder bash[13306]: mount -t ext4 -o 
defaults,noatime,nodiratime /dev/dm-0 /home
Jan 21 04:28:19 wonder bash[13335]: quotaon not found in 
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13335]: quotaon not found in 
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13368]: mounting /dev/dm-1 on /var/cluster
Jan 21 04:28:19 wonder bash[13390]: mount -t ext3 -o defaults /dev/dm-1 
/var/cluster
Jan 21 04:28:19 wonder bash[13415]: quotaon not found in 
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13415]: quotaon not found in 
/bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:20 wonder bash[13467]: Link for eth0: Detected
Jan 21 04:28:20 wonder bash[13489]: Adding IPv4 address 172.25.16.58/22 
to eth0
Jan 21 04:28:20 wonder bash[13513]: Sending gratuitous ARP: 172.25.16.58 
00:30:48:c6:df:ce brd ff:ff:ff:ff:ff:ff
Jan 21 04:28:21 wonder bash[13551]: Executing 
/etc/cluster/MailHost-misc-early start
Jan 21 04:28:21 wonder bash[13606]: Executing 
/etc/cluster/saslauthd-cluster start
Jan 21 04:28:21 wonder bash[13679]: Executing 
/etc/cluster/postfix-cluster start
Jan 21 04:28:22 wonder bash[13788]: Executing 
/etc/cluster/dovecot-wrapper start
Jan 21 04:28:22 wonder bash[13850]: Executing 
/etc/cluster/mailman-wrapper start
Jan 21 04:28:23 wonder bash[13901]: Executing 
/etc/cluster/apache2-mailhost start


<?xml version="1.0"?>
<cluster name="alpha" config_version="44">

<cman two_node="0" expected_votes="3">
</cman>

<clusternodes>
<clusternode name="wonder-p" votes="1" nodeid="1">
<fence>
<method name="single">
<device name="pwr01" option="off"/>
<device name="pwr02" option="off"/>
<device name="pwr01" option="on"/>
<device name="pwr02" option="on"/>
</method>
</fence>
</clusternode>
<clusternode name="nicks-p" votes="1" nodeid="2">
<fence>
<method name="single">
<device name="pwr03" option="off"/>
<device name="pwr04" option="off"/>
<device name="pwr03" option="on"/>
<device name="pwr04" option="on"/>
</method>
</fence>
</clusternode>
</clusternodes>

<quorumd interval="1" tko="10" votes="1" label="quorumdisk">
<heuristic program="ping 172.25.19.254 -c1 -t1" score="1" interval="2" 
tko="3"/>
</quorumd>

<fence_daemon post_join_delay="20">
</fence_daemon>

<fencedevices>
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="4" 
name="pwr01" udpport="161" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="4" 
name="pwr02" udpport="161" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="3" 
name="pwr03" udpport="161" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="3" 
name="pwr04" udpport="161" />
</fencedevices>

<rm>

<failoverdomains>
<failoverdomain name="mailcluster" restricted="1" ordered="0" >
<failoverdomainnode name="wonder-p" priority="1"/>
<failoverdomainnode name="nicks-p" priority="1"/>
</failoverdomain>
</failoverdomains>

<service name="MailHost" autostart="1" domain="mailcluster" >
<script name="MailHost-early" file="/etc/cluster/MailHost-misc-early" />
<fs name="mailhome" mountpoint="/home" device="/dev/dm-0" fstype="ext4" 
force_unmount="1" active_monitor="1" 
options="defaults,noatime,nodiratime" />
<fs name="mailcluster" mountpoint="/var/cluster" device="/dev/dm-1" 
fstype="ext3" force_unmount="1" active_monitor="1" options="defaults" />
<ip address="172.25.16.58" monitor_link="1" />
<script name="saslauthd" file="/etc/cluster/saslauthd-cluster" />
<script name="postfix" file="/etc/cluster/postfix-cluster" />
<script name="dovecot" file="/etc/cluster/dovecot-wrapper" 
__independent_subtree="1" />
<script name="mailman" file="/etc/cluster/mailman-wrapper" 
__independent_subtree="1" />
<script name="apache2-mailhost" file="/etc/cluster/apache2-mailhost" 
__independent_subtree="1" />
<script name="usermin" file="/etc/init.d/usermin-sb" 
__independent_subtree="1" />
<script name="MailHost-late" file="/etc/cluster/MailHost-misc-late" />
</service>

</rm>
</cluster>


Thanks

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100121/d23caec5/attachment.htm>

From pmdyer at ctgcentral2.com  Thu Jan 21 23:28:58 2010
From: pmdyer at ctgcentral2.com (Paul M. Dyer)
Date: Thu, 21 Jan 2010 17:28:58 -0600 (CST)
Subject: [Linux-cluster] Luci home page
In-Reply-To: <782081.52039.qm@web32102.mail.mud.yahoo.com>
Message-ID: <13627661.21264116538974.JavaMail.root@athena>

Hi,

I had this problem and worked with RHN for about 4 weeks.   They have been able to reproduce it, and expect to send the bugzilla report to engineering at some point.

For me, I was able to bypass the timeout error by starting all ricci agents as root.   Somehow, this is a workaround for a saslauthd issue.

Edit /etc/init.d/ricci, and change line 155 to: 

daemon "$RICCID"

removing the -u "$NewUID" parameter.

Paul

----- Original Message -----
From: "Celso K. Webber" <celsowebber at yahoo.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Tuesday, January 19, 2010 3:24:51 PM (GMT-0600) America/Chicago
Subject: Re: [Linux-cluster] Luci home page

Hi Dirk,

I've double checked my environment, and I have SELinux DISABLED on both nodes, as I have the firewall DISABLED also.

Thanks.



----- Original Message ----
From: Dirk H. Schulz <dirk.schulz at kinzesberg.de>
To: linux clustering <linux-cluster at redhat.com>
Sent: Tue, January 19, 2010 4:14:28 PM
Subject: Re: [Linux-cluster] Luci home page

I have experienced this kind of difficulties when I started testing conga. One of the first things I tried was setting SElinux to permissive on the conga server, and from then on I could work well with it.

I did not look into audit.log to find out if there is a solution to it, so far.

Dirk

Celso K. Webber schrieb:
> Hi,
> 
> I don't know if your problem is the same as mine, but I'm experiencing a problem where luci is a little bit unstable.
> 
> I can log in into luci, but when I click the "cluster" tab, it takes some time and returns:
> "An error occurred when trying to contact any of the nodes in the rhcs-xen cluster."
> 
> In the server where luci is running (node1), I see the following messages in /var/log/messages:
> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
> 
> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
> 
> I've checked name resolution, host names, /etc/hosts, etc. Firewall and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from anywhere successfully.
> 
> Whenever I receive this kind of "timeout error" in luci, if I click the same link again or if I do a "reload" in the browser, then luci usually responds correctly. There is the inconvenience of re-clicking almost every link I use in luci.
> 
> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the problem persistend. I indeed tried to remove luci and ricci, rm -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
> 
> Does anyone has these same symptoms?
> 
> Thanks, Celso.
> 
> ------------------------------------------------------------------------
> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
> *To:* linux clustering <linux-cluster at redhat.com>
> *Sent:* Mon, January 18, 2010 7:24:48 PM
> *Subject:* [Linux-cluster] Luci home page
> 
> When I log into my luci homepage and click on cluster. My Cluster does not appear.  The servers show up under the storage tab and so forth.
> 
>  
> As far as I know this is a nfs cluster
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



      

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From brettcave at gmail.com  Fri Jan 22 07:28:13 2010
From: brettcave at gmail.com (Brett Cave)
Date: Fri, 22 Jan 2010 09:28:13 +0200
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>
References: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>
Message-ID: <c0773fd31001212328i314a03cay5227bb0fddfc3c73@mail.gmail.com>

On Thu, Jan 21, 2010 at 3:25 PM, King, Adam <adam.king at intechnology.com>wrote:

>   Hi,
>
> I have a 2 node cluster (xen1 and xen2) running on dl380 G3?s. There are 2
> virtual machines running on these. From first node from the shell command
> line I can fence the 2nd  node through the hp lights out port using
> /sbin/fence_ilo ?a 192.168.1.53 ?l aking ?p password ?v. I can also do the
> same from the second node and successfully fence the first node. When I
> fence the node that has control of one or more virtual machines they are
> migrated to the other node.
>
I had some issues with the default fence_ilo, found a nice fence_fast_ilo
fencing script in an rpm: comoonics-bootimage-fenceclient. This seems to
work much better.


>  If I fence a node in Luci this works too.
>
> However, if I have a node running one or more virtual machines and run
> /sbin/reboot ?f on that node, in /var/log/messages in the other node I get
> messages such as
>
> xen1 fenced[3126]: agent "fence_ilo" reports: Unable to connect/login to
> fencing device
>
> xen1 fenced[3126]: fence "192.168.50.116" failed
>

assuming that 192.168.1.53 is the 2nd server's ilo address and
192.168.50.116 is the 1st servers ilo address?

Can anyone advise why this is happening?
>
> Cheers
>
> Adam
>
>    Adam King
>  Systems Administrator
>  adam.king at intechnology.com
>    InTechnology plc
>  Support 0845 120 7070
>  Telephone 01423 850000
>  Facsimile 01423 858866
> www.intechnology.com
>
>
>   *InTechnology provides industry leading network, hosting, data
> and IP telephony services to over 800 UK businesses.
>
> *    *Over 25 years in business employing 220 employees in the UK*     *Annual
> turnover of ?50m, profitable and financially secure*     *InTechnology
> manages over 14,000 managed network circuits*     *InTechnology manages
> over 5PB (over 5.2 million GB) of customer data*     *InTechnology hosts
> more than 16,000 IP telephony seats*     *InTechnology operates four
> managed data centres - 50,000 sq ft capacity*
>
>     This is an email from InTechnology plc, Central House, Beckwith
> Knowle, Harrogate, UK, HG3 1UG.
> Registered in England 3916586.
>
> The contents of this message may be privileged and confidential. If you
> have received this message in error, you may not use,
> disclose, copy or distribute its content in any way. Please notify the
> sender immediately. All messages are scanned for all viruses.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/0624d1fa/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-signature-dots.gif
Type: image/gif
Size: 76 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/0624d1fa/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-logo-circle.gif
Type: image/gif
Size: 6495 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/0624d1fa/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arrow.gif
Type: image/gif
Size: 97 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/0624d1fa/attachment-0002.gif>

From adam.king at intechnology.com  Fri Jan 22 09:09:03 2010
From: adam.king at intechnology.com (King, Adam)
Date: Fri, 22 Jan 2010 09:09:03 -0000
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <c0773fd31001212328i314a03cay5227bb0fddfc3c73@mail.gmail.com>
References: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>
	<c0773fd31001212328i314a03cay5227bb0fddfc3c73@mail.gmail.com>
Message-ID: <ED48EF11677FC44BB3440460C24F58E405A4BFA6@mgmtexm03101.intechnology.co.uk>

Hi, 

Thanks for your reply, 192.168.1.152 is the first servers ilo and 192.168.50.112 is the first servers eth0, the first servers eth1 is 192.168.1.50. The second servers eth0 is 192.168.50.116 and its ilo is 192.168.1.153, the second servers eth1 is 192.168.1.51. 

I have found the rpm you suggested and will try that,

Thanks again

 

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
Sent: 22 January 2010 07:28
To: linux clustering
Subject: Re: [Linux-cluster] fence_ilo

 

 

On Thu, Jan 21, 2010 at 3:25 PM, King, Adam <adam.king at intechnology.com> wrote:



Hi, 

I have a 2 node cluster (xen1 and xen2) running on dl380 G3's. There are 2 virtual machines running on these. From first node from the shell command line I can fence the 2nd  node through the hp lights out port using /sbin/fence_ilo -a 192.168.1.53 -l aking -p password -v. I can also do the same from the second node and successfully fence the first node. When I fence the node that has control of one or more virtual machines they are migrated to the other node.

I had some issues with the default fence_ilo, found a nice fence_fast_ilo fencing script in an rpm: comoonics-bootimage-fenceclient. This seems to work much better.

 

	If I fence a node in Luci this works too. 

	However, if I have a node running one or more virtual machines and run /sbin/reboot -f on that node, in /var/log/messages in the other node I get messages such as 

	xen1 fenced[3126]: agent "fence_ilo" reports: Unable to connect/login to fencing device

	xen1 fenced[3126]: fence "192.168.50.116" failed

 

assuming that 192.168.1.53 is the 2nd server's ilo address and 192.168.50.116 is the 1st servers ilo address?

 

	Can anyone advise why this is happening? 

	Cheers

	Adam 

	  

	Adam King 

	Systems Administrator 

	adam.king at intechnology.com 

	InTechnology plc 

	Support 0845 120 7070 

	Telephone 01423 850000 

	Facsimile 01423 858866 

	www.intechnology.com <http://www.intechnology.com/> 

	
	   

InTechnology provides industry leading network, hosting, data
and IP telephony services to over 800 UK businesses.

	 

 

Over 25 years in business employing 220 employees in the UK 

 



Annual turnover of ?50m, profitable and financially secure 





InTechnology manages over 14,000 managed network circuits 





InTechnology manages over 5PB (over 5.2 million GB) of customer data 





InTechnology hosts more than 16,000 IP telephony seats 





InTechnology operates four managed data centres - 50,000 sq ft capacity 



 

	This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
	Registered in England 3916586.
	
	The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
	disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.

	
	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster 
 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

 


This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,

disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/4556023f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 6495 bytes
Desc: image001.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/4556023f/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 97 bytes
Desc: image002.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/4556023f/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.gif
Type: image/gif
Size: 76 bytes
Desc: image003.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/4556023f/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-logo-circle.gif
Type: image/gif
Size: 6495 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/4556023f/attachment-0003.gif>

From pasik at iki.fi  Fri Jan 22 09:54:13 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Fri, 22 Jan 2010 11:54:13 +0200
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B582F92.4090400@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com> <4B5314D4.8010900@kinzesberg.de>
	<20100118103439.GK17978@reaktio.net>
	<4B570FAA.1090202@kinzesberg.de>
	<20100120191548.GE2861@reaktio.net>
	<20100120192811.GF2861@reaktio.net>
	<4B582F92.4090400@kinzesberg.de>
Message-ID: <20100122095413.GS2861@reaktio.net>

On Thu, Jan 21, 2010 at 11:42:26AM +0100, Dirk H. Schulz wrote:
> Pasi K?rkk?inen schrieb:
>> You might also be interested of this thread:
>> http://lists.xensource.com/archives/html/xen-users/2010-01/msg00393.html
>>   
>
> That is interesting. Sounds like it could be interesting to make it a  
> community effort.
>

Indeed. We've been discussing this with multiple people. 

-- Pasi



From celsowebber at yahoo.com  Fri Jan 22 11:06:04 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Fri, 22 Jan 2010 03:06:04 -0800 (PST)
Subject: [Linux-cluster] Luci home page
In-Reply-To: <13627661.21264116538974.JavaMail.root@athena>
References: <13627661.21264116538974.JavaMail.root@athena>
Message-ID: <758393.1799.qm@web32107.mail.mud.yahoo.com>

Hi Paul!!!

Thank you very much for this power tip! I have a ticket opened with Red Hat, now I can just tell them that there were other similar cases.

Thanks again!

Regards, Celso.



----- Original Message ----
From: Paul M. Dyer <pmdyer at ctgcentral2.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Thu, January 21, 2010 9:28:58 PM
Subject: Re: [Linux-cluster] Luci home page

Hi,

I had this problem and worked with RHN for about 4 weeks.   They have been able to reproduce it, and expect to send the bugzilla report to engineering at some point.

For me, I was able to bypass the timeout error by starting all ricci agents as root.   Somehow, this is a workaround for a saslauthd issue.

Edit /etc/init.d/ricci, and change line 155 to: 

daemon "$RICCID"

removing the -u "$NewUID" parameter.

Paul

----- Original Message -----
From: "Celso K. Webber" <celsowebber at yahoo.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Tuesday, January 19, 2010 3:24:51 PM (GMT-0600) America/Chicago
Subject: Re: [Linux-cluster] Luci home page

Hi Dirk,

I've double checked my environment, and I have SELinux DISABLED on both nodes, as I have the firewall DISABLED also.

Thanks.



----- Original Message ----
From: Dirk H. Schulz <dirk.schulz at kinzesberg.de>
To: linux clustering <linux-cluster at redhat.com>
Sent: Tue, January 19, 2010 4:14:28 PM
Subject: Re: [Linux-cluster] Luci home page

I have experienced this kind of difficulties when I started testing conga. One of the first things I tried was setting SElinux to permissive on the conga server, and from then on I could work well with it.

I did not look into audit.log to find out if there is a solution to it, so far.

Dirk

Celso K. Webber schrieb:
> Hi,
> 
> I don't know if your problem is the same as mine, but I'm experiencing a problem where luci is a little bit unstable.
> 
> I can log in into luci, but when I click the "cluster" tab, it takes some time and returns:
> "An error occurred when trying to contact any of the nodes in the rhcs-xen cluster."
> 
> In the server where luci is running (node1), I see the following messages in /var/log/messages:
> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
> 
> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
> 
> I've checked name resolution, host names, /etc/hosts, etc. Firewall and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from anywhere successfully.
> 
> Whenever I receive this kind of "timeout error" in luci, if I click the same link again or if I do a "reload" in the browser, then luci usually responds correctly. There is the inconvenience of re-clicking almost every link I use in luci.
> 
> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the problem persistend. I indeed tried to remove luci and ricci, rm -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
> 
> Does anyone has these same symptoms?
> 
> Thanks, Celso.
> 
> ------------------------------------------------------------------------
> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
> *To:* linux clustering <linux-cluster at redhat.com>
> *Sent:* Mon, January 18, 2010 7:24:48 PM
> *Subject:* [Linux-cluster] Luci home page
> 
> When I log into my luci homepage and click on cluster. My Cluster does not appear.  The servers show up under the storage tab and so forth.
> 
>  
> As far as I know this is a nfs cluster
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



      

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



      



From ccaulfie at redhat.com  Fri Jan 22 11:41:59 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 22 Jan 2010 11:41:59 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<4B4AE972.6040004@redhat.com>	
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>	
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>	
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>	
	<4B4D9982.6060505@redhat.com>
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>
Message-ID: <4B598F07.1070002@redhat.com>

On 21/01/10 15:17, Evan Broder wrote:
> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
> <ccaulfie at redhat.com>  wrote:
>> On 12/01/10 16:21, Evan Broder wrote:
>>>
>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>> <ccaulfie at redhat.com>    wrote:
>>>>
>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>
>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>
>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>>
>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>
>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>
>>>>>>>> Hi -
>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>>>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>> in-kernel locking.
>>>>>>>>
>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>
>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>> operations
>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>
>>>>>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>
>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>> information at the list. What information can I provide to make it
>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>
>>>>>>>
>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>> can be
>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>> should be
>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>> still
>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>> conjunction with the clvmd logs.
>>>>>>
>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes in
>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>
>>>>>>>> ", so they should be easy to spot.
>>>>
>>>>
>>>> Ironically it looks like a bug in the clvmd-openais code. I can reproduce
>>>> it
>>>> on my systems here. I don't see the problem when using the dlm!
>>>>
>>>> Can you try -Icorosync and see if that helps? In the meantime I'll have a
>>>> look at the openais bits to try and find out what is wrong.
>>>>
>>>> Chrissie
>>>>
>>>
>>> I'll see what we can pull together, but the nodes running the clvm
>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>> DLM 3 would be...challenging.
>>>
>>> It would be much, much better for us if we could get clvmd-openais
>>> working.
>>>
>>> Is there any chance this would work better if we dropped back to
>>> openais whitetank instead of corosync + openais wilson?
>>>
>>
>>
>> OK, I've found the bug and it IS in openais. The attached patch will fix it.
>>
>> Chrissie
>>
>
> Awesome. That patch fixed our problem.
>
> We are running into one other problem - performing LVM operations on
> one node is substantially slower than performing them on the other
> node:
>
> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>    Logical volume "test" created
>
> real	0m0.309s
> user	0m0.000s
> sys	0m0.008s
> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>    Logical volume "test" successfully removed
>
> real	0m0.254s
> user	0m0.004s
> sys	0m0.008s
>
>
> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>    Logical volume "test" created
>
> real	0m7.282s
> user	0m6.396s
> sys	0m0.312s
> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>    Logical volume "test" successfully removed
>
> real	0m7.277s
> user	0m6.420s
> sys	0m0.292s
>
> Any idea why this is happening and if there's anything we can do about it?
>


I'm not at all sure why that should be happening. I suppose the best 
thing to do would be to enable clvmd logging (clvmd -d) and see what is 
taking the time.

Chrissie



From adam.king at intechnology.com  Fri Jan 22 14:08:23 2010
From: adam.king at intechnology.com (King, Adam)
Date: Fri, 22 Jan 2010 14:08:23 -0000
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <c0773fd31001212328i314a03cay5227bb0fddfc3c73@mail.gmail.com>
References: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>
	<c0773fd31001212328i314a03cay5227bb0fddfc3c73@mail.gmail.com>
Message-ID: <ED48EF11677FC44BB3440460C24F58E405A4C043@mgmtexm03101.intechnology.co.uk>

Hi, 

I have just had a look at the rpm you suggested. Do I need to do anything to configure the cluster to use this instead of the default fencing script? Also, do I need to do anything to alter either of these files?

/etc/comoonics/bootimage/rpms.initrd.d/fence_ilo.list
/opt/atix/comoonics-fencing/fence_ilo

 

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
Sent: 22 January 2010 07:28
To: linux clustering
Subject: Re: [Linux-cluster] fence_ilo

 

 

On Thu, Jan 21, 2010 at 3:25 PM, King, Adam <adam.king at intechnology.com> wrote:



Hi, 

I have a 2 node cluster (xen1 and xen2) running on dl380 G3's. There are 2 virtual machines running on these. From first node from the shell command line I can fence the 2nd  node through the hp lights out port using /sbin/fence_ilo -a 192.168.1.53 -l aking -p password -v. I can also do the same from the second node and successfully fence the first node. When I fence the node that has control of one or more virtual machines they are migrated to the other node.

I had some issues with the default fence_ilo, found a nice fence_fast_ilo fencing script in an rpm: comoonics-bootimage-fenceclient. This seems to work much better.

 

	If I fence a node in Luci this works too. 

	However, if I have a node running one or more virtual machines and run /sbin/reboot -f on that node, in /var/log/messages in the other node I get messages such as 

	xen1 fenced[3126]: agent "fence_ilo" reports: Unable to connect/login to fencing device

	xen1 fenced[3126]: fence "192.168.50.116" failed

 

assuming that 192.168.1.53 is the 2nd server's ilo address and 192.168.50.116 is the 1st servers ilo address?

 

	Can anyone advise why this is happening? 

	Cheers

	Adam 

	  

	Adam King 

	Systems Administrator 

	adam.king at intechnology.com 

	InTechnology plc 

	Support 0845 120 7070 

	Telephone 01423 850000 

	Facsimile 01423 858866 

	www.intechnology.com <http://www.intechnology.com/> 

	
	   

InTechnology provides industry leading network, hosting, data
and IP telephony services to over 800 UK businesses.

	 

 

Over 25 years in business employing 220 employees in the UK 

 



Annual turnover of ?50m, profitable and financially secure 





InTechnology manages over 14,000 managed network circuits 





InTechnology manages over 5PB (over 5.2 million GB) of customer data 





InTechnology hosts more than 16,000 IP telephony seats 





InTechnology operates four managed data centres - 50,000 sq ft capacity 



 

	This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
	Registered in England 3916586.
	
	The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
	disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.

	
	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster 
 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

 


This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,

disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f022b625/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 6495 bytes
Desc: image001.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f022b625/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 97 bytes
Desc: image002.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f022b625/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.gif
Type: image/gif
Size: 76 bytes
Desc: image003.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f022b625/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-logo-circle.gif
Type: image/gif
Size: 6495 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f022b625/attachment-0003.gif>

From td3201 at gmail.com  Fri Jan 22 14:45:29 2010
From: td3201 at gmail.com (Terry)
Date: Fri, 22 Jan 2010 08:45:29 -0600
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com>
	<74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com>
	<8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com>
	<8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
	<37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz>
Message-ID: <8ee061011001220645j62b075c5u11ce1568df0085a5@mail.gmail.com>

On Mon, Jan 4, 2010 at 1:34 PM, Abraham Alawi <a.alawi at auckland.ac.nz> wrote:
>
> On 1/01/2010, at 5:13 AM, Terry wrote:
>
>> On Wed, Dec 30, 2009 at 10:13 AM, Terry <td3201 at gmail.com> wrote:
>>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W. <jwellband at gmail.com> wrote:
>>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry <td3201 at gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> I have a working 2 node cluster that I am trying to add a third node
>>>>> to. ? I am trying to use Red Hat's conga (luci) to add the node in but
>>>>
>>>> If you have two node cluster with two_node=1 in cluster.conf - such as
>>>> two nodes with no quorum device to break a tie - you'll need to bring
>>>> the cluster down, change two_node to 0 on both nodes (and rev the
>>>> cluster version at the top of cluster.conf), bring the cluster up and
>>>> then add the third node.
>>>>
>>>> For troubleshooting any cluster issue, take a look at syslog
>>>> (/var/log/messages by default). It can help to watch it on a
>>>> centralized syslog server that all of your nodes forward logs to.
>>>>
>>>> --
>>>> HTH, YMMV, HANW :)
>>>>
>>>> Jason
>>>>
>>>> The path to enlightenment is /usr/bin/enlightenment.
>>>
>>> Thank you for the response. ?/var/log/messages doesn't have any
>>> errors. ?It says cman started then says can't connect to cluster
>>> infrastructure after a few seconds. ?My cluster does not have the
>>> two_node=1 config now. ?Conga took that out for me. ?That bit me last
>>> night because I needed to put that back in.
>>>
>>
>> CMAN still will not start and gives no debug information. ?Anyone know
>> why cman_tool -d join would not print any output at all?
>> Troubleshooting this is kind of a nightmare. ?I verified that two_node
>> is not in play.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> Try this line in your cluster.conf file:
> <logging debug="on" logfile="/var/log/rhcs.log" to_file="yes"/>
>
> Also, if you are sure your cluster.conf is correct then copy it manually to all the nodes and add clean_start="1" to the fence_daemon line in cluster.conf and run 'service cman start' simultaneously on all the nodes (probably a good idea to do that from runlevel 1 but make sure you have the network up first)
>
> Cheers,
>
> ?-- Abraham
>
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> Abraham Alawi
>
> Unix/Linux Systems Administrator
> Science IT
> University of Auckland
> e: a.alawi at auckland.ac.nz
> p: +64-9-373 7599, ext#: 87572
>
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>
>

I am still battling this.  I stopped the cluster completely, modified
the config and then started it, but that didn't work either.  Same
issue.  I noticed clurgmgrd wasn't staying running so I then tried
this:

[root at omadvnfs01c ~]# clurgmgrd -d -f
[7014] notice: Waiting for CMAN to start

Then in another window I issued:
[root at omadvnfs01c ~]# cman_tool join


Then back in the other window below "[7014] notice: Waiting for CMAN
to start", I got:
failed acquiring lockspace: Transport endpoint is not connected
Locks not working!

Anyone know what could be going on?



From adam.king at intechnology.com  Fri Jan 22 15:00:44 2010
From: adam.king at intechnology.com (King, Adam)
Date: Fri, 22 Jan 2010 15:00:44 -0000
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <8ee061011001220645j62b075c5u11ce1568df0085a5@mail.gmail.com>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com><74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com><8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com><8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com><37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz>
	<8ee061011001220645j62b075c5u11ce1568df0085a5@mail.gmail.com>
Message-ID: <ED48EF11677FC44BB3440460C24F58E405A4C069@mgmtexm03101.intechnology.co.uk>

I'm assuming you have read this? http://sources.redhat.com/cluster/wiki/FAQ/CMAN#cman_2to3


 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

 
-----Original Message-----

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Terry
Sent: 22 January 2010 14:45
To: linux clustering
Subject: Re: [Linux-cluster] cannot add 3rd node to running cluster

On Mon, Jan 4, 2010 at 1:34 PM, Abraham Alawi <a.alawi at auckland.ac.nz> wrote:
>
> On 1/01/2010, at 5:13 AM, Terry wrote:
>
>> On Wed, Dec 30, 2009 at 10:13 AM, Terry <td3201 at gmail.com> wrote:
>>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W. <jwellband at gmail.com> wrote:
>>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry <td3201 at gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> I have a working 2 node cluster that I am trying to add a third node
>>>>> to.   I am trying to use Red Hat's conga (luci) to add the node in but
>>>>
>>>> If you have two node cluster with two_node=1 in cluster.conf - such as
>>>> two nodes with no quorum device to break a tie - you'll need to bring
>>>> the cluster down, change two_node to 0 on both nodes (and rev the
>>>> cluster version at the top of cluster.conf), bring the cluster up and
>>>> then add the third node.
>>>>
>>>> For troubleshooting any cluster issue, take a look at syslog
>>>> (/var/log/messages by default). It can help to watch it on a
>>>> centralized syslog server that all of your nodes forward logs to.
>>>>
>>>> --
>>>> HTH, YMMV, HANW :)
>>>>
>>>> Jason
>>>>
>>>> The path to enlightenment is /usr/bin/enlightenment.
>>>
>>> Thank you for the response.  /var/log/messages doesn't have any
>>> errors.  It says cman started then says can't connect to cluster
>>> infrastructure after a few seconds.  My cluster does not have the
>>> two_node=1 config now.  Conga took that out for me.  That bit me last
>>> night because I needed to put that back in.
>>>
>>
>> CMAN still will not start and gives no debug information.  Anyone know
>> why cman_tool -d join would not print any output at all?
>> Troubleshooting this is kind of a nightmare.  I verified that two_node
>> is not in play.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> Try this line in your cluster.conf file:
> <logging debug="on" logfile="/var/log/rhcs.log" to_file="yes"/>
>
> Also, if you are sure your cluster.conf is correct then copy it manually to all the nodes and add clean_start="1" to the fence_daemon line in cluster.conf and run 'service cman start' simultaneously on all the nodes (probably a good idea to do that from runlevel 1 but make sure you have the network up first)
>
> Cheers,
>
>  -- Abraham
>
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> Abraham Alawi
>
> Unix/Linux Systems Administrator
> Science IT
> University of Auckland
> e: a.alawi at auckland.ac.nz
> p: +64-9-373 7599, ext#: 87572
>
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>
>

I am still battling this.  I stopped the cluster completely, modified
the config and then started it, but that didn't work either.  Same
issue.  I noticed clurgmgrd wasn't staying running so I then tried
this:

[root at omadvnfs01c ~]# clurgmgrd -d -f
[7014] notice: Waiting for CMAN to start

Then in another window I issued:
[root at omadvnfs01c ~]# cman_tool join


Then back in the other window below "[7014] notice: Waiting for CMAN
to start", I got:
failed acquiring lockspace: Transport endpoint is not connected
Locks not working!

Anyone know what could be going on?

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,

disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.



From Martin.Waite at datacash.com  Fri Jan 22 15:10:36 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Fri, 22 Jan 2010 15:10:36 -0000
Subject: [Linux-cluster] unable to post questions from googlemail
Message-ID: <A78DB34D00374344A0AB65B6523C05DC04F2ADB9@marsden.win.datacash.com>

Hi,

 

I moved my subscription to my home email account, but since then have
been unable to post questions on this list.

 

I have recreated a work account to see if this works.

 

Is googlemail blocked from the list ?

 

-- 

Martin Waite

Solutions Architect

DataCash

 

Tel (direct): +44 (0)131 538 8431

Tel (mobile): +44 (0)7866 750 509

 

DataCash Ltd, Suite 3/1 Great Michael House,
14 Links Place, Edinburgh, EH6 7EZ, United Kingdom.

 

Tel: +44 (0)870 7274 762

Fax: +44 (0)870 7274 782

 

www.datacash.com <http://www.datacash.com/> 

 

DISCLAIMER: This email and any files transmitted with it are
confidential to DataCash Group plc and its group companies. It is
intended only for the person to whom it is addressed. If you have
received this email in error, please forward it to info at datacash.com
<mailto:info at datacash.com>  with the subject line "Received in Error".
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email or any transmitted files.
DataCash Ltd is registered in England and Wales no. 3430157. DataCash
Ltd is part of the DataCash Group plc. DataCash Group plc is registered
in England and Wales no. 3168091. DataCash Ltd and DataCash Group plc
registered address is Descartes House, 8 Gate Street, London, WC2A 3HP,
United Kingdom. 

 

Save a tree...Please only print this page if essential

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/e88f1e51/attachment.htm>

From td3201 at gmail.com  Fri Jan 22 15:19:45 2010
From: td3201 at gmail.com (Terry)
Date: Fri, 22 Jan 2010 09:19:45 -0600
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <ED48EF11677FC44BB3440460C24F58E405A4C069@mgmtexm03101.intechnology.co.uk>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com>
	<74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com>
	<8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com>
	<8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
	<37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz>
	<8ee061011001220645j62b075c5u11ce1568df0085a5@mail.gmail.com>
	<ED48EF11677FC44BB3440460C24F58E405A4C069@mgmtexm03101.intechnology.co.uk>
Message-ID: <8ee061011001220719j4405a3b0ue9e9cfeb4399a359@mail.gmail.com>

On Fri, Jan 22, 2010 at 9:00 AM, King, Adam <adam.king at intechnology.com> wrote:
> I'm assuming you have read this? http://sources.redhat.com/cluster/wiki/FAQ/CMAN#cman_2to3
>
>
>
>
> Adam King
> Systems Administrator
> adam.king at intechnology.com
>
>
> InTechnology plc
> Support 0845 120 7070
> Telephone 01423 850000
> Facsimile 01423 858866
> www.intechnology.com
>
>
> -----Original Message-----
>
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Terry
> Sent: 22 January 2010 14:45
> To: linux clustering
> Subject: Re: [Linux-cluster] cannot add 3rd node to running cluster
>
> On Mon, Jan 4, 2010 at 1:34 PM, Abraham Alawi <a.alawi at auckland.ac.nz> wrote:
>>
>> On 1/01/2010, at 5:13 AM, Terry wrote:
>>
>>> On Wed, Dec 30, 2009 at 10:13 AM, Terry <td3201 at gmail.com> wrote:
>>>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W. <jwellband at gmail.com> wrote:
>>>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry <td3201 at gmail.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I have a working 2 node cluster that I am trying to add a third node
>>>>>> to. ? I am trying to use Red Hat's conga (luci) to add the node in but
>>>>>
>>>>> If you have two node cluster with two_node=1 in cluster.conf - such as
>>>>> two nodes with no quorum device to break a tie - you'll need to bring
>>>>> the cluster down, change two_node to 0 on both nodes (and rev the
>>>>> cluster version at the top of cluster.conf), bring the cluster up and
>>>>> then add the third node.
>>>>>
>>>>> For troubleshooting any cluster issue, take a look at syslog
>>>>> (/var/log/messages by default). It can help to watch it on a
>>>>> centralized syslog server that all of your nodes forward logs to.
>>>>>
>>>>> --
>>>>> HTH, YMMV, HANW :)
>>>>>
>>>>> Jason
>>>>>
>>>>> The path to enlightenment is /usr/bin/enlightenment.
>>>>
>>>> Thank you for the response. ?/var/log/messages doesn't have any
>>>> errors. ?It says cman started then says can't connect to cluster
>>>> infrastructure after a few seconds. ?My cluster does not have the
>>>> two_node=1 config now. ?Conga took that out for me. ?That bit me last
>>>> night because I needed to put that back in.
>>>>
>>>
>>> CMAN still will not start and gives no debug information. ?Anyone know
>>> why cman_tool -d join would not print any output at all?
>>> Troubleshooting this is kind of a nightmare. ?I verified that two_node
>>> is not in play.
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> Try this line in your cluster.conf file:
>> <logging debug="on" logfile="/var/log/rhcs.log" to_file="yes"/>
>>
>> Also, if you are sure your cluster.conf is correct then copy it manually to all the nodes and add clean_start="1" to the fence_daemon line in cluster.conf and run 'service cman start' simultaneously on all the nodes (probably a good idea to do that from runlevel 1 but make sure you have the network up first)
>>
>> Cheers,
>>
>> ?-- Abraham
>>
>> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>> Abraham Alawi
>>
>> Unix/Linux Systems Administrator
>> Science IT
>> University of Auckland
>> e: a.alawi at auckland.ac.nz
>> p: +64-9-373 7599, ext#: 87572
>>
>> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>>
>>
>
> I am still battling this. ?I stopped the cluster completely, modified
> the config and then started it, but that didn't work either. ?Same
> issue. ?I noticed clurgmgrd wasn't staying running so I then tried
> this:
>
> [root at omadvnfs01c ~]# clurgmgrd -d -f
> [7014] notice: Waiting for CMAN to start
>
> Then in another window I issued:
> [root at omadvnfs01c ~]# cman_tool join
>
>
> Then back in the other window below "[7014] notice: Waiting for CMAN
> to start", I got:
> failed acquiring lockspace: Transport endpoint is not connected
> Locks not working!
>
> Anyone know what could be going on?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
> Registered in England 3916586.
>
> The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
>
> disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
>
> --

I didn't but I performed those steps anyways.  As it sits, I have a
three node cluster with only two nodes in it.  Which is bad too but it
is what it is until I figure this out.  Here's my cluster.conf just
for completeness:

<cluster alias="omadvnfs01" config_version="53" name="omadvnfs01">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="omadvnfs01a.sec.jel.lc" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="omadvnfs01a-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="omadvnfs01b.sec.jel.lc" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="omadvnfs01b-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="omadvnfs01c.sec.jel.lc" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="omadvnfs01c-drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="10.98.1.211"
login="root" name="omadvnfs01a-drac" passwd="foo"/>
                <fencedevice agent="fence_drac" ipaddr="10.98.1.212"
login="root" name="omadvnfs01b-drac" passwd="foo"/>
                <fencedevice agent="fence_drac" ipaddr="10.98.1.213"
login="root" name="omadvnfs01c-drac" passwd="foo"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="fd_omadvnfs01a-nfs"
nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode
name="omadvnfs01a.sec.jel.lc" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="fd_omadvnfs01b-nfs"
nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode
name="omadvnfs01b.sec.jel.lc" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="fd_omadvnfs01c-nfs"
nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode
name="omadvnfs01c.sec.jel.lc" priority="1"/>
                        </failoverdomain>
                </failoverdomains>

I am not sure if I did a restart after I did the work though.  When it
says "shutdown cluster software" that is simply a 'service cman stop'
on redhat, right?  Want to make sure I don't need to kill any other
components before updating the configuration manually.  I appreciate
the help.  I am probably going to try it again this afternoon to
double check my work.



From td3201 at gmail.com  Fri Jan 22 15:20:48 2010
From: td3201 at gmail.com (Terry)
Date: Fri, 22 Jan 2010 09:20:48 -0600
Subject: [Linux-cluster] unable to post questions from googlemail
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC04F2ADB9@marsden.win.datacash.com>
References: <A78DB34D00374344A0AB65B6523C05DC04F2ADB9@marsden.win.datacash.com>
Message-ID: <8ee061011001220720s3ae83ed9oe89ab536f32ac6bd@mail.gmail.com>

On Fri, Jan 22, 2010 at 9:10 AM, Martin Waite <Martin.Waite at datacash.com> wrote:
> Hi,
>
>
>
> I moved my subscription to my home email account, but since then have been
> unable to post questions on this list.
>
>
>
> I have recreated a work account to see if this works.
>
>
>
> Is googlemail blocked from the list ?
>
>
>
> --
>
> Martin Waite
>
> Solutions Architect
>
> DataCash
>
>
>
> Tel (direct): +44 (0)131 538 8431
>
> Tel (mobile): +44 (0)7866 750 509
>
>
>
> DataCash Ltd, Suite 3/1 Great Michael House,
> 14 Links Place, Edinburgh, EH6 7EZ, United Kingdom.
>
>
>
> Tel: +44 (0)870 7274 762
>
> Fax: +44 (0)870 7274 782
>
>
>
> www.datacash.com
>
>
>
> DISCLAIMER: This email and any files transmitted with it are confidential to
> DataCash Group plc and its group companies. It is intended only for the
> person to whom it is addressed. If you have received this email in error,
> please forward it to info at datacash.com with the subject line "Received in
> Error". If you are not the intended recipient you must not use, disclose,
> copy, print, distribute or rely on this email or any transmitted files.
> DataCash Ltd is registered in England and Wales no. 3430157. DataCash Ltd is
> part of the DataCash Group plc. DataCash Group plc is registered in England
> and Wales no. 3168091. DataCash Ltd and DataCash Group plc registered
> address is Descartes House, 8 Gate Street, London, WC2A 3HP, United Kingdom.
>
>
>
> Save a tree...Please only print this page if essential
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Is googlemail different that @gmail.com?  If not, then it works.  :)



From Martin.Waite at datacash.com  Fri Jan 22 15:37:55 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Fri, 22 Jan 2010 15:37:55 -0000
Subject: [Linux-cluster] unable to post questions from googlemail
In-Reply-To: <8ee061011001220720s3ae83ed9oe89ab536f32ac6bd@mail.gmail.com>
References: <A78DB34D00374344A0AB65B6523C05DC04F2ADB9@marsden.win.datacash.com>
	<8ee061011001220720s3ae83ed9oe89ab536f32ac6bd@mail.gmail.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC04F2ADF1@marsden.win.datacash.com>

Hi,

There was a dispute over the gmail domain name at some point (I don't
know how it was resolved),  but when I signed up "@gmail.com" accounts
were no longer available and I was given an "@googlemail.com" account
instead.

This is very frustrating:  I hate posting to the list using Outlook
(work client) because I cannot get it to format replies nicely - like
this one, for example :-(

I just wondered whether my inability to post was due to some spam-filter
or whether the list-manager software was getting confused over my
changed account.

I tried to contact the list admins, but got no reply.   I will try again
using this account.

Sorry for the noise.

regards,
Martin


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Terry
Sent: 22 January 2010 15:21
To: linux clustering
Subject: Re: [Linux-cluster] unable to post questions from googlemail

On Fri, Jan 22, 2010 at 9:10 AM, Martin Waite
<Martin.Waite at datacash.com> wrote:
> Hi,
>
>
>
> I moved my subscription to my home email account, but since then have
been
> unable to post questions on this list.
>
>
>
> I have recreated a work account to see if this works.
>
>
>
> Is googlemail blocked from the list ?
>
>
>
> --
>
> Martin Waite
>
> Solutions Architect
>
> DataCash
>
>
>
> Tel (direct): +44 (0)131 538 8431
>
> Tel (mobile): +44 (0)7866 750 509
>
>
>
> DataCash Ltd, Suite 3/1 Great Michael House,
> 14 Links Place, Edinburgh, EH6 7EZ, United Kingdom.
>
>
>
> Tel: +44 (0)870 7274 762
>
> Fax: +44 (0)870 7274 782
>
>
>
> www.datacash.com
>
>
>
> DISCLAIMER: This email and any files transmitted with it are
confidential to
> DataCash Group plc and its group companies. It is intended only for
the
> person to whom it is addressed. If you have received this email in
error,
> please forward it to info at datacash.com with the subject line "Received
in
> Error". If you are not the intended recipient you must not use,
disclose,
> copy, print, distribute or rely on this email or any transmitted
files.
> DataCash Ltd is registered in England and Wales no. 3430157. DataCash
Ltd is
> part of the DataCash Group plc. DataCash Group plc is registered in
England
> and Wales no. 3168091. DataCash Ltd and DataCash Group plc registered
> address is Descartes House, 8 Gate Street, London, WC2A 3HP, United
Kingdom.
>
>
>
> Save a tree...Please only print this page if essential
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Is googlemail different that @gmail.com?  If not, then it works.  :)

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From adam.king at intechnology.com  Fri Jan 22 15:44:10 2010
From: adam.king at intechnology.com (King, Adam)
Date: Fri, 22 Jan 2010 15:44:10 -0000
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <8ee061011001220719j4405a3b0ue9e9cfeb4399a359@mail.gmail.com>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com><74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com><8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com><8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com><37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz><8ee061011001220645j62b075c5u11ce1568df0085a5@mail.gmail.com><ED48EF11677FC44BB3440460C24F58E405A4C069@mgmtexm03101.intechnology.co.uk>
	<8ee061011001220719j4405a3b0ue9e9cfeb4399a359@mail.gmail.com>
Message-ID: <ED48EF11677FC44BB3440460C24F58E405A4C092@mgmtexm03101.intechnology.co.uk>

This is the way I would do it:
Have it as a 2 node cluster in your cluster.conf
/etc/init.d/cman stop
/etc/init.d/modclusterd stop (not sure if this is necessary or not)
On both nodes.
Edit your cluster.conf on the first node to include the third node
Delete the two_node=1 line 
Scp the cluster.conf file from your first node to the second and third to cut out the risk of any editing mistakes.
Make sure chkconfig has the relevant daemons starting at boot time on all 3 nodes
Reboot all 3 servers using /sbin/reboot -f



 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

 
-----Original Message-----

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Terry
Sent: 22 January 2010 15:20
To: linux clustering
Subject: Re: [Linux-cluster] cannot add 3rd node to running cluster

On Fri, Jan 22, 2010 at 9:00 AM, King, Adam <adam.king at intechnology.com> wrote:
> I'm assuming you have read this? http://sources.redhat.com/cluster/wiki/FAQ/CMAN#cman_2to3
>
>
>
>
> Adam King
> Systems Administrator
> adam.king at intechnology.com
>
>
> InTechnology plc
> Support 0845 120 7070
> Telephone 01423 850000
> Facsimile 01423 858866
> www.intechnology.com
>
>
> -----Original Message-----
>
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Terry
> Sent: 22 January 2010 14:45
> To: linux clustering
> Subject: Re: [Linux-cluster] cannot add 3rd node to running cluster
>
> On Mon, Jan 4, 2010 at 1:34 PM, Abraham Alawi <a.alawi at auckland.ac.nz> wrote:
>>
>> On 1/01/2010, at 5:13 AM, Terry wrote:
>>
>>> On Wed, Dec 30, 2009 at 10:13 AM, Terry <td3201 at gmail.com> wrote:
>>>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W. <jwellband at gmail.com> wrote:
>>>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry <td3201 at gmail.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I have a working 2 node cluster that I am trying to add a third node
>>>>>> to.   I am trying to use Red Hat's conga (luci) to add the node in but
>>>>>
>>>>> If you have two node cluster with two_node=1 in cluster.conf - such as
>>>>> two nodes with no quorum device to break a tie - you'll need to bring
>>>>> the cluster down, change two_node to 0 on both nodes (and rev the
>>>>> cluster version at the top of cluster.conf), bring the cluster up and
>>>>> then add the third node.
>>>>>
>>>>> For troubleshooting any cluster issue, take a look at syslog
>>>>> (/var/log/messages by default). It can help to watch it on a
>>>>> centralized syslog server that all of your nodes forward logs to.
>>>>>
>>>>> --
>>>>> HTH, YMMV, HANW :)
>>>>>
>>>>> Jason
>>>>>
>>>>> The path to enlightenment is /usr/bin/enlightenment.
>>>>
>>>> Thank you for the response.  /var/log/messages doesn't have any
>>>> errors.  It says cman started then says can't connect to cluster
>>>> infrastructure after a few seconds.  My cluster does not have the
>>>> two_node=1 config now.  Conga took that out for me.  That bit me last
>>>> night because I needed to put that back in.
>>>>
>>>
>>> CMAN still will not start and gives no debug information.  Anyone know
>>> why cman_tool -d join would not print any output at all?
>>> Troubleshooting this is kind of a nightmare.  I verified that two_node
>>> is not in play.
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> Try this line in your cluster.conf file:
>> <logging debug="on" logfile="/var/log/rhcs.log" to_file="yes"/>
>>
>> Also, if you are sure your cluster.conf is correct then copy it manually to all the nodes and add clean_start="1" to the fence_daemon line in cluster.conf and run 'service cman start' simultaneously on all the nodes (probably a good idea to do that from runlevel 1 but make sure you have the network up first)
>>
>> Cheers,
>>
>>  -- Abraham
>>
>> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>> Abraham Alawi
>>
>> Unix/Linux Systems Administrator
>> Science IT
>> University of Auckland
>> e: a.alawi at auckland.ac.nz
>> p: +64-9-373 7599, ext#: 87572
>>
>> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>>
>>
>
> I am still battling this.  I stopped the cluster completely, modified
> the config and then started it, but that didn't work either.  Same
> issue.  I noticed clurgmgrd wasn't staying running so I then tried
> this:
>
> [root at omadvnfs01c ~]# clurgmgrd -d -f
> [7014] notice: Waiting for CMAN to start
>
> Then in another window I issued:
> [root at omadvnfs01c ~]# cman_tool join
>
>
> Then back in the other window below "[7014] notice: Waiting for CMAN
> to start", I got:
> failed acquiring lockspace: Transport endpoint is not connected
> Locks not working!
>
> Anyone know what could be going on?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
> Registered in England 3916586.
>
> The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
>
> disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
>
> --

I didn't but I performed those steps anyways.  As it sits, I have a
three node cluster with only two nodes in it.  Which is bad too but it
is what it is until I figure this out.  Here's my cluster.conf just
for completeness:

<cluster alias="omadvnfs01" config_version="53" name="omadvnfs01">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="omadvnfs01a.sec.jel.lc" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="omadvnfs01a-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="omadvnfs01b.sec.jel.lc" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="omadvnfs01b-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="omadvnfs01c.sec.jel.lc" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="omadvnfs01c-drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="10.98.1.211"
login="root" name="omadvnfs01a-drac" passwd="foo"/>
                <fencedevice agent="fence_drac" ipaddr="10.98.1.212"
login="root" name="omadvnfs01b-drac" passwd="foo"/>
                <fencedevice agent="fence_drac" ipaddr="10.98.1.213"
login="root" name="omadvnfs01c-drac" passwd="foo"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="fd_omadvnfs01a-nfs"
nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode
name="omadvnfs01a.sec.jel.lc" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="fd_omadvnfs01b-nfs"
nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode
name="omadvnfs01b.sec.jel.lc" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="fd_omadvnfs01c-nfs"
nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode
name="omadvnfs01c.sec.jel.lc" priority="1"/>
                        </failoverdomain>
                </failoverdomains>

I am not sure if I did a restart after I did the work though.  When it
says "shutdown cluster software" that is simply a 'service cman stop'
on redhat, right?  Want to make sure I don't need to kill any other
components before updating the configuration manually.  I appreciate
the help.  I am probably going to try it again this afternoon to
double check my work.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,

disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.



From alkol6 at gmail.com  Fri Jan 22 16:00:08 2010
From: alkol6 at gmail.com (Senol Erdogan)
Date: Fri, 22 Jan 2010 18:00:08 +0200
Subject: [Linux-cluster] cannot add 3rd node to running cluster
In-Reply-To: <8ee061011001220719j4405a3b0ue9e9cfeb4399a359@mail.gmail.com>
References: <8ee061010912291130n68f0bad6l496f71df2cd703ac@mail.gmail.com>
	<74e9d01e0912291520l3bc36ac4yc7a17b1f96fa123d@mail.gmail.com>
	<8ee061010912300813i7fd29c70hd81bf691d574df0c@mail.gmail.com>
	<8ee061010912310813g3f45bf6ekfc52c3d5420a5826@mail.gmail.com>
	<37B32C9E-A8C3-4BA4-BFB4-FAB6235985D5@auckland.ac.nz>
	<8ee061011001220645j62b075c5u11ce1568df0085a5@mail.gmail.com>
	<ED48EF11677FC44BB3440460C24F58E405A4C069@mgmtexm03101.intechnology.co.uk>
	<8ee061011001220719j4405a3b0ue9e9cfeb4399a359@mail.gmail.com>
Message-ID: <93bf230a1001220800k5ec58703v70a68e5340088bf4@mail.gmail.com>

hi,
maybe u have your cluster have Votes problem,

when u use "cman_tool status" and if u see like a "Quorum: 2 Activity
Blocked" message, then
use "cman_tool expected -e 1". Again use "cman_tool status", u will see
removed "Activity Blocked" on "Quorum:".

This command pull down Qourum of cluster and cluster will runing (expected
vote: n/2+1). This solution a tepmrolary solution for inadequate cluster
qourum. (before check config_version="{ver.num}" , u know, must be same
number in all nodes)

i hope server for your problem

(yep, i know.. my englsh are ultra-professionel-imba :) )


2010/1/22 Terry <td3201 at gmail.com>

> On Fri, Jan 22, 2010 at 9:00 AM, King, Adam <adam.king at intechnology.com>
> wrote:
> > I'm assuming you have read this?
> http://sources.redhat.com/cluster/wiki/FAQ/CMAN#cman_2to3
> >
> >
> >
> >
> > Adam King
> > Systems Administrator
> > adam.king at intechnology.com
> >
> >
> > InTechnology plc
> > Support 0845 120 7070
> > Telephone 01423 850000
> > Facsimile 01423 858866
> > www.intechnology.com
> >
> >
> > -----Original Message-----
> >
> > From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] On Behalf Of Terry
> > Sent: 22 January 2010 14:45
> > To: linux clustering
> > Subject: Re: [Linux-cluster] cannot add 3rd node to running cluster
> >
> > On Mon, Jan 4, 2010 at 1:34 PM, Abraham Alawi <a.alawi at auckland.ac.nz>
> wrote:
> >>
> >> On 1/01/2010, at 5:13 AM, Terry wrote:
> >>
> >>> On Wed, Dec 30, 2009 at 10:13 AM, Terry <td3201 at gmail.com> wrote:
> >>>> On Tue, Dec 29, 2009 at 5:20 PM, Jason W. <jwellband at gmail.com>
> wrote:
> >>>>> On Tue, Dec 29, 2009 at 2:30 PM, Terry <td3201 at gmail.com> wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I have a working 2 node cluster that I am trying to add a third node
> >>>>>> to.   I am trying to use Red Hat's conga (luci) to add the node in
> but
> >>>>>
> >>>>> If you have two node cluster with two_node=1 in cluster.conf - such
> as
> >>>>> two nodes with no quorum device to break a tie - you'll need to bring
> >>>>> the cluster down, change two_node to 0 on both nodes (and rev the
> >>>>> cluster version at the top of cluster.conf), bring the cluster up and
> >>>>> then add the third node.
> >>>>>
> >>>>> For troubleshooting any cluster issue, take a look at syslog
> >>>>> (/var/log/messages by default). It can help to watch it on a
> >>>>> centralized syslog server that all of your nodes forward logs to.
> >>>>>
> >>>>> --
> >>>>> HTH, YMMV, HANW :)
> >>>>>
> >>>>> Jason
> >>>>>
> >>>>> The path to enlightenment is /usr/bin/enlightenment.
> >>>>
> >>>> Thank you for the response.  /var/log/messages doesn't have any
> >>>> errors.  It says cman started then says can't connect to cluster
> >>>> infrastructure after a few seconds.  My cluster does not have the
> >>>> two_node=1 config now.  Conga took that out for me.  That bit me last
> >>>> night because I needed to put that back in.
> >>>>
> >>>
> >>> CMAN still will not start and gives no debug information.  Anyone know
> >>> why cman_tool -d join would not print any output at all?
> >>> Troubleshooting this is kind of a nightmare.  I verified that two_node
> >>> is not in play.
> >>>
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >>
> >> Try this line in your cluster.conf file:
> >> <logging debug="on" logfile="/var/log/rhcs.log" to_file="yes"/>
> >>
> >> Also, if you are sure your cluster.conf is correct then copy it manually
> to all the nodes and add clean_start="1" to the fence_daemon line in
> cluster.conf and run 'service cman start' simultaneously on all the nodes
> (probably a good idea to do that from runlevel 1 but make sure you have the
> network up first)
> >>
> >> Cheers,
> >>
> >>  -- Abraham
> >>
> >> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> >> Abraham Alawi
> >>
> >> Unix/Linux Systems Administrator
> >> Science IT
> >> University of Auckland
> >> e: a.alawi at auckland.ac.nz
> >> p: +64-9-373 7599, ext#: 87572
> >>
> >> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> >>
> >>
> >
> > I am still battling this.  I stopped the cluster completely, modified
> > the config and then started it, but that didn't work either.  Same
> > issue.  I noticed clurgmgrd wasn't staying running so I then tried
> > this:
> >
> > [root at omadvnfs01c ~]# clurgmgrd -d -f
> > [7014] notice: Waiting for CMAN to start
> >
> > Then in another window I issued:
> > [root at omadvnfs01c ~]# cman_tool join
> >
> >
> > Then back in the other window below "[7014] notice: Waiting for CMAN
> > to start", I got:
> > failed acquiring lockspace: Transport endpoint is not connected
> > Locks not working!
> >
> > Anyone know what could be going on?
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > This is an email from InTechnology plc, Central House, Beckwith Knowle,
> Harrogate, UK, HG3 1UG.
> > Registered in England 3916586.
> >
> > The contents of this message may be privileged and confidential. If you
> have received this message in error, you may not use,
> >
> > disclose, copy or distribute its content in any way. Please notify the
> sender immediately. All messages are scanned for all viruses.
> >
> > --
>
> I didn't but I performed those steps anyways.  As it sits, I have a
> three node cluster with only two nodes in it.  Which is bad too but it
> is what it is until I figure this out.  Here's my cluster.conf just
> for completeness:
>
> <cluster alias="omadvnfs01" config_version="53" name="omadvnfs01">
>        <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="omadvnfs01a.sec.jel.lc" nodeid="1"
> votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="omadvnfs01a-drac"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="omadvnfs01b.sec.jel.lc" nodeid="2"
> votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="omadvnfs01b-drac"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="omadvnfs01c.sec.jel.lc" nodeid="3"
> votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="omadvnfs01c-drac"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman/>
>        <fencedevices>
>                <fencedevice agent="fence_drac" ipaddr="10.98.1.211"
> login="root" name="omadvnfs01a-drac" passwd="foo"/>
>                <fencedevice agent="fence_drac" ipaddr="10.98.1.212"
> login="root" name="omadvnfs01b-drac" passwd="foo"/>
>                <fencedevice agent="fence_drac" ipaddr="10.98.1.213"
> login="root" name="omadvnfs01c-drac" passwd="foo"/>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="fd_omadvnfs01a-nfs"
> nofailback="1" ordered="1" restricted="0">
>                                <failoverdomainnode
> name="omadvnfs01a.sec.jel.lc" priority="1"/>
>                        </failoverdomain>
>                        <failoverdomain name="fd_omadvnfs01b-nfs"
> nofailback="1" ordered="1" restricted="0">
>                                <failoverdomainnode
> name="omadvnfs01b.sec.jel.lc" priority="2"/>
>                        </failoverdomain>
>                        <failoverdomain name="fd_omadvnfs01c-nfs"
> nofailback="1" ordered="1" restricted="0">
>                                <failoverdomainnode
> name="omadvnfs01c.sec.jel.lc" priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>
> I am not sure if I did a restart after I did the work though.  When it
> says "shutdown cluster software" that is simply a 'service cman stop'
> on redhat, right?  Want to make sure I don't need to kill any other
> components before updating the configuration manually.  I appreciate
> the help.  I am probably going to try it again this afternoon to
> double check my work.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f6368c53/attachment.htm>

From alkol6 at gmail.com  Fri Jan 22 16:10:50 2010
From: alkol6 at gmail.com (Senol Erdogan)
Date: Fri, 22 Jan 2010 18:10:50 +0200
Subject: [Linux-cluster] fail over
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B89019D@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B89019D@a0040semail3.pinellas.local>
Message-ID: <93bf230a1001220810q26516a54w15593e9d708e4277@mail.gmail.com>

hi,
-set "restricted" mode this failover domain
-use "priority" system on this failower domain
- Primary node must have "1" priority and second node have "2"

mean, when u use "Priority" on nodes, failback will run. (if nodes havent
any problem joining to cluster and if rgmanager on autostart )

2010/1/21 Fagnon Raymond <pcs.fagnonr at pcsb.org>

>  I want to fail my virtual server over to my standby node. I am going to
> do this via a reboot. When the primary is backup what command can I issue to
> have the virtual server failover to the primary again?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/f59ef5df/attachment.htm>

From pradhanparas at gmail.com  Fri Jan 22 16:15:17 2010
From: pradhanparas at gmail.com (Paras pradhan)
Date: Fri, 22 Jan 2010 10:15:17 -0600
Subject: [Linux-cluster] Luci home page
In-Reply-To: <13627661.21264116538974.JavaMail.root@athena>
References: <782081.52039.qm@web32102.mail.mud.yahoo.com>
	<13627661.21264116538974.JavaMail.root@athena>
Message-ID: <8b711df41001220815u53829e23rfb3c300161cb6efa@mail.gmail.com>

This fixes the 11111 timeout issue but not the un-responsiveness of conga
storage tab when you have device mapper running.

Thanks
Paras.


On Thu, Jan 21, 2010 at 5:28 PM, Paul M. Dyer <pmdyer at ctgcentral2.com>wrote:

> Hi,
>
> I had this problem and worked with RHN for about 4 weeks.   They have been
> able to reproduce it, and expect to send the bugzilla report to engineering
> at some point.
>
> For me, I was able to bypass the timeout error by starting all ricci agents
> as root.   Somehow, this is a workaround for a saslauthd issue.
>
> Edit /etc/init.d/ricci, and change line 155 to:
>
> daemon "$RICCID"
>
> removing the -u "$NewUID" parameter.
>
> Paul
>
> ----- Original Message -----
> From: "Celso K. Webber" <celsowebber at yahoo.com>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Tuesday, January 19, 2010 3:24:51 PM (GMT-0600) America/Chicago
> Subject: Re: [Linux-cluster] Luci home page
>
> Hi Dirk,
>
> I've double checked my environment, and I have SELinux DISABLED on both
> nodes, as I have the firewall DISABLED also.
>
> Thanks.
>
>
>
> ----- Original Message ----
> From: Dirk H. Schulz <dirk.schulz at kinzesberg.de>
> To: linux clustering <linux-cluster at redhat.com>
> Sent: Tue, January 19, 2010 4:14:28 PM
> Subject: Re: [Linux-cluster] Luci home page
>
> I have experienced this kind of difficulties when I started testing conga.
> One of the first things I tried was setting SElinux to permissive on the
> conga server, and from then on I could work well with it.
>
> I did not look into audit.log to find out if there is a solution to it, so
> far.
>
> Dirk
>
> Celso K. Webber schrieb:
> > Hi,
> >
> > I don't know if your problem is the same as mine, but I'm experiencing a
> problem where luci is a little bit unstable.
> >
> > I can log in into luci, but when I click the "cluster" tab, it takes some
> time and returns:
> > "An error occurred when trying to contact any of the nodes in the
> rhcs-xen cluster."
> >
> > In the server where luci is running (node1), I see the following messages
> in /var/log/messages:
> > Jan  7 17:43:01 node1 luci[5262]: Error reading from
> node2.localdomain:11111: timeout
> > Jan  7 17:43:01 node1 luci[5262]: Error reading from
> node2.localdomain:11111: timeout
> >
> > Jan  7 17:43:06 node1 luci[5262]: Error reading from
> node1.localdomain:11111: timeout
> > Jan  7 17:43:06 node1 luci[5262]: Error reading from
> node1.localdomain:11111: timeout
> >
> > I've checked name resolution, host names, /etc/hosts, etc. Firewall and
> SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from
> anywhere successfully.
> >
> > Whenever I receive this kind of "timeout error" in luci, if I click the
> same link again or if I do a "reload" in the browser, then luci usually
> responds correctly. There is the inconvenience of re-clicking almost every
> link I use in luci.
> >
> > I'm using a fresh RHEL 5.4 installation, did "yum update" recently and
> the problem persistend. I indeed tried to remove luci and ricci, rm -rf
> /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
> >
> > Does anyone has these same symptoms?
> >
> > Thanks, Celso.
> >
> > ------------------------------------------------------------------------
> > *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
> > *To:* linux clustering <linux-cluster at redhat.com>
> > *Sent:* Mon, January 18, 2010 7:24:48 PM
> > *Subject:* [Linux-cluster] Luci home page
> >
> > When I log into my luci homepage and click on cluster. My Cluster does
> not appear.  The servers show up under the storage tab and so forth.
> >
> >
> > As far as I know this is a nfs cluster
> >
> >
> > ------------------------------------------------------------------------
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/ae972ffa/attachment.htm>

From pcs.fagnonr at pcsb.org  Fri Jan 22 16:18:02 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Fri, 22 Jan 2010 11:18:02 -0500
Subject: [Linux-cluster] heartbeat
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B8904AF@a0040semail3.pinellas.local>

I am new to this so please bear with me. But where is the heartbeat setup? Is it in the cluster.conf? And if so on what stanza?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/39f0fbb4/attachment.htm>

From alkol6 at gmail.com  Fri Jan 22 16:32:00 2010
From: alkol6 at gmail.com (Senol Erdogan)
Date: Fri, 22 Jan 2010 18:32:00 +0200
Subject: [Linux-cluster] heartbeat
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B8904AF@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B8904AF@a0040semail3.pinellas.local>
Message-ID: <93bf230a1001220832g45367217sadb9e0f899f62635@mail.gmail.com>

mean failover services seting  or check pulse frequency setting?

2010/1/22 Fagnon Raymond <pcs.fagnonr at pcsb.org>

>  I am new to this so please bear with me. But where is the heartbeat
> setup? Is it in the cluster.conf? And if so on what stanza?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100122/fc0403ba/attachment.htm>

From dirk.schulz at kinzesberg.de  Fri Jan 22 17:33:01 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Fri, 22 Jan 2010 18:33:01 +0100
Subject: [Linux-cluster] CLVM and snapshot: test results and questions
Message-ID: <4B59E14D.4080500@kinzesberg.de>

Hi all,

according to RedHat's docs creating snapshots in a cluster environment 
should be basically possible.
Here is what the "Cluster_Logical_Volume_Manager.pdf" says on page 49:
> LVM snapshots are not cluster-aware, so they require exclusive access 
> to a volume. For information
> on activating logical volumes on individual nodes in a cluster, see 
> Section 8, ?Activating
> Logical Volumes on Individual Nodes in a Cluster?.
In Section 8 they write
> To activate logical volumes exclusively on one node, use the lvchange 
> -aey command.
I have a 2 node cluster using GFS shared volumes based on CLVM. 
According to the above I tried the following:

- umount a GFS volume on node 2
- gfs_tool freeze /mountpoint/of/GFSvolume
- lvchange -aey LOGICALVOLUMEgfsresideson on node 1

I got an "error locking ..." then.

Next I tried "lvchange -an LOGICALVOLUME..." on node 2 which threw the 
same error, but after a while
"lvchange -aey LOGICALVOLUMEgfsresideson" on node 1 worked and the 
logical volume did not appear on node 2 with "lvdisplay".

So now I tried "lvcreate ... -s ..." on node 1 to create a snapshot, but 
received the standard error that shapshots of clustered volumes are not 
implemented yet.

Now there is 2 possibilities how this could be understood:
1. lvcreate refuses snapshotting cluster logical volumes if the cluster 
bit is there - disregarding the exclusivity of the activation state.
2. I did something wrong or incomplete.

I really would regret number 1 being true, and because of the errors I 
received inbetween I hope that 2 comes up as winner. :-)
E.g. on HPUX you have to activate the volume group exclusively, not the 
logical volume. Could that be the reason.

Is anyone out there who can shed enough light on all this?

Thanks for any hint or help

Dirk



From daniela.anzellotti at roma1.infn.it  Fri Jan 22 18:37:47 2010
From: daniela.anzellotti at roma1.infn.it (Daniela Anzellotti)
Date: Fri, 22 Jan 2010 19:37:47 +0100
Subject: [Linux-cluster] CLVMD hangs
In-Reply-To: <4B507976.30709@redhat.com>
References: <00b101ca95b3$ee5ff0b0$cb1fd210$@it> <4B507976.30709@redhat.com>
Message-ID: <4B59F07B.2050602@roma1.infn.it>

Hi all,

same problem for me: today after a reboot of a cluster node, clvmd could 
not start properly anymore since pvscan hangs.

If I try to start vgscan on the other node it hangs too!! If I 'strace' 
on pvscan I can see it stops while reading...

It's a cluster of two nodes, with quorum disk (two_node="0"), offering 
several VMs installed on raw partitions in a CLVM environment.

It's running version 5.4, x86_64 , with the following packages:

cman-2.0.115-1.el5_4.3
rgmanager-2.0.52-1.el5_4.2
lvm2-2.02.46-8.el5_4.2
lvm2-cluster-2.02.46-8.el5_4.1

kernel-xen-2.6.18-164.2.1.el5
kernel-xen-devel-2.6.18-164.2.1.el5
xen-3.0.3-94.el5_4.2
xen-libs-3.0.3-94.el5_4.2


Any suggestion on what can I check/change/install? I don't have any idea 
on what happened...

Thanks,
Daniela



Christine Caulfield wrote:
> On 15/01/10 07:25, cluster at xinet.it wrote:
>> Hi all,
>>
>> i use redhat cluster my os version is 5.4. I have a three-node cluster
>> and clustered VG (XenVG) with many LV (one for each VM). After a cluster
>> reboot I?m not able to restart clvmd anymore due to vgscan hangs.
>>
>> Is there a way to convert the clustered VG to a non-clustered one? I
>> tried changing the locking_typ parameter from 3 to 1 but it wasn?t the
>> right solution.
> 
> You can remove the cluster flag from a VG with the following command
> 
> vgchange -cn <vg> --config 'global {locking_type = 0}'
> 
> 
> If you're amenable it would be helpful to try and debug this. Can you add
> 
> CLVMDOPTS="-d2"
> 
> to the file /etc/sysconfig/cluster
> 
> 
> and post the clvmd parts of system to me please.
> 
> Thanks
> 
> Chrissie
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
- Daniela Anzellotti ------------------------------------
  INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354
  e-mail: daniela.anzellotti at roma1.infn.it
---------------------------------------------------------

Comunicato della Presidenza della Repubblica  -  Roma, 7 ottobre 2009
Tutti sanno da che parte sta il Presidente della Repubblica. Sta dalla 
parte della Costituzione, esercitando le sue funzioni con assoluta 
imparzialit? e in uno spirito di leale collaborazione istituzionale.
--



From raju.rajsand at gmail.com  Sat Jan 23 04:29:41 2010
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sat, 23 Jan 2010 09:59:41 +0530
Subject: [Linux-cluster] GFS resource
In-Reply-To: <8b711df41001200827n1b43d259t98f553ca408e4694@mail.gmail.com>
References: <8b711df41001200827n1b43d259t98f553ca408e4694@mail.gmail.com>
Message-ID: <8786b91c1001222029m3b9bb2das5a9bfbdd6956981e@mail.gmail.com>

Greetings,

On Wed, Jan 20, 2010 at 9:57 PM, Paras pradhan <pradhanparas at gmail.com>wrote:

> Hi,
>
> Now my question is do I need to add a GFS resource in this cluster.  If yes
> why?
>
>
If you want to create VMs or expose the LUNs to the VMs, might as well use
the LVs (for clusterwide usage one anyway has to use CLVM)

GFS adds overhead when giving advantage clusterwide locking facility etc.

HTH

Regards,

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100123/1948a2ae/attachment.htm>

From xavier.montagutelli at unilim.fr  Sat Jan 23 08:26:56 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Sat, 23 Jan 2010 09:26:56 +0100
Subject: [Linux-cluster] CLVM and snapshot: test results and questions
In-Reply-To: <4B59E14D.4080500@kinzesberg.de>
References: <4B59E14D.4080500@kinzesberg.de>
Message-ID: <201001230926.56431.xavier.montagutelli@unilim.fr>

On Friday 22 January 2010 18:33:01 Dirk H. Schulz wrote:
> Hi all,
> 
> according to RedHat's docs creating snapshots in a cluster environment
> should be basically possible.
> 
> Here is what the "Cluster_Logical_Volume_Manager.pdf" says on page 49:
> > LVM snapshots are not cluster-aware, so they require exclusive access
> > to a volume. For information
> > on activating logical volumes on individual nodes in a cluster, see
> > Section 8, ?Activating
> > Logical Volumes on Individual Nodes in a Cluster?.
[..]
> So now I tried "lvcreate ... -s ..." on node 1 to create a snapshot, but
> received the standard error that shapshots of clustered volumes are not
> implemented yet.
> 
> Now there is 2 possibilities how this could be understood:
> 1. lvcreate refuses snapshotting cluster logical volumes if the cluster
> bit is there - disregarding the exclusivity of the activation state.
> 2. I did something wrong or incomplete.
> 
> I really would regret number 1 being true, and because of the errors I
> received inbetween I hope that 2 comes up as winner. :-)
> E.g. on HPUX you have to activate the volume group exclusively, not the
> logical volume. Could that be the reason.
> 
> Is anyone out there who can shed enough light on all this?

A word of warning : I am not an expert. 

Unfortunately the answer #1 seems to be the good one ...

http://archives.free.net.ph/message/20081014.081254.df482534.en.html

Latest thread on "linux-lvm" ML :

https://www.redhat.com/archives/linux-lvm/2010-January/msg00007.html

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From dirk.schulz at kinzesberg.de  Sat Jan 23 09:52:07 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 23 Jan 2010 10:52:07 +0100
Subject: [Linux-cluster] CLVM and snapshot: test results and questions
In-Reply-To: <201001230926.56431.xavier.montagutelli@unilim.fr>
References: <4B59E14D.4080500@kinzesberg.de>
	<201001230926.56431.xavier.montagutelli@unilim.fr>
Message-ID: <4B5AC6C7.8070408@kinzesberg.de>

Hi Xavier,

Xavier Montagutelli schrieb:
>
> Unfortunately the answer #1 seems to be the good one ...
>
> http://archives.free.net.ph/message/20081014.081254.df482534.en.html
>   
That's a great hint, thanks alot.

Now I just have to find out what happened to this patch in the last 15 
months.

Dirk



From dirk.schulz at kinzesberg.de  Sat Jan 23 10:06:19 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 23 Jan 2010 11:06:19 +0100
Subject: [Linux-cluster] GFS resource
In-Reply-To: <8b711df41001200827n1b43d259t98f553ca408e4694@mail.gmail.com>
References: <8b711df41001200827n1b43d259t98f553ca408e4694@mail.gmail.com>
Message-ID: <4B5ACA1B.9040807@kinzesberg.de>

Paras pradhan schrieb:
> Hi,
>
> I have a 3 nodes cluster with shared GFS2 partition connected through 
> SAN. This is Xen virtualization cluster and fencing is working 
> properly. So when there is a problem in any nodes, the node is being 
> fenced and other nodes are working properly. Now my question is do I 
> need to add a GFS resource in this cluster.  If yes why?
I am thinking about the same question. I guess it depends if you have 
CLVM "below" your GFS2 partitions. If yes, clvmd must be running before 
the GFS2 partitions can be mounted (you cannot mount them via fstab at 
system startup then), which means that you have to have them defined as 
resource to get the dependency configured.

If you have setup GFS2 directly on the SAN LUNs I cannot say from 
experience, but from my understanding it's the same situation, since GFS 
uses dlm for locking (but wasn't there an option to use something else?).

In my personal opinion (that is why I am thinking about it in the first 
place) it is a question of clear design. If you have EVERYTHING your 
cluster services need defined as a resource you have a straightforward 
design that
- is easily understood
- is easily documented
- is easily remembered over some years
and all that leads to much easier and less error prone handling in 
emergency situations.

Dirk
>
>
> Thanks!
> Paras.
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From dirk.schulz at kinzesberg.de  Sat Jan 23 10:12:54 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 23 Jan 2010 11:12:54 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100120160401.GA10468@esri.com>
References: <4B4F527F.8030105@kinzesberg.de>	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>	<20100115093514.GJ17978@reaktio.net>	<20100115171211.GA13575@esri.com>
	<4B5314D4.8010900@kinzesberg.de>	<20100118103439.GK17978@reaktio.net>	<4B570FAA.1090202@kinzesberg.de>
	<20100120160401.GA10468@esri.com>
Message-ID: <4B5ACBA6.90408@kinzesberg.de>

Ray Van Dolson schrieb:
> On Wed, Jan 20, 2010 at 06:14:02AM -0800, Dirk H. Schulz wrote:
>   
>> That means calling
>>         sync
>>         echo 3 > /proc/sys/vm/drop_caches
>> inside the VM, right? Or is there anything more to do to flush everything?
>>
>> Dirk
>>     
>
> I would think the best way would be to actually pause the VM.  If you
> snapshot a VM while it's running, even with the above, how can you
> guarantee an application won't do something on the VM right as the
> snapshot occurs?
>   

That is an important point, I am already pondering about it. Of course 
pausing the vm is the better option - if you can afford it.

I am working systematically on an environment where the possibility of 
very high level availability vms is part of the design (e.g. 99,99 %). 
At the moment the only approach I can see is doing snapshots of a 
running vm and making those as reliable as possible.

Dirk




From dirk.schulz at kinzesberg.de  Sat Jan 23 10:31:49 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 23 Jan 2010 11:31:49 +0100
Subject: [Linux-cluster] Cluster blocked because "waiting for 1 more stopped
	message"
Message-ID: <4B5AD015.2080902@kinzesberg.de>

Hi folks,

testing around with activating/deactivating my cluster logical volumes I 
drove the cluster into a situation where clvmd on one node was stuck, so 
I decided to reboot the node.
This did not work because the kernel could not unmount some file system. 
I had to power it off. So far my fault, I thought.

On the other node group_tool dump gave back:
> 1264171959 0:default waiting for 1 more stopped messages before 
> LEAVE_ALL_STOPPED 2
> 1264171959 2:XenImages waiting for 1 more stopped messages before 
> LEAVE_ALL_STOPPED 2
> 1264171959 1:XenImages waiting for 1 more stopped messages before 
> LEAVE_ALL_STOPPED 2
> 1264171959 1:clvmd waiting for 1 more stopped messages before 
> LEAVE_ALL_STOPPED 2
> 1264171959 got client 13 dump
And even after rebooting the problem node and restarting cman, clvmd and 
rgmanager, services on the working node were stuck as well with the 
above  messages being shown.

I did not find any way to push the cluster back into working condition 
other than rebooting the working node also. Even a "kill -9" on clvmd 
did not work!

Is there any way to manually fake the awaited "stopped message" to make 
the rest of the cluster go on? There MUST be, because otherwise this 
would kill the concept of a cluster on the whole: waiting for a dead 
nodes last "stopped" message before going on clustering does not make 
much sense to me.

If anyone out there could help me understand why it is implemented that 
way and point me at what to do in such a case, I would be very happy.

Dirk




From dirk.schulz at kinzesberg.de  Sat Jan 23 10:40:55 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 23 Jan 2010 11:40:55 +0100
Subject: [Linux-cluster] Startup delay for VMs in a cluster
Message-ID: <4B5AD237.5050509@kinzesberg.de>

Hi folks,

the VMs in my cluster are defined via <vm> containers which are not 
inside <service> containers because the latter prevents live migration 
(as far as I could find out).

Now when I start up the cluster, all VMs are started at once, which I 
would like to prevent. Is there any way to define a startup sequence or 
startup delay (like you can for the xendomains service) without misusing 
dependencies?

Dirk



From pasik at iki.fi  Sat Jan 23 13:18:06 2010
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Sat, 23 Jan 2010 15:18:06 +0200
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <4B5ACBA6.90408@kinzesberg.de>
References: <4B4F527F.8030105@kinzesberg.de>
	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>
	<20100115093514.GJ17978@reaktio.net>
	<20100115171211.GA13575@esri.com> <4B5314D4.8010900@kinzesberg.de>
	<20100118103439.GK17978@reaktio.net>
	<4B570FAA.1090202@kinzesberg.de> <20100120160401.GA10468@esri.com>
	<4B5ACBA6.90408@kinzesberg.de>
Message-ID: <20100123131806.GL2861@reaktio.net>

On Sat, Jan 23, 2010 at 11:12:54AM +0100, Dirk H. Schulz wrote:
> Ray Van Dolson schrieb:
>> On Wed, Jan 20, 2010 at 06:14:02AM -0800, Dirk H. Schulz wrote:
>>   
>>> That means calling
>>>         sync
>>>         echo 3 > /proc/sys/vm/drop_caches
>>> inside the VM, right? Or is there anything more to do to flush everything?
>>>
>>> Dirk
>>>     
>>
>> I would think the best way would be to actually pause the VM.  If you
>> snapshot a VM while it's running, even with the above, how can you
>> guarantee an application won't do something on the VM right as the
>> snapshot occurs?
>>   
>
> That is an important point, I am already pondering about it. Of course  
> pausing the vm is the better option - if you can afford it.
>
> I am working systematically on an environment where the possibility of  
> very high level availability vms is part of the design (e.g. 99,99 %).  
> At the moment the only approach I can see is doing snapshots of a  
> running vm and making those as reliable as possible.
>

Pausing the VM doesn't really help. It doesn't guarangee in any way that 
the applications (say MySQL) are in a safe/consistent state.

You really need to coordinate the process with the applications, if you 
realy need/want to make consistent backups with snapshots.

-- Pasi



From mammadshah at hotmail.com  Sat Jan 23 14:36:58 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Sat, 23 Jan 2010 19:36:58 +0500
Subject: [Linux-cluster] redhat apache cluster probelm
Message-ID: <BLU110-W224AA3E9C805127EFA9E4AD4610@phx.gbl>




Hi,



Using RHEL 5.3 and HP-ILo as fencing device. i did the following for testing Apache-Cluster.



1. root#mkqdisk -c /dev/sda1  -l DB_Q

2. root#chkconfig --level 345 qdiskd on

3. root#service qdiskd start

4. root#system-config-cluster



5. Cluster Name : jkz and selected quorum disk with following options

         Interval = 1

         TKO      = 10

         votes    = 1

         Minimum score = 3

         Device   = /dev/sda1

         Label    = DB_Q



6. Quorum Disk Heuristic 

         Program   = ping -c 2 10.10.10.1

         Score     = 1

         Interval  = 2



7. Add new node to cluster

      Node Name    =  node1.example.com

      Quorum votes = 1

      Node Name    =  node2.example.com

      Quorum votes = 1



8. New Fence Device 

      HP ILO Device

      Name        = ILOGB89xxxxxx

      user        = manage

      password    = manage

      Hostname    = 10.10.10.100



 

      HP ILO Device

      Name        = ILOGB88xxxxxx

      user        = manage

      password    = manage

      Hostname    = 10.10.10.101





9.   selected Node1 and "Manage fencing for this node"

      Add New Fencing level -> Add Fencing to this Level. selected ILOGB89xxxxxx



10. selected Node2 and "Manage fencing for this node"

      Add New Fencing level -> Add Fencing to this Level. selected ILOGB88xxxxxx





11. Created failover domains "failover-cluster" and selected
"node1.example.com and node2.example.com" from menu, and selected
"Restrict to this Failover Domain"





12. Create Resource 

           New Resource = Apache Server

           Name    = Apache HTTP Server service

           Server Root = /etc/httpd

           Config File = /etc/httpd/conf/httpd.conf

           httpd options = /etc/rc.d/init.d/httpd





13.  Create a new Resource "File system"



          Name    = httpd-content

          File System type = ext3

          Mount point = /var/www/html

          device    = /dev/sdb1



14.  Create a new Resource "IP "



         10.10.10.200



           

15. Create a New Service "Web-Service"

      Failover Domain = failover-cluster



 And selected "Add shared resource to this service" 



     A.   Apache HTTP Server Service

     B.   Httpd-Content

     C.   IP Address (10.10.10.200)

      





After doing all this. when i start cman service, it hangs and error puts in /var/log/messages



root## service cman start

Starting cluster:

   Loading modules... done

   Mounting configfs... done

   Starting ccsd... done

   Starting cman... done

   Starting qdiskd... done

   Starting daemons... done

   Starting fencing...





root#tail /var/log/messages



Jan 23 17:31:24 node1 ccsd[17103]: Cluster is not quorate.  Refusing connection.

Jan 23 17:31:24 node1 ccsd[17103]: Error while processing connect: Connection refused

Jan 23 17:31:29 node1 ccsd[17103]: Cluster is not quorate.  Refusing connection.

Jan 23 17:31:29 node1 ccsd[17103]: Error while processing connect: Connection refused

Jan 23 17:31:34 node1 ccsd[17103]: Cluster is not quorate.  Refusing connection.

Jan 23 17:31:34 node1 ccsd[17103]: Error while processing connect: Connection refused

Jan 23 17:31:39 node1 ccsd[17103]: Cluster is not quorate.  Refusing connection.

Jan 23 17:31:39 node1 ccsd[17103]: Error while processing connect: Connection refused		
			 
			  
				
			

 
Thanks,
 Ammad Shah
 


 		 	   		  
_________________________________________________________________
Windows Live: Keep your friends up to date with what you do online.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_1:092010
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100123/491ee3f8/attachment.htm>

From broder at mit.edu  Sat Jan 23 17:35:36 2010
From: broder at mit.edu (Evan Broder)
Date: Sat, 23 Jan 2010 12:35:36 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B598F07.1070002@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<4B4AE972.6040004@redhat.com>
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
	<4B4D9982.6060505@redhat.com>
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>
	<4B598F07.1070002@redhat.com>
Message-ID: <178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>

On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 21/01/10 15:17, Evan Broder wrote:
>>
>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>> <ccaulfie at redhat.com> ?wrote:
>>>
>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>
>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com> ? ?wrote:
>>>>>
>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>
>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>
>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com> ? ?wrote:
>>>>>>>>
>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>
>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>
>>>>>>>>> Hi -
>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past
>>>>>>>>> by
>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>> nodes,
>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>> in-kernel locking.
>>>>>>>>>
>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>> to
>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>
>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>> one
>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>> operations
>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>> both
>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>
>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>> clustering
>>>>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>
>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>> information at the list. What information can I provide to make it
>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>
>>>>>>>>
>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>> can be
>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>> should be
>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>> still
>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>> conjunction with the clvmd logs.
>>>>>>>
>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>> in
>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>
>>>>>>>>> ", so they should be easy to spot.
>>>>>
>>>>>
>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>> reproduce
>>>>> it
>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>
>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll have
>>>>> a
>>>>> look at the openais bits to try and find out what is wrong.
>>>>>
>>>>> Chrissie
>>>>>
>>>>
>>>> I'll see what we can pull together, but the nodes running the clvm
>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>> DLM 3 would be...challenging.
>>>>
>>>> It would be much, much better for us if we could get clvmd-openais
>>>> working.
>>>>
>>>> Is there any chance this would work better if we dropped back to
>>>> openais whitetank instead of corosync + openais wilson?
>>>>
>>>
>>>
>>> OK, I've found the bug and it IS in openais. The attached patch will fix
>>> it.
>>>
>>> Chrissie
>>>
>>
>> Awesome. That patch fixed our problem.
>>
>> We are running into one other problem - performing LVM operations on
>> one node is substantially slower than performing them on the other
>> node:
>>
>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>> ? Logical volume "test" created
>>
>> real ? ?0m0.309s
>> user ? ?0m0.000s
>> sys ? ? 0m0.008s
>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>> ? Logical volume "test" successfully removed
>>
>> real ? ?0m0.254s
>> user ? ?0m0.004s
>> sys ? ? 0m0.008s
>>
>>
>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>> ? Logical volume "test" created
>>
>> real ? ?0m7.282s
>> user ? ?0m6.396s
>> sys ? ? 0m0.312s
>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>> ? Logical volume "test" successfully removed
>>
>> real ? ?0m7.277s
>> user ? ?0m6.420s
>> sys ? ? 0m0.292s
>>
>> Any idea why this is happening and if there's anything we can do about it?
>>
>
>
> I'm not at all sure why that should be happening. I suppose the best thing
> to do would be to enable clvmd logging (clvmd -d) and see what is taking the
> time.
>
> Chrissie
>

No problem. I've collected another set of logs - they're in
<http://web.mit.edu/broder/Public/clvmd-slow/>.

After spinning up corosync and clvmd, the commands I ran were, in order:

  root at black-mesa:~# vgchange -a y xenvg
    0 logical volume(s) in volume group "xenvg" now active
  root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
    Logical volume "test1" created

  real    0m0.685s
  user    0m0.004s
  sys     0m0.000s
  root at black-mesa:~# time lvremove -f /dev/xenvg/test1
    Logical volume "test1" successfully removed

  real    0m0.235s
  user    0m0.004s
  sys     0m0.004s
  root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
    Logical volume "test2" created

  real    0m8.007s
  user    0m6.396s
  sys     0m0.312s
  root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
    Logical volume "test2" successfully removed

  real    0m7.364s
  user    0m6.436s
  sys     0m0.300s
  root at black-mesa:~# vgchange -a n xenvg
    0 logical volume(s) in volume group "xenvg" now active

(black-mesa is node 1, and torchwood-institute is node 2)

Thanks again for your help,
 - Evan



From mammadshah at hotmail.com  Sun Jan 24 08:18:00 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Sun, 24 Jan 2010 13:18:00 +0500
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
Message-ID: <BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>



Hello Rajat,  


1. HP-ILO firmware is 1.6.1 (using default fence_ilo package)

2. Yes i have copied the same file to node2 and md5sum on both nodes are equal.

3. first i started cman, but services waits on fence.....

4. I am starting cman first on node1. 

5. I am using RHEL 5.3 (2.6.18xen) iptables and selinux disabled.

6. Is it necessary to have 4 nic, can i use 2 nic for test lab using, if i install RHEL5.4



I have some improvement on this (time was not synchronized on both nodes). i made some changes to configuration of cluster. now everything works fine when both nodes are up. but when any of the nodes shutdown (forced by me) then i have logs in messages. 


When i execute "clustat" on single-alive node. it have the following 



 root# clustat

Cluster Status for DB_Clust @ Sun Jan 24 08:28:54 2010

Member Status: Inquorate



 Member Name                                    ID   Status

 ------ ----                                    ---- ------

 node1.example.com                       1 Online, Local

 node2.example.com                       2 Offline

 /dev/disk/by-id/scsi-3600c0ff000d7ba264cac4d4b0    0 Offline



and not getting reply from alternate node (the one is alive). 


 
Thanks,
 Ammad Shah
 




Date: Sat, 23 Jan 2010 23:45:02 +0530
Subject: fence_ilo
From: rajatjpatel at gmail.com
To: mammadshah at hotmail.com

Hello Mr, Shah

1 What is firmware for you HP ilo (if it is more 1.6X than attach file you can replace in /bin/fence_ilo in single user mode and the restart the both server)
2 you have config you cluster from node1 suppose scp /etc/cluster/cluster.conf node2:/etc/cluster/cluster.conf and check the md5sum on both the node 

3 then you can start cman and rgmanager
4 normally cman has to start 1st if there  your getting error then you can start rgmanager 1st and than cman
5 if you are using RHEL 5.4 os and that is 2 node cluster then you dont need to have qdisk 

6 before setting up cluster in your HP 460 you may have 4 nic card so you need to config in attach diagram 
 Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53

+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...
 		 	   		  
_________________________________________________________________
Windows Live: Keep your friends up to date with what you do online.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_1:092010
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100124/fc8ea1c9/attachment.htm>

From dirk.schulz at kinzesberg.de  Sun Jan 24 08:23:31 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sun, 24 Jan 2010 09:23:31 +0100
Subject: [Linux-cluster] Snapshotting GFS and freezing
In-Reply-To: <20100123131806.GL2861@reaktio.net>
References: <4B4F527F.8030105@kinzesberg.de>	<64D0546C5EBBD147B75DE133D798665F03F3F472@hugo.eprize.local>	<20100115093514.GJ17978@reaktio.net>	<20100115171211.GA13575@esri.com>
	<4B5314D4.8010900@kinzesberg.de>	<20100118103439.GK17978@reaktio.net>	<4B570FAA.1090202@kinzesberg.de>
	<20100120160401.GA10468@esri.com>	<4B5ACBA6.90408@kinzesberg.de>
	<20100123131806.GL2861@reaktio.net>
Message-ID: <4B5C0383.5040600@kinzesberg.de>

Pasi K?rkk?inen schrieb:
> On Sat, Jan 23, 2010 at 11:12:54AM +0100, Dirk H. Schulz wrote:
>   
>> Ray Van Dolson schrieb:
>>     
>>> On Wed, Jan 20, 2010 at 06:14:02AM -0800, Dirk H. Schulz wrote:
>>>   
>>>       
>>>> That means calling
>>>>         sync
>>>>         echo 3 > /proc/sys/vm/drop_caches
>>>> inside the VM, right? Or is there anything more to do to flush everything?
>>>>
>>>> Dirk
>>>>     
>>>>         
>>> I would think the best way would be to actually pause the VM.  If you
>>> snapshot a VM while it's running, even with the above, how can you
>>> guarantee an application won't do something on the VM right as the
>>> snapshot occurs?
>>>   
>>>       
>> That is an important point, I am already pondering about it. Of course  
>> pausing the vm is the better option - if you can afford it.
>>
>> I am working systematically on an environment where the possibility of  
>> very high level availability vms is part of the design (e.g. 99,99 %).  
>> At the moment the only approach I can see is doing snapshots of a  
>> running vm and making those as reliable as possible.
>>     
>
> Pausing the VM doesn't really help. It doesn't guarangee in any way that 
> the applications (say MySQL) are in a safe/consistent state.
>
> You really need to coordinate the process with the applications, if you 
> realy need/want to make consistent backups with snapshots.
>   
Yes, of course, that is still another point in the list that has to be 
considered. But pausing your application (say, stop MySQL writing to the 
database) is not sufficient, you also have to write down buffers and 
caches of kernel, filesystem etc. to make the image as consistent as 
possible.

Dirk



From rajatjpatel at gmail.com  Sun Jan 24 08:35:48 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Sun, 24 Jan 2010 14:05:48 +0530
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
Message-ID: <82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>

CAN SEND ME /VA/LOG/MESSAGES FROM FOR THE BOTH NODE
Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...


On Sun, Jan 24, 2010 at 1:48 PM, Muhammad Ammad Shah <mammadshah at hotmail.com
> wrote:

>
> Hello Rajat,
>
> 1. HP-ILO firmware is 1.6.1 (using default fence_ilo package)
> 2. Yes i have copied the same file to node2 and md5sum on both nodes are
> equal.
> 3. first i started cman, but services waits on fence.....
> 4. I am starting cman first on node1.
> 5. I am using RHEL 5.3 (2.6.18xen) iptables and selinux disabled.
> 6. Is it necessary to have 4 nic, can i use 2 nic for test lab using, if i
> install RHEL5.4
>
>
> I have some improvement on this (time was not synchronized on both nodes).
> i made some changes to configuration of cluster. now everything works fine
> when both nodes are up. but when any of the nodes shutdown (forced by me)
> then i have logs in messages.
>
> When i execute "clustat" on single-alive node. it have the following
>
>  root# clustat
> Cluster Status for DB_Clust @ Sun Jan 24 08:28:54 2010
> Member Status: Inquorate
>
>  Member Name                                    ID   Status
>  ------ ----                                    ---- ------
>  node1.example.com                       1 Online, Local
>  node2.example.com                       2 Offline
>  /dev/disk/by-id/scsi-3600c0ff000d7ba264cac4d4b0    0 Offline
>
> and not getting reply from alternate node (the one is alive).
>
>
>
> Thanks,
>  Ammad Shah
>
>
>
>
>
> ------------------------------
> Date: Sat, 23 Jan 2010 23:45:02 +0530
> Subject: fence_ilo
> From: rajatjpatel at gmail.com
> To: mammadshah at hotmail.com
>
>
> Hello Mr, Shah
>
> 1 What is firmware for you HP ilo (if it is more 1.6X than attach file you
> can replace in /bin/fence_ilo in single user mode and the restart the both
> server)
> 2 you have config you cluster from node1 suppose scp
> /etc/cluster/cluster.conf node2:/etc/cluster/cluster.conf and check the
> md5sum on both the node
> 3 then you can start cman and rgmanager
> 4 normally cman has to start 1st if there  your getting error then you can
> start rgmanager 1st and than cman
> 5 if you are using RHEL 5.4 os and that is 2 node cluster then you dont
> need to have qdisk
> 6 before setting up cluster in your HP 460 you may have 4 nic card so you
> need to config in attach diagram
>
> Regards,
>
> Rajat J Patel
> D 803 Royal Classic
> Link Road
> Andheri West
> Mumbai 53
> +919920121211
> www.taashee.com
>
> FIRST THEY IGNORE YOU...
> THEN THEY LAUGH AT YOU...
> THEN THEY FIGHT YOU...
> THEN YOU WIN...
>
> ------------------------------
> Windows Live: Keep your friends up to date with what you do online.<http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_1:092010>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100124/5252d065/attachment.htm>

From raju.rajsand at gmail.com  Sun Jan 24 13:56:51 2010
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 24 Jan 2010 19:26:51 +0530
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
Message-ID: <8786b91c1001240556l15c50ft41117aeb6335a19a@mail.gmail.com>

Greetings,

On Sun, Jan 24, 2010 at 1:48 PM, Muhammad Ammad Shah <mammadshah at hotmail.com
> wrote:

>
>
> When i execute "clustat" on single-alive node. it have the following
>
>  root# clustat
> Cluster Status for DB_Clust @ Sun Jan 24 08:28:54 2010
> Member Status: Inquorate
>
>  Member Name                                    ID   Status
>  ------ ----                                    ---- ------
>  node1.example.com                       1 Online, Local
>  node2.example.com                       2 Offline
>  /dev/disk/by-id/scsi-3600c0ff000d7ba264cac4d4b0    0 Offline
>
>
Some silly questions:

Q1. is it a two node cluster? if so, have you configured is as such using
two_nodes
---- http://sources.redhat.com/cluster/wiki/FAQ/CMAN says

[quote]
How can I define a two-node cluster if a majority is needed to reach quorum?

We had to allow two-node clusters, so we made a special exception to the
quorum rules. There is a special setting "two_node" in the /etc/cluster.conf
file that looks like this:

<cman expected_votes="1" two_node="1"/>

This will allow one node to be considered enough to establish a quorum. Note
that if you configure a quorum disk/partition, you don't want two_node="1".

[/quote]
Q2. Are the cluster member nodes able to resolve each others names? (either
through DNS or /etc/hosts file)
Q3. have you used cman_tool join command to join the nodes to the cluster?


Regards,

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100124/a113ed50/attachment.htm>

From chirag_linuxforum at rediffmail.com  Mon Jan 25 05:59:34 2010
From: chirag_linuxforum at rediffmail.com (Chirag )
Date: 25 Jan 2010 05:59:34 -0000
Subject: [Linux-cluster] =?utf-8?q?Queries_on_RHCS_command_line_tools?=
Message-ID: <20100125055934.61041.qmail@f6mail-144-202.rediffmail.com>

Hi
I am using Redhat cluster suite. I am trying to configure cluster on linux using command line tools. I have following queries. Please answer them as early as possible. I have gone through man pages of ccs_tool, cman_tool, clusvcadm etc. commands but I could not find anwers of following questions.
1. Can I add node in cluster without adding fence device using command line tool? If yes then how?2. I want to add failover domain in cluster. How can I add it using command line tool?3. I want to add resources in cluster. How can I add it using command line tool?4. I want to define services in cluster. How can I define them using command line tool?
RegardsChirag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/28b75b96/attachment.htm>

From xavier.montagutelli at unilim.fr  Mon Jan 25 08:03:26 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Mon, 25 Jan 2010 09:03:26 +0100
Subject: [Linux-cluster] Queries on RHCS command line tools
In-Reply-To: <20100125055934.61041.qmail@f6mail-144-202.rediffmail.com>
References: <20100125055934.61041.qmail@f6mail-144-202.rediffmail.com>
Message-ID: <201001250903.26529.xavier.montagutelli@unilim.fr>

On Monday 25 January 2010 06:59:34 Chirag wrote:
> Hi
> I am using Redhat cluster suite. I am trying to configure cluster on linux
>  using command line tools. I have following queries. Please answer them as
>  early as possible. I have gone through man pages of ccs_tool, cman_tool,
>  clusvcadm etc. commands but I could not find anwers of following
>  questions. 1. Can I add node in cluster without adding fence device using
>  command line tool? If yes then how?2. I want to add failover domain in
>  cluster. How can I add it using command line tool?3. I want to add
>  resources in cluster. How can I add it using command line tool?

>  4. I want
>  to define services in cluster. How can I define them using command line
>  tool?

My two cents : 

When I want to add a service, I use (programmaticaly) ed or vi to add the new 
service and increase the version number in the cluster.conf file, then reread 
it with "ccs_tool update ...".

> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From ccaulfie at redhat.com  Mon Jan 25 09:17:50 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 25 Jan 2010 09:17:50 +0000
Subject: [Linux-cluster] Queries on RHCS command line tools
In-Reply-To: <20100125055934.61041.qmail@f6mail-144-202.rediffmail.com>
References: <20100125055934.61041.qmail@f6mail-144-202.rediffmail.com>
Message-ID: <4B5D61BE.5060702@redhat.com>

On 25/01/10 05:59, Chirag wrote:
> Hi
>
> I am using Redhat cluster suite. I am trying to configure cluster on
> linux using command line tools. I have following queries. Please answer
> them as early as possible. I have gone through man pages of ccs_tool,
> cman_tool, clusvcadm etc. commands but I could not find anwers of
> following questions.
>
> 1. Can I add node in cluster without adding fence device using command
> line tool? If yes then how?

No, you can't leave off a fencing device. All nodes need a fencing 
device associated with them. The command-line tool simply enforces this.

> 2. I want to add failover domain in cluster. How can I add it using
> command line tool?
> 3. I want to add resources in cluster. How can I add it using command
> line tool?
> 4. I want to define services in cluster. How can I define them using
> command line tool?

You can't. ccs_tool only manages nodes and fencing devices. Because of 
the complexity and variability of services you will have to either edit 
the XML yourself or use one of the GUIs, sorry

Chrissie



From Martin.Waite at datacash.com  Mon Jan 25 09:49:29 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Mon, 25 Jan 2010 09:49:29 -0000
Subject: [Linux-cluster] how do I reduce the risk of split-brain while
	performing node maintenance
Message-ID: <A78DB34D00374344A0AB65B6523C05DC04F2AFC4@marsden.win.datacash.com>

Hi,

Suppose that I have a cluster with an odd-number of nodes - say 3 nodes.
In the event of a network partition, split brain is avoided by moving
(or keeping) leadership to the partition that contains two nodes. 

In the event of routine maintenance, where one of the members needs to
be brought down in order to be replaced or upgraded, I now have a 2-node
system which presumably is at risk of split brain in the event of a
network partition happening during the maintenance window.

What is the best way of reducing the risk of split brain during the
maintenance window when the cluster is not at full strength ?

regards,
Martin

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/52909c75/attachment.htm>

From ccaulfie at redhat.com  Mon Jan 25 10:32:05 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 25 Jan 2010 10:32:05 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<4B4AE972.6040004@redhat.com>	
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>	
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>	
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>	
	<4B4D9982.6060505@redhat.com>	
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>	
	<4B598F07.1070002@redhat.com>
	<178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>
Message-ID: <4B5D7325.3000605@redhat.com>

On 23/01/10 17:35, Evan Broder wrote:
> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
> <ccaulfie at redhat.com>  wrote:
>> On 21/01/10 15:17, Evan Broder wrote:
>>>
>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>> <ccaulfie at redhat.com>    wrote:
>>>>
>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>
>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>
>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>
>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>
>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>>>
>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>
>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>
>>>>>>>>>> Hi -
>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past
>>>>>>>>>> by
>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>> nodes,
>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>> in-kernel locking.
>>>>>>>>>>
>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>> to
>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>
>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>> one
>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>> operations
>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>> both
>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>
>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>> clustering
>>>>>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>
>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>> information at the list. What information can I provide to make it
>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>> can be
>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>> should be
>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>> still
>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>
>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>> in
>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>
>>>>>>>>>> ", so they should be easy to spot.
>>>>>>
>>>>>>
>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>> reproduce
>>>>>> it
>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>
>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll have
>>>>>> a
>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>> DLM 3 would be...challenging.
>>>>>
>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>> working.
>>>>>
>>>>> Is there any chance this would work better if we dropped back to
>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>
>>>>
>>>>
>>>> OK, I've found the bug and it IS in openais. The attached patch will fix
>>>> it.
>>>>
>>>> Chrissie
>>>>
>>>
>>> Awesome. That patch fixed our problem.
>>>
>>> We are running into one other problem - performing LVM operations on
>>> one node is substantially slower than performing them on the other
>>> node:
>>>
>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>    Logical volume "test" created
>>>
>>> real    0m0.309s
>>> user    0m0.000s
>>> sys     0m0.008s
>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>    Logical volume "test" successfully removed
>>>
>>> real    0m0.254s
>>> user    0m0.004s
>>> sys     0m0.008s
>>>
>>>
>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>    Logical volume "test" created
>>>
>>> real    0m7.282s
>>> user    0m6.396s
>>> sys     0m0.312s
>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>    Logical volume "test" successfully removed
>>>
>>> real    0m7.277s
>>> user    0m6.420s
>>> sys     0m0.292s
>>>
>>> Any idea why this is happening and if there's anything we can do about it?
>>>
>>
>>
>> I'm not at all sure why that should be happening. I suppose the best thing
>> to do would be to enable clvmd logging (clvmd -d) and see what is taking the
>> time.
>>
>> Chrissie
>>
>
> No problem. I've collected another set of logs - they're in
> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>
> After spinning up corosync and clvmd, the commands I ran were, in order:
>
>    root at black-mesa:~# vgchange -a y xenvg
>      0 logical volume(s) in volume group "xenvg" now active
>    root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>      Logical volume "test1" created
>
>    real    0m0.685s
>    user    0m0.004s
>    sys     0m0.000s
>    root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>      Logical volume "test1" successfully removed
>
>    real    0m0.235s
>    user    0m0.004s
>    sys     0m0.004s
>    root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>      Logical volume "test2" created
>
>    real    0m8.007s
>    user    0m6.396s
>    sys     0m0.312s
>    root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>      Logical volume "test2" successfully removed
>
>    real    0m7.364s
>    user    0m6.436s
>    sys     0m0.300s
>    root at black-mesa:~# vgchange -a n xenvg
>      0 logical volume(s) in volume group "xenvg" now active
>
> (black-mesa is node 1, and torchwood-institute is node 2)
>
> Thanks again for your help,
>

Hiya,

Oddly I can't find any delays in the clvmd logs at all. There are some 
7-second gaps in the log files but those are between commands coming 
from the lvm command-line and not internal to clvmd itself.

What sort of storage are you using for these LVs? The only thing I can 
think of now is that there is some sort of delay in lvm opening the 
device for writing as it updates the metadata. An LVM debug log might be 
helpful here, though I'm not sure, off-hand, how to put time-stamps on 
that - I'm not really an LVM developer any more.

Chrissie



From frederic at ovsg.univ-ag.fr  Mon Jan 25 11:26:42 2010
From: frederic at ovsg.univ-ag.fr (frederic randriamora)
Date: Mon, 25 Jan 2010 11:26:42 +0000
Subject: [Linux-cluster] weird rgmanager
Message-ID: <4B5D7FF2.3070704@ovsg.univ-ag.fr>

Hi,


In a four node cluster running RH5.4, connected to a FC SAN

clustat says:
node 1 and node 3 are online with rgmanager
node 2 and node 4 are offline

The cluster remains quorate because of a qdiskd running on each node


Fine.


BUT, node 4, which is offline as per clustat and cman_tool nodes, is 
still reported by clustat as running services ( those services are 
actually dead ).
I have on the two alive nodes ( node 1 and node 3 ):

cman_tool services
type             level name       id       state       
fence            0     default    00010004 FAIL_ALL_STOPPED
[1 2 3 4]
dlm              1     clvmd      00020004 LEAVE_STOP_WAIT
[1 2 3 4]
dlm              1     rgmanager  00030004 FAIL_ALL_STOPPED
[1 3 4]


they are running services ( xen vm, nfs and dns ) OK.

The other two dead nodes ( they don't run ccs neither cman neither 
nothing ) can access the SAN as is displayed by multipath -ll
I know I can restart the whole cluster but i would like to know why this 
is happening.

I someone please can help.

Frederic



From mammadshah at hotmail.com  Mon Jan 25 12:27:53 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Mon, 25 Jan 2010 17:27:53 +0500
Subject: [Linux-cluster] gfs and cluster nodes rebooting
Message-ID: <BLU110-W2673BAD5E4130D1EAB988FD45F0@phx.gbl>



HI,



I have configured two node cluster and its working fine for SAN (ext3 file system). after this i configured GFS using following.



root# pvcreate  /dev/sdb

root#vgcreate -c y vg1_gfs /dev/sdc1

root#lvcreate -n db_store -l 100%FREE vg1_gfs

root#/etc/init.d/clvmd start



 Started on both nodes.



root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4 /dev/vg1_gfs/db_store

root# service gfs start



root#chkconfig --level 345 clvmd on 

root#chkconfig --level 345 gfs on



----------------

the problem is, as i changed File system (ex3) resource to GFS Resource.



nodes are rebooting..



there is nothing in /var/log/messages. but when i checked console of the node there was some message related to GFS. 

DLM id:0 ...



so i removed GFS and switched back to File system(ext3) resource. 



can i install oracle on  Resource File system(ext3) ?



or how to troubleshoot GFS reboot..



need help,		
			 
			  
				
			
 
Thanks,
Muhammad Ammad Shah
 


 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you?re up to on Facebook.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_2:092009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/f29c2418/attachment.htm>

From mammadshah at hotmail.com  Mon Jan 25 12:54:37 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Mon, 25 Jan 2010 17:54:37 +0500
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>,
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>,
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
Message-ID: <BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>



Dear Rajat,

HI,



I have configured two node cluster and its working fine for SAN (ext3 file system). after this i configured GFS using following.



root# pvcreate  /dev/sdb

root#vgcreate -c y vg1_gfs /dev/sdc1

root#lvcreate -n db_store -l 100%FREE vg1_gfs

root#/etc/init.d/clvmd start



 Started on both nodes.



root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4 /dev/vg1_gfs/db_store

root# service gfs start



root#chkconfig --level 345 clvmd on 

root#chkconfig --level 345 gfs on



----------------

the problem is, as i changed File system (ex3) resource to GFS Resource.



nodes are rebooting..



there is nothing in /var/log/messages. but when i checked console of the node there was some message related to GFS. 

DLM id:0 ...



so i removed GFS and switched back to File system(ext3) resource. 



can i install oracle on  Resource File system(ext3) ?



or how to troubleshoot GFS reboot..

need help,		
			 
			  
				
			



 
Thanks,
Muhammad Ammad Shah
 

 		 	   		  
_________________________________________________________________
Windows Live Hotmail: Your friends can get your Facebook updates, right from Hotmail?.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/2d4bfd8e/attachment.htm>

From swhiteho at redhat.com  Mon Jan 25 13:10:20 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 25 Jan 2010 13:10:20 +0000
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	, <BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
	, <82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
Message-ID: <1264425020.14393.246.camel@localhost.localdomain>

Hi,

On Mon, 2010-01-25 at 17:54 +0500, Muhammad Ammad Shah wrote:
> 
> Dear Rajat,
> 
> HI,
> 
> I have configured two node cluster and its working fine for SAN (ext3
> file system). after this i configured GFS using following.
> 
> root# pvcreate /dev/sdb
> root#vgcreate -c y vg1_gfs /dev/sdc1
> root#lvcreate -n db_store -l 100%FREE vg1_gfs
> root#/etc/init.d/clvmd start
> 
> Started on both nodes.
> 
> root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j
> 4 /dev/vg1_gfs/db_store
> root# service gfs start
> 
> root#chkconfig --level 345 clvmd on 
> root#chkconfig --level 345 gfs on
> 
> ----------------
> the problem is, as i changed File system (ex3) resource to GFS
> Resource.
> 
> nodes are rebooting..
> 
> there is nothing in /var/log/messages. but when i checked console of
> the node there was some message related to GFS. 
> DLM id:0 ...
> 
What is the exact message?

Steve.




From broder at mit.edu  Mon Jan 25 13:13:31 2010
From: broder at mit.edu (Evan Broder)
Date: Mon, 25 Jan 2010 08:13:31 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B5D7325.3000605@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<178868c41001110132s1350c7cep942148bbce62dfc0@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
	<4B4D9982.6060505@redhat.com>
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>
	<4B598F07.1070002@redhat.com>
	<178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>
	<4B5D7325.3000605@redhat.com>
Message-ID: <178868c41001250513o39d9e7d3i7f9ace8b0dc2e070@mail.gmail.com>

On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 23/01/10 17:35, Evan Broder wrote:
>>
>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>> <ccaulfie at redhat.com> ?wrote:
>>>
>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>
>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com> ? ?wrote:
>>>>>
>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>
>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com> ? ? ?wrote:
>>>>>>>
>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>
>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>
>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>> <ccaulfie at redhat.com> ? ? ?wrote:
>>>>>>>>>>
>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>
>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>
>>>>>>>>>>> Hi -
>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>> past
>>>>>>>>>>> by
>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>> nodes,
>>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>
>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>>> to
>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>
>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running
>>>>>>>>>>> on
>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>>> one
>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>> operations
>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>>> both
>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>
>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>> clustering
>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>> formed
>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>
>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>>> information at the list. What information can I provide to make
>>>>>>>>>>> it
>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>>> can be
>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>> should be
>>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>>> still
>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>
>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>>> in
>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>>
>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>
>>>>>>>
>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>> reproduce
>>>>>>> it
>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>
>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>> have
>>>>>>> a
>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>
>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>>> DLM 3 would be...challenging.
>>>>>>
>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>> working.
>>>>>>
>>>>>> Is there any chance this would work better if we dropped back to
>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>
>>>>>
>>>>>
>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>> fix
>>>>> it.
>>>>>
>>>>> Chrissie
>>>>>
>>>>
>>>> Awesome. That patch fixed our problem.
>>>>
>>>> We are running into one other problem - performing LVM operations on
>>>> one node is substantially slower than performing them on the other
>>>> node:
>>>>
>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>> ? Logical volume "test" created
>>>>
>>>> real ? ?0m0.309s
>>>> user ? ?0m0.000s
>>>> sys ? ? 0m0.008s
>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>> ? Logical volume "test" successfully removed
>>>>
>>>> real ? ?0m0.254s
>>>> user ? ?0m0.004s
>>>> sys ? ? 0m0.008s
>>>>
>>>>
>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>> ? Logical volume "test" created
>>>>
>>>> real ? ?0m7.282s
>>>> user ? ?0m6.396s
>>>> sys ? ? 0m0.312s
>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>> ? Logical volume "test" successfully removed
>>>>
>>>> real ? ?0m7.277s
>>>> user ? ?0m6.420s
>>>> sys ? ? 0m0.292s
>>>>
>>>> Any idea why this is happening and if there's anything we can do about
>>>> it?
>>>>
>>>
>>>
>>> I'm not at all sure why that should be happening. I suppose the best
>>> thing
>>> to do would be to enable clvmd logging (clvmd -d) and see what is taking
>>> the
>>> time.
>>>
>>> Chrissie
>>>
>>
>> No problem. I've collected another set of logs - they're in
>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>
>> After spinning up corosync and clvmd, the commands I ran were, in order:
>>
>> ? root at black-mesa:~# vgchange -a y xenvg
>> ? ? 0 logical volume(s) in volume group "xenvg" now active
>> ? root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>> ? ? Logical volume "test1" created
>>
>> ? real ? ?0m0.685s
>> ? user ? ?0m0.004s
>> ? sys ? ? 0m0.000s
>> ? root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>> ? ? Logical volume "test1" successfully removed
>>
>> ? real ? ?0m0.235s
>> ? user ? ?0m0.004s
>> ? sys ? ? 0m0.004s
>> ? root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>> ? ? Logical volume "test2" created
>>
>> ? real ? ?0m8.007s
>> ? user ? ?0m6.396s
>> ? sys ? ? 0m0.312s
>> ? root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>> ? ? Logical volume "test2" successfully removed
>>
>> ? real ? ?0m7.364s
>> ? user ? ?0m6.436s
>> ? sys ? ? 0m0.300s
>> ? root at black-mesa:~# vgchange -a n xenvg
>> ? ? 0 logical volume(s) in volume group "xenvg" now active
>>
>> (black-mesa is node 1, and torchwood-institute is node 2)
>>
>> Thanks again for your help,
>>
>
> Hiya,
>
> Oddly I can't find any delays in the clvmd logs at all. There are some
> 7-second gaps in the log files but those are between commands coming from
> the lvm command-line and not internal to clvmd itself.
>
> What sort of storage are you using for these LVs? The only thing I can think
> of now is that there is some sort of delay in lvm opening the device for
> writing as it updates the metadata. An LVM debug log might be helpful here,
> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
> really an LVM developer any more.
>
> Chrissie
>

Well, uh, you see...at the moment I'm using a VM on a personal server
running iscsitarget from the other side of MIT's campus :)

But that's only because our normal RAID is in the process of being
RMA'd. Normally it's a Fibre channel RAID sitting right next to the
two servers, and we still had the same timing problems (with the same
server being slower) then. The two servers are identical hardware and
sitting right next to each other.

In any case, thanks for the suggestion to look at LVM itself instead
of clvmd - I sort of forgot my basic debugging techniques before
asking for help on this one. LVM logs to syslog, and its logging mode
includes timestamps.

When I cranked up the debug level, I noticed that the lvcreates and
lvremoves were hanging on "Archiving volume group "xenvg" metadata
(seqno 48).". If I turn off the metadata archives in lvm.conf on
torchwood-institute, its performance goes back to what I'd expect, so
I guess it's having some sort of disk problems.

Thanks again for all of our help,
 - Evan



From rajatjpatel at gmail.com  Mon Jan 25 13:17:44 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Mon, 25 Jan 2010 18:47:44 +0530
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <1264425020.14393.246.camel@localhost.localdomain>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
	<1264425020.14393.246.camel@localhost.localdomain>
Message-ID: <82d055df1001250517k37b5e5f2j18578cd38af5b8b4@mail.gmail.com>

This happens because one node of the cluster has two interfaces on the same
network segment with an IP in the same subnet. This node sends out cluster
messages by the wrong source IP instead of the IP defined in the
/etc/cluster/cluster.conf.



To solve the issue, just need shutdown the IP that is not defined in
/etc/cluster/cluster.conf.

Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...


On Mon, Jan 25, 2010 at 6:40 PM, Steven Whitehouse <swhiteho at redhat.com>wrote:

> Hi,
>
> On Mon, 2010-01-25 at 17:54 +0500, Muhammad Ammad Shah wrote:
> >
> > Dear Rajat,
> >
> > HI,
> >
> > I have configured two node cluster and its working fine for SAN (ext3
> > file system). after this i configured GFS using following.
> >
> > root# pvcreate /dev/sdb
> > root#vgcreate -c y vg1_gfs /dev/sdc1
> > root#lvcreate -n db_store -l 100%FREE vg1_gfs
> > root#/etc/init.d/clvmd start
> >
> > Started on both nodes.
> >
> > root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j
> > 4 /dev/vg1_gfs/db_store
> > root# service gfs start
> >
> > root#chkconfig --level 345 clvmd on
> > root#chkconfig --level 345 gfs on
> >
> > ----------------
> > the problem is, as i changed File system (ex3) resource to GFS
> > Resource.
> >
> > nodes are rebooting..
> >
> > there is nothing in /var/log/messages. but when i checked console of
> > the node there was some message related to GFS.
> > DLM id:0 ...
> >
> What is the exact message?
>
> Steve.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/99881aaa/attachment.htm>

From rajatjpatel at gmail.com  Mon Jan 25 13:18:58 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Mon, 25 Jan 2010 18:48:58 +0530
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
Message-ID: <82d055df1001250518i5f6fcb22la2fde2d38de708a0@mail.gmail.com>

Hi Shah,

Have you config your multipath or your using /dev/sda from your san

This happens because one node of the cluster has two interfaces on the same
network segment with an IP in the same subnet. This node sends out cluster
messages by the wrong source IP instead of the IP defined in the
/etc/cluster/cluster.conf.



To solve the issue, just need shutdown the IP that is not defined in
/etc/cluster/cluster.conf.

Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...


On Mon, Jan 25, 2010 at 6:24 PM, Muhammad Ammad Shah <mammadshah at hotmail.com
> wrote:

>
> Dear Rajat,
>
> HI,
>
> I have configured two node cluster and its working fine for SAN (ext3 file
> system). after this i configured GFS using following.
>
> root# pvcreate /dev/sdb
> root#vgcreate -c y vg1_gfs /dev/sdc1
> root#lvcreate -n db_store -l 100%FREE vg1_gfs
> root#/etc/init.d/clvmd start
>
> Started on both nodes.
>
> root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4
> /dev/vg1_gfs/db_store
> root# service gfs start
>
> root#chkconfig --level 345 clvmd on
> root#chkconfig --level 345 gfs on
>
> ----------------
> the problem is, as i changed File system (ex3) resource to GFS Resource.
>
> nodes are rebooting..
>
> there is nothing in /var/log/messages. but when i checked console of the
> node there was some message related to GFS.
> DLM id:0 ...
>
> so i removed GFS and switched back to File system(ext3) resource.
>
> can i install oracle on Resource File system(ext3) ?
>
> or how to troubleshoot GFS reboot..
> need help,
>
>
>
>
> Thanks,
> Muhammad Ammad Shah
>
>
>
> ------------------------------
> Windows Live Hotmail: Your friends can get your Facebook updates, right
> from Hotmail?.<http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/06750eba/attachment.htm>

From rajatjpatel at gmail.com  Mon Jan 25 13:20:29 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Mon, 25 Jan 2010 18:50:29 +0530
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
Message-ID: <82d055df1001250520k7c6651f7w3191245e1267c8a2@mail.gmail.com>

root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4/dev/vg1_gfs/db_store

root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 3
/dev/vg1_gfs/db_store
Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...


On Mon, Jan 25, 2010 at 6:24 PM, Muhammad Ammad Shah <mammadshah at hotmail.com
> wrote:

>
> Dear Rajat,
>
> HI,
>
> I have configured two node cluster and its working fine for SAN (ext3 file
> system). after this i configured GFS using following.
>
> root# pvcreate /dev/sdb
> root#vgcreate -c y vg1_gfs /dev/sdc1
> root#lvcreate -n db_store -l 100%FREE vg1_gfs
> root#/etc/init.d/clvmd start
>
> Started on both nodes.
>
> root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4
> /dev/vg1_gfs/db_store
> root# service gfs start
>
> root#chkconfig --level 345 clvmd on
> root#chkconfig --level 345 gfs on
>
> ----------------
> the problem is, as i changed File system (ex3) resource to GFS Resource.
>
> nodes are rebooting..
>
> there is nothing in /var/log/messages. but when i checked console of the
> node there was some message related to GFS.
> DLM id:0 ...
>
> so i removed GFS and switched back to File system(ext3) resource.
>
> can i install oracle on Resource File system(ext3) ?
>
> or how to troubleshoot GFS reboot..
> need help,
>
>
>
>
> Thanks,
> Muhammad Ammad Shah
>
>
>
> ------------------------------
> Windows Live Hotmail: Your friends can get your Facebook updates, right
> from Hotmail?.<http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/3f8f5fd3/attachment.htm>

From broder at mit.edu  Mon Jan 25 13:28:27 2010
From: broder at mit.edu (Evan Broder)
Date: Mon, 25 Jan 2010 08:28:27 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001250513o39d9e7d3i7f9ace8b0dc2e070@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
	<4B4D9982.6060505@redhat.com>
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>
	<4B598F07.1070002@redhat.com>
	<178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>
	<4B5D7325.3000605@redhat.com>
	<178868c41001250513o39d9e7d3i7f9ace8b0dc2e070@mail.gmail.com>
Message-ID: <178868c41001250528h7fe51965p89990eb2f5ffd542@mail.gmail.com>

On Mon, Jan 25, 2010 at 8:13 AM, Evan Broder <broder at mit.edu> wrote:
> On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
> <ccaulfie at redhat.com> wrote:
>> On 23/01/10 17:35, Evan Broder wrote:
>>>
>>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>>> <ccaulfie at redhat.com> ?wrote:
>>>>
>>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>>
>>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com> ? ?wrote:
>>>>>>
>>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>>
>>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com> ? ? ?wrote:
>>>>>>>>
>>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>>
>>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>>> <ccaulfie at redhat.com> ? ? ?wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>>
>>>>>>>>>>>> Hi -
>>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>>> past
>>>>>>>>>>>> by
>>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>>> nodes,
>>>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>>
>>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>>>> to
>>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>>
>>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running
>>>>>>>>>>>> on
>>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>>>> one
>>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>>> operations
>>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>>>> both
>>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>>
>>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>>> clustering
>>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>>> formed
>>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>>
>>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>>>> information at the list. What information can I provide to make
>>>>>>>>>>>> it
>>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>>>> can be
>>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>>> should be
>>>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>>>> still
>>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>>
>>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>>>> in
>>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>>>
>>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>>> reproduce
>>>>>>>> it
>>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>>
>>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>>> have
>>>>>>>> a
>>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>>
>>>>>>>> Chrissie
>>>>>>>>
>>>>>>>
>>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>>>> DLM 3 would be...challenging.
>>>>>>>
>>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>>> working.
>>>>>>>
>>>>>>> Is there any chance this would work better if we dropped back to
>>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>>> fix
>>>>>> it.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>> Awesome. That patch fixed our problem.
>>>>>
>>>>> We are running into one other problem - performing LVM operations on
>>>>> one node is substantially slower than performing them on the other
>>>>> node:
>>>>>
>>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>> ? Logical volume "test" created
>>>>>
>>>>> real ? ?0m0.309s
>>>>> user ? ?0m0.000s
>>>>> sys ? ? 0m0.008s
>>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>> ? Logical volume "test" successfully removed
>>>>>
>>>>> real ? ?0m0.254s
>>>>> user ? ?0m0.004s
>>>>> sys ? ? 0m0.008s
>>>>>
>>>>>
>>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>> ? Logical volume "test" created
>>>>>
>>>>> real ? ?0m7.282s
>>>>> user ? ?0m6.396s
>>>>> sys ? ? 0m0.312s
>>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>> ? Logical volume "test" successfully removed
>>>>>
>>>>> real ? ?0m7.277s
>>>>> user ? ?0m6.420s
>>>>> sys ? ? 0m0.292s
>>>>>
>>>>> Any idea why this is happening and if there's anything we can do about
>>>>> it?
>>>>>
>>>>
>>>>
>>>> I'm not at all sure why that should be happening. I suppose the best
>>>> thing
>>>> to do would be to enable clvmd logging (clvmd -d) and see what is taking
>>>> the
>>>> time.
>>>>
>>>> Chrissie
>>>>
>>>
>>> No problem. I've collected another set of logs - they're in
>>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>>
>>> After spinning up corosync and clvmd, the commands I ran were, in order:
>>>
>>> ? root at black-mesa:~# vgchange -a y xenvg
>>> ? ? 0 logical volume(s) in volume group "xenvg" now active
>>> ? root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>> ? ? Logical volume "test1" created
>>>
>>> ? real ? ?0m0.685s
>>> ? user ? ?0m0.004s
>>> ? sys ? ? 0m0.000s
>>> ? root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>> ? ? Logical volume "test1" successfully removed
>>>
>>> ? real ? ?0m0.235s
>>> ? user ? ?0m0.004s
>>> ? sys ? ? 0m0.004s
>>> ? root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>> ? ? Logical volume "test2" created
>>>
>>> ? real ? ?0m8.007s
>>> ? user ? ?0m6.396s
>>> ? sys ? ? 0m0.312s
>>> ? root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>> ? ? Logical volume "test2" successfully removed
>>>
>>> ? real ? ?0m7.364s
>>> ? user ? ?0m6.436s
>>> ? sys ? ? 0m0.300s
>>> ? root at black-mesa:~# vgchange -a n xenvg
>>> ? ? 0 logical volume(s) in volume group "xenvg" now active
>>>
>>> (black-mesa is node 1, and torchwood-institute is node 2)
>>>
>>> Thanks again for your help,
>>>
>>
>> Hiya,
>>
>> Oddly I can't find any delays in the clvmd logs at all. There are some
>> 7-second gaps in the log files but those are between commands coming from
>> the lvm command-line and not internal to clvmd itself.
>>
>> What sort of storage are you using for these LVs? The only thing I can think
>> of now is that there is some sort of delay in lvm opening the device for
>> writing as it updates the metadata. An LVM debug log might be helpful here,
>> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
>> really an LVM developer any more.
>>
>> Chrissie
>>
>
> Well, uh, you see...at the moment I'm using a VM on a personal server
> running iscsitarget from the other side of MIT's campus :)
>
> But that's only because our normal RAID is in the process of being
> RMA'd. Normally it's a Fibre channel RAID sitting right next to the
> two servers, and we still had the same timing problems (with the same
> server being slower) then. The two servers are identical hardware and
> sitting right next to each other.
>
> In any case, thanks for the suggestion to look at LVM itself instead
> of clvmd - I sort of forgot my basic debugging techniques before
> asking for help on this one. LVM logs to syslog, and its logging mode
> includes timestamps.
>
> When I cranked up the debug level, I noticed that the lvcreates and
> lvremoves were hanging on "Archiving volume group "xenvg" metadata
> (seqno 48).". If I turn off the metadata archives in lvm.conf on
> torchwood-institute, its performance goes back to what I'd expect, so
> I guess it's having some sort of disk problems.
>
> Thanks again for all of our help,
> ?- Evan
>

Actually, the real problem turned out to be much more straightforward:

root at torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
301369

For some reason, torchwood-institute had built up a massive archive of
files named like
/etc/lvm/archive/.lvm_torchwood-institute.mit.edu_17470_993238909.
Listing them all took...about 7 seconds.

Deleting them all put performance back where it should be.

- Evan



From ccaulfie at redhat.com  Mon Jan 25 13:30:40 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 25 Jan 2010 13:30:40 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001250528h7fe51965p89990eb2f5ffd542@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<4B4AF17C.4040306@redhat.com> <4B4C38D1.6000608@redhat.com>	
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>	
	<4B4D9982.6060505@redhat.com>	
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>	
	<4B598F07.1070002@redhat.com>	
	<178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>	
	<4B5D7325.3000605@redhat.com>	
	<178868c41001250513o39d9e7d3i7f9ace8b0dc2e070@mail.gmail.com>
	<178868c41001250528h7fe51965p89990eb2f5ffd542@mail.gmail.com>
Message-ID: <4B5D9D00.8020504@redhat.com>

On 25/01/10 13:28, Evan Broder wrote:
> On Mon, Jan 25, 2010 at 8:13 AM, Evan Broder<broder at mit.edu>  wrote:
>> On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
>> <ccaulfie at redhat.com>  wrote:
>>> On 23/01/10 17:35, Evan Broder wrote:
>>>>
>>>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>    wrote:
>>>>>
>>>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>
>>>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>>>
>>>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>>>> <ccaulfie at redhat.com>        wrote:
>>>>>>>>>
>>>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>>>> <ccaulfie at redhat.com>        wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi -
>>>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>>>> past
>>>>>>>>>>>>> by
>>>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>>>> nodes,
>>>>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>>>>> to
>>>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running
>>>>>>>>>>>>> on
>>>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>>>>> one
>>>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>>>> operations
>>>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>>>>> both
>>>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>>>> clustering
>>>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>>>> formed
>>>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>>>>> information at the list. What information can I provide to make
>>>>>>>>>>>>> it
>>>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>>>>> can be
>>>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>>>> should be
>>>>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>>>>> still
>>>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>>>
>>>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>>>>> in
>>>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>>>>
>>>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>>>> reproduce
>>>>>>>>> it
>>>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>>>
>>>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>>>> have
>>>>>>>>> a
>>>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>>>
>>>>>>>>> Chrissie
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>>>>> DLM 3 would be...challenging.
>>>>>>>>
>>>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>>>> working.
>>>>>>>>
>>>>>>>> Is there any chance this would work better if we dropped back to
>>>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>>>> fix
>>>>>>> it.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>
>>>>>> Awesome. That patch fixed our problem.
>>>>>>
>>>>>> We are running into one other problem - performing LVM operations on
>>>>>> one node is substantially slower than performing them on the other
>>>>>> node:
>>>>>>
>>>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>>>    Logical volume "test" created
>>>>>>
>>>>>> real    0m0.309s
>>>>>> user    0m0.000s
>>>>>> sys     0m0.008s
>>>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>>>    Logical volume "test" successfully removed
>>>>>>
>>>>>> real    0m0.254s
>>>>>> user    0m0.004s
>>>>>> sys     0m0.008s
>>>>>>
>>>>>>
>>>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>>>    Logical volume "test" created
>>>>>>
>>>>>> real    0m7.282s
>>>>>> user    0m6.396s
>>>>>> sys     0m0.312s
>>>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>>>    Logical volume "test" successfully removed
>>>>>>
>>>>>> real    0m7.277s
>>>>>> user    0m6.420s
>>>>>> sys     0m0.292s
>>>>>>
>>>>>> Any idea why this is happening and if there's anything we can do about
>>>>>> it?
>>>>>>
>>>>>
>>>>>
>>>>> I'm not at all sure why that should be happening. I suppose the best
>>>>> thing
>>>>> to do would be to enable clvmd logging (clvmd -d) and see what is taking
>>>>> the
>>>>> time.
>>>>>
>>>>> Chrissie
>>>>>
>>>>
>>>> No problem. I've collected another set of logs - they're in
>>>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>>>
>>>> After spinning up corosync and clvmd, the commands I ran were, in order:
>>>>
>>>>    root at black-mesa:~# vgchange -a y xenvg
>>>>      0 logical volume(s) in volume group "xenvg" now active
>>>>    root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>>>      Logical volume "test1" created
>>>>
>>>>    real    0m0.685s
>>>>    user    0m0.004s
>>>>    sys     0m0.000s
>>>>    root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>>>      Logical volume "test1" successfully removed
>>>>
>>>>    real    0m0.235s
>>>>    user    0m0.004s
>>>>    sys     0m0.004s
>>>>    root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>>>      Logical volume "test2" created
>>>>
>>>>    real    0m8.007s
>>>>    user    0m6.396s
>>>>    sys     0m0.312s
>>>>    root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>>>      Logical volume "test2" successfully removed
>>>>
>>>>    real    0m7.364s
>>>>    user    0m6.436s
>>>>    sys     0m0.300s
>>>>    root at black-mesa:~# vgchange -a n xenvg
>>>>      0 logical volume(s) in volume group "xenvg" now active
>>>>
>>>> (black-mesa is node 1, and torchwood-institute is node 2)
>>>>
>>>> Thanks again for your help,
>>>>
>>>
>>> Hiya,
>>>
>>> Oddly I can't find any delays in the clvmd logs at all. There are some
>>> 7-second gaps in the log files but those are between commands coming from
>>> the lvm command-line and not internal to clvmd itself.
>>>
>>> What sort of storage are you using for these LVs? The only thing I can think
>>> of now is that there is some sort of delay in lvm opening the device for
>>> writing as it updates the metadata. An LVM debug log might be helpful here,
>>> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
>>> really an LVM developer any more.
>>>
>>> Chrissie
>>>
>>
>> Well, uh, you see...at the moment I'm using a VM on a personal server
>> running iscsitarget from the other side of MIT's campus :)
>>
>> But that's only because our normal RAID is in the process of being
>> RMA'd. Normally it's a Fibre channel RAID sitting right next to the
>> two servers, and we still had the same timing problems (with the same
>> server being slower) then. The two servers are identical hardware and
>> sitting right next to each other.
>>
>> In any case, thanks for the suggestion to look at LVM itself instead
>> of clvmd - I sort of forgot my basic debugging techniques before
>> asking for help on this one. LVM logs to syslog, and its logging mode
>> includes timestamps.
>>
>> When I cranked up the debug level, I noticed that the lvcreates and
>> lvremoves were hanging on "Archiving volume group "xenvg" metadata
>> (seqno 48).". If I turn off the metadata archives in lvm.conf on
>> torchwood-institute, its performance goes back to what I'd expect, so
>> I guess it's having some sort of disk problems.
>>
>> Thanks again for all of our help,
>>   - Evan
>>
>
> Actually, the real problem turned out to be much more straightforward:
>
> root at torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
> 301369
>
> For some reason, torchwood-institute had built up a massive archive of
> files named like
> /etc/lvm/archive/.lvm_torchwood-institute.mit.edu_17470_993238909.
> Listing them all took...about 7 seconds.
>
> Deleting them all put performance back where it should be.

:-)

Sometimes the simpler thing actually is the problem!

clvmd is usually the first place I do look for a slowdown as it's 
complicated and does a lot of locking. It's good that you found the 
problem elsewhere.

Thanks for letting me know.

Chrissie



From mammadshah at hotmail.com  Mon Jan 25 13:33:26 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Mon, 25 Jan 2010 18:33:26 +0500
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <82d055df1001250520k7c6651f7w3191245e1267c8a2@mail.gmail.com>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>,
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>,
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>,
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>,
	<82d055df1001250520k7c6651f7w3191245e1267c8a2@mail.gmail.com>
Message-ID: <BLU110-W4F75074D8B00EB1FCF85DD45F0@phx.gbl>



Dear Rajat,

According to Red Hat GFS, journals are equal to number of nodes in cluster, and according to my understanding it cant be increased later on if i want to add more nodes in cluster, am i right ?

you set it to 3, 2 for cluster nodes and 1 for quorum ? kindly let me know if  i am wrong.



 
Thanks,
Muhammad Ammad Shah
 




Date: Mon, 25 Jan 2010 18:50:29 +0530
Subject: Re: GFS continue to reboot nodes
From: rajatjpatel at gmail.com
To: mammadshah at hotmail.com
CC: linux-cluster at redhat.com

root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4 /dev/vg1_gfs/db_store

root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 3 /dev/vg1_gfs/db_store
Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...

THEN THEY FIGHT YOU...
THEN YOU WIN...



On Mon, Jan 25, 2010 at 6:24 PM, Muhammad Ammad Shah <mammadshah at hotmail.com> wrote:







Dear Rajat,

HI,



I have configured two node cluster and its working fine for SAN (ext3 file system). after this i configured GFS using following.



root# pvcreate  /dev/sdb

root#vgcreate -c y vg1_gfs /dev/sdc1

root#lvcreate -n db_store -l 100%FREE vg1_gfs

root#/etc/init.d/clvmd start



 Started on both nodes.



root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4 /dev/vg1_gfs/db_store

root# service gfs start



root#chkconfig --level 345 clvmd on 

root#chkconfig --level 345 gfs on



----------------

the problem is, as i changed File system (ex3) resource to GFS Resource.



nodes are rebooting..



there is nothing in /var/log/messages. but when i checked console of the node there was some message related to GFS. 

DLM id:0 ...



so i removed GFS and switched back to File system(ext3) resource. 



can i install oracle on  Resource File system(ext3) ?



or how to troubleshoot GFS reboot..

need help,		
			 
			  
				
			



 
Thanks,
Muhammad Ammad Shah
 

 		 	   		  
Windows Live Hotmail:  Your friends can get your Facebook updates, right from Hotmail?.


 		 	   		  
_________________________________________________________________
Windows Live Hotmail: Your friends can get your Facebook updates, right from Hotmail?.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/50ce52d6/attachment.htm>

From rajatjpatel at gmail.com  Mon Jan 25 13:34:57 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Mon, 25 Jan 2010 19:04:57 +0530
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <BLU110-W4F75074D8B00EB1FCF85DD45F0@phx.gbl>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
	<82d055df1001250520k7c6651f7w3191245e1267c8a2@mail.gmail.com>
	<BLU110-W4F75074D8B00EB1FCF85DD45F0@phx.gbl>
Message-ID: <82d055df1001250534u44acd7cep83eeb156dce83102@mail.gmail.com>

can remove qdisk and try it out it will work
Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...


On Mon, Jan 25, 2010 at 7:03 PM, Muhammad Ammad Shah <mammadshah at hotmail.com
> wrote:

>
> Dear Rajat,
>
> According to Red Hat GFS, journals are equal to number of nodes in cluster,
> and according to my understanding it cant be increased later on if i want to
> add more nodes in cluster, am i right ?
>
> you set it to 3, 2 for cluster nodes and 1 for quorum ? kindly let me know
> if  i am wrong.
>
>
>
>
> Thanks,
> Muhammad Ammad Shah
>
>
>
>
>
> ------------------------------
> Date: Mon, 25 Jan 2010 18:50:29 +0530
> Subject: Re: GFS continue to reboot nodes
> From: rajatjpatel at gmail.com
> To: mammadshah at hotmail.com
> CC: linux-cluster at redhat.com
>
>
> root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4/dev/vg1_gfs/db_store
>
> root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 3
> /dev/vg1_gfs/db_store
> Regards,
>
> Rajat J Patel
> D 803 Royal Classic
> Link Road
> Andheri West
> Mumbai 53
> +919920121211
> www.taashee.com
>
> FIRST THEY IGNORE YOU...
> THEN THEY LAUGH AT YOU...
> THEN THEY FIGHT YOU...
> THEN YOU WIN...
>
>
> On Mon, Jan 25, 2010 at 6:24 PM, Muhammad Ammad Shah <
> mammadshah at hotmail.com> wrote:
>
>
> Dear Rajat,
>
> HI,
>
> I have configured two node cluster and its working fine for SAN (ext3 file
> system). after this i configured GFS using following.
>
> root# pvcreate /dev/sdb
> root#vgcreate -c y vg1_gfs /dev/sdc1
> root#lvcreate -n db_store -l 100%FREE vg1_gfs
> root#/etc/init.d/clvmd start
>
> Started on both nodes.
>
> root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4
> /dev/vg1_gfs/db_store
> root# service gfs start
>
> root#chkconfig --level 345 clvmd on
> root#chkconfig --level 345 gfs on
>
> ----------------
> the problem is, as i changed File system (ex3) resource to GFS Resource.
>
> nodes are rebooting..
>
> there is nothing in /var/log/messages. but when i checked console of the
> node there was some message related to GFS.
> DLM id:0 ...
>
> so i removed GFS and switched back to File system(ext3) resource.
>
> can i install oracle on Resource File system(ext3) ?
>
> or how to troubleshoot GFS reboot..
> need help,
>
>
>
>
> Thanks,
> Muhammad Ammad Shah
>
>
>
> ------------------------------
> Windows Live Hotmail: Your friends can get your Facebook updates, right
> from Hotmail?.<http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009>
>
>
>
> ------------------------------
> Windows Live Hotmail: Your friends can get your Facebook updates, right
> from Hotmail?.<http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/eb0e89f3/attachment.htm>

From mammadshah at hotmail.com  Mon Jan 25 13:42:47 2010
From: mammadshah at hotmail.com (Muhammad Ammad Shah)
Date: Mon, 25 Jan 2010 18:42:47 +0500
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <82d055df1001250534u44acd7cep83eeb156dce83102@mail.gmail.com>
References: <82d055df1001231015na727d56t8bd26d22fb2cd82e@mail.gmail.com>,
	<BLU110-W1EA55655B09DEA6847282D4600@phx.gbl>,
	<82d055df1001240035h370f4de8t34a02b5499006e00@mail.gmail.com>,
	<BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>,
	<82d055df1001250520k7c6651f7w3191245e1267c8a2@mail.gmail.com>,
	<BLU110-W4F75074D8B00EB1FCF85DD45F0@phx.gbl>,
	<82d055df1001250534u44acd7cep83eeb156dce83102@mail.gmail.com>
Message-ID: <BLU110-W182534D2FFA5F53DD13698D45F0@phx.gbl>


can you explain this (remove qdisk), do i use 3 journals  or  8 journals. 

really gr8 help by you. 


 
Thanks,
Muhammad Ammad Shah
 




Date: Mon, 25 Jan 2010 19:04:57 +0530
Subject: Re: GFS continue to reboot nodes
From: rajatjpatel at gmail.com
To: mammadshah at hotmail.com
CC: linux-cluster at redhat.com

can remove qdisk and try it out it will workRegards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com


FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...



On Mon, Jan 25, 2010 at 7:03 PM, Muhammad Ammad Shah <mammadshah at hotmail.com> wrote:







Dear Rajat,

According to Red Hat GFS, journals are equal to number of nodes in cluster, and according to my understanding it cant be increased later on if i want to add more nodes in cluster, am i right ?


you set it to 3, 2 for cluster nodes and 1 for quorum ? kindly let me know if  i am wrong.



 
Thanks,
Muhammad Ammad Shah
 




Date: Mon, 25 Jan 2010 18:50:29 +0530

Subject: Re: GFS continue to reboot nodes
From: rajatjpatel at gmail.com
To: mammadshah at hotmail.com

CC: linux-cluster at redhat.com

root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4 /dev/vg1_gfs/db_store


root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 3 /dev/vg1_gfs/db_store
Regards,

Rajat J Patel
D 803 Royal Classic
Link Road
Andheri West
Mumbai 53
+919920121211
www.taashee.com

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...


THEN THEY FIGHT YOU...
THEN YOU WIN...



On Mon, Jan 25, 2010 at 6:24 PM, Muhammad Ammad Shah <mammadshah at hotmail.com> wrote:








Dear Rajat,

HI,



I have configured two node cluster and its working fine for SAN (ext3 file system). after this i configured GFS using following.



root# pvcreate  /dev/sdb

root#vgcreate -c y vg1_gfs /dev/sdc1

root#lvcreate -n db_store -l 100%FREE vg1_gfs

root#/etc/init.d/clvmd start



 Started on both nodes.



root#mkfs -t gfs2 -p lock_dlm -t db_clust:db_store -j 4 /dev/vg1_gfs/db_store

root# service gfs start



root#chkconfig --level 345 clvmd on 

root#chkconfig --level 345 gfs on



----------------

the problem is, as i changed File system (ex3) resource to GFS Resource.



nodes are rebooting..



there is nothing in /var/log/messages. but when i checked console of the node there was some message related to GFS. 

DLM id:0 ...



so i removed GFS and switched back to File system(ext3) resource. 



can i install oracle on  Resource File system(ext3) ?



or how to troubleshoot GFS reboot..

need help,		
			 
			  
				
			



 
Thanks,
Muhammad Ammad Shah
 

 		 	   		  
Windows Live Hotmail:  Your friends can get your Facebook updates, right from Hotmail?.



 		 	   		  
Windows Live Hotmail:  Your friends can get your Facebook updates, right from Hotmail?.


 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/76718b22/attachment.htm>

From rpeterso at redhat.com  Mon Jan 25 13:56:40 2010
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 25 Jan 2010 08:56:40 -0500 (EST)
Subject: [Linux-cluster] GFS continue to reboot nodes
In-Reply-To: <BLU110-W1CBDBB93586064A976323D45F0@phx.gbl>
Message-ID: <262964534.90191264427800717.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Muhammad Ammad Shah" <mammadshah at hotmail.com> wrote:
| there is nothing in /var/log/messages. but when i checked console of
| the node there was some message related to GFS.
| DLM id:0 ...
| 
| so i removed GFS and switched back to File system(ext3) resource.
| 
| can i install oracle on Resource File system(ext3) ?
| 
| or how to troubleshoot GFS reboot..
| need help,

Hi Muhammad,

To answer a few questions on this thread:
The number of journals should not cause GFS any problems.  It is
perfectly okay to create a number of journals greater or equal to
the number of nodes.  So 3 and 4 are both okay.

Second, if you created the volume group without clvmd running, you
may have created it without the clustered bit set. You may need to
do this command:  vgchange -cy /dev/your_vg

Third, GFS does not reboot the system.  The system is most likely
being fenced due to a problem discovered by the cluster infrastructure.
That problem might be GFS's fault, but we have no way to know.

Fourth, if your file system is shared between nodes, you probably
want to use a shared file system.  Using ext3 will only work properly
if the file system is only mounted by a single computer.  If it's
mounted by more than one computer, use GFS or GFS2.

Fifth, to debug the problem please send the console messages to
this mailing list so we can tell more about the symptoms.

I hope this helps.

Regards,

Bob Peterson
Red Hat File Systems



From jose.neto at liber4e.com  Mon Jan 25 15:59:16 2010
From: jose.neto at liber4e.com (jose nuno neto)
Date: Mon, 25 Jan 2010 15:59:16 -0000 (GMT)
Subject: [Linux-cluster] Included Service Scripts Stop error
In-Reply-To: <56b46216bfab23c2a48e7756f995e6fb.squirrel@fela.liber4e.com>
References: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>
	<4B5461C2.50804@redhat.com>
	<56b46216bfab23c2a48e7756f995e6fb.squirrel@fela.liber4e.com>
Message-ID: <5c7b98ed19fe3eebfea9c371aa3d970b.squirrel@fela.liber4e.com>

Hi,

I'm building a cluster with
cman-2.0.115-1.el5_4.2
rgmanager-2.0.52-1.el5


Right now I'm testing services start/stop but can't get a clean stop from
the included scripts in /usr/share/cluster
It starts ok, but stop no, althouth the app is stoped
This happens either with apache or mysqld scripts
Any extra setting needed? timing tunings?

Comands runned:

rg_test test /etc/cluster/cluster.conf start service MySQL
Running in test mode.
Starting MySQL...
<debug>  Link for bond0: Detected
<info>   Adding IPv4 address 172.26.18.7/20 to bond0
<debug>  Sending gratuitous ARP: 172.26.18.7 00:14:4f:ca:83:4c brd
ff:ff:ff:ff:ff:ff
<debug>  Verifying Configuration Of mysql:MySQLTest
<debug>  Verifying Configuration Of mysql:MySQLTest > Succeed
<info>   Starting Service mysql:MySQLTest
<debug>  Looking For IP Address > Succeed -  IP Address Found
<debug>  Starting Service mysql:MySQLTest > Succeed
Start of MySQL complete


rg_test test /etc/cluster/cluster.conf stop service MySQL
Running in test mode.
Stopping MySQL...
<debug>  Verifying Configuration Of mysql:MySQLTest
<debug>  Verifying Configuration Of mysql:MySQLTest > Succeed
<info>   Stopping Service mysql:MySQLTest
<error>  Stopping Service mysql:MySQLTest > Failed - Application Is Still
Running
<error>  Stopping Service mysql:MySQLTest > Failed
<info>   Removing IPv4 address 172.26.18.7/20 from bond0

rg_test test /etc/cluster/cluster.conf stop service MySQL
Running in test mode.
Stopping MySQL...
<debug>  Verifying Configuration Of mysql:MySQLTest
<debug>  Verifying Configuration Of mysql:MySQLTest > Succeed
<info>   Stopping Service mysql:MySQLTest
<error>  Checking Existence Of File
/var/run/cluster/mysql/mysql:MySQLTest.pid [mysql:MySQLTest] > Failed -
File Doesn't Exist
<info>   Stopping Service mysql:MySQLTest > Succeed
<debug>  172.26.18.7 is not configured
Stop of MySQL complete


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From alex.urbanowicz at gmail.com  Mon Jan 25 17:48:45 2010
From: alex.urbanowicz at gmail.com (Alex Urbanowicz)
Date: Mon, 25 Jan 2010 18:48:45 +0100
Subject: [Linux-cluster] GFS problem
Message-ID: <9e4c4ee71001250948i433feb14v5e6223c9caa375a3@mail.gmail.com>

Hello

I have a problem with shared GFS resource on a 12-node Cluster Manager
cluster.

The cluster starts up properly if all nodes are booted at once. Any major
interaction with one of the nodes (reboot, cman restart) causes the GFS to
lock out the GFS, and for the cluster to fal into some unstable split state.

In this state, logs, clustat and "cman_tool status" report the cluster as
fully connected and working, while "cman_tool resources" reports only the
fence resource in JOIN_START_WAIT (or JOIN_STOP WAIT, depending on what was
done to the cluster in the meantime) state with overlapping but different
node sets, depending on the node I run the "cman_tool resources" command.

So far, the only functioning method to get the cluster out of the state is
to manually reboot all the nodes at once, but this is unfeasible due to
uptime expectations and high load carried by the cluster.

We're completely in the dark about the possible cause of the problem, any
help is appreciated.

TIA

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100125/53aa33ab/attachment.htm>

From yvette at dbtgroup.com  Mon Jan 25 19:58:08 2010
From: yvette at dbtgroup.com (yvette hirth)
Date: Mon, 25 Jan 2010 19:58:08 +0000
Subject: [Linux-cluster] newbie
Message-ID: <4B5DF7D0.8030403@dbtgroup.com>

hi,

i'm new at clustered file systems and i have a "should i?"-type question.

i have three boxes - a file server (FOO), and two workstations (DT0 and 
DT1).  they all run ubuntu (more on why below).

i'm working now to get SCST running with my qla2xxx fcp hba's.

i have a raid array that presents four arrays, each with one LUN, to 
FOO, and two LUNS for two of those arrays are presented to desktops as 
well - one to DT0 and one to DT1.  when the desktops use "their two 
arrays", FOO has them unmounted, and vice-versa.  it works, i get fast 
dasd access, but of course it's unworkable over the long run; it's a 
stop-gap until SCST is in.

once SCST is running, then i plan on having FOO present one LUN each for 
three raid arrays to the two desktops.  array0 will be shared by FOO and 
DT0; array1 will be shared by FOO and DT1; and array2 will be shared by 
FOO, DT0 and DT1.

the scenario for my question is this:

1) SCST is in and running
2) all four of the raid arrays will be available and mounted on FOO
3) three of the four raid arrays will be presented to more than one host
    at a time
4) all access to the arrays will be controlled by FOO
5) FOO will see all modified buffers / data going to-from the three
    shared raid arrays.  (in effect, this is "DAS through a fileserver",
    not a "SAN Fabric".)

so here's my question:

will a "non-clustered" filesystem like ext3 / xfs be sufficient?  i 
can't use zfs because this is FCP SCSI not iSCSI.

i've looked at OPEN-E, OpenFiler (which has a 64-bit version that 
doesn't like HP DL380 G6's), DataCore and tons more and all fall short 
or failed the install.

initially i tried RHEL 5.4 (which is how i got on this mailing list) and 
SLES 11 and i'm farther along with ubuntu than with either of those. 
i'm not saying anything bad about RHEL/SLES; it's just that the 
"learning how to setup a cluster" curve for a debian person is 
non-trivial.  i still have the SLES 11 config on a bootable partition in 
case i can never get SCST working with qlogic HBA's on ubuntu and thus 
ubuntu doesn't work out.

i somewhat grok NFS locking, but when it comes to FCP SCSI locking i'm 
lost.  do i need GFS/2 or will ext3 on a file server be sufficient?

we could certainly go iSCSI (which seems to be a trend), but we've 
invested some non-trivial budget in 8gbps FCP and don't feel it's 
antiquidated just yet.

if anyone could share their advice, i'd appreciate it.

thanks!
yvette hirth



From jpalmae at gmail.com  Mon Jan 25 21:43:58 2010
From: jpalmae at gmail.com (Jorge Palma)
Date: Mon, 25 Jan 2010 18:43:58 -0300
Subject: [Linux-cluster] GFS problem
In-Reply-To: <9e4c4ee71001250948i433feb14v5e6223c9caa375a3@mail.gmail.com>
References: <9e4c4ee71001250948i433feb14v5e6223c9caa375a3@mail.gmail.com>
Message-ID: <5b65f1b11001251343p659d3b96gf07dd2165adf521e@mail.gmail.com>

Please send your fence configuration and cluster.conf

Regards


2010/1/25, Alex Urbanowicz <alex.urbanowicz at gmail.com>:
> Hello
>
> I have a problem with shared GFS resource on a 12-node Cluster Manager
> cluster.
>
> The cluster starts up properly if all nodes are booted at once. Any major
> interaction with one of the nodes (reboot, cman restart) causes the GFS to
> lock out the GFS, and for the cluster to fal into some unstable split state.
>
> In this state, logs, clustat and "cman_tool status" report the cluster as
> fully connected and working, while "cman_tool resources" reports only the
> fence resource in JOIN_START_WAIT (or JOIN_STOP WAIT, depending on what was
> done to the cluster in the meantime) state with overlapping but different
> node sets, depending on the node I run the "cman_tool resources" command.
>
> So far, the only functioning method to get the cluster out of the state is
> to manually reboot all the nodes at once, but this is unfeasible due to
> uptime expectations and high load carried by the cluster.
>
> We're completely in the dark about the possible cause of the problem, any
> help is appreciated.
>
> TIA
>
> Alex
>


-- 
Jorge Palma Escobar
Ingeniero de Sistemas
Red Hat Linux Certified Engineer
Certificate N? 804005089418233



From broder at mit.edu  Tue Jan 26 01:56:24 2010
From: broder at mit.edu (Evan Broder)
Date: Mon, 25 Jan 2010 20:56:24 -0500
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <4B5D9D00.8020504@redhat.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>
	<4B4D9982.6060505@redhat.com>
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>
	<4B598F07.1070002@redhat.com>
	<178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>
	<4B5D7325.3000605@redhat.com>
	<178868c41001250513o39d9e7d3i7f9ace8b0dc2e070@mail.gmail.com>
	<178868c41001250528h7fe51965p89990eb2f5ffd542@mail.gmail.com>
	<4B5D9D00.8020504@redhat.com>
Message-ID: <178868c41001251756l525fb806n2d6dc03ea4dfba5b@mail.gmail.com>

On Mon, Jan 25, 2010 at 8:30 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 25/01/10 13:28, Evan Broder wrote:
>>
>> On Mon, Jan 25, 2010 at 8:13 AM, Evan Broder<broder at mit.edu> ?wrote:
>>>
>>> On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
>>> <ccaulfie at redhat.com> ?wrote:
>>>>
>>>> On 23/01/10 17:35, Evan Broder wrote:
>>>>>
>>>>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com> ? ?wrote:
>>>>>>
>>>>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com> ? ? ?wrote:
>>>>>>>>
>>>>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>>>>> <ccaulfie at redhat.com> ? ? ? ?wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>>>>> <ccaulfie at redhat.com> ? ? ? ?wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi -
>>>>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using
>>>>>>>>>>>>>> OpenAIS
>>>>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>>>>> past
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>>>>> nodes,
>>>>>>>>>>>>>> so we're specifically looking for a solution that doesn't
>>>>>>>>>>>>>> involve
>>>>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're
>>>>>>>>>>>>>> hoping
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd
>>>>>>>>>>>>>> running
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> both
>>>>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>>>>> clustering
>>>>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>>>>> formed
>>>>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've
>>>>>>>>>>>>>> straced
>>>>>>>>>>>>>> various processes, but I don't want to blast a bunch of
>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>> information at the list. What information can I provide to
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> To start with, the best logging to produce is the clvmd logs
>>>>>>>>>>>>> which
>>>>>>>>>>>>> can be
>>>>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>>>>> should be
>>>>>>>>>>>>> from all nodes in the cluster so they can be correlated. If
>>>>>>>>>>>>> you're
>>>>>>>>>>>>> still
>>>>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful
>>>>>>>>>>>>> in
>>>>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>>>>
>>>>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both
>>>>>>>>>>>> processes
>>>>>>>>>>>> in
>>>>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them
>>>>>>>>>>>> at a
>>>>>>>>>>>> few points with what I was doing - the annotations all start
>>>>>>>>>>>> with "
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>>>>> reproduce
>>>>>>>>>> it
>>>>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>>>>
>>>>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>>>>> have
>>>>>>>>>> a
>>>>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>>>>
>>>>>>>>>> Chrissie
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to
>>>>>>>>> support
>>>>>>>>> DLM 3 would be...challenging.
>>>>>>>>>
>>>>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>>>>> working.
>>>>>>>>>
>>>>>>>>> Is there any chance this would work better if we dropped back to
>>>>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>>>>> fix
>>>>>>>> it.
>>>>>>>>
>>>>>>>> Chrissie
>>>>>>>>
>>>>>>>
>>>>>>> Awesome. That patch fixed our problem.
>>>>>>>
>>>>>>> We are running into one other problem - performing LVM operations on
>>>>>>> one node is substantially slower than performing them on the other
>>>>>>> node:
>>>>>>>
>>>>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>>>> ? Logical volume "test" created
>>>>>>>
>>>>>>> real ? ?0m0.309s
>>>>>>> user ? ?0m0.000s
>>>>>>> sys ? ? 0m0.008s
>>>>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>>>> ? Logical volume "test" successfully removed
>>>>>>>
>>>>>>> real ? ?0m0.254s
>>>>>>> user ? ?0m0.004s
>>>>>>> sys ? ? 0m0.008s
>>>>>>>
>>>>>>>
>>>>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>>>> ? Logical volume "test" created
>>>>>>>
>>>>>>> real ? ?0m7.282s
>>>>>>> user ? ?0m6.396s
>>>>>>> sys ? ? 0m0.312s
>>>>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>>>> ? Logical volume "test" successfully removed
>>>>>>>
>>>>>>> real ? ?0m7.277s
>>>>>>> user ? ?0m6.420s
>>>>>>> sys ? ? 0m0.292s
>>>>>>>
>>>>>>> Any idea why this is happening and if there's anything we can do
>>>>>>> about
>>>>>>> it?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm not at all sure why that should be happening. I suppose the best
>>>>>> thing
>>>>>> to do would be to enable clvmd logging (clvmd -d) and see what is
>>>>>> taking
>>>>>> the
>>>>>> time.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>> No problem. I've collected another set of logs - they're in
>>>>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>>>>
>>>>> After spinning up corosync and clvmd, the commands I ran were, in
>>>>> order:
>>>>>
>>>>> ? root at black-mesa:~# vgchange -a y xenvg
>>>>> ? ? 0 logical volume(s) in volume group "xenvg" now active
>>>>> ? root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>>>> ? ? Logical volume "test1" created
>>>>>
>>>>> ? real ? ?0m0.685s
>>>>> ? user ? ?0m0.004s
>>>>> ? sys ? ? 0m0.000s
>>>>> ? root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>>>> ? ? Logical volume "test1" successfully removed
>>>>>
>>>>> ? real ? ?0m0.235s
>>>>> ? user ? ?0m0.004s
>>>>> ? sys ? ? 0m0.004s
>>>>> ? root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>>>> ? ? Logical volume "test2" created
>>>>>
>>>>> ? real ? ?0m8.007s
>>>>> ? user ? ?0m6.396s
>>>>> ? sys ? ? 0m0.312s
>>>>> ? root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>>>> ? ? Logical volume "test2" successfully removed
>>>>>
>>>>> ? real ? ?0m7.364s
>>>>> ? user ? ?0m6.436s
>>>>> ? sys ? ? 0m0.300s
>>>>> ? root at black-mesa:~# vgchange -a n xenvg
>>>>> ? ? 0 logical volume(s) in volume group "xenvg" now active
>>>>>
>>>>> (black-mesa is node 1, and torchwood-institute is node 2)
>>>>>
>>>>> Thanks again for your help,
>>>>>
>>>>
>>>> Hiya,
>>>>
>>>> Oddly I can't find any delays in the clvmd logs at all. There are some
>>>> 7-second gaps in the log files but those are between commands coming
>>>> from
>>>> the lvm command-line and not internal to clvmd itself.
>>>>
>>>> What sort of storage are you using for these LVs? The only thing I can
>>>> think
>>>> of now is that there is some sort of delay in lvm opening the device for
>>>> writing as it updates the metadata. An LVM debug log might be helpful
>>>> here,
>>>> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
>>>> really an LVM developer any more.
>>>>
>>>> Chrissie
>>>>
>>>
>>> Well, uh, you see...at the moment I'm using a VM on a personal server
>>> running iscsitarget from the other side of MIT's campus :)
>>>
>>> But that's only because our normal RAID is in the process of being
>>> RMA'd. Normally it's a Fibre channel RAID sitting right next to the
>>> two servers, and we still had the same timing problems (with the same
>>> server being slower) then. The two servers are identical hardware and
>>> sitting right next to each other.
>>>
>>> In any case, thanks for the suggestion to look at LVM itself instead
>>> of clvmd - I sort of forgot my basic debugging techniques before
>>> asking for help on this one. LVM logs to syslog, and its logging mode
>>> includes timestamps.
>>>
>>> When I cranked up the debug level, I noticed that the lvcreates and
>>> lvremoves were hanging on "Archiving volume group "xenvg" metadata
>>> (seqno 48).". If I turn off the metadata archives in lvm.conf on
>>> torchwood-institute, its performance goes back to what I'd expect, so
>>> I guess it's having some sort of disk problems.
>>>
>>> Thanks again for all of our help,
>>> ?- Evan
>>>
>>
>> Actually, the real problem turned out to be much more straightforward:
>>
>> root at torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
>> 301369
>>
>> For some reason, torchwood-institute had built up a massive archive of
>> files named like
>> /etc/lvm/archive/.lvm_torchwood-institute.mit.edu_17470_993238909.
>> Listing them all took...about 7 seconds.
>>
>> Deleting them all put performance back where it should be.
>
> :-)
>
> Sometimes the simpler thing actually is the problem!
>
> clvmd is usually the first place I do look for a slowdown as it's
> complicated and does a lot of locking. It's good that you found the problem
> elsewhere.
>
> Thanks for letting me know.
>
> Chrissie
>

Hi again!

Sorry to bother everyone again, but deadlocks on simultaneous LVM
operations seem to have resurfaced somehow. As far as I know, we
haven't made any changes, other than restarting the cluster a few
times, but if I run

root at black-mesa:~# lvcreate -n test1 -L 1G xenvg & lvcreate -n test2 -L 1G xenvg

it seems to consistently deadlock.

I should have logs of everything in
<http://web.mit.edu/broder/Public/clvmd/node-1-syslog.20100125> and
<http://web.mit.edu/broder/Public/clvmd/node-2-syslog.20100125>

Thanks again!
 - Evan



From raju.rajsand at gmail.com  Tue Jan 26 05:32:19 2010
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Tue, 26 Jan 2010 11:02:19 +0530
Subject: [Linux-cluster] newbie
In-Reply-To: <4B5DF7D0.8030403@dbtgroup.com>
References: <4B5DF7D0.8030403@dbtgroup.com>
Message-ID: <8786b91c1001252132h70782584ga1a71684562143f8@mail.gmail.com>

Greetings,

On Tue, Jan 26, 2010 at 1:28 AM, yvette hirth <yvette at dbtgroup.com> wrote:

> hi,
>
> once SCST is running, then i plan on having FOO present one LUN each for
> three raid arrays to the two desktops.  array0 will be shared by FOO and
> DT0; array1 will be shared by FOO and DT1; and array2 will be shared by FOO,
> DT0 and DT1.
>
> the scenario for my question is this:
>
will a "non-clustered" filesystem like ext3 / xfs be sufficient?  i can't
> use zfs because this is FCP SCSI not iSCSI.
>
NO
 You will require GFS / 2 in which case you will need Centos/RHEL cluster
infrastructure

>
> i've looked at OPEN-E, OpenFiler (which has a 64-bit version that doesn't
> like HP DL380 G6's), DataCore and tons more and all fall short or failed the
> install.
>
>

> we could certainly go iSCSI (which seems to be a trend), but we've invested
> some non-trivial budget in 8gbps FCP and don't feel it's antiquidated just
> yet.
>

ISCSI presents LUNs

For concurrent access from multiple nodes to the same filesystem a clustered
filesystem like GFS/OCFS/HPFS _will_ be required or of course you can go the
NFS way.

my 0.02

Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100126/235f7e60/attachment.htm>

From ccaulfie at redhat.com  Tue Jan 26 11:23:16 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 26 Jan 2010 11:23:16 +0000
Subject: [Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
In-Reply-To: <178868c41001251756l525fb806n2d6dc03ea4dfba5b@mail.gmail.com>
References: <178868c41001081458h3c0b70denef831712883a6fec@mail.gmail.com>	
	<178868c41001120821g6885c03cr75a3d42eb0196428@mail.gmail.com>	
	<4B4D9982.6060505@redhat.com>	
	<178868c41001210717u82779edp128006b7fc8e67e8@mail.gmail.com>	
	<4B598F07.1070002@redhat.com>	
	<178868c41001230935v752a811clbee3a679d13c8f0b@mail.gmail.com>	
	<4B5D7325.3000605@redhat.com>	
	<178868c41001250513o39d9e7d3i7f9ace8b0dc2e070@mail.gmail.com>	
	<178868c41001250528h7fe51965p89990eb2f5ffd542@mail.gmail.com>	
	<4B5D9D00.8020504@redhat.com>
	<178868c41001251756l525fb806n2d6dc03ea4dfba5b@mail.gmail.com>
Message-ID: <4B5ED0A4.2020201@redhat.com>

On 26/01/10 01:56, Evan Broder wrote:
> Hi again!
>
> Sorry to bother everyone again, but deadlocks on simultaneous LVM
> operations seem to have resurfaced somehow. As far as I know, we
> haven't made any changes, other than restarting the cluster a few
> times, but if I run
>
> root at black-mesa:~# lvcreate -n test1 -L 1G xenvg&  lvcreate -n test2 -L 1G xenvg
>
> it seems to consistently deadlock.
>
> I should have logs of everything in
> <http://web.mit.edu/broder/Public/clvmd/node-1-syslog.20100125>  and
> <http://web.mit.edu/broder/Public/clvmd/node-2-syslog.20100125>
>

Hiya

I can't reproduce that one here. And your logs don't show clvmd output 
so I can't really look into it further at the moment.

Chrissie



From td3201 at gmail.com  Tue Jan 26 21:29:51 2010
From: td3201 at gmail.com (Terry)
Date: Tue, 26 Jan 2010 15:29:51 -0600
Subject: [Linux-cluster] one node of two node cluster is not activating LVM
	volume groups
Message-ID: <8ee061011001261329l4b6aa333s321d0d7a709bf998@mail.gmail.com>

I have a two node cluster that was working fine but now one of my
nodes is not able to see all of my clustered volumes (clvmd):

[root at omadvnfs01a ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Skipping clustered volume group vg_data01h
  Skipping clustered volume group vg_data01e
  Found volume group "VolGroup02" using metadata type lvm2
  Skipping clustered volume group vg_data01b
  Skipping clustered volume group vg_data01d
  Skipping clustered volume group vg_data01a
  Skipping clustered volume group vg_data01c
  Skipping clustered volume group vg_data01i
  Found volume group "VolGroup00" using metadata type lvm2


The other node is fine:
[root at omadvnfs01b ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_data01h" using metadata type lvm2
  Found volume group "vg_data01e" using metadata type lvm2
  Found volume group "vg_data01d" using metadata type lvm2
  Found volume group "vg_data01b" using metadata type lvm2
  Found volume group "vg_data01a" using metadata type lvm2
  Found volume group "vg_data01c" using metadata type lvm2
  Found volume group "VolGroup02" using metadata type lvm2
  Found volume group "vg_data01i" using metadata type lvm2
  Found volume group "VolGroup00" using metadata type lvm2


I am not sure how to troubleshoot this.  I see clvmd running on both
nodes.  The cluster appears to be fine other than this.  Of course I
have tried restarting the entire cluster.  Any thoughts?



From td3201 at gmail.com  Tue Jan 26 21:41:02 2010
From: td3201 at gmail.com (Terry)
Date: Tue, 26 Jan 2010 15:41:02 -0600
Subject: [Linux-cluster] one node of two node cluster is not activating
	LVM volume groups
In-Reply-To: <8ee061011001261329l4b6aa333s321d0d7a709bf998@mail.gmail.com>
References: <8ee061011001261329l4b6aa333s321d0d7a709bf998@mail.gmail.com>
Message-ID: <8ee061011001261341jedfb004xb19f368b7c8112bd@mail.gmail.com>

On Tue, Jan 26, 2010 at 3:29 PM, Terry <td3201 at gmail.com> wrote:
> I have a two node cluster that was working fine but now one of my
> nodes is not able to see all of my clustered volumes (clvmd):
>
> [root at omadvnfs01a ~]# vgscan
> ?Reading all physical volumes. ?This may take a while...
> ?Skipping clustered volume group vg_data01h
> ?Skipping clustered volume group vg_data01e
> ?Found volume group "VolGroup02" using metadata type lvm2
> ?Skipping clustered volume group vg_data01b
> ?Skipping clustered volume group vg_data01d
> ?Skipping clustered volume group vg_data01a
> ?Skipping clustered volume group vg_data01c
> ?Skipping clustered volume group vg_data01i
> ?Found volume group "VolGroup00" using metadata type lvm2
>
>
> The other node is fine:
> [root at omadvnfs01b ~]# vgscan
> ?Reading all physical volumes. ?This may take a while...
> ?Found volume group "vg_data01h" using metadata type lvm2
> ?Found volume group "vg_data01e" using metadata type lvm2
> ?Found volume group "vg_data01d" using metadata type lvm2
> ?Found volume group "vg_data01b" using metadata type lvm2
> ?Found volume group "vg_data01a" using metadata type lvm2
> ?Found volume group "vg_data01c" using metadata type lvm2
> ?Found volume group "VolGroup02" using metadata type lvm2
> ?Found volume group "vg_data01i" using metadata type lvm2
> ?Found volume group "VolGroup00" using metadata type lvm2
>
>
> I am not sure how to troubleshoot this. ?I see clvmd running on both
> nodes. ?The cluster appears to be fine other than this. ?Of course I
> have tried restarting the entire cluster. ?Any thoughts?
>

Well, I fail, epically.   Apparently locking_type got set to 1
magically.  I can't imagine a patch would have done it but that was
the reason things weren't working for me.



From fdinitto at redhat.com  Tue Jan 26 21:45:29 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 26 Jan 2010 22:45:29 +0100
Subject: [Linux-cluster] one node of two node cluster is not activating
 LVM	volume groups
In-Reply-To: <8ee061011001261329l4b6aa333s321d0d7a709bf998@mail.gmail.com>
References: <8ee061011001261329l4b6aa333s321d0d7a709bf998@mail.gmail.com>
Message-ID: <4B5F6279.1050500@redhat.com>

On 1/26/2010 10:29 PM, Terry wrote:
> I have a two node cluster that was working fine but now one of my
> nodes is not able to see all of my clustered volumes (clvmd):
> 
> [root at omadvnfs01a ~]# vgscan
>   Reading all physical volumes.  This may take a while...
>   Skipping clustered volume group vg_data01h
>   Skipping clustered volume group vg_data01e
>   Found volume group "VolGroup02" using metadata type lvm2
>   Skipping clustered volume group vg_data01b
>   Skipping clustered volume group vg_data01d
>   Skipping clustered volume group vg_data01a
>   Skipping clustered volume group vg_data01c
>   Skipping clustered volume group vg_data01i
>   Found volume group "VolGroup00" using metadata type lvm2


Check lvm.conf on this node and compare to the other node to see if they
are using the same locking type?

Fabio



From loudredz71 at yahoo.com  Tue Jan 26 23:15:00 2010
From: loudredz71 at yahoo.com (Jonathan Horne)
Date: Tue, 26 Jan 2010 15:15:00 -0800 (PST)
Subject: [Linux-cluster] NFS clustering, with SAN
Message-ID: <137818.60607.qm@web45308.mail.sp1.yahoo.com>

greetings, i am new to this list, my apologies if this topic has been touched on before (but so far searching has not yielded a direction for me to follow)

i have 2 servers that share a file system over SAN.  both servers export this file system as a NFS share.  i need to implement some sort of clustering so that if one server goes down, clients dont lose their connection to the exported file system.  im hoping to find some implementation that will have a pretty instantaneous failover, as there will be pretty constant filewrites/reads from multiple client servers.

can i please get some recommendations on what i should research to make this happen? ill gladly accept tips, or documents to read, i can work with either.

thanks,
Jonathan


      



From td3201 at gmail.com  Tue Jan 26 23:57:23 2010
From: td3201 at gmail.com (Terry)
Date: Tue, 26 Jan 2010 17:57:23 -0600
Subject: [Linux-cluster] NFS clustering, with SAN
In-Reply-To: <137818.60607.qm@web45308.mail.sp1.yahoo.com>
References: <137818.60607.qm@web45308.mail.sp1.yahoo.com>
Message-ID: <8ee061011001261557g736976e3r3101c3caeff8044b@mail.gmail.com>

On Tue, Jan 26, 2010 at 5:15 PM, Jonathan Horne <loudredz71 at yahoo.com> wrote:
> greetings, i am new to this list, my apologies if this topic has been touched on before (but so far searching has not yielded a direction for me to follow)
>
> i have 2 servers that share a file system over SAN. ?both servers export this file system as a NFS share. ?i need to implement some sort of clustering so that if one server goes down, clients dont lose their connection to the exported file system. ?im hoping to find some implementation that will have a pretty instantaneous failover, as there will be pretty constant filewrites/reads from multiple client servers.
>
> can i please get some recommendations on what i should research to make this happen? ill gladly accept tips, or documents to read, i can work with either.
>
> thanks,
> Jonathan
>

Hi Jonathan,

Welcome to the list.  I have a large nfs cluster I manage.  All of the
configuration magic to reduce the downtime and headache is on the NFS
client side.  When there is a failover event, there is a short outage
period for the IP to switch over to the other node.  A lot of things
affect this, including your switching (arp) environment.  Is it a
single export?  If not, you can create each export as a separate
service, with associating separate IP address and create an
active/active type of NFS environment.  Just a thought.

Thanks,
Terry



From loudredz71 at yahoo.com  Wed Jan 27 00:36:07 2010
From: loudredz71 at yahoo.com (Jonathan Horne)
Date: Tue, 26 Jan 2010 16:36:07 -0800 (PST)
Subject: [Linux-cluster] NFS clustering, with SAN
In-Reply-To: <8ee061011001261557g736976e3r3101c3caeff8044b@mail.gmail.com>
Message-ID: <234260.83036.qm@web45316.mail.sp1.yahoo.com>

--- On Tue, 1/26/10, Terry <td3201 at gmail.com> wrote:

> From: Terry <td3201 at gmail.com>
> Subject: Re: [Linux-cluster] NFS clustering, with SAN
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Tuesday, January 26, 2010, 5:57 PM
> On Tue, Jan 26, 2010 at 5:15 PM,
> Jonathan Horne <loudredz71 at yahoo.com>
> wrote:
> > greetings, i am new to this list, my apologies if this
> topic has been touched on before (but so far searching has
> not yielded a direction for me to follow)
> >
> > i have 2 servers that share a file system over SAN.
> ?both servers export this file system as a NFS share. ?i
> need to implement some sort of clustering so that if one
> server goes down, clients dont lose their connection to the
> exported file system. ?im hoping to find some
> implementation that will have a pretty instantaneous
> failover, as there will be pretty constant filewrites/reads
> from multiple client servers.
> >
> > can i please get some recommendations on what i should
> research to make this happen? ill gladly accept tips, or
> documents to read, i can work with either.
> >
> > thanks,
> > Jonathan
> >
> 
> Hi Jonathan,
> 
> Welcome to the list.? I have a large nfs cluster I
> manage.? All of the
> configuration magic to reduce the downtime and headache is
> on the NFS
> client side.? When there is a failover event, there is
> a short outage
> period for the IP to switch over to the other node.? A
> lot of things
> affect this, including your switching (arp)
> environment.? Is it a
> single export?? If not, you can create each export as
> a separate
> service, with associating separate IP address and create
> an
> active/active type of NFS environment.? Just a
> thought.
> 
> Thanks,
> Terry
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Terry, thanks for the reply.

my setup is like this: 2 servers, with a shared LUN.  LUN is mounted on both as /opt/data, both have /opt/data listed in the /etc/exports file.  so basically 2 servers, and showmount -e against both of them show the exact same thing (and of course, its a given that since its a shared LUN (ofcs2 file system), both exports contain the exact same data).  right now were in the testing phase of this project, and there are 4 NFS clients that are connecting (2 oracle servers writing files to the export, and 2 weblogic servers reading the files from oracle from the export).

ultimately trying to fix this up in an HA setup, so that if one NFS server drops off, the clients dont know the difference.  environment is fully switched, and all nodes (NFS servers and clients) have bonded network interfaces connected to separate switches (which connect via port-channel).

you mentioned that you have setup on the client side that takes care of your failover headaches, im interested in hearing more about how this works.

thanks,
Jonathan


      




From pmdyer at ctgcentral2.com  Wed Jan 27 00:48:46 2010
From: pmdyer at ctgcentral2.com (Paul M. Dyer)
Date: Tue, 26 Jan 2010 18:48:46 -0600 (CST)
Subject: [Linux-cluster] NFS clustering, with SAN
In-Reply-To: <234260.83036.qm@web45316.mail.sp1.yahoo.com>
Message-ID: <18656321.11264553326207.JavaMail.root@athena>

http://sources.redhat.com/cluster/doc/nfscookbook.pdf

This cookbook, starting at about page 14 talks about creating an NFS resource.

Paul

----- Original Message -----
From: "Jonathan Horne" <loudredz71 at yahoo.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Tuesday, January 26, 2010 6:36:07 PM (GMT-0600) America/Chicago
Subject: Re: [Linux-cluster] NFS clustering, with SAN

--- On Tue, 1/26/10, Terry <td3201 at gmail.com> wrote:

> From: Terry <td3201 at gmail.com>
> Subject: Re: [Linux-cluster] NFS clustering, with SAN
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Tuesday, January 26, 2010, 5:57 PM
> On Tue, Jan 26, 2010 at 5:15 PM,
> Jonathan Horne <loudredz71 at yahoo.com>
> wrote:
> > greetings, i am new to this list, my apologies if this
> topic has been touched on before (but so far searching has
> not yielded a direction for me to follow)
> >
> > i have 2 servers that share a file system over SAN.
> ?both servers export this file system as a NFS share. ?i
> need to implement some sort of clustering so that if one
> server goes down, clients dont lose their connection to the
> exported file system. ?im hoping to find some
> implementation that will have a pretty instantaneous
> failover, as there will be pretty constant filewrites/reads
> from multiple client servers.
> >
> > can i please get some recommendations on what i should
> research to make this happen? ill gladly accept tips, or
> documents to read, i can work with either.
> >
> > thanks,
> > Jonathan
> >
> 
> Hi Jonathan,
> 
> Welcome to the list.? I have a large nfs cluster I
> manage.? All of the
> configuration magic to reduce the downtime and headache is
> on the NFS
> client side.? When there is a failover event, there is
> a short outage
> period for the IP to switch over to the other node.? A
> lot of things
> affect this, including your switching (arp)
> environment.? Is it a
> single export?? If not, you can create each export as
> a separate
> service, with associating separate IP address and create
> an
> active/active type of NFS environment.? Just a
> thought.
> 
> Thanks,
> Terry
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Terry, thanks for the reply.

my setup is like this: 2 servers, with a shared LUN.  LUN is mounted on both as /opt/data, both have /opt/data listed in the /etc/exports file.  so basically 2 servers, and showmount -e against both of them show the exact same thing (and of course, its a given that since its a shared LUN (ofcs2 file system), both exports contain the exact same data).  right now were in the testing phase of this project, and there are 4 NFS clients that are connecting (2 oracle servers writing files to the export, and 2 weblogic servers reading the files from oracle from the export).

ultimately trying to fix this up in an HA setup, so that if one NFS server drops off, the clients dont know the difference.  environment is fully switched, and all nodes (NFS servers and clients) have bonded network interfaces connected to separate switches (which connect via port-channel).

you mentioned that you have setup on the client side that takes care of your failover headaches, im interested in hearing more about how this works.

thanks,
Jonathan


      


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From alex.urbanowicz at gmail.com  Wed Jan 27 10:22:28 2010
From: alex.urbanowicz at gmail.com (Alex Urbanowicz)
Date: Wed, 27 Jan 2010 11:22:28 +0100
Subject: [Linux-cluster] GFS problem
Message-ID: <9e4c4ee71001270222j3f4664at4503694a0dd1c73d@mail.gmail.com>

>
> From: Jorge Palma <jpalmae at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] GFS problem
> Message-ID:
>        <5b65f1b11001251343p659d3b96gf07dd2165adf521e at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Please send your fence configuration and cluster.conf
>

cluster.conf:

<?xml version="1.0"?>
<!--
** puppet managed file $Revision: 2889 $
-->
<cluster config_version="14" name="gfs-filmweb">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="www1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="www1" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <!-- if method 1 happen to fail - use method
2 -->
                                <method name="2">
                                        <device name="manual"
nodename="www1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="www2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="www2" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="www2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app1" nodeid="65" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app1" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app2" nodeid="66" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app2" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app3" nodeid="67" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app3" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app4" nodeid="68" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app4" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app5" nodeid="69" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app5" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app6" nodeid="70" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app6" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app6"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="app7" nodeid="71" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="app7" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="app7"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="blade403" nodeid="72" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="blade403" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="blade403"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="blade404" nodeid="73" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="blade404" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="blade404"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="blade405" nodeid="74" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="blade405" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="blade405"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="blade406" nodeid="75" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="blade406" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="blade406"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="blade407" nodeid="76" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="blade407" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="blade407"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="blade408" nodeid="77" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsysrq"
nodename="blade408" password="fencepassword" port="9" operation="1bbbb"/>
                                </method>
                                <method name="2">
                                        <device name="manual"
nodename="blade408"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="8" two_node="0"/>
        <fencedevices>
                <fencedevice agent="fence_rsysrq" name="rsysrq"/>
                <fencedevice agent="fence_manual" name="manual"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

fencing is done using fence_rsysrq so there is no configuration to speak of
except the iptables/modprobe part:

options ipt_SYSRQ passwd="fencepassword" tolerance=3720

-A INPUT  -i bond0.108 -s 10.100.108.0/24 -d <hostip> -p udp -m udp --dport
9 -j SYSRQ

Alex.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/532fb018/attachment.htm>

From marcos.david at efacec.com  Wed Jan 27 10:53:13 2010
From: marcos.david at efacec.com (Marcos David)
Date: Wed, 27 Jan 2010 10:53:13 +0000
Subject: [Linux-cluster] Failover issues when shuting down node
Message-ID: <4B601B19.6070900@efacec.com>

Hi,

after a few tests with a four-node cluster (mainly shutting one down to
see if the failover was working properly) we had the following messages:

*Jan 27 03:31:02 node2_pub clurgmgrd[4240]: <err> #75: Failed changing
service status
Jan 27 03:31:02 node2_pub clurgmgrd[4240]: <debug> Stopping failed
service service:PID_PA-SA-R2
Jan 27 03:31:07 node2_pub clurgmgrd[4240]: <notice> Stopping service
service:PID_PA-SA-R2*
...
other checks
...
*Jan 27 03:31:25 node2_pub openais[3480]: [TOTEM] entering GATHER state
from 12.
Jan 27 03:31:27 node2_pub clurgmgrd[4240]: <err> #52: Failed changing RG
status*
*Jan 27 03:31:27 node2_pub clurgmgrd[4240]: <crit> #13: Service
service:PID_PA-SA-R2 failed to stop cleanly
Jan 27 03:31:27 node2_pub clurgmgrd[4240]: <debug> Handling failure
request for RG service:PID_PA-SA-R2*
Jan 27 03:31:30 node2_pub openais[3480]: [TOTEM] entering GATHER state
from 11.

The problem is that the same service was running on two nodes which
isn't supposed to happen....


The service in question consists of a virtual ip and a script.
The script's stop doesn't return an error in any circumstance.

Cluster is running RHEL5.3.

What could have caused these errors?


Thanks for any insight and help ;)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/85acfd2a/attachment.htm>

From adam.king at intechnology.com  Wed Jan 27 11:23:39 2010
From: adam.king at intechnology.com (King, Adam)
Date: Wed, 27 Jan 2010 11:23:39 -0000
Subject: [Linux-cluster] fence_ilo
In-Reply-To: <ED48EF11677FC44BB3440460C24F58E405A4BFA6@mgmtexm03101.intechnology.co.uk>
References: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk><c0773fd31001212328i314a03cay5227bb0fddfc3c73@mail.gmail.com>
	<ED48EF11677FC44BB3440460C24F58E405A4BFA6@mgmtexm03101.intechnology.co.uk>
Message-ID: <ED48EF11677FC44BB3440460C24F58E405C99D73@mgmtexm03101.intechnology.co.uk>

How do I get the system to use this instead of the default /sbin/fence_ilo? I cant find any instructions/man page for this. 

 

 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of King, Adam
Sent: 22 January 2010 09:09
To: linux clustering
Subject: Re: [Linux-cluster] fence_ilo

 

Hi, 

Thanks for your reply, 192.168.1.152 is the first servers ilo and 192.168.50.112 is the first servers eth0, the first servers eth1 is 192.168.1.50. The second servers eth0 is 192.168.50.116 and its ilo is 192.168.1.153, the second servers eth1 is 192.168.1.51. 

I have found the rpm you suggested and will try that,

Thanks again

 

 

  

Adam King 

Systems Administrator 

adam.king at intechnology.com 

InTechnology plc 

Support 0845 120 7070 

Telephone 01423 850000 

Facsimile 01423 858866 

www.intechnology.com <http://www.intechnology.com/> 


   

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brett Cave
Sent: 22 January 2010 07:28
To: linux clustering
Subject: Re: [Linux-cluster] fence_ilo

 

 

On Thu, Jan 21, 2010 at 3:25 PM, King, Adam <adam.king at intechnology.com> wrote:




Hi, 

I have a 2 node cluster (xen1 and xen2) running on dl380 G3's. There are 2 virtual machines running on these. From first node from the shell command line I can fence the 2nd  node through the hp lights out port using /sbin/fence_ilo -a 192.168.1.53 -l aking -p password -v. I can also do the same from the second node and successfully fence the first node. When I fence the node that has control of one or more virtual machines they are migrated to the other node.

I had some issues with the default fence_ilo, found a nice fence_fast_ilo fencing script in an rpm: comoonics-bootimage-fenceclient. This seems to work much better.

 

	If I fence a node in Luci this works too. 

	However, if I have a node running one or more virtual machines and run /sbin/reboot -f on that node, in /var/log/messages in the other node I get messages such as 

	xen1 fenced[3126]: agent "fence_ilo" reports: Unable to connect/login to fencing device

	xen1 fenced[3126]: fence "192.168.50.116" failed

 

assuming that 192.168.1.53 is the 2nd server's ilo address and 192.168.50.116 is the 1st servers ilo address?

 

	Can anyone advise why this is happening? 

	Cheers

	Adam 

	  

	Adam King 

	Systems Administrator 

	adam.king at intechnology.com 

	InTechnology plc 

	Support 0845 120 7070 

	Telephone 01423 850000 

	Facsimile 01423 858866 

	www.intechnology.com <http://www.intechnology.com/> 

	
	  

InTechnology provides industry leading network, hosting, data
and IP telephony services to over 800 UK businesses.

	 

 

Over 25 years in business employing 220 employees in the UK 

 



Annual turnover of ?50m, profitable and financially secure 





InTechnology manages over 14,000 managed network circuits 





InTechnology manages over 5PB (over 5.2 million GB) of customer data 





InTechnology hosts more than 16,000 IP telephony seats 





InTechnology operates four managed data centres - 50,000 sq ft capacity 



 

	This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
	Registered in England 3916586.
	
	The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
	disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.

	
	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster

 

This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,
disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses. 
 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

 


This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,

disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/ec2fe147/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 6495 bytes
Desc: image001.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/ec2fe147/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 97 bytes
Desc: image002.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/ec2fe147/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.gif
Type: image/gif
Size: 76 bytes
Desc: image003.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/ec2fe147/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-logo-circle.gif
Type: image/gif
Size: 6495 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/ec2fe147/attachment-0003.gif>

From alfredo.moralejo at roche.com  Wed Jan 27 14:06:25 2010
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Wed, 27 Jan 2010 15:06:25 +0100
Subject: [Linux-cluster] Failover issues when shuting down node
In-Reply-To: <4B601B19.6070900@efacec.com>
References: <4B601B19.6070900@efacec.com>
Message-ID: <C64734E4E1C80E49955AD539DB2FBC3A57F27B19@rkamsem703.emea.roche.com>


Hi,

I think this is related to bug:

https://bugzilla.redhat.com/show_bug.cgi?id=548133

Adding a sleep after stopping rgmanager can help.

Best regards,

Alfredo



________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marcos David
Sent: Wednesday, January 27, 2010 11:53 AM
To: linux clustering
Subject: [Linux-cluster] Failover issues when shuting down node

Hi,

after a few tests with a four-node cluster (mainly shutting one down to see if the failover was working properly) we had the following messages:

Jan 27 03:31:02 node2_pub clurgmgrd[4240]: <err> #75: Failed changing service status
Jan 27 03:31:02 node2_pub clurgmgrd[4240]: <debug> Stopping failed service service:PID_PA-SA-R2
Jan 27 03:31:07 node2_pub clurgmgrd[4240]: <notice> Stopping service service:PID_PA-SA-R2
...
other checks
...
Jan 27 03:31:25 node2_pub openais[3480]: [TOTEM] entering GATHER state from 12.
Jan 27 03:31:27 node2_pub clurgmgrd[4240]: <err> #52: Failed changing RG status
Jan 27 03:31:27 node2_pub clurgmgrd[4240]: <crit> #13: Service service:PID_PA-SA-R2 failed to stop cleanly
Jan 27 03:31:27 node2_pub clurgmgrd[4240]: <debug> Handling failure request for RG service:PID_PA-SA-R2
Jan 27 03:31:30 node2_pub openais[3480]: [TOTEM] entering GATHER state from 11.

The problem is that the same service was running on two nodes which isn't supposed to happen....


The service in question consists of a virtual ip and a script.
The script's stop doesn't return an error in any circumstance.

Cluster is running RHEL5.3.

What could have caused these errors?


Thanks for any insight and help ;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100127/0c69c9e8/attachment.htm>

From td3201 at gmail.com  Wed Jan 27 14:53:44 2010
From: td3201 at gmail.com (Terry)
Date: Wed, 27 Jan 2010 08:53:44 -0600
Subject: [Linux-cluster] NFS clustering, with SAN
In-Reply-To: <234260.83036.qm@web45316.mail.sp1.yahoo.com>
References: <8ee061011001261557g736976e3r3101c3caeff8044b@mail.gmail.com>
	<234260.83036.qm@web45316.mail.sp1.yahoo.com>
Message-ID: <8ee061011001270653k744de6b7j6e9a76deb75a9b77@mail.gmail.com>

On Tue, Jan 26, 2010 at 6:36 PM, Jonathan Horne <loudredz71 at yahoo.com> wrote:
> --- On Tue, 1/26/10, Terry <td3201 at gmail.com> wrote:
>
>> From: Terry <td3201 at gmail.com>
>> Subject: Re: [Linux-cluster] NFS clustering, with SAN
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Date: Tuesday, January 26, 2010, 5:57 PM
>> On Tue, Jan 26, 2010 at 5:15 PM,
>> Jonathan Horne <loudredz71 at yahoo.com>
>> wrote:
>> > greetings, i am new to this list, my apologies if this
>> topic has been touched on before (but so far searching has
>> not yielded a direction for me to follow)
>> >
>> > i have 2 servers that share a file system over SAN.
>> ?both servers export this file system as a NFS share. ?i
>> need to implement some sort of clustering so that if one
>> server goes down, clients dont lose their connection to the
>> exported file system. ?im hoping to find some
>> implementation that will have a pretty instantaneous
>> failover, as there will be pretty constant filewrites/reads
>> from multiple client servers.
>> >
>> > can i please get some recommendations on what i should
>> research to make this happen? ill gladly accept tips, or
>> documents to read, i can work with either.
>> >
>> > thanks,
>> > Jonathan
>> >
>>
>> Hi Jonathan,
>>
>> Welcome to the list.? I have a large nfs cluster I
>> manage.? All of the
>> configuration magic to reduce the downtime and headache is
>> on the NFS
>> client side.? When there is a failover event, there is
>> a short outage
>> period for the IP to switch over to the other node.? A
>> lot of things
>> affect this, including your switching (arp)
>> environment.? Is it a
>> single export?? If not, you can create each export as
>> a separate
>> service, with associating separate IP address and create
>> an
>> active/active type of NFS environment.? Just a
>> thought.
>>
>> Thanks,
>> Terry
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> Terry, thanks for the reply.
>
> my setup is like this: 2 servers, with a shared LUN. ?LUN is mounted on both as /opt/data, both have /opt/data listed in the /etc/exports file. ?so basically 2 servers, and showmount -e against both of them show the exact same thing (and of course, its a given that since its a shared LUN (ofcs2 file system), both exports contain the exact same data). ?right now were in the testing phase of this project, and there are 4 NFS clients that are connecting (2 oracle servers writing files to the export, and 2 weblogic servers reading the files from oracle from the export).
>
> ultimately trying to fix this up in an HA setup, so that if one NFS server drops off, the clients dont know the difference. ?environment is fully switched, and all nodes (NFS servers and clients) have bonded network interfaces connected to separate switches (which connect via port-channel).
>
> you mentioned that you have setup on the client side that takes care of your failover headaches, im interested in hearing more about how this works.
>
> thanks,
> Jonathan
>
>
>
>
>
> --

The link Paul posted is a great resource.  Start there.  In short,
you're going to have downtime in the event of a failover, short, but
downtime.  Increase nodes to reduce the chances.



From tuckerd at lyle.smu.edu  Wed Jan 27 15:08:47 2010
From: tuckerd at lyle.smu.edu (Doug Tucker)
Date: Wed, 27 Jan 2010 09:08:47 -0600
Subject: [Linux-cluster] Cluster Suite 4 to 5 upgrade?
Message-ID: <1264604927.3276.1.camel@thor.seas.smu.edu>

I have a RHCS currently on RH 4.8. My question is 3 part. Is it possible
to join a RH 5.4 cluster machine to a 4.8 cluster? If not, is there any
upgrade path that does not compromise the data? IOW, could I take the
backend storage currently running on the 4.8, and connect it to a new
5.4 without data loss?  Or, can I do an "upgrade" of the existing
cluster machines from 4.8 to 5.4?

Thanks in advance,

Doug Tucker



From gordan at bobich.net  Wed Jan 27 15:18:08 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 27 Jan 2010 15:18:08 +0000
Subject: [Linux-cluster] Cluster Suite 4 to 5 upgrade?
In-Reply-To: <1264604927.3276.1.camel@thor.seas.smu.edu>
References: <1264604927.3276.1.camel@thor.seas.smu.edu>
Message-ID: <4B605930.3090103@bobich.net>

Doug Tucker wrote:
> I have a RHCS currently on RH 4.8. My question is 3 part. Is it possible
> to join a RH 5.4 cluster machine to a 4.8 cluster?

In a word - no.

> If not, is there any
> upgrade path that does not compromise the data? IOW, could I take the
> backend storage currently running on the 4.8, and connect it to a new
> 5.4 without data loss?

I presume you mean that you have a GFS volume on a SAN that you want to 
keep after the upgrade. I'm not 100% sure (perhaps someone from RH here 
could confirm), but I believe the on-disk GFS format is the same between 
the versions that ship with RHEL 4.8 and 5.x. Note that you still won't 
be able to have the volume mounted on both 4.x and 5.x machines at the 
same time.

> Or, can I do an "upgrade" of the existing
> cluster machines from 4.8 to 5.4?

AFAIK, the "upgrade" process isn't supported. It might work, but the 
chances are that you'll be better off doing a clean install and copying 
the minimum required configs over.

Gordan



From dirk.schulz at kinzesberg.de  Wed Jan 27 17:10:03 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Wed, 27 Jan 2010 18:10:03 +0100
Subject: [Linux-cluster] LVM and GFS as resource/service not starting
Message-ID: <4B60736B.50903@kinzesberg.de>

Hi folks,

I tried to integrate a CLVM vg and the GFS partitions living there into 
cluster.conf.

This is what I added:
>         <resources>
>                 <script name="sshd-virt" file="/etc/init.d/sshd-virt"/>
>                 <lvm name="VolGroup10" vg_name="VolGroup10"/>
>                 <clusterfs name="XenConfigs" fstype="gfs" 
> device="/dev/VolGroup10/XenConfigs" mountpoint="/mount/point1" 
> force_unmount="0"/>     
>                 <clusterfs name="XenImages" fstype="gfs" 
> device="/dev/VolGroup10/XenImages" mountpoint="/mount/point2" 
> force_unmount="0"/>     
>         </resources>
>         <service name="GFS-XenVols" autostart="1" domain="ALL">
>             <lvm ref="VolGroup10"/>
>             <clusterfs ref="XenConfigs"/>
>             <clusterfs ref="XenImages"/>
>         </service>
But rgmanager failed to start the service. This is all I get:
> /var/log/messages:Jan 27 17:54:57 pclus1cent5-01 clurgmgrd[27744]: 
> <err> #43: Service service:GFS-XenVols has failed; can not start.
> /var/log/messages:Jan 27 17:54:57 pclus1cent5-01 clurgmgrd[27744]: 
> <crit> #13: Service service:GFS-XenVols failed to stop cleanly
How can I get more detailled output from clurgmgrd?

And is there any syntactical error in the above entries in cluster.conf?

Any hint or help is appreciated.

Dirk




From glisha at gmail.com  Wed Jan 27 20:26:38 2010
From: glisha at gmail.com (Georgi Stanojevski)
Date: Wed, 27 Jan 2010 21:26:38 +0100
Subject: [Linux-cluster] start cluster with only one node present?
Message-ID: <b71e4df91001271226w73eefa58rf10895a1b33cd886@mail.gmail.com>

Hi folks,

i'm testing a two node cluster (on redhat 5.4, hp dl380g3). I'm using
fence_ilo, ?have a quorum disk (*cluster.conf) and have an IP resource
and GFS2 mount point.

Every think works well, until I completely disconnect the power from
one node (ex. node1).

Now node2 can't fence node1 (since there is no connection to the fence
ilo device) and I can't start the cluster at all. When I try to reboot
node2 it hangs on Starting fencing...

I guess my question is: Is there a way to start the cluster if only
one node is present?

Thank you for any input.

*cluster.conf
<?xml version="1.0"?>
<cluster alias="hpcluster" config_version="16" name="hpcluster">
? ? ? ?<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
? ? ? ?<clusternodes>
? ? ? ? ? ? ? ?<clusternode name="node2.domain.com" nodeid="1" votes="1">
? ? ? ? ? ? ? ? ? ? ? ?<fence>
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?<method name="1">
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?<device name="node2-ilo"/>
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?</method>
? ? ? ? ? ? ? ? ? ? ? ?</fence>
? ? ? ? ? ? ? ?</clusternode>
? ? ? ? ? ? ? ?<clusternode name="node1.domain.com" nodeid="2" votes="1">
? ? ? ? ? ? ? ? ? ? ? ?<fence>
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?<method name="1">
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?<device name="node1-ilo"/>
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?</method>
? ? ? ? ? ? ? ? ? ? ? ?</fence>
? ? ? ? ? ? ? ?</clusternode>
? ? ? ?</clusternodes>
? ? ? ?<cman expected_votes="3" two_node="0"/>
? ? ? ?<fencedevices>
? ? ? ? ? ? ? ?<fencedevice agent="fence_ilo" hostname="node1-ilo"
login="admin" name="node1-ilo" passwd="admin"/>
? ? ? ? ? ? ? ?<fencedevice agent="fence_ilo" hostname="node2-ilo"
login="admin" name="node2-ilo" passwd="admin"/>
? ? ? ?</fencedevices>
? ? ? ?<rm>
? ? ? ? ? ? ? ?<failoverdomains/>
? ? ? ? ? ? ? ?<resources>
? ? ? ? ? ? ? ? ? ? ? ?<ip address="10.10.10.15" monitor_link="1"/>
? ? ? ? ? ? ? ?</resources>
? ? ? ? ? ? ? ?<service autostart="0" exclusive="0" name="myservice"
recovery="relocate">
? ? ? ? ? ? ? ? ? ? ? ?<ip ref="10.10.10.15"/>
? ? ? ? ? ? ? ?</service>
? ? ? ?</rm>
? ? ? ?<quorumd device="/dev/emcpowerb1" votes="1" status_file="/tmp/qdisk">
? ? ? ? ? ? ? ?<heuristic program="ping 10.10.10.1 -c1 -t1" />
? ? ? ?</quorumd>
</cluster>



--
Glisha



From pmdyer at ctgcentral2.com  Wed Jan 27 23:41:29 2010
From: pmdyer at ctgcentral2.com (Paul M. Dyer)
Date: Wed, 27 Jan 2010 17:41:29 -0600 (CST)
Subject: [Linux-cluster] fence cold shutdown
Message-ID: <1797236.01264635689522.JavaMail.root@athena>

Hi,

I am testing failover of Xen VM services on a 3-node Cluster Suite install.   To begin, I just did a reboot of node-1, that had one VM service running.   The VM service was migrated as part of the node shutdown.  

Okay, now for something more severe.  I started node-1 again, and migrated that same VM service back there.   This time, I did a hard poweroff of the node.   The VM service showed state "stopping" in clustat, as the node showed "Offline".   I left the node powered off for about 10 minutes, but the VM service never got migrated or restarted on the other nodes of the cluster.

I then powered on node-1.  The VM service started back up on node-3, which is the preferred node for it's failover domain.

Shouldn't that VM service restart automatically even while node-1 was down?

Thanks for help.

Paul

P.S.
RHEL 5.4 xen kernel 64bit
cman-2.0.115-1.el5_4.9
rgmanager-2.0.52-1.el5_4.3



From zaeem.arshad at gmail.com  Thu Jan 28 04:46:36 2010
From: zaeem.arshad at gmail.com (Zaeem Arshad)
Date: Thu, 28 Jan 2010 09:46:36 +0500
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
Message-ID: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>

Hi List,

We have 2 geographically distant sites located approximately 35km
apart with dark fiber connectivity available between them. Mail01 and
SAN1 is placed at site A while Mail02 and SAN2 is at site B. Our
requirement is to have the mail servers in a cluster configuration in
an active/active mode. To cater for the loss of connectivity or losing
a SAN itself, I have come up with the following design.

1) Export 1 block device from each SAN to its mail server i.e. SAN1
exports to Mail01
2) Use DRBD to configure a block device comprising of the 2 SAN
volumes and use it as a physical volume in clvm.
3) Create a GFS logical volume from this PV that can be used by both servers.

I am wondering if this is a correct design as theoretically it looks
to address both node and SAN failure or connectivity loss.



Regards

--
Zaeem



From gordan at bobich.net  Thu Jan 28 05:25:23 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 28 Jan 2010 05:25:23 +0000
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
Message-ID: <4B611FC3.2040007@bobich.net>

Zaeem Arshad wrote:
> Hi List,
> 
> We have 2 geographically distant sites located approximately 35km
> apart with dark fiber connectivity available between them. Mail01 and
> SAN1 is placed at site A while Mail02 and SAN2 is at site B. Our
> requirement is to have the mail servers in a cluster configuration in
> an active/active mode. To cater for the loss of connectivity or losing
> a SAN itself, I have come up with the following design.

What's the point of having a SAN if you're using DRBD? You might as well 
have DAS in each of the two mail servers. Unless you need so much 
storage space that you can't put enough disks directly into the server...

> 1) Export 1 block device from each SAN to its mail server i.e. SAN1
> exports to Mail01
> 2) Use DRBD to configure a block device comprising of the 2 SAN
> volumes and use it as a physical volume in clvm.

The CLVM bit is isn't relevant per se, you don't strictly need it, but 
it won't hurt.

> 3) Create a GFS logical volume from this PV that can be used by both servers.

That's fine.

> I am wondering if this is a correct design as theoretically it looks
> to address both node and SAN failure or connectivity loss.

The problem you have is that you have no way of enacting fencing if the 
connectivity between the sites fails. If a node fails, any cluster file 
system (GFS included) will mandate a fencing action to ensure that one 
of the nodes gets taken down and stays down. If you have lost cross-site 
connectivity, the nodes won't be able to fence each other, and GFS will 
simply block until connectivity is restored and fencing succeeds. The 
chances are that when this happens, it'll also cause a fencing shoot-out 
and both nodes may well end up getting fenced.

You could use some kind of cheat-fencing, say, by setting a firewall 
rule that will prevent the nodes from re-connecting (you'd need to write 
your own fencing agent, but that's not particularly difficult), but then 
you would be pretty much guaranteeing a split-brain situation, where the 
nodes would end up operating independently without any hope of ever 
re-synchronising.

The bottom line is that you need reliable out-of-band fencing mechanism. 
If you have GSM/wireless signal in both areas you could rig up a 
separate, small fencing "server" on each site with a GSM modem, and 
write a fencing agent that sends a fencing request by SMS. When the 
fencing server receives a fencing request, you'd have to make it issue a 
local fencing action using one of the more standard fencing agents. Note 
that in this case, due to high latency of things like SMS, you'd need to 
implement accurate time stamping and deliberately semi-randomize the 
delay between fencing requests being sent so that you could check time 
stamps and the fencing servers could sensibly decide whether to obey the 
local fencing request or the remote one.

You have to get a little creative about it and write a few lines of code 
to glue it together. I've been meaning to implement something like this 
for a while, but I haven't gotten around to it yet.

Gordan



From zaeem.arshad at gmail.com  Thu Jan 28 06:11:50 2010
From: zaeem.arshad at gmail.com (Zaeem Arshad)
Date: Thu, 28 Jan 2010 11:11:50 +0500
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <4B611FC3.2040007@bobich.net>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
	<4B611FC3.2040007@bobich.net>
Message-ID: <3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>

On Thu, Jan 28, 2010 at 10:25 AM, Gordan Bobic <gordan at bobich.net> wrote:
> Zaeem Arshad wrote:
>>
>> Hi List,
>>
>> We have 2 geographically distant sites located approximately 35km
>> apart with dark fiber connectivity available between them. Mail01 and
>> SAN1 is placed at site A while Mail02 and SAN2 is at site B. Our
>> requirement is to have the mail servers in a cluster configuration in
>> an active/active mode. To cater for the loss of connectivity or losing
>> a SAN itself, I have come up with the following design.
>
> What's the point of having a SAN if you're using DRBD? You might as well
> have DAS in each of the two mail servers. Unless you need so much storage
> space that you can't put enough disks directly into the server...

We have already bought the SAN, that's why. We do expect our storage
needs to be on the higher side. Also, I am sharing storage on both
SANs as theoretically DRBD will use the local resource first for
read/write and failover to the IP reachable disk volume.

>
>> 1) Export 1 block device from each SAN to its mail server i.e. SAN1
>> exports to Mail01
>> 2) Use DRBD to configure a block device comprising of the 2 SAN
>> volumes and use it as a physical volume in clvm.
>
> The CLVM bit is isn't relevant per se, you don't strictly need it, but it
> won't hurt.
>
>> 3) Create a GFS logical volume from this PV that can be used by both
>> servers.
>
> That's fine.
>
>> I am wondering if this is a correct design as theoretically it looks
>> to address both node and SAN failure or connectivity loss.
>
> The problem you have is that you have no way of enacting fencing if the
> connectivity between the sites fails. If a node fails, any cluster file
> system (GFS included) will mandate a fencing action to ensure that one of
> the nodes gets taken down and stays down. If you have lost cross-site
> connectivity, the nodes won't be able to fence each other, and GFS will
> simply block until connectivity is restored and fencing succeeds. The
> chances are that when this happens, it'll also cause a fencing shoot-out and
> both nodes may well end up getting fenced.
>
> You could use some kind of cheat-fencing, say, by setting a firewall rule
> that will prevent the nodes from re-connecting (you'd need to write your own
> fencing agent, but that's not particularly difficult), but then you would be
> pretty much guaranteeing a split-brain situation, where the nodes would end
> up operating independently without any hope of ever re-synchronising.
>

In such a case where we lose site connectivity altogether, I'd like
the Site 2 servers to shut itself down to avoid a split-brain
condition. Since, I am implementing clustering, won't the quorum
server take care of this issue?


Regards

--
Zaeem



From hassoony01 at hotmail.com  Thu Jan 28 06:19:48 2010
From: hassoony01 at hotmail.com (hussein hussein)
Date: Thu, 28 Jan 2010 09:19:48 +0300
Subject: [Linux-cluster] (no subject)
Message-ID: <SNT101-DS237DA3390410612B17016FD05C0@phx.gbl>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100128/5747c008/attachment.htm>

From zaeem.arshad at gmail.com  Thu Jan 28 08:46:06 2010
From: zaeem.arshad at gmail.com (Zaeem Arshad)
Date: Thu, 28 Jan 2010 13:46:06 +0500
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <4B611FC3.2040007@bobich.net>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
	<4B611FC3.2040007@bobich.net>
Message-ID: <3e1809421001280046h418d47een5597ee1e1d385f4c@mail.gmail.com>

On Thu, Jan 28, 2010 at 10:25 AM, Gordan Bobic <gordan at bobich.net> wrote:

> The problem you have is that you have no way of enacting fencing if the
> connectivity between the sites fails. If a node fails, any cluster file
> system (GFS included) will mandate a fencing action to ensure that one of
> the nodes gets taken down and stays down. If you have lost cross-site
> connectivity, the nodes won't be able to fence each other, and GFS will
> simply block until connectivity is restored and fencing succeeds. The
> chances are that when this happens, it'll also cause a fencing shoot-out and
> both nodes may well end up getting fenced.
>
> You could use some kind of cheat-fencing, say, by setting a firewall rule
> that will prevent the nodes from re-connecting (you'd need to write your own
> fencing agent, but that's not particularly difficult), but then you would be
> pretty much guaranteeing a split-brain situation, where the nodes would end
> up operating independently without any hope of ever re-synchronising.
>
> The bottom line is that you need reliable out-of-band fencing mechanism. If
> you have GSM/wireless signal in both areas you could rig up a separate,
> small fencing "server" on each site with a GSM modem, and write a fencing
> agent that sends a fencing request by SMS. When the fencing server receives
> a fencing request, you'd have to make it issue a local fencing action using
> one of the more standard fencing agents. Note that in this case, due to high
> latency of things like SMS, you'd need to implement accurate time stamping
> and deliberately semi-randomize the delay between fencing requests being
> sent so that you could check time stamps and the fencing servers could
> sensibly decide whether to obey the local fencing request or the remote one.
>
> You have to get a little creative about it and write a few lines of code to
> glue it together. I've been meaning to implement something like this for a
> while, but I haven't gotten around to it yet.
>


This seems to address the split brain condition that can occur in case
of network blackouts involving a 2-node cluster.

https://bugzilla.redhat.com/show_bug.cgi?id=372901



From jose.neto at liber4e.com  Thu Jan 28 10:41:11 2010
From: jose.neto at liber4e.com (jose nuno neto)
Date: Thu, 28 Jan 2010 10:41:11 -0000 (GMT)
Subject: [Linux-cluster] Seeking Advice on 2node/3node Architecture
In-Reply-To: <56b46216bfab23c2a48e7756f995e6fb.squirrel@fela.liber4e.com>
References: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>
	<4B5461C2.50804@redhat.com>
	<56b46216bfab23c2a48e7756f995e6fb.squirrel@fela.liber4e.com>
Message-ID: <731f412e4ca6d582efc2c2a357d63532.squirrel@fela.liber4e.com>

Hellos

I've been testing cluster suite for a few weeks and starting to get a
better understanding of it. I'm preparing a design for a high-availability
solution and would like to ear from you in regards to used solutions,
stable, more tested....

The scenario I have is this:

.2 Datacenters ( 20Kms a apart ) linked with Fiber Channel Loop / and TCP
links
.Storage available on both datacenters
.1Node on each Datacenter

the goal is to build a high-available solution for oracle db, comprising
data protection, 2 node systems active/passive ( eventually 3node, or
qdiskd )

my current doubts are on preventing a splitbrain on network comunication
break between the 2datacenters, making sure data is protected, and 1node
keeps up service.
would a simple 2node config be able to do it? or I should think of qdisk
with some heuristics

I'd apreciate some thought on this

Best Regards
Jose

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From gordan at bobich.net  Thu Jan 28 10:44:08 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 28 Jan 2010 10:44:08 +0000
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>	<4B611FC3.2040007@bobich.net>
	<3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>
Message-ID: <4B616A78.5010201@bobich.net>

Zaeem Arshad wrote:

>>> We have 2 geographically distant sites located approximately 35km
>>> apart with dark fiber connectivity available between them. Mail01 and
>>> SAN1 is placed at site A while Mail02 and SAN2 is at site B. Our
>>> requirement is to have the mail servers in a cluster configuration in
>>> an active/active mode. To cater for the loss of connectivity or losing
>>> a SAN itself, I have come up with the following design.
>> What's the point of having a SAN if you're using DRBD? You might as well
>> have DAS in each of the two mail servers. Unless you need so much storage
>> space that you can't put enough disks directly into the server...
> 
> We have already bought the SAN, that's why. We do expect our storage
> needs to be on the higher side. Also, I am sharing storage on both
> SANs as theoretically DRBD will use the local resource first for
> read/write and failover to the IP reachable disk volume.

Hmm... What sort of ping time do you get? I presume you have established 
that it is on the sensible side.

In terms of performance you will need to make sure that machines tend to 
access only their own sub-paths on the file system (e.g. spool/1 and 
spool/2, and server 1 doesn't touch spool/2 until server 2 goes down). 
Otherwise the performance is going to be attrocious since file locks 
will end up bouncing between the machines. These normally live in cache 
on a conventional file system so if they have to start getting exchanged 
at most accesses you are looking at a latency degradation from ~ 50ns 
down to some milliseconds. If your connectivity is VERY good, if it's 
35km I would be surprised if your latencies are better than 10ms, which 
you'll feel even on the disk latency, let along memory latency - we are 
talking 200,000x slower in the best case scenario.

>>> 1) Export 1 block device from each SAN to its mail server i.e. SAN1
>>> exports to Mail01
>>> 2) Use DRBD to configure a block device comprising of the 2 SAN
>>> volumes and use it as a physical volume in clvm.
>> The CLVM bit is isn't relevant per se, you don't strictly need it, but it
>> won't hurt.
>>
>>> 3) Create a GFS logical volume from this PV that can be used by both
>>> servers.
>> That's fine.
>>
>>> I am wondering if this is a correct design as theoretically it looks
>>> to address both node and SAN failure or connectivity loss.
>> The problem you have is that you have no way of enacting fencing if the
>> connectivity between the sites fails. If a node fails, any cluster file
>> system (GFS included) will mandate a fencing action to ensure that one of
>> the nodes gets taken down and stays down. If you have lost cross-site
>> connectivity, the nodes won't be able to fence each other, and GFS will
>> simply block until connectivity is restored and fencing succeeds. The
>> chances are that when this happens, it'll also cause a fencing shoot-out and
>> both nodes may well end up getting fenced.
>>
>> You could use some kind of cheat-fencing, say, by setting a firewall rule
>> that will prevent the nodes from re-connecting (you'd need to write your own
>> fencing agent, but that's not particularly difficult), but then you would be
>> pretty much guaranteeing a split-brain situation, where the nodes would end
>> up operating independently without any hope of ever re-synchronising.
>>
> 
> In such a case where we lose site connectivity altogether, I'd like
> the Site 2 servers to shut itself down to avoid a split-brain
> condition. Since, I am implementing clustering, won't the quorum
> server take care of this issue?

So you propose to have a quorum disk on site 2? OK, that works. The 
problem is that fencing works by one server fencing another, not itself. 
So you'll still need a reliable OOB fencing mechanism such as the one I 
described.

Gordan



From carlopmart at gmail.com  Thu Jan 28 11:59:23 2010
From: carlopmart at gmail.com (carlopmart)
Date: Thu, 28 Jan 2010 12:59:23 +0100
Subject: [Linux-cluster] Assign cluster IP address to an interface without IP
Message-ID: <4B617C1B.5040601@gmail.com>

Hi all,

  Is it possible to assign a cluster IP address to an interface without IP? I have a 
two-node cluster with three interfaces each one:

  - eth0: management interface
  - eth1: multicast, cluster operational interface
  - eth2: service interface

  Eth0 and eth1 have an ip address but eth2 don't on both hosts. I need to assign 
several cluster ip address to eth2 on both hosts, but these ip groups are on 
different network segment than eth0 has. Can I do this??

  Many thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From zaeem.arshad at gmail.com  Fri Jan 29 05:33:48 2010
From: zaeem.arshad at gmail.com (Zaeem Arshad)
Date: Fri, 29 Jan 2010 10:33:48 +0500
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <4B616A78.5010201@bobich.net>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
	<4B611FC3.2040007@bobich.net>
	<3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>
	<4B616A78.5010201@bobich.net>
Message-ID: <3e1809421001282133j13f08d37ifbb022c00d5a6816@mail.gmail.com>

On Thu, Jan 28, 2010 at 3:44 PM, Gordan Bobic <gordan at bobich.net> wrote:

You have made interesting observations. Let me see if I can answer these.

> Hmm... What sort of ping time do you get? I presume you have established
> that it is on the sensible side.

It's 7 ms to be precise. I was assuming the distance to be around 35km
but it seems less.

>
> In terms of performance you will need to make sure that machines tend to
> access only their own sub-paths on the file system (e.g. spool/1 and
> spool/2, and server 1 doesn't touch spool/2 until server 2 goes down).
> Otherwise the performance is going to be attrocious since file locks will
> end up bouncing between the machines. These normally live in cache on a
> conventional file system so if they have to start getting exchanged at most
> accesses you are looking at a latency degradation from ~ 50ns down to some
> milliseconds. If your connectivity is VERY good, if it's 35km I would be
> surprised if your latencies are better than 10ms, which you'll feel even on
> the disk latency, let along memory latency - we are talking 200,000x slower
> in the best case scenario.
>

I think this is how DRBD works. It will usually read/write to the
local device first and in case of local device failure will it switch
over to the other block over IP. I'd be surprised if it doesn't do
that.


> So you propose to have a quorum disk on site 2? OK, that works. The problem
> is that fencing works by one server fencing another, not itself. So you'll
> still need a reliable OOB fencing mechanism such as the one I described.
>
>

Yeah. I can probably use the following heuristics to make sure I fall
back to primary

Heuristics:
Check if QD is accessible at SAN2           5
yahoo check                                           3
POPXYZ check                                       2
Volume X visible on SAN1 by Mail1           4

By using these heuristics and associated scores, I am able to decide
which server has quorum. I am also looking into the OOB fencing idea
too. One question that pops in my mind is that while I will have only
2 nodes in the DRBD volume, will more than 2 cluster members be able
to access the GFS volume on top of the DRBD volume?


Regards

--
Zaeem



From kkovachev at varna.net  Fri Jan 29 11:06:13 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Fri, 29 Jan 2010 13:06:13 +0200
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <3e1809421001282133j13f08d37ifbb022c00d5a6816@mail.gmail.com>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
	<4B611FC3.2040007@bobich.net>
	<3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>
	<4B616A78.5010201@bobich.net>
	<3e1809421001282133j13f08d37ifbb022c00d5a6816@mail.gmail.com>
Message-ID: <20100129105002.M34046@varna.net>

On Fri, 29 Jan 2010 10:33:48 +0500, Zaeem Arshad wrote
> On Thu, Jan 28, 2010 at 3:44 PM, Gordan Bobic <gordan at bobich.net> wrote:
> 
> You have made interesting observations. Let me see if I can answer these.
> 
> > Hmm... What sort of ping time do you get? I presume you have established
> > that it is on the sensible side.
> 
> It's 7 ms to be precise. I was assuming the distance to be around 35km
> but it seems less.
> 
> >
> > In terms of performance you will need to make sure that machines tend to
> > access only their own sub-paths on the file system (e.g. spool/1 and
> > spool/2, and server 1 doesn't touch spool/2 until server 2 goes down).
> > Otherwise the performance is going to be attrocious since file locks will
> > end up bouncing between the machines. These normally live in cache on a
> > conventional file system so if they have to start getting exchanged at most
> > accesses you are looking at a latency degradation from ~ 50ns down to some
> > milliseconds. If your connectivity is VERY good, if it's 35km I would be
> > surprised if your latencies are better than 10ms, which you'll feel even on
> > the disk latency, let along memory latency - we are talking 200,000x slower
> > in the best case scenario.
> >
> 
> I think this is how DRBD works. It will usually read/write to the
> local device first and in case of local device failure will it switch
> over to the other block over IP. I'd be surprised if it doesn't do
> that.
> 
> > So you propose to have a quorum disk on site 2? OK, that works. The problem
> > is that fencing works by one server fencing another, not itself. So you'll
> > still need a reliable OOB fencing mechanism such as the one I described.
> >
> >
> 
> Yeah. I can probably use the following heuristics to make sure I fall
> back to primary
> 
> Heuristics:
> Check if QD is accessible at SAN2           5
> yahoo check                                           3
> POPXYZ check                                       2
> Volume X visible on SAN1 by Mail1           4
> 
> By using these heuristics and associated scores, I am able to decide
> which server has quorum. I am also looking into the OOB fencing idea
> too. One question that pops in my mind is that while I will have only
> 2 nodes in the DRBD volume, will more than 2 cluster members be able
> to access the GFS volume on top of the DRBD volume?
> 

If you use DRBD on top of the SANs you should access the drbd resource from
the other nodes. I have similar setup: DAS in two servers replicated via DRBD
in active/active mode, then they are exported via iscsi and used from the
nodes (and the storages are nodes too) with multipath multibus.

In your case the fencing is the main problem ... instead of SMS you may use
the mobiles as GPRS modems directly to the remote fence device

> Regards
> 
> --
> Zaeem
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From adam.king at intechnology.com  Fri Jan 29 11:39:10 2010
From: adam.king at intechnology.com (King, Adam)
Date: Fri, 29 Jan 2010 11:39:10 -0000
Subject: [Linux-cluster]  fence_ilo
In-Reply-To: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>
References: <ED48EF11677FC44BB3440460C24F58E405A4BE88@mgmtexm03101.intechnology.co.uk>
Message-ID: <ED48EF11677FC44BB3440460C24F58E405C9A161@mgmtexm03101.intechnology.co.uk>

Hi, 

Trying this again with a few updated details and a slightly different
problem...

I have a 2 node cluster (xen1 and xen2) running on dl380 G3's. There are
2 virtual machines running on these. From xen1 from the shell command
line I can fence xen2 through the hp lights out port using
/sbin/fence_ilo -a 192.168.1.53 -l aking -p password -v, or using the
comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm. I can also do the
same from the second node and successfully fence the first node. When I
fence the node that has control of one or more virtual machines they are
migrated to the other node. 

If I fence a node in Luci this works too. 

I have xen1-ilo and xen2-ilo set in /etc/hosts with their respective ip
addresses

When I run an /sbin/reboot -f on xen2, xen1 successfully fences that
node and if necessary takes control of any virtual machines that are on
it. 

When I run an /sbin/reboot -f on xen1, xen2 does not even try and fence
node1 and just waits for it to come back up and rejoin the cluster.
/sbin/fenced is running on both machines, I cannot see any difference in
the running services between them. 

The problem here is that if there is that problem with xen1, the virtual
machines will be down. This kinda ruins the idea of a cluster for me. 

 

Can anyone advise why this is happening? 

Cheers

Adam 

  

The contents of this message may be privileged and confidential. If you
have received this message in error, you may not use,
disclose, copy or distribute its content in any way. Please notify the
sender immediately. All messages are scanned for all viruses. 
 

Adam King
Systems Administrator
adam.king at intechnology.com


InTechnology plc
Support 0845 120 7070
Telephone 01423 850000
Facsimile 01423 858866
www.intechnology.com

 


This is an email from InTechnology plc, Central House, Beckwith Knowle, Harrogate, UK, HG3 1UG.
Registered in England 3916586.

The contents of this message may be privileged and confidential. If you have received this message in error, you may not use,

disclose, copy or distribute its content in any way. Please notify the sender immediately. All messages are scanned for all viruses.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100129/ea6450b1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InTechnology-logo-circle.gif
Type: image/gif
Size: 6495 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100129/ea6450b1/attachment.gif>

From zaeem.arshad at gmail.com  Fri Jan 29 12:36:48 2010
From: zaeem.arshad at gmail.com (Zaeem Arshad)
Date: Fri, 29 Jan 2010 17:36:48 +0500
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <20100129105002.M34046@varna.net>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>
	<4B611FC3.2040007@bobich.net>
	<3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>
	<4B616A78.5010201@bobich.net>
	<3e1809421001282133j13f08d37ifbb022c00d5a6816@mail.gmail.com>
	<20100129105002.M34046@varna.net>
Message-ID: <3e1809421001290436v6a88dc90r56fcbb2c90abb20b@mail.gmail.com>

On Fri, Jan 29, 2010 at 4:06 PM, Kaloyan Kovachev <kkovachev at varna.net> wrote:

>
> If you use DRBD on top of the SANs you should access the drbd resource from
> the other nodes. I have similar setup: DAS in two servers replicated via DRBD
> in active/active mode, then they are exported via iscsi and used from the
> nodes (and the storages are nodes too) with multipath multibus.
>
> In your case the fencing is the main problem ... instead of SMS you may use
> the mobiles as GPRS modems directly to the remote fence device


I have worked out heuristic checks and the associated scored for
quorum disk to be put in. I have tried to work out on paper the
different scenarios that will play out and how the cluster membership
should work based on the eventual score. I think I can do away with
the GPRS modems for now but I will definitely integrate them in the
solution at a later stage.


Regards

--
Zaeem



From gordan at bobich.net  Fri Jan 29 12:47:23 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 29 Jan 2010 12:47:23 +0000
Subject: [Linux-cluster] DRBD with GFS applicable for this scenario?
In-Reply-To: <3e1809421001290436v6a88dc90r56fcbb2c90abb20b@mail.gmail.com>
References: <3e1809421001272046w4111e6e9m3b2d385166d1f7cf@mail.gmail.com>	<4B611FC3.2040007@bobich.net>	<3e1809421001272211y5dbe51c1u99c4b0845376e0f0@mail.gmail.com>	<4B616A78.5010201@bobich.net>	<3e1809421001282133j13f08d37ifbb022c00d5a6816@mail.gmail.com>	<20100129105002.M34046@varna.net>
	<3e1809421001290436v6a88dc90r56fcbb2c90abb20b@mail.gmail.com>
Message-ID: <4B62D8DB.908@bobich.net>

Zaeem Arshad wrote:
> On Fri, Jan 29, 2010 at 4:06 PM, Kaloyan Kovachev <kkovachev at varna.net> wrote:
> 
>> If you use DRBD on top of the SANs you should access the drbd resource from
>> the other nodes. I have similar setup: DAS in two servers replicated via DRBD
>> in active/active mode, then they are exported via iscsi and used from the
>> nodes (and the storages are nodes too) with multipath multibus.
>>
>> In your case the fencing is the main problem ... instead of SMS you may use
>> the mobiles as GPRS modems directly to the remote fence device
> 
> 
> I have worked out heuristic checks and the associated scored for
> quorum disk to be put in. I have tried to work out on paper the
> different scenarios that will play out and how the cluster membership
> should work based on the eventual score. I think I can do away with
> the GPRS modems for now but I will definitely integrate them in the
> solution at a later stage.

Indeed - the only failure mode that you need GPRS modems for is if the 
inter-site link goes down.

Gordan



From dirk.schulz at kinzesberg.de  Fri Jan 29 12:47:23 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Fri, 29 Jan 2010 13:47:23 +0100
Subject: [Linux-cluster] Startup delay for VMs in a cluster
In-Reply-To: <4B5AD237.5050509@kinzesberg.de>
References: <4B5AD237.5050509@kinzesberg.de>
Message-ID: <4B62D8DB.6070707@kinzesberg.de>

Dirk H. Schulz schrieb:
> Hi folks,
>
> the VMs in my cluster are defined via <vm> containers which are not 
> inside <service> containers because the latter prevents live migration 
> (as far as I could find out).
>
> Now when I start up the cluster, all VMs are started at once, which I 
> would like to prevent. Is there any way to define a startup sequence 
> or startup delay (like you can for the xendomains service) without 
> misusing dependencies?
>
> Dirk
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Just for the archive: I have an idea now.

I will try defining soft dependencies between the VMs (as I understand 
soft is only regarded at startup), so I can have a startup sequence 
which is even more than just a startup delay.

Any reasons why this should not work?

Dirk



From thomas at sjolshagen.net  Fri Jan 29 15:01:08 2010
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Fri, 29 Jan 2010 10:01:08 -0500
Subject: [Linux-cluster] Startup delay for VMs in a cluster
In-Reply-To: <4B62D8DB.6070707@kinzesberg.de>
References: <4B5AD237.5050509@kinzesberg.de> <4B62D8DB.6070707@kinzesberg.de>
Message-ID: <FCB557EE-EC33-4C06-9E7D-4449E06BA5B3@sjolshagen.net>

Keep in mind that the sevice (vm) is considered "up" the moment the  
guest starts, ie not booted, so the delay would be minimal.

A more thorough approach to ordering guest start-up and delaying it  
(either until some in-guest service is available and/or some time  
interval after the previous guest has started would probably require  
using your own script and the <script> resource. However, I think that  
would mean that you also would need to roll your own solution relative  
to live migration, etc since I believe rgmanager only supports live  
migration of KVM guests as <vm> resources..?

// Thomas

On Jan 29, 2010, at 7:47, "Dirk H. Schulz" <dirk.schulz at kinzesberg.de>  
wrote:

> Dirk H. Schulz schrieb:
>> Hi folks,
>>
>> the VMs in my cluster are defined via <vm> containers which are not  
>> inside <service> containers because the latter prevents live  
>> migration (as far as I could find out).
>>
>> Now when I start up the cluster, all VMs are started at once, which  
>> I would like to prevent. Is there any way to define a startup  
>> sequence or startup delay (like you can for the xendomains service)  
>> without misusing dependencies?
>>
>> Dirk
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> Just for the archive: I have an idea now.
>
> I will try defining soft dependencies between the VMs (as I  
> understand soft is only regarded at startup), so I can have a  
> startup sequence which is even more than just a startup delay.
>
> Any reasons why this should not work?
>
> Dirk
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From pcs.fagnonr at pcsb.org  Fri Jan 29 15:38:50 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Fri, 29 Jan 2010 10:38:50 -0500
Subject: [Linux-cluster] heartbeat
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B94541A@a0040semail3.pinellas.local>

I have a two node cluster and I want to setup the heartbeat monitor over the network. Does anyone have an example of what I need to add to the cluster.conf file to make this happen?

I don't  have any nics that I can dedicate for a cross over cable for heartbeat so I was thinking I could use the serial or the network itself. Is it possible to just use the primary interface for heartbeat since it is my understanding it is nothing more than a ping.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100129/f49cdb6a/attachment.htm>

From dirk.schulz at kinzesberg.de  Fri Jan 29 15:42:05 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Fri, 29 Jan 2010 16:42:05 +0100
Subject: [Linux-cluster] Startup delay for VMs in a cluster
In-Reply-To: <FCB557EE-EC33-4C06-9E7D-4449E06BA5B3@sjolshagen.net>
References: <4B5AD237.5050509@kinzesberg.de> <4B62D8DB.6070707@kinzesberg.de>
	<FCB557EE-EC33-4C06-9E7D-4449E06BA5B3@sjolshagen.net>
Message-ID: <4B6301CD.5010106@kinzesberg.de>

Thomas Sjolshagen schrieb:
> Keep in mind that the sevice (vm) is considered "up" the moment the 
> guest starts, ie not booted, so the delay would be minimal.
>
> A more thorough approach to ordering guest start-up and delaying it 
> (either until some in-guest service is available and/or some time 
> interval after the previous guest has started would probably require 
> using your own script and the <script> resource. However, I think that 
> would mean that you also would need to roll your own solution relative 
> to live migration, etc since I believe rgmanager only supports live 
> migration of KVM guests as <vm> resources..?

Well, you're right, vm service must be considered up as soon as "xm 
create" returns it's exit code which is very early in the vm startup 
process.

Yeah, and your also right with rgmanager supporting live migration only 
for standalone <vm> resources, not for <vm> rescources inside service 
definitions (and that regardless of KVM or Xen). I have already run into 
that.

But what about copying vm.sh to vm-something.sh and filling in a crude 
"sleep 60" after the vm start? Would you see any problems there (apart 
from updates to vm.sh going past it)?

Thanks for sharing thoughts.

Dirk
>
> // Thomas
>
> On Jan 29, 2010, at 7:47, "Dirk H. Schulz" <dirk.schulz at kinzesberg.de> 
> wrote:
>
>> Dirk H. Schulz schrieb:
>>> Hi folks,
>>>
>>> the VMs in my cluster are defined via <vm> containers which are not 
>>> inside <service> containers because the latter prevents live 
>>> migration (as far as I could find out).
>>>
>>> Now when I start up the cluster, all VMs are started at once, which 
>>> I would like to prevent. Is there any way to define a startup 
>>> sequence or startup delay (like you can for the xendomains service) 
>>> without misusing dependencies?
>>>
>>> Dirk
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> Just for the archive: I have an idea now.
>>
>> I will try defining soft dependencies between the VMs (as I 
>> understand soft is only regarded at startup), so I can have a startup 
>> sequence which is even more than just a startup delay.
>>
>> Any reasons why this should not work?
>>
>> Dirk
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From dirk.schulz at kinzesberg.de  Fri Jan 29 16:17:58 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Fri, 29 Jan 2010 17:17:58 +0100
Subject: [Linux-cluster] heartbeat
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B94541A@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B94541A@a0040semail3.pinellas.local>
Message-ID: <4B630A36.9010603@kinzesberg.de>

Fagnon Raymond schrieb:
>
> I have a two node cluster and I want to setup the heartbeat monitor 
> over the network. Does anyone have an example of what I need to add to 
> the cluster.conf file to make this happen?
>
> I don?t have any nics that I can dedicate for a cross over cable for 
> heartbeat so I was thinking I could use the serial or the network 
> itself. Is it possible to just use the primary interface for heartbeat 
> since it is my understanding it is nothing more than a ping.
>
Excuse me, are you talking of the old Linux-HA which is often referenced 
as "Heartbeat"?

And what kind of two node cluster do you have? What software packages 
have you set up already?

This is RedHat cluster suite, and it uses a fairly complex cluster 
communication. If your cluster communication and all other communication 
use the same network/nics, you might run into trouble if "other 
communication" uses up your network bandwidth.

Dirk



From pcs.fagnonr at pcsb.org  Fri Jan 29 18:21:30 2010
From: pcs.fagnonr at pcsb.org (Fagnon Raymond)
Date: Fri, 29 Jan 2010 13:21:30 -0500
Subject: [Linux-cluster] heartbeat
In-Reply-To: <4B630A36.9010603@kinzesberg.de>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B94541A@a0040semail3.pinellas.local>
	<4B630A36.9010603@kinzesberg.de>
Message-ID: <D24B5EDEE969354CA10D44C56305CB4B1D9B9454CD@a0040semail3.pinellas.local>

Here are the packages that are installed that I can find. I inherited this system so I am not sure if everything is installed. 
rgmanager-2.0.46-1.el5
cman-2.0.98-1.el5_3.7
Cluster_Administration-en-US-5.2-1
modcluster-0.12.1-2.el5
system-config-cluster-1.0.55-1.0

This cluster I believe was designed to be a failover. 
Here is what clustat tells me
[root at a0040slinuxgfs1 ~]# clustat
Cluster Status for LinuxCluster @ Fri Jan 29 12:12:02 2010
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 a0040slinuxgfs1                                                     1 Online, Local, rgmanager
 a0040slinuxgfs2                                                     2 Online, rgmanager

 Service Name                                             Owner (Last)                                             State
 ------- ----                                             ----- ------                                             -----
 service:Virtual Server                                   a0040slinuxgfs1                                          started

This is a gig network so bandwidth should not be an issue I believe. Please inform me on the heartbeat since I very new at this.

Raymond

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dirk H. Schulz
Sent: Friday, January 29, 2010 11:18 AM
To: linux clustering
Subject: Re: [Linux-cluster] heartbeat

Fagnon Raymond schrieb:
>
> I have a two node cluster and I want to setup the heartbeat monitor 
> over the network. Does anyone have an example of what I need to add to 
> the cluster.conf file to make this happen?
>
> I don't have any nics that I can dedicate for a cross over cable for 
> heartbeat so I was thinking I could use the serial or the network 
> itself. Is it possible to just use the primary interface for heartbeat 
> since it is my understanding it is nothing more than a ping.
>
Excuse me, are you talking of the old Linux-HA which is often referenced 
as "Heartbeat"?

And what kind of two node cluster do you have? What software packages 
have you set up already?

This is RedHat cluster suite, and it uses a fairly complex cluster 
communication. If your cluster communication and all other communication 
use the same network/nics, you might run into trouble if "other 
communication" uses up your network bandwidth.

Dirk

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From thomas at sjolshagen.net  Fri Jan 29 18:47:10 2010
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Fri, 29 Jan 2010 13:47:10 -0500
Subject: [Linux-cluster] Startup delay for VMs in a cluster
In-Reply-To: <4B6301CD.5010106@kinzesberg.de>
References: <4B5AD237.5050509@kinzesberg.de> <4B62D8DB.6070707@kinzesberg.de>
	<FCB557EE-EC33-4C06-9E7D-4449E06BA5B3@sjolshagen.net>
	<4B6301CD.5010106@kinzesberg.de>
Message-ID: <46FAF558-F8CB-4487-B248-680BCD85E1A2@sjolshagen.net>


On Jan 29, 2010, at 10:42 AM, Dirk H. Schulz wrote:

[SNIP]

> 
> But what about copying vm.sh to vm-something.sh and filling in a crude "sleep 60" after the vm start? Would you see any problems there (apart from updates to vm.sh going past it)?

Dirk,

I think you'd have to randomize the delay somewhat, i.e. something along the lines of the example below. However,
this example isn't the best since, in my view, there are a couple of problems related to the loop (could be "expensive")
as well as the maxwait vs. vm.sh 'start timeout' value relationship.

#!/bin/bash
# Set a max value for the sleep time per guest.
# NOTE: Must make sure the start timeout for the vm.sh script
# is longer than the $maxwait setting (for obvious reasons)
maxwait=30

value=$RANDOM
while [ $value -gt $maxwait ] ; do
     # Theoretically, this could loop for "a while" since $RANDOM can be
     # a value between 0 and 32K...
     value=$RANDOM
done

# Sleep for that time.
sleep $value




From hicheerup at gmail.com  Fri Jan 29 19:32:57 2010
From: hicheerup at gmail.com (linux-crazy)
Date: Sat, 30 Jan 2010 01:02:57 +0530
Subject: [Linux-cluster] heartbeat
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B9454CD@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B94541A@a0040semail3.pinellas.local>
	<4B630A36.9010603@kinzesberg.de>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B9454CD@a0040semail3.pinellas.local>
Message-ID: <29e045b81001291132p5f07a3b8o23250164eca8e312@mail.gmail.com>

Hi,

Check this out

http://sources.redhat.com/cluster/wiki/FAQ/CMAN

 I don't think it necessary to go for any separate NIC For cluster
hearbeat , if you are running system with multiple NIC then go with
ether net channel Bonding with active passive mode instead of round
robin , because  with round-robin mode i faced some packet loss that
may cause cluster to miss some heartbeat messages.

 Note:  I  done an tons of two node RHEL5 Cluster implementation , all
works like an charm without any seperate heart beat and more over all
are running Oracle DB  with more that 3 lakhs transction per day.

  If  you have any more doubt please revert me back



On Fri, Jan 29, 2010 at 11:51 PM, Fagnon Raymond <pcs.fagnonr at pcsb.org> wrote:
> Here are the packages that are installed that I can find. I inherited this system so I am not sure if everything is installed.
> rgmanager-2.0.46-1.el5
> cman-2.0.98-1.el5_3.7
> Cluster_Administration-en-US-5.2-1
> modcluster-0.12.1-2.el5
> system-config-cluster-1.0.55-1.0
>
> This cluster I believe was designed to be a failover.
> Here is what clustat tells me
> [root at a0040slinuxgfs1 ~]# clustat
> Cluster Status for LinuxCluster @ Fri Jan 29 12:12:02 2010
> Member Status: Quorate
>
> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID ? Status
> ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ---- ------
> ?a0040slinuxgfs1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 Online, Local, rgmanager
> ?a0040slinuxgfs2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 Online, rgmanager
>
> ?Service Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Owner (Last) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State
> ?------- ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- ------ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -----
> ?service:Virtual Server ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? a0040slinuxgfs1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?started
>
> This is a gig network so bandwidth should not be an issue I believe. Please inform me on the heartbeat since I very new at this.
>
> Raymond
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dirk H. Schulz
> Sent: Friday, January 29, 2010 11:18 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] heartbeat
>
> Fagnon Raymond schrieb:
>>
>> I have a two node cluster and I want to setup the heartbeat monitor
>> over the network. Does anyone have an example of what I need to add to
>> the cluster.conf file to make this happen?
>>
>> I don't have any nics that I can dedicate for a cross over cable for
>> heartbeat so I was thinking I could use the serial or the network
>> itself. Is it possible to just use the primary interface for heartbeat
>> since it is my understanding it is nothing more than a ping.
>>
> Excuse me, are you talking of the old Linux-HA which is often referenced
> as "Heartbeat"?
>
> And what kind of two node cluster do you have? What software packages
> have you set up already?
>
> This is RedHat cluster suite, and it uses a fairly complex cluster
> communication. If your cluster communication and all other communication
> use the same network/nics, you might run into trouble if "other
> communication" uses up your network bandwidth.
>
> Dirk
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From nehemiasjahcob at gmail.com  Fri Jan 29 19:45:20 2010
From: nehemiasjahcob at gmail.com (Nehemias Jahcob)
Date: Fri, 29 Jan 2010 16:45:20 -0300
Subject: [Linux-cluster] Included Service Scripts Stop error
In-Reply-To: <5c7b98ed19fe3eebfea9c371aa3d970b.squirrel@fela.liber4e.com>
References: <cccb38f22fd2bc70945da5b1281ea934.squirrel@fela.liber4e.com>
	<4B5461C2.50804@redhat.com>
	<56b46216bfab23c2a48e7756f995e6fb.squirrel@fela.liber4e.com>
	<5c7b98ed19fe3eebfea9c371aa3d970b.squirrel@fela.liber4e.com>
Message-ID: <5f61ab381001291145kded59b8mc389096b89a61fa7@mail.gmail.com>

Mostrar forma romanizada
Please send:

/etc/cluster.conf
/etc/sysconfig/cman | cluster

Best regards

NJ


2010/1/25 jose nuno neto <jose.neto at liber4e.com>

> Hi,
>
> I'm building a cluster with
> cman-2.0.115-1.el5_4.2
> rgmanager-2.0.52-1.el5
>
>
> Right now I'm testing services start/stop but can't get a clean stop from
> the included scripts in /usr/share/cluster
> It starts ok, but stop no, althouth the app is stoped
> This happens either with apache or mysqld scripts
> Any extra setting needed? timing tunings?
>
> Comands runned:
>
> rg_test test /etc/cluster/cluster.conf start service MySQL
> Running in test mode.
> Starting MySQL...
> <debug>  Link for bond0: Detected
> <info>   Adding IPv4 address 172.26.18.7/20 to bond0
> <debug>  Sending gratuitous ARP: 172.26.18.7 00:14:4f:ca:83:4c brd
> ff:ff:ff:ff:ff:ff
> <debug>  Verifying Configuration Of mysql:MySQLTest
> <debug>  Verifying Configuration Of mysql:MySQLTest > Succeed
> <info>   Starting Service mysql:MySQLTest
> <debug>  Looking For IP Address > Succeed -  IP Address Found
> <debug>  Starting Service mysql:MySQLTest > Succeed
> Start of MySQL complete
>
>
> rg_test test /etc/cluster/cluster.conf stop service MySQL
> Running in test mode.
> Stopping MySQL...
> <debug>  Verifying Configuration Of mysql:MySQLTest
> <debug>  Verifying Configuration Of mysql:MySQLTest > Succeed
> <info>   Stopping Service mysql:MySQLTest
> <error>  Stopping Service mysql:MySQLTest > Failed - Application Is Still
> Running
> <error>  Stopping Service mysql:MySQLTest > Failed
> <info>   Removing IPv4 address 172.26.18.7/20 from bond0
>
> rg_test test /etc/cluster/cluster.conf stop service MySQL
> Running in test mode.
> Stopping MySQL...
> <debug>  Verifying Configuration Of mysql:MySQLTest
> <debug>  Verifying Configuration Of mysql:MySQLTest > Succeed
> <info>   Stopping Service mysql:MySQLTest
> <error>  Checking Existence Of File
> /var/run/cluster/mysql/mysql:MySQLTest.pid [mysql:MySQLTest] > Failed -
> File Doesn't Exist
> <info>   Stopping Service mysql:MySQLTest > Succeed
> <debug>  172.26.18.7 is not configured
> Stop of MySQL complete
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100129/9ae91413/attachment.htm>

From dirk.schulz at kinzesberg.de  Sat Jan 30 07:07:59 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 30 Jan 2010 08:07:59 +0100
Subject: [Linux-cluster] heartbeat
In-Reply-To: <D24B5EDEE969354CA10D44C56305CB4B1D9B9454CD@a0040semail3.pinellas.local>
References: <D24B5EDEE969354CA10D44C56305CB4B1D9B94541A@a0040semail3.pinellas.local>	<4B630A36.9010603@kinzesberg.de>
	<D24B5EDEE969354CA10D44C56305CB4B1D9B9454CD@a0040semail3.pinellas.local>
Message-ID: <4B63DACF.1020603@kinzesberg.de>

Fagnon Raymond schrieb:
> Here are the packages that are installed that I can find. I inherited this system so I am not sure if everything is installed. 
> rgmanager-2.0.46-1.el5
> cman-2.0.98-1.el5_3.7
> Cluster_Administration-en-US-5.2-1
> modcluster-0.12.1-2.el5
> system-config-cluster-1.0.55-1.0
>
> This cluster I believe was designed to be a failover. 
> Here is what clustat tells me
> [root at a0040slinuxgfs1 ~]# clustat
> Cluster Status for LinuxCluster @ Fri Jan 29 12:12:02 2010
> Member Status: Quorate
>
>  Member Name                                                     ID   Status
>  ------ ----                                                     ---- ------
>  a0040slinuxgfs1                                                     1 Online, Local, rgmanager
>  a0040slinuxgfs2                                                     2 Online, rgmanager
>
>  Service Name                                             Owner (Last)                                             State
>  ------- ----                                             ----- ------                                             -----
>  service:Virtual Server                                   a0040slinuxgfs1                                          started
>
> This is a gig network so bandwidth should not be an issue I believe. Please inform me on the heartbeat since I very new at this.
>
>   
Raymond,

this is a redhat cluster suite cluster. Cluster communication 
("heartbeat" and much more) is done by cman.

You might want to have a look at the docs at 
http://www.redhat.com/docs/manuals/csgfs/.

Dirk



From dirk.schulz at kinzesberg.de  Sat Jan 30 07:13:25 2010
From: dirk.schulz at kinzesberg.de (Dirk H. Schulz)
Date: Sat, 30 Jan 2010 08:13:25 +0100
Subject: [Linux-cluster] Logical Dependency problem
Message-ID: <4B63DC15.5080706@kinzesberg.de>

I would like to define my shared GFS volumes as cluster resource and use 
them in services.

Now there is GFS volumes that have to be mounted on more than one nodes. 
The only way to achive that I found is to define a separate service for 
each cluster node - or is there any way to define ONE service that 
starts at several nodes simultaneously?

Well, but IF I define a fs service for every node and use it in 
dependency definitions (i.e. define another service B that is dependent 
on the fs service), then service B only runs on the one node where this 
fs service is started.

So the question is:
How can I mount a GFS volume on several nodes via rgmanager and let 
other services depend on it without restricting them to single nodes?

Dirk



From celsowebber at yahoo.com  Sun Jan 31 20:22:44 2010
From: celsowebber at yahoo.com (Celso K. Webber)
Date: Sun, 31 Jan 2010 12:22:44 -0800 (PST)
Subject: [Linux-cluster] Luci home page
In-Reply-To: <8b711df41001220815u53829e23rfb3c300161cb6efa@mail.gmail.com>
References: <782081.52039.qm@web32102.mail.mud.yahoo.com>
	<13627661.21264116538974.JavaMail.root@athena>
	<8b711df41001220815u53829e23rfb3c300161cb6efa@mail.gmail.com>
Message-ID: <633587.97517.qm@web32105.mail.mud.yahoo.com>

Hi,

Just to report that the workaround SOLVED my problem about the 'An error occurred when trying to contact any of the nodes in the <cluster_name> cluster.' message. Thanks for the tip!

Best regards,

Celso.




________________________________
From: Paras pradhan <pradhanparas at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Fri, January 22, 2010 2:15:17 PM
Subject: Re: [Linux-cluster] Luci home page

This fixes the 11111 timeout issue but not the un-responsiveness of conga storage tab when you have device mapper running.

Thanks
Paras.



On Thu, Jan 21, 2010 at 5:28 PM, Paul M. Dyer <pmdyer at ctgcentral2.com> wrote:

Hi,
>
>>I had this problem and worked with RHN for about 4 weeks.   They have been able to reproduce it, and expect to send the bugzilla report to engineering at some point.
>
>>For me, I was able to bypass the timeout error by starting all ricci agents as root.   Somehow, this is a workaround for a saslauthd issue.
>
>>Edit /etc/init.d/ricci, and change line 155 to:
>
>>daemon "$RICCID"
>
>>removing the -u "$NewUID" parameter.
>
>>Paul
>
>
>>----- Original Message -----
>>From: "Celso K. Webber" <celsowebber at yahoo.com>
>>To: "linux clustering" <linux-cluster at redhat.com>
>
>Sent: Tuesday, January 19, 2010 3:24:51 PM (GMT-0600) America/Chicago
>>Subject: Re: [Linux-cluster] Luci home page
>
>>Hi Dirk,
>
>>I've double checked my environment, and I have SELinux DISABLED on both nodes, as I have the firewall DISABLED also.
>
>>Thanks.
>
>
>
>>----- Original Message ----
>>From: Dirk H. Schulz <dirk.schulz at kinzesberg.de>
>>To: linux clustering <linux-cluster at redhat.com>
>>Sent: Tue, January 19, 2010 4:14:28 PM
>>Subject: Re: [Linux-cluster] Luci home page
>
>>I have experienced this kind of difficulties when I started testing conga. One of the first things I tried was setting SElinux to permissive on the conga server, and from then on I could work well with it.
>
>>I did not look into audit.log to find out if there is a solution to it, so far.
>
>>Dirk
>
>>Celso K. Webber schrieb:
>>> Hi,
>>>
>>> I don't know if your problem is the same as mine, but I'm experiencing a problem where luci is a little bit unstable.
>>>
>>> I can log in into luci, but when I click the "cluster" tab, it takes some time and returns:
>>> "An error occurred when trying to contact any of the nodes in the rhcs-xen cluster."
>>>
>>> In the server where luci is running (node1), I see the following messages in /var/log/messages:
>>> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
>>> Jan  7 17:43:01 node1 luci[5262]: Error reading from node2.localdomain:11111: timeout
>>>
>>> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
>>> Jan  7 17:43:06 node1 luci[5262]: Error reading from node1.localdomain:11111: timeout
>>>
>>> I've checked name resolution, host names, /etc/hosts, etc. Firewall and SELinux is disabled. I can do a "telnet nodeX.localdomain 11111" from anywhere successfully.
>>>
>>> Whenever I receive this kind of "timeout error" in luci, if I click the same link again or if I do a "reload" in the browser, then luci usually responds correctly. There is the inconvenience of re-clicking almost every link I use in luci.
>>
>>
>>> I'm using a fresh RHEL 5.4 installation, did "yum update" recently and the problem persistend. I indeed tried to remove luci and ricci, rm -rf /var/lib/luci and /var/lib/ricci, but with no change in behaviour.
>>
>>
>>> Does anyone has these same symptoms?
>>>
>>> Thanks, Celso.
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Fagnon Raymond <pcs.fagnonr at pcsb.org>
>>> *To:* linux clustering <linux-cluster at redhat.com>
>>> *Sent:* Mon, January 18, 2010 7:24:48 PM
>>> *Subject:* [Linux-cluster] Luci home page
>>>
>>> When I log into my luci homepage and click on cluster. My Cluster does not appear.  The servers show up under the storage tab and so forth.
>>>
>>>
>>> As far as I know this is a nfs cluster
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>>--
>>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
>
>>--
>>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>>--
>>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100131/8bab3789/attachment.htm>